10/14/2013

10-14-13 - Oodle's Fast LZ4

Oodle now has a compressor called "LZB" (LZ-Bytewise) which is basically a variant of Yann's LZ4 . A few customers were testing LZNib against LZ4 and snappy and were bothered that it was slower to decode than those (though it offers much more compression and I believe it is usually a better choice of tradeoffs). In any case, OodleLZB is now there for people who care more about speed than ratio.

Oodle LZB has some implementation tricks that could also be applied to LZ4. I thought I'd go through them as an illustration of how you make compressors fast, and to give back to the community since LZ4 is nicely open source.

OodleLZB is 10-20% faster to decode than LZ4. (1900 mb/s vs. 1700 mb/s on lzt99, and 1550 mb/s vs. 1290 mb/s on all_dds). The base LZ4 implementation is not bad, it's in fact very fast and has the obvious optimizations like 8-byte word copies and so on. I'm not gonna talk about fiddly junk like do you do ptr++ or ++ptr , though that stuff can make a difference on some platforms. I want to talk about how you structure a decode loop.

The LZ4 decoder is like this :


U8 * ip; // = compressed input
U8 * op; // = decompressed output

for(;;)
{
    int control = *ip++;

    int lrl = control>>4;
    int ml_control = control&0xF;

    // get excess lrl
    if ( lrl == 0xF ) AddExcess(lrl,ip);

    // copy literals :
    copy_literals(op, ip, lrl);

    ip += lrl;
    op += lrl;

    if ( EOF ) break;

    // get offset
    int offset = *((U16 *)ip);
    ip+=2;

    int ml = ml_control + 4;
    if ( ml_control == 0xF ) AddExcess(ml,ip);

    // copy match :
    if ( overlap )
    {
        copy_match_overlap(op, op - offset, ml );
    }
    else
    {
        copy_match_nooverlap(op, op - offset, ml );
    }

    op += ml;

    if ( EOF ) break;
}

and AddExcess is :

#define AddExcess(val,cp)   do { int b = *cp++; val += b; if ( b != 0xFF ) break; } while(1)

So, how do we speed this up?

The main thing we want to focus on is branches. We want to reduce the number of branches on the most common paths, and we want to make maximum use of the branches that we can't avoid.

There are four branches we want to pay attention to :

1. the checks for control nibbles = 0xF
2. the check for match overlap
3. the check of LRL and match len inside the match copiers
4. the EOF checks

So last one first; the EOF check is the easiest to eliminate and also the least important. On modern chips with good branch prediction, the highly predictable branches like that don't cost much. If you know that your input stream is not corrupted (because you've already checked a CRC of some kind), then you can put the EOF check under one of the rare code cases, like perhaps LRL control = 0xF. Just make the encoder emit that rare code when it hits EOF. On Intel chips that makes almost no speed difference. (if you need to handle corrupted data without crashing, don't do that).

On to the substantial ones. Note that branch #3 is hidden inside "copy_literals" and "copy_match". copy_literals is something like :


for(int i=0;i<lrl;i+=8)
{
    *((U64 *)(dest+i)) = *((U64 *)(src+i));
}

(the exact way to do copy_literals quickly depends on your platform; in particular are offset-addresses free? and are unaligned loads fast? Depending on those, you would write the loop in different ways.)

We should notice a couple of things about this. One is that the first branch on lrl is the most important. Later branches are rare, and we're also covering a lot of bytes per branch in that case. When lrl is low, we're getting a high number of branches per byte. Another issue about that is that the probability of taking the branch is very different the first time, so you can help the branch-prediction in the chip by doing some explicit branching, like :


if ( lrl > 0 )
{
    *((U64 *)(dest)) = *((U64 *)(src));
    if ( lrl > 8 )
    {
        *((U64 *)(dest+8)) = *((U64 *)(src+8));
        if ( lrl > 16 )
        {
            // .. okay now do a loop here ...
        }
    }
}

We'll also see later that the branch on ( lrl > 0 ) is optional, and in fact it's best to not do it - just always unconditionally/speculatively copy the first 8 bytes.

The next issue we should see is that the branch for lrl > 16 there for the copier is redundant with the check for the control value (lrl == 0xF). So we should merge them :


change this :

    // get excess lrl
    if ( lrl == 0xF ) AddExcess(lrl,ip);

    // copy literals :
    copy_literals(op, ip, lrl);

to :

    // get excess lrl
    if ( lrl == 0xF )
    {
        AddExcess(lrl,ip);

        // lrl >= 15
        copy_literals_long(op, ip, lrl);
    }
    else
    {
        // lrl < 15
        copy_literals_short(op, ip, lrl);
    }

this is the principle that we can't avoid the branch on the LRL escape code, so we should make maximum use of it. That is, it caries extra information - the branch tells us something about the value of LRL, and any time we take a branch we should try to make use of all the information it gives us.

But we can do more here. If we gather some stats about LRL we see something like :

% of LRL >= 0xF : 8%
% of LRL > 8 : 15%
% of LRL <= 8 : 85%
the vast majority of LRL's are short. So we should first detect and handle that case :

    if ( lrl <= 8 )
    {
        if ( lrl > 0 )
        {
            *((U64 *)(op)) = *((U64 *)(ip));
        }
    }
    else
    {
        // lrl > 8
        if ( lrl == 0xF )
        {
            AddExcess(lrl,ip);

            // lrl >= 15
            copy_literals_long(op, ip, lrl);
        }
        else
        {
            // lrl > 8 && < 15
            *((U64 *)(op)) = *((U64 *)(ip));
            *((U64 *)(op+8)) = *((U64 *)(ip+8));
        }
    }

which we should then rearrange a bit :

    // always immediately speculatively copy 8 :
    *((U64 *)(op)) = *((U64 *)(ip));

    if_unlikely( lrl > 8 )
    {
        // lrl > 8
        // speculatively copy next 8 :
        *((U64 *)(op+8)) = *((U64 *)(ip+8));

        if_unlikely ( lrl == 0xF )
        {
            AddExcess(lrl,ip);

            // lrl >= 15
            // can copy first 16 without checking lrl
            copy_literals_long(op, ip, lrl);
        }
    }
    
    op += lrl;
    ip += lrl;

and we have a faster literal-copy path.

Astute readers may notice that we can now be stomping past the end of the buffer, because we always do the 8 byte copy regardless of LRL. There are various ways to prevent this. One is to make your EOF check test for (end - 8), and when you break out of that loop, then you have a slower/safer loop to finish up the end. (ADD : not exactly right, see notes in comments)

Obviously we should do the exact same procedure with the ml_control. Again the check for spilling the 0xF nibble tells us something about the match length, and we should use that information to our advantage. And again the short matches are by far the most common, so we should detect that case and make it as quick as possible.

The next branch we'll look at is the overlap check. Again some statistics should be our guide : less than 1% of all matches are overlap matches. Overlap matches are important (well, offset=1 overlaps are important) because they are occasionally very long, but they are rare. One option is just to forbid overlap matches entirely, and really that doesn't hurt compression much. We won't do that. Instead we want to hide the overlap case in another rare case : the excess ML 0xF nibble case.

The way to make compressors fast is to look at the code you have to write, and then change the format to make that code fast. That is, write the code the way you want it and the format follows - don't think of a format and then write code to handle it.

We want our match decoder to be like this :


if ( ml_control < 0xF )
{
    // 4-18 byte short match
    // no overlap allowed
    // ...
}
else
{
    // long match OR overlap match
    if ( offset < 8 )
    {
        ml = 4; // back up to MML
        AddExcess(ml,ip);

        // do offset < 8 possible overlap match
        // ...
    }
    else
    {
        AddExcess(ml,ip);

        // do offset >= 8 long match
        // ...
    }
}

so we change our encoder to make data like that.

Astute readers may note that overlap matches now take an extra byte to encode, which does cost a little compression ratio. If we like we can fix that by reorganizing the code stream a little (one option is to send one ml excess byte before sending offset and put a flag value in there; since this is all in the very rare pathway it can be more complex in its encoding), or we can just ignore it, it's around a 1% hit.

That's it.

One final note, if you want to skip all that, there is a short cut to get much of the win.

The simplest case is the most important - no excess lrl, no excess ml - and it occurs around 40% of all control bytes. So we can just detect it and fastpath that case :


U8 * ip; // = compressed input
U8 * op; // = decompressed output

for(;;)
{
    int control = *ip++;

    int lrl = control>>4;
    int ml_control = control&0xF;

    if ( (control & 0x88) == 0 )
    {
        // lrl < 8 and ml_control < 8

        // copy literals :
        *((U64 *)(op)) = *((U64 *)(ip));
        ip += lrl;
        op += lrl;

        // get offset
        int offset = *((U16 *)ip);
        ip+=2;
    
        int ml = ml_control + 4;    

        // copy match; 4-11 bytes :
        U8 * from = op - offset;
        *((U64 *)(op)) = *((U64 *)(from));
        *((U32 *)(op+8)) = *((U32 *)(from+8));

        op += ml;

        continue;
    }

    // ... normal decoder here ...
}

This fastpath doesn't help much with all the other improvements we've done here, but you can just drop it on the original decoder and get much of the win. Of course beware the EOF handling. Also this fastpath assumes that you have forbidden overlaps from the normal codes and are sending the escape match code (0xF) for overlap matches.

ADDENDUM : A few notes :

In case it's not clear, one of the keys to this type of fast decoder is that there's a "hot region" in front of the output pointer that contains undefined data, which we keep stomping on over and over. eg. when we do the 8-byte literal blasts regardless of lrl. This has a few consequences you must be aware of.

One is if you're trying to decode in a minimum size circular window (eg. 64k in this case). Offsets to matches that are near window size (like 65534) are actually wrapping around to be just *ahead* of the current output pointer. You cannot allow those matches in the hot region because those bytes are undefined. There are a few fixes - don't use this decoder for 64k circular windows, or don't allow your encoder to output offsets that cause that problem.

A similar problem arises with parallelizing the decode. If you're decoding chunks of the stream in parallel, you cannot allow the hot region of one thread to be stomping past the end of its chunk, which another thread might be reading from. To handle this Oodle defines "seek points" which are places that you are allowed to parallelize the decode, and the coders ensure that the hot regions do not cross seek points. That is, within a chunk up to the seek point, the decoder is allowed to do these wild blast-aheads, but as it gets close to a seek point it must break out and fall into a safer mode (this can be done with a modified end condition check).

9 comments:

Yann Collet said...

very, very good article, with lots of good ideas and crystal clear explanations.

I'll test some of them in the near future.

Yann Collet said...

I've been trying to use your presented ideas on LZ4 lately.
For now, I'm not yet successful at getting a speed boost, but it's just a beginning.

One thing I noticed is that this comment might be incorrect :

// first 16 already done, so don't re-copy them

If I do understand the code correctly, the next line do indeed just that, since op and ip have not yet been updated:
copy_literals_long(op, ip, lrl);

Moreover, it's in fact necessary : the previous 16 bytes have been copied speculatively, but in the end, they have copied length control bytes, instead of literal ones. So it's absolutely necessary to erase those wrong bytes by copying the right ones on top of them.

cbloom said...

Yeah, that comment is incorrect, will fix.

The correct thing is to do the first 16 without checking lrl, then go to a normal 8-byte copy loop.

As you point out they do need to be recopied.

Yann Collet said...

I've been trying to adapt your suggestion to LZ4 lately, but I'm getting into a wall.

One of the key suggestion of your proposal is for 8 bytes to be copied "no matter what", speculatively.

For this action to be possible, 8 bytes must remain (at least) in both input and output buffer.

That's a verification I'm not able to remove so far. I could maybe merge the output buffer limit verification into the previous copy operation, but no such luck with input buffer limit verification.

As a consequence, since at least one verification remain, the current code also takes the assumption that "better know the copy length", in order to check buffer limits in a single test.

cbloom said...

Right. The crucial trick is that the encoder knows what the decoder is doing. You can't just use the standard encoder.

The encoder knows when the decoder will speculate, and can ensure that it only does so in safe places (no buffer overruns).

Specifically, in this case, an LRL cannot start within the last 8 bytes. You must either have a final LRL of >= 8 bytes, or there must be a match that takes you all the way to the end.

Yann Collet said...

This is the reason why the last 5 bytes must be literals :
the penultimate literal run copy has a guarantee that at least 8 bytes are remaining in the input buffer (offset (2) + token (1) + literals (5+) ).

Unfortunately the last run does not have this guarantee.

Even more importantly, this guarantee is only true with a valid input. There is also a possibility that the input might be malicious, in order to produce intentional read/write beyond buffer limits. At least the _safe version must guarantee a protection against such scenario.

cbloom said...

Eh, right. "or there must be a match that takes you all the way to the end" is wrong. I get in trouble when I say something that differs from what I actually do in Oodle.

The way to be end-of-buffer safe is you drop out of the fast loop somewhere before the end, and finish off with a simple slow loop.

The point that you drop out of the fast loop must always have >= 8 bytes of raw and comp data.

So for example your check could be

fastEndPtr = raw + rawLen - 32

and then the encoder must also make data specifically such that the last 32 raw bytes always take >= 8 comp bytes.

Of course the simplest way to do it is just to write a lot of literals at the end, which is what I do cuz I also hide my EOF check in the long-lrl case.

cbloom said...

And you are right that this is all very unsafe for corrupted/malicious data.

I should also say that this post is not intended as "LZ4 should do this". There are definite disadvantages - malicious data danger, code complexity, etc.

Yann Collet said...

Indeed, this is how I read it.

I've been tempted however to see if it could be applied "as is" to LZ4, and the answer is : it's a bit more complicated than that.

That's nonetheless a very good read, with excellent ideas that are generic enough to be used anywhere else.

old rants