10/14/2013

10-14-13 - Oodle's Fast LZ4

Oodle now has a compressor called "LZB" (LZ-Bytewise) which is basically a variant of Yann's LZ4 . A few customers were testing LZNib against LZ4 and snappy and were bothered that it was slower to decode than those (though it offers much more compression and I believe it is usually a better choice of tradeoffs). In any case, OodleLZB is now there for people who care more about speed than ratio.

Oodle LZB has some implementation tricks that could also be applied to LZ4. I thought I'd go through them as an illustration of how you make compressors fast, and to give back to the community since LZ4 is nicely open source.

OodleLZB is 10-20% faster to decode than LZ4. (1900 mb/s vs. 1700 mb/s on lzt99, and 1550 mb/s vs. 1290 mb/s on all_dds). The base LZ4 implementation is not bad, it's in fact very fast and has the obvious optimizations like 8-byte word copies and so on. I'm not gonna talk about fiddly junk like do you do ptr++ or ++ptr , though that stuff can make a difference on some platforms. I want to talk about how you structure a decode loop.

The LZ4 decoder is like this :


U8 * ip; // = compressed input
U8 * op; // = decompressed output

for(;;)
{
    int control = *ip++;

    int lrl = control>>4;
    int ml_control = control&0xF;

    // get excess lrl
    if ( lrl == 0xF ) AddExcess(lrl,ip);

    // copy literals :
    copy_literals(op, ip, lrl);

    ip += lrl;
    op += lrl;

    if ( EOF ) break;

    // get offset
    int offset = *((U16 *)ip);
    ip+=2;

    int ml = ml_control + 4;
    if ( ml_control == 0xF ) AddExcess(ml,ip);

    // copy match :
    if ( overlap )
    {
        copy_match_overlap(op, op - offset, ml );
    }
    else
    {
        copy_match_nooverlap(op, op - offset, ml );
    }

    op += ml;

    if ( EOF ) break;
}

and AddExcess is :

#define AddExcess(val,cp)   do { int b = *cp++; val += b; if ( b != 0xFF ) break; } while(1)

So, how do we speed this up?

The main thing we want to focus on is branches. We want to reduce the number of branches on the most common paths, and we want to make maximum use of the branches that we can't avoid.

There are four branches we want to pay attention to :

1. the checks for control nibbles = 0xF
2. the check for match overlap
3. the check of LRL and match len inside the match copiers
4. the EOF checks

So last one first; the EOF check is the easiest to eliminate and also the least important. On modern chips with good branch prediction, the highly predictable branches like that don't cost much. If you know that your input stream is not corrupted (because you've already checked a CRC of some kind), then you can put the EOF check under one of the rare code cases, like perhaps LRL control = 0xF. Just make the encoder emit that rare code when it hits EOF. On Intel chips that makes almost no speed difference. (if you need to handle corrupted data without crashing, don't do that).

On to the substantial ones. Note that branch #3 is hidden inside "copy_literals" and "copy_match". copy_literals is something like :


for(int i=0;i<lrl;i+=8)
{
    *((U64 *)(dest+i)) = *((U64 *)(src+i));
}

(the exact way to do copy_literals quickly depends on your platform; in particular are offset-addresses free? and are unaligned loads fast? Depending on those, you would write the loop in different ways.)

We should notice a couple of things about this. One is that the first branch on lrl is the most important. Later branches are rare, and we're also covering a lot of bytes per branch in that case. When lrl is low, we're getting a high number of branches per byte. Another issue about that is that the probability of taking the branch is very different the first time, so you can help the branch-prediction in the chip by doing some explicit branching, like :


if ( lrl > 0 )
{
    *((U64 *)(dest)) = *((U64 *)(src));
    if ( lrl > 8 )
    {
        *((U64 *)(dest+8)) = *((U64 *)(src+8));
        if ( lrl > 16 )
        {
            // .. okay now do a loop here ...
        }
    }
}

We'll also see later that the branch on ( lrl > 0 ) is optional, and in fact it's best to not do it - just always unconditionally/speculatively copy the first 8 bytes.

The next issue we should see is that the branch for lrl > 16 there for the copier is redundant with the check for the control value (lrl == 0xF). So we should merge them :


change this :

    // get excess lrl
    if ( lrl == 0xF ) AddExcess(lrl,ip);

    // copy literals :
    copy_literals(op, ip, lrl);

to :

    // get excess lrl
    if ( lrl == 0xF )
    {
        AddExcess(lrl,ip);

        // lrl >= 15
        copy_literals_long(op, ip, lrl);
    }
    else
    {
        // lrl < 15
        copy_literals_short(op, ip, lrl);
    }

this is the principle that we can't avoid the branch on the LRL escape code, so we should make maximum use of it. That is, it caries extra information - the branch tells us something about the value of LRL, and any time we take a branch we should try to make use of all the information it gives us.

But we can do more here. If we gather some stats about LRL we see something like :

% of LRL >= 0xF : 8%
% of LRL > 8 : 15%
% of LRL <= 8 : 85%
the vast majority of LRL's are short. So we should first detect and handle that case :

    if ( lrl <= 8 )
    {
        if ( lrl > 0 )
        {
            *((U64 *)(op)) = *((U64 *)(ip));
        }
    }
    else
    {
        // lrl > 8
        if ( lrl == 0xF )
        {
            AddExcess(lrl,ip);

            // lrl >= 15
            copy_literals_long(op, ip, lrl);
        }
        else
        {
            // lrl > 8 && < 15
            *((U64 *)(op)) = *((U64 *)(ip));
            *((U64 *)(op+8)) = *((U64 *)(ip+8));
        }
    }

which we should then rearrange a bit :

    // always immediately speculatively copy 8 :
    *((U64 *)(op)) = *((U64 *)(ip));

    if_unlikely( lrl > 8 )
    {
        // lrl > 8
        // speculatively copy next 8 :
        *((U64 *)(op+8)) = *((U64 *)(ip+8));

        if_unlikely ( lrl == 0xF )
        {
            AddExcess(lrl,ip);

            // lrl >= 15
            // can copy first 16 without checking lrl
            copy_literals_long(op, ip, lrl);
        }
    }
    
    op += lrl;
    ip += lrl;

and we have a faster literal-copy path.

Astute readers may notice that we can now be stomping past the end of the buffer, because we always do the 8 byte copy regardless of LRL. There are various ways to prevent this. One is to make your EOF check test for (end - 8), and when you break out of that loop, then you have a slower/safer loop to finish up the end. (ADD : not exactly right, see notes in comments)

Obviously we should do the exact same procedure with the ml_control. Again the check for spilling the 0xF nibble tells us something about the match length, and we should use that information to our advantage. And again the short matches are by far the most common, so we should detect that case and make it as quick as possible.

The next branch we'll look at is the overlap check. Again some statistics should be our guide : less than 1% of all matches are overlap matches. Overlap matches are important (well, offset=1 overlaps are important) because they are occasionally very long, but they are rare. One option is just to forbid overlap matches entirely, and really that doesn't hurt compression much. We won't do that. Instead we want to hide the overlap case in another rare case : the excess ML 0xF nibble case.

The way to make compressors fast is to look at the code you have to write, and then change the format to make that code fast. That is, write the code the way you want it and the format follows - don't think of a format and then write code to handle it.

We want our match decoder to be like this :


if ( ml_control < 0xF )
{
    // 4-18 byte short match
    // no overlap allowed
    // ...
}
else
{
    // long match OR overlap match
    if ( offset < 8 )
    {
        ml = 4; // back up to MML
        AddExcess(ml,ip);

        // do offset < 8 possible overlap match
        // ...
    }
    else
    {
        AddExcess(ml,ip);

        // do offset >= 8 long match
        // ...
    }
}

so we change our encoder to make data like that.

Astute readers may note that overlap matches now take an extra byte to encode, which does cost a little compression ratio. If we like we can fix that by reorganizing the code stream a little (one option is to send one ml excess byte before sending offset and put a flag value in there; since this is all in the very rare pathway it can be more complex in its encoding), or we can just ignore it, it's around a 1% hit.

That's it.

One final note, if you want to skip all that, there is a short cut to get much of the win.

The simplest case is the most important - no excess lrl, no excess ml - and it occurs around 40% of all control bytes. So we can just detect it and fastpath that case :


U8 * ip; // = compressed input
U8 * op; // = decompressed output

for(;;)
{
    int control = *ip++;

    int lrl = control>>4;
    int ml_control = control&0xF;

    if ( (control & 0x88) == 0 )
    {
        // lrl < 8 and ml_control < 8

        // copy literals :
        *((U64 *)(op)) = *((U64 *)(ip));
        ip += lrl;
        op += lrl;

        // get offset
        int offset = *((U16 *)ip);
        ip+=2;
    
        int ml = ml_control + 4;    

        // copy match; 4-11 bytes :
        U8 * from = op - offset;
        *((U64 *)(op)) = *((U64 *)(from));
        *((U32 *)(op+8)) = *((U32 *)(from+8));

        op += ml;

        continue;
    }

    // ... normal decoder here ...
}

This fastpath doesn't help much with all the other improvements we've done here, but you can just drop it on the original decoder and get much of the win. Of course beware the EOF handling. Also this fastpath assumes that you have forbidden overlaps from the normal codes and are sending the escape match code (0xF) for overlap matches.

ADDENDUM : A few notes :

In case it's not clear, one of the keys to this type of fast decoder is that there's a "hot region" in front of the output pointer that contains undefined data, which we keep stomping on over and over. eg. when we do the 8-byte literal blasts regardless of lrl. This has a few consequences you must be aware of.

One is if you're trying to decode in a minimum size circular window (eg. 64k in this case). Offsets to matches that are near window size (like 65534) are actually wrapping around to be just *ahead* of the current output pointer. You cannot allow those matches in the hot region because those bytes are undefined. There are a few fixes - don't use this decoder for 64k circular windows, or don't allow your encoder to output offsets that cause that problem.

A similar problem arises with parallelizing the decode. If you're decoding chunks of the stream in parallel, you cannot allow the hot region of one thread to be stomping past the end of its chunk, which another thread might be reading from. To handle this Oodle defines "seek points" which are places that you are allowed to parallelize the decode, and the coders ensure that the hot regions do not cross seek points. That is, within a chunk up to the seek point, the decoder is allowed to do these wild blast-aheads, but as it gets close to a seek point it must break out and fall into a safer mode (this can be done with a modified end condition check).

10/09/2013

10-09-13 - Urgh ; Threads and Memory

This is a problem I've been trying to avoid really facing, so I keep hacking around it, but it keeps coming back to bite me every few months.

Threads/Jobs and memory allocation is a nasty problem.

Say you're trying to process some 8 GB file on a 32-bit system. You'd like to fire up a bunch of threads and let them all crank on chunks of the file simultaneously. But how big of a chunk can each thread work on? And how many threads can you run?

The problem is those threads may need to do allocations to do their processing. With free-form allocations you don't necessarily know in advance how much they need to allocate (it might depend on processing options or the data they see or whatever). With a multi-process OS you also don't know in advance how much memory you have available (it may reduce while you're running). So you can't just say "I have the memory to run 4 threads at a time". You don't know. You can run out of memory, and you have to abort the whole process and try again with less threads.

In case it's not obvious, you can't just try running 4 threads, and when one of them runs out of memory you pause that thread and run others, or kill that thread, because the thread may do work and allocations incrementally, like work,alloc,work,alloc,etc. so that by the time an alloc fails, you're alread holding a bunch of other allocs and no other thread may be able to run.

To be really clear, imagine you have 2 MB free and your threads do { alloc 1 MB, work A, alloc 1 MB, work B }. You try to run 2 threads, and they both get up to work A. Now neither thread can continue because you're out of memory.

The real solution is for each Job to pre-declare its resource requirements. Like "I need 80 MB" to run. Then it becomes the responsibility of the Job Manager to do the allocation, so when the Job is started, it is handed the memory and it knows it can run; all allocations within the Job then come from the reserved pool, not from the system.

(there are of course other solutions; for example you could make all your jobs rewindable, so if one fails an allocation it is aborted (and any changes to global state undone), or similarly all your jobs could work in two stages, a "gather" stage where allocs are allowed, but no changes to the global state are allowed, and a "commit" phase where the changes are applied; the job can be aborted during "gather" but must not fail during "commit").

So the Job Manager might try to allocate memory for a job, fail, and run some other jobs that need less memory. eg. if you have jobs that take { 10, 1, 10, 1 } of memories, and you have only 12 memories free, you can't run the two 10's at the same time, but you can run the 1's while a 10 is running.

While you're at it, you may as well put some load-balancing in your Jobs as well. You could have each Job mark up to what extend it needs CPU, GPU, or IO resources (in addition to memory use). Then the Job Manager can try to run jobs that are of different types (eg. don't run two IO-heavy jobs at the same time).

If you want to go even more extreme, you could have Jobs pre-declare the shared system resources that they need locks on, and the Job Manager can try to schedule jobs to avoid lock contention. (the even super extreme version of this is to pre-declare *all* your locks and have the Job Manager take them for you, so that you are gauranteed to get them; at this point you're essentially making Jobs into snippets that you know cannot ever fail and cannot even ever *block*, that is they won't even start unless they can run straight to completion).

I haven't wanted to go down this route because it violates one of my Fundamental Theorems of Jobs, which is that job code should be the same as main-thread code, not some weird meta-language that requires lots of markup and is totally different code from what you would write in the non-threaded case.

Anyway, because I haven't properly addressed this, it means that in low-memory scenarios (eg. any 32-bit platform), the Oodle compressors (at the optimal parse level) can run out of memory if you use too many worker threads, and it's hard to really know that's going to happen in advance (since the exact memory use depends on a bunch of options and is hard to measure). Bleh.

(and obviously what I need to do for Oodle, rather than solving this problem correctly and generally, is just to special case my LZ string matchers and have them allocate their memory before starting the parallel compress, so I know how many threads I can run)

10/04/2013

10-04-13 - The Reality of Coding

How you actually spend time.

4 hours - think about a problem, come up with a new algorithm or variation of algorithm, or read papers to find an algorithm that will solve your problem. This is SO FUCKING FUN and what keeps pulling me back in.

8 hours - do initial test implementation to prove concept. It works! Yay!

And now the fun part is over.

50 hours - do more careful implementation that handles all the annoying corner cases; integrate with the rest of your library; handle failures and so on. Provide lots of options for slightly different use cases that massively increase the complexity.

50 hours - adding features and fixing rare bugs; spread out over the next year

20 hours - have to install new SDKs to test it; inevitably they've broken a bunch of APIs and changed how you package builds so waste a bunch of time on that

10 hours - some stupid problem with Win8 loading the wrong drivers; or the linux dir my test is writing to is no longer writeble by my user account; whatever stupid computer problem; chase that around for a while

10 hours - the p4 server is down / vpn is down / MSVC has an internal compiler error / my laptop is overheating / my hard disk is full, whatever stupid problem always holds you up.

10 hours - someone checks in a breakage to the shared library; it would take a minute just to fix it, but you can't do that because it would break them so you have to do meetings to agree on how it should be fixed

10 hours - some OS API you were using doesn't actually behave the way you expected, or has a bug; some weird corner case or undocumented interaction in the OS API that you have to chase down

40 hours - writing docs and marketing materials, teaching other members of your team, internal or external evangelizing

30 hours - some customer sees a bug on some specific OS or SDK version that I no longer have installed; try to remote debug it, that doesn't work, try to find a repro, that doesn't work, give up and install their case; in the end it turns out they had bad RAM or something silly.

The reality is that as a working coder, the amount of time you actually get to spend working on the good stuff (new algorithms, hacking, clever implementations) is vanishingly small.

10/03/2013

10-03-13 - SetLastError(0)

Public reminder to myself about something I discovered a while ago.

If you want to do IO really robustly in Windows, you can't just assume that your ReadFile / WriteFile will succeed under normal usage. There are lots of nasty cases where you need to retry (perhaps with smaller IO sizes, or just after waiting a bit).

In particular you can see these errors in normal runs :


ERROR_NOT_ENOUGH_MEMORY = too many AsyncIO 's pending

ERROR_NOT_ENOUGH_QUOTA  = single IO call too large
    not enough process space pages available
    -> SetProcessWorkingSetSize

ERROR_NO_SYSTEM_RESOURCES = 
    failure to alloc pages in the kernel address space for the IO
    try again with smaller IOs  

ERROR_IO_PENDING = 
    normal async IO return value

ERROR_HANDLE_EOF = 
    sometimes normal EOF return value

anyway, this post is not about the specifics of IO errors. (random aside : I believe that some of these annoying errors were much more common in 32-bit windows; the failure to get address space to map IO pages was a bigger problem in 32-bit (I saw it most often when running with the /3GB option which makes the kernel page space a scarce commodity), I don't think I've seen it in the field in 64-bit windows)

I discovered a while ago that ReadFile and WriteFile can fail (return false) but not set last error to anything. That is, you have something like :


SetLastError(77); // something bogus

if ( ! ReadFile(....) )
{
    // failure, get code :
    DWORD new_error = GetLastError();

    // new_error should be the error info about ReadFile failing
    // but sometimes it's still 77
    ...
}

and *sometimes* new_error is still 77 (or whatever; that is, it wasn't actually set when ReadFile failed).

I have no idea exactly what situations make the error get set or not. I have no idea how many other Win32 APIs are affected by this flaw, I only have empirical proof of ReadFile and WriteFile.

Anyhoo, the conclusion is that best practice on Win32 is to call SetLastError(0) before invoking any API where you need to know for sure that the error code you get was in fact set by that call. eg.


SetLastError(0);
if ( ! SomeWin32API(...) )
{
    DWORD hey_I_know_this_error_is_legit = GetLastError();

}

That is, Win32 APIs returning failure does *not* guarantee that they set LastError.


ADD : while I'm at it :

$err
$err,hr
in the VC watch window is pretty sweet.

GetLastError is :

*(DWORD *)($tib+0x34)

or *(DWORD *)(FS:[0x34]) on x86

old rants