9/28/2010

09-28-10 - Branchless LZ77 Decoder

It occurred to me that if you do an LZ77 where the literal/match flag is sent via a run length of literals, then a run of literals is the same as a match, but the source is the next few bytes of the compressed buffer, rather than some previous location in the decompressed buffer.

That is, you're decompressing from "comp" buffer into "dest" buffer. A "match" is just a copy from the "dest" buffer, and literals are just at copy from the "comp" buffer.

So, let's say we do a byte-wise LZ77 , use one bit to flag literal or match, then 7 bits for the length. Our branchless decoder is something like :


{
    U8 control = *comp++;
    int source = control >> 7;
    int length = (control & 127) + 1;
    U8 * lit_ptr = comp;
    U8 * mat_ptr = dest - *((U16 *)comp);
    U8 * copy_from_ptr = select( lit_ptr, mat_ptr, source );
    memcpy( dest, copy_from_ptr, length );
    dest += length;
    comp += (source<<1);
}

Where "select(a,b,c)" = c ? ( a : b ); or something like that. (sometimes better implemented with a negative and; on the PC you might use cmov, on the spu you might use sel, etc.)

While this should be very fast, compression will not be awesome because the division for literals and matches is not ideal and 7 bits of length is a lot for literals but not enough for matches, and offsets are always 16 bits which is too little. We can do a slightly better version using a 256 entry lookup table for the control :


{
    U8 control = *comp++;
    int source = source_table[control];
    int length = length_table[control];
    U8 * lit_ptr = comp;
    U8 * mat_ptr = dest - *((U16 *)comp);
    U8 * copy_from_ptr = select( lit_ptr, mat_ptr, source );
    memcpy( dest, copy_from_ptr, length );
    dest += length;
    comp += (source<<1);
}

for example with the table you could let the match lengths be larger and sparse. But it would probably be better to just have a branch that reads more bytes for long lengths.

Adding things like optionally larger offsets starts to make the branchless code complex enough that eating one branch is better. If you're willing to do a lot of variable shifts it's certainly possible, for example you could grab 1 control byte, look it up in various tables. The tables tell you some # of bits for match length, some # for match offset, and some # for literal run length (they add up a multiple of 8 and use some portion of the control byte as well). Unfortunately variable shifting is untenably slow on many important platforms.

BTW one useful trick for reducing branches in your LZ decoder is to put the EOF case out in some rare case, rather than as your primary loop condition, and you change your primary loop to be an unconditional branch. On PC's this doesn't change much but on some architectures an unconditional branch is much cheaper than a conditional one, even it's predictable. That is, instead of :


while ( ! eof )
{
   .. do one decode step ..
   if ( mode 1 )
     ..
   if ( mdoe 2 )
    ...
}

You do :

for(;;)
{
   .. do one decode step ..
   if ( mode 1 )
     ..
   if ( mode 2 )
   {
     if ( rare case )
     {
        if ( eof )
          break;
     }
   }
}

Also, obviously you don't actually use "memcpy" , but whatever you use for the copy has an internal branch. And of course we have turned our branch into a tight data dependency. On some platforms that's not much better than a branch, but on many it is much better. Unfortunately unrolling doesn't help the data dependency much because of the fact that LZ can copy from its own output, so you have to wait for the copy to be done before you can start the next one (there's also an inherent LHS here, though that stall is the least of our worries).

18 comments:

cbloom said...

Obviously memcpy can be done without a branch on some architectures (rep stosb), but in practice it's always faster to use at least one branch to separate the short case from the long case.

Shelwien said...

its rep movsb/d.
Also what do you think about that idea - http://encode.ru/threads/550-Ultra-fast-LZ?p=11005&viewfull=1#post11005 ?

cbloom said...

The idea of generating an executable that outputs the data?

It's an interesting idea, but I just don't see a way to make it work since executable instructions are so much larger than we need.

There are certainly places where some limited amount of codegen could be used to speed up decoders. For example as I noted before, for Huffman decoders you could generate the branch tree.

If you have cores to spare it's possible you could decode the packed custom executable tokens into machine instructions on one core and then run them on another, but that's insanity.

Shelwien said...

I meant specifically splitting LZ decoding into entropy decoding (with x86 code generation) and match unrolling (by running that x86 code).
And I do believe that it could improve the decoding speed, because L1 code cache would be employed for match offset/length decoding, thus allowing for more efficient use of L1 data cache.
Also it was just an interpretation of a name of LZO technique.

cbloom said...

Yeah, but the output of instructions from the entropy decode part has to basically be a memcpy, so you're eating the branches there instead (and maybe eating some of them twice).

eg. drawing analogy to the code in the post you would do something like :

U8 control = *comp++;
void * iptr = table_instructions[control];
int isize = table_instruction_size[control];
memcpy(code_ptr,iptr,isize);
code_ptr += isize;

this gets the intructions to decode a given control code then puts them together, and after a bit you run them.
The actual instructions to copy a length 4 match (or whatever) can then be branchless.

But you haven't actually won anything I don't think. (* in the trivial case)

(ignoring practical problems with code generation)

(* that is if you just directly translate your simple LZ77 decoder you don't win much; but the instruction method does let you do much more complex things for the same cost; eg. all 256 control codes could do completely different decode ops and you avoid the cost of branching to find all those cases).

The win of getting the match lengths/offsets into icache is an illusion, because you had to pull them through dcache to generate the code.

Shelwien said...

Ok, here's an actual test:
http://nishi.dreamhosters.com/u/lzcodegen_v0.rar

clocks are cpu clocks from rdtsc.
first result is from 2.cpp, with a plain decoding loop,
and second is the result with code generation
(generation+execution)
Note that IntelC replaced the cpy() with its crazy memcpy
function call, which works faster with SSE2 or whatever it uses,
but it doesn't really count because the same tricks can be
used in generated code too.

[ MSC ]
clock=55799840
clock=42811688+5514896=48326584

[ gcc450 ]
clock=53975928
clock=43155400+5571568=48726968

[ IC111 ]
clock=39331688
clock=43025976+5640504=48666480

ryg said...

"Note that IntelC replaced the cpy() with its crazy memcpy
function call, which works faster with SSE2 or whatever it uses,
but it doesn't really count because the same tricks can be
used in generated code too."
Wait, what?

You post what is basically a strawman LZ decoder with a match-copying loop that's significantly worse than anything you'd do in a real decoder. And when ICC detects the pattern and compiles it into a good memcpy, you say that doesn't really matter because you could do the same in your generated code?

Did you even look at your numbers? The ICC-compiled normal decoder is over 10% faster than your code generator!

Your code generator has the same branches that a regular decoder would have. But it then generates 3 bytes worth of code for every 1-byte literal! In what universe is that a good idea?

Matches are just as bad! For most files (and most LZs), the majority of matches are short (say between 2 and 5 bytes), yet you generate 10 bytes worth of code for them. If you generate an "optimized" memcpy, you have all the ifs in your code generator that the optimized memcpy would have, and you end up outputting even more code!

What problem is this even trying to solve? I've never seen a decently written LZ+huffman decoder be a bottleneck in any real app. (Hint: Decompressing from the FS cache into the FS cache is not a real app). In reality, as soon as you can decompress data faster than your IO can hand you new chunks to decompress, you're going fast enough. Pure-LZ stuff like LZO is there already (and has been for quite some time), and the amount of computation we can do per IO (and that includes memory bandwidth!) is only going up. In that scenario, you want to spend more CPU time per byte if it allows you to reduce the amount of input you're reading (which is your actual bottleneck).

Shelwien said...

@ryg:
I look at my numbers and see that my code generator is faster, despite all the problems you mentioned.
(btw, IC version with /Oi- - disabled
intrinsics - shows clock=54467520)
And I'm sorry, but that demo was written just now for this discussion, so isn't it weird to expect me to post a perfect codec?
I still think that it supports my point well enough as it is.
Also do you have another explanation for LZO's "the revolutionary Data To Code Transformations (D2CT)."?

ryg said...

"I look at my numbers and see that my code generator is faster, despite all the problems you mentioned."
Yes, and I say that this result is completely bogus, because you're not comparing the implementations on equal terms. You can't change two things A and B (copy loop and decoding strategy) at the same time and then conclude that B is faster; all you know is A+B is faster, and A may have a much larger effect than B does.

"And I'm sorry, but that demo was written just now for this discussion, so isn't it weird to expect me to post a perfect codec?"
I don't care about "perfection", but I do care about proper methodology. In particular, if you make any claims and I see a flaw in your experiments, I will point them out, and I won't be subtle about it if I think it's something that should be obvious.

Another thing is making your results relevant and reproducible. You didn't include your test data file, so I can't re-run your experiments (and figure out if the copy loop makes a difference on my own), and I also can't see what type of data you're testing on - might be 12MB of zeroes for all I know.

Finally - and that's the biggest red flag - you made a lot of claims that are either wrong or just make no sense. Once a certain level of handwaving is reached, I tend to get very skeptical. As Charles pointed out, your comment about L1 ICache vs. DCache makes no sense; your generated code goes through both. Nor have you explained where the gain you're expecting is supposed to come from (you're still doing the same number of branches, and you're generating a bunch of extra data that a normal LZ decoder wouldn't, then execute it out of a cold ICache. In short, you're doing strictly more work than a normal decoder would, have no reasonable explanation of where you expect to get your gains, and only come out ahead in an example where you need to make the compiler intentionally generate bad code. I'm not impressed.

Will said...

so JIT aside, what else might D2CT mean? Speculation please!

Shelwien said...

http://encode.ru/threads/1135-Code-generation-in-LZ-decoder-Branchless-LZ77-Decoder?p=22606&viewfull=1#post22606

cbloom said...

BTW the fact that ICC is changing that loop to "memcpy" is in fact a big *disadvantage* because the matches are typically short. If you want to make the test a bit more fair, you should just write your direct LZ decode loop using the same instructions that you emit in the second case.

Anonymous said...

Funny idea, but not practical.

ryg said...

"BTW the fact that ICC is changing that loop to "memcpy" is in fact a big *disadvantage* because the matches are typically short. If you want to make the test a bit more fair, you should just write your direct LZ decode loop using the same instructions that you emit in the second case."
Not in Shelwiens case apparently, which made me wonder how his test file looks (must have lots of long matches to explain the figures he's getting). Both stosb (1 cycle latency, 4 uops) and rep movs* (relatively high startup cost) aren't awesome on current Intel processors either.

In any case starting from VC2008 you just have __movsb as an intrinsic (same parameters as memcpy, defined in intrin.h) so this is really easy to change.

Shelwien said...

@ryg: As I already said (in the forum post)
there's a testfile generator which you didn't notice somehow.
Matches there have random length
from 2 to 255 (for MOV CL,xx; REP MOVSB), match or literal is selected with probability 0.5

  said...

I've done this before on a SPU. Worked out pretty well in terms of decoding performance. Compression was so so, but worth it in some cases where you knew it was going to be a win.

ryg said...

"@ryg: As I already said (in the forum post)
there's a testfile generator which you didn't notice somehow."

I'm not on that forum and I don't read it, so it's lost on me if you don't mention it here :)

In any case, now that you mentioned it I found it, but given that it's named "1.cpp" I'm not surprised that I didn't notice it the first time round. A readme.txt that described what the programs do would've helped.

"Matches there have random length
from 2 to 255 (for MOV CL,xx; REP MOVSB), match or literal is selected with probability 0.5"
Now that is really completely bogus. I had assumed you used a "disassembled" LZ datastream. But you're literally drawing random numbers out of a uniform distribution for both match offset and length! That does not have anything to do with the distribution of values you get in a real LZ datastream, where both offsets and lengths have a (very roughly) exponential distribution, with small lengths and offsets much more common than large ones.

Among other things, the average "match length" in your test datastream is 128.5 bytes. That's way, *way* off what you'll see in a real datastream (as mentioned multiple times by CB and me, expect an average match length between 3 and 5 - or even less if you allow 2-byte matches). Even though you use completely byte-aligned coding using 1 byte for the "is this a match" flag, your test generator outputs a file that has an astonishing expected 21.58:1 "compression ratio", or roughly 0.37bpb - again, way off the expected operating range of a lossless coder. Among other things, this means your generated decoder is actually significantly smaller than the data it outputs, something which would never happen under more realistic assumptions. Let's assume a generous average match length of 5. 3.cpp generates 2 bytes of code for every literal (1 byte output) and 10 bytes of code for every match (expected 5 bytes output). That means in a realistic scenario your generated code is going to be about 2x as large as the data it writes, and the total amount of bytes written by (codegen+code) is 3x as much as a straight LZ decoder would.

Then there's the matter of the uniform random offsets. Most LZ offsets are small; of course, since you're copying 128.5 bytes on average, you don't get a big penalty from cache-thrashing since you actually read/write whole cache lines a substantial fraction of the time, but trust me when I say that your program still doesn't behave anything like a real LZ decoder.

This kind of synthetic benchmark is a tricky business; even when generated data looks statistically similar to actual traces, it can still get the actual behavior completely wrong (for a cautionary tale of how this hampered memory allocator research for several decades, refer to http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.141.4610&rep=rep1&type=pdf). It's way better (and more reliable) to use real test data instead of generating synthetic traces. But in your case, the traces don't even have the same statistical properties as actual LZ token streams, they're completely unrelated!

Anonymous said...

Totally tangential, but I wish that memory allocator paper ryg linked to was better.

The big problem is their analysis is incredibly sloppy. For example, they present a lot of totally meaningless averages in the last few tables.

In general they give short shrift to worst cases. For a general purpose thing (a system allocator), we generally want something with no worst case. (We don't necessarily have to optimize for things with the least bad worst case, but even if we design for the average case we want the least bad worst case to be not *too* bad.) Sadly, eight traces don't cut it for detecting whether these algorithms have worst cases.

Furthermore, their measurement of fragmentation is suspect; sure, they give 4 different measuresments of it and report the results for two of them, but then they say "The fact that AO performs *very* well on Espresso, using just 0.26% more memory than the theoretical minimum, but shows over 25% fragmentation using measure #3 is evidence of the misleading behavior of measure #3"... which both begs and invites the question... it uses 0.26% more memory than *what* theoretical minimum under *what* model? (Other allocators reported lower numbers... were they less close to the theoeretical minimum? Damn, why don't you just report their ratio relative to the theoretical minimum then?)

Moreover, they don't even *need* a fragmentation metric. For short-running programs (programs that don't run for multiple hours), a simple measurement of the maximum memory used by the allocator ever will suffice to compare multiple allocators. Fortunately this is strictly related to their fragmentation metric #4 for any given program; the way measure #4 fails is a failure for the ratio they talk about to be meaningful, not a failure of the numbers to be comparable across allocators.

old rants