9/10/2012

09-10-12 - LZ4 - Large Window

Continuing from : LZ4 Optimal Parse (also see the Encoding Values in Bytes series).

Trying the obvious ways to extend LZ4 to large windows.

In my view, the fundamental thing about LZ4 is the strictly alternating sequence of literal-run-len, match-len. That is, in LZ4 to send two matches in a row you must send a literal-run-len of 0 between them. This helps decoder speed a bit because it removes the "is this token a match or literal?" branch and makes you just always go literals-match-literals-match. (though you don't really get to avoid the branch, it's just moved into the looping on the run len).

I relax the control word structure to not just be 4 bits of LRL and 4 bits of ML. But I do keep strict bit-wise division of the control world, and strictly byte-by-byte literals and offsets. The obvious contenders are :


LZ4 variants are named like :

# bits of LRL - # bits of ML - # bits of offset : how offset is encoded

we have :

4-4-0 : 16  = "LZ4 classic" , always a 16 bit offset

4-4-0 : 15/23 = one bit of 16 bit offset is reserved to indicate a 3 byte offset (so windows are 1<<15 , 1<<23)

4-4-0 : encodemodWB = use encodemod for offset; send word first then bytes

3-4-1 : 16/24 =  3 bits of LRL, 1 bit in control flags 2 or 3 byte offset
     (4-3-1 is strictly worse than 3-4-1) 

3-3-2 : encodemodB = 2 bottom bits of offset go in the control ; remainder send with encodemod bytes
        this is the first variant that can send an offset in only 1 byte

3-3-2-B ML3-4 : same as above, but 1 byte offets get a min match len of 3 (instead of 4)

And the optimal parse compressed sizes are :

raw 4-4-0 : 16 4-4-0 : 15/23 4-4-0 : encodemodWB 3-4-1 : 16/24 3-3-2 : encodemodB 3-3-2-B MML3-4
lzt00 16914 6444 6444 6444 6551 6186 6068
lzt01 200000 198900 198900 198900 198905 198893 198880
lzt02 755121 410549 321448 314152 307669 315101 292427
lzt03 3471552 1815464 1804312 1804086 1807200 1799418 1795951
lzt04 48649 16461 16463 16461 16564 15938 15584
lzt05 927796 459700 457261 454931 460191 445986 440742
lzt06 563160 492938 429336 428734 431119 429374 419768
lzt07 500000 264112 264594 263954 266882 257550 248500
lzt08 355400 330793 330524 328740 336329 334284 322959
lzt09 786488 340317 327145 323145 321273 326352 325124
lzt10 154624 14845 14706 14627 14687 13714 13299
lzt11 58524 25749 25755 25750 25943 24555 23870
lzt12 164423 32485 32770 32470 32542 31272 30864
lzt13 1041576 1042749 1043586 1042984 1042836 1042747 1040033
lzt14 102400 56478 56734 56535 57516 55182 53395
lzt15 34664 13995 13996 13995 14102 13050 12723
lzt16 21504 12340 12340 12340 12517 11847 11392
lzt17 53161 23025 23167 23025 23163 22374 22028
lzt18 102400 85614 87190 85929 86374 86197 79138
lzt19 768771 359276 345974 339273 334512 337204 335912
lzt20 1179702 1043192 1011629 1004412 1004435 1002099 993442
lzt21 679936 192411 120808 120704 121908 115289 113461
lzt22 400000 361524 356885 353333 353676 353120 348347
lzt23 1048576 1040623 1038648 1034493 1039073 1038802 1035197
lzt24 3471552 2369040 1911004 1907645 1929223 1931560 1934129
lzt25 1029744 324107 324281 323513 323032 332437 332747
lzt26 262144 246334 248360 246587 247177 246667 244990
lzt27 857241 425694 386493 386056 387358 358184 353497
lzt28 1591760 437666 399105 393814 390517 390421 388712
lzt29 3953035 2230095 1563410 1554583 1553331 1537093 1519904
lzt30 100000 100394 100394 100394 100394 100394 100393
total 24700817 14773314 13273662 13212009 13246999 13173290 13053476
raw 4-4-0 : 16 4-4-0 : 15/23 4-4-0 : encodemodWB 3-4-1 : 16/24 3-3-2 : encodemodB 3-3-2 : encdemodB

Some thoughts : going to large window helps a lot, but the exact scheme doesn't matter much.

Obviously you could do better by going to more complexity; arbitrary dividers instead of integer bit counts; different min match lens for each # of offset bytes (eg. 3 for 1 byte, 4 for 2 bytes, 5 for 3 bytes, etc.).

Probably the biggest next step (which isn't a big speed hit) would be to add a "last offset" (aka "repeat match"). Last offset is a big LHS penalty on the shit platforms, but here our decoder is so simple that we have the potential to write the whole thing in assembly and just keep the last offset in a register.

BTW I don't think I said this explicitly in the last post, but the great thing about having an optimal parser is that it's much *easier* to write an optimal parser for an arbitrary coding scheme than it is to write a well tweaked heuristic. Each time you change the coding, you would have to re-tune the heuristic, whereas with an optimal parser you just plug in the code costs and let it run. Then once you have decided on a format, you can go back and try to find a heuristic that produces a good parse without the slowness of the optimal parse.

2 comments:

Will said...

> (though you don't really get to avoid the branch, it's just moved into the looping on the run len)

How about always having a multiple of some number of match-literal pairs? Then you'd only have to check every so often.

If the point is for big windows, then the tail end where there might be some padding is no big deal.

Probably a bad idea.

cbloom said...

Well, back in the ancient days we did things like this :

Put match/literal flag in a byte, so you put 8 single-bit flags together.

Decoder just does a switch() on the control byte with 8 flags in it

Jump to 256 copies of unrolled decoder that spits out 8 matches or literals at a time

On old chips that had tiny icaches, this didn't really hurt much (because you had very little hope of keeping much in icache anyway). (and of course instructions were relatively expensive compared to cache misses in the old days)

Nowadays I-cache is usually very large and you can often keep your whole decoder in I-cache, so doing something like this that creates a huge amount of code is a no-go. (as bad as a branch is, it's a lot better than an unpredictable I-cache miss).

old rants