Trying the obvious ways to extend LZ4 to large windows.
In my view, the fundamental thing about LZ4 is the strictly alternating sequence of literal-run-len, match-len. That is, in LZ4 to send two matches in a row you must send a literal-run-len of 0 between them. This helps decoder speed a bit because it removes the "is this token a match or literal?" branch and makes you just always go literals-match-literals-match. (though you don't really get to avoid the branch, it's just moved into the looping on the run len).
I relax the control word structure to not just be 4 bits of LRL and 4 bits of ML. But I do keep strict bit-wise division
of the control world, and strictly byte-by-byte literals and offsets. The obvious contenders are :
LZ4 variants are named like :
# bits of LRL - # bits of ML - # bits of offset : how offset is encoded
we have :
4-4-0 : 16 = "LZ4 classic" , always a 16 bit offset
4-4-0 : 15/23 = one bit of 16 bit offset is reserved to indicate a 3 byte offset (so windows are 1<<15 , 1<<23)
4-4-0 : encodemodWB = use encodemod for offset; send word first then bytes
3-4-1 : 16/24 = 3 bits of LRL, 1 bit in control flags 2 or 3 byte offset
(4-3-1 is strictly worse than 3-4-1)
3-3-2 : encodemodB = 2 bottom bits of offset go in the control ; remainder send with encodemod bytes
this is the first variant that can send an offset in only 1 byte
3-3-2-B ML3-4 : same as above, but 1 byte offets get a min match len of 3 (instead of 4)
And the optimal parse compressed sizes are :
raw | 4-4-0 : 16 | 4-4-0 : 15/23 | 4-4-0 : encodemodWB | 3-4-1 : 16/24 | 3-3-2 : encodemodB | 3-3-2-B MML3-4 | |||||||||||||
lzt00 | 16914 | 6444 | 6444 | 6444 | 6551 | 6186 | 6068 | ||||||||||||
lzt01 | 200000 | 198900 | 198900 | 198900 | 198905 | 198893 | 198880 | ||||||||||||
lzt02 | 755121 | 410549 | 321448 | 314152 | 307669 | 315101 | 292427 | ||||||||||||
lzt03 | 3471552 | 1815464 | 1804312 | 1804086 | 1807200 | 1799418 | 1795951 | ||||||||||||
lzt04 | 48649 | 16461 | 16463 | 16461 | 16564 | 15938 | 15584 | ||||||||||||
lzt05 | 927796 | 459700 | 457261 | 454931 | 460191 | 445986 | 440742 | ||||||||||||
lzt06 | 563160 | 492938 | 429336 | 428734 | 431119 | 429374 | 419768 | ||||||||||||
lzt07 | 500000 | 264112 | 264594 | 263954 | 266882 | 257550 | 248500 | ||||||||||||
lzt08 | 355400 | 330793 | 330524 | 328740 | 336329 | 334284 | 322959 | ||||||||||||
lzt09 | 786488 | 340317 | 327145 | 323145 | 321273 | 326352 | 325124 | ||||||||||||
lzt10 | 154624 | 14845 | 14706 | 14627 | 14687 | 13714 | 13299 | ||||||||||||
lzt11 | 58524 | 25749 | 25755 | 25750 | 25943 | 24555 | 23870 | ||||||||||||
lzt12 | 164423 | 32485 | 32770 | 32470 | 32542 | 31272 | 30864 | ||||||||||||
lzt13 | 1041576 | 1042749 | 1043586 | 1042984 | 1042836 | 1042747 | 1040033 | ||||||||||||
lzt14 | 102400 | 56478 | 56734 | 56535 | 57516 | 55182 | 53395 | ||||||||||||
lzt15 | 34664 | 13995 | 13996 | 13995 | 14102 | 13050 | 12723 | ||||||||||||
lzt16 | 21504 | 12340 | 12340 | 12340 | 12517 | 11847 | 11392 | ||||||||||||
lzt17 | 53161 | 23025 | 23167 | 23025 | 23163 | 22374 | 22028 | ||||||||||||
lzt18 | 102400 | 85614 | 87190 | 85929 | 86374 | 86197 | 79138 | ||||||||||||
lzt19 | 768771 | 359276 | 345974 | 339273 | 334512 | 337204 | 335912 | ||||||||||||
lzt20 | 1179702 | 1043192 | 1011629 | 1004412 | 1004435 | 1002099 | 993442 | ||||||||||||
lzt21 | 679936 | 192411 | 120808 | 120704 | 121908 | 115289 | 113461 | ||||||||||||
lzt22 | 400000 | 361524 | 356885 | 353333 | 353676 | 353120 | 348347 | ||||||||||||
lzt23 | 1048576 | 1040623 | 1038648 | 1034493 | 1039073 | 1038802 | 1035197 | ||||||||||||
lzt24 | 3471552 | 2369040 | 1911004 | 1907645 | 1929223 | 1931560 | 1934129 | ||||||||||||
lzt25 | 1029744 | 324107 | 324281 | 323513 | 323032 | 332437 | 332747 | ||||||||||||
lzt26 | 262144 | 246334 | 248360 | 246587 | 247177 | 246667 | 244990 | ||||||||||||
lzt27 | 857241 | 425694 | 386493 | 386056 | 387358 | 358184 | 353497 | ||||||||||||
lzt28 | 1591760 | 437666 | 399105 | 393814 | 390517 | 390421 | 388712 | ||||||||||||
lzt29 | 3953035 | 2230095 | 1563410 | 1554583 | 1553331 | 1537093 | 1519904 | ||||||||||||
lzt30 | 100000 | 100394 | 100394 | 100394 | 100394 | 100394 | 100393 | ||||||||||||
total | 24700817 | 14773314 | 13273662 | 13212009 | 13246999 | 13173290 | 13053476 | ||||||||||||
raw | 4-4-0 : 16 | 4-4-0 : 15/23 | 4-4-0 : encodemodWB | 3-4-1 : 16/24 | 3-3-2 : encodemodB | 3-3-2 : encdemodB |
Some thoughts : going to large window helps a lot, but the exact scheme doesn't matter much.
Obviously you could do better by going to more complexity; arbitrary dividers instead of integer bit counts; different min match lens for each # of offset bytes (eg. 3 for 1 byte, 4 for 2 bytes, 5 for 3 bytes, etc.).
Probably the biggest next step (which isn't a big speed hit) would be to add a "last offset" (aka "repeat match"). Last offset is a big LHS penalty on the shit platforms, but here our decoder is so simple that we have the potential to write the whole thing in assembly and just keep the last offset in a register.
BTW I don't think I said this explicitly in the last post, but the great thing about having an optimal parser is that it's much *easier* to write an optimal parser for an arbitrary coding scheme than it is to write a well tweaked heuristic. Each time you change the coding, you would have to re-tune the heuristic, whereas with an optimal parser you just plug in the code costs and let it run. Then once you have decided on a format, you can go back and try to find a heuristic that produces a good parse without the slowness of the optimal parse.
2 comments:
> (though you don't really get to avoid the branch, it's just moved into the looping on the run len)
How about always having a multiple of some number of match-literal pairs? Then you'd only have to check every so often.
If the point is for big windows, then the tail end where there might be some padding is no big deal.
Probably a bad idea.
Well, back in the ancient days we did things like this :
Put match/literal flag in a byte, so you put 8 single-bit flags together.
Decoder just does a switch() on the control byte with 8 flags in it
Jump to 256 copies of unrolled decoder that spits out 8 matches or literals at a time
On old chips that had tiny icaches, this didn't really hurt much (because you had very little hope of keeping much in icache anyway). (and of course instructions were relatively expensive compared to cache misses in the old days)
Nowadays I-cache is usually very large and you can often keep your whole decoder in I-cache, so doing something like this that creates a huge amount of code is a no-go. (as bad as a branch is, it's a lot better than an unpredictable I-cache miss).
Post a Comment