cbloom rants: 09-10-12 - LZ4

Continuing from : LZ4 Optimal Parse (also see the Encoding Values in Bytes series).

Trying the obvious ways to extend LZ4 to large windows.

In my view, the fundamental thing about LZ4 is the strictly alternating sequence of literal-run-len, match-len. That is, in LZ4 to send two matches in a row you must send a literal-run-len of 0 between them. This helps decoder speed a bit because it removes the "is this token a match or literal?" branch and makes you just always go literals-match-literals-match. (though you don't really get to avoid the branch, it's just moved into the looping on the run len).

I relax the control word structure to not just be 4 bits of LRL and 4 bits of ML. But I do keep strict bit-wise division of the control world, and strictly byte-by-byte literals and offsets. The obvious contenders are :


LZ4 variants are named like :

# bits of LRL - # bits of ML - # bits of offset : how offset is encoded

we have :

4-4-0 : 16  = "LZ4 classic" , always a 16 bit offset

4-4-0 : 15/23 = one bit of 16 bit offset is reserved to indicate a 3 byte offset (so windows are 1<<15 , 1<<23)

4-4-0 : encodemodWB = use encodemod for offset; send word first then bytes

3-4-1 : 16/24 =  3 bits of LRL, 1 bit in control flags 2 or 3 byte offset
     (4-3-1 is strictly worse than 3-4-1) 

3-3-2 : encodemodB = 2 bottom bits of offset go in the control ; remainder send with encodemod bytes
        this is the first variant that can send an offset in only 1 byte

3-3-2-B ML3-4 : same as above, but 1 byte offets get a min match len of 3 (instead of 4)

And the optimal parse compressed sizes are :

	raw	4-4-0 : 16	4-4-0 : 15/23	4-4-0 : encodemodWB	3-4-1 : 16/24	3-3-2 : encodemodB	3-3-2-B MML3-4
lzt00	16914	6444	6444	6444	6551	6186	6068
lzt01	200000	198900	198900	198900	198905	198893	198880
lzt02	755121	410549	321448	314152	307669	315101	292427
lzt03	3471552	1815464	1804312	1804086	1807200	1799418	1795951
lzt04	48649	16461	16463	16461	16564	15938	15584
lzt05	927796	459700	457261	454931	460191	445986	440742
lzt06	563160	492938	429336	428734	431119	429374	419768
lzt07	500000	264112	264594	263954	266882	257550	248500
lzt08	355400	330793	330524	328740	336329	334284	322959
lzt09	786488	340317	327145	323145	321273	326352	325124
lzt10	154624	14845	14706	14627	14687	13714	13299
lzt11	58524	25749	25755	25750	25943	24555	23870
lzt12	164423	32485	32770	32470	32542	31272	30864
lzt13	1041576	1042749	1043586	1042984	1042836	1042747	1040033
lzt14	102400	56478	56734	56535	57516	55182	53395
lzt15	34664	13995	13996	13995	14102	13050	12723
lzt16	21504	12340	12340	12340	12517	11847	11392
lzt17	53161	23025	23167	23025	23163	22374	22028
lzt18	102400	85614	87190	85929	86374	86197	79138
lzt19	768771	359276	345974	339273	334512	337204	335912
lzt20	1179702	1043192	1011629	1004412	1004435	1002099	993442
lzt21	679936	192411	120808	120704	121908	115289	113461
lzt22	400000	361524	356885	353333	353676	353120	348347
lzt23	1048576	1040623	1038648	1034493	1039073	1038802	1035197
lzt24	3471552	2369040	1911004	1907645	1929223	1931560	1934129
lzt25	1029744	324107	324281	323513	323032	332437	332747
lzt26	262144	246334	248360	246587	247177	246667	244990
lzt27	857241	425694	386493	386056	387358	358184	353497
lzt28	1591760	437666	399105	393814	390517	390421	388712
lzt29	3953035	2230095	1563410	1554583	1553331	1537093	1519904
lzt30	100000	100394	100394	100394	100394	100394	100393
total	24700817	14773314	13273662	13212009	13246999	13173290	13053476
	raw	4-4-0 : 16	4-4-0 : 15/23	4-4-0 : encodemodWB	3-4-1 : 16/24	3-3-2 : encodemodB	3-3-2 : encdemodB

Some thoughts : going to large window helps a lot, but the exact scheme doesn't matter much.

Obviously you could do better by going to more complexity; arbitrary dividers instead of integer bit counts; different min match lens for each # of offset bytes (eg. 3 for 1 byte, 4 for 2 bytes, 5 for 3 bytes, etc.).

Probably the biggest next step (which isn't a big speed hit) would be to add a "last offset" (aka "repeat match"). Last offset is a big LHS penalty on the shit platforms, but here our decoder is so simple that we have the potential to write the whole thing in assembly and just keep the last offset in a register.

BTW I don't think I said this explicitly in the last post, but the great thing about having an optimal parser is that it's much *easier* to write an optimal parser for an arbitrary coding scheme than it is to write a well tweaked heuristic. Each time you change the coding, you would have to re-tune the heuristic, whereas with an optimal parser you just plug in the code costs and let it run. Then once you have decided on a format, you can go back and try to find a heuristic that produces a good parse without the slowness of the optimal parse.

2 comments:

Will said...: > (though you don't really get to avoid the branch, it's just moved into the looping on the run len)

How about always having a multiple of some number of match-literal pairs? Then you'd only have to check every so often.

If the point is for big windows, then the tail end where there might be some padding is no big deal.

Probably a bad idea.; September 10, 2012 at 10:45 PM
cbloom said...: Well, back in the ancient days we did things like this :

Put match/literal flag in a byte, so you put 8 single-bit flags together.

Decoder just does a switch() on the control byte with 8 flags in it

Jump to 256 copies of unrolled decoder that spits out 8 matches or literals at a time

On old chips that had tiny icaches, this didn't really hurt much (because you had very little hope of keeping much in icache anyway). (and of course instructions were relatively expensive compared to cache misses in the old days)

Nowadays I-cache is usually very large and you can often keep your whole decoder in I-cache, so doing something like this that creates a huge amount of code is a no-go. (as bad as a branch is, it's a lot better than an unpredictable I-cache miss).; September 13, 2012 at 2:44 PM

cbloom rants

9/10/2012

09-10-12 - LZ4 - Large Window

2 comments:

old rants