cbloom rants: 09-04-12 - LZ4 Optimal Parse

I wrote an optimal parser for LZ4 to have as a reference point.

It's a very easy format to optimal parse, because the state space is small enough that you can walk the entire dynamic programming table. That is, you just make a table which is :


          <------ all file positions ------>
^       XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
|       XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
states  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
|       XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 
V       XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

and then you just go fill all the slots. Like LZSS (Storer-Szymanski) it's simplest to walk backwards, that way any value in the table that you need from later positions is already computed.

(* see addendum at end)

For LZ4 the state is just the literal run len (there is no entropy coder; there is no "last offset"; and there are no carried bits between coding events - the way a match is coded is totally independent of what precedes it). I use 16 states. Whenever you code a literal, the state transition is just state++ , when you code a match the transition is always to state = 0.

There is a small approximation in my optimal parse; I don't keep individual states for literal run lens > 15. That means I do measure the cost jump when you go from 14 to 15 literals (and have to output an extra byte), but I don't measure the cost jump when you go from 15+254 to 15+255.

The optimal parse can be made very very slightly better by using 20 states or so (instead of 16). Then from state 15-20 you count the cost of sending a literal to be exactly 1 byte (no extra cost in control words or literal run len). At state 20 you count the cost to be 1 byte + (1/234) , that is you add in the amortized cost of the 1-1-1-1 code that will be used to send large literal run lengths. While this is better in theory, on my test set I don't get any win from going to more than 16 states.

Without further ado, the numbers :

	raw	greedy	lazy	optimal	XXX	lz4 -c0	lz4 -c1	lz4 -c2
lzt00	16914	6646	6529	6444		7557	6666	6489
lzt01	200000	198906	198901	198900		198922	198918	198916
lzt02	755121	412064	411107	410549		447200	412085	410705
lzt03	3471552	1827696	1821562	1815464		2024928	1827577	1820776
lzt04	48649	17521	16985	16461		21803	17540	16725
lzt05	927796	484749	462204	459700		518392	484764	460905
lzt06	563160	493815	493056	492938		508281	493838	493071
lzt07	500000	269945	266191	264112		300505	269945	265704
lzt08	355400	332459	332201	330793		335890	332476	331470
lzt09	786488	352802	346700	340317		416708	352821	344807
lzt10	154624	16517	15173	14845		21334	16537	15155
lzt11	58524	26719	26060	25749		30148	26737	25848
lzt12	164423	35168	33504	32485		60450	35187	33682
lzt13	1041576	1042758	1042750	1042749		1045665	1042774	1042765
lzt14	102400	57421	56834	56478		59832	57432	56541
lzt15	34664	14755	14055	13995		15702	14776	14078
lzt16	21504	12503	12396	12340		12908	12528	12365
lzt17	53161	24023	23450	23025		28657	24040	23157
lzt18	102400	86880	86109	85614		88745	86899	85675
lzt19	768771	381018	369110	359276		478748	381033	363233
lzt20	1179702	1051478	1047742	1043192		1073769	1051500	1045195
lzt21	679936	203405	196764	192411		244363	203410	194091
lzt22	400000	363832	362390	361524		371121	363845	361748
lzt23	1048576	1040762	1040758	1040623		1049408	1040778	1040717
lzt24	3471552	2391132	2372991	2369040		2426582	2391145	2369896
lzt25	1029744	328271	326682	324107		370278	328295	324206
lzt26	262144	246951	246876	246334		250159	246972	246481
lzt27	857241	478524	429287	425694		531932	478537	430366
lzt28	1591760	468568	455644	437666		580253	468589	445814
lzt29	3953035	2342525	2259916	2230095		2536827	2342537	2235202
lzt30	100000	100394	100394	100394		100410	100410	100410
total	24700817	15110207	14874321	14773314		16157477	15110591	14816193
	raw	greedy	lazy	optimal		lz4 -c0	lz4 -c1	lz4 -c2

greedy, lazy, optimal are mine. They all use a suffix array for string searching, and thus always find the longest possible match. Greedy just takes the longest match. Lazy considers a match at the next position also and has a very simple heuristic for preferring it or not. Optimal is the big state table described above.

Yann's lz4 -c2 is a lazy parse that seems to go 3 steps ahead with some funny heurstics that I can't quite follow; I see it definitely considers the transition threshold of matchlen from 18 to 19, and also some other stuff. It uses MMC for string matching. His heuristic parse is quite good; I actually suspect that most of the win of "optimal" over "lz4 -c2" is due to finding better matches, not from making better parse decisions.

(Yann's lz4.exe seems to also add a 16 byte header to every file)

See also previous posts on LZ and optimal parsing :

cbloom rants 10-10-08 - 7 - On LZ Optimal Parsing
cbloom rants 10-24-11 - LZ Optimal Parse with A Star Part 1
cbloom rants 12-17-11 - LZ Optimal Parse with A Star Part 2
cbloom rants 12-17-11 - LZ Optimal Parse with A Star Part 3
cbloom rants 12-17-11 - LZ Optimal Parse with A Star Part 4
cbloom rants 01-09-12 - LZ Optimal Parse with A Star Part 5
cbloom rants 11-02-11 - StringMatchTest Release + String Match Post Index
cbloom rants 09-27-08 - 2 - LZ and ACB
cbloom rants 08-20-10 - Deobfuscating LZMA
cbloom rants 09-03-10 - LZ and Exclusions
cbloom rants 09-14-10 - A small note on structured data
cbloom rants 06-08-11 - Tech Todos

(*) ADDENDUM :

It turns out you can optimal parse LZ4 without keeping all the states, that is with just a single LZSS style backwards walk and only a 1-wide dynamic programming table. There are several subtle things that make this possible. See the comments on this post : LZ-Bytewise conclusions

5 comments:

Cyan said...: > Yann's lz4.exe seems to also add a 16 byte header to every file

That's correct.
- 4 bytes for a "magic number"
- 8 bytes for option flags and original size
- 4 bytes per chunk, with a single chunk being 4MB, so i guess all your samples need only one chunk.; September 5, 2012 at 6:13 AM
Cyan said...: > Yann's lz4 -c2 is a lazy parse that seems to go 3 steps ahead with some funny heurstics that I can't quite follow

The algorithm consider only overlapping matches of growing lengthes.
To keep it simple, i restricted the length of the serie to 3. As long as the serie of growing lengthes is <=3, i think it can be proven that the parsing is optimal.

But beyond that limit, the algorithm makes a wild guess, encodes one match, and move along.
I tried to explain all this in an old post :
http://fastcompression.blogspot.fr/2011/12/advanced-parsing-strategies.html; September 5, 2012 at 6:22 AM
Anonymous said...: Why is the literal run length a state?; September 5, 2012 at 8:00 AM
Cyan said...: Hi Charles

Do you think it could be a good idea to share your Optimal Parsing algorithm for LZ4 as a part of the open-source package ?
Since it achieves higher compression ratio than HC, i believe there may be some programmers which will be interested.

Regards; September 21, 2012 at 12:19 PM
cbloom said...: Yeah, I'll email you.; September 21, 2012 at 3:08 PM

cbloom rants

9/04/2012

09-04-12 - LZ4 Optimal Parse

5 comments:

old rants