9/04/2012

09-04-12 - LZ4 Optimal Parse

I wrote an optimal parser for LZ4 to have as a reference point.

It's a very easy format to optimal parse, because the state space is small enough that you can walk the entire dynamic programming table. That is, you just make a table which is :


          <------ all file positions ------>
^       XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
|       XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
states  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
|       XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 
V       XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 

and then you just go fill all the slots. Like LZSS (Storer-Szymanski) it's simplest to walk backwards, that way any value in the table that you need from later positions is already computed.

(* see addendum at end)

For LZ4 the state is just the literal run len (there is no entropy coder; there is no "last offset"; and there are no carried bits between coding events - the way a match is coded is totally independent of what precedes it). I use 16 states. Whenever you code a literal, the state transition is just state++ , when you code a match the transition is always to state = 0.

There is a small approximation in my optimal parse; I don't keep individual states for literal run lens > 15. That means I do measure the cost jump when you go from 14 to 15 literals (and have to output an extra byte), but I don't measure the cost jump when you go from 15+254 to 15+255.

The optimal parse can be made very very slightly better by using 20 states or so (instead of 16). Then from state 15-20 you count the cost of sending a literal to be exactly 1 byte (no extra cost in control words or literal run len). At state 20 you count the cost to be 1 byte + (1/234) , that is you add in the amortized cost of the 1-1-1-1 code that will be used to send large literal run lengths. While this is better in theory, on my test set I don't get any win from going to more than 16 states.

Without further ado, the numbers :

raw greedy lazy optimal XXX lz4 -c0 lz4 -c1 lz4 -c2
lzt00 16914 6646 6529 6444 7557 6666 6489
lzt01 200000 198906 198901 198900 198922 198918 198916
lzt02 755121 412064 411107 410549 447200 412085 410705
lzt03 3471552 1827696 1821562 1815464 2024928 1827577 1820776
lzt04 48649 17521 16985 16461 21803 17540 16725
lzt05 927796 484749 462204 459700 518392 484764 460905
lzt06 563160 493815 493056 492938 508281 493838 493071
lzt07 500000 269945 266191 264112 300505 269945 265704
lzt08 355400 332459 332201 330793 335890 332476 331470
lzt09 786488 352802 346700 340317 416708 352821 344807
lzt10 154624 16517 15173 14845 21334 16537 15155
lzt11 58524 26719 26060 25749 30148 26737 25848
lzt12 164423 35168 33504 32485 60450 35187 33682
lzt13 1041576 1042758 1042750 1042749 1045665 1042774 1042765
lzt14 102400 57421 56834 56478 59832 57432 56541
lzt15 34664 14755 14055 13995 15702 14776 14078
lzt16 21504 12503 12396 12340 12908 12528 12365
lzt17 53161 24023 23450 23025 28657 24040 23157
lzt18 102400 86880 86109 85614 88745 86899 85675
lzt19 768771 381018 369110 359276 478748 381033 363233
lzt20 1179702 1051478 1047742 1043192 1073769 1051500 1045195
lzt21 679936 203405 196764 192411 244363 203410 194091
lzt22 400000 363832 362390 361524 371121 363845 361748
lzt23 1048576 1040762 1040758 1040623 1049408 1040778 1040717
lzt24 3471552 2391132 2372991 2369040 2426582 2391145 2369896
lzt25 1029744 328271 326682 324107 370278 328295 324206
lzt26 262144 246951 246876 246334 250159 246972 246481
lzt27 857241 478524 429287 425694 531932 478537 430366
lzt28 1591760 468568 455644 437666 580253 468589 445814
lzt29 3953035 2342525 2259916 2230095 2536827 2342537 2235202
lzt30 100000 100394 100394 100394 100410 100410 100410
total 24700817 15110207 14874321 14773314 16157477 15110591 14816193
raw greedy lazy optimal lz4 -c0 lz4 -c1 lz4 -c2

greedy, lazy, optimal are mine. They all use a suffix array for string searching, and thus always find the longest possible match. Greedy just takes the longest match. Lazy considers a match at the next position also and has a very simple heuristic for preferring it or not. Optimal is the big state table described above.

Yann's lz4 -c2 is a lazy parse that seems to go 3 steps ahead with some funny heurstics that I can't quite follow; I see it definitely considers the transition threshold of matchlen from 18 to 19, and also some other stuff. It uses MMC for string matching. His heuristic parse is quite good; I actually suspect that most of the win of "optimal" over "lz4 -c2" is due to finding better matches, not from making better parse decisions.

(Yann's lz4.exe seems to also add a 16 byte header to every file)

See also previous posts on LZ and optimal parsing :

cbloom rants 10-10-08 - 7 - On LZ Optimal Parsing
cbloom rants 10-24-11 - LZ Optimal Parse with A Star Part 1
cbloom rants 12-17-11 - LZ Optimal Parse with A Star Part 2
cbloom rants 12-17-11 - LZ Optimal Parse with A Star Part 3
cbloom rants 12-17-11 - LZ Optimal Parse with A Star Part 4
cbloom rants 01-09-12 - LZ Optimal Parse with A Star Part 5
cbloom rants 11-02-11 - StringMatchTest Release + String Match Post Index
cbloom rants 09-27-08 - 2 - LZ and ACB
cbloom rants 08-20-10 - Deobfuscating LZMA
cbloom rants 09-03-10 - LZ and Exclusions
cbloom rants 09-14-10 - A small note on structured data
cbloom rants 06-08-11 - Tech Todos

(*) ADDENDUM :

It turns out you can optimal parse LZ4 without keeping all the states, that is with just a single LZSS style backwards walk and only a 1-wide dynamic programming table. There are several subtle things that make this possible. See the comments on this post : LZ-Bytewise conclusions

5 comments:

Yann Collet said...

> Yann's lz4.exe seems to also add a 16 byte header to every file

That's correct.
- 4 bytes for a "magic number"
- 8 bytes for option flags and original size
- 4 bytes per chunk, with a single chunk being 4MB, so i guess all your samples need only one chunk.

Yann Collet said...

> Yann's lz4 -c2 is a lazy parse that seems to go 3 steps ahead with some funny heurstics that I can't quite follow

The algorithm consider only overlapping matches of growing lengthes.
To keep it simple, i restricted the length of the serie to 3. As long as the serie of growing lengthes is <=3, i think it can be proven that the parsing is optimal.

But beyond that limit, the algorithm makes a wild guess, encodes one match, and move along.
I tried to explain all this in an old post :
http://fastcompression.blogspot.fr/2011/12/advanced-parsing-strategies.html

nothings said...

Why is the literal run length a state?

Yann Collet said...

Hi Charles

Do you think it could be a good idea to share your Optimal Parsing algorithm for LZ4 as a part of the open-source package ?
Since it achieves higher compression ratio than HC, i believe there may be some programmers which will be interested.

Regards

cbloom said...

Yeah, I'll email you.

old rants