It's a very easy format to optimal parse, because the state space is small enough that you can walk the entire
dynamic programming table. That is, you just make a table which is :
<------ all file positions ------>
^ XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
| XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
states XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
| XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
V XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
and then you just go fill all the slots. Like LZSS (Storer-Szymanski) it's simplest to walk backwards, that way
any value in the table that you need from later positions is already computed.
(* see addendum at end)
For LZ4 the state is just the literal run len (there is no entropy coder; there is no "last offset"; and there are no carried bits between coding events - the way a match is coded is totally independent of what precedes it). I use 16 states. Whenever you code a literal, the state transition is just state++ , when you code a match the transition is always to state = 0.
There is a small approximation in my optimal parse; I don't keep individual states for literal run lens > 15. That means I do measure the cost jump when you go from 14 to 15 literals (and have to output an extra byte), but I don't measure the cost jump when you go from 15+254 to 15+255.
The optimal parse can be made very very slightly better by using 20 states or so (instead of 16). Then from state 15-20 you count the cost of sending a literal to be exactly 1 byte (no extra cost in control words or literal run len). At state 20 you count the cost to be 1 byte + (1/234) , that is you add in the amortized cost of the 1-1-1-1 code that will be used to send large literal run lengths. While this is better in theory, on my test set I don't get any win from going to more than 16 states.
Without further ado, the numbers :
raw | greedy | lazy | optimal | XXX | lz4 -c0 | lz4 -c1 | lz4 -c2 | |
lzt00 | 16914 | 6646 | 6529 | 6444 | 7557 | 6666 | 6489 | |
lzt01 | 200000 | 198906 | 198901 | 198900 | 198922 | 198918 | 198916 | |
lzt02 | 755121 | 412064 | 411107 | 410549 | 447200 | 412085 | 410705 | |
lzt03 | 3471552 | 1827696 | 1821562 | 1815464 | 2024928 | 1827577 | 1820776 | |
lzt04 | 48649 | 17521 | 16985 | 16461 | 21803 | 17540 | 16725 | |
lzt05 | 927796 | 484749 | 462204 | 459700 | 518392 | 484764 | 460905 | |
lzt06 | 563160 | 493815 | 493056 | 492938 | 508281 | 493838 | 493071 | |
lzt07 | 500000 | 269945 | 266191 | 264112 | 300505 | 269945 | 265704 | |
lzt08 | 355400 | 332459 | 332201 | 330793 | 335890 | 332476 | 331470 | |
lzt09 | 786488 | 352802 | 346700 | 340317 | 416708 | 352821 | 344807 | |
lzt10 | 154624 | 16517 | 15173 | 14845 | 21334 | 16537 | 15155 | |
lzt11 | 58524 | 26719 | 26060 | 25749 | 30148 | 26737 | 25848 | |
lzt12 | 164423 | 35168 | 33504 | 32485 | 60450 | 35187 | 33682 | |
lzt13 | 1041576 | 1042758 | 1042750 | 1042749 | 1045665 | 1042774 | 1042765 | |
lzt14 | 102400 | 57421 | 56834 | 56478 | 59832 | 57432 | 56541 | |
lzt15 | 34664 | 14755 | 14055 | 13995 | 15702 | 14776 | 14078 | |
lzt16 | 21504 | 12503 | 12396 | 12340 | 12908 | 12528 | 12365 | |
lzt17 | 53161 | 24023 | 23450 | 23025 | 28657 | 24040 | 23157 | |
lzt18 | 102400 | 86880 | 86109 | 85614 | 88745 | 86899 | 85675 | |
lzt19 | 768771 | 381018 | 369110 | 359276 | 478748 | 381033 | 363233 | |
lzt20 | 1179702 | 1051478 | 1047742 | 1043192 | 1073769 | 1051500 | 1045195 | |
lzt21 | 679936 | 203405 | 196764 | 192411 | 244363 | 203410 | 194091 | |
lzt22 | 400000 | 363832 | 362390 | 361524 | 371121 | 363845 | 361748 | |
lzt23 | 1048576 | 1040762 | 1040758 | 1040623 | 1049408 | 1040778 | 1040717 | |
lzt24 | 3471552 | 2391132 | 2372991 | 2369040 | 2426582 | 2391145 | 2369896 | |
lzt25 | 1029744 | 328271 | 326682 | 324107 | 370278 | 328295 | 324206 | |
lzt26 | 262144 | 246951 | 246876 | 246334 | 250159 | 246972 | 246481 | |
lzt27 | 857241 | 478524 | 429287 | 425694 | 531932 | 478537 | 430366 | |
lzt28 | 1591760 | 468568 | 455644 | 437666 | 580253 | 468589 | 445814 | |
lzt29 | 3953035 | 2342525 | 2259916 | 2230095 | 2536827 | 2342537 | 2235202 | |
lzt30 | 100000 | 100394 | 100394 | 100394 | 100410 | 100410 | 100410 | |
total | 24700817 | 15110207 | 14874321 | 14773314 | 16157477 | 15110591 | 14816193 | |
raw | greedy | lazy | optimal | lz4 -c0 | lz4 -c1 | lz4 -c2 |
greedy, lazy, optimal are mine. They all use a suffix array for string searching, and thus always find the longest possible match. Greedy just takes the longest match. Lazy considers a match at the next position also and has a very simple heuristic for preferring it or not. Optimal is the big state table described above.
Yann's lz4 -c2 is a lazy parse that seems to go 3 steps ahead with some funny heurstics that I can't quite follow; I see it definitely considers the transition threshold of matchlen from 18 to 19, and also some other stuff. It uses MMC for string matching. His heuristic parse is quite good; I actually suspect that most of the win of "optimal" over "lz4 -c2" is due to finding better matches, not from making better parse decisions.
(Yann's lz4.exe seems to also add a 16 byte header to every file)
See also previous posts on LZ and optimal parsing :
cbloom rants 10-10-08 - 7 - On LZ Optimal Parsing
cbloom rants 10-24-11 - LZ Optimal Parse with A Star Part 1
cbloom rants 12-17-11 - LZ Optimal Parse with A Star Part 2
cbloom rants 12-17-11 - LZ Optimal Parse with A Star Part 3
cbloom rants 12-17-11 - LZ Optimal Parse with A Star Part 4
cbloom rants 01-09-12 - LZ Optimal Parse with A Star Part 5
cbloom rants 11-02-11 - StringMatchTest Release + String Match Post Index
cbloom rants 09-27-08 - 2 - LZ and ACB
cbloom rants 08-20-10 - Deobfuscating LZMA
cbloom rants 09-03-10 - LZ and Exclusions
cbloom rants 09-14-10 - A small note on structured data
cbloom rants 06-08-11 - Tech Todos
(*) ADDENDUM :
It turns out you can optimal parse LZ4 without keeping all the states, that is with just a single LZSS style backwards walk and only a 1-wide dynamic programming table. There are several subtle things that make this possible. See the comments on this post : LZ-Bytewise conclusions
> Yann's lz4.exe seems to also add a 16 byte header to every file
ReplyDeleteThat's correct.
- 4 bytes for a "magic number"
- 8 bytes for option flags and original size
- 4 bytes per chunk, with a single chunk being 4MB, so i guess all your samples need only one chunk.
> Yann's lz4 -c2 is a lazy parse that seems to go 3 steps ahead with some funny heurstics that I can't quite follow
ReplyDeleteThe algorithm consider only overlapping matches of growing lengthes.
To keep it simple, i restricted the length of the serie to 3. As long as the serie of growing lengthes is <=3, i think it can be proven that the parsing is optimal.
But beyond that limit, the algorithm makes a wild guess, encodes one match, and move along.
I tried to explain all this in an old post :
http://fastcompression.blogspot.fr/2011/12/advanced-parsing-strategies.html
Why is the literal run length a state?
ReplyDeleteHi Charles
ReplyDeleteDo you think it could be a good idea to share your Optimal Parsing algorithm for LZ4 as a part of the open-source package ?
Since it achieves higher compression ratio than HC, i believe there may be some programmers which will be interested.
Regards
Yeah, I'll email you.
ReplyDelete