It's a very easy format to optimal parse, because the state space is small enough that you can walk the entire
dynamic programming table. That is, you just make a table which is :
<------ all file positions ------>
and then you just go fill all the slots. Like LZSS (Storer-Szymanski) it's simplest to walk backwards, that way
any value in the table that you need from later positions is already computed.
(* see addendum at end)
For LZ4 the state is just the literal run len (there is no entropy coder; there is no "last offset"; and there are no carried bits between coding events - the way a match is coded is totally independent of what precedes it). I use 16 states. Whenever you code a literal, the state transition is just state++ , when you code a match the transition is always to state = 0.
There is a small approximation in my optimal parse; I don't keep individual states for literal run lens > 15. That means I do measure the cost jump when you go from 14 to 15 literals (and have to output an extra byte), but I don't measure the cost jump when you go from 15+254 to 15+255.
The optimal parse can be made very very slightly better by using 20 states or so (instead of 16). Then from state 15-20 you count the cost of sending a literal to be exactly 1 byte (no extra cost in control words or literal run len). At state 20 you count the cost to be 1 byte + (1/234) , that is you add in the amortized cost of the 1-1-1-1 code that will be used to send large literal run lengths. While this is better in theory, on my test set I don't get any win from going to more than 16 states.
Without further ado, the numbers :
|raw||greedy||lazy||optimal||XXX||lz4 -c0||lz4 -c1||lz4 -c2|
|raw||greedy||lazy||optimal||lz4 -c0||lz4 -c1||lz4 -c2|
greedy, lazy, optimal are mine. They all use a suffix array for string searching, and thus always find the longest possible match. Greedy just takes the longest match. Lazy considers a match at the next position also and has a very simple heuristic for preferring it or not. Optimal is the big state table described above.
Yann's lz4 -c2 is a lazy parse that seems to go 3 steps ahead with some funny heurstics that I can't quite follow; I see it definitely considers the transition threshold of matchlen from 18 to 19, and also some other stuff. It uses MMC for string matching. His heuristic parse is quite good; I actually suspect that most of the win of "optimal" over "lz4 -c2" is due to finding better matches, not from making better parse decisions.
(Yann's lz4.exe seems to also add a 16 byte header to every file)
See also previous posts on LZ and optimal parsing :
cbloom rants 10-10-08 - 7 - On LZ Optimal Parsing
cbloom rants 10-24-11 - LZ Optimal Parse with A Star Part 1
cbloom rants 12-17-11 - LZ Optimal Parse with A Star Part 2
cbloom rants 12-17-11 - LZ Optimal Parse with A Star Part 3
cbloom rants 12-17-11 - LZ Optimal Parse with A Star Part 4
cbloom rants 01-09-12 - LZ Optimal Parse with A Star Part 5
cbloom rants 11-02-11 - StringMatchTest Release + String Match Post Index
cbloom rants 09-27-08 - 2 - LZ and ACB
cbloom rants 08-20-10 - Deobfuscating LZMA
cbloom rants 09-03-10 - LZ and Exclusions
cbloom rants 09-14-10 - A small note on structured data
cbloom rants 06-08-11 - Tech Todos
(*) ADDENDUM :
It turns out you can optimal parse LZ4 without keeping all the states, that is with just a single LZSS style backwards walk and only a 1-wide dynamic programming table. There are several subtle things that make this possible. See the comments on this post : LZ-Bytewise conclusions