cbloom rants: 10-22-12 - LZ-Bytewise conclusions

10/22/2012

10-22-12 - LZ-Bytewise conclusions

Wrapping this up + index post. Previous posts in the series :

cbloom rants 09-02-12 - Encoding Values in Bytes Part 1
cbloom rants 09-02-12 - Encoding Values in Bytes Part 2
cbloom rants 09-02-12 - Encoding Values in Bytes Part 3
cbloom rants 09-04-12 - Encoding Values in Bytes Part 4
cbloom rants 09-04-12 - LZ4 Optimal Parse
cbloom rants 09-10-12 - LZ4 - Large Window
cbloom rants 09-11-12 - LZ MinMatchLen and Parse Strategies
cbloom rants 09-13-12 - LZNib
cbloom rants 09-14-12 - Things Most Compressors Leave On the Table
cbloom rants 09-15-12 - Some compression comparison charts
cbloom rants 09-23-12 - Patches and Deltas
cbloom rants 09-24-12 - LZ String Matcher Decision Tree
cbloom rants 09-28-12 - LZNib on enwik8 with Long Range Matcher
cbloom rants 09-30-12 - Long Range Matcher Notes
cbloom rants 10-02-12 - Small note on LZHAM
cbloom rants 10-04-12 - Hash-Link match finder tricks
cbloom rants 10-05-12 - OodleLZ Encoder Speed Variation with Worker Count
cbloom rants 10-07-12 - Small Notes on LZNib
cbloom rants: 10-16-12 - Two more small notes on LZNib

And some little additions :

First a correction/addendum on cbloom rants 09-04-12 - LZ4 Optimal Parse :

I wrote before that going beyond the 15 states needed to capture the LRL overflowing the control byte doesn't help much (or at all). That's true if you only go up to 20 or 30 or 200 states, but if you go all the way to 270 states, so that you capture the transition to needing another byte, there is some win to be had (LZ4P-LO-332 got lztestset to 12714031 with small optimal state set, 12492631 with large state set).

If you just do it naively, it greatly increases memory use and run time. However, I realized that there is a better way. The key is to use the fact that there are so many code-cost ties. In LZ-Bytewise with the large state set, often the coding decision in a large number of states will have the same cost, and furthermore often the end point states will all have the same cost. When this happens, you don't need to make the decision independently for each state, instead you make one decision for the entire block, and you store a decision for a range of states, instead of one for each state.

eg. to be explicit, instead of doing :


in state 20 at pos P
consider coding a literal (takes me to state 21 at pos P+1)
consider various matches (takes me to state 0 at pos P+L)
store best choice in table[P][20]

in state 21 ...

do :

in states 16-260 at pos P
consider coding a literal (takes me to states 17-261 at pos P+1 which I saw all have the same cost)
consider various matches (takes me to state 0 at pos P+L)
store in table[P] : range {16-260} makes decision X

in states 261-263 ...

so you actually can do the very large optimal parse state set with not much increase in run time or memory use.

Second : I did a more complex variant of LZ4P (large window). LZ4P-LO includes "last offset". LZ4P-LO-332 uses a 3-bit-3-bit-2-bit control word (as described previously here : cbloom rants 09-10-12 - LZ4 - Large Window ) ; the 2 bit offset reserves one value for LO and 3 values for normal offsets.

(I consider this an "LZ4" variant because (unlike LZNib) it sends LZ codes as a strictly alternating LRL-ML pairs (LRL can be zero) and the control word of LRL and ML is in one byte)

Slightly better than LZ4P-LO-332 is LZ4P-LO-695 , where the numbering has switched from bits to number of values (so 332 should be 884 for consistency). You may have noticed that 6*9*5 = 270 does not fit in a byte, but that's fixed easily by forbidding some of the possibilities. 6-9-5 = 6 values for literals, 9 for match lengths, and 5 for offsets. The 5 offsets are LO + 2 bits of normal offset. So for example one of the ways that the 270 values is reduced is because an LO match can never occur after an LRL of 0 (the previous match would have just been longer), so those combinations are removed from the control byte.

LZ4P-LO-695 is not competitive with LZNib unless you spill the excess LRL and ML (the amount that is too large to fit in the control word) to nibbles, instead of spilling to bytes as in the original LZ4 and LZ4P. Even with spilling to nibbles, it's no better than LZNib. Doing LZ4P-LO-695, I found a few bugs in LZNib, so its results also got better.

Thirdly, current numbers :

	raw	lz4	lz4p332	lz4plo695	lznib d8	zlib	OodleLZHLW
lzt00	16914	6473	6068	6012	5749	4896	4909
lzt01	200000	198900	198880	198107	198107	198199	198271
lzt02	755121	410695	292427	265490	253935	386203	174946
lzt03	3471552	1820761	1795951	1745594	1732491	1789728	1698003
lzt04	48649	16709	15584	15230	14352	11903	10679
lzt05	927796	460889	440742	420541	413894	422484	357308
lzt06	563160	493055	419768	407437	398780	446533	347495
lzt07	500000	265688	248500	240004	237120	229426	210182
lzt08	355400	331454	322959	297694	302303	277666	232863
lzt09	786488	344792	325124	313076	298340	325921	268715
lzt10	154624	15139	13299	11774	11995	12577	10274
lzt11	58524	25832	23870	22381	22219	21637	19132
lzt12	164423	33666	30864	29023	29214	27583	24101
lzt13	1041576	1042749	1040033	1039169	1009055	969636	923798
lzt14	102400	56525	53395	51328	51522	48155	46422
lzt15	34664	14062	12723	11610	11696	11464	10349
lzt16	21504	12349	11392	10881	10889	10311	9936
lzt17	53161	23141	22028	21877	20857	18518	17931
lzt18	102400	85659	79138	74459	76335	68392	59919
lzt19	768771	363217	335912	323886	299498	312257	268329
lzt20	1179702	1045179	993442	973791	955546	952365	855231
lzt21	679936	194075	113461	107860	102857	148267	83825
lzt22	400000	361733	348347	336715	331960	309569	279646
lzt23	1048576	1040701	1035197	1008638	989387	777633	798045
lzt24	3471552	2369885	1934129	1757927	1649592	2289316	1398291
lzt25	1029744	324190	332747	269047	230931	210363	96745
lzt26	262144	246465	244990	239816	239509	222808	207600
lzt27	857241	430350	353497	315394	328666	333120	223125
lzt28	1591760	445806	388712	376137	345343	335243	259488
lzt29	3953035	2235299	1519904	1451801	1424026	1805289	1132368
lzt30	100000	100394	100393	100010	100013	100020	100001
total	24700817	14815832	13053476	12442709	12096181	13077482	10327927

And comparison charts on the aggregated single file lzt99 :

Speeds are the best of 20 trials on each core; speed is the best of either x86 or x64 (usually x64 is faster). The decode times measured are slightly lower for everybody in this post (vs the last post of this type) because of the slightly more rigorous timing runs. For reference the decode speeds I measured are (mb/s) :


LZ4 :      1715.10235
LZNib :     869.1924302
OodleLZHLW: 287.2821629
zlib :      226.9286645
LZMA :       31.41397495

Also LZNib current enwik8 size : (parallel chunking (8 MB chunks) and LRM 12/12 with bubble)


LZNib enwik8 mml3 : 30719351
LZNib enwik8 stepml : 30548818

(all other LZNib results are for mml3)

22 comments:

wgarvin said...: Like sean (on the first post about optimal parsing LZ4) I didn't understand why you need different states for the literals at all. At least, not with LZ4, where an N-byte match is always representable in N bytes or less. Maybe you can explain that part.

Inspired by your posts, I hacked together my own optimal parser for LZ4 which did indeed slightly beat LZ4hc on compressed size (although its >10x slower, but I don't care for my use-case..)

Anyway, here's what I did for mine. As I scan the file backwards, I keep track of "cost of the current run of literals" up to the start of the next match.
This is incremental and needs no state. If some match at the current position is better than just using literals, it will be stored as the "best match" at this position and a new run of literals will be started which terminates at this position. I only store the "literal run" cost if there is no lower-cost match available.

I did it that way because I couldn't stomach the huge memory requirement to store separate cost/offset/etc. for all of those states. But maybe there's a scenario that I haven't thought of where I lose compression.

As far as I can tell, the same suffix is optimal for every "state". If this sounds nonsensical, maybe you could elaborate on what you use your separate states for, and what state it transitions into after a match. :P; October 23, 2012 at 1:53 PM
cbloom said...: So, with all these bytewise LRL formats, after a match you always go to state 0. That means the choice of match is independent of state, so you can just do it once. That means you only need to store the choice of match once at each position, not for each state.

However, whether or not to take a match or a literal does depend on the current state, because the current state changes the cost of the literal.

The optimum suffix encoding from any pos P to the end is dependent on pos P and also the current LRL used to arrive at pos P. That's where state comes in.

In particular, I made a text image of whether states chose a match or a literal. Most of the time they all pick either literal or match. But sometimes the choice is different per state. Specifically for LZ4 you see a lot of these patterns :

-------------++-
-------------++-
--------------+-
-------------++-
-------------++-
--------------+-
-------------++-
--------------+-
--------------+-
-------------++-

where + is match and - is literal; October 23, 2012 at 5:21 PM
cbloom said...: It does look like for LZ4 you could store a very compact parse; you only need to store decisions for {0-12,13,14,15+} . So you could store just one ML and offset and then just 4 bits for whether those state groups chose match or literal.

Also you can check your parse against mine; book1 -> 359276; October 23, 2012 at 5:27 PM
cbloom said...: ------------++++

these do occur, but they're very rare; only 5 times on book1, so excluding the possibility of state 11&12 making a different choice than state 0-10 is a small approximation.

-----------+++++

occurs only once on book1; October 23, 2012 at 5:30 PM
Anonymous said...: Why does lz4 decode in half the time of lznib?; October 23, 2012 at 10:27 PM
wgarvin said...: Interesting.. Mine produces 359279 bytes for book1. Only 3 bytes difference, which makes me wonder if it has something to do with LZ4's rules for near end of block (last 5 bytes must be literals, and last match cannot start within last 12 bytes of the output). I enforce those rules; maybe you don't, or maybe there's a bug in how I enforce them (or I might just have a bug somewhere else.. of course I do run my output through LZ4's decompressor to verify that I produced a valid compressed stream for the original data).

My LZ4 parser is quite slow for two reasons: (1) I just wrapped up your suffixarray code in a C++ class and used that for the string matching; LZ4 has a small window and the suffix array wastes a lot of time skipping matches that are out of the window. (2) I don't do any intelligent selection of match lengths; I start with the longest match length and just decrement it N times to see if the resulting cost is cheaper. I don't keep track of where overlapping matches are or anything, its just stupid brute force. with N=32 it compresses pretty well and is not too slow; the 359279 bytes on book1 above were produced with N=512 which is the slowest (and I can't imagine a scenario where using a match 512 bytes shorter than the longest available match actually improves compression). On 1 run, it took me 0.39825 secs to compress book1 (for comparison, LZ4HC took 0.06274 secs and miniz took 0.10053 secs). The biggest file I've tried it on is a 22716480-byte exe file (the setup exe from GuildWars2), LZ4HC took 1.08394 seconds and produced an output size of 11653902 (51.30153%) while my LZ4 parser takes 31.00071 seconds and produced an output size of 11610063 (51.10855%). I actually wanted to write one for an LZNib-like format after I read your description of it. I decided to write a parser for LZ4 first since it has a ready-made decompressor I could test it with. I don't quite have the LZNib compressor working yet (there's some bug I didnt get around to tracking down, its probably not in the decompressor because I kept that super simple). But anyway, it uses the same literal-counting thing I used in my LZ4 parser.. it doesn't have any separate "states" and only stores one possible suffix for each position (one total cost to end, and match offset,length which are zero if its coded as literals). Even though you have to pay the cost of the nibble to start the run of literals, I don't think anything special is needed to handle that. My code is crappy experimental code, I mostly just want to see what kind of compression is possible with these things, on the datasets I have at work (day job = game developer).; October 23, 2012 at 11:19 PM
wgarvin said...: nothings said...

"Why does lz4 decode in half the time of lznib?"

There's two reasons for this.

(1) LZNib is more branchy than LZ4, it has a branch to decide between literals and match while LZ4 does not. Also it needs extra length bytes slightly more often (making those branch(es) more unpredictable than they are in LZ4) and if I interpreted charles' posts correctly, in LZNib his match offset is also coded in a variable-size way which involves a branch there.

(2) Cache misses. Because LZ4 uses a 64K window, matches being copied are almost always still in L1 cache. If not, then certainly they're in L2. With LZNib a match might be copied from a couple megabytes earlier in the text; not in L1 anyway. Unless its really long, the x86 auto-prefetching hardware doesn't have time to get going.; October 23, 2012 at 11:23 PM
wgarvin said...: Going back to the literal-counting / states question:

In my LZ4 parser, I keep a count of the exact cost of the run of literals, no matter how long it is: I have a counter which adds 1 byte of cost every 255 literals (and starts at the proper value so the first byte is charged on the 15th literal, IIRC). In effect, I only consider one "state" for each position, which is the number of literals until the next match (next in file order, I mean). So when I find a match at a position, I store the cost and length and offset for that match. Then I consider the previous pos, and start by assuming it will be a run of 1 literal + cost of following match. And try to beat that with some other match + cost of following stuff. If no match is found, I'll store the cost of those literals, and then consider the pos before that and now assume it is a run of 2 literals. Etc. So I think this is equivalent to only considering whichever "state" would represent a run of literals from the current pos to the next match, and not any others.

As I wrote the above paragraph I noticed, that my parser will NOT consider a longer run of literals which ends at the start of a better match (i.e. cheaper suffix, so that that encoding is cheaper overall). But I think it compensates by rarely (or never?) getting into that situation in the first place; it always starts at each position by assuming it will code literal(s) there and figuring out what that would cost to the end of the file. A match has to actually be better than that or it won't get stored.

Of course it only works because there's no entropy coder and no last offset, etc. The encoded sizes of matches and literals are so uniform that once you find the best suffix for a given position, there's almost no situation where it would make sense to code something else at that spot.

Anyway, I've been reading your blog for a long time and I'm interested in a lot of the technical topics you post about (compression, threading, etc.) So thanks for posting all of that stuff. Your recent LZ benchmarks and LZNib posts were really timely, since I'm a game developer in my day job and was just thinking about replacing our existing compressors with some more modern stuff. LZNib looks especially interesting, being so simple and fast to decompress, yet still giving kinda decent compression on large files just because of the large window. For a game that doesn't want to spend much CPU on decompression, there certainly might be a niche for that.

...Sorry about the tl,dr length of these posts.; October 23, 2012 at 11:28 PM
wgarvin said...: LZ4 formats in general, don't have the optimal subgraph property. But I think LZ4 almost has this property.

At least I'm struggling to come up with a case where pretending it does actually gives worse compression. (I think LZNib almost has this property too, at least until you start adding last offset in there..)

So my LZ4 parser is just doing a simple backwards walk, dynamic programming, like the old LZSS optimal parse you described in that post from 4 years ago.; October 23, 2012 at 11:48 PM
wgarvin said...: I thought of another way to summarize the difference:

If I understand correctly, when you consider a run of literals K bytes long and you consider extending it to K+1 bytes, you charge any "extra cost" of adding another length byte to the end of that run of literals (by it ending in a different state).

But since I'm scanning backwards and "extending" at the front of that run of literals, not the end.. I charge the "extra cost" of extending the length to K+1 to the first byte in the new run of literals. It works for LZ4 because the coding cost of a run of literals is completely determined by its length. So when I evaluate possible matches at that position, I already know exactly what it would cost to just use literals instead.; October 24, 2012 at 6:24 AM
cbloom said...: @Unknown - first the small points :

1. Yeah, suffixarray is not great for windowed matching. If you just do 128k suffix arrays that overlap by 64k, that's not bad.

2. You're right I don't do the special end-of-file stuff that LZ4's decoder wants, so that is probably the source of the few bytes difference in size.

3. For selection of match lengths, you only need to consider reducing to ML=18 in LZ4

4. Who are you? Send me an email if you prefer.

Next, about the parsing.; October 24, 2012 at 10:20 AM
cbloom said...: About the parsing -

I think you are right, but I'm not entirely sure yet, needs a bit more thought.

I believe that in LZ4, any time the optimal subgraph is different, the code cost is a tie, so it doesn't matter.

That is, if literals cost (1 + epsilon) bytes instead of exactly 1 byte, then it would *not* have the optimal subgraph property.

Also, if you break the tie in a strict way (eg. always prefer matches in case of tie), then you would conclude that LZ4 does *not* have state-independent optimal subgraphs (that is, the chosen subgraph with strict tie breaking would be different depending on preceding LRL).

But if all you care about is total cost to code, then I think you are right.

To be quite concrete, here's an explicit example :

Say at some point in the file, pos P :

You find a match of length 4
After 4 bytes, there is a match of length 16
After 20 bytes, EOF

So you can send {M4,M16} in 3+3 = 6 bytes

Alternatively, you can send 1 literal then a match of length 19 to get to EOF {L1,M19} in 1+4 = 5 bytes

So {L1,M19} looks like a strictly better way to code the current subgraph.

However, if you are preceded by 14 literals, then the cost of that subgraph changes to 6 bytes. So you had :

{M4,M16} = 6
{L1,M19} = 5

but

{L14,M4,M16} = 20
{L15,M19} = 20

the cost has changed and L1,M19 is no longer strictly the best way to encode that subgraph -

*However* the change has only made it into a tie, so you didn't actually make the wrong choice by assuming that it was best.

I can't come up with a case where assuming an optimal subgraph actually makes you take the wrong decision, it seems to only change strictly better choices into ties, which is benign.; October 24, 2012 at 10:29 AM
wgarvin said...: Yeah, I was thinking only about the output sizes and not about the actual subgraphs. And thats a good concrete example. I also can't think of a case where it makes me take a sub-optimal decision.

I do sometimes get sub-optimal results if I don't consider enough match lengths though. Even if I try up to 256 different lengths (max, max-1, max-2...) I still get a few bytes less compression on large files.

I guess there's some edge cases to do with adding extra length bytes around every 255 bytes of length... careful tweaking of two overlapping long matches might be able to save a byte there, occasionally. If I try up to 512 possible match lengths I don't see any difference in compressed size vs. just trying all possible lengths for the match (except that is disasterously slow for really long matches). There's probably a much smarter way to choose candidate lengths (I read some of your posts about that) but the speed hit even from comparing 512 lengths seemed small, compared to just the time spent in the string matcher to find the match in the first place, so I decided to ignore that for now. I mean, it only finds the match once and then considers all the pre-computed costs-to-end for each possible length.

BTW, the string matcher I'm currently using is basically your "SuffixArray3" with your interval-tree code in it, which I got from one of your old posts. I tried "SuffixArray1" also, just for kicks--and I tried replacing divsufsort with Yuta Mori's SAIS, but it was slower. The dates are confusing--divsufsort seems to be older, much more complicated, and just about the best possible thing for computing a suffix array. The world needs more people like Yuta Mori, divsufsort is awesome.

btw I emailed you, from this address (wylie.garvin_at_gmail.com).; October 24, 2012 at 11:23 AM
cbloom said...: "I do sometimes get sub-optimal results if I don't consider enough match lengths though"

Yeah I didn't mean that only considering 18 is optimal, but the loss is quite small and the speed gain is quite large. On my "lztestset" the size difference is 14773304 vs 14773045 (259 bytes)

So I have implemented your parser and can confirm it produces the exact same output sizes, and is much faster. Thanks!

(I can also confirm that divsufsort is awesome; really superb piece of code; maybe the best single piece of free code I have ever seen in my life, simple interface, it just works, and it's actually world-class)

One note which may not have been obvious to all from the preceding discussion :

In Wylie's parser you must break ties in favor of matches. The reason is that matches will never increase in cost due to preceding choices, but literals can increase in cost.; October 24, 2012 at 11:44 AM
wgarvin said...: This comment has been removed by the author.; October 24, 2012 at 2:52 PM
wgarvin said...: So I have implemented your parser and can confirm it produces the exact same output sizes, and is much faster. Thanks!"

That's good news. I hope it will work for your LZNib parser too.

[I told blogger my name, yay, I'm no longer "Unknown".]; October 24, 2012 at 3:08 PM
wgarvin said...: This comment has been removed by the author.; October 24, 2012 at 4:18 PM
wgarvin said...: Ahhh, I think I get it now! You're right.

Example:

At a position P, I might look at my two choices and say "literals or match, the cost is the same" and pick either one. Say I pick literals.

But then at position P-3, suppose I have the option of a match which is 4 bytes more expensive than the match I considered at position P. But because I didn't choose the match at position P (where the score was tied) I'm still thinking of my current run of literals as being length 3+N, rather than just length 3. And the cost of that run of 3+N literals can be strictly larger, so I end up with a sub-optimal parse! The best option at P-3 is really "code 3 literals, then at pos P code that match that looked like it was an equally good suffix". But I might not discover that unless I break ties between literal and match, by always choosing the match.

And I just checked and my code does not do this properly, so that makes it an "almost-optimal" parser :P I just have to fix the bug. Thanks.; October 24, 2012 at 4:32 PM
wgarvin said...: After fixing the bug, I also now get book1 -> 359276 bytes. (and book2 -> 235311, and pic -> 66127).; October 24, 2012 at 4:57 PM
wgarvin said...: On my setup exe from GuildWars2, it now takes about 31.5 seconds and compressed size is now 11607793 (51.09855% of original size, slightly better than before).; October 24, 2012 at 5:07 PM
cbloom said...: For the historical record, responding to one of your deleted comments :

I actually disagree with you about what the "clever part" is.

Incrementally counting the cost to send the LRL code is not really significant.

When you inc the LRL, if you do it naively you would write

running_cost += LRL_Cost(LRL+1) - LRL_Cost(LRL)

which, sure, is not maximally efficient, but is not a major issue, and it's easy to change that one line to instead do incremental cost.

(the fastest way is to start lrl_counter at -15 and inc cost when it hits zero, then change lrl_counter to -255)

The clever part is the fact that it's okay to use the optimal subgraph from any pos P to the end of the file (if you break ties for matches) regardless of arrival state/lrl ; eg. ignoring what might precede pos P.

The fact that that is okay is/was not at all obvious (to me) and is really the major difference.; October 26, 2012 at 10:49 AM
wgarvin said...: I wouldn't say it was obvious to me either. I think I just assumed it was true after reading your old page about LZSS optimal parse. I said to myself "that sounds like it will work for LZ4, for more or less the same reasons", but I must have skipped some mental steps in there. LZSS does not have the varying cost per byte of literals that LZ4 has, which adds a wrinkle to things. I completely missed that bit.

Turns out the "prefer matches over literals" rule does make it size-optimal, but you're right that this is not exactly obvious... its pretty cool that it actually works!; October 26, 2012 at 2:36 PM

cbloom rants

10/22/2012

10-22-12 - LZ-Bytewise conclusions

22 comments:

old rants