cbloom rants

10/22/2012

10-22-12 - LZ-Bytewise conclusions

Wrapping this up + index post. Previous posts in the series :

And some little additions :

First a correction/addendum on cbloom rants 09-04-12 - LZ4 Optimal Parse :

I wrote before that going beyond the 15 states needed to capture the LRL overflowing the control byte doesn't help much (or at all). That's true if you only go up to 20 or 30 or 200 states, but if you go all the way to 270 states, so that you capture the transition to needing another byte, there is some win to be had (LZ4P-LO-332 got lztestset to 12714031 with small optimal state set, 12492631 with large state set).

If you just do it naively, it greatly increases memory use and run time. However, I realized that there is a better way. The key is to use the fact that there are so many code-cost ties. In LZ-Bytewise with the large state set, often the coding decision in a large number of states will have the same cost, and furthermore often the end point states will all have the same cost. When this happens, you don't need to make the decision independently for each state, instead you make one decision for the entire block, and you store a decision for a range of states, instead of one for each state.

eg. to be explicit, instead of doing :

```
in state 20 at pos P
consider coding a literal (takes me to state 21 at pos P+1)
consider various matches (takes me to state 0 at pos P+L)
store best choice in table[P][20]

in state 21 ...

do :

in states 16-260 at pos P
consider coding a literal (takes me to states 17-261 at pos P+1 which I saw all have the same cost)
consider various matches (takes me to state 0 at pos P+L)
store in table[P] : range {16-260} makes decision X

in states 261-263 ...

```
so you actually can do the very large optimal parse state set with not much increase in run time or memory use.

Second : I did a more complex variant of LZ4P (large window). LZ4P-LO includes "last offset". LZ4P-LO-332 uses a 3-bit-3-bit-2-bit control word (as described previously here : cbloom rants 09-10-12 - LZ4 - Large Window ) ; the 2 bit offset reserves one value for LO and 3 values for normal offsets.

(I consider this an "LZ4" variant because (unlike LZNib) it sends LZ codes as a strictly alternating LRL-ML pairs (LRL can be zero) and the control word of LRL and ML is in one byte)

Slightly better than LZ4P-LO-332 is LZ4P-LO-695 , where the numbering has switched from bits to number of values (so 332 should be 884 for consistency). You may have noticed that 6*9*5 = 270 does not fit in a byte, but that's fixed easily by forbidding some of the possibilities. 6-9-5 = 6 values for literals, 9 for match lengths, and 5 for offsets. The 5 offsets are LO + 2 bits of normal offset. So for example one of the ways that the 270 values is reduced is because an LO match can never occur after an LRL of 0 (the previous match would have just been longer), so those combinations are removed from the control byte.

LZ4P-LO-695 is not competitive with LZNib unless you spill the excess LRL and ML (the amount that is too large to fit in the control word) to nibbles, instead of spilling to bytes as in the original LZ4 and LZ4P. Even with spilling to nibbles, it's no better than LZNib. Doing LZ4P-LO-695, I found a few bugs in LZNib, so its results also got better.

Thirdly, current numbers :

 raw lz4 lz4p332 lz4plo695 lznib d8 zlib OodleLZHLW lzt00 16914 6473 6068 6012 5749 4896 4909 lzt01 200000 198900 198880 198107 198107 198199 198271 lzt02 755121 410695 292427 265490 253935 386203 174946 lzt03 3471552 1820761 1795951 1745594 1732491 1789728 1698003 lzt04 48649 16709 15584 15230 14352 11903 10679 lzt05 927796 460889 440742 420541 413894 422484 357308 lzt06 563160 493055 419768 407437 398780 446533 347495 lzt07 500000 265688 248500 240004 237120 229426 210182 lzt08 355400 331454 322959 297694 302303 277666 232863 lzt09 786488 344792 325124 313076 298340 325921 268715 lzt10 154624 15139 13299 11774 11995 12577 10274 lzt11 58524 25832 23870 22381 22219 21637 19132 lzt12 164423 33666 30864 29023 29214 27583 24101 lzt13 1041576 1042749 1040033 1039169 1009055 969636 923798 lzt14 102400 56525 53395 51328 51522 48155 46422 lzt15 34664 14062 12723 11610 11696 11464 10349 lzt16 21504 12349 11392 10881 10889 10311 9936 lzt17 53161 23141 22028 21877 20857 18518 17931 lzt18 102400 85659 79138 74459 76335 68392 59919 lzt19 768771 363217 335912 323886 299498 312257 268329 lzt20 1179702 1045179 993442 973791 955546 952365 855231 lzt21 679936 194075 113461 107860 102857 148267 83825 lzt22 400000 361733 348347 336715 331960 309569 279646 lzt23 1048576 1040701 1035197 1008638 989387 777633 798045 lzt24 3471552 2369885 1934129 1757927 1649592 2289316 1398291 lzt25 1029744 324190 332747 269047 230931 210363 96745 lzt26 262144 246465 244990 239816 239509 222808 207600 lzt27 857241 430350 353497 315394 328666 333120 223125 lzt28 1591760 445806 388712 376137 345343 335243 259488 lzt29 3953035 2235299 1519904 1451801 1424026 1805289 1132368 lzt30 100000 100394 100393 100010 100013 100020 100001 total 24700817 14815832 13053476 12442709 12096181 13077482 10327927

And comparison charts on the aggregated single file lzt99 :

Speeds are the best of 20 trials on each core; speed is the best of either x86 or x64 (usually x64 is faster). The decode times measured are slightly lower for everybody in this post (vs the last post of this type) because of the slightly more rigorous timing runs. For reference the decode speeds I measured are (mb/s) :

```
LZ4 :      1715.10235
LZNib :     869.1924302
OodleLZHLW: 287.2821629
zlib :      226.9286645
LZMA :       31.41397495

```
Also LZNib current enwik8 size : (parallel chunking (8 MB chunks) and LRM 12/12 with bubble)
```
LZNib enwik8 mml3 : 30719351
LZNib enwik8 stepml : 30548818

```
(all other LZNib results are for mml3)

wgarvin said...

Like sean (on the first post about optimal parsing LZ4) I didn't understand why you need different states for the literals at all. At least, not with LZ4, where an N-byte match is always representable in N bytes or less. Maybe you can explain that part.

Inspired by your posts, I hacked together my own optimal parser for LZ4 which did indeed slightly beat LZ4hc on compressed size (although its >10x slower, but I don't care for my use-case..)

Anyway, here's what I did for mine. As I scan the file backwards, I keep track of "cost of the current run of literals" up to the start of the next match.
This is incremental and needs no state. If some match at the current position is better than just using literals, it will be stored as the "best match" at this position and a new run of literals will be started which terminates at this position. I only store the "literal run" cost if there is no lower-cost match available.

I did it that way because I couldn't stomach the huge memory requirement to store separate cost/offset/etc. for all of those states. But maybe there's a scenario that I haven't thought of where I lose compression.

As far as I can tell, the same suffix is optimal for every "state". If this sounds nonsensical, maybe you could elaborate on what you use your separate states for, and what state it transitions into after a match. :P

cbloom said...

So, with all these bytewise LRL formats, after a match you always go to state 0. That means the choice of match is independent of state, so you can just do it once. That means you only need to store the choice of match once at each position, not for each state.

However, whether or not to take a match or a literal does depend on the current state, because the current state changes the cost of the literal.

The optimum suffix encoding from any pos P to the end is dependent on pos P and also the current LRL used to arrive at pos P. That's where state comes in.

In particular, I made a text image of whether states chose a match or a literal. Most of the time they all pick either literal or match. But sometimes the choice is different per state. Specifically for LZ4 you see a lot of these patterns :

-------------++-
-------------++-
--------------+-
-------------++-
-------------++-
--------------+-
-------------++-
--------------+-
--------------+-
-------------++-

where + is match and - is literal

cbloom said...

It does look like for LZ4 you could store a very compact parse; you only need to store decisions for {0-12,13,14,15+} . So you could store just one ML and offset and then just 4 bits for whether those state groups chose match or literal.

Also you can check your parse against mine; book1 -> 359276

cbloom said...

------------++++

these do occur, but they're very rare; only 5 times on book1, so excluding the possibility of state 11&12 making a different choice than state 0-10 is a small approximation.

-----------+++++

occurs only once on book1

Anonymous said...

Why does lz4 decode in half the time of lznib?

wgarvin said...

Interesting.. Mine produces 359279 bytes for book1. Only 3 bytes difference, which makes me wonder if it has something to do with LZ4's rules for near end of block (last 5 bytes must be literals, and last match cannot start within last 12 bytes of the output). I enforce those rules; maybe you don't, or maybe there's a bug in how I enforce them (or I might just have a bug somewhere else.. of course I do run my output through LZ4's decompressor to verify that I produced a valid compressed stream for the original data).

My LZ4 parser is quite slow for two reasons: (1) I just wrapped up your suffixarray code in a C++ class and used that for the string matching; LZ4 has a small window and the suffix array wastes a lot of time skipping matches that are out of the window. (2) I don't do any intelligent selection of match lengths; I start with the longest match length and just decrement it N times to see if the resulting cost is cheaper. I don't keep track of where overlapping matches are or anything, its just stupid brute force. with N=32 it compresses pretty well and is not too slow; the 359279 bytes on book1 above were produced with N=512 which is the slowest (and I can't imagine a scenario where using a match 512 bytes shorter than the longest available match actually improves compression). On 1 run, it took me 0.39825 secs to compress book1 (for comparison, LZ4HC took 0.06274 secs and miniz took 0.10053 secs). The biggest file I've tried it on is a 22716480-byte exe file (the setup exe from GuildWars2), LZ4HC took 1.08394 seconds and produced an output size of 11653902 (51.30153%) while my LZ4 parser takes 31.00071 seconds and produced an output size of 11610063 (51.10855%). I actually wanted to write one for an LZNib-like format after I read your description of it. I decided to write a parser for LZ4 first since it has a ready-made decompressor I could test it with. I don't quite have the LZNib compressor working yet (there's some bug I didnt get around to tracking down, its probably not in the decompressor because I kept that super simple). But anyway, it uses the same literal-counting thing I used in my LZ4 parser.. it doesn't have any separate "states" and only stores one possible suffix for each position (one total cost to end, and match offset,length which are zero if its coded as literals). Even though you have to pay the cost of the nibble to start the run of literals, I don't think anything special is needed to handle that. My code is crappy experimental code, I mostly just want to see what kind of compression is possible with these things, on the datasets I have at work (day job = game developer).

wgarvin said...

nothings said...

"Why does lz4 decode in half the time of lznib?"

There's two reasons for this.

(1) LZNib is more branchy than LZ4, it has a branch to decide between literals and match while LZ4 does not. Also it needs extra length bytes slightly more often (making those branch(es) more unpredictable than they are in LZ4) and if I interpreted charles' posts correctly, in LZNib his match offset is also coded in a variable-size way which involves a branch there.

(2) Cache misses. Because LZ4 uses a 64K window, matches being copied are almost always still in L1 cache. If not, then certainly they're in L2. With LZNib a match might be copied from a couple megabytes earlier in the text; not in L1 anyway. Unless its really long, the x86 auto-prefetching hardware doesn't have time to get going.

wgarvin said...

Going back to the literal-counting / states question:

In my LZ4 parser, I keep a count of the exact cost of the run of literals, no matter how long it is: I have a counter which adds 1 byte of cost every 255 literals (and starts at the proper value so the first byte is charged on the 15th literal, IIRC). In effect, I only consider one "state" for each position, which is the number of literals until the next match (next in file order, I mean). So when I find a match at a position, I store the cost and length and offset for that match. Then I consider the previous pos, and start by assuming it will be a run of 1 literal + cost of following match. And try to beat that with some other match + cost of following stuff. If no match is found, I'll store the cost of those literals, and then consider the pos before that and now assume it is a run of 2 literals. Etc. So I think this is equivalent to only considering whichever "state" would represent a run of literals from the current pos to the next match, and not any others.

As I wrote the above paragraph I noticed, that my parser will NOT consider a longer run of literals which ends at the start of a better match (i.e. cheaper suffix, so that that encoding is cheaper overall). But I think it compensates by rarely (or never?) getting into that situation in the first place; it always starts at each position by assuming it will code literal(s) there and figuring out what that would cost to the end of the file. A match has to actually be better than that or it won't get stored.

Of course it only works because there's no entropy coder and no last offset, etc. The encoded sizes of matches and literals are so uniform that once you find the best suffix for a given position, there's almost no situation where it would make sense to code something else at that spot.

Anyway, I've been reading your blog for a long time and I'm interested in a lot of the technical topics you post about (compression, threading, etc.) So thanks for posting all of that stuff. Your recent LZ benchmarks and LZNib posts were really timely, since I'm a game developer in my day job and was just thinking about replacing our existing compressors with some more modern stuff. LZNib looks especially interesting, being so simple and fast to decompress, yet still giving kinda decent compression on large files just because of the large window. For a game that doesn't want to spend much CPU on decompression, there certainly might be a niche for that.

...Sorry about the tl,dr length of these posts.

wgarvin said...

LZ4 formats in general, don't have the optimal subgraph property. But I think LZ4 almost has this property.

At least I'm struggling to come up with a case where pretending it does actually gives worse compression. (I think LZNib almost has this property too, at least until you start adding last offset in there..)

So my LZ4 parser is just doing a simple backwards walk, dynamic programming, like the old LZSS optimal parse you described in that post from 4 years ago.

wgarvin said...

I thought of another way to summarize the difference:

If I understand correctly, when you consider a run of literals K bytes long and you consider extending it to K+1 bytes, you charge any "extra cost" of adding another length byte to the end of that run of literals (by it ending in a different state).

But since I'm scanning backwards and "extending" at the front of that run of literals, not the end.. I charge the "extra cost" of extending the length to K+1 to the first byte in the new run of literals. It works for LZ4 because the coding cost of a run of literals is completely determined by its length. So when I evaluate possible matches at that position, I already know exactly what it would cost to just use literals instead.

cbloom said...

@Unknown - first the small points :

1. Yeah, suffixarray is not great for windowed matching. If you just do 128k suffix arrays that overlap by 64k, that's not bad.

2. You're right I don't do the special end-of-file stuff that LZ4's decoder wants, so that is probably the source of the few bytes difference in size.

3. For selection of match lengths, you only need to consider reducing to ML=18 in LZ4

4. Who are you? Send me an email if you prefer.

Next, about the parsing.

cbloom said...

About the parsing -

I think you are right, but I'm not entirely sure yet, needs a bit more thought.

I believe that in LZ4, any time the optimal subgraph is different, the code cost is a tie, so it doesn't matter.

That is, if literals cost (1 + epsilon) bytes instead of exactly 1 byte, then it would *not* have the optimal subgraph property.

Also, if you break the tie in a strict way (eg. always prefer matches in case of tie), then you would conclude that LZ4 does *not* have state-independent optimal subgraphs (that is, the chosen subgraph with strict tie breaking would be different depending on preceding LRL).

But if all you care about is total cost to code, then I think you are right.

To be quite concrete, here's an explicit example :

Say at some point in the file, pos P :

You find a match of length 4
After 4 bytes, there is a match of length 16
After 20 bytes, EOF

So you can send {M4,M16} in 3+3 = 6 bytes

Alternatively, you can send 1 literal then a match of length 19 to get to EOF {L1,M19} in 1+4 = 5 bytes

So {L1,M19} looks like a strictly better way to code the current subgraph.

However, if you are preceded by 14 literals, then the cost of that subgraph changes to 6 bytes. So you had :

{M4,M16} = 6
{L1,M19} = 5

but

{L14,M4,M16} = 20
{L15,M19} = 20

the cost has changed and L1,M19 is no longer strictly the best way to encode that subgraph -

*However* the change has only made it into a tie, so you didn't actually make the wrong choice by assuming that it was best.

I can't come up with a case where assuming an optimal subgraph actually makes you take the wrong decision, it seems to only change strictly better choices into ties, which is benign.

wgarvin said...

Yeah, I was thinking only about the output sizes and not about the actual subgraphs. And thats a good concrete example. I also can't think of a case where it makes me take a sub-optimal decision.

I do sometimes get sub-optimal results if I don't consider enough match lengths though. Even if I try up to 256 different lengths (max, max-1, max-2...) I still get a few bytes less compression on large files.

I guess there's some edge cases to do with adding extra length bytes around every 255 bytes of length... careful tweaking of two overlapping long matches might be able to save a byte there, occasionally. If I try up to 512 possible match lengths I don't see any difference in compressed size vs. just trying all possible lengths for the match (except that is disasterously slow for really long matches). There's probably a much smarter way to choose candidate lengths (I read some of your posts about that) but the speed hit even from comparing 512 lengths seemed small, compared to just the time spent in the string matcher to find the match in the first place, so I decided to ignore that for now. I mean, it only finds the match once and then considers all the pre-computed costs-to-end for each possible length.

BTW, the string matcher I'm currently using is basically your "SuffixArray3" with your interval-tree code in it, which I got from one of your old posts. I tried "SuffixArray1" also, just for kicks--and I tried replacing divsufsort with Yuta Mori's SAIS, but it was slower. The dates are confusing--divsufsort seems to be older, much more complicated, and just about the best possible thing for computing a suffix array. The world needs more people like Yuta Mori, divsufsort is awesome.

btw I emailed you, from this address (wylie.garvin_at_gmail.com).

cbloom said...

"I do sometimes get sub-optimal results if I don't consider enough match lengths though"

Yeah I didn't mean that only considering 18 is optimal, but the loss is quite small and the speed gain is quite large. On my "lztestset" the size difference is 14773304 vs 14773045 (259 bytes)

So I have implemented your parser and can confirm it produces the exact same output sizes, and is much faster. Thanks!

(I can also confirm that divsufsort is awesome; really superb piece of code; maybe the best single piece of free code I have ever seen in my life, simple interface, it just works, and it's actually world-class)

One note which may not have been obvious to all from the preceding discussion :

In Wylie's parser you must break ties in favor of matches. The reason is that matches will never increase in cost due to preceding choices, but literals can increase in cost.

wgarvin said...
This comment has been removed by the author.
wgarvin said...

So I have implemented your parser and can confirm it produces the exact same output sizes, and is much faster. Thanks!"

That's good news. I hope it will work for your LZNib parser too.

[I told blogger my name, yay, I'm no longer "Unknown".]

wgarvin said...
This comment has been removed by the author.
wgarvin said...

Ahhh, I think I get it now! You're right.

Example:

At a position P, I might look at my two choices and say "literals or match, the cost is the same" and pick either one. Say I pick literals.

But then at position P-3, suppose I have the option of a match which is 4 bytes more expensive than the match I considered at position P. But because I didn't choose the match at position P (where the score was tied) I'm still thinking of my current run of literals as being length 3+N, rather than just length 3. And the cost of that run of 3+N literals can be strictly larger, so I end up with a sub-optimal parse! The best option at P-3 is really "code 3 literals, then at pos P code that match that looked like it was an equally good suffix". But I might not discover that unless I break ties between literal and match, by always choosing the match.

And I just checked and my code does not do this properly, so that makes it an "almost-optimal" parser :P I just have to fix the bug. Thanks.

wgarvin said...

After fixing the bug, I also now get book1 -> 359276 bytes. (and book2 -> 235311, and pic -> 66127).

wgarvin said...

On my setup exe from GuildWars2, it now takes about 31.5 seconds and compressed size is now 11607793 (51.09855% of original size, slightly better than before).

cbloom said...

For the historical record, responding to one of your deleted comments :

I actually disagree with you about what the "clever part" is.

Incrementally counting the cost to send the LRL code is not really significant.

When you inc the LRL, if you do it naively you would write

running_cost += LRL_Cost(LRL+1) - LRL_Cost(LRL)

which, sure, is not maximally efficient, but is not a major issue, and it's easy to change that one line to instead do incremental cost.

(the fastest way is to start lrl_counter at -15 and inc cost when it hits zero, then change lrl_counter to -255)

The clever part is the fact that it's okay to use the optimal subgraph from any pos P to the end of the file (if you break ties for matches) regardless of arrival state/lrl ; eg. ignoring what might precede pos P.

The fact that that is okay is/was not at all obvious (to me) and is really the major difference.

wgarvin said...

I wouldn't say it was obvious to me either. I think I just assumed it was true after reading your old page about LZSS optimal parse. I said to myself "that sounds like it will work for LZ4, for more or less the same reasons", but I must have skipped some mental steps in there. LZSS does not have the varying cost per byte of literals that LZ4 has, which adds a wrinkle to things. I completely missed that bit.

Turns out the "prefer matches over literals" rule does make it size-optimal, but you're right that this is not exactly obvious... its pretty cool that it actually works!