Some learnings from ZStd

I've spent some time in the last month looking into cases where ZStd beats Kraken & Mermaid.

Most of the time Kraken gets better ratio than ZStd, but there were exceptions to that (mainly text), and it always kind of bothered me, since Kraken is roughly a superset of ZStd (not exactly), and the differences are small, it shouldn't have been winning by more than 1% (which is the variation I'd expect due to small differences). On text files, I have no edge over ZStd, all my advantages are moot, so we're reduced to both being pretty basic LZ-Huffs; so we should be equal, but I was losing. So I dug in to see what was going on.

Thanks of course to Yann for making his great work open source so that I'm able to look at it; open source and sharing code is a wonderful and helpful thing when people choose to do so voluntarily, not so nice when your work is stolen from you against your will and shown to the world like phone-hacked dick-pics *cough* *assholes*. Since I'm learning from open source, I figured I should give back, so I'm posting what I learned.

A lot of the differences are a question of binary vs. text focus. ZStd has some tweaking that clearly comes from testing on text and corpora with a lot of text (like silesia). On the other hand, I've been focusing very much on binary and that has caused me to miss some important things that only show up when you look closely at text performance.

This is what I found :

Long hashes are good for text, bad for binary

ZStd non-optimal levels use hash lengths of 5 or even 6 or 7 at the fastest levels. This helps on text because text has many long matches, so it's important to have a hash long enough that it can differentiate between "boogie" and "booger" and put them in different hash table bins. (this is most important at the fastest levels which are cache table with no ways).

On binary you really want to hash len 4 because there are important matches of exactly len 4, and longer hashes can make you miss them.

zstd2 hash len 6 :
PD3D    : zstd2 : 31,941,800 ->11,342,055 =  2.841 bpb =  2.816 to 1 

zstd2 hash len 4 :
PD3D    : zstd2 : 31,941,800 ->10,828,309 =  2.712 bpb =  2.950 to 1 

zstd2 hash len 6 :
dickens : zstd2 : 10,192,446 -> 3,909,882 =  3.069 bpb =  2.607 to 1 

zstd2 hash len 4 :
dickens : zstd2 : 10,192,446 -> 4,387,536 =  3.444 bpb =  2.323 to 1 

Longer hashes help the fast modes a *lot* on text. If you care about fast compression of text you really want those longer hashes.

This is a big issue and because of it ZStd fast modes will continue to be better than Oodle on text (and Oodle will be better on binary); or we have to find a good way to detect the data type and tune the hash length to match.

lazy2 is helpful on text

Standard lazy parsing looks for a match at ptr, if one is found it also looks at ptr+1 to see if something better is there. Lazy2 also looks at ptr+2.

I wasn't doing 2-ahead lazy parsing, because on binary it doesn't help much. But on text it's a nice little win :

Zstd level 9 has 2-step lazy normally :

zstd9 : 41,458,703 ->10,669,424 =  2.059 bpb =  3.886 to 1 

disabled : (1-step lazy) :

zstd9 : 41,458,703 ->10,825,637 =  2.089 bpb =  3.830 to 1 

optimal parser all len reductions helps on text

I once wrote that in codecs that do strong rep0 exclusion (rep0len1 literal can't occur immediately after a match), that you can just always send max-length matches, and not have to consider match length reductions. (because max-length matches maintain rep0 exclusion but shorter ones violate it).

That is not quite right. It tends to be true on binary, but is wrong on text. The issue is that you only get the rep0 exclusion benefit if you actually send a literal after the match.

That happens often on binary. Binary frequently goes match-literal-match-literal , with some near-random bytes between predictable regions. Text has very few literals. Many text files go match-match-match which means the rep0 literal exclusion does nothing for you.

On text files you often have many short & medium length overlapping matches, and trying len reductions is important to find the parse that traces through them optimally.


and the optimal parse might be


which you would only find if you tried the len reduction of A

this kind of thing. Text is all about making the best normal-match decisions.

with all len reductions :

zstd22 : 10,000,000 -> 2,800,209 =  2.240 bpb =  3.571 to 1 

without :

zstd22 : 10,000,000 -> 2,833,168 =  2.267 bpb =  3.530 to 1 

Getting len 3 matches right in the optimal parser is really important on text

Part of the "text is all matches" issue. My codecs are mostly MML 4 in the non-optimal modes, then I switch to MML3 at level 7 (Optimal3). Adding MML3 generally lets you get a bit more compression ratio, but hurts decode speed a bit.

(BTW MML3 in the non-optimal modes generally *hurts* compression ratio, because they can't make the decision correctly about when to use it. A len 3 match is always marginal, it's only slightly cheaper than 3 literals (depending on the literals), and you probably don't want it if you can find any longer match within those next 3 bytes. Non-optimal parsers just make these decisions wrong and muck it all up, they do better with MML 4 or even higher sometimes. (there are definitely files where you can crank up MML to 6 or 8 and improve ratio))

So, I was doing that *but* I was using the statistics from a greedy pre-pass to seed the optimal parse decisions, and the greedy pre-pass was MML 4, which was biasing the optimal against len 3 matches. It was just a fuckup, and it wasn't hurting me on binary, but when I compared to ZStd's optimal parse on text I could immediately see it had a lot more len 3 matches than me.

(this is also an example of the parse-statistics feedback problem, which I believe is the most important problem in LZ compresion)


zstd22 : 10,192,446 -> 2,856,038 =  2.242 bpb =  3.569 to 1

before :
ooKraken7 : 10,192,446 -> 2,905,719 =  2.281 bpb =  3.508 to 1

after  :
ooKraken7 : 10,192,446 -> 2,862,710 =  2.247 bpb =  3.560 to 1 

ZStd is full of small clever bits

There's lot of little clever nuggets that are hard to see. They aren't generally commented and they're buried in chunks of copy-pasted code that all looks the same so it's easy to gloss over the variations.

I looked over this code many times :

        if ((offset_1 > 0) & (MEM_read32(ip+1-offset_1) == MEM_read32(ip+1))) {
            mLength = ZSTD_count(ip+1+4, ip+1+4-offset_1, iend) + 4;
            ZSTD_storeSeq(seqStorePtr, ip-anchor, anchor, 0, mLength-MINMATCH);
        } else {
            U32 offset;
            if ( (matchIndex <= lowestIndex) || (MEM_read32(match) != MEM_read32(ip)) ) {
                ip += ((ip-anchor) >> g_searchStrength) + 1;
            // [ got match etc... ]

and I thought - okay, look for a 4 byte rep match, if found take it unconditionally and don't look for normal match. That's the same thing I do (I think it came from me?), no biggie.

But there's a wrinkle. The rep check is not at the same position as the normal match. It's at pos+1.

This is actually a mini-lazy-parse. It doesn't do a full match & rep find at pos & (pos+1). It's just scanning through, at each pos it only does one rep find and one match find, but the rep find is offset forward by +1. That means it will take {literal + rep} even if match is available, which a normal non-lazy parser can't do.

(aside : you might think that this misses a rep find, when the literal run starts, right after a match, it starts find the first rep at pos+1 so there's a spot where it does no rep find. But that spot is where the rep0 exclusion applies - there can be no rep there, so it's all good!)

This is a solid win and it's totally for free, so very cool.

Seven testset 

with rep-ahead search :

total : zstd3       : 80,000,000 ->34,464,878 =  3.446 bpb =  2.321 to 1 

with rep at same pos as match :

total : zstd3       : 80,000,000 ->34,521,261 =  3.452 bpb =  2.317 to 1 

The end.

ADD : a couple more notes on ZStd (that aren't from the recent investigation) while I'm at it :

ZStd uses a unique approach to the lrl0-rep0 exclusion

After a match (of full length), that same offset cannot match again. If your offsets are in a rep match cache, the most recently used offset is the top (0th) entry, rep0. This is the lrl0-rep0 exclusion.

rep0 is usually the most likely match, so it will get the largest share of the entropy coder probability space. Therefore if you're in an exclusion where that symbol is impossible, you're wasting a lot of bits.

There are two ways that I would call "traditional" or straightforward data compression ways to model the lrl0-rep0 exclusion. One is to use a single bit for (lrl == 0) as context for the rep-index coding event. eg. you have two entropy coding states for offsets, one for lrl == 0 and one for lrl != 0. The other classical method would be to combine lrl with rep-index in a larger alphabet, which allows you to model their correlation using only order-0 entropy coding. The minimum alphabet size here is only 2 bits, 1 bit for (lrl == 0) or not, and one for (match == rep0) or not.

ZStd does not use either of these methods. Instead it shifts the rep index by (lrl == 0). That is, ZStd has 3 reps, and normally they are in match offset slots 0,1,2. But right after the end of a match (when lrl is 0) those offset values change to mean rep 1,2,3 ; and there is no rep3, that's a virtual offset equal to (rep0 - 1).

The ZStd format documentation is a good reference for these things.

I can't say how well the ZStd method here compares to the alternatives as it's a bit more effort to check than I'd like to do. (if you want to try it, you could double the size of ZStd's offset coding alphabet to put 1 bit of lrl == 0 into the offset coding; then the decode sequence grabs an offset and only pulls an lrl code if the offset bit says so).

ZStd uses TANS in a limited and efficient way

ZStd does not use TANS (FSE) on its literals, which are the largest class of entropy coded symbols. Presumably Yann found, like us, that the compression gains on literals (over Huffman) are small, and the speed cost is not worth it. ZStd only uses TANS on the LZ match components - LRL, offset, ML.

Each of these has a small alphabet (52,35,28), and therefore can use a small # of bits for the TANS tables (9,9,8). This is a sweet spot for TANS, so it works well in ZStd.

For large alphabets (eg. 256 for literals), TANS needs a higher # of bits for its code tables (at least 11), which means 2048 entries being filled. This makes the table setup time rather large. By cutting the table size to 8 or 9 bits you cut that down by 4-8X. With large alphabets you also may as well just go Huff. But with small alphabets, Huff gets worse and worse. Consider the extreme - in an alphabet of 2 symbols Huff becomes no compression at all, while TANS can still do entropy coding. With small alphabets to use Huffman you need to combine symbols (eg. in a 2-bit alphabet you would code 4 at once as an 8-bit symbol). BUT that means going up to big decoder tables again, which adds to your constant overhead.

FSE uses the prime-scatter method to fill the TANS decode table. (this is using a relatively-prime step to just walk around the circular array, using the property that you can just keep stepping that way and you will eventually hit every slot once and only once). I evaluated the prime-scatter method before and concluded that the compression penalty was unacceptably large. I was mistaken. I had just implemented it wrong, so my results were much worse than they should be.

(the mistake I made was that I did the prime-scatter in one pass; for each symbol, take the steps and fill table entries, increment "from_state" as you step, "to_state" steps around with the prime-modulo. This causes a non-monotonic relationship between from_state and to_state which is very bad. The right way to do it (the way ZStd/FSE does it) is to use some kind of two-pass scheme, so that you do the shuffle-scatter first (which can step around the loop non-monotonically) but then assign the from_state relationship in a second pass which ensures the monotonic relationship).

With a correct implementation, prime-scatter's compression ratio is totally fine (*). The two-pass method that ZStd/FSE uses would be slow for large alphabets or large L, but ZStd only uses FSE for small alphabets and small L. The entropy coder and application are well matched. (* = if you special case singletons, as below)

The worst case for prime-scatter is low counts, and counts of 1 are the worst. ZStd/FSE uses a special case for counts of 1 that are "below 1". Back in the "Understanding TANS" series I looked at the "precise sort" method of table building and found that artificially skewing the bias to put counts of 1 at the end was a big win in practice. The issue there is that the counts we see at that point are normalized, and zeros were forced up to 1 for codeability. The true count might be much lower. Say you're coding an array of size 64k and symbol 'x' only occurs 1 time. If you have a TANS L of 1024 , the true probability should be 1/64k , but normalized forces it up to 1/1024. Putting the singleton counts at th end of the TANS array gives them the maximum codelen (end of the array has maximum fractional bits). The sort bias I did before was a hack that relies on the fact that most singleton counts come from below-1 normalized probabilities. ZStd/FSE explicitly signals the difference, it can send a "true 1" (eg. closest normalized probability really is 1/1024 ; eg. in the 64k array, count is near 64), or a "below 1" , some very low count that got forced up to 1. The "below 1" symbols are forced to the end of the TANS array while the true 1's are allowed to prime-scatter like other symbols.

The end.


Jarek Duda said...

Hi Charles,
Thanks for the comments. I don't have experience with LZ, but regarding the last part, the best (still heuristic) method for tANS symbol spread I am aware of is "tuned spread" which uses both quantization and the actual probabilities to correspondingly shift the symbol appearances left or right.
If quantization is p[s] ~ q[s]/L, this symbol has q[s] appearances i \in {q[s],...,2q[s]-1} and
preferred position for i-th appearance is x ~ 1/(p[s] ln(1 + 1/i)).

For a singleton i=1, x ~ 1/(p[s] ln(2)) \in [L,2L-1], so it can well represent probabilities between p[s] ~ 1/(2L ln(2)) ~ 0.72/L for the most-right position (x=2L), to p[s] ~ 1/(L ln(2)) ~ 1.44/L for the most-left position (x=L).

cbloom said...

Jarek, but that would require transmitting the true probability (p), not the normalized probability (q). That may or may not be worth it, as the true probability may take more bits, and it would require the decoder to spend the time to normalize (to recreate the q's since it was sent the p's), which is non-trivial.

Jarek Duda said...

Indeed, in practice there is some compromise needed, for example distinguishing singleton in the center of range from the very low probability symbols.
More sophisticated is storing probability distribution quantized in an optimized way (minimizing cost of header + KL), then also decoder perform proper quantization and symbol spread ... https://encode.ru/threads/1883-Reducing-the-header-optimal-quantization-compression-of-probability-distribution
Another option is storing symbol spread in the header ... anyway, there are many possibilities to optimize among.

old rants