I've spent some time in the last month looking into cases where ZStd beats Kraken & Mermaid.
Most of the time Kraken gets better ratio than ZStd,
but there were exceptions to that (mainly text), and it always kind of bothered me,
since Kraken is roughly a superset of ZStd (not exactly), and the differences are small,
it shouldn't have been winning by more than 1% (which is the variation I'd expect due to
small differences). On text files, I have no edge over ZStd, all my advantages are moot, so
we're reduced to both being pretty basic LZ-Huffs; so we should be equal, but I was losing.
So I dug in to see what was going on.
Thanks of course to Yann for making his great work open source so that I'm able to look at it; open source and sharing
code is a wonderful and helpful thing when people choose to do so voluntarily, not so nice when your work is stolen
from you against your will and shown to the world like phone-hacked dick-pics *cough* *assholes*. Since I'm
learning from open source, I figured I should give back, so I'm posting what I learned.
A lot of the differences are a question of binary vs. text focus. ZStd has some tweaking that clearly
comes from testing on text and corpora with a lot of text (like silesia). On the other hand, I've been
focusing very much on binary and that has caused me to miss some important things that only show up
when you look closely at text performance.
This is what I found :
Long hashes are good for text, bad for binary
ZStd non-optimal levels use hash lengths of 5 or even 6 or 7 at the fastest levels. This helps on
text because text has many long matches, so it's important to have a hash long enough that it can
differentiate between "boogie" and "booger" and put them in different hash table bins. (this is
most important at the fastest levels which are cache table with no ways).
On binary you really want to hash len 4 because there are important matches of exactly len 4, and longer hashes
can make you miss them.
zstd2 hash len 6 :
PD3D : zstd2 : 31,941,800 ->11,342,055 = 2.841 bpb = 2.816 to 1
zstd2 hash len 4 :
PD3D : zstd2 : 31,941,800 ->10,828,309 = 2.712 bpb = 2.950 to 1
zstd2 hash len 6 :
dickens : zstd2 : 10,192,446 -> 3,909,882 = 3.069 bpb = 2.607 to 1
zstd2 hash len 4 :
dickens : zstd2 : 10,192,446 -> 4,387,536 = 3.444 bpb = 2.323 to 1
Longer hashes help the fast modes a *lot* on text. If you care about fast compression of text
you really want those longer hashes.
This is a big issue and because of it ZStd fast modes will continue to be better than Oodle on text
(and Oodle will be better on binary); or we have to find a good way to detect the data type and
tune the hash length to match.
lazy2 is helpful on text
Standard lazy parsing looks for a match at ptr, if one is found it also looks at ptr+1 to see if
something better is there. Lazy2 also looks at ptr+2.
I wasn't doing 2-ahead lazy parsing, because on binary it doesn't help much. But on text it's
a nice little win :
Zstd level 9 has 2-step lazy normally :
zstd9 : 41,458,703 ->10,669,424 = 2.059 bpb = 3.886 to 1
disabled : (1-step lazy) :
zstd9 : 41,458,703 ->10,825,637 = 2.089 bpb = 3.830 to 1
optimal parser all len reductions helps on text
I once wrote that in codecs that do strong rep0 exclusion (rep0len1 literal can't occur immediately after a
match), that you can just always send max-length matches, and not have to consider match length reductions.
(because max-length matches maintain rep0 exclusion but shorter ones violate it).
That is not quite right.
It tends to be true on binary, but is wrong on text.
The issue is that you only get the rep0 exclusion benefit if you actually send a literal after the match.
That happens often on binary. Binary frequently goes match-literal-match-literal , with some near-random
bytes between predictable regions.
Text has very few literals. Many text files go match-match-match which means the rep0 literal exclusion does
nothing for you.
On text files you often have many short & medium length overlapping matches, and trying len reductions is
important to find the parse that traces through them optimally.
AAAADDDGGGGJJJJ
BBBBBFFFHHHHHH
CCCEEEEEIII
and the optimal parse might be
AAABBBFFFHHHHHH
which you would only find if you tried the len reduction of A
this kind of thing. Text is all about making the best normal-match decisions.
with all len reductions :
zstd22 : 10,000,000 -> 2,800,209 = 2.240 bpb = 3.571 to 1
without :
zstd22 : 10,000,000 -> 2,833,168 = 2.267 bpb = 3.530 to 1
Getting len 3 matches right in the optimal parser is really important on text
Part of the "text is all matches" issue. My codecs are mostly MML 4 in the non-optimal modes,
then I switch to MML3 at level 7 (Optimal3). Adding MML3 generally lets you get a bit more
compression ratio, but hurts decode speed a bit.
(BTW MML3 in the non-optimal modes generally *hurts* compression ratio, because they can't make
the decision correctly about when to use it. A len 3 match is always marginal, it's only
slightly cheaper than 3 literals (depending on the literals), and you probably don't want it if you
can find any longer match within those next 3 bytes. Non-optimal parsers just make these decisions
wrong and muck it all up, they do better with MML 4 or even higher sometimes. (there are definitely
files where you can crank up MML to 6 or 8 and improve ratio))
So, I was doing that *but* I was using the statistics from a greedy pre-pass to seed the optimal
parse decisions, and the greedy pre-pass was MML 4, which was biasing the optimal against len 3 matches.
It was just a fuckup, and it wasn't hurting me on binary, but when I compared to ZStd's optimal parse
on text I could immediately see it had a lot more len 3 matches than me.
(this is also an example of
the parse-statistics feedback problem, which I believe is the most important problem in LZ compresion)
dickens
zstd22 : 10,192,446 -> 2,856,038 = 2.242 bpb = 3.569 to 1
before :
ooKraken7 : 10,192,446 -> 2,905,719 = 2.281 bpb = 3.508 to 1
after :
ooKraken7 : 10,192,446 -> 2,862,710 = 2.247 bpb = 3.560 to 1
ZStd is full of small clever bits
There's lot of little clever nuggets that are hard to see. They aren't generally commented and they're buried in
chunks of copy-pasted code that all looks the same so it's easy to gloss over the variations.
I looked over this code many times :
if ((offset_1 > 0) & (MEM_read32(ip+1-offset_1) == MEM_read32(ip+1))) {
mLength = ZSTD_count(ip+1+4, ip+1+4-offset_1, iend) + 4;
ip++;
ZSTD_storeSeq(seqStorePtr, ip-anchor, anchor, 0, mLength-MINMATCH);
} else {
U32 offset;
if ( (matchIndex <= lowestIndex) || (MEM_read32(match) != MEM_read32(ip)) ) {
ip += ((ip-anchor) >> g_searchStrength) + 1;
continue;
}
// [ got match etc... ]
and I thought - okay, look for a 4 byte rep match, if found take it unconditionally and don't look for
normal match. That's the same thing I do (I think it came from me?), no biggie.
But there's a wrinkle. The rep check is not at the same position as the normal match. It's at pos+1.
This is actually a mini-lazy-parse. It doesn't do a full match & rep find at pos & (pos+1). It's just
scanning through, at each pos it only does one rep find and one match find, but the rep find is offset
forward by +1. That means it will take {literal + rep} even if match is available, which a normal
non-lazy parser can't do.
(aside : you might think that this misses a rep find, when the literal run starts, right after a match,
it starts find the first rep at pos+1 so there's a spot where it does no rep find. But that spot is where
the rep0 exclusion applies - there can be no rep there, so it's all good!)
This is a solid win and it's totally for free, so very cool.
Seven testset
with rep-ahead search :
total : zstd3 : 80,000,000 ->34,464,878 = 3.446 bpb = 2.321 to 1
with rep at same pos as match :
total : zstd3 : 80,000,000 ->34,521,261 = 3.452 bpb = 2.317 to 1
The end.
ADD : a couple more notes on ZStd (that aren't from the recent investigation) while I'm at it :
ZStd uses a unique approach to the lrl0-rep0 exclusion
After a match (of full length), that same offset cannot match again. If your offsets are in a rep match cache, the most
recently used offset is the top (0th) entry, rep0. This is the lrl0-rep0 exclusion.
rep0 is usually the most likely match, so it will get the largest share of the entropy coder probability space. Therefore
if you're in an exclusion where that symbol is impossible, you're wasting a lot of bits.
There are two ways that I would call "traditional" or straightforward data compression ways to model the lrl0-rep0 exclusion.
One is to use a single bit for (lrl == 0) as context for the rep-index coding event. eg. you have two entropy coding states
for offsets, one for lrl == 0 and one for lrl != 0. The other classical method would be to combine lrl with rep-index in a
larger alphabet, which allows you to model their correlation using only order-0 entropy coding. The minimum alphabet size here
is only 2 bits, 1 bit for (lrl == 0) or not, and one for (match == rep0) or not.
ZStd does not use either of these methods. Instead it shifts the rep index by (lrl == 0). That is, ZStd has 3 reps, and normally
they are in match offset slots 0,1,2. But right after the end of a match (when lrl is 0) those offset values change to mean rep
1,2,3 ; and there is no rep3, that's a virtual offset equal to (rep0 - 1).
The
ZStd format documentation is a good reference for these things.
I can't say how well the ZStd method here compares to the alternatives as it's a bit more effort to check than I'd like to do.
(if you want to try it, you could double the size of ZStd's offset coding alphabet to put 1 bit of lrl == 0 into the offset coding;
then the decode sequence grabs an offset and only pulls an lrl code if the offset bit says so).
ZStd uses TANS in a limited and efficient way
ZStd does not use TANS (FSE) on its literals, which are the largest class of entropy coded symbols. Presumably Yann found, like us,
that the compression gains on literals (over Huffman) are small, and the speed cost is not worth it. ZStd only uses TANS on the
LZ match components - LRL, offset, ML.
Each of these has a small alphabet (52,35,28), and therefore can use a small # of bits for the TANS tables (9,9,8). This is a sweet
spot for TANS, so it works well in ZStd.
For large alphabets (eg. 256 for literals), TANS needs a higher # of bits for its code tables (at least 11), which means 2048 entries
being filled. This makes the table setup time rather large. By cutting the table size to 8 or 9 bits you cut that down by 4-8X.
With large alphabets you also may as well just go Huff. But with small alphabets, Huff gets worse and worse. Consider the extreme -
in an alphabet of 2 symbols Huff becomes no compression at all, while TANS can still do entropy coding. With small alphabets to use Huffman you
need to combine symbols (eg. in a 2-bit alphabet you would code 4 at once as an 8-bit symbol). BUT that means going up to big decoder
tables again, which adds to your constant overhead.
FSE uses the prime-scatter method to fill the TANS decode table. (this is using a relatively-prime step to just walk around the
circular array, using the property that you can just keep stepping that way and you will eventually hit every slot once and only once).
I evaluated the prime-scatter method before and concluded that the compression penalty was unacceptably large.
I was mistaken. I had just implemented it wrong, so my results were much worse than they should be.
(the mistake I made was that I did the prime-scatter in one pass; for each symbol, take the steps and fill
table entries, increment "from_state" as you step, "to_state" steps around with the prime-modulo. This causes
a non-monotonic relationship between from_state and to_state which is very bad. The right way to do it
(the way ZStd/FSE does it) is to use some kind of two-pass scheme, so that you do the shuffle-scatter first
(which can step around the loop non-monotonically) but then assign the from_state relationship in a second
pass which ensures the monotonic relationship).
With a correct implementation, prime-scatter's compression ratio is totally fine (*). The two-pass method that ZStd/FSE
uses would be slow for large alphabets or large L, but ZStd only uses FSE for small alphabets and small L.
The entropy coder and application are well matched. (* = if you special case singletons, as below)
The worst case for prime-scatter is low counts, and counts of 1 are the worst. ZStd/FSE uses a special case
for counts of 1 that are "below 1". Back in the "Understanding TANS" series I looked at the "precise sort" method
of table building and
found that artificially skewing the bias to put counts of 1 at the end was a big win in practice. The issue
there is that the counts we see at that point are normalized, and zeros were forced up to 1 for codeability.
The true count might be much lower. Say you're coding an array of size 64k and symbol 'x' only occurs 1 time.
If you have a TANS L of 1024 , the true probability should be 1/64k , but normalized forces it up to 1/1024.
Putting the singleton counts at th end of the TANS array gives them the maximum codelen (end of the array
has maximum fractional bits). The sort bias I did before was a hack that relies on the fact that most
singleton counts come from below-1 normalized probabilities. ZStd/FSE explicitly signals the difference, it
can send a "true 1" (eg. closest normalized probability really is 1/1024 ; eg. in the 64k array, count is near
64), or a "below 1" , some very low count that got forced up to 1. The "below 1" symbols are forced to the end
of the TANS array while the true 1's are allowed to prime-scatter like other symbols.
The end.