When I did the Oodle LZH I made a mistake. I used a zip-style combined codeword. Values 0-255 are a literal, and 256+ contain the log2ish of length and offset. The advantage of this is that you only have one Huff table and just do one decode, then if it's a match you also fetch some raw bits. It also models length-offset correlation by putting them in the codeword together. (this scheme is missing a lot of things that you would want in a more modern high compression LZ, like pos-state patterns and so on).
Then I added "rep matches" and just stuck them in the combined codeword as special offset values.
So the codeword was :
{
256 : literal
4*L : 4 rep matches * L length slots (L=8 or whatever = 32 codes)
O*L : O offset slots * L length slots (O=14 and L = 6 = 84 codes or whatevs)
= 256+32+84 = 372
}
The problem is that rep-match-0 can never occur after a match. (assuming you write matches of maximum length). Rep-match-0 is quite important, on binary/structured files it can have very high probability. By using a single codeword which contains rep-match-0 for all coding events, you are incorrectly mixing the statistics of the after match state (where rep-match-0 has zero probability) and after literal state (where rep-match-0 has high probability).
A quick look at the strategies for fixing this :
1. Just use separate statistics. Keep the same combined codeword structure, but have two entropy tables, one for after-match and one for after-literal. This would also let you code the literal-after-match as an xor literal with separate statistics for that.
Whether you do xor-lit or not, there will be a lot of shared probability information between the two entropy tables, so if you do static Huffman or ANS probability transmission, you may need to use the cross-two-tables-similary in that transmission.
In a static Huffman or ANS entropy scheme if rep-match-0 never occurs in the after-match code table, it will be given a codelen of 0 (or impossible) and won't take any code space at all. (I guess it does take a little code space unless you also explicitly special case the knowledge that it must have codelen 0 in your codelen transmitter)
This is the simplest version of the more general case :
2. Context-code the rep-match event using match history. As noted just using "after match" or "after literal" as the context is the simplest version of this, but more detailed history will also affect the rep match event. This is the natural way to fix it in an adaptive arithmetic/ANS coder which uses context coding anyway. eg. this is what LZMA does.
Here we aren't forbidding rep-match after match, we're just using the fact that it never occur to make its probability go to 0 (adaptively) and thus it winds up taking nearly zero code space. In LZMA you actually can have a rep match after match because the matches have a max length of 273, so longer matches will be written as rep matches. Ryg pointed out that after a match that's been limitted by max-length, LZMA should really consider the context for the rep-match coding to be like after-literal , not after-match.
In Oodle's LZA I write infinite match lengths, so this is simplified. I also preload the probability of rep-match in the after-match contexts to be near zero. (I actually can't preload exactly zero because I do sometimes write a rep-match after match due to annoying end-of-buffer and circular-window edge cases). Preconditioning the probability saves the cost of learning that it's near zero, which saves 10-100 bytes.
3. Use different code words.
Rather than relying on statistics, you can explicitly use different code words for the after-match and after-literal case. For example in something like an LZHuf as described above, just use a codeword in the after-match case that omits the rep0 codes, and thus has a smaller alphabet.
This is most clear in something like LZNib. LZNib has three events :
LRL
Match
Rep Match
So naively it looks like you need to write a trinary decision at each coding (L,M,R). But in fact only two of them are
ever possible :
After L - M or R
cannot write another L because we would've just made LRL longer
After M or R - L or M
cannot write an R because we wouldn't just made match longer
So LZNib writes the binary choices (M/R) after L and (L/M) after M or R. Because they're always binary choices, this
allows LZNib to use the simple single-divider method of
encoding values in bytes .
4. Use a combined code word that includes the conditioning state.
Instead of context modeling, you can always take previous events that you need context from and make a combined codeword. (eg. if you do Huffman on the 16-bit bigram literals, you get order-1 context coding on half the literals).
So we can make a combined codeword like :
{
(LRL 0,1,2,3+) * (rep matches , normal matches) * (lengths slots)
= 4 * (4 + 14) * 8 = 576
}
Which is a pretty big alphabet, but also combined length slots so you get lrl-offset-length correlation modeling as well.
In a combined codeword like this you are always writing a match, and any literals that precede it are written with an LRL (may be 0). The forbidden codes are LRL=0 and match =rep0 , so you can either just let those get zero probabilities, or explicitly remove them from the codeword to reduce the alphabet. (there are also other forbidden codes in normal LZ parses, such as low-length high-offset codes, so you would similarly remove or adjust those)
A more minimal codeword is just
{
(LRL 0,1+) * (rep-match-0, any other match)
= 2 * 2 = 4
}
which is enough to get the rep-match-0 can't occur after LRL 0 modeling. Or you can do anything between those two extremes to choose an alphabet size.
I ran into the max-match-len = no rep0 exclusion thing mostly by accident: I was trying to figure out the impact of limiting max match length at various levels and at some point one of my asserts started firing.
ReplyDeleteIn LZMA this is just context dilution in a case that's not terribly important, because it only triggers on highly redundant data. Being sloppy isn't ideal but in this case you're already doing well.
But in my case I was using a literal coder that explicitly excludes rep0, so being wrong about it being excluded is a bitstream-breaking bug.