I believe there's a mistake in LZMA. I could be totally wrong about that because reading the 7z code is very hard; in any case I'm going to talk about Rep0 exclusion. I believe that LZMA does not do this the way it should, and perhaps this change should be integrated into a future version. In general I have found LZMA to be very good and most of its design decisions have been excellent. My intention is not to nitpick it, but to give back to a very nice algorithm which has been generously made public by its author, Igor Pavlov.
LZMA does "literal-after-match" exclusion. I talked a bit about this last time. Basically, after a match you know that the next literal cannot be the one that would have continued the match. If it was you would have just written a longer match. This relies on always writing the maximum length for matches.
To model "LAM" exclusion, LZMA uses a funny way of doing the binary arithmetic model for those symbols. I wrote a bit about that last time, and the LZMA way to do that is good.
LZMA uses LAM exclusion on the first literal after a match, and then does normal 8-bit literal coding if there are more literals.
That all seems fine, and I worked on the Oodle LZ-Arith variant for about month with a similar scheme, thinking it was right.
But there's a wrinkle. LZMA also does "rep0len1" coding.
For background, LZMA, like LZX before it, does "repeat match" coding. A rep match means using one of the last N offsets (usually N = 3) and you flag that and send it in very few bits. I've talked about the surprising importance of repeat matches before (also here and other places ).
LZMA, like LZX, codes rep matches with MML of 2.
But LZMA also has "rep0len1". rep0len1 codes a single symbol at the 0th repeat match offset. That's the last offset matched from. That's the same offset that provides the LAM exclusion. In fact you can state the LAM exclusion as "rep0len1 cannot occur on the symbol after a match". (and in fact LZMA gets that right and doesn't try to code the rep0len1 bit after a match).
rep0len1 is not a win on text, but it's a decent little win on binary structured data (see example at bottom of this post ). It lets you get things like the top byte of a series of floats (off=4 len1 match, then 3 literals).
The thing is, if you're doing the rep0len1 coding, then normal literals also cannot be the rep0len1 symbol. If they were, then you would just code them with the rep0len1 flag.
So *every* literal should be coded with rep0 exclusion. Not just the literal after a match. And in fact the normal 8-bit literal coding path without exclusion is never used.
To be concrete, coding a literal in LZMA should look like this :
cur_lit = buffer[ pos ];
rep0_lit = buffer[ pos - rep0_offset ];
if ( after match )
{
    // LAM exclusion means cur_lit should never be = rep0_lit
    ASSERT( cur_lit != rep0_lit );
}
else
{
    if ( cur_lit == rep0_lit )
    {
        // lit is rep0, send rep0len1 :
        ... encode rep0len1 flag ...
        // do not drop through
        return;
    }
}
// either way, we now have exclusion :
ASSERT( cur_lit != rep0_lit );
encode_with_exclude( cur_lit, rep0_lit );
and that provides a pretty solid win.  Of all the things I did to try to beat LZMA, this was the only clear winner.
ADDENDUM : some notes on this.
Note that the LZMA binary-exclude coder is *not* just doing exclusion. It's also using the exclude symbol as modelling context. Pure exclusion would just take the probability of the excluded symbol and distribute it to the other symbols, in proportion to their probability.
It turns out that the rep0 literal is an excellent predictor, even beyond just exclusion.
That is, say you're doing normal 8-bit literal coding with no exclusion. You are allowed to use an 8-bit literal as context. You can either use the order-1 literal (that's buffer[pos-1]) or the rep0 literal (that's buffer[pos-rep0_offset]).
It's better to use the rep0 literal!
Of course the rep0 literal becomes a weaker predictor as you get away from the end of the match. It's very good on the literal right after a match (lam exclusion), and still very good on the next literal, and then steadily weaker.
It turns out the transition point is 4-6 literals away from the end of the match; that's the point at which the o1 symbol becomes more correlated to the current symbol than the rep0 lit.
One of the ideas that I had for Oodle LZA was to remove the rep0len1 flag completely and instead get the same effect from context modeling. You can instead take the rep0 lit and use it as an 8-bit context for literal coding, and should get the same benefit. (the coding of the match flag is implicit in the probability model).
I believe the reason I couldn't find a win there is because it turns out that LZ literal coding needs to adapt very fast. You want very few context bits, you want super fast adaptation of the top bits. Part of the issue is that you don't see LZ literals very often; there are big gaps where you had matches, so you aren't getting as much data to adapt to the file. But I can't entirely explain it.
You can intuitively understand why the rep0 literal is such a strong predictor even when it doesn't match.
You've taken a string from earlier in the file, and blacked out one symbol.  You're told what the symbol
was before, and you're told that in the new string it is not that symbol.  It's something like :
"Ther" matched before
'r' is to be substituted
"The?"
What is ? , given that it was 'r' but isn't 'r' here.
Given only the o1 symbol ('e') and the substituted symbol ('r'), you can make a very good guess of what
should be there ('n' probably, maybe 'm', or 's', ...).  Obviously more context would help, but with
limited information, the substituted symbol (rep0 lit) sort of gives you a hint about the whole area.
An ever simpler case is given just the fact that rep0lit is upper or lower case - you're likely to substitute it with a character of the same case. Similarly if it's a vowel or consonant, you're likely to substitute with one of the same. etc. and of course I'm just using English text because it's intuitive, it works just as well on binary structured data.
There's another very small flaw in LZMA's exclude coder, which is more of a minor detail, but I'll include it here.
The LZMA binary exclude coder is equivalent to this clearer version that I posted last time :
void BinaryArithCodeWithExclude( ArithEncoder * arith, int val, int exclude )
{
    int ctx = 1; // place holder top bit
    // first loop in the "matched" part of the tree :
    for(;;)
    {
        int exclude_bit = (exclude >> 7) & 1; exclude <<= 1;
        int bit = (val >> 7) & 1; val <<= 1;
        ASSERT( ctx < 256 );
        m_bins[256 + ctx + (exclude_bit<<8)].encode(arith,bit);
        ctx = (ctx<<1) | bit;
        if ( ctx >= 256 )
            return;
        if ( bit != exclude_bit )
            break;
    }
    
    // then finish bits that are unmatched :
    // non-matched
    do
    {
        int bit = (val >> 7) & 1; val <<= 1;
        m_bins[ctx].encode(arith,bit);
        ctx = (ctx<<1) | bit;
    }
    while( ctx < 256 );
}
This codes up to 8 bits while the bits of "val" match the bits of "exclude" , and up to 8 bits while the bits of "val" don't match.
Now, obviously the very first bit can never be coded in the unmatched phase. So we could eliminate that from the unmatched bins. But that only saves us one slot.
(and actually I'm already wasting a slot intentionally; the "ctx" with place value bit like this is always >= 1 , so you should be indexing at "ctx-1" if you want a tight packed array. I intentionally don't do that, so that I have 256 bins instead of 255, because it makes the addressing work with "<<8" instead of "*255")
More importantly, in the "matched" phase, you don't actually need to code all 8 bits. If you code 7 bits, then you know that "val" and "exclude" match in all top 7 bits, so it must be that val == exclude^1. That is, it's just one bit flip away; the decoder will also know that so you can just not send it.
The fixed encoder is :
void BinaryArithCodeWithExclude( ArithEncoder * arith, int val, int exclude )
{
    int ctx = 1; // place holder top bit
    // first loop in the "matched" part of the tree :
    for(;;)
    {
        int exclude_bit = (exclude >> 7) & 1; exclude <<= 1;
        int bit = (val >> 7) & 1; val <<= 1;
        ASSERT( ctx < 128 );
        m_bins[256 + ctx + (exclude_bit<<7)].encode(arith,bit);
        ctx = (ctx<<1) | bit;
        if ( bit != exclude_bit )
            break;
        if ( ctx >= 128 )
        {
            // I've coded 7 bits
            // and they all match
            // no need to code the last one
            return;
        }
    }
    
    // then finish bits that are unmatched :
    // non-matched
    do
    {
        int bit = (val >> 7) & 1; val <<= 1;
        m_bins[ctx].encode(arith,bit);
        ctx = (ctx<<1) | bit;
    }
    while( ctx < 256 );
}
Note that now ctx in the matched phase can only reach 128.  That means this coder actually only needs 2*256 bins, not 3*256 bins
as stated last time.
This is a little speed savings (tiny because we only get one less arith coding event on a rare path), a little compression savings (tiny because that bottom bit models very well), and a little memory use savings.
 
 
3 comments:
"Of all the things I did to try to beat LZMA, this was the only clear winner."
So how much did it help?
A lot (*). And best of all, with almost no speed penalty (and a savings of memory use).
For example :
LZMA max : 9344463
Oodle LZA1 : 9268615
Oodle LZA2 : 9188185
The difference from LZA1 to LZMA is like a month of hard tweakage, and largely comes from my better string matcher and optimal parser, which has the disadvantage of being slower to encode.
The difference from LZA2 to LZA1 is entirely from rep0 exclusion and is almost free.
(* = a lot is relative to what is possible in this domain; in absolute terms it's only about 1% which to non-compression-people doesn't seem like a lot)
Added some notes to the post.
Post a Comment