cbloom rants: 06-12-14 - Some LZMA Notes

I've been working on an LZ-Arith for Oodle, and of course the benchmark to beat is LZMA, so I've had a look at a few things.

Some previous posts related to things I'll discuss today :

cbloom rants 09-27-08 - On LZ and ACB
cbloom rants 10-01-08 - First Look at LZMA
cbloom rants 10-05-08 - 5 Rant on New Arithmetic Coders
cbloom rants 08-20-10 - Deobfuscating LZMA
cbloom rants 09-03-10 - LZ and Exclusions

Some non-trivial things I have noticed :

1. The standard way of coding literals with a binary arithmetic coder has a subtle quirk to it.

LZMA uses the now standard fractional update method for binary probability modeling. That's p0 -= (p0 >> updshift) and so on. See for example : 10-05-08 - 5 : Rant on New Arithmetic Coders .

The fractional update method is an approximation of a standard {num0,num1} binary model in which you are kept right at the renormalization threshold. That is, a counting model does :

P0 = num0 / (num0+num1);
... do coding ...
if ( bit ) num1++;
else num0++;
if ( (num0+num1) > renorm_threshold )
{
  // scale down somehow; traditionally num0 >>= 1;
}

The fractional shift method is equivalent to :

num0 = P0;
num1 = (1<<frac_tot) - P0;
if ( bit ) num1++;
else num0++;

// num0+num1 is now ((1<<frac_tot)+1); rescale :
P0 = num0 * (1<<frac_tot)/((1<<frac_tot)+1);

That is, it assumes you're right at renormalization threshold and keeps you there.

The important thing about this is adaptation speed.

A traditional {num0,num1} model adapts very quickly at first. Each observed bit causes a big change to P0 because total is small. As total gets larger, it becomes more stable, it has more inertia and adapts more slowly. The renorm_threshold sets a minimum adaptation speed; that is, it prevents the model from becoming too full of old data and too slow to respond to new data.

Okay, that's all background. Now let's look at coding literals.

The standard way to code an N bit literal using a binary arithmetic coder is to code each bit one by one, either top down or bottom up, and use the previous coded bits as context, so that each subtree of the binary tree gets its own probability models. Something like :


ctx = 1;
while( ctx < 256 ) // 8 codings
{
    int bit = (val >> 7)&1; // get top bit
    val <<= 1; // slide val for next coding
    BinaryCode( bit, p0[ctx-1] );
    // put bit in ctx for next event
    ctx = (ctx<<1) | bit;
}

Okay.

Now first of all there is a common misconception that binary coding is somehow different than N-ary arithmetic coding, or that it will work better on "binary data" that is somehow organized "bitwise" vs text-like data. That is not strictly true.

If we use a pure counting model for our N-ary code and our binary code, and we have not reached the renormalization threshold, then they are in fact *identical*. Exactly identical.

For example, say we're coding two-bit literals :


The initial counts are :

0: 3
1: 1
2: 5
3: 4
total = 13

we code a 2 with probability 5/13 in log2(13/5) bits = 1.37851
and its count becomes 6

With binary modeling the counts are :

no ctx
0: 4
1: 9

ctx=0
0: 3
1: 1

ctx=1
0: 5
1: 4

to code a "2"
we first code a 1 bit with no context
with probability 9/13 in log2(13/9) bits = 0.53051
and the counts become {4,10}

then we code a 0 bit with a 1 context
with probability 5/9 in log2(9/5) bits = 0.84800
and the counts become {6,4}

And of course 1.37851 = 0.53051 + 0.84800

The coding is exactly the same. (and furthermore, binary coding top down or bottom up is also exactly the same).

However, there is a difference, and this is where the quirk of LZMA comes in. Once you start hitting the renormalization threshold, so that the adaptation speed is clamped, they do behave differently.

In a binary model, you will see many more events at the top bit. The exact number depends on how spread your statistics are. If all 256 symbols are equally likely, then the top bit is coded 128X more often than the bottom bits (and each of the next bits is coded 64X, etc.). If only one symbol actually occurs then all the bit levels will be coded the same number of times. In practice it's somewhere in between.

If you were trying to match the normal N-ary counting model, then the binary model should have much *slower* adaptation for the top bit than it does for the bottom bit. With a "fractional shift" binary arithmetic coder that would mean using a different "update shift".

But LZMA, like most code I've seen that implements this kind of binary coding of literals, does not use different adaptation rates for each bit level. Instead they just blindly use the same binary coder for each bit level.

This is wrong, but it turns out to be right. I tested a bunch of variations and found that the LZMA way is best on my test set. It seems that having much faster adaptation of the top bits is a good thing.

Note that this is a consequence of using unequal contexts for the different bit levels. The top bit has 0 bits of context, while the bottom bit has 7 bits of context, which means its statistics are diluted 128X (or less). If you do an order-1 literal coder this way, the top bit has 8 bits of context while the bottom bit gets 15 bits.

2. The LZMA literal-after-match coding is just an exclude

I wrote before (here : cbloom rants 08-20-10 - Deobfuscating LZMA ) about the "funny xor thing" in the literal-after-match coder. Turns out I was wrong, it's not really funny at all.

In LZ coding, there's a very powerful exclusion that can be applied. If you always output matches of the maximum length (more on this later), then you know that the next symbol cannot be the one that followed in the match. Eg. if you just copied a match from "what" but only took 3 symbols, then you know the next symbol cannot be "t", since you would have just done a length-4 match in that case.

This is a particularly good exclusion because the symbol that followed in the match is what you would predict to be the most probable symbol at that spot!

That is, say you need to predict the MPS (most probable symbol) at any spot in the file. Well, what you do is look at the preceding context of symbols and find the longest previous match of the context, and take the symbol that follows that context. This is "PPM*" essentially.

So when you code a literal after a match in LZ, you really want to do exclusion of the last-match predicted symbol. In a normal N-ary arithmetic coder, you would simply set the count of that symbol to 0. But it's not so simple with the binary arithmetic coder.

With a binary arithmetic coder, let's say you have the same top 7 bits as the exclude symbol. Well then, you know exactly what your bottom bit must be without doing any coding at all - it must be the bit that doesn't match the exclude symbol. At the next bit level above that, you can't strictly exclude, but you can probabilistically, exclude. That is :


Working backwards from the bottom :

At bit level 0 :

if symbol top 7 bits == exclude top 7 bits
then full exclude

that is, probability of current bit == exclude bit is zero

At bit level 1 :

if symbol top 6 bits == exclude top 6 bits
then

if symbol current bit matches exclude current bit, I will get full exclusion in the next level
so chance of that path is reduced but not zero

the other binary path is unaffected

that is, we're currently coding to decide between 4 symbols.  Something like :

0 : {A,B}
1 : {C,D}

we should have P0 = (PA+PB)/(PA+PB+PC+PD)

but we exclude one; let's say B, so instead we want to code with P0 = PA/(PA+PC+PD)

etc..

That is, the exclude is strongest at the bottom bit level, and becomes less strong as you go back up to higher bit levels, because there are more symbols on each branch than just the exclude symbol.

The LZMA implementation of this is :


  static void LitEnc_EncodeMatched(CRangeEnc *p, CLzmaProb *probs, UInt32 symbol, UInt32 matchByte)
  {
    UInt32 offs = 0x100;
    symbol |= 0x100;
    do
    {
      matchByte <<= 1;
      RangeEnc_EncodeBit(p, probs + (offs + (matchByte & offs) + (symbol >> 8)), (symbol >> 7) & 1);
      symbol <<= 1;
      offs &= ~(matchByte ^ symbol);
    }
    while (symbol < 0x10000);
  }

I rewrote it to understand it; maybe this is clearer :

void BinaryArithCodeWithExclude( ArithEncoder * arith, int val, int exclude )
{
    // same thing but maybe clearer :
    bool matched = true;        
    val |= 0x100; // place holder top bit

    for(int i=0;i<8;i++) // 8 bit literal
    {
        int exclude_bit = (exclude >> (7-i)) & 1;
        int bit = (val >> (7-i)) & 1;

        int context = val >> (8-i);
        if ( matched )
            context += exclude_bit?512:256;

        m_probs[context].encode(arith,bit);

        if ( bit != exclude_bit )
            matched = false;
    }
}

We're tracking a running flag ("matched" or "offs") which tells us if we are on the same path of the binary tree as the exclude symbol. That is, do all prior bits match. If so, that steps us into another group of contexts, and we add the current bit from the exclude symbol to our context.

Now of course "matched" always starts true, and only turns to false once, and then stays false. So we can instead implement this as two loops with a break :


void BinaryArithCodeWithExclude( ArithEncoder * arith, int val, int exclude )
{
    int ctx = 1; // place holder top bit

    // first loop in the "matched" part of the tree :
    for(;;)
    {
        int exclude_bit = (exclude >> 7) & 1; exclude <<= 1;
        int bit = (val >> 7) & 1; val <<= 1;
        m_bins[256 + ctx + (exclude_bit<<8)].encode(arith,bit);
        ctx = (ctx<<1) | bit;
        if ( ctx >= 256 )
            return;
        if ( bit != exclude_bit )
            break;
    }
    
    // then finish bits that are unmatched :
    // non-matched
    do
    {
        int bit = (val >> 7) & 1; val <<= 1;
        m_bins[ctx].encode(arith,bit);
        ctx = (ctx<<1) | bit;
    }
    while( ctx < 256 );
}

It's actually not weird at all, it's just the way to do symbol exclusion with a binary coder.

ADDENDUM : maybe I'm going to0 far saying it's not weird. It is a bit weird, sort of like point 1, it's actually not right, but in a good way.

The thing that's weird is that when coding the top bits, it's only using the bits seen so far of the exclude symbol. If you wanted to do a correct probability exclusion, you need *all* the bits of the exclude symbol, so that you can see exactly what symbol it is, how much probability it contributes to that side of the binary tree.

The LZMA way appears to work significantly better than doing the full exclude.

That is, it's discarding some bits of the exclude as context, and that seems to help due to some issue with sparsity and adaptation rates. The LZMA uses 3*256 binary probabilities, while full exclusion uses 9*256. (though in both cases, not all probs are actually used; eg. the first bit is always coded from the "matched" probs, not the "un-matched").

ADDENDUM2 : Let me say it again perhaps clearer.

The way to code a full exclude using binary modeling is :


coding "val" with exclusion of "exclude"

while bits of val coded so far match bits of exclude coded so far :
{
  N bits coded so far
  use 8 bits of exclude as context
  code current bit of val
  if current bit of val != same bit of exclude
    break;
}

while there are bits left to code in val
{
  N bits coded so far
  use N bits of val as context
  code current bit of val
}

The LZMA way is :

coding "val" with exclusion of "exclude"

while bits of val coded so far match bits of exclude coded so far :
{
  N bits coded so far
  use N+1 bits of exclude as context   // <- only difference is here
  code current bit of val
  if current bit of val != same bit of exclude
    break;
}

while there are bits left to code in val
{
  N bits coded so far
  use N bits of val as context
  code current bit of val
}

I also tried intermediate schemes like using N+2 bits of exclude (past bits+current bit+one lower bit) which should help a little to identify the exclusion probability without diluting statistics too much - they all hurt.

3. Optimal parsing and exclusions are either/or and equal

There are two major options for coding LZ-arith :

I. Do literal-after-match exclusion and always output the longest match. Use a very simplified optimal parser that only considers literal vs match (and a few other things). Essentially just a fancier lazy parse (sometimes called a "flexible parse").

II. Do not do literal-after-match exclusion, and consider many match lengths in an optimal parser.

It turns out that these give almost identical compression.

Case II has the simpler code stream because it doesn't require the literal-after-match special coder, but it's much much slower to encode at high compression because the optimal parser has to work much harder.

I've seen this same principle many times and it always sort of intrigues me. Either you can make a code format that explicitly avoids redundancy, or you can exploit that redundancy by writing an encoder that aggressively searches the coding space.

In this case the coding of exclude-after-match is quite simple, so it's definitely preferable to do that and not have to do the expensive optimal parse.

4. LZMA is very Pareto

I can't really find any modification to it that's a clear win. Obviously you can replace the statistical coders with either something faster (ANS) or something that gives more compression (CM) and you can move the space/speed tradeoff, but not in a clearly beneficial way.

That is, on the compression_ratio / speed / memory_use three-way tradeoff, if you hold any two of those constant, there's no improvement to be had in the other.

.. except for one flaw, which we'll talk about in the next post.

cbloom rants

6/12/2014

06-12-14 - Some LZMA Notes

No comments:

old rants