10/05/2008

10-05-08 : Rant on New Arithmetic Coders

Rant on New Arithmetic Coders

Can we fucking stop calling arithmetic coding "range coding" !? It's fucking arithmetic coding. "Range coding" differs from the ancient CACM arithmetic coder in exactly two ways :

1. It uses a "virtual queue" for the carry propagation, exactly as I described in my VirtQ paper . (warning : ancient!)

2. It introduces an approximation to the range update which lets it use more bits by changing the order of multiplication :

CACM :

range = range_high - range_low + 1;
range_high = range_low + (symhigh * range) / symtot - 1;
range_low += (symlow * range) / symtot;

"range coder" :

range = range_high - range_low;
range_high = range_low + symhigh * (range / symtot);
range_low += symlow * (range / symtot);

This small rearrangement of the math does in fact make a huge difference, but dude look, it's literally just moving where you put the parentheses.

Of course it means you save a divide, and it means range can go up to 31 bits since you never multiply by it. Getting range up to 31 bits means you can afford to shuffle out whole bytes at a time. In CACM the multiplicataion has to fit in 32 bits so your range can only be 16 bits so outputting bytes is a no go; range stays between 15 and 16 bits all the time in CACM while a "range coder" can go down to 23 bits; when it's down at 23 bits you only have 9 bits of coding precision because of the approximation of doing the divide before the multiply.

I'm sick of all these web pages that are like "LZMA uses a statistical algorithm called range coding". No, it uses a fucking entropy coder called arithmetic coding.

Okay, this was just an angry rant, but it reminds me of something interesting. I recently stumbled on some new arithmetic coders and they blew my mind for a minute until I figured out how they work. (BTW For reference, the canonical "range coder" is by Michael Schindler )

(BTW #2 it's sort of curious to me historically that I missed the range coder; I was so close to inventing it myself, but somehow got right next to it and didn't see it. All the pieces of the range coder were thought of previously by various people, including the original CACM guys, such as trying to use 32 bits, shifting out bytes, approximating the division, and the virtual queue, but on their own each of those ideas doesn't quite work or make a big difference, nobody put those ideas together and realized that together they make this "range coder" thing that's a big improvement).

The new arithmetic coders I've discovered are the fpaq0p binary range coder and "rc" the "russian range coder". FPAQ0p is by Ilia Muraviev, who's done lots of good stuff; you can find it on Matt Mahoney's PAQ page . The "russian range coder" used to have a web page but it appears to have disappeared. I found the code in MRP (Minimum Rate Predictors, a decent lossless image coder).

The "fpaq0p" arithmetic encoder is brutally simple. It's what you would write if you just heard of arithmetic encoding; in fact it was my first idea for a coder ever, but it didn't work, so everybody dropped it. Well, it's back. If you remember the super basic idea of an arithmetic coder is you're specifying a code stream by writing an infinite precision number; at any time you know your desired number is in some interval, low and high, you shrink the interval down to specify the symbol, and to do it in finite precision you shift out top bits of low & high as they become the same. (See my ancient write up of basic arithcoding )

So what would you write?

range_low = 0;
range_high = 0xFFFFFFFF; // 32 bits on

range = range_high - range_low;
range_high = range_low + symhigh * (range / symtot);
range_low += symlow * (range / symtot);

while( (range_low>>24) == (range_high>>24) )
{
 *outPtr+= = (range_low>>24);
 range_low <<= 8;
 range_high <<= 8;
 range_high += 0xFF;
}

Okay, good job, but this is badly broken. First of all, you're overflowing your 32 bit code. Second of all, it doesn't handle carry propagation, and it doesn't handle underflow. The problem with carries is that range can step up to the next bit and push a bit over code size. The problem with underflow is that we aren't handling the case of the range getting very very small but while the top bytes are different.

In particular, when range_high comes down close to range_low from above and range_low comes up close to range_high from below, but the top bytes are different, eg, they're like 0x800000AA and 0x7FFFFFBB , the top bytes are different so they don't get shifted out, but the range can get arbitrarily small. (in particular the top byte of "low" is X, the top byte of "hi" is X+1, the second byte of low is 0xFF and the second byte of hi is 0x00).

But fpaq0p works, so how does it do it?

The main trick is to only do binary coding. Then you just don't worry about underflow. With the underflow problem, your range can get as small as only 1 , with 0x01000000 and 0x00FFFFFF, but you implicitly add 1 to the top, so you still have enough range to code 2 symbols, which is what you need for your binary alphabet to be decodable. Any alphabet larger than binary is not codable there.

The other minor trick is just about where you do your "+1". The high is always implicitly +1 above what you actually store in range_high. We started it with 0xFFFFFFFF but we want an implicit FFFFF... repeating to infinity after it. That's why we shift in FF, but we also have to do it right in the range update in a way that doesn't blow the 32 bits or ever generate a carry.

So the working fpaq0p coder is :


range_low = 0;
range_high = 0xFFFFFFFF; // 32 bits on

range = range_high - range_low;
range_mid = range_low + symp0 * (range / symtot);
if ( bit )
 range_low = range_mid+1;
else
 range_high = range_mid;

while( (range_low>>24) == (range_high>>24) )
{
 *outPtr+= = (range_low>>24);
 range_low <<= 8;
 range_high <<= 8;
 range_high += 0xFF;
}

Which looks like it's doing the same trivial thing, but is subtly different.

Note that in the horrible underflow case your coding precision can get as bad as only 1 bit (that is, the probabilities are forced to 50/50 regardless of what they really are). Obviously that hurts coding, but it's actually not that bad, because it's statistically incredibly rare. In fact it happens something like 2^-24 of the time, since all spots on the code range are equally likely. And even when it does happen you only lose a few bits of coding efficiency.

BTW :


To do this faster :

 while( (range_low>>24) == (range_high>>24) )

You can do :

 while( ((x1 ^ x2)>>24) == 0 )

or

 while( ((x1 ^ x2)&0xFF000000) == 0 )

or

 while( (x1 ^ x2) < (1<<24) )

but in practice it doesn't really matter because you're stalled out on branches anyway.

Also in a binary arithmetic coder you never actually divide by symtot. You use symtot = some fixed power of 2, and you do a probability update that keeps tot the same. My code's got this old rung/ladder thing that does different styles of adaptation with a power of 2 total. Or you can just do the super simple thing that everybody seems to do these days :

#define symbits  (12) // or whatever < 14
#define symtot  (1<< symbits)
#define updshift (6) // or anything in {1,symbits-1}

void update(int & p0,bool bit)
{

 if ( bit )
 {
  p0 -= p0 >> updshift;
 }
 else
 {
  p0 += (symtot - p0) >> updshift;
 }
}

Here "updshift" controls how fast your adaptation is vs. how precise you become in the long term. This update makes no distinction between the initial rollup when you have no stats and the adapted state, which the rung/ladder does. This is an approximation of a normal n0/n1 counter with an implicit max count of (1 << updshift). Compare to an n0/n1 update :


void update(int & p0,bool bit)
{
 int n0 = p0;
 int n1 = symtot - p0;
 int increment = symtot >> updshift;
 if ( bit )
 {
  n1 += increment;
 }
 else
 {
  n0 += increment;
 }
 p0 = (n0 * symtot) / (n0 + n1);
}

We can see the equivalence and approximation by looking at the bit on case :


 I'll use := to mean assignment while I do maths :

 p0 := (n0 * symtot) / (n0 + n1);
 p0 := (p0 * symtot) / (symtot + incr);
 p0 := (p0 * symtot)*(symtot - incr) / (symtot + incr)*(symtot - incr);
  now approximate incr^2 is much less than symtot^2
 p0 := (p0 * symtot)*(symtot - incr) / (symtot^2);
 p0 := (p0)*(symtot - incr) / (symtot);
 p0 := p0 - (p0*incr)/symtot);
 p0 -= p0/(symtot/incr);
 p0 -= p0>>updshift;

BTW this approximation goes back to the ancient stuff by Howard & Vitter on fast arithmetic coding. Of course in practice you also want to merge your modelling with arithmetic coding since they do the same random branch on bit.

Most people these days are good at designing their compressor to drive the binary arithmetic coder well. A well driven binary arithmetic coder is generally coding probabilities close to 50%. That's better because it keeps precision high and keeps you in the accurate part of the coarse probability update approximation. It's also better because it means you're calling the coder many fewer times. If you only call the arithmetic coder with probablities close to 50% it means you're only doing coding roughly as many times as the number of bits you output, which is low. You accomplish this by coding symbols that are "natural" for the problem. For example, for LZ77 offsets you code in the log2 of the offset (similarly with BWT MTF codes). etc. etc.

So I also found this range coder implementation called the "Russian Range Coder". It's extremely similar to the standard Michael Schindler coder except the renorm & bit output loop is slightly different :

"Russian Range Coder" :

    while((rc->low ^ (rc->low + rc->range)) < (1<<24))
    {
  *rc->ptr++ = (U8) (rc->low >> 24);
  rc->range <<= 8;
  rc->low <<= 8;
    }
    
    while(rc->range < (1 << 16))
    {
  *rc->ptr++ = (U8) (rc->low >> 24);
  rc->range = ((~rc->low) & 0xFFFF) << 8;
  rc->low <<= 8;
    }

So what's going on here? The first part we can recognize is exactly the simple loop from fpaq0p that just outputs the top byte when it's the same. But if you just do that you're fucked for multisymbol alphabets because of underflow, so looky here, the second part handles that. It checks if the range is getting too small to resolve symbols. But normally in the virtual queue / schindler coder that would mean setting up a carry queue and such. This coder avoids that. It does it by just giving up a huge chunk of the range.

The bad case occurs when the range is small and the top bytes are off by 1. The second bytes are 0x00 and 0xFF. With a virtual queue, you would put the top byte of "low" in the virtual queue spot, and count the number of 0xFF bytes but don't write them yet. If you get a carry, you propagate the carry by writing the top of low+1, then implicitly flipping the 0xFF's to 0x00's and writing them. This is the virtual Q / schindler way.

Here we don't check for underflow and do the carry and such, we just smash down the "high" to the "low". We know the top bytes are not equal, so the next bytes must be 00 and FF, eg. we have something like 0x0B00XXXX and 0x0AFFXXXX , we just smash the high down to 0x0AFFFFFF. This bit :


 rc->range = ((~rc->low) & 0xFFFF) << 8;
 rc->low <<= 8;

is the same as :

 high = low | 0xFFFF;
 range = high - low;
 range <<= 8;
 low <<= 8;

which I think makes it clearer what's happening. You just chop down to make your top bytes the same, eliminating the bad straddle situation which caused underflow. Now you can renormalize the simple way by shifting out equal top bytes and get a big enough range back.

Now obviously this is an approximation that loses some efficiency, you're throwing away on average half your range in the cases that you have this bad underflow, but again it's pretty rare. I don't know the exact probabilities, but the measured efficiency is very close to the full coder.

In practice, however, this coder is not any faster than the full correct schindler coder, so there's no point to using it. It is however, very simple and elegant once you see how it works.

ps. the other new thing is the Jarek Duda lattice coder thing which I have yet to understand. It seems to offer no advantages and many disadvantages, but I would like to grok it for my own brain's happiness.

10/01/2008

10-01-08 : First Look at LZMA

So I looked into LZMA / 7-Zip a bit. So far as I know there's no real description of the algorithm available, but there is source code here . This is from just browsing the source code for a bit, so take it with a bit of skepticism.

LZMA is an arithmetic coded LZ + context coder. It has a couple of clever ideas that make it perform very well. In fact it compresses better than Quantum and is also faster (at compression - Quantum is faster at decompression, in fact the only disadvantage of LZMA is that its decomp is not super fast).

The match finder is very fast. I can't figure out what it's doing exactly. It makes strong hashes using the CRC. The docs claim it then goes to a patricia trie, but my god I can't see that in the code.

It does lots of binary arithcoding. In fact I don't think it has a multisymbol arithcoder at all, it just breaks larger alphabets into a series of binary codes. He does a pretty good job of breaking up the binary tree so that the common cases of short lengths and low offsets get written in very few binary coding ops (that is, the coding choices of 0 or 1 are close to equal probability).

There are some clever models. In general it's well tweaked to use few contexts but capture the salient information in those contexts. For example, the order-1 literal model only uses the top N bits of the previous char as context, since for most data the top bits are the good predictors and the bottom bits are semi-random. One clever model is using the bottom 2 bits of position as part of the literal context. This is great for capturing the very common pattern on binary data of a 0 every fourth byte, or any every-2 or every-4 bytes kind of patterns (such as when you write a bunch of 32 bit floats and the exponent part is the same on all of them).

It also does a semi-optimal parse. From what I can tell it looks like it finds the match possibilities N steps ahead, and then does a kind of semi-backward cost evaluation from the end of that sequence to pick the best choice at the current coding spot. I believe N is only 4, so this is sort of like the old "lazy matching" but it looks ahead 4, and rather than a rough heuristic like most people use in lazy matching he actually evaluates a coding cost like a true optimal parse.

On text it's very good, but on binary it really kicks ass, I'm guessing the biggest win is because of the pos-conditioned contexts which are really good for binary.

9/27/2008

09-27-08 : On LZ and ACB


09-27-08

On LZ and ACB

LZ77 coders are hugely inefficient. Everybody knows the basic idea of LZ77. You compress data using back references to previous data. You either write a match, consisting of an offset & length, or a literal, consisting of one raw character.

The inefficiencies are intuitively obvious. If the coding was perfect, then if you were the decoder, the encoded stream should be totally random, you shouldn't be able to predict the code stream, but of course you can. One way is with context modeling. If you see a context like "what the ", then it's highly likely the next sequence is "fuck" so offsets which point to that are far more likely. Similarly, certain offsets point at certain words, and there are natural likely match lengths for those as well.

Someday soon I'm going to write about how I'm doing "Optimal Parsing". But the very fact that you can do optimal parsing is a sign of the flaw of LZ77. The encoder has many choices of different ways to write the code stream, which means there is massive redundancy in the code space. There are many different LZ77 code streams which all correspond to the same input. The optimal parser tries to pick the best one of these, but it doesn't change the fact that all the ways to encode that you did not choose are just wasted probability space in the coded bits.

Let's look at how you might reduce the redundancy in LZ77 if you for some reason had to do LZ coding and wanted to go as far with it as you can :

First of all, whether or not a match will be found is highly predictable. Most high compression LZ coders do a lot of context modeling for the match flag, because you can tell by looking at the local neighborhood whether it's an area that's likely to match. There are also finite-state patterns that arise in many types of data (particularly binary data), where it may be very likely that the data goes match-literal-match-literal , or mat-lit-lit-mat-lit-lit-mat , or whatever. Apparently LZMA does this.

Modeling the redundancy in the offset and the length are more interesting and difficult. The first thing we see is that they are not really two separate parameters, they are highly correlated, but not in a trivial way. For example, say we simply find the longest possible match and send the {offset,length} in the normal LZ77 way. There were very similar shorter matches that we passed up. When we send just the {offset} , the decoder knows what the match string we want looks like, and it knows that we passed up all the other matches, so this offset must be the longest one. That means it can mostly infer the match length. The decoder gets the offset, that specifies the match string, then the decoder finds the other string in the window that has the longest prefix length with the match string. The actual match must be at least the prefix length + 1. You can then either just output that match, or you could send the amount that the lengths exceeds that.

Let me show an example for clarity :

T A B C X X A B C D T T * A B C D ...

current code pointer is at (*)

encoder send match offset "-6"

decoder sees the match string "ABCDTT"

it looks for the longest other matching prefix and finds "ABCXX" with shared length 3

therefore we must match of at least length 4

decoder outputs "ABCD"
(then possibly also an additional match length starting at 0)

Now, we're still sending the offsets raw, but of course the offset is highly predictable too. Given the current context (or whatever type of model you want to use), you're much more likely to match certain offsets than others. You shouldn't just code the offset as a backward index into the window, you should assign a probability to each. The "ROLZ" and "PMR" LZ coders of today do this to some extent.

(aside : ROLZ = reduced offset LZ; basically means only code matches that share some amount of context, perhaps also MTF the offsets within the contexts; you code many fewer matches, but that's a good thing because ROLZ coders generally use a strong context model for literals; in fact if you use a powerful PPM for literals then the fewer matches you code the better your compression is. PMR = previous match reference, just codes references to previous matches in fewer bits because they are more likely, encoder also favors choosing PMR's over other offsets).

We can code an arbitrary offset with some modeling by reordering them. Make offset 0 be the most likely, up to offset N the least likely. One way to do this is to order by context match length. Say the longest context match is L symbols, then all the offsets that match with a context of length L are first, then all the offsets that match at length (L-1), down to length (0). Within each context, order so that more recently seen & used offsets are lower. So to code an offset, you still find the longest match, then you see where that offset is in the list ordered by likelihood, and output that index. (alternatively you could just context code the offset in the normal context way).

This is very closely related to LZP, Block Sorting, and Fenwick's Symbol Ranking stuff ( for example ).

We can do something similar in a slightly different way by sorting. Consider all previous possible match spots. Sort them by their context string, backwards, like for PPM. Now to match your current string, find the longest match with your forward string. This is the offset you want to send. To send it, find your current context in the sorted list - you'll find the longest match (actually you'll be between two strings in the list). Send the difference between your position in the list and the position of the longest match. Hopefully this delta is usually small, because the best match usually comes from a context similar to yours. The length of match is like before - the offset implicility send the length, you send the additional length after the implicit.

Obviously our goal has been to make the offset and match length both very close to zero. I mentioned block sort and symbol ranking before; the statistics of the codes we're generating here should look very much like those statistics.

Well, holy crap, this is exactly ACB !! (ACB = Associative Coding by Buyanovsky). I found a paper on ACB by Tomas Skopal or you can read this or this on my site.

ACB was never given much attention for a few reasons. 1) George's english is awful and he could never write a decent paper on it, so it never got into academia, his terminology is all non-standard and very confusing, 2) it's way the hell off the optimal curve in terms of speed vs. compression. It's slower than PPM and can't compress as well. It is however, interesting to me in a theoretical sense, because it's sort of like as good as you can do with dictionary compression, and as we see it's extremely similar to LZP, block sorting, and symbol ranking. In fact all of these coders wind up sucking in the same way, because by simply turning your correlation knowledge into an index with low mean, you're giving away a lot of good local information that a plain context modeller could do more with.

old rants