# cbloom rants

to

## Arithmetic

coder transformation.

This is something well known by "practictioners of the art" but I've never seen it displayed explicitly, so here we go. We're talking about arbitrary-alphabet decoding here obviously, not binary, and static probability models mostly.

Let's start with our Huffman decoder. (a bit of review here or here or here ). For simplicity and symmetry, we will use a Huffman decoder that can handle code lengths up to 16, and we will use a table-accelerated decoder. The decoder looks like this :

```
// look at next 16 bits (but don't consume them)
U32 peek = BitInput_Peek(16);

// use peek to look in decode tables :
int sym = huffTable_symbol[peek];

// use symbol to get actual code length :
int bits = symbol_codeLength[sym];

// and remove that code length from the bit stream :
BitInput_Consume(bits);

```
this is very standard (more normally the huffTable would only accelerate the first 8-12 bits of decode, and you would then fall back to some other method for codes longer than that). Let's expand out what Peek and Consume do exactly. For symmetry to the arithcoder I'm going to keep my bit buffer right-aligned in a big-endian word.
```
int bits_bitLen = // # of bits in word
U32 bits_code = // current bits in word

BitInput_Peek(16) :
{
ASSERT( bits_bitLen >= 16 );
U32 ret = bits_code >> (bits_bitLen - 16);
}

BitInput_Consume(bits) :
{
bits_bitLen -= bits;
bits_code &= (1 `<``<` bits_bitLen)-1;
while ( bits_bitLen `<` 16 )
{
bits_code `<``<`= 8;
bits_code |= *byteStream++;
bits_bitLen += 8;
}
}
```
it should be obvious what these do; _Peek grabs the top 16 bits of code for you to snoop. Consume removes the top "bits" from code, and then streams in bytes to refill the bits while we are under count. (to repeat again, this is not how you should actually implement bit streaming, it's slower than necessary).

Okay, now let's look at an Arithmetic decoder. (a bit of review here or here and here ). First lets start with the totally generic case. Arithmetic Decoding consists of getting the probability target, finding what symbol that corresponds to, then removing that symbol's probability range from the stream. This is :

```
AC_range = size of current arithmetic interval
AC_code  = value in range specified

Arithmetic_Peek(cumulativeProbabilityTotal) :
{
r = AC_range / cumulativeProbabilityTotal;
target = AC_code / r;
return target;
}

Arithmetic_Consume(cumulativeProbabilityLow, probability, cumulativeProbabilityTotal)
{
AC_range /= cumulativeProbabilityTotal;
AC_code  -= cumulativeProbabilityLow * AC_range
AC_range *= probability;

while ( AC_range `<` minRange )
{
AC_code `<``<`= 8;
AC_range `<``<`= 8;
AC_code |= *byteStream++;
}
}

```
Okay it's not actually obvious that this is a correct arithmetic decoder (the details are quite subtle) but it is; and in fact this is just about the fastest arithmetic decoder in the world (the only thing you would do differently in real code is share the divide by cumulativeProbabilityTotal so it's only done once).

Now, the problem of taking the Peek target and finding what symbol that specifies is actually the slowest part, there are various solutions, Fenwick trees, Deferred Summation, etc. For now we are talking about *static* coding, so we will use a table lookup.

To decode with a table we need a table from [0,cumulativeProbabilityTotal] which can map a probability target into a symbol. So when we get a value from _Peek we look it up in a table to get the symbol, cumulativeProbabilityLow, and probability.

To speed things up, we can use cumulativeProbabilityTotal = a power of two to turn the divide into a shift. We choose cumulativeProbabilityTotal = 2^16. (the longest code we can write with our arithmetic coder then has code length -log2(1/cumulativeProbabilityTotal) = 16 bits).

So now our static table-based arithmetic decode is :

```
Arithmetic_Peek() :
{
r = AC_range >> 16;
target = AC_code / r;
}

int sym = arithTable_symbol[target];

int cumProbLow  = cumProbTable[sym];
int cumProbHigh = cumProbTable[sym+1];

Arithmetic_Consume()
{
AC_range >>= 16;
AC_code  -= cumProbLow * AC_range
AC_range *= (cumProbHigh - cumProbLow);

while ( AC_range `<` minRange )
{
AC_code `<``<`= 8;
AC_range `<``<`= 8;
AC_code |= *byteStream++;
}
}

```
Okay, not bad, and we still allow arbitrarily probabilities within the [0,cumulativeProbabilityTotal] , so this is more general than the Huffman decoder. But we still have a divide which is very slow. So if we want to get rid of that, we have to constrain a bit more :

Make each symbol probability a power of 2, so (cumProbHigh - cumProbLow) is always a power of 2 (< cumulativeProbabilityTotal). We will then store the log2 of that probability range. Let's do that explicitly :

```
Arithmetic_Peek() :
{
r = AC_range >> 16;
target = AC_code / r;
}

int sym = arithTable_symbol[target];

int cumProbLow  = cumProbTable[sym];
int cumProbLog2 = log2Probability[sym];

Arithmetic_Consume()
{
AC_range >> 16;
AC_code  -= cumProbLow * AC_range
AC_range `<``<`= cumProbLog2;

while ( AC_range `<` minRange )
{
AC_code  `<``<`= 8;
AC_range `<``<`= 8;
AC_code |= *byteStream++;
}
}

```
Now the key thing is that since we only ever >> shift down AC_Range or << to shift it up, if it starts a power of 2, it stays a power of 2. So we will replace AC_Range with its log2 :
```
Arithmetic_Peek() :
{
r = AC_log2Range - 16;
target = AC_code >> r;
}

int sym = arithTable_symbol[target];

int cumProbLow  = cumProbTable[sym];
int cumProbLog2 = log2Probability[sym];

Arithmetic_Consume()
{
AC_code  -= cumProbLow `<``<` (AC_log2Range - 16);
AC_log2Range += (cumProbLog2 - 16);

while ( AC_log2Range `<` min_log2Range )
{
AC_code  `<``<`= 8;
AC_log2Range += 8;
AC_code |= *byteStream++;
}
}

```
we only need a tiny bit more now. First observe that an arithmetic symbol of log2Probability is written in (16 - log2Probability) bits, so lets call that "codeLen". And we'll rename AC_log2range to AC_bitlen :

```
Arithmetic_Peek() :
{
peek = AC_code >> (AC_bitlen - 16);
}

int sym = arithTable_symbol[peek];

int codeLen = sym_codeLen[sym];
int cumProbLow  = sym_cumProbTable[sym];

Arithmetic_Consume()
{
AC_code   -= cumProbLow `<``<` (AC_bitlen - 16);
AC_bitlen -= codeLen;

while ( AC_bitlen `<` 16 )
{
AC_code  `<``<`= 8;
AC_bitlen += 8;
AC_code |= *byteStream++;
}
}

```
let's compare this to our Huffman decoder (just copying down from the top of the post and reorganizing a bit) :
```
BitInput_Peek() :
{
peek = bits_code >> (bits_bitLen - 16);
}

// use peek to look in decode tables :
int sym = huffTable_symbol[peek];

// use symbol to get actual code length :
int codeLen = sym_codeLen[sym];

BitInput_Consume() :
{
bits_code &= (1 `<``<` bits_bitLen)-1;
bits_bitLen -= codeLen;

while ( bits_bitLen `<` 16 )
{
bits_code `<``<`= 8;
bits_bitLen += 8;
bits_code |= *byteStream++;
}
}

```
you should be able to see the equivalence.

There's only a small difference left. To remove the consumed bits, the arithmetic coder does :

```
int cumProbLow  = sym_cumProbTable[sym];

AC_code   -= cumProbLow `<``<` (AC_bitlen - 16);

```
while the Huffman coder does :
```
bits_code &= (1 `<``<` bits_bitLen)-1;

```
which is obviously simpler. Note that the Huffman remove can be written as :
```
code = peek >> (16 - codeLen);

bits_code -= code `<``<` (bits_bitLen - codeLen);

```
What's happening here - peek is 16 bits long, it's a window in the next 16 bits of "bits_code". First we make "code" which is the top "codeLen" of "peek". "code" is our actual Huffman code for this symbol. Then we know the top bits of bits_code are equal to code, so to turn them off, rather than masking we can subtract. The equivalent cumProbLow is code<<(16-codeLen). This is the equivalence of the Huffman code to taking the arithmetic probability range [0,65536] and dividing it in half at each tree branch.

The arithmetic coder had to look up cumProbLow in a table because it is still actually a bit more general than the Huffman decoder. In particular our arithmetic decoder can still handle probabilities like [1,2,4,1] (with cumProbTot = 8). Because of that the cumProbLows don't hit the nice bit boundaries. If you require that your arithmetic probabilities are always sorted [1,1,2,4], then since they are power of two and sum to a power of two, each partial power of two must be present, so the cumProbLows must all hit bit boundaries like the huffman codes, and the equivalence is complete.

So, you should now see clearly that a Huffman and Arithmetic coder are not completely different things. They are a continuum on the same scale. If you start with a fully general Arithmetic coder it is flexible, but slow. You then constrain it in various ways step by step, it gets faster and less general, and eventually you get to a Huffman coder. But those are not the only coders in the continuum, you also have things like "Arithmetic coder with fixed power of two probability total but non-power-of-2 symbol probabilities" which is somewhere in between in space and speed.

BTW not directly on topic, but I found this in my email and figure it should be in public :

```

Well, Adaptive Huffman is awful, nobody does it.  So you have a few options :

Static Huffman  -
very fast
code lengths must be transmitted
can use table-based decode

Arithmetic with static probabilities scaled with total = a power of 2
very fast
can use table-based decode
must transmit probabilities
decode must do a divide

"Deferred Summation"
doesn't transmit probabilites

must use Fenwick tree or something like that
much slower, coder time no longer dominates
(symbol search in tree time dominates)

Arithmetic by binary decomposition
can use fast binary arithmetic coder
speed depends on how many binary events it takes to code symbols on average

It just depends on your situation so much. With somehting like image or
audio coding you want to do special-cased things like turn amplitudes
into log2+remainder, use a binary coder for the zero, perhaps do
zero-run coding, etc. stuff to avoid doing the fully general case of a
multisymbol large alphabet coder.

```

Autodidactic Asphyxiation said...

This is a great illustration. It even suggests a template-parameterized implementation for table-based decoders. An obvious-but-interesting thing to me: this continuum extends to coding simpler than table-based Huffman.

I wonder if it would be faster to do static huffman by compiling the tables directly into code. Maybe it is a lose because your branch targets are unpredictable, but maybe it is a win because you can more easily do things like decode multiple symbols, use immediate shifts, and have shorter load dependency chains. A fast implementation would likely look similar to a really simple indirect-threaded bytecode interpreter.

So this would be a compiled-table approach, but now what could you do if your bits->symbols mapping had structure that was easier to transpose into code? Specifically, let's get rid of the switch dispatch. I believe this leads you to with Rice/Golomb/Gamma/etc. codes.

cbloom said...

"I wonder if it would be faster to do static huffman by compiling the tables directly into code. Maybe it is a lose because your branch targets are unpredictable, but maybe it is a win because you can more easily do things like decode multiple symbols, use immediate shifts, and have shorter load dependency chains. A fast implementation would likely look similar to a really simple indirect-threaded bytecode interpreter."

Yeah we talked about doing this. If you had to actually interpret commands it would be a lose, but if you could compile directly into machine code and insert it in your decoder, it would be a win.

"So this would be a compiled-table approach, but now what could you do if your bits->symbols mapping had structure that was easier to transpose into code? Specifically, let's get rid of the switch dispatch. I believe this leads you to with Rice/Golomb/Gamma/etc. codes. "

Ah yeah, that's an interesting idea. The most compelling case would come if your encoder was a bit flexible, eg. you had some space/speed tradeoff parameter and it could choose to build tables that were not quite optimal and had more directly encodable structure.

For example, one case that's very common is to have a fast decode table for the first 8-12 bits of the huffman code. Then, for the remainder of the symbols instead of decoding the true huffman code, you could just use a variable-length code of some sort, with very little loss because those symbols are rare anyway. But then speeding up that branch is also not very important. However one nice win of that approach is that sending the code lens can be done smaller and simpler because you don't send any of the code lens > fastLen. Hmmm...

Anonymous said...

Whoa, color! It's like a rainbow, all the way through the blog.

Double colors, oh my god. It's a double-colored blog, all the way. Oh my god. It's so bright and vivid! Oh! Ah! What does this mean?

Anonymous said...

by the way, since it's not clear, that is not me bandwagonning on that meme, I actually hate that meme.

it is actually me making fun of myself, because my first reaction to this post HONESTLY was "holy crap, it's in color, Charles is using color, unbelievable". And when I went to post that I realized that perhaps I was overdoing it a bit, and should make fun of myself for it.

cbloom said...

Oh, I didn't even notice the rainbow meme reference.

I knew I would blow some reader's minds with the use of this new "rich text" technology.

Autodidactic Asphyxiation said...

"Yeah we talked about doing this. If you had to actually interpret commands it would be a lose, but if you could compile directly into machine code and insert it in your decoder, it would be a win."

It would be fun to JIT your Huffman decoder. I imagine a simple table-based codegen (which I'm sure you have lying around) would be easy to throw together.

A few random questions about table-based decoders.

What is the size/speed tradeoff of looking up symbol length after decoding the symbol? If you looked up both together, you might get some more cache misses, but your load dependency chain shrinks. I suppose you mentioned that on in-order PPC there's some advantage to having 8-bit arrays, but you're going to be eating a 3-5 cycles even if your L1 hits.

It seems like it might be advantageous to be able to decode many small symbols simultaneously. If Huffman is able to compress 8-bit bytes to 4 bits on average, you expect to be able to 2-3 bytes from 8-12 bits. Have you ever tried this? Dealing with these special cases might not be easy in table-based implementations, but might be more feasible with codegen.

"However one nice win of that approach is that sending the code lens can be done smaller and simpler because you don't send any of the code lens greater than fastLen. Hmmm..."

You could make it so that rare symbols are transmitted as literals with a common prefix. Then you send the rare symbols + code lens, and the code lens build a huffman tree that excludes the rare symbols, but includes the rare symbol marker. This was just the naive and obvious thing I thought of just now, so I'm sure you could do much better.

Whenever I see "code lens" I think that there is some deep philosophical implication. Like: "through the lens of the algorithm" instead of "I am too lazy to type gth."

cbloom said...

"It would be fun to JIT your Huffman decoder. I imagine a simple table-based codegen (which I'm sure you have lying around) would be easy to throw together."

Yeah there are lots of fun things you could do with dynamic code. Unfortunately, it's forbidden on most consoles for security reasons.

"What is the size/speed tradeoff of looking up symbol length after decoding the symbol? If you looked up both together, you might get some more cache misses"

Yes, the exact details of this are platform dependent. We have tried looking them up together, looking them up in two arrays both indexed by peek, looking up sym then looking up codelen from sym, etc. The code posted is not what we ship with. There are platform quirks but generally the biggest issue is the dependency chain and when you need which values.

"It seems like it might be advantageous to be able to decode many small symbols simultaneously."

Yeah, this is in the old Huffman2 that's in crblib. It's not very useful in general because you generally have a mixed stream, like huffman-raw bits-huffman.

"You could make it so that rare symbols are transmitted as literals with a common prefix."

Indeed. There's a whole spectrum to be explored where mutating the coding a bit to be not perfectly Huffman could reduce memory use and improve speed, and then you could play with the space/speed tradeoff.

ryg said...

"Yeah, this is in the old Huffman2 that's in crblib. It's not very useful in general because you generally have a mixed stream, like huffman-raw bits-huffman."
Depends on the anatomy of your code stream (and your decoder). If the raw bits are part of the same (logical) symbols, it's definitely a win to include them in the table.

Two examples: First one is LZ offsets. You typically have a couple different symbols to encode the magnitude of the offset followed by a couple of raw bits to specify it fully. If the full code (prefix + extra bits) is shorter than your full table size, you can just decode the whole thing at once. Basically, instead of doing

slot = DecodeHuffSymbol()
offs = Base[slot]
if (ExtraBits[slot])
offs += GetBits(ExtraBits[slot])

you store Base/ExtraBits directly in your fast decode table. Then you do something like

entry = DecodeHuffSymbol()
offs = entry.offs
if (entry.extraBits) {
slot = entry.offs
// code as above, except we know
// that extraBits != 0
}

Second example: JPEG-style value coding. Similar to the above, you have a huffman tree encoding run length and log2(abs(value)) followed by extra bits to specify sign and "mantissa". You can put decoding for sign+mantissa into the table for small values (which typically have short codes anyway). Now you have 3 paths: fast path (decode huff bits + trailing bits in one table lookup), normal path (decode huff bits in one table lookup) and slow path (multiple table lookups or one table lookup + linear scan afterwards). For something JPEG-ish, it's not unusual for 85% or more of all codes to go through the fast path even with moderate table sizes (did this with 9 bits index and 16 bits/table entry, so a 1k table total... pretty good deal).

So this definitely works if you can fold multi-step decoding of a single logical symbol into one table lookup. It's useless for anything that actually controls execution flow in the decoder, though. Reading two symbols that influence control flow separately adds N+M states your decoder can be in. Parsing them together introduces (N x M) states - usually a bad idea, unless a significant fraction is trivial or similar and can be unified.

cbloom said...

These are good points.