cbloom rants: Order-1 Huffman

This is a simple idea that's rarely written down, so I thought I'd do a quick summary.

To my knowledge I was the first person to write about it (in "New Techniques in Context Modeling and Arithmetic Encoding" (PDF) ) but it's one of those simple ideas that probably a lot of people had and didn't write about (like Deferred Summation). It's also one of those ideas that keeps being forgotten and rediscovered over the years.

(I don't know much about the details of how Brotli does this, it may differ. I'll be talking about how I did it).

(also by "Huffman" I pretty much always mean "static Huffman" where you measure the histogram of a block and transmit the code lengths, not "adaptive Huffman" (modifying codelens per symbol (bleck)) or "deferred summation Huffman" (codelens computed from histogram of previous data with no explicit codelen transmission))

Let's start with just the case of order-1 8 bit literals. So you're coding a current 8-bit symbol with an 8-bit previous symbol as context. You can do this naively by just have 256 arrays, one for each 8-bit context. The decoder looks like this :


256 times :
read codelens from file
build huffman decode table

per symbol :

o1 = ptr[-1];
ptr[0] = huff_decode( bitstream , huff_table[o1] );

and on a very large file (*) that might be okay.

(* actually it's only okay on a very large file with completely stable statistics, which never happens in practice. In the real world "very large files" don't usually exist; instead they act like a sequence of small/medium files tacked together. That is, you want a decoder that works well on small files, and you want to be able to reset it periodically (re-transmit huffman codelens in this case) so that it can adapt to local statistics).

On small files, it's disastrous. You're sending 256 sets of codelens, which can be a lot of wasted data. Worst of all it's a huge decode time overhead to parse out the codelens and build the decode tables if you're only going to get a few symbols in that context.

So you want to reduce the count of huffman tables. A rough guideline is to make the number of tables proportional to the number of bytes. Maybe 1 table per 1024 bytes is tolerable to you, it depends.

(another goal for reduction might be to get all the huff tables to fit in L2 cache)

So we want to merge the Huffman tables. You want to find pairs of contexts that have the most similar statistics and merge those. If you don't mind the poor encoder-time speed, a good solution is a best-first merge :


for each pair {i,j} (i<j)
merge_cost(i,j) = Huffman_Cost( symbols_i + symbols_j ) - Huffman_Cost( symbols_i ) - Huffman_Cost( symbols_j )

Huffman_Cost( symbols ) = bits to send codelens + bits to encode symbols using those codelens


while # of contexts > target , and/or merge cost < target
pop lowest merge_cost
merge context j onto i
delete all merge costs involving j
recompute all merge costs involving i

if the cost was just entropy (H) instead of Huffman_Cost , then a merge_cost would always be strictly >= 0 (separate statistics are always cheaper than combining). But since the Huffman codelen transmission is not free, the first merges will actually reduce encoded size. So you should always do merges that are free or beneficial, even if huffman table count is low enough.

So contexts with similar statistics will get merged together, since coding them with a combined set of codelens either doesn't hurt or hurts only a little (or helps, with the cost of codelen transmission counted). In this way contexts where it wasn't really helping to differentiate them will get reduced.

Once this is done, the decoder becomes :


get n = number of huffman tables

n times :
read codelens from file
build huffman decode table

256 times :
read tableindex from file
merged_huff_table_ptr[i] = huff_table[ tableindex ]

per symbol :

o1 = ptr[-1];
ptr[0] = huff_decode( bitstream , merged_huff_table_ptr[o1] );

So merged_huff_table_ptr[] is still a [256] array, but it points at only [n] unique Huffman tables.

That's order-1 Huffman!

In the modern world, we know that o1 = the previous literal is not usually the best use of an 8-bit context. You might do something like top 3 bits of ptr[-1], top 2 bits of [ptr-2], 2 bits of position, to make a 7-bit context.

One of the cool things order-1 Huffman can do is to adaptively figure out the context for you.

For example with LZMA you have the option of the # of literal context bits (lc) and literal pos bits (lp). You want them to be as low as possible for better statistics, and there's no good way to choose them per file. (usually lc=2 or lp=2 , usually just one or the other, not both)

With order-1 Huffman, you just make a context with 3 bits of lc and 3 bits of lp, so you have a [64] 6-bit context. Then you let the merger throw away states that don't help. For example if it's a file where pos-bits are irrelevant (like text), they will just get merged out, all the lc contexts that have different lp values will merge together.

05-03-16 | Brotli signed int mode

Brotli signed int context mode. Looks like a good idea. My guess is this is what's helping Brotli10 on the binary files I wrote about in the previous post (horse.vipm and so on)

Signed int takes the previous two bytes and forms a 6-bit context from them thusly :


  Context = (Lut2[b2]<<3) | Lut2[b1];

      Lut2 :=
         0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
         2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
         2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
         3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
         3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
         3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
         3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
         4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
         4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
         4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
         4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
         5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
         5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
         5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
         6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7

So, it's roughly categorizing the two values into ranges, which means it can act as a kind of linear predictor (if that fits the data), eg. two preceding values in group 4 = my value probably is too, or if b2 is a "3" and b1 is a "4" then I'm likely a "5". Or not, if the data isn't linear like that. Or maybe there's only correlation to b1 and b2 gets ignored, which the order-1-huff can also model.

One thing I like is holding out values 00 and FF and special cases that get a unique bucket. This lets you detect the special cases of last two bytes = 0000,FFFF,FF00,00FF , which can be pretty important on binary.

I think that for the type of data we get in games that often has floats, it might be worth it to single out 7F and 80 as well, something like :

0,1111....
1..
22...
2........2,3
4,5555...
5...
66.....
6........6,7

but who knows, would have to test to see.

cbloom rants

5/08/2016

Order-1 Huffman

05-03-16 | Brotli signed int mode

No comments:

Post a Comment