Engel Coding and Length Limited Huffman Heuristic Revisited

Joern Engel has written about a technique he described to me for forming prefix code lengths here on his blog.

Engel Coding is a fast/approximate way of forming length limited prefix code lengths.

I'm going to back up first and remind us what a prefix code is and the use of the Kraft inequality.

We want to entropy code some alphabet using integer code lengths. We want those codes to be decodeable without side information, eg. by only seeing the bits themselves, not transmitting the length in bits.

0  : a
10 : b
11 : c
is a prefix code. If you see the sequence "11100" it can only decode to "11=c,10=b,0=a".
0  : a
1  : b
11 : c
is not a prefix code. Any occurance of "11" can either be two b's or one c. They can't be resolved without knowing the length of the code. There is no bit sequence you can possibly assign to c that works. The fact that this is impossible is a function of the code lengths.

You can construct a prefix code from code lengths if and only if the lengths satisfy the Kraft inequality :

Sum{ 2^-L_i } <= 1

It's pretty easy to understand this intuitively if you think like an arithmetic coder. 2^-L is the effective probability for a code length, so this is just saying the probabilities must sum to <= 100%

That is, think of the binary code as dividing the range [0,1] like an arithmetic coder does. The first bit divides it in half, so a single bit code would take half the range. A two bit code takes half of that, so a quarter of the original range, eg. 2^-L.

The two numbers that we care about are the Kraft code space used by the code lengths, and the total code length of the alphabet under this encoding :

Kraft code space :

K = Sum{ 2^-L_i }

Total code length :

CL = Sum{ C_i * L_i }

L_i = code length in bits of symbol i
C_i = count of symbol i

Minimize CL subject to K <= 1 (the "Kraft inequality")

We want the minimum total code length subject to the prefix code constraint.

The well known solution to this problem is Huffman's algorithm. There are of course lots of other ways to make prefix code lengths which do not minimize CL. A famous historical one is Shannon-Fano coding, but there have been many others, particularly in the early days of data compression before Huffman's discovery.

Now for a length-limited code we add the extra constraint :

max{ L_i } <= limit

now Huffman's standard algorithm can't be used. Again the exact solution is known; to minimize CL under the two constraints of the Kraft inequality and the maximum codelength limit, the algorithm is "Package Merge".

In Oodle we (uniquely) actually use Package Merge at the higher compression levels, but it is too slow and complicated to use when you want fast encoding, so at the lower compression levels we use a heuristic.

The goal of the heuristics is to find a set of code lengths that satisfy the contraints and get CL reasonably close to the minimum (what Package Merge would find).

The Oodle heuristic works by first finding the true Huffman code lengths, then if any are over the limit, they are changed to equal the limit. This now violates the Kraft inequality (they are not prefix decodeable), so we apply corrections to get them to K = 1. ZStd uses a similar method (and I imagine lots of other people have in the past; this is pretty much how length-limited near-Huffman is done). My previous post on the heuristic length limited code is below, with some other Huffman background :

cbloom rants: 07-03-10 - Length-Limitted Huffman Codes Heuristic
cbloom rants 07-02-10 - Length-Limitted Huffman Codes
cbloom rants 05-22-09 - A little more Huffman
cbloom rants 08-12-10 - The Lost Huffman Paper
cbloom rants Huffman Performance
cbloom rants Huffman Correction

(Engel correctly points out that most of the places where I say "Huffman coding" I should really be saying "prefix coding". The decoding methods and canonical code assignment and so on can be done with any prefix code. A Huffman code is only the special case of a prefix code with optimal lengths. That is, Huffman's algorithm is only the part about code length assignment; the rest is just prefix coding.)

So Engel's idea is : if we're going to limit the code lengths and muck them up with some heuristic anyway, don't bother with first finding the optimal non-length-limited Huffman code lengths. Just start with heuristic code lengths.

His heuristic is (conceptually) :

L_i = round( log2( P_i ) )

which is intuitively a reasonable place to start. If your code lengths didn't need to be an integer number of bits, then you would want them to be as close to log(P) as possible.

Then apply the limit and fix the lengths to satisfy the Kraft inequality. Note that in this case the tweaking of lens to satisfy Kraft is not just caused by lens that exceed the limit. After the heuristic codelens are made, even if they are short, they might not be Kraft. eg. you can get code lengths like { 2,3,3,3,3,3,4,4,4 } which are not prefix (one of the 3's need to be changed to a 4). The idea is that unlike Huffman or Shannon-Fano which explicitly work by creating a prefix code by construction, Engel coding instead makes code lengths which could be non-prefix and relies on a fix up phase.

When Joern told me about this it reminded me of "Polar Coding" (Andrew Polar's, not the more common use of the term for error correction). Andrew Polar's code is similar in the sense that it tries to roughly assign log2(P) codelens to symbols, and then uses a fix-up phase to make them prefix. The details of the heuristic are not the same. (I suspect that there are lots of these heuristic entropy coders that have been invented over the years and usually not written down).

Obviously you don't actually want to do a floating log2; for the details of Engel's heuristic see his blog.

But actually the details of the initial codelen guess is not very important to Engel coding. His codelen adjustment phase is what actually determines the codelens. You can start the codelens all at len 1 and let the adjustment phase do all the work to set them, and in fact that gives the same final codelens!

I tried four methods of initial codelen assignment and they all produced the exact same final codelens. The only difference is how many steps of the iterative refinement were needed to get them to Kraft equality.

all initial codelens = 1    : num_adjustment_iterations = 2350943

codelens = floor( log2(P) ) : num_adjustment_iterations = 136925

codelens = round( log2(P) ) : num_adjustment_iterations = 25823

Engel Coding heuristic      : num_adjustment_iterations = 28419

The crucial thing is how the refinement is done.

To get to the refinement, let's go over some basics. I'll first describe the way we were actually doing the length limit heuristic in Oodle (which is not the same as what I described in the old blog post above).

In the Oodle heuristic, we start with Huffman, then clamp the lens to the limit. At this point, the Kraft K is too big. That is, we are using more code space than we are allowed to. We need to raise some codelens somewhere to free up more code space. But raising codelens increases the total codelen (CL). So the goal is to bump up some codelens to get K = 1, with a minimum increase to CL.

If you do L_i ++

K -= 2^(-L_i)
K += 2^(-(L_i+1))

for a net change of :

K -= 2^(-(L_i+1))

(shorter codelen symbols makes a bigger change to K)

and CL does :

CL -= C_i * L_i
CL += C_i * (L_i + 1)

net :

CL += C_i

(lower count symbols hurt total codelen less)

K_delta_i = 2^(-(L_i+1))
CL_delta_i = C_i

To get under the K budget, we want to find the lowest CL_delta with the maximum K_delta. That is, code space (K) is the resource you want to buy, and code len (CL) is the currency you use to pay for that code space. You want the best price :

price = CL_delta / K_delta

price = C_i * 2^(L_i+1)

What I was doing in Oodle was taking the step with the best "price" that didn't overshoot the target of K = 1.

If your symbols are sorted by count (which they usually are for Huffman codelen algorithms), then you don't need to compute "price" for all your symbols. The minimum price will always occur at the lowest count (first in the sorted list) at each codelen. So rather than making a full heap of up to 256 symbols (or whatever your alphabet size is), you only need a heap of the 16 (or whatever codelen limit is) lowest count symbols at each codelen.

The big improvement in Engel's refinement heuristic is that it allows overshooting K if the absolute value of the distance to K decreases.

Consider K in fixed point with a 12 bit codelen limit. Then "one" is at K = 4096. Say you had K = 4099. It's 3 too big. My heuristic could only consider K steps of -=2 and -=1 (only power of two steps are possible). Engel can also take a step of -= 4 , changing K to 4095. It's now 1 too small (codeable but wasteful) and rather than increasing codelens to fit in the code space, we can decrease a symbol codelen somewhere to gain some total codelen.

Engel converges to K = one by (possibly) taking successively smaller overshooting steps, so K wiggles around the target, delta going positive & negative. This does not always converge, so a bail out to a simpler approach is needed. This overshooting lets it get to K by doing a combination of positive and negative steps (eg. 3 = 4 - 1 , not just 3 = 1 + 2), which is a little bit of a step towards Package Merge (the difference being that package merge find the cheapest path to get the desired sum, while Engel's heuristic is greedy, taking the single cheapest step each time).

In practice this turns out to be much better than only taking non-overshooting steps.

Time to look at the results on some real data :

"seven" test set, cut into 64k chunks, order 0 entropy coded

comparing code len to package merge (ideal)

The length in excess (percent) is :

excess percent = (codelen(heuristic) - codelen(packagemerge))*100/codelen(packagemerge)

Huffman then limit monotonic      mean : 0.027% max : 2.512%
Huffman then limit overshooting   mean : 0.002% max : 0.229%
Engel coding                      mean : 0.016% max : 10.712%

(codelen limit of 12, 256 symbol byte alphabet)

The heuristic (overshooting) limit is really very good, extremely close to package merge and even the maximum excess len is small. Engel coding (non-Huffman initial code lengths) works fine on average but does have (very rare) bad cases. This is not surprising; there's reason we use Huffman's algorithm to get the code lengths right.

In that bad 10.71% excess case, the package merge average code len is 1.604 but Engel coding produces an average code length of 1.776

Note that many of the blocks in this test did not hit the codelen limit; in that case "Huffman then limit" produces the best possible codelens, but Engel coding might not.

For most applications, it's probably best to make the true Huffman code and then limit the lengths with a heuristic. The time saved from the approximate initial code lengths is pretty small compared to other operations needed to do entropy coding (histogramming for example is annoyingly slow). Nevertheless I found this technique to be an interesting reminder to keep an open mind about approximations and understand where our algorithms came from and why we use the ones we do.

Another thing I find interesting is how Engel Coding points back to Package Merge again.

First there's the fact that you can start Engel Coding with just all the codelens set at 1 , and let the Kraft fixup make the codelens. That's how Package Merge works. It starts all codelens at 1, and the increments the cheapest ones until it gets up to Kraft. The log2ish starting guess for Engel Coding is just a way of jump-starting the codelens closer to the final answer to avoid lots of steps.

Engel Coding's overshooting heuristic improves on the monotonic heuristic by allowing you to take some +- steps. That is, increment one codelen and decrement another. In particular it can do things like : rather than increment a len 3 codelen , instead increment a len 2 codelen and decrement a len 3 codelen. This is the kind of move that you need to make to get to real optimal code lens.

The key missing element is considering the costs of all possible steps and finding a path to the desired K. That is, Engel coding takes greedy (locally cheapest price) steps, which may not give the optimal path overall. The way to turn this greedy algorithm into an optimal one is dynamic programming. Lo and behold, that's what Package Merge is.


Cyan said...

Interesting read!
A few years ago, I experimented with overshooting correction :
but the conclusion was, it rarely paid off.

So either the test was incorrectly setup, which is quite possible,
or the benefit of this strategy depends on distribution, and the samples used were not reflecting that properly.

cbloom said...

Well, even the monotonic heuristic is not bad if you look at averages :

mean : 0.027%

only a quarter of 0.1% excess size over package merge.

Improving these heuristics is really about avoiding the rare bad cases :

max : 2.512%

that's bad enough that we'd like to avoid it.

Aside from running these tests on lots of real data to try to find real-world-occuring worst cases, I also set up a synthetic test to just create histograms to try to find worst cases.

It's easy to make a histogram of 6 symbols or so and just try all possible symbol counts from 1-100. You can restrict to sorted order of counts without loss of generality.

old rants