cbloom rants: The Kraft Number, Binary Arithmetic, and Incremental Package Merge

The Kraft number that we compute for length limited Huffman codelen construction can be thought of as a set of bit flags that tell you which codelens to change.

Recall


K = Sum{ 2^-L_i }

K <= 1 is prefix codeable

you can think of K as the sum of effective coding probabilities (P = 2^-L), so K over 1 is a total probability over 100%.

when we initially make Huffman codelens and then apply a limit, we use too much code space. That corresponds to K > 1.

If you write K as a binary decimal, it will be something like :


K = 1.00101100

Those 1 bits that are below K = 1.0 are exactly the excess codelens that we need to correct.

That is, if you have an extra symbol of length L, that is K too high by 2^-L , that's a 1 bit in the binary at position L to the right of the decimal.


codelen set of {1,2,2,3}

a : len 1
b : len 2
c : len 2
d : len 3

K = 2^-1 + 2^-2 + 2^-2 + 2^-3

K = 0.100 +
    0.010 +
    0.010 +
    0.001

K = 1.001

take (K-1) , the part below the decimal :

K-1 = .001

we have an excess K of 2^-3 ; that's one len 3 too many.

To fix that we can change a code of len 2 to len 3

that does 

-= 0.010
+= 0.001

same as

-= 0.001

That is, when (K-1) has 2 leading zeros, you correct it by promoting a code of len 2 to len 3

so if we compute K in fixed point (integer), we can just take K - one and do a count leading zeros on it, and it tells you which code len to correct. The bits that are on in K tell us exactly what's wrong with our code.

Now, similarly, we can think of the compound operations in the same way.

Any time we need to do a correction of K by 2^-L we could do one code from (L-1) -> L , or we could do two codes from L -> (L+1), or ... that can be seen as just an expression of the mathematics of how you can add bits together to make the desired delta.

That is :


You want 0.001 (increment codelen 2 to 3)

that can be :

0.001 = 0.0001
       +0.0001

(increment two lower lens)

or :

0.001 = 0.01
       -0.001

increment a lower codelen then decrement one at your level

Now, thinking about it this way we can try to enumerate all the possible moves.

To reduce the space of all possible moves, we need a few assumptions :

1. No non-len-group changing moves are profitable. That is, the set of symbols for the current len groups are the best possible set. eg. it's not profitable to do something like { symbol a from len 2 to 3 and symbol b from len 3 to 2 } . If there are any profitable moves like that, do them separately. What this means is the counts are sorted; eg. if a symbol is at a higher codelen, its count is less equal the count of any symbol at lower codelen.

2. I only need to enumerate the moves that can be the cheapest (in terms of total code len).

In that case I think that you can enumerate all the moves thusly :


a change of 2^-L can be accomplished via

inc(L-1)   (that means increment a codelen at L-1)

inc(L) can be substitituted with { inc(L+1), inc(L+1) }

And each of those inc(L+1) can also be substituted with a pair at L+2.
You take either a codelen at L or two at the next higher len, 
whichever has a smaller effect on CL (total codelen).

a change of 2^-L can also be accomplished via :

inc(L-2) and dec(L-1)

OR

inc(L-3) and dec(L-2) and dec(L-1)

again these can be understood as binary decimals :

0.0001 = 0.0010 - 0.0001
0.0001 = 0.0100 - 0.0011
0.0001 = 0.1000 - 0.0111

and finally the decs are also a tree of pairs :

dec(L) can instead be { dec(L+1), dec(L+1) }

this seems like a lot, but compared to all possible ways to make the number X from adds & subtractions of any power of two, it's quite small. The reason we can consider such a reduced set of moves is because we only need the one best way to toggle a bit of K, not all possible ways.

Really we just do :


enumerate the position of the lowest codelen to inc
between 1 and (L-1)

decrement at all codelens below the one you incremented
down to (L-1)

this produce the desired change in K of 2^-L

each "inc" & "dec" can either be at that code len, or a pair of next codelen

(somebody smarter than me : prove that these are in fact all the necessary moves)

Let's look at how dynamic programming reduces the amount of work we have to do.


Say we need to do an inc() at L = 3

(inc means increment a codelen, that decreases K by 2^-4)

We can either increment a single symbol at L = 3
or a pair from L = 4

(this is just the same kind of operation you do in normal Huffman tree building)

The best symbol at L = 3 is just the lowest count symbol (if any exists)

Ask for the best two nodes at L = 4

Those can also be a single symbol, or a pair from L = 5

When you ask for the first node at L = 4, it gets the best two at L = 5

but then imagine the single symbol at L = 4 was lower count and is taken

Now you ask for the second node at L = 4, it again needs the best two at L = 5

we already have them, no more work is needed.

Any time you chose a symbol rather than a pair of higher len, the 2^n tree stops growing.

[3] -> { [4a], [4b] }

[4a] -> sym(4) or { [5a], [5b] }

[4a] takes sym(4)

[4b] -> sym2(4) or { [5a], [5b] }

[4b] doesn't need a new evaluation at level 5

Another issue I need to mention is that as you increment and decrement codelens, they move between the lists, so the cached dynamic programming lists cannot be retained, or can they? (for example, you want to keep the symbols at each len sorted by count)

In fact the accounting for symbols moving is simple and doesn't need to invalidate the cached lists.


When you do an inc(L) , that symbol moves to L+1 and is now available for a further inc(L+1)

(this does not occur with dec(L) since it moves in the opposite direction)

Say you wanted an inc(3) ; you consider doing a pair of { inc(4), inc(4) }

One of the inc(4)'s can be a pair of inc(5)'s , and one of those len 5 symbols can be the one you did inc 4 on.

That is, say you have 'A' at len 4 and 'B' at len 5

inc(3) <- { inc( 'A' @ 4) , { (inc 'A' @ 5) , inc( 'B' @ 5 } }

This is a legal move and something you have to consider.

But the rule for it is quite simple - if a symbol occurs earlier in the list of chosen increments, it is
available at the next level.

If you're familiar with the way Package Merge makes its lists, this is quite similar. It just means when you choose the lowest count symbol at the current level, you can also draw from the previous increments in your list if they have lower count.

These queues we are building are exactly the same thing you would need to do the full package merge algorithm. The difference is, in traditional Package Merge you would start with all the symbols at codelen = max (K is too low), and then incrementally apply the best decrements to increase K. Here we are starting with K pretty close to 1 , with K greater than 1. The result is that in many cases we can do far fewer package merge steps. I call this Incremental Package Merge. It allows you to start from a nearly-Kraft set of codelens and get to the same optimal solution as if you did full package merge.

Let's look at a concrete example or two :


codelen : symbol+count

len 3 : a+7
len 4 : b+3 , c+3
len 5 : d+1 , e+1

you need an inc(3) to get K -= 2^-4

you can :

inc(a) ; cost 7
inc(b + c) ; cost 6
inc(b + { d + e } ) ; cost 5

The best inc(4) is {b,d,e}

Another example :

len 3 : a+7
len 4 : b+2
len 5 : c+1

again say you want an inc(3)

the best is

inc( b + { b + c } ) ; cost 5

here the best option is to inc b twice

And finally let's think again about how Package Merge is related to the "numismatic" or "coin collector problem".

If you play with this you will see what we're really doing is working on a two-cost accountancy. Each symbol has a cost in K which is determined only by its current codelen (2^-L). It also has a cost in total codelen CL which is detemined only by its symbol count. We are trying to pay off a debt in K (or spend a credit in K) and maximize the value we get in CL.

cbloom rants

4/18/2018

The Kraft Number, Binary Arithmetic, and Incremental Package Merge

No comments:

old rants