5/21/2009

05-21-09 - Entropy Coding by Counting

There have been a number of interesting works in the last few years about different ways to do coding. Some refereneces :

L. Oktem : Hierarchical Enumerative Coding and Its Applications in Image Compressing, Ph.D. thesis
"Group Testing for Wavelet-Based Image Compression"
"Entropy Coding using Equiprobable Partitioning"
"Binary Combinatorial Coding"

The common pattern in all these is the idea of coding probabilistic events using counting. Basically, rather than taking some skew-probability event and trying to create a code whole length matches the probabilities like L = - log2(P) , instead you try to combine events such that you create coding decision points that have 50/50 or at least power of 2 probabilities.

Let's look at a simple example. Say you want to code a bunch of binary independent events, some 001100 sequence with a certain P(1) and P(0) that's constant. For now let's say p = P(1) is very low. (this is common case in image coding, or really any predictive coding where the prediction is very good - a 1 means the prediction was wrong, 0 means it was right; eg. high bands of wavelets are like this (though they aren't independent - more on that later)).

If you just try to code each bit one by one, the average length should be much less than one and you'd have to arithmetic code or something. To create an event that's just 50/50 we code a group of symbols. We want to make the probability that the whole group is off be close to 50%, so :


0.50 = P(0) ^ N

N = log(0.50) / log(P(0)) = -1 / log2( P(0) )

for example if P(1) = 5% , then N = 13.5

so you get 13 or 14 symbols then code are they all on or off with just 1 raw bit. That's an efficient coding operation because by design you've made that a 50/50 event. Now we have a bit of a puzzle for the next step. If you sent "it's all off" then you're done, but if you sent "something is on" you now have to send more data to specify what is on - but you have already sent some information by sending that something is on so you don't want to just waste that.

There's some crazy difficult combinatorics math to do if you wanted to figure out what the next event to code would be that was a perfect 50/50, but in the end our hand is forced because there's not really anything else we can do - we have to send a bit to say "is the first half on?". I'll present the group testing method without justification then try to justify it after the fact :

You know something in the group of N is on. First send a bit for whether the first half ([N/2]) range is on. If that's on, recurse onto that range and repeat. If the first half was on, you still have to move to your neighboring (N/2) which you don't know if it's on or off. If the first half was off, then move to the the neighbor, he must be on, so recurse into him without sending a bit.

Now the justification. First see if there was only 1 value on in the range N, then it would have equal probability of being in the first or second half of the group, so it would be correct to send a bit to specify if it was in the first or second half - that would be equivalent to singalling if the first half is active. Now note that by design of our group size it's by far most likely that there is only 1 value on. For our example case above where P(1) = 5%, the chance of there being exactly 1 value on is around 70%. In that case our code is perfect; for higher values on it's not ideal, but those are less likely, so the net loss is small.

(BTW the binary code to locate a single item in a group is just the binary coding of the index; eg. first you send a bit to indicate if it's in the top half or bottom half, etc.)

The enumerative coding approaches take a similar but slightly different tack. Basically they code the # of values on in the sequence. Once you have the # of values on, then all arrangements are equally likely, so you just need to code the permutation id with a fixed length equiprobable code.


Say for example you want to code 16 binary events

You first send the # on.

If there's 1 on, there are 16 ways for that to happen - just send 4 bits

If there're 2 on, there are 16*15/2 = 120 ways
    you should send a 6.906 bit code, but you can just send a 7 bit code

If there're 3 on, you want to send one of 16*15*14/6 = 560 ways
    a 10 bit code would suck here, so you want a simple prefix code
    that usually writes 9 bits and occasionally 10

This is pretty efficient, particularly when the probability of a bit being on is very low so that you are usually in the low-count cases. In the high count cases it starts to suck because counting the permutations gets pretty complex. The other problem with this method is how to encode the # that are on. The problem is the probability of some # being on is a binomial distribution. There's no simple fixed-bit code for a binomial distribution (I don't think).

Now, for N large, the binomial distribution approaches a normal or geometric distribution. In that case we can use a Golomb code for the # of values on. Furthermore for N large, the coding inefficiency of the # on becomes less important. In the real world I don't think N large is very practical.

Anyhoo I'm not sure any of this is super useful because in practice bits are never independent, so you want to do context coding to capture that, and it's a mess to do with these methods. The GTW (Group Testing for Wavelets) guys capture context information by classifying each sample into classes of similar expected statistics, and also by interleaved scanning so that coefficients with relations are not put into the same class together. The result is impressive, they get compression comparable to ECECOW, but all the work of shuffling values around and multiple scans means they're surely slower than just arithmetic coding.

BTW not related but while I was looking at this I was confused by the following stupidity :


Exponential Distribution == Laplacian == Bernoulli Process == Geometric

All of them are P(n) ~= r ^ n

(or P(n) ~= e ^ ( - lambda * n ) if you prefer)

~ means proportional, ignoring normalization.

(Poisson is similar but different - P(n) ~= r ^ n / n! )

BTW the maximum likelihood estimate of the laplacian parameter is the L1 norm

P(n) ~= e ^ ( - lambda * n )  ; lambda = 1 / (L1 norm of samples)

yeah yeah some of them are continuous and some discrete, some are one sided, some two sided, but people in compression play fast & loose with all of those things, so it's very confusing when you switch from one paper to another and they switch terms. Most people in compression describe the AC coefficients as having a "Laplacian" distribution.

BTW Golomb codes are optimal for coding Laplacian/Geometric distributions.


Mean of geometric distribution with ratio r is = r / (1.0 - r)
  (r given the mean is r = m / (1+m) )

The optimal Golomb parameter k satisfies the relation :

r^k + r^(k+1) <= 1 < r^k + r^(k-1)

or

r^k <= 1/(1+r) < r^(k-1)

You can find this k a few ways :

k = froundint( 0.5 - log(1+r)/log(r) );

k = ceil( log(1+r)/log(1/r) );

or

int GetK(double r)
{
    double pk = p;
    for(int k =1;;k++)
    {
        if ( (1.0 + p)*pk <= 1.0 )
            return k;
        pk *= p;
    }
}

and furthermore we often use the nice approximation

k = froundint( mean )

however, that is not really correct for large mean.  It works fine for mean up to around 5.

At mean = 6 , the true optimum k is 5

true entropy : 4.144196
k=5 , Golomb H : 4.174255
k=6 , Golomb H : 4.219482

you can see at k=5 Golomb does very well matching entropy, but using the incorrect k= mean it gets off.

At larger mean it's much worse -

at mean = 19, the correct k is = 14

true entropy : 5.727806
golomb_k : 14 , H : 5.761475
golomb_k : 19 , H : 5.824371

At low mean of course Golomb gets bad because it can't code the 0 in less than 1 bit. You would obviously use run-lenghts or something to fix that in practice. Golomb works very well down to a mean of about 1.0 (that's a geometric r of 0.5).

One neat thing about geometric distributions is if you code the first symbol a different way, the remainder is still geometric. So you can code the 0 symbol with RLE, then if it's not 0, you code the symbols >= 1 with Golomb, and it's still geometric with the same r and k. (though Golomb is still bad then it's being done less often).


BTW I tried expressing k in terms of the mean :

k = ceil( log(1+r)/log(1/r) );

k = ( log( (1+2m) / (1+m))/log((1+m)/m) )

k = ( log( 1+2m ) - log(1+m) ) / ( log (1+m) - log(m) )

which is sort of handy if you have 'm' as an integer you can use table lookups for the logs and just do one divide, but it's
not really super simple.

what I'd like is just a Taylor expansion of this that works for m < 30 or so.  In particular it should be something like

k = m - 0.2 * m^2  (but not this)

k is like m for m small, but then it needs to correct down.  I couldn't work it out, maybe you can.

4 comments:

  1. Yeah, Golomb codes came into fashion during the early 90s. Two other interesting algorithms using them are PWC (very simple embedded Wavelet coder by Malvar, http://research.microsoft.com/pubs/69701/tr-99-26.pdf) and LOCO-I/JPEG-LS (http://www.hpl.hp.com/loco/). Both are based on Golomb residual coders combined with Golomb run-length coders for well-predicted areas, with different models for r. Malvar also has a nice variant on the theme here: http://research.microsoft.com/pubs/70033/tr-2003-95.pdf (it's patented, though).

    ReplyDelete
  2. Yes... if you download the software at that HP-Labs page you will find that I wrote it ;)

    ReplyDelete
  3. BTW a funny story about that - I worked at HP Labs for the team that invented LOCO and JPEG-LS (Weinberger, Seroussi, et.al). I wrote the first real implementation of JPEG-LS while I was there. The team that invented it didn't have real working code for their algorithm.

    I think that's actually quite common in academia and it explains why so many weird choices are made, and so many algorithms are published that just don't make sense as real programs.

    It kind of blows my mind, if I can't actually implement something and test it's speed vs. performance how are you supposed to know that it's useful?

    ReplyDelete
  4. The "group testing for wavelets" thing involves a really powerful insight. And it seems trivially obvious in hindsight...

    ReplyDelete