Visualizing Binary Probability Updates

Some time ago I noted that the standard way we use binary probabilities to model statistics has some odd quirks : 06-12-14 - Some LZMA Notes

I thought I would make some pictures to make that more intuitive.

What I have here is a 4 bit symbol (alphabet of 16). It is coded with 4 binary probabilities in a tree. That is :

first code a bit for sym >= 8 or not (is it in [0-7] or [8-15])
then go to [0-3] vs [4-7] (for example)
then [4-5] vs [6-7]
lastly [6] vs [7]

One way you might implement this is something like :
U32 p0[16];

// sym is given
sym |= 16; // set a place-value marker bit

for 4 times
  int ctx = sym >> 4; int bit = (sym>>3)&1;
  arithmetic_code( p0[ctx] , bit );
  sym <<= 1;
and note that only 15 p0's are actually used, p0[0] is not accessed; p0[1] is the root probability for [0-7] vs [8-15] , p0[2] is [0-3] vs [4-7] , etc.

The standard binary probability update is exponential decay (the simplest geometric IIR filter) :

if ( bit )
    P0 -= P0 * lambda;
    P0 += (1.0 - P0) * lambda;
So what actually happens to the symbol probabilities when you do this?

Something a bit strange.

When you update symbol [5] (for example), his probability goes up. But so does the probability of his neighbor, [4]. And also his next neighbors [6 and 7]. And so on up the binary tree.

Now the problem is not the binary decomposition of the coding. Fundamentally binary decomposition and full alphabet coding are equivalent and should be able to get the same results if you form the correct correspondence between them.

(aside : For example, if you use a normal n0/n1 count model for probabilities, AND you initialize the counts such that the parents = the sum of the children, then what you get is : visualizing_binary_probability_updates_n0_n1.html - only the symbols you observe increase in probability. Note this is a binary probability tree, not full alphabet, but it acts exactly the same way.)

The problem (as noted in the previous LZMA post) is that the update rate at the top levels is too large.

Intuitively, when a 5 occurs you should update the binary choice [45] vs [67] , because a 5 is more likely, that pair is more likely. The problem is that [4] did not get more likely, and the probability of the group [45] represents the chance of either occuring. One of those things did get more likely, but one did not. The 4 in the group should act as a drag that keeps the group [45] probability from going up so much. Approximately, it should go up by half the speed of the leaf update.

The larger the group, the greater the drag of symbols that did not get more likely. eg. in [0-7] vs [8-15] when a 5 occurs, all of 0123467 did not get more likely, so 7 symbols act as drag and the rate should be around 1/8 the speed.

(see appendix at end for the exact speeds you should use if you wanted to adjust only one probability)

Perhaps its best to just look at the charts. These are histograms of probabilities (in percent). It starts all even, then symbol 5 occurs 20 X, then symbol 12 occurs 20 X. The probabilities are updated with the binary update scheme. lambda here is 1.0/8 , which is rather fast, used to make the visualization more dramatic.

What you should see : when 5 first starts getting incremented, the probabilities of its neighbors go way up, 4, and 6-7, and the whole 0-7 side. After 5 occurs many times, the probabilities of its neighbors goes down. Then when 12 starts beeing seen, the whole 8-15 side shoots up.

Go scroll down through the charts then we'll chat more.

This is a funny thing. Most people in the data compression community think of binary coding symbols this way as just a way to encode an alphabet using binary arithmetic coding. They aren't thinking about the fact that what they're actually doing is a strange sort of group-probability update.

In fact, in many applications if you take a binary arithmetic coder like this and replace it with an N-ary full alphabet coder, results get *worse*. Worse !? WTF !? It's because this weird group update thing that the binary coder is doing is often actually a good thing.

You can imagine scenarios where that could be the case. In some types of data, when a new symbol X suddenly starts occuring (when it had been rare before), then that means (X-1) and (X+2) may start to be seen as well. We're getting some kind of complicated modeling that novel symbols imply their neighbors novel probability should go up. In some type of data (such as BWT post-MTF) the probabilities act very much in binary tree groups. (see Fenwick for example). In other types of data that is very bit structured (such as ascii text and executable code), when a symbol with some top 3 bits occurs, then other symbols with those top bits are also more likely. That is, many alphabets actually have a natural binary decomposition where symbol groups in the binary tree do have joint probability.

This is one of those weird things that happens all the time in data compression where you think you're just doing an implementation choice ("I'll use binary arithmetic coding instead of full alphabet") but that actually winds up also doing modeling in ways you don't understand.

The visuals :

5 x 1

5 x 2

5 x 3

5 x 4

5 x 5

5 x 6

5 x 7

5 x 8

5 x 9

5 x 10

5 x 11

5 x 12

5 x 13

5 x 14

5 x 15

5 x 16

5 x 17

5 x 18

5 x 19

5 x 20

12 x 1

12 x 2

12 x 3

12 x 4

12 x 5

12 x 6

12 x 7

12 x 8

12 x 9

12 x 10

12 x 11

12 x 12

12 x 13

12 x 14

12 x 15

12 x 16

12 x 17

12 x 18

12 x 19

12 x 20

Appendix :

How to do a binary probability update without changing your neighbor :

Consider the simplest case : 4 symbol alphabet :

[0 1 2 3]

A = P[0 1] vs [2 3]

B = P[1] vs [1]

P(0) = A * B
P(1) = A * (1 - B)

now a 0 occurs

alpha = update rate for A
beta = update rate for B

A' = A + (1-A) * alpha
B' = B + (1-B) * beta

we want P(1)/P(23) to be unchanged by the update

(if that is unchanged, then P(1)/P(2) and P(1)/P(3) is unchanged)

that is, the increase to P(0) should scale down all other P's by the same factor

P(1)/P(23) = A * (1-B) / (1-A)

so require :

A' * (1-B') / (1-A') = A * (1-B) / (1-A)

solve for alpha [...algebra...]

alpha = A * beta / ( 1 - (1 - A) * beta )

that's as opposed to just using the same speed at all levels, alpha = beta.

In the limit of small beta, (slow update), this is just alpha = A * beta.

The update rate at the higher level is decreased by the probability of the updated subsection.

In the special case where all the probabilities start equal, A = 0.5, so this is just alpha = beta / 2 - the update rates should be halved at each level, which was the intuitive rule of thumb that I hand-waved about before.

No comments:

old rants