Visualizing Probability Update Schemes Part 2

A little bonus so we can look at the weird properties of the geometric / PAQ mixer.

Recall from the tour of mixing :

Geometric mixing is a product of experts. In the binary case, this reduces to linear mixing in logit (codelen difference) domain; this is what's used by PAQ. The coefficients of a geometric mixer are not really "weights" , in that they don't sum to one, and they can be negative.

In fact the combination of experts in geometric mixing is not convex; that is, the mixer does not necessarily interpolate them. Linear mixing stays within the simplex of the original experts, it can't extrapolate (because weights are clamped in [0,1]).

For example, say your best expert always gets the polarity of P right (favors bit 0 or 1 at the right time), but it always predicts P a bit too low. It picks P of 0.7 when it should be 0.8. The linear mixer can't fix that. It can at most give that expert a weight of 100%. The geometric mixer can fix that. It can apply an amplification factor that says - yes I like your prediction, but take it farther.

The geometric mixer coefficients are literally just a scaling of the experts' codelen difference. The gradient descent optimizes that coefficient to make output probabilities that match the observed data; to get there it can apply amplification or suppression of the codelen difference.

Let's see this in a very simple case : just one expert.

The expert here is "geo 5" , (a 1/32 geometric probability update). That's pretty fast for real world use but it looks very slow in these little charts. We apply a PAQ style logit mixer with a *very* fast "learning rate" to exaggerate the effect (1000X faster than typical).

Note the bit sequence here is different than the last post; I've simplified it here to just 30 1's then 10 0's to make the effect more obvious.

The underlying expert adapts slowly : (P(1) in green, codelen difference in blue)

Note that even in the 0000 range, geo 5 is still favoring P(1) , it hasn't forgotten all the 1's at the start. Codelen difference is still positive (L(0) > L(1)).

With the PAQ mixer applied to just a single expert :

In the 111 phase, the mixer "weight" (amplification factor) goes way up; it stabilizes around 4. It's learning that the underlying expert has P(1) on the right side, so our weight should be positive, but it's P(1) is way too low, so we're scaling up the codelen difference by 4X.

In the 000 phase, the mixer quickly goes "whoah wtf this expert is smoking crack" and the weight goes *negative*. P(1) goes way down to around 15% even though the underlying expert still has a P(1) > 50%

Now in practice this is not how you use mixers. The learning rate in the real world needs to be way lower (otherwise you would be shooting your weights back and forth all the time, overreacting to the most recent coding). In practice the weight converge slowly to an ideal and stay there for long periods of time.

But this amplification compensation property is real, just more subtle (more like 1.1X rather than 4X).

For example, perhaps one of your models is something like a deterministic context (PPM*) model. You find the longest context that has seen any symbols before. That maximum-length context usually has very sparse statistics but can be a good predictor; how good it is exactly depends on the file. So that expert contributes some P fo the next symbol based on what it sees in the deterministic context. It has to just make a wild guess because it has limited observations (perhaps it uses secondary statistics). Maybe it guesses P = 0.8. The mixer can learn that no, on this particular file the deterministic model is in fact better than that, so I like you and amplify you even by a bit more, your coefficient might converge to 1.1 (on another file, maybe the deterministic expert is not so great, its weight might go to 0.7, you're getting P in the right direction, but it's not as predictable as you think).


Visualizing Probability Update Schemes

Unlike the last "visualizing" post, I don't have a real clear point to this one. This is more like "let's look at some pretty pictures".

I do think it helps to get intuition by actually looking at charts & graphs, rather than just always look at the end result score for something like compression.

We're going to look at binary probability estimation schemes.

Binary probability estimation is just filtering the history of bits seen in some way.

Each bit seen is a digital signal of value 0 or 1. You just want some kind of smoothing of that signal. Boom, that's your probability estimate, P(1). All this nonsense about Poisson and Bernoulli processes and blah blah, what a load of crap. It's just filtering.

For example, the "Laplace" estimator

P(1) = (n1 + c)/(n0 + n1 + 2c)

That's just the average of the entire signal seen so far. Add up all the bits, divide by number. (countless papers have been written about what c should be (c= 1? c = 1/2?), why not 1/4? or 3/4? Of course there's no a-priori reason for any particular value of c and in the real world it should just be tweaked to maximize results.)

That's the least history-forgetting estimator of all, it weights everything in the past equally (we never want an estimator where older stuff is more important). In the other direction you have lots of options - FIR and IIR filters. You could of course take the last N bits and average them (FIR filter), or take the last N and average with a weight that smoothly goes from 0 to 1 across the window (sigmoidal or just linear). You can of course take an IIR average, that's

average <- (average*N + last)/(N+1)

Which is just the simplest IIR filter, aka geometric decay, "exponential smoothing" blah blah.

Anyway, that's just to get us thinking in terms of filters. Let's look at some common filters and how they behave.

Green bars = probability of a 1 bit after the bit at bottom is processed.

Blue bars = log odds = codelen(0) - codelen(1)

Laplace : count n0 & n1 :

After 10 1's and 10 0's, P is back to 50/50 ; we don't like the way this estimator has no recency bias.

Geometric IIR with updshift 2 (very fast) :

When a fast geometric gets to the 010101 section is wiggles back and forth wildly. If you look at the codelen difference in that region you can see it's wasting something like 1.5 bits on each coding (because it's always wiggled in just the wrong way, eg. biased towards 0 right before a 1 occurs).

Note that geometric probabilities have an exponential decay shape. (specifically , 0.75^n , where 0.75 is from (1 - 1/4) and the 4 is because shift 2). HOWEVER the codelen difference steps are *linear*.

The geometric probability update (in the direction of MPS increasing probability) is very close to just adding a constant to codelen difference (logit). (this just because P *= lambda , so log(P) += log(lambda)). You can see after the 111 region, the first 0 causes a big negative step in codelen difference, and then once 0 becomes the MPS the further 0 steps are the same constant linear step.

Geometric IIR with updshift 5 (fast) :

Shift 5 is still fast by real world standards but looks slow compared to the crazy speed of geo 2. You can see that the lack of an initial adaptation range hurts the ramp up on the 1111 portion. That is, "geo 5" acts like it has 33 symbols in its history; at the beginning it actually has 0, so it has a bogus history of 33 times P = 0.5 which gives it a lot of intertia.

FIR 8-tap window flat weight :

Just take the last 8 and average. (I initialize the window to 01010 which is why it has the two-tap stair step in the beginning). In practice you can't like the probabilities get to 0 or 1 completely, you have to clamp at some epsilon, and in fact you need a very large epsilon here because over-estimating P(1) and then observing a 0 bit is very very bad (codelen of 0 goes to infinity fast as P->1). The "codelen difference" chart here uses a P epsilon of 0.01

bilateral filter :

It's just filtering, right? Why not? This bilateral filter takes a small local average (4 tap) and weights contribution of those local averages back through time. The weight of each sample is multiplied by e^-dist * e^-value_difference. That is, two terms (hence bilateral), weight goes down as you go back in time, but also based on how far away the sample is from the most recent one in value. ("sample" is the 4-tap local average)

What the bilateral does is that as you get into each region, it quickly forgets the previous region. So as you get into 000 it forgets the 111, and once it gets into 0101 it pretty solidly stabilizes at P = 0.5 ; that is, it's fast in forgetting the past when the past doesn't match the present (fast like geo 2), BUT it's not fast in over-responding to the 0101 wiggle like geo 2.

There are an infinity of different funny impulse responses you could do here. I have no idea if any of them would actually help compression (other than by giving you more parameters to be able to train to your data, which always helps, sort of).

mixed :

Linear mix of geo 2 (super fast) and geo 5 (still quite fast). mixing weight starts at 0.5 and is updated with online gradient descent. The mixer here has an unrealistically fast learning rate to exaggerate the effect.

The weight shown is the weight of geo 2, the faster model. You can see that in the 111 and 000 regions the weight of geo 2 shoots way up (because geo 5 is predicting P near 0.5 still), and then in the 0101 region the weight of geo 2 gradually falls off because the slow geo 5 does better in that region.

The mixer immediately does something that none of the other estimators did - when the first 0 bit occurs, it takes a *big* step down, almost down to 50/50. It's an even faster step than geo 2 did on its own. (same thing with the first 1 after the 000 region).

Something quite surprising popped out to me. The probability steps in the 111 and 000 regions wind up linear. Note that both the underlying estimators (geo 2 and geo 5) has exponential decay curving probabilities, but the interaction with the mixing weight cancels that out and we get linear. I'm not sure if that's a coincidence of the particular learning rate, but it definitely illustrates to us that a mixed probability can behave unlike any of its underlying experts!


Visualizing Binary Probability Updates

Some time ago I noted that the standard way we use binary probabilities to model statistics has some odd quirks : 06-12-14 - Some LZMA Notes

I thought I would make some pictures to make that more intuitive.

What I have here is a 4 bit symbol (alphabet of 16). It is coded with 4 binary probabilities in a tree. That is :

first code a bit for sym >= 8 or not (is it in [0-7] or [8-15])
then go to [0-3] vs [4-7] (for example)
then [4-5] vs [6-7]
lastly [6] vs [7]

One way you might implement this is something like :
U32 p0[16];

// sym is given
sym |= 16; // set a place-value marker bit

for 4 times
  int ctx = sym >> 4; int bit = (sym>>3)&1;
  arithmetic_code( p0[ctx] , bit );
  sym <<= 1;
and note that only 15 p0's are actually used, p0[0] is not accessed; p0[1] is the root probability for [0-7] vs [8-15] , p0[2] is [0-3] vs [4-7] , etc.

The standard binary probability update is exponential decay (the simplest geometric IIR filter) :

if ( bit )
    P0 -= P0 * lambda;
    P0 += (1.0 - P0) * lambda;
So what actually happens to the symbol probabilities when you do this?

Something a bit strange.

When you update symbol [5] (for example), his probability goes up. But so does the probability of his neighbor, [4]. And also his next neighbors [6 and 7]. And so on up the binary tree.

Now the problem is not the binary decomposition of the coding. Fundamentally binary decomposition and full alphabet coding are equivalent and should be able to get the same results if you form the correct correspondence between them.

(aside : For example, if you use a normal n0/n1 count model for probabilities, AND you initialize the counts such that the parents = the sum of the children, then what you get is : visualizing_binary_probability_updates_n0_n1.html - only the symbols you observe increase in probability. Note this is a binary probability tree, not full alphabet, but it acts exactly the same way.)

The problem (as noted in the previous LZMA post) is that the update rate at the top levels is too large.

Intuitively, when a 5 occurs you should update the binary choice [45] vs [67] , because a 5 is more likely, that pair is more likely. The problem is that [4] did not get more likely, and the probability of the group [45] represents the chance of either occuring. One of those things did get more likely, but one did not. The 4 in the group should act as a drag that keeps the group [45] probability from going up so much. Approximately, it should go up by half the speed of the leaf update.

The larger the group, the greater the drag of symbols that did not get more likely. eg. in [0-7] vs [8-15] when a 5 occurs, all of 0123467 did not get more likely, so 7 symbols act as drag and the rate should be around 1/8 the speed.

(see appendix at end for the exact speeds you should use if you wanted to adjust only one probability)

Perhaps its best to just look at the charts. These are histograms of probabilities (in percent). It starts all even, then symbol 5 occurs 20 X, then symbol 12 occurs 20 X. The probabilities are updated with the binary update scheme. lambda here is 1.0/8 , which is rather fast, used to make the visualization more dramatic.

What you should see : when 5 first starts getting incremented, the probabilities of its neighbors go way up, 4, and 6-7, and the whole 0-7 side. After 5 occurs many times, the probabilities of its neighbors goes down. Then when 12 starts beeing seen, the whole 8-15 side shoots up.

Go scroll down through the charts then we'll chat more.

This is a funny thing. Most people in the data compression community think of binary coding symbols this way as just a way to encode an alphabet using binary arithmetic coding. They aren't thinking about the fact that what they're actually doing is a strange sort of group-probability update.

In fact, in many applications if you take a binary arithmetic coder like this and replace it with an N-ary full alphabet coder, results get *worse*. Worse !? WTF !? It's because this weird group update thing that the binary coder is doing is often actually a good thing.

You can imagine scenarios where that could be the case. In some types of data, when a new symbol X suddenly starts occuring (when it had been rare before), then that means (X-1) and (X+2) may start to be seen as well. We're getting some kind of complicated modeling that novel symbols imply their neighbors novel probability should go up. In some type of data (such as BWT post-MTF) the probabilities act very much in binary tree groups. (see Fenwick for example). In other types of data that is very bit structured (such as ascii text and executable code), when a symbol with some top 3 bits occurs, then other symbols with those top bits are also more likely. That is, many alphabets actually have a natural binary decomposition where symbol groups in the binary tree do have joint probability.

This is one of those weird things that happens all the time in data compression where you think you're just doing an implementation choice ("I'll use binary arithmetic coding instead of full alphabet") but that actually winds up also doing modeling in ways you don't understand.

The visuals :

5 x 1

5 x 2

5 x 3

5 x 4

5 x 5

5 x 6

5 x 7

5 x 8

5 x 9

5 x 10

5 x 11

5 x 12

5 x 13

5 x 14

5 x 15

5 x 16

5 x 17

5 x 18

5 x 19

5 x 20

12 x 1

12 x 2

12 x 3

12 x 4

12 x 5

12 x 6

12 x 7

12 x 8

12 x 9

12 x 10

12 x 11

12 x 12

12 x 13

12 x 14

12 x 15

12 x 16

12 x 17

12 x 18

12 x 19

12 x 20

Appendix :

How to do a binary probability update without changing your neighbor :

Consider the simplest case : 4 symbol alphabet :

[0 1 2 3]

A = P[0 1] vs [2 3]

B = P[1] vs [1]

P(0) = A * B
P(1) = A * (1 - B)

now a 0 occurs

alpha = update rate for A
beta = update rate for B

A' = A + (1-A) * alpha
B' = B + (1-B) * beta

we want P(1)/P(23) to be unchanged by the update

(if that is unchanged, then P(1)/P(2) and P(1)/P(3) is unchanged)

that is, the increase to P(0) should scale down all other P's by the same factor

P(1)/P(23) = A * (1-B) / (1-A)

so require :

A' * (1-B') / (1-A') = A * (1-B) / (1-A)

solve for alpha [...algebra...]

alpha = A * beta / ( 1 - (1 - A) * beta )

that's as opposed to just using the same speed at all levels, alpha = beta.

In the limit of small beta, (slow update), this is just alpha = A * beta.

The update rate at the higher level is decreased by the probability of the updated subsection.

In the special case where all the probabilities start equal, A = 0.5, so this is just alpha = beta / 2 - the update rates should be halved at each level, which was the intuitive rule of thumb that I hand-waved about before.

old rants