Visualizing Probability Update Schemes

Unlike the last "visualizing" post, I don't have a real clear point to this one. This is more like "let's look at some pretty pictures".

I do think it helps to get intuition by actually looking at charts & graphs, rather than just always look at the end result score for something like compression.

We're going to look at binary probability estimation schemes.

Binary probability estimation is just filtering the history of bits seen in some way.

Each bit seen is a digital signal of value 0 or 1. You just want some kind of smoothing of that signal. Boom, that's your probability estimate, P(1). All this nonsense about Poisson and Bernoulli processes and blah blah, what a load of crap. It's just filtering.

For example, the "Laplace" estimator

P(1) = (n1 + c)/(n0 + n1 + 2c)

That's just the average of the entire signal seen so far. Add up all the bits, divide by number. (countless papers have been written about what c should be (c= 1? c = 1/2?), why not 1/4? or 3/4? Of course there's no a-priori reason for any particular value of c and in the real world it should just be tweaked to maximize results.)

That's the least history-forgetting estimator of all, it weights everything in the past equally (we never want an estimator where older stuff is more important). In the other direction you have lots of options - FIR and IIR filters. You could of course take the last N bits and average them (FIR filter), or take the last N and average with a weight that smoothly goes from 0 to 1 across the window (sigmoidal or just linear). You can of course take an IIR average, that's

average <- (average*N + last)/(N+1)

Which is just the simplest IIR filter, aka geometric decay, "exponential smoothing" blah blah.

Anyway, that's just to get us thinking in terms of filters. Let's look at some common filters and how they behave.

Green bars = probability of a 1 bit after the bit at bottom is processed.

Blue bars = log odds = codelen(0) - codelen(1)

Laplace : count n0 & n1 :

After 10 1's and 10 0's, P is back to 50/50 ; we don't like the way this estimator has no recency bias.

Geometric IIR with updshift 2 (very fast) :

When a fast geometric gets to the 010101 section is wiggles back and forth wildly. If you look at the codelen difference in that region you can see it's wasting something like 1.5 bits on each coding (because it's always wiggled in just the wrong way, eg. biased towards 0 right before a 1 occurs).

Note that geometric probabilities have an exponential decay shape. (specifically , 0.75^n , where 0.75 is from (1 - 1/4) and the 4 is because shift 2). HOWEVER the codelen difference steps are *linear*.

The geometric probability update (in the direction of MPS increasing probability) is very close to just adding a constant to codelen difference (logit). (this just because P *= lambda , so log(P) += log(lambda)). You can see after the 111 region, the first 0 causes a big negative step in codelen difference, and then once 0 becomes the MPS the further 0 steps are the same constant linear step.

Geometric IIR with updshift 5 (fast) :

Shift 5 is still fast by real world standards but looks slow compared to the crazy speed of geo 2. You can see that the lack of an initial adaptation range hurts the ramp up on the 1111 portion. That is, "geo 5" acts like it has 33 symbols in its history; at the beginning it actually has 0, so it has a bogus history of 33 times P = 0.5 which gives it a lot of intertia.

FIR 8-tap window flat weight :

Just take the last 8 and average. (I initialize the window to 01010 which is why it has the two-tap stair step in the beginning). In practice you can't like the probabilities get to 0 or 1 completely, you have to clamp at some epsilon, and in fact you need a very large epsilon here because over-estimating P(1) and then observing a 0 bit is very very bad (codelen of 0 goes to infinity fast as P->1). The "codelen difference" chart here uses a P epsilon of 0.01

bilateral filter :

It's just filtering, right? Why not? This bilateral filter takes a small local average (4 tap) and weights contribution of those local averages back through time. The weight of each sample is multiplied by e^-dist * e^-value_difference. That is, two terms (hence bilateral), weight goes down as you go back in time, but also based on how far away the sample is from the most recent one in value. ("sample" is the 4-tap local average)

What the bilateral does is that as you get into each region, it quickly forgets the previous region. So as you get into 000 it forgets the 111, and once it gets into 0101 it pretty solidly stabilizes at P = 0.5 ; that is, it's fast in forgetting the past when the past doesn't match the present (fast like geo 2), BUT it's not fast in over-responding to the 0101 wiggle like geo 2.

There are an infinity of different funny impulse responses you could do here. I have no idea if any of them would actually help compression (other than by giving you more parameters to be able to train to your data, which always helps, sort of).

mixed :

Linear mix of geo 2 (super fast) and geo 5 (still quite fast). mixing weight starts at 0.5 and is updated with online gradient descent. The mixer here has an unrealistically fast learning rate to exaggerate the effect.

The weight shown is the weight of geo 2, the faster model. You can see that in the 111 and 000 regions the weight of geo 2 shoots way up (because geo 5 is predicting P near 0.5 still), and then in the 0101 region the weight of geo 2 gradually falls off because the slow geo 5 does better in that region.

The mixer immediately does something that none of the other estimators did - when the first 0 bit occurs, it takes a *big* step down, almost down to 50/50. It's an even faster step than geo 2 did on its own. (same thing with the first 1 after the 000 region).

Something quite surprising popped out to me. The probability steps in the 111 and 000 regions wind up linear. Note that both the underlying estimators (geo 2 and geo 5) has exponential decay curving probabilities, but the interaction with the mixing weight cancels that out and we get linear. I'm not sure if that's a coincidence of the particular learning rate, but it definitely illustrates to us that a mixed probability can behave unlike any of its underlying experts!

No comments:

old rants