5/03/2018

Whirlwind Tour of Mixing Part 3

On to online gradient descent.

First we need to talk about how our experts are mixed. There are two primary choices :


Linear mixing :

M(x) = Sum{ w_i P_i(x) } / Sum{ w_i }

or = Sum{ w_i P_i(x) } if the weights are normalized

Geometric mixing :

M(x) = Prod{ P_i(x) ^ c_i } / Sum(all symbols y){ Prod{ P_j(y) ^ c_j } }

(the coefficients c_i do not need to be normalized)

P = probability from each expert
M = mixed probability

require

Sum(all symbols y){ M(y) } = 1

Linear mixing is just a weighted linear combination. Geometric mixing is also known as "product of expert" and "logarithmic opinion pooling".

Geometric mixing can be expressed as a linear combination of the logs, if normalization is ignored :


log( M_unnormalized ) = Sum{ c_i * log( P_i ) }

In the large alphabet case, there is no simple logarithmic form of geometric M (with normalization included).

In the special case of a binary alphabet, Geometric mixing has a special simple form (even with normalization) in log space :


M = Prod{ P_i ^ c_i } / ( Prod{ P_i ^ c_i } + Prod{ (1 - P_i) ^ c_i } )

M = 1 / ( 1 + Prod{ ((1 - P_i)/P_i) ^ c_i } )

M = 1 / ( 1 + Prod{ e^c_i * log((1 - P_i)/P_i) } )

M = 1 / ( 1 + e^ Sum { c_i * log((1 - P_i)/P_i) } )

M = sigmoid( Sum { c_i * logit(P_i) } )

which we looked at in the previous post, getting some background on logit and sigmoid.

We looked before about how logit space strongly preserves skewed probabilities. We can see the same thing directly in the geometric mixer.

Obviously if an expert has a P near 0, that multiplies into the product and produces an output near 0. Zero times anything = zero. Less obviously if an expert has a P near 1, the denomenator normalization factor becomes zero in the other terms (that aren't the numerator), so the mix becomes 1, regardless of what the other P's.

So assume we are using these mixtures. We want to optimize (minimize) code length. The general procedure for gradient descent is :


the code length cost of a previous coding event is -log2( M(x) )
evaluated at the actually coded symbol x

(in compression the total output file length is just the sum of all these event costs)

To improve the w_i coefficients over time :

after coding symbol x, see how w_i could have been improved to reduce -log2( M(x) )

take a step of w_i in that direction

the direction is the negative gradient :

delta(w_i) ~= - d/dw_i { -log2( M(x) ) }

delta(w_i) ~= (1/M) * d/dw_i{ M(x) }

(~ means proportional; the actual delta we want is scaled by some time step rate)

so to find the steps we should take we just have to do some derivatives.

I won't work through them in ascii, but they're relatively simple.

(the geometric mixer update for the non-binary alphabet case is complex and I won't cover it any more here. For practical reasons we will assume linear mixing for non-binary alphabets, and either choice is available for binary alphabets. For the non-binary geometric step see Mattern "Mixing Strategies in Data Compression")

When you work through the derivatives what you get is :


Linear mixer :

delta(w_i) ~= ( P_i / M ) - 1


Geometric mixer (binary) :

delta(w_i) ~= (1 - M) * logit(P_i)

Let's now spend a little time getting familiar with both of them.

First the linear mixer :


put r_i = P_i(x) / M(x)

ratio of expert i's probability to the mixed probability, on the symbol we just coded (x)

Note : Sum { w_i * r_i } = 1  (if the weights are normalized)

Your goal is to have M = 1

That is, you gave 100% probability to the symbol that actually occured

The good experts are the ones with high P(x)  (x actually occured)

The experts that beat the current mix have P > M

Note that because of Sum { w_i * r_i } = 1

Whenever there is an expert with r_i > 1 , there must be another with r_i < 1

The linear mixer update is just :

delta(w_i) ~= r_i - 1

experts that were better than the current mix get more weight
experts worse than the current mix get less weight

I think it should be intuitive this is a reasonable update.

Note that delta(w) does not preserve normalization. That is :


Sum{ delta(w_i) } != 0 , and != constant

however

Sum{ w_i * delta(w_i) } = Sum{ w_i * P_i / M } - Sum { w_i } = Sum {w_i} - Sum{w_i} = 0

the deltas *do* sum to zero when weighted.

We'll come back to that later.

If we write the linear update as :


delta(w_i) ~= ( P_i - M ) / M

it's obvious that delta is proportional to (1/M). That is, when we got the overall mixed prediction M right, M is near 1, the step is small. When we got the step wrong, M is near 0, the step we take can be very large. (note that P_i does not have to also be near zero when M is near zero, the w_i for that expert might have been small). In particular, say most of the experts were confused, they predict the wrong thing, so for the actual coded event x, their P(x) is near zero. One guy said no, what about x! His P(x) was higher. But everyone ignores the one guy piping up (his weight is low). The mixed M(x) will be very low, we make a huge codelen (very bad log cost), the step (1/M) will be huge, and it bumps up the one guy who got the P(x) right.

Now playing with the geometric update :


Geometric mixer (binary) :

delta(w_i) ~= (1 - M) * logit(P_i)

this is evaluated for the actually coded symbol x (x = 0 or 1)

let's use the notation that ! means "for the opposite symbol"
eg. if you coded a 0 , then ! means "for a 1"

eg. !P = 1-P and !M = 1-M

(in the previous post I put up a graph of the relationship between L and !L, codelens of opposing bits)

We can rewrite delta(w_i) as :

delta(w_i) ~= !M * ( !L_i - L_i )

recalling logit is the codelen difference

that's

delta(w_i) ~= (mixed probability of the opposite symbol) * (ith model's codelen of opposite - codelen of coded symbol)

I find this form to be more intuitive; let's talk through it with words.

The step size is proportional to !M , the opposite symbol mixed probability. This is a bit like the (1/M) scaling in the linear mixer. When we got the estimate right, the M(x) of the coded symbol will go to 1, so !M goes to zero, our step size goes to zero. That is, if our mixed probability M was right, we take no step. If it ain't broke don't fix it. If we got the M very wrong, we thought a 0 bit would not occur but then it did, M is near zero, !M goes to one.

While the general direction of scaling here is the same as the linear mixer, the scaling is very different. For example in the case where we were grossly wrong, M -> 0 , the linear mixer steps like (1/M) - it can be a huge step. The geometric mixer steps like (1-M) , it hits a finite maximum.

The next term (!L - L) means weights go up for experts with a large opposite codelen. eg. if the codelen of the actually coded symbol was low, that's a good expert, then the codelen of the opposite must be high, so his weight goes up.

Again this is similar to the linear mixer, which took a step ~ P(x) , so experts who were right go up. Again the scaling is different, and in the opposite way. In the linear mixer, if experts get the P right or wrong, the step just scales in [0,1] ; it varies, but not drastically. Conversely in the geometric case because step is proportional to codelen, it goes to infinity as P gets extremely wrong.

Say you're an expert who gets it all wrong and guesses P(x) = 0 (near zero) then x occurs. In the linear case the step is just (-1). In the geometric case, L(x) -> inf , !L -> 0 , your weight takes a negative infinite step. Experts who get it grossly wrong are not gradually diminished, they are immediately crushed. (in practice we clamp weights to some chosen finite range, or clamp step to a maximim)

Repeating again what the difference is :

In linear mixing, if the ensemble makes a big joint mistake, the one guy out of the set who was right can get a big weight boost. If the ensemble does okay, individuals experts that were way off do not get big adjustments.

In geometric mixing, individual experts that make a big mistake take a huge weight penalty. This is only mildly scaled by whether the ensemble was good or not.

Another funny difference :

With geometric mixing, experts that predict a 50/50 probability (equal chance of 0 or 1) do not change w at all. If you had a static expert that just always said "I dunno, 50/50?" , his weight would never ever change. But he would also never contribute to the mixture either, since as noted in the previous post equivocal experts simply have no effect on the sum of logits. (you might think that if you kept seeing 1 bits over and over, P(1) should be near 100% and experts that keep saying "50%" should have their weight go to crap; that does not happen in geometric mixing).

In contrast, with linear mixing, 50/50 experts would either increase or decrease in weight depending on if they were better or worse than the previous net mixture. eg. if M was 0.4 on the symbol that occurred, the weight of an 0.5 P expert would go up, since he's better than the mix and somebody must be worse than him.

Note again that in geometric mixing (for binary alphabets), there is no explicit normalization needed. The w_i's are not really weights, they don't sum to 1. They are not in [0,1] but go from [-inf,inf]. It's not so much "weighting of experts" as "adding of experts" but in delta-codelen space.

Another funny thing about geometric mixing is what the "learning rate" really does.

With the rate explicitly in the steps :


Linear mixer :

w_i += alpha * ( ( P_i / M ) - 1 )

(then w_i normalized)


Geometric mixer (binary) :

c_i += alpha * ( (1 - M) * logit(P_i) )

(c_i not normalized)

In the linear case, alpha really is a learning rate. It controls the speed of how fast a better expert takes weight from a worse expert. They redistribute the finite weight resource because of normalization.

That's not the case in the geometric mixer. In fact, alpha just acts as an overall scale to all the weights. Because all the increments to w over time have been scaled by alpha, we can just factor it out :


no alpha scaling :

c_i += ( (1 - M) * logit(P_i) )

factor out alpha :

M = sigmoid( alpha * Sum { c_i * logit(P_i) } )

The "learning rate" in the geometric mixer really just scales how the sum of logits is stretched back to probability. In fact you can think of it as a normalization factor more than a learning rate.

The geometric mixer has yet another funny property that it doesn't obviously pass through the best predictor. It doesn't even obviously pass through the case of a single expert. (with linear mixing it's obvious that crap predictors weight goes to zero, and a single expert would just pass through at weight=1)


aside you may ignore; quick proof that it does :

say you have only one expert P
say that P is in fact perfectly correct
0 bits occur at rate P and P is constant

initially you have some mixing coefficient 'c' (combining alpha in with c)

The mixed probability is :

M = sigmoid( c * logit(P) )

which is crap.  c is applying a weird scaling we don't want.

c should go to 1, because P is the true probability.

Does it?

delta(c) after one step of symbol 0 is (1-M) * logit(P)
for symbol 1 it's M * logit(1-P)

since P is the true probability, the probabilistic average step is :

average_delta(c) = P * ( (1-M) * logit(P) ) + (1-P) * ( (M) * logit(1-P) )

logit(1-P) = - logit(P)

average_delta(c) = P * logit(P) - M * logit(P) = (P - M) * logit(P)

in terms of l = logit(P)
using P = sigmoid(l)

average_delta(c) = ( sigmoid(l) - sigmoid( c * l ) ) * l

this is a differential equation for c(t)

you could solve it (anyone? anyone?)

but even without solving it's clear that it does go to c = 1 as t -> inf

if c < 1 , average_delta(c) is > 0 , so c goes up
if c > 1 , average_delta(c) is < 0 , so c goes down

in the special case of l small (P near 50/50), sigmoid is near linear,
then the diffeq is easy and c ~= (1 + e^-t)

Note that this was done with alpha set to 1 (included in c). If we leave alpha separate, then c should converge to 1/alpha in order to pass through a single predictor. That's what I mean by alpha not really being a "learning rate" but more of a normalization factor.

So anyway, despite geometric mixing being a bit odd, it works in practice; it also works in theory (the gradient descent can be proved to converge with reasonable bounds and so on). In practice it has the nice property of not needing any divides for normalization (in the binary case). Of course it doesn't require evaluation of logit & sigmoid, which are typically done by table lookup.


Bonus section : Soft Bayes

The "Soft Bayes" method is not a gradient descent, but it is very similar, so I will mention it here. Soft Bayes uses a weighted linear combination of the experts, just like linear gradient descent mixing.


Linear mixer :

d_i = ( P_i / M ) - 1

w_i += rate * d_i

Soft Bayes :

w_i *= ( 1 + rate * d_i )

Instead of adding on the increment, Soft Bayes multiples 1 + the increment.

Multiplying by 1 + delta is obviously similar to adding delta and the same intuitive arguments we made before about why this works still apply. Maybe I'll repeat them here :

General principles we want in a mixer update step :


The step should be larger when the mixer produced a bad result

  the worse M(x) predicted the observed symbol x, the larger the step should be
  this comes from scaling like (1/M) or (1-M)

The step should scale up weights of experts that predicted the symbol well

  eg. larger P_i(x) should go up , smaller P_i(x) should go down

  a step proportional to (P-M) does this in a balanced way (some go up, some go down)
  but a step of (P - 0.5) works too
  as does logit(P)

We can see a few properties of Soft Bayes :

Soft Bayes :

w_i *= ( 1 + rate * d_i )

w_i' = w_i + w_i * rate * d_i

w_i += rate * w_i * d_i

This is the same as linear gradient descent step, but each step d_i is scaled by w_i

Soft Bayes is inherently self-normalizing :

Sum{w_i'} = Sum{w_i} + Sum{w_i * rate * d_i}

Sum{ w_i * d_i } = Sum{ ( w_i * P_i / M ) - w_i } = ( M/M - 1 ) * Sum{ w_i } = 0

therefore :

Sum{w_i'} = Sum{w_i} 

the sum of weights is not changed

if they started normalized, they stay normalized

The reason that "soft bayes" is so named can be seen thusly :

Soft Bayes :

w_i *= ( 1 + rate * d_i )

d_i = ( P_i / M ) - 1

w_i *= (1-rate) + rate * ( P_i / M )

at rate = 1 :

w_i *= P_i / M

ignoring normalization this is 

w_i *= P_i

which is "Beta Weighting" , which is aka "naive Bayes".

So Soft Bayes uses "rate" as a lerp factor to blend in a naive Bayes update. When rate is 1, it's "fully Bayes", when rate is lower it keeps some amount of the previous weight lerped in.

Soft Bayes seems to work well in practice. Soft Bayes could have the bad property that if weights get to zero they can never rise, but in practice we always clamp weights to a non-zero range so this doesn't occur.

A quick aside on why Beta weighting is the naive Bayesian solution :


ignoring normalization through this aside
(also ignoring priors and all the proper stuff you should do if you're Bayesian)
(see papers for thorough details)

Mix experts :

M = Sum{ w_i * P_i }

P_i(x) is the probability of x given expert i ; P(x|i)

The w_i should be the probability of expert i, given past data , that is :

P(x|past) = Sum{ P(i|past) * P(x|i) }

P(i|past) is just ~= P(past|i)
  (ignoring prior P(i) and normalizing terms P(past) that factor out in normalization)

Now *if* (big if) the past events are all independent random events, then :

P(past|i) = P(x0|i)*P(x1|i)*P(x2|i)... for past symbols x0,x1,...

since P("ab") = P(a)*P(b) etc.

which you can compute incrementally as :

w_i *= P_i

the Beta Weight update.

In words, the weight for the expert should be the probability of that expert having made the data we have seen so far. That probability is just the P_i that expert has predicted for all past symbols, multiplied together.

Now this is obviously very wrong. Our symbols are very much not independent, they have strong conditional probability, that's how we get compression out of modeling the stream. The failure of that assumption may be why true beta weighting is poor in practice.


This Whirlwind Tour has been brought to you by the papers of Mattern & Knoll, PAQ of Mahoney, and the radness of RAD.

Small note about practical application :

You of course want to implement this with integers and store things like weights and probabilities in fixed point. I have found that the exact way you handle truncation of that fixed point is very important. In particular you need to be careful about rounding & biasing and try not to just lose small values. In general mixing appears to be very "fiddly"; it has a very small sweet spot of tuned parameters and small changes to them can make results go very bad.

No comments:

old rants