Whirlwind Tour of Mixing Part 2

Before we dive into online gradient descent, let's look at a couple of building blocks.

In many cases we've been talking about mixers that work with arbitrary alphabets, but let's specialize now to binary. In that case, each expert just makes one P, say P(0), and implicitly P(1) = 1 - P(0).

We will make use of the logit ("stretch") and logistic sigmoid ("squash"), so let's look at them a bit first.

logit(x) = ln( x / (1-x) )

sigmoid(x) = 1 / (1 + e^-x )  (aka "logistic")

logit is "log odds" of a probability

logit takes [0,1] -> [-inf,inf]

sigmoid takes [-inf,inf] -> [0,1]

logit( sigmoid(x) ) = 1

I'm a little bit fast and loose with the base of the logs and exponents sometimes, because the difference comes down to scaling factors which wash out in constants in the end. I'll try to be not too sloppy about it.

You can see plots of them here for example . The "logistic sigmoid" is an annoyingly badly named function; "logistic" is way too similar to "logit" which is very confusion, and "sigmoid" is just a general shape description, not a specific function.

What's more intuitive in data compression typically is the base-2 variants :

log2it(P) = log2( P / (1-P) )

log2istic(t) = 1 / (1 + 2^-t )

which is just different by a scale of the [-inf,inf] range. (it's a difference of where you measure your codelen in units of "bits" or "e's").

In data compression there's an intuitive way to look at the logit. It's the difference of codelens of symbol 0 and 1 arising from the probability P.

log2it(P) = log2( P / (1-P) )

= log2( P ) - log2( 1 - P )

if P = P0

then L0 = codelen of 0 = - log2(P0)

L1 = - log2(1-P0)

log2it(P0) = L1 - L0

which I like to write as :

log2it = !L - L

where !L means the len of the opposite symbol

Working with the logit in data compression is thus very natural and has nice properties :

1. It is (a)symmetric under exchange of the symbol alphabet 0->1. In that case P -> 1-P , so logit -> - logit. The magnitude stays the same, so mixers that act on logit behave the same way. eg. you aren't favoring 0 or 1, which of course you shouldn't be. This is why using the *difference* of codelen between L0 and L1 is the right thing to do, not just using L0, since the 0->1 exchange acts on L0 in a strange way.

Any mixer that we construct should behave the same way if we swap 0 and 1 bits. If you tried to construct a valid mixer like that using only functions of L0, it would be a mess. Doing it in the logit = (!L-L) it works automatically.

2. It's linear in codelen. Our goal in compression is to minimize total codelen, so having a parameter that is in the space we want makes that natural.

3. It's unbounded. Working with P's is awkward because of their finite range, which requires clamping and theoretically projection back into the valid cube.

4. It's centered on zero for P = 0.5 and large for skewed P. We noted before that what we really want is to gather models that are skewed, and logit does this naturally.

Let's look at the last property.

Doing mixing in logit domain implicitly weights skewed experts, as we looked at before in "functional weighting". With logit, skewed P's near 0 or 1 can stretched out to -inf or +inf. This gives them very high implicit "weight", that is even after mixing with a P near 0.5, the result is still skewed.

Mix two probabilities with equal weight :

0.5 & 0.9

linear : 0.7
logit  : 0.75

0.5 & 0.99

linear : 0.745
logit  : 0.90867

0.5 & 0.999

linear : 0.74950
logit  : 0.96933

Logit blending is sticky at skewed P.

A little plot of what the L -> !L function looks like :

Obviously it is symmetric around (1,1) where both codelens are 1 bit, P is 0.5 ; as P gets skewed either way one len goes to zero and the other goes to infinity.

logistic sigmoid is the map back from codelen delta to probability. It has the nice property for us that no matter what you give it, the output is a valid normalized probability in [0,1] ; you can scale or screw up your logistics in whatever way you want and you still get a normalized probability (which makes it codeable).

Linear combination of logits (codelen deltas) is equal to geometric (multiplicative) mixing of probabilities :

M = mixed probability

linear combination in logit space, then transform back to probability :

M = sigmoid( Sum{ c_i * logit( P_i ) } )

M = 1 / ( 1 + e ^ - Sum{ c_i * logit( P_i ) } )

e ^ logit(P) = P/(1-P)
e ^ -logit(P) = (1-P)/P
e ^ - c * logit(P) = ((1-P)/P) ^ c

M = 1 / ( 1 + Prod{ (( 1 - P_i ) / P_i)^c_i } )

M = Prod{ P_i ^ c_i } / ( Prod{ P_i ^ c_i } + Prod{ ( 1 - P_i )^c_i } )

What we have is a geometric blend (like a geometric mean, but with arbitrary exponents, which are the mixing factors), and the denomenator is just the normalizing factor so that M0 and M1 some to one.

Note that the blend factors (c_i) do not need to be normalized here. We get a normalized probability even if the c_i's are not normalized. That is an important property for us in practice.

This is a classic machine learning method called "product of experts". In the most basic form, all the c_i = 1, the powers are all 1, the probabilities from each expert are just multiplied together.

I'm calling the c_i "coefficients" not "weights" to emphasize the fact that they are not normalized.

Note that while we don't need to normalize the c_i , and in practice we don't (and in the "product of experts" usage with all c_i = 1 they are of course not normalized), that has weird consequences and let us not brush it under the table.

For one thing, the overall scale of c_i changes the blend. In many cases we hand-waved about scaling because normalization will wipe out any overall scaling factors of weights. In this case the magnitude scale of the c_i absolutely matters. Furthermore, the scale of the c_i has nothing to do with a "learning rate" or blend strength. It's just an extra exponent applied to your probabilities that does some kind of nonlinear mapping to how they are mixed. e.g. doing c_i *= 2 is like mixing squared probabilities instead of linear probabilities.

For another, with unnormalized c_i - only skewed probabilities contribute. Mixing in more 50/50 experts does *nothing*. It does not bias your result back towards 50/50 at all!

We can see this both in the logit formulation and in the geometric mean formulation :

M = sigmoid( Sum{ c_i * logit( P_i ) } )

Now add on a new expert P_k with some coefficient c_k

M' = sigmoid( Sum{ c_i * logit( P_i ) + c_k * logit( P_k ) } )

if P_k = 50% , then logit(P_k) = 0

M' = M

even if c_k is large.

(if you normalized the c's, the addition of c_k would bring down the coffecient from the other contributions)

The Sum of logits only adds up codelen deltas

Experts with large codelen deltas are adding their scores to the vote
small codelen deltas simply don't contribute at all

In the geometric mean :

Q = Prod{ P_i ^ c_i }

M = Q / ( Q + !Q )

(where ! means "of the opposing symbol; eg. P -> (1-P) )

Add on another expert with P = 0.5 :

Q' = Q * 0.5 ^ c_k

!Q' = !Q * 0.5 ^ c_k

M' = Q * 0.5 ^ c_k / ( Q * 0.5 ^ c_k + !Q * 0.5 ^ c_k)

the same term multiplied around just factors out

M' = M

This style of mixing is like adding experts, not blending them. Loud shouty experts with skewed codelen deltas add on.

Next up, how to use this for gradient descent mixing in compression.

No comments:

old rants