Whirlwind Tour of Mixing Part 1

I'm going to try to run through mixing techniques in data compression in a wild quick tour. Part 1 will cover techniques up to but not including online gradient descent.

First some background :

We have probabilities output by some "experts" , P_i(x) for the ith expert for symbol x.

We wish to combine these in some way to create a mixed probability.

The experts are a previous stage of probability estimation. In data compression these are usually models; they could be different orders of context model, or the same order at different speeds, or other probability sources.

(you could also just have two static experts, one that says P(0) = 100% and one that says P(1) = 100%, then the way you combine them is actually modeling the binary probability; in that case the mixer is the model. What I'm getting at is there's actually a continuum between the model & the mixer; the mixer is doing some of the modeling work. Another common synthetic case used for theoretical analysis is to go continuous and have an infinite number of experts with P(0) = x for all x in [0,1], then the weight w(x) is again the model.)

Now, more generally, the experts could also output a confidence. This is intuitively obviously something that should be done, but nobody does it, so we'll just ignore it and say that our work is clearly flawed and always has room for improvement. (for example, a context that has seen n0 = 1 and n1 = 0 might give P(0) = 90% , but that is a far less confident expert that one that has n0 = 90 and n1 = 10 , which should get much higher weight). What I'm getting at here is the experts themselves have a lot more information than just P, and that information should be used to feed the mixer. We will not do that. We treat the mixer as independent of the experts and only give it the probabilities.

(you can fix this a bit by using side information (not P) from the model stages as context for the mixer coefficients; more on that later).

Why mixing? A lot of reasons. One is that your data may be switching character over time. It might switch from very-LZ-like to very-order-3-context like to order-0-like. One option would be to try to detect those ranges and transmit switch commands to select models, but that is not "online" (one pass) and the overhead to send the switch commands winds up being comparable to just mixing all those models (much research shows that these alternatives are asymptotically bounded). Getting perfect switch transmission right is actually way harder than mixing; you have to send the switch commands efficiently, you have to find the ideal switch ranges with consideration of the cost of transmitting the switch (which is a variant of a "parse-statistics feedback" problem which is one of the big unsolved bugbears in compression).

Another reason we like mixing is that our models might be good for part of the character of the data, but not all of it. For example it's quite common to have some data that's text-like (has strong order-N context properties) but is also LZ-like (perhaps matches at offset 4 or 8 are more likely, or at rep offset). It's hard to capture both those correlations with one model. So do two models and mix.

On to techniques :

1. Select

The most elementary mixer just selects one of the models. PPMZ used this for LOE (Local Order Estimation). PPMZ's primary scheme was to just select the order with highest MPS (most probable symbol) probability. Specifically if any order was deterministic (MPS P = 100%) it would be selected.

2. Fixed Weights

The simplest mixer is just to combine stages using fixed weights. For example :

P = (3 * P1 + P2)/4

This is used very effectively in BCM, for example, which uses it in funny ways like mixing the probability pre-APM with the probability post-APM using fixed weight ratios.

Fixed linear combinations like this of multi-speed adaptive counters is not really mixing, but is a way to create higher order IIR filters. (as in the "two speed" counter from BALZ; that really makes a 2-tap IIR from 1-tap IIR geometric decay filters)

3. Per-file/chunk transmitted Fixed Weights

Instead of fixed constants for mixing, you can of course optimize the weights per file (or per chunk) and transmit them.

If you think of the goal of online gradient descent as finding the best fit of parameters and converging to a constant - then this makes perfect sense. Why do silly neural net online learning when we can just statically optimize the weights?

Of course we don't actually want our online learners to converge to constants - we want them to keep adapting; there is no steady state in real world data compression. All the online learning research which assumes convergence and learning rate decreasing over time and various asymptotic gaurantees - it's all sort of wrong for us.

M1 (Mattern) uses optimized weights and gets quite high compression. This method is obviously very asymmetric, slow to encode & fast(er) to decode. You can also approximate it by making some pre-optimized weights for various types of data and just doing data type detection to pick a weight set.

4. Functional Weighting

You can combine P's using only functions of the P's, with no separate state (weight variables that are updated over time). The general idea here is that more skewed P's are usually much more powerful (more likely to be true) than P's near 50/50.

If you have one expert on your panel saying "I dunno, everything is equally likely" and another guy saying "definitely blue!" , you are usually best weighting more to the guy who seems sure. (unless the positive guy is a presidential candidate, then you need to treat lack of considering both sides of an argument or admitting mistakes as huge character flaws that make them unfit to serve).

In PPMZ this was used for combining the three SEE orders; the weighting used there was 1/H (entropy).

weight as only a function of P

the idea is to give more weight to skewed P's

W = 1 / H(P) = 1 / ( P * log(P) + (1-P) * log(1-P) )

e^ -H would be fine too

very simple skews work okay too :

W = ABS(P - 0.5)

just higher weight for P away from 0.5

then linearly combine models :

P_mixed = Sum{ W_i * P_i } / Sum{ W_i }

This method has generally been superceded by better approaches in most cases, but it is still useful for creating a prior model for initial seeding of mixers, such as for 2d APM mixing (below).

I'll argue in a future post that Mahoney's stretch/squash (logit/logistic sigmoid) is a form of this (among other things).

5. Secondary Estimation Mixing / 2d APM Mixing

As mentioned at the end of the last post, you can use a 2d APM for mixing. The idea is very simple, instead of one P as input to the APM, you take two probabilities from different models and look them up as a 2d matrix.

You may want to use the stretch remap and bilinear filtering so that as few bits as possible are needed to index the APM table, in order to preserve as much density as possible. (eg. 6 bits from each P = 12 bit APM table)

This allows you to model completely arbitrary mixing relationships, not just "some weight to model 1, some to model 2". It can do the fully P-input dependent mixing, like if model 1 P is high, it gets most of the influence, but if model 1 P is low, then model 2 takes over.

The big disadvantage of this method is sparsity. It's really only viable for mixing two or three models, you can't do a ton. It's best on large stable data sources, not small rapidly changing data, because the mixing weight takes longer to respond to changes.

One of the big advantages of single-scalar weighted mixing is that it provides a very fast adapting stage to your compressor. Say you have big models with N states, N might be a million, so each individual context only gets an update every millionth byte. Then you might have a (1d) APM stage to adapt the P's; this has 100 entries or so, so it only gets 1/100 updates. But then you blend the models with a scalar mixing weight - that updates after every byte. The scalar mixing weight is always 100% up to date with recent changes in the data, it can model switches very quickly while the rest of the big model takes much more time to respond. 2d APM Mixing loses this advantage.

6. Beta Weighting

Beta Weighting was introduced by CTW ; see also the classic work of Volf on switching, etc. (annoyingly this means something different in stock trading).

Beta Weighting is an incremental weight update. It does :

P_mixed = Sum W_i * P_i / Sum W_i

(linear weighted sum of model probabilities, normalized)

after seeing symbol X

W_i *= P_i(X)

We can understand this intuitively :

If a model had a high probability for the actually occuring symbol (it was a good model),
it will have P_i large

W_i *= 0.9 or whatever

if the model had low probability for the actual symbol (it was wrong, claiming that symbol was unlikely),
it will get scaled down :

W_i *= 0.1 for example

Furthermore, the weights we get this way are just negative exponentials of the cost in each model. That is :

P = 2^ - L

L = codelen

The weights winds up being the product of all coded probabilities seen so far :

W at time t = P_(t-1) * P(t-2) * P(t-3) * P(t_4) ...

W = 2^( - Sum_t{ L_(t-1) + L(t-2) + L(t-3) ... } )

W = 2^( - codelen if you used this model for all symbols )

W ~ e^ - cost

If you like we can also think of this another way :

since W is normalized, overall scaling is irrelevant

at each update, find the model that produced the lowest codelen (or highest P)

this is the model we would have chosen with an ideal selector

so the cost of any other model is :

cost_i = L_i - L_best

the CTW Beta weight rule is :

W_i *= 2^ - cost_i

(and the best model weight stays the same, its cost is zero)


W = 2^ - total_cost

total_cost += cost_i

In the special case of two models we can write :

The normalized weight for model 1 is :

W1 = ( 2^-C1 ) / ( 2^-C1 + 2^-C2 ) = 1 / ( 1 + 2^(C1-C2) ) = log2istic( C2-C1 )

W2 = 1 - W1 = log2istic( C1-C2 )

"log2istic" = logistic sigmoid with base 2 instead of e

We only need to track the delta of costs , dC = C1-C2

P = log2istic(-dC) * P1 + log2istic(dC) * P2

(equivalently, just store W1 and do W1 *= P1/P2)

We'll see in the next post how logistic is a transform from delta-codelens to probabilities, and this will become neater then.

Now Beta weighting is obviously wrong. It weights every coding event equally, no matter how old it is. In compression we always want strong recency modeling, so a better cost update is something like :

dC' = dC * aging + rate * (C1-C2)

where aging = 0.95 or whatever , something < 1
and rate is not so much the learning rate (it could be normalized out) as it is an exponential modifier to the weighting.

Beta weighting as used in CTW stores a weight per node. Each node weights counts as seen at that level vs. the next deeper level. In contrast, in context mixers like PAQ the weight for order N vs order N+1 is stored outside of the model, not per context but as a global property of the data. The PAQ way is much more flexible because you can always use some context for the weight if you like.

Next post we'll get into online gradient descent.

No comments:

old rants