Recall from the tour of mixing :

Geometric mixing is a product of experts. In the binary case, this reduces to linear mixing in logit (codelen difference) domain; this is what's used by PAQ. The coefficients of a geometric mixer are not really "weights" , in that they don't sum to one, and they can be negative.

In fact the combination of experts in geometric mixing is not convex; that is, the mixer does not necessarily interpolate them. Linear mixing stays within the simplex of the original experts, it can't extrapolate (because weights are clamped in [0,1]).

For example, say your best expert always gets the polarity of P right (favors bit 0 or 1 at the right time), but it always predicts P a bit too low. It picks P of 0.7 when it should be 0.8. The linear mixer can't fix that. It can at most give that expert a weight of 100%. The geometric mixer can fix that. It can apply an amplification factor that says - yes I like your prediction, but take it farther.

The geometric mixer coefficients are literally just a scaling of the experts' codelen difference. The gradient descent optimizes that coefficient to make output probabilities that match the observed data; to get there it can apply amplification or suppression of the codelen difference.

Let's see this in a very simple case : just one expert.

The expert here is "geo 5" , (a 1/32 geometric probability update). That's pretty fast for real world use but it looks very slow in these little charts. We apply a PAQ style logit mixer with a *very* fast "learning rate" to exaggerate the effect (1000X faster than typical).

Note the bit sequence here is different than the last post; I've simplified it here to just 30 1's then 10 0's to make the effect more obvious.

The underlying expert adapts slowly : (P(1) in green, codelen difference in blue)

Note that even in the 0000 range, geo 5 is still favoring P(1) , it hasn't forgotten all the 1's at the start. Codelen difference is still positive (L(0) > L(1)).

With the PAQ mixer applied to just a single expert :

In the 111 phase, the mixer "weight" (amplification factor) goes way up; it stabilizes around 4. It's learning that the underlying expert has P(1) on the right side, so our weight should be positive, but it's P(1) is way too low, so we're scaling up the codelen difference by 4X.

In the 000 phase, the mixer quickly goes "whoah wtf this expert is smoking crack" and the weight goes *negative*. P(1) goes way down to around 15% even though the underlying expert still has a P(1) > 50%

Now in practice this is not how you use mixers. The learning rate in the real world needs to be way lower (otherwise you would be shooting your weights back and forth all the time, overreacting to the most recent coding). In practice the weight converge slowly to an ideal and stay there for long periods of time.

But this amplification compensation property is real, just more subtle (more like 1.1X rather than 4X).

For example, perhaps one of your models is something like a deterministic context (PPM*) model. You find the longest context that has seen any symbols before. That maximum-length context usually has very sparse statistics but can be a good predictor; how good it is exactly depends on the file. So that expert contributes some P fo the next symbol based on what it sees in the deterministic context. It has to just make a wild guess because it has limited observations (perhaps it uses secondary statistics). Maybe it guesses P = 0.8. The mixer can learn that no, on this particular file the deterministic model is in fact better than that, so I like you and amplify you even by a bit more, your coefficient might converge to 1.1 (on another file, maybe the deterministic expert is not so great, its weight might go to 0.7, you're getting P in the right direction, but it's not as predictable as you think).

## No comments:

Post a Comment