5/03/2018

Whirlwind Tour of Mixing Part 3

On to online gradient descent.

First we need to talk about how our experts are mixed. There are two primary choices :


Linear mixing :

M(x) = Sum{ w_i P_i(x) } / Sum{ w_i }

or = Sum{ w_i P_i(x) } if the weights are normalized

Geometric mixing :

M(x) = Prod{ P_i(x) ^ c_i } / Sum(all symbols y){ Prod{ P_j(y) ^ c_j } }

(the coefficients c_i do not need to be normalized)

P = probability from each expert
M = mixed probability

require

Sum(all symbols y){ M(y) } = 1

Linear mixing is just a weighted linear combination. Geometric mixing is also known as "product of expert" and "logarithmic opinion pooling".

Geometric mixing can be expressed as a linear combination of the logs, if normalization is ignored :


log( M_unnormalized ) = Sum{ c_i * log( P_i ) }

In the large alphabet case, there is no simple logarithmic form of geometric M (with normalization included).

In the special case of a binary alphabet, Geometric mixing has a special simple form (even with normalization) in log space :


M = Prod{ P_i ^ c_i } / ( Prod{ P_i ^ c_i } + Prod{ (1 - P_i) ^ c_i } )

M = 1 / ( 1 + Prod{ ((1 - P_i)/P_i) ^ c_i } )

M = 1 / ( 1 + Prod{ e^c_i * log((1 - P_i)/P_i) } )

M = 1 / ( 1 + e^ Sum { c_i * log((1 - P_i)/P_i) } )

M = sigmoid( Sum { c_i * logit(P_i) } )

which we looked at in the previous post, getting some background on logit and sigmoid.

We looked before about how logit space strongly preserves skewed probabilities. We can see the same thing directly in the geometric mixer.

Obviously if an expert has a P near 0, that multiplies into the product and produces an output near 0. Zero times anything = zero. Less obviously if an expert has a P near 1, the denomenator normalization factor becomes zero in the other terms (that aren't the numerator), so the mix becomes 1, regardless of what the other P's.

So assume we are using these mixtures. We want to optimize (minimize) code length. The general procedure for gradient descent is :


the code length cost of a previous coding event is -log2( M(x) )
evaluated at the actually coded symbol x

(in compression the total output file length is just the sum of all these event costs)

To improve the w_i coefficients over time :

after coding symbol x, see how w_i could have been improved to reduce -log2( M(x) )

take a step of w_i in that direction

the direction is the negative gradient :

delta(w_i) ~= - d/dw_i { -log2( M(x) ) }

delta(w_i) ~= (1/M) * d/dw_i{ M(x) }

(~ means proportional; the actual delta we want is scaled by some time step rate)

so to find the steps we should take we just have to do some derivatives.

I won't work through them in ascii, but they're relatively simple.

(the geometric mixer update for the non-binary alphabet case is complex and I won't cover it any more here. For practical reasons we will assume linear mixing for non-binary alphabets, and either choice is available for binary alphabets. For the non-binary geometric step see Mattern "Mixing Strategies in Data Compression")

When you work through the derivatives what you get is :


Linear mixer :

delta(w_i) ~= ( P_i / M ) - 1


Geometric mixer (binary) :

delta(w_i) ~= (1 - M) * logit(P_i)

Let's now spend a little time getting familiar with both of them.

First the linear mixer :


put r_i = P_i(x) / M(x)

ratio of expert i's probability to the mixed probability, on the symbol we just coded (x)

Note : Sum { w_i * r_i } = 1  (if the weights are normalized)

Your goal is to have M = 1

That is, you gave 100% probability to the symbol that actually occured

The good experts are the ones with high P(x)  (x actually occured)

The experts that beat the current mix have P > M

Note that because of Sum { w_i * r_i } = 1

Whenever there is an expert with r_i > 1 , there must be another with r_i < 1

The linear mixer update is just :

delta(w_i) ~= r_i - 1

experts that were better than the current mix get more weight
experts worse than the current mix get less weight

I think it should be intuitive this is a reasonable update.

Note that delta(w) does not preserve normalization. That is :


Sum{ delta(w_i) } != 0 , and != constant

however

Sum{ w_i * delta(w_i) } = Sum{ w_i * P_i / M } - Sum { w_i } = Sum {w_i} - Sum{w_i} = 0

the deltas *do* sum to zero when weighted.

We'll come back to that later.

If we write the linear update as :


delta(w_i) ~= ( P_i - M ) / M

it's obvious that delta is proportional to (1/M). That is, when we got the overall mixed prediction M right, M is near 1, the step is small. When we got the step wrong, M is near 0, the step we take can be very large. (note that P_i does not have to also be near zero when M is near zero, the w_i for that expert might have been small). In particular, say most of the experts were confused, they predict the wrong thing, so for the actual coded event x, their P(x) is near zero. One guy said no, what about x! His P(x) was higher. But everyone ignores the one guy piping up (his weight is low). The mixed M(x) will be very low, we make a huge codelen (very bad log cost), the step (1/M) will be huge, and it bumps up the one guy who got the P(x) right.

Now playing with the geometric update :


Geometric mixer (binary) :

delta(w_i) ~= (1 - M) * logit(P_i)

this is evaluated for the actually coded symbol x (x = 0 or 1)

let's use the notation that ! means "for the opposite symbol"
eg. if you coded a 0 , then ! means "for a 1"

eg. !P = 1-P and !M = 1-M

(in the previous post I put up a graph of the relationship between L and !L, codelens of opposing bits)

We can rewrite delta(w_i) as :

delta(w_i) ~= !M * ( !L_i - L_i )

recalling logit is the codelen difference

that's

delta(w_i) ~= (mixed probability of the opposite symbol) * (ith model's codelen of opposite - codelen of coded symbol)

I find this form to be more intuitive; let's talk through it with words.

The step size is proportional to !M , the opposite symbol mixed probability. This is a bit like the (1/M) scaling in the linear mixer. When we got the estimate right, the M(x) of the coded symbol will go to 1, so !M goes to zero, our step size goes to zero. That is, if our mixed probability M was right, we take no step. If it ain't broke don't fix it. If we got the M very wrong, we thought a 0 bit would not occur but then it did, M is near zero, !M goes to one.

While the general direction of scaling here is the same as the linear mixer, the scaling is very different. For example in the case where we were grossly wrong, M -> 0 , the linear mixer steps like (1/M) - it can be a huge step. The geometric mixer steps like (1-M) , it hits a finite maximum.

The next term (!L - L) means weights go up for experts with a large opposite codelen. eg. if the codelen of the actually coded symbol was low, that's a good expert, then the codelen of the opposite must be high, so his weight goes up.

Again this is similar to the linear mixer, which took a step ~ P(x) , so experts who were right go up. Again the scaling is different, and in the opposite way. In the linear mixer, if experts get the P right or wrong, the step just scales in [0,1] ; it varies, but not drastically. Conversely in the geometric case because step is proportional to codelen, it goes to infinity as P gets extremely wrong.

Say you're an expert who gets it all wrong and guesses P(x) = 0 (near zero) then x occurs. In the linear case the step is just (-1). In the geometric case, L(x) -> inf , !L -> 0 , your weight takes a negative infinite step. Experts who get it grossly wrong are not gradually diminished, they are immediately crushed. (in practice we clamp weights to some chosen finite range, or clamp step to a maximim)

Repeating again what the difference is :

In linear mixing, if the ensemble makes a big joint mistake, the one guy out of the set who was right can get a big weight boost. If the ensemble does okay, individuals experts that were way off do not get big adjustments.

In geometric mixing, individual experts that make a big mistake take a huge weight penalty. This is only mildly scaled by whether the ensemble was good or not.

Another funny difference :

With geometric mixing, experts that predict a 50/50 probability (equal chance of 0 or 1) do not change w at all. If you had a static expert that just always said "I dunno, 50/50?" , his weight would never ever change. But he would also never contribute to the mixture either, since as noted in the previous post equivocal experts simply have no effect on the sum of logits. (you might think that if you kept seeing 1 bits over and over, P(1) should be near 100% and experts that keep saying "50%" should have their weight go to crap; that does not happen in geometric mixing).

In contrast, with linear mixing, 50/50 experts would either increase or decrease in weight depending on if they were better or worse than the previous net mixture. eg. if M was 0.4 on the symbol that occurred, the weight of an 0.5 P expert would go up, since he's better than the mix and somebody must be worse than him.

Note again that in geometric mixing (for binary alphabets), there is no explicit normalization needed. The w_i's are not really weights, they don't sum to 1. They are not in [0,1] but go from [-inf,inf]. It's not so much "weighting of experts" as "adding of experts" but in delta-codelen space.

Another funny thing about geometric mixing is what the "learning rate" really does.

With the rate explicitly in the steps :


Linear mixer :

w_i += alpha * ( ( P_i / M ) - 1 )

(then w_i normalized)


Geometric mixer (binary) :

c_i += alpha * ( (1 - M) * logit(P_i) )

(c_i not normalized)

In the linear case, alpha really is a learning rate. It controls the speed of how fast a better expert takes weight from a worse expert. They redistribute the finite weight resource because of normalization.

That's not the case in the geometric mixer. In fact, alpha just acts as an overall scale to all the weights. Because all the increments to w over time have been scaled by alpha, we can just factor it out :


no alpha scaling :

c_i += ( (1 - M) * logit(P_i) )

factor out alpha :

M = sigmoid( alpha * Sum { c_i * logit(P_i) } )

The "learning rate" in the geometric mixer really just scales how the sum of logits is stretched back to probability. In fact you can think of it as a normalization factor more than a learning rate.

The geometric mixer has yet another funny property that it doesn't obviously pass through the best predictor. It doesn't even obviously pass through the case of a single expert. (with linear mixing it's obvious that crap predictors weight goes to zero, and a single expert would just pass through at weight=1)


aside you may ignore; quick proof that it does :

say you have only one expert P
say that P is in fact perfectly correct
0 bits occur at rate P and P is constant

initially you have some mixing coefficient 'c' (combining alpha in with c)

The mixed probability is :

M = sigmoid( c * logit(P) )

which is crap.  c is applying a weird scaling we don't want.

c should go to 1, because P is the true probability.

Does it?

delta(c) after one step of symbol 0 is (1-M) * logit(P)
for symbol 1 it's M * logit(1-P)

since P is the true probability, the probabilistic average step is :

average_delta(c) = P * ( (1-M) * logit(P) ) + (1-P) * ( (M) * logit(1-P) )

logit(1-P) = - logit(P)

average_delta(c) = P * logit(P) - M * logit(P) = (P - M) * logit(P)

in terms of l = logit(P)
using P = sigmoid(l)

average_delta(c) = ( sigmoid(l) - sigmoid( c * l ) ) * l

this is a differential equation for c(t)

you could solve it (anyone? anyone?)

but even without solving it's clear that it does go to c = 1 as t -> inf

if c < 1 , average_delta(c) is > 0 , so c goes up
if c > 1 , average_delta(c) is < 0 , so c goes down

in the special case of l small (P near 50/50), sigmoid is near linear,
then the diffeq is easy and c ~= (1 + e^-t)

Note that this was done with alpha set to 1 (included in c). If we leave alpha separate, then c should converge to 1/alpha in order to pass through a single predictor. That's what I mean by alpha not really being a "learning rate" but more of a normalization factor.

So anyway, despite geometric mixing being a bit odd, it works in practice; it also works in theory (the gradient descent can be proved to converge with reasonable bounds and so on). In practice it has the nice property of not needing any divides for normalization (in the binary case). Of course it doesn't require evaluation of logit & sigmoid, which are typically done by table lookup.


Bonus section : Soft Bayes

The "Soft Bayes" method is not a gradient descent, but it is very similar, so I will mention it here. Soft Bayes uses a weighted linear combination of the experts, just like linear gradient descent mixing.


Linear mixer :

d_i = ( P_i / M ) - 1

w_i += rate * d_i

Soft Bayes :

w_i *= ( 1 + rate * d_i )

Instead of adding on the increment, Soft Bayes multiples 1 + the increment.

Multiplying by 1 + delta is obviously similar to adding delta and the same intuitive arguments we made before about why this works still apply. Maybe I'll repeat them here :

General principles we want in a mixer update step :


The step should be larger when the mixer produced a bad result

  the worse M(x) predicted the observed symbol x, the larger the step should be
  this comes from scaling like (1/M) or (1-M)

The step should scale up weights of experts that predicted the symbol well

  eg. larger P_i(x) should go up , smaller P_i(x) should go down

  a step proportional to (P-M) does this in a balanced way (some go up, some go down)
  but a step of (P - 0.5) works too
  as does logit(P)

We can see a few properties of Soft Bayes :

Soft Bayes :

w_i *= ( 1 + rate * d_i )

w_i' = w_i + w_i * rate * d_i

w_i += rate * w_i * d_i

This is the same as linear gradient descent step, but each step d_i is scaled by w_i

Soft Bayes is inherently self-normalizing :

Sum{w_i'} = Sum{w_i} + Sum{w_i * rate * d_i}

Sum{ w_i * d_i } = Sum{ ( w_i * P_i / M ) - w_i } = ( M/M - 1 ) * Sum{ w_i } = 0

therefore :

Sum{w_i'} = Sum{w_i} 

the sum of weights is not changed

if they started normalized, they stay normalized

The reason that "soft bayes" is so named can be seen thusly :

Soft Bayes :

w_i *= ( 1 + rate * d_i )

d_i = ( P_i / M ) - 1

w_i *= (1-rate) + rate * ( P_i / M )

at rate = 1 :

w_i *= P_i / M

ignoring normalization this is 

w_i *= P_i

which is "Beta Weighting" , which is aka "naive Bayes".

So Soft Bayes uses "rate" as a lerp factor to blend in a naive Bayes update. When rate is 1, it's "fully Bayes", when rate is lower it keeps some amount of the previous weight lerped in.

Soft Bayes seems to work well in practice. Soft Bayes could have the bad property that if weights get to zero they can never rise, but in practice we always clamp weights to a non-zero range so this doesn't occur.

A quick aside on why Beta weighting is the naive Bayesian solution :


ignoring normalization through this aside
(also ignoring priors and all the proper stuff you should do if you're Bayesian)
(see papers for thorough details)

Mix experts :

M = Sum{ w_i * P_i }

P_i(x) is the probability of x given expert i ; P(x|i)

The w_i should be the probability of expert i, given past data , that is :

P(x|past) = Sum{ P(i|past) * P(x|i) }

P(i|past) is just ~= P(past|i)
  (ignoring prior P(i) and normalizing terms P(past) that factor out in normalization)

Now *if* (big if) the past events are all independent random events, then :

P(past|i) = P(x0|i)*P(x1|i)*P(x2|i)... for past symbols x0,x1,...

since P("ab") = P(a)*P(b) etc.

which you can compute incrementally as :

w_i *= P_i

the Beta Weight update.

In words, the weight for the expert should be the probability of that expert having made the data we have seen so far. That probability is just the P_i that expert has predicted for all past symbols, multiplied together.

Now this is obviously very wrong. Our symbols are very much not independent, they have strong conditional probability, that's how we get compression out of modeling the stream. The failure of that assumption may be why true beta weighting is poor in practice.


This Whirlwind Tour has been brought to you by the papers of Mattern & Knoll, PAQ of Mahoney, and the radness of RAD.

Small note about practical application :

You of course want to implement this with integers and store things like weights and probabilities in fixed point. I have found that the exact way you handle truncation of that fixed point is very important. In particular you need to be careful about rounding & biasing and try not to just lose small values. In general mixing appears to be very "fiddly"; it has a very small sweet spot of tuned parameters and small changes to them can make results go very bad.

5/02/2018

Whirlwind Tour of Mixing Part 2

Before we dive into online gradient descent, let's look at a couple of building blocks.

In many cases we've been talking about mixers that work with arbitrary alphabets, but let's specialize now to binary. In that case, each expert just makes one P, say P(0), and implicitly P(1) = 1 - P(0).

We will make use of the logit ("stretch") and logistic sigmoid ("squash"), so let's look at them a bit first.


logit(x) = ln( x / (1-x) )

sigmoid(x) = 1 / (1 + e^-x )  (aka "logistic")

logit is "log odds" of a probability

logit takes [0,1] -> [-inf,inf]

sigmoid takes [-inf,inf] -> [0,1]

logit( sigmoid(x) ) = 1

I'm a little bit fast and loose with the base of the logs and exponents sometimes, because the difference comes down to scaling factors which wash out in constants in the end. I'll try to be not too sloppy about it.

You can see plots of them here for example . The "logistic sigmoid" is an annoyingly badly named function; "logistic" is way too similar to "logit" which is very confusion, and "sigmoid" is just a general shape description, not a specific function.

What's more intuitive in data compression typically is the base-2 variants :


log2it(P) = log2( P / (1-P) )

log2istic(t) = 1 / (1 + 2^-t )

which is just different by a scale of the [-inf,inf] range. (it's a difference of where you measure your codelen in units of "bits" or "e's").

In data compression there's an intuitive way to look at the logit. It's the difference of codelens of symbol 0 and 1 arising from the probability P.


log2it(P) = log2( P / (1-P) )

= log2( P ) - log2( 1 - P )

if P = P0

then L0 = codelen of 0 = - log2(P0)

L1 = - log2(1-P0)

log2it(P0) = L1 - L0

which I like to write as :

log2it = !L - L

where !L means the len of the opposite symbol

Working with the logit in data compression is thus very natural and has nice properties :

1. It is (a)symmetric under exchange of the symbol alphabet 0->1. In that case P -> 1-P , so logit -> - logit. The magnitude stays the same, so mixers that act on logit behave the same way. eg. you aren't favoring 0 or 1, which of course you shouldn't be. This is why using the *difference* of codelen between L0 and L1 is the right thing to do, not just using L0, since the 0->1 exchange acts on L0 in a strange way.

Any mixer that we construct should behave the same way if we swap 0 and 1 bits. If you tried to construct a valid mixer like that using only functions of L0, it would be a mess. Doing it in the logit = (!L-L) it works automatically.

2. It's linear in codelen. Our goal in compression is to minimize total codelen, so having a parameter that is in the space we want makes that natural.

3. It's unbounded. Working with P's is awkward because of their finite range, which requires clamping and theoretically projection back into the valid cube.

4. It's centered on zero for P = 0.5 and large for skewed P. We noted before that what we really want is to gather models that are skewed, and logit does this naturally.

Let's look at the last property.

Doing mixing in logit domain implicitly weights skewed experts, as we looked at before in "functional weighting". With logit, skewed P's near 0 or 1 can stretched out to -inf or +inf. This gives them very high implicit "weight", that is even after mixing with a P near 0.5, the result is still skewed.


Mix two probabilities with equal weight :

0.5 & 0.9

linear : 0.7
logit  : 0.75

0.5 & 0.99

linear : 0.745
logit  : 0.90867

0.5 & 0.999

linear : 0.74950
logit  : 0.96933

Logit blending is sticky at skewed P.

A little plot of what the L -> !L function looks like :

Obviously it is symmetric around (1,1) where both codelens are 1 bit, P is 0.5 ; as P gets skewed either way one len goes to zero and the other goes to infinity.

logistic sigmoid is the map back from codelen delta to probability. It has the nice property for us that no matter what you give it, the output is a valid normalized probability in [0,1] ; you can scale or screw up your logistics in whatever way you want and you still get a normalized probability (which makes it codeable).

Linear combination of logits (codelen deltas) is equal to geometric (multiplicative) mixing of probabilities :


M = mixed probability

linear combination in logit space, then transform back to probability :

M = sigmoid( Sum{ c_i * logit( P_i ) } )

M = 1 / ( 1 + e ^ - Sum{ c_i * logit( P_i ) } )

e ^ logit(P) = P/(1-P)
e ^ -logit(P) = (1-P)/P
e ^ - c * logit(P) = ((1-P)/P) ^ c

M = 1 / ( 1 + Prod{ (( 1 - P_i ) / P_i)^c_i } )

M = Prod{ P_i ^ c_i } / ( Prod{ P_i ^ c_i } + Prod{ ( 1 - P_i )^c_i } )

What we have is a geometric blend (like a geometric mean, but with arbitrary exponents, which are the mixing factors), and the denomenator is just the normalizing factor so that M0 and M1 some to one.

Note that the blend factors (c_i) do not need to be normalized here. We get a normalized probability even if the c_i's are not normalized. That is an important property for us in practice.

This is a classic machine learning method called "product of experts". In the most basic form, all the c_i = 1, the powers are all 1, the probabilities from each expert are just multiplied together.

I'm calling the c_i "coefficients" not "weights" to emphasize the fact that they are not normalized.

Note that while we don't need to normalize the c_i , and in practice we don't (and in the "product of experts" usage with all c_i = 1 they are of course not normalized), that has weird consequences and let us not brush it under the table.

For one thing, the overall scale of c_i changes the blend. In many cases we hand-waved about scaling because normalization will wipe out any overall scaling factors of weights. In this case the magnitude scale of the c_i absolutely matters. Furthermore, the scale of the c_i has nothing to do with a "learning rate" or blend strength. It's just an extra exponent applied to your probabilities that does some kind of nonlinear mapping to how they are mixed. e.g. doing c_i *= 2 is like mixing squared probabilities instead of linear probabilities.

For another, with unnormalized c_i - only skewed probabilities contribute. Mixing in more 50/50 experts does *nothing*. It does not bias your result back towards 50/50 at all!

We can see this both in the logit formulation and in the geometric mean formulation :


M = sigmoid( Sum{ c_i * logit( P_i ) } )

Now add on a new expert P_k with some coefficient c_k

M' = sigmoid( Sum{ c_i * logit( P_i ) + c_k * logit( P_k ) } )

if P_k = 50% , then logit(P_k) = 0

M' = M

even if c_k is large.

(if you normalized the c's, the addition of c_k would bring down the coffecient from the other contributions)

The Sum of logits only adds up codelen deltas

Experts with large codelen deltas are adding their scores to the vote
small codelen deltas simply don't contribute at all

In the geometric mean :

Q = Prod{ P_i ^ c_i }

M = Q / ( Q + !Q )

(where ! means "of the opposing symbol; eg. P -> (1-P) )

Add on another expert with P = 0.5 :

Q' = Q * 0.5 ^ c_k

!Q' = !Q * 0.5 ^ c_k

M' = Q * 0.5 ^ c_k / ( Q * 0.5 ^ c_k + !Q * 0.5 ^ c_k)

the same term multiplied around just factors out

M' = M

This style of mixing is like adding experts, not blending them. Loud shouty experts with skewed codelen deltas add on.

Next up, how to use this for gradient descent mixing in compression.

Whirlwind Tour of Mixing Part 1

I'm going to try to run through mixing techniques in data compression in a wild quick tour. Part 1 will cover techniques up to but not including online gradient descent.

First some background :

We have probabilities output by some "experts" , P_i(x) for the ith expert for symbol x.

We wish to combine these in some way to create a mixed probability.

The experts are a previous stage of probability estimation. In data compression these are usually models; they could be different orders of context model, or the same order at different speeds, or other probability sources.

(you could also just have two static experts, one that says P(0) = 100% and one that says P(1) = 100%, then the way you combine them is actually modeling the binary probability; in that case the mixer is the model. What I'm getting at is there's actually a continuum between the model & the mixer; the mixer is doing some of the modeling work. Another common synthetic case used for theoretical analysis is to go continuous and have an infinite number of experts with P(0) = x for all x in [0,1], then the weight w(x) is again the model.)

Now, more generally, the experts could also output a confidence. This is intuitively obviously something that should be done, but nobody does it, so we'll just ignore it and say that our work is clearly flawed and always has room for improvement. (for example, a context that has seen n0 = 1 and n1 = 0 might give P(0) = 90% , but that is a far less confident expert that one that has n0 = 90 and n1 = 10 , which should get much higher weight). What I'm getting at here is the experts themselves have a lot more information than just P, and that information should be used to feed the mixer. We will not do that. We treat the mixer as independent of the experts and only give it the probabilities.

(you can fix this a bit by using side information (not P) from the model stages as context for the mixer coefficients; more on that later).

Why mixing? A lot of reasons. One is that your data may be switching character over time. It might switch from very-LZ-like to very-order-3-context like to order-0-like. One option would be to try to detect those ranges and transmit switch commands to select models, but that is not "online" (one pass) and the overhead to send the switch commands winds up being comparable to just mixing all those models (much research shows that these alternatives are asymptotically bounded). Getting perfect switch transmission right is actually way harder than mixing; you have to send the switch commands efficiently, you have to find the ideal switch ranges with consideration of the cost of transmitting the switch (which is a variant of a "parse-statistics feedback" problem which is one of the big unsolved bugbears in compression).

Another reason we like mixing is that our models might be good for part of the character of the data, but not all of it. For example it's quite common to have some data that's text-like (has strong order-N context properties) but is also LZ-like (perhaps matches at offset 4 or 8 are more likely, or at rep offset). It's hard to capture both those correlations with one model. So do two models and mix.

On to techniques :


1. Select

The most elementary mixer just selects one of the models. PPMZ used this for LOE (Local Order Estimation). PPMZ's primary scheme was to just select the order with highest MPS (most probable symbol) probability. Specifically if any order was deterministic (MPS P = 100%) it would be selected.

2. Fixed Weights

The simplest mixer is just to combine stages using fixed weights. For example :


P = (3 * P1 + P2)/4

This is used very effectively in BCM, for example, which uses it in funny ways like mixing the probability pre-APM with the probability post-APM using fixed weight ratios.

Fixed linear combinations like this of multi-speed adaptive counters is not really mixing, but is a way to create higher order IIR filters. (as in the "two speed" counter from BALZ; that really makes a 2-tap IIR from 1-tap IIR geometric decay filters)

3. Per-file/chunk transmitted Fixed Weights

Instead of fixed constants for mixing, you can of course optimize the weights per file (or per chunk) and transmit them.

If you think of the goal of online gradient descent as finding the best fit of parameters and converging to a constant - then this makes perfect sense. Why do silly neural net online learning when we can just statically optimize the weights?

Of course we don't actually want our online learners to converge to constants - we want them to keep adapting; there is no steady state in real world data compression. All the online learning research which assumes convergence and learning rate decreasing over time and various asymptotic gaurantees - it's all sort of wrong for us.

M1 (Mattern) uses optimized weights and gets quite high compression. This method is obviously very asymmetric, slow to encode & fast(er) to decode. You can also approximate it by making some pre-optimized weights for various types of data and just doing data type detection to pick a weight set.

4. Functional Weighting

You can combine P's using only functions of the P's, with no separate state (weight variables that are updated over time). The general idea here is that more skewed P's are usually much more powerful (more likely to be true) than P's near 50/50.

If you have one expert on your panel saying "I dunno, everything is equally likely" and another guy saying "definitely blue!" , you are usually best weighting more to the guy who seems sure. (unless the positive guy is a presidential candidate, then you need to treat lack of considering both sides of an argument or admitting mistakes as huge character flaws that make them unfit to serve).

In PPMZ this was used for combining the three SEE orders; the weighting used there was 1/H (entropy).


weight as only a function of P

the idea is to give more weight to skewed P's

W = 1 / H(P) = 1 / ( P * log(P) + (1-P) * log(1-P) )

e^ -H would be fine too

very simple skews work okay too :

W = ABS(P - 0.5)

just higher weight for P away from 0.5

then linearly combine models :

P_mixed = Sum{ W_i * P_i } / Sum{ W_i }

This method has generally been superceded by better approaches in most cases, but it is still useful for creating a prior model for initial seeding of mixers, such as for 2d APM mixing (below).

I'll argue in a future post that Mahoney's stretch/squash (logit/logistic sigmoid) is a form of this (among other things).

5. Secondary Estimation Mixing / 2d APM Mixing

As mentioned at the end of the last post, you can use a 2d APM for mixing. The idea is very simple, instead of one P as input to the APM, you take two probabilities from different models and look them up as a 2d matrix.

You may want to use the stretch remap and bilinear filtering so that as few bits as possible are needed to index the APM table, in order to preserve as much density as possible. (eg. 6 bits from each P = 12 bit APM table)

This allows you to model completely arbitrary mixing relationships, not just "some weight to model 1, some to model 2". It can do the fully P-input dependent mixing, like if model 1 P is high, it gets most of the influence, but if model 1 P is low, then model 2 takes over.

The big disadvantage of this method is sparsity. It's really only viable for mixing two or three models, you can't do a ton. It's best on large stable data sources, not small rapidly changing data, because the mixing weight takes longer to respond to changes.

One of the big advantages of single-scalar weighted mixing is that it provides a very fast adapting stage to your compressor. Say you have big models with N states, N might be a million, so each individual context only gets an update every millionth byte. Then you might have a (1d) APM stage to adapt the P's; this has 100 entries or so, so it only gets 1/100 updates. But then you blend the models with a scalar mixing weight - that updates after every byte. The scalar mixing weight is always 100% up to date with recent changes in the data, it can model switches very quickly while the rest of the big model takes much more time to respond. 2d APM Mixing loses this advantage.

6. Beta Weighting

Beta Weighting was introduced by CTW ; see also the classic work of Volf on switching, etc. (annoyingly this means something different in stock trading).

Beta Weighting is an incremental weight update. It does :


P_mixed = Sum W_i * P_i / Sum W_i

(linear weighted sum of model probabilities, normalized)

after seeing symbol X

W_i *= P_i(X)

We can understand this intuitively :

If a model had a high probability for the actually occuring symbol (it was a good model),
it will have P_i large

W_i *= 0.9 or whatever

if the model had low probability for the actual symbol (it was wrong, claiming that symbol was unlikely),
it will get scaled down :

W_i *= 0.1 for example

Furthermore, the weights we get this way are just negative exponentials of the cost in each model. That is :

P = 2^ - L

L = codelen

The weights winds up being the product of all coded probabilities seen so far :

W at time t = P_(t-1) * P(t-2) * P(t-3) * P(t_4) ...

W = 2^( - Sum_t{ L_(t-1) + L(t-2) + L(t-3) ... } )

W = 2^( - codelen if you used this model for all symbols )

W ~ e^ - cost

If you like we can also think of this another way :

since W is normalized, overall scaling is irrelevant

at each update, find the model that produced the lowest codelen (or highest P)

this is the model we would have chosen with an ideal selector

so the cost of any other model is :

cost_i = L_i - L_best

the CTW Beta weight rule is :

W_i *= 2^ - cost_i

(and the best model weight stays the same, its cost is zero)

or

W = 2^ - total_cost

total_cost += cost_i

In the special case of two models we can write :


The normalized weight for model 1 is :

W1 = ( 2^-C1 ) / ( 2^-C1 + 2^-C2 ) = 1 / ( 1 + 2^(C1-C2) ) = log2istic( C2-C1 )

W2 = 1 - W1 = log2istic( C1-C2 )

"log2istic" = logistic sigmoid with base 2 instead of e

We only need to track the delta of costs , dC = C1-C2

P = log2istic(-dC) * P1 + log2istic(dC) * P2

(equivalently, just store W1 and do W1 *= P1/P2)

We'll see in the next post how logistic is a transform from delta-codelens to probabilities, and this will become neater then.

Now Beta weighting is obviously wrong. It weights every coding event equally, no matter how old it is. In compression we always want strong recency modeling, so a better cost update is something like :


dC' = dC * aging + rate * (C1-C2)

where aging = 0.95 or whatever , something < 1
and rate is not so much the learning rate (it could be normalized out) as it is an exponential modifier to the weighting.

Beta weighting as used in CTW stores a weight per node. Each node weights counts as seen at that level vs. the next deeper level. In contrast, in context mixers like PAQ the weight for order N vs order N+1 is stored outside of the model, not per context but as a global property of the data. The PAQ way is much more flexible because you can always use some context for the weight if you like.

Next post we'll get into online gradient descent.

5/01/2018

Secondary Estimation : From PPMZ SEE to PAQ APM

I want to ramble a bit about how Matt Mahoney uses Secondary Estimation in APM's (in his compressors PAQ and BBB), but to get there I thought I'd go through a little story about what secondary estimation is all about.

Secondary Estimation began as a way to apply a (small) correction to a (large) primary model, but we will see in our journey that it has transformed into a way of doing heavy duty modeling itself.

The simplest and perhaps earlier form of secondary estimation may have been bias factors to linear predictors.

Say you are predicting some signal like audio. A basic linear predictor is of the form :


to code x[t]

pred = 2*x[t-1] - x[t-2]

delta = x[t] - pred

transmit delta

Now it's common to observe that "delta" has remaining predicability. The simplest case is if it just has an overall bias; it tends to be > 0. Then you track the average observed delta and subtract it off to correct the primary predictor. A more interesting case is to use some context to track different corrections. For example :

Previous delta as context :

context = delta > 0 ? 1 : 0;

pred = 2*x[t-1] - x[t-2]
delta = x[t] - pred

bias = secondary_estimation[ context ]

transmit (delta - bias)

secondary_estimation[ context ].update(bias)

Now we are tracking the average bias that occurs when previous delta was positive or negative, so if they tend to occur in clumps we will remove that bias. ("update" means "average in in some way" which we will perhaps revisit).

Another possibility is to use local shape information, something like :


for last 4 values , track if delta was > 0
each of those is a bit
this makes a 4 bit context

look up bias using this context

Now you can detect patterns like the transition from flats to edges and apply different biases in each case.

The fundamental thing that we are trying to do here, which is a common theme is :

The primary model is a mathematical predictor, which extracts some fundamental information about the data.

The secondary estimator tracks how well that primary model is fitting the current observation set and corrects for it.

In CALIC a form of secondary estimation very similar to this is used. CALIC uses a shape-adaptive gradient predictor (DPCM lossless image coding). This is the primary model; essentially it tries to fix the local pixels to one of a few classes (smooth, 45 degree edge, 90 degree edge, etc.) and uses different linear predictors for each case. The prediction error from that primary model is then corrected using a table lookup in a shape context. The table lookup tracks the average bias in that shape context. While the primary model uses gradient predictors that are hard-coded (don't vary from image to image), the secondary model can correct for how well those gradient predictors fit the current image.

I'm going to do another hand wavey example before we get into modern data compression.

A common form of difficult prediction that everyone is familiar with is weather prediction.

Modern weather prediction is done using ensembles of atmospheric models. They take observations of various atmospheric variables (temperature, pressure, vorticity, clouds, etc.) and do a physical simulation to evolve the fluid dynamics equations forward in time. They form a range of possible outcomes by testing how variations in the input lead to variations in the output; they also create confidence intervals by using ensembles of slightly different models. The result is a prediction with a probability for each possible outcome. The layman is only shown a simplified prediction (eg. predict 80 degrees Tuesday) but the models actually assign a probability to each possibility (80 degrees at 30% , 81 degrees at 15%, etc.)

Now this physically based simulation is a primary model. We can adjust it using a secondary model.

Again the simplest form would be if the physical simulation just has a general bias. eg. it predicts 0.5 degrees too warm on average. A more subtle case would be if there is a bias per location.


primary = prediction from primary model

context = location

track (observed - primary)

store bias per context

secondary = primary corrected by bias[context]

Another correction we might want is the success of yesterday's prediction. You can imagine that the prediction accuracy is usually better if yesterday's prediction was spot on. If we got yesterday grossly wrong, it's more likely we will do so again. Now in the primary physical model, we might be feeding in yesterday's atmospheric conditions as context already, but it is not using yesterday's prediction. We are doing secondary estimation using side data which is not in the primary model.

You might also imagine that there is no overall bias to the primary prediction, but there is bias depending on what value it output. And furthermore that bias might depend on location. And it might depend on how good yesterday's prediction was. Now our secondary estimator is :


make primary prediction

context = value of primary, quantized into N bits

(eg. maybe we map temperature to 5 bits, chance of precipitation to 3 bits)

context += location
context += success of yesterday's prediction

secondary = primary corrected by bias[context]

Now we can perform quite subtle corrections. Say we have a great overall primary predictor. But in Seattle, when the primary model predicts a temperature of 60 and rain at 10% , it tends to actually rain 20% of the time (and more likely if we got yesterday wrong).


SEE in PPMZ

I introduced Secondary Estimation (for escapes) to context modeling in PPMZ.

(See "Solving the Problems of Context Modeling" for details; I was obviously quite modest about my work back then!)

Let's start with a tiny bit of background on PPM and what was going on in PPM development back then.

PPM takes a context of some previous characters (order-N = N previous characters) and tracks what characters have been seen in that context. eg. after order-4 context "hell" we may have seen space ' ' 3 times and 'o' 2 times (and if you're a bro perhaps 'a' too many times).

Classic PPM starts from deep high order contexts (perhaps order-8) and if the character to be coded has not been seen in that context order, it sends an "escape" and drops down to the next lower context.

The problem of estimating the escape probability was long a confounding. The standard approach was to use the count of characters seen in the context to form the escape probability.


A nice summary of classical escape estimators from
"Experiments on the Zero Frequency Problem" :

n = the # of tokens seen so far
u = number of unique tokens seen so far
ti = number of unique tokens seen i times

PPMA : 1 / (n+1)

PPMB : (u - t1) / n

PPMC : u / (n + u)

PPMD : (u/2) / n

PPMP : t1/n - t2/n^2 - t3/n^3 ...

PPMX : t1 / n

PPMXC : t1 / n if t1 < n
        u / (n + u) otherwise

eg. for PPMA you just leave the escape count at 1. For PPMC, you increment the escape count each time a novel symbol is seen. For PPMD, you increment the escape count by 1/2 on a novel symbol (and also set the novel characters count to 1/2) so the total always goes up by 1.

Now some of these have theoretical justifications based on Poisson process models and Laplacian priors and so on, but the correspondence of those priors to real world data is generally poor (at low counts).

The big difficult with PPM (and data compression in general) is that low counts is where we do our business. We can easily get contexts with high counts (sufficient statistics) and accurate estimators by dropping to lower orders. eg. if you did PPM and limit your max order to order-2 , you wouldn't have this problem. But we generally want to push our orders to the absolute highest breaking point, where our statistical density goes to shit. Big compression gains can be found there, so we need to be able to work where we may have only seen 1 or 2 events in a context before.

Around the time I was doing PPMZ, the PPM* and PPMD papers by Teahan came out ("Unbounded length contexts for PPM" and "Probability estimation for PPM"). PPMD was a new better estimator, but to me it just screamed that something was wrong. All these semi-justified mathematical forms for the escape probability seemed to just be the wrong approach. At the same time, PPM* showed the value of long deterministic contexts, where by definition "u" (the number of unique symbols) is always 1 and "n" (total) count is typically low.

We need to be able to answer a very important and difficult question. In a context that has only seen one symbol one time, what should the escape probability be? (equivalently, what is the probability that one symbol is right?). There is no simple formula of observed counts in that context that can solve that.

Secondary Estimation is the obvious answer. We want to know :


P(esc) when only 1 symbol has been seen in a context

We might first observe that this P(esc) varies between files. On more compressible files, it is much lower. On near-random files, it can be quite high. This is a question of to what extent is this novel deterministic context a reflection of the end statistics, or just a mirage due to sparse statistics.

That is,


In an order-8 context like "abcdefgh" we have only even seen the symbol 'x' once

Is that a reflection of a true pattern?  That "abcdefgh" will always predict 'x' ?

Or is the following character in fact completely random, and if we saw more symbols we
would see all of them with equal probability

Again you don't have the information within the context to tell the difference. Using an average across the whole file is a very rough guess, but obviously you could do better.

You need to use out-of-context observations to augment the primary model. Rather than just tracking the real P(esc) for the whole file we can obviously use some context to find portions of the data where it behaves differently. For example P(esc) (when only 1 symbol has been seen in a context) might be 75% if the parent context has high entropy; it might be 50% if the order is 6, it might be 10% if order is 8 and the previous symbols were also coded from high order with no escape.

In PPMZ, I specifically handled only low escape counts and totals; that is, it was specifically trying to solve this sparse statistics problem. The more modern approach (APM's, see later) is to just run all probabilities through secondary statistics.

The general approach to secondary estimation in data compression is :


create a probability from the primary model

take {probability + some other context} and look that up in a secondary model

use the observed statistics in the secondary model for coding

The "secondary model" here is typically just a table. We are using the observed statistics in one model as the input to another model. In PPMZ, the primary model is large and sparse while the secondary model is small and dense, but that need not always be the case.

The secondary model is usually initialized so that the input probabilities pass through if nothing has been observed yet. That is,


initially :

secondary_model[ probability + context ] = probability

Secondary estimation in cases like PPM can be thought of as a way of sharing information across sparse contexts that are not connected in the normal model.

That is, in PPM contexts that are suffixes have a parent-child relationship and can easily share information; eg. "abcd" and "bcd" and "cd" are connected and can share observations. But some other context "xyzw" is totally unconnected in the tree. Despite that, by having the same statistics they may be quite related! That is


"abcd" has seen 3 occurances of symbol 'd' (and nothing else)

"xyzw" has seen 3 occurances of symbol 'd' (and nothing else)

these are probably very similar contexts even though they have no connection in the PPM tree. The next coding event that happens in either context, we want to communicate to the other. eg. say a symbol 'e' now occurs in "abcd" - that makes it more likely that "xyzw" will have an escape. That is, 3-count 'd' contexts are now less likely to be deterministic. The secondary estimation table allows us to accumulate these disparate contexts together and merge their observations.

The PPMZ SEE context is made from the escape & total count, as well as three different orders of previous symbol bits. The three orders and then blended together (an early form of weighted mixing!). I certainly don't recommend the exact details of how PPMZ does it today. The full details of the PPMZ SEE mechanism can be found in PPMZ (Solving the Problems of Context Modeling) (PDF) . You may also wish to consult the later work by Shkarin "PPM: one step to practicality" (PPMd/PPmonstr/PPMii) which is rather more refined. Shkarin adds some good ideas to the context used for SEE, such as using the recent success so that switches between highly compressible and less compressible data can be modeled.

Let me repeat myself to be clear. Part of the purpose & function of secondary estimation is to find patterns where the observed counts in a state to do not linearly correspond to the actual probability in any way.

That is, secondary estimation can model things like :


Assuming a binary coder
The coder sees 0's and 1's
Each context state tracks n0 and n1 , the # or 0's and 1's seen in that context

On file A :

if observed { n0 = 0, n1 = 1 } -> actual P1 is 60%
if observed { n0 = 0, n1 = 2 } -> actual P1 is 70%
if observed { n0 = 0, n1 = 3 } -> actual P1 is 90%

On file B :

if observed { n0 = 0, n1 = 1 } -> actual P1 is 70%
if observed { n0 = 0, n1 = 2 } -> actual P1 is 90%
if observed { n0 = 0, n1 = 3 } -> actual P1 is 99%


The traditional models are of the form :

P1 = (n1 + c) / (n0 + n1 + 2c)

and you can make some Bayesian prior arguments to come up with different values of c (1/2 and 1 are common)

The point is there is *no* prior that gets it right. If the data actually came from a Laplacian or Poisson source, then observing counts you could make estimates of the true P in this way. But context observations do not work that way.

There are lots of different effects happening. One is that there is multiple "sources" in a file. There might be one source that's pure random, one source is purely predictable (all 0's or all 1's), another source is in fact pretty close to a Poisson process. When you have only a few observations in a context, part of the uncertainty is trying to guess which of the many sources in the file that context maps to.


Adaptive Probability Map (APM) ; secondary estimation & beyond.

Matt Mahoney's PAQ (and many other modern compressors) make heavy use of secondary estimation, not just for escapes, but for every coding event. Matt calls it an "APM" , and while it is essentially the same thing as a secondary estimation table, the typical usage and some details are a bit different, so I will call them APMs here to distinguish.

PPM-type compressors hit a dead end in their compression ratio due to the difficult of doing things like mixing and secondary estimation in character alphabets. PAQ, by using exclusively binary coding, only has one probability to work with (the probability of a 0 or 1, and then the other is inferred), so it can be easily transformed. This allows you to apply secondary estimation not just to the binary escape event, but to all symbol coding.

An APM is a secondary estimation table. You take a probability from a previous stage, look it up in a table (with some additional context, optionally), and then use the observed statistics in the table, either for coding or as input to another stage.

By convention, there are some differences in typical implementation choices. I'll describe a typical APM implementation :


probability P from previous stage

P is transformed nonlinearly with a "stretch"

[stretch(P) + context] is looked up in APM table

observed P there is passed to next stage

optionally :

look up both floor(P) and ceil(P) in some fixed point to get two adjacent APM entries
linearly interpolate them using fractional bits

A lot of the details here come from the fact that we are passing general P's through the APM, rather than restricting to only low counts as in PPMZ SEE. That is, we need to handle a wide range of P's.

Thus, you might want to use a fixed point to index the APM table and then linearly interpolate (standard technique for tabulated function lookup); this lets you use smaller, denser tables and still get fine precision.

The "stretch" function is to map P so that we quantize into buckets where we need them. That is, if you think in terms of P in [0,1] we want to have smaller buckets at the endpoints. The reason is that P steps near 0 or 1 make a very large difference in codelen, so getting them right there is crucial. That is, a P difference of 0.90 to 0.95 is much more important than from 0.50 to 0.55 ; instead of having variable bucket sizes (smaller buckets at the end) we use uniform quantization but stretch P first (the ends spread out more). One nice option for stretch is the "logit" function.


LPS symbol codelen (LPS = less probable symbol)

steps of P near 0.5 produce small changes in codelen :

P = 0.5  : 1 bit
P = 0.55 : 1.152 bits

steps of P near 1.0 produce big changes in codelen :

P = 0.9  : 3.32193 bits
P = 0.95 : 4.32193 bits

The reason why you might want to do this stretch(P) indexing to the APM table (rather than just look up P) is again density. With the stretch(P) distortion, you can get away with only 5 or 6 bits of index, and still get good resolution where you need it. With linear P indexing you might need a 10 bit table size for the same quality. That's 32-64 entries instead of 1024 which is a massive increase in how often each slot gets updated (or a great decrease in the average age of statistics; we want them to be as fresh as possible). With such course indexing of the APM, the two adjacent table slots should be used (above and below the input P) and linearly interpolated.

APM implementations typically update the observed statistics using the "probability shift" method (which is equivalent to constant total count or geometric decay IIR).

An APM should be initialized such that it passes through the input probability. If you get a probability from the previous stage, and nothing has yet been observed in the APM's context, it passes through that probability. Once it does observe something it shifts the output probability towards the observed statistics in that context.

We can see something interesting that APM's can do already :

I wrote before about how secondary estimation is crucial in PPM to handle sparse contexts with very few events. It allows you to track the actual observed rates in that class of contexts. But we thought of SEE as less important in dense contexts.

That is true only in non-time-varying data. In the real world almost all data is time varying. It can shift character, or go through different modes or phases. In that sense all contexts are sparse, even if they have a lot of observed counts, because they are sparse in observation of recent events. That is, some order-4 context in PPM might have 100 observed counts, but most of those were long in the past (several megabytes ago). Only a few are from the last few bytes, where we may have shifted to a different character of data.

Running all your counts through an APM picks this up very nicely.


During stable phase :

look up primary model

secondary = APM[ primary ]

when primary is dense, secondary == primary will pass through


Now imagine we suddenly transition to a chunk of data that is nearly random

the primary model still has all the old counts and will only slowly learn about the new random data

but the APM sees every symbol coded, so it can learn quickly

APM[] will quickly tend towards APM[P] = 0.5 for all P

that is, the input P will be ignored and all states will have 50/50 probability

This is quite an extreme example, but more generally the APM can pick up regions where we move to more or less compressible data. As in the PPM SEE case, what's happening here is sharing of information across parts of the large/sparse primary model that might not otherwise communicate.

Let me repeat that to be clear :

Say you have a large primary model, with N context nodes. Each node is only updated at a rate of (1/N) times per byte processed. The APM on the other hand is updated every time. It can adapt much faster - it's sharing information across all the nodes of the primary model. You may have very mature nodes in the primary model with counts like {n0=5,n1=10} (which we would normally consider to be quite stable). Now you move to a region where the probability of a 1 is 100%. It will take *tons* of steps for the primary model to pick that up and model it, because the updates are scattered all over the big model space. Any one node only gets 1/N of the updates, so it takes ~N*10 bytes to get good statistics for the change.

An interesting common way to use APM's is in a cascade, rather than a single step with a large context.

That is, you want to take some P from the primary model, and you have additional contexts C1 and C2 to condition the APM step. You have two options :


Single step, large context :

P_out = APM[ {C1,C2,P} ]

Two steps, small context, cascade :

P1 = APM1[ {C1,P} ]

P_out = APM2[ {C2,P1} ]

That is, feed P through an APM with one context, then take that P output and pass it through another APM with a different context, as many times as you like.

Why use a cascade instead of a large context? The main reason is density.

If you have N bits of extra context information to use in your APM lookup, the traditional big table approach dilutes your statistics by 2^N. The cascade approach only dilutes by *N.

The smaller tables of the APM cascade mean that it cannot model certain kinds of correlations. It can only pick up ways that each context correlated to biases on P. It cannot pick up joint context correlations.


The APM cascade can model things like :

if C1 = 0 , high P's tend to skew higher  (input P of 0.7 outputs 0.9)

if C2 = 1 , low P's tend to skew lower

But it can't model joint things like :

if C1 = 0 and C2 = 1 , low P's tend towards 0.5

One nice thing about APM cascades is that they are nearly a NOP when you add a context that doesn't help. (as opposed to the single table large context method, which has a negative diluting effecting when you add context bits that don't help). For example if C2 is just not correlated to P, then APM2 will just pass P through unmodified. APM1 can model the correlation of C1 and P without being molested by the mistake of adding C2. They sort of work independently and stages that aren't helping turn themselves off.

One funny thing you can do with an APM cascade is to just use it for modeling with no primary model at all. This looks like :


Traditional usage :

do order-3 context modeling to generate P
modify observed P with a single APM :
P = APM[P]

APM Cascade as the model :

P0 = order-0 frequency
P1 = APM1[ o1 | P0 ];
P2 = APM2[ o2 | P1 ];
P3 = APM3[ o3 | P2 ];

That is, just run the probability from each stage through the next to get order-N modeling.

We do this from low order to high, so that each stage can add its observation to the extent it has seen events. Imagine that the current order2 and order3 contexts have not been observed at all yet. Then we get P1 from APM1, pass it through APM2, since that has seen nothing it just passes P1 through, then APM3 does to. So we wind up getting the lowest order that has actually observed things.

This method of using an APM cascade is the primary model is used by Mahoney in BBB.

A 2d APM can also be used as a mixer, which I think I will leave for the next post.

4/18/2018

Race in EOF on Windows Pipes

There appears to be a race in the EOF check on Windows pipes. I haven't seen this written about elsewhere, so here it is.

TLDR : don't call eof on pipes in Windows! Use read returning zero instead to detect eof.

We observed that semi-randomly pipes would report EOF too soon and we'd get a truncated stream. (this is not the issue with ctrl-Z ; of course you have to make sure your pipe is set to binary).

To reproduce the issue you may use this simple piper :


// piper :

#include <fcntl.h>
#include <io.h>

int main(int argc,char *argv[])
{
    int f_in  = _fileno(stdin);
    int f_out = _fileno(stdout);
    // f_in & out are always 0 and 1

    _setmode(f_in,  _O_BINARY);
    _setmode(f_out, _O_BINARY);

    char buf[4096];
    
    // NO : sees early eof!
    //  don't ask about eof until _read returns 0
    while ( ! _eof(f_in) )
    // YES : just for loop here
    //for(;;)
    {
        int n = _read(f_in,buf,sizeof(buf));
        if ( n > 0 )
        {
            _write(f_out,buf,n);
        }
        else if ( n == 0 )
        {
            // I think eof is always true here :
            if ( _eof(f_in) )
                break;
        }
        else
        {
            int err = errno;
            
            if ( err == EAGAIN )
                continue;
                
            // respond to other errors?
            
            if ( _eof(f_in) )
                break;
        }
    }

    return 0;
}

the behavior of piper is this :

vc80.pdb                 1,044,480

redirect input & ouput :

piper.exe < vc80.pdb > r:\out

r:\out                      1,044,480

consistently copies whole stream, no problems.

Now using a pipe :

piper.exe < vc80.pdb | piper > r:\out

r:\out                         16,384
r:\out                         28,672
r:\out                         12,288

semi-random output sizes due to hitting eof early

If the eof check marked with "NO" in the code is removed, and the for loop is used instead, piping works fine.

I can only guess at what's happening, but here's a shot :

If the pipe reader asks about EOF and there is nothing currently pending to read in the pipe, eof() returns true.

Windows anonymous pipes are unbuffered. That is, they are lock-step between the reader & writer. When the reader calls read() it blocks until the writer puts something, and vice-versa. The bytes are copied directly from writer to reader without going through an internal OS buffer.

In this context, what this means is if the reader drains out everything the writer had to put, and then races ahead and calls eof() before the writer puts anything, it sees eof true. If the writer puts something first, it seems eof false. It's just a straight up race.


time    reader                          writer
-----------------------------------------------------------
0                                       _write blocks waiting for reader
1       _eof returns false
2       _read consumes from writer
3                                       _write blocks waiting for reader
4       _eof returns false
5       _read consumes from writer
6       _eof returns true
7                                       _write blocks waiting for reader

at times 1 and 4, the eof check returns false because the writer had gotten to the pipe first. At time 6 the reader runs faster and checks eof before the writer can put anything, now it seems eof is true.

As a check of this hypothesis : if you add a Sleep(1) call immediately before the _eof check (in the loop), there is no early eof observed, because the writer always gets a chance to put data in the pipe before the eof check.

Having this behavior be a race is pretty nasty. To avoid the problem, never ask about eof on pipes in Windows, instead use the return value of read(). The difference is that read() blocks on the writer process, waiting for it to either put some bytes or terminate. I believe this is a bug in Windows. Pipes should never be returning EOF as long as the process writing to them is alive.

The Kraft Number, Binary Arithmetic, and Incremental Package Merge

The Kraft number that we compute for length limited Huffman codelen construction can be thought of as a set of bit flags that tell you which codelens to change.

Recall


K = Sum{ 2^-L_i }

K <= 1 is prefix codeable

you can think of K as the sum of effective coding probabilities (P = 2^-L), so K over 1 is a total probability over 100%.

when we initially make Huffman codelens and then apply a limit, we use too much code space. That corresponds to K > 1.

If you write K as a binary decimal, it will be something like :


K = 1.00101100

Those 1 bits that are below K = 1.0 are exactly the excess codelens that we need to correct.

That is, if you have an extra symbol of length L, that is K too high by 2^-L , that's a 1 bit in the binary at position L to the right of the decimal.


codelen set of {1,2,2,3}

a : len 1
b : len 2
c : len 2
d : len 3

K = 2^-1 + 2^-2 + 2^-2 + 2^-3

K = 0.100 +
    0.010 +
    0.010 +
    0.001

K = 1.001

take (K-1) , the part below the decimal :

K-1 = .001

we have an excess K of 2^-3 ; that's one len 3 too many.

To fix that we can change a code of len 2 to len 3

that does 

-= 0.010
+= 0.001

same as

-= 0.001

That is, when (K-1) has 2 leading zeros, you correct it by promoting a code of len 2 to len 3

so if we compute K in fixed point (integer), we can just take K - one and do a count leading zeros on it, and it tells you which code len to correct. The bits that are on in K tell us exactly what's wrong with our code.

Now, similarly, we can think of the compound operations in the same way.

Any time we need to do a correction of K by 2^-L we could do one code from (L-1) -> L , or we could do two codes from L -> (L+1), or ... that can be seen as just an expression of the mathematics of how you can add bits together to make the desired delta.

That is :


You want 0.001 (increment codelen 2 to 3)

that can be :

0.001 = 0.0001
       +0.0001

(increment two lower lens)

or :

0.001 = 0.01
       -0.001

increment a lower codelen then decrement one at your level

Now, thinking about it this way we can try to enumerate all the possible moves.

To reduce the space of all possible moves, we need a few assumptions :

1. No non-len-group changing moves are profitable. That is, the set of symbols for the current len groups are the best possible set. eg. it's not profitable to do something like { symbol a from len 2 to 3 and symbol b from len 3 to 2 } . If there are any profitable moves like that, do them separately. What this means is the counts are sorted; eg. if a symbol is at a higher codelen, its count is less equal the count of any symbol at lower codelen.

2. I only need to enumerate the moves that can be the cheapest (in terms of total code len).

In that case I think that you can enumerate all the moves thusly :


a change of 2^-L can be accomplished via

inc(L-1)   (that means increment a codelen at L-1)

inc(L) can be substitituted with { inc(L+1), inc(L+1) }

And each of those inc(L+1) can also be substituted with a pair at L+2.
You take either a codelen at L or two at the next higher len, 
whichever has a smaller effect on CL (total codelen).

a change of 2^-L can also be accomplished via :

inc(L-2) and dec(L-1)

OR

inc(L-3) and dec(L-2) and dec(L-1)

again these can be understood as binary decimals :

0.0001 = 0.0010 - 0.0001
0.0001 = 0.0100 - 0.0011
0.0001 = 0.1000 - 0.0111

and finally the decs are also a tree of pairs :

dec(L) can instead be { dec(L+1), dec(L+1) }

this seems like a lot, but compared to all possible ways to make the number X from adds & subtractions of any power of two, it's quite small. The reason we can consider such a reduced set of moves is because we only need the one best way to toggle a bit of K, not all possible ways.

Really we just do :


enumerate the position of the lowest codelen to inc
between 1 and (L-1)

decrement at all codelens below the one you incremented
down to (L-1)

this produce the desired change in K of 2^-L

each "inc" & "dec" can either be at that code len, or a pair of next codelen

(somebody smarter than me : prove that these are in fact all the necessary moves)

Let's look at how dynamic programming reduces the amount of work we have to do.


Say we need to do an inc() at L = 3

(inc means increment a codelen, that decreases K by 2^-4)

We can either increment a single symbol at L = 3
or a pair from L = 4

(this is just the same kind of operation you do in normal Huffman tree building)

The best symbol at L = 3 is just the lowest count symbol (if any exists)

Ask for the best two nodes at L = 4

Those can also be a single symbol, or a pair from L = 5

When you ask for the first node at L = 4, it gets the best two at L = 5

but then imagine the single symbol at L = 4 was lower count and is taken

Now you ask for the second node at L = 4, it again needs the best two at L = 5

we already have them, no more work is needed.

Any time you chose a symbol rather than a pair of higher len, the 2^n tree stops growing.

[3] -> { [4a], [4b] }

[4a] -> sym(4) or { [5a], [5b] }

[4a] takes sym(4)

[4b] -> sym2(4) or { [5a], [5b] }

[4b] doesn't need a new evaluation at level 5

Another issue I need to mention is that as you increment and decrement codelens, they move between the lists, so the cached dynamic programming lists cannot be retained, or can they? (for example, you want to keep the symbols at each len sorted by count)

In fact the accounting for symbols moving is simple and doesn't need to invalidate the cached lists.


When you do an inc(L) , that symbol moves to L+1 and is now available for a further inc(L+1)

(this does not occur with dec(L) since it moves in the opposite direction)

Say you wanted an inc(3) ; you consider doing a pair of { inc(4), inc(4) }

One of the inc(4)'s can be a pair of inc(5)'s , and one of those len 5 symbols can be the one you did inc 4 on.

That is, say you have 'A' at len 4 and 'B' at len 5

inc(3) <- { inc( 'A' @ 4) , { (inc 'A' @ 5) , inc( 'B' @ 5 } }

This is a legal move and something you have to consider.

But the rule for it is quite simple - if a symbol occurs earlier in the list of chosen increments, it is
available at the next level.

If you're familiar with the way Package Merge makes its lists, this is quite similar. It just means when you choose the lowest count symbol at the current level, you can also draw from the previous increments in your list if they have lower count.

These queues we are building are exactly the same thing you would need to do the full package merge algorithm. The difference is, in traditional Package Merge you would start with all the symbols at codelen = max (K is too low), and then incrementally apply the best decrements to increase K. Here we are starting with K pretty close to 1 , with K greater than 1. The result is that in many cases we can do far fewer package merge steps. I call this Incremental Package Merge. It allows you to start from a nearly-Kraft set of codelens and get to the same optimal solution as if you did full package merge.

Let's look at a concrete example or two :


codelen : symbol+count

len 3 : a+7
len 4 : b+3 , c+3
len 5 : d+1 , e+1

you need an inc(3) to get K -= 2^-4

you can :

inc(a) ; cost 7
inc(b + c) ; cost 6
inc(b + { d + e } ) ; cost 5

The best inc(4) is {b,d,e}

Another example :

len 3 : a+7
len 4 : b+2
len 5 : c+1

again say you want an inc(3)

the best is

inc( b + { b + c } ) ; cost 5

here the best option is to inc b twice

And finally let's think again about how Package Merge is related to the "numismatic" or "coin collector problem".

If you play with this you will see what we're really doing is working on a two-cost accountancy. Each symbol has a cost in K which is determined only by its current codelen (2^-L). It also has a cost in total codelen CL which is detemined only by its symbol count. We are trying to pay off a debt in K (or spend a credit in K) and maximize the value we get in CL.

4/04/2018

Engel Coding and Length Limited Huffman Heuristic Revisited

Joern Engel has written about a technique he described to me for forming prefix code lengths here on his blog.

Engel Coding is a fast/approximate way of forming length limited prefix code lengths.

I'm going to back up first and remind us what a prefix code is and the use of the Kraft inequality.

We want to entropy code some alphabet using integer code lengths. We want those codes to be decodeable without side information, eg. by only seeing the bits themselves, not transmitting the length in bits.

0  : a
10 : b
11 : c
is a prefix code. If you see the sequence "11100" it can only decode to "11=c,10=b,0=a".
0  : a
1  : b
11 : c
is not a prefix code. Any occurance of "11" can either be two b's or one c. They can't be resolved without knowing the length of the code. There is no bit sequence you can possibly assign to c that works. The fact that this is impossible is a function of the code lengths.

You can construct a prefix code from code lengths if and only if the lengths satisfy the Kraft inequality :


Sum{ 2^-L_i } <= 1

It's pretty easy to understand this intuitively if you think like an arithmetic coder. 2^-L is the effective probability for a code length, so this is just saying the probabilities must sum to <= 100%

That is, think of the binary code as dividing the range [0,1] like an arithmetic coder does. The first bit divides it in half, so a single bit code would take half the range. A two bit code takes half of that, so a quarter of the original range, eg. 2^-L.

The two numbers that we care about are the Kraft code space used by the code lengths, and the total code length of the alphabet under this encoding :


Kraft code space :

K = Sum{ 2^-L_i }

Total code length :

CL = Sum{ C_i * L_i }

L_i = code length in bits of symbol i
C_i = count of symbol i


Minimize CL subject to K <= 1 (the "Kraft inequality")

We want the minimum total code length subject to the prefix code constraint.

The well known solution to this problem is Huffman's algorithm. There are of course lots of other ways to make prefix code lengths which do not minimize CL. A famous historical one is Shannon-Fano coding, but there have been many others, particularly in the early days of data compression before Huffman's discovery.

Now for a length-limited code we add the extra constraint :


max{ L_i } <= limit

now Huffman's standard algorithm can't be used. Again the exact solution is known; to minimize CL under the two constraints of the Kraft inequality and the maximum codelength limit, the algorithm is "Package Merge".

In Oodle we (uniquely) actually use Package Merge at the higher compression levels, but it is too slow and complicated to use when you want fast encoding, so at the lower compression levels we use a heuristic.

The goal of the heuristics is to find a set of code lengths that satisfy the contraints and get CL reasonably close to the minimum (what Package Merge would find).

The Oodle heuristic works by first finding the true Huffman code lengths, then if any are over the limit, they are changed to equal the limit. This now violates the Kraft inequality (they are not prefix decodeable), so we apply corrections to get them to K = 1. ZStd uses a similar method (and I imagine lots of other people have in the past; this is pretty much how length-limited near-Huffman is done). My previous post on the heuristic length limited code is below, with some other Huffman background :

cbloom rants: 07-03-10 - Length-Limitted Huffman Codes Heuristic
cbloom rants 07-02-10 - Length-Limitted Huffman Codes
cbloom rants 05-22-09 - A little more Huffman
cbloom rants 08-12-10 - The Lost Huffman Paper
cbloom rants Huffman Performance
cbloom rants Huffman Correction

(Engel correctly points out that most of the places where I say "Huffman coding" I should really be saying "prefix coding". The decoding methods and canonical code assignment and so on can be done with any prefix code. A Huffman code is only the special case of a prefix code with optimal lengths. That is, Huffman's algorithm is only the part about code length assignment; the rest is just prefix coding.)

So Engel's idea is : if we're going to limit the code lengths and muck them up with some heuristic anyway, don't bother with first finding the optimal non-length-limited Huffman code lengths. Just start with heuristic code lengths.

His heuristic is (conceptually) :


L_i = round( log2( P_i ) )

which is intuitively a reasonable place to start. If your code lengths didn't need to be an integer number of bits, then you would want them to be as close to log(P) as possible.

Then apply the limit and fix the lengths to satisfy the Kraft inequality. Note that in this case the tweaking of lens to satisfy Kraft is not just caused by lens that exceed the limit. After the heuristic codelens are made, even if they are short, they might not be Kraft. eg. you can get code lengths like { 2,3,3,3,3,3,4,4,4 } which are not prefix (one of the 3's need to be changed to a 4). The idea is that unlike Huffman or Shannon-Fano which explicitly work by creating a prefix code by construction, Engel coding instead makes code lengths which could be non-prefix and relies on a fix up phase.

When Joern told me about this it reminded me of "Polar Coding" (Andrew Polar's, not the more common use of the term for error correction). Andrew Polar's code is similar in the sense that it tries to roughly assign log2(P) codelens to symbols, and then uses a fix-up phase to make them prefix. The details of the heuristic are not the same. (I suspect that there are lots of these heuristic entropy coders that have been invented over the years and usually not written down).

Obviously you don't actually want to do a floating log2; for the details of Engel's heuristic see his blog.

But actually the details of the initial codelen guess is not very important to Engel coding. His codelen adjustment phase is what actually determines the codelens. You can start the codelens all at len 1 and let the adjustment phase do all the work to set them, and in fact that gives the same final codelens!

I tried four methods of initial codelen assignment and they all produced the exact same final codelens. The only difference is how many steps of the iterative refinement were needed to get them to Kraft equality.


all initial codelens = 1    : num_adjustment_iterations = 2350943

codelens = floor( log2(P) ) : num_adjustment_iterations = 136925

codelens = round( log2(P) ) : num_adjustment_iterations = 25823

Engel Coding heuristic      : num_adjustment_iterations = 28419

The crucial thing is how the refinement is done.

To get to the refinement, let's go over some basics. I'll first describe the way we were actually doing the length limit heuristic in Oodle (which is not the same as what I described in the old blog post above).

In the Oodle heuristic, we start with Huffman, then clamp the lens to the limit. At this point, the Kraft K is too big. That is, we are using more code space than we are allowed to. We need to raise some codelens somewhere to free up more code space. But raising codelens increases the total codelen (CL). So the goal is to bump up some codelens to get K = 1, with a minimum increase to CL.


If you do L_i ++

K -= 2^(-L_i)
K += 2^(-(L_i+1))

for a net change of :

K -= 2^(-(L_i+1))

(shorter codelen symbols makes a bigger change to K)

and CL does :

CL -= C_i * L_i
CL += C_i * (L_i + 1)

net :

CL += C_i

(lower count symbols hurt total codelen less)

K_delta_i = 2^(-(L_i+1))
CL_delta_i = C_i

To get under the K budget, we want to find the lowest CL_delta with the maximum K_delta. That is, code space (K) is the resource you want to buy, and code len (CL) is the currency you use to pay for that code space. You want the best price :

price = CL_delta / K_delta

price = C_i * 2^(L_i+1)

What I was doing in Oodle was taking the step with the best "price" that didn't overshoot the target of K = 1.

If your symbols are sorted by count (which they usually are for Huffman codelen algorithms), then you don't need to compute "price" for all your symbols. The minimum price will always occur at the lowest count (first in the sorted list) at each codelen. So rather than making a full heap of up to 256 symbols (or whatever your alphabet size is), you only need a heap of the 16 (or whatever codelen limit is) lowest count symbols at each codelen.

The big improvement in Engel's refinement heuristic is that it allows overshooting K if the absolute value of the distance to K decreases.

Consider K in fixed point with a 12 bit codelen limit. Then "one" is at K = 4096. Say you had K = 4099. It's 3 too big. My heuristic could only consider K steps of -=2 and -=1 (only power of two steps are possible). Engel can also take a step of -= 4 , changing K to 4095. It's now 1 too small (codeable but wasteful) and rather than increasing codelens to fit in the code space, we can decrease a symbol codelen somewhere to gain some total codelen.

Engel converges to K = one by (possibly) taking successively smaller overshooting steps, so K wiggles around the target, delta going positive & negative. This does not always converge, so a bail out to a simpler approach is needed. This overshooting lets it get to K by doing a combination of positive and negative steps (eg. 3 = 4 - 1 , not just 3 = 1 + 2), which is a little bit of a step towards Package Merge (the difference being that package merge find the cheapest path to get the desired sum, while Engel's heuristic is greedy, taking the single cheapest step each time).

In practice this turns out to be much better than only taking non-overshooting steps.

Time to look at the results on some real data :


"seven" test set, cut into 64k chunks, order 0 entropy coded

comparing code len to package merge (ideal)

The length in excess (percent) is :

excess percent = (codelen(heuristic) - codelen(packagemerge))*100/codelen(packagemerge)

Huffman then limit monotonic      mean : 0.027% max : 2.512%
Huffman then limit overshooting   mean : 0.002% max : 0.229%
Engel coding                      mean : 0.016% max : 10.712%

(codelen limit of 12, 256 symbol byte alphabet)

The heuristic (overshooting) limit is really very good, extremely close to package merge and even the maximum excess len is small. Engel coding (non-Huffman initial code lengths) works fine on average but does have (very rare) bad cases. This is not surprising; there's reason we use Huffman's algorithm to get the code lengths right.

In that bad 10.71% excess case, the package merge average code len is 1.604 but Engel coding produces an average code length of 1.776

Note that many of the blocks in this test did not hit the codelen limit; in that case "Huffman then limit" produces the best possible codelens, but Engel coding might not.

For most applications, it's probably best to make the true Huffman code and then limit the lengths with a heuristic. The time saved from the approximate initial code lengths is pretty small compared to other operations needed to do entropy coding (histogramming for example is annoyingly slow). Nevertheless I found this technique to be an interesting reminder to keep an open mind about approximations and understand where our algorithms came from and why we use the ones we do.


Another thing I find interesting is how Engel Coding points back to Package Merge again.

First there's the fact that you can start Engel Coding with just all the codelens set at 1 , and let the Kraft fixup make the codelens. That's how Package Merge works. It starts all codelens at 1, and the increments the cheapest ones until it gets up to Kraft. The log2ish starting guess for Engel Coding is just a way of jump-starting the codelens closer to the final answer to avoid lots of steps.

Engel Coding's overshooting heuristic improves on the monotonic heuristic by allowing you to take some +- steps. That is, increment one codelen and decrement another. In particular it can do things like : rather than increment a len 3 codelen , instead increment a len 2 codelen and decrement a len 3 codelen. This is the kind of move that you need to make to get to real optimal code lens.

The key missing element is considering the costs of all possible steps and finding a path to the desired K. That is, Engel coding takes greedy (locally cheapest price) steps, which may not give the optimal path overall. The way to turn this greedy algorithm into an optimal one is dynamic programming. Lo and behold, that's what Package Merge is.

3/26/2018

Compression of Android Game APK Packages with Oodle

This is a look at compression of Android Game APK Packages. In this study I'm mainly looking at the issue of transmission of the APK to the client, not storage on the client and decompression at runtime. The key difference between the two is that for transmission over the network, you want to compress the package as a "tar", that is without division into files, so that the compressor can use cross-file correlation. For storage on disk and runtime loading, you might want to store files individually (or perhaps combined and/or split into paging units), and you might want some files uncompressed.

The Android APK package is just a zip (thanks to them for just using zip and not changing the header so that it can be easily manipulated with standard tools).

I chose the list of games from this article : Google's instant app tech now lets you try games before you buy which is : Clash Royale, Words With Friends 2, Bubble Witch 3 Saga, Final Fantasy XV: A New Empire, Mighty Battles and -- of course -- Solitaire

I discovered that "Mighty Battles" internally contains a large pre-compressed pak file. (it's named "content.mp3" but is not really an mp3, it's some sort of compressed archive. They use the mp3 extension to get the APK package system to store it without further zlib compression.) Because of that I exluded Might Battles from the test; it would be about the same size with every compressor, and is not reflective of how it should be compressed (their internal package should be uncompressed if we're testing how well the outer compressor does). Later I also saw that "Clash Royale" is also mostly pre-compressed content. Clash Royale has its content in ".sc" files that are opaque compressed data. I left it in the test set, but it should also have those files uncompressed for real use with an outer compressor. I wasn't sure which Solitaire to test; I chose the one by Zynga.

The "tar" is made by unpacking the APK zip and concatenating all the files together. I also applied PNGz0 to turn off zlib compression on any PNGs. I then tested various compressors on the game tars.

original tar zlib Leviathan
BubbleWitch3 78,032,875 304,736,621 67,311,666 54,443,823
ClashRoyale 101,702,690 124,031,098 98,386,824 93,026,161
FinalFantasyXV 58,933,554 144,668,500 57,104,802 41,093,459
Solitaire 14,814,888 139,177,140 14,071,999 8,337,863
WordsWithFriends2 78,992,339 570,621,614 78,784,623 53,413,494
total 332,476,346 1,283,234,973 315,659,914 250,314,800
original  = size of the source APK (per-file zip with some files stored uncompressed)
tar       = unzipped files, with PNGz0, concatenated together
zlib      = zip -9 applied to the tar ; slightly smaller than original
Leviathan = Oodle Leviathan level 8 (Optimal4) applied to the tar
You can see that Clash Royale doesn't change much because it contains large amounts of pre-compressed data internally. The other games all get much smaller with Leviathan on a tar (relative to the original APK, or zlib on the tar). eg. BubbleWitch3 was 78 MB, Leviathan can send it in 54.4 MB ; Solitaire can be sent in almost half the size.

Leviathan is very fast to decode on ARM. Despite getting much more compression than zlib, it is faster to decode. More modern competitors (ZStd, brotli, LZMA) are also slower to decode than Leviathan on ARM, and get less compression.

For reference, here is the performance on this test set of a few compressors (speeds on Windows x64 Core i7-3770) :

Note that some of the wins here are not accessible to game developers. When a mobile game developer uses Oodle on Android, they can apply Oodle to their own content and get the size and time savings there. But they can't apply Oodle to their executable or Java files. The smaller they reduce their content, the larger the proportion of their APK becomes that is made up of files they can't compress. To compress all the content in the APK (both system and client files, as well as cross-file tar compression) requires support from the OS or transport layer.

I'll also take this chance to remind clients that when using Oodle, you should always try to turn off any previous compression on your data. For example, here we didn't just try the compressors directly on the APK files (which are zip archives and have previous zlib compression), we first unpacked them. We then further took the zlib compression off the PNG's so that the outer compressors in the test could have a chance to compress that data better. The internal compressors used on Clash Royale and Mighty Battles should also have been ideally turned off to maximize compression. On the other hand, turning off previous compression does not apply to data-specific lossy compressors such as audio, image, and video codecs. That type of data should be passed through with no further compression.

3/25/2018

Improving the compression of block-compressed textures Revisited

ADD: the best solution for this now is our new product Oodle Texture

I'm leaving this post for historical reference, but it is now obsolete and you should look at the Oodle Texture results instead.


The Oodle LZ compressors are especially good on binary data, such as the block compressed textures used in games.

(by "block compressed textures" I mean BCn, ETC1, ETC2, etc. textures in fixed size blocks for use with GPU's. I do *not* mean already compressed textures such as JPEG, PNG, or BCn that has already been compressed with crunch. You should not be applying Oodle or any other generic compressor on top of already compressed textures of that type. If you have a lot of PNG data consider PNG without ZLib or look for the upcoming Oodle Lossless Image codec.)

See the Appendix at the bottom for a comparison of modern LZ compressors on BCn data. Oodle LZ gets more compression and/or much faster decode speeds on BCn data.

So you can certainly just create your texture as usual (at maximum quality) and compress it with Oodle. That's fine and gives you the best visual quality.

If you need your texture data to be smaller for some reason, you can use a data-specific lossy compressor like crunch (or Basis), or you could use RDO texture creation followed by Oodle LZ compression.

(I've written about this before, here : Improving the compression of block-compressed textures , but I'm trying to do a rather cleaner more thorough job this time).

RDO texture creation is a modification of the step that creates the block compressed texture (BCn or whatever) from the original (RGBA32 or whatever). Instead of simply choosing the compressed texture blocks that minimize error, blocks are chosen to minimize rate + distortion. That is, sometimes larger error is intentionally chosen when it improves rate. In this case, we want to minimize the rate *after* a following LZ compressor. The block compressed textures always have the same size, but some choices are more compressible than others. The basic idea is to choose blocks that have some relation to preceding blocks, thereby making them more compressible. Common examples are trying to reuse selector bits, or to choose endpoints that match neighbors.

RDO encoding of block compressed textures should always be done from the original non-compressed version of the texture, *not* from a previous block compressed encoding. eg. don't take something already in BC1 and try to run RDO to shrink it further. Doing that would cause the errors to add up, a bit like taking a JPEG and lowering it's "quality" setting to make it smaller - that should always be done from original data.

Now, block compressed textures are already lossy. BC1 is quite bad; BC7 and ASTC less so. So adding more error may not be acceptable at all. If large amounts of error are acceptable in your texture, you may not ever be seeing the largest mip levels. Sending mip levels that are too large and never visible is a *far* larger waste of size than anything we do here, so it's important to have a process in your game to find those textures and shrink them.

The best tool I know of at the moment to do RDO texture creation is crunch by Rich Geldreich / Binomial. I'm told that their newer Basis product has an improved RDO-for-LZ but I don't have a copy to test. What I actually run is Unity's improvement to crunch. The way you use it is something like :

crunch_x64_unity.exe -DXT1 -fileformat dds -file input.png -maxmips 1 -quality 200 -out output.dds
That is, tell it to make fileformat DDS, it will do normal block texture compression, but with rate-favoring decisions.

NOTE : we're talking about lossy compression here, which is always a little subtle to investigate because there are two axes of performance : both size and quality. Furthermore "quality" is hard to measure well, and there is no substitute for human eyes examining the images to decide what level of loss is acceptable. Here I am reporting "imdiff" scores with my "combo" metric. The "imdiff" scores are not like an RMSE; they're roughly on a scale of 0-100 where 0 = no difference and 100 = complete garbage, like a percent difference (though not really).

Some results :


act3cracked_colour
1024x1024

non-RDO fast BC1 : 524,416 bytes
then Leviathan   : -> 416,981
imdiff     : 33.26

crunch RDO quality 200 , then Leviathan : -> 354,203
imdiff     : 36.15

file size 85% of non-RDO
error 109% of non-RDO

adventurer_colour 
1024x1024

non-RDO fast BC1 : 524,416 bytes
then Leviathan   : -> 409,874
imdiff     : 32.96

crunch RDO quality 200 , then Leviathan : -> 334,342
imdiff     : 33.48

file size 81% of non-RDO
error 102% of non-RDO

Personally I like crunch's RDO DDS at these high quality levels, 200 or above. It introduces relatively little error and the file size savings are still significant.

At lower quality levels use of crunch can be problematic in practice. Unfortunately it's hard to control how much error it introduces. You either have to manually inspect textures for damage, or run an external process to measure quality and feed that back into the crunch settings. Another problem is that crunch's quality setting doesn't scale with texture size; smaller size textures get less error and larger size textures get more error at the same "quality" setting, which means you need to choose a quality setting per texture size. (I think the issue is that crunch's codebook size doesn't scale with texture size, which makes it particularly bad for textures at 2048x2048 or above, or for large texture atlases).

Your other option besides doing RDO texture creation followed by LZ is to just use crunch's native "crn" format for textures.

Let's compare RDO+LZ vs crn for size. I will do this by dialing the quality setting until they get the same imdiff "combo" score, so we are comparing a line of equal distortion (under one metric).


act3cracked_colour
1024x1024

crunch crn 255 : -> 211,465
imdiff     : 42.33

crunch rdo dds 95 : -> 264,206
imdiff     : 42.36


adventurer_colour
1024x1024

crunch crn 255 : -> 197,644
imdiff     : 38.48

crunch rdo dds 101 : -> 244,402
imdiff     : 38.67

The native "crn" format is about 20% smaller than RDO + LZ on both of these textures. It is to be expected that custom compressors, well designed for one type of data, should beat general purpose compressors. Note that comparing "crn" sizes to just doing BCn + LZ (without RDO) is not a valid comparison, since they are at different error levels.

If you look at the quality settings, the "crn" mode at maximum quality is still introducing a lot of error. That "quality" setting is not on the same scale for crn mode and dds mode. Maximum quality (255) in crn mode is roughly equivalent to quality = 100 in dds mode. Unfortunately there seems to be no way to get higher quality in the crn mode.

This has been an attempt to provide some facts to help you make a good choice. There are three options : BCn (without RDO) + LZ , RDO BCn + LZ, or a custom compresssed texture format like CRN. They have different pros and cons and the right choice depends on your app and pipeline.

Now we haven't looked at decode speed in this comparison. I've never measured crunch's decode speed (of CRN format), but I suspect that Oodle's LZ decoders are significantly faster. Another possible speed advantage for Oodle LZ is that you can store your BCn data pre-swizzled for the hardware, which may let you avoid more CPU work. I should also note that you should never LZ decompress directly into uncached graphics memory. You either need to copy it over after decoding (which is very fast and recommended) or start the memory as cached for LZ decoding and then change it to uncached GPU memory after the decode is done.


Appendix : Performance of LZ compressors on BCn data

Comparison of Oodle to some other compressors on samples of game texture data.

Repeating the "Game BCn" test from Oodle 2.6.0 : Leviathan detailed performance report : A mix of BCn textures from a game (mostly BC1, BC4, and BC7) :

"Game BCn" :

lzma 16.04 -9           : 3.692 to 1 :   64.85 MB/s
brotli24 2017-12-12 -11 : 3.380 to 1 :  237.78 MB/s
zlib 1.2.11 -9          : 2.720 to 1 :  282.78 MB/s
zstd 1.3.3 -22          : 3.170 to 1 :  485.97 MB/s

Kraken8                 : 3.673 to 1 :  880.99 MB/s
Leviathan8              : 3.844 to 1 :  661.93 MB/s

A different set : "test_data\image\dds" is mostly BC1 with some BC3 and some RGBA32

test_data\image\dds :

lzma 16.04 -9           : 2.354 to 1 :   39.53 MB/s
brotli24 2017-12-12 -11 : 2.161 to 1 :  161.40 MB/s
zlib 1.2.11 -9          : 1.894 to 1 :  222.70 MB/s
zstd 1.3.3 -22          : 2.066 to 1 :  443.96 MB/s

Kraken8                 : 2.320 to 1 :  779.84 MB/s
Leviathan8              : 2.386 to 1 :  540.90 MB/s
(note this is lzma with default settings; lzma with settings tweaked for BCn can sometimes get more compression than Leviathan)

While brotli and ZStd are competitive with Kraken's compression ratio on text (and text-like) files, they lag behind on many types of binary data, such as these block compressed textures.

old rants