cbloom rants: Order 0 estimators for data compression

I thought this might be an interesting simple survey topic to illustrate an introduction to data compression.

Assume that we are writing a compressor with only order-0 modeling, and that we are working on a binary alphabet, so we are just modeling the count of 0's and 1's. Maybe we have some binary data that we believe only has order-0 correlation in it, or maybe this is the back-end of some other stage of a compressor.

If the data is in fact stationary (the probabilites don't change over time) and truly order-0, then the best we can do is to count the # of 0's and 1's in the whole sequence to get the best possible estimate of the true probability of 0's and 1's in the source.

The first option is a static coder : (using the nomenclature of "static huffman" vs "adaptive huffman" ; eg. static means non-streaming, probabilities or counts transmitted at the start of the buffer)


Encoder counts n0 and n1 in the whole sequence
Encoder transmits n0 and n1 exactly

Encoder & Decoder both make
p0 = n0 / (n0+n1)
p1 = n1 / (n0+n1)

Lots of little notes here already. We didn't have to do any +1's to ensure we had non-zero probabilities as you often see, because we have the exact count of the whole stream. eg. if n0 is 0, that's okay because we won't have any 0's to code so it's okay that they're impossible to code.

Now, how do you do your entropy coding? You could feed p0&p1 to arithmetic coding, ANS, to an enumerative coder (since we know n0 and n1, we are just selecting one of the arrangements of those bits, of which there are (n0+n1)!/n0!n1! , and those are all equally likely, so just send an integer that selects one of those), you could group up bits and use huffman. For now we don't care how the back end works, we're just trying to model the probability to feed to the back end.

If n0 and n1 are large, they are probably specifying more precision than the coder can use, which is wasting bits. So maybe you want to send an approximation of just p0 in 14 bits or whatever your back-end can use.

If you do send n0 and n1 exactly, then obviously you don't need to send the file length (it's n0+n1), and furthermore you can gain some efficiency by decrementing n0 or n1 as you go, so that the last symbol you see is known exactly.

Okay, so moving on to adaptive estimators. Instead of transmitting p0 up front, we will start with no a-priori knowledge of the stream (hence p0 = 50%), and as we encounter symbols, we will update p0 to make it the best estimate based on what we've seen so far. The standard solution is :


Encoder & Decoder start with n0 and n1 = 0

Encoder & Decoder form a probability from the n0 and n1 seen so far

p0 = (n0 + B)/(n0+n1 + 2B)

symbols are coded with the current estimate of p0
after which n0 or n1 is incremented and a new p0 is formed

B is a constant bias factor
if B = 1/2 this is a KT estimator (optimal in a specific synthetic case, irrelevant)
if B = 1 this is a laplace estimator

Note that the bias B must be > 0 so that we can encode a novel symbol, eg. coding the first 0 bit when n0 = 0.

There's stuff in the literature about "optimal estimators" but it's all a bit silly, because the optimal estimator depends on the source and what the distribution of possible sources is.

That is, say you actually are getting bits from a stationary source that has a true (unknown) probability of 0 bits, T0. You could see a wide variety of sources with different values of T0, which occur with probability P(T0). After you see some bits, n0 and n1, you wish to compute a p0 which minimizes the expected codelen of the next symbol you see. To do that, you can compute the relative probability of seeing n0 and n1 events from a source of probability T0. But to form the correct final estimate you must have the information about the a-priori likelihood of each source P(T0) which in practice you never have.

So we have these estimators for stationary sources, but in the real world you almost never have a stationary source. So let's start looking at estimators we might actually want to use in the real world.

(it may actually be a pretty stationary source, but it could be stationary only under a more complex model, and any time you are not fully modeling the data, stationary sources appear to be dynamic. This is like flatlanders in 2d watching a 3d object move through their plane - it may actually be a rigid body in a higher dimension, but it looks dynamic when you have an incomplete view of it. For example data that has Order-1 correlation (probability depends on the previous symbol) will appear to have dynamic statitics under only an order-0 model (the probabilities will seem to change after each symbol is coded))

Let's start with the "static" case, transmitting p0 or n0/n1. We can improve it by just breaking the source into chunks and transmitting a model on each chunk, rather than a single count for the whole buffer. These chunks could be fixed size, but there are sometimes large wins by finding the ideal places to put the chunk boundaries. This is an unsolved problem in general, I don't know of any algorithm to do it optimally (other than brute force, which is O(N!) or something evil like that), we use hacky heuristics. Obviously chunks have a cost in that you must spend bits to indicate where the chunk boundaries are, and what the probabilities are in each chunk, so you must count the cost to send the chunk information vs. the bits saved by coding with different probabilities.

(the most extreme case is a buffer that has n0=n1, which would take n0+n1 bits to send as a single chunk, but if in fact all the 0's are at the start, and all the 1's are the end, then you can cut it into two chunks, in the first chunk p0=100% so the bits are sent in zero bits, in the second chunk p0=0% , so the total size is only the overhead of specifying the chunks and probabilities)

A slightly more sophisticated version of this scheme is to have several probability groups and to be able to switch between them from chunk to chunk, that is :


send the # of models, M
send the models
  in the binary case, p0 or n0/n1 for each model

send the # of chunks
for each chunk :
  send its length
  send a model selection m in [0,M)
  send the data in that chunk using model m

In a binary coder this is a bit silly, but in a general alphabet coder, the model might be very large (100 bytes or so), so sending the model selection m is much cheaper than sending the whole model. This method allows you to switch rapidly between models at a lower cost. eg. if your data is like 000000111111111100000001111111000000 - the runs of different-character data are best coded by switching between models. (we're still assuming we can only use order-0 coding). (this is what Brotli does)

Now moving on to adaptive estimators.

The basic approach is that instead of forming an estimate of future probabilities by counting all n0 and n1 events we have seen in the past, we will count based on what we've seen in the recent past, or weight more recent events higher than old ones.

This is rarely done in practice, but you can simply count the # of each symbol in a finite window and update it incrementally :


at position p
code bit[p]

p0 = (n0 + B)/(n0+n1 + 2B)

after coding, increment n0 or n1

if (n0+n1) = T , desired maximum total
  remove the bit b[p - T]
  by subtracting one from n0 or n1

this has the advantage of keeping the sum constant (once the sum reaches T), which you could use to make the sum power of 2. But it requires you actually have the previous T bits, which you usually don't if you are using the adaptive coder as part of a larger model.

This does illustrate a problem we will face with many of these adaptive estimators. There's an initial run-up phase. They start empty with no symbols seen, then count normally up to T, at which point they reach steady state.

A common old-fashioned approach is to renormalize the total to T/2 once it reaches T. This was originally done as a way of limitting the sum T inside the range allowed by the entropy coder (eg. it must fit in 14 bits in old arithmetic coders so that the multiplies fit in 32 bits). It was found that applying limits like this didn't hurt compression, they in fact help in practice, because they make the statistics more adaptive to local changes.


after coding increment n0 or n1

if (n0+n1) = T
    n0 /= 2 , n1 /= 2;

This is actually the same as a piecewise-linear approximation of geometric falloff of counts. A true geometric update is like this :


once steady state is reached :
n0+n1 == T always

after coding
n0 or n1 += 1 
n0+n1 == T+1 now

n0 *= T/(T+1)
n1 *= T/(T+1)

now n0+n1 == T again

this is equivalent to doing :

n0 or n1 += inc
inc *= (T+1)/T

let G = (T+1)/T is the geometric growth factor

events contribute with weights :

1,G,G^2,G^3,etc..

now, nobody does a geometric update quite like this because it requires high precision counts (though you can do piecewise linear approximations of this and fixed point versions, which can be interesting). There is a way to do a geometric update in finite precision that's extremely common :


p0 probability is fixed point (12-14 bits is common)

at steady state

after coding a 1 : p0 -= p0 >> updshift
after coding a 0 : p0 += (one - p0) >> updshift

this is equivalent to the "renorm every step to keep n0+n1 = T" with T = 1<<updshift

This gives an efficient way to do a very recency-biased (geometric) estimator. For most of the estimators I'm talking about, the non-binary alphabet extension is obvious, and I'm just doing binary here for simplicitly, but in this case the non-binary alphabet version is non trivial. Fabian works it out here : Mixing discrete probability distributions , and Models for adaptive arithmetic coding .

For people familiar with filtering, it should be obvious that what we're really doing here is running filters over the previous events. The "window" estimator is a simple FIR filter with a box response. The geometric estimator is the simplest IIR filter.

In all our (adaptive) estimators, we have ensured that P0 and P1 are never zero - we need to be able to code either bit even if we've never seen one before.

To do this, we often add on a count to n0 and n1 (+B above), or ensure it's non-zero.

In the binary updshift case, the minimum of p0 is where (p0 >> updshift) is zero, that's


p0min = (1 << updshift) - 1

which in practice is actually quite a large minimum probability of the novel symbol. That turns out to be desirable in very local fast-adaptive estimators. What you want is if the last 4 events were all 1 bits, you want the probability P1 to go very high very fast - but you don't want to be over-confident about that local model matching future bits, so you want P0 to stay at some floor.

Essentially what we are doing here is blending in the unknown or "flat" model (50/50 probability of 0 or 1 bit) with some desired weight. So you might have a very jerky strongly adapting local model, but then you also blend in the flat model as a hedge.

The geometric update be extended to "two speed" :


track two running estimators, p0_a and p0_b

make p0 = (p0_a + p0_b)/2
use p0 for coding

after the event is see update each with different speeds :

after coding a 1 : p0_a -= p0_a >> updshift_a
after coding a 0 : p0_a += (one - p0_a) >> updshift_a

and p0_b with updshift_b

eg. you might use
updshift_a = 4 (a very fast model)
updshift_b = 8 (a slower model)
(with one = 1<<14)

Naively this looks like an interesting blend of two models. Actually since it's all just linear, it's in fact still just an IIR filter. It's simply a slightly more general IIR filter; the previous one was a one-tap filter (previous total and new event), this one is a two-tap filter (two previous totals and new event).

But this leads us to somethat that is interesting, which is more general blending.

You could have something like 3 models : flat (all symbols likely), a very fast estimator that strongly adapts to local statistics, and a slow estimator (perhaps n0/n1 counts for the whole file) that is more accurate if the file is in fact stationary.

Then blend the 3 models based on local performance. The blend weight for a simple log-loss system is simply the multiple of probabilities of that model on the preceding symbols.

Now, a common problem with these IIR type filters is that they assume steady state. You may recall previously we talked about the renormalization-based adaptive coder that has two phases :


track n0,n1

ramp-up phase , while (n0+n1) < T
  initialize n0=n1=0
  n0 or n1 += 1

stready-state :
  when n0+n1 = T , renorm total to T/2
  n0 or n1 += 1

If you're doing whole-file entropy coding (eg. lots of events) then maybe the ramp-up phase is not important to you and you can just ignore it, but if you're doing context modeling (lots of probability estimators in each node of the tree, which might not see very many events), then the ramp-up phase is crucial and can't be ignored.

If you want something efficient (like the updshift geometric model), but that accounts for ramp-up vs steady state, the answer is table lookups. (the key difference in the ramp-up phase is that adaptation early on is much faster than once you reach steady state)

This actually goes back to the ancient days of arithmetic codec, in the work of people like Howard & Vitter, and things like the Q-coder from IBM.

The idea is that you have a compact state variable which is your table index. It starts at an index for no events (n0=0,n1=0), and counts up through the ramp-up phase. Then once you reach steady state the index ticks up and down on a line like the p0 in updshift. Each index has a state transition for "after a 0" and "after a 1" to adapt. Something like :


ramp-up :

0: {0,0} -> 1 or 2
1: {1,0} -> 3 or 4
2: {0,1} -> 3 or 5
3: {1,1} -> 6 or 7
4: {2,0} -> 
5: {0,2} -> 

etc.

then say T = 16 is steady state, you have

{0,16} {1,15} {2,14} ... {16,0}

that just transitions up and down

And obviously you don't need to actually store {n0,n1}, you just store p0 in fixed point so you can do divide-free arithmetic coding. So there's like a tree of states for the ramp-up phase, then just a line back and forth at steady state.

And those states are not actually what you want at steady state. Actually finding the ideal probabilities for steady state is complex and in the end can only be solved by iteration. I won't go into the details but just quickly touch on the issues.

You might start with a mid point at p0=0.5 , at simulated T=16 that corresponds to {8,8} , so you consider stepping to {8,9} after seeing a 1 and renormalize to T=16, that gives p0=8/17 = 0.47059 ; that corresponds to a geometric update with scaling factor G = 17/16. If you keep seeing 1's, then p0 keeps going down like that. But if you saw a 0, then p0 -> p0 + (1 - p0) * (1 - 1/G) , so 0.47059 -> 0.50173 , which is not back to where you were.

This should be intuitive because with geometric recency, if you see a 1 bit then a 0 bit, the 0 you just saw counts a bit more than the 1 before, so you don't get back to the midpoint. With geometry recency the p0 estimated for seeing bits 01 is not the same as after seeing 10 - the order matters. This is also good intuition why simple counting estimators like KT are not very useful in data compression - the probability of 0 after seeing "11110000" is most likely the not the same after seeing "00001111" . Now you might argue that we're asking our order-0 estimator to do non-order-0 things, we aren't giving it a memoryless bit, we should have used some higher order statistics or a transform or something first, but in practice that's not helpful.

The short answer is you just need lots of states on the steady-state line, and you have to numerically optimize what the probability in each state is by simulating what the desired probability is when you arive there in various ways and averaging them; a kind of k-means quantization type of thing.

Another issue is how you do the state transition graph on the steady-state line. When you are out at the ends, say very low p0 so a 1 bit is highly predicted - if you see another 1 bit, then p0 does not change very much, but if you see a 0 bit (unexpected), then p0 should change a lot. This is actually information theory in a microcosm - when you see the expected events, they don't jar your model very much, because they are what you expected, they contain little new information, when you see events that had very low probability, that's huge information and jars p0 a lot.

(there's some ancient code for a coder like this and a table in crblib ; "rungae.c" / "ladder.c")

You could store the # of steps to take up or down after seeing a 0 or 1 bit. One of them could be implicit. For example when you see a more probable symbol, always take 1 step, when you see a less probable symbol, take many steps (maybe 3). Another clever way to do it is used in the Q-coder (and QM and MQ). They have a steady state line of states, but only change state when the arithmetic coder outputs a bit. This means you have to see 1/log2(P) events before you change states, which is exactly what you want - when P is very high, log2(P) is tiny and you won't step until you see several. This method cannot be used in modern arithmetic coders that output byte by byte, it requires bitwise renormalization. It's neat because it lets you use a very tiny table (53 states) and you can put the density where you need it (mostly around p0=0.5) but still have states way out at the extreme probabilities to code them efficiently.

The next step in the evolution is secondary statistics.

If you have this {n0,n1} state transition table in the last section, that's a state index. The straightforward way to do it is that each state has a p0 precomputed that corresponds to n0,n1 and you use that for coding.

With secondary statistics, instead of using the p0 that you *expected* to observe for given past counts, you use the p0 that you actually *observed* in that same state in the past.


Say you're in a given state S after seeing bits 0100
(n0 =3, n1 =1 , but order matters too)

You could compute the p0 that should be seen after that sequence with some standard estimator
(geometric or KT or whatever)

Or, screw them.  Instead use S as a lookup to a secondary model.

SecondaryStatistics[S]

contains the n0 and n1 actually coded from the state S
previous times that you were in state S

This was the SEE idea from PPMZ (then Shkarin's PPMD (different from Teahan's PPMD) and Mahoney's PAQ (sometimes called APM there)). In the real world there are weird nonlinearities in the actual probabilities of states that can't be expressed well with simple estimators. Furthermore, those change from file to file, so you can't just tabulate them, you need to observe them.

A common hacky thing to do is to use a different estimator if n0=0 or n1=0 ; eg. if one of the possible symbols has never been seen at all, special case it and don't use something like a standard KT estimator that gives it a bias to non-zero probability. This is done because in practice it's been observed that deterministic contexts have very different statistics. Really this is just a special case version of something more general like secondary statistics.

The other big step you could take is mixing. But that's rather going beyond simple order-0 estimators so I think it's time to stop.

cbloom rants

1/24/2017

Order 0 estimators for data compression

No comments:

old rants