On the entropy of distance coding

Let's look at a little math problem that I think is both amusing and illuminating, and is related to why LZ77 distance coding is fundamentally inefficient for sending a sequence of repeated tokens.

The toy problem we will study is this :

Given an array of random bytes, you for some silly reason decide to compress them by sending the distance to the most recent preceding occurance of that byte.

So to send 'A' you count backward from your current position P ; if there's an A at P-1, you send a 0, if it's at P-2 send a 1, etc.

What is the entropy of this distance encoding?

This "compressor" is silly, it expands (it will send more than 8 bits per byte). How much does it expand?

First of all, why are we looking at this. It's a boiled down version of LZ77 on a specific type of data. LZ77 sends repeated strings by sending the distance to the previous occurance of that string. Imagine your data consists of a finite set of tokens. Each token is several bytes and imagine the tokens are drawn from a random source, and there are 256 of them. If you had the indices of the tokens, you would just have a random byte sequence. But LZ77 does not have that and cannot convert the stream back to the token indexes. Instead the best it can do is to parse on token boundaries, and send the distance to the previous occurance of that token.

This does occur in the real world. For example consider something like a palettized image. The palette indices are bytes that refer to a dictionary of 256 colors. If you are given the image to compress after de-palettizing, you would see something like 32-bit RGBA tokens that repeat. Once you have seen the 256 unique colors, you no longer need to send anything but distances to previous occurances of a color. English text is also a bit like this, with the tokens equal to a word+punction string of variable length.

So back to our toy problem. To get the entropy of the distance coding, we need the probability of the distances.

To find that, I think about coding the distance with a sequence of binary coding events. Start at distance = 0. Either the current byte matches ours, or it doesn't. Chance of matching is (1/256) for random bytes. The chance of no-match is (255/256). So we multiply our current probability by (1/256) and the probability of all future distances by (255/256), then move to the next distance. (perhaps it's easier to imagine that the preceding bytes don't exist yet; rather you are drawing a random number as you step back in distance; any time you get your own value you stop)

This gives you the probability distribution :

P(0) = (1/256)
P(1) = (255/256) * (1/256)
P(2) = (255/256)^2 * (1/256)
P(n) = (255/256)^n * (1/256)
an alternative way to get the same distribution is to look at distance D. First multiply by the probability that it is not a lower distance (one minus the sum of all lower distance probabilities). Then the probability that it is here is (1/256). That's :
P(0) = (1/256)
P(1) = (1 - P(0)) * (1/256)
P(2) = (1 - P(0) - P(1)) * (1/256)
P(n) = (1 - P(0) - P(1) .. - P(n-1)) * (1/256)

which is equal to the first way.

Given this distribution we can compute the entropy :

H = - Sum_n P(n) * log2( P(n) )

starting at n = 0

let x = (255/256)
P(n) = (1-x) * x^n

log2( P(n) ) = log2( 1-x ) + n * log2( x )

H = - Sum_n (1-x) * x^n * ( log2( 1-x ) + n * log2( x ) )

terms that don't depend on n pull out of the sum :

H = - (1-x) * log2( 1-x ) * Sum_n x^n 
    - (1-x) * log2( x ) * Sum_n n * x^n

we need two sums :

G = Sum_n x^n 
S = Sum_n n * x^n

G is just the geometric series
G = 1 / (1 - x)

recall the trick to find G is to look at G - x*G
we can use the same trick to find S
S - x*S = G - 1
the other standard trick to find S is to take the d/dx of G
either way you get:

S = x / (1-x)^2

H = - (1-x) * log2( 1-x ) * G
    - (1-x) * log2( x ) * S

H = - (1-x) * log2( 1-x ) * ( 1 / (1-x) )
    - (1-x) * log2( x ) * ( x / (1-x)^2 )

H = - log2( 1-x ) - log2( x ) * ( x / (1-x) )

putting back in x = (255/256)
1-x = 1/256

the first term "- log2( 1-x )" is just 8 bits, send a byte

H = 8 + 255 * log2( 256 / 255 )

H = 8 + 1.43987 bits

about 9.44 bits

or if we look at general alphabet size N now instead of source bytes

H = log2(N) + (N-1) * log2( N / (N-1) )

recalling of course log2(N) is just the number of bits it would take to code a random symbol of N possible values
we'll look at the expansion due to distance coding, H - log2(N)

if we now take the limit of N large

H - log2(N) -> (N-1) * log2( N / (N-1) )
H - log2(N) -> (N-1) * log2( 1 + 1 / (N-1) )
H - log2(N) -> log2( ( 1 + 1 / (N-1) ) ^ (N-1) )
H - log2(N) -> log2( ( 1 + 1/N ) ^ N )

( 1 + 1/N ) ^ N -> e  !!

H - log2(N) -> log2( e ) = 1.44269 bits

for large alphabet, the excess bits sent due to coding distances instead of indices is log2(e) !!

I thought it was pretty entertaining for Euler to show up in this problem.

Why is distance coding fundamentally inefficient (vs just coding the indices of these repeated tokens) ? It's due to repetitions of values.

If our preceding bytes were "ABCB" and we now wish to code an "A" , it's at distance = 3 because we had to count the two B's. Our average distance is getting bumped up because symbols may occur multiple times before we find our match. If we did an LZ77 coder that made all substrings of the match length unique, we would not have this inefficiency. (for example LZFG which sends suffix trie node indices rather than distances does this)

We can see where this inefficiency appears in the probabilities:

if you are drawing a random number from 256 at each step
keep stepping until you get a match
each time you have to draw from all 256 possibilities (repeats allowed)

P(0) = (1/256)
P(1) = (255/256) * (1/256)
P(2) = (255/256)^2 * (1/256)
P(n) = (255/256)^n * (1/256)

(as above)

instead imagine drawing balls from a hat
once you draw a ball it is discarded so it cannot repeat
stop when you match your byte
first draw has the same 1/256 chance of match :

P(0) = (1/256)

as before multiply by 1-P to get the probability of continuing
but now we only have 255 balls left, so the chance of a match is (1/255)

P(1) = (255/256) * (1/255) = (1/256)

current P was (1/255) so multiply the next by (254/255)
now we have 254 , so we match (1/254) of the time :

P(2) = (255/256) * (254/255) * (1/254) = (1/256)
P(n) = (1/256)

it's just a flat probability.

decrementing the alphabet size by one each time makes a big difference.

This theoretical coding loss is also observed exactly in the real world.

experiment :

make 100,000 random bytes

make a palette of 256 32-bit random dwords

use the 100,000 random bytes to index the palette to output 400,000 bytes

what can LZ77 on these 400,000 bytes?

Our theoretical analysis says the best possible is 9.44 bits per palette index

plus we have to send 1024 bytes of the palette data as well

best = 100,000 * 9.44/8 + 1024 = 119,024

real world :

Leviathan Optimal5
random_bytes_depal.bin :    400,000 ->   119,034 =  2.381 bpb =  3.360 to 1

it turns out this theoretical limit is almost exactly achieved by Oodle Leviathan, only 10 bytes over due to headers and so on.

Leviathan is able to hit this limit (unlike simpler LZ77's) because it will entropy code all the match signals to nearly zero bits (all the matches will be "match len 4" commands, so they will have a probability near 1 and thus entropy code to zero bits). Also the offsets are in multiples of 4, but Leviathan will see the bottom bits are always zero and not send them. The result is that Leviathan sends the matches in this kind of data just as if it was sending a distance in tokens (as opposed to bytes) back to find the repeated token, which is exactly what our toy problem looked at.

We cannot do better than this with LZ77-style distance coding of matches. If we want to beat this we must induce the dictionary and send id references to whole words. Leviathan and similar LZ's cannot get as close to the optimum on text, because we don't handle tokens of varying length as well. In this kind of depalettized data, the match length is totally redundant and should be coded in zero bits. With text there is content-length correlation which we don't model at all.

Also note that we assume random tokens here. The loss due to sending distances instead of token id's gets even worse if the tokens are not equiprobable, as the distances are a poor way to capture the different token probabilities.

Summary :

The coding loss due to sending repeated tokens by distance rather than by index is at least log2(e) bits for large alphabet. This theoretical limit acts as a real world limit on the compression that LZ77 class algorithms can achieve on depalettized data.


Charles said...

Nice article; relatively easy to follow :). Let me see how much I got right here.

I think your argument boils down too:

Algorithms which say things like 'repeat that sequence we saw 2 tokens back' are less efficient than 'repeat that sequence at token 5'.

And you present some evidence centered on a dataset with 256 tokens of 4 bytes each.

An obvious argument; if we increase the size of the file in an unbounded way (so, 256 tokens of 4 bytes each, but 50GB of data), the distance between the sequence will, on average, remain the same, while the absolute index value most recently seen would increase an unbounded (but much slower) way.

Unless, we cheat and assume we can reference the first value seen (the index of which will probably be quite small) for all instances of a value. You do seem to be making that assumption here, and I don't really see a strong reason why you shouldn't do that, especially with the additional constraint 'your data consists of a finite set of tokens'. You call it out a bit in the hat example.

As an interesting extension to this; your assumptions here would not hold up as well if data was not random and tended towards 'chunking'. For example, if we had 4 billion bytes of 'black', then 4 billion of 'white', sending the index of a white pixel would use 4 bytes or so, where as sending the 'last reference' would be (on average) 1 byte.

So if there is a time component to the data (i.e., it changes from predominately one color to another over time), sending distance might work better. I wonder if that is more common in practice? How strong does this tendency have to be to make distance more attractive?

PS; the lack of an open-source decompression library for Oodle makes me sad. I really want to extract images from one of my favorite games, but they use Oodle and cause me much sadness.

cbloom said...

No, you're not sending the index as in a position. The alternative to distance coding is indexing by *content*.

Instead of coding "repeat the thing at 200 bytes back" you say "repeat the 10th previous thing with unique contents of length 4"

It's really just about removing repeats from the indexing.

Another way to see it is, say you want to send length 4 tokens. As your file goes to 50 GB, the number of distances you can send goes up to 50 GB, a large number. But if you only count the number of unique 4-byte strings that occur, that number goes up much more slowly, because some repeat.

How much more slowly depends on the entropy. On high entropy (random bytes) they go up at the same rate, on maximal low entropy (all bytes zero) the number of unique strings doesn't go up at all.

old rants