cbloom rants: 09-02-12 - Encoding Values in Bytes Part 2

A short post for a small concept : encoding using division of words into ranges.

If you're trying to put two different things in a byte (eg. in Part 1 we were putting either "the value fits in this byte and here it is" or "the value does not fit in this byte and here's a portion of it"), you can obviously use a flag bit.

eg. say you're trying to send either some number of apples (A's) or some number of bananas (B's) in a byte. You could send :


0 bit + 7 bits more = 7 bits of A's
1 bit + 7 bits more = 7 bits of B's

and you can write the decoder in a bit-twiddling manner :

{

    int byte = *ptr++;
    int isB   = byte & 0x80;
    int count = byte & 0x7F;

}

but you can also write the same decoder in a value-checking manner :

{
    int byte = *ptr++;

    if ( byte >= 128 )
    {
        isB = true;
        count = byte - 128;
    }
    else
    {
        isB = false;
        count = byte;
    }

}

(of course you should follow with a count++ in all cases unless you want to be able to send a count of 0)

But writing it in a value-checking manner makes it obvious that we don't need to put the divider in the middle. We could just as easily do :

{
    int byte = *ptr++;

    if ( byte >= 90 )
    {
        isB = true;
        count = byte - 90;
    }
    else
    {
        isB = false;
        count = byte;
    }

}

and now we have [0-89] for A's and [90-255] for B's ; this distribution may be better if your data is not symmetric.

To first approximation, the division point should be the probability of A's vs. B's ; eg. midpoint = P(A)*256. (this is not exactly right; it depends on the distribution of magnitudes of A's and B's ; plus there are some ugly quantization boundary effects, but it is roughly right).

Of course flag-value encodings that we saw last time are a special case of this with the divider shoved all the way to one end.

{
    int byte = *ptr++;

    if ( byte >= 255 )
    {
        isB = true;
        count = byte - 255;
    }
    else
    {
        isB = false;
        count = byte;
    }

}

and a "B" is something that doesn't fit in one byte.

Now this is obviously a kind of proto-arithmetic-coder. It doesn't carry left over range from one byte to the next, but it's trying to divide the output range of the current byte based on the probability of the symbol.

In particular for one simple case, it does exactly what you expect :

Say you have a random source of 0's and 1's (eg. bits) with some probability P0 of 0 (and 1-P0 of 1). You decide to encode them using RLE, that is you'll send a run length of either 0's or 1's. You want to put the run length in a byte (or nibble, or whatever). You can do it by sending a word which is broken into two ranges; one range for a run of 0's and another range for a run of 1's. Thusly :


EncodeRun( bool isRun0 , int runLen , int divider )
{
    ASSERT( runLen > 0 );

    if ( isRun0 )
    {
        // I get [0,divider-1] in the output

        for(;;)
        {
            if ( runLen <= divider )
            {
                *output++ = runLen-1;
                return;
            }
            else
            {
                *output++ = divider-1;
                runLen -= (divider-1);
                ASSERT( runLen >= 1 );
            }
        }
    }
    else
    {
        // I get [divider,base-1] in the output
        //  (eg. base = 256 for bytes, 16 for nibbles, etc.)

        int range = base-divider;

        for(;;)
        {
            if ( runLen <= range )
            {
                *output++ = divider + runLen-1;
                return;
            }
            else
            {
                *output++ = base-1; // divider + (range-1)
                runLen -= (range-1);
                ASSERT( runLen >= 1 );
            }
        }
    }
}

Note there's no flag for "doesn't fit in one byte" here; if we have more 0's than fit in one run, we just send another run of 0's following the first. eg. runs of 0's and 1's don't strictly alternate in this encoding.

Anyway, this encoder has just been a very concrete demonstration to get us to this statement :

For a random source with probability P0 of a 0, the optimal value of "divider" is divider = P0 * base. That is, you split the range proportional to the probability of each type of value.

(it may be either the floor or ceil of that float). (and it may not be true very close to the edges, eg. at divider=1 or base-2).

This was just an exercise to prove that our intuition is correct - the divider corresponds to the probability of each type of thing.

I'm sure that this technique has been well known to practitioners of the art forever. Back when LZRW was written, the tiniest last ounce of efficiency was needed, so the bit-packing approaches were preferred. With better CPU's now we can afford to do a branch or table lookup, which lets us put the divider away from the middle. So far as I know this technique was first used in a mainstream product in LZO (*). Recently Sean helped clarify my thinking about it.

In the next part we will bring together parts 1 and 2.

* = ADDENDUM : LZO uses this scheme to send a match or literal(s) in a byte. I don't recall the exact scheme in LZO, but say for example you wanted to send either a literal run len or a match length in an LZ encoder in a 4 bit nibble. You could send 1 bit for a match/literal flag, and then 3 bits of match len or literal run len. Instead, we now see that's equal to a threshold of 8, and we can put that threshold somewhere else to minimize our output. eg. it could be 10, then you have [0,9] for matches and [10,15] for literals.

In LZO the threshold is a constant, presumably tweaked over some test set, but other ideas are obvious. The threshold could be transmitted in the header of the file and the encoder could optimize it for the file in question. The threshold could also be dynamic, then you have an adaptive bytewise entropy coder of sorts. The adaptation schemes are obvious because as we have shown the threshold acts very much like a probability, so you can use the standard schemes that are known for binary arithmetic coders (eg. threshold += (threshold>>5) type of stuff, or rung-ladder, etc). (rung-ladder is a table-driven state machine, eg. threshold = table[state].threshold ; state = table[state].after_match and then optimize the tables somehow).

If I recall correctly, LZO also does some semi-arithmetic-codery stuff, in the sense that if not all of the range is needed to flag what you are currently sending, that value range can be carried over into a later encoding. Again, this not the exact LZO scheme but you could imagine a decoder that works like this : grab a nibble; a value in [0-11] is a match (with perhaps 1 bit of offset and the rest is match lengths), a value in [12-15] flags a literal; now obviously you don't need 4 values to send a single bit flag, so we keep (nibble-12) and add it to the next match length we decode.

ADDENDUM 2 : just to emphasize what I'm saying in the last paragraph because I think it's important.

When designing heuristic / bit-packing compression codes, you should *not* take the view point of "hey a literal flag takes 1 bit so I will allocate one bit of space for it".

What you should do is find the probability of events; (eg. literal vs. match); and allocate space in the code based on probability. Then, once you have done that, figure out a way to use the values in the code productively.

cbloom rants

9/02/2012

09-02-12 - Encoding Values in Bytes Part 2

No comments:

old rants