9/02/2012

09-02-12 - Encoding Values in Bytes Part 1

A very basic level of data compression is just sending values in bytes. I'm gonna do a few posts on this topic, first some review.

A common case is you want to send some value which has a large max (or perhaps infinite or unknown max) but has a much lower mean. Obviously you don't want to just output all the bits/bytes needed for the max value all the time.

(to be quite concrete : one standard situation is transmitting a file size. A file size could take up to 64 bits but most file sizes fit in 16 bits. Sending 64 bits all the time is silly, you want some kind of variable length encoding that using 2 bytes when possible, then more if necessary. Of course because of Kraft inequality you will no longer be able to send a value of (2^64 -1) in 8 bytes, because you are compressing some of the values you must expand some)

Let's run through the standard ways to do this :

Flag Value Encodings

These reserve a flag value in the range to mean "more bytes follow". The most basic form is like :

{
    U8 *ptr; int value; // given

    while( value >= 255 )
    {
        value -= 255;
        *ptr++ = 255;
    }
    *ptr++ = value;
}
That is, we reserve one value in the byte (255) to be a flag meaning "more bytes follow". Values lower than 255 can be sent right away. If more bytes follow, we try again to just send it in one byte.

There are various other options for flag-value encodings. I like to name them by the number of bytes send in each step. So the example above is "1-1-1-1-..." encoding. (it's the "unary" of byte encoding; that is, 1-1-1-1 encoding sends small values in the fewest possible bytes, but it sends large values in the most possible bytes (of any prefix code)).

But there are other options : 1-2-4-8 (try to send just 1 byte, then try to send 2 bytes (flag value 65535), then try in 4), 1-1-2-3-4-5 , etc. (and you don't have to start with 1; for file sizes you might use 2-3-5-8 (though really for file sizes a flag bit encoding is probably better)).

In fact while the simplicity of 1-1-1-1 encoding is appealing, it's almost never the best way to go, because in the rare chance that you have some degenerate data with a very large value, it can expand a lot. (eg. to send 100,000 in 1-1-1-... encoding takes 393 bytes). Even if 1-1-1- is theoretically optimal on your data, you should use 1-1-1-2-4 or something like that to limit the maximum output in bad cases.

Here's the number of bytes sent for some flag value encodings :


  value :   1-1-1-1   1-1-2-3   1-2-3-4
    200 :         1         1         1
    400 :         2         2         3
    600 :         3         4         3
   1000 :         4         4         3
   1600 :         7         4         3
   2600 :        11         4         3
   4200 :        17         4         3
   6800 :        27         4         3
  11000 :        44         4         3
  17800 :        70         4         3
  28800 :       113         4         3
  46600 :       183         4         3
  75400 :       296         7         6

Flag Bit Encodings

The next very common method is a flag bit (or bits) to indicate more bytes follow. This lets you send large values in fewer bytes at the cost of sending small values in more bytes.

The flag bits can be sent with any prefix code (eg. Huffman, unary, etc) of course. Some common simple cases :

One flag bit at a time

(aka unary flag bits, or 7-bit words). You do this encoding thusly :

{
    U8 *ptr; int value; // given

    while( value >= 128 )
    {
        value -= 128;
        *ptr++ = value & 0x7F;
        value >>= 7;
    }
    *ptr++ = value;
}
To decode it, you check if the top bit of each byte is on. The top bit is being on means "more bytes follow". Whether or not the top bit is on, you always have 7 bits of payload to add to your value. That is, the encoding is like :

0     - 127      :  0 + 7 bits
128   - 16511    :  1 + 7 bits ; 0 + 7 bits    [or 10 + 14 bits]
16512 - 0x20407F :  1 + 7, 1 + 7, 0 + 7        [or 110 + 21 bits]

Another common case is :

N flag bits

In this case you just send a fixed number of flag bits; eg. 2 bits to 1-5 bytes. The encoding is :
0        - 63      : 00 + 6 bits
0x40     - 16447   : 01 + 6+8 bits
0x4040   - 4210751 : 10 + 6+8+8 bits
0x404040 - 1.078e9 : 11 + 6+8+8+8 bits

Now of course you can combine the two, and you can use the same kind of nomenclature as before, but this time referring to the number of flag bits ; eg. 1 flag bit, then 1 more, then 2, then 3. But as noted previously this is really just a prefix code for the number of bits.

Flag bit and flag value are really just two extremes of a related continuum, which we shall see in a later post.

No comments:

old rants