# cbloom rants

## 9/02/2012

### 09-02-12 - Encoding Values in Bytes Part 3

Okay, now that we have some background we can bring together the two ends of the story : encoding variable length numbers, and encodings in bytes using variable ranges.

What we want to do should be obvious. We want to use some amount of range, like [0 , (T-1)] to send a value <= (T-1) immediately, and use the rest of the range [T, base-1] to send some part of a value that's >= T. (eg. base = 256 for bytewise encoding)

With our background this should be pretty obvious so I'll jump straight to the code. The trick is in the case that it doesn't all fit in the current byte, we'll use modulo to output part of the value in the space we're given.

First a recursive implementation which is more obviously symmetric between the encoder and decoder :

(In this implementation I put the "can send immediately" in the top part of the word, and I put the "here's part of the value but there's more to follow" in the bottom part of the word, which is the opposite of the description above).

```
typedef uint8 U8;
typedef uint8 * U8P;
typedef const uint8 * U8PC;

U8 * EncodeMod(U8 * to, int val, int mod)
{
ASSERT( mod >= 1 && mod <= 255 );
const int upper = 256 - mod;
if ( val < upper )
{
*to++ = (U8)(mod + val);
}
else
{
// val >= upper
val -= upper;
int top    = val / mod;
int bottom = val % mod;
*to++ = (U8)bottom;
return EncodeMod(to,top,mod);
}
}

int DecodeMod(U8PC & from, int mod)
{
const int upper = 256 - mod;
int byte = *from++;
if ( byte >= mod )
{
return (byte - mod);
}
else
{
int val = DecodeMod(from,mod);
val *= mod;
val += byte + upper;
return val;
}
}

```
but of course you don't actually want to use recursive functions; the less obviously symmetric non-recursive implementation is :
```
U8 * EncodeMod(U8 * to, int val, int mod)
{
ASSERT( mod >= 1 && mod <= 255 );
const int upper = 256 - mod;
for(;;)
{

if ( val < upper )
{
*to++ = (U8)(mod + val);
}
else
{
// val >= upper
val -= upper;
int lower = val % mod;
*to++ = (U8)lower;
val /= mod;
}
}
}

int DecodeMod(U8PC & from, int mod)
{
const int upper = 256 - mod;
int mul = 1;
int val = 0;
for(;;)
{
int byte = *from++;
if ( byte >= mod )
{
val += (byte - mod)*mul;
break;
}
else
{
val += (byte+upper)*mul;
mul *= mod;
}
}
return val;
}

```
and furthermore in practice you probably don't want to eat the cost of the divide and modulo in the encoder (it's only a mul in the decoder, but still). So you can require that mod is a power of 2. The implementation is obvious :
```
U8 * EncodeModPow2(U8 * to, int val, int bits)
{
ASSERT( bits >= 0 && bits < 8 );
const int mod = (1<`<`bits);
const int upper = 256 - mod;
for(;;)
{

if ( val < upper )
{
*to++ = (U8)(mod + val);
}
else
{
// val >= upper
val -= upper;
int lower = val & (mod-1);
*to++ = (U8)lower;
val >>= bits;
}
}
}

int DecodeModPow2(U8PC & from, int bits)
{
const int mod = (1<`<`bits);
const int upper = 256 - mod;
int shift = 0;
int val = 0;
for(;;)
{
int byte = *from++;
if ( byte >= mod )
{
val += (byte - mod)<`<`shift;
return val;
}
else
{
val += (byte+upper)<`<`shift;
shift += bits;
}
}
}

```

Now in all this code we're outputing a byte at a time, but of course you could start with 2 bytes, or go 1-1-2 or whatever.

EncodeMod is a superset of some of the previous encodings we've looked at.

For example, mod = 1 is a 1-1-1-1 flag value encoding. That is, we reserve only one value in the range to indicate more bytes follow.

mod = 128 is a unary flag-bit encoding (the 1 bit flag + 7 bit payload scheme).

(mod over 128 is valid but a weird thing to do (*!*). mod = 0 is send a byte. mod = 256 is send one byte from a multi-byte word; eg. to send a uint16 in 16 bits is encodemod 256 then encodemod 0 ).

In between EncodeMod can generate intermediate encodings. Here is a sampling of the values at which EncodeMod steps up to using the next # of bytes :

```
mod   1 : 255,510,765,1020,1275,1530,1785,2040,2295,
mod   2 : 254,762,1778,3810,7874,16002,32258,64770,129794,
mod   3 : 253,1012,3289,10120,30613,92092,276529,
mod   5 : 251,1506,7781,39156,196031,
mod   8 : 248,2232,18104,145080,
mod  13 : 243,3402,44469,578340,
mod  21 : 235,5170,108805,
mod  34 : 222,7770,264402,
mod  55 : 201,11256,619281,
mod  89 : 167,15030,1337837,
mod 144 : 112,16240,2338672,
mod 233 : 23,5382,1254029,

bits  0 : 255,510,765,1020,1275,1530,1785,2040,2295,
bits  1 : 254,762,1778,3810,7874,16002,32258,64770,129794,
bits  2 : 252,1260,5292,21420,85932,343980,
bits  3 : 248,2232,18104,145080,
bits  4 : 240,4080,65520,1048560,
bits  5 : 224,7392,236768,
bits  6 : 192,12480,798912,
bits  7 : 128,16512,2113664,

```
eg. mod = 1 (the old 1-1-1-1 encoding) sends values < 255 in 1 byte, < 510 in 2 bytes, etc. So for example at mod 13 we van only get up to 242 in 1 byte, but we can get a lot more in 2 bytes (< 3402).

EncodeMod is flexible but it cannot generate all possible variable length integer encodings in bytes. As a counterexample, if we try to reproduce the 2-flag-bits thresholds (0,0x40,0x4040,0x404040) the closest I can get with EncodeMod is to do mod 192, then 170, then 127, which gives thresholds of (0,0x40,0x40C0,0x408040). One of the differences here is that EncodeMod has no upper limit, it can encode infinite values (just like flag-value and unary-flag-bits encodings), but the fixed-number-of-flag bit (or finite length prefix code) type encodings do have an upper limit, so are not reserving any space for "there are still bigger values".

Finally, some charts showing the number of bytes required to send a value with various mods. The charts all show the same data, just at different zooms of the X axis. You should be able to see that low mod is best for small values, high mod is best for large values, and they cross somewhere in between. (the rightmost point is not the next byte step, it's a generated clip with the chart boundary; all the other points indicate a step up in the number of bytes output).

ADDENDUM : *!* : actually encodemod with a mod over the midpoint is totally useful and reasonable.

Fabian 'ryg' Giesen said...

"furthermore in practice you probably don't want to eat the cost of the divide and modulo in the encoder"
For fixed divisors (i.e. "mod" isn't a function parameter, or when inlining) and with speed optimization enabled, all compilers you're likely to be using will synthesize div/mod using integer multiplies. It's only with variable divisors that they actually generate divide instructions these days.

Turtle said...

Assuming the usage context is compression using byte based coding then Google Snappy has an interesting approach for literal lengths.

They have 6 bits in the byte for the length. Values up to 60 are stored as length-1. For values above that the code stores the number of bytes that follow - 1 to 4 (for codes 60-63). Those bytes form a little endian length.

Though oddly that little endian number stores length-1 - I'd have thought length-60 would have been better so the single extra byte would encode lengths of 1-256 + 60 rather than just 1-256. Doesn't make much difference after the first extra byte.

For literal runs the fixed break @ 60 makes sense - longer runs of literals are uncompressed data anyway.

Only makes sense for literals though. For match lengths or offsets (assuming an LZ77 type encoder), improving the range while still keeping byte based coding is interesting.

cbloom said...

@Turtle - 6 bits for the initial literal runlen sounds like bad code design. You really only want 3-4 for the initial value.

Maybe I'll add two notes to the original post.

won3d said...

There's LZ4, which seems to be generally better than Snappy.

http://fastcompression.blogspot.com/2011/05/lz4-explained.html

One IMO important kind of byte-oriented variable-length encoding is UTF-8, if only because it is ubiqutious. There are many inefficiencies, but they are generally reasonable.

I actually use it as a variable-length encoding for my JS compressed mesh format, since it is "free" in a browser. Being byte-oriented like this is also nice if the data might be further compressed by a text compressor like GZIP (which is also "free" in a browser).

Great blog series.

cbloom said...

Reminding myself : UTF-8 is basically the unary-flag-bit encoding (mod 128) but with some extra redundancy added to make payload bytes distinguishable from first bytes.

won3d said...

So I think Protocol Buffer varints or MIDI encoding is more similar to MOD 128, in that both use the sign bit to indicate whether more bytes are coming. Proto is in LSB order and MIDI is in MSB order (In general, you don't really have endian issues, since you only ever read one byte at a type, which is another nice property of byte encodings as a serialization format).

Anyway, UTF-8 is different because the length is unary encoded in the first byte (we call this a prefix varint encoding). You probably know this. As a result, UTF-8 gets only 5 more bits per byte because you lose a bit from the first byte and you only get 6 more bits per new byte. For text processing, it is nice to be able to rewind to the previous character.

You can take the prefix varint coding one more step by grouping them together into a shared tag byte, or a group varint. For a while, we were using this:

We don't use it that much anymore, but if it were important, decoding might be faster using a tag lookup table and pshufb to decode 4 words in parallel. Haven't tried this.

cbloom said...

Ah, I hadn't though about the fact that web people are interested in these codes for putting numbers in web protocols and such.

Yeah, for these purposes I've been discounting the arrangement of the bits and considering any encoding scheme which produces the same output lengths to be the same.

Unary prefix codes are especially fast to decode because you can use countlz. Perhaps an addendum on this is warranted.

won3d said...

Well, I wouldn't really say for "web" coding, but certainly for "network." This distinction is getting somewhat blurred by things binary AJAX requests, WebRTC, etc. Of course, you're still having to do the decoding in JS, which makes bit-bashing somewhat painful enough that I would still avoid it.

Interestingly, the varint codings that we tend to use don't actually do the obvious exclusion you're talking about. UTF-8 is like this; the following byte sequence (I put a gap between tag and payload):

110 00000
10 000000

is actually an illegal encoding of 0, not a legal encoding of 128.