cbloom rants: 11/2015

A while ago I wrote a series on Encoding Values in Bytes in which I talk about the "EncodeMod" varint encoding.

EncodeMod is just the idea that you send each token (byte, word, nibble, whatever) with two ranges; in one range the values are terminal (no more tokens), while in the other range it means "this is part of the value" but more tokens follow. You can then optimize the division point for a wide range of applications.

In my original pseudo-code I was writing the ranges with the "more tokens" follow at the bottom, and terminal values at the top. That is :


Specifically for the case of byte tokens and pow2 mod

mod = 1<<bits

in each token we send "bits" of values that don't currently fit

upper = 256 - mod

"upper" is the number of terminal values we can send in the current token

I was writing

[0,mod) = bits of value + more tokens follow
[mod,256) = terminal value

Fabian spotted that the code is slightly simpler if you switch the ranges. Use the low range [0,upper) for terminal values and [upper,256) for non-terminal values. The ranges are the same, so you get the same encoded lengths.

(BTW it also occurred to me when learning about ANS that EncodeMod is reminiscent of simple ANS. You're trying to send a bit - "do more bytes follow". You're putting that bit in a token, and you have some extra information you can send with that bit - so just put some of your value in there. The number of slots for bit=0 and 1 should correspond to the probability of each event.)

The switched encodemod is :


U8 *encmod(U8 *to, int val, int bits)
{
    const int upper = 256 - (1<<bits); // binary, this is 1110000 or similar (8-bits ones, bits zeros)
    while (val >= upper)
    {
        *to++ = (U8) (upper | val);
        val = (val - upper) >> bits;
    }

    *to++ = (U8) val;
    return to;
}


const U8 *decmod(int *outval, const U8 *from, int bits)
{
    const int upper = 256 - (1<<bits);
    int shift = 0;
    int val = 0;

    for (;;)
    {
        int byte = *from++;
        val += byte << shift;
        if (byte < upper)
            break;
        shift += bits;
    }

    *outval = val;
    return from;
}

The simplification of the encoder here :


    *to++ = (U8) (upper | val);
    val = (val - upper) >> bits;

written in long-hand is :


    low = val & ((1<<bits)-1);
    *to++ = upper + low;  // (same as upper | low, same as upper | val)
    val -= upper;
    val >>= bits;

or

    val -= upper;
    low = val & ((1<<bits)-1);
    *to++ = upper + low;  // (same as upper | low, same as upper | val)
    val >>= bits;

and the val -= upper can be done early or late because val >= upper it doesn't touch "low"

Basically by using "upper" like this, the mask of low bits and add of upper is done in one op.

cbloom rants

11/13/2015

Flipped encodemod

old rants