Okay, now that we have some background we can bring together the two ends of the story : encoding variable length numbers,
and encodings in bytes using variable ranges.
What we want to do should be obvious. We want to use some amount of range, like [0 , (T-1)] to send a value <= (T-1) immediately,
and use the rest of the range [T, base-1] to send some part of a value that's >= T. (eg. base = 256 for bytewise encoding)
With our background this should be pretty obvious so I'll jump straight to the code. The trick is in the case that it doesn't all fit in the current byte,
we'll use modulo to output part of the value in the space we're given.
First a recursive implementation which is more obviously symmetric between the encoder and decoder :
(In this implementation I put the "can send immediately" in the top part of the word, and I put the "here's part
of the value but there's more to follow" in the bottom part of the word, which is the opposite of the description above).
typedef uint8 U8;
typedef uint8 * U8P;
typedef const uint8 * U8PC;
U8 * EncodeMod(U8 * to, int val, int mod)
{
ASSERT( mod >= 1 && mod <= 255 );
const int upper = 256 - mod;
if ( val < upper )
{
*to++ = (U8)(mod + val);
return to;
}
else
{
// val >= upper
val -= upper;
int top = val / mod;
int bottom = val % mod;
*to++ = (U8)bottom;
return EncodeMod(to,top,mod);
}
}
int DecodeMod(U8PC & from, int mod)
{
const int upper = 256 - mod;
int byte = *from++;
if ( byte >= mod )
{
return (byte - mod);
}
else
{
int val = DecodeMod(from,mod);
val *= mod;
val += byte + upper;
return val;
}
}
but of course you don't actually want to use recursive functions; the less obviously symmetric non-recursive
implementation is :
U8 * EncodeMod(U8 * to, int val, int mod)
{
ASSERT( mod >= 1 && mod <= 255 );
const int upper = 256 - mod;
for(;;)
{
if ( val < upper )
{
*to++ = (U8)(mod + val);
return to;
}
else
{
// val >= upper
val -= upper;
int lower = val % mod;
*to++ = (U8)lower;
val /= mod;
}
}
}
int DecodeMod(U8PC & from, int mod)
{
const int upper = 256 - mod;
int mul = 1;
int val = 0;
for(;;)
{
int byte = *from++;
if ( byte >= mod )
{
val += (byte - mod)*mul;
break;
}
else
{
val += (byte+upper)*mul;
mul *= mod;
}
}
return val;
}
and furthermore in practice you probably don't want to eat the cost of the divide and modulo in the encoder
(it's only a mul in the decoder, but still). So you can require that mod is a power of 2. The implementation
is obvious :
U8 * EncodeModPow2(U8 * to, int val, int bits)
{
ASSERT( bits >= 0 && bits < 8 );
const int mod = (1<<bits);
const int upper = 256 - mod;
for(;;)
{
if ( val < upper )
{
*to++ = (U8)(mod + val);
return to;
}
else
{
// val >= upper
val -= upper;
int lower = val & (mod-1);
*to++ = (U8)lower;
val >>= bits;
}
}
}
int DecodeModPow2(U8PC & from, int bits)
{
const int mod = (1<<bits);
const int upper = 256 - mod;
int shift = 0;
int val = 0;
for(;;)
{
int byte = *from++;
if ( byte >= mod )
{
val += (byte - mod)<<shift;
return val;
}
else
{
val += (byte+upper)<<shift;
shift += bits;
}
}
}
Now in all this code we're outputing a byte at a time, but of course you could start with 2 bytes,
or go 1-1-2 or whatever.
EncodeMod is a superset of some of the previous encodings we've looked at.
For example, mod = 1 is a 1-1-1-1 flag value encoding. That is, we reserve only one value in the range to indicate
more bytes follow.
mod = 128 is a unary flag-bit encoding (the 1 bit flag + 7 bit payload scheme).
(mod over 128 is valid but a weird thing to do (*!*). mod = 0 is send a byte. mod = 256 is send one byte from a multi-byte word;
eg. to send a uint16 in 16 bits is encodemod 256 then encodemod 0 ).
In between EncodeMod can generate intermediate encodings. Here is a sampling of the values at which
EncodeMod steps up to using the next # of bytes :
mod 1 : 255,510,765,1020,1275,1530,1785,2040,2295,
mod 2 : 254,762,1778,3810,7874,16002,32258,64770,129794,
mod 3 : 253,1012,3289,10120,30613,92092,276529,
mod 5 : 251,1506,7781,39156,196031,
mod 8 : 248,2232,18104,145080,
mod 13 : 243,3402,44469,578340,
mod 21 : 235,5170,108805,
mod 34 : 222,7770,264402,
mod 55 : 201,11256,619281,
mod 89 : 167,15030,1337837,
mod 144 : 112,16240,2338672,
mod 233 : 23,5382,1254029,
bits 0 : 255,510,765,1020,1275,1530,1785,2040,2295,
bits 1 : 254,762,1778,3810,7874,16002,32258,64770,129794,
bits 2 : 252,1260,5292,21420,85932,343980,
bits 3 : 248,2232,18104,145080,
bits 4 : 240,4080,65520,1048560,
bits 5 : 224,7392,236768,
bits 6 : 192,12480,798912,
bits 7 : 128,16512,2113664,
eg. mod = 1 (the old 1-1-1-1 encoding) sends values < 255 in 1 byte, < 510 in 2 bytes, etc. So for example
at mod 13 we van only get up to 242 in 1 byte, but we can get a lot more in 2 bytes (< 3402).
EncodeMod is flexible but it cannot generate all possible variable length integer encodings in bytes. As a counterexample,
if we try to reproduce the 2-flag-bits thresholds (0,0x40,0x4040,0x404040) the closest I can get with EncodeMod is to do
mod 192, then 170, then 127, which gives thresholds of (0,0x40,0x40C0,0x408040). One of the differences here is that EncodeMod
has no upper limit, it can encode infinite values (just like flag-value and unary-flag-bits encodings), but the fixed-number-of-flag bit
(or finite length prefix code) type encodings do have an upper limit, so are not reserving any space for "there are still bigger values".
Finally, some charts showing the number of bytes required to send a value with various mods.
The charts all show the same data, just at different zooms of the X axis. You should be able to see that
low mod is best for small values, high mod is best for large values, and they cross somewhere in between.
(the rightmost point is not the next byte step, it's a generated clip with the chart boundary; all the other
points indicate a step up in the number of bytes output).
ADDENDUM : *!* : actually encodemod with a mod over the midpoint is totally useful and reasonable.