cbloom rants: 09-04-12 - Encoding Values in Bytes Part 4

Just some random addenda. The main post series is here :

cbloom rants 09-02-12 - Encoding Values in Bytes Part 1
cbloom rants 09-02-12 - Encoding Values in Bytes Part 2
cbloom rants 09-02-12 - Encoding Values in Bytes Part 3

1. I mentioned prefix byte length codes in the first post. This is a code in which you first send the # of bytes using a bit prefix code, then the remainder. A small note on that :

Obviously you have unary :

1 + 7 bits
10 + 14 bits
110 + 21 bits
1110 + 28 bits

If the max you send is 4 bytes then we're actually wasting a bit on the last code; it should just be 111 + 29 bits :

1 + 7 bits
10 + 14 bits
110 + 21 bits
111 + 29 bits

and obviously you have the fixed two bit code :

00 + 6 bits
01 + 14 bits
10 + 22 bits
11 + 30 bits

what I didn't mention before is that for a max of 4 bytes these are the *only* prefix codes. All other prefix codes are just the xor of these or are strictly worse.

2. The unary prefix code is always nice, because it can be decoded branchlessly with countlz (BSF/BSR). (see for example Elias Gamma or Exp-Golomb coding).

The decoder for mod-128 (with the flag bits at the head, and little endian) is :


const uint32 mod128_mask[] = { (1UL<<7)-1, (1UL<<14)-1, (1UL<<21)-1, (1UL<<29)-1 };
const uint32 mod128_base[] = { 0, (1UL<<7)-1, ((1UL<<14)-1) + ((1UL<<7)-1), 
                                ((1UL<<21)-1) + ((1UL<<14)-1) + ((1UL<<7)-1) };

int DecodeMod128LEBranchless(U8PC & from)
{
    uint32 dw = *( (uint32 *) from );
    unsigned long index;
    _BitScanForward(&index,dw);
    index = MIN(index,3);
    int count = index+1;
    int shift = MIN(count,3);
    from += count;
    dw >>= shift;
    dw &= mod128_mask[index];
    dw += mod128_base[index];
    return (int) dw;
}

and the encoder is left as an exercise for the reader. (as is an endian-independent version)

The code above provides the maximum code space; it adds the base of each range and it doesn't waste a bit for the top range. Obviously it can be faster (no MIN's) if you do waste the bit.

3. I was playing with LZ4 offsets and it's an obvious place to use EncodeMod.

Say you have an existing LZ codec that writes offsets in 2 bytes. You want to leave it alone except for the offset IO.

You can add > 65536 offset support by changing the offset to EncodeMod.

One obvious scheme is : 0 + 15 bit offsets, 1 + 23 bit offsets. This is EncodeMod( 15 bits ) on the first word followed by a raw byte (EncodeMod(0)).

But that is not ideal; you want more of the offsets in the 32768 to 65536 range to get into the first 2 bytes. Well, just lower your mod. EncodeMod( 14 bits ) gives you up to (32768+16384) in the first 2 bytes, and 22 bits for longer offsets. EncodeMod( 13 bits ) gives you up to 57344 (65536-8192) in the first 2 bytes, and 21 bits for long offsets.

The next thing you could do is be able to send a few low offsets in just 1 byte. The most important one is actually just the RLE offset (offset=1). We do not want to use something like 1 bit + 7 bit payload, because that puts too many offsets in 1 byte and it takes away too much of our 2 byte range (we still want as many offsets as possible to be sent in 2 bytes). We can get that in using EncodeMod. Instead of outputing 2 bytes at first, we do 1 byte with a high mod - something like mod 254. This saves just a few values to fit in 1 byte and most of that byte range is left over for the second byte.

So to send an LZ offset you might do : mod 255 , mod 64, mod 0 ; then the thresholds are

1,48961,4226881

for 1-3 byte offset encoding.

(I don't suggest you actually do this for LZ offsets; it's intended as an example of the kind of thinking that goes into good code design).

4. File sizes are a common case for this, so I did a brute force optimize.

I gathered all the file sizes on my computer. (you may question whether my computer is a representative sample).

(BTW reminder to self : beware circular symbolic links in dir traversals these days; lots of old software that does dir traversals can infinite loop these days)

Anyhoo I tried mod byte-by-byte encoding and optimized the parameters for a 3-step series. The optimal was :

B-B-B-mod :
best : 251, 27, 15 : 1228288
cost per :  2.129383

you can see it has reserved a little space for tiny file sizes to go in 1 byte, then most go in 2 bytes, and not much space is reserved for the very large rare files.

More efficient is an encodemod which goes word-then-bytes , and uses only pow2 mods. The optimal for that is :

W-B-pow2
best : 13 , 4 : 1232113
cost per :  2.136096

only barely higher average cost. (13,4 are bit counts; eg. mods of 1<<13 and 1<<4)

Compare to a simple scheme I've used in the past :

write 16 bits if size < (1<<15) , else 32 bits, else 64 bits :

16-32-64 : cost per : 2.33855

this is not a great scheme and yet it's only slightly worse on average; clearly not a lot of win to be had on this front.

cbloom rants

9/04/2012

09-04-12 - Encoding Values in Bytes Part 4

No comments:

old rants