10-16-12 - Thoughts on Bit-Packing Structs Before Compression

If you're trying to transmit some data compactly, and you are *not* using any back-end compression, then it's relatively straightforward to pack the structs through ad-hoc "bit packing" - you just want to squeeze them into as few bits as possible. But if you are going to apply a standard compressor after bit packing, it's a little less clear. In particular, a lot of people make mistakes that result in larger final data than necessary.

To be clear, there are two compression steps :

{ Raw structs } --[ad hoc]--> { Bit Packed } --[compressor]--> { Transmitted Data }

What you actually want to minimize is the size of the final transmitted data, which is not necessarily achieved with the smallest bit packed data.

The ideal scenario is if you know your back-end compressor, simply try a variety of ways of packing and measure the final size. You should always start with completely un-packed data, which often is a reasonable way to go. It's also important to keep in mind the speed hit of bit packing. Compressors (in particular, decompressors) are very fast, so even though your bit-packing may just consist of some simple math, it actually can very easily be much slower than the back-end decompressor. Many people incorrectly spend CPU time doing pre-compression bit-packing, when they would be better off spending that same CPU time by just running a stronger compressor and not doing any twiddling themselves.

The goal of bit-packing should really be to put the data in a form that the compressor can model efficienctly. Almost all compressors assume an 8-bit alphabet, so you want your data to stay in 8-bit form (eg. use bit-aligned packing, don't use non-power-of-2 multiplies to tightly pack values if they will cross a byte boundary). Also almost all compressors, even the best in the world (PAQ, etc) primarily achieve compression by modeling correlation between neighboring bytes. That means if you have data that does not have the property of maximum correlation to its immediate neighbor (and steady falloff) then some swizzling may help, just rearranging bytes to put the correlated bytes near each other and the uncorrelated bytes far away.

Some issues to consider :

1. Lossy bit packing.

Any time you can throw away bits completely, you have a big opportunity that you should exploit (which no back end compressor can ever do, because it sends data exactly). The most common case of this is if you have floats in your struct. Almost always there are several bits in a float which are pure garbage, just random noise which is way below the error tolerance of your app. Those bits are impossible to compress and if you can throw them away, that's pure win. Most floats are better transmitted as something like a 16 bit fixed point, but this requires application-specific knowledge about how much precision is really needed.

Even if you decide you can't throw away those bits, something that can help is just to get them out of the main stream. Having some random bytes mixed in to an otherwise nicely compressible stream really mucks up the order-0 statistics, so just putting them on the side is a nice way to go. eg. you might take the bottom 4 or 8 bits out of each float and just pass them uncompressed.

(in practical bone-head tips, it's pretty common for un-initialized memory to be passed to compressors; eg. if your structs are padded by C so there are gaps between values, put something highly compressible in the gap, like zero or a duplicate of the neighboring byte)

2. Relationships between values.

Any time you have a struct where the values are not completely independent, you have a good opportunity for packing. Obviously there are cases where one value in a struct can be computed from another and should just not be sent.

There are more subtle cases, like if A = 1 then B has certain statistics (perhaps it's usually high), while if A = 0 then B has other statistics (perhaps it's usually low). In these cases there are a few options. One is just to rearrange the transmission order so that A and B are adjacent. Most back end compressors model correlation between values that are adjacent, so putting the most-related values in a struct next to each other will let the back end find that correlation.

There are also often complicated mathematical relationships. A common case is a normalized vector; the 3 values are constrained in a way that the compressor will never be able to figure out (proof that current compressors are still very far away from the ideal of perfect compression). When possible you want to reduce these related values to their minimal set; another common case is rotation matrices, where 9 floats (36 bytes) can be reduced to 3 fixed points (6-9 bytes).

This is really exactly the same as the kinds of variable changes that you want to do physics; when you have a lot of values in a struct that are constrained together in some way, you want to identify the true number of degrees of freedom, and try to convert your values into independent unconstrained variables.

When numerical values are correlated to their neighbors, delta transformation may help. (this particularly helps with larger-than-byte values where a compressor will have a harder time figuring it out)

3. Don't mash together statistics.

A common mistake is to get too aggressive with mashing together values into bits in a way that wrecks the back-end statistical model. Most back end compressors work best if the bytes in the file all have the same probability histogram; that is, they are drawn from the same "source". (as noted in some of the other points, if there are multiple unrelated "sources" in your one data stream, the best thing to do is to separate them from each other in the buffer)

Let me give a really concrete example of this. Say you have some data which has lots of unused space in its bytes, something like :

bytes in the original have values :

0000 + 4 bits from source "A"
0001 + 4 bits from source "B"

(when I say "from source" I mean a random value drawn under a certain probability distribution)

You might be tempted to bit-pack these to compact them before the back end compressor. You might do something like this :

Take the top 4 bits to make a flag bit
Take 8 flag bits and put them in a byte

Then take the 4 bits of either A or B and put them together in the high and low nibble of a byte

eg, in nibbles :

0A 1B 1B 0A 0A 0A 1B 0A 

--[bit packed]-->

01100010 (binary) + ABBAAABA (nibbles)

(and A and B are not the hex numbers but mean 4 bits drawn from that source)

It looks like you have done a nice job of packing, but in fact you've really wrecked the data. The sources A and B had different statistics, and in the original form the compressor would have been able to learn that, because the flag bit was right there in the byte with the payload. But by packing it up tightly what you have done is made a bunch of bytes whose probability model is a mix of {bit flags},{source A},{source B}, which is a big mess.

I guess a related point is :

4. Even straightforward bit packing doesn't work for the reasons you think it does.

Say for example you have a bunch of bytes which only take on the values 0-3 (eg. use 2 bits). You might think that it would be a big win to do your own bit packing before the compressor and cram 4 bytes together into one. Well, maybe.

The issue is that the back end compressor will be able to do that exact same thing just as well. It can see that the bytes only take values 0-3 and thus will send them as 2 bits. It doesn't really need your help to see that. (you could help it if you had say some values that you knew were in 0-3 and some other values you knew were in 0-7, you might de-interleave those values so they are separated in the file, or somehow include their identity in the value so that their statistics don't get mixed up; see #5)

However, packing the bytes down can help in some cases. One is if the values are correlated to their neighbors; by packing them you get more of them near each other, so the correlation is modeled at an effective higher order. (eg. if the back end only used order-0 literals, then by packing you get order-3 (for one of the values anyway)). If the values are not neighbor-correlated, then packing will actually hurt.

(with a Huffman back end packing can also help because it allows you to get fractional bits per original value)

Also for small window LZ, packing down effectively increases the window size. Many people see advantages to packing data down before feeding it to Zip, but largely that is just reflective of the tiny 32k window in Zip (left over from the DOS days and totally insane that we're still using it).

5. Separating values that are independent :

I guess I've covered this in other points but it's significant enough to be redundant about. If you have two different sources (A and B); and there's not much correlation between the two, eg. A's and B's are unrelated, but the A's are correlated to other A's - you should try to deinterleave them.

A common simple case is AOS vs SOA. When you have a bunch of structs, often each value in the struct is more related to the same value in its neighbor struct than to other values within its own struct (eg. struct0.x is related to struct1.x more than to struct0.y). In this case, you should transform from array-of-structs to struct-of-arrays ; that is, put all the .x's together.

For example, it's well known that DXT1 compresses better if you de-interleave the end point colors from the palette interpolation indeces. Note that AOS-SOA transformation is very slow if done naively so this has to be considered as a tradeoff in the larger picture.

More generally when given a struct you want to use app-specific knowledge to pack together values that are strongly correlated and de-interleave values that are not.

No comments:

old rants