cbloom rants: 02-10-15 - LZ Offset Modeling Rambles

Canonical LZ offset coding :

The standard way to send LZ offsets in a modern LZ coder is like this :

1. Remove the bottom bits. The standard number is 4-7 bits.


low = offset & ((1<<lowbits)-1);
offset >>= lowbits;

then you send "low" entropy coded, using a Huffman table or arithmetic or whatever. (offsets less than the mask are treated separately).

The point of this is that offset bottom bits have useful patterns in structured data. On text this does nothing for you and could be skipped. On structured data you get probability peaks for the low bits of offset at 4,8,12 for typical dword/qword data.

You also get a big peak at the natural structure size of the data. eg. if it's a D3DVertex or whatever and it's 36 or 72 bytes there will be big probability peaks at 36,72,108,144.

2. Send the remaining high part of offset in a kind of "# sig bits" (Elias Gamma) coding.

Count the number of bits on. Send the # of bits on using an entropy coder.

Then send the bits below the top bit raw, non-entropy coded. Like this :


topbitpos = bit_scan_top_bit_pos( offset );
ASSERT( offset >= (1<<topbitpos) && offset < (2<<topbitpos) );

rawbits = offset & ((1<<topbitpos)-1);

send topbitpos entropy coded
send rawbits in topbitpos bits

Optionally you might also entropy-code the bit under the top bit. You could just arithmetic code that bit (conditioned on the position of the top bit as context). Or you can make an expanded top-bit-position code :


slot = 2*topbitpos + ((offset>>(topbitpos-1))&1)

send "slot" entropy coded
send only topbitpos-1 of raw bits

More generally, you can define slots by the number of raw bits being sent in each level. We've done :


Straight #SB slots :

{0,1,2,3,4,5,...}

Slot from #SB plus one lower bit :

{0,1,1,2,2,3,3,4,4...}

More generally it helps a little to put more slots near the bottom :

{0,0,1,1,1,2,2,2,3,3,4,4,5,6,7,8...}

but in the more general method you can't use a simple bitscan to find your slot.

The intermediate bits that are sent raw do have a slight probability decline for larger values. But there's very little to win there, and a lot of noise in the modeling. In something like a Huffman-coded-LZ, sending the code lengths for extra slots there costs more than the win you get from modeling it.

With this we're just modeling the general decrease in probability for larger offsets. Something that might be interesting that I've never heard of anyone doing is to fit a parameteric probability function, laplacian or something else, to offset. Instead of counting symbols to model, instead fit the parameteric function, either adaptively or statically.

So, a whole offset is sent something like this :


offset bits (MSB on left) :

  SSRRRRRRRRLLLLL

S = bit sent in slot index, entropy coded
R = raw bit
L = low bit, entropy coded

Now some issues and followup.

I. The low bits.

The low-bits-mask method actually doesn't handle the structure size very well. It does well for the 4-8 dword/qword stuff, because those are generally divide into the low bit mask evenly. (a file that had trigram structure, like an RGB bitmap, would have problems).

The problem is the structure size is rarely an exact multiple of the power of two "lowbits" mask. For example :


 36&0xF = 0x04 = 0100
 72&0xF = 0x08 = 1000
108&0xF = 0x0C = 1100
144&0xF = 0x00 = 0000

A file with structure size of 36 will make four strong peaks if the lower 4 bits are modeled.

If instead of doing (% 16) to get low bits, you did (% 36), you would get a perfect single peak at zero.

Any time the structure size doesn't divide your low bits, you're using extra bits that you don't need to.

But this issue also gets hidden in a funny way. If you have "repeat match" codes then on strongly structured data your repeat offsets will likely contain {36,72,...} which means those offsets don't even go into the normal offset coder. (the redundancy between low-bits modeling and repeat matches as a way of capturing structure is another issue that's not quite right).

II. Low bit structure is not independent of high bits.

Sometimes you get a file that just has the exact same struct format repeated throughout the whole file. But not usually.

It's unlikely that the same structure that occurs in your local region (4-8-36 for example) occurs back at offset one million. It might be totally different structure there. Or, it might even be the same structure, but shifted by 1 because there's an extra byte somewhere, which totally mucks up the low bits.

The simplest way to model this is to make the lowbits entropy coder be conditioned by the high bits slot. That is, different models for low offsets vs higher offsets. The higher offsets will usually wind up with low bits that are nearly random (equiprobable) except in the special case that your whole file is the same structure.

More generally, you could remember the local structure that you detect as you scan through the file. Then when you match back to some prior region, you look at what the structure was there.

An obvious idea is to use this canonical coding for offsets up to 32768 (or something), and then for higher offsets use LZSA.

So essentially you have a small sliding LZ77 window immediately preceding your current pointer, and as strings fall out of the LZ77 window they get picked up in a large LZSA dictionary behind.

cbloom rants

2/10/2015

02-10-15 - LZ Offset Modeling Rambles

No comments:

old rants