8/19/2013

08-19-13 - Sketch of multi-Huffman Encoder

Simple way to do small-packet network packet compression.

Train N different huffman code sets. Encoder and decoder must have a copy of the static N code sets.

For each packet, take a histogram. Use the histogram to measure the transmitted length with each code set, and choose the smallest. Transmit the selection and then the encoding under that selection.

All the effort is in the offline training. Sketch of training :

Given a large number of training packets, do this :


for - many trials - 

select 1000 or so seed packets at random
(you want a number 4X N or so)

each of the seeds is a current hypothesis of a huffman code set
each seed has a current total histogram and codelens

for each packet in the training set -
add it to one of the seeds
the one which has the most similar histogram
one way to measure that is by counting the huffman code length

after all packets have been accumulated onto all the seeds,
start merging seeds

each seed has a cost to transmit; the size of the huffman tree
plus the data in the seed, huffman coded

merge seeds with the lowest cost to merge
(it can be negative when merging makes the total cost go down)

keep merging the best pairs until you are down to N seeds

once you have the N seeds, reclassify all the packets by lowest-cost-to-code
and rebuild the histograms for the N trees using only the new classification

those are your N huffman trees

measure the score of this trial by encoding some other training data with those N trees.

It's just k-means with random seeds and bottom-up cluster merging. Very heuristic and non-optimal but provides a starting point anyway.

The compression ratio will not be great on most data. The advantage of this scheme is that there is zero memory use per channel. The huffman trees are const and shared by all channels. For N reasonable (4-16 would be my suggestion) the total shared memory use is quite small as well (less than 64k or so).

Obviously there are many possible ways to get more compresion at the cost of more complexity and more memory use. For packets that have dword-aligned data, you might do a huffman per mod-4 byte position. For text-like stationary sources you can do order-1 huffman (that is, 256 static huffman trees, select by the previous byte), but this takes rather more const shared memory. Of course you can do multi-symbol huffman, and there are lots of ways to do that. If your data tends to be runny, an RLE transform would help. etc. I don't think any of those are worth pursuing in general, if you want more compression then just use a totally different scheme.


Oh yeah this also reminds me of something -

Any static huffman encoder in the vernacular style (eg. does periodic retransmission of the table, more or less optimized in the encoder (like Zip)) can be improved by keeping the last N huffman tables. That is, rather than just throwing away the history when you send a new one, keep them. Then when you do retransmission of a new table, you can just send "select table 3" or "send new table as delta from table 5".

This lets you use locally specialized tables far more often, because the cost to send a table is drastically reduced. That is, in the standard vernacular style it costs something like 150 bytes to send the new table. That means you can only get a win from sending new tables every 4k or 16k or whatever, not too often because there's big overhead. But there might be little chunks of very different data within those ranges.

For example you might have one Huffman table that only has {00,FF,7F,80} as literals (or whatever, specific to your data). Any time you encounter a run where those are the only characters, you send a "select table X" for that range, then when that range is over you go back to using the previous table for the wider range.

3 comments:

ryg said...

How's the speed of this compared to say your MMO LZP variants?

In my experience histogramming isn't cheap. And I'd expect your LZP thingy to do better on the compression too.

cbloom said...

Yeah, histogramming is trouble. On LHS platforms it's a complete disaster, because histogramming is one giant LHS.

LZP gets way better compression, but it's using a ton more memory. It's got a big shared dictionary, and also a per-channel adaptive state. Not a fair contest.

The nice thing about multi-huffman is zero per-channel memory use. Also the decode is quite fast, all the time is in the encode.

It also works okay even on very tiny packets, under 16 bytes.

Those properties make it good for something like an MMO for the client->server traffic. The server only decodes those packets and doesn't want to spend any memory on a channel adaptive state. They also tend to be quite small, and the data rate is so low on the client end that taking a bit of time per byte is not a big deal.

Unknown said...

This is bsaically what we do for our MMO Vendetta Online.
We have a set of static tables that we choose based on the context of the data being sent. For example, if a readable text string is sent, a static table that is 'optimized' for ASCII is used, but if a 3d position is sent, a 'floating-point' optimized table is used. Overall I think we have 16 tables to choose from that have been optimized with actual gameplay data.

old rants