06-15-10 - Top down - Bottom up

"Top down bottom up" sounds like a ZZ-top song, but is actually different ways of attacking tree-like problems. Today I am thinking about how to decide where to split a chunk of data for Huffman compression. A quick overview of the problem :

You have an array of N characters. You want to Huffman compress them, which means you send the code lengths and then the codes. The idea is that you may do better if you split this array into several. For each array you send new code lengths. Obviously this lets you adapt to the local statistics better, but costs the space to send those code lengths. (BTW one way to do this in practice is to code using an N+1 symbol alphabet, where the extra symbol means "send new codes"). You want to pick the split points that give you lowest total length (and maybe also a penalty for more splits to account for the speed of decode).

The issue is how do you pick the splits? For N characters the possible number of splits is 2^(N-1) (eg. if N = 3, the ways are [012],[0][12],[01][2],[0][1][2]) so brute force is impossible.

The standard ways to solve this problem is either top down or bottom up. I always think of these in terms of Shannon-Fano tree building (top down) or Huffman tree building (bottom up).

The top down way is : take the N elements and pick the single best split point. Then repeat on each half. This is recursive top-down tree building. One disadvantage of this approach is that it can fail to make good splits when the good splits are "hidden". eg. data like {xxxxxoooooxxxxxoooooxxxxxoooooxxxxx} might not pick any splits, because the gain from one split is not significant enough to offset the cost of sending new codes, but if you actually went all the way down the tree you would find good splits by completely segregating each unique bit. I also sometimes call this the "greedy" approach.

So the next solution is bottom up. Start with every character as a group (merge up runs of identical characters first). Then find the two groups which give you the biggest gain to merge and merge those groups. Repeat until no merge is beneficial.

In general it is often the case that bottom up solutions to these problems are better than top-down, but I don't there is any strong reason why that should be true.

There are a whole class of problems like this, such as BSP/kd tree building and category tree building in the machine learning literature (actually this is identical to category tree building in machine learning, since entropy is pretty good way to rate the quality of category trees, and an "overfit" penalty for making too many splits is equivalent to the code cost transmission penalty here). One of the good tricks if using the top-down approach is do an extra "non-greedy" step. When you hit the point where no more splits are beneficial, go ahead and do another split anyway speculatively and then split its children. Sometimes by going ahead one step you get past the local minimum, eg. the classic case is trying to build a kd-tree for this kind of data :

  X  O

  O  X

Where do you put a split? You currently have 50/50 entropies, no split is beneficial. If you do one split and then force another, you find a perfect tree.

Anyway, because this class of problems is unsolvable, most of the techniques are just heuristic and ad-hoc. For example in many cases it can be helpful just to always do a few mid-point splits for N levels to get things started. For the specific case of kd-trees there have been a lot of great papers in the last 5 years out of the interactive ray tracing community.

1 comment:

Jeff Roberts said...

I think you can look at this as a dynamic programming problem similar to trellis encoding. You lay out the frequencies in a sorted array, and then you are just sliding the split points (where you would edge the huffman tree) along each bitlevel, so there are far fewer choices than you think - especially if you are limiting the maximum depth of the tree. We can talk about it tomorrow. Jeff

old rants