5/18/2009

05-18-09 - Lagrange Space-Speed Optimization

We just did an interesting thing at RAD that's kind of related to what I've been writing about.

A while ago Dmitry Shkarin (ppmii inventor) posted this code sketch for his Huffman decoder :


Dmitry Shkarin's Huffman reader :

Suppose We have calculated lengths of rrHuffman codes and minimal
codelength is N then We can read N bits and to stop for most probable
symbols and to repeat reading for other symbols. Decoding procedure will
be something similar to:

extern UINT GetBits(UINT NBits);

struct HUFF_UNPACK_ITEM {
INT_TYPE NextTable, BitsToRead;
} Table[2*ALPHABET_SIZE];

inline UINT DecodeSymbol()
{
const HUFF_UNPACK_ITEM* ptr=Table;
do {
ptr += ptr->NextTable+GetBits(ptr->BitsToRead);
} while (ptr->BitsToRead != 0);
return ptr->NextTable;
} 

this will seem rather opaque to you if you don't know about Huffman codes; you can ignore it and read on and still get the point.

Dmitry's decoder reads the minimum # of bits to get to the next resolved branch at each step, then increments into the table by that branch. Obviously the value of the bits you read is equal to the branch number. So like if you read two bits, 00 = 0, 01 = 1, 10 = 2, 11 = 3 - you just add the bits you read to the base index and that's the branch you take.

Okay, that's pretty simple and nice, but it's not super fast. It's well known that a good way to accelerate Huffman decoding is just to have a big table of how to decode a large fixed-size bit read. Instead of reading variable amounts, you always just read 12 bits to start (for example), and use that 12 bit value to look up in a 4096 member table. That table tells you how many bits were actually needed and what symbol you decoded. If more than 12 bits are needed, it gives you a pointer to a followup table to resolve the symbol exactly. The crucial thing about that is that long symbols are very unlikely (the probability of each symbol is like 2^-L for a bit length of L) so you rarely need the long decode path.

It's pretty obvious that you could extend Dmitry's method to encompass read-aheads like this acceleration table. Instead of just doing GetBits on "BitsToRead" , instead you scan ahead BitsToRead , and then when you take a path you add an extra field like "BitsConsumed" which tells you how many of those bits were actually needed. This lets you make initial jump tables that read a whole bunch of bits in one go.

More generally, in the tree building, at any point you could decide to make a big fast node that wastes memory, or a small binary treeish kind of node. This is kind of like a Judy tree design, or a Patricia Trie where the nodes can switch between linked-lists of children or an array of child pointers. One nice thing here is our decoder doesn't need to switch on the node type, it always uses the same decode code, but the tree is just bigger or smaller.

To be concrete here's a simply Huffman code and possible trees for it :


Huffman codes :

0   : a
10  : b
110 : c
111 : d

Possible trees :

(* means read one bit and branch)

Tree 1)

*----- a
 \
  *--- b
   \
    *- c
     \
       d

4 leaves
3 internal nodes
= 7 nodes total

Tree 2)

[2bit] 
[ 00 ]--- a
[ 01 ]--- a
[ 10 ]--- b
[ 11 ]- *- c
         \ d

5 leaves
2 internal nodes
= 7 nodes total

Tree 3)

[3bit] -
[ 000 ] -- a
...
[ 110 ] -- c
[ 111 ] -- d

8 leaves
1 internal node
= 9 nodes total

Tree 4)

*------- a
 \
  [2bit]
  [ 00 ] - b
  [ 01 ] - b
  [ 10 ] - c
  [ 11 ] - d

5 leaves
2 internal nodes
= 7 nodes total

We have these four trees. They have different memory sizes. We can also make an estimate of what the decode time for each tree would be. In particular for the case of Huffman decoding, the expected time is something like the number of branches weighted by 2^-depth of each branch. Reading more bits in a given branch isn't significantly slower than reading 1 bit, we just want as few branches as possible to decode a symbol.

I'm going to get away from the Huffman details now. In general when we are trying to make fast data structures, what we want is as much speedup as possible for a given memory use. Obviously we could throw 64k of table slots at it and read 16 bits all at once and be very fast. Or we could use the minimum-memory table. Usually we want to be somewhere in between, we want a sweet spot where we give it some amount of memory and get a good speedup. It's a trade off problem.

If we tried all possible trees, you could just measure the Mem use & Time for each tree and pick the one you like best. You would see there is some graph Time(Mem) - Time as a function of Mem. For minimum Mem, Time is high, as you give it more Mem, Time should go down. Obviously that would be very slow and we don't want to do that.

One way to think about it is like this : start with the tree that consumes minimum memory. Now we want to let it have a little bit more memory. We want the best bang-for-the-buck payoff for that added memory, so we want the tree change that gives us the best speedup per extra byte consumed. That's the optimum d(Time)/d(Mem). Keep doing improvements until d(Time)/d(Mem) doesn't give you a big enough win to be worth it.

Some of you may already be recognizing this - this is just a Rate-Distortion problem, and we can solve it efficiently with a Lagrange multiplier.

Create a Lagrange cost like :


J = Time + lambda * Mem

Now try to build the tree that minimizes J for a given lambda. (ignore how lambda was picked for now, just give it some value).

If you find the tree that minimizes J, then that is the tree that is on the optimal Mem-Time curve at a certain spot of the slope d(Time)/d(Mem).

You should be able to see this is true. First of all - when J is minimum, you must be on the optimal Time(Mem) curve. If you weren't, then you could hold Mem constant and improve Time by moving towards the optimal curve and thus get a lower J cost.

Now, where are you on the curve? You must be at the spot where lambda = - d(Time)/d(Mem). One way to see this is algebraically :


J is at a minimum, therefore :

d(J)/d(Mem) = 0

d(J)/d(Mem) = d(Time)/d(Mem) + lambda * 1 = 0

therefore lambda = - d(Time)/d(Mem)

You can also see this intuitively. If I start out anywhere on the optimal Time(Mem) curve, I can improve J by trading Mem for Time as long as the gain I get in Time is exceeding lamda * what I lose in Mem. That is, if d(Time) > - lambda * d(Mem) , then I should do the step. Obviously you keep doing that until they are equal. QED.

Since I'm nice I drew a purty picture :

The blue curve is the set of optimal solutions for various Mem parameters. The hard to see yellow tangent is a lambda parameter which is selecting a specific spot on the trade-off curve. The green region above the curve is the space of all possible solutions - they are inefficient solutions because you can improve them by getting to the blue curve. The red region is not possible.

This kind of lagrange space-time optimization has all sorts of applications in games. One example would be your spatial acceleration structures, or something like kD trees or BVH hierarchies for ray tracing. Too often we use hacky heuristics to build these. What you should really do is create a lagrange cost that weighs the cost of more memory use vs. the expected speedup.

One of the nice wins of this approach is that you can often get away with doing a greedy forward optimization for J (the lagrange cost), and it's just a single decision as you build your tree. You just evaluate your current choices for J and pick the best one. You do then have to retry and dial lambda to search for a given Mem target. If you didn't use the lagrange multiplier approach, you would have to try all the approaches and record the Time/Mem of every single possibility, then you would pick the one that has the best Time for the desired Mem.

In the past I've talked about algorithms being dumb because they are "off the curve". That means they are in the green region of the picture - there's just no reason to use that tradeoff. In general in algorithms you can't fault someone for selecting something on the blue curve. eg. a linked list hash table vs. a reprobing hash table might be at different spots on the blue curve - but they're both optimal. The only time you're making a big mistake is when you're way out in green space of wanton inefficiency.

Back to the specifics of the Huffman decoder - what we've found is kind of interesting and maybe I'll write about it in more detail later when we understand it better. (or it might be one of our magic secrets).

ADDENDUM : I should be clear that the with the J-cost optimization you still have to consider greedy tree building vs. optimal tree building. What you're gaining is a way to drive yourself towards the blue curve, but it's not like the tree building is suddenly super easy. You also have to deal with searching around in the lagrange multiplier to hit the target you want. For many practical problems you can create experimental heuristics that will give you a formula for the lambda that gives you a certain mem use or rate or whatever.

When we make advances in the art, there are two main ways we do it. One is to push farther along the blue curve than anyone has before. For example in data compression we usually have a run time vs. compression curve. You can run slower and slower algorithms and get more and more compression. You might find a new algorithm that runs even slower and gets even more compression than anyone has before. (PAQ is playing this game now; I used to play this game with PPMZ back in the day). That extends the curve out into areas unexplored. The other advance is to find something that's below the previously known blue curve. You find a way to get a better trade-off and you set a new optimal curve point. You might find a new compressor that gets the same compression ratio as old stuff, but runs way faster.

5/15/2009

05-15-09 - Trellis Quantization

There's a big gap in all technical literature. You can pretty easily find hand-wavey overviews of things for non-specialists in any field that make your feel all happy, but don't get into any details so they're useless if you actually need to do something in that field. Then there's also an abundance of technical papers that are steeped in the terminology and notation of that particular sub-field and are completely opaque. It's almost impossible to find transition material.

One case of this for me was so-called "Trellis Quantization". Part of the problem was that there's a whole body of old literature on "trellis coded quantization" which is an entirely different thing. TCQ is about designing non-linear or VQ quantizers for sources. Almost all of that research from the old days is totally worthless, because it assumed *NO ENTROPY CODING* on the post-quantization coefficients. Thus it was trying to make quantization buckets which had equal probability. Even at the time that research was happening it was already pointless, but it's especially pointlessnow. We know that in the presence of an entropy coder, independent variables are R-D optimal with a fix-step-size deadzone quantizer.

But in the real world our entropy coder is not independent. That is, we use things like run length codes, or adaptive arithmetic coders, or other schemes, such that the way we code one variable affects the code length for the next variable. This is where so-called "trellis quantization" in the modern sense comes in.

I really hate the modern use of the term "trellis quantization" because it's really not about the trellis or the quantization. A better term would be "dynamic programming code stream output optimization". If somebody had just told me that's what it was at the beginning it would have saved me weeks of confusion. It's basically the same thing you do in an LZ optimal parser (though not exactly).

This technique is used mainly in lossy image coders to optimize the coding of a certain bunch of pixels. It can be used for bitplane truncation in zerotree type coders, but it's mainly used for block-transforms, and mainly for video coders (it was made prominent in H263 and is used in H264).

Basically it goes like this : for a given block, you do normal quantization and you get a bunch of coefficients like {17,3,9,4,0,1,2,0}. You could just output that, and you would have a certain rate and distortion. But you could also change some of those coefficients. That would dicrease your rate and increase your distortion. (note that the increase in distortion may be very small in some cases - if the original values were very close to a quantization bucket edge, then you can shift them without much pain). You might be able to output a better Rate-Distortion optimal block by choosing some other output, such as {17,3,9,4,0,1,1,0} or {17,0,8,4,0,0,0,0} depending on where you are on the R-D curve.

The "trellis" comes in because you don't need to consider all possible outputs. It's easiest to think of like this :

Say you have 4 binary coding decisions. That is, you could make choices 0000 or 0001, etc. there are 16 possible choices. Naively, you would have to consider all 16 sequences and rate each one and pick the best. If you draw this as a graph, it looks like a tree - each node has two kids, it branches out and you have 16 leaves. But in many coding scenarios, your current coding cost does not depend on the entire history of your sequence - it only depends on the current state. For example, say you are doing order-1 context modeling and binary arithmetic encoding. Then there is a cost to encode a 0 after a 0, a 0 after a 1, a 1 after 0 and a 1 after a 1 (c00,c01,c10,c11). Each path in the graph is a coding action. The graph you need to consider is like this :


  [ 1 ]----[ 1 ]----[ 1 ]----[ 1 ]
 /     \  /     \  /     \  /     \ 
/       \/       \/       \/       \
\       /\       /\       /\       / 
 \     /  \     /  \     /  \     /  
  [ 0 ]----[ 0 ]----[ 0 ]----[ 0 ]

you start at the far left, as you go along each edge that's a coding output. Any given state, it doesn't matter how you got there, the transitions out of it have the same cost regardless of history. To fill out the graph you start on the far left with a cost of zero, you walk each link, and you fill in the node if the cost you are carrying is lower than what's already in there. Each node only needs to remember the cheapest way to get to that node.

To find the optimal coding you start at the far right and you walk backwards along the links that led you to the cheapest cost.

This graph looks like a "trellis" so these kind of problems are called "trellis quantization" or "joint trellis subband optimization" or whatever. The key thing about the trellis shape is that for N coding decisions the size of the graph is O(K*N) (if there are K options at each decision point), whereas the full branching factor is K^N if you had to consider all possible sequences.

This is apparently the "Viterbi algorithm" or some such thing but all the descriptions I've found of that are weird and confusing.

For game developers, this is very similar to A* path finding. A* is actually a form of Lazy Dynamic Programming. Path finding has the character we need because the cost to go between two nodes only depends on those two nodes, not the whole history, thus at each node you only need to remember the shortest path to get to that node.

In H264 this is a nice way to really optimize the output for a given block. It's sort of weirdly called "quantization" because it often consists of jamming values to zero which is kind of like using a larger adaptive deadzone. I really don't like that terminology, because it is in fact NOT a variable quantizer, since it doesn't affect the reconstruction levels in the decoder at all.

Note that in video coding this is a very small bit of a tiny optimization, and the rest of the optimization is a big heuristic mess. The total optimization looks something like this :


Assign bit rate to various frames
  (try to optimize R-D ; meet channel overflow and buffer underflow constraints )

Within a frame, assign bits to motion vectors vs. residuals & control codes

Iterate :

    choose best motion vectors for current motion vector bit rate allocation

    optimize block mode decisions

    code residuals to given bit allocation

        "trellis" optimize the coding of each block to a lagrangian R-D (R + lambda * D)

    oh and also iteratively search around on the lagrange multiplier lambda
        to hit the rate you were supposed to

    then hold residuals constant and optimize block & motion vector decisions for those residuals

  then shift bit allocation between modes, residuals & vectors & repeat

yuck. Oh, and of course per-block coding can't really be independently optimized since context model state carries between blocks. And in intra frames blocks are used to predict other blocks, so if you change one it affects all future ones. Oh, and this optimization is super important.

05-15-09 - Image Compression Rambling Part 3 - HDR

Next part : HDR (High Dynamic Range) coding. One of the things I'd like to do well for the future is store HDR data well, since it's more and more important to games.

The basic issue is that the range of values you need to store is massive. This means 8-bit is no good, but really it means any linear encoding is no good. The reason is that the information content of values is relative to their local average (or something like that).

For example, imagine you have an image of the lighting in a room during the day with the windows open. The areas directly hit by the sun are blazing bright, 10^6 nits or whatever. The areas that are in shadow are very dark. However, if you rotate and look at the image with your back to the sun, your eyes adjust and you can see a lot of detail in the shadow area still. If you encoded linearly you would have thrown that all away.

There are a few possibilities for storing this. JPEG-XR (HD Photo) seems to mainly just advocate using floating point pixels, such as F16-RGB (48 bits per pixel). One nice thing about this "half" 48 bit format is that graphics cards now support it directly, so it can be used in textures with no conversion. They have another 32 bit mode RGBE which uses an 8-bit shared exponent and 3 8-bit mantissas. RGBE is not very awesome; it gives you more dynamic range than you need, and not enough precision.

I haven't been able to figure out exactly how they wind up dealing with the floating point in the coder though. You don't want to encode the exponent and mantissa seperately, because the mantissa jumps around weirdly if you aren't aware of the exponent. At some point you need to quantize and make integer output levels.

One option would be do a block DCT on floating point values, send the DC as a real floating point, and then send the AC as fixed-size steps, where the step size is set from the magnitude of the DC. I think this is actually an okay approach, except for the fact that the block artifacts around something like the sun would be horrific in absolute value (though not worse after tone mapping).

Some kind of log-magnitude seems natural because relative magnitude is what really matters. That is, in areas of the image near 0, you want very small steps, but right near the sun where the intensity is 10^8 or whatever you only need steps of 10^5.

This leads to an interesting old idea : relative quantization. This is most obvious on a DPCM type coder. For each pixel you are encoding, you predict the pixel from the previous, and subtract from the prediction and send the error. To get a lossy coder, you quantize the error before encoding it. Rather than quantize with a fixed size step, you quantize relative to the magnitude of the local neighborhood (or the prediction). You create a local scale S and quantize in steps of {S,2S} - but then you also always include some very large steps. So for example in a neighborhood that's near black you might use steps like {1,2,3,4,5...16,20,...256,512,1024,...10^5,10^6,10^7}. You allow a huge dynamic range step away from black, but you don't give many values to those huge steps. Once you get up into a very high magnitude, the low steps would be relative.

This takes advantage of two things : 1. we see variation relative to the local average, and 2. in areas of very large very high frequency variation, we have extremely little sensitivity to exactly what the magnitude of that step is. Basically if you're look at black and you suddenly look at the sun you can only tell "that's fucking bright" , not how bright. In fact you could be off by an order of magnitude or more.

Note that this is all very similar to gamma-correction and is sort of redundant with it. I don't want to get myself tangled up in doing gamma and de-gamma and color space corrections and so on. I just want to be a pixel bucket that people can jam whatever they want in. Variable step sizing like this has been around as an idea in image compression for a very long time. It's a good basic idea - even for 8 bit low dynamic range standard images it is interesting, for example if you have a bunch of pixels like {1,3,2,1,200,1,3,2} - the exact value of that very big one is not very important at all. The bunch of small ones you want to send precisely, but the big one you could tolerate 10 steps of error.

While that's true and seems very valuable, it's very hard to use in practice, which is why no one is doing it. The problem is that you would need to account for non-local effects to actually get away with it. For example - changing a 200 to a 190 in that example above might be okay. But what if that was a row if pixels, and the 200 is actually a vertical edge with lots of pixel values at 200. In that case, you would be able to tell if a 200 suddenly jumped to a 190. Similarly, if you have a bunch of values like - {1,3,2,1,200,200,200,1,3,2} - again the eye is very bad at seeing that the 200 is actually a 200 - you could replace all three of those with a 190 and it would be fine. However if your row was like this : {1,3,2,1,200,200,200,1,3,2,1,200,200,2} - and you changed some of the 200's but not the others - then that would be very visible. The eye can't see absolute intensity very well at all, but it can see relative intensity and repetitions of a value, and it can see edges and shapes. So you have to do some big nonlocal analysis to know how much error you can really use in any given spot.

Greg Ward has an interesting approach for backward compatible JPEG HDR . Basically he makes a tone-mapped version of the image, stores that in normal RGB, divides that by the full range original image, and stores the ratio in a seperate gray scale channel. This is a good compromise, but it's not how you would actually want to store HDR if you had a custom pixel format. (it's bad because it presumes some tone mapping, and the ratio image is very redundant with the luma of the RGB image).

I'm leaning towards Log-Y-UV of some kind. Probably with only Log(Y) and UV linear or something like that. Greg Ward has a summary of a bunch of different HDR formats . Apparently game devs are actually using his LogLuv : see Christer or Deano or Matt Pettineo & Marco .

One nasty thing about LogLuv is that there's some problems near zero, and everybody is doing slightly different hacks to deal with that. Yay. It also doesn't bilerp nicely.

ADDENDUM : I should clarify cuz this got really rambly. LogLuv is an *integer* encoding. As a pixel bucket I can of course handle 3 integers of various scales and signs, no problemo. Also, "floating points" that are in linear importance scale are not really a difficult issue either.

The tricky issue comes from "real" floating points. That is, real HDR image data that you don't want to integerize in some kind of LogLuv type function for whatever reason. In that case, the actual floating point representation is pretty good. Storing N bits of mantissa gives you N bits of information relative to the overall scale of the thing.

The problems I have with literally storing floating point exponent and mantissa are :

1. Redundancy beteen M & E = worse compression. Could be fixed by using one as the context for the other.

2. Extra coding ops. Sending M & E instead of just a value is twice as slow, twice as many arithmetic code ops, whatever.

3. The mantissa does these weird step things. If you just compress M on its own as a plane, it does these huge steps when the exponent changes. Like for the values 1.8,1.9,1.99,2.01,2.10 - M goes from 0.99 to 0.01. Then you do a wavelet or whatever and lossy transform it and it smooths out that zigzag all wrong. Very bad.

So clearly just sending M & E independently is terrible.

5/14/2009

05-14-09 - Image Compression Rambling Part 2

Okay we talked a bit about block transforms, now let's talk about some of the somewhat weird variants of block transforms that are used in modern standard coders.

With an 8x8 block we're at a big disadvantage. An 8x8 block is like a 3 level wavelet. That's not much, wavelet coders rely on a 5 or 6 level transform normally, which would correspond to a 32x32 block or better. Large block transforms like that are bad because they're computationally complex, but also because they are visually bad. Large blocks create worse blocking artifacts, and also increase ringing, because it makes the high frequency shapes very non-local.

Basically by only doing 8x8 we are leaving a lot of redundancy between neighboring blocks. There's moderate correlation within a block, but also strong correlation across blocks for coefficients of the same type.

H264 Intra frame coding is actually really excellent; it outperforms JPEG-2000 for example. There're a few papers on this idea of using H264 intra coding just for still images, and a project called AIC . (AIC performs worse than real H264 for a few reasons I'll get into later).

"AIC" basically just does 8x8 block DCT's - but it does this interesting thing of pre-predicting the block before the transform. It works on blocks in scan order, and for each block before it does the DCT it creates a prediction from the already transmitted neighbors and subtracts that off. This is a nice page with the details . What this accomplishes does is greatly reduce correlation between blocks. It subtracts off predicted DC so the DC is usually small, and also often subtracts off predicted shapes, so for example if you're in a smooth gradient region it subtracts off that gradient.

Real H264 intra beats "AIC" pretty well. I'm not sure exactly why that is, but I have a few guesses. H264 uses integer transforms, AIC uses floating point (mainly a big deal at very high bit rates). H264 uses macroblocks and various sub-block sizes; in particular it can choose 8x8 or 4x4 sub-blocks, AIC always uses 8x8. Choosing smaller blocks in high detail areas can be a win. I think the biggest difference is probably that the H264 implementations tested do some RDO while AIC does not. I'm not sure exactly how they do RDO on the Intra blocks because each block affects the next one, but I guess they could at least sequentially optimize each block as they go with a "trellis quantizer" (see next post on this).

Okie doke. JPEG XR has similar issues but solves them in different ways. JPEG XR fundamentally uses a 4x4 transform similar to a DCT. 4x4 is too small to remove a lot of correlation, so neighboring blocks are very highly correlated. To address this, JPEG XR groups 4x4 groups of blocks together, so it has a 16x16 macroblock. The DC's of each of the 4x4 blocks gets another pass of the 4x4 transform. This is a lot like doing a wavelet transform but getting 4:1 reduction instead of 2:1. Within the 16x16 macroblock, each coefficient is predicted from its neighbor using gradient predictors similar to H264's.

In H264 the gradient predictor is chosen in the encoder and transmitted. In JPEG XR the gradient predictor is adaptively chosen by some decision that's made in the encoder & decoder. (I haven't found the exact details on this). Also in JPEG XR the delta-from-prediction is done *post* transform, while in H264 it was done pre-transform.

If you think about it, there's a whole world of possibilities here. You could do 4x4 transforms again on all the coefficients. That would be very similar to doing a 16x16 DCT (though not exactly the same - you would have to apply some twiddle factors and butterflies to make it really the same). You could do various types of deltas in pre-transform space and post-transform space. Basically you can use the previous transmitted data in any way you want to reduce what you need to send.

One way to think about all this is that we're trying to make the reconstruction look better when we send all zeros. That is, at low bit rates, we will very often have the case that the entire block of AC coefficients goes to zero. What does our output look like in that case? With plain old JPEG we will make a big solid 8x8 block. With H264 we will make some kind of gradient as chosen by the neighbor predictor mode. With JPEG XR we will get some predicted AC's values untransformed, and it will also be smoothed into the neighbors by "lapping".

So, let's get into lapping. Lapping basically gives us a nicer output signal when all the AC's are zero. I wrote a bit about lapping before . That post described lapping in terms of being a double-size invertable transform. That is, it's a transform that takes 2N taps -> N coefficients and back -> 2N , such that if you overlap with neighbors you get exact reconstruction. The nice thing is you can make it a smooth window that goes to zero at the edges, so that you have no hard block edge boundaries.

Amusingly there are a lot of different ways to construct lapped transforms. There are a huge family of them (see papers on VLGBT or some fucking acronym or other). There are lots of approaches that all give you the same thing :


2N -> N windowed basis functions as above
  (nobody actually uses this approach but it's nice theoretically)

Pre & post filtering on the image values (time domain or spatial domain)
  basically the post-filter is a blur and the pre-filter is a sharpen that inverts the blur
  (this can be formulated with integer lifting)

Post-DCT filtering (aka FLT - Fast Lapped Transform)
  basically do the NxN DCT as usual
  then swizzle the DCT coefficients into the neighboring DCT's

Post-DCT filtering can either be done on all the coefficients, or just on the first few (DC and primary AC coefficients).

Lapping is good and bad. It's not entirely an awesome win. For one thing, the pre-filter that the lap does is basically a sharpen, so it actually makes your data harder to compress. That's sort of balanced by having a better reconstruction shape for any given bit rate, but not always. The fundamental reason for this is that lapping relies on larger local smoothness. eg. for 8x8 blocks you're doing 16-tap lapped transforms. If your signal is actually smooth over 16 taps then it's all good, but when it's not, the lapped transform needs *larger* AC coefficients to compensate than a plain blocked 8-tap DCT would.

The interesting thing to me is to open this up and consider our options. Think just about the decoder. When I get the DC coefficient at a given spot in the image - I don't need to plop down the shape of any certain transform coefficient. What I should do is use that coefficient to plop down my best guess of what the pixels here were in the original. When I get the next AC coefficient, I should use that to refine.

One way to think about this is that we could in fact create an optimized local basis. The encoder and decoder should make the same local basis based on past transmitted data only. For example, you could take all the previously sent nearby blocks in the image, run PCA on them to create the local KLT ! This is obviously computationally prohibitive, but it gives us an idea of what's possible and how far off. Basically what this is doing is making the DC coefficient multiply a shape which is our best guess for what the block will be. Then the 1st AC coefficient multiplies our best guess for how the block might vary from that first guess, etc.

5/13/2009

05-13-09 - Image Compression Rambling Part 1

What exactly is a block transform like a DCT or whatever, and why do we do it? This is sort of rambling socratic questioning for my self.

We have some 1d signal of length N (that's just N floats). We want to transform it and preserve L2 norm. (Why exactly do we want to preserve L2 norm? It's not strictly necessary but it means that quantizing in transformed space is the same as transforming in pre-transform space which is handy.)

Well, preserving L2 norm is just the same as pretending our signal is a vector in N-d space and preserving its length. That means that our transform is just a rotation (our transform is real and invertable). In particular an N-point transform is a member of SO(N).

That's most obvious in 2d so let's start there. I wrote before about the 2d Haar/Hadamard/S-transform and whatnot.

What are all the possible 2d matrices that can transform a signal and preserve length ?


I : (identity)
[1 0]
[0 1]

J : (exchange)
[0 1]
[1 0]

Z : (mirror)
[1  0]
[0 -1]

ZZ = I , JJ = I

K = -JZ = ZJ =
[0  1]
[-1 0]

KK = -1
K^T = -K
K^T K = 1

H = Hadamard = J + Z :
[1  1]
[1 -1]

is all you can make; (you can stick arbitrary factors of J or K on there, or -1's). Let's see what kind of linear combination we can make :

R(c,s) = c * I + s * K

R^T * R = ( c * I + s * K ) ^T * ( c * I + s * K ) 

R^T * R = ( c * I ^T + s * K ^T ) * ( c * I + s * K ) 

R^T * R = ( c * I - s * K ) * ( c * I + s * K ) 

R^T * R = ( c^2 * I - s^2 * K*K ) 

R^T * R = ( c^2 * I + s^2 * I ) 

R^T * R = ( c^2 + s^2 )  * I

therefore ( c^2 + s^2 ) = 1

therefore c & s are a cosine & sine and R is a rotation

You see a lot of signal processing papers drawing these fucking signal flow diagrams that I hate. One of the things they like to draw is a "butterfly" . There's some ambiguity about what people mean by a "butterfly". The Wikipedia page is using the convention that it's just a Hadamard. People will usually put a little "-" on the line that gets the negative to disambiguate.

Sometimes you'll see a variable written next to a line of a butterfly. That means multiply by that number. We saw in my earlier post how rotations can be made from shears. If you ignore normalization for a moment, we can see that 2d rotations can be made just by applying one of our matrices (such as H), then multiplying one variable by a scalar. In terms of circuit diagrams, I and J are just lines moving around, Z is a negative applied to one line, H is a butterfly. That's all you need to do any 2d rotation.

Some of the lapped transform people confusingly use the "mirrored rotation" matrix :


M(a) = Z * R(a)

M =
[ c   s ]
[ s  -c ]

M^T = M
M * M = I

M is its own transpose and its own inverse
btw you can see this geometrically because Z * R(a) = R(-a) * Z

Haar = M( 45 degrees ) = sqrt(1/2) * Hadamard

Now any N-d rotation can be made from a series of 2d planar rotations. That is in fact how you build the FFT.

Since any transform is a rotation, the 8-tap DCT must be an 8-d rotation. We could figure out all the angles by thinking about what it does to various vectors. All constant vectors [c,c,c,c,c,c,c,c] are rotated to [1,0,0,0,0,0,0,0] ; that's one angle, etc. Each of those rotations is just a 2d rotation in some plane or other.

We can bring this back to the PCA/KLT I wrote about before . The PCA finds the axes of principal variation. The KLT is then the rotation matrix which rotates the principal axes to the coordinate axes (that is, it's the spatial frame where the covariance is diagonal).

Now we can ask why one set of axes or the other. Apparently the DCT is the KLT for a simple model of neighbor-correlated data. (specifically, if the correlation matrix is symmetric and tri-diagonal). I'd like to find a simple proof of this, I haven't been able to find it yet. (help). Recently there have been some papers on using the Tchebichef transform (DTT) instead of the DCT. Personally I think this is rather moot because nobody just does transforms and sends the coefficients directly any more. We always take deltas from neighbors or use correlations in other ways, so having transforms that decorrelate more is somewhat irrelevant.

(BTW there is one big win with the DTT - it's based on polynomials so the first AC coefficient is in fact a straight line ramp; with the DCT the first AC is a cosine shape. For synthetic images this could be a big win because the DTT can capture linear gradients exactly in only 2 coefficients).

But let's back up a second. Why are we doing a block transform at all ? It's not obvious. We want to exploit correlation. But there are plenty of other ways to do that. We could use a DPCM method (delta from neighbors), or just a probability-model method (predict similarity to neighbors). That could capture the correlation perfectly well. So it's not that.

Sometimes it's said that it's good for quantization. Hmm, maybe. You can also quantize the error that you would code in a DPCM method (like lossy CALIC). In practice quantizing transforms does work better at high loss, though it's a bit mysterious why that is exactly. There are two main factors I think :

1. Separating the DC and the AC. People make a lot of claims about the DCT basis being a natural fit for the human visual system. I think that's largely hogwash. Certainly once you cut it into 8x8 blocks it's wrong. But one thing that is valuable is the seperation of DC (low-frequency intesity - basically just a mipped-down version of the signal), and AC (higher frequency detail). The transform lets you code the DC carefully and throw away the AC.

Now, you could say making a lower res mip version and sending that first is not really inherent to transform coding. You'd be wrong. The problem is you want to make a lower res "DC" version without introducing redundancy. Say you have a 2x2 block of pixels -


| a b |
| c d |

You want to send a lower res version first, such as (a+b+c+d)/4 , and then send the higher res version. Say you want do this with some kind of DPCM/predictor scheme. Even if you use the lower res to predict the higher res, you are creating redundancy, you're now coding 5 values instead of 4. Of course the way to fix this is to do a 2d Haar transform to create the average & deltas from average in a reversible way that only makes 4 values instead of 5 !!

(note you could just send {a} as the low res version then send {b,c,d} later - that avoids redundancy at the cost of not having as good of a low res version).

2. Efficiency (lots of zeros). Block transforms are handy in practice just for a purely practical reason. When you quantize them you get lots of zeros. That lets you do run-length or end-of-block codes that cut off a ton of zeros and save lots of coding ops. Note that I'm distinguishing this from the "energy compaction" which they often talk about in the literature as being a huge win for compression. No, that's not really why we like block transforms - you can get the exact same entropy gain by using the correlation in other ways. The big win is the practical issue.

5/07/2009

05-07-09 - Integer Function Inversions

Question : how do you find/make functions that are exactly invertable in integers?

In particular, what I'd like is a family of cumulative probability distribution functions for arithmetic coding that can be inverted in integers.

Even more specifically, the main cases are semi-laplacian or semi-gaussian functions. Precisely the goal is something like this :


Probably of a symbol is modeled like P(x) ~= e^ ( - lambda * x ) = k ^ -x

(or something like that; P(0) is large, P(255) is small; x in [0,255] and)

Cumulative probability is the sum of probability of all lower symbols :

C(x) = Sum { y <= x } P(y)

To encode x we send something in [ C(x-1) , C(x) )

We want C(x) to be scaled such that C(255) = 16384 or some other power of 2 constant

We want C(x) to be integers, and we for decodability we must have C(x) >= C(x-1) + 1

That is, the integer math truncation must never make a C(x) = C(x-1)

Now, to decode we get back some target number T that's in in [ C(x-1) , C(x) ) for some X

We'd like to have an analytic function that gives us x directly from T :

x = D( T )

Now before you say "that's impossible" ; it's obviously not. You can certainly trivially find solutions such as :


C(x) = (C(255)/256) * (x+1)

D(T) = 256 * T / C(255);

The question is, are there more useful solutions, and in particular can you construct a whole family of solutions that are parameterized to give you different shapes.

Obviously you can precompute table lookups, but I'd rather not have a whole mess of tables.

I don't really know anything about integer->integer function theory; it seems to me there are two possible approaches to this. One is "constructive" ; start with simple funcionts that you know you can invert, then there are many operations you can do on them to compose or distort them and still have them invertable.

5/05/2009

05-05-09 - AutoReflect

I wrote before about autogenerating prefs for C++ . Well, I went and did it.

It's almost exactly what was discussed before. There's a code generator "AutoReflect" that makes a little include file. You mark variables with //$ to get them processed.

Here's an actual example from Galaxy4 :


class DriverTestPref : public Prefs
{
public :

    AUTO_REFLECT_FULL(DriverTestPref);

    float m_sharedConvergeTime; //$ = 2.2f;
    float m_cubicMaxAccelScale; //$ = 1.75f;
    float m_pdTimeScale; //$ = 2.f;
    float m_pdDamping; //$ = 1.f;
    float m_pdMinVel; //$ = 0.005f;
    float m_interceptConfidence; //$ = 0.5f;
    
    //float m_test; //$ = 1.f;
};

#include "gApp_DriverTest.aup"

To integrate this with VC 2003, I made AutoReflect just run as a global pre build step. It recurses directories and looks for all the ".aup" files. It checks their modtime against the corresponding .cpp and only runs if the cpp is newer. Even then, it's quite common that the cpp was changed but not in a way that affects the AutoReflect, so I generate the new aup file to a temp name, diff it against the current aup file, and don't touch it if it's the same.

That way I don't have to worry about adding custom build steps to lots of files or anything, it's one global pre-build you put in your project once. There are a few minor disadvantages with that :

1. You have to make an "aup" file manually once to get it started. You can do this just by creating the file by hand, or you can run "AutoReflect -f" for "full process" in which case it changes the enumeration to look for all "cpp" files instead of looking forb all "aup" files.

2. Fucking MSDev Pre/Post Build Events don't use the machine %PATH% to look for executables !?!?! URG WTF. It means I can't just put "AutoReflect" in there and have it work on various systems, I have to hard code the full path, or put it in one of the MSDev "Executable Directories" paths.

I gather than in VC 2008 the custom build capabilities are much enhanced so maybe there's a better way there, but this is mostly working very nicely. One good thing about doing it as a pre-build is that it doesn't interfere with the normal Make incremental build at all. That is, when the cpp is modified, first I make a new aup, then the cpp is compiled (which includes the new aup). There's no funny business where the cpp gets compiled twice, or it gets compiled before the aup is made or any of those kinds of problems.

ADDENDUM : actually I just realized there is a problem with this method. Because the "pre build" is only run for an F7 "build" and not for a ctrl-F7 "compile" you can compile your file and it doesn't get the new AUP. That's not a disaster, but it mildly sucks, I'd like it to AutoReflect before the compile when I hit ctrl-F7.

For the example above, the actual "aup" generated is :


template <class T>
void DriverTestPref::Auto_Reflection(T & functor)
{
    REFLECT(m_sharedConvergeTime);
    REFLECT(m_cubicMaxAccelScale);
    REFLECT(m_pdTimeScale);
    REFLECT(m_pdDamping);
    REFLECT(m_pdMinVel);
    REFLECT(m_interceptConfidence);
}

void DriverTestPref::Auto_SetDefaults()
{
    m_sharedConvergeTime = 2.2f;
    m_cubicMaxAccelScale = 1.75f;
    m_pdTimeScale = 2.f;
    m_pdDamping = 1.f;
    m_pdMinVel = 0.005f;
    m_interceptConfidence = 0.5f;
}

While I was at it I also put my Prefs & TweakVars into a "DirChangeWatcher" so that I get automatic hot reloads and made that all happen by default in Galaxy4. Pleasing.

I plan to not check in the aups to source control. Since they are generated each time you build, I'll treat them like obj's. Again the only problem with this is when someone syncs and doesn't have the aups yet - I can't do my incremental build method until they exist. What I would really like is for the MSDev "Full Rebuild" or "Clean" to run my "AutoReflect -f" for me that would generate the aups.

There's one stupid thing that's still not done in this, which is handling .h vs .cpp ; since you can have autoreflected classes in xxx.h and xxx.cpp , both would generate xxx.aup and I'd have to merge them or something. I could make it generate two seperates aups, "xxx.h.aup" and "xxx.cpp.aup" , not sure if that's the right thing to do. (ADDENDUM : yeah, I just did that, I think it's the way to go, it also makes it work with .c or whatever, because for any .aup file I can find the source file by just cutting off the .aup ; it removes all assumptions about the source file extension).

Of course I talk about AutoReflect mainly in terms of "prefs", but it's useful for other things. It basically gives you reflection in C++. One thing I'd like to use it for is to bring back the "IOZ" automatic IO system we did at Oddworld (basically a templated visitor IO that let's you stream things in and out trivially).

Unofficial early releases :

AutoReflect.zip (zip 62k)

cblib.zip (zip 500k)

galaxy4.zip (zip 1.5M)

Also in galaxy4 :

Now on Dx9. Now shares math/core code with cblib so it's not duped. New OBB & Hull code as written about earlier (in gApp_HullTest). New SmoothDriver and test app (gApp_DriverTest) (cubic & pd controller stuff written about long ago). Some other random shit, like new gFont,

4/29/2009

04-29-09 - QCD

Someone made me think briefly about QCD (Quantum Chromo Dynamics).

I never really got my head around the standard model in grad school. I think I understood QED pretty well, and Weak isn't really that bad either, but then you get into QCD and the maths gets really tough and there's this sea of particles and I had no idea what's going on. Part of the problem is that a lot of the texts go through a historical perspective and teach you the stages of understanding and the experiments that led to modern QCD. I think that's a big mistake and I discourage anyone from reading that. I was always really confused by all the talk of the various mesons and baryons. Often the classes would start with talking about K+ transitions or pion decay or scattering coefficients for "Omegas" and I'd be like "WTF are these particles and who cares what they do?".

I think it's way better just to say "we have quarks and gluons". And yes, the quarks can combine together into these various things, but we don't even really need to talk about them because nobody fucking cares about what exactly the meson made from (strange-antistrange) is called.

I much prefer a purely modern approach to QFT based on symmetry. In particular I really like Weinberg's approach in his textbook which is basically - we expect to observe every phenomenon in the universe which is *possible* to exist. If something is possible but doesn't ever happen, that is quite strange and we should wonder why. In particular with QFT - every possible Lagrangian which leads to a consistent theory should correspond to something in nature. When you start to write these down it turns out that very few are actually possible (given a few constraints, such as the postulate that relativity is required, etc.).

Anyway, I was never really happy with my intuition for QFT. Part of the problem is the math is just so hard, you can't do a lot of problems and get really comfortable with it. (David Politzer at Caltech once gave me a standard model homework problem to actually compute some real scattering coefficients that had been experimentally tested. It took me about 50 pages and I got it horribly wrong).

The whole gauge-field symmetry-group idea seems like it should be very elegant and lead to some intuition, but I just don't see it. You can say hand wavey things, like : electromagnetism is the presence of an extra U(1) symmetry; you can think of this as an extra circular dimension that's rolled up tiny so it has no spatial size, or if you like you can do the Feynman way and say that everything flying around is a clock that is pointing in some direction (that's the U(1) angle). In this picture, the coupling of a "charge" to the field is the fact that the charge distorts the U(1) dimension. If you're familiar with the idea of general relativity where masses distort spacetime and thus create the gravity force, it's the same sort of thing, but instead of distorting spacetime, charge distorts the U(1) fiber. As charges move around in this higher-D space, if they are pushed by variation of the U(1) fiber clock angle, that pushes them in real space, which is how they get force. Charges are a pole in the curvature of the fiber angle; in a spacetime sense it's a pinched spot that can't be worked out by any stretching of the space fabric. Okay this is sort of giving us a picture, but it's super hand wavey and sort of wrong, and it's hard to reconcile with the real maths.

Anyway, the thing I wanted to write about QCD is the real problem of non-perturbative analysis.

When you're taught QED, the thing people latch onto are the simple Feynman diagrams where two electrons fly along and exchange a photon. This is appealingly classical and easy to understand. The problem is, it's sort of a lie. For one thing, the idea that the photon is "thrown" between the electrons and thus exchanges momentum and forces them apart is a very appealing picture, but kind of wrong, since the photon can actually have negative momentum (eg. for an electron and positron, the photon exchanged between them pulls them together, so the sort of spacemen playing catch kind of picture just doesn't work).

First of all, let's back up a bit. QFT is formulated using the sum of all complex exponential actions mechanism. Classically this would reduce to "least action" paths, which is equivalent to Lagragian classical mechanics. There's a great book which teaches ordinary Quantum Mechanics using this formulation : Quantum Mechanics and Path Integrals by Feynman & Hibbs (this is a serious textbook for physics undergrads who already know standard QM ; it's a great bridge from standard QM to QFT, because it introduces the sum-on-action formalism in the more familiar old QM). Anyway, the math winds up as a sum of all possible ways for a given interaction to happen. The Feynman diagram is a nice way to write down these various ways and then you still integrate over all possible ways each diagram can happen.

Now let's go back to the simple QED diagram that I mentioned. This is often shown as your first diagram, and you can do the integral easily, and you get a nice answer that's simple and cute. But what happened? We're supposed to sum on *all* ways that the interaction can happen, and we only did one. In fact, there are tons of other possibilities that produce the same outcome, and we really need to either sum them all, or show that they are small.

One thing we need to add is all the ways that you can add vacuum -> vacuum graphs. You can make side graphs that start from nothing, particles pop out of the vacuum, interact, then go back to the vacuum. These are conveniently not mentioned because if you add them all up they have an infinite contribution, which would freak out early students. Fortunately we have the renormalization mechanism that sweeps this under the rug just fine, but it's quite complex.

The other issue is that you can add more and more complex graphs; instead of just one photon exchange, what about two? The more complex graphs have higher powers of the coupling constant (e in this case). If the coupling constant is small, this is like a Taylor expansion, each term is higher powers of e, and e is small, so we can just go up to 3rd order accuracy or whatever we want. The problem with this is that even when e is small, as the graphs get more complex there are *more* of them. As you allow more couplings, there are more and more ways to make a graph of N couplings. In order for this kind of Taylor expansion to be right, the number of graphs must go up more slowly than 1/e. Again it's quite complex to prove that.

Starting with a simple problem that we can solve exactly, and then adding terms that make us progressively more accurate is the standard modus operandi in physics. Usually the full system is too hard to solve analytically, and too hard to get intuition for, so we rely on what's called a perturbation expansion. Take your complex system that you can't solve, and expand it into Simple + C * Complex1 + C^2 * Complex2 + ... - higher and higher powers of C, which should be small.

And with QCD we get a real problem. Again you can start with a simple graph of quarks flying along passing gluons. First of all, unlike photons, there are gluon-gluon couplings which means we need to add a bunch more graphs where gluons interact with other gluons. Now when we start adding these higher order terms, we have a problem. In QCD, the coupling constant is not small enough, and the number of graphs that are possible for each order of the coupling constant is too high - the more complex terms are not less important. In fact in some cases, they're *more* important than the simpler terms.

This makes QCD unlike any other field theory. Our sort of classical intuition of particles flying around exchanging bosons completely breaks down. Instead the quarks live in a foaming soup of gluons. I don't really even want to describe it in hand wavey terms like that because any kind of picture you might have like that is going to be wrong and misleading. Even the most basic of QCD problems is too hard to do analytically; in practice people do "lattice QCD" numerical computations (in some simple cases you can do the summations analytically and then take the limit of the lattice size going to zero).

The result is that even when I was doing QFT I never really understood QCD.

4/28/2009

04-28-09 - Quadratic

I'm doing a little refinement of my old cubic interpolator ("Smooth Driver") thing. (see also : here and here and here ).

One thing I'm trying to do is fix up all the nasty epsilon robustness issues. A small part of that is solving a quadratic. "Easy!" I hear you say. Everyone knows how to solve a quadratic, right? Not so.

I found this page which has a nice summary of the issues, written by a sour old curmudgeon who just whines about how dumb we all are but doesn't actually provide us with a solution.

You can also find the Wikipedia page or the Numerical Recipes (5.6) snippet about the more robust numerical way to find the roots that avoids subtracting two nearly identical numbers. Okay, that's all well and good but there's a lot more code to write to deal with all the degenerate cases.

This is what I have so far : (I'm providing the case where the coefficients are real but the solutions may be complex; you can obviously modify to complex coefficients or only real solutions)


// A t^2 + B t + C = 0;
// returns number of solutions
int SolveQuadratic(const double A,const double B,const double C,
                    ComplexDouble * pT0,ComplexDouble * pT1)
{
    // first invalidate :
    *pT0 = FLT_MAX;
    *pT1 = FLT_MAX;
        
    if ( A == 0.0 )
    {
        if ( B == 0.0 )
        {
            if ( C == 0.0 )
            {
                // degenerate - any value of t is a solution
                *pT0 = 0.0;
                *pT1 = 0.0;
                return -1;
            }
            else
            {       
                // no solution
                return 0;
            }
        }
        
        double t = - C / B;
        *pT0 = t;
        *pT1 = t;
        return 1;
    }
    else if ( B == 0.0 )
    {
        if ( C == 0.0 )
        {
            // A t^2 = 0;
            *pT0 = 0.0;
            *pT1 = 0.0;
            return 1;
        }
        
        // B is 0 but A isn't
        double discriminant = -C / A;
        ComplexDouble t = ComplexSqrt(discriminant);
        *pT0 = t;
        *pT1 = - t;
        return 2;
    }
    else if ( C == 0.0 )
    {
        // A and B are not zero
        // t = 0 is one solution
        *pT0 = 0.0;
        // A t + B = 0;
        *pT1 = -B / A;
        return 2;
    }

    // Numerical Recipes 5.6 : 

    double discriminant = ( B*B - 4.0 * A * C );
    
    if ( discriminant == 0.0 )
    {
        double t = - 0.5 * B / A;
        *pT0 = t;
        *pT1 = t;
        return 1;
    }
    
    ComplexDouble sqrtpart = ComplexSqrt( discriminant );
    
    sqrtpart *= - 0.5 * fsign(B);
    
    ComplexDouble Q = sqrtpart + (- 0.5 * B);
    
    // Q cannot be zero
    
    *pT0 = Q / A;
    *pT1 = C / Q;
        
    return 2;
}

One thing that is missing is refinement of roots by Newton-Raphson. The roots computed this way can still have large error, but gradient descent can improve that.

4/24/2009

04-24-09 - Convex Hulls and OBB's

I wrote last month a bit about OBB fitting. I mentioned at the time that it would be nice to have an implementation of the exact optimal OBB code, and also the bounded-best OBB in reasonable time. I found the Barequet & Har-Peled work on this topic but didn't read it at the time.

Well, I finally got through it. Their paper is pretty ugly. Let me briefly explain their method :

They, like my old OBB stuff, take heavy advantage of the fast rotating-calipers method to find the optimal rectangle of a convex hull in 2d (it's O(n)). Also finding the convex hull is O(nlogn). What that means is, given one axis of an OBB, you can find the optimal other two axes in O(nlogn). So the problem just comes down to finding one of the optimal axes.

Now, as I mentioned before, the number of actual axes you must consider to be truly optimal is O(n^2) , making your total run time O(n^3). These axes are the face normals of the convex hull, plus axes where the OBB is supported by two edges and a vert (I don't know an easy way to even enumerate these).

It's now pretty well known around the industry that you can get a very good OBB by just trying a bunch of scattered initial axes instead of doing O(n^2). If you try some fixed number of axes, like say 256, it doesn't count against your big-O at all, so your whole OBB fit is still O(nlogn).

Well this is exactly what the Barequet & Har-Peled method is. They try a fixed number of directions for the seed axis, then do the rectangle fit for the other two axes. The main contribution of the paper is the proof that if you try enough fixed directions, you can get the error of the bbox within whatever tolerance you want. That's sort of intuitively obvious - if you try more and more fixed directions you must get closer and closer to the optimal box. Their construction also provides a specific method for enumerating enough directions.

Their enumeration goes like this :

Start with some seed box S. They use the box that is made from taking one axis to be the "diameter" of the point set (the vector between the two most seperated points). Using that box is important to their proof, but I don't think which seed box you use is actually terribly important in practice.

The seed box S has normalized edge vectors S.x , S.y, S.z (the three axes that define the box).

Enumerate all sets of 3 (non-negative) integers whose sum is <= K , that is {i,j,k} such that (i+j+k) <= K

Construct the normal N = i * S.x + j * S.y + k * S.z ; normalize it, and use this as the direction to fit a new OBB. (note that there are a lot of points {ijk} that generate the same normal - any points that are integer multiples of another; those can be skipped).

Barequet & Har-Peled prove that the OBB made this way is within (1/K) to some power or other of optimal, so as you increase K you get ever closer.

Now, this is almost identical to my old "OptimalOBBFixedDirections" which tried various static directions and then optimized the box from there. My OptimalOBBFixedDirections always tests 42 directions in the {+,+,+} octant which I made by subdividing an octahedron. I have found that the Barequet & Har-Peled method does in fact find better boxes with fewer tests, but the difference is very very small (thousandths of a percent). I'll show numbers in a second.

First I want to mention two other things.

1. There's a "common wisdom" around that net that while it is bad to make an OBB from the covariance matrix of the *points* , it is good to make an OBB from the covariance matrix of the *volume*. That is, they claim if you have a closed mesh, you can use the Mirtich (or Eberly or Blow/Binstock) method to compute the covariance matrix of the solid body, and use that for your OBB axes.

I have seen no evidence that this is true. Yes, the covariance matrix of the points is highly dependent on the tesselation, while the covariance matrix of the solid is more a property of the actually shape of the object, so that is intuitively pleasing. In practice it appears to be completely random which one is actually better. And you just shouldn't use the covariance matrix method anyway.

2. Barequet & Har-Peled mention the iterative refinement of OBB's using the caliper-fit. This is something I've known a while, but I've never seen it published before; I think it's one of those gems of wisdom that lots of people know but don't consider worth a paper. They mention it almost in passing, but it's actually perhaps the most valuable thing in their whole paper.

Recall if you have one axis of the OBB fixed, you can easily find the optimal directions of the other two axes using rotating calipers to fit a rectangle. The thing is, once you do that, you can then hold one of those new axes fixed, and fit the other two. So like, fix X, then caliper to get YZ, then fix Y, and caliper to get XZ. Each step of the iteration either improves your OBB or does nothing. That means you descend to a local minimum in a finite number of steps. (in practice I find you usually get there in only 3 steps, in fact that might be provable (?)).

Assuming your original seed box is pretty close to optimal, this iteration is kind of like taking your OBB and trying to spin it along one axis and pinch it to see if you get a tighter fit; it's sort of like wiggling your key as you put it into a lock. If your seed OBB is close to being right, but isn't supported by one of the necessary support conditions, this will wiggle it tighter until it is supported by a face or an edge pair.

The methods shown in the test below are :


True convex hull (within epsilon) :
    note that area optimization is usually what you want
    but volume shows a bigger difference between methods

Hull simplificiation :

    I simplify the hull by just doing PM on the triangles
    Then convert the triangles to planes
    Push the planes out until all points are behind them
    Then clip the planes against each other to generate new faces
    This is a simpler hull that strictly contains the original hull

k-dop :
    Fits 258 planes in fixed directions on the sphere
    Just pushes each plane to the edge of the point set
    Clips them all against each other
    strictly this is O(n) but in practice it's slower than the true convex hull
        and much worse quality
    seems pointless to me

The rating we show on all the OBB's is surface area

AxialOBB  :
    axis-aligned box

OBBByCovariance :
    vertex covariance matrix sets axes

OBBByCovariance+it :
    OBBByCovariance followed by iterative greedy optimization

OBBByCovarianceOptimized :
    like OBBByCovariance+it, but tries all 3 initial fixed axes

OptimalOBBFixedDirections :
    tries 42 fixed directions

OptimalOBBFixedDirections+it :
    tries 42 fixed directions, takes the best, then optimizes

OptimalOBBFixedDirections opt :
    tries 42 fixed directions, optimizes each one, then takes the best

OBBGoodHeuristic :
    takes the best of OBBByCovarianceOptimized and "OptimalOBBFixedDirections opt"

OBBGivenCOV :
    this is OBBByCovarianceOptimized but using the solid body covariance instead of points  

OptimalOBB :
    tries all face normals of the convex hull (slow)
    I'd like to also try all the edge-support directions here, but haven't figured it out

OptimalOBBBarequetHarPeled 5     
 kLimit : 5 , numBuilds : 19
    BarequetHarPeled method with (i+j+k) <= 5
    causes it to try 19 boxes
    optimizes each one, then picks the best
    very similar to "OptimalOBBFixedDirections opt"

And the results :


-----------------------------------------

dolphin.x :

    Made Hull with 206 faces
    hull1 volume : 1488557 , area : 95330

Making hull from k-dop planes...
    Made Hull with 142 faces
    hull2 volume : 2081732 , area : 104951

Making OBB...
    AxialOBB                         : 193363.109
    OBBByCovariance                  : 190429.594
    OBBByCovariance+it               : 179504.734
    OBBByCovarianceOptimized         : 179504.719
    OptimalOBBFixedDirections        : 181693.297
    OptimalOBBFixedDirections+it     : 181693.297
    OptimalOBBFixedDirections opt    : 176911.750
    OBBGoodHeuristic                 : 179504.719
    OBBGivenCOV                      : 178061.406
    OptimalOBB                       : 176253.359

     kLimit : 3 , numBuilds : 3
    OptimalOBBBarequetHarPeled 3     : 179504.703
     kLimit : 5 , numBuilds : 19
    OptimalOBBBarequetHarPeled 5     : 178266.047
     kLimit : 10 , numBuilds : 160
    OptimalOBBBarequetHarPeled 10    : 176508.109
     kLimit : 20 , numBuilds : 1222
    OptimalOBBBarequetHarPeled 20    : 176218.344
     kLimit : 50 , numBuilds : 18037
    OptimalOBBBarequetHarPeled 50    : 176116.156

-----------------------------------------

teapot.x :

hull1 faces : 612
    hull1 volume : 3284935 , area : 117470

simplified hull2 faces : 366
    hull2 volume : 3384222 , area : 120357

Making hull from k-dop planes...
    Made Hull with 234 faces
    hull2 volume : 3761104 , area : 129271

Making OBB...
    AxialOBB                         : 253079.797
    OBBByCovariance                  : 264091.344
    OBBByCovariance+it               : 222514.219
    OBBByCovarianceOptimized         : 220723.844
    OptimalOBBFixedDirections        : 219071.703
    OptimalOBBFixedDirections+it     : 218968.844
    OBBGoodHeuristic                 : 218968.844
    OptimalOBB                       : 218968.844
    OBBGivenCOV                      : 220762.766

     kLimit : 3 , numBuilds : 3
    OptimalOBBBarequetHarPeled 3     : 220464.766
     kLimit : 5 , numBuilds : 19
    OptimalOBBBarequetHarPeled 5     : 219540.203
     kLimit : 10 , numBuilds : 160
    OptimalOBBBarequetHarPeled 10    : 218968.000
     kLimit : 20 , numBuilds : 1222
    OptimalOBBBarequetHarPeled 20    : 218965.406
     kLimit : 50 , numBuilds : 18037
    OptimalOBBBarequetHarPeled 50    : 218963.109

-----------------------------------------

Some highlights :

OBBByCovariance is quite bad. OptimalOBBFixedDirections is the only other one that doesn't do the iterative optimization, and it can be bad too, though not nearly as bad.

Any of the methods that do the iterative optimization is perfectly fine. The differences are very small.

"OptimalOBBBarequetHarPeled 7" does about the same number of tests as "OptimalOBBFixedDirections opt" , and it's very microscopically better because of the way the directions are distributed.

OBBGivenCOV (the solid mass covariance) is worse than OBBByCovarianceOptimized (point covariance) on teapot.

Also - the Convex Hull simplification thing I did was just pulled out of my ass. I did a quick Google to see if I could find any reference, and I couldn't find any. I'm surprised that's not a solved problem, it seems like something right up the Geometers' alley.

Problem : Find the convex bounding volume made of N faces (or N planes) that strictly encloses the original mesh, and has minimum surface area (or volume).

In general, the optimal N-hull can not be reached by greedy simplification from the full-detail convex hull. In practice I found my hacky PM solution to work fine for moderate simplification levels. To make it more correct, the "push out to enclose" step should be done in each PM collapse to keep the hull valid as you go (instead of at the end). Also the PM collapse metric should be the metric you are trying to optimize - surface area or volume (I just used my old geometric error collapser).

The main thing I was interested in with convex hull simplification was eating away highly tesselated bits. The mesh I mainly tested on was "fandisk" because it's got these big flat surfaces, and then some rounded bits. If you imagine a mesh like a big cube minowski summed with a small sphere, you get a cube with rounded edges and corners. If the sphere is highly tessleated, you can get a hull with tons and tons of faces, but they are very unimportant faces. You want to sort of polygonate those corners, replaces the rounded sphere with a less tesselated one that's pushed out.

4/23/2009

04-23-09 - Telling Time

.. telling time is a huge disaster on windows.

To start see Jon Watte's old summary that's still good .

Basically you have timeGetTime() , QPC, or TSC.

TSC is fast (~ 100 clocks) and high precision. The problems I know of with TSC :

TSC either tracks CPU clocks, or time passing. On older CPUs it actually increments with each cpu cycle, but on newer CPUs it just tracks time (!). The newer "constant rate" TSC on Intel chips runs at some frequency which so far as I can tell you can't query.

If TSC tracks CPU cycles, it will slow down when the CPU speedsteps. If the CPU goes into a full sleep state, the TSC may stop running entirely. These issues are bad on single core, but they're even worse on multi-proc systems where the cores can independently sleep or speedstep. See for example these linux notes or tsc.txt .

Unfortunately, if TSC is constant rate and tracking real time, then it no longer tracks cpu cycles, which is actually what you want for measuring performance (you should always report speeds of micro things in # of clocks, not in time).

Furthermore on some multicore systems, the TSC gets out of sync between cores (even without speedsteps or power downs). If you're trying to use it as a global time, that will hose you. On some systems, it is kept in sync by the hardware, and on some you can get a software patch that makes rdtsc do a kernel interrupt kind of thing which forces the TSC's of the cores to sync.

See this email I wrote about this issue :

Apparently AMD is trying to keep it hush hush that they fucked up and had to release a hotfix. I can't find any admission of it on their web site any more ;

this is the direct download of their old utility that forces the cores to TSC sync : TscSync

they now secretly put this in the "Dual Core Optimizer" : Dual Core Optimizer Oh, really AMD? it's not a bug fix, it's an "optimizer". Okay.

There's also a seperate issue with AMD C&Q (Cool & Quiet) if you have multiple cores/processors that decide to clock up & down. I believe the main fix for that now is just that they are forbidden from selecting different clocks. There's an MS hot fix related to that : MS hotfix 896256

I also believe that the newest version of the "AMD Processor Driver" has the same fixes related to C&Q on multi-core systems : AMD Driver I'm not sure if you need both the AMD "optimizer" and processor driver, or if one is a subset of the other.

Okay, okay, so you decide TSC is too much trouble, you're just going to use QPC, which is what MS tells you to do anyway. You're fine, right?

Nope. First of all, on many systems QPC actually is TSC. Apparently Windows evaluates your system at boot and decides how to implement QPC, and sometimes it picks TSC. If it does that, then QPC is fucked in all the ways that TSC is fucked.

So to fix that you can apply this : MS hotfix 895980 . Basically this just puts /USEPMTIMER in boot.ini which forces QPC to use the PCI clock instead of TSC.

But that's not all. Some old systems had a bug in the PCI clock that would cause it to jump by a big amount once in a while.

Because of that, it's best to advance the clock by taking the delta from previous and clamping that delta to be in valid range. Something like this :


U64 GetAbsoluteQPC()
{
    static U64 s_lastQPC = GetQPC();
    static U64 s_lastAbsolute = 0;

    U64 curQPC = GetQPC();

    U64 delta = curQPC - s_lastQPC;

    s_lastQPC = curQPC;

    if ( delta < HUGE_NUMBER )
        s_lastAbsolute += delta;

    return s_lastAbsolute;
}

(note that "delta" is unsigned, so when QPC jumps backwards, it will show up as as very large positive delta, which is why we compare vs HUGE_NUMBER ; if you're using QPC just to get frame times in a game, then a reasonable thing is to just get the raw delta from the last frame, and if it's way out of reasonable bounds, just force it to be 1/60 or something).

Urg.

BTW while I'm at I think I'll evangelize a "best practice" I have recently adopted. Both QPC and TSC have problems with wrapping. They're in unsigned integers and as your game runs you can hit the end and wrap around. Now, 64 bits is a lot. Even if your TSC frequency is 1000 GigaHz (1 THz), you won't overflow 64 bits for 194 days. The problem is they don't start at 0. (

Unsigned int wrapping works perfectly when you do subtracts and keep them in unsigned ints. That is :


in 8 bits :

U8 start = 250;
U8 end = 3;

U8 delta = end - start;
delta = 8;

That's cool, but lots of other things don't work with wrapping :


U64 tsc1 = rdtsc();

... some stuff ...

U64 tsc2 = rdtsc();

U64 avg = ( tsc1 + tsc2 ) /2;

This is broken because tsc may have wrapped.

The one that usually gets me is simple compares :


if ( time1 < time2 )
{
    // ... event1 was earlier
}

are broken when time can wrap. In fact with unsigned times that wrap there is no way to tell which one came first (though you could if you put a limit on the maximum time delta that you consider valid - eg. any place that you compare times, you assume they are within 100 days of each other).

But this is easily fixed. Instead of letting people call rdtsc raw, you bias it :


uint64  Timer::GetAbsoluteTSC()
{
    static uint64 s_first = rdtsc();
    uint64 cur = rdtsc();
    return (cur - s_first);
}

this gives you a TSC that starts at 0 and won't wrap for a few years. This lets you just do normal compares everywhere to know what came before what. (I used the TSC as an example here, but you mainly want QPC to be the time you're passing around).

04-23-09 - iPod Hardware - the harbinger of suck

Is god fucking awful. Stop touting it as the greatest example of product design in this century. Yes, yes, the screen is nice, and the basic shape and weight of it is appealing. If it was just a paperweight I would be pretty pleased with it. But when you actually try to *use* it, it's rubbish. (it's the Angelina Jolie of product design if you will - it's the canonical example that everyone uses of something that's great, but it's actually awful).

Try to actually browse through the menus with that fucking wheel. Scan through a big list of artists, back up, change to album view, scan down, it's awful.

The wheel is a disaster for volume control. The right thing for volume is a knob, or a dial. Something physical that you rotate. And it shouldn't be a fucking digital dial that just spins like all the shit that they're giving us on computers now. It should be an *absolute* dial with an actual zero point, so that I can turn the volume down to zero when the thing is off. Hitting play on an iPod is fucking ear drum roulette, you never know when it's going to explode your head.

You're playing a song, you pause it, you go browse to some other song. You want to just resume the original song. How do you even do that !? I suppose it must be possible, but I don't know. It should just be a button.

Design for music playing devices has been perfect for a long time. You have play, pause, skip, volume. You have those on different buttons that are ALWAYS those buttons. They're physical buttons you can touch, so you can use the device while it's in your pocket or your eyes are closed. They should be rubber (or rubberized) so they're tactile, and each button should have a different shape so you know you're on the right one.

It's just fucking rubbish. It's like so many cars these days, changing user interfaces for no good reason and making them worse. Don't give me fucking digital buttons to increment and decrement the air conditioning, that's awful! Give me a damn dial or a slider that has an absolute scale. I don't want to be hitting plus-plus-plus.

The damn "Start" button that's on so many cars now really pisses me off. You used to stick in a key, then turn it. What's wrong with that? It works perfectly fucking fine. Now I have to stick in a key, make sure I'm pressing the brake or the clutch or whatever, then press a start button. Why !? It's more steps, it's just worse.

The worst of course is the menu shit like iDrive that's basically an iPod style interface with a fucking wheel and menus and context-dependent actions. Context-dependent actions are fucking horrible user interface design, quit it. With consumer electronic devices there should just be a few buttons, and those buttons always execute the same action and do it immediately. I know some jack-hole is going to run into me because he was trying to mate his bluetooth and was browsing around the menus.

4/17/2009

04-17-09 - Oodle File Page Cache

I'm trying to figure something out, maybe someone out there has a good idea.

This is about the Oodle File Page Cache that I mentioned previously. I'm not doing the fully general page cache thing yet (I might not ever because it's one of those things you have to buy into my philosophy which is what I'm trying to avoid).

Anyway, the issue is about how to prioritize pages for reclamation. Assume you're in a strictly limitted memory use scenario, you have a fixed size pool of say 50 pages or so.

Now obviously, pages that are actually current locked get memory. And the next highest priority is probably sequentially prefetched pages (the pages that immediately follow the currently locked pages) (assuming the file is flagged for sequential prefetching, which it would be by default).

But after that you have to decide how to use the remaining pages. The main spot where you run into a question is : I need to grab a new page to put data into, but all the pages are taken - which one do I drop and recycle? (Or, equivalently : I'm thinking about prefetching page X, the least important page currently in the pool is page Y - should I reclaim page Y to prefetch page X, or should I just wait and not do the prefetch right now).

The main ambiguity comes from "past" pages vs. "prefetched" pages (and there's also the issue of old prefetched pages that were never used).

A "past" page is one that the client has unlocked. It was paged in, the client locked it, did whatever, then unlocked it. There's one simple case, if the client tells me this file is strictly forward-scan streaming, then the page can be dropped immediately. If not, then the past page is kept around for some amount of time. (there's another sequence point when the file containing the page is closed - again optionally you can say "just drop everything when I close the file" or the pages can be kept around for a while after the close to make sure you were serious about closing it).

A "prefetched" page obviously can be prefetched by sequential scan in an open file. It could also be from a file in the prefetch-ahead file list that was generated by watching previous runs.

Prefetches pages create two issues : one is how far ahead do I prefetch. Basically you prefetch ahead until you run out of free pages, but when you have no free pages, the question is do I reclaim past pages to do new prefetches?

The other issue with prefetches is what do you do with prefetched pages that were never actually used. Like I prefetched some pages but then the client never locked them, so they are still sitting around - at what point do I reclaim those to do new prefetches?

To make it more clear here's a sort of example :


Client gives me a prefetch file list - {file A, file B, file C, file D}

Client opens file A and touches a bunch of pages.  So I pull in {A:0,A:1,A:2} (those are the page numbers in the file).

I also start prefetching file B and file C , then I run out of free pages, so I get {B:0,B:1,C:0}.

Client unlocks the pages in file A but doesn't close file A or tell me I can drop the pages.

Client now starts touching pages in file C.  I give him C:0 that I already prefetched.

Now I want to prefetch C:1 for sequential scan, I need to reclaim a page.

Do I reclaim a page from B (prefetched but not yet used) or a page from A (past pages) ?  Or not prefetch at all?

When client actually asks for C:1 to lock then I must reclaim something.

Should I now start prefetching {D:0} ?  I could drop a page from {A} to get it.

Anyway, this issue just seems like a big mess so I'm hoping someone has a clever idea about how to make it not so awful.

There's also very different paradigms for low-page-count vs high-page-count caches. On something like the PS2 or XBox 1 where you are super memory limitted, you might in fact run with only 4 pages or something tiny like that. In that case, I really want to make sure that I am using each page for the best purpose at all times. In that scenario, each time I need to reclaim a page, I should reevaluate all the priorities so they are fresh and make the best decision I can.

On something like Windows you might run with 1024 pages. (64k page * 1024 pages = 64 MB page cache). In that case I really don't want to be walking every single page to try to pick the best one all the time. I can't just use a heap or something, because page priorities are not static - they can change just based on time passing (if I put any time-based prioritization in the cache). Currently I'm using a sort of cascaded priority queue where I have different pools of priority groups, and I only reevaluate the current lowest priority group. But that's rather complicated.

4/15/2009

04-15-09 - Oodle Page Cache

So I'm redoing the low level file IO part of Oodle. Actually I may be retargetting a lot of Oodle. One thing I took from GDC and something I've been thinking about a long time is how to make Oodle simpler. I don't want to be making a big structure that you have to buy into and build your game on. Rather I want to make something "leafy" that's easy for people to plug in at the last minute to solve specific problems.

Pursuant to that, the new idea is to make Oodle a handful of related pieces. You can use one or more of the pieces, and each is easy to plug in at the last minute.

1. Async File IO ; the new idea with this is that it's cross platform, all nice and async, does the LZ decompression on a thread, can do DVD packaging and DVD emulation, can handle the PS3/Xenon console data transfers - but it just looks like regular files to you. This is less ambitious than the old system ; it no longer directly provides things like paging data in & out, or hot-loading artist changes; you could of course still do those things but it leaves it more up to the client to do that.

The idea is that if you just write your game on the PC using lazy loose file loads, boom you pop in Oodle and hardly touch the code at all, and you automatically get your files packed up nice and tight into like an XBLA downloadable pack, or a DVD for PS3, or whatever, and it's all fast and good. Oh, and it also integrates nicely with Bink and Miles and Granny so that you can things like play a Bink video while loading a level, and the data streamers share the bandwidth and schedule seeks correctly.

2. Texture goodies. We'll provide the most awesome threaded JPEG decoders, and also probably a better custom lossy and custom lossless texture compressors that are specifically designed for modern games (with features like good alpha-channel support, support for various bit depths and strange formats like X16Y16 , etc.). Maybe some nice DXTC realtime encoding stuff and quality offline encoding stuff. Maybe also a whole custom texture cache thing, so you can say Oodle use 32 MB for textures and do all the paging and decompression and such.

3. Threading / Async utilities. You get the threaded work manager, the thread profiler, we'll probably do "the most awesome" multithreaded allocator. We'll try to give these specific functions that address something specific that people will need to ship a game. eg. a customer is near ship and their allocator is too slow and taking too much memory so they don't fit in the 256 MB of the console. Boom plug in Oodle and you can ship your game.

Anyway, that's just the idea, it remains to be worked out a bit. One thing I'm definitely doing is the low level IO is now going through a page cache.

As I'm writing it I've been realizing that the page cache is a super awesome paradigm for games in general these days. Basically the page cache is just like an OS virtual memory manager. There's a certain limited amount of contiguous physical memory. You divide it into pages, and dynamically assign the pages to the content that's wanted at the time.

Now, the page cache can be used just for file IO, and then it's a lot like memory mapped files. The client can use the stdio look-alike interface, and if they do that, then the page cache just automatically does the "right thing", prefetching ahead pages as they sequentially read through a file, etc.

But since we're doing this all custom and we're in a video game environment where people are willing to get more manual and lower to the bone, we can do a lot more. For example, you can tell me whether a file should sequential prefetch or not. You can manually prefetch at specific spots in the file that you expect to jump to. You can prefetch whole other files that you haven't opened yet. And perhaps most importantly, you can assign priorities to the various pages, so that when you are in a low memory situation (as you always are in games), the pages will be used for the most important thing. For example you can prefetch the whole next file that you expect to need, but you would do that at very low priority so it only uses pages if they aren't needed for anything more urgent.

The next awesome thing I realized about the page cache is that - hey, that can just be the base for the whole game memory allocator. So maybe you give 32 MB to page cache. That can be used for file IO, or video playback - or maybe you want to use it to pop up your in game "pause menu" GUI. Or say you want to stream in a compressed file - you map pages to read in the packed bits, and then you map pages as you decompress; as you decompress you toss the pages with the packed bits and leave the uncompressed pages in the cache.

Say you have some threaded JPEG decompress or something - it grabs a page to decompress to. The other cool thing is because all this is manual - it can grab the page with a certain priority level. If no page is available, it can do various things depending on game knowledge to decide how to respond to that. For example if it's just decompressing a high res version of a texture you already have in low res, it can just abort the decompress. If it's a low priority prefetch, it can just go to sleep and wait for a page to become available. If it's high priority, like you need this texture right now, that can cause other pages that are less important to get dropped.

Pages could also cache procedurally generated textures, data from a network, optional sounds, etc. etc.

I think about it this way - in the future of multicore and async and threading and whatnot, you might have 100 jobs to run per frame. Some of those jobs need large temp work memory. You can't just statically assign memory to various purposes, you want it to be used by the jobs that need it in different ways.

4/06/2009

04-06-09 - Multi-threaded Allocators

I've been reading a bit about Multi-threaded Allocators. A quick survery :

The canonical base allocator is Hoard . Hoard is a "slab allocator"; it takes big slabs from the OS and assigns each slab to a fixed-size block. In fact it's extremely similar to my "Fixed Restoring" allocator that's been in Galaxy3 forever and we shipped in Stranger. The basic ideas seem okay, though it appears to use locks when it could easily be lock-free for most ops.

One thing I really don't understand about Hoard is the heuristic for returning slabs to the shared pool. The obvious way to do a multi-threaded allocator is just to have a slab pool for each thread. The problem with that is that if each thread does 1 alloc of size 8, every thread gets a whole slab for itself and the overhead is proportional to the number of threads, which is a bad thing going into the massively multicore future. Hoard's heuristic is that it checks the amount of free space in a slab when you do a free. If the slab is "mostly" free by some measure, it gets returned to the main pool.

My problem with this is it seems to have some very bad cases that IMO are not entirely uncommon. For example a usage like this :


for( many )
{
    allocate 2 blocks
    free 1 block
}

will totally break Hoard. From what I can tell what hoard will do is allocate you a slab for your thread the first time you go in, then when you free() it will be very empty, so it will return the slab to the shared pool. Then after that when you allocate you will either get from the shared pool, or pull the slab back. Very messy.

There are two questions : 1. how greedy is new slab creation? That is, when a thread allocates an object of a given size for the first time, does it immediately get a brand new slab, or do you first give it objects from a shared slab and wait to see if it does more allocs before you give it its own slab. 2. how greedy is slab recycling? eg. when a thread frees some objects, when do you give the slab back to the shared pool. Do you wait for the slab to be totally empty, or do you do it right away.

The MS "Low Fragmentation Heap" is sort of oddly misnamed. The interesting bit about it is not really the "low fragmentation", it's that it has better multi-threaded performance. So far as I can tell, the LFH is 99.999% identical to Hoard. See MS PPT on the Low Fragmentation Heap

We did a little test and it appears that the MS LFH is faster than Hoard. My guess is there are two things going on : 1. MS actually uses lock-free linked lists for the free lists in each slab (Hoard uses locks), and 2. MS makes a slab list *per processor*. When a thread does an alloc it gets the heap for the processor it's on. They note that the current processor is only available to the kernel (KeGetCurrentProcessorNumber ). Hoard also makes P heaps for P processors (actually 2P), but they can't get current processor so they use the thread Id and hash it down to P which leads to occasional contention and collisions.

The other canonical allocator is tcmalloc . I still haven't been able to test tcmalloc because it doesn't build on VS2003. tcmalloc is a bit different than Hoard or LFH. Instead of slab lists, it uses traditional SGI style free lists. There are free lists for each thread, and they get "garbage collected" back to the shared pool.

One issue I would be concerned about with tcmalloc is false sharing. Because the free lists get shared back to a global pool and then redistributed back to threads, there is no real provision to prevent little items from going to different threads, which is bad. Hoard and LFH don't have this problem because they assign whole slabs to threads.

The Hoard papers make some rather ridiculous claims about avoiding false sharing, however. The fact is, if you are passing objects between threads, then no general purpose allocator can avoid false sharing. The huge question for the allocator is - if I free an object on thread 1 that was allocated on thread 2 , should I put it on the freelist for thread 1, or on the freelist for thread 2 ? One or the other will make false sharing bad, but you can't answer it unless you know the usage pattern. (BTW I think tcmalloc puts it on thread 1 - it uses the freelist of the thread that did the freeing - and Hoard puts it on thread 2 - the freelist of the thread that did the allocation; neither one is strictly better).

Both tcmalloc and the LFH have high memory overhead. See here for example . They do have better scalability of overhead to high thread counts, but the fact remains that they may hold a lot of memory that's not actually in use. That can be bad for consoles.

In fact for video games, what you want in an allocator is a lot of tweakability. You want to be able to tweak the amount of overhead it's allowed to have, you want to be able to tweak how fast it recycles pages for different block types or shares them across threads. If you're trying to ship and you can't fit in memory because your allocator has too much overhead, that's a disaster.

BTW false sharing and threaded allocators are clearly a place that a generational copying garbage collector would be nice. (this does not conflict with my contention that you can always be faster by doing very low level manual allocation - the problem is that you may have to give tons of hints to the allocator for it to do the right thing). With GC it can watch the usage and be smart. For example if a thread does a few allocs, they can come from a global shared pool. It does some more allocs of that type, then it gets its own slab and the previous allocs are moved to that slab. If those objects are passed by a FIFO to another thread, then they can be copied to a slab that's local to that thread. Nice.

It seems clear to me that the way to go is a slab allocator. Nice because it's good for cache-coherence and false sharing. The ops inside a slab can be totally thread-local and so no need to worry about multithread issues. Slabs are large enough that slab management is rare and most allocations only take the time do inside-slab work.

One big question for me is always how to get from an object to its owning slab (eg. how is free implemented). Obviously it's trivial if you're okay with adding 4 bytes to every allocation. The other obvious way is to have some kind of page table. Each slab is page-aligned, so you take the object pointer and truncate it down to slab alignment and look that up. If for example you have 32 bit pointers (actually 31), and you pages are 4k, then you need 2 to the 19 page table entries (512k). If a PTE is 4 bytes that's 2 MB of overhead. It's more reasonable if you use 64k pages, but that means more waste inside the page. It's also a problem for 64-bit pointers.

There are some other options that are a bit more hard core. One is to reserve a chunk of virtual address for slabs. Then whenever you see a free() you check if the pointer is in that range, and if so you know you can just round the pointer down to get the slab head. The problem is the reserving of virtual address.

Another option is to try to use pointer alignment. You could do something like make all large allocs be 4k aligned. Then all small allocs are *not* 4k aligned (this happens automatically because they come from inside a 4k page), and to get the base of the page you round down to the next lowest 4k.

old rants