5/27/2009

05-27-09 - Optimal Local Bases

I've had this idea forever but didn't want to write about it because I wanted to try it, and I hate writing about things before I try them. Anyway, I'm probably never going to do it, so here it is :

It's obvious that we are now at a point where we could use the actual optimal KLT for 4x4 transforms. That is, for a given image, take all the 4x4 blocks and turn them into a 16-element vector. Do the PCA to find the 16 bases vectors that span that 16-element space optimally. Tack those together to make a 16x16 matrix. This is now your KLT transform for 4x4 blocks.

(in my terminology, the PCA is the solve to find the spanning axes of maximum variance of a set of vectors; the KLT is the orthonormal transform made by using the PCA basis vectors as the rows of a matrix).

BTW there's another option which is to do it seperably - a 4-tap optimal horizontal transform and a 4-tap optimal vertical transform. That would give you two 4x4 KLT matrices instead of one 16x16 , so it's a whole lot less work to do, but it doesn't capture any of the shape information in the 2d regions, so I conjecture you would lose almost all of the benefit. If you think about, there's not much else you can do in a 4-tap transform other than what the DCT or the Hadamard already does, which is basically {++++,++--,-++-,+-+-}.

Now, to transform your image you just take each 4x4 block, multiply it by the 16x16 KLT matrix, and away you go. You have to transmit the KLT matrix, which is a bit tricky. There would seem to be 256 coefficients, but in fact there are only 15+14+13.. = 16*15/2 = 120. This because you know the matrix is a rotation matrix, each row is normal - that's one constraint, and each row is perpendicular to all previous, so the first only has 15 free parameters, the second has 14, etc.

If you want to go one step crazier, you could do local adaptive fitting like glicbawls. For each 4x4 block that you want to send, take the blocks in the local neghborhood. Do the PCA to find the KLT, weighting each block by its proximity to the current. Use that optimal local KLT for the current block. The encoder and decoder perform the same optimization, so the basis vectors don't need to be transmitted. This solve will surely be dangerously under-conditioned, so you would need to use a regularizer that gives you the DCT in degenerate cases.

I conjecture that this would rock ass on weird images like "barb" that have very specific weird patterns that are repeated a lot, because a basis vector will be optimized that exactly matches those patterns. But there are still some problems with this method. In particular, 4x4 transforms are too small.

We'd like to go to 8x8, but we really can't. The first problem is that the computation complexity is like (size)^8 , so 8x8 is 256 X slower than 4x4 (a basis vector has (size^2) elements, there are (size^2) basis vectors, and a typical PCA is O(N^2)). Even if speed wasn't a problem though, it would still suck. If we had to transmit the KLT matrix, it would be 64*63/2 = 2016 coefficients to transmit - way too much overhead except on very large images. If we tried to local fit, the problem is there are too many coefficients to fit so we would be severely undertrained.

So our only hope is to use the 4x4 and hope we can fix it using something like the 2nd-pass Hadamard ala H264/JPEG-XR. That might work but it's an ugly heuristic addition to our "optimal" bases.

The interesting thing about this to me is that it's sort of the right way to do an image LZ thing, and it unifies transform coding and context/predict coding. The problem with image LZ is that the things you can match from are an overcomplete set - there are lots of different match locations that give you the exact same current pixel. What you want to do is consider all the possible match locations - merge up the ones that are very similar, but give those higher probability - hmmm.. that's just the PCA!

You can think of the optimal local bases as predictions from the context. The 1st basis is the one we predicted would have most of the energy - so first we send our dot product with that basis and hopefully it's mostly correct. Now we have some residual if that basis wasn't perfect, well the 2nd basis is what we predicted the residual would be, so now we dot with that and send that. You see, it's just like context modeling making a prediction. Furthermore when you do the PCA to build the optimal local KLT, the eigenvalues of the PCA step tell you how much confidence to have in the quality of your prediction - it tell you what probability model to use on each of the coefficients. In a highly predictable area, the 1st eigenvalue will be close to 100% of the energy, so you should code predicting the higher residuals to be zero strongly; in a highly random area, the eigenvalues of the PCA will be almost all the same, so you should expect to code multiple residuals.

5/26/2009

05-26-09 - Some Study of DCT coefficients

First of all : The number of non-zero values in the lower-diagonal area of the 8x8 block after quantization to a reasonable/typical value.

(36 of the 64 values are considered to be "lower diagonal" in the bottom right area) The actual number :


avg : 3.18 : 
 0 : 37.01
 1 : 16.35
 2 :  9.37
 3 :  6.48
 4 :  5.04
 5 :  3.88
 6 :  3.22
 7 :  3.02
 8 :  2.49
 9 :  2.34
10 :  2.09
...

The interesting thing about this is that it has a very flat tail, much flatter than you might expect. For example, if the probability of a given coefficient being zero or nonzero was an indepedent random event, the distribution would be binomial; it peaks flatter and is then much faster to zero :

avg : 3.18 : 
 0 : 13.867587
 1 : 28.162718
 2 : 27.802496
 3 : 17.775123
 4 : 8.272518
 5 : 2.986681
 6 : 0.870503
 7 : 0.210458
 8 : 0.043037
 9 : 0.007553
10 : 0.001150
...

What this tells us is the probability of a given coefficient being zero is highly *not* idependent. They are strongly correlated - the more values that are on, the more likely it is that the next will be on. In fact we see that if there are 6 values on, it's almost equally likely there are 7, etc. , that is : P(n)/P(n-1) goes toward 1.0 as n gets larger.

Also, amusingly the first two ratios P(1)/P(0) and P(2)/P(1) are both very close to 0.5 in every image I'm tried (in 0.4 to 0.6 generally). What this means is it wouldn't be too awful just to code the # of values on with unary, at least for the first bit (you could use something like an Elias Gamma code which uses unary at first then adds more raw bits).

Now for pretty pictures. Everyone has seen graphics like this, showing the L2 energy of each coefficient in the DCT : (none of these pictures include the DC because it's weird and different)

This shows the percentage of the time the value is exactly zero :

Now for some more interesting stuff. This shows the percent of correlation to the block above the current one : (to the north) :

Note in particular the strong correlation of the first row.

The next one is the correlation to the block to the left (west) :

Finally the fun one. This one shows the correlation of each coefficient to the other 63 coefficients in the same block :

The self-correlation is 100% which makes it a white pixel obviously. Black means 0% correlation. This is absolute-value correlation in all cases (no signs). There are a lot of patterns that should be pretty obvious to the eye. Beware a bit in over-modeling on these patterns because they do change a bit from image to image, but the general trend stays the same.

And another one from a different image :

This one's from Lena. A few things I think are particularly interesting - in the upper left area, which is where most of the important energy is, the correlation is most strong diagonally. That is, you see these "X" shape patterns where the center pixel is correlated mainly to it's diagonal neighbors, not the one's directly adjacent to it.

05-26-09 - Storing Floating Points

I rambled a bit before about how to store floating point images . I think I have the awesome solution.

First of all there are two issues :

Goal 1. Putting the floating point data into a format that I can easily run through filters, deltas from previous, etc. Even if you're doing lossless storage your want to be able to delta from previous and have the deltas make sense and preserve the original values.

Goal 2. Quantizing in such a way that the bits are used where the precision is wanted. Obviously you want to be sending bits only for values that are actually possible in the floating point number.

Also to be clear before I start, the issue here is with data that's "true floating point". That is, it's not just data that happens to be in floating point but could be equally well converted to ints inside some [min,max] range. For example, the coordinates of most geometry in video games really isn't meant to be floating point, the extra precision near zero is not actually wanted. The classic example where you actually want floating point is for HDR images where you want a lot of range, and you actually want less absolutely precision for higher values. That is, the difference between 999991 and 999992 is not as important as the difference between 1 and 2.

Now, we are going to be some kind of lossy storage. The loss might be very small, but it's kind of silly to talk about storing floating points without talking about lossy storage, because you don't really intend to have 23 bits of mantissa or whatever. To be lossy, we want to just do a simple linear quantization, which means we want to transform into a space where the values have equal significance.

Using some kind of log-scale is the obvious approach. Taking the log transforms the value such that even size steps in log-space are even multipliers in original space. That's good, it's the kind of scaling we want. It means the step from 1 to 2 is the same as the step from 100 to 200. The only problem is that it doesn't match the original floating point representation really, it's more continuous than we need.

What we want is a transform that gives us even size steps within one exponent, and then when the exponent goes up one, the step size doubles. That makes each step of equal importance. So, the quantizer for each exponent should just be Q ~= 2^E.

But that's easy ! The mantissa of the floating point that we already have is already quantized like that. We can get exactly what we want by just pretending our floating point is fixed point !


value = (1 + M)* 2^E

(mantissa with implicit one bit at head, shifted by exponent)

fixed point = {E.M}

That is, just take the exponent's int value and stick it in the significant bits above the mantissa. The mantissa is already quantized for you with the right variable step size. Now you can further quantize to create more loss by right shifting (aka only keeping N bits of the mantissa) or by dividing by any number.

This representation also meets Goal 1 - it's now in a form that we can play with. Note that it's not the same as just taking the bytes of a floating point in memory - we actually have an integer now that we can do math with and it makes sense.


when you cross an exponent :

1.99 = {0.1111}
2.01 = {1.0001}

So you can just subtract those and you get a sensible answer. The steps in the exponent are the correct place value to compensate for the mantissa jumping down. It means we can do things like wavelets and linear predictors in this space.

Now there is a bit of a funny issue with negative numbers vs. negative exponents. First the general solution and what's wrong with it :

Negatives Solution 1 : allow both negative numbers and negative exponents. This creates a "mirrored across zero" precision spectrum. What you do is add E_max to E to make it always positive (just like the IEEE floating point), so the actual zero exponent is biased up. The spectrum of values now looks like :


  (positives)      real step size
  3.M                     /
  2.M                    / 
  1.M                   /
  0.M                  /
 -1.M                 /
 -2.M                /  
 -3.M               /  
(zero)             +    
 -3.M               \   
 -2.M                \  
 -1.M                 \ 
  0.M                  \
  1.M                   \
  2.M                    \
  3.M                     \
  (negatives)

What we have is a huge band of values with very small exponents on each side of the zero. Now, if this is actually what you want, then fine, but I contend that it pretty much never is. The issue is, if you actually have positives and negatives, eg. you have values that cross zero, you didn't really intend to put half of your range between -1 and 1. In particular, the difference between 1 and 2^10 is the same as the difference between 1 and 2^-10. That just intuitively is obviously wrong. If you had a sequence of float values like :

{ 2.0 , 1.8 , 1.4, 1.0, 0.6, 0.2, -0.1, -0.3, -0.6, -1.0, -1.4 }

That looks nice and smooth and highly compressible right? NO ! Hidden in the middle there is a *MASSIVE* step from 0.2 to -0.1 ; that seems benign but it's actually a step past almost your entire floating point range. (BTW you might be thinking - just add a big value to get your original floating poin tdata away from zero. Well, if I did that it would shift where the precision was and throw away lots of bits; if it's okay to add a big value to get away from zero, then our data wasn't "true" floating point data to begin with and you should've just done a [min,max] scale instead).

So I contend that is almost always wrong.

Negatives Solution 2 : allow negative numbers and NOT negative exponents. I content that you almost never actually want negative exponents. If you do want precision below 1.0, you almost always just want some fixed amount of it - not more and more as you get smaller. That can be represented better just with the bits of the mantissa, or by scaling up your values by some fixed amount before transforming.

To forbid negative exponents we make everything in [0,1.0] into a kind of "denormal". We just give it a linear scale. That is, we make a slightly modified representation :


Stored val = {E.M} (fixed point)

Float val =
    
    if E >= 1 : (1 + M) *2^(E-1)
                makes numbers >= 1.0

    if E == 0 : M
                makes numbers in [0.0,1.0)

(M is always < 1, that is we pretend it has a decimal point all the way to the left)

Now the low values are like :

0    : 0.0000
0.5  : 0.1000
0.99 : 0.1111
1.01 : 1.0001

Of course we can do negative values with this by just putting the sign of the floating point value onto our fixed point value, and crossing zero is perfectly fine.

05-26-09 - Automatic Multithreading

Bartosz made an interesting post about extending D for automatic multithreading with some type system additions.

It made me think about how you could do guaranteed-safe multithreading in C++. I think it's actually pretty simple.

First of all, every variable must be owned by a specific "lock". It can only be accessed if the current thread owns that lock. Many of the ownership relationships could be implicit. For example there is an implicit lock for every thread for stuff that is exlusively owned by that thread. That thread almost has that lock, so you never actually generate a lock/unlock, but conceptually those variables still have a single lock that owns them.

So, stack variables for example are implicitly automatically owned by the thread they are made on. Global variables are implicitly owned by the "main thread" lock if not otherwise specified. If some other thread tries to touch a global, it tries to take a lock that it doesn't own and you fail.

 

Lock gate1;

int shared : gate1; // specifies "shared" is accessed by gate1

int global; // implicitly owned by the main thread

void thread1()
{
    int x; // owned by thread1

    x = shared; // fail! you must lock gate1

    {
        lock(gate1);
        x = shared; // ok
    }

    y = global; // fail! you don't own global
}

Mkay that's nice and all. Single threaded programs still just work without any touches, everything is owned by the main thread. Another goal I think of threading syntax additions should be that going from a large single threaded code base to adding a few threading bits should be easy. It is here.

There are a few things you would need to make this really work. One is a clean transfer of ownership method as Bartosz talks about. Something like auto_ptr or unique_ptr, but actually working in the language, so that you can pass objects from one owner to another and ensure that no refs leak out during the passage.

You can also of course extend this if you don't want the constraint that everything is protected by a lock. For example you could easily add "atomic" as a qualifier instead of a lock owner. If something is marked atomic, then it can be accessed without taking a lock, but it's only allowed to be accessed by atomic operations like cmpx/CAS.

This is a nice start, but it doesn't prevent deadlocks and still requires a lot of manual markup.


I also finally read a bit about Sun's Rock. It's very interesting, I encourage you to read more about it at the Sun Scalable Synchronization page.

Rock is actually a lot like LRB in many ways. It's 16 lightweight cores, each of which has 2 hardware threads (they call them strands). It's basically a simple in-order core, but it can do a sort of state-save push/pop and speculative execution. They have cleverly multi-purposed that functionality for both the Transactional Memory, and also just for improving performance of single threaded code. The state-save push-pop is a ghetto form of out-of-order-execution that we all know and love so much. It means that the chip can execute past a branch or something and then if you go the other way on the branch, it pops the state back to the branch. This is just like checkpointing a transaction and then failing the commit !

The key thing for the transactional memory is that Rock is NOT a full hardware transactional memory chip. It provides optimistic hardware transactions with some well designed support to help software transactional memory implementations. The optimistic hardware transactions basically work by failing to commit if any other core touches a cache line you're writing. Thus if you do work in cache lines that you own, you can read data, write it out, it gets flushed out of the cache to the global consistent view and there are no problems. If someone touches that cache line it invalides the transaction - even though it might not necessarilly need to. That's what makes it optimistic and not fully correct (there are other things too). If it allows a transaction through, then it definitely was okay to do, but it can fail a transaction that didn't need to fail.

There's a lot of problems with that, it can fail in cases that are perfectly valid transactions, so obviously you cannot rely on that as a TM implementation. However, it does let a lot of simple transactions successfully complete quickly. In particular for simple transactions with no contention, the optimistic hardware transaction completes no problemo. If it fails, you need to have a fallback mechanism - in which case you should fall back to your real STM implementation, which should have forward progress guarantees. So one way of look at the HTM in Rock is just a common case optimization for your STM software. The commit in Rock has a "jump on fail" so that you can provide the failure handling code block; you could jump back to your checkpoint and try again, but eventually you have to do something else.

Perhaps more interestiongly, the HTM in Rock is useful in other ways too. It gives you a way to do a more general ll-sc (load linked store conditional) kind of thing, so even if you're not using STM, you can build up larger "atomic" operations for your traditional lock/atomic C++ style multithreading. It can also be used for "lock elision" - avoiding actually doing locks in traditional lock-based code. For example if you have code like :


    lock(CS)

    x = y;

    unlock(CS)

that can be transparently converted to something like :

    checkpoint; // begin hardware transaction attempt

    x = y;

    commit  // try to end hardware transaction, if fails :
    failure {

        lock( CS)   
    
        x = y;

        unlock(CS)
    }

So you avoid stalling if it's not needed. There are of course tons of scary subtleties with all this stuff. I'll let the Sun guys work it out.

It's also actually a more efficient way of doing the simple "Interlocked" type atomic ops. On x86 or in a strongly ordered language (such as Java's volatiles) the Interlocked ops are fully sequentially consistent, which means they go in order against each other. That actually is a pretty hard operation. We think of the CMPX or CAS as being very light weight, but it has to sync the current core against all other cores to ensure ops are ordered. (I wrote about some of this stuff before in more detail in the threading articles). The "transaction" hardware with a checkpoint/commit lets your core flow through its ops without doing a big sync on interlocked ops. Now, obviously the checkpoint/commit itself needs to be synchronized against all other cores, but it's done on a per cache-line basis, and it uses the cache line dirtying communication hardware that we need anyway. In particular in the case of no contention or the case of doing multiple interlocked ops within one cache line, it's a big win.

I'm sure that Sun will completely cock things up somehow as they always do, but it is very interesting, and I imagine that most future chips will have models somewhat like this, as we all go forward and struggle with the issue of writing software for these massively parallel architectures.

5/25/2009

05-25-09 - Using DOT graphviz

I wrote a while ago about DOT, the graphviz svg thing.

I thought I'd write a quick advice on using it from what I've learned. The manual dotguide.pdf and the online help are okay, so start there, but keep this in mind :

DOT was designed to make graphs for printing on paper. That leads to some weird quirks. For one thing, all the sizes are in *inches*. It also means that all the fitting support and such is not really what you want, so just disable all of that.

I have the best success when I use the Google method (what I stole from pprof in perftools). Basically they just set the font size and then let DOT make all the decisions about layout and sizing. There's one caveat in that : the way that DOT writes its font size is not supported right by FF3. See : 1 or 2 . There's a pretty easy way to fix this, basically for every font size tag, you need to add a "px" on it. So change "font-size:14.00;" to "font-size:14.00px;" . This change needs to be done on the SVG after dot. I do it by running a grep to replace ",fontcolor=blue" with "px". So in the DOT code I make all my text "blue", it's not actually blue, the grep just changes that to the px I need for my font sizes. So you'll see me output text attributes like "fontsize=24,fontcolor=blue".

The other big thing is that DOT seems to have zero edge label layout code. And in fact the edge layout code is a bit weak in general. It's pretty good at sizing the nodes and positioning the nodes - it has lots of fancy code for that, but then the edge labels are just slapped on, and the text will often be in horrible places. The solution to this is just to use edge labels as little as possible and make them short.

Another trick to getting better edge positioning is to explicitly supply the edge node ports and put them on a record. I do this in my huffman graphs. The other good edge trick is to use edgeweight. It doesn't change the appears of the edge, it changes how the edge is weighted for importance in the layout algorithm, so it becomes straighter and shorter.

For educational purposes I just made a little program to parse MSVC /showIncludes output to DOT :

my dotincludes zip .
dotincludes sample output from running on itself.

Here are some Huffman trees (SVG - you need a FireFox 3+ to view nicely) optimized with various Lagrange multipliers :

Huff 1
Huff 2
Huff 3

Remember it's control-wheel to zoom SVG's in FF.

5/22/2009

05-22-09 - A little more Huffman

.. okay since I got started on this, a few other notes. This is ancient stuff, in fact it's been in the "Huffman2" in crblib for 10+ years, and it used to be common knowledge, but it's some of that art that's sort of drifted away.

For one thing, Huffman codes are specified only by the code lens. You don't need to store the bit patterns. Of course then you need a convention for how the bit patterns are made.

The "canonical" (aka standard way) to make the codes is to put numerically increasing codes in order with numerically increasing symbols of the same code length. I helps to think about the bit codes as if the top bit is transmitted first , even if you don't actually do that. That is, if the Huffman code is 011 that means first we send a 0 bit, then two ones, and that has value "3" in the integer register.

You can create the canonical code from the lengths in O(N) for N symbols using a simple loop like this :


for( len between min and max )
{
    nextCode = (nextCode + numCodesOfLen[len]) << 1;
    codePrefix[len] = nextCode;
}

for( all symbols )
{
    code[sym] = codePrefix[ codeLen[sym] ];
    codePrefix[ codeLen[sym] ] ++;
}

this looks a bit complex, but it's easy to see what's going on. If you imagine that you had your symbols sorted first by code length, and then by symbol Id, then you are simply handing out codes in numerical order, and shifting left one each time the code length bumps up. That is :

curCode = 0;
for( all symbols sorted by codelen : )
{
    code[sym] = curCode;
    curCode ++;
    curCode <<= codeLen[sym+1] - codeLen[sym];
}

An example :

code lens :

c : 1
b : 2
a : 3
d : 3

[code=0]

c : [0]
    code ++ [1]
    len diff = (2-1) so code <<= 1 [10]
b : [10]
    code ++ [11]
    len diff = (3-2) so code <<= 1 [110]
a : [110]
    code ++ [111]
    no len diff
d : [111]

very neat.

The other thing is to decode you don't actually need to build any tree structure. Huffman trees are specified entirely by the # of codes of each length, so you only need that information, not all the branches. It's like a balanced heap kind of a thing; you don't actually want to build a tree structure, you just store it in an array and you know the tree structure.

You can do an O(H) (H is the entropy, which is the average # of bits needed to decode a symbol) decoder using only 2 bytes per symbol (for 12 bit symbols and max code lengths of 16). (note that H <= log(N) - a perfectly balanced alphabet is the slowest case and it takes log(N) bits).


a node is :

Node
{
  CodeLen : 4 bits
  Symbol  : 12 bits
}

You just store them sorted by codelen and then by code within that len (aka "canonical Huffman order"). So they are sitting in memory in same order you would draw the tree :

0 : c
10 : b
11 : a

would be

[ 1 : c ]
[ 2 : b ]
[ 2 : a ]

then the tree descent is completely determined, you don't need to store any child pointers, the bits that you read are the index into the tree. You can find the node you should be at any time by doing :

index = StartIndex[ codeLen ] + code;

StartIndex[codeLen] contains the index of the first code of that length *minus* the value of the first code bits at that len. You could compute StartIndex in O(1) , but in practice you should put them in a table; you only need 16 of them, one for each code len.

So the full decoder looks like :


code = 0;
codeLen = 0;

for(;;)
{
    code <<= 1;
    code |= readbit();
    codeLen ++;

    if ( Table[ StartIndex[ codeLen ] + code ].codeLen == codeLen )
        return Table[ StartIndex[ codeLen ] + code ].Symbol
}

obviously you can improve this in various ways, but it's interesting to understand the Huffman tree structure this way.

The thing is if you just list out the huffman codes in order, they numerically increase :


0
10
110
111

= 0, 2, 6,7

If you have a prefix that doesn't yet decode to anything, eg if you have "11" read so far - that will always be a code value that numerically doesn't exist in the table.

Finally, if your symbols actually naturally occur in sorted order, eg. 0 is most likely, then 1, etc.. (as they do with AC coefficients in the DCT which have geometric distribution) - then you can do a Huffman decoder that's O(H) with just the # of codes of each length !

05-22-09 - Fast Means

I know of three fast ways to track the mean of a variable. At time t you observe a value x_t. You want to track the mean of x over time quickly (eg. without a divide!). I generally want some kind of local mean, not the full mean since time 0.

1. "weighted decay". You want to do something like M_t = 0.9 * M_t-1 + 0.1 * x_t;


int accum = 0;

accum := ((15 * accum)>>4) + x_t;

mean = (accum>>4);

This is the same as doing the true (total/count) arithmetic mean, except that once count gets to some value (here 16) then instead of doing total += x_t, instead you do total = (15/16) * total + x_t; so that count sticks at 16.

2. "sliding window". Basically just track a window of the last N values, then you can add them to get the local mean. If N is power of 2 you don't divide. Of course the fast way is to only track inputs & exits, don't do the sum all the time.


int window[32];
int window_i = 0;
int accum = 0;

accum += x_t - window[window_i];
window[window_i] = x_t;
window_i = (window_i + 1) & 0x1F;

mean = (accum>>5);

3. "deferred summation". This is the technique I invented for arithmetic coding long ago (I'm sure it's an old technique in other fields). Basically you just do a normal arithmetic mean, but you wait until count is a power of two before doing the divide.


int accum = 0;
int next_count = 0;
int next_shift = 3;

accum += x_t;

if ( --next_count == 0 )
{
    mean = accum >> next_shift;
    accum = 0;
    next_shift = MIN( 10 , next_shift+1 );
    next_count = 1 << next_shift;
}

mean is not changing during each chunk, basically you use the mean from the last chunk and wait for another power-of-2 group to be done before you update mean. (also this codelet is resetting accum to zero each time but in practice you probably want to carry over some of the last accum).

For most purposes "deferred summation" is actually really bad. For the thing I invented it for - order 0 coding of stable sources - it's good, but that application is rare and not useful. For real context coding you need fast local adaptation and good handling of sparse contexts with very few statistics, so delaying until the next pow-2 of symbols seen is not okay.

The "weighted decay" method has a similar sort of flaw. The problem is you only have one tune parameter - the decay multiplier (which here became the shift down amount). If you tune it one way, it adapts way too fast to short term fluctuations, but if you tune it the other way it clings on to values from way way in the past much too long.

The best almost always is the sliding window, but the sliding window is not exactly cheap. It takes a lot of space, and it's not awesomely fast to update.

5/21/2009

05-21-09 - Entropy Coding by Counting

There have been a number of interesting works in the last few years about different ways to do coding. Some refereneces :

L. Oktem : Hierarchical Enumerative Coding and Its Applications in Image Compressing, Ph.D. thesis
"Group Testing for Wavelet-Based Image Compression"
"Entropy Coding using Equiprobable Partitioning"
"Binary Combinatorial Coding"

The common pattern in all these is the idea of coding probabilistic events using counting. Basically, rather than taking some skew-probability event and trying to create a code whole length matches the probabilities like L = - log2(P) , instead you try to combine events such that you create coding decision points that have 50/50 or at least power of 2 probabilities.

Let's look at a simple example. Say you want to code a bunch of binary independent events, some 001100 sequence with a certain P(1) and P(0) that's constant. For now let's say p = P(1) is very low. (this is common case in image coding, or really any predictive coding where the prediction is very good - a 1 means the prediction was wrong, 0 means it was right; eg. high bands of wavelets are like this (though they aren't independent - more on that later)).

If you just try to code each bit one by one, the average length should be much less than one and you'd have to arithmetic code or something. To create an event that's just 50/50 we code a group of symbols. We want to make the probability that the whole group is off be close to 50%, so :


0.50 = P(0) ^ N

N = log(0.50) / log(P(0)) = -1 / log2( P(0) )

for example if P(1) = 5% , then N = 13.5

so you get 13 or 14 symbols then code are they all on or off with just 1 raw bit. That's an efficient coding operation because by design you've made that a 50/50 event. Now we have a bit of a puzzle for the next step. If you sent "it's all off" then you're done, but if you sent "something is on" you now have to send more data to specify what is on - but you have already sent some information by sending that something is on so you don't want to just waste that.

There's some crazy difficult combinatorics math to do if you wanted to figure out what the next event to code would be that was a perfect 50/50, but in the end our hand is forced because there's not really anything else we can do - we have to send a bit to say "is the first half on?". I'll present the group testing method without justification then try to justify it after the fact :

You know something in the group of N is on. First send a bit for whether the first half ([N/2]) range is on. If that's on, recurse onto that range and repeat. If the first half was on, you still have to move to your neighboring (N/2) which you don't know if it's on or off. If the first half was off, then move to the the neighbor, he must be on, so recurse into him without sending a bit.

Now the justification. First see if there was only 1 value on in the range N, then it would have equal probability of being in the first or second half of the group, so it would be correct to send a bit to specify if it was in the first or second half - that would be equivalent to singalling if the first half is active. Now note that by design of our group size it's by far most likely that there is only 1 value on. For our example case above where P(1) = 5%, the chance of there being exactly 1 value on is around 70%. In that case our code is perfect; for higher values on it's not ideal, but those are less likely, so the net loss is small.

(BTW the binary code to locate a single item in a group is just the binary coding of the index; eg. first you send a bit to indicate if it's in the top half or bottom half, etc.)

The enumerative coding approaches take a similar but slightly different tack. Basically they code the # of values on in the sequence. Once you have the # of values on, then all arrangements are equally likely, so you just need to code the permutation id with a fixed length equiprobable code.


Say for example you want to code 16 binary events

You first send the # on.

If there's 1 on, there are 16 ways for that to happen - just send 4 bits

If there're 2 on, there are 16*15/2 = 120 ways
    you should send a 6.906 bit code, but you can just send a 7 bit code

If there're 3 on, you want to send one of 16*15*14/6 = 560 ways
    a 10 bit code would suck here, so you want a simple prefix code
    that usually writes 9 bits and occasionally 10

This is pretty efficient, particularly when the probability of a bit being on is very low so that you are usually in the low-count cases. In the high count cases it starts to suck because counting the permutations gets pretty complex. The other problem with this method is how to encode the # that are on. The problem is the probability of some # being on is a binomial distribution. There's no simple fixed-bit code for a binomial distribution (I don't think).

Now, for N large, the binomial distribution approaches a normal or geometric distribution. In that case we can use a Golomb code for the # of values on. Furthermore for N large, the coding inefficiency of the # on becomes less important. In the real world I don't think N large is very practical.

Anyhoo I'm not sure any of this is super useful because in practice bits are never independent, so you want to do context coding to capture that, and it's a mess to do with these methods. The GTW (Group Testing for Wavelets) guys capture context information by classifying each sample into classes of similar expected statistics, and also by interleaved scanning so that coefficients with relations are not put into the same class together. The result is impressive, they get compression comparable to ECECOW, but all the work of shuffling values around and multiple scans means they're surely slower than just arithmetic coding.

BTW not related but while I was looking at this I was confused by the following stupidity :


Exponential Distribution == Laplacian == Bernoulli Process == Geometric

All of them are P(n) ~= r ^ n

(or P(n) ~= e ^ ( - lambda * n ) if you prefer)

~ means proportional, ignoring normalization.

(Poisson is similar but different - P(n) ~= r ^ n / n! )

BTW the maximum likelihood estimate of the laplacian parameter is the L1 norm

P(n) ~= e ^ ( - lambda * n )  ; lambda = 1 / (L1 norm of samples)

yeah yeah some of them are continuous and some discrete, some are one sided, some two sided, but people in compression play fast & loose with all of those things, so it's very confusing when you switch from one paper to another and they switch terms. Most people in compression describe the AC coefficients as having a "Laplacian" distribution.

BTW Golomb codes are optimal for coding Laplacian/Geometric distributions.


Mean of geometric distribution with ratio r is = r / (1.0 - r)
  (r given the mean is r = m / (1+m) )

The optimal Golomb parameter k satisfies the relation :

r^k + r^(k+1) <= 1 < r^k + r^(k-1)

or

r^k <= 1/(1+r) < r^(k-1)

You can find this k a few ways :

k = froundint( 0.5 - log(1+r)/log(r) );

k = ceil( log(1+r)/log(1/r) );

or

int GetK(double r)
{
    double pk = p;
    for(int k =1;;k++)
    {
        if ( (1.0 + p)*pk <= 1.0 )
            return k;
        pk *= p;
    }
}

and furthermore we often use the nice approximation

k = froundint( mean )

however, that is not really correct for large mean.  It works fine for mean up to around 5.

At mean = 6 , the true optimum k is 5

true entropy : 4.144196
k=5 , Golomb H : 4.174255
k=6 , Golomb H : 4.219482

you can see at k=5 Golomb does very well matching entropy, but using the incorrect k= mean it gets off.

At larger mean it's much worse -

at mean = 19, the correct k is = 14

true entropy : 5.727806
golomb_k : 14 , H : 5.761475
golomb_k : 19 , H : 5.824371

At low mean of course Golomb gets bad because it can't code the 0 in less than 1 bit. You would obviously use run-lenghts or something to fix that in practice. Golomb works very well down to a mean of about 1.0 (that's a geometric r of 0.5).

One neat thing about geometric distributions is if you code the first symbol a different way, the remainder is still geometric. So you can code the 0 symbol with RLE, then if it's not 0, you code the symbols >= 1 with Golomb, and it's still geometric with the same r and k. (though Golomb is still bad then it's being done less often).


BTW I tried expressing k in terms of the mean :

k = ceil( log(1+r)/log(1/r) );

k = ( log( (1+2m) / (1+m))/log((1+m)/m) )

k = ( log( 1+2m ) - log(1+m) ) / ( log (1+m) - log(m) )

which is sort of handy if you have 'm' as an integer you can use table lookups for the logs and just do one divide, but it's
not really super simple.

what I'd like is just a Taylor expansion of this that works for m < 30 or so.  In particular it should be something like

k = m - 0.2 * m^2  (but not this)

k is like m for m small, but then it needs to correct down.  I couldn't work it out, maybe you can.

5/20/2009

05-20-09 - Some Huffman notes

For the record, you can do a very fast Huffman decoder by always reading 8 bits and storing the bit-reading state in the table pointer. The way you do this is by creating a seperate 8-bit lookup table for each possible current bit-reading state.

You have some # of bits [0,8] that have not yet been used for output. For each # of bits there are various values those bits can have, for 0 bits its {} , for 1 bit it's {0,1}, etc. So your total number of states is 256+128+64... = 512.

So at each step you have your current table pointer which is your decode table plus holds your current bit buffer state. You read 8 more bits and look up in that table. The table tells you what symbols can be made from the combination of {past bit buffer + new 8 bits}. Those symbols may not use up all the bits that you gave it. Whatever bits are left specify a new state.

I'll show an example for clarity :


Say you read 3 bits at a time instead of 8 (obviously we picked 8 because it's byte aligned)

If your Huffman code is :

0 - a
10 - b
11 - c

You make these tables :

{no bits} :

000 : aaa + {no bits}
001 : aa  + {1}
010 : ab  + {no bits}
011 : ac  + {no bits}
100 : ba  + {no bits}
101 : b   + {1}
110 : ca  + {no bits}
111 : c   + {1}

{1} :  (means ther's 1 bit pending of value 1)

000 : baa + {no bits}
001 : ba  + {1}
010 : bb  + {no bits}
011 : bc  + {no bits}
100 : caa + {no bits}
101 : ca  + {1}
110 : cb  + {no bits}
111 : cc  + {no bits}

The decoder code looks like this :


struct DecodeTableItem
{
    U8 numOut;
    U8 output[MAX_OUT]; 
    DecodeTableItem * nextTable;
};

DecodeTableItem * pTable = table_no_bits_pending;

for(;;)
{
    U8 byte = *inptr++;

    int num = pTable[byte].numOut;

    for(int i=0;i < num;i++)
        *outptr++ = pTable[byte].output[i];
    
    pTable = pTable[byte].nextTable;
}

It's actually a tiny bit more complicated than this because you have to handle codes longer than 8 bits. That just means "numOut" is 0 and you have to fall out to a brute force loop. You could handle that case in this same system if you make tables for 9,10,11+ bits in the queue, but you don't want to make all those tables. (you also don't actually need to make the 8 bits in the queue table either). The "numOut = 0" loop will read more bytes until it has enough bits to decode a symbol, output that symbol, and then there will be 0-7 remainder bits that will select the next table.

BTW in practice this is pretty useless because we almost never want to do one giant huffman array decode, we have our huffman codes interleaved with other kinds of coding, or other huffman trees, dependent lookups, etc. (this is also useless because it's too many tables and they are too slow to make on the fly unless you are decoding a ton of data).

BTW old-school compression people might recognize that this is basically a Howard-Vitter Quasi-Arithmetic coder. In the end, every decompressor is just a state machine that's fed by input bits or bytes from the coded stream. What we are doing here is basically explicitly turning our algorithm into that state machine. The Howard-Vitter QA did the same thing for a simplified version of arithmetic coding.

There are a lot of wins for explicitly modeling your decoder as a state machine triggered on input bits. By design of the compressor, the bits in the coded stream are random, which means any branch on them in the decoder is totally unpredictable. You have to eat that unpredictable branch - but you only want one of them, not a whole bunch. In normal function-abstracted decompressors you read the bits in some coder and generate a result and then act on that result outside - you're essentially eating the branch twice or more.

A typical arithmetic decoder does something like :


Read bits from stream into "Code"

Test "Code" against Range and probabilities to find Symbol
    -> this is the main branch on the bit stream

if ( Range too small )
    do Renormalization

branch on Symbol to update probabilities

do conditional output using Symbol
    eg. Symbol might be a runlength or something that makes us branch again

The state-based binary arithmetic coders like the QM or MQ coder combine the bit-reading, renormalization, and model updates into a single state machine.

5/18/2009

05-18-09 - Lagrange Space-Speed Optimization

We just did an interesting thing at RAD that's kind of related to what I've been writing about.

A while ago Dmitry Shkarin (ppmii inventor) posted this code sketch for his Huffman decoder :


Dmitry Shkarin's Huffman reader :

Suppose We have calculated lengths of rrHuffman codes and minimal
codelength is N then We can read N bits and to stop for most probable
symbols and to repeat reading for other symbols. Decoding procedure will
be something similar to:

extern UINT GetBits(UINT NBits);

struct HUFF_UNPACK_ITEM {
INT_TYPE NextTable, BitsToRead;
} Table[2*ALPHABET_SIZE];

inline UINT DecodeSymbol()
{
const HUFF_UNPACK_ITEM* ptr=Table;
do {
ptr += ptr->NextTable+GetBits(ptr->BitsToRead);
} while (ptr->BitsToRead != 0);
return ptr->NextTable;
} 

this will seem rather opaque to you if you don't know about Huffman codes; you can ignore it and read on and still get the point.

Dmitry's decoder reads the minimum # of bits to get to the next resolved branch at each step, then increments into the table by that branch. Obviously the value of the bits you read is equal to the branch number. So like if you read two bits, 00 = 0, 01 = 1, 10 = 2, 11 = 3 - you just add the bits you read to the base index and that's the branch you take.

Okay, that's pretty simple and nice, but it's not super fast. It's well known that a good way to accelerate Huffman decoding is just to have a big table of how to decode a large fixed-size bit read. Instead of reading variable amounts, you always just read 12 bits to start (for example), and use that 12 bit value to look up in a 4096 member table. That table tells you how many bits were actually needed and what symbol you decoded. If more than 12 bits are needed, it gives you a pointer to a followup table to resolve the symbol exactly. The crucial thing about that is that long symbols are very unlikely (the probability of each symbol is like 2^-L for a bit length of L) so you rarely need the long decode path.

It's pretty obvious that you could extend Dmitry's method to encompass read-aheads like this acceleration table. Instead of just doing GetBits on "BitsToRead" , instead you scan ahead BitsToRead , and then when you take a path you add an extra field like "BitsConsumed" which tells you how many of those bits were actually needed. This lets you make initial jump tables that read a whole bunch of bits in one go.

More generally, in the tree building, at any point you could decide to make a big fast node that wastes memory, or a small binary treeish kind of node. This is kind of like a Judy tree design, or a Patricia Trie where the nodes can switch between linked-lists of children or an array of child pointers. One nice thing here is our decoder doesn't need to switch on the node type, it always uses the same decode code, but the tree is just bigger or smaller.

To be concrete here's a simply Huffman code and possible trees for it :


Huffman codes :

0   : a
10  : b
110 : c
111 : d

Possible trees :

(* means read one bit and branch)

Tree 1)

*----- a
 \
  *--- b
   \
    *- c
     \
       d

4 leaves
3 internal nodes
= 7 nodes total

Tree 2)

[2bit] 
[ 00 ]--- a
[ 01 ]--- a
[ 10 ]--- b
[ 11 ]- *- c
         \ d

5 leaves
2 internal nodes
= 7 nodes total

Tree 3)

[3bit] -
[ 000 ] -- a
...
[ 110 ] -- c
[ 111 ] -- d

8 leaves
1 internal node
= 9 nodes total

Tree 4)

*------- a
 \
  [2bit]
  [ 00 ] - b
  [ 01 ] - b
  [ 10 ] - c
  [ 11 ] - d

5 leaves
2 internal nodes
= 7 nodes total

We have these four trees. They have different memory sizes. We can also make an estimate of what the decode time for each tree would be. In particular for the case of Huffman decoding, the expected time is something like the number of branches weighted by 2^-depth of each branch. Reading more bits in a given branch isn't significantly slower than reading 1 bit, we just want as few branches as possible to decode a symbol.

I'm going to get away from the Huffman details now. In general when we are trying to make fast data structures, what we want is as much speedup as possible for a given memory use. Obviously we could throw 64k of table slots at it and read 16 bits all at once and be very fast. Or we could use the minimum-memory table. Usually we want to be somewhere in between, we want a sweet spot where we give it some amount of memory and get a good speedup. It's a trade off problem.

If we tried all possible trees, you could just measure the Mem use & Time for each tree and pick the one you like best. You would see there is some graph Time(Mem) - Time as a function of Mem. For minimum Mem, Time is high, as you give it more Mem, Time should go down. Obviously that would be very slow and we don't want to do that.

One way to think about it is like this : start with the tree that consumes minimum memory. Now we want to let it have a little bit more memory. We want the best bang-for-the-buck payoff for that added memory, so we want the tree change that gives us the best speedup per extra byte consumed. That's the optimum d(Time)/d(Mem). Keep doing improvements until d(Time)/d(Mem) doesn't give you a big enough win to be worth it.

Some of you may already be recognizing this - this is just a Rate-Distortion problem, and we can solve it efficiently with a Lagrange multiplier.

Create a Lagrange cost like :


J = Time + lambda * Mem

Now try to build the tree that minimizes J for a given lambda. (ignore how lambda was picked for now, just give it some value).

If you find the tree that minimizes J, then that is the tree that is on the optimal Mem-Time curve at a certain spot of the slope d(Time)/d(Mem).

You should be able to see this is true. First of all - when J is minimum, you must be on the optimal Time(Mem) curve. If you weren't, then you could hold Mem constant and improve Time by moving towards the optimal curve and thus get a lower J cost.

Now, where are you on the curve? You must be at the spot where lambda = - d(Time)/d(Mem). One way to see this is algebraically :


J is at a minimum, therefore :

d(J)/d(Mem) = 0

d(J)/d(Mem) = d(Time)/d(Mem) + lambda * 1 = 0

therefore lambda = - d(Time)/d(Mem)

You can also see this intuitively. If I start out anywhere on the optimal Time(Mem) curve, I can improve J by trading Mem for Time as long as the gain I get in Time is exceeding lamda * what I lose in Mem. That is, if d(Time) > - lambda * d(Mem) , then I should do the step. Obviously you keep doing that until they are equal. QED.

Since I'm nice I drew a purty picture :

The blue curve is the set of optimal solutions for various Mem parameters. The hard to see yellow tangent is a lambda parameter which is selecting a specific spot on the trade-off curve. The green region above the curve is the space of all possible solutions - they are inefficient solutions because you can improve them by getting to the blue curve. The red region is not possible.

This kind of lagrange space-time optimization has all sorts of applications in games. One example would be your spatial acceleration structures, or something like kD trees or BVH hierarchies for ray tracing. Too often we use hacky heuristics to build these. What you should really do is create a lagrange cost that weighs the cost of more memory use vs. the expected speedup.

One of the nice wins of this approach is that you can often get away with doing a greedy forward optimization for J (the lagrange cost), and it's just a single decision as you build your tree. You just evaluate your current choices for J and pick the best one. You do then have to retry and dial lambda to search for a given Mem target. If you didn't use the lagrange multiplier approach, you would have to try all the approaches and record the Time/Mem of every single possibility, then you would pick the one that has the best Time for the desired Mem.

In the past I've talked about algorithms being dumb because they are "off the curve". That means they are in the green region of the picture - there's just no reason to use that tradeoff. In general in algorithms you can't fault someone for selecting something on the blue curve. eg. a linked list hash table vs. a reprobing hash table might be at different spots on the blue curve - but they're both optimal. The only time you're making a big mistake is when you're way out in green space of wanton inefficiency.

Back to the specifics of the Huffman decoder - what we've found is kind of interesting and maybe I'll write about it in more detail later when we understand it better. (or it might be one of our magic secrets).

ADDENDUM : I should be clear that the with the J-cost optimization you still have to consider greedy tree building vs. optimal tree building. What you're gaining is a way to drive yourself towards the blue curve, but it's not like the tree building is suddenly super easy. You also have to deal with searching around in the lagrange multiplier to hit the target you want. For many practical problems you can create experimental heuristics that will give you a formula for the lambda that gives you a certain mem use or rate or whatever.

When we make advances in the art, there are two main ways we do it. One is to push farther along the blue curve than anyone has before. For example in data compression we usually have a run time vs. compression curve. You can run slower and slower algorithms and get more and more compression. You might find a new algorithm that runs even slower and gets even more compression than anyone has before. (PAQ is playing this game now; I used to play this game with PPMZ back in the day). That extends the curve out into areas unexplored. The other advance is to find something that's below the previously known blue curve. You find a way to get a better trade-off and you set a new optimal curve point. You might find a new compressor that gets the same compression ratio as old stuff, but runs way faster.

old rants