6/15/2009

06-15-09 - Blog Roll

It's time now for me to give a shout out to all the b-boys in the werld.

mischief.mayhem.soap.
Adventures of a hungry girl
Atom
Aurora
Beautiful Pixels
Birth of a Game
bouliiii's blog
Braid
C0DE517E
Capitol Hill Triangle
cbloom rants
Cessu's blog
Culinary Fool
David Lebovitz
Diary of a Graphics Programmer
Diary Of An x264 Developer
Eat All About It
EntBlog
fixored?
Game Rendering
GameArchitect
garfield minus garfield
Graphic Rants
Graphics Runner
Graphics Size Coding
Gustavo Duarte
His Notes
Humus
I Get Your Fail
Ignacio Casta�o
Industrial Arithmetic
John Ratcliff's Code Suppository
Lair Of The Multimedia Guru
Larry Osterman's WebLog
level of detail
Lightning Engine
limegarden.net
Lost in the Triangles
Mark's Blog
Married To The Sea
meshula.net
More Seattle Stuff
My Green Paste, Inc.
Nerdblog.com
not a beautiful or unique snowflake
NVIDIA Developer News
Nynaeve
onepartcode.com
Pete Shirley's Graphics Blog
Pixels, Too Many..
Real-Time Rendering
realtimecollisiondetection.net - the blog
RenderWonk
Ryan Ellis
Seattle Daily Photo
snarfed.org
Some Assembly Required
stinkin' thinkin'
Stumbling Toward 'Awesomeness'
surly gourmand
Sutter's Mill
Syntopia
Thatcher's rants and musings
The Atom Project
The Big Picture
The Data Compression News Blog
The Ladybug Letter
The software rendering world
TomF's Tech Blog
View
VirtualBlog
Visual C++ Team Blog
Void Star: Ares Fall
What your mother never told you about graphics development
Wright Eats
xkcd.com
Bartosz Milewski's Programming Cafe

autogen from the Google Reader xml output. I would post the code right here but HTML EATS MY FUCKING LESS THAN SIGNS and it's pissing me off. God damn you.

SAVED : Thanks Wouter for linking to htmlescape.net ; I might write a program to automatically do that to anything inside a PRE chunk when I upload the block.


int main(int argc,char *argv[])
{
    char * in = ReadWholeFile(argv[1]);
    
    while( in && *in )
    {
        in = skipwhitespace(in);
        if ( stripresame(in," %s  
\n",url,title); } in = strnextline(in); } return 0; }

6/13/2009

06-13-09 - Integer division by constants

.. can be turned into multiplies and shifts as I'm sure you know. Often on x86 this is done most efficiently through use of the "mul high" capability (the fact that you get 64 bit multiply output for free and can just grab the top dword).

Sadly, C doesn't give you a way to do mul high, AND stupidly MS/Intel don't seen to provide any intrinsic to do it in a clean way. After much fiddling I've found that this usually works on MSVC :


uint32 Mul32High(uint32 a,uint32 b)
{
    return (uint32)( ((uint64) a * b) >>32);
}

but make sure to check your assembly. This should assemble to just a "mul" and "ret edx".

Now, for the actual divider lookup, there are lots of references out there on the net, but most of them are not terribly useful because 1) lots of them assume expensive multiplies, and 2) most of them are designed to work for the full range of arguments. Often you know that you only need to work on a limited range of one parameter, and you can find much simpler versions for limited ranges.

One excellent page is : Jones on Reciprocal Multiplication (he actually talks about the limited range simplifications, unlike the canonical references).

The best reference I've found is the AMD Optimization Guide. They have a big section about the theory, and also two programs "sdiv.exe" and "udiv.exe" that spit out code for you! Unfortunately it's really fucking hard to find on their web site. sdiv and udiv were shipped on AMD developer CD's and appear to have disappeared from the web. I've found one FTP Archive here . You can find the AMD PDF's on their site, as these links may break : PDF 1 , PDF 2 .

And actually this little CodeProject FindMulShift is not bad; it just does a brute force search for simple mul shift approximations for limited ranges of numerators.

But it's written in a not-useful way. You should just find the shift that gives you maximum range. So I did that, it took two seconds, and here's the result for you :



__forceinline uint32 IntegerDivideConstant(uint32 x,uint32 divisor)
{
    ASSERT( divisor > 0 && divisor <= 16 );
    
    if ( divisor == 1 )
    {
        return x;
    }
    else if ( divisor == 2 )
    {
        return x >> 1;
    }
    else if ( divisor == 3 )
    {
        ASSERT( x < 98304 );  
        return ( x * 0x0000AAAB ) >> 17; 
    }
    else if ( divisor == 4 )
    {
        return x >> 2;
    }
    else if ( divisor == 5 )
    {
        ASSERT( x < 81920 );  
        return ( x * 0x0000CCCD ) >> 18; 
    }
    else if ( divisor == 6 )
    {
        ASSERT( x < 98304 );  
        return ( x * 0x0000AAAB ) >> 18; 
    }
    else if ( divisor == 7 )
    {
        ASSERT( x < 57344 );  
        return ( x * 0x00012493 ) >> 19; 
    }
    else if ( divisor == 8 )
    {
        return x >> 3;
    }
    else if ( divisor == 9 )
    {
        ASSERT( x < 73728 );  
        return ( x * 0x0000E38F ) >> 19; 
    }
    else if ( divisor == 10 )
    {
        ASSERT( x < 81920 );  
        return ( x * 0x0000CCCD ) >> 19; 
    }
    else if ( divisor == 12 )
    {
        ASSERT( x < 90112 );  
        return ( x * 0x0000BA2F ) >> 19; 
    }
    else if ( divisor == 13 )
    {
        ASSERT( x < 98304 );  
        return ( x * 0x0000AAAB ) >> 19; 
    }
    else if ( divisor == 13 )
    {
        ASSERT( x < 212992 );  
        return ( x * 0x00004EC5 ) >> 18; 
    }
    else if ( divisor == 14 )
    {
        ASSERT( x < 57344 );  
        return ( x * 0x00012493 ) >> 20; 
    }
    else if ( divisor == 15 )
    {
        ASSERT( x < 74909 );  
        return ( x * 0x00008889 ) >> 19; 
    }
    else if ( divisor == 16 )
    {
        return x >> 4;
    }
    else
    {
        CANT_GET_HERE();
        return x / divisor;
    }
}

Note : an if/else tree is better than a switch() because we're branching on constants. This should all get removed by the compiler. Some compilers get confused by large switches, even on constants, and fail to remove them.

7 seems to be the worst number for all of these methods. It only works here up to 57344 (not quite 16 bits).

06-13-09 - A little mathemagic

I'm doing some fun math that I'll post in the next entry, but first some little random shit I've done along the way.

First of all this is just a handy tiny C++ functor dealy for doing numerical integration. I tried to be a little bit careful about floating point issues (note for example the alternative version of the "t" interpolation) but I'm sure someone with more skills can fix this. Obviously the accumulation into "sum" is not awesome for floats if you have a severely oscillating cancelling function (like if you try to integrate a high frequency cosine for example). I suppose the best way would be to use cascaded accumulators (an accumulator per exponent). Anyhoo here it is :


template < typename functor >
double Integrate( double lo, double hi, int steps, functor f )
{
    double sum = 0.0;
    double last_val = f(lo);
    
    double step_size = (hi - lo)/steps;
    
    for(int i=1;i <= steps;i++)
    {
        //double t = lo + i * (hi - lo) / steps;
        double t = ( (steps - i) * lo + i * hi ) / steps;
        
        double cur_val = f(t);
        
        // trapezoid :
        double cur_int = (1.0/2.0) * (cur_val + last_val);
        
        // Simpson :
        // double mid_val = f(t + step_size * 0.5);
        //double cur_int = (1.0/6.0) * (cur_val + 4.0 * mid_val + last_val);
        
        sum += cur_int;
        
        last_val = cur_val;
    }
    
    sum *= step_size;
    
    return sum;
}

BTW I think it might help to use steps = an exact power of two (like 2^16), since that is exactly representable in floats.

I also did something that's quite useless but kind of educational. Say you want a log2() and for some reason you don't want to call log(). (note that this is dumb because most platforms have fast native log or good libraries, but whatever, let's continue).

Obviously the floating point exponent is close, so if we factor our number :


X = 2^E * (1 + M)

log2(X) = log2( 2^E * (1 + M) )

log2(X) = log2( 2^E ) + log2(1 + M)

log2(X) = E + log2(1 + M)

log2(X) = E + ln(1 + M) / ln(2)

M in [0,1]

Now, obviously ln(1 + M) is the perfect kind of thing to do a series expansion on.

We know M is "small" so the obvious thing that a junior mathematician would do is the Taylor expansion :


ln(1+x) ~= x - x^2/2 + x^3/3 - x^4/4 + ...

that would be very wrong. There are a few reasons why. One is that our "x" (M) is not actually "small". M is equally likely to be anywhere in [0,1] , and for M -> 1 , the error of this expansion is huge.

More generally, Taylor series are just *NEVER* the right way to do functional approximation. They are very useful mathematical constructs for doing calculus and limits, but they should only be used for solving equations, not in computer science. Way too many people use them for function approximation which is NOT what they do.

If for some reason we want to use a Taylor-like expansion for ln() we can fix it. First of all, we can bring our x into range better.


if ( 1+x > 4/3 )
{
  E ++
  x = (1+x)/2 - 1;
}

if (1+x) is large, we divide it by two and compensate in the exponent. Now instead of having x in [0,1] we have x in [-1/3,1/3] which is better.

The next thing you can do is find the optimal last coefficient. That is :


ln(1+x) ~= x - x^2/2 + x^3/3 - x^4/4 + k5 * x^5

for a 5-term polynomial. For x in [-epsilon,epsilon] the optimal value for k5 is 1/5 , the Taylor expansion. But that's not the range of x we're using. We're using either [-1/3,0] or [0,1/3]. The easiest way to find a better k5 is to take the difference from a higher order Taylor :

delta = k5 * x^5 - ( x^5/5 - x^6/6 + x^7/7 )

Integrate delta^2 over [0,1/3] to find the L2 norm error, then take the different wrst k5 to minimize the error. What you get is :

x in [0,1/3] :

k5 = (1/5 - 11/(18*12))

x in [-1/3,0] :

k5 = (1/5 + 11/(18*12))

it's intuitive what's happening here; if you just truncate a Taylor series at some order, you're doing the wrong thing. The N-term Taylor series is not the best N-term approximation. What we've done here is found the average of what all the future terms add up to and put them in as a compensation. In particular in the ln case the terms swing back and forth positive,negative, and each one is smaller than the last, so when you truncate you are overshooting the last value, so you need to compensate down in the x > 0 case and up in the x < 0 case.

using k5 instead of 1/5 we improve the error by over 10X :
        
basic    : 1.61848e-008
improved : 1.31599e-009

The full code is :

float log2_improved( float X )
{
    ASSERT( X > 0.0 );
    
    ///-------------
    // approximate log2 by getting the exponent from the float
    //  and then using the mantissa to do a taylor series
    
    // get the exponent :
    uint32 X_as_int = FLOAT_AS_INT(X);
    int iLogFloor = (X_as_int >> 23) - 127;

    // get the mantissa :
    FloatAnd32 fi;
    fi.i = (X_as_int & ( (1<<23)-1 ) ) | (127<<23);
    double frac = fi.f;
    ASSERT( frac >= 1.0 && frac < 2.0 );

    double k5;

    if ( frac > 4.0/3.0 )
    {
        // (frac/2) is closer to 2.0 than frac is
        //  push the iLog up and our correction will now be negative
        // the branch here sucks but this is necessary for accuracy
        //  when frac is near 2.0, t is near 1.0 and the Taylor is totally invalid
        frac *= 0.5;
        iLogFloor++;
        
        k5 = (1/5.0 + 11.0/(18.0*12.0));
    }
    else
    {    
        k5 = (1/5.0 - 11.0/(18.0*12.0));
    }
    
    // X = 2^iLogFloor * frac
    double t = frac - 1.0;

    ASSERT( t >= -(1.0/3.0) && t <= (1.0/3.0) );

    double lnFrac = t - t*t*0.5 + (t*t*t)*( (1.0/3.0) - t*(1.0/4.0) + t*t*k5 );
        
    float approx = (float)iLogFloor + float(lnFrac) * float(1.0 / LN2);

    return approx;
}

In any case, this is still not right. What we actually want is the best N-term approximation on a certain interval. There's no need to mess about, because that's a well defined thing to find.

You could go at it brute force, start with an arbitrary N-term polynomial and optimize the coefficients to minimize L2 error. But that would be silly because this has all been worked out by mathemagicians in the past. The answer is just the "Shifted Legendre Polynomials" . Legendre Polynomials are defined on [-1,1] ; you can shift them to any range [a,b] that you're working on. They are an orthonormal transform basis for functions on that interval.

The good thing about Legendre Polynomials is that the best coefficients for an N-term expansion in Legendre polynomials are just the Hilbert dot products (integrals) with the Legendre basis functions. Also, if you do the infinite-term expansion in the Legendre basis, then the best expansion in the first N polynomials is just the first N terms of the infinite term expansion (note that this was NOT true with Taylor series). (the same is true of Fourier or any orthonormal series; obviously it's not true for Taylor because Taylor series are not orthonormal - that's why I could correct for higher terms by adjusting the lower terms, because < x^5 * x^7 > is not zero). BTW another way to find the Legendre coefficients is to start with the Taylor series, and then do a least-squares best fit to compensate each lower coefficient for the ones that you dropped off.

(note the standard definition of Legendre polynomials makes them orthogonal but not orthonormal ; you have to correct for their norm as we show below :)

To approximate a function f(t) we just have to find the coefficients : C_n = < P_n * f > / < P_n * P_n >. For our function f = log2( 1 + x) , they are :


0.557304959,
0.492127684,
-0.056146683,
0.007695622,
-0.001130694,
0.000172362

which you could find by exact integration but I got lazy and just did numerical integration. Now our errors are :

basic    : 1.62154e-008
improved : 1.31753e-009
legendre : 1.47386e-009

Note that the legendre error reported here is slightly higher than the "improved" error - that's because the Legendre version just uses the mantissa M directly on [0,1] - there's no branch test with 4/3 and exponent shift. If I did that for the Legendre version then I should do new fits for the ranges [-1/3,0] and [0,1/3] and it would be much much more accurate. Instead the Legendre version just does an unconditional 6-term fit and gets almost the same results.

Anyway, like I said don't actually use this for log2 , but these are good things to remember any time you do functional approximation.

5/27/2009

05-27-09 - Optimal Local Bases

I've had this idea forever but didn't want to write about it because I wanted to try it, and I hate writing about things before I try them. Anyway, I'm probably never going to do it, so here it is :

It's obvious that we are now at a point where we could use the actual optimal KLT for 4x4 transforms. That is, for a given image, take all the 4x4 blocks and turn them into a 16-element vector. Do the PCA to find the 16 bases vectors that span that 16-element space optimally. Tack those together to make a 16x16 matrix. This is now your KLT transform for 4x4 blocks.

(in my terminology, the PCA is the solve to find the spanning axes of maximum variance of a set of vectors; the KLT is the orthonormal transform made by using the PCA basis vectors as the rows of a matrix).

BTW there's another option which is to do it seperably - a 4-tap optimal horizontal transform and a 4-tap optimal vertical transform. That would give you two 4x4 KLT matrices instead of one 16x16 , so it's a whole lot less work to do, but it doesn't capture any of the shape information in the 2d regions, so I conjecture you would lose almost all of the benefit. If you think about, there's not much else you can do in a 4-tap transform other than what the DCT or the Hadamard already does, which is basically {++++,++--,-++-,+-+-}.

Now, to transform your image you just take each 4x4 block, multiply it by the 16x16 KLT matrix, and away you go. You have to transmit the KLT matrix, which is a bit tricky. There would seem to be 256 coefficients, but in fact there are only 15+14+13.. = 16*15/2 = 120. This because you know the matrix is a rotation matrix, each row is normal - that's one constraint, and each row is perpendicular to all previous, so the first only has 15 free parameters, the second has 14, etc.

If you want to go one step crazier, you could do local adaptive fitting like glicbawls. For each 4x4 block that you want to send, take the blocks in the local neghborhood. Do the PCA to find the KLT, weighting each block by its proximity to the current. Use that optimal local KLT for the current block. The encoder and decoder perform the same optimization, so the basis vectors don't need to be transmitted. This solve will surely be dangerously under-conditioned, so you would need to use a regularizer that gives you the DCT in degenerate cases.

I conjecture that this would rock ass on weird images like "barb" that have very specific weird patterns that are repeated a lot, because a basis vector will be optimized that exactly matches those patterns. But there are still some problems with this method. In particular, 4x4 transforms are too small.

We'd like to go to 8x8, but we really can't. The first problem is that the computation complexity is like (size)^8 , so 8x8 is 256 X slower than 4x4 (a basis vector has (size^2) elements, there are (size^2) basis vectors, and a typical PCA is O(N^2)). Even if speed wasn't a problem though, it would still suck. If we had to transmit the KLT matrix, it would be 64*63/2 = 2016 coefficients to transmit - way too much overhead except on very large images. If we tried to local fit, the problem is there are too many coefficients to fit so we would be severely undertrained.

So our only hope is to use the 4x4 and hope we can fix it using something like the 2nd-pass Hadamard ala H264/JPEG-XR. That might work but it's an ugly heuristic addition to our "optimal" bases.

The interesting thing about this to me is that it's sort of the right way to do an image LZ thing, and it unifies transform coding and context/predict coding. The problem with image LZ is that the things you can match from are an overcomplete set - there are lots of different match locations that give you the exact same current pixel. What you want to do is consider all the possible match locations - merge up the ones that are very similar, but give those higher probability - hmmm.. that's just the PCA!

You can think of the optimal local bases as predictions from the context. The 1st basis is the one we predicted would have most of the energy - so first we send our dot product with that basis and hopefully it's mostly correct. Now we have some residual if that basis wasn't perfect, well the 2nd basis is what we predicted the residual would be, so now we dot with that and send that. You see, it's just like context modeling making a prediction. Furthermore when you do the PCA to build the optimal local KLT, the eigenvalues of the PCA step tell you how much confidence to have in the quality of your prediction - it tell you what probability model to use on each of the coefficients. In a highly predictable area, the 1st eigenvalue will be close to 100% of the energy, so you should code predicting the higher residuals to be zero strongly; in a highly random area, the eigenvalues of the PCA will be almost all the same, so you should expect to code multiple residuals.

5/26/2009

05-26-09 - Some Study of DCT coefficients

First of all : The number of non-zero values in the lower-diagonal area of the 8x8 block after quantization to a reasonable/typical value.

(36 of the 64 values are considered to be "lower diagonal" in the bottom right area) The actual number :


avg : 3.18 : 
 0 : 37.01
 1 : 16.35
 2 :  9.37
 3 :  6.48
 4 :  5.04
 5 :  3.88
 6 :  3.22
 7 :  3.02
 8 :  2.49
 9 :  2.34
10 :  2.09
...

The interesting thing about this is that it has a very flat tail, much flatter than you might expect. For example, if the probability of a given coefficient being zero or nonzero was an indepedent random event, the distribution would be binomial; it peaks flatter and is then much faster to zero :

avg : 3.18 : 
 0 : 13.867587
 1 : 28.162718
 2 : 27.802496
 3 : 17.775123
 4 : 8.272518
 5 : 2.986681
 6 : 0.870503
 7 : 0.210458
 8 : 0.043037
 9 : 0.007553
10 : 0.001150
...

What this tells us is the probability of a given coefficient being zero is highly *not* idependent. They are strongly correlated - the more values that are on, the more likely it is that the next will be on. In fact we see that if there are 6 values on, it's almost equally likely there are 7, etc. , that is : P(n)/P(n-1) goes toward 1.0 as n gets larger.

Also, amusingly the first two ratios P(1)/P(0) and P(2)/P(1) are both very close to 0.5 in every image I'm tried (in 0.4 to 0.6 generally). What this means is it wouldn't be too awful just to code the # of values on with unary, at least for the first bit (you could use something like an Elias Gamma code which uses unary at first then adds more raw bits).

Now for pretty pictures. Everyone has seen graphics like this, showing the L2 energy of each coefficient in the DCT : (none of these pictures include the DC because it's weird and different)

This shows the percentage of the time the value is exactly zero :

Now for some more interesting stuff. This shows the percent of correlation to the block above the current one : (to the north) :

Note in particular the strong correlation of the first row.

The next one is the correlation to the block to the left (west) :

Finally the fun one. This one shows the correlation of each coefficient to the other 63 coefficients in the same block :

The self-correlation is 100% which makes it a white pixel obviously. Black means 0% correlation. This is absolute-value correlation in all cases (no signs). There are a lot of patterns that should be pretty obvious to the eye. Beware a bit in over-modeling on these patterns because they do change a bit from image to image, but the general trend stays the same.

And another one from a different image :

This one's from Lena. A few things I think are particularly interesting - in the upper left area, which is where most of the important energy is, the correlation is most strong diagonally. That is, you see these "X" shape patterns where the center pixel is correlated mainly to it's diagonal neighbors, not the one's directly adjacent to it.

05-26-09 - Storing Floating Points

I rambled a bit before about how to store floating point images . I think I have the awesome solution.

First of all there are two issues :

Goal 1. Putting the floating point data into a format that I can easily run through filters, deltas from previous, etc. Even if you're doing lossless storage your want to be able to delta from previous and have the deltas make sense and preserve the original values.

Goal 2. Quantizing in such a way that the bits are used where the precision is wanted. Obviously you want to be sending bits only for values that are actually possible in the floating point number.

Also to be clear before I start, the issue here is with data that's "true floating point". That is, it's not just data that happens to be in floating point but could be equally well converted to ints inside some [min,max] range. For example, the coordinates of most geometry in video games really isn't meant to be floating point, the extra precision near zero is not actually wanted. The classic example where you actually want floating point is for HDR images where you want a lot of range, and you actually want less absolutely precision for higher values. That is, the difference between 999991 and 999992 is not as important as the difference between 1 and 2.

Now, we are going to be some kind of lossy storage. The loss might be very small, but it's kind of silly to talk about storing floating points without talking about lossy storage, because you don't really intend to have 23 bits of mantissa or whatever. To be lossy, we want to just do a simple linear quantization, which means we want to transform into a space where the values have equal significance.

Using some kind of log-scale is the obvious approach. Taking the log transforms the value such that even size steps in log-space are even multipliers in original space. That's good, it's the kind of scaling we want. It means the step from 1 to 2 is the same as the step from 100 to 200. The only problem is that it doesn't match the original floating point representation really, it's more continuous than we need.

What we want is a transform that gives us even size steps within one exponent, and then when the exponent goes up one, the step size doubles. That makes each step of equal importance. So, the quantizer for each exponent should just be Q ~= 2^E.

But that's easy ! The mantissa of the floating point that we already have is already quantized like that. We can get exactly what we want by just pretending our floating point is fixed point !


value = (1 + M)* 2^E

(mantissa with implicit one bit at head, shifted by exponent)

fixed point = {E.M}

That is, just take the exponent's int value and stick it in the significant bits above the mantissa. The mantissa is already quantized for you with the right variable step size. Now you can further quantize to create more loss by right shifting (aka only keeping N bits of the mantissa) or by dividing by any number.

This representation also meets Goal 1 - it's now in a form that we can play with. Note that it's not the same as just taking the bytes of a floating point in memory - we actually have an integer now that we can do math with and it makes sense.


when you cross an exponent :

1.99 = {0.1111}
2.01 = {1.0001}

So you can just subtract those and you get a sensible answer. The steps in the exponent are the correct place value to compensate for the mantissa jumping down. It means we can do things like wavelets and linear predictors in this space.

Now there is a bit of a funny issue with negative numbers vs. negative exponents. First the general solution and what's wrong with it :

Negatives Solution 1 : allow both negative numbers and negative exponents. This creates a "mirrored across zero" precision spectrum. What you do is add E_max to E to make it always positive (just like the IEEE floating point), so the actual zero exponent is biased up. The spectrum of values now looks like :


  (positives)      real step size
  3.M                     /
  2.M                    / 
  1.M                   /
  0.M                  /
 -1.M                 /
 -2.M                /  
 -3.M               /  
(zero)             +    
 -3.M               \   
 -2.M                \  
 -1.M                 \ 
  0.M                  \
  1.M                   \
  2.M                    \
  3.M                     \
  (negatives)

What we have is a huge band of values with very small exponents on each side of the zero. Now, if this is actually what you want, then fine, but I contend that it pretty much never is. The issue is, if you actually have positives and negatives, eg. you have values that cross zero, you didn't really intend to put half of your range between -1 and 1. In particular, the difference between 1 and 2^10 is the same as the difference between 1 and 2^-10. That just intuitively is obviously wrong. If you had a sequence of float values like :

{ 2.0 , 1.8 , 1.4, 1.0, 0.6, 0.2, -0.1, -0.3, -0.6, -1.0, -1.4 }

That looks nice and smooth and highly compressible right? NO ! Hidden in the middle there is a *MASSIVE* step from 0.2 to -0.1 ; that seems benign but it's actually a step past almost your entire floating point range. (BTW you might be thinking - just add a big value to get your original floating poin tdata away from zero. Well, if I did that it would shift where the precision was and throw away lots of bits; if it's okay to add a big value to get away from zero, then our data wasn't "true" floating point data to begin with and you should've just done a [min,max] scale instead).

So I contend that is almost always wrong.

Negatives Solution 2 : allow negative numbers and NOT negative exponents. I content that you almost never actually want negative exponents. If you do want precision below 1.0, you almost always just want some fixed amount of it - not more and more as you get smaller. That can be represented better just with the bits of the mantissa, or by scaling up your values by some fixed amount before transforming.

To forbid negative exponents we make everything in [0,1.0] into a kind of "denormal". We just give it a linear scale. That is, we make a slightly modified representation :


Stored val = {E.M} (fixed point)

Float val =
    
    if E >= 1 : (1 + M) *2^(E-1)
                makes numbers >= 1.0

    if E == 0 : M
                makes numbers in [0.0,1.0)

(M is always < 1, that is we pretend it has a decimal point all the way to the left)

Now the low values are like :

0    : 0.0000
0.5  : 0.1000
0.99 : 0.1111
1.01 : 1.0001

Of course we can do negative values with this by just putting the sign of the floating point value onto our fixed point value, and crossing zero is perfectly fine.

05-26-09 - Automatic Multithreading

Bartosz made an interesting post about extending D for automatic multithreading with some type system additions.

It made me think about how you could do guaranteed-safe multithreading in C++. I think it's actually pretty simple.

First of all, every variable must be owned by a specific "lock". It can only be accessed if the current thread owns that lock. Many of the ownership relationships could be implicit. For example there is an implicit lock for every thread for stuff that is exlusively owned by that thread. That thread almost has that lock, so you never actually generate a lock/unlock, but conceptually those variables still have a single lock that owns them.

So, stack variables for example are implicitly automatically owned by the thread they are made on. Global variables are implicitly owned by the "main thread" lock if not otherwise specified. If some other thread tries to touch a global, it tries to take a lock that it doesn't own and you fail.

 

Lock gate1;

int shared : gate1; // specifies "shared" is accessed by gate1

int global; // implicitly owned by the main thread

void thread1()
{
    int x; // owned by thread1

    x = shared; // fail! you must lock gate1

    {
        lock(gate1);
        x = shared; // ok
    }

    y = global; // fail! you don't own global
}

Mkay that's nice and all. Single threaded programs still just work without any touches, everything is owned by the main thread. Another goal I think of threading syntax additions should be that going from a large single threaded code base to adding a few threading bits should be easy. It is here.

There are a few things you would need to make this really work. One is a clean transfer of ownership method as Bartosz talks about. Something like auto_ptr or unique_ptr, but actually working in the language, so that you can pass objects from one owner to another and ensure that no refs leak out during the passage.

You can also of course extend this if you don't want the constraint that everything is protected by a lock. For example you could easily add "atomic" as a qualifier instead of a lock owner. If something is marked atomic, then it can be accessed without taking a lock, but it's only allowed to be accessed by atomic operations like cmpx/CAS.

This is a nice start, but it doesn't prevent deadlocks and still requires a lot of manual markup.


I also finally read a bit about Sun's Rock. It's very interesting, I encourage you to read more about it at the Sun Scalable Synchronization page.

Rock is actually a lot like LRB in many ways. It's 16 lightweight cores, each of which has 2 hardware threads (they call them strands). It's basically a simple in-order core, but it can do a sort of state-save push/pop and speculative execution. They have cleverly multi-purposed that functionality for both the Transactional Memory, and also just for improving performance of single threaded code. The state-save push-pop is a ghetto form of out-of-order-execution that we all know and love so much. It means that the chip can execute past a branch or something and then if you go the other way on the branch, it pops the state back to the branch. This is just like checkpointing a transaction and then failing the commit !

The key thing for the transactional memory is that Rock is NOT a full hardware transactional memory chip. It provides optimistic hardware transactions with some well designed support to help software transactional memory implementations. The optimistic hardware transactions basically work by failing to commit if any other core touches a cache line you're writing. Thus if you do work in cache lines that you own, you can read data, write it out, it gets flushed out of the cache to the global consistent view and there are no problems. If someone touches that cache line it invalides the transaction - even though it might not necessarilly need to. That's what makes it optimistic and not fully correct (there are other things too). If it allows a transaction through, then it definitely was okay to do, but it can fail a transaction that didn't need to fail.

There's a lot of problems with that, it can fail in cases that are perfectly valid transactions, so obviously you cannot rely on that as a TM implementation. However, it does let a lot of simple transactions successfully complete quickly. In particular for simple transactions with no contention, the optimistic hardware transaction completes no problemo. If it fails, you need to have a fallback mechanism - in which case you should fall back to your real STM implementation, which should have forward progress guarantees. So one way of look at the HTM in Rock is just a common case optimization for your STM software. The commit in Rock has a "jump on fail" so that you can provide the failure handling code block; you could jump back to your checkpoint and try again, but eventually you have to do something else.

Perhaps more interestiongly, the HTM in Rock is useful in other ways too. It gives you a way to do a more general ll-sc (load linked store conditional) kind of thing, so even if you're not using STM, you can build up larger "atomic" operations for your traditional lock/atomic C++ style multithreading. It can also be used for "lock elision" - avoiding actually doing locks in traditional lock-based code. For example if you have code like :


    lock(CS)

    x = y;

    unlock(CS)

that can be transparently converted to something like :

    checkpoint; // begin hardware transaction attempt

    x = y;

    commit  // try to end hardware transaction, if fails :
    failure {

        lock( CS)   
    
        x = y;

        unlock(CS)
    }

So you avoid stalling if it's not needed. There are of course tons of scary subtleties with all this stuff. I'll let the Sun guys work it out.

It's also actually a more efficient way of doing the simple "Interlocked" type atomic ops. On x86 or in a strongly ordered language (such as Java's volatiles) the Interlocked ops are fully sequentially consistent, which means they go in order against each other. That actually is a pretty hard operation. We think of the CMPX or CAS as being very light weight, but it has to sync the current core against all other cores to ensure ops are ordered. (I wrote about some of this stuff before in more detail in the threading articles). The "transaction" hardware with a checkpoint/commit lets your core flow through its ops without doing a big sync on interlocked ops. Now, obviously the checkpoint/commit itself needs to be synchronized against all other cores, but it's done on a per cache-line basis, and it uses the cache line dirtying communication hardware that we need anyway. In particular in the case of no contention or the case of doing multiple interlocked ops within one cache line, it's a big win.

I'm sure that Sun will completely cock things up somehow as they always do, but it is very interesting, and I imagine that most future chips will have models somewhat like this, as we all go forward and struggle with the issue of writing software for these massively parallel architectures.

5/25/2009

05-25-09 - Using DOT graphviz

I wrote a while ago about DOT, the graphviz svg thing.

I thought I'd write a quick advice on using it from what I've learned. The manual dotguide.pdf and the online help are okay, so start there, but keep this in mind :

DOT was designed to make graphs for printing on paper. That leads to some weird quirks. For one thing, all the sizes are in *inches*. It also means that all the fitting support and such is not really what you want, so just disable all of that.

I have the best success when I use the Google method (what I stole from pprof in perftools). Basically they just set the font size and then let DOT make all the decisions about layout and sizing. There's one caveat in that : the way that DOT writes its font size is not supported right by FF3. See : 1 or 2 . There's a pretty easy way to fix this, basically for every font size tag, you need to add a "px" on it. So change "font-size:14.00;" to "font-size:14.00px;" . This change needs to be done on the SVG after dot. I do it by running a grep to replace ",fontcolor=blue" with "px". So in the DOT code I make all my text "blue", it's not actually blue, the grep just changes that to the px I need for my font sizes. So you'll see me output text attributes like "fontsize=24,fontcolor=blue".

The other big thing is that DOT seems to have zero edge label layout code. And in fact the edge layout code is a bit weak in general. It's pretty good at sizing the nodes and positioning the nodes - it has lots of fancy code for that, but then the edge labels are just slapped on, and the text will often be in horrible places. The solution to this is just to use edge labels as little as possible and make them short.

Another trick to getting better edge positioning is to explicitly supply the edge node ports and put them on a record. I do this in my huffman graphs. The other good edge trick is to use edgeweight. It doesn't change the appears of the edge, it changes how the edge is weighted for importance in the layout algorithm, so it becomes straighter and shorter.

For educational purposes I just made a little program to parse MSVC /showIncludes output to DOT :

my dotincludes zip .
dotincludes sample output from running on itself.

Here are some Huffman trees (SVG - you need a FireFox 3+ to view nicely) optimized with various Lagrange multipliers :

Huff 1
Huff 2
Huff 3

Remember it's control-wheel to zoom SVG's in FF.

5/22/2009

05-22-09 - A little more Huffman

.. okay since I got started on this, a few other notes. This is ancient stuff, in fact it's been in the "Huffman2" in crblib for 10+ years, and it used to be common knowledge, but it's some of that art that's sort of drifted away.

For one thing, Huffman codes are specified only by the code lens. You don't need to store the bit patterns. Of course then you need a convention for how the bit patterns are made.

The "canonical" (aka standard way) to make the codes is to put numerically increasing codes in order with numerically increasing symbols of the same code length. I helps to think about the bit codes as if the top bit is transmitted first , even if you don't actually do that. That is, if the Huffman code is 011 that means first we send a 0 bit, then two ones, and that has value "3" in the integer register.

You can create the canonical code from the lengths in O(N) for N symbols using a simple loop like this :


for( len between min and max )
{
    nextCode = (nextCode + numCodesOfLen[len]) << 1;
    codePrefix[len] = nextCode;
}

for( all symbols )
{
    code[sym] = codePrefix[ codeLen[sym] ];
    codePrefix[ codeLen[sym] ] ++;
}

this looks a bit complex, but it's easy to see what's going on. If you imagine that you had your symbols sorted first by code length, and then by symbol Id, then you are simply handing out codes in numerical order, and shifting left one each time the code length bumps up. That is :

curCode = 0;
for( all symbols sorted by codelen : )
{
    code[sym] = curCode;
    curCode ++;
    curCode <<= codeLen[sym+1] - codeLen[sym];
}

An example :

code lens :

c : 1
b : 2
a : 3
d : 3

[code=0]

c : [0]
    code ++ [1]
    len diff = (2-1) so code <<= 1 [10]
b : [10]
    code ++ [11]
    len diff = (3-2) so code <<= 1 [110]
a : [110]
    code ++ [111]
    no len diff
d : [111]

very neat.

The other thing is to decode you don't actually need to build any tree structure. Huffman trees are specified entirely by the # of codes of each length, so you only need that information, not all the branches. It's like a balanced heap kind of a thing; you don't actually want to build a tree structure, you just store it in an array and you know the tree structure.

You can do an O(H) (H is the entropy, which is the average # of bits needed to decode a symbol) decoder using only 2 bytes per symbol (for 12 bit symbols and max code lengths of 16). (note that H <= log(N) - a perfectly balanced alphabet is the slowest case and it takes log(N) bits).


a node is :

Node
{
  CodeLen : 4 bits
  Symbol  : 12 bits
}

You just store them sorted by codelen and then by code within that len (aka "canonical Huffman order"). So they are sitting in memory in same order you would draw the tree :

0 : c
10 : b
11 : a

would be

[ 1 : c ]
[ 2 : b ]
[ 2 : a ]

then the tree descent is completely determined, you don't need to store any child pointers, the bits that you read are the index into the tree. You can find the node you should be at any time by doing :

index = StartIndex[ codeLen ] + code;

StartIndex[codeLen] contains the index of the first code of that length *minus* the value of the first code bits at that len. You could compute StartIndex in O(1) , but in practice you should put them in a table; you only need 16 of them, one for each code len.

So the full decoder looks like :


code = 0;
codeLen = 0;

for(;;)
{
    code <<= 1;
    code |= readbit();
    codeLen ++;

    if ( Table[ StartIndex[ codeLen ] + code ].codeLen == codeLen )
        return Table[ StartIndex[ codeLen ] + code ].Symbol
}

obviously you can improve this in various ways, but it's interesting to understand the Huffman tree structure this way.

The thing is if you just list out the huffman codes in order, they numerically increase :


0
10
110
111

= 0, 2, 6,7

If you have a prefix that doesn't yet decode to anything, eg if you have "11" read so far - that will always be a code value that numerically doesn't exist in the table.

Finally, if your symbols actually naturally occur in sorted order, eg. 0 is most likely, then 1, etc.. (as they do with AC coefficients in the DCT which have geometric distribution) - then you can do a Huffman decoder that's O(H) with just the # of codes of each length !

05-22-09 - Fast Means

I know of three fast ways to track the mean of a variable. At time t you observe a value x_t. You want to track the mean of x over time quickly (eg. without a divide!). I generally want some kind of local mean, not the full mean since time 0.

1. "weighted decay". You want to do something like M_t = 0.9 * M_t-1 + 0.1 * x_t;


int accum = 0;

accum := ((15 * accum)>>4) + x_t;

mean = (accum>>4);

This is the same as doing the true (total/count) arithmetic mean, except that once count gets to some value (here 16) then instead of doing total += x_t, instead you do total = (15/16) * total + x_t; so that count sticks at 16.

2. "sliding window". Basically just track a window of the last N values, then you can add them to get the local mean. If N is power of 2 you don't divide. Of course the fast way is to only track inputs & exits, don't do the sum all the time.


int window[32];
int window_i = 0;
int accum = 0;

accum += x_t - window[window_i];
window[window_i] = x_t;
window_i = (window_i + 1) & 0x1F;

mean = (accum>>5);

3. "deferred summation". This is the technique I invented for arithmetic coding long ago (I'm sure it's an old technique in other fields). Basically you just do a normal arithmetic mean, but you wait until count is a power of two before doing the divide.


int accum = 0;
int next_count = 0;
int next_shift = 3;

accum += x_t;

if ( --next_count == 0 )
{
    mean = accum >> next_shift;
    accum = 0;
    next_shift = MIN( 10 , next_shift+1 );
    next_count = 1 << next_shift;
}

mean is not changing during each chunk, basically you use the mean from the last chunk and wait for another power-of-2 group to be done before you update mean. (also this codelet is resetting accum to zero each time but in practice you probably want to carry over some of the last accum).

For most purposes "deferred summation" is actually really bad. For the thing I invented it for - order 0 coding of stable sources - it's good, but that application is rare and not useful. For real context coding you need fast local adaptation and good handling of sparse contexts with very few statistics, so delaying until the next pow-2 of symbols seen is not okay.

The "weighted decay" method has a similar sort of flaw. The problem is you only have one tune parameter - the decay multiplier (which here became the shift down amount). If you tune it one way, it adapts way too fast to short term fluctuations, but if you tune it the other way it clings on to values from way way in the past much too long.

The best almost always is the sliding window, but the sliding window is not exactly cheap. It takes a lot of space, and it's not awesomely fast to update.

old rants