6/22/2009

06-22-09 - Redraw Dilemma

This apartment searching is really annoying me. I can't handle having "many balls in the air" ; when I put something on my todo list, I like to work at it until it's gone. God I fucking hate shit on my todo list (the fucking health care keeps reinserting itself on my todo list and it's pissing me off; they got me again today with some billing fuckup, but I digress...).

Anyway, it's reminding me of a concept I often think about. I'll call it "the redrawer's dilemma" but there must be a better/standard name for this.

The hypothetical game goes something like this :

You are given a bag with 100 numbers in it. You know the numbers are in [0,1000] but don't know how many of each number there are in the bag. You start by drawing a random number from the bag.

At each turn of play, you can either keep your current number (in which case that is your final score), or you can put your current number back in the bag and draw again, but drawing again costs you -1 that will be subtracted from your final score.

How do you play this game optimally?

There are two things that are interesting to me about this game in real life. One is that humans almost always play it incredibly badly, and the second is that when you finally decide to stop redrawing you're almost always unhappy about it (unless you got super lucky and draw a 900+ number).

The two classic human player errors in this game are the "I just started drawing, I shouldn't stand yet" and the "I can't stop now, I already passed on something better than this". The "I just started drawing, I shouldn't stand yet" guy draws something like an 800 on one of his early draws. He thinks dang that's really good, but maybe this bag just has lots of high numbers in it, I just started drawing, I should put some time into it. Now of course that reasoning is based in correct logic - if you have reason to believe that your chance of drawing higher is good enough to merit the cost of continued looking, then yes, do so, but just drawing more because "it's early" makes no sense - the game is totally non temporal, the cost of continuing drawing doesn't go up over time. This often leads into the "I can't stop now, I already passed on something better than this" guy, who's mainly motivated by pride and shame - he doesn't want to admit to himself that he made a big mistake passing early when he got a high number, so he has to keep drawing until he gets something better. He might draw an 800, then a whole mess of single digit numbers and he's thinking "oh fuck I blew it" and then he draws a 400. At that point he should stand and quit redrawing, but he can't, so he draws again.

The thing is, even if he played correctly and just took the 400 after passing on the 800, he would be really unhappy about. And if the early termination guy played correctly and just got an early 800 and didn't draw, he would be unhappy too, because he'd always be wondering if he could've done better.

The other game theory / logical fallacy that plagues me in these kind of things is "I'm already spending X I may as well spend X". First I was looking for places around $1500, then I bumped it to $1700, then $1900. Now I'm looking at places for $2500 cuz fuck it they're nicer and I was looking at places for $2000 so it's only $500 more.

In other news, hotpads is actually a pretty cool apartment search site. It seems they are just scraping craigslist and maybe some other classifieds sites, so it's not like they have anything new, but the map interface and search features and such are solid. One thing is really annoying me about it though - the wheel zooming in the map is totally broken, I keep trying to wheel zoom and it sends the map off the never never land. Urg!

In more random news, I've really enjoyed the "Wallander" series on PBS ; the crime stories are pretty dumb/ridiculous, but I like the muddled contemplative pace of it, and the washed out monochrome color palette.

6/21/2009

06-21-09 - Fast Exp & Log

So in an earlier post I wrote about approximation of log2 and Ryg commented with links to Robin Green's great GDC 2003 talk : part1 (pdf) and part2 (pdf) ( main page here ).

It's mostly solid, but in part 2 around page 40 he talks about "fastexp" and "bitlog" and my spidey senses got tingling. Either I don't understand, or he was just smoking crack through that section.

Let's look at "bitlog" first. Robin writes it very strangely. He writes :


A Mathematical Oddity: Bitlog
  A real mathematical oddity
  The integer log2 of a 16 bit integer
  Given an N-bit value, locate the leftmost nonzero bit.
  b = the bitwise position of this bit, where 0 = LSB.
  n = the NEXT three bits (ignoring the highest 1)

    bitlog(x) = 8x(b-1) + n

  Bitlog is exactly 8 times larger than log2(x)-1

Bitlog Example
 For example take the number 88
88 = 1011000
b = 6th bit
n = 011 = 3
bitlog(88) = 8*(6-1)+3
= 43
  (43/8)+1 = 6.375
  Log2(88) = 6.4594
  This relationship holds down to bitlog(8)

Okay, I just don't follow. He says it's "exact" but then shows an example where it's not exact. He also subtracts off 1 and then just adds it back on again. Why would you do this :

    bitlog(x) = 8x(b-1) + n

  Bitlog is exactly 8 times larger than log2(x)-1

When you could just say :

    bitlog(x) = 8xb + n

  Bitlog is exactly 8 times larger than log2(x)

??? Weird.

Furthermore this seems neither "exact" nor an "oddity". Obviously the position of the MSB is the integer part of the log2 of a number. As for the fractional part of the log2, this is not a particular good way to get it. Basically what's happening here is he takes the next 3 bits and uses them for linear interpolation to the next integer.

Written out verbosely :


x = int to get log2 of
b = the bitwise position of top bit, where 0 = LSB.

x >= (1 << b) && x < (2 << b)

fractional part :
f = (x - (1 << b)) / (1 << b)

f >= 0 && f < 1

x = 2^b * (1 + f)

correct log2(x) = b + log2(1+f)

approximate with b + f

note that "f" and "log2(1+f)" both go from 0 to 1, so it's exact at the endpoints
but wrong in the middle

So far as I can tell, Robin's method is actually like this :

uint32 bitlog_x8(uint32 val)
{
    if ( val <= 8 )
    {
        static const uint32 c_table[9] = { (uint32)-1 , 0, 8, 13, 16, 19, 21, 22, 24 };
        return c_table[val];
    }
    else
    {
        unsigned long index;
        
        _BitScanReverse(&index,(unsigned long)val);
    
        ASSERT( index >= 3 );
    
        uint32 bottom = (val >> (index - 3)) & 0x7;
        uint32 blog = (index << 3) | bottom;

        return blog;
    }
}

where I've removed the weird offsets of 1 and this just returns log2 times 8. You need the check for val <= 8 because shifting by negative amounts is fucked.

But you might well ask - why only use 3 bits ? And in fact you're right, I see no reason to use only 3 bits. In fact we can do a fixed point up to 27 bits : (we need to save 5 bits at the top to store the max possible integer part of the log2)


float bitlogf(uint32 val)
{
    unsigned long index;
    
    _BitScanReverse(&index,(unsigned long)val);

    uint32 vv = (val << (27 - index)) + ((index-1) << 27);

    return vv * (1.f/134217728); // 134217728 = 2^27
}

what we've done here is find the pos of the MSB, shift val up so the MSB is at bit 27, then we add the index of the MSB (we subtract one because the MSB it self starts the counting at one in the 27th bit pos). This makes a fixed point value with 27 bits of fractional part, the bits below the MSB act as the fractional bits. We scale to return a float, but you could of course do this with any # of fixed point bits and return a fixed point int.

But of course this is exactly the same kind of thing done in an int-to-float so we could use that too :


float bitlogf2(float fval)
{
    FloatAnd32 fi;
    
    fi.f = fval;
    
    float vv = (float) (fi.i - (127 << 23));
    
    return vv * (1.f/8388608); // 8388608 = 2^23
}

which is a lot like what I wrote about before. The int-to-float does the exact same thing we did manually above, finding the MSB and making the log2 and fractional part.

One note - all of these versions are exact for the true powers of 2, and they err consistently low for all other values. If you want to minimize the maximum error, you should bias them.

The maximum error of ( log2( 1 + f) - f ) occurs at f = ( 1/ln(2) - 1 ) = 0.442695 ; that error is 0.08607132 , so the correct bias is half that error : 0.04303566

Backing up in Robin's talk we can now talk about "fastexp". "fastexp" is doing "e^x" by using the floating point format again, basically he's just sticking x into the exponent part to get the int-to-float to do the 2^x. To make it e^x instead of 2^x you just scale x by 1/ln(2) , and again we use the same trick as with bitlog : we can do exact integer powers of two, to get the values in between we use the fractional bits for linear interpolation. Robin's method seems sound, it is :


float fastexp(float x)
{
    int i = ftoi( x * 8.f );
        
    FloatAnd32 f;
    f.i = i * 1512775 + (127 << 23) - 524288;
    
    // 1512775 = (2^20)/ln(2)
    // 524288 = 0.5*(2^20)

    return f.f;
}

for 3 bits of fractional precision. (note that Robin says to bias with 0.7*(2^20) ; I don't know where he got that; I get minimum relative error with 0.5)).

Anyway, that's all fine, but once again we can ask - why just 3 bits? Why not use all the bits of x as fractional bits? And if we put the multiply by 1/ln(2) in the float math before we convert to ints, it would be more accurate.

What we get is :


float fastexp2(float x)
{
    // 12102203.16156f = (2^23)/ln(2)
    int i = ftoi( x * 12102203.16156f );
    
    FloatAnd32 f;
    f.i = i + (127 << 23) - 361007;
    
    // 361007 = (0.08607133/2)*(2^23)

    return f.f;
}

and indeed this is much much more accurate. (max_rel_err = 0.030280 instead of 0.153897 - about 5X better).

I guess Robin's fastexp is preferrable if you already have your "x" in a fixed point format with very few fractional bits (3 bits in that particular case, but it's good for <= 8 bits). The new method is preferred if you have "x" in floating point or if "x" is in fixed point with a lot of fractional bits (>= 16).

ADDENDUM :

I found the Google Book where bitlog apparently comes from; it's Math toolkit for real-time programming By Jack W. Crenshaw ; so far as I can tell this book is absolute garbage and that section is full of nonsense and crack smoking.

ADDENDUM 2 :

it's obvious that log2 is something like :


x = 2^I * (1+f)

(I is an int, f is the mantissa)

log2(x) = I + log2(1+f)

log2(1+f) = f + f * (1-f) * C

We've been using log2(1+f) ~= f , but we know that's exact at the ends and wrong in the middle
so obvious we should add a term that humps in the middle.

If we solve for C we get :

C = ( log2(1+x) - x ) / x*(1-x)

Integrating on [0,1] gives C = 0.346573583

hence we can obviously do a better bitlog something like :

float bitlogf3(float fval)
{
    FloatAnd32 fi;
    
    fi.f = fval;
    
    float vv = (float) (fi.i - (127<<23));
    
    vv *= (1.f/8388608);
    
    //float frac = vv - ftoi(vv);
    
    fi.i = (fi.i & 0x7FFFFF) | (127<<23);
    
    float frac = fi.f - 1.f;
    
    const float C = 0.346573583f;
        
    return vv + C * frac * (1.f - frac);
}

6/17/2009

06-17-09 - Inverse Box Sampling - Part 1.5

In the previous post we attacked the problem :

If you are given a low res signal L and a known down-sampler D() (in particlar, box down sampling), find an up sampler U() such that :

L = D ( U( L ) )

and U( L ) is as close as possible to the actual high res signal that L was made from (unknown).

I'm also interested in the opposite problem :

If you are given a high res signal H, and a known up-sampler U() (in particular, bilinear filtering), find a down sampler D() such that :

E = ( H - U( D( H ) ) )^2 is minized

This is a much more concrete and tractable problem. In particular in games/3d we know we are forced to use bilinear filtering as our up-sampler. If you use box down-sampling for D() as many people do, that's horrible, because bilinear filtering and box-downsampling are both interpolating and variance reducing. That, they both take noisey signals and force them towards gray. If you know that U() is going to be bilinear filtering, then you should use a D() that compensates for that. It's intuitively obvious that D should be something a bit like a sinc to bring in some neighbors with negative lobes to compensate for the blurring aspect of bilinear upsample, but what exactly I don't know yet.

(note that this is a different problem than making mips - in making mips you are actually going to be viewing the mip at a 1:1 resolution, it will not be upsampled back to the original resolution; you would use this if you were trying to substitute a lower res texture for a higher one).

I haven't tried my hand at solving this yet, maybe it's been done? Much like the previous problem, I'm surprised this isn't something well known and standard, but I haven't found anything on it.

06-17-09 - DXTC More Followup

I finally came back to DXTC and implemented some of the new slightly different techniques. ( summary of my old posts )

See the : NVidia Article or NVTextureTools Wiki for details.

Briefly :

DXT1 = my DXT1 encoder with annealing. (version reported here is newer and has some more small improvements; the RMSE's are slightly better than last time). DXT1 is 4 bits per pixel (bpp)

Humus BC4BC5 = Convert to YCoCg, Put Y in a single-channel BC4 texture (BC4 = the alpha part of DXT5, it's 4 bpp). Put the CoCg in a two-channel BC5 texture - downsampled by 2X. BC5 is two BC4's stuck together; BC5 is 8 bpp, but since it's downsampled 2x, this is 2bpp per original pixel. The net is a 6 bpp format

DXT5 YCoCg = the method described by JMP and Ignacio. This is 8 bpp. I use arbitrary CoCg scale factors, not the limited ones as in the previously published work.


Here are the results in RMSE (per pixel) : (modified 6-19 with new better results for Humus from improved down filter)

name DXT1 Humus DXT5 YCoCg
kodim01.bmp 8.2669 3.9576 3.8355
kodim02.bmp 5.2826 2.7356 2.643
kodim03.bmp 4.644 2.3953 2.2021
kodim04.bmp 5.3889 2.5619 2.4477
kodim05.bmp 9.5739 4.6823 4.5595
kodim06.bmp 7.1053 3.4543 3.2344
kodim07.bmp 5.6257 2.6839 2.6484
kodim08.bmp 10.2165 5.0581 4.8709
kodim09.bmp 5.2142 2.519 2.4175
kodim10.bmp 5.1547 2.5453 2.3435
kodim11.bmp 6.615 3.1246 2.9944
kodim12.bmp 4.7184 2.2811 2.1411
kodim13.bmp 10.8009 5.2525 5.0037
kodim14.bmp 8.2739 3.9859 3.7621
kodim15.bmp 5.5388 2.8415 2.5636
kodim16.bmp 5.0153 2.3028 2.2064
kodim17.bmp 5.4883 2.7981 2.5511
kodim18.bmp 7.9809 4.0273 3.8166
kodim19.bmp 6.5602 3.2919 3.204
kodim20.bmp 5.3534 3.0838 2.6225
kodim21.bmp 7.0691 3.5069 3.2856
kodim22.bmp 6.3877 3.5222 3.0243
kodim23.bmp 4.8559 3.045 2.4027
kodim24.bmp 8.4261 5.046 3.8599
clegg.bmp 14.6539 23.5412 10.4535
FRYMIRE.bmp 6.0933 20.0976 5.806
LENA.bmp 7.0177 5.5442 4.5596
MONARCH.bmp 6.5516 3.2012 3.4715
PEPPERS.bmp 5.8596 4.4064 3.4824
SAIL.bmp 8.3467 3.7514 3.731
SERRANO.bmp 5.944 17.4141 3.9181
TULIPS.bmp 7.602 3.6793 4.119
lena512ggg.bmp 4.8137 2.0857 2.0857
lena512pink.bmp 4.5607 2.6387 2.3724
lena512pink0g.bmp 3.7297 3.8534 3.1756
linear_ramp1.BMP 1.3488 0.8626 1.1199
linear_ramp2.BMP 1.2843 0.7767 1.0679
orange_purple.BMP 2.8841 3.7019 1.9428
pink_green.BMP 3.1817 1.504 2.7461


And here are the results in SSIM :

Note this is an "RGB SSIM" computed by doing :

SSIM_RGB = ( SSIM_R * SSIM_G ^2 * SSIM_B ) ^ (1/4)

That is, G gets 2X the weight of R & B. The SSIM is computed at a scale of 6x6 blocks which I just randomly picked out of my ass.

I also convert the SSIM to a "percent similar". The number you see below is a percent - 100% means perfect, 0% means completely unrelated to the original (eg. random noise gets 0%). This percent is :

SSIM_Percent_Similar = 100.0 * ( 1 - acos( ssim ) * 2 / PI )

I do this because the normal "ssim" is like a dot product, and showing dot products is not a good linear way to show how different things are (this is the same reason I show RMSE instead of PSNR like other silly people). In particular, when two signals are very similar, the "ssim" gets very close to 0.9999 very quickly even though the differences are still pretty big. Almost any time you want to see how close two vectors are using a dot product, you should do an acos() and compare the angle.

name DXT1 Humus DXT5 YCoCg
kodim01.bmp 84.0851 92.6253 92.7779
kodim02.bmp 82.2029 91.7239 90.5396
kodim03.bmp 85.2678 92.9042 93.2512
kodim04.bmp 83.4914 92.5714 92.784
kodim05.bmp 83.6075 92.2779 92.4083
kodim06.bmp 85.0608 92.6674 93.2357
kodim07.bmp 85.3704 93.2551 93.5276
kodim08.bmp 84.5827 92.4303 92.7742
kodim09.bmp 84.7279 92.9912 93.5035
kodim10.bmp 84.6513 92.81 93.3999
kodim11.bmp 84.0329 92.5248 92.9252
kodim12.bmp 84.8558 92.8272 93.4733
kodim13.bmp 83.6149 92.2689 92.505
kodim14.bmp 82.6441 92.1501 92.1635
kodim15.bmp 83.693 92.0028 92.8509
kodim16.bmp 85.1286 93.162 93.6118
kodim17.bmp 85.1786 93.1788 93.623
kodim18.bmp 82.9817 92.1141 92.1309
kodim19.bmp 84.4756 92.7702 93.0441
kodim20.bmp 87.0549 90.5253 93.2088
kodim21.bmp 84.2549 92.2236 92.8971
kodim22.bmp 82.6497 91.0302 91.9512
kodim23.bmp 84.2834 92.4417 92.4611
kodim24.bmp 84.6571 92.3704 93.2055
clegg.bmp 77.4964 70.1533 83.8049
FRYMIRE.bmp 91.3294 72.2527 87.6232
LENA.bmp 77.1556 80.7912 85.2508
MONARCH.bmp 83.9282 92.5106 91.6676
PEPPERS.bmp 81.6011 88.7887 89.0931
SAIL.bmp 83.2359 92.4974 92.4144
SERRANO.bmp 89.095 75.7559 90.7327
TULIPS.bmp 81.5535 90.8302 89.6292
lena512ggg.bmp 86.6836 95.0063 95.0063
lena512pink.bmp 86.3701 92.1843 92.9524
lena512pink0g.bmp 89.9995 79.9461 84.3601
linear_ramp1.BMP 92.1629 94.9231 93.5861
linear_ramp2.BMP 92.8338 96.1397 94.335
orange_purple.BMP 89.0707 91.6372 92.1934
pink_green.BMP 87.4589 93.5702 88.4219


Conclusion :

DXT5 YCoCg and "Humus" are both significantly better than DXT1.

Note that DXT5-YCoCg and "Humus" encode the luma in exactly the same way. For gray images like "lena512ggg.bmp" you can see they produce identical results. The only difference is how the chroma is encoded - either a DXT1 block (+scale) at 4 bpp, or a downsampled 2X BC4 block at 2 bpp.

In RGB RMSE , DXT5-YCoCg is measurably better than Humus-BC4BC5 , but in SSIM they are are nearly identical. This is because almost all of the RMSE loss in Humus comes from the YCoCg lossy color conversion and the CoCg downsampling. The actual BC4BC5 compression is very near lossless. (as much as I hate DXT1, I really like BC4 - it's very easy to produce near optimal output, unlike DXT1 where you have to run a really fancy compressor to get good output). The CoCg loss hurts RMSE a lot, but doesn't hurt actual visual quality or SSIM much in most cases.

In fact on an important class of images, Humus actually does a lot better than DXT5-YCoCg. That class is simple smooth ramp images, which we use very often in the form of lightmaps. The test images at the bottom of the table (linear_ramp and pink_green) show this.

On a few images where the CoCg downsample kills you, Humus does very badly. It's bad on orangle_purple because that image is specifically designed to be primarily in Chroma not Luma ; same for lena512pink0g.bmp ; note that normal chroma downsampling compressors like JPEG have this same problem. You could in theory choose a different color space for these images and use a different reconstruction shader.

Since Humus is only 6 bpp, size is certainly not a reason to prefer DXT1 over it. However, it does require two texture fetches in the shader, which is a pretty big hit. (BTW the other nice thing about Humus is that it's already down-sampled in CoCg, so if you are using something like a custom JPEG in YCoCg space with downsampled CoCg - you can just directly transcode that into Humus BC4BC5, and there's no scaling up or down or color space changes in the realtime recompress). I think this is probably what will be in Oodle because I really can't get behind any other realtime recompress.


I also tried something else, which is DXT1 optimized for SSIM. The idea is to use a little bit of neighbor information. The thing is, in my crazy DXT1 encoder, I'm just trying various end points and measuring the quality of each choice. The normal thing to do it to just take the MSE vs the original, but of course you could do other error metrics.

One such error metric is to decompress the block you're working on into its context - decompress into a chunk of neighbors that have already been DXT1 compressed & decompressed as well. Then compare that block and its neighbors to the original image in that neighborhood. In my case I used 2 pixels around the block I was working on, making a total region of 8x8 pixels (with the 4x4 DXT1 block in the middle).

You then compare the 8x8 block to the original image and try to optimize that. If you just used MSE in this comparison, it would be the same as before, but you can use other things. For example, you could add a term that penalizes not changes in values, but changes in *slope*.

Another approach would be to take the DCT of the 8x8 block and the DCT of the 8x8 original. If you then just take the L2 difference in DCT domain, that's no different than the original method, because the DCT is unitary. But you can apply non-uniform quantizers at this step using the JPEG visual quantization weights.

The approach I used was to use SSIM (using a 4x4 SSIM block) on the 8x8 windows. This means you are checking the error not just on your block, but on how your block fits into the neighborhood.

For example if the original image is all flat color - you want the output to be all flat color. Just using MSE won't give you that, eg. MSE considers 4444 -> 3535 to be just as good as 4444 -> 5555 , but we know the latter is better.

This does in fact produce slightly better looking images - it hurts RMSE of course because you're no longer optimizing for RMSE.

06-17-09 - Inverse Box Sampling - Part 2

Okay, in Part 1.5 I asked about the downsample that was the best inverse of bilinear upsampling. I have a solution that pleases me.

Sean reminded me that he tackled this before; I dunno if he has any notes about it on the public net, he can link them. His basic idea was to do a full solve for the entire down-sampled image. It's quite simple if you think about. Consider the case of 2X up & down sampling. The bilinear filter upsample will make a high res image where each pixel is a simple linear combo of 4 low res. You take the L2 error :

E = Sum[all high res pixel] ( Original - Upsampled ) ^2

For Sean's full solution approach, you set Upsampled = Bilinear_Upsample( X) , and just solve this for X without any assumption of how X is made from Original. For an N-pixel low res image you have 4N error terms, so it's plenty dense (you could also artificially regularize it more by starting with a low res image that's equal to the box down-sample, and then solve for the deltas from that, and add an extra "Tikhonov" regularization term that presumes small deltas - this would fix any degenerate cases).

I didn't do that. Instead I assumed that I want a discrete local linear filter and solved for what it should be.

A discrete local linear filter is just a bunch of coefficients. It must be symmetric, and it must sum to 1.0 to be mean-preserving (flat source should reproduce flat exactly). Hence it has the form {C2,C1,C0,C0,C1,C2} with C0+C1+C2 = 1/2. (this example has two free coefficients). Obviously the 1-wide case must be {0.5,0.5} , then you have {C1,0.5-C1,0.5-C1,C1} etc. as many taps as you want. You apply it horizontally and then vertically. (in general you could consider asymetric filters, but I assume H & V use the same coefficients).

A 1d application of the down-filter is like :

L_n = Sum[k] { C_k * [ H_(2*n-k) + H_(2*n+1+k) ] }

That is : Low pixel n = filter coefficients times High res samples centered at (2*n * 0.5) going out both directions.

Then the bilinear upsample is :

U_(2n) = (3/4) * L_n + (1/4) * L_(n-1)

U_(2n+1) = (3/4) * L_n + (1/4) * L_(n+1)

Again we just make a squared error term like the above :

E = Sum[n] ( H_n - U_n ) ^2

Substitute the form of L_n into U_n and expand so you just have a matrix equation in terms of H_n and C_k. Then do a solve for the C_k. You can do a least-squares solve here, or you can just directly solve it because there are generally few C's (the matrix is # of C's by # of pixels).

Here's how the error varies with number of free coefficients (zero free coefficients means a pure box downsample) :


r:\>bmputil mse lenag.256.bmp bilinear_down_up_0.bmp  rmse : 15.5437 psnr : 24.3339

r:\>bmputil mse lenag.256.bmp bilinear_down_up_1.bmp  rmse : 13.5138 psnr : 25.5494

r:\>bmputil mse lenag.256.bmp bilinear_down_up_2.bmp  rmse : 13.2124 psnr : 25.7454

r:\>bmputil mse lenag.256.bmp bilinear_down_up_3.bmp  rmse : 13.0839 psnr : 25.8302
 
you can see there's a big jump from 0 to 1 but then only gradually increasing quality after that (though it does keep getting better as it should).

Two or three free terms (which means a 6 or 8 tap filter) seems like the ideal width to me - wider than that and you're getting very nonlocal which means ringing and overfitting. Optimized on all my test images the best coefficients I get are :


// 8 taps :

static double c_downCoef[4] = { 1.31076, 0.02601875, -0.4001217, 0.06334295 };

// 6 taps :

static double c_downCoef[3] = { 1.25 , 0.125, - 0.375 };

(the 6-tap one was obviously so close to those perfect fractions that I just manually rounded it; I assume that if I solved this analytically that's what I would get. The 8-tap one is not so obvious to me what it would be).

Now, how do these static ones compare to doing the lsqr fit to make coefficients per image ? They're 99% of the benefit. For example :


// solve :
lena.512.bmp : doing solve exact on 3 x 524288
{ 1.342242526 , -0.028240414 , -0.456030369 , 0.142028257 }  // rmse : 10.042138

// static fit :
lena.512.bmp :  // rmse : 10.116388

------------

// static fit :
clegg.bmp :  // rgb rmse : 50.168 , gray rmse : 40.506

// solve :
fitting : clegg.bmp : doing lsqr on 3 x 1432640 , c_lsqr_damping = 0.010000
{ 1.321164423 , 0.002458499 , -0.381711250 , 0.058088329 }  // rgb rmse : 50.128 , gray rmse : 40.472

So it seems to me this is in fact a very simple and high quality way to down-sample to make the best reproduction after bilinear upsampling.

I'm not even gonna touch the issue of the [0,255] range clamping or the fact that your low res image should actually be considered discrete, not continuous.

ADDENDUM : it just occured to me that you might do the bilinear 2X upsampling using offset-taps instead of centered taps. That is, centered taps reconstruct like :


+---+    +-+-+
|   |    | | |
|   | -> +-+-+
|   |    | | |
+---+    +-+-+

That is, the area of four high res pixels lies directly on one low res pixel. Offset taps do :

+---+     | |
|   |    -+-+-
|   | ->  | |
|   |    -+-+-
+---+     | |

that is, the center of a low res pixel corresponds directly to a high res pixel.

With centered taps, the bilinear upsample weights in 1d are always (3/4,1/4) then (1/4,3/4) , (so in 2d they are 9/16, etc.)

With offset taps, the weights in 1d are (1) (1/2,1/2) (1) etc... that is, one pixel is just copied and the tweeners are averages.

Offset taps have the advantage that they aren't so severely variance decreasing. Offset taps should use a single-center down-filter of the form :

{C2,C1,C0,C1,C2}

(instead of {C2,C1,C0,C0,C1,C2} ).

My tests show single-center/offset up/down is usually slightly worse than symmetric/centered , and occasionally much better. On natural/smooth images (such as the entire Kodak set) it's slightly worse. Picking one at random :


symmetric :
kodim05.bmp : { 1.259980122 , 0.100375561 , -0.378468204 , 0.018112521 }   // rmse : 25.526521

offset :
kodim05.bmp : { 0.693510045 , 0.605009745 , -0.214854612 , -0.083665178 }  // rgb rmse : 26.034 

that pattern holds for all. However, on weird images it can be better, for example :

symmetric :
c:\src\testproj>Release\TestProj.exe t:\test_images\color\bragzone\clegg.bmp f
{ 1.321164423 , 0.002458499 , -0.381711250 , 0.058088329 }  // rgb rmse : 50.128 , gray rmse : 40.472

offset :
c:\src\testproj>Release\TestProj.exe t:\test_images\color\bragzone\clegg.bmp f
{ 0.705825115 , 0.561705835 , -0.267530949 }  // rgb rmse : 45.185 , gray rmse : 36.300

so ideally you would choose the best of the two. If you're decompressing in a pixel shader you need another parameter for whether to offset your sampling UV's by 0.5 of a pixel or not.


ADDENDUM : I got Humus working with a KLT color transform. You just do the matrix transform in the shader after fetching "YUV" (not really YUV any more). It helps on the bad cases, but still doesn't make it competitive. It's better just to go with DXT1 or DXT5-YCoCg in those cases. For example :

On a pure red & blue texture :


Humus YCoCg :

rmse : 11.4551 , psnr : 26.9848
ssim : 0.9529 , perc : 80.3841%


Humus KLT with forced Y = grey :

KLT : Singular values : 56.405628,92.022781,33.752548
 KLT : 0.577350,0.577350,0.577350
 KLT : -0.707352,0.000491,0.706861
 KLT : 0.407823,-0.816496,0.408673

rmse : 11.4021 , psnr : 27.0251
ssim : 0.9508 , perc : 79.9545%


Humus KLT  :

KLT : Singular values : 93.250313,63.979282,0.230347
 KLT : -0.550579,0.078413,0.831092
 KLT : -0.834783,-0.051675,-0.548149
 KLT : -0.000035,-0.995581,0.093909

rmse : 5.6564 , psnr : 33.1140
ssim : 0.9796 , perc : 87.1232%

(note the near perfect zero in the last singular value, as it should be)


DXT1 :

rmse : 3.0974 , psnr : 38.3450
ssim : 0.9866 , perc : 89.5777%

DXT5-YCoCg :

rmse : 2.8367 , psnr : 39.1084
ssim : 0.9828 , perc : 88.1917%

So, obviously a big help, but not enough to be competitive. Humus also craps out pretty bad on some images that have single pixel checkerboard patterns. (again, any downsampling format, such as JPEG, will fail on the same cases). Not really worth it to mess with the KLT, better just to support one of the other formats as a fallback.

One thing I'm not sure about is just how bad the two texture fetches is these days.

6/16/2009

06-16-09 - Inverse Box Sampling

A while ago I posed this problem to the world :

Say you are given the box-downsampled version of a signal (I may use "image" and "signal" interchangeably cuz I'm sloppy). Box-downsampled means groups of N values in the original have been replaced by the average in that group and then downsampled N:1. You wish to find an image which is the same resolution as the source and if box-downsampled by N, exactly reproduces the low resolution signal you were given. This high resolution image you produce should be "smooth" and close to the expected original signal.

Examples of this are say if you're given a low mip and you wish to create a higher mip such that downsampling again would exactly reproduce the low mip you were given. The particular case I mainly care about is if you are given the DC coefficients of a JPEG, which are the averages on 8x8 blocks, you wish to produce a high res image which has the exact same average on 8x8 blocks.

Obviously this is an under-constrained problem (for N > 1) because I haven't clearly spelled out "smooth" etc. There are an infinity of signals that when downsampled produce the same low resolution version. Ideally I'd like to have a way to upsample with a parameter for smoothness vs. ringing that I could play with. (if you're nitty, I can constrain the problem precisely : The correlation of the output image and the original source image should be maximized over the space of all real world source images (eg. for example over the space of all images that exist on the internet)).

Anyway, after trying a whole bunch of heuristic approaches which all failed (though Sean's iterative approach is actually pretty good), I found the mathemagical solution, and I thought it was interesting, so here we go.

First of all, let's get clear on what "box downsample" means in a form we can use in math.

You have an original signal f(t) . We're going to pretend it's continuous because it's easier.

To make the "box downsample" what you do is apply a convolution with a rectangle that's N wide. Since I'm treating t as continuous I'll just choose coordinates where N = 1. That is, "high res" pixels are 1/N apart in t, and "low res" pixels are 1 apart.

Convolution { f , g } (t) = Integral{ ds * f(s) * g(t - s) }

The convolution with rect gives you a smoothed signal, but it's still continuous. To get the samples of the low res image, you multiply this by "comb". comb is a sum of dirac delta functions at all the integer coordinates.

F(t) = Convolve{ rect , f(t) }

low res = comb * F(t)

low res = Sum[n] L_n * delta_n

Okay ? We now have a series of low res coefficients L_n just at the integers.

This is what is given to us in our problem. We wish to try to guess what "f" was - the original high res signal. Well, now that we've written is this way, it's obvious ! We just have to undo the comb filtering and undo the convolution with rect !

First to undo the comb filter - we know the answer to that. We are given discrete samples L_n and we wish to reproduce the smooth signal F that they came from. That's just Shannon sampling theorem reconstruction. The smooth reconstruction is made by just multiplying each sample by a sinc :

F(t) = Sum[n] L_n * sinc( t - n )

This is using the "normalized sinc" definition : sinc(x) = sin(pi x) / (pi x).

sinc(x) is 1.0 at x = 0 and 0.0 at all other integer x's and it oscillates around a lot.

So this F(t) is our reconstruction of the rect-filtered original - not the original. We need to undo the rect filter. To do that we rely on the Convolution Theorem : Convolution in Fourier domain is just multiplication. That is :

Fou{ Convolution { f , g } } = Fou{ f } * Fou{ g }

So in our case :

Fou{ F } = Fou{ Convolution { f , rect } } = Fou{ f } * Fou{ rect }

Fou{ f } = Fou{ F } / Fou{ rect }

Recall F(t) = sinc( t - n ) , so :

Fou{ f } = Sum[n] L_n * Fou{ sinc( t - n ) } / Fou{ rect }

Now we need some Fourier transform knowledge. The easiest way for me to find this stuff is just to do the integrals myself. Integrals are really fun and easy. I won't copy them here because it sucks in ASCII so I'll leave it as an exercise to the reader. You can easily figure out the Fourier translation principle :

Fou{ sinc( t - n ) } = e^(-2 pi i n v) * Fou{ sinc( t ) }

As well as the Fourier sinc / rect symmetry :

Fou{ rect(t) } = sinc( v )

Fou{ sinc(t) } = rect( v )

All that means for us :

Fou{ f } = Sum[n] L_n * e^(-2 pi i n v) * rect(v) / sinc(v)

So we have the Fourier transform of our signal and all that's left is to do the inverse transform !

f(t) = Sum[n] L_n * Fou^-1{ e^(-2 pi i n v) * rect(v) / sinc(v) }

because of course constants pull out of the integral. Again you can easily prove a Fourier translation principle : the e^(-2 pi i n v) term just acts to translate t by n, so we have :

f(t) = Sum[n] L_n * h(t - n)

h(t) = Fou^-1{ rect(v) / sinc(v) }

First of all, let's stop and see what we have here. h(t) is a function centered on zero and symmetric around zero - it's a reconstruction shape. Our final output signal, f(t), is just the original low res coefficients multiplied by this h(t) shape translated to each integer point n. That should make a lot of sense.

What is h exactly? Well, again we just go ahead and do the Fourier integral. The thing is, "rect" just acts to truncate the infinite range of the integral down to [-1/2, 1/2] , so :

h(t) = Integral[-1/2,1/2] { dv e^(2 pi i t v) / sinc(v) }

Since sinc is symmetric around zero, let's take the two halves of the range around zero and add them together :

h(t) = Integral[0,1/2] { dv ( e^(2 pi i t v) + e^(- 2 pi i t v) ) / sinc(v) }

h(t) = Integral[0,1/2] { dv 2 * cos ( 2 pi t v ) * pi * v / sin( pi v) }

(note we lost the c - sinc is now sin). Let's change variables to w = pi v :

h(t) = (2 / pi ) * Integral[ 0 , pi/2 ] { dw * w * cos( 2 t w ) / sin( w ) }

And.. we're stuck. This is an integral function; it's a pretty neat form, it sure smells like some kind of Bessel function or something like that, but I can't find this exact form in my math books. (if anyone knows what this is, help me out). (actually I think it's a type of elliptic integral).

One thing we can do with h(t) is prove that it is in fact exactly what we want. It has the box-unit property :

Integral[ N - 1/2 , N + 1/2 ] { h(t) dt } = 1.0 if N = 0 and 0.0 for all other integer N

That is, the 1.0 wide window box filter of h(t) centered on integers is exactly 1.0 on its own unit interval, and 0 on others. In other words, h(t) reconstructs its own DC perfectly and doesn't affect any others. (prove this by just going ahead and doing the integral; you should get sin( N * pi ) / (N * pi ) ).

While I can't find a way to simplify h(t) , I can just numerically integrate it. It looks like this :

Photobucket

You can see it sort of looks like sinc, but it isn't. The value at 0 is > 1. The height of the central peak vs. the side peaks is more extreme than sinc, the first negative lobes are deeper than sinc. It actually reminds me of the appearance of a wavelet.

Actually the value h(0) is exactly 4 G / pi = 1.166243... , where "G" is Catalan's constant.

Anyway, this is all very amusing and it actually "works" in the sense that if you blow up a low-res image using this h(t) basis shape, it does in fact make a high res image that is smooth and upon box-down sampling exactly reproduces the low-res original.

It is, however, not actually useful. For one thing, it's computationally ridiculous. Of course you would precompute the h(t) and store it in a table, but even then, the reach of h(t) is infinite, and it doesn't get small until very large t (beyond the edges of any real image), so in practice every output pixel must be a weighted sum from every single DC values in the low res image. Even without that problem, it's useless because it's just too ringy on real data. Looking at the shape above it should be obvious it will ring like crazy.

I believe these problems basically go back to the issue of using the ideal Shannon reconstruction when I did the step of "undoing the comb". By using the sinc to reproduce I doomed myself to non-local effect and ringing. The next obvious question is - can you do something other than sinc there? Why yes you can, though you have to be careful.

Say we go back to the very beginning and make this reconstruction :

F(t) = Sum[n] L_n * B( t - n )

We're making F(t) which is our reconstruction of the smooth box-filter of the original. Now B(t) is some reconstruction basis function (before we used sinc). In order to be a reconstruction, B(t) must be 1.0 at t = 0, and 0.0 at all other integer t. Okay.

If we run through the math with general B, we get :

again :

f(t) = Sum[n] L_n * h(t - n)

but with :

h(t) = Fou^-1{ Fou{ B } / sinc(v) }

For example :

If B(t) = "triangle" , then F(t) is just the linear interpolation of the L_n

Fou{ triangle } = sinc^2 ( v)

h(t) = Fou^-1{ sinc^2 ( v) / sinc(v) } = Fou^-1{ sinc } = rect(t)

Our basis functions are rects ! In fact this is the reconstruction where the L_n is just made a constant over each DC domain. In fact if you think about it that should be obvious. If you take the L_n and make them constant on each domain, then you run a rectangle convolution over that - as you slide the rectangle window along, you get linear interpolation, which is our F(t).

That's not useful, but maybe some other B(t) is. In particular I think the best line of approach is for B(t) to be some kind of windowed sinc. Perhaps a Guassian-windowed sinc. Any real world window I can think of leads to a Fourier transform of B(t) that's too complex to do analytically, which means our only approach to finding h is to do a double-numerical-integration which is rather a disastrous thing to do, even for precomputing a table.

So I guess that's the next step, though I think this whole approach is a practical dead end and is now just a scientific curiosity. I must say it was a lot of fun to actually bust out pencil and paper and do some math and real thinking. I really miss it.

6/15/2009

06-15-09 - Blog Roll

It's time now for me to give a shout out to all the b-boys in the werld.

mischief.mayhem.soap.
Adventures of a hungry girl
Atom
Aurora
Beautiful Pixels
Birth of a Game
bouliiii's blog
Braid
C0DE517E
Capitol Hill Triangle
cbloom rants
Cessu's blog
Culinary Fool
David Lebovitz
Diary of a Graphics Programmer
Diary Of An x264 Developer
Eat All About It
EntBlog
fixored?
Game Rendering
GameArchitect
garfield minus garfield
Graphic Rants
Graphics Runner
Graphics Size Coding
Gustavo Duarte
His Notes
Humus
I Get Your Fail
Ignacio Casta�o
Industrial Arithmetic
John Ratcliff's Code Suppository
Lair Of The Multimedia Guru
Larry Osterman's WebLog
level of detail
Lightning Engine
limegarden.net
Lost in the Triangles
Mark's Blog
Married To The Sea
meshula.net
More Seattle Stuff
My Green Paste, Inc.
Nerdblog.com
not a beautiful or unique snowflake
NVIDIA Developer News
Nynaeve
onepartcode.com
Pete Shirley's Graphics Blog
Pixels, Too Many..
Real-Time Rendering
realtimecollisiondetection.net - the blog
RenderWonk
Ryan Ellis
Seattle Daily Photo
snarfed.org
Some Assembly Required
stinkin' thinkin'
Stumbling Toward 'Awesomeness'
surly gourmand
Sutter's Mill
Syntopia
Thatcher's rants and musings
The Atom Project
The Big Picture
The Data Compression News Blog
The Ladybug Letter
The software rendering world
TomF's Tech Blog
View
VirtualBlog
Visual C++ Team Blog
Void Star: Ares Fall
What your mother never told you about graphics development
Wright Eats
xkcd.com
Bartosz Milewski's Programming Cafe

autogen from the Google Reader xml output. I would post the code right here but HTML EATS MY FUCKING LESS THAN SIGNS and it's pissing me off. God damn you.

SAVED : Thanks Wouter for linking to htmlescape.net ; I might write a program to automatically do that to anything inside a PRE chunk when I upload the block.


int main(int argc,char *argv[])
{
    char * in = ReadWholeFile(argv[1]);
    
    while( in && *in )
    {
        in = skipwhitespace(in);
        if ( stripresame(in," %s  
\n",url,title); } in = strnextline(in); } return 0; }

6/13/2009

06-13-09 - Integer division by constants

.. can be turned into multiplies and shifts as I'm sure you know. Often on x86 this is done most efficiently through use of the "mul high" capability (the fact that you get 64 bit multiply output for free and can just grab the top dword).

Sadly, C doesn't give you a way to do mul high, AND stupidly MS/Intel don't seen to provide any intrinsic to do it in a clean way. After much fiddling I've found that this usually works on MSVC :


uint32 Mul32High(uint32 a,uint32 b)
{
    return (uint32)( ((uint64) a * b) >>32);
}

but make sure to check your assembly. This should assemble to just a "mul" and "ret edx".

Now, for the actual divider lookup, there are lots of references out there on the net, but most of them are not terribly useful because 1) lots of them assume expensive multiplies, and 2) most of them are designed to work for the full range of arguments. Often you know that you only need to work on a limited range of one parameter, and you can find much simpler versions for limited ranges.

One excellent page is : Jones on Reciprocal Multiplication (he actually talks about the limited range simplifications, unlike the canonical references).

The best reference I've found is the AMD Optimization Guide. They have a big section about the theory, and also two programs "sdiv.exe" and "udiv.exe" that spit out code for you! Unfortunately it's really fucking hard to find on their web site. sdiv and udiv were shipped on AMD developer CD's and appear to have disappeared from the web. I've found one FTP Archive here . You can find the AMD PDF's on their site, as these links may break : PDF 1 , PDF 2 .

And actually this little CodeProject FindMulShift is not bad; it just does a brute force search for simple mul shift approximations for limited ranges of numerators.

But it's written in a not-useful way. You should just find the shift that gives you maximum range. So I did that, it took two seconds, and here's the result for you :



__forceinline uint32 IntegerDivideConstant(uint32 x,uint32 divisor)
{
    ASSERT( divisor > 0 && divisor <= 16 );
    
    if ( divisor == 1 )
    {
        return x;
    }
    else if ( divisor == 2 )
    {
        return x >> 1;
    }
    else if ( divisor == 3 )
    {
        ASSERT( x < 98304 );  
        return ( x * 0x0000AAAB ) >> 17; 
    }
    else if ( divisor == 4 )
    {
        return x >> 2;
    }
    else if ( divisor == 5 )
    {
        ASSERT( x < 81920 );  
        return ( x * 0x0000CCCD ) >> 18; 
    }
    else if ( divisor == 6 )
    {
        ASSERT( x < 98304 );  
        return ( x * 0x0000AAAB ) >> 18; 
    }
    else if ( divisor == 7 )
    {
        ASSERT( x < 57344 );  
        return ( x * 0x00012493 ) >> 19; 
    }
    else if ( divisor == 8 )
    {
        return x >> 3;
    }
    else if ( divisor == 9 )
    {
        ASSERT( x < 73728 );  
        return ( x * 0x0000E38F ) >> 19; 
    }
    else if ( divisor == 10 )
    {
        ASSERT( x < 81920 );  
        return ( x * 0x0000CCCD ) >> 19; 
    }
    else if ( divisor == 12 )
    {
        ASSERT( x < 90112 );  
        return ( x * 0x0000BA2F ) >> 19; 
    }
    else if ( divisor == 13 )
    {
        ASSERT( x < 98304 );  
        return ( x * 0x0000AAAB ) >> 19; 
    }
    else if ( divisor == 13 )
    {
        ASSERT( x < 212992 );  
        return ( x * 0x00004EC5 ) >> 18; 
    }
    else if ( divisor == 14 )
    {
        ASSERT( x < 57344 );  
        return ( x * 0x00012493 ) >> 20; 
    }
    else if ( divisor == 15 )
    {
        ASSERT( x < 74909 );  
        return ( x * 0x00008889 ) >> 19; 
    }
    else if ( divisor == 16 )
    {
        return x >> 4;
    }
    else
    {
        CANT_GET_HERE();
        return x / divisor;
    }
}

Note : an if/else tree is better than a switch() because we're branching on constants. This should all get removed by the compiler. Some compilers get confused by large switches, even on constants, and fail to remove them.

7 seems to be the worst number for all of these methods. It only works here up to 57344 (not quite 16 bits).

06-13-09 - A little mathemagic

I'm doing some fun math that I'll post in the next entry, but first some little random shit I've done along the way.

First of all this is just a handy tiny C++ functor dealy for doing numerical integration. I tried to be a little bit careful about floating point issues (note for example the alternative version of the "t" interpolation) but I'm sure someone with more skills can fix this. Obviously the accumulation into "sum" is not awesome for floats if you have a severely oscillating cancelling function (like if you try to integrate a high frequency cosine for example). I suppose the best way would be to use cascaded accumulators (an accumulator per exponent). Anyhoo here it is :


template < typename functor >
double Integrate( double lo, double hi, int steps, functor f )
{
    double sum = 0.0;
    double last_val = f(lo);
    
    double step_size = (hi - lo)/steps;
    
    for(int i=1;i <= steps;i++)
    {
        //double t = lo + i * (hi - lo) / steps;
        double t = ( (steps - i) * lo + i * hi ) / steps;
        
        double cur_val = f(t);
        
        // trapezoid :
        double cur_int = (1.0/2.0) * (cur_val + last_val);
        
        // Simpson :
        // double mid_val = f(t + step_size * 0.5);
        //double cur_int = (1.0/6.0) * (cur_val + 4.0 * mid_val + last_val);
        
        sum += cur_int;
        
        last_val = cur_val;
    }
    
    sum *= step_size;
    
    return sum;
}

BTW I think it might help to use steps = an exact power of two (like 2^16), since that is exactly representable in floats.

I also did something that's quite useless but kind of educational. Say you want a log2() and for some reason you don't want to call log(). (note that this is dumb because most platforms have fast native log or good libraries, but whatever, let's continue).

Obviously the floating point exponent is close, so if we factor our number :


X = 2^E * (1 + M)

log2(X) = log2( 2^E * (1 + M) )

log2(X) = log2( 2^E ) + log2(1 + M)

log2(X) = E + log2(1 + M)

log2(X) = E + ln(1 + M) / ln(2)

M in [0,1]

Now, obviously ln(1 + M) is the perfect kind of thing to do a series expansion on.

We know M is "small" so the obvious thing that a junior mathematician would do is the Taylor expansion :


ln(1+x) ~= x - x^2/2 + x^3/3 - x^4/4 + ...

that would be very wrong. There are a few reasons why. One is that our "x" (M) is not actually "small". M is equally likely to be anywhere in [0,1] , and for M -> 1 , the error of this expansion is huge.

More generally, Taylor series are just *NEVER* the right way to do functional approximation. They are very useful mathematical constructs for doing calculus and limits, but they should only be used for solving equations, not in computer science. Way too many people use them for function approximation which is NOT what they do.

If for some reason we want to use a Taylor-like expansion for ln() we can fix it. First of all, we can bring our x into range better.


if ( 1+x > 4/3 )
{
  E ++
  x = (1+x)/2 - 1;
}

if (1+x) is large, we divide it by two and compensate in the exponent. Now instead of having x in [0,1] we have x in [-1/3,1/3] which is better.

The next thing you can do is find the optimal last coefficient. That is :


ln(1+x) ~= x - x^2/2 + x^3/3 - x^4/4 + k5 * x^5

for a 5-term polynomial. For x in [-epsilon,epsilon] the optimal value for k5 is 1/5 , the Taylor expansion. But that's not the range of x we're using. We're using either [-1/3,0] or [0,1/3]. The easiest way to find a better k5 is to take the difference from a higher order Taylor :

delta = k5 * x^5 - ( x^5/5 - x^6/6 + x^7/7 )

Integrate delta^2 over [0,1/3] to find the L2 norm error, then take the different wrst k5 to minimize the error. What you get is :

x in [0,1/3] :

k5 = (1/5 - 11/(18*12))

x in [-1/3,0] :

k5 = (1/5 + 11/(18*12))

it's intuitive what's happening here; if you just truncate a Taylor series at some order, you're doing the wrong thing. The N-term Taylor series is not the best N-term approximation. What we've done here is found the average of what all the future terms add up to and put them in as a compensation. In particular in the ln case the terms swing back and forth positive,negative, and each one is smaller than the last, so when you truncate you are overshooting the last value, so you need to compensate down in the x > 0 case and up in the x < 0 case.

using k5 instead of 1/5 we improve the error by over 10X :
        
basic    : 1.61848e-008
improved : 1.31599e-009

The full code is :

float log2_improved( float X )
{
    ASSERT( X > 0.0 );
    
    ///-------------
    // approximate log2 by getting the exponent from the float
    //  and then using the mantissa to do a taylor series
    
    // get the exponent :
    uint32 X_as_int = FLOAT_AS_INT(X);
    int iLogFloor = (X_as_int >> 23) - 127;

    // get the mantissa :
    FloatAnd32 fi;
    fi.i = (X_as_int & ( (1<<23)-1 ) ) | (127<<23);
    double frac = fi.f;
    ASSERT( frac >= 1.0 && frac < 2.0 );

    double k5;

    if ( frac > 4.0/3.0 )
    {
        // (frac/2) is closer to 2.0 than frac is
        //  push the iLog up and our correction will now be negative
        // the branch here sucks but this is necessary for accuracy
        //  when frac is near 2.0, t is near 1.0 and the Taylor is totally invalid
        frac *= 0.5;
        iLogFloor++;
        
        k5 = (1/5.0 + 11.0/(18.0*12.0));
    }
    else
    {    
        k5 = (1/5.0 - 11.0/(18.0*12.0));
    }
    
    // X = 2^iLogFloor * frac
    double t = frac - 1.0;

    ASSERT( t >= -(1.0/3.0) && t <= (1.0/3.0) );

    double lnFrac = t - t*t*0.5 + (t*t*t)*( (1.0/3.0) - t*(1.0/4.0) + t*t*k5 );
        
    float approx = (float)iLogFloor + float(lnFrac) * float(1.0 / LN2);

    return approx;
}

In any case, this is still not right. What we actually want is the best N-term approximation on a certain interval. There's no need to mess about, because that's a well defined thing to find.

You could go at it brute force, start with an arbitrary N-term polynomial and optimize the coefficients to minimize L2 error. But that would be silly because this has all been worked out by mathemagicians in the past. The answer is just the "Shifted Legendre Polynomials" . Legendre Polynomials are defined on [-1,1] ; you can shift them to any range [a,b] that you're working on. They are an orthonormal transform basis for functions on that interval.

The good thing about Legendre Polynomials is that the best coefficients for an N-term expansion in Legendre polynomials are just the Hilbert dot products (integrals) with the Legendre basis functions. Also, if you do the infinite-term expansion in the Legendre basis, then the best expansion in the first N polynomials is just the first N terms of the infinite term expansion (note that this was NOT true with Taylor series). (the same is true of Fourier or any orthonormal series; obviously it's not true for Taylor because Taylor series are not orthonormal - that's why I could correct for higher terms by adjusting the lower terms, because < x^5 * x^7 > is not zero). BTW another way to find the Legendre coefficients is to start with the Taylor series, and then do a least-squares best fit to compensate each lower coefficient for the ones that you dropped off.

(note the standard definition of Legendre polynomials makes them orthogonal but not orthonormal ; you have to correct for their norm as we show below :)

To approximate a function f(t) we just have to find the coefficients : C_n = < P_n * f > / < P_n * P_n >. For our function f = log2( 1 + x) , they are :


0.557304959,
0.492127684,
-0.056146683,
0.007695622,
-0.001130694,
0.000172362

which you could find by exact integration but I got lazy and just did numerical integration. Now our errors are :

basic    : 1.62154e-008
improved : 1.31753e-009
legendre : 1.47386e-009

Note that the legendre error reported here is slightly higher than the "improved" error - that's because the Legendre version just uses the mantissa M directly on [0,1] - there's no branch test with 4/3 and exponent shift. If I did that for the Legendre version then I should do new fits for the ranges [-1/3,0] and [0,1/3] and it would be much much more accurate. Instead the Legendre version just does an unconditional 6-term fit and gets almost the same results.

Anyway, like I said don't actually use this for log2 , but these are good things to remember any time you do functional approximation.