8/25/2009

08-25-09 - Oodle Image Compression Looking Back

I did a little image compressor for RAD/Oodle. The goal was to make something with quality comparable to a good modern wavelet coder, but using a block-based scheme so that it's more compact and simple in memory use so that it will be easy to stream through the SPU and SIMD and all that good stuff, we also wanted an internal floating point core algorithm so that it extends to HDR and arbitrary bit depths. I wrote about it before, see : here or here . That's been done for a while but there were some interesting bits I never wrote about so I thought I'd note them quickly :

1. I did lagrange R-D optimization to do "trellis quantization" (see previous ). There are some nasty things about this though, and it's actually turned off by default. It usually gives you a pretty nice win in terms of RMSE (because it's measuring "D" (distortion) in terms of MSE, so by design it optimizes that for a given rate), but I find in practice that it actually hurts perceptual quality pretty often. By "perceptual" here I just mean my own eyeballing (because as I'll mention later, I found SSIM to be pretty worthless). The problem is that the R-D trellis optimization is basically taking coefficients and slamming them to zero where the distortion cost of doing that is not worth the rate it would take to have that coefficient. In practice what this does is take individual blocks and makes them very smooth. In some cases that's great, because it lets you put more bits where they're more important (for example on images of just human faces it works great because it takes bits away from the interior patches of skin and gives those bits to the edges and eyes and such).

One of the test images I use is the super high res PNG "moses.png" that I found here . Moses is wearing a herring bone jacket. At low bit rates with R-D Trellis enabled, what happens is the coder just starts tossing out entire blocks in the jacket because they are so expensive in terms of rate. The problem with that is it's not uniform. Perceptually the block that gets killed stands out very strongly and looks awful.

Obviously this could be fixed by using a better measure of "D" in the R-D optimization. This is almost a mantra for me : when you design a very aggressive optimizer and just let it run, you better be damn sure you are rating the target criterion correctly, or it will search off into unexpected spaces and give you bad results (even though they optimize exactly the rating that you told it to optimize).

2. It seems DCT-based coders are better than wavelets on very noisy images (eg. images with film grain, or just images with lots of energy in high frequency, such as images of grasses, etc). This might not be true with fancy shape-adaptive wavelets and such, but with normal wavelets the "prior" model is that the image has most of its energy in the smooth bands, and has important high frequency detail only in isolated areas like edges. When you run a wavelet coder at low bit rate, the result is a very smoothed looking version of the image. That's good in most cases, but on the "noisy" class of images, a good modern wavelet coder will actually look worse than JPEG. The reason (I'm guessing) is that DCT coders have those high frequency pattern basis functions. It might get the detail wrong, but at least there's still detail.

In some cases it makes a big difference to specifically inject noise in the decoder. One way to do this is to do a noisey restore of the quantization buckets. That is, coefficient J with quantizer Q would normally restore to Q*J. Instead we restore to something random in the range [ Q*(J-0.5) , Q*(J+0.5) ]. This ensures that the noisey output would re-encode to the same bit stream the decoder saw. I wound up not using this method for various reasons, instead I optionally inject noise directly in image space, using a simplified model of film grain noise. The noise magnitude can be manually specified by the user, or you can have the encoder measure how noisey the original is and compare to the baseline decoder output and see how much energy we lost, and have the noise injector restore that noise level.

To really do this in a rigorous and sophisticated way you should really have location-variable noise levels, or even context-adaptive noise levels. For example, an image of a smooth sphere on a background of static should detect the local neighborhood and only add noise on the staticy background. Exploring this kind of development is very difficult because any noise injection hurts RMSE a lot, and developing new algorithms without any metric to rate them is a fool's errand. I find that in some cases reintroducing noise clearly looks better to my eye, but there's no good metric that captures that.

3. As I mentioned in the earlier posts, lapping just seems to not be the win. A good post process unblocking filter gives you all the win of lapping without the penalties. Another thing I noticed for the first time is that the JPEG perceptual quantization matrix actually has a built-in bias against blocking artifacts. The key thing is that the AC10 and AC01 (the simplest horizontal and vertical ramps) are quantized *less* than the DC. That guarantees that if you have two adjacent blocks in a smooth gradient area, if the DC's quantize to being one step apart, then you will have at least one step of AC10 linear ramp to bridge between them.

If you don't use the funny JPEG perceptual quantization matrix (which I don't think you should) then a good unblocking filter is crucial at low bit rate. The unblocking filter was probably the single biggest perceptual improvement in the entire codec.

4. I also somewhat randomly found a tiny trick that's a huge improvement. We've long noticed that at high quantization you get this really nasty chroma drift problem. The problem occurs when you have adjacent blocks with very similar colors, but not quite the same, and they sit on different sides of quantization boundary, so one block shifts down and the neighbor shifts up. For example with Quantizer = 100 you might have two neighbors with values {49, 51} and they quantize to {0,1} which restores to {0,100} and becomes a huge step. This is just what quantization does, but when you apply quantization separately to the channels of a color (RGB or YUV or whatever), when one of the channels shifts like that, it causes a hue rotation. So rather than seeing a stair step, what you see is that a neighboring block has become a different color.

Now there are a lot of ideas you might have about how to address this. To really attack it thoroughly, you would need a stronger perceptual error metric, in particular one which can measure non-local patterns, which is something we don't have. The ideal perceptual error metric needs to be able to pick up on things like "this is a smooth gradient patch in the source, and the destination has a block that stands out from the others".

Instead we came up with just a simple hack that works amazingly well. Basically what we do is adaptively resize the quantization of the DC component, so that when you are in a smooth region ("smooth" meaning neighboring block DC's are similar to each other), then we use finer quantization bucket sizes. This lets you more accurately represent smooth gradients and avoid the chroma shift. Obviously this hurts RMSE so it's hard to measure the behavior analytically, but it looks amazingly much better to our eyes.

Of course while this is an exciting discovery it's also terrifying. It reminded me how bad our image quality metrics are, and the fact that we're optimizing for these broken metrics means we are making broken algorithms. There's a whole plethora of possible things you could do along this vein - various types of adaptive quantizer sizes, maybe log quantizers? maybe more coarse quantizers in noisy parts of the image? it's impossible to explore those ideas because we have no way to rate them.

As I mentioned previously, this experiment also convinced me that SSIM is just worthless. I know in the SSIM papers they show examples where it is slightly better than RMSE at telling which image is higher quality, but in practice within the context of a DCT-based image coder I find it almost never differs from RMSE; that is, if you do something like R-D optimized quantization of DCT coefficients with Distortion measured by RMSE, you will produce an image that has almost exactly the same SSIM as if you did R-D with D measured by SSIM. If RMSE and SSIM were significantly different, that would not be the case. I say this within the context of DCT-based image coding because obviously RMSE and SSIM can disagree a lot, but that axis of freedom is not explored by DCT image coders. The main thing is that SSIM is really not measuring anything important visual at all. A real visual metric needs to use global/neighborhood information, and knowledge of shapes and what is important about the image. For example, changing a pixel that's part of a perfect edge is way more important than changing an image that's in some noise. Changing a block from grey to pink is way worse than changing a block from green to blue-green, even if it's a smaller value change. etc. etc.

It seems to me there could very easily be massive improvements possible in perceptual quality without any complexity increase that we just can't get because we can't measure it.

8/05/2009

08-05-09 - Relacy License Notes

The lock-free code I posted with Relacy has a clarification to the license agreement added. If you have downloaded this please read and make sure you are in compliance. I've copied the added text here :

ADDENDUM ON RELACY LICENSE : (revised 9-14-09)

Relacy is now released under the BSD license :


    1 /*  Relacy Race Detector
    2  *  Copyright (c) 2009, Dmitry S. Vyukov
    3  *  All rights reserved.
    4  *  Redistribution and use in source and binary forms, with or without modification,
    5  *  are permitted provided that the following conditions are met:
    6  *    - Redistributions of source code must retain the above copyright notice,
    7  *      this list of conditions and the following disclaimer.
    8  *    - Redistributions in binary form must reproduce the above copyright notice, this list of conditions
    9  *      and the following disclaimer in the documentation and/or other materials provided with the distribution.
   10  *    - The name of the owner may not be used to endorse or promote products derived from this software
   11  *      without specific prior written permission.
   12  *  THIS SOFTWARE IS PROVIDED BY THE OWNER "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
   13  *  THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
   14  *  IN NO EVENT SHALL THE OWNER BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY,
   15  *  OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
   16  *  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
   17  *  STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE,
   18  *  EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
   19  */

My work with Relacy is 100% free for any use. However, the original Relacy license still applies to all work product made with Relacy, such as my code above.

The version of Relacy that I built my code with was released under a previous less clear and restrictive license. Dmitry says the new BSD license applies.

8/04/2009

08-04-09 - CINIT

Two questions I can't find answers to :

1. Is there a way to tell from a piece of code that you are being called from cinit ? eg. in C++ when a constructor causes some code to run, and that calls some function, and then I get called, is there anything I can check to see that I'm currently in cinit, not main?

(obviously a very evil thing I could do is run a stack trace and see what's at the top of the stack). I can't find anything in C that I can check, because the C stdlib is initialized before me, so to my cinit code it looks just like I'm in the app run.

The reason I want this is mainly for asserting & validation - I want to make sure that my own cinit code isn't calling certain things (such as memory allocation) so I want to put in checks like ASSERT( ! in_cinit() );

2. Is there a way to disallow cinit code in certain modules? For example, having cinit stuff in any library is very unsafe because you have to be wary that your OBJ could get dropped from the link and the cinit stuff will not be run, plus you have order of run issues, it's something I can handle fine in my own projects, but not something I want to force on clients. So I want to make sure that my actual deliverable libraries have no cinit stuff - but I do want cinit stuff in my test apps, so I don't want to just break it entirely. I'd really like a compiler error so I know I did a booboo right when I write it.

7/07/2009

07-07-09 - Small Image Compression Notes , Part 2

Deblocking survey :

There are a few different ways to come at the problem theoretically.

One is to work on post-decode data in spatial domain. These approaches basically work by explicitly trying to detect block edges and just filter them. This is the approach, for example, of the H264 "in loop deblocking filter" which there is a lot of literature on. See for example "Adaptive Deblocking Filter" by List, Joch, et.al. For an example of the filter-based approach on the 8x8 DCT case see "DCT-Based Image Compression using Wavelet-Based Algorithm with Efficient Deblocking Filter" by Yan and Chen. (BTW the JPEG standard contains a "block smoother" which basically predicts AC1 as a linear function from neighboring block coefficients. This is okay for the specific case of smooth images and very high quantization, but is generally not awesome and is an ancient technique. Ignore.)

A more hardcore version of the filtering approach is "Combined Frequency and Spatial Domain Algorithm for the Removal of Blocking Artifacts" which does adaptively-offset and adaptively-directed gaussian filters ; this is sort of like the image denoising stuff that creates pixel gradient flow vectors - the filters are local gradient adaptive so they don't go across real edges. This appears to perform quite well but is very expensive.

The other general approach is a more abstract maximum-likelihood idea. You received a lossy compressed image I. You know the original image was one of the many which when compressed produces I. You want to output the one that was most likely the true original image. This is a maximum likelihood problem, and requires some a-priori model of what you think "natural" images look like. In particular, for the case of quantized DCT coefficients, you have a quantized DCT coefficient C ; instead of just reproducing Q*C you can reproduce anything in the range { Q*C - Q/2 , Q*C + Q/2 } , and you should choose the thing in that range that makes the "best" image.

"Optimal JPEG Decoding" (1998) by Jung, Antonini, Barlaud takes this approach directly. Their results are not awesome though; presumably because their prior is not good. A more modern version of the same idea is "Block Artifact Reduction Using a Transform-Domain Markov Random Field Model" by Li and Delp which uses a better model for image likelihood, but is in the same vein of doing a brute force search in the allowed coefficient space to find the maximum-likelihood reproduction.

A related method that was popular for a while is "Projection onto Convex Sets". This is basically just a method of satisfying simple convex constraints in an optimization. Here our constraint is that the quantized coefficient stay the same, that is, repro in { Q*C - Q/2 , Q*C + Q/2 } . You then apply some target function, such as you want smoothness or something, and take iterative steps towards that goal and project onto the constraints one by one. There are a lot more details to this, I haven't paid too much attention to it because these are all crazy expensive and I want something realtime.

"Blocking Artifact Detection and Reduction in Compressed Data" by Triantafyllidis etal (2002) is in the same vein but simpler and more analytical. It again worse directly in DCT space on coefficients within their quantization range, but it directly solves for the ideal reconstruction value as a function of neighbors based on minimization of specific simple deblocking metric. You wind up with just some equations for how to modify each coefficient in terms of neighbor coefficients. While the paper is good, I think one of their base assumptions - that the frequencies can be dealt with independently - is not sound, and most other people do not make that assumption.

"Derivation of Prediction Equations for Blocking Effect Reduction" by Gopal Lakhani and Norman Zhong (1999) is an older, simpler still version of the Triantafyllidis paper. They only correct the first few coefficients and solve for optimal reconstruction to minimize MSDS (mean squared difference of slopes). You can actually look at the equations here and they're very intuitively obviously right. For example, the first AC coefficient should be corrected using the difference of the neighboring DC coefficients. In case you don't see that that's obviously right, if you have DC's like [8],[16],[24] after dequantization at Q=8, and your AC's all got quantized to zero, obviously the original image most likely had a smooth slope, so the first AC in the middle block should be predicted to be the linear interpolation.

An interesting one I found that's related to the stuff I tried with smooth reconstruction of the DC band is : "Improvement of DCT-based Compression Algorithms Using Poisson�s Equation" by Yamatani and Saito (2006) .

BTW a related issue that often comes up is the incorrectness of center dequantization of AC coefficients. I've written about this before and lots of these papers mention it; the best full note on it is : "Biased Reconstruction for JPEG Decoding" by Price.

The very modern stuff has gotten quite arcane. People now are doing things like directional overcomplete wavelets on the reproduced image; with this they can detect both block artifacts and also ringing and other quantized transform artifacts. They then use maximum-likelihood markov models to guess what the source image was that produced this output. This stuff is extremely complex and I haven't really followed it because it's nowhere near realtime, but probably the best solution for offline very high quality JPEG decoders.

An interesting outlier is John Costella's Unblock . It's based on a clever simple idea that I've never seen anywhere else. Unblock is based on the assumption that pixels near the block boundaries come from the same model as pixels in the centers of blocks. That sounds obvious but it's quite profound. It means that pixels near the edges of blocks should have the same statistics as pixels in the centers (in the maximum likelihood lingo, this is a prior we can use to choose an optimal output). In particular, it's useful because in the DCT the interior pixels are much more accurate than the edge pixels. What Unblock does is looks at the statistics of the decompressed interior pixels and assumes those are our goal, and then it forces the pixels near the edge to match the statistics of the interior. The corrections are applied as wide smooth filters.

7/06/2009

07-06-09 - Small Image Compression Notes

Lapping appears to be a complete red herring. I've wasted a lot of time on it and I'm very angry. I've been trying to work up a lapped block DCT image coder. The idea is that block-DCT-based is good for speed and parallelization for micro-core architectures, good for memory bandwidth, etc. and the lapping theoretically lets you avoid some of the nasty block artifacts by effectively extending your basis functions.

In practice it just doesn't work. I've tried lots of different lapping methods, and in all of them if I make a parameterized lap amount based on a kaiser-bessel-derived window and then tweak the lap amount to maximize SSIM, it tunes to no lapping at all. Basically what's happening is that the extra bit rate cost caused by the forward lap scrambling things up is too great for the win of smoother basis functions on decompress to make up. Obviously in a few contrived cases it does help, such as on very smooth images at very high compression. (of course the large lap basis functions are a form of modeling - they will help any time the image is smooth over the larger area, and hurt when it is not).

The really silly thing about this is that areas where the image is very smooth over a large area are the cases we already handle very well!! Yeah sure naive JPEG looks awful, but even a deblocking filter after decompress can fix that case very easily. In areas that aren't smooth, lapping actually makes artifacts like ringing worse.

The other issue is I'm having a little trouble with lagrange bitstream optimization. Basically my DCT block coder does a form of "trellis quantization" (which I wrote about before) where it can selectively zero coefficients if it decides it gets an R/D win by doing so. Obviously this gives you a nice RMSE win at a given rate (by design it does so - any time it finds a coefficient to zero, it steps up the R/D slope). But what does this actually do?

Think about trying to make the best bit stream for a given rate. Say two bits per pixel. If we don't do any lagrange optimization at all, we might pick some quantizer, say Q = 16. Now we turn on lagrange optimization, it finds some coefficients to zero, that reduces the bit rate, so to get back to the target bit rate, we can use a lower quantizer. It searches for the right lagrange lambda by iterating a few times and we wind up with something like Q = 12 , and some values zeroed, and a better RMSE. What's happened is we got to use a lower quantizer, so we made more, larger, nonzero coefficients, and then we selectively zeroed a few that took the most R/D.

But what does this actually do to the image qualitatively? What it does is increase the quality everywhere (Q =16 goes to Q=12) , but then it stomps on the quality in a few isolated spots (trellis quantization zeros some coefficients). If you compare the two images, the lagrange optimized one looks better everywhere, but then is very smooth and blurred out in a few spots. Normally this is not a big deal and it's just a win, but sometimes I've found it actually looks really awful.

Even if you optimize for some perceptual metric like SSIM it doesn't detect how bad this is, because SSIM is still a local measurement and this is a nonlocal artifact. Your eyes very quickly pick out that part of the image has been blurred way more than the rest of it. (in other cases it does the same thing, but it's actually good; it sort of acts like a bilateral filter actually, it will give bits to the high contrast edges and kill coefficients in the texture part, so for like images of skin it does a nice job of keeping the edges sharp and just smoothing out the interior, as opposed to non-lagrange-optimized JPEG which allocates bits equally and will preserve the skin pore detail and make the edges all ringy and chopped up).

I guess the fix to this is some hacky/heuristic way to just force the lagrange optimization not to be too aggressive.

I guess this is also an example of a computer problem that I've observed many times in various forms : when you let a very aggressive optimizer run wild seeking some path to maximize some metric, it will do so, and if your metric does not perfectly measure exactly the thing that you actually want to optimize, you can get some very strange/bad results.

6/22/2009

06-22-09 - Redraw Dilemma

This apartment searching is really annoying me. I can't handle having "many balls in the air" ; when I put something on my todo list, I like to work at it until it's gone. God I fucking hate shit on my todo list (the fucking health care keeps reinserting itself on my todo list and it's pissing me off; they got me again today with some billing fuckup, but I digress...).

Anyway, it's reminding me of a concept I often think about. I'll call it "the redrawer's dilemma" but there must be a better/standard name for this.

The hypothetical game goes something like this :

You are given a bag with 100 numbers in it. You know the numbers are in [0,1000] but don't know how many of each number there are in the bag. You start by drawing a random number from the bag.

At each turn of play, you can either keep your current number (in which case that is your final score), or you can put your current number back in the bag and draw again, but drawing again costs you -1 that will be subtracted from your final score.

How do you play this game optimally?

There are two things that are interesting to me about this game in real life. One is that humans almost always play it incredibly badly, and the second is that when you finally decide to stop redrawing you're almost always unhappy about it (unless you got super lucky and draw a 900+ number).

The two classic human player errors in this game are the "I just started drawing, I shouldn't stand yet" and the "I can't stop now, I already passed on something better than this". The "I just started drawing, I shouldn't stand yet" guy draws something like an 800 on one of his early draws. He thinks dang that's really good, but maybe this bag just has lots of high numbers in it, I just started drawing, I should put some time into it. Now of course that reasoning is based in correct logic - if you have reason to believe that your chance of drawing higher is good enough to merit the cost of continued looking, then yes, do so, but just drawing more because "it's early" makes no sense - the game is totally non temporal, the cost of continuing drawing doesn't go up over time. This often leads into the "I can't stop now, I already passed on something better than this" guy, who's mainly motivated by pride and shame - he doesn't want to admit to himself that he made a big mistake passing early when he got a high number, so he has to keep drawing until he gets something better. He might draw an 800, then a whole mess of single digit numbers and he's thinking "oh fuck I blew it" and then he draws a 400. At that point he should stand and quit redrawing, but he can't, so he draws again.

The thing is, even if he played correctly and just took the 400 after passing on the 800, he would be really unhappy about. And if the early termination guy played correctly and just got an early 800 and didn't draw, he would be unhappy too, because he'd always be wondering if he could've done better.

The other game theory / logical fallacy that plagues me in these kind of things is "I'm already spending X I may as well spend X". First I was looking for places around $1500, then I bumped it to $1700, then $1900. Now I'm looking at places for $2500 cuz fuck it they're nicer and I was looking at places for $2000 so it's only $500 more.

In other news, hotpads is actually a pretty cool apartment search site. It seems they are just scraping craigslist and maybe some other classifieds sites, so it's not like they have anything new, but the map interface and search features and such are solid. One thing is really annoying me about it though - the wheel zooming in the map is totally broken, I keep trying to wheel zoom and it sends the map off the never never land. Urg!

In more random news, I've really enjoyed the "Wallander" series on PBS ; the crime stories are pretty dumb/ridiculous, but I like the muddled contemplative pace of it, and the washed out monochrome color palette.

6/21/2009

06-21-09 - Fast Exp & Log

So in an earlier post I wrote about approximation of log2 and Ryg commented with links to Robin Green's great GDC 2003 talk : part1 (pdf) and part2 (pdf) ( main page here ).

It's mostly solid, but in part 2 around page 40 he talks about "fastexp" and "bitlog" and my spidey senses got tingling. Either I don't understand, or he was just smoking crack through that section.

Let's look at "bitlog" first. Robin writes it very strangely. He writes :


A Mathematical Oddity: Bitlog
  A real mathematical oddity
  The integer log2 of a 16 bit integer
  Given an N-bit value, locate the leftmost nonzero bit.
  b = the bitwise position of this bit, where 0 = LSB.
  n = the NEXT three bits (ignoring the highest 1)

    bitlog(x) = 8x(b-1) + n

  Bitlog is exactly 8 times larger than log2(x)-1

Bitlog Example
 For example take the number 88
88 = 1011000
b = 6th bit
n = 011 = 3
bitlog(88) = 8*(6-1)+3
= 43
  (43/8)+1 = 6.375
  Log2(88) = 6.4594
  This relationship holds down to bitlog(8)

Okay, I just don't follow. He says it's "exact" but then shows an example where it's not exact. He also subtracts off 1 and then just adds it back on again. Why would you do this :

    bitlog(x) = 8x(b-1) + n

  Bitlog is exactly 8 times larger than log2(x)-1

When you could just say :

    bitlog(x) = 8xb + n

  Bitlog is exactly 8 times larger than log2(x)

??? Weird.

Furthermore this seems neither "exact" nor an "oddity". Obviously the position of the MSB is the integer part of the log2 of a number. As for the fractional part of the log2, this is not a particular good way to get it. Basically what's happening here is he takes the next 3 bits and uses them for linear interpolation to the next integer.

Written out verbosely :


x = int to get log2 of
b = the bitwise position of top bit, where 0 = LSB.

x >= (1 << b) && x < (2 << b)

fractional part :
f = (x - (1 << b)) / (1 << b)

f >= 0 && f < 1

x = 2^b * (1 + f)

correct log2(x) = b + log2(1+f)

approximate with b + f

note that "f" and "log2(1+f)" both go from 0 to 1, so it's exact at the endpoints
but wrong in the middle

So far as I can tell, Robin's method is actually like this :

uint32 bitlog_x8(uint32 val)
{
    if ( val <= 8 )
    {
        static const uint32 c_table[9] = { (uint32)-1 , 0, 8, 13, 16, 19, 21, 22, 24 };
        return c_table[val];
    }
    else
    {
        unsigned long index;
        
        _BitScanReverse(&index,(unsigned long)val);
    
        ASSERT( index >= 3 );
    
        uint32 bottom = (val >> (index - 3)) & 0x7;
        uint32 blog = (index << 3) | bottom;

        return blog;
    }
}

where I've removed the weird offsets of 1 and this just returns log2 times 8. You need the check for val <= 8 because shifting by negative amounts is fucked.

But you might well ask - why only use 3 bits ? And in fact you're right, I see no reason to use only 3 bits. In fact we can do a fixed point up to 27 bits : (we need to save 5 bits at the top to store the max possible integer part of the log2)


float bitlogf(uint32 val)
{
    unsigned long index;
    
    _BitScanReverse(&index,(unsigned long)val);

    uint32 vv = (val << (27 - index)) + ((index-1) << 27);

    return vv * (1.f/134217728); // 134217728 = 2^27
}

what we've done here is find the pos of the MSB, shift val up so the MSB is at bit 27, then we add the index of the MSB (we subtract one because the MSB it self starts the counting at one in the 27th bit pos). This makes a fixed point value with 27 bits of fractional part, the bits below the MSB act as the fractional bits. We scale to return a float, but you could of course do this with any # of fixed point bits and return a fixed point int.

But of course this is exactly the same kind of thing done in an int-to-float so we could use that too :


float bitlogf2(float fval)
{
    FloatAnd32 fi;
    
    fi.f = fval;
    
    float vv = (float) (fi.i - (127 << 23));
    
    return vv * (1.f/8388608); // 8388608 = 2^23
}

which is a lot like what I wrote about before. The int-to-float does the exact same thing we did manually above, finding the MSB and making the log2 and fractional part.

One note - all of these versions are exact for the true powers of 2, and they err consistently low for all other values. If you want to minimize the maximum error, you should bias them.

The maximum error of ( log2( 1 + f) - f ) occurs at f = ( 1/ln(2) - 1 ) = 0.442695 ; that error is 0.08607132 , so the correct bias is half that error : 0.04303566

Backing up in Robin's talk we can now talk about "fastexp". "fastexp" is doing "e^x" by using the floating point format again, basically he's just sticking x into the exponent part to get the int-to-float to do the 2^x. To make it e^x instead of 2^x you just scale x by 1/ln(2) , and again we use the same trick as with bitlog : we can do exact integer powers of two, to get the values in between we use the fractional bits for linear interpolation. Robin's method seems sound, it is :


float fastexp(float x)
{
    int i = ftoi( x * 8.f );
        
    FloatAnd32 f;
    f.i = i * 1512775 + (127 << 23) - 524288;
    
    // 1512775 = (2^20)/ln(2)
    // 524288 = 0.5*(2^20)

    return f.f;
}

for 3 bits of fractional precision. (note that Robin says to bias with 0.7*(2^20) ; I don't know where he got that; I get minimum relative error with 0.5)).

Anyway, that's all fine, but once again we can ask - why just 3 bits? Why not use all the bits of x as fractional bits? And if we put the multiply by 1/ln(2) in the float math before we convert to ints, it would be more accurate.

What we get is :


float fastexp2(float x)
{
    // 12102203.16156f = (2^23)/ln(2)
    int i = ftoi( x * 12102203.16156f );
    
    FloatAnd32 f;
    f.i = i + (127 << 23) - 361007;
    
    // 361007 = (0.08607133/2)*(2^23)

    return f.f;
}

and indeed this is much much more accurate. (max_rel_err = 0.030280 instead of 0.153897 - about 5X better).

I guess Robin's fastexp is preferrable if you already have your "x" in a fixed point format with very few fractional bits (3 bits in that particular case, but it's good for <= 8 bits). The new method is preferred if you have "x" in floating point or if "x" is in fixed point with a lot of fractional bits (>= 16).

ADDENDUM :

I found the Google Book where bitlog apparently comes from; it's Math toolkit for real-time programming By Jack W. Crenshaw ; so far as I can tell this book is absolute garbage and that section is full of nonsense and crack smoking.

ADDENDUM 2 :

it's obvious that log2 is something like :


x = 2^I * (1+f)

(I is an int, f is the mantissa)

log2(x) = I + log2(1+f)

log2(1+f) = f + f * (1-f) * C

We've been using log2(1+f) ~= f , but we know that's exact at the ends and wrong in the middle
so obvious we should add a term that humps in the middle.

If we solve for C we get :

C = ( log2(1+x) - x ) / x*(1-x)

Integrating on [0,1] gives C = 0.346573583

hence we can obviously do a better bitlog something like :

float bitlogf3(float fval)
{
    FloatAnd32 fi;
    
    fi.f = fval;
    
    float vv = (float) (fi.i - (127<<23));
    
    vv *= (1.f/8388608);
    
    //float frac = vv - ftoi(vv);
    
    fi.i = (fi.i & 0x7FFFFF) | (127<<23);
    
    float frac = fi.f - 1.f;
    
    const float C = 0.346573583f;
        
    return vv + C * frac * (1.f - frac);
}

6/17/2009

06-17-09 - Inverse Box Sampling - Part 1.5

In the previous post we attacked the problem :

If you are given a low res signal L and a known down-sampler D() (in particlar, box down sampling), find an up sampler U() such that :

L = D ( U( L ) )

and U( L ) is as close as possible to the actual high res signal that L was made from (unknown).

I'm also interested in the opposite problem :

If you are given a high res signal H, and a known up-sampler U() (in particular, bilinear filtering), find a down sampler D() such that :

E = ( H - U( D( H ) ) )^2 is minized

This is a much more concrete and tractable problem. In particular in games/3d we know we are forced to use bilinear filtering as our up-sampler. If you use box down-sampling for D() as many people do, that's horrible, because bilinear filtering and box-downsampling are both interpolating and variance reducing. That, they both take noisey signals and force them towards gray. If you know that U() is going to be bilinear filtering, then you should use a D() that compensates for that. It's intuitively obvious that D should be something a bit like a sinc to bring in some neighbors with negative lobes to compensate for the blurring aspect of bilinear upsample, but what exactly I don't know yet.

(note that this is a different problem than making mips - in making mips you are actually going to be viewing the mip at a 1:1 resolution, it will not be upsampled back to the original resolution; you would use this if you were trying to substitute a lower res texture for a higher one).

I haven't tried my hand at solving this yet, maybe it's been done? Much like the previous problem, I'm surprised this isn't something well known and standard, but I haven't found anything on it.

06-17-09 - DXTC More Followup

I finally came back to DXTC and implemented some of the new slightly different techniques. ( summary of my old posts )

See the : NVidia Article or NVTextureTools Wiki for details.

Briefly :

DXT1 = my DXT1 encoder with annealing. (version reported here is newer and has some more small improvements; the RMSE's are slightly better than last time). DXT1 is 4 bits per pixel (bpp)

Humus BC4BC5 = Convert to YCoCg, Put Y in a single-channel BC4 texture (BC4 = the alpha part of DXT5, it's 4 bpp). Put the CoCg in a two-channel BC5 texture - downsampled by 2X. BC5 is two BC4's stuck together; BC5 is 8 bpp, but since it's downsampled 2x, this is 2bpp per original pixel. The net is a 6 bpp format

DXT5 YCoCg = the method described by JMP and Ignacio. This is 8 bpp. I use arbitrary CoCg scale factors, not the limited ones as in the previously published work.


Here are the results in RMSE (per pixel) : (modified 6-19 with new better results for Humus from improved down filter)

name DXT1 Humus DXT5 YCoCg
kodim01.bmp 8.2669 3.9576 3.8355
kodim02.bmp 5.2826 2.7356 2.643
kodim03.bmp 4.644 2.3953 2.2021
kodim04.bmp 5.3889 2.5619 2.4477
kodim05.bmp 9.5739 4.6823 4.5595
kodim06.bmp 7.1053 3.4543 3.2344
kodim07.bmp 5.6257 2.6839 2.6484
kodim08.bmp 10.2165 5.0581 4.8709
kodim09.bmp 5.2142 2.519 2.4175
kodim10.bmp 5.1547 2.5453 2.3435
kodim11.bmp 6.615 3.1246 2.9944
kodim12.bmp 4.7184 2.2811 2.1411
kodim13.bmp 10.8009 5.2525 5.0037
kodim14.bmp 8.2739 3.9859 3.7621
kodim15.bmp 5.5388 2.8415 2.5636
kodim16.bmp 5.0153 2.3028 2.2064
kodim17.bmp 5.4883 2.7981 2.5511
kodim18.bmp 7.9809 4.0273 3.8166
kodim19.bmp 6.5602 3.2919 3.204
kodim20.bmp 5.3534 3.0838 2.6225
kodim21.bmp 7.0691 3.5069 3.2856
kodim22.bmp 6.3877 3.5222 3.0243
kodim23.bmp 4.8559 3.045 2.4027
kodim24.bmp 8.4261 5.046 3.8599
clegg.bmp 14.6539 23.5412 10.4535
FRYMIRE.bmp 6.0933 20.0976 5.806
LENA.bmp 7.0177 5.5442 4.5596
MONARCH.bmp 6.5516 3.2012 3.4715
PEPPERS.bmp 5.8596 4.4064 3.4824
SAIL.bmp 8.3467 3.7514 3.731
SERRANO.bmp 5.944 17.4141 3.9181
TULIPS.bmp 7.602 3.6793 4.119
lena512ggg.bmp 4.8137 2.0857 2.0857
lena512pink.bmp 4.5607 2.6387 2.3724
lena512pink0g.bmp 3.7297 3.8534 3.1756
linear_ramp1.BMP 1.3488 0.8626 1.1199
linear_ramp2.BMP 1.2843 0.7767 1.0679
orange_purple.BMP 2.8841 3.7019 1.9428
pink_green.BMP 3.1817 1.504 2.7461


And here are the results in SSIM :

Note this is an "RGB SSIM" computed by doing :

SSIM_RGB = ( SSIM_R * SSIM_G ^2 * SSIM_B ) ^ (1/4)

That is, G gets 2X the weight of R & B. The SSIM is computed at a scale of 6x6 blocks which I just randomly picked out of my ass.

I also convert the SSIM to a "percent similar". The number you see below is a percent - 100% means perfect, 0% means completely unrelated to the original (eg. random noise gets 0%). This percent is :

SSIM_Percent_Similar = 100.0 * ( 1 - acos( ssim ) * 2 / PI )

I do this because the normal "ssim" is like a dot product, and showing dot products is not a good linear way to show how different things are (this is the same reason I show RMSE instead of PSNR like other silly people). In particular, when two signals are very similar, the "ssim" gets very close to 0.9999 very quickly even though the differences are still pretty big. Almost any time you want to see how close two vectors are using a dot product, you should do an acos() and compare the angle.

name DXT1 Humus DXT5 YCoCg
kodim01.bmp 84.0851 92.6253 92.7779
kodim02.bmp 82.2029 91.7239 90.5396
kodim03.bmp 85.2678 92.9042 93.2512
kodim04.bmp 83.4914 92.5714 92.784
kodim05.bmp 83.6075 92.2779 92.4083
kodim06.bmp 85.0608 92.6674 93.2357
kodim07.bmp 85.3704 93.2551 93.5276
kodim08.bmp 84.5827 92.4303 92.7742
kodim09.bmp 84.7279 92.9912 93.5035
kodim10.bmp 84.6513 92.81 93.3999
kodim11.bmp 84.0329 92.5248 92.9252
kodim12.bmp 84.8558 92.8272 93.4733
kodim13.bmp 83.6149 92.2689 92.505
kodim14.bmp 82.6441 92.1501 92.1635
kodim15.bmp 83.693 92.0028 92.8509
kodim16.bmp 85.1286 93.162 93.6118
kodim17.bmp 85.1786 93.1788 93.623
kodim18.bmp 82.9817 92.1141 92.1309
kodim19.bmp 84.4756 92.7702 93.0441
kodim20.bmp 87.0549 90.5253 93.2088
kodim21.bmp 84.2549 92.2236 92.8971
kodim22.bmp 82.6497 91.0302 91.9512
kodim23.bmp 84.2834 92.4417 92.4611
kodim24.bmp 84.6571 92.3704 93.2055
clegg.bmp 77.4964 70.1533 83.8049
FRYMIRE.bmp 91.3294 72.2527 87.6232
LENA.bmp 77.1556 80.7912 85.2508
MONARCH.bmp 83.9282 92.5106 91.6676
PEPPERS.bmp 81.6011 88.7887 89.0931
SAIL.bmp 83.2359 92.4974 92.4144
SERRANO.bmp 89.095 75.7559 90.7327
TULIPS.bmp 81.5535 90.8302 89.6292
lena512ggg.bmp 86.6836 95.0063 95.0063
lena512pink.bmp 86.3701 92.1843 92.9524
lena512pink0g.bmp 89.9995 79.9461 84.3601
linear_ramp1.BMP 92.1629 94.9231 93.5861
linear_ramp2.BMP 92.8338 96.1397 94.335
orange_purple.BMP 89.0707 91.6372 92.1934
pink_green.BMP 87.4589 93.5702 88.4219


Conclusion :

DXT5 YCoCg and "Humus" are both significantly better than DXT1.

Note that DXT5-YCoCg and "Humus" encode the luma in exactly the same way. For gray images like "lena512ggg.bmp" you can see they produce identical results. The only difference is how the chroma is encoded - either a DXT1 block (+scale) at 4 bpp, or a downsampled 2X BC4 block at 2 bpp.

In RGB RMSE , DXT5-YCoCg is measurably better than Humus-BC4BC5 , but in SSIM they are are nearly identical. This is because almost all of the RMSE loss in Humus comes from the YCoCg lossy color conversion and the CoCg downsampling. The actual BC4BC5 compression is very near lossless. (as much as I hate DXT1, I really like BC4 - it's very easy to produce near optimal output, unlike DXT1 where you have to run a really fancy compressor to get good output). The CoCg loss hurts RMSE a lot, but doesn't hurt actual visual quality or SSIM much in most cases.

In fact on an important class of images, Humus actually does a lot better than DXT5-YCoCg. That class is simple smooth ramp images, which we use very often in the form of lightmaps. The test images at the bottom of the table (linear_ramp and pink_green) show this.

On a few images where the CoCg downsample kills you, Humus does very badly. It's bad on orangle_purple because that image is specifically designed to be primarily in Chroma not Luma ; same for lena512pink0g.bmp ; note that normal chroma downsampling compressors like JPEG have this same problem. You could in theory choose a different color space for these images and use a different reconstruction shader.

Since Humus is only 6 bpp, size is certainly not a reason to prefer DXT1 over it. However, it does require two texture fetches in the shader, which is a pretty big hit. (BTW the other nice thing about Humus is that it's already down-sampled in CoCg, so if you are using something like a custom JPEG in YCoCg space with downsampled CoCg - you can just directly transcode that into Humus BC4BC5, and there's no scaling up or down or color space changes in the realtime recompress). I think this is probably what will be in Oodle because I really can't get behind any other realtime recompress.


I also tried something else, which is DXT1 optimized for SSIM. The idea is to use a little bit of neighbor information. The thing is, in my crazy DXT1 encoder, I'm just trying various end points and measuring the quality of each choice. The normal thing to do it to just take the MSE vs the original, but of course you could do other error metrics.

One such error metric is to decompress the block you're working on into its context - decompress into a chunk of neighbors that have already been DXT1 compressed & decompressed as well. Then compare that block and its neighbors to the original image in that neighborhood. In my case I used 2 pixels around the block I was working on, making a total region of 8x8 pixels (with the 4x4 DXT1 block in the middle).

You then compare the 8x8 block to the original image and try to optimize that. If you just used MSE in this comparison, it would be the same as before, but you can use other things. For example, you could add a term that penalizes not changes in values, but changes in *slope*.

Another approach would be to take the DCT of the 8x8 block and the DCT of the 8x8 original. If you then just take the L2 difference in DCT domain, that's no different than the original method, because the DCT is unitary. But you can apply non-uniform quantizers at this step using the JPEG visual quantization weights.

The approach I used was to use SSIM (using a 4x4 SSIM block) on the 8x8 windows. This means you are checking the error not just on your block, but on how your block fits into the neighborhood.

For example if the original image is all flat color - you want the output to be all flat color. Just using MSE won't give you that, eg. MSE considers 4444 -> 3535 to be just as good as 4444 -> 5555 , but we know the latter is better.

This does in fact produce slightly better looking images - it hurts RMSE of course because you're no longer optimizing for RMSE.

06-17-09 - Inverse Box Sampling - Part 2

Okay, in Part 1.5 I asked about the downsample that was the best inverse of bilinear upsampling. I have a solution that pleases me.

Sean reminded me that he tackled this before; I dunno if he has any notes about it on the public net, he can link them. His basic idea was to do a full solve for the entire down-sampled image. It's quite simple if you think about. Consider the case of 2X up & down sampling. The bilinear filter upsample will make a high res image where each pixel is a simple linear combo of 4 low res. You take the L2 error :

E = Sum[all high res pixel] ( Original - Upsampled ) ^2

For Sean's full solution approach, you set Upsampled = Bilinear_Upsample( X) , and just solve this for X without any assumption of how X is made from Original. For an N-pixel low res image you have 4N error terms, so it's plenty dense (you could also artificially regularize it more by starting with a low res image that's equal to the box down-sample, and then solve for the deltas from that, and add an extra "Tikhonov" regularization term that presumes small deltas - this would fix any degenerate cases).

I didn't do that. Instead I assumed that I want a discrete local linear filter and solved for what it should be.

A discrete local linear filter is just a bunch of coefficients. It must be symmetric, and it must sum to 1.0 to be mean-preserving (flat source should reproduce flat exactly). Hence it has the form {C2,C1,C0,C0,C1,C2} with C0+C1+C2 = 1/2. (this example has two free coefficients). Obviously the 1-wide case must be {0.5,0.5} , then you have {C1,0.5-C1,0.5-C1,C1} etc. as many taps as you want. You apply it horizontally and then vertically. (in general you could consider asymetric filters, but I assume H & V use the same coefficients).

A 1d application of the down-filter is like :

L_n = Sum[k] { C_k * [ H_(2*n-k) + H_(2*n+1+k) ] }

That is : Low pixel n = filter coefficients times High res samples centered at (2*n * 0.5) going out both directions.

Then the bilinear upsample is :

U_(2n) = (3/4) * L_n + (1/4) * L_(n-1)

U_(2n+1) = (3/4) * L_n + (1/4) * L_(n+1)

Again we just make a squared error term like the above :

E = Sum[n] ( H_n - U_n ) ^2

Substitute the form of L_n into U_n and expand so you just have a matrix equation in terms of H_n and C_k. Then do a solve for the C_k. You can do a least-squares solve here, or you can just directly solve it because there are generally few C's (the matrix is # of C's by # of pixels).

Here's how the error varies with number of free coefficients (zero free coefficients means a pure box downsample) :


r:\>bmputil mse lenag.256.bmp bilinear_down_up_0.bmp  rmse : 15.5437 psnr : 24.3339

r:\>bmputil mse lenag.256.bmp bilinear_down_up_1.bmp  rmse : 13.5138 psnr : 25.5494

r:\>bmputil mse lenag.256.bmp bilinear_down_up_2.bmp  rmse : 13.2124 psnr : 25.7454

r:\>bmputil mse lenag.256.bmp bilinear_down_up_3.bmp  rmse : 13.0839 psnr : 25.8302
 
you can see there's a big jump from 0 to 1 but then only gradually increasing quality after that (though it does keep getting better as it should).

Two or three free terms (which means a 6 or 8 tap filter) seems like the ideal width to me - wider than that and you're getting very nonlocal which means ringing and overfitting. Optimized on all my test images the best coefficients I get are :


// 8 taps :

static double c_downCoef[4] = { 1.31076, 0.02601875, -0.4001217, 0.06334295 };

// 6 taps :

static double c_downCoef[3] = { 1.25 , 0.125, - 0.375 };

(the 6-tap one was obviously so close to those perfect fractions that I just manually rounded it; I assume that if I solved this analytically that's what I would get. The 8-tap one is not so obvious to me what it would be).

Now, how do these static ones compare to doing the lsqr fit to make coefficients per image ? They're 99% of the benefit. For example :


// solve :
lena.512.bmp : doing solve exact on 3 x 524288
{ 1.342242526 , -0.028240414 , -0.456030369 , 0.142028257 }  // rmse : 10.042138

// static fit :
lena.512.bmp :  // rmse : 10.116388

------------

// static fit :
clegg.bmp :  // rgb rmse : 50.168 , gray rmse : 40.506

// solve :
fitting : clegg.bmp : doing lsqr on 3 x 1432640 , c_lsqr_damping = 0.010000
{ 1.321164423 , 0.002458499 , -0.381711250 , 0.058088329 }  // rgb rmse : 50.128 , gray rmse : 40.472

So it seems to me this is in fact a very simple and high quality way to down-sample to make the best reproduction after bilinear upsampling.

I'm not even gonna touch the issue of the [0,255] range clamping or the fact that your low res image should actually be considered discrete, not continuous.

ADDENDUM : it just occured to me that you might do the bilinear 2X upsampling using offset-taps instead of centered taps. That is, centered taps reconstruct like :


+---+    +-+-+
|   |    | | |
|   | -> +-+-+
|   |    | | |
+---+    +-+-+

That is, the area of four high res pixels lies directly on one low res pixel. Offset taps do :

+---+     | |
|   |    -+-+-
|   | ->  | |
|   |    -+-+-
+---+     | |

that is, the center of a low res pixel corresponds directly to a high res pixel.

With centered taps, the bilinear upsample weights in 1d are always (3/4,1/4) then (1/4,3/4) , (so in 2d they are 9/16, etc.)

With offset taps, the weights in 1d are (1) (1/2,1/2) (1) etc... that is, one pixel is just copied and the tweeners are averages.

Offset taps have the advantage that they aren't so severely variance decreasing. Offset taps should use a single-center down-filter of the form :

{C2,C1,C0,C1,C2}

(instead of {C2,C1,C0,C0,C1,C2} ).

My tests show single-center/offset up/down is usually slightly worse than symmetric/centered , and occasionally much better. On natural/smooth images (such as the entire Kodak set) it's slightly worse. Picking one at random :


symmetric :
kodim05.bmp : { 1.259980122 , 0.100375561 , -0.378468204 , 0.018112521 }   // rmse : 25.526521

offset :
kodim05.bmp : { 0.693510045 , 0.605009745 , -0.214854612 , -0.083665178 }  // rgb rmse : 26.034 

that pattern holds for all. However, on weird images it can be better, for example :

symmetric :
c:\src\testproj>Release\TestProj.exe t:\test_images\color\bragzone\clegg.bmp f
{ 1.321164423 , 0.002458499 , -0.381711250 , 0.058088329 }  // rgb rmse : 50.128 , gray rmse : 40.472

offset :
c:\src\testproj>Release\TestProj.exe t:\test_images\color\bragzone\clegg.bmp f
{ 0.705825115 , 0.561705835 , -0.267530949 }  // rgb rmse : 45.185 , gray rmse : 36.300

so ideally you would choose the best of the two. If you're decompressing in a pixel shader you need another parameter for whether to offset your sampling UV's by 0.5 of a pixel or not.


ADDENDUM : I got Humus working with a KLT color transform. You just do the matrix transform in the shader after fetching "YUV" (not really YUV any more). It helps on the bad cases, but still doesn't make it competitive. It's better just to go with DXT1 or DXT5-YCoCg in those cases. For example :

On a pure red & blue texture :


Humus YCoCg :

rmse : 11.4551 , psnr : 26.9848
ssim : 0.9529 , perc : 80.3841%


Humus KLT with forced Y = grey :

KLT : Singular values : 56.405628,92.022781,33.752548
 KLT : 0.577350,0.577350,0.577350
 KLT : -0.707352,0.000491,0.706861
 KLT : 0.407823,-0.816496,0.408673

rmse : 11.4021 , psnr : 27.0251
ssim : 0.9508 , perc : 79.9545%


Humus KLT  :

KLT : Singular values : 93.250313,63.979282,0.230347
 KLT : -0.550579,0.078413,0.831092
 KLT : -0.834783,-0.051675,-0.548149
 KLT : -0.000035,-0.995581,0.093909

rmse : 5.6564 , psnr : 33.1140
ssim : 0.9796 , perc : 87.1232%

(note the near perfect zero in the last singular value, as it should be)


DXT1 :

rmse : 3.0974 , psnr : 38.3450
ssim : 0.9866 , perc : 89.5777%

DXT5-YCoCg :

rmse : 2.8367 , psnr : 39.1084
ssim : 0.9828 , perc : 88.1917%

So, obviously a big help, but not enough to be competitive. Humus also craps out pretty bad on some images that have single pixel checkerboard patterns. (again, any downsampling format, such as JPEG, will fail on the same cases). Not really worth it to mess with the KLT, better just to support one of the other formats as a fallback.

One thing I'm not sure about is just how bad the two texture fetches is these days.

old rants