9/25/2009

09-25-09 - Motion Search

I just read the paper on Patch Match and it makes me angry so I figure I'll write about the motion search method I'm developing for possible future use in the new RAD video stuff. PatchMatch is just so incredibly trivial and obvious, it's one of those things that never should have been a paper and never should have been accepted in a journal. It's a great thing for someone to write on their blog because you can describe it in about one sentence, and most experts in the field already know the idea and are probably doing it already. (I will say the good thing about the paper is they do a good job of gathering references to other papers that are related, such as stuff in texture synth and hole filling and so on which I find interesting).

Here's the one sentence version of PatchMatch : Seed your match field with some random guess or shitty initial matches; improve by incrementally propagating match offsets to neighbors and trying small random deltas to find improvements. (it's an absolute classic spin network magnetic moment relaxation kind of problem).

Here's what I've been doing : start with a match field set to all nulls (no match found yet). Then incrementally fill it in with matches and propagate them to neighbors. It proceeds in a few steps like this :

Step 1. Use computer vision methods to find feature points in a frame. Match feature points to the previous frame. This bit is a bit tricky and tweaky, you only want to make matches that you're pretty confident in. Note that this matching is done based on a "characteristic" of the feature point which has no distance limit, and is also somewhat immune to rotation and scaling and such. Sometimes this step finds some very good correspondences between the frames, but it's sparse - it only has high confidence at a few places in the frame, so you can't use it to find all the block matches (and you wouldn't want to even if you could). Generally this finds around 100 vectors.

Step 2. Find "distinctive" spots in the frame. The goal is to find some spots that are not degenerate - eg. not flat patches, not straight edges. The idea is that these are places where we can likely find a good motion vector with high confidence, unlike degenerate areas where there are lots of equally good match vectors. I use two mechanisms to find distinctive spots : one is the computer vision feature points that were not already used in the first matching step. The second is to take the "cornerness" map of the image using a Harris or Hessian operator on the derivative of gaussians (this is a lot like an edge map, but it kills straight edges). Find the top 5% highest cornerness values that are local maxima, and use those as distinctive spots. All of the distinctive spots do a long distance brute force block match (something like radius = 16 or 32) to try to find a good motion vector for them.

Step 3 : Flood fill to fill in the gaps. We now have presumably good motion vectors at a few key points in the frame. Go to their neighbors and search for match vectors that are close to the neighboring one that we already found. Put that in the blank and push its neighbors to the queue to continue the flood fill.

Step 4 : Relaxation pass. (this is not critical). We now have a motion vector everywhere in the frame. For each match vector in the frame, look at its 4 neighbors. Examine match vectors that are near my 4 neighboring vectors. If one is better, replace self. Continue to next. Theoretically you should do this pass a few times, but I find 1 or 2 is very close to infinite.

The key thing is that motion is usually semi-coherent (but not fully coherent, because we are not really trying to find true motion here, but rather just the best matching block, which is a lot more random than true motion is). By finding very good motion vectors in seed spots where we have high confidence, we can propagate that good information out to places where we don't have as much confidence. This lets us avoid doing large brute-force searches.

BTW I really do not understand the point of all the "diamond search" type shit in the video compression literature. It seems to just find really shitty motion vectors and is not making good use of the possibilities in the bit stream. Especially with GPU video encoding in this modern age, doing plain old big chunks of brute force motion search is preferrable. (yes, I know it's for speed, but it's a poor way to optimize, and the high quality encoders are still non-realtime anyway, so if you're not realtime you may as well take some more time and do better; plus the vast majority of use of non-realtime video encoders is in an encode-once decode-many type of scenario which means you should spend a lot of cpu and encode as well as possible).

With this method I find motion vectors using local searches of radius 8-16 that are the same quality as brute force searches of radius 50-100, which makes it about two orders of magnitude faster (and higher quality, since nobody does brute force searches that far).


ADDENDUM : To give this post a bit more weight, here are some numbers on quality from my video coder vs. brute force search radius :


 -s16  : rmse : 9.3725 , psnr : 28.7277
 -s26  : rmse : 9.2404 , psnr : 28.8510
 -s48  : rmse : 9.0279 , psnr : 29.0531
 -s64  : rmse : 8.9171 , psnr : 29.1603
 -s100 : rmse : 8.7842 , psnr : 29.2907
 -s9999: rmse : 8.5294 , psnr : 29.5465

(-s16 means it's searching a 33x33 grid for motion vectors) (-s9999 means it searches full frame).

The above described iterative feature point propagation method gets


 -sfast: rmse : 8.8154 , psnr : 29.2600

BTW for doing full-frame brute force search you obviously should use a block-space acceleration structure for high dimensional nearest neighbor search, like a kd-tree, a bd-tree (box decomposition) or vp-tree (vantage point). High dimensional spaces are nasty though; the typical idea of "find a cell then walk to its immediate neighbors" is not fast in high D because you have O(D) neighbors.

9/08/2009

09-08-09 - DXTC Addendum

Ryg pointed out that there are a few very important little details that I took for granted and didn't mention in my original DXTC postings , or was just not clear about :

One is that when I try all ways of hitting two given endpoints, I try both 4-color and 3-color versions. That is, given two endpoint colors C0 and C1, I quantize them to 16 bits, then try the DXT1 palette that you get from {C0,C1} and also the one from {C1,C0} (order of DXT1 determines whether it is 3-color or 4-color).

The second and related crucial thing, is that in 3-color mode, the extra color is transparent black. If the texture has no alpha at all, I assume the user will not be using it as an alpha source, so I treat the transparent black as just black. That is, I do color palette selection with alpha just ignored.

Apparently this is pretty important. I suspect this especially helps with the "4 means" method; if a bunch of the colors are near black, you want them to be classed together and then just ignored for the endpoint selection, so that they will go to the hard-coded black in 3-color mode and your interp end points will be chosen from the remaining colors.

8/27/2009

08-27-09 - Oodle Image Compression Looking Back Pictures

I thought for the record I should put up some pictures about what I talked about last time.

First of all the R/D trellis quantization issue. Very roughly what we're doing here is coding to a certain bit rate. The "RDO" lets us use a smaller quantization bucket size, which initially lowers distrortion and increases our rate, but then we hammer on some of the values - mainly we just force them to zero, which causes some distortion and decreases rate; we choose to hammer the values that save us the most rate per distortion. (99% of the time all you're doing is turning 1's into 0's, so it's a matter of picking the 1 to squash to 0 which saves you most the rate).

Here are the results on "Moses" at 0.5 bits per pixel :

No R/D : RMSE = 9.9381 :

Photobucket

Unconstrained R/D : RMSE = 9.7843 :

Photobucket

You should be able to see in the R/D image that some of the image looks better, but other parts look much worse. The RDO has stolen rate from places where it was expensive in terms of rate to encode a certain distortion, and moved those bits to parts of the image where you can get more distortion win at a cheaper rate. This is awesome if your goal is to minimize RMSE, but it's unclear to me whether this is *ever* good perceptually.

In this particular case, the RDO Moses image actually has a worse SSIM than the No-RD image; this type of mistake is actually something that SSIM is okay at detecting.

In practice I use some hacks to limit how much the RDO can do to any one block. With those hacks I almost always get an SSIM improvement from RDO, but it's still unclear to me whether or not it's actually a perceptual improvement on many images (in some cases it's a very clear win; images like kodim09 or kodim20 where you have big flat patches in some spots and then a lot of edge detail in other spots, the RDO does a good job of stealing from the flats to give to the edges, which the eye likes, because we don't mind it if an almost perfectly smooth area becomes perfectly smooth).

Now for the hacky perceptual smooth DC issue.

This is "kodim04" at 0.25 bpp ; no RDO ; no unblock , no perceptual DC quantization ; basically a naive DCT coder :

Photobucket

Now we turn on the hacky perceptual quantization that gives more precision to smooth DC's : (unblock still off) :

Photobucket

Note that the perceptual quant of DC means that we are using more of our bitrate for the DC band, so we give less bits to AC, which means using a larger quantizer for AC to match the bit rate constraint.

Now with unblocking , no perceptual DC quant : (RMSE = 12.8565 , SSIM = 58.62%)

Photobucket

With unblocking and perceptual DC quant : (RMSE = 12.9666, SSIM = 57.88%)

Photobucket

I think the improvement is clearest on the unblocked images - the perceptual DC quant one actually looks okay, the parts that are supposed to be smooth still look smooth. The one with uniform DC quant looks disgustingly bumpy. Note that the SSIM of the better image is actually quite a bit worse. Of course RMSE gets worse any time you do a perceptual improvement. You should also be able to see that the detail in the hat thatching is better in the nonperceptual version, but that doesn't bother the eye nearly as much as breaking smoothness.

ADDENDUM : some close up pictures of Moses' waddle area showing the R/D artifacts better. You should zoom these to full screen with a box filter and toggle between them to see most clearly. You should see the RDO killing blocks in the collar area very clearly. All you really need to do is look at the last picture of these four and you should be able to see what I'm talking about with the RDO :

Portion of Moses at 0.75 bpp : No lagrange optimization :

Photobucket

With Lagrange RDO :

Photobucket

Crop of No-L :

Photobucket

Crop of RDO :

Photobucket

8/25/2009

08-25-09 - Oodle Image Compression Looking Back

I did a little image compressor for RAD/Oodle. The goal was to make something with quality comparable to a good modern wavelet coder, but using a block-based scheme so that it's more compact and simple in memory use so that it will be easy to stream through the SPU and SIMD and all that good stuff, we also wanted an internal floating point core algorithm so that it extends to HDR and arbitrary bit depths. I wrote about it before, see : here or here . That's been done for a while but there were some interesting bits I never wrote about so I thought I'd note them quickly :

1. I did lagrange R-D optimization to do "trellis quantization" (see previous ). There are some nasty things about this though, and it's actually turned off by default. It usually gives you a pretty nice win in terms of RMSE (because it's measuring "D" (distortion) in terms of MSE, so by design it optimizes that for a given rate), but I find in practice that it actually hurts perceptual quality pretty often. By "perceptual" here I just mean my own eyeballing (because as I'll mention later, I found SSIM to be pretty worthless). The problem is that the R-D trellis optimization is basically taking coefficients and slamming them to zero where the distortion cost of doing that is not worth the rate it would take to have that coefficient. In practice what this does is take individual blocks and makes them very smooth. In some cases that's great, because it lets you put more bits where they're more important (for example on images of just human faces it works great because it takes bits away from the interior patches of skin and gives those bits to the edges and eyes and such).

One of the test images I use is the super high res PNG "moses.png" that I found here . Moses is wearing a herring bone jacket. At low bit rates with R-D Trellis enabled, what happens is the coder just starts tossing out entire blocks in the jacket because they are so expensive in terms of rate. The problem with that is it's not uniform. Perceptually the block that gets killed stands out very strongly and looks awful.

Obviously this could be fixed by using a better measure of "D" in the R-D optimization. This is almost a mantra for me : when you design a very aggressive optimizer and just let it run, you better be damn sure you are rating the target criterion correctly, or it will search off into unexpected spaces and give you bad results (even though they optimize exactly the rating that you told it to optimize).

2. It seems DCT-based coders are better than wavelets on very noisy images (eg. images with film grain, or just images with lots of energy in high frequency, such as images of grasses, etc). This might not be true with fancy shape-adaptive wavelets and such, but with normal wavelets the "prior" model is that the image has most of its energy in the smooth bands, and has important high frequency detail only in isolated areas like edges. When you run a wavelet coder at low bit rate, the result is a very smoothed looking version of the image. That's good in most cases, but on the "noisy" class of images, a good modern wavelet coder will actually look worse than JPEG. The reason (I'm guessing) is that DCT coders have those high frequency pattern basis functions. It might get the detail wrong, but at least there's still detail.

In some cases it makes a big difference to specifically inject noise in the decoder. One way to do this is to do a noisey restore of the quantization buckets. That is, coefficient J with quantizer Q would normally restore to Q*J. Instead we restore to something random in the range [ Q*(J-0.5) , Q*(J+0.5) ]. This ensures that the noisey output would re-encode to the same bit stream the decoder saw. I wound up not using this method for various reasons, instead I optionally inject noise directly in image space, using a simplified model of film grain noise. The noise magnitude can be manually specified by the user, or you can have the encoder measure how noisey the original is and compare to the baseline decoder output and see how much energy we lost, and have the noise injector restore that noise level.

To really do this in a rigorous and sophisticated way you should really have location-variable noise levels, or even context-adaptive noise levels. For example, an image of a smooth sphere on a background of static should detect the local neighborhood and only add noise on the staticy background. Exploring this kind of development is very difficult because any noise injection hurts RMSE a lot, and developing new algorithms without any metric to rate them is a fool's errand. I find that in some cases reintroducing noise clearly looks better to my eye, but there's no good metric that captures that.

3. As I mentioned in the earlier posts, lapping just seems to not be the win. A good post process unblocking filter gives you all the win of lapping without the penalties. Another thing I noticed for the first time is that the JPEG perceptual quantization matrix actually has a built-in bias against blocking artifacts. The key thing is that the AC10 and AC01 (the simplest horizontal and vertical ramps) are quantized *less* than the DC. That guarantees that if you have two adjacent blocks in a smooth gradient area, if the DC's quantize to being one step apart, then you will have at least one step of AC10 linear ramp to bridge between them.

If you don't use the funny JPEG perceptual quantization matrix (which I don't think you should) then a good unblocking filter is crucial at low bit rate. The unblocking filter was probably the single biggest perceptual improvement in the entire codec.

4. I also somewhat randomly found a tiny trick that's a huge improvement. We've long noticed that at high quantization you get this really nasty chroma drift problem. The problem occurs when you have adjacent blocks with very similar colors, but not quite the same, and they sit on different sides of quantization boundary, so one block shifts down and the neighbor shifts up. For example with Quantizer = 100 you might have two neighbors with values {49, 51} and they quantize to {0,1} which restores to {0,100} and becomes a huge step. This is just what quantization does, but when you apply quantization separately to the channels of a color (RGB or YUV or whatever), when one of the channels shifts like that, it causes a hue rotation. So rather than seeing a stair step, what you see is that a neighboring block has become a different color.

Now there are a lot of ideas you might have about how to address this. To really attack it thoroughly, you would need a stronger perceptual error metric, in particular one which can measure non-local patterns, which is something we don't have. The ideal perceptual error metric needs to be able to pick up on things like "this is a smooth gradient patch in the source, and the destination has a block that stands out from the others".

Instead we came up with just a simple hack that works amazingly well. Basically what we do is adaptively resize the quantization of the DC component, so that when you are in a smooth region ("smooth" meaning neighboring block DC's are similar to each other), then we use finer quantization bucket sizes. This lets you more accurately represent smooth gradients and avoid the chroma shift. Obviously this hurts RMSE so it's hard to measure the behavior analytically, but it looks amazingly much better to our eyes.

Of course while this is an exciting discovery it's also terrifying. It reminded me how bad our image quality metrics are, and the fact that we're optimizing for these broken metrics means we are making broken algorithms. There's a whole plethora of possible things you could do along this vein - various types of adaptive quantizer sizes, maybe log quantizers? maybe more coarse quantizers in noisy parts of the image? it's impossible to explore those ideas because we have no way to rate them.

As I mentioned previously, this experiment also convinced me that SSIM is just worthless. I know in the SSIM papers they show examples where it is slightly better than RMSE at telling which image is higher quality, but in practice within the context of a DCT-based image coder I find it almost never differs from RMSE; that is, if you do something like R-D optimized quantization of DCT coefficients with Distortion measured by RMSE, you will produce an image that has almost exactly the same SSIM as if you did R-D with D measured by SSIM. If RMSE and SSIM were significantly different, that would not be the case. I say this within the context of DCT-based image coding because obviously RMSE and SSIM can disagree a lot, but that axis of freedom is not explored by DCT image coders. The main thing is that SSIM is really not measuring anything important visual at all. A real visual metric needs to use global/neighborhood information, and knowledge of shapes and what is important about the image. For example, changing a pixel that's part of a perfect edge is way more important than changing an image that's in some noise. Changing a block from grey to pink is way worse than changing a block from green to blue-green, even if it's a smaller value change. etc. etc.

It seems to me there could very easily be massive improvements possible in perceptual quality without any complexity increase that we just can't get because we can't measure it.

8/05/2009

08-05-09 - Relacy License Notes

The lock-free code I posted with Relacy has a clarification to the license agreement added. If you have downloaded this please read and make sure you are in compliance. I've copied the added text here :

ADDENDUM ON RELACY LICENSE : (revised 9-14-09)

Relacy is now released under the BSD license :


    1 /*  Relacy Race Detector
    2  *  Copyright (c) 2009, Dmitry S. Vyukov
    3  *  All rights reserved.
    4  *  Redistribution and use in source and binary forms, with or without modification,
    5  *  are permitted provided that the following conditions are met:
    6  *    - Redistributions of source code must retain the above copyright notice,
    7  *      this list of conditions and the following disclaimer.
    8  *    - Redistributions in binary form must reproduce the above copyright notice, this list of conditions
    9  *      and the following disclaimer in the documentation and/or other materials provided with the distribution.
   10  *    - The name of the owner may not be used to endorse or promote products derived from this software
   11  *      without specific prior written permission.
   12  *  THIS SOFTWARE IS PROVIDED BY THE OWNER "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
   13  *  THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
   14  *  IN NO EVENT SHALL THE OWNER BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY,
   15  *  OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
   16  *  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
   17  *  STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE,
   18  *  EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
   19  */

My work with Relacy is 100% free for any use. However, the original Relacy license still applies to all work product made with Relacy, such as my code above.

The version of Relacy that I built my code with was released under a previous less clear and restrictive license. Dmitry says the new BSD license applies.

8/04/2009

08-04-09 - CINIT

Two questions I can't find answers to :

1. Is there a way to tell from a piece of code that you are being called from cinit ? eg. in C++ when a constructor causes some code to run, and that calls some function, and then I get called, is there anything I can check to see that I'm currently in cinit, not main?

(obviously a very evil thing I could do is run a stack trace and see what's at the top of the stack). I can't find anything in C that I can check, because the C stdlib is initialized before me, so to my cinit code it looks just like I'm in the app run.

The reason I want this is mainly for asserting & validation - I want to make sure that my own cinit code isn't calling certain things (such as memory allocation) so I want to put in checks like ASSERT( ! in_cinit() );

2. Is there a way to disallow cinit code in certain modules? For example, having cinit stuff in any library is very unsafe because you have to be wary that your OBJ could get dropped from the link and the cinit stuff will not be run, plus you have order of run issues, it's something I can handle fine in my own projects, but not something I want to force on clients. So I want to make sure that my actual deliverable libraries have no cinit stuff - but I do want cinit stuff in my test apps, so I don't want to just break it entirely. I'd really like a compiler error so I know I did a booboo right when I write it.

7/07/2009

07-07-09 - Small Image Compression Notes , Part 2

Deblocking survey :

There are a few different ways to come at the problem theoretically.

One is to work on post-decode data in spatial domain. These approaches basically work by explicitly trying to detect block edges and just filter them. This is the approach, for example, of the H264 "in loop deblocking filter" which there is a lot of literature on. See for example "Adaptive Deblocking Filter" by List, Joch, et.al. For an example of the filter-based approach on the 8x8 DCT case see "DCT-Based Image Compression using Wavelet-Based Algorithm with Efficient Deblocking Filter" by Yan and Chen. (BTW the JPEG standard contains a "block smoother" which basically predicts AC1 as a linear function from neighboring block coefficients. This is okay for the specific case of smooth images and very high quantization, but is generally not awesome and is an ancient technique. Ignore.)

A more hardcore version of the filtering approach is "Combined Frequency and Spatial Domain Algorithm for the Removal of Blocking Artifacts" which does adaptively-offset and adaptively-directed gaussian filters ; this is sort of like the image denoising stuff that creates pixel gradient flow vectors - the filters are local gradient adaptive so they don't go across real edges. This appears to perform quite well but is very expensive.

The other general approach is a more abstract maximum-likelihood idea. You received a lossy compressed image I. You know the original image was one of the many which when compressed produces I. You want to output the one that was most likely the true original image. This is a maximum likelihood problem, and requires some a-priori model of what you think "natural" images look like. In particular, for the case of quantized DCT coefficients, you have a quantized DCT coefficient C ; instead of just reproducing Q*C you can reproduce anything in the range { Q*C - Q/2 , Q*C + Q/2 } , and you should choose the thing in that range that makes the "best" image.

"Optimal JPEG Decoding" (1998) by Jung, Antonini, Barlaud takes this approach directly. Their results are not awesome though; presumably because their prior is not good. A more modern version of the same idea is "Block Artifact Reduction Using a Transform-Domain Markov Random Field Model" by Li and Delp which uses a better model for image likelihood, but is in the same vein of doing a brute force search in the allowed coefficient space to find the maximum-likelihood reproduction.

A related method that was popular for a while is "Projection onto Convex Sets". This is basically just a method of satisfying simple convex constraints in an optimization. Here our constraint is that the quantized coefficient stay the same, that is, repro in { Q*C - Q/2 , Q*C + Q/2 } . You then apply some target function, such as you want smoothness or something, and take iterative steps towards that goal and project onto the constraints one by one. There are a lot more details to this, I haven't paid too much attention to it because these are all crazy expensive and I want something realtime.

"Blocking Artifact Detection and Reduction in Compressed Data" by Triantafyllidis etal (2002) is in the same vein but simpler and more analytical. It again worse directly in DCT space on coefficients within their quantization range, but it directly solves for the ideal reconstruction value as a function of neighbors based on minimization of specific simple deblocking metric. You wind up with just some equations for how to modify each coefficient in terms of neighbor coefficients. While the paper is good, I think one of their base assumptions - that the frequencies can be dealt with independently - is not sound, and most other people do not make that assumption.

"Derivation of Prediction Equations for Blocking Effect Reduction" by Gopal Lakhani and Norman Zhong (1999) is an older, simpler still version of the Triantafyllidis paper. They only correct the first few coefficients and solve for optimal reconstruction to minimize MSDS (mean squared difference of slopes). You can actually look at the equations here and they're very intuitively obviously right. For example, the first AC coefficient should be corrected using the difference of the neighboring DC coefficients. In case you don't see that that's obviously right, if you have DC's like [8],[16],[24] after dequantization at Q=8, and your AC's all got quantized to zero, obviously the original image most likely had a smooth slope, so the first AC in the middle block should be predicted to be the linear interpolation.

An interesting one I found that's related to the stuff I tried with smooth reconstruction of the DC band is : "Improvement of DCT-based Compression Algorithms Using Poisson�s Equation" by Yamatani and Saito (2006) .

BTW a related issue that often comes up is the incorrectness of center dequantization of AC coefficients. I've written about this before and lots of these papers mention it; the best full note on it is : "Biased Reconstruction for JPEG Decoding" by Price.

The very modern stuff has gotten quite arcane. People now are doing things like directional overcomplete wavelets on the reproduced image; with this they can detect both block artifacts and also ringing and other quantized transform artifacts. They then use maximum-likelihood markov models to guess what the source image was that produced this output. This stuff is extremely complex and I haven't really followed it because it's nowhere near realtime, but probably the best solution for offline very high quality JPEG decoders.

An interesting outlier is John Costella's Unblock . It's based on a clever simple idea that I've never seen anywhere else. Unblock is based on the assumption that pixels near the block boundaries come from the same model as pixels in the centers of blocks. That sounds obvious but it's quite profound. It means that pixels near the edges of blocks should have the same statistics as pixels in the centers (in the maximum likelihood lingo, this is a prior we can use to choose an optimal output). In particular, it's useful because in the DCT the interior pixels are much more accurate than the edge pixels. What Unblock does is looks at the statistics of the decompressed interior pixels and assumes those are our goal, and then it forces the pixels near the edge to match the statistics of the interior. The corrections are applied as wide smooth filters.

7/06/2009

07-06-09 - Small Image Compression Notes

Lapping appears to be a complete red herring. I've wasted a lot of time on it and I'm very angry. I've been trying to work up a lapped block DCT image coder. The idea is that block-DCT-based is good for speed and parallelization for micro-core architectures, good for memory bandwidth, etc. and the lapping theoretically lets you avoid some of the nasty block artifacts by effectively extending your basis functions.

In practice it just doesn't work. I've tried lots of different lapping methods, and in all of them if I make a parameterized lap amount based on a kaiser-bessel-derived window and then tweak the lap amount to maximize SSIM, it tunes to no lapping at all. Basically what's happening is that the extra bit rate cost caused by the forward lap scrambling things up is too great for the win of smoother basis functions on decompress to make up. Obviously in a few contrived cases it does help, such as on very smooth images at very high compression. (of course the large lap basis functions are a form of modeling - they will help any time the image is smooth over the larger area, and hurt when it is not).

The really silly thing about this is that areas where the image is very smooth over a large area are the cases we already handle very well!! Yeah sure naive JPEG looks awful, but even a deblocking filter after decompress can fix that case very easily. In areas that aren't smooth, lapping actually makes artifacts like ringing worse.

The other issue is I'm having a little trouble with lagrange bitstream optimization. Basically my DCT block coder does a form of "trellis quantization" (which I wrote about before) where it can selectively zero coefficients if it decides it gets an R/D win by doing so. Obviously this gives you a nice RMSE win at a given rate (by design it does so - any time it finds a coefficient to zero, it steps up the R/D slope). But what does this actually do?

Think about trying to make the best bit stream for a given rate. Say two bits per pixel. If we don't do any lagrange optimization at all, we might pick some quantizer, say Q = 16. Now we turn on lagrange optimization, it finds some coefficients to zero, that reduces the bit rate, so to get back to the target bit rate, we can use a lower quantizer. It searches for the right lagrange lambda by iterating a few times and we wind up with something like Q = 12 , and some values zeroed, and a better RMSE. What's happened is we got to use a lower quantizer, so we made more, larger, nonzero coefficients, and then we selectively zeroed a few that took the most R/D.

But what does this actually do to the image qualitatively? What it does is increase the quality everywhere (Q =16 goes to Q=12) , but then it stomps on the quality in a few isolated spots (trellis quantization zeros some coefficients). If you compare the two images, the lagrange optimized one looks better everywhere, but then is very smooth and blurred out in a few spots. Normally this is not a big deal and it's just a win, but sometimes I've found it actually looks really awful.

Even if you optimize for some perceptual metric like SSIM it doesn't detect how bad this is, because SSIM is still a local measurement and this is a nonlocal artifact. Your eyes very quickly pick out that part of the image has been blurred way more than the rest of it. (in other cases it does the same thing, but it's actually good; it sort of acts like a bilateral filter actually, it will give bits to the high contrast edges and kill coefficients in the texture part, so for like images of skin it does a nice job of keeping the edges sharp and just smoothing out the interior, as opposed to non-lagrange-optimized JPEG which allocates bits equally and will preserve the skin pore detail and make the edges all ringy and chopped up).

I guess the fix to this is some hacky/heuristic way to just force the lagrange optimization not to be too aggressive.

I guess this is also an example of a computer problem that I've observed many times in various forms : when you let a very aggressive optimizer run wild seeking some path to maximize some metric, it will do so, and if your metric does not perfectly measure exactly the thing that you actually want to optimize, you can get some very strange/bad results.

6/22/2009

06-22-09 - Redraw Dilemma

This apartment searching is really annoying me. I can't handle having "many balls in the air" ; when I put something on my todo list, I like to work at it until it's gone. God I fucking hate shit on my todo list (the fucking health care keeps reinserting itself on my todo list and it's pissing me off; they got me again today with some billing fuckup, but I digress...).

Anyway, it's reminding me of a concept I often think about. I'll call it "the redrawer's dilemma" but there must be a better/standard name for this.

The hypothetical game goes something like this :

You are given a bag with 100 numbers in it. You know the numbers are in [0,1000] but don't know how many of each number there are in the bag. You start by drawing a random number from the bag.

At each turn of play, you can either keep your current number (in which case that is your final score), or you can put your current number back in the bag and draw again, but drawing again costs you -1 that will be subtracted from your final score.

How do you play this game optimally?

There are two things that are interesting to me about this game in real life. One is that humans almost always play it incredibly badly, and the second is that when you finally decide to stop redrawing you're almost always unhappy about it (unless you got super lucky and draw a 900+ number).

The two classic human player errors in this game are the "I just started drawing, I shouldn't stand yet" and the "I can't stop now, I already passed on something better than this". The "I just started drawing, I shouldn't stand yet" guy draws something like an 800 on one of his early draws. He thinks dang that's really good, but maybe this bag just has lots of high numbers in it, I just started drawing, I should put some time into it. Now of course that reasoning is based in correct logic - if you have reason to believe that your chance of drawing higher is good enough to merit the cost of continued looking, then yes, do so, but just drawing more because "it's early" makes no sense - the game is totally non temporal, the cost of continuing drawing doesn't go up over time. This often leads into the "I can't stop now, I already passed on something better than this" guy, who's mainly motivated by pride and shame - he doesn't want to admit to himself that he made a big mistake passing early when he got a high number, so he has to keep drawing until he gets something better. He might draw an 800, then a whole mess of single digit numbers and he's thinking "oh fuck I blew it" and then he draws a 400. At that point he should stand and quit redrawing, but he can't, so he draws again.

The thing is, even if he played correctly and just took the 400 after passing on the 800, he would be really unhappy about. And if the early termination guy played correctly and just got an early 800 and didn't draw, he would be unhappy too, because he'd always be wondering if he could've done better.

The other game theory / logical fallacy that plagues me in these kind of things is "I'm already spending X I may as well spend X". First I was looking for places around $1500, then I bumped it to $1700, then $1900. Now I'm looking at places for $2500 cuz fuck it they're nicer and I was looking at places for $2000 so it's only $500 more.

In other news, hotpads is actually a pretty cool apartment search site. It seems they are just scraping craigslist and maybe some other classifieds sites, so it's not like they have anything new, but the map interface and search features and such are solid. One thing is really annoying me about it though - the wheel zooming in the map is totally broken, I keep trying to wheel zoom and it sends the map off the never never land. Urg!

In more random news, I've really enjoyed the "Wallander" series on PBS ; the crime stories are pretty dumb/ridiculous, but I like the muddled contemplative pace of it, and the washed out monochrome color palette.

6/21/2009

06-21-09 - Fast Exp & Log

So in an earlier post I wrote about approximation of log2 and Ryg commented with links to Robin Green's great GDC 2003 talk : part1 (pdf) and part2 (pdf) ( main page here ).

It's mostly solid, but in part 2 around page 40 he talks about "fastexp" and "bitlog" and my spidey senses got tingling. Either I don't understand, or he was just smoking crack through that section.

Let's look at "bitlog" first. Robin writes it very strangely. He writes :


A Mathematical Oddity: Bitlog
  A real mathematical oddity
  The integer log2 of a 16 bit integer
  Given an N-bit value, locate the leftmost nonzero bit.
  b = the bitwise position of this bit, where 0 = LSB.
  n = the NEXT three bits (ignoring the highest 1)

    bitlog(x) = 8x(b-1) + n

  Bitlog is exactly 8 times larger than log2(x)-1

Bitlog Example
 For example take the number 88
88 = 1011000
b = 6th bit
n = 011 = 3
bitlog(88) = 8*(6-1)+3
= 43
  (43/8)+1 = 6.375
  Log2(88) = 6.4594
  This relationship holds down to bitlog(8)

Okay, I just don't follow. He says it's "exact" but then shows an example where it's not exact. He also subtracts off 1 and then just adds it back on again. Why would you do this :

    bitlog(x) = 8x(b-1) + n

  Bitlog is exactly 8 times larger than log2(x)-1

When you could just say :

    bitlog(x) = 8xb + n

  Bitlog is exactly 8 times larger than log2(x)

??? Weird.

Furthermore this seems neither "exact" nor an "oddity". Obviously the position of the MSB is the integer part of the log2 of a number. As for the fractional part of the log2, this is not a particular good way to get it. Basically what's happening here is he takes the next 3 bits and uses them for linear interpolation to the next integer.

Written out verbosely :


x = int to get log2 of
b = the bitwise position of top bit, where 0 = LSB.

x >= (1 << b) && x < (2 << b)

fractional part :
f = (x - (1 << b)) / (1 << b)

f >= 0 && f < 1

x = 2^b * (1 + f)

correct log2(x) = b + log2(1+f)

approximate with b + f

note that "f" and "log2(1+f)" both go from 0 to 1, so it's exact at the endpoints
but wrong in the middle

So far as I can tell, Robin's method is actually like this :

uint32 bitlog_x8(uint32 val)
{
    if ( val <= 8 )
    {
        static const uint32 c_table[9] = { (uint32)-1 , 0, 8, 13, 16, 19, 21, 22, 24 };
        return c_table[val];
    }
    else
    {
        unsigned long index;
        
        _BitScanReverse(&index,(unsigned long)val);
    
        ASSERT( index >= 3 );
    
        uint32 bottom = (val >> (index - 3)) & 0x7;
        uint32 blog = (index << 3) | bottom;

        return blog;
    }
}

where I've removed the weird offsets of 1 and this just returns log2 times 8. You need the check for val <= 8 because shifting by negative amounts is fucked.

But you might well ask - why only use 3 bits ? And in fact you're right, I see no reason to use only 3 bits. In fact we can do a fixed point up to 27 bits : (we need to save 5 bits at the top to store the max possible integer part of the log2)


float bitlogf(uint32 val)
{
    unsigned long index;
    
    _BitScanReverse(&index,(unsigned long)val);

    uint32 vv = (val << (27 - index)) + ((index-1) << 27);

    return vv * (1.f/134217728); // 134217728 = 2^27
}

what we've done here is find the pos of the MSB, shift val up so the MSB is at bit 27, then we add the index of the MSB (we subtract one because the MSB it self starts the counting at one in the 27th bit pos). This makes a fixed point value with 27 bits of fractional part, the bits below the MSB act as the fractional bits. We scale to return a float, but you could of course do this with any # of fixed point bits and return a fixed point int.

But of course this is exactly the same kind of thing done in an int-to-float so we could use that too :


float bitlogf2(float fval)
{
    FloatAnd32 fi;
    
    fi.f = fval;
    
    float vv = (float) (fi.i - (127 << 23));
    
    return vv * (1.f/8388608); // 8388608 = 2^23
}

which is a lot like what I wrote about before. The int-to-float does the exact same thing we did manually above, finding the MSB and making the log2 and fractional part.

One note - all of these versions are exact for the true powers of 2, and they err consistently low for all other values. If you want to minimize the maximum error, you should bias them.

The maximum error of ( log2( 1 + f) - f ) occurs at f = ( 1/ln(2) - 1 ) = 0.442695 ; that error is 0.08607132 , so the correct bias is half that error : 0.04303566

Backing up in Robin's talk we can now talk about "fastexp". "fastexp" is doing "e^x" by using the floating point format again, basically he's just sticking x into the exponent part to get the int-to-float to do the 2^x. To make it e^x instead of 2^x you just scale x by 1/ln(2) , and again we use the same trick as with bitlog : we can do exact integer powers of two, to get the values in between we use the fractional bits for linear interpolation. Robin's method seems sound, it is :


float fastexp(float x)
{
    int i = ftoi( x * 8.f );
        
    FloatAnd32 f;
    f.i = i * 1512775 + (127 << 23) - 524288;
    
    // 1512775 = (2^20)/ln(2)
    // 524288 = 0.5*(2^20)

    return f.f;
}

for 3 bits of fractional precision. (note that Robin says to bias with 0.7*(2^20) ; I don't know where he got that; I get minimum relative error with 0.5)).

Anyway, that's all fine, but once again we can ask - why just 3 bits? Why not use all the bits of x as fractional bits? And if we put the multiply by 1/ln(2) in the float math before we convert to ints, it would be more accurate.

What we get is :


float fastexp2(float x)
{
    // 12102203.16156f = (2^23)/ln(2)
    int i = ftoi( x * 12102203.16156f );
    
    FloatAnd32 f;
    f.i = i + (127 << 23) - 361007;
    
    // 361007 = (0.08607133/2)*(2^23)

    return f.f;
}

and indeed this is much much more accurate. (max_rel_err = 0.030280 instead of 0.153897 - about 5X better).

I guess Robin's fastexp is preferrable if you already have your "x" in a fixed point format with very few fractional bits (3 bits in that particular case, but it's good for <= 8 bits). The new method is preferred if you have "x" in floating point or if "x" is in fixed point with a lot of fractional bits (>= 16).

ADDENDUM :

I found the Google Book where bitlog apparently comes from; it's Math toolkit for real-time programming By Jack W. Crenshaw ; so far as I can tell this book is absolute garbage and that section is full of nonsense and crack smoking.

ADDENDUM 2 :

it's obvious that log2 is something like :


x = 2^I * (1+f)

(I is an int, f is the mantissa)

log2(x) = I + log2(1+f)

log2(1+f) = f + f * (1-f) * C

We've been using log2(1+f) ~= f , but we know that's exact at the ends and wrong in the middle
so obvious we should add a term that humps in the middle.

If we solve for C we get :

C = ( log2(1+x) - x ) / x*(1-x)

Integrating on [0,1] gives C = 0.346573583

hence we can obviously do a better bitlog something like :

float bitlogf3(float fval)
{
    FloatAnd32 fi;
    
    fi.f = fval;
    
    float vv = (float) (fi.i - (127<<23));
    
    vv *= (1.f/8388608);
    
    //float frac = vv - ftoi(vv);
    
    fi.i = (fi.i & 0x7FFFFF) | (127<<23);
    
    float frac = fi.f - 1.f;
    
    const float C = 0.346573583f;
        
    return vv + C * frac * (1.f - frac);
}

old rants