08-25-09 - Oodle Image Compression Looking Back

I did a little image compressor for RAD/Oodle. The goal was to make something with quality comparable to a good modern wavelet coder, but using a block-based scheme so that it's more compact and simple in memory use so that it will be easy to stream through the SPU and SIMD and all that good stuff, we also wanted an internal floating point core algorithm so that it extends to HDR and arbitrary bit depths. I wrote about it before, see : here or here . That's been done for a while but there were some interesting bits I never wrote about so I thought I'd note them quickly :

1. I did lagrange R-D optimization to do "trellis quantization" (see previous ). There are some nasty things about this though, and it's actually turned off by default. It usually gives you a pretty nice win in terms of RMSE (because it's measuring "D" (distortion) in terms of MSE, so by design it optimizes that for a given rate), but I find in practice that it actually hurts perceptual quality pretty often. By "perceptual" here I just mean my own eyeballing (because as I'll mention later, I found SSIM to be pretty worthless). The problem is that the R-D trellis optimization is basically taking coefficients and slamming them to zero where the distortion cost of doing that is not worth the rate it would take to have that coefficient. In practice what this does is take individual blocks and makes them very smooth. In some cases that's great, because it lets you put more bits where they're more important (for example on images of just human faces it works great because it takes bits away from the interior patches of skin and gives those bits to the edges and eyes and such).

One of the test images I use is the super high res PNG "moses.png" that I found here . Moses is wearing a herring bone jacket. At low bit rates with R-D Trellis enabled, what happens is the coder just starts tossing out entire blocks in the jacket because they are so expensive in terms of rate. The problem with that is it's not uniform. Perceptually the block that gets killed stands out very strongly and looks awful.

Obviously this could be fixed by using a better measure of "D" in the R-D optimization. This is almost a mantra for me : when you design a very aggressive optimizer and just let it run, you better be damn sure you are rating the target criterion correctly, or it will search off into unexpected spaces and give you bad results (even though they optimize exactly the rating that you told it to optimize).

2. It seems DCT-based coders are better than wavelets on very noisy images (eg. images with film grain, or just images with lots of energy in high frequency, such as images of grasses, etc). This might not be true with fancy shape-adaptive wavelets and such, but with normal wavelets the "prior" model is that the image has most of its energy in the smooth bands, and has important high frequency detail only in isolated areas like edges. When you run a wavelet coder at low bit rate, the result is a very smoothed looking version of the image. That's good in most cases, but on the "noisy" class of images, a good modern wavelet coder will actually look worse than JPEG. The reason (I'm guessing) is that DCT coders have those high frequency pattern basis functions. It might get the detail wrong, but at least there's still detail.

In some cases it makes a big difference to specifically inject noise in the decoder. One way to do this is to do a noisey restore of the quantization buckets. That is, coefficient J with quantizer Q would normally restore to Q*J. Instead we restore to something random in the range [ Q*(J-0.5) , Q*(J+0.5) ]. This ensures that the noisey output would re-encode to the same bit stream the decoder saw. I wound up not using this method for various reasons, instead I optionally inject noise directly in image space, using a simplified model of film grain noise. The noise magnitude can be manually specified by the user, or you can have the encoder measure how noisey the original is and compare to the baseline decoder output and see how much energy we lost, and have the noise injector restore that noise level.

To really do this in a rigorous and sophisticated way you should really have location-variable noise levels, or even context-adaptive noise levels. For example, an image of a smooth sphere on a background of static should detect the local neighborhood and only add noise on the staticy background. Exploring this kind of development is very difficult because any noise injection hurts RMSE a lot, and developing new algorithms without any metric to rate them is a fool's errand. I find that in some cases reintroducing noise clearly looks better to my eye, but there's no good metric that captures that.

3. As I mentioned in the earlier posts, lapping just seems to not be the win. A good post process unblocking filter gives you all the win of lapping without the penalties. Another thing I noticed for the first time is that the JPEG perceptual quantization matrix actually has a built-in bias against blocking artifacts. The key thing is that the AC10 and AC01 (the simplest horizontal and vertical ramps) are quantized *less* than the DC. That guarantees that if you have two adjacent blocks in a smooth gradient area, if the DC's quantize to being one step apart, then you will have at least one step of AC10 linear ramp to bridge between them.

If you don't use the funny JPEG perceptual quantization matrix (which I don't think you should) then a good unblocking filter is crucial at low bit rate. The unblocking filter was probably the single biggest perceptual improvement in the entire codec.

4. I also somewhat randomly found a tiny trick that's a huge improvement. We've long noticed that at high quantization you get this really nasty chroma drift problem. The problem occurs when you have adjacent blocks with very similar colors, but not quite the same, and they sit on different sides of quantization boundary, so one block shifts down and the neighbor shifts up. For example with Quantizer = 100 you might have two neighbors with values {49, 51} and they quantize to {0,1} which restores to {0,100} and becomes a huge step. This is just what quantization does, but when you apply quantization separately to the channels of a color (RGB or YUV or whatever), when one of the channels shifts like that, it causes a hue rotation. So rather than seeing a stair step, what you see is that a neighboring block has become a different color.

Now there are a lot of ideas you might have about how to address this. To really attack it thoroughly, you would need a stronger perceptual error metric, in particular one which can measure non-local patterns, which is something we don't have. The ideal perceptual error metric needs to be able to pick up on things like "this is a smooth gradient patch in the source, and the destination has a block that stands out from the others".

Instead we came up with just a simple hack that works amazingly well. Basically what we do is adaptively resize the quantization of the DC component, so that when you are in a smooth region ("smooth" meaning neighboring block DC's are similar to each other), then we use finer quantization bucket sizes. This lets you more accurately represent smooth gradients and avoid the chroma shift. Obviously this hurts RMSE so it's hard to measure the behavior analytically, but it looks amazingly much better to our eyes.

Of course while this is an exciting discovery it's also terrifying. It reminded me how bad our image quality metrics are, and the fact that we're optimizing for these broken metrics means we are making broken algorithms. There's a whole plethora of possible things you could do along this vein - various types of adaptive quantizer sizes, maybe log quantizers? maybe more coarse quantizers in noisy parts of the image? it's impossible to explore those ideas because we have no way to rate them.

As I mentioned previously, this experiment also convinced me that SSIM is just worthless. I know in the SSIM papers they show examples where it is slightly better than RMSE at telling which image is higher quality, but in practice within the context of a DCT-based image coder I find it almost never differs from RMSE; that is, if you do something like R-D optimized quantization of DCT coefficients with Distortion measured by RMSE, you will produce an image that has almost exactly the same SSIM as if you did R-D with D measured by SSIM. If RMSE and SSIM were significantly different, that would not be the case. I say this within the context of DCT-based image coding because obviously RMSE and SSIM can disagree a lot, but that axis of freedom is not explored by DCT image coders. The main thing is that SSIM is really not measuring anything important visual at all. A real visual metric needs to use global/neighborhood information, and knowledge of shapes and what is important about the image. For example, changing a pixel that's part of a perfect edge is way more important than changing an image that's in some noise. Changing a block from grey to pink is way worse than changing a block from green to blue-green, even if it's a smaller value change. etc. etc.

It seems to me there could very easily be massive improvements possible in perceptual quality without any complexity increase that we just can't get because we can't measure it.


Timothy Farrar said...

As per our bad image quality metrics, sounds like you are suggesting we need the image analog of what we have for audio compression (psychoacoustical Bark scale, equal-loudness contours, model of frequency masking, etc).

cbloom said...

Yes, though even with audio I'm not aware of automated programs that rate the quality with those metrics.

What we really need is a function call that can compare two images and give you back a quality rating.

Actually a lot of that psycho-perceptual stuff does not work well for images, because it presumes a certain viewing method and viewing environment. For example, the original JPEG visibility threshold research is actually quite good, but it is specific to a certain pixel size, a certain monitor time, a certain sitting distance, etc. It models visual masking in the eyes, and the result is that it's actually very bad when the image is used as a texture.

I actually think the visual masking and relative importance of different colors and things like that is not as important as the global effects. Maybe I'll try to make some example images to show what I mean.

Timothy Farrar said...

I was thinking something more along the lines of a function which would provide a perceptual weight per DCT coef given a source block and the surrounding 8 blocks as context.

Looking at the micro context (just block encoding with context) could be a good start.

Macro context stuff, like ability to shift around content to match basis functions as long as things like edges and relative tonal gradients are preserved globally, just gets awfully complicated (but might indeed provide better overall gains in perceptual quality per bitrate).

cbloom said...

Yeah that's a start, and certainly would be easier to integrate into existing frameworks which rely on linear error, but my intuition is that the global stuff actually matters a lot. The simplest example is just if you have a big solid color patch of value 137 surrounded by noise, then changing any one pixel in there to 138 is very bad visually, but changing the whole patch to 138 is almost zero perceptual error.

cbloom said...

I should say : using immediate neighbors is what I am doing now for my hacky improvement in Oodle, and it does in fact get you a huge win. I just feel like that's only the tip of the iceberg.

ryg said...

I don't see how audio compression is much better in that regard. Yes, they incorporate psychoacoustic models at the encoder side, but only in a very local form of quantizer optimization, not unlike R-D optimization in image/video coders (though by now with relatively consistent success). They don't have any global quality optimization either, and compared with e.g. video encoders, their bitrate allocation strategies are very short-sighted. They also have basically the same problems as image/video coders: for example, like the individual over-smoothed blocks that Charles mentioned, perceptual audio codecs have the tendency to mess up transients, causing pre-echoes and overall mushiness (what would be called "blurriness" for visual signals, basically).

Finally, audio and image codecs are also in the same league compression ratio-wise. If you take 24bit RGB images and 44.1kHz 16-bit stereo CD audio (comes out at 1411kbit/s), you can see what I mean: at a ratio of 12:1 (2bits/pixel for images, 120kbit/s for audio) you get decent quality, 32:1 (0.75bits/pixel or 45kbit/s for audio) is recognizable but has very notable artifacts.

If anything, images actually do a bit better than audio: 1bit/pixel tends to look okay for most natural images, while the corresponding 60kbit/s for audio is already well into the region where most audio codecs produce really annoying artifacts for music.

cbloom said...

Yeah, let's not even get started on audio, Jeff and I could do a long rant on how bad the audio coders suck. In compression there are two separate but related ways that you get wins :

1. Modeling the data for prediction ; eg. using self-similarity and global similarity to make more likely streams smaller. In general audio codecs suck pretty bad at this; they know nothing about instruments or music theory.

2. Modeling the data for lossiness; eg. knowing how you are free to mutate the data in ways that make it perceptually similar to the original. In audio there is obviously tons and tons of degrees of freedom on how you could make different bit streams that sound the same, but codecs don't have enough sophisticated to know this. For example, have someone hit a cymbal 1000 times. Every one of those sound segments will sound perceptually interchangeable to the human ear, but will code very differently.

ryg said...

Exactly; it's really the same problem in both audio and image compression. All the lossiness is targeted at processes happening at the lowest levels of information processing in the brain; the models are all on a signal processing level. For both audio and image coding, the gains between the first algorithms to do this and the subsequent refinements we're using now are really quite small; AAC compresses maybe 2x as well as MPEG-1 Layer 2, and current still image coders have less than that on JPEG for most images. Video has seen larger improvements than that, but video coding efficiency is severely constrained by the requirement that it should be playable in realtime, which severely limited options until the late 90s.

I'm pretty certain that it's possible to gain at least by one order of magnitude in all of these applications, but that's not gonna happen by tweaking current state of the art approaches; in fact, all of them (JPEG 2k, AAC, H.264) are over-engineered and pretty far down the curve into diminishing returns already.

What's really necessary to make a big dent is to get away from the signal level and into higher-level semantic properties like the cymbal example, or the "all the gazillion different instances of uniform random noise blocks look exactly the same to the human eye" example that's already been mentioned.

Of course, everyone dealing with lossy compression knows this, and nobody really has a good solution.

A big problem is that we don't know all that much about how these parts of human perception work internally, either; the "plumbing" for both the aural and visual systems have been researched a long time ago and are well-understood by now, but e.g. methods to determine the similarity between two different shapes are still a very active research area, and that's after you've distilled them into a relatively concise, abstract representation. It's either a set of genuinely hard problems (which seems likely), or there's something subtile but crucial going on that everyone's been missing so far.

Jon Olick said...

I've done exactly what you described in a previous project... Classified regions as noisey, smooth, etc and then change the quanitzation to match. In a noisey area, the quantization can be harsh and you won't notice, but in a smooth area, you want finer quantization steps. I had a table of quantization characteristics which the pixels could get classified into. It very much so improved the compression ratio. As well as I would recommend trying out injecting noise into certain areas on a block basis as some parts of the image can be helped by it and others hurt. That was a big win in perceptual quality as well.

Jaba Adams said...

Uh, so I know nothing about compression ...

What about borrowing from the machine vision community? Look for feature points in the source image, then look for feature points in the compressed image.

Wave hands about a suitable definition of perceptual features.

Naturally, I'm wary of any proposed solution that is congruent with General AI.

cbloom said...

I'm actually using some machine vision stuff in my new video work, I'll write about it soon. A lot of their stuff is very hacky.

old rants