I did a little image compressor for RAD/Oodle. The goal was to make something with
quality comparable to a good modern wavelet coder, but using a block-based scheme so that it's more compact and simple in memory use
so that it will be easy to stream through the SPU and SIMD and all that good stuff, we also wanted an internal floating point core algorithm so
that it extends to HDR and arbitrary bit depths. I wrote about it before, see :
here or
here . That's been done for a while but
there were some interesting bits I never wrote about so I thought I'd note them quickly :
1. I did lagrange R-D optimization to do "trellis quantization" (see previous ).
There are some nasty things about this though, and it's actually turned off by default. It usually gives you a pretty nice win in terms
of RMSE (because it's measuring "D" (distortion) in terms of MSE, so by design it optimizes that for a given rate), but I find in practice that it
actually hurts perceptual quality pretty often. By "perceptual" here I just mean my own eyeballing (because as I'll mention later, I found SSIM to
be pretty worthless). The problem is that the R-D trellis optimization is basically taking coefficients and slamming them to zero where the distortion
cost of doing that is not worth the rate it would take to have that coefficient. In practice what this does is take individual blocks and makes them
very smooth. In some cases that's great, because it lets you put more bits where they're more important (for example on images of just human faces
it works great because it takes bits away from the interior patches of skin and gives those bits to the edges and eyes and such).
One of the test images I use is the super high res PNG "moses.png" that I found here .
Moses is wearing a herring bone jacket. At low bit rates with R-D Trellis enabled, what happens is the coder just starts tossing out entire blocks in
the jacket because they are so expensive in terms of rate. The problem with that is it's not uniform. Perceptually the block that gets killed stands
out very strongly and looks awful.
Obviously this could be fixed by using a better measure of "D" in the R-D optimization. This is almost a mantra for me : when you design a very
aggressive optimizer and just let it run, you better be damn sure you are rating the target criterion correctly, or it will search off into
unexpected spaces and give you bad results (even though they optimize exactly the rating that you told it to optimize).
2. It seems DCT-based coders are better than wavelets on very noisy images (eg. images with film grain, or just images with lots of energy in
high frequency, such as images of grasses, etc). This might not be true with fancy shape-adaptive wavelets and such, but with normal wavelets
the "prior" model is that the image has most of its energy in the smooth bands, and has important high frequency detail only in isolated areas
like edges. When you run a wavelet coder at low bit rate, the result is a very smoothed looking version of the image. That's good in most
cases, but on the "noisy" class of images, a good modern wavelet coder will actually look worse than JPEG. The reason (I'm guessing)
is that DCT coders have those high frequency pattern basis functions. It might get the detail wrong, but at least there's still detail.
In some cases it makes a big difference to specifically inject noise in the decoder. One way to do this is to do a noisey restore of the
quantization buckets. That is, coefficient J with quantizer Q would normally restore to Q*J. Instead we restore to something random in the
range [ Q*(J-0.5) , Q*(J+0.5) ]. This ensures that the noisey output would re-encode to the same bit stream the decoder saw. I wound up
not using this method for various reasons, instead I optionally inject noise directly in image space, using a simplified model of film
grain noise. The noise magnitude can be manually specified by the user, or you can have the encoder measure how noisey the original is and
compare to the baseline decoder output and see how much energy we lost, and have the noise injector restore that noise level.
To really do this in a rigorous and sophisticated way you should really have location-variable noise levels, or even context-adaptive noise
levels. For example, an image of a smooth sphere on a background of static should detect the local neighborhood and only add noise on the
staticy background. Exploring this kind of development is very difficult because any noise injection hurts RMSE a lot, and developing new
algorithms without any metric to rate them is a fool's errand. I find that in some cases reintroducing noise clearly looks better to my eye,
but there's no good metric that captures that.
3. As I mentioned in the earlier posts, lapping just seems to not be the win. A good post process unblocking filter gives you all the win
of lapping without the penalties. Another thing I noticed for the first time is that the JPEG perceptual quantization matrix actually has
a built-in bias against blocking artifacts. The key thing is that the AC10 and AC01 (the simplest horizontal and vertical ramps) are
quantized *less* than the DC. That guarantees that if you have two adjacent blocks in a smooth gradient area, if the DC's quantize to being
one step apart, then you will have at least one step of AC10 linear ramp to bridge between them.
If you don't use the funny JPEG perceptual quantization matrix (which I don't think you should) then a good unblocking filter is crucial at
low bit rate. The unblocking filter was probably the single biggest perceptual improvement in the entire codec.
4. I also somewhat randomly found a tiny trick that's a huge improvement. We've long noticed that at high quantization you get this really
nasty chroma drift problem. The problem occurs when you have adjacent blocks with very similar colors, but not quite the same, and they sit
on different sides of quantization boundary, so one block shifts down and the neighbor shifts up. For example with Quantizer = 100 you might
have two neighbors with values {49, 51} and they quantize to {0,1} which restores to {0,100} and becomes a huge step. This is just what
quantization does, but when you apply quantization separately to the channels of a color (RGB or YUV or whatever), when one of the channels
shifts like that, it causes a hue rotation. So rather than seeing a stair step, what you see is that a neighboring block has become a different
color.
Now there are a lot of ideas you might have about how to address this. To really attack it thoroughly, you would need a stronger perceptual
error metric, in particular one which can measure non-local patterns, which is something we don't have. The ideal perceptual error metric
needs to be able to pick up on things like "this is a smooth gradient patch in the source, and the destination has a block that stands out
from the others".
Instead we came up with just a simple hack that works amazingly well. Basically what we do is adaptively resize the quantization of the DC
component, so that when you are in a smooth region ("smooth" meaning neighboring block DC's are similar to each other), then we use finer
quantization bucket sizes. This lets you more accurately represent smooth gradients and avoid the chroma shift. Obviously this hurts RMSE so
it's hard to measure the behavior analytically, but it looks amazingly much better to our eyes.
Of course while this is an exciting discovery it's also terrifying. It reminded me how bad our image quality metrics are, and the fact that
we're optimizing for these broken metrics means we are making broken algorithms. There's a whole plethora of possible things you could do
along this vein - various types of adaptive quantizer sizes, maybe log quantizers? maybe more coarse quantizers in noisy parts of the image?
it's impossible to explore those ideas because we have no way to rate them.
As I mentioned previously, this experiment also convinced me that SSIM is just worthless. I know in the SSIM papers they show examples where it
is slightly better than RMSE at telling which image is higher quality, but in practice within the context of a DCT-based image coder I find it
almost never differs from RMSE; that is, if you do something like R-D optimized quantization of DCT coefficients with Distortion measured by
RMSE, you will produce an image that has almost exactly the same SSIM as if you did R-D with D measured by SSIM. If RMSE and SSIM were significantly
different, that would not be the case. I say this within the context of DCT-based image coding because obviously RMSE and SSIM can disagree a lot,
but that axis of freedom is not explored by DCT image coders. The main thing is that SSIM is really not measuring anything important visual at
all. A real visual metric needs to use global/neighborhood information, and knowledge of shapes and what is important about the image.
For example, changing a pixel that's part of a perfect edge is way more important than changing an image that's in some noise. Changing a block
from grey to pink is way worse than changing a block from green to blue-green, even if it's a smaller value change. etc. etc.
It seems to me there could very easily be massive improvements possible in perceptual quality without any complexity increase
that we just can't get because we can't measure it.