With an 8x8 block we're at a big disadvantage. An 8x8 block is like a 3 level wavelet. That's not much, wavelet coders rely on a 5 or 6 level transform normally, which would correspond to a 32x32 block or better. Large block transforms like that are bad because they're computationally complex, but also because they are visually bad. Large blocks create worse blocking artifacts, and also increase ringing, because it makes the high frequency shapes very non-local.

Basically by only doing 8x8 we are leaving a lot of redundancy between neighboring blocks. There's moderate correlation within a block, but also strong correlation across blocks for coefficients of the same type.

H264 Intra frame coding is actually really excellent; it outperforms JPEG-2000 for example. There're a few papers on this idea of using H264 intra coding just for still images, and a project called AIC . (AIC performs worse than real H264 for a few reasons I'll get into later).

"AIC" basically just does 8x8 block DCT's - but it does this interesting thing of pre-predicting the block before the transform. It works on blocks in scan order, and for each block before it does the DCT it creates a prediction from the already transmitted neighbors and subtracts that off. This is a nice page with the details . What this accomplishes does is greatly reduce correlation between blocks. It subtracts off predicted DC so the DC is usually small, and also often subtracts off predicted shapes, so for example if you're in a smooth gradient region it subtracts off that gradient.

Real H264 intra beats "AIC" pretty well. I'm not sure exactly why that is, but I have a few guesses. H264 uses integer transforms, AIC uses floating point (mainly a big deal at very high bit rates). H264 uses macroblocks and various sub-block sizes; in particular it can choose 8x8 or 4x4 sub-blocks, AIC always uses 8x8. Choosing smaller blocks in high detail areas can be a win. I think the biggest difference is probably that the H264 implementations tested do some RDO while AIC does not. I'm not sure exactly how they do RDO on the Intra blocks because each block affects the next one, but I guess they could at least sequentially optimize each block as they go with a "trellis quantizer" (see next post on this).

Okie doke. JPEG XR has similar issues but solves them in different ways. JPEG XR fundamentally uses a 4x4 transform similar to a DCT. 4x4 is too small to remove a lot of correlation, so neighboring blocks are very highly correlated. To address this, JPEG XR groups 4x4 groups of blocks together, so it has a 16x16 macroblock. The DC's of each of the 4x4 blocks gets another pass of the 4x4 transform. This is a lot like doing a wavelet transform but getting 4:1 reduction instead of 2:1. Within the 16x16 macroblock, each coefficient is predicted from its neighbor using gradient predictors similar to H264's.

In H264 the gradient predictor is chosen in the encoder and transmitted. In JPEG XR the gradient predictor is adaptively chosen by some decision that's made in the encoder & decoder. (I haven't found the exact details on this). Also in JPEG XR the delta-from-prediction is done *post* transform, while in H264 it was done pre-transform.

If you think about it, there's a whole world of possibilities here. You could do 4x4 transforms again on all the coefficients. That would be very similar to doing a 16x16 DCT (though not exactly the same - you would have to apply some twiddle factors and butterflies to make it really the same). You could do various types of deltas in pre-transform space and post-transform space. Basically you can use the previous transmitted data in any way you want to reduce what you need to send.

One way to think about all this is that we're trying to make the reconstruction look better when we send all zeros. That is, at low bit rates, we will very often have the case that the entire block of AC coefficients goes to zero. What does our output look like in that case? With plain old JPEG we will make a big solid 8x8 block. With H264 we will make some kind of gradient as chosen by the neighbor predictor mode. With JPEG XR we will get some predicted AC's values untransformed, and it will also be smoothed into the neighbors by "lapping".

So, let's get into lapping. Lapping basically gives us a nicer output signal when all the AC's are zero. I wrote a bit about lapping before . That post described lapping in terms of being a double-size invertable transform. That is, it's a transform that takes 2N taps -> N coefficients and back -> 2N , such that if you overlap with neighbors you get exact reconstruction. The nice thing is you can make it a smooth window that goes to zero at the edges, so that you have no hard block edge boundaries.

Amusingly there are a lot of different ways to construct lapped transforms. There are a huge family of them (see papers on VLGBT or some fucking acronym or other). There are lots of approaches that all give you the same thing :

2N -> N windowed basis functions as above (nobody actually uses this approach but it's nice theoretically) Pre & post filtering on the image values (time domain or spatial domain) basically the post-filter is a blur and the pre-filter is a sharpen that inverts the blur (this can be formulated with integer lifting) Post-DCT filtering (aka FLT - Fast Lapped Transform) basically do the NxN DCT as usual then swizzle the DCT coefficients into the neighboring DCT'sPost-DCT filtering can either be done on all the coefficients, or just on the first few (DC and primary AC coefficients).

Lapping is good and bad. It's not entirely an awesome win. For one thing, the pre-filter that the lap does is basically a sharpen, so it actually makes your data harder to compress. That's sort of balanced by having a better reconstruction shape for any given bit rate, but not always. The fundamental reason for this is that lapping relies on larger local smoothness. eg. for 8x8 blocks you're doing 16-tap lapped transforms. If your signal is actually smooth over 16 taps then it's all good, but when it's not, the lapped transform needs *larger* AC coefficients to compensate than a plain blocked 8-tap DCT would.

The interesting thing to me is to open this up and consider our options. Think just about the decoder. When I get the DC coefficient at a given spot in the image - I don't need to plop down the shape of any certain transform coefficient. What I should do is use that coefficient to plop down my best guess of what the pixels here were in the original. When I get the next AC coefficient, I should use that to refine.

One way to think about this is that we could in fact create an optimized local basis. The encoder and decoder should make the same local basis based on past transmitted data only. For example, you could take all the previously sent nearby blocks in the image, run PCA on them to create the local KLT ! This is obviously computationally prohibitive, but it gives us an idea of what's possible and how far off. Basically what this is doing is making the DC coefficient multiply a shape which is our best guess for what the block will be. Then the 1st AC coefficient multiplies our best guess for how the block might vary from that first guess, etc.

## 7 comments:

There's a major caveat with H264 beating JPEG2000 at intra coding in terms of PSNR: I've never found a comparision that states which color space it was done in, which is really important. H264 PSNR is usually specified in terms of PSNR on the decoded YUV signal, since the standard doesn't cover getting from YUV to RGB. J2k decoders, however, usually give you the decoded data back in RGB. The correct way to test this would be to use the same YUV source data for both H264 and J2k, and turn off any color space conversions on the output data, but unless it's explicitly mentioned it's safe to assume it wasn't done.

Instead, the easiest way to compare them is to just convert the decoded RGB data to YUV using the same process you used to get the YUV for H264 in the first place. This puts J2k at a disadvantage since its data goes through two (slightly lossy) transforms before the PSNR gets computed.

There's an even bigger problem though - still image coding normally uses YCbCr normalized so that Y is in [0,255] and Cb, Cr are in [-128,127]. Video coding, however, customarily uses the D1 standard ranges, which is Y in [16,235] and Cb, Cr in [16,240]. (D1 directly recorded NTSC signals, so it needed blacker-than-black levels for the blanking intervals and used whiter-than-white levels for sync markers and the like).

In short, if you don't explicitly make sure that J2k and H264 work on exactly the same YUV input data and are compared based on the YUV outputs they produce, H264 is automatically at an advantage because it gets values with a slightly lower dynamic range; and unless you compare in RGB, J2k doesn't benefit from its extra precision in the PSNR calculation.

That said, aside from the other issues with AIC that you mentioned, it also leaves out the deblocking filter and it fixes the quantizer once for the whole image (H264 can adapt it over the course of the image if beneficial).

"There's a major caveat with H264 beating JPEG2000 at intra coding in terms of PSNR: I've never found a comparision that states which color space it was done in, which is really important."

Yeah you make an excellent point; I swore long ago to never trust the PSNR numbers in papers because they mess things up so often.

You see so much stuff that's just retarded, like comparing JPEG with the perceptual quantizer matrix using an plain RMSE/PSNR metric vs. something like JPEG-2000 without perceptual quantizers.

I was assuming they compared errors in RGB, but even if they did that has its own problems.

"(H264 can adapt it over the course of the image if beneficial)."

Do you know how that works? I haven't found any papers on it. Does it signal and actually send a new Qp with a macroblock?

Yep. Whenever a macroblock with a nonzero number of residual coefficients is sent, the encoder also writes

"mb_qp_delta" which is the difference between the old and new qp value encoded as signed Exp-Golomb code (same encoding as used for motion vectors).

The primary purpose of this is definitely rate control, but since the qp->quantization error curve is anything but monotonous, this is something you'd want to consider during RD optimization as well.

In the long long ago I made graphs like this :

http://www.cbloom.com/src/lena_25_err.gif

demonstrating just how extremely nonlinear the RD variation with quantizer can be.

One more note on lapping: while lapping solves blocking artifacts "on paper", it falls a bit short in practice. What lapping boils down to is replacing your DCT basis functions with wider basis functions that smoothly decay towards zero some distance from the block. This is fine for the AC bands, but for DC, this means you still have per-block averages (now with smoother transitions between them). For medium-to-high quantization settings, the difference between DCs of adjacent blocks in smooth regions (blue or cloudy skies for example), can still be significant, so now you have slightly blurred visible block boundaries instead of hard block boundaries. Definitely an improvement, but still very visible.

The background of the house.bmp sample in this comparison is a good example. Also, in general, the PSNR results of HD Photo/JPEG XR are very mediocre in that comparison and others. Certainly not "very close to JPEG2000" as advertised. This SSIM-based evaluation (with the usual caveats) is outright dismal, with HD Photo lagging far behind even ordinary JPEG in quality, and also being consirably worse than JPEG-LS in the lossless benchmarks.

Yeah, I agree completely.

I did a lot of experiments with lapping. In some cases it's a big win, but usually not. It does create a very obvious sort of "bilinear filtered" look. Of course it should, it's still just taking the LL band and upsampling it in a very simple local way.

Lapping also actually *hurts* compression at most bit rates because it munges together parts of the image that aren't smooth.

The obvious solution is adaptive bases. You would like adaptive bases that :

1. detect high noise areas are reduce their size to localize more

2. detect smooth areas and increase their size so it's a broader smooth up-filter for the LL.

Of course the best known way to do this is just wavelets!

Post a Comment