With an 8x8 block we're at a big disadvantage. An 8x8 block is like a 3 level wavelet. That's not much, wavelet coders rely on a 5 or 6 level transform normally, which would correspond to a 32x32 block or better. Large block transforms like that are bad because they're computationally complex, but also because they are visually bad. Large blocks create worse blocking artifacts, and also increase ringing, because it makes the high frequency shapes very non-local.
Basically by only doing 8x8 we are leaving a lot of redundancy between neighboring blocks. There's moderate correlation within a block, but also strong correlation across blocks for coefficients of the same type.
H264 Intra frame coding is actually really excellent; it outperforms JPEG-2000 for example. There're a few papers on this idea of using H264 intra coding just for still images, and a project called AIC . (AIC performs worse than real H264 for a few reasons I'll get into later).
"AIC" basically just does 8x8 block DCT's - but it does this interesting thing of pre-predicting the block before the transform. It works on blocks in scan order, and for each block before it does the DCT it creates a prediction from the already transmitted neighbors and subtracts that off. This is a nice page with the details . What this accomplishes does is greatly reduce correlation between blocks. It subtracts off predicted DC so the DC is usually small, and also often subtracts off predicted shapes, so for example if you're in a smooth gradient region it subtracts off that gradient.
Real H264 intra beats "AIC" pretty well. I'm not sure exactly why that is, but I have a few guesses. H264 uses integer transforms, AIC uses floating point (mainly a big deal at very high bit rates). H264 uses macroblocks and various sub-block sizes; in particular it can choose 8x8 or 4x4 sub-blocks, AIC always uses 8x8. Choosing smaller blocks in high detail areas can be a win. I think the biggest difference is probably that the H264 implementations tested do some RDO while AIC does not. I'm not sure exactly how they do RDO on the Intra blocks because each block affects the next one, but I guess they could at least sequentially optimize each block as they go with a "trellis quantizer" (see next post on this).
Okie doke. JPEG XR has similar issues but solves them in different ways. JPEG XR fundamentally uses a 4x4 transform similar to a DCT. 4x4 is too small to remove a lot of correlation, so neighboring blocks are very highly correlated. To address this, JPEG XR groups 4x4 groups of blocks together, so it has a 16x16 macroblock. The DC's of each of the 4x4 blocks gets another pass of the 4x4 transform. This is a lot like doing a wavelet transform but getting 4:1 reduction instead of 2:1. Within the 16x16 macroblock, each coefficient is predicted from its neighbor using gradient predictors similar to H264's.
In H264 the gradient predictor is chosen in the encoder and transmitted. In JPEG XR the gradient predictor is adaptively chosen by some decision that's made in the encoder & decoder. (I haven't found the exact details on this). Also in JPEG XR the delta-from-prediction is done *post* transform, while in H264 it was done pre-transform.
If you think about it, there's a whole world of possibilities here. You could do 4x4 transforms again on all the coefficients. That would be very similar to doing a 16x16 DCT (though not exactly the same - you would have to apply some twiddle factors and butterflies to make it really the same). You could do various types of deltas in pre-transform space and post-transform space. Basically you can use the previous transmitted data in any way you want to reduce what you need to send.
One way to think about all this is that we're trying to make the reconstruction look better when we send all zeros. That is, at low bit rates, we will very often have the case that the entire block of AC coefficients goes to zero. What does our output look like in that case? With plain old JPEG we will make a big solid 8x8 block. With H264 we will make some kind of gradient as chosen by the neighbor predictor mode. With JPEG XR we will get some predicted AC's values untransformed, and it will also be smoothed into the neighbors by "lapping".
So, let's get into lapping. Lapping basically gives us a nicer output signal when all the AC's are zero. I wrote a bit about lapping before . That post described lapping in terms of being a double-size invertable transform. That is, it's a transform that takes 2N taps -> N coefficients and back -> 2N , such that if you overlap with neighbors you get exact reconstruction. The nice thing is you can make it a smooth window that goes to zero at the edges, so that you have no hard block edge boundaries.
Amusingly there are a lot of different ways to construct lapped transforms. There are a huge family of them (see papers on VLGBT or some fucking acronym or other). There are lots of approaches that all give you the same thing :
2N -> N windowed basis functions as above (nobody actually uses this approach but it's nice theoretically) Pre & post filtering on the image values (time domain or spatial domain) basically the post-filter is a blur and the pre-filter is a sharpen that inverts the blur (this can be formulated with integer lifting) Post-DCT filtering (aka FLT - Fast Lapped Transform) basically do the NxN DCT as usual then swizzle the DCT coefficients into the neighboring DCT'sPost-DCT filtering can either be done on all the coefficients, or just on the first few (DC and primary AC coefficients).
Lapping is good and bad. It's not entirely an awesome win. For one thing, the pre-filter that the lap does is basically a sharpen, so it actually makes your data harder to compress. That's sort of balanced by having a better reconstruction shape for any given bit rate, but not always. The fundamental reason for this is that lapping relies on larger local smoothness. eg. for 8x8 blocks you're doing 16-tap lapped transforms. If your signal is actually smooth over 16 taps then it's all good, but when it's not, the lapped transform needs *larger* AC coefficients to compensate than a plain blocked 8-tap DCT would.
The interesting thing to me is to open this up and consider our options. Think just about the decoder. When I get the DC coefficient at a given spot in the image - I don't need to plop down the shape of any certain transform coefficient. What I should do is use that coefficient to plop down my best guess of what the pixels here were in the original. When I get the next AC coefficient, I should use that to refine.
One way to think about this is that we could in fact create an optimized local basis. The encoder and decoder should make the same local basis based on past transmitted data only. For example, you could take all the previously sent nearby blocks in the image, run PCA on them to create the local KLT ! This is obviously computationally prohibitive, but it gives us an idea of what's possible and how far off. Basically what this is doing is making the DC coefficient multiply a shape which is our best guess for what the block will be. Then the 1st AC coefficient multiplies our best guess for how the block might vary from that first guess, etc.