It reminded me of some things I've been talking about for a long time, but I couldn't find a link to myself on my blog. Anyway, internally at RAD I've been saying that we should be sending AC coefficients with more precision given to the *sum* of the values, and less given to the *distribution* of the values.
ADD : some searching finds these which are relevant -
cbloom rants 08-25-09 - Oodle Image Compression Looking Back
cbloom rants 08-27-09 - Oodle Image Compression Looking Back Pictures
cbloom rants 03-10-10 - Distortion Measure
cbloom rants 10-30-10 - Detail Preservation in Images
I've never actually pursued this, so I don't have the answers. But it is something that's been nibbling at my brain for a long time, and I still think it's a good idea, so that tells me there's something there.
I'll scribble a few notes -
1. I've long believed that blocks should be categorized into "smooth", "detail" and "edge". For most of this discussion we're going to ignore smooth and edge and just talk about detail. That is, blocks that are not entirely smooth, and don't have a dominant edge through them (perhaps because that edge was predicted).
2. The most important thing in detail blocks is preserving the amount of energy in the various
frequency subbands. This is something that I've talked about before in terms of perceptual
metrics. In a standard DCT you do make categories something like :
and what's most important is preserving the sum in each category. (that chart was pulled out of my ass
but you get the idea).
The sums should be preserved in a kind of wavelet subband quadtree type of way. Like preserve the sum of each of the 4x4 blocks; then go only to the upper-left 4x4 and divide it into 2x2's and preserve those sums, and then go to only the upper-left 2x2, etc.
3. You can take a standard type of codec and optimize the encoding towards this type of perceptual metric, and that helps a bit, but it's the wrong way to go. Because you're still spending bits to exactly specify the noise in the high frequency area. (doing the RD optimization just let you choose the cheapest way to specify that noise).
What you really want is a perceptual quantizer that fundamentally gives up information in the right way as you reduce bitrate. At low bits you just want to say "hey there's some noise" and not spend bits specifying the details of it.
The normal scalar quantizers that we use are just not right. As you remove bits, they kill energy, which looks bad. It looks better for that energy to be there, but wrong.
3 1/2. The normal zig-zag coding schemes we use are really bad.
In order to specify any energy way out in the highest-frequency region (E in the chart above) you have to send a ton of zeros to get there. This makes it prohibitively costly in bits.
One of the first things that you notice when implementing an R/D optimized coder with TQ is that it starts killing all your high frequency detail. This is because under any kind of normal D norm, with zig-zag-like schemes, the R to send those values is just not worth it.
But perceptually that's all wrong. It makes you over-smooth images.
3 3/4. Imagine that the guy allocating bits is standing on the shore. The shore is the DC coefficient at 00. He's looking out to see. Way out at the horizon is the super-high-frequency coefficients in the bottom right of the DCT. In the foreground (that's AC10 and AC01) he can see individual waves and where rocks are. Way out at sea he shouldn't be spending bandwidth trying to describe individual waves, but he should still be saying things like "there are a lot of big waves out there" or "there's a big swell but no breakers". Yeah.
4. What you really want is a joint quantizer of summed energy and the distribution of that energy. At max bit rate you send all the coefficients exactly. As you reduce bitrate, the sum is preserved pretty well, but the distribution of the lower-right (highest frequency) coefficients becomes lossy. As you reduce bit rate more, the total sum is still pretty good and the overall distribution of energy is mostly right, but you get more loss in where the energy is going in the lower frequency subbands, and you also get more scalar quantization of the lower frequency subbands, etc.
Like by the time that AC11 has a scalar quantizer of 8, the highest frequency lower-right area has its total energy sent with a quantizer of 32, and zero bits are sent for the location of that energy.
5. When the energy is unspecified, you'd like to restore in some nice way. That is, don't just restore to the same quantization vector every time ("center" of the quantization bucket), since that could create patterns. I dunno. Maybe restore with some randomness; restore based on prediction from the neighborhood; restore to maximum likelihood? (ML based on neighborhood/prediction/image not just a global ML)
6. An idea I've tossed around for a while is a quadtree/wavelet-like coding scheme. Take the 8x8 block of coefficients (and as always exclude DC in some way). Send the sum of the whole thing. Divide into four children. So now you have to send a (lossy) distribution of that sum onto the 4 child slots. Go to the upper left (LL band) and do it again, etc.
7. The more energy you have, the less important its exact distribution, due to masking. As you have more energy to distribute, the number of vectors you need goes up a lot, but the loss you can tolerate also goes up. In terms of the bits to send a block, it should still increase as a function of the energy level of that block, but it should increase less quickly than naive log2(distributions) would indicate.
8. Not all AC's are equally likely or equally perceptually important. Specifically the vector codebook should contain more entries that preserve values in the upper-left (low frequency) area.
9. The interaction with prediction is ugly. (eg. I don't know how to do it right). The nature of AC values after mocomp or intra-prediction is not the same as the nature of AC's after just transform (as in JPEG). Specifically, ideas like variance masking and energy preservation apply to the transformed AC values, *not* to the deltas that you typically see in video coding.
10. You want to send the information about the AC in a useful order. That is, the things you send first should be very strong classifiers of the entropy of that block for coding purposes, and of the masking properties for quantization purposes.
For example, packJPG first send the (quantizated) location of the last-non-zero coefficient in Z-scan order. This turns out to be the best classifier of blocks for normal JPEG, so that is used as the primary bit of context for coding further information about the block.
You don't want sending the "category" or "masking" information to be separate side-band data. It should just be the first part of sending the coefficients. So your category is maybe something like the bit-vector of which coefficient groups have any non-zero coefficients. Something like that which is not redundant with sending them, it's just the first gross bit of information about their distribution.
I found this in email; I think there are some more somewhere ...
(there were some good responses to that mail as well...)
(If someone wants to give me a professorship and a few grad students so that I can solve these problems, just let me know anytime...)