12/03/2014

12-03-14 - Lossy AC Location Quantization

Okay, I want to ramble a bit about the idea of lossy AC location quantization.

This is the idea that as you are quantizing the AC's, throwing out some information, one of the things you should be throwing out is the *location* of AC signals, not just their energy levels (as you do with a scalar quantizer).

Now, Daala's PVQ scheme (P = Predictive or Pyramid depending on context, confusingly) does this a bit. They send the L2 norm ("gain") of each subband (their "subbands" being chunks of similar AC coefficients), and they send a Pyramid VQ unit vector from a codebook describing the distribution of that gain within the subband. When they do the VQ index selection, that can wind up decreasing some AC values and increasing others, because you rotated a bit in the N-dimensional space. Unlike normal AC scalar quantization, this is energy preserving because it's a rotation - when one value goes down, others go up.

But Daala is not really super lossy on the location part of the AC's in the way my imaginary coder is. In particular the step from sending no AC's to sending one (at K=1) is a very large bit rate step, and there's no way to send one AC in a somewhat-specified location. The step from K=1 to K=2 is also large.

To be clear, this idea is all about low bit rates. At high bit rates, K is high, this all doesn't matter very much.

When I was doing video, I was looking at low bit rates. Well, typical internet video bit rates. It's around 4k bytes per frame. For 720p that's around 2.4 bits per 8x8 block. For 480p it's 7.3 bits per 8x8 block. The bit rates for video are low!

At those low bit rates, when you get to send any AC's at all, it's usually just AC10 and/or AC01. If you have blocks with energy out in the higher AC's, it usually just costs too much to send them at all in an R/D coder. They cost too many bits, because you have to send a long Z-scan run-length, and they don't help D enough. (of course I'm talking about mocomp residuals here)

But perceptually this is bad. As noted previously, it makes it too big of a step in R to get to the next D option. There needs to be something in between.

That is, you start by sending just { AC10 , end-of-block } ; then your next step is { AC10 , AC45 , eob } - that's like a 10 bit step or something. You need something in between that's a few bits and provides some D benefit, like { AC10 , AC-somewhere }. That is, the location of that AC needs to be lossy.

Now, none of the old coders do this because under RMSE it doesn't make any sense. Taking some AC signal energy and putting on a different basis function is just a big WTF in terms of classic transform theory.

But perceptually it is right.

Rather than pulling something out of my ass, this time I'll post the actual DCT categories that I found to be optimal in my perceptual metric studies -


const int c_dctCategory[64] = 
{
   0, 1, 1, 3, 3, 3, 3, 6,
   2, 5, 5, 5, 6, 6, 6, 6,
   2, 5, 5, 6, 6, 6, 6, 6,
   4, 5, 7, 5, 6, 6, 6, 6,
   4, 7, 7, 7, 8, 8, 8, 8,
   4, 7, 7, 7, 8, 8, 8, 8,
   4, 7, 7, 7, 8, 8, 8, 8,
   7, 7, 7, 7, 8, 8, 8, 8
};

(there's also inter-band masking; categories 1+2 mask 3-5, and then categories 1-5 mask 6-8).

If you can't get the AC's all right, the next best thing is to get the sums within the categories right.

The categories sort of grossly tell you what kind of detail is in the block perceptually; is it high frequency, is it vertical detail or horizontal detail. It's what the eye picks up most of all in low bit rate video. Hey there's big chunky edge here where it should have been smooth, or hey this textured water is smooth in one block. The most important thing is for there to be *some* texture in the water, not for it to be right texture.

Point being, there is some benefit to filling in the gaps in the R-D space. Between sending no AC's at all, and sending "AC64 is 1" you can send "something in group 8 is 1". (in Pyramid VQ speak, there are big gaps to fill between K=0 and K=1 and K=2)

(note that under a normal D metric, this won't be good, so RDO won't work. You either need a D metric that counts this right, or fudge it by interpolating)

(also note that I'm hand-waving in a nasty way here; the perceptual property that "subband sums make sense" is true when applied to the original block - not the residual after subtracting out mocomp, while the property that "almost all AC's are zero" is a property of the residual! This is a difficulty that leads to the Daala PVQ (P = Predictive) Householder business. We'll ignore it for now.)

So, how might you actually send AC's with lossy positions?

There are a few ways. I'll sketch one or two.

First of all, let's work one subband at a time. We'll assume for now that we want the "lossy positioning" to only move values around within a subband, where they look rather similar. (perhaps at very low bit rates being even more lossy would be good). So, we want to send the sum of energy on each subband, and then send the way that energy is distributed within the subband.

It's an open question to me what the best "energy" on the subband is to preserve; L2, L1 or something else? Some power of those norms? In any case, it's scalar quantized. Then we have to distribute the energy, knowing the sum.

At higher bit rates, you quantize the distribution somehow and send it. One method is the Pyramid VQ (PVQ) that Daala uses. Perhaps I'll describe this more later, or you can follow the links and read old reference material. For small subbands and low quantized sums, you can also just store a codebook.

When the quantized sum is 1 ("K" in PVQ speak), then you just need to encode the position of where that 1 pulse is; eg. in a subband of N slots you need to code the spot that gets the energy, which takes log2(N) bits for a flat code. (of course a flat code is wrong because the slots aren't equally likely; the lower frequencies are slightly more likely, even within a subband that has been selected to be somewhat uniform).

So, what about sending less than log2(N) bits?

Well, at K=1 you can think of the bits as being binary splitting planes, cutting the subband in half with each bit you send. If you send all log2(N) bits then you specify a value precisely.


eg.

subband 3 is { a b c d }

K = 0 ; no bits sent , position = { abcd }
1 bit : { ab | cd }
K = 1 : 2 bits : { a | b | c | d }

We now have a continuous way of sending more bits and getting more precision.

On larger subbands, the splitting planes could be not in the middle (binary encoding) but rather at the point where they divide the probability 50/50, so that the index is variable length and doesn't require entropy coding.

When you've specified a lossy range of possible energy locations, eg. at 1 bit we specified "either a or b" you have a lot of options of how to restore that in the decoder. It could be just to the most probably spot (a), or a pseudo-random permutation of either. If the encoder and decoder use the same pseudo-random permutation, that allows the encoder to make an RD choice based on whether the restoration will match reality or not.

There's a tricky issue above K=1 when you want to add lossy detail for the next value distribution. It needs to be done such that D actually goes down, which is not trivial. You want D to be as continuous as possible as you add more lossy detail. The "embedded" style of truncating away bits of location is also not so obvious to me when you have two values to distribute.

More on this whole topic later.

1 comment:

  1. "If the encoder and decoder use the same pseudo-random permutation, that allows the encoder to make an RD choice based on whether the restoration will match reality or not."

    For image coding just not specifying how that energy is restored is an option. I don't think you can get away with it in video coding though because you're in a feedback loop - the decoder will use whatever it decodes as source for your next frame.

    ReplyDelete