First, I want to note that the

PVQ Demo page has good links at the bottom with
more details in them, worth reading.

Also, the Main Daala page has more links, including the
"Intro to Video" series, which is rather more than an intro and is a good read. It's a broad survey of
modern video coding.

Now, a big raw dump of emails between me, ryg, and JM Valin. I'm gonna try to color them to make it a bit
easier to follow. Thusly :

cbloom

ryg

valin

And this all starts with me being not very clear on PVQ so the beginning is a little fuzzy.

I will be following this up with a "summary of PVQ as I now understand it" which is probably more
useful for most people. So, read that, not this.

(also jebus the internet is rindoculuos. Can I have BBS's back? Like literally 1200 baud text
is better than the fucking nightmare that the internet has become. And I wouldn't mind playing
a little Trade Wars once a day...)

Hi,
I'm the author of the latest Daala demo on PVQ on which you commented
recently. Here's some comments on your comments. I wasn't able to submit
this to your blog, but feel free to copy them there.
> 1. I've long believed that blocks should be categorized into
> "smooth", "detail" and "edge". For most of this discussion we're
> going to ignore smooth and edge and just talk about detail. That is,
> blocks that are not entirely smooth, and don't have a dominant edge
> through them (perhaps because that edge was predicted).
I agree here, although right now we're treating "smooth" and "detail" in
the same way (smooth is just low detail). Do you see any reason to treat
those separately?
> 2. The most important thing in detail blocks is preserving the
> amount of energy in the various frequency subbands. This is something
> that I've talked about before in terms of perceptual metrics.
This is exactly what I had in mind with this PVQ work. Before Daala, I
worked on the CELT part of the Opus codec, which has strict preservation
of the energy. In the case of Daala, it looks so far like we want to
relax the energy constraint a little.
Right now, the codebook has an energy-preserving structure, but the
actual search is R/D optimized with the same weight given to the amount
of energy and its location. It's pretty easy to change the code to give
more weight to energy preservation. I could even show you how to play
with it if you're interested.
> 3. You can take a standard type of codec and optimize the encoding
> towards this type of perceptual metric, and that helps a bit, but
> it's the wrong way to go. Because you're still spending bits to
> exactly specify the noise in the high frequency area.
Correct, hence the gain-shape quantization in my post.
> 4. What you really want is a joint quantizer of summed energy and
> the distribution of that energy. At max bit rate you send all the
> coefficients exactly. As you reduce bitrate, the sum is preserved
> pretty well, but the distribution of the lower-right (highest
> frequency) coefficients becomes lossy. As you reduce bit rate more,
> the total sum is still pretty good and the overall distribution of
> energy is mostly right, but you get more loss in where the energy is
> going in the lower frequency subbands, and you also get more scalar
> quantization of the lower frequency subbands, etc.
Well, my experience with both Opus and CELT is that you want the same
resolution for the energy as you use for the "details". That being said,
having an explicit energy still means you can better preserve it in the
quantization process (i.e. it won't increase or decrease too much due to
the quantization).
> 5. When the energy is unspecified, you'd like to restore in some nice
> way. That is, don't just restore to the same quantization vector
> every time ("center" of the quantization bucket), since that could
> create patterns. I dunno. Maybe restore with some randomness; restore
> based on prediction from the neighborhood; restore to maximum
> likelihood? (ML based on neighborhood/prediction/image not just a
> global ML)
I experimented a little bit with adding noise at the correct energy and
while it slightly improved the quality on still images, it wasn't clear
how to apply it to video because then you have the problem of static vs
dynamic noise.
> 6. An idea I've tossed around for a while is a quadtree/wavelet-like
> coding scheme. Take the 8x8 block of coefficients (and as always
> exclude DC in some way). Send the sum of the whole thing. Divide into
> four children. So now you have to send a (lossy) distribution of that
> sum onto the 4 child slots. Go to the upper left (LL band) and do it
> again, etc.
I considered something along these lines, but it would not be easy to do
because the lowest frequencies would kind of drown out the high frequencies.
> 7. The more energy you have, the less important its exact
> distribution, due to masking. As you have more energy to distribute,
> the number of vectors you need goes up a lot, but the loss you can
> tolerate also goes up. In terms of the bits to send a block, it
> should still increase as a function of the energy level of that
> block, but it should increase less quickly than naive
> log2(distributions) would indicate.
Yes, that is exactly what the PVQ companding does.
> 8. Not all AC's are equally likely or equally perceptually important.
> Specifically the vector codebook should contain more entries that
> preserve values in the upper-left (low frequency) area.
This is the equivalent of the quantization matrix, which PVQ has as well
(though I didn't really talk about it).
> 9. The interaction with prediction is ugly. (eg. I don't know how to
> do it right). The nature of AC values after mocomp or
> intra-prediction is not the same as the nature of AC's after just
> transform (as in JPEG). Specifically, ideas like variance masking and
> energy preservation apply to the transformed AC values, *not* to the
> deltas that you typically see in video coding.
Handling the prediction is exactly what the whole Householder reflection
in PVQ is about (see the 6 steps figure). The PVQ gain encoding scheme
is always done on the input and not on the prediction. So the activity
masking is applied on the input energy and not based on the energy of
the residual.
> 10. You want to send the information about the AC in a useful order.
> That is, the things you send first should be very strong classifiers
> of the entropy of that block for coding purposes, and of the masking
> properties for quantization purposes.
Well, coding the energy first achieves most of this.
> You don't want sending the "category" or "masking" information to be
> separate side-band data. It should just be the first part of sending
> the coefficients. So your category is maybe something like the
> bit-vector of which coefficient groups have any non-zero
> coefficients. Something like that which is not redundant with sending
> them, it's just the first gross bit of information about their
> distribution.
Well, the masking information is tied to the gain. For now, the category
information is only tied to the block size decision (based on the
assumption that edges will be 4x4), but it's not ideal and it's
something I'd like to improve.
On the topic of lapped transform, this has indeed been causing us all
sorts of headaches, but it also has interesting properties. Jury's still
out on that one, but so far I think we've managed to make reasonably
good use of it.
Cheers,
Jean-Marc

Thanks for writing! Before I address specific points, maybe you can teach me a bit about PVQ and how you use it? I can't find any good resources on the web (your abstract is rather terse). Maybe you can point me at some relevant reference material. (the CELT paper is rather terse too!)
Are you constructing the PVQ vector from the various AC's within a single block? Or gathering the same subband from spatial neighbors? (I think the former, but I've seen the latter in papers)
Assuming the former -
Isn't it just wrong? The various AC's have different laplacian distributions (lower frequencies more likely) so using PVQ just doesn't seem right.
In particular PVQ assumes all coefficients are equally likely and equally distributed.
In your abstract you seem to describe a coding scheme which is not a uniform length codeword like traditional PVQ. It looks like it assigns shorter codes to vectors that have their values early on in some kind of z-scan order.
How is K chosen?

Hi,
On 02/12/14 08:52 PM, Charles Bloom wrote:
> Thanks for writing! Before I address specific points, maybe you can
> teach me a bit about PVQ and how you use it? I can't find any good
> resources on the web (your abstract is rather terse). Maybe you can
> point me at some relevant reference material. (the CELT paper is rather
> terse too!)
I'm currently writing a longer paper for a conference in February, but
for now there isn't much more than the demo and the abstract I link to
at the bottom. I have some notes that describe some of the maths, but
it's a bit all over the place right now:
http://jmvalin.ca/video/video_pvq.pdf
> Are you constructing the PVQ vector from the various AC's within a
> single block? Or gathering the same subband from spatial neighbors? (I
> think the former, but I've seen the latter in papers)
>
> Assuming the former -
Correct. You can see the grouping (bands) in Fig. 1 of:
http://jmvalin.ca/video/spie_pvq_abstract.pdf
> Isn't it just wrong? The various AC's have different laplacian
> distributions (lower frequencies more likely) so using PVQ just doesn't
> seem right.
>
> In particular PVQ assumes all coefficients are equally likely and
> equally distributed.
>
>
> In your abstract you seem to describe a coding scheme which is not a
> uniform length codeword like traditional PVQ. It looks like it assigns
> shorter codes to vectors that have their values early on in some kind of
> z-scan order.
One thing to keep in mind if that the P in PVQ now stands for
"perceptual". In Daala we are no longer using the indexing scheme from
CELT (which does assume identical distribution). Rather, we're using a
coding scheme based on Laplace distribution of unequal variance. You can
read more about the actual encoding process in another document:
http://jmvalin.ca/video/pvq_encoding.pdf
> How is K chosen?
The math is described (poorly) in section 6.1 of
http://jmvalin.ca/video/video_pvq.pdf
Basically, the idea is to have the same resolution in the direction of
the gain as in any other direction. In the no prediction case, it's
roughly proportional to the gain times the square root of the number of
dimensions. Because K only depends on values that are available to the
decoder, we don't actually need to signal it.
Hope this helps,
Jean-Marc

Thanks for the responses and the early release papers, yeah I'm figuring most of it out.
K is chosen so that distortion from the PVQ (P = Pyramid) quantization is the same as distortion from gain quantization. Presumably under a simple D metric like L2.
The actual PVQ (P = Pyramid) part is the simplest and least ambiguous. The predictive stuff is complex.
Let me make sure I understand this correctly -
You never actually make a "residual" in the classic sense by subtracting the prediction off.
You form the prediction in transformed space. (perhaps by having a motion vector, taking the pixels it points to and transforming them, dealing with lapping, yuck!)
The gain of the current block is sent (for each subband). Not the gain of the delta. The gain of the prediction in the same band is used as coding context? (the delta of the quantized gains could be sent).
The big win that you guys were after in sending the gain seems to have been the non-linear quantization levels; essentially you're getting "variance adaptive quantization" without explicitly sending per block quantizers.
The Householder reflection is the way that vectors near the prediction are favored. This is the only way that the predicted block is used!? Madness!
(presumably if the prediction had detail that was finer than the quantization level of the current block that could be used to restore within the quantization bucket; eg. for "golden frames")

On 03/12/14 12:18 AM, Charles Bloom wrote:
> K is chosen so that distortion from the PVQ (P = Pyramid) quantization
> is the same as distortion from gain quantization. Presumably under a
> simple D metric like L2.
Yes, it's an L2 metric, although since the gain is already warped, the
distortion is implicitly weighted by the activity masking, which is
exactly what we want.
> You never actually make a "residual" in the classic sense by subtracting
> the prediction off.
Correct.
> You form the prediction in transformed space. (perhaps by having a
> motion vector, taking the pixels it points to and transforming them,
> dealing with lapping, yuck!)
We have the input image and we have a predicted image. We just transform
both. Lapping doesn't actually cause any issues there (unlike many other
places). As far as I can tell, this part is similar to what a wavelet
coder would do.
> The gain of the current block is sent (for each subband). Not the gain
> of the delta.
Correct.
> The gain of the prediction in the same band is used as
> coding context? (the delta of the quantized gains could be sent).
Yes, the gain is delta-coded, so coding "same gain" is cheap.
Especially, there's a special symbol for gain=0,theta=0, which means
"skip this band and use prediction as is".
> The big win that you guys were after in sending the gain seems to have
> been the non-linear quantization levels; essentially you're getting
> "variance adaptive quantization" without explicitly sending per block
> quantizers.
Exactly. Not only that but it's adaptive based on the variance of the
current band, not just an entire macroblock.
> The Householder reflection is the way that vectors near the prediction
> are favored. This is the only way that the predicted block is used!?
> Madness!
Well, the reference is used to compute the reflection *and* the gain. In
the end, we're using exactly the same amount of information, just in a
different space.
> (presumably if the prediction had detail that was finer than the
> quantization level of the current block that could be used to restore
> within the quantization bucket; eg. for "golden frames")
Can you explain what you mean here?
Jean-Marc

So one thing that strikes me is that at very low bit rate, it would be nice to go below K=1. In the large high-frequency subbands, the vector dimension N is very large, so even at K=1 it takes a lot of bits to specify where the energy should go. It would be nice to be more lossy with that location.
It seems that for low K you're using a zero-runlength coder to send the distribution, with a kind of Z-scan order, which makes it very similar to standard MPEG.
(maybe you guys aren't focusing on such low bit rates; when I looked at low bit rate video the K=1 case dominated)
At 09:42 PM 12/2/2014, you wrote:
> (presumably if the prediction had detail that was finer than the
> quantization level of the current block that could be used to restore
> within the quantization bucket; eg. for "golden frames")
Can you explain what you mean here?
If you happen to have a very high quality previous block (much better than
your current quantizer / bit rate should give you) - with normal mocomp you can easily carry that block forward, and perhaps apply corrections to it, but the high detail of that block is preserved.
With the PVQ scheme it's not obvious to me that that works.
When you send the quantized gain of the subbands you're losing precision (it looks like you guys have a special fudge to fix this, by offsetting the gain based on the prediction's gain?)
But for the VQ part, you can't really "carry forward" detail in the same way.
I guess the reflection vector can be higher precision than the quantizer, so in a sense that preserves detail, but it doesn't carry forward the same values, because they drift due to rotation and staying a unit vector, etc.

Some more questions -
Is the Householder reflection method also used for Intra prediction? (do you guys do the directional Intra like H26x ?)
How much of this scheme is because you believe it's the best thing to do vs. you have to avoid H26x patents?
If you're not sending any explicit per-block quantizer, it seems like that removes a lot of freedom for future encoders to do more sophisticated perceptual optimization. (ROI bit allocation or whatever)

On 03/12/14 02:17 PM, Charles Bloom wrote:
> So one thing that strikes me is that at very low bit rate, it would be
> nice to go below K=1. In the large high-frequency subbands, the vector
> dimension N is very large, so even at K=1 it takes a lot of bits to
> specify where the energy should go. It would be nice to be more lossy
> with that location.
Well, for large N, the first gain step already has K>1, which I believe
is better than K=1. I've considered adding an extra gain step with K=1
or below, but never had anything that was really worth it (didn't try
very hard).
> It seems that for low K you're using a zero-runlength coder to send the
> distribution, with a kind of Z-scan order, which makes it very similar
> to standard MPEG.
>
> (maybe you guys aren't focusing on such low bit rates; when I looked at
> low bit rate video the K=1 case dominated)
We're also targeting low bit-rates, similar to H.265. We're not yet at
our target level of performance though.
> Is the Householder reflection method also used for Intra prediction?
> (do you guys do the directional Intra like H26x ?)
We also use it for intra prediction, though right now our intra
prediction is very limited because of the lapped transform. Except for
chroma which we predict from the luma. PVQ makes this particularly easy.
We just use the unit vector from luma as chroma prediction and code the
gain.
> How much of this scheme is because you believe it's the best thing to
> do vs. you have to avoid H26x patents?
The original goal wasn't to avoid patents, but it's a nice added benefit.
> If you're not sending any explicit per-block quantizer, it seems like
> that removes a lot of freedom for future encoders to do more
> sophisticated perceptual optimization. (ROI bit allocation or
> whatever)
We're still planning on adding some per-block/macroblock/something
quantizers, but we just won't need them for activity masking.
Cheers,
Jean-Marc

Hi,
Just read your "smooth blocks" post and I thought I'd mention on thing
we do in Daala to improve the quality of smooth regions. It's called
"Haar DC" and the idea is basically to apply a Haar transforms to all
the DCs in a superblock. This has the advantage of getting us much
better quantization resolution at large scales.
Unfortunately, there's absolutely no documentation about it, so you'd
have to look at the source code, mostly in od_quantize_haar_dc() and a
bit of od_compute_dcts()
http://git.xiph.org/?p=daala.git;a=blob;f=src/encode.c;h=879dda;hb=HEAD
Cheers,
Jean-Marc

Yeah I definitely can't follow that code without digging into it.
But this : "much better quantization resolution at large scales." is interesting.
When I did the DLI test :
http://cbloomrants.blogspot.com/2014/08/08-31-14-dli-image-compression.html
something I noticed in both JPEG and DLI (and in everything else, I'm sure) is :
Because everyone just does naive scalar quantization on DC's, large regions of solid color will shift in a way that is very visible.
That is, it's a very bad perceptual RD allocation. Some bits should be taken away from AC detail and put into making that large region DC color more precise.
The problem is that DC scalar quantization assumes the blocks are independent and random and so on. It models the distortion of each block as being independent, etc.
But it's not. If you have the right scalar quantizer for the DC when the blocks are in a region of high variation (lots of different DC's) then that is much too large a quantizer for regions where blocks all have roughly the same DC.
This is true even when there is a decent amount of AC energy, eg. the image I noticed it in was the "Porsche640" test image posted on that page - the greens of the bushes all color shift in a very bad way. The leaf detail does not mask this kind of perceptual error.

Two more questions -
1. Do you use a quantization matrix (ala JPEG CSF or whatever) ? If so, how does that work with gain preservation and the Pyramid VQ unit vector?
2. Do you mind if I post all these mails publicly?

On 11/12/14 02:03 PM, Charles Bloom wrote:
> 1. Do you use a quantization matrix (ala JPEG CSF or whatever) ? If so,
> how does that work with gain preservation and the Pyramid VQ unit vector?
Right now, we just set a different quantizer value for each "band", so
we can't change resolution on a coefficient-by-coefficient basis, but it
still looks like a good enough approximation. If needed we might try
doing something fancier at some point.
> 2. Do you mind if I post all these mails publicly?
I have no problem with that and in fact I encourage you to do so.
Cheers,
Jean-Marc

ryg:
Don't wanna post this to your blog because it's a long comment and will probably fail Blogger's size limit.
Re "3 1/2. The normal zig-zag coding schemes we use are really bad."
Don't agree here about zig-zag being the problem. Doesn't it just boil down to what model you use for the run lengths?
Classic JPEG/MPEG style coding rules (H.264 and later are somewhat different) 1. assume short runs are more probable than long ones and 2. give a really cheap way to end blocks early. The result is that the coder likes blocks with a fairly dense cluster in the first few coded components (and only this is where zig-zag comes in) and truncated past that point.
Now take Fischer-style PVQ (original paper is behind a paywall, but this:
http://www.nul.com/pbody17.pdf
covers what seems to be the proposed coding scheme). You have two parameters, N and K. N is the dimensionality of the data you're coding (this is a constant at the block syntax level and not coded) and K is the number of unit pulses (=your "energy"). You code K and then send an integer (with a uniform model!) that says which of all possible arrangements of K unit pulses across N dimensions you mean. For 16-bit ACs in a 8x8 block so N=63, there's on the order of 2^(63*16) = 2^1008 different values you could theoretically code, so clearly for large K this integer denoting the configuration can get quite huge.
Anyway, suppose that K=1 (easiest case). Then the "configuration number" will tell us where the pulse goes and what sign it has, uniformly coded. That's essentially a run length with *uniform* distribution plus sign. K=2: we have two pulses. There's N*2 ways to code +-2 in one AC and the rest zeros (code AC index, code sign), and (N choose 2) * 2^2 ways to code two slots at +-1 each. And so forth for higher K.
From there, we can extrapolate what the general case looks like. I think the overall structure ends up being isomorphic to this:
1. You code the number M (<=N) of nonzero coefficients using a model derived from the combinatorics given N and K (purely counting-based). (K=1 implies M=1, so nothing to code in that case.)
2. Code the M sign bits.
3. Code the positions of the M nonzero coeffs - (N choose M) options here.
4. Code another number denoting how we split the K pulses among the M coeffs - that's an integer partition of K into exactly M parts, not sure if there's a nice name/formula for that.
This is close enough to the structure of existing AC entropy coders that we can meaningfully talk about the differences. 1) and 2) are bog-standard (we use a different model knowing K than a regular codec that doesn't know K would, but that's it). You can view 3) in terms of significance masks, and the probabilities have a reasonably simple form (I think you can adapt the Reservoir sampling algorithm to generate them) - or, by looking at the zero runs, in term of run lengths. And 4) is a magnitude coder constrained by knowing the final sum of everything.
So the big difference is that we know K at the start, which influences our choice of models forthwith. But it's not actually changing the internal structure that much!
That said, I don't think they're actually doing Fischer-style PVQ of "just send a uniform code". The advantage of breaking it down like above is that you have separate syntax elements that you can apply additional modeling on separately. Just having a giant integer flat code is not only massively unwieldy, it's also a bit of a dead end as far as further modeling is concerned.
-Fabian

cbloom:
At 01:36 PM 12/2/2014, you wrote:
Don't wanna post this to your blog because it's a long comment and will probably fail Blogger's size limit.
Re "3 1/2. The normal zig-zag coding schemes we use are really bad."
Don't agree here about zig-zag being the problem. Doesn't it just boil down to what model you use for the run lengths?
My belief is that for R/D optimization, it's bad when there's a big R step that doesn't correspond to a big D step.
You want the prices of things to be "fair".
So the problem is cases like :
XX00
X01
0
vs
XX00
X001
000
00
0
which is not a very big D change at all, but is a very big R step.
I think it's easy to see that even keeping something equivalent to the zigzag, you could change it so that the position of the next coded value is sent in a way such that the rates better match entropy and distortion.
But of course, really what you want is to send the positions of those later values in a lossy way.
Even keeping something zigzagish you can imagine easy ways to do it, like you send a zigzag RLE that's something like {1,2,3-4,5-7,8-13} whatever.

ryg:
Actually the Fischer (magnitude enumeration) construction corresponds pretty much directly to a direct coder: from the IEEE paper, l = dim, k = number of pulses, then number of code words N(l,k) is
N(l,k) = sum_{i=-k}^k N(l-1, k-|i|)
This is really direct: N(l,k) just loops over all possible values i for the first AC coeff. The remaining uncoded ACs then are l-1 dimensional and <= k-|i|. Divide through by N(l,k) and you have a probability distribution for coding a single AC coeff. Splitting out the i=0 case and sign, we get:
N(l,k) = N(l-1,k) + 2 * sum_{j=1}^k N(l-1,k-j)
=: N(l-1,k) + 2 * S(l-1,k)
which corresponds 1:1 to this encoder:
// While energy (k) left
for (i = 0; k > 0; i++) {
{
assert(i < Ndims); // shouldn't get to N with leftover energy
int l = N - i; // remaining dims
int coeff = coeffs[i];
// encode significance
code_binary(coeff == 0, N(l-1,k) / N(l,k));
if (coeff != 0) {
// encode sign
code_binary(coeff < 0, 0.5);
int mag = abs(coeff);
// encode magnitude (multi-symbol)
// prob(mag=j) = N(l-1,k-j) / S(l-1, k)
// then:
k -= mag;
}
}
and this is probably how you'd want to implement it given an arithmetic back end anyway. Factoring it into multiple decisions is much more convenient (and as said before, easier to do secondary modeling on) than the whole "one giant bigint" mess you get if you're not low-dimensional.
Having the high-dimensional crap in there blows because the probabilities can get crazy. Certainly Ndims=63 would suck to work with directly. Separately, I'd expect that for k "large" (k >= Ndims? Ndims*4? More? Less?) you can use a simpler coder and/or fairly inaccurate probabilities because that's gonna be infrequent.
Maybe given k = AC_sum(1,63) = sum_{i=1}^63 |coeff_i|, there's a reasonably nice way to figure out say AC_sum(1,32) and AC_sum(33,63). And if you can do that once, you can do it more than once. Kind of a top-down approach: you start with "I have k energy for this block" and first figure out which subband groups that energy goes into. Then you do the "detail" encode like above within each subband of maybe 8-16 coeffs; with l<=Ndim<=8 and k<=Ndim*small, you would have reasonable (practical) model sizes.
-Fabian

cbloom:
No, I don't think that's right.
The N recursion is just for counting the number of codewords, it doesn't imply a coding scheme.
It explicitly says that the pyramid vector index is coded with a fixed length word, using ceil( N ) bits. Your coding scheme is variable length.
I need to find the original Fisher paper because this isn't making sense to me. The AC's aren't equally probable and don't have the same Laplacian distribution so PVQ just seems wrong.
I did find this paper ("Robust image and video coding with pyramid vector quantisation") which uses PVQ and is making the vectors not from within the same block, but within the same *subband* in different spatial locations. eg. gathering all the AC20's from lots of neigboring blocks. That does make sense to me but I'm not sure if that's what everyone means when they talk about PVQ ?
(paper attached to next email)

ryg:
On 12/2/2014 5:46 PM, Charles Bloom {RAD} wrote:
No, I don't think that's right.
The N recursion is just for counting the number of codewords, it doesn't
imply a coding scheme.
It explicitly says that the pyramid vector index is coded with a fixed
length word, using ceil( N ) bits. Your coding scheme is variable length.
I wasn't stating that Fischer's scheme is variable-length; I was stating that the decomposition as given implies a corresponding way to encode it that is equivalent (in the sense of exact same cost). It's not variable length. It's variable number of symbols but the output length is always the same (provided you use an exact multi-precision arithmetic coder that is, otherwise it can end up larger due to round-off error).
log2(N(l,k)) is the number of bits we need to spend to encode which one out of N(l,k) equiprobable codewords we use. The ceil(log2(N)) is what you get when you say "fuck it" and just round it to an integral number of bits, but clearly that's not required. So suppose we're coding to the exact target rate using a bignum rationals and an exact arithmetic coder.
Say I have a permutation of 3 values and want to encode which one it is. I can come up with a canonical enumeration (doesn't matter which) and send an index stating which one of the 6 candidates it is, in log2(6) bits. I can send one bit stating whether it's an even or odd permutation, which partitions my 6 cases into 2 disjoint subsets of 3 cases each, and then send log2(3) bits to encode which of the even/odd permutations I am, for a total of log2(2) + log2(3) = log2(6) bits.
Or I can get fancier. In the general case, I can (arbitrarily!) partition my N values into disjoint subsets with k_1, k_2, ..., k_m elements, respectively, sum_i k_i = N. To code a number, I then first code the number of the subset it's in (using probability p_i = k_i/N) and then send a uniform integer denoting which element it is, in log2(k_i) bits. Say I want to encode some number x, and it falls into subset j. Then I will spend
-log2(p_i) + log2(k_i) = -log2(k_i / N) + log2(k_i)
= log2(N / k_i) + log2(k_i) = log2(N)
bits (surprise... not). I'm just partitioning my uniform distribution into several distributions over smaller sets, always setting probabilities exactly according to the number of "leaves" (=final coded values) below that part of the subtree, so that the product along each path is still a uniform distribution. I can nest that process of course, and it's easy to do so in some trees but not others meaning I get non-uniform path lengths, but at no point am I changing the size of the output bitstream.
That's exactly what I did in the "coder" given below. What's the value of the first AC coefficient? It must obey
-k <= ac_0 <= k
per definition of k, and I'm using that to partition our codebook C into 2k+1 disjoint subsets, namely
C_x = { c in C | ac0(c) = x }
and nicely enough, by the unit-pulse definition that leads to the enumeration formula, each of the C_x corresponds to another PVQ codebook, namely with dimension l-1 and energy k-|x|. Which implies the whole thing decomposes into "send x and then do a PVQ encode of the rest", i.e. the loop I gave.
That said, one important point that I didn't cover in my original mail: from the purposes of coding this is really quite similar to a regular AC coder, but of course the values being coded don't mean the same thing. In a JPEG/MPEG style entropy coder, the values I'm emitting are raw ACs. PVQ works (for convenience) with code points on an integer lattice Z^N, but the actual AC coeffs coded aren't those lattice points, they're (gain(K) / len(lattice_point)) * lattice_point (len here being Euclidean and not 1-norm!).
I need to find the original Fisher paper because this isn't making sense
to me. The AC's aren't equally probable and don't have the same
Laplacian distribution so PVQ just seems wrong.
I did find this paper ("Robust image and video coding with pyramid
vector quantisation") which uses PVQ and is making the vectors not from
within the same block, but within the same *subband* in different
spatial locations. eg. gathering all the AC20's from lots of neigboring
blocks. That does make sense to me but I'm not sure if that's what
everyone means when they talk about PVQ ?
(paper attached to next email)
The link to the extended abstract for the Daala scheme (which covers this) is on the Xiph demo page:
http://jmvalin.ca/video/spie_pvq_abstract.pdf
Page 2 has the assignment of coeffs to subbands. They're only using a handful, and notably they treat 4x4 blocks as a single subband.
-Fabian

cbloom:
Ah yeah, you are correct of course. I didn't see how you had the probabilities in the coding.
There are a lot of old papers I can't get about how to do the PVQ enumeration in an efficient way. I'm a bit curious about what they do.
But as I'm starting to understand it all a bit now, that just seems like the least difficult part of the problem.
Basically the idea is something like - divide the block into subbands. Let's say the standard wavelet tree for concreteness -
01447777
23447777
55667777
55667777
8..
Send the sum in each subband ; this is the "gain" ; let's say g_s
g_s is sent with some scalar quantizer (how do you choose q_s ?)
(in Daala a non-linear quantizer is used)
For each subband, scale the vector to an L1 length K_s (how do you choose K_s?)
Quantize the vector to a PVQ lattice point; send the lattice index
So PVQ (P = Pyramid) solves this problem of how to enumerate the distribution given the sum. But that's sort of the trivial part.
The how do you send the subband gains, what is K, etc. is the hard part. Do the subband gains mask each other?
Then there's the whole issue of PVQ where P = Predictive. This Householder reflection business. Am I correct in understanding that Daala doesn't subtract off the motion prediction and make a residual? The PVQ (P = predictive) scheme is used instead? That's quite amazing. And it seems that Daala sends the original gain, not the gain of the residual (and uses the gain of the prediction as context).
The slides (reference #4) clear things up a bit.

ryg:
On 12/2/2014 8:21 PM, Charles Bloom {RAD} wrote:
Ah yeah, you are correct of course. I didn't see how you had the
probabilities in the coding.
There are a lot of old papers I can't get about how to do the PVQ
enumeration in an efficient way. I'm a bit curious about what they do.
Well, the one I linked to has a couple variants already. But it's pretty much besides the point. You can of course turn this into a giant combinatorical circle-jerk, but I don't see the use. For example (that's one of the things in the paper I linked to) if you're actually assigning indexes to values then yeah, the difference between assigning codewords in order { 0, -1, 1, -2, 2, ... } and { -k, -k+1, ..., -1, 0, 1, 2, ..., k } matters, but once you decompose it into several syntax elements most of that incidental complexity just disappears completely.
But as I'm starting to understand it all a bit now, that just seems like
the least difficult part of the problem.
Yeah, agreed.
Basically the idea is something like - divide the block into subbands.
Let's say the standard wavelet tree for concreteness -
01447777
23447777
55667777
55667777
8..
Yup.
Send the sum in each subband ; this is the "gain" ; let's say g_s
No, the gain isn't sum (1-norm), it's the Euclidean (2-norm) length. If you used 1-norm you wouldn't deform the integer lattice, meaning you're still just a scalar quantizer, just one with a funky backend.
E.g. in 2D, just sending k = q(|x| + |y|) (with q being a uniform scalar quantizer without dead zone for simplicity) and then coding where the pulses go is just using the same rectangular lattice as you would have if you were sending q(x), q(y) directly. (Once you add a dead zone that's not true any more; scalar favors a "+" shape around the origin whereas the 1-norm PVQ doesn't. But let's ignore that for now.)
With an ideal vector quantizer you make the "buckets" (=Voronoi regions) approximately equally likely. For general arbitrary 2D points that means the usual hex lattice. The PVQ equivalent of that is the NPVQ pattern:
https://people.xiph.org/~jm/daala/pvq_demo/quantizer4.png
That's clearly suboptimal (not a real hex lattice at all), but it has the nice gain/shape-separation: the circles are all equal-gain. You unwrap each circle by normalizing the point in the 1-norm, and then sending the corresponding AC pulses.
g_s is sent with some scalar quantizer (how do you choose q_s ?)
(in Daala a non-linear quantizer is used)
q_s would come from the rate control, as usual.
g codes overall intensity. You would want that to be roughly perceptually uniform. And you're not sending g at all, you're sending K.
CIELab gamma (which is ~perceptually uniform) is 3, i.e. linear->CIELab is pow(x, 1/3). The Daala gain compander uses, surprise, 1/3. This would make sense except for the part where the CIE gamma deals in *linear* values and Daala presumably works on a gamma-infested color space, because that's what you get.
My theory is this: the thing they're companding is not g_s, but g_s^2, i.e. sum of squares of AC coeffs. That makes for a total companding curve of (g_s)^(2/3). Display gamma is ~2, perceptually uniform gamma is ~3, so this would be in the right range to actually work out. They're not doing a good job of describing this though!
For each subband, scale the vector to an L1 length K_s (how do you
choose K_s?)
You don't. You have your companded g's. The companding warps the space so now we're in NPVQ land (the thing I sent the image URL for). The companded g is the radius of the circle you're actually on. But of course this is a quantizer so your choices of radius are discrete and limited.
You look at circles with a radius in the right neighborhood (most obviously, just floor(g) and ceil(g), though you might want to widen the search if you're doing RDO). You find the closest lattice points on both circles (this is convex, so no risk of getting stuck in a local min). Choose whichever of the two circles is better. (All points *on* the same circle have the same cost, at least with vanilla PVQ. So the only RD trade-off you do is picking which circle.)
K_s is the index of the circle you're on. The origin is K_s=0, the first real circle is K_s=1 (and has 2N points where N is your dimensionality), and so forth.
Quantize the vector to a PVQ lattice point; send the lattice index
Finding that is the convex search.
So PVQ (P = Pyramid) solves this problem of how to enumerate the
distribution given the sum. But that's sort of the trivial part.
Well, that's the combinatorical part. The actual vector quantizer is the idea of warping the 1-norm diamonds into sensibly-spaced 2-norm circles. The regular structure enables the simplified search.
The how do you send the subband gains, what is K, etc. is the hard
part. Do the subband gains mask each other?
Not sure if they're doing any additional masking beyond that. If they do, they're not talking about it.
Then there's the whole issue of PVQ where P = Predictive. This
Householder reflection business. Am I correct in understanding that
Daala doesn't subtract off the motion prediction and make a residual?
The PVQ (P = predictive) scheme is used instead? That's quite amazing.
And it seems that Daala sends the original gain, not the gain of the
residual (and uses the gain of the prediction as context).
As far as I can tell, yeah. And yes, definitely gain of the overall block, not of the residual! Again, you have the separation into gain and shape here.
The gains are coded separately, and hence out of the equation. What remains is unit vectors for both your target block and your prediction.
That means your points are on a sphere. You do a reflection that aligns your prediction vector with the 1st AC coefficient. This rotates (well, reflects...) everything around but your block is still a unit vector on a sphere.
1st AC will now contain block_gain * dot(block_unit_vec, prediction_unit_vec). You already know block_gain. They send the dot product (cosine of the angle, but speaking about this in terms of angles is just confusing IMO; it's a correlation coefficient, period). This tells you how good the prediction is. If it's 0.9, you've just removed 90% of the energy to code. You need to quantize this appropriately - you want to make sure the quantizer resolution here is reasonably matched to quantizer resolution of points on your sphere, or you're wasting bits. Now you turn whatever is left of g into K (as above).
You can iterate this as necessary. If you do bi-prediction, you can do another Householder reflection to align the energy of pred2 that was orthogonal to pred1 (the rest is gone already!) with the 2nd AC. You code another correlation coefficient and then deal with the residuals.
Fade-in / fade-out just kind of fall out when you do prediction like this. It's not a special thing. The "ACs" don't change, just the gains.
Handling cross-fades with one predictor is still shitty, but if you're doing bipred they kinda fall out as well.
It all sounds pretty cool. But I have no clue at all how well it works in practice or where it stands cost/benefit-wise.
-Fabian

ryg:
That means your points are on a sphere. You do a reflection that aligns
your prediction vector with the 1st AC coefficient. This rotates (well,
reflects...) everything around but your block is still a unit vector on
a sphere.
Important note for this and all that follows:
For this to work as I described it, your block and the prediction need to be in the same space, which in this context has to be frequency (DCT) space (since that's what you eventually want to code with PVQ), so you need to DCT your reference block first.
This combined with the reflections etc. make this pretty pricey, all things considered.
If you weren't splitting by subbands, I believe you could finesse your way around this: (normalized) DCT and Householder reflections are both unitary, so they preserve both the L2 norm and dot products. Which means you could calculate both the overall gain and the correlation coeffs for your prediction *before* you do the DCT (and hence in the decoder, add that stuff back in post-IDCT, without having to DCT your reference).
But with the subband splitting, that no longer works, at least not directly. You could still do it with a custom filter bank that just passes through precisely the DCT coeffs we're interested in for each subband, but eh, somehow I have my doubts that this is gonna be much more efficient than just eating the DCT. It would certainly add yet another complicated mess to the pile.
-Fabian

cbloom:
At 10:23 PM 12/2/2014, Fabian Giesen wrote:
For this to work as I described it, your block and the prediction need to be in the same space, which in this context has to be frequency (DCT) space (since that's what you eventually want to code with PVQ), so you need to DCT your reference block first.
This combined with the reflections etc. make this pretty pricey, all things considered.
Yeah, I asked Valin about this.
They form an entire predicted *image* rather than block-by-block because of lapping.
They transform the predicted image the same way as the current frame.
Each subband gain is sent as a delta from the predicted image subband gain.
Crazy!
His words :
> You form the prediction in transformed space. (perhaps by having a
> motion vector, taking the pixels it points to and transforming them,
> dealing with lapping, yuck!)
We have the input image and we have a predicted image. We just transform
both. Lapping doesn't actually cause any issues there (unlike many other
places). As far as I can tell, this part is similar to what a wavelet
coder would do.

cbloom:
At 09:31 PM 12/2/2014, you wrote:
q_s would come from the rate control, as usual.
Yeah, I just mean the details of that is actually one of the most important issues. eg. how does Q vary for the different subbands. Is there inter-subband masking, etc.
In Daala the Q is non-linear (variance adaptive quantizer)
g codes overall intensity. You would want that to be roughly perceptually uniform. And you're not sending g at all, you're sending K.
In Daala they send g and derive K.
CIELab gamma (which is ~perceptually uniform) is 3, i.e. linear->CIELab is pow(x, 1/3). The Daala gain compander uses, surprise, 1/3. This would make sense except for the part where the CIE gamma deals in *linear* values and Daala presumably works on a gamma-infested color space, because that's what you get.
My theory is this: the thing they're companding is not g_s, but g_s^2, i.e. sum of squares of AC coeffs. That makes for a total companding curve of (g_s)^(2/3). Display gamma is ~2, perceptually uniform gamma is ~3, so this would be in the right range to actually work out. They're not doing a good job of describing this though!
Err, yeah maybe. What they actually did was take the x264 VAQ and try to reproduce it.
For each subband, scale the vector to an L1 length K_s (how do you
choose K_s?)
You don't. You have your companded g's. The companding warps the space so now we're in NPVQ land (the thing I sent the image URL for). The companded g is the radius of the circle you're actually on. But of course this is a quantizer so your choices of radius are discrete and limited.
No, that's not right.
K is effectively your "distribution" quantizer. It should be proportional to g in some way (or some power of g) but it's not just g.
As the quantizer for g goes up, K goes down.
In Daala they choose K such that the distortion due to PVQ is the same as the distortion due to gain scalar quantization.
1st AC will now contain block_gain * dot(block_unit_vec, prediction_unit_vec). You already know block_gain. They send the dot product (cosine of the angle, but speaking about this in terms of angles is just confusing IMO; it's a correlation coefficient, period).
I think that in Daala they actually send the angle, not the cosine, which is important because of the non-linear quantization buckets.
It's difficult for me to intuit what the Householder reflection is doing to the residuals. But I guess it doesn't matter much.
It also all seems to fall apart a bit if the prediction is not very good. Then the gains might mismatch quite a bit, and even though you had some pixels that matched well, they will be scaled differently when normalized. It's a bit blah.

ryg:
Yeah, I asked Valin about this.
They form an entire predicted *image* rather than block-by-block because
of lapping.
That doesn't have anything to do with the lapping, I think - that's because they don't use regular block-based mocomp.
At least their proposal was to mix overlapping-block MC and Control Grid Interpolation (CGI, essentially you specify a small mesh with texture coordinates and do per-pixel tex coord interpolation). There's no nice way to do this block-per-block in the first place, not with OBMC in the mix anyway; if you chop it up into tiles you end up doing a lot of work twice.

ryg:
On 12/03/2014 10:26 AM, Charles Bloom {RAD}
g codes overall intensity. You would want that to be roughly
perceptually uniform. And you're not sending g at all, you're sending K.
In Daala they send g and derive K.
Ah, my bad.
For each subband, scale the vector to an L1 length K_s (how do you
choose K_s?)
You don't. You have your companded g's. The companding warps the space
so now we're in NPVQ land (the thing I sent the image URL for). The
companded g is the radius of the circle you're actually on. But of
course this is a quantizer so your choices of radius are discrete and
limited.
No, that's not right.
K is effectively your "distribution" quantizer. It should be
proportional to g in some way (or some power of g) but it's not just g.
As the quantizer for g goes up, K goes down.
In Daala they choose K such that the distortion due to PVQ is the same
as the distortion due to gain scalar quantization.
Ah OK, that makes sense.
1st AC will now contain block_gain * dot(block_unit_vec,
prediction_unit_vec). You already know block_gain. They send the dot
product (cosine of the angle, but speaking about this in terms of
angles is just confusing IMO; it's a correlation coefficient, period).
I think that in Daala they actually send the angle, not the cosine,
which is important because of the non-linear quantization buckets.
Didn't check how they send it. I do find thinking of this in terms of cross-correlation between block and pred a lot simpler than phrasing it in terms of angles.
It's difficult for me to intuit what the Householder reflection is doing
to the residuals. But I guess it doesn't matter much.
The reflection itself doesn't do anything meaningful. Your normalized points were on a unit sphere before, and still are after. You're just spinning it around.
It does mean that your coefficient numbering really loses all meaning. After one such reflection, you're already scrambled. Overall energy is still the same (because it's unitary) but the direction is completely different. Since PVQ already assumes that the directions are equiprobable (well, more or less, since the PVQ doesn't actually uniformly cover the sphere), they don't care.
It also all seems to fall apart a bit if the prediction is not very
good. Then the gains might mismatch quite a bit, and even though you
had some pixels that matched well, they will be scaled differently when
normalized. It's a bit blah.
Well, it's just a different goal for the predictors. Regular motion search tries to minimize SAD or similar, as do the H.264 spatial predictors. For this kind of scheme you don't care about differences at all, instead you want to maximize the normalized correlation coeff between image and reference. (You want texture matches, not pixel matches.)
-Fabian

cbloom:
The other thing I note is that it doesn't seem very awesome at low bit rate.
Their subband chunks are very large. Even at K=1 the N slots that could have that one value is very large, so sending the index of that one slot is a lot of bits.
At that point, the way you model the zeros and the location of the 1 is the most important thing.
What I'm getting at is a lossy way of sending that.

ryg:
On 12/3/2014 1:04 PM, Charles Bloom {RAD} wrote:
The other thing I note is that it doesn't seem very awesome at low bit
rate.
Their subband chunks are very large. Even at K=1 the N slots that could
have that one value is very large, so sending the index of that one slot
is a lot of bits.
Yeah, the decision to send a subband *at all* means you have to code gain, theta and your AC index. For N=16 that's gonna be hard to get below 8 bits even for trivial signals. At which point you get a big jump in the RD curve, which is bad.
Terriberry has a few slides that explain how they're doing inter-band activity masking currently:
https://people.xiph.org/~tterribe/daala/pvq201404.pdf
The example image is kind of terrible though. The "rose" dress (you'll see what I mean) is definitely better in the AM variant, but the rest is hard to tell for me unless I zoom in, which is cheating.
At that point, the way you model the zeros and the location of the 1 is
the most important thing.
What I'm getting at is a lossy way of sending that.
This is only really interesting at low K, where the PVQ codebook is relatively small.
So, er, let's just throw this one in: suppose you're actually sending codebook indices. You just have a rate allocation function that tells you how many bits to send, independent of how big the codebook actually is.
If you truly believe that preserving narrowband energy is more important than getting the direction right, then getting a random vector with the right energy envelope is better than nothing.
Say K=1, Ndim=16. You have N=32 codewords, so a codebook index stored directly is 5 bits. Rate function says "you get 0 bits". So you don't send an index at all, and the decoder just takes codeword 0. Or rate function says "you get 2 bits" so you send two bits of the codebook index, and take the rest as zero.
This is obviously biased. So the values you send aren't raw codebook indices. You have some random permutation function family
p_x(i) : { 0, ..., N-1 } -> { 0, ..., N-1 }
where x is a per-block value that both the encoder and decoder know (position or something), and what you send is not the codebook id but p_x(id).
For any given block (subband, whatever), this doesn't help you at all. You either guess right or you guess wrong.
But statistically, suppose you shaved 2 bits off the codebook IDs for 1000 blocks. Then you'd expect about 250 of these blocks to reconstruct the right ACs. For the rest, you reconstructed garbage ACs, but it's garbage with the right energy levels at least! :)
No clue if this is actually a good idea at all. It definitely allows you to remove a lot of potholes from the RD curve.
-Fabian

ryg:
On 12/3/2014 1:48 PM, Fabian Giesen wrote:
> [..]
This is obviously biased. So the values you send aren't raw codebook
indices. You have some random permutation function family
p_x(i) : { 0, ..., N-1 } -> { 0, ..., N-1 }
where x is a per-block value that both the encoder and decoder know
(position or something), and what you send is not the codebook id but
p_x(id).
Now this is all assuming you either get the right code or you get garbage, and living with whichever one it is.
You can also go in the other direction and try to get the direction at least mostly right. You can try to determine an ordering of the code book so that distortion more or less smoothly goes down as you add extra bits. (First bit tells you which hemisphere, that kind of thing.)
That way, if you get 4 bits out of 5, it's not a 50:50 chance between right vector and some random other vector, it's either the right vector or another vector that's "close". (Really with K=1 and high dim it's always gonna be garbage, though, because you just don't have any other vector in the code book that's even close; this is more interesting at K=2 or up).
This makes the per-block randomization (you want that to avoid systematic bias) harder, though. One approach that would work is to do a Householder reflection with a random vector (again hashed from position or similar).
All that said, I don't believe in this at all. It's "solving" a problem by "reducing" it to a more difficult unsolved problem (in this case, "I want a VQ codebook that's close to optimal for embedded coding"). Of course, even if you do a bad job here, it's still not gonna be worse than the direct "random permutation" stuff. But I doubt it's gonna be appreciably better either, and it's definitely more complex.
-Fabian

cbloom:
At 02:17 PM 12/3/2014, Fabian Giesen wrote:
That way, if you get 4 bits out of 5, it's not a 50:50 chance between right vector and some random other vector, it's either the right vector or another vector that's "close".
Yes, this is the type of scheme I imagine. Sort of like a wavelet significant bit thing. As you send fewer bits the location gets coarser.
The codebook for K=1 is pretty obvious. You're just sending a location; you want the top bits to grossly classify the AC and the bottom bits to distinguish neighbors (H neighbors for H-type AC's, and V neighbors for V-type AC's)
For K=2 and up it's more complex. You could just train them and store them (major over-train risk) up to maybe K=3 but then you have to switch to an algorithmic method.
Really the only missing piece for me is how you get the # of bits used to specify the locations. It takes too many bits to actually send it, so it has to be implicit from some other factors like the block Q and K and I'm not sure how to get that.

## 11 comments:

About PVQ indexing schemes: lots of these are possible. You can see the first one I came up with at , before I'd even heard of Fischer or found his 1986 paper. Eventually for CELT we switched to the original Fischer indexing scheme (because it was simpler to implement and being published over 20 years ago trumped any other minor advantages in bit error robustness another scheme might have). If you look at Section 4.3.4.2, you can see it's pretty simple to decode, and in Section 5.3.8.2, it's _really_ simple to encode.

You don't want to plug these kind of dimension-at-a-time schemes directly into an arithmetic coder, though, because for large dimensions the probabilities become highly skewed. Enough that it's hard to model with typical arithmetic coder precision (and since you pay computational overhead per-symbol, highly skewed distributions mean you spend a lot of cycles without learning much new about your signal). Instead, you'd want to do something like we describe for SILK in Section 4.2.7.8.3, where you split the vector and say how many pulses lie in each half. That gives you much more balanced probabilities.

But as Jean-Marc mentioned, we don't use these uniformly distributed schemes in Daala. They're a fun aside (highly relevant for audio, not so much for video).

Inter-band masking: we were doing this back in April (when I made the slides you point to). However, we removed it in October: . Its effect was always pretty minor, it was fairly expensive (as originally implemented, I'm sure it could've been simplified), and removing it actually improved metrics slightly. More importantly, it introduced dependencies between the entropy coder and the "signal processing" parts of PVQ, which prevented deeply pipelining these stages in hardware.

Also, apologies for the example image I used in those slides. It was literally the first image I tried.

Directional intra: I'm sure you've seen . After spending far too much time trying to get this to work in a reasonable computational complexity, we eventually disabled it. We now use a much simpler scheme that can only predict pure horizontal or pure vertical edges.

Lossy AC encoding: I suspect for very low bitrates, rather than try fill with random noise or using splitting planes or whatever, what you really want to do is just use small, trained VQ codebooks. At low rates the codebooks shouldn't be very large, and any other scheme is going to wind up equivalent to some form of VQ, without the benefit of training. One thing we've discussed is trying to classify blocks based on the first (few) band(s), and then using the classification to select from one of several different VQ codebooks for the remaining bands. For inter, you could potentially do the classification on the predictor instead of the LF, so you could apply this scheme to the LF bands, too. For intra, where we don't have good directional predictors anymore, you could even use the trained VQ codeword as the PVQ "reference" for the diagonal bands, letting you extend this scheme to higher rates.

"Lossy AC encoding: I suspect for very low bitrates, rather than try fill with random noise or using splitting planes or whatever, what you really want to do is just use small, trained VQ codebooks."Yep, Charles and me talked about this in person right after that thread and that's basically what we ended up with. Pre-baked codebooks does mean that you want to have a small set of dimensionalities you're coding.

That one's a bit awkward; take the Daala subband partitioning (as in the SPIE abstract), which has (up to 16x16 blocks) subbands with 15 coeffs (top-left 4x4 minus DC), 8 coeffs (8x8 top/left), 32 coeffs (8x8 bottom-right and 16x16 top/left), and 128 coeffs (16x16 rest).

The 32- and 128-coeff subbands, you just wouldn't bother trying to find a short code for, I think. Coding anything meaningful in there is just gonna take a bunch of bits, and the codebooks (even with small number of entries) would be large and very hard to train without getting a lot of bias from your training set. But for the 8-coeff and 15-coeff subbands, a specialized codebook for the low-rate cases is potentially interesting. Only with prediction and biprediction (which remove 1 coeff each), you really have 6 cases: [6,8] coeffs and [13,15] coeffs. That's a lot of codebooks...

It might seem that this is over-thinking it, but at least for Bink 2 (which is a *super*-simple codec optimized for low complexity and low memory use, not state of the art low-bitrate performance, mind), it really paid off visually to tweak the coding so that basically at any point in the code stream, you could spend another 5 bits and get a reduction in distortion, even when doing so made the coding "asymptotically worse" in the sense of being suboptimal in the medium-to-high bit rate regime.

A typical problem with RD optimization is the "one blurry block in the middle of a sharp region" case. This is perceptually really bad (you notice it immediately), and it usually happens when a block is coded intra but you have no way of coding more than DC and maybe 1 AC coeff without going over your bit budget. Giving your codec (perceptually) good options in that case is definitely worthwhile.

"Directional intra: I'm sure you've seen"

Presumably this is one of the difficulties in working with lapped transforms, and predicting in transformed-space, etc.

Directional intra is really just great for edges. I've often wondered if a more explicit edge-adaptive predictor that uses the neighbors and no explicit signalling would be better.

"Lossy AC encoding: I suspect for very low bitrates, rather than try fill with random noise or using splitting planes or whatever, what you really want to do is just use small, trained VQ codebooks."

Yeah. My concern about this is if you're always restoring the missing high-frequency data in the same way, the pattern will become visible. But that remains to be seen, and perhaps that could be a few codebooks, etc.

This is getting a bit into the realm of fantasy land, but in a lot of cases what you'd really like to do is send some parameters for a noise generating function. You just send the total energy of high frequency noise, and you have some trained noise generators that you select from. eg. obviously for film grain. Even for things like grass and leaves, what you want to do is send a small patch once at full precision, and then make all future blocks have the same spectrum as the sample.

"One thing we've discussed is trying to classify blocks based on the first (few) band(s), and then using the classification to select from one of several different VQ codebooks for the remaining bands. For inter, you could potentially do the classification on the predictor instead of the LF, so you could apply this scheme to the LF bands, too."

Right. I've used a similar idea in wavelet experiments. When a high-band is cut off, rather than fill with zeros, fill from the parent, scaled down by some factor. Because of the subband octave correlation property. You can tell from just the first few AC's what kind of detail is in the block (usually).

You could have a VQ codebook that comes from PCA/KLT of the previous frame.

Well, as I said, classifying things by directionality in the LF and using that to select a VQ codebook for the HF is one potential way to get small codebooks that still do something useful. I.e., if you have a straight edge with a known orientation, the HF is not empty, but it does not actually have that many degrees of freedom.

That means lots of codebooks for the large bands, though, which is exactly where you don't need them. It may wind up just too expensive.

One other approach we have (by which I mean Monty has) been playing with is to train filters that will compact the energy from straight edges in the HF into the LF, and then predict the HF from just the LF coefficients. In VQ terms, you could think of this like an adaptive codebook, as is used in speech codecs (though the mechanism is very different). The problem is that you've traded memory complexity of large codebook tables for the computational complexity of the filters (which are not nice separable things). So it's not clear yet any of this will be worth the cost.

"At least their proposal was to mix overlapping-block MC and Control Grid Interpolation (CGI, essentially you specify a small mesh with texture coordinates and do per-pixel tex coord interpolation)."Actually, we removed CGI fairly early on. It was just too expensive. You basically can't vectorize the loads without a chip with N MMUs available, and not even x86 with its new gather instructions really has that... a GPU-style 2D texture cache would work, but that's also a metric boatload of transistors. That might have been something we could live with, but didn't have any clear gains. It could do something _different_, but it wasn't clear the encoder I had for it was taking advantage of that to do something _useful_, nor how to write an encoder that could. I definitely could not reproduce the gains that other researchers reported by switching methods, though that may be related to the extra constraints added to avoid blocking artifacts where the switch occurred.

One thought Jean-Marc had instead is to have a codeword that signals, "Make this whole region 4x4 blocks and fill in the MVs with interpolation instead of coding them." I.e., it's CGI, but at the block level instead of the pixel level, so you get at least 4-way parallelism in your loads.

"I've often wondered if a more explicit edge-adaptive predictor that uses the neighbors and no explicit signalling would be better."Well, at the very least you need to be able to shut something like this off, because when it's wrong it will be disastrously so.

"This is getting a bit into the realm of fantasy land, but in a lot of cases what you'd really like to do is send some parameters for a noise generating function."This has been done for still images. See Marcus Nadenau's 2000 Ph.D thesis, "Integration of Human Color Vision Models into High Quality Image Compression" for one example with wavelets. Extending this to video is hard, for the reasons Jean-Marc said above.

"Even for things like grass and leaves, what you want to do is send a small patch once at full precision, and then make all future blocks have the same spectrum as the sample.""Same spectrum" here has to be a bit vague, because of phase offsets and leakage and various other things that will keep it from being "the same". However, my canonical example of "video compression is nowhere near the limit of what it could do if we had infinite CPU available" is to apply some of the dynamic texture analysis research that's been done to generate low-dimensional models for things like water, leaves, etc., and run that analysis

in the decoder, so that the encoder can just start sending a few model parameters to get you waves or fluttering leaves or whatever. The kinds of things that are predicted very poorly with traditional motion compensation, and for which local metrics like MSE are completely irrelevant.(Using texture synthesis to fill in detail)

In the "infinite CPU" mindset, you probably wouldn't even bother sending a lot of model coefficients, but go for a predictor-corrector kind of structure on both the encoder and decoder side: the decoder "learns" structure from frames (slices, tiles...) it's seen before. After one good keyframe with useful texture the decoder could learn it and replicate that structure later. Given (say) a motion vector with high weight, the decoder would know that the target area should have similar texture properties to the region it was copied from. Hmm.

You could still use them as a parametric model for synthesis, but purely "appearance-preserving mocomp" would probably help reduce the generation artifacts (and resulting wasted bit rate) for edge-heavy structure. Alas, not cheap at all.

Another thing that would need a lot more CPU than we're willing to throw at it right now: dictionary-based schemes (linear algebra dictionaries, i.e. over-complete "bases"). Orthogonal matching pursuit, K-SVD etc. That's been done for still images (either sending the dictionary atoms in the bit stream, or having a pre-baked dict for special-purpose applications like faces) but with video you could conceivably start with a generic basis (DCT-derived or similar) and then keep adapting it to match the data you actually have.

There's no shortage of things to try, all we need is a couple orders of magnitude increase in cycles per pixel. :)

"In the "infinite CPU" mindset, you probably wouldn't even bother sending a lot of model coefficients, but go for a predictor-corrector kind of structure on both the encoder and decoder side: the decoder "learns" structure from frames (slices, tiles...) it's seen before."

Yeah, but the problem is you probably never sent a great sample of the texture in a previous frame. (if you did, it would be weird because it would be too high quality for the bit rate).

So you'd have to send a patch of 32x32 pixels as side data to be used as the sample for generation for a certain type of texture. Then you do standard texture-from-example kind of stuff.

"Another thing that would need a lot more CPU than we're willing to throw at it right now: dictionary-based schemes "

I think there are fundamental theoretical problems with these schemes that are unsolved. It's not CPU holding them back; there are definite steps in this direction that could be done right now. For example the 4x4 spatial transforms can be done as a matrix multiply, so custom matrices could be sent per frame.

The problem is once you start making the bases adaptive, then you can no longer hand-tweak your codec for them. Where are the subbands, what should the CSF's be, or the quantizers, how do you bit-rate allocate for them. All the perceptual kludges fall apart.

I believe you can only go to adaptive bases when you have a really good software perceptual metric to evaluate them, so that we aren't relying on offline evaluation for encoder tweaks. And we're still quite far off there.

(and running the theoretical super-perceptual-metric in-loop to make coding decision is insanely CPU prohibitive)

Yeah, but the problem is you probably never sent a great sample of the texture in a previous frame. (if you did, it would be weird because it would be too high quality for the bit rate).Not really true. Things like x264 MBTree already spend time to identify areas that are going to be frequently used as motion references, and boost their bit budget accordingly. That's not experimental, that's been shipping (and on by default for two-pass encoding) for years.

You're not gonna have a pristine reference, but you can definitely assume that a) often-referenced areas will start out higher quality than the rest (and thus definitely useful candidate references to take "probes" from), b) that working somewhat harder to "leak" less of that quality over time would save bits overall.

Well I don't agree. (this is getting way off topic though)

The whole point of the texture synthesis (or the lossy AC stuff) is to send detail at levels *beyond* what you can otherwise send.

By the very definition of the problem, you don't have that detail in previous frames.

You can try force that detail in previous frames, which is really rather more like VP8's "golden frames" than MB-tree, because it's a big difference, not a small tweak, and because you need source regions larger than one block.

But doing that is a very nasty non-local optimization problem. Should I put way more bits on these blocks than usual so that they can be used as texture source for future frames? You have to try putting more bits on those blocks and see how it affects N frames in the future. Totally unpossible. Also really hard to not make it a weird pop in quality.

MB-tree is a very small bit rate adjustment. Not really the same thing.

Of course there's a general problem with this idea, which is the binary nature of "this block gets texture synth and this one doesn't" that would also create weird visual quality variation. All of a sudden you get tweeds that look really amazing because it gets good texture synth, but everything else is lower quality. I guess this is an issue with any category based S/D/E type coder, you have to be very careful to not be creating perceptual quality level differences in the different block types.

Hi,

Just two comments:

- Wanting to get more uniform lattice on l2 sphere, it is worth to add some deformation, e.g. coordinate-wise power x^p allows to reduce MSE up to 26 percents ( https://arxiv.org/pdf/1705.05285 ).

- Enumerative coding often requires large arithmetic, also looses ~half bit in the final flush. For speedup and to avoid this inefficiency, it can be put into entropy coding. For this purpose we need to use the combinatorial formulas to find probabilities: for first digit for all possible (L,K).

Best,

Jarek

Post a Comment