First consider Intra, that is the case of no prediction.
We have a block of some variable size (4x4 to 16x16) that has been transformed. The DC is sent by some other mechanism, so we are left with some AC's.
The AC's are grouped into subbands. These are just collections of likely-similar AC coefficients. Ideally they should be statistically similar. The current subbands in Daala are :
The AC coefficients in each subband are consider a vector. We first send the "gain" of the vector, which is the L2 norm (the "length" of the vector). (*1)
The gain can be sent with a non-uniform scalar quantizer. This allows you to build variance-adaptive quantization into the codec without side data. (H264 by sending per-block quantizer changes based on activity, which costs extra bits in signalling). Essentially you want to use larger quantization bins for larger gains, because specifying the exact amount of a large energy is not as important. Daala seems to take gain to the 2/3 power and uniformly quantize that. (*2) (*3)
Once we've sent the gain, we need to send how that energy is distributed on the coefficients. I won't go into the details of how it's done. Of course you have already sent the total energy so you don't want to just send the coefficient values, that would be highly redundant. You may think of it as dividing the vector by the gain, we are left with a unit vector. The Fischer "Pyramid Vector Quantizer" is one good starting point.
Note that Fischer PVQ is explicitly not quite applicable here, because it is only optimal if the coefficients in the vector are independent and have the same distribution, which is not the case in images. Because of that, just sending the index of the PVQ codebook without entropy coding is probably wrong, and a way of coding a PVQ codebook selection that uses the properties of the AC coefficients is preferrable. (eg. code one by one in a z-scan order so you know likelihood is geometrically decreasing, etc. Daala has some proposals along these lines).
A key issue is determining the PVQ codebook resolution. With PVQ this is K, the L1 norm of the unnormalized (integer coefficients) codebook vectors. Daala computes K without sending it. K is computed such that error due to the PVQ quantization is equal to error due to gain quantization. (*4). This makes K a function of the overall quantization level, and also of the gain of that block - more gain = more K = more resolution for where that gain goes.
A non-uniform quantization matrix is a bit murky in this scheme (ala the JPEG CSF factors) because it changes the scaling of each axis in the vector, as well as the average magnitude, which violates the assumptions of Pyramid VQ. Applying a different quantizer per subband is an easy compromise, but pretty coarse. (*5)
And that's it for intra.
Advantages vs. traditional residual coding :
1. You send the gain up front, which is a good summary of the character of the block. This allows for good modeling of that gain (eg. from neighbors, previous gains). It also allows that gain to be used as a context or coding model selector for the rest of the block.
2. Because you send the gain explicitly, it is better preserved than with traditional coding (especially with Trellis Quant which tends to kill gain). (deadzone quantizers also tend to kill gain; here a deadzone quantizer might be appropriate for the total gain, but that's better than doing it for each coefficient independently). Preserving gain is good perceptually. It also allows for separate loss factor for gain and distribution of energy which may or may not be useful.
3. Because you send the gain explicitly, you can use a non-linear quantizer on it.
4. You have a simple way of sending large subbands = zero without coding an extra redundant flag.
5. No patents! (unbelievable that this is even an issue, all you software patenters out there plz DIAGF)
*1 = In my studies I found preserving L1 norm of AC activity to be more visually important than preserving L2 norm. That doesn't mean it's better to code L1 norm though.
*2 = One advantage of this scheme is having VAQ built in. A disadvantage is that it's hard-coded into the codec so that the encoder isn't free to do adaptive quantization based on better human visual studies. Of course the encoder can send corrections to the built-in quantizer but you better be careful about tweak the VAQ that's in the standard!
*3 = I always found variable quantization of log(gain) to be appealing.
*4 = My perceptual studies indicate that there should probably be more error due to the distribution quantization (K) than from gain quantization.
*5 = Of course, applying a CSF-like matrix to *residuals* as is done in most video coders is also very murky.
Okay, but now we have prediction, from motion compensation or intra prediction or whatever. You have a current block to encode and a predicted block that you think is likely similar.
The problem is we can't just subtract off the predicted block and make a residual the way normal video codecs do. If you did that, you would no longer have a "gain" which was the correct energy level of the block - it would be an energy level of the *residual*, which is not a useful thing either for perceptual quality control or for non-uniform quantization to mimic VAQ.
Sending the gain is easy. You have the gain for each predicted subband, so you can just predict the gain for the current subband to be similar to that (send the delta, context code, etc.). You want to send the delta in quantized space (after the power mapping). The previous block might have more information than you can preserve at the current quantization level (it may have a finely specified gain which your quantizer cannot represent). With a normal residual method, we could just send zero residuals and keep whatever detail was in the previous frame. Daala makes it possible to carry this forward by centering a quantization bucket exactly on the previous gain.
Now we need the distribution of coefficients. The issue is you can't just send the delta using Pyramid VQ. We already have the gain, which is the length of the coefficient N-vector , it's not the length of the delta vector.
Geometrically, we are on an N-sphere (since we know the length of the current vector) and we have a point on that sphere where the predicted block was. So we need to send our current location on the sphere relative to that known previous position. Rather than mess around with spherical coordinate systems, we can take the two vectors, and then essentially just send the parallel part (the dot product, or angle between them), and then the perpendicular part.
JM Valin's solution for Daala using the Householder reflection is just a way of getting the "perpendicular part" in a convenient coordinate system, where you can isolate the N-2 degrees of freedom. You have the length of the perpendicular part (it's gain*sin(theta)), so it's a unit vector, and we can use Pyramid VQ to send it.
So, to transmit our coefficients we send the gain (as delta from previous gain), we send the extent to which the vectors are parallel (eg. by sending theta (*6)), we then know the length of the perpendicular part and just need to send the remaining N-2 directional DOF using Pyramid VQ.
As in the intra case, the K (quantizer resolution) of the Pyramid VQ is computed from other factors. Here obviously rather than being proportional to the total gain, it should be proportional to the length of the perpendicular part, eg. gain*sin(theta). In particular if you send a theta near zero, K goes toward zero.
One funny thing caused by the Householder reflection (coordinate system change to get the perpendicular part) is that you've scrambled up the axes in a way you can't really work with. So custom trained knowledge of the axes, like expected magnitudes and Z-scan order and things like that are gone. eg. with a normal coefficient delta, you know that it's very likely the majority of the delta is in the first few AC's, but after the rotation that's lost. (you can still build that in at the subband level, just not within the subbands).
Another funny thing is the degeneracy of polar conversion around the origin. When two vectors are very close (by Euclidean distance) they have a small angle between them, *if* they are long enough. Near the origin, the polar conversion has a pole (ha ha punny). This occurs for subbands near zero, eg. nearly flat, low energy blocks. Since the gain has previously been sent, it's possible that could be used to change to a different coder for gains near zero (eg. just use the Intra coder). (in Daala you would just send a "noref" flag). To be clear, the issue is that the vectors might be very close in Euclidean distance, and thus seem like good matches based on SAD motion search, but could easily be vectors pointing in completely opposite directions, hence be very bad to code using this theta scheme.
And I think that's about it.
The big goal of this funny business is to be able to send the gain (length) of the current subband vector, rather than sending the length of the delta. This gives you the advantages as discussed previously in the simpler Intra case.
*6 = send theta, or sin(theta) ? They have slightly different quantization bucket scaling. Most of this assumes that we have a reasonably good prediction so theta is small and theta ~= sin(theta).
Geometrically with white board drawing :
Traditional video coders just form the Euclidean "Delta" vector and send its components.
Daala Intra (no prediction) takes the "Current" vector, sends its length, and then its unit vector (direction) using Pyramid VQ.
Daala Predictive VQ sends the length of "Current", the angle (theta) from Predicted to Current, and the unit vector (direction) of the "Perpendicular" vector.
(of course Daala doesn't send the "Perpendicular" vector in the coordinate space draw above; it's reflected into the space where "Predicted" is aligned with an axis, that way Perpendicular is known to have a zero in that direction and is simply a vector in N-1 dimensions (and you've already sent the length so it has N-2 DOF))