I'm going to talk about video coding and code stream design. It's important to remember the end goal - we want to make a coded bit
stream which is <= N bytes and plays back with the highest quality possible. In all cases I assume the decoder is a fixed known
quantity.
First of all it's useful to consider what you would do if you had infinite CPU time. The answer is you should try all coded bit
streams. That is, to make the optimal N byte stream, you should make 256 ^ N streams, run each through the decoder, measure the error
somehow, and choose the best. That sounds ridiculous, but it is our goal; everything else we talk about will be approximations that
are trying to get as close as possible to that.
Let's try to get into the realm of reality. You have some encoder that you think produces good code streams for your given decoder.
In various places in your encoder you have decision points - what quantizer to use, what macroblock mode to use, whether to send this
value as its true self or as any other value, etc. If you had lots of CPU power you should do a brute force search on all decisions -
try every decision all possible ways; if there are M decisions and each has K choices this is K^M. Reject code streams produced that are bigger than N. Measure the error of the results and choose
the best.
Okay, that's still ridiculously impractical. But let's consider a simpler case. Say you only have two coding decisions, and most importantly -
they are independent, that is one decision does not affect the other. For example say you are coding two separate images, and you are only
allowed to choose the quantizer on each image, and your goal is to minimize the total error. If there are K quantizer choices, a full search
would mean trying K^2 ways. But in that's not necessary because of independence. All you have to do code the first image K ways & remember the
rate (R) and distortion (D) of each way, then the second image K ways and remember R/D, so you have 2K codings. Now to find the optimal combination
you could do K^2 table lookups, but even here you can usually do much less, because the R/D tables are monotonic (or very nearly monotonic) which
means rather than doing K^2 you only need to pick an entry on each side and then slide each one up and down to search for improved D given the R
constraint, which is O(K). This obviously extends to N independent coding choices ; rather than K^N we can do K*N codings.
The simplest way to do this in practice is with a lagrange multiplier framework. Rather than look for optimal D given R, we look for an optimal
lagrange cost : J = D + lambda * R.
If J is maximized by our coding choices, it means dJ/dCode = 0. That means 0 = dD/dC + lambda * dR/dC , or dD/dR = - lambda
That is, lambda has selected a slope of the D(R) curve. If D(R) and dD/dR are both monotonic, then this has uniquely selected a rate & encoding.
We shall henceforth assume that this monotonicity requirement is true. In practice it is not quite true but it is "mostly" true which we can deal
with in hacky ways (* perhaps more on this niggle some day). With this assumption, if your goal is to hit a certain rate, you can monotonically
search lambda to find an encoding that maximizes J at that rate.
Now the key point is that maximizing J for a given rate gives you the optimal encoding for independent coding choices. Consider our example of
the two images above. Instead I make a quantizer choice from the K choices on each image independently. On each one I measure J and choose
the one that maximizes J. (note that since R and D are monotonic I can actually just do a binary search here that's O( log(K) ) not O(K) ).
Because they are independent, I have found the maximum J for the full stream. This is the optimal encoding. Why? Because it has made
the slope dD/dR equal at each of the coding choices. That is, I can't move bits from one image to the other and get a better distortion at
the same rate. If the slopes were not equal, then I would get a win from moving the bits away from where they affect D less to where they
affect D more.
This is the same as before, but the win in terms of code flow and structure should be obvious. Before I had to do a bunch of encodings and
then go back and consider different bit allocations to find the best. Now I can do one linear pass through my codings and make the optimal J
decision at each step, and then after a decision is made I can forget about it and move on to the next and never revisit it. The disadvantage
is now that my encoding is parameterized by the unintuitive lambda parameter rather than just R, the rate; if I want to hit a specific rate I
have to search lambda in an outer loop (* perhaps more on this later).
Now we're going to make a big leap of faith and see what happens if we try to use this sort of one-pass lagrange optimization on more complicated
real world scenarios. What goes wrong? Two big issues. One is that D might not be independent, the other is that coding is not independent.
To be clear, our proposal is to do one-pass lagrange coding decision making. Encode the data in streaming one pass scan. At each decision point, you
try all possible ways for the current subset of the data only. You measure J on the current subset and take the decision that maximizes J on
that subset. You then do the encoding using that decision and output the encoded bits, and move on to the next subset. R (the rate) can of course
be measured on each subset independently (if arithmetic coding, you can count the fractional bits left in the state as well). D (the distortion)
must be some measure that can be done locally.
What about D (the distortion measure) ? Well, if D is SSD (sum of square differences) (aka L2 error or MSE), then it is in fact independent from
pixel to pixel or block to block, which means it is fully correct to measure D only locally on each decision. But you might object that this
is not a good way to measure distortion. I'm going to ignore this for now and just posit that we choose D to be SSD and we're going to optimize
for that. (* more on this later).
The big issue is that in the real world our coding decisions are not independent. A huge affect with video coding is of course motion compensation -
blocks can be used as reference for later blocks. That means a coding decision now can have affects far in the future. Even if we ignore that and
just talk about image coding, blocks affect the coding of other blocks through statistical coding - be it huffman, arithmetic, or context coding.
Typically this has an affect of the form : if I chose mode A for the current block, that has the side effect of making mode A cheaper for future
blocks, and all other modes slightly more expensive. I contend that the statistical affects on the future are not a huge issue. (you may recall
when we talked about LZ77 optimal parsing, I made the same hand-wavey argument to dismiss this issue). There is one big issue about the
statistical effects - early on in coding when statistics are sparse, decision 1 can have a huge affect on the statistics which heavily biases decision 2
and leads you far away from the true optimal choice. To stop this, the statistics for decision making purposes should be preconditioned with some
"normal" data to reduce the affect of strange early decisions. Note that the statistics used for actual coding need not be the same as the statistics
for decision making.
Sometimes a purely greedy coding decision can be very bad, though. Consider the case of something like trellis quantization of DCT coefficients.
The true output of the DCT is X,Y,Z. But you need not transmit those values, you can transmit anything and it will just have some rate and
distortion. For example, say the output is 0,2,3,0,X,0,0,0,0 . How should you transmit X ? If you're using something like run-length coding
or end-of-block signalling, there is a big advantage to sending X as 0. You can't know that unless you could see the future after X. If you
only saw 0,2,3,0,X, and didn't know what followed you wouldn't see that. This is a well known issue with quantization of trailing DCT coefficients,
but it's also an issue with things like macroblock mode decisions in video coding. The reason is that mode equal to previous mode is coded so
much smaller than any other mode, when you make a decision you very strongly affect not only your current rate, but also the following block.
One remedy to this is "semi-non-greedy lookahead" type coding as is well known from LZ77 coders; that is, instead of just doing one coding decision,
you try all ways for the next 2 or 3 or 4, choose the one that optimizes all of them together, and then discard all but the current one and step
ahead one step.
Okay I'm going to sort of ignore this issue for now, but we should keep it in the back of our mind that purely greedy forward-scan decision
making can be pretty far from optimal.
The other big issue for video is motion compensation. Yeah, that's a big issue, but it's not specific to this type of code stream optimization,
so I'm not going to discuss it now. It's an issue that you must address in any video coder that makes decisions. The way to address it in this
framework is to choose some scaling factor for "D" on each block. You compute J = c * D + lambda * R , where normally c would be 1, instead
you bias it up or down depending on whether you decide the current block is more or less important than average. (blocks which will be
sources for future motion compensation should be considered more important in the forward greedy pass decision making). (* perhaps more on
this later).
To be continued ...