Blurg, the complexity wheel turns. In the end, all the issues with video come down two huge fundamental problems :
1. Lack of the true distortion metric. That is, we make decisions to optimize for some D, but that D is not really what
humans perceive as quality. So we try to bias the coder to make the right kind of error in a black
art hacky way.
2. Inability to do full non-greedy optimization. eg. on each coding decision we try to do a local greedy decision and hope
that is close to globally optimal, but in fact our decisions do have major effects on the future in numerous and complex ways.
So we try to account for how current decisions might affect the future using ugly
heuristics.
These two major issues underly all the difficulties and hacks in video coding, and they are unfortunately nigh intractable.
Because of these issues, you get really annoying spurious results in coding. Some of the annoying shit I've seen :
A. I greatly improve my R/D optimizer. Global J goes up !! (lower J is better, it should have gone down).
WTF happened !? On any one block, my R/D optimizer now has
much more ability to make decisions and reach a J minimum on that block. The problem is that the local greedy optimization
is taking my code stream to weird places that then hurt later blocks in ways I am not accounting for.
B. I introduce a new block type. I observe that the R/D chooser picks it reasonably often and global J goes down. All signs
indicate this is good for coding. No, visual quality goes down! Urg. This can come from any number of problems, maybe the
new block type has artifacts that are visually annoying. One that I have run into that's a bother is just that certain block
types will have their J minimum on the R/D curve at very different places - the result of that is a lot of quality variation
across the frame, which is visually annoying. eg. the block type might be good in a strict numerical sense, but its optimum
point is at much higher or much lower quality than your other block types, which makes it stand out.
C. I completely screw up a block type, quality goes UP ! eg. I introduce a bug or some major coding inefficiency so a certain
block type really sucks. But global quality is better, WTF. Well this can happen if that block type was actually bad.
For one thing, block types can actually be bad for global J even if they are good for greedy local J, because they produce
output that is not good as a future mocomp source, or even simply because they are redundant with other block types and are a
waste of code space. A more complex problem which I ran into is that a broken block type can change the amount of bits allocated
to various parts of the video, and that can randomly give you better bit allocation, which can make quality go up even though you
broke your coder a bit. Most specifically, I broke my Intra ("I") block (no mocomp) coder, which caused more bits to go to I-like frames, which actually
improved quality.
D. I improve my movec finder, so I'm more able to find truly optimal movecs in an R/D sense (eg. find the movec that actually
optimizes J on the current block). Global J goes down. The problem here is that optimizing the current movec can make that
movec very weird - eg. make the movec far from the "true motion". That then hurts future coding greatly.
In most cases these problems can be patched with hacks and heuristics. The goal of hacks and heuristics is basically to
try to address the first two issues. Going back to the numbering of the two issues, what the hacks do is :
1. Try to force distortion to be "good distortion". Forbid too much quality variation between neighboring blocks. Forbid
block mode decisions that you somehow decide is "ugly distortion" even if it optimized J. Try to tweak your D metric to
make visual quality better. Note that the D tweakage here is a pretty nasty black art - you are NOT actually trying to make
a D that approximates a true human visual D, you are trying to make a D under which your codec will make decisions that produce
good global output.
2. To account for the greedy/non-greedy problem, you try to bias the greedy decisions towards things that you guess will be
good for the future. This guess might be based on actually future data from a previous run. Basically you decide not to
make the choice that is locally optimal if you have reason to believe it will hurt too much in the future. This is largely
based on intuition and familiarity with the codec.
Now I'll mention a few random particular issues, but really these themes occur again and again.
I. Very simple block modes. Most coders have something like a "direct block copy" mode, or even a "solid single color",
eg. DIRECT or "skip" or whatever. These type of blocks are generally quite high distortion and very low rate. The
problem occurs when your lambda is sort of near the threshold for whether to prefer these blocks or not. Oddly the
alternate choice mode might have much higher rate and much higher distortion. The result is that a bunch of very similar
blocks near each other in an image might semi-randomly select between the high quality and low quality modes (which
happen to have very similar J's at the current lambda). This is obviously ugly. Furthermore, there's a non-greedy
optimization issue with these type of block modes. If we compare two choices that have similar J, one is a skip type
block with high distortion, another is some detailed block mode - the skip type is bad for information conveyance to the
future. That is, it doesn't add anything useful for future blocks to refer to. It just copies existing pixels (or even
wipes some information out in some cases).
II. Gradual changes need to be send gradually. That is, if there is some part of the video which is slowly steadily
changing, such as a slow cross fade, or very slow scale/rotate type motion, or whatever - you really need to send it as
such. If you make a greedy best J decision, at low bit rate you will some times decide to send zero delta, zero delta,
for a while because the difference is so small, and then it becomes too big where you need to correct it and you send a
big delta. You've turned the gradual shift into a stutter and pop. Of course the decision to make a big correction
won't happen all across the frame at the same time, so you'll see blocks speckle and move in waves. Very ugly.
III. Rigid translations need to preserved. The eye is very sensitive to rigid translations. If you just let the movec
chooser optimize for J or D it can screw this up. One reason is that very small motions or movements of monotonous
objects might slip to movec = 0 for code size purposes. That is, rather than send the correct small movec, it might decide
that J is better by incorrectly sending a zero delta movec with a higher distortion. Another reason is that the actual
best pixel match might not correspond to the motion, you can get anomalies, especialy on sliding patterned or semi-patterned
objects like grass. In these cases, it actually looks better to use the true motion movec even if it has larger numerical
distortion D to do so. Furthermore there is another greedy/non-greedy issue. Sometimes some non-true-motion movec might
give you well the best J on the current block by reaching out and grabbing some random pixels that match really well. But
that screws up your motion field for the future. That movec will be used to condition predictions of future movecs. So say
you have some big slowly translating field - if everyone picks nice true motion movecs they will also be coherent, but if
people just get to pick the best match for themselves greedily, they will be a big mess and not predict each other. That
movec might also be used by the next frame, the previous B frame, etc.
IV. Full pel vs. half/quarter/sub-pel is a tricky issue. Sub-pel movecs often win in a strict SSD sense; this is partly
because when the match is imperfct, sub-pel movecs act to sort of average two guess together; they produce a blurred
prediction, which is optimal under L2 norm. There are some problems with this though; sub-pel movecs act to blur the
image, they can stand out visually as blurrier bits; they also act to "destroy information" in the same way that simple
block modes do. Full pel movecs have the advantage of giving you straight pixel copies, so there is no blur or destruction
of information. But full pel movecs can have their own problem if the true motion is subpel - they can produce wiggling.
eg. if an area should really have movecs around 0.5 , you might make some blocks where the movec is +0 and some where it is +1.
The result is a visible dilation and contraction that wiggles along, rather than a perfect rigid motion.
V. A good example of all this evil is the movec search in x264. They observed that allowing very large movec search ranges
actually decreases quality (vs a more local incremental searches). In theory if your movec chooser is using the right
criterion, this should not be - more choices should never hurt, it should simply not choose them if they are worse.
Their problem is twofold - 1. their movec chooser is obviously not perfect in that it doesn't account for current cost
completely correctly, 2. of course it doesn't account for all the effects on the future. The result is that using some
heuristic seed spots for the search which you believe are good coherent movecs for various reasons, and then doing small
local searches actually gives you better global quality. This is a case where using "broken" code gives better results.
In fact it is a general pattern that using very accurate local decisions often hurts global quality, and often using
some tweaked heuristic is better. eg. instead of using true code cost R in your J decision, you make some functional
fit to the norms of the residuals; you then tweak that fit to optimize global quality - not to fit R. The result is
that the fit can wind up compensating for the greedy/non-greedy and other weird factors, and the approximation can
actually be better than the more accurate local criterion.
Did I say blurg?