1. Lack of the true distortion metric. That is, we make decisions to optimize for some D, but that D is not really what humans perceive as quality. So we try to bias the coder to make the right kind of error in a black art hacky way.
2. Inability to do full non-greedy optimization. eg. on each coding decision we try to do a local greedy decision and hope that is close to globally optimal, but in fact our decisions do have major effects on the future in numerous and complex ways. So we try to account for how current decisions might affect the future using ugly heuristics.
These two major issues underly all the difficulties and hacks in video coding, and they are unfortunately nigh intractable. Because of these issues, you get really annoying spurious results in coding. Some of the annoying shit I've seen :
A. I greatly improve my R/D optimizer. Global J goes up !! (lower J is better, it should have gone down). WTF happened !? On any one block, my R/D optimizer now has much more ability to make decisions and reach a J minimum on that block. The problem is that the local greedy optimization is taking my code stream to weird places that then hurt later blocks in ways I am not accounting for.
B. I introduce a new block type. I observe that the R/D chooser picks it reasonably often and global J goes down. All signs indicate this is good for coding. No, visual quality goes down! Urg. This can come from any number of problems, maybe the new block type has artifacts that are visually annoying. One that I have run into that's a bother is just that certain block types will have their J minimum on the R/D curve at very different places - the result of that is a lot of quality variation across the frame, which is visually annoying. eg. the block type might be good in a strict numerical sense, but its optimum point is at much higher or much lower quality than your other block types, which makes it stand out.
C. I completely screw up a block type, quality goes UP ! eg. I introduce a bug or some major coding inefficiency so a certain block type really sucks. But global quality is better, WTF. Well this can happen if that block type was actually bad. For one thing, block types can actually be bad for global J even if they are good for greedy local J, because they produce output that is not good as a future mocomp source, or even simply because they are redundant with other block types and are a waste of code space. A more complex problem which I ran into is that a broken block type can change the amount of bits allocated to various parts of the video, and that can randomly give you better bit allocation, which can make quality go up even though you broke your coder a bit. Most specifically, I broke my Intra ("I") block (no mocomp) coder, which caused more bits to go to I-like frames, which actually improved quality.
D. I improve my movec finder, so I'm more able to find truly optimal movecs in an R/D sense (eg. find the movec that actually optimizes J on the current block). Global J goes down. The problem here is that optimizing the current movec can make that movec very weird - eg. make the movec far from the "true motion". That then hurts future coding greatly.
In most cases these problems can be patched with hacks and heuristics. The goal of hacks and heuristics is basically to try to address the first two issues. Going back to the numbering of the two issues, what the hacks do is :
1. Try to force distortion to be "good distortion". Forbid too much quality variation between neighboring blocks. Forbid block mode decisions that you somehow decide is "ugly distortion" even if it optimized J. Try to tweak your D metric to make visual quality better. Note that the D tweakage here is a pretty nasty black art - you are NOT actually trying to make a D that approximates a true human visual D, you are trying to make a D under which your codec will make decisions that produce good global output.
2. To account for the greedy/non-greedy problem, you try to bias the greedy decisions towards things that you guess will be good for the future. This guess might be based on actually future data from a previous run. Basically you decide not to make the choice that is locally optimal if you have reason to believe it will hurt too much in the future. This is largely based on intuition and familiarity with the codec.
Now I'll mention a few random particular issues, but really these themes occur again and again.
I. Very simple block modes. Most coders have something like a "direct block copy" mode, or even a "solid single color", eg. DIRECT or "skip" or whatever. These type of blocks are generally quite high distortion and very low rate. The problem occurs when your lambda is sort of near the threshold for whether to prefer these blocks or not. Oddly the alternate choice mode might have much higher rate and much higher distortion. The result is that a bunch of very similar blocks near each other in an image might semi-randomly select between the high quality and low quality modes (which happen to have very similar J's at the current lambda). This is obviously ugly. Furthermore, there's a non-greedy optimization issue with these type of block modes. If we compare two choices that have similar J, one is a skip type block with high distortion, another is some detailed block mode - the skip type is bad for information conveyance to the future. That is, it doesn't add anything useful for future blocks to refer to. It just copies existing pixels (or even wipes some information out in some cases).
II. Gradual changes need to be send gradually. That is, if there is some part of the video which is slowly steadily changing, such as a slow cross fade, or very slow scale/rotate type motion, or whatever - you really need to send it as such. If you make a greedy best J decision, at low bit rate you will some times decide to send zero delta, zero delta, for a while because the difference is so small, and then it becomes too big where you need to correct it and you send a big delta. You've turned the gradual shift into a stutter and pop. Of course the decision to make a big correction won't happen all across the frame at the same time, so you'll see blocks speckle and move in waves. Very ugly.
III. Rigid translations need to preserved. The eye is very sensitive to rigid translations. If you just let the movec chooser optimize for J or D it can screw this up. One reason is that very small motions or movements of monotonous objects might slip to movec = 0 for code size purposes. That is, rather than send the correct small movec, it might decide that J is better by incorrectly sending a zero delta movec with a higher distortion. Another reason is that the actual best pixel match might not correspond to the motion, you can get anomalies, especialy on sliding patterned or semi-patterned objects like grass. In these cases, it actually looks better to use the true motion movec even if it has larger numerical distortion D to do so. Furthermore there is another greedy/non-greedy issue. Sometimes some non-true-motion movec might give you well the best J on the current block by reaching out and grabbing some random pixels that match really well. But that screws up your motion field for the future. That movec will be used to condition predictions of future movecs. So say you have some big slowly translating field - if everyone picks nice true motion movecs they will also be coherent, but if people just get to pick the best match for themselves greedily, they will be a big mess and not predict each other. That movec might also be used by the next frame, the previous B frame, etc.
IV. Full pel vs. half/quarter/sub-pel is a tricky issue. Sub-pel movecs often win in a strict SSD sense; this is partly because when the match is imperfct, sub-pel movecs act to sort of average two guess together; they produce a blurred prediction, which is optimal under L2 norm. There are some problems with this though; sub-pel movecs act to blur the image, they can stand out visually as blurrier bits; they also act to "destroy information" in the same way that simple block modes do. Full pel movecs have the advantage of giving you straight pixel copies, so there is no blur or destruction of information. But full pel movecs can have their own problem if the true motion is subpel - they can produce wiggling. eg. if an area should really have movecs around 0.5 , you might make some blocks where the movec is +0 and some where it is +1. The result is a visible dilation and contraction that wiggles along, rather than a perfect rigid motion.
V. A good example of all this evil is the movec search in x264. They observed that allowing very large movec search ranges actually decreases quality (vs a more local incremental searches). In theory if your movec chooser is using the right criterion, this should not be - more choices should never hurt, it should simply not choose them if they are worse. Their problem is twofold - 1. their movec chooser is obviously not perfect in that it doesn't account for current cost completely correctly, 2. of course it doesn't account for all the effects on the future. The result is that using some heuristic seed spots for the search which you believe are good coherent movecs for various reasons, and then doing small local searches actually gives you better global quality. This is a case where using "broken" code gives better results.
In fact it is a general pattern that using very accurate local decisions often hurts global quality, and often using some tweaked heuristic is better. eg. instead of using true code cost R in your J decision, you make some functional fit to the norms of the residuals; you then tweak that fit to optimize global quality - not to fit R. The result is that the fit can wind up compensating for the greedy/non-greedy and other weird factors, and the approximation can actually be better than the more accurate local criterion.
Did I say blurg?