And the encoder is where the real pain lies. H264 like many of the standards that I hate, is not a one-to-one transform from decompressed streams to code streams. There is no explicit algorithm to find the optimal stream for a given bit rate. With all the different choices that the encoder has of different block types, different bit allocation, different motion vectors, etc. there's a massive amount of search space, and getting good compression quality hinges entirely on having a good encoder that searches that space well.
All of this stifles innovation, and also means that there are very few decent implementations available because it's so damn hard to make a good implementation. It's such a big arcane standard that's tweaked to the Nth degree, there are literally thousands of papers about it (and the Chinese seem to have really latched on to working on H264 improvements which mean there are thousands of papers written by non-English speakers, yay).
I really don't like overcomplex standards, especially this style that specifies the decoder but not the encoder. Hey, it's a nice idea in theory, it sounds good - you specify the decoder, and then over the years people can innovate and come up with better encoders that are still compatible with the same decoder. Sounds nice, but it doesn't work. What happens in the real world is that a shitty encoder gains acceptance in the mass market and that's what everyone uses. Or NO encoder ever takes hold, such as with the so-called "MPEG 4" layered audio spec, for which there exists zero mainstream encoders because it's just too damn complex.
Even aside from all that annoyance, it also just bugs me because it's not optimal. There are lots of ways to encode the exact same decoded video, and that means you're wasting code space. Any time the encoder has choices that let it produce the same output with different code streams, it means you're wasting code space. I talked about this a bit in the past in the LZ optimal parser article, but it should be intuitively obvious - you could take some of those redundant code streams and make them decode to something different, which would give you more output possibilities and thus reduce error at the same bit rate. Obviously H264 still performs well so it's not a very significant waste of code space, but you could make the coder simpler and more efficient by eliminating those choices.
Furthermore, while the motion compensation and all that is very fancy, it's still "ghetto". It's still a gross approximation of what we really want to do, which is *predict* the new pixel from its neighbors and from the past sequence of frames. That is, don't just create motion vectors and subtract the value and encode the difference - doing subtraction is a very primitive form of prediction.
Making a single predicted value and subtracting is okay *if* the predicted probability spectrum is unimodal laplacian, and you also use what information you can to predict the width of the laplacian. But it often isn't. Often there are pixels that you can make a good prediction that this pixel is very likely either A or B, and each is laplacian, but making a single prediction you'd have to guess (A+B)/2 which is no good. (an example of this is along the edges of moving objects, where you can very strongly predict any given pixel to be either from the still background or from the edge of the moving object - a bimodal distribution).