03-08-10 - Distortion and Bit Allocation

I now know that rate allocation is by far the most important thing in video. It's obviously important in a lot of things, but in video you just have so many bits and so much flexibility in where you put them, and there are lots of psychovisual phenomena that don't exist in images (due to motion, eye adaptation, feature tracking, etc. because the eye notices changes over time, etc). In fact I conjecture that you could take a really shitty old coder like MPEG2 and make videos that beat anything currently in existance with just better rate allocation.

What can rate allocation do ?

1. Move bits to the source of predictions. That is, code some frame (or part of a frame) better than normal because it will be used as a mocomp source in the future. This is actually a purely mathematical win and would apply without any psychovisual consideration. A lot of people do this in semi-heuristic ways, but of course those can make lots of mistakes (for example there may be cases where increasing the bit assignment to a block might actually make it a worse source for the future, eg. the future might be a better match to the block with more distortion; also starving the future might cause it to no longer choose that block as a source, etc). Some people move bits around while holding all the block mode decisions and movecs constant, which at least lets you converge, but of course you should consider all possible bit moves and all possible mode changes.

2. Move bits from frame to frame to make some frames look better and some look worse. Move bits around the frame to make parts look better and parts look worse. In general choosing where to put your error.

There's also a related issue which is not exactly rate allocation but is very similar. In lossy coders like video coders you often you have a choice of what your error looks like. That is, for the same distortion (in a numerical sense) you could make different shapes of error, through choosing different block modes, choosing different movecs, or more globally choosing quantizers or quantization matrices. This often ties into rate allocation because it involves how you make your free choices in the encoder :

3. What the distortion looks like. In particular, if you make some amount of error (in an SSD or SAD sense (aka L2 or L1 norm)) what does that error look like? what is the shape of it?

Now, in a lagrangian framework the main thing driving all these decisions is just the D metric in J = R + lambda D. If you change D, it changes where bits get put. D determines how important you think one type of error is vs. another type of error.

Just as an example, say you ran face detection on your video, then you could assign face regions to all your frames, and any error in the face region could be counted as extra important - if you put this into your "D" metric, then the lagrangian coder automatically gives those areas more bits. But that kind of example is rather banal. There are obviously tons of human error-importance issues that you could try to account for, having to do with what objects are most important in the frame, where the motion is, what kind of errors are particularly appalling, etc etc.

Purely numerical error distribution can be important : say you have an error of 3 somewhere and an error of 20 somewhere else. You have bits to change each by 1. Should you change the 3 to 2 or the 20 to 19 ? Well, it depends on their neighborhoods, but I think more often than not you should do the 3->2. That will be more visually noticeable. Using L1 or L2 (or L-N for whatever other N's) causes you to make different decisions in these cases. Most simplistically you can see it as a continuum between minimizing the total abs error (L1) vs minimizing the maximum error (L-infinity). That is, the issue of whether you have clumpy error or spread out error is a pretty big one.

The thing holding back development is a lack of a procedure for measuring real "quality". The problem is changing distortion to change your bit allocation for psychovisual purposes will by definition hurt your abstract measures. (hacky changes to D might hurt RMSE but help SSIM, but in that case I would say some of the change was not "psychovisual" - the part of the change which helps SSIM is in fact an analytical change to improve a certain metric). At some point you have to be able to make a decision that you will allocate bits in such a way that your video will look worse to computers, but will look better to humans. (with our current shitty computer analysis models).

x264 and others have a bit of a solution for this - they use a kind of "crowd sourcing" (bleck web 2.0 buzz word, I feel like I just vomitted in my own mouth a little). They can put beta features in their code and they have mobs of fan-boys who will download betas and try them on lots of videos and then post results on the forums. This gives you lots of real human eyes saying "this looks better" or not for attempts at psychovisual. But I don't think you can really make big developments using that technique - you can only make small heuristic stabs in the dark and then find out if they were okay, because the turnaround time for results from the crowd is too long, and if you release too many dead ends for them to test they will stop doing it, so you have to be reasonably sure it is a good change before publishing it to the crowd, etc. It's not the kind of thing a researcher needs, which is a black box where I can throw videos and say "which looks better to a human".

The result is that we are mostly stabbing in the dark and occasionally getting lucky.

No comments:

old rants