One idea is to train up a statistical fit to make good guesses. For any given video (or other data to compress), you can create a relationship between lambda and R that's a pretty simple functional fit. There are a few ways to do this. One way would be to just go ahead and compress the video 16 times with different lambda values, observe R each time, that gives you a bunch of data points, now use those to fit a function to lambda(R) , now you can dial a lambda for any rate. If you're going to be working on the same video a lot, this might not be so bad; in particular in the games or DVD publishing business this might be tolerable.

A slight tweak on that idea is instead of just making a fit per video, instead fit to some characteristics of that video. Sample some kind of moments from the video, like do one pass with simple mocomp and measure the L2 norm of the movec and the L2 norm of the residuals. Those are just two moments. Use those to build up a fit lambda(R,m1,m2). To make this fit, we could run on a variety of sample videos to build a database of some good seeds. Furthermore, whenever a user does a run, it would add a seed to their local database. That way if they run on the same video multiple times, those moments would be matched exactly and the fit would get better. If we don't find a close enough sample point in the database to make a good fit, then we just force a test compression of the video at a given lambda and measure the R. So the end result is that rather than doing a bunch of test compressions to build a model, we just do a quick scan to sample some moments, then take the R parameter and look up the fit and get a guess for lambda. Note that rather than doing this fit per video, you could do it *per frame* which would give you a lot more dense sample data very quickly.

Another way to get a decent guess for lambda is with interpolation search. The procedure would be something like this :

pick good brackets for lambda high and low lambdas that you're pretty sure the desired lambda is between, but are as close together as possible (requires some kind of moment sampling or heuristic training) evalue R(lambda) at the two end points and half way between them that's 3 compression runs from 3 points, fit a quadratic R(lambda) use quadratic to choose 4th point lambda (bias it so its not too close to the 3 we already did) run compression at 4th point if that R(lambda) is close enough to target -> done make line between 4th point and neighbor that straddles target R use lerp to choose 5th lambda 4 or 5 evaluations of R(lambda)In many cases with 4 runs your R may be close enough to desired rate and you are just done. The quality of interpolation search depends on how simple/smooth the R(lambda) function is. In general it is a good fit to simple curves, but you can get unlucky in your sample point choices.

Finally, if you want to hit a rate more exactly, you can do so more efficiently by relaxing the global uniform lambda requirement. Say you have some lambda which is pretty close to giving you the right rate and you want to encode to hit that rate more exactly. You let lambda drift a little bit as you code to try to adjust towards hitting the exact bit rate.

I'm not sure exactly what the ideal algorithm is for this, so if you have better ideas let me know. It's a little bit of a nasty black box feedback problem, because I'm trying to hit R by dialing L, and I have this unknown function R(L). One assumption that we will use, which most video coders use in one way or another, is that the video is locally self similar, that is, frame N is usually similar to frame N+1 , so if I observe something about the function R(L) on frame N, it is a good guess that it behaves the same way on frame N+1 ; obviously this is grossly untrue for major cuts, but those are "rare" and we just accept the innaccuracy there (you could also have a panic mode when you see you got it really wrong).

So the idea is that rather than search L around on a given frame, we use previous frames to just make a guess for L and only encode the frame once. If we get it wrong by a little bit, no big deal, we'll make up for it on the next frame. This only works when we are only trying to make small corrections.

Specifically : you did a previous pass at lambda L1 and that gave you total size R1. It also gave you a size for each frame : F1(i). You now want to hit size R2 (which is very close to R1).

At frame i you have already written W2 bits, so you have (R2 - W2) remaining. In pass one you had (R1 - W1) remaining at the same spot. Compute the desired size for the current frame as :

F2 = F1 * (R2 - W2) / (R1 - W1)

To hit frame size F2 , you know if F2 is close to F1 then you should use L = L1. If F2 is a little bigger or smaller either way you should adjust L slightly. To do that, we track a running estimate of dL/dF , call it M for the slope. So we use :

L = L1 + (F2 - F1) * M

We then compress with this L which gives us a frame size F(L). We then compute the actual observed slope :

S = (L - L1) / ( F(L) - F1 )

and then update M using S via IIR or FIR.

There are a lot of kludgy things you'd need to do here, like seed M with a decent guess, don't blend in S updates if you get a weird result ( like F(L) = F1 ), forbid L from making big steps away from L1 even if the estimate really wants to, etc.

Also I'm not sure if dF/dL is really the right variable to estimate ; maybe it would be better to do d(F/F1)/ d(L/L1) , or something. That is, for frames of different characters, what is the most consistent response of frame size to L variation?

## No comments:

Post a Comment