2/23/2010

02-23-10 - Image Compresson - Color , ScieLab

The last time I wrote about anything technical it was to comment on image coding perceptual targets and chroma . Let's get into that a bit more.

There are these standard weapons available to us : 1. Colorspace transform (lossy or lossless) , 2. Relative scaling of color channels, 3. Downsampling , 4. Non-flat quantization matrices.

Many image compressors use some combination of these. For example, JPEG uses YCbCr colorspace, which has a built-in down scaling of the chroma channels, also optionally downsamples chroma, and also usually uses a very high-frequency-killing quantization matrix. The result is that chroma is attacked in many ways - the DC accuracy is destroyed by the scaling in the color conversion as well as the [0] entry of the quantization matrix, and high frequency info is killed both by downsampling and the high entries in the quantization matrix.

But is this good? Obviously it's all bad in terms of RMSE (* not completely true, but close enough), so we need something that approximates the human eye's less sensitie chroma receptors.

For a long time I put off this question because it seemed the only way to attack it was by showing a ton of images to test subjects and asking "is this better?". (Furthermore, there's the ugly problem that any perceptual metric is heavy tied to viewing conditions, and without knowing the viewing conditions you may be optimizing for the wrong thing). But maybe I found a solution.

Let me be clear briefly that I am here only trying to address the issue of how the human eye sees chroma vs luma. This is not a full "psychovisual perceptual metric" which would have to account for the brain identifying areas of noise vs. areas of smoothness, repeated patterns, linear ramps, etc. Basically the only thing I'm trying to capture here is the importance of luma bits vs. chroma bits.

Well, it turns out there's this thing from color research called SCIELAB . You may be familiar with "CIE LAB" aka the "Lab color space" which is considered to be pretty close to "perceptually uniform" , that is 1 unit of distance between two Lab colors has the same perceptual error importance no matter what the two colors are. Well SCIELAB is the extension of CIELAB to images (not just single colors). You can read the paper at that link (or see links below), but the basic thing it does is very simple :

SCIELAB takes the image and transforms it to "opponent color" (luma, red-green, and blue-yellow) , which is roughly the color space that eyes use to see light (rods see luma, cones see chroma) (note that here we are transforming "pixel values" into real light values, so we have to make an assumption about the brightness and color calibration of the viewing device). In opponent color space, each channel is filtered. The filter used for each channel represents the angular resolution that a rod or cone has. Basically this is a gaussian whose sdev is proportional to the angular resolution in that channel. This depends on the DPI of the viewing device and the viewing distance (eg. how many pixels fit into one degree at the eye). The gaussian is narrow for luma, indicating good precision, and wider for chroma. The filter also has a wide negative lobe around the center peak, which captures the fact that we see values as relative to their neighborhood - eg. 100 on a background of 10 looks brighter than 100 on a background of 50.

The gaussian filters represent the probability of a photon from a given pixel hitting and activating a rod or cone. The wider filters for chroma indicate that a half-toned image in red-green or blue-yellow will be indistiguishable from the original at a much shorter distance than a half-toned luma image.

One you do this filtering, you transform back to CIELAB and then you can just do a normal MSE to create a "delta E". (CIE also defines a more elaborate more uniform "delta E" metric for LAB , but for our purposes the plain L2 distance is very close and much simpler). The result is a "SCIELAB delta E" metric that is analytic and can be used in place of MSE for comparing images. Having this SCIELAB metric now lets us try various things and it tells us whether they are perceptually better or not (in terms of optical perception of color, anyway).

So far as I know this has never been used in the mainstream image compression literature ; the only place I found it was this Stanford school project tech report : Direction-Adaptive Partitioned Block Transform for Color Image Coding . This paper is pretty interesting; they aren't actually doing anything with the DA-PBT , they're just evaluating color spaces and how to do color coding starting with a grayscale image compressor.

Let's go through the EE398 paper in detail.

First they use YCbCr because they claim it produces better scielab results than RGB. True enough, but there were a lot of other color spaces to try. Furthermore, they don't mention this, but they are using the JPEG style YCbCr, which has a built in 0.5 scaling of the chroma channels (chroma should have a range of [-256,256] but JPEG offsets and scales to put it back into [0,256]) - they have effectively killed the chroma precision by using YCBCr.

They then look at whether sub-sampling helps or not. They find it to be roughly neutral - but when you try subsampling or not subsampling you should also try optimizing all other free options (scaling of the chroma channels, quantization matrix).

The most interesting part to me is "Rate Allocation". They try giving different fractions of the bit budget to Y or CbCr. They find that optimal delta E almost always occurs somewhere around Y bits = 66% of the total , that is the bit ratios are like [4:1:1]. In order to acheive this ratio they had to use small quantization step sizes for CbCr than Y, but that is an anomaly because of the fact that the YCbCr they use has killed the chroma - if you use a non-scaling YCbCr you would find that the chroma quantization values should be *larger* than luma to acheive the 66% bit allocation. (note that using different quantization values on each channel is equivalent to scaling the channels relative to each other).

They also found that using non-uniform quantization matrices (ala JPEG) hurt. I believe this was just an anomaly of their flawed testing methodology.

This paper was the most serious study of color in image compression that I've ever seen, but is still flawed in some simple ways that we can fix. The big problem is that they make the classic blunder of many people working in compression of optimizing parameters one by one. That is, say you have a compressor with options {A,B,C}. The blunderer finds the optimal value for option A and holds that fixed, then the optimal for B, then the optimal for C. They then try out some experimental new mode for step A, and their tests show it doesn't help - but they failed to retry every option for B and C in the new mode for A. eg. for example something like downsampling might hurt if you're using YCbCr, but say you use some other color space, or scale your colors in some way, or whatever, then downsampling might help and the result of doing all those steps together may be the best configuration.

Let's go back through it carefully :

First of all, the color conversion. Let me note that we use the color conversion in image compression for really two separate purposes which are mixed up. One use is for decorrelation (or energy compaction if you prefer) - this helps compression even for lossless mode. The second is for perceptual separation of chroma from luma so that we can smack the chroma around. Obviously here we need a color transform which gives us {luma/chroma} separation - that is, we cannot use something like the KLT which doesn't necessarilly have a perceptual "luma" axis.

From my earlier color studies, I found that YCoCg produces good results, usually within 1% of the best color transform on each image, so we'll just use that. But we will be careful and use a float <-> float YCoCg which doesn't scale any of the channels.

We will then scale Y relative to CoCg. This scaling is equivalent to variable quantizers and is (one of the ways) how we will control the bit allocation to Y vs. Chroma. This scaling gives you a difference in "value resolution" , it doesn't kill high frequencies.

You can then optionally downsample chroma. Note that in naive tests I have found in the past that downsampling chroma sometimes helps visual quality; and in fact in some cases it even helps MSE measured on the RGB data. I now know that that was just an anomaly due to the fact that I wasn't considering chroma scaling. That is, downsampling was just a crude way of allocating fewer bits to chroma, which does in fact sometimes help, but if you also have the ability to change the chroma bit allocation by relative scaling of the channels, the advantage of downsampling vanishes.

I optimized the scaling of CoCg relative to Y on lots of images. Obviously the true optimum value is highly image dependent (you could compute this per image and store it with the image of course), but in most cases a scale near 0.7 is optimal if you are not downsampling, and a scale near 1.1 is close to optimal when downsampling ( 1.0 is not bad when downsampling ). When not downsampling, the optimal bit allocation is usually in the area of Y ~= 66% of the bits, as seen in the EE398 paper. When downsampling, the optimal bit allocation tends to be closer to Y = 80% of the bits. Downsampling generally hurts RGB MSE and SCIELAB delta E, but I find it sometimes helps RGB SSIM.

Obviously downsampling is resulting in more bits being used on luma, which means you'll have sharper edges and better preservation of texture and a visual appearance of more "detail", at the cost of the color values being far off. By my own examination, I often will find that if I just stare at the image made from downsampled chroma it looks "better" - eg. I see more edge detail, and it has less of that obvious appearance of being compressed, eg. less ringing artifacts, halos, stair-steps, etc. However, when I switch back and forth between the original and the compressed, the version made from downsampled chroma shows obvious color errors. The version made from non-downsampled chroma obviously has much better color preservation, but appears generally blurrier, has more block artifacts, etc. The non-downsampled version wins according to "delta E" , but by my eyes I can't really clearly say one is better than the other, they're just different errors.

The last tool we have is a non-uniform quantization matrix. NUQM lets us give more bits to the low frequencies vs. the high frequencies. Generally NUQM hurts MSE, but it might help "delta E" , because SCIELAB accounts for the "fuzziness" of human visual (insensitivity to high frequency pattern). To test this, what we need to try is various different NUQM's for both luma and chroma, as well as optimizing the relative scaling value in each case. I haven't completed this yet, but early results show that NUQM's do in fact help delta E. Note that I'm not talking about doing a per-image optimal NUQM like "dctopt" does or something, just finding something like the JPEG style skewed matrix to use globally.

Some numbers for example :


On a 512x512 color image of a face , at 1.0 bits per pixel , 
optimizing quality at constant bit rate


baseline : delta E = 2.2933

not downsampled , optimal CoCg scale = 0.625 : delta E = 2.155  (bits Y = 72%)

    downsampled , optimal CoCg scale = 1.188 : delta E = 2.381  (bits Y = 80%)

best NUQM and scaling (no downsampling) : delta E = 1.899  (bits Y = 61%)

( JPEG delta E = 2.7339 )

One thing I notice that NUQM does obviously is give a lot more bits to the DC's. In this case :


not downsampled, same cases as previous
UQM = uniform quantization matrix

 UQM , bits DC = 15.4% , Q = 14.0  , delta E = 2.155 , bits Y = 72%

NUQM , bits DC = 19.4% , Q =  7.25 , delta E = 1.899 , bits Y = 61%

Here Q is the quantizer of the DC component of Y - in the UQM case all Q's are the same (though the Q for chroma is effectively scaled). In the NUQM case the higher frequency AC components get much higher Q's. We can see from the above that because of NUQM, the quantizer for the DC can be much lower at the same bit rate.

Personal visual inspection indicates that the NUQM images just have much more "JPEG-like" artifacts. That is, they generally look more speckly. They obviously preserve flat areas and simple ramps somewhat better. The tradeoff is much worse ringing artifacts and destruction of high frequency detail like fine edges. (in my case the lower Q from NUQM also means a much weaker deblocking filter is used which may be part of the reason for more speckly appearance).

In any case, NUQM clearly helps delta E due to the ability to take bits away from the high frequency chroma data - much better than just scaling and downsampling can.

This is all very interesting and promising, but we have to ask ourselves at some point - how much do we trust this "scielab delta E" ? eg. by optimizing for this metric are we actually making better results? More and more I am convinced that the biggest thing missing from data compression is a better image quality metric (and then once you have that, you need to go back to basics and re-test all your assumptions against it in the correct way).

Color links :

Working Space Comparison sRGB vs. Adobe RGB 1998
Welcome to IEEE Xplore 2.0 Using SCIELAB for image and video quality evaluation
Video compression's quantum leap - 12112003 - EDN
Useful Color Equations
Useful Color Data
Standard illuminant - Wikipedia, the free encyclopedia
SpringerLink - Book Chapter
S-CIELAB Matlab implementation
References related to S-CIELAB
Lab color space - Wikipedia, the free encyclopedia
IEEE Xplore - Login
help - sRGB versus Adobe RGB (1998)
efg's Chromaticity Diagrams Lab Report
CIECAM02 - Wikipedia, the free encyclopedia
Chromatic Adaptation
Brian A. Wandell -- Reference Page
Ask a Color Scientist!
A top down description of S-CIELAB and CIEDE2000. Garrett M. Johnson. 2003; Color Research & Application - Wiley InterScienc
A proposal for the modification of s-CIELAB

2/10/2010

02-10-10 - Some little image notes

1. Code stream structure implies a perceptual model. Often we'll say that uniform quantization is optimal for RMSE but is not optimal for perceptual quality. We think of JPEG-style quantization matrices that crush high frequencies as being better for human-visual perceptual quality. I want to note and remind myself that actually just the coding structure actually targets perceptual quality even if you are using uniform quantizers. (obviously there are gross ways this is true such as if you subsample chroma but I'm not talking about that).

1.A. One way is just with coding order. In something like a DCT with zig-zag scan, we are assuming there will be more zeros in the high frequency. Then when you use something like an RLE coder or End of Block codes, or even just a context coder that will correlate zeros to zeros, the result is that you will want to crush values in the high frequencies when you do RDO or TQ (rate distortion optimization and trellis quantization). This is sort of subtle and important; RDO and TQ will pretty much always kill high frequency detail, not because you told it anything about the HVS or any weighting, but just because that is where it can get the most rate back for a given distortion gain - and this is just because of the way the code structure is organized (in concert with the statistics of the data). The same thing happens with wavelet coders and something like a zerotree - the coding structure is not only capturing correlation, it's also implying that we think high frequencies are less important and thus where you should crush things. These are perceptual coders.

1.B. Any coder that makes decisions using a distortion metric (such as any lagrange RD based coder) is making perceptual decisions according to that distortion metric. Even if the sub-modes are not overtly "perceptual" if the decision is based on some distortion other than MSE you can have a very perceptual coder.

2. Chroma. It's widely just assumed that "chroma is less important" and that "subsampling is a good way to capture this". I think that those contentions are a bit off. What is true, is that subsampling chroma is *okay* on *most* images, and it gives you a nice speedup and sometimes a memory use reduction (half as many samples to code). But if you don't care about speed or memory use, it's not at all clear that you should be subsampling chroma for human visual perceptual gain.

It is true that we see high frequencies of chroma worse than we see high frequencies of luma. But we are still pretty good at locating a hard edge, for example. What is true is that a half-tone printed image in red or blue will appear similar to the original at a closer distance than one in green.

One funny thing with JPEG for example is that the quantization matrices are already smacking the fuck out of the high frequencies, and then they do it even harder for chroma. It's also worth noting that there are two major ways you can address the importance of chroma : one is by killing high frequencies in some way (quantization matrices or subsampling) - the other is how fine the DC value of the chroma should be; eg. how should the chroma planes be scaled vs. the luma plane (this is equivalent to asking - should the quantizers be the same?).

1/22/2010

01-22-10 - Exponential

One of my all time pet-peeves is people who say things are "increased exponentially". Of course the worst use of all is just when people use it to mean "a lot" , eg. they're not even talking about a trend or a response curve, eg. "the 911 Turbo provides an exponential increase in power over the base spec". This abortion of usage is becoming more and more common. But even scientists and mathematicians will use "exponential" to describe a trend that's faster than linear (it's quite common in NYT math/economics/science articles for example).

Today I was reading this blog on software development organizations and my hackles immediately went up when I read :

"The real cost of complexity increases exponentially."

I started to write a snarky post about how he almost certainly meant "geometrically". But then I started thinking about it a bit more. (* correction : by "geometrically" I mean "polynomially").

Maybe software development time actually is exponential in number of features?

If you're trying to write N features and they are all completely independent, then time is O(N), eg. linear.

If each feature can be used with only one other feature, and that interaction is completely custom but independent, then time is O(N^2), eg. geometric.

Already I think there's a bit of a myth we can tackle. A lot of optimistic software devs think that they can get under even this O(N^2) complexity by making the features interact through some generic well defined pathways. eg. rather than specificially code how feature (A) and feature (B) and feature (A+B) work, they try to just write (A) and (B) but make them both aware of their circumstances and handle various cases so that (A+B) works right. The problem is - you didn't really avoid the O(N^2). At the very least, you still had to test (A+B) to make sure it worked, which meant trying all N^2 ways, so your time is still O(N^2). The code might look like it's O(N) because there's only a function for each feature, but within each function is implicit O(N^2) logic !!

What I mean by implicit logic is the blank lines which testing reveals you don't have to write! eg. :


void Feature_A( )
{

    DoStuff()

    if ( SelectionMode() )
    {
        // ... custom stuff here to make (A+C) and (A+D) work
    }

    // !!! blank lines here !
    //  this is implicit knowledge that (A+B) and (A+E) don't need custom code

}

You might argue that this is slightly less than quadratic complexity for the developer, and tester time is cheaper so we don't care if that's quadratic. But in any case it's geometric and above linear.

But is it actually exponential? What if all the features could be enabled or disabled, so the state of your code is a binary string indicating what features are on or off; eg. 1100 = A on, B on, C off, D off. Then there are in fact 2^N feature states, which is in fact exponential.

Another possibility is if the features can only be enabled one by one, but they have lingering effects. You have some shared state, like a data file you're processing, and you can do A then C then B then E on it. In that case the number of sequences is something like N! which is exponential for large N (actually super-exponential)

Let's concretely consider a real video game coding scenario.

You're trying to write N features. You are "sophisticated" so rather than writing N^2 hard-coded interactions, you make each feature interact with a shared world state via C "channels". (in the old Looking Glass speak, these C channels might be a set of standard "properites" on objects and ways to interact with those channels; in the old Oddworld Munch codebase there were C "component" types that could be on objects such as "SoundTrigger "Pressable" etc.). So your initial code writing time something like O(N*C).

But for the N features to really be meaningful, the C is ~= N (roughly proportional). (or at least C ~= log(N) , but certainly C is not constant as N increases - as you add features you always find you need more channels for them to communicate with each other). So just code writing time is something between O(NlogN) and O(N^2).

But your features also affect shared state - e.g. the "world" that the game takes place in, be that physical state, or state variables that can be flipped on other objects. If you have N objects each with K internal states, this creates K^N world states that have to be tested. Even with very small world state, if the features are order dependent, you're back to N! test cases.

If the bug rate was a constant percentage of test cases (eg. 0.1% of test cases produce a bug), then you are back to exponential number of bugs = exponential coder time. But I'm not sure that model of bugs is correct. If the bug rate was a constant percentage of lines of code, then bug rate would only be geometric.

1/17/2010

01-17-10 - Nob or Knob -

Is there a difference between Nob and Knob or are they just alternate spellings of the same thing ?

It appears that "knob" is generally considered more correct now, though in old english "nob" was more common. There are many meanings, I'll show with the spelling I prefer :

knob : dial/wheel control
knob : bump or protuberance
knob : small amount, usually of butter
knob : head of the penis
nob  : head of a man (archaic)
nob  : wealthy or upper class person (archaic)

Some people seem to think that either "nob" or "knob" are exclusively correct for the slang meaning of penis (they also disagree on whether it refers to the whole penis or just the head). It's unclear to me where the origin of this slang came from, since "nob" can either mean a person's head (archaic), which would suggest the slang came to refer to head of the penis, or "knob" can mean any protuberance.

Wiktionary seems to think "knob" is a common way to refer to a hill; perhaps this is British, I've never heard it in America.

Some weird uses :

"Nob Hill" - my guess is this common name refers to a hill where the wealthy people lived, not the fact that the hill was a knob, but I could be totally wrong about that. Some people seem to point Nob Hill at nabob but since nabob basically means the same thing as nob I don't see why you would point "Nob" at "nabob" when you could just point it at "nob".

"Hob Nob" - apparently this is completely unrelated if you believe this etymology it came from habbe nabbe

"For his nobs" in Cribbage is a funny one ; it appears to also be completely unrelated, coming from the game noddy which means simpleton, and since the knave of the same suit was important to the game it was referred to as the "knave noddy" or just "noddy" which must have become nob. (unrelated but the story of John Suckling, purported inventor of modern Cribbage, seems pretty fascinating; he was apparently a master gambler and cheater at cards who used his skill to get money beyond his station; he was involved in a plot to spring a prisoner from the tower of london, and received at least one beating at the handle of a nobleman tradgames ; ezinearticles ; wikipedia )

There are some funny uses : Nob Hill Knob Set and Nifty Nob Inc. maker of fine Knobs good job on the consistency, guys.

1/14/2010

01-14-10 - A small note on Trellis quantization

See reference .

I guess this is obvious, but you do get a pretty nice win from using the true floating point Dct results rather than the quantized Dct results when you do TQ.

I believe the standard practice (what I was doing before anyway) is to do your normal fast Dct + quantization, which takes your integer pixels and makes quantized integer post-Dct output. You then apply TQ on that quantized-and-transformed output, which means instead of sending the true output { 4,3,0,0 } you also consider {4,2,0,0 } and {4,0,0,0 } and so on, measure J(R,D) of each and pick the best.

Okay, but the distortion of changing "1" to 0 is not the same as the distortion of another 1 to 0 , if those were not really the same 1 before quantization.

For example, say you're quantizing with a quantizer = 1.0 for simplicity, and no deadzone, even bucket sizes, so you have quantization buckets :


[ -0.5, 0.5 ] -> 0
[ 0.5 , 1.5 ] -> 1
etc.

In that case, when TQ decides to take a quantized "1" and send instead a 0, if the true value was 0.51 , that's not so bad. If the true original value was 1.49 , that's a lot worse.

However, interestingly, if the true original value was 1.49 , then we could send it as a "2". If the value is near a quantization boundary, then the distortion doesn't care whether you kick the value up or down, but the rate might be significantly different, in which case you should make a choice based on J and get a win.

So my Dct now also does the true float -> float transform just for use in the distortion measurement for TQ. It's also useful in this application to make sure your Dct doesn't do any scaling, so that the transform is Unitary, that is, L2 norm is preserved. That way the distortion measure in post-Dct space is the same as the distortion measure in pre-Dct space which means you can use the same lambda for lagrange J decisions.

1/13/2010

01-13-10 - Oodle Revisited

I got some emails from a friend recently that made me start thinking about Oodle again. Friend is an indie 360 developer who just shot me a query like (paraphrasing) :

"Hey, I'm loading stuff in my game and only getting like 20 MB/sec , what gives?"

So we started trying to dig into what he was doing exactly - are you opening/reading the files with all the right flags, are you on 4k alignment, are you on a thread so you aren't just stalling out for the seeks, etc. ? In the end I think we figured out that his problem was he was reading too many small files, to which he replied :

"Oh yeah, duh, I've always packed files into bundles and loaded big chunks in previous games, but I just hadn't gotten around to it yet in this and it was bugging me that my level loads were taking so long."

! Ah ha ! To me this is where Oodle comes in as a product. It's not super hard stuff, but it's something that people always put off until the very end, which is kind of a shame because it means their level loads take forever during dev. So for three years while you're making your game you suffer through annoying slow level loads. Instead you buy Oodle on day 1 and your level loads are automagically fast.

I believe in this as a product, in the sense that I think we can make it, and it would be extremely valuable, but I'm not convinced that the game industry is mature enough to buy it. The game industry has always suffered from the mistaken thinking of "I could write that myself, so I won't pay for it". That's silly. What you should look at is what's the full cost of writing it yourself, including debugging, and perhaps most importantly including the opportunity cost of spending that time on this instead of something else, or the opportunity cost of not having feature X until you've gotten around to writing it.

For example, say you're on a one man dev team and you lay out your coding tasks for your project in order {A,B,C,D,E} . All during dev you are suffering a penalty from not having feature E done. If it would make dev a lot easier to have feature E done up front, you should pay a *lot* of money to get it immediately. Some of the classic mistakes in this vein are things like profile HUDs which people put off until the end, but would provide huge benefit if you had from the beginning; another one people aren't aware of is level save/load and memory card support - you think of that as a minor detail to do at the end, but as soon as you do it level designers are walking around with scenarios saved on memory cards to show to each other and they get a huge productivity boost, so it would have been a nice win to do early (though then you have maintenance pain).

To me the big win of Oodle is :

Client just writes code to load files one by one with plain old fopen/fread/whatever.

Behind the scenes Oodle magically robustly incrementally syncs PC files to consoles, packages files into bundles, makes prefetch lists and prefetches bundles, compresses & decompresses bundles on threads. You have ship quality fast loading from the beginning.

That's a small, simple, great product I think. The problem is to make it something compelling enough to sell we start to add lots of features : high performance paging in/out for seamless worlds, hot reloading of changed content, smooth IO integration with streaming data files like audio/video, DVD layout optimization, texture compressors, etc. which really just wind up clouding the tiny little valuable product at the core.

01-13-10 - Lagrange Rate Control Part 3

Lambda seeking. How do we choose lambda to match a desired rate? I have a few ideas.

One idea is to train up a statistical fit to make good guesses. For any given video (or other data to compress), you can create a relationship between lambda and R that's a pretty simple functional fit. There are a few ways to do this. One way would be to just go ahead and compress the video 16 times with different lambda values, observe R each time, that gives you a bunch of data points, now use those to fit a function to lambda(R) , now you can dial a lambda for any rate. If you're going to be working on the same video a lot, this might not be so bad; in particular in the games or DVD publishing business this might be tolerable.

A slight tweak on that idea is instead of just making a fit per video, instead fit to some characteristics of that video. Sample some kind of moments from the video, like do one pass with simple mocomp and measure the L2 norm of the movec and the L2 norm of the residuals. Those are just two moments. Use those to build up a fit lambda(R,m1,m2). To make this fit, we could run on a variety of sample videos to build a database of some good seeds. Furthermore, whenever a user does a run, it would add a seed to their local database. That way if they run on the same video multiple times, those moments would be matched exactly and the fit would get better. If we don't find a close enough sample point in the database to make a good fit, then we just force a test compression of the video at a given lambda and measure the R. So the end result is that rather than doing a bunch of test compressions to build a model, we just do a quick scan to sample some moments, then take the R parameter and look up the fit and get a guess for lambda. Note that rather than doing this fit per video, you could do it *per frame* which would give you a lot more dense sample data very quickly.

Another way to get a decent guess for lambda is with interpolation search. The procedure would be something like this :


pick good brackets for lambda
    high and low lambdas that you're pretty sure the desired lambda is between, but are as close together as possible
    (requires some kind of moment sampling or heuristic training)

evalue R(lambda) at the two end points and half way between them
    that's 3 compression runs

from 3 points, fit a quadratic R(lambda)

use quadratic to choose 4th point lambda
    (bias it so its not too close to the 3 we already did)

run compression at 4th point
    if that R(lambda) is close enough to target -> done

make line between 4th point and neighbor that straddles target R
use lerp to choose 5th lambda

4 or 5 evaluations of R(lambda)

In many cases with 4 runs your R may be close enough to desired rate and you are just done. The quality of interpolation search depends on how simple/smooth the R(lambda) function is. In general it is a good fit to simple curves, but you can get unlucky in your sample point choices.

Finally, if you want to hit a rate more exactly, you can do so more efficiently by relaxing the global uniform lambda requirement. Say you have some lambda which is pretty close to giving you the right rate and you want to encode to hit that rate more exactly. You let lambda drift a little bit as you code to try to adjust towards hitting the exact bit rate.

I'm not sure exactly what the ideal algorithm is for this, so if you have better ideas let me know. It's a little bit of a nasty black box feedback problem, because I'm trying to hit R by dialing L, and I have this unknown function R(L). One assumption that we will use, which most video coders use in one way or another, is that the video is locally self similar, that is, frame N is usually similar to frame N+1 , so if I observe something about the function R(L) on frame N, it is a good guess that it behaves the same way on frame N+1 ; obviously this is grossly untrue for major cuts, but those are "rare" and we just accept the innaccuracy there (you could also have a panic mode when you see you got it really wrong).

So the idea is that rather than search L around on a given frame, we use previous frames to just make a guess for L and only encode the frame once. If we get it wrong by a little bit, no big deal, we'll make up for it on the next frame. This only works when we are only trying to make small corrections.

Specifically : you did a previous pass at lambda L1 and that gave you total size R1. It also gave you a size for each frame : F1(i). You now want to hit size R2 (which is very close to R1).

At frame i you have already written W2 bits, so you have (R2 - W2) remaining. In pass one you had (R1 - W1) remaining at the same spot. Compute the desired size for the current frame as :

F2 = F1 * (R2 - W2) / (R1 - W1)

To hit frame size F2 , you know if F2 is close to F1 then you should use L = L1. If F2 is a little bigger or smaller either way you should adjust L slightly. To do that, we track a running estimate of dL/dF , call it M for the slope. So we use :

L = L1 + (F2 - F1) * M

We then compress with this L which gives us a frame size F(L). We then compute the actual observed slope :

S = (L - L1) / ( F(L) - F1 )

and then update M using S via IIR or FIR.

There are a lot of kludgy things you'd need to do here, like seed M with a decent guess, don't blend in S updates if you get a weird result ( like F(L) = F1 ), forbid L from making big steps away from L1 even if the estimate really wants to, etc.

Also I'm not sure if dF/dL is really the right variable to estimate ; maybe it would be better to do d(F/F1)/ d(L/L1) , or something. That is, for frames of different characters, what is the most consistent response of frame size to L variation?

1/12/2010

01-12-10 - Lagrange Rate Control Part 2

Okay, so we've talked a bit about lagrange coding decisions. What I haven't mentioned yet is that we're implicitly talking about rate control for video coders. "Rate control" just means deciding how many bits to put in each frame. That's just a coding decision. If the frames were independent (eg. as in Motion JPEG - no mocomp) then we know that lagrange multiplier R/D decisions would in fact be the optimal way to allocate bits to frames. Thus, if we ignore the frame-dependence issue, and if we pretend that all distortions are equal - lagrange rate control should be optimal.

What does lagrange rate control mean for video? It means you pick a single global lambda for the whole video. This lambda tells you how much a bit of rate should be worth in terms of distortion gain. Any bit which doesn't give you at least lambda worth of distortion gain will not be coded. (we assume bits are of monotonically decreasing value - the first bit is the most important, the more bits you send the less value they have). On each frame of video you want to code the frame to maximize J. The biggest single control parameter to do this is the quantizer. The quantizer will largely control the size of the frame. So you dial Q to maximize J on that frame, this sets a frame size. (maximizing J will also affect movec choices, macroblock mode choices, etc).

Frames of different content will wind up getting different Q's and very different sizes. So this is very far from constant bit rate or constant quantizer. What it is is "constant bit value". That is, all frames are of the size where adding one more bit does not help by lambda or more. Harder to code (noisy, fast motion) frames will thus be coded with more error, because it takes more bits to get the same amount of gain in that type of frame. Easy to code (smooth, flat) frames will be coded with much less error. Whether or not this is good perceptually is unclear, but it's how you get optimal D or a given R assuming our D choice is what we want.

Ideally you want to use a single global lambda for your whole video. In practice that might not be possible. Usually the user wants to specificy a certain bit rate, either because they actually need to meet a bit rate maximum (eg. for DVD streaming) , or because they want to specify a maximum total size (eg. for fitting on your game's ship DVD), or simply because that's an intuitive way of specifying "quality" that people are familiar with from MP3 audio and such. So your goal is to hit a rate. To do that with a single global lambda, you would have to try various lambdas, search them up and down, re-encode the whole video each time. You could use binary search (or maybe interpolation search), but this is still a lot of re-encodings of the whole video to try to hit rate. (* more on this later)

Aside : specifying lambda is really how people should encode videos for distribution as downloads via torrents or whatever. When I'm storing a bunch of videos on my hard disk, the limitting factor is my total disk size and the total download time - I don't need to limit how big each individual movie is. What I want is for the bits to go where they help me most. That's exactly what lambda does for you. It makes no sense that I have some 700 MB half hour show that would look just fine in 400 MB , while I have some other 700 MB show that looks like shit and could really use some more bits. Lambda is the right way to allocate hard drive bytes for maximum total viewing quality.

Okay. The funny thing is that I can't find anyone else on the web or in papers talking about lagrange video rate control. It's such an obvious thing that I expected it to be the standard way, but it's not.

What do other people do? The de-facto standard seems to be what x264 and FFMPEG do, which I'll try to roughly outline (though I can't say I get all the details since the only documentation is the very messy source code). Their good mode is two pass, so I'll only talk about that.

The primary thing they do is pick a size for each frame somehow, and then try to hit that size. To hit that frame size, they search QP ( the quantization parameter) a bit. The specifically only search QP in the local neighborhood of the previous QP because they want to limit QP variation between frames (the range of search is a command line parameter - in fact almost everything in this is a command line parameter so I'll stop saying that). When they choose a QP, there's a heuristic formula for H264 which specifies a lambda for lagrange decisions that corresponds to that QP. Note that this lambda is only used for inside-the-frame coding decisions, not for choosing QP or global rate allocation. Also note that the lambda-QP relationship is not necessarily optimal; it's a formula (there are a bunch of H264 papers about making good lambda-QP functional fits and searches). They also do additional funny things like run a blurring pass on QP to smooth out variation; presumably this is a heuristic for perceptual purposes.

So the main issue is how do they pick this frame size to target? So far as I can tell it's a mess of heuristics. For each frame they have a "complexity" measure C. On the first pass C is computed from entropy of the delta or some such thing, raised to the 0.8 power (purely heuristic I believe). The C's are then munged by some fudge factors (that can be set on the command line) - I frame sizes are multiplied by a factor > 1 that makes them bigger, B frame sizes are multipled by a factor < 1 that makes them smaller. Once all the "complexities" are chosen, they are all scaled by (target total size) / (total complexity) to make them add up to the total target size. This sets the desired size for each frame.

Note that this heuristic method has many of the same qualitative properties as full lagrangian allocation - that is, more complex frames will get more bytes than simpler frames, but not *enough* more bytes to give them the same error, so more complex frames will have larger errors than simpler frames. However, quantitatively there's no gaurantee that it's doing the same thing.

So far as I can tell the lagrange method is just better (I'm a little concerned about this because it just seems to vastly obviously better that it disturbs me that not everyone is doing it). Ignoring real world issues we've glossed over, the next big problem is the fact that we have to do this search for lambda, so we'll talk about that next time.

ADDENDUM : x264/ffmpeg rate control references :

ratecontrol.txt - Loren is the main dev but this is very cursory
FFmpeg RateControlContext Struct Reference
FFmpeg libavcodecratecontrol.c Source File
[Ffmpeg-user] changing bitrate on the fly - detailed rate control document, but not written by one of the main devs so beware
x264 Macroblock Tree Ratecontrol testing (committed) - Doom9's Forum - this is about the dependency issue that we haven't discussed

01-12-10 - Lagrange Rate Control Part 1

I'm going to talk about video coding and code stream design. It's important to remember the end goal - we want to make a coded bit stream which is <= N bytes and plays back with the highest quality possible. In all cases I assume the decoder is a fixed known quantity.

First of all it's useful to consider what you would do if you had infinite CPU time. The answer is you should try all coded bit streams. That is, to make the optimal N byte stream, you should make 256 ^ N streams, run each through the decoder, measure the error somehow, and choose the best. That sounds ridiculous, but it is our goal; everything else we talk about will be approximations that are trying to get as close as possible to that.

Let's try to get into the realm of reality. You have some encoder that you think produces good code streams for your given decoder. In various places in your encoder you have decision points - what quantizer to use, what macroblock mode to use, whether to send this value as its true self or as any other value, etc. If you had lots of CPU power you should do a brute force search on all decisions - try every decision all possible ways; if there are M decisions and each has K choices this is K^M. Reject code streams produced that are bigger than N. Measure the error of the results and choose the best.

Okay, that's still ridiculously impractical. But let's consider a simpler case. Say you only have two coding decisions, and most importantly - they are independent, that is one decision does not affect the other. For example say you are coding two separate images, and you are only allowed to choose the quantizer on each image, and your goal is to minimize the total error. If there are K quantizer choices, a full search would mean trying K^2 ways. But in that's not necessary because of independence. All you have to do code the first image K ways & remember the rate (R) and distortion (D) of each way, then the second image K ways and remember R/D, so you have 2K codings. Now to find the optimal combination you could do K^2 table lookups, but even here you can usually do much less, because the R/D tables are monotonic (or very nearly monotonic) which means rather than doing K^2 you only need to pick an entry on each side and then slide each one up and down to search for improved D given the R constraint, which is O(K). This obviously extends to N independent coding choices ; rather than K^N we can do K*N codings.

The simplest way to do this in practice is with a lagrange multiplier framework. Rather than look for optimal D given R, we look for an optimal lagrange cost : J = D + lambda * R.

If J is maximized by our coding choices, it means dJ/dCode = 0. That means 0 = dD/dC + lambda * dR/dC , or dD/dR = - lambda

That is, lambda has selected a slope of the D(R) curve. If D(R) and dD/dR are both monotonic, then this has uniquely selected a rate & encoding. We shall henceforth assume that this monotonicity requirement is true. In practice it is not quite true but it is "mostly" true which we can deal with in hacky ways (* perhaps more on this niggle some day). With this assumption, if your goal is to hit a certain rate, you can monotonically search lambda to find an encoding that maximizes J at that rate.

Now the key point is that maximizing J for a given rate gives you the optimal encoding for independent coding choices. Consider our example of the two images above. Instead I make a quantizer choice from the K choices on each image independently. On each one I measure J and choose the one that maximizes J. (note that since R and D are monotonic I can actually just do a binary search here that's O( log(K) ) not O(K) ). Because they are independent, I have found the maximum J for the full stream. This is the optimal encoding. Why? Because it has made the slope dD/dR equal at each of the coding choices. That is, I can't move bits from one image to the other and get a better distortion at the same rate. If the slopes were not equal, then I would get a win from moving the bits away from where they affect D less to where they affect D more.

This is the same as before, but the win in terms of code flow and structure should be obvious. Before I had to do a bunch of encodings and then go back and consider different bit allocations to find the best. Now I can do one linear pass through my codings and make the optimal J decision at each step, and then after a decision is made I can forget about it and move on to the next and never revisit it. The disadvantage is now that my encoding is parameterized by the unintuitive lambda parameter rather than just R, the rate; if I want to hit a specific rate I have to search lambda in an outer loop (* perhaps more on this later).

Now we're going to make a big leap of faith and see what happens if we try to use this sort of one-pass lagrange optimization on more complicated real world scenarios. What goes wrong? Two big issues. One is that D might not be independent, the other is that coding is not independent.

To be clear, our proposal is to do one-pass lagrange coding decision making. Encode the data in streaming one pass scan. At each decision point, you try all possible ways for the current subset of the data only. You measure J on the current subset and take the decision that maximizes J on that subset. You then do the encoding using that decision and output the encoded bits, and move on to the next subset. R (the rate) can of course be measured on each subset independently (if arithmetic coding, you can count the fractional bits left in the state as well). D (the distortion) must be some measure that can be done locally.

What about D (the distortion measure) ? Well, if D is SSD (sum of square differences) (aka L2 error or MSE), then it is in fact independent from pixel to pixel or block to block, which means it is fully correct to measure D only locally on each decision. But you might object that this is not a good way to measure distortion. I'm going to ignore this for now and just posit that we choose D to be SSD and we're going to optimize for that. (* more on this later).

The big issue is that in the real world our coding decisions are not independent. A huge affect with video coding is of course motion compensation - blocks can be used as reference for later blocks. That means a coding decision now can have affects far in the future. Even if we ignore that and just talk about image coding, blocks affect the coding of other blocks through statistical coding - be it huffman, arithmetic, or context coding. Typically this has an affect of the form : if I chose mode A for the current block, that has the side effect of making mode A cheaper for future blocks, and all other modes slightly more expensive. I contend that the statistical affects on the future are not a huge issue. (you may recall when we talked about LZ77 optimal parsing, I made the same hand-wavey argument to dismiss this issue). There is one big issue about the statistical effects - early on in coding when statistics are sparse, decision 1 can have a huge affect on the statistics which heavily biases decision 2 and leads you far away from the true optimal choice. To stop this, the statistics for decision making purposes should be preconditioned with some "normal" data to reduce the affect of strange early decisions. Note that the statistics used for actual coding need not be the same as the statistics for decision making.

Sometimes a purely greedy coding decision can be very bad, though. Consider the case of something like trellis quantization of DCT coefficients. The true output of the DCT is X,Y,Z. But you need not transmit those values, you can transmit anything and it will just have some rate and distortion. For example, say the output is 0,2,3,0,X,0,0,0,0 . How should you transmit X ? If you're using something like run-length coding or end-of-block signalling, there is a big advantage to sending X as 0. You can't know that unless you could see the future after X. If you only saw 0,2,3,0,X, and didn't know what followed you wouldn't see that. This is a well known issue with quantization of trailing DCT coefficients, but it's also an issue with things like macroblock mode decisions in video coding. The reason is that mode equal to previous mode is coded so much smaller than any other mode, when you make a decision you very strongly affect not only your current rate, but also the following block. One remedy to this is "semi-non-greedy lookahead" type coding as is well known from LZ77 coders; that is, instead of just doing one coding decision, you try all ways for the next 2 or 3 or 4, choose the one that optimizes all of them together, and then discard all but the current one and step ahead one step. Okay I'm going to sort of ignore this issue for now, but we should keep it in the back of our mind that purely greedy forward-scan decision making can be pretty far from optimal.

The other big issue for video is motion compensation. Yeah, that's a big issue, but it's not specific to this type of code stream optimization, so I'm not going to discuss it now. It's an issue that you must address in any video coder that makes decisions. The way to address it in this framework is to choose some scaling factor for "D" on each block. You compute J = c * D + lambda * R , where normally c would be 1, instead you bias it up or down depending on whether you decide the current block is more or less important than average. (blocks which will be sources for future motion compensation should be considered more important in the forward greedy pass decision making). (* perhaps more on this later).

To be continued ...

12/18/2009

12-18-09 - The ACM

There's a bunch of hubbub lately in the game/graphics community about the anti-scientific actions of the ACM.

I find it all much ado about nothing. Of *course* the ACM is a bunch of cocks that are against scientific progress and dissemination of information. Ooo the group of evil bastards who claim all the rights to the paper that you yourself wrote are being dicks !? Well big fucking surprise. Woop-tee-do.

Everyone should boycott the ACM and the IEEE. Duh. The paper publications have absolutely zero purpose now. The conferences have almost no purpose, it's just a bunch of hob-nobbing and back-slapping. The internet is how real information is conveyed. I'm not quite sure why people are so enamored of conferences. I can get way more information in 10 minutes reading papers on the internet than you can get from a 4 day conference, because I am pulling the information I want at the pace that I want, not having it pushed to me.

Now, peer review and collection of papers does in fact provide a service. You don't just want researchers putting up papers on their own web sites with no organization and no peer review. (for example, Arxiv is great and all, but the lack of organization and peer review makes it weak as a primary publishing location).

If you actually cared about getting away from the evil iron grip of the ACM and IEEE bastards, you should work harder to organize and promote the free online journals, of which there are many, but they haven't caught on because the majority of major researchers are smitten by the prestige of the ACM/IEEE names.

Go to DOAJ , pick a journal, support that. Or you know, start a new one focused on higher performance / realtime computer graphics.

old rants