10-22-10 - Some notes on Chroma Sampling

First some motivation before we dig into this :

Take bmp, downsample convert to YUV 420 (specifically YCbCr 601), upsample and convert back to RGB, measure RMSE. (yes I know RGB rmse is not really what you want, more on this later).

Testing "my_soup" :

ffmpeg bmp -> y4m -> bmp , default options :
rmse : 5.7906 

ffmpeg -sws_flags +accurate_rnd+full_chroma_int+full_chroma_inp
rmse : 2.3052

my_y4m : 

RGB -> chroma then boc :
rmse : 3.3310

box then RGB -> chroma :
rmse : 3.3310

box rgb FindBestY :
rmse : 3.1129

lsqr solved YCbCr for box :
rmse : 3.1129

box rgb (decoder spill) :
rmse : 3.0603

linear-down linear-up :
rmse : 2.7562

down-for-linear-up :
rmse : 2.0329

down-for-linear-up _RGB FindBestY :
rmse : 1.8951

solve lsqr for linear up :
//float solution rmse : 1.6250
rmse : 1.7400

Clearly there is a lot of win to be had from good chroma sampling. Let's talk a bit about what's going on.

(BTW the ffmpeg results don't directly compare to my_y4m, they're just there for reference and sanity check; for example I use symmetric-centered (JPEG style) 420 subsampling and I think they use offset MPEG-style 420 subsampling ; I'm also still not sure if they are in 16-235 or 0-255 , I am definitely in 0-255 ).

First of all just in terms of basic filtering, you might do your operation like this :

"encode" :

RGB -> floats
float RGB -> matrix multiply -> YUV
UV plane -> filters -> downsampled
quantize & clamp to [0,255] ints
transmit YUV 420

"decode" :

UV plane -> floats -> filters -> upsampled
YUV -> matrix multiply -> float RGB
quantize & clamp to [0,255] ints
write RGB bmp

So, you might think that's a reasonable way to go, and try various filters. I did experiments on this before (see this blog post and in particular the comment at the end). I found that fancy filters don't really help much, that the best thing is just doing bilinear reconstruction and a special optimized downsample filter that I call "down for bilinear up".

But once you think about it, there's no reason that we should follow this particular process for making our YUV. In particular, downsampling in chroma space does some very weird things that you might not expect. A lot of the weirdness comes from the fact that the RGB we care about is clamped in the range [0-255]. And our advantage is that we have Y at higher resolution.

So let's start looking at better ways to do chroma sampling.

Our first good link is this : Twibright Luminaplex and Hyperluma

His writing is very weird, so I'll paraphrase briefly. One of the problems with chroma subsampling is that it's not light linear. eg. averaging Cb and Cr does not produce a resulting color which is the average of what your eye will see. Instead of subsampling CbCr , you should instead solve for the YCbCr which produces the light-linear color that you would see for the average of those 2x2 chromas. The easiest way to do this is just to subsample CbCr is in some way, and then instead of computing Y from the original RGB, you turn it into a solve to find the Y, given CbCr, that produces the best result.

The next good idea from the "Twibright" page is just to abandom the idea of computing the YCbCr from a matrix in general. We know what the decoder will do to upsample, so instead we just take the idea that our encoder should output the coefficients which will make the best result after upsampling. In the "Hyperluma" algorithms, he sets his goal as preserving constant actual luma ( luma is not "Y" , it's actual perceived brightness). Basically he does the chroma subsample using RGB, and then given the low-res chroma and the original high-res RGB, solve for the Y that gives you the correct luma.

The "Luminaplex" algorithm takes this one step further and does a brute force optimization. In the end with compression if you want absolute optimality it always comes down to this - wiggle the discrete values, see what the output is, and take the best. (we saw this with DXT1 as well).

Implementing these ideas gives me the "down-for-linear-up _RGB FindBestY" results above.

On the Twibright page he claims that he can implement Luminaplex for box upsampling and still get great results. I found that to not be true on most real images, you need at least bilinear upsampling. To solve for the optimal YCbCr for a given RGB target image, you have a large coupled linear system. That is, for each pixel, you have to consider the current CbCr and also the neighbors, and for those neighbors, you have to consider their neighbors. This is a sparse semi-diagonal linear system. In particular, you are solving a linear system like this :

find x to minimize | A x - b |

b = the target RGB values
 (b has 3 * # of pixels values)

x = the YCbCr values for the output
 w*h of Y, (w*h)/4 each of Cb and Cr

A = the YCbCr -> RGB matrix plus upsample filter
    3*w*h rows
    (3/2)*w*h columns

for each R and B row, there are 5 non-zero entries
  (1 for Y and 4 for Cr or Cb)
for each G row there are 9 non-zero entries
  (1 for Y and 4 for Cr and 4 for Cb)

you can solve this reasonably easily with a standard sparse matrix linear solver. Note that in this form we are directly solving for minimum RGB RMSE , but you can of course solve for other metrics (that are linear transformations of RGB anyway). The fours in the number of terms come from the four-tap box filter; bigger filters have more non-zero terms and so make the matrix much fatter in memory and slower to solve.

If you implement this you get "solve lsqr for linear up" , in the results above. Note that this has the problem mentioned in the last post. I actually want to solve for discrete and clamped YCbCr, but it's too hard, so I just solve the equation as if they are continuous, and then round to the nearest int and clamp to 0-255. To improve this, I actually re-find Y from Chroma after the round-and-clamp. The loss from going discrete is this bit :

//float solution rmse : 1.6250
rmse : 1.7400

I believe that this is as good as you can do for a decoder which operates in the simple linear way. That is, up to now we have assumed that we have a generic decoder, so we can use these techniques to make optimal baseline-compatible JPEGs or H264 videos or whatever. But there are things we can do on the decode side as well, so let's look at that.

The most interesting link I've seen is this page by Glenn Chan : Towards Better Chroma Subsampling

Glenn makes the realization that many of the YCbCr produced by normal subsampling are actually impossible. That is, they can't come from any RGB in [0-255]. When you see an impossible YCbCr, you know it was caused by downsampling, which means you know that some chroma was put on your pixel that should have been on a neighbor's pixel. For the moment let's pretend we have box sampling. In a 2x2 block we have some black pixels and some red pixels. When you box downsample you will get the average of the red chroma (call it Cr = 1) and the black chroma (Cr = 0). When you box upsample you will get Cr = 0.5 over all four pixels. But not you have a pixel with Y = 0 and Cr = 0.5 ; the only way to make a zero Y but with some red chroma would be for it to have negative G and/or B. So this must be a mistake - when we see Y = 0 and Cr = 0.5, we know that the chroma on this pixel must haved "spilled" onto us from our neighbor incorrectly. To fix it, we just take our unwanted Cr and push it over to our neighbor, and we get a perfect result - the Y = 0 gets Cr = 0 and is black, and the Cr = 0.5 red pixel gets pushed up to Cr = 1.

Glenn works out how much chroma a pixel can hold for a given Y. One way to think about this is to think of the RGB->YCbCr as a rotation (+ shear and scale, but you can think of it as rotation for our purposes). You've taken the RGB axial box and have put a new box around it in a rotated space. To completely cover the range of the original box, we have to use a much bigger box in this new space. The result is a large amount of empty space in the YCbCr box which did not come from the original RGB box. Handling this better is a general problem for all compressors with color conversions - we often code YCbCr as if they have full range, but in fact after we have seen Y we know that the range for CbCr might be much smaller.

There's another way of getting the same result, which is to use the fact that we know our Y is more reliable than our CbCr. That is, use your YCbCr to reproduce RGB. Now see if the RGB are all in [0-255] , if they are you are fine. If not, you have to clamp them. Now recompute Y from RGB (something like 0.2R + 0.7G + 0.1B). Because of the clamping, this will now be wrong, eg. not match the transmitted Y. So what we are doing is ensuring that the Y of the output RGB is equal to the Y we transmitted. To acheive that, we adjust CbCr so that we are not clamping the RGB.

On some very bad cases, the win from the "spill" decoder is massive :

on "redlines" :
alternating vertical black and red lines :

ffmpeg default :
rmse : 144.9034

ffmpeg -sws_flags +accurate_rnd+full_chroma_int+full_chroma_inp
rmse : 94.7088

my_y4m box filter :
rmse : 101.9621

my_y4m bilinear :
rmse : 101.3658

my_y4m box spill :
rmse : 7.5732


The main limitation of Glenn's method is that it only helps when you are pushing pixels into illegal values. eg. the black next to red example above was helped enormously, but if it was instead grey next to red, then no illegal value would have been made and we would have done nothing. (eg. on my_soup it only gave us 3.1129 -> 3.0603)

The other problem with Glenn's method is that it is rather slow in the decoder, too slow for something like video (but certainly okay for a JPEG loader in Photoshop or something like that).

There are some simplifications/approximations of Glenn's method which would probably be viable in video.

One is to compute an approximate "chroma capacity" based on Y, and then for each 2x2 box upsample, instead of putting the chroma onto each pixel with equal weight, you do it weighted by the chroma capacity. Chroma capacity is a triangle function of Y, something like min( Y, 255-Y ). So a 2x2 box upsample adjusted for capacity is just :

given subsampled chroma Cr & Cb
and non-subsampled Y's (Y_0,1,2,3)

unadjusted box upsample is just :
Cr_n = Cr
Cb_n = Cb

adjusted box upsample is :

CC_n = min( Y_n, 255 - Y_n )
CC_n *= 4 / ( CC_0 + CC_1 + CC_2 + CC_3 )

Cr_n = CC_n * Cr
Cb_n = CC_n * Cb

(this is similar to his "proportion" method but much simplified). On the synthetic case of the red and black quad, this produces the same results as the more expensive method. On real images it's slightly worse.

Another approach to accelerating Glenn's method would be to just go ahead and do the YCbCr->RGB on each pixel, and then when you do the clamp (which you must do anyway), use that branch to spill to neighbors, and compute the spill amount directly from how far your RGB is pushed out of [0..255] , eg. if B is -4 , then (-4 / Cb_to_B) worth of Cb goes onto the neighbor.

I've only implemented the "spill" method for box sampling, but you can do it for any time of upsampling filter. It's a little awkward though, as you have to implement your upsampler in a sort of backwards way; rather than iterating on the high res pixels and sampling from the subsampled CbCr plane with some filter and accumulating into each output pixel only once, instead you need to iterate over the low res subsampled CbCr and create the filter output and add it into various target pixels.

There's one final method to look at which we've not implemented yet. Glenn's method is of course a form of "luma aided chroma upsample", but it's not what I'm usually refering to when I say that. What we usually mean is using the luma edge information to select the chroma upfilter. That is, rather than always doing bilinear chroma upsample or bicubic or sinc or whatever, you do a decision tree on the luma pixels and choose one of several filters which have various shapes. This is actually a variant of the old "super resolution" problem. We have a low res signal and wish to make the best possible high res output, possibly with the help of some side information. In this case we have the high res Y plan as side information which we believe is well correlated; in many of the "super resolution" papers in the literature what they have is previous video frames, but the techniques are quite similar. I've seen some papers on this but no implementation (though I hear some TV's actually have a hacky version of this called "warpscan" or something like that).

Training filters for various edge scenarios is relatively straightforward, so I might do that in the next few days.

BTW one scary issue for all these fancy filters is if you don't control the decoder or know its exact spec. In particular, TV's and such that are doing fancy chroma upsamplings now means your manipulations could make things worse.

Also something I have observed with chroma sampling is that minimizing any simple analytic metric like RMSE can lead to ringing. This is a very generic issue in lossy compression (it's roughly the same reason that you get ringing in wavelets). In order to reduce a very large single pixel error, the encoder will apply something like a sinc filter shape to that pixel. That might cut the error at that pixel from 100 to 50, which is a huge win, but it also adds a low magnitude ringing around that pixel (maybe magnitude 5 or so). In RMSE terms this is a big win, but visually it's very bad to create that ringing around the single bad pixel, better to just leave its big error alone. (nonlinear error metrics might help this, but nonlinear error metrics don't lead to simple to solve linear matrix equations)

The best links :

Twibright Luminaplex and Hyperluma
Towards Better Chroma Subsampling

Some good links related to chroma sampling and color : psx h4x0rz in teh wired YCbCr to RGB Conversion Showdown
psx h4x0rz in teh wired Immaculate Decoding
Marty Reddy Color FAQ
Chroma Sampling An Investigation
hometheaterhifi - chroma upsampling error
COMPSCI708S1T CBIR - Colour Features
CiteSeerX � Optimal image scaling using pixel classification

And some really unrelated links that I just happened to have found in the last few days :

LIVE lab on image quality
Live Chart Playground - Google Chart Tools Image Charts (aka Chart API) - Google Code
JPEG Post Processing
Shape-Adaptive Transforms Filtering Pointwise SA-DCT algorithms
Perl jpegrescan - Dark Shikari - Pastebin.com


Anonymous said...

We know what the decoder will do to upsample, so instead we just take the idea that our encoder should output the coefficients which will make the best result after upsampling.

I've long argued this as an approach to consider for mipmap generation: we know they're reconstructed with bilerp, so optimize the downsample of the mipmaps around reconstructing as best as possible after bilerp.

Of course, we don't actually display texture mipmaps upsampled this way, so it's maybe not actually a good argument (instead they are better off with the bilerp resampling being a good approximation to a good-looking downsampling/blur of the source data, in which case we're back where we started).

Anonymous said...

Also, I should mention I actually *tried* this, using RMSE, and observed the ringing problem you reported.

I tried augmenting the linear programming problem with monotinicity constraints:

If x_0 < x_1 < x_2 < x_3, then require downsampled x_01 < x_23, but I don't recall ever getting any good results from this. Also monotinicity may not be good enough, maybe you want to make sure that if a gradient is smooth it stays smooth, not just monotonic; e.g. if the derivative (finite difference) over six samples is montonically increasing, than so should the derivative (finite difference) over the the "corresponding" downsampled ones. (Using sloppy pixel-is-a-little-square-box-filter terminology.)

old rants