3/18/2010

03-18-10 - Physics

Gravity is a force that acts proportionally to the two masses. (let's just assume classical Newtonian gravity is in fact the way the universe works)

People outside of science often want to know "but why?" or "how exactly? what is the mechanism? what carries the force?" . At first this seems like a reasonable question, you don't just want to have these rules, you want to know where they come from, what they mean exactly. But if you think a bit more, it should be clear that these questions are absurd.

Let's say you know the fundamental physical laws. These are expressed as mathematical rules that tell you the behavior of objects. Say for example we lived in a world with only Newtonian dynamics and gravity and that is all the laws. Someone asks "but what *is* gravity exactly?". I ask : How could you ever know? What could there ever be that "is" gravity? If something was faccilitating the force of gravity, there would have to be some description of that thing, some new law to describe it. That would mean some new rule to describe this thing that carried gravity. Then you would ask "well where does this rule for the carrier of gravity come from?" and you would need a new rule. Say you said "gravity is carried by the exchange of gravitons" ; then of course they could ask "why is there a graviton, what makes gravitons exactly, why do they couple in this way?" etc.

The fundamental physical laws cannot be explained by anything else.

That's almost a tautology because that's what I mean by "fundamental" - you take all the behavior of the universe, every little thing, like "I pushed on this rock with 10 pounds of force and it went 5 meters per second". You strip away every single law that can be explained with some other law. You strip and strip and finally you are left with a few laws that cannot be explained by anything else. These are the fundamental laws and there is no "why" or "how" for them. In fact the whole human question of "how" is imprecise; what we really should say is "what simpler physical law can explain this phenomenon?". And at some point there is no more answer to that.

Of course this is assuming that there *is* a fundamental physical law. Most physicists assume that to be true without questioning it, but I wrote here at cbloom.com long ago that in fact the set of physical laws might well be infinite - that is, maybe we will find some day that the electrical and gravitational force can be explained in terms of some new law which also adds some new behaviors at very small scale (if it didn't add new behaviors it would simply be a new expression of the same law and not count), and then maybe that new law is explained in terms of another new law which also adds new behaviors, etc. ad infinitum - a russian doll of physcial laws that never ends. This is possible, and furthermore I contend that it is irrelevant.

There is a certain human need to know "why" the physical laws are as they are, or to know the "absolute" "fundamental" laws - but I don't believe there's really much merit to that at all. What if they do finally work out string theory, and it explains all known phenomena for a while, but then we find that there is a small error in the mass of the Higgs Boson on the order of ten to the minus one billion, which tells us there must be some other physical law that we don't yet know. The fact that string theory then is only a very good model of the universe and not the "absolute law" of the universe changes nothing except our own silly human emotions in response to it (and surely crackpots would rise up and say that since it's not "100% right" then there must be angels and thetans at work).

What if we found laws that explained all phenomena that we know of perfectly. We might well think those laws are the "absolute fundamental" laws of the universe. But how would we ever know? Maybe there are other phenomena that can't be explained by those laws that we simply haven't seen yet. Maybe those other phenomena could *never* be seen! (for example there may be another entire set of particles and processes which have zero coupling to our known matter). The existance of this unexplained phenomena does not reduce the merit of the laws you know, even though they are now "not complete" or "don't describe all of nature".

It's funny to think about how our intuition of "mass" was screwed up by the fact that we evolved on the earth in a high gravity environment where we inherently think of mass as "weight" - eg. something is heavy. There's this thing which I will call K. It's the coefficient of inertia, it's how hard something is to move when you apply a certain force to it. F = K A if you will. Imagine we grew up in outer space with lots of large electrical charges around. If we apply an electric field to two charges of different K, one moves fast and one moves slow, the difference is the constant K. It's a very funny thing that this K, this resistance to changes of motion, is also the coupling to the gravitational field.

3/10/2010

03-10-10 - Distortion Measure

What are the things we might put in an ideal distortion measure? This is going to be rather stream of conscious rambling, so beware. Our goal is to make output that "looks like" the input, and also that just looks "good". Most of what I talk about will assume that you are running "D" on something like 4x4 or 8x8 blocks and comparing it to "D" on other blocks, but of course it could be run on a gaussian windowed patch, just some way of localizing distortion on a region.

I'm going to ignore the more "macroscopic" issues of which frame is more important than another frame, or even which object within a frame is more important - those are very important issues I'm sure, but they can be added on later, and are beyond the scope of current research anyway. I want to talk about the microscopic local distortion rating D. The key thing is that the numerical value of D assigns a way to "score" one distortion against another. This not only lets you choose the way your error looks on a given block (choosing the one with lowest score obviously), it also determines how your bits are allocated around the frame in an R/D framework (bits will go to places that D says are more important).

It should be intuitively obvious that just using D = SSD or SAD is very crude and badly broken. One pixel step of numerical error clearly has very different importance depending on where it is. How might we do better ?

1. Value Error. Obviously the plain old metric of "output value - input value" is useful even just as a sanity check and regularizer ; it's the background distortion metric that you will then add your other biasing factors to. All other things being equal you do want output pixels to exactly match input pixels. But even here there's a funny issue of what measure you use. Probably something in the L-N norms, (L1 = SAD, L2 = SSD). The standard old world metric is L2, because if you optimize for D = L2, then you will minimize your MSE and maximize your PSNR, which is the goal of old literature.

The L-N norms behave differently in the way they rate one error vs another. The higher N is, the more importance it puts on the largest error. eg. L-infinity only cares about the largest error. L-2 cares more about big errors than small ones. That is, L2 makes it better to change 100->99 than 1->0. Obviously you could also do hybrid things like use L1 and then add a penalty term for the largest error if you think minimizing the maximum error is important. I believe that *where* the error occurs is more important than what its value is, as we will discuss later.

2. DC Preservation. Changes in DC are very noticeable. Particularly in video, the eye is usually tracking mainly one or two foreground objects; what the means is that most of the frame we are only seeing with our secondary vision (I don't know a good term for this, it's not exactly peripheral vision since it's right in front of you, but it's not what your brain is focused on, so you see it at way lower detail). All this stuff that we see with secondary vision we are only seeing the gross properties of it, and one of those is the DC. Another issue is that if a bunch of blocks in the source have the same DC, and you change one of them in the output, that is sorely noticeable.

I'm not sure if it's most important to preserve the median or the mean or what exactly. Usually people preserve the mean, but there are certainly cases where that can screw you up. eg. if you have a big field of {80} with a single pixel spike on it, you want to preserve that background {80} everywhere no matter what the spike does in the output. eg. {80,80,255,80,80} -> {80,80,240,80,80} is better than making it go -> {83,83,240,83,83} even though the latter has better mean preservation.

3. Edge Preservation. Hard edges, especially long straight lines or smooth curves, are very visible to humans and any artifact in them stands out. The importance of edges varies though; it has something to do with the length of the edge (longer edges are more major visual features) and with the contrast range of the region around the edge : eg. an edge that separates two very smooth sections is super visible, but an edge that's one of many in a bunch of detail is less important (preserving the detail there is important, but the exact shape of the edge is not). eg. a patch of grass or leaves might have tons of edges, but their exact shape is not crucial. An image of hardwood floor has tons of long straight parallel edges and preserving those exactly is very important. The border between objects is typically very important.

Obviously there's the issue of keeping the edges that were in the original and also the issue of not making new edges that weren't in the original. eg. introducing edges at block boundaries or from ringing artifacts or whatever. As with edge preservation, the badness of these introduces edges depends on the neighborhood - it's much worse to make them in a smooth patch than once that's already noisy. (in fact in a noisy patch, ringing artifacts are sort of what you want, which is why JPEG can look better than naive wavelet coders on noisy data).

4. Smooth -> Smooth (and Flat -> Flat). Changing smooth input to not smooth is very bad. Old coders failed hard on this by making block boundaries. Most new coders now handle this easily inherently either because they are wavelet or use unblocking or something. There are still some tricky cases though, such as if you have a smooth ramp with a bit of gaussian noise speckle added to it. Visually the eye still sees this as "smooth ramp" (in fact if you squint your eyes the noise speckly goes away completely). It's very important for the output to preserve this underlying smooth ramp; many good modern coders see the noise speckle as "detail" that should be preserved and wind up screwing up the smooth ramp.

5. Detail/Energy Preservation. The eye is very sensitive to whether a region is "noisy" or "detailed", much more so than exactly what that detail is. Some of the JPEG style "threshold of visibility" stuff is misleading because it makes you think the eye is not sensitive to high frequency shapes - true, but you do see that there's "something" there. The usual solution to this is to try to preserve the amount of high frequency energy in a block.

There are various sub-cases of this. There's true noise (or real life video that's very similar to true noise) in which case the exact pixel values don't matter much at all as long as the frequency spectrum and distribution of the noise is reproduced. There's detail that is pretty close to noise, like tree leaves, grass, water, where again the exact pixels are not very important as long as the character of the source is preserved. Then there's "false noise" ; things like human hair or burlap or bricks can look a lot like noise to naive analysis metrics, but are in fact patterned texture in which case messing up the pattern is very visible.

There are two issues here - obviously there's trying to match the source, but there's also the issue of matching your neighbors. If you have a bunch of neighboring source blocks with a certain amount of energy, you want to reproduce that same patch in the output - you don't want to have single blocks with very different energy, because they will stand out. Block energy is almost like DC level in this way.

6. Dynamic range / sdev Preservation. Of course related to previous metrics, but you can definitely see when the dynamic range of a region changes. On an edge it's very easy to see if a high contrast edge becomes lower contrast. Also in noise/detail areas the main things you notice are the DC, the amount of noise, and the range of the noise. One reason its so visible is because of optical fusion and affects on DC brightness. That is, if you remove the bright specks from a background it makes the whole region look darker. Because of gamma correction, {80,120} is not the same brightness as {100,100}. Now theoretically you could do gamma-corrected DC preservation checks, but there are difficulties in trying to be gamma correct in your error metrics since the gamma remapping sort of does what you want in terms of making changes of dark values relatively more important; maybe you could do gamma-correct DC preservation and then scale it back using gamma to correct for that.

It's unclear to me whether the important thing is the absolute [low,high] range, or the statistical width [mean-sdev,mean+sdev]. Another option would be to sort the values from lowest to highest and look at the distribution; the middle is the median, then you have the low and high tails on each side; you sort of want to preserve the shape of that distribution. For example the input might have the high values in a kind of gaussian falloff tail with most values near median and fewer as it gets higher; then the output should have a similar distribution, but exactly matching the high value is not important. The same block might have all of its low values at exactly 0 ; in that case the output should also have those values at exactly 0.


Whatever all the final factors are, you are left with how to scale them and combine them. There are two issues on scaling : power and coefficient. Basically you're going to combine the sub-distortions something like this :


D = Sum_n { Cn * Dn^Pn }

Dn = distortion sub type n
Cn = coefficient n
Pn = power n

The power Pn lets you change the units that Dn are measured in; it lets you change how large values of Dn contribute vs. how small values contribute. The cofficient Cn obviously just overall scales the importance of each Dn vs. the other D's.

It's actually not that hard to come up with a good set of candidate distortion terms like I did above, the problem is once you have them (the various Dn) - what are the Cn and Pn to combine them?

3/08/2010

03-08-10 - Distortion and Bit Allocation

I now know that rate allocation is by far the most important thing in video. It's obviously important in a lot of things, but in video you just have so many bits and so much flexibility in where you put them, and there are lots of psychovisual phenomena that don't exist in images (due to motion, eye adaptation, feature tracking, etc. because the eye notices changes over time, etc). In fact I conjecture that you could take a really shitty old coder like MPEG2 and make videos that beat anything currently in existance with just better rate allocation.

What can rate allocation do ?

1. Move bits to the source of predictions. That is, code some frame (or part of a frame) better than normal because it will be used as a mocomp source in the future. This is actually a purely mathematical win and would apply without any psychovisual consideration. A lot of people do this in semi-heuristic ways, but of course those can make lots of mistakes (for example there may be cases where increasing the bit assignment to a block might actually make it a worse source for the future, eg. the future might be a better match to the block with more distortion; also starving the future might cause it to no longer choose that block as a source, etc). Some people move bits around while holding all the block mode decisions and movecs constant, which at least lets you converge, but of course you should consider all possible bit moves and all possible mode changes.

2. Move bits from frame to frame to make some frames look better and some look worse. Move bits around the frame to make parts look better and parts look worse. In general choosing where to put your error.

There's also a related issue which is not exactly rate allocation but is very similar. In lossy coders like video coders you often you have a choice of what your error looks like. That is, for the same distortion (in a numerical sense) you could make different shapes of error, through choosing different block modes, choosing different movecs, or more globally choosing quantizers or quantization matrices. This often ties into rate allocation because it involves how you make your free choices in the encoder :

3. What the distortion looks like. In particular, if you make some amount of error (in an SSD or SAD sense (aka L2 or L1 norm)) what does that error look like? what is the shape of it?

Now, in a lagrangian framework the main thing driving all these decisions is just the D metric in J = R + lambda D. If you change D, it changes where bits get put. D determines how important you think one type of error is vs. another type of error.

Just as an example, say you ran face detection on your video, then you could assign face regions to all your frames, and any error in the face region could be counted as extra important - if you put this into your "D" metric, then the lagrangian coder automatically gives those areas more bits. But that kind of example is rather banal. There are obviously tons of human error-importance issues that you could try to account for, having to do with what objects are most important in the frame, where the motion is, what kind of errors are particularly appalling, etc etc.

Purely numerical error distribution can be important : say you have an error of 3 somewhere and an error of 20 somewhere else. You have bits to change each by 1. Should you change the 3 to 2 or the 20 to 19 ? Well, it depends on their neighborhoods, but I think more often than not you should do the 3->2. That will be more visually noticeable. Using L1 or L2 (or L-N for whatever other N's) causes you to make different decisions in these cases. Most simplistically you can see it as a continuum between minimizing the total abs error (L1) vs minimizing the maximum error (L-infinity). That is, the issue of whether you have clumpy error or spread out error is a pretty big one.

The thing holding back development is a lack of a procedure for measuring real "quality". The problem is changing distortion to change your bit allocation for psychovisual purposes will by definition hurt your abstract measures. (hacky changes to D might hurt RMSE but help SSIM, but in that case I would say some of the change was not "psychovisual" - the part of the change which helps SSIM is in fact an analytical change to improve a certain metric). At some point you have to be able to make a decision that you will allocate bits in such a way that your video will look worse to computers, but will look better to humans. (with our current shitty computer analysis models).

x264 and others have a bit of a solution for this - they use a kind of "crowd sourcing" (bleck web 2.0 buzz word, I feel like I just vomitted in my own mouth a little). They can put beta features in their code and they have mobs of fan-boys who will download betas and try them on lots of videos and then post results on the forums. This gives you lots of real human eyes saying "this looks better" or not for attempts at psychovisual. But I don't think you can really make big developments using that technique - you can only make small heuristic stabs in the dark and then find out if they were okay, because the turnaround time for results from the crowd is too long, and if you release too many dead ends for them to test they will stop doing it, so you have to be reasonably sure it is a good change before publishing it to the crowd, etc. It's not the kind of thing a researcher needs, which is a black box where I can throw videos and say "which looks better to a human".

The result is that we are mostly stabbing in the dark and occasionally getting lucky.

3/03/2010

03-03-10 - Image Compresson - Color , ScieLab - Part 2

Follow up to the last post on color .

First a correction : what I said about downsampling there is mostly wrong. I made the classic amateur's blunder of testing on too small a data set and drawing conclusions from it. I'm a little embarassed to make that mistake, but hey this is a blog not a research journal. Any expectations of rigor are unfounded. For example this is one of the test images I ran on that convinced me that downsample was bad :


aikmi
-i7 qtable ; CoCg optimized joint for min SCIELAB

downsample :

   262,144 ->    32,823 =  1.001 bpb =  7.986 to 1 (per pixel)
Q : 11.0000  Co scale = Cg Scale = 1.525
bits DC : 19636|5151|3832 , bits AC : 175319|38483|19879
bits DC = 10.9% bits AC = 89.1%
bits Y = 74.3% bits CoCg = 25.7%
rmse : 7.3420 , psnr : 30.8485
ssim : 0.9134 , perc : 73.3109%
scielab rmse : 2.200

no downsample :

   262,144 ->    32,679 =  0.997 bpb =  8.021 to 1 (per pixel)
Q : 12.0000  Co scale = Cg Scale = 0.625
bits DC : 19185|13535|9817 , bits AC : 160116|39407|19091
bits DC = 16.3% bits AC = 83.7%
bits Y = 68.7% bits CoCg = 31.3%
rmse : 6.9877 , psnr : 31.2781
ssim : 0.9111 , perc : 72.9532%
scielab rmse : 1.980

you can see that downsample is just much worse in every way, including severely worse in SCIELAB which doesn't care about chroma differences as much as luma. In this particular image, there's a lot of high detail color bits, and the downsampled version looks significantly worse, it's easy to pick out visually.

However, in general this is not true, and in fact downsample is often a small win.

Without further ado I present lots of stats :

i0 Cg=1 Co=1 i0 Cg = 0.6 Co = 0.575 i7 Cg = 0.6 Co = 0.575 i4/i7 opt per image i7 CoCg optimized independently per image i7 CoCg optimized jointly per image downsampled
file rmse scielab rmse scielab rmse scielab rmse scielab Co Cg rmse scielab Co / Cg rmse scielab
kodim01 12.6809 4.8898 12.5848 4.8413 12.6567 4.3415 12.7018 4.238 0.455 0.455 12.623 4.3153 1.225 12.486 4.2525
kodim02 6.235 2.1961 6.1733 2.1793 6.2836 2.0519 6.2544 1.9542 0.58 0.58 6.2285 1.978 1.3375 6.4866 1.9841
kodim03 4.0098 1.7135 3.974 1.7173 4.0621 1.5587 3.9778 1.5883 0.705 0.83 4.0853 1.5359 1.6 4.1235 1.6102
kodim04 6.3981 2.4661 6.3658 2.4929 6.4083 2.2579 6.4083 2.2579 0.705 0.705 6.4092 2.248 1.5625 6.3698 2.1977
kodim05 14.2903 7.2293 14.0531 7.1756 14.1613 6.5253 14.2296 6.452 0.58 0.58 14.167 6.5291 1.5625 13.9658 6.4326
kodim06 8.9416 3.6338 8.836 3.5923 8.9622 3.2131 9.0316 3.1608 0.455 0.58 8.9664 3.2184 1.3 8.8455 3.1733
kodim07 5.147 2.316 5.1145 2.1919 5.2338 2.0167 5.2388 1.9815 0.58 0.58 5.202 2.0047 1.225 5.1601 1.9462
kodim08 14.6964 7.5082 14.5479 7.5237 14.5675 6.8769 14.6411 6.7521 0.58 0.83 14.5726 6.8285 1.4875 14.3053 6.692
kodim09 4.4789 1.8149 4.439 1.8574 4.5303 1.675 4.5303 1.675 0.705 0.955 4.5467 1.6359 1.4125 4.5389 1.6906
kodim10 4.9926 2.0932 4.9477 2.1196 5.0678 1.9887 5.0398 1.9514 0.58 0.955 5.0585 1.9109 1.6 5.0449 1.9556
kodim11 7.9484 3.2677 7.9006 3.2315 8.0441 2.9234 8.0441 2.9234 0.58 0.58 8.0478 2.9276 1.375 7.939 2.858
kodim12 4.6495 1.8486 4.6326 1.8529 4.7335 1.6862 4.7259 1.6663 0.58 0.705 4.7041 1.6776 1.2625 4.7001 1.6457
kodim13 18.5372 8.3568 18.3502 8.2634 18.5334 7.2841 18.6579 7.1262 0.455 0.58 18.5013 7.2697 1.1125 18.381 7.2327
kodim14 11.076 4.8628 10.972 4.7473 11.0146 4.3268 11.064 4.2636 0.58 0.58 11.0151 4.3308 1.3 10.9818 4.3614
kodim15 5.8269 2.4099 5.8082 2.4665 5.9134 2.2246 5.8383 2.2457 0.705 0.705 5.9158 2.2098 1.525 5.8699 2.1497
kodim16 5.689 2.3266 5.6289 2.3199 5.7372 2.0534 5.7372 2.0534 0.58 0.58 5.7373 2.055 1.375 5.6667 2.0276
kodim17 5.5166 2.3244 5.47 2.2994 5.6716 2.0774 5.5853 2.0874 0.455 0.705 5.6523 2.0574 1.4125 5.6014 2.037
kodim18 10.8501 4.8609 10.7131 4.7903 10.9517 4.3169 10.9639 4.2627 0.58 0.705 10.9266 4.3006 1.3375 10.8048 4.2189
kodim19 7.1545 2.8338 7.0872 2.8518 7.2311 2.4977 7.2637 2.4362 0.58 0.705 7.2158 2.4758 1.5625 7.1314 2.4396
kodim20 4.7872 1.8258 4.7183 1.8042 4.9208 1.6441 4.863 1.6524 0.455 0.83 4.9265 1.6306 1.1875 4.9427 1.656
kodim21 7.7757 3.3671 7.6338 3.3427 7.9293 3.0078 7.8541 3.0018 0.705 0.705 7.9204 2.95 1.3 7.7688 2.9302
kodim22 8.279 3.2205 8.1788 3.1253 8.3292 2.8656 8.3542 2.8114 0.455 0.58 8.3026 2.8379 1.45 8.267 2.8436
kodim23 3.917 1.5567 3.8968 1.5138 3.953 1.4315 3.961 1.4157 0.58 0.58 3.9481 1.4146 1.6 4.3382 1.573
kodim24 10.9877 5.2479 10.8105 5.0477 11.0256 4.6141 11.0435 4.5882 0.455 0.455 11.0413 4.6005 1.3375 10.9372 4.503
194.86 84.17 192.84 83.35 195.92 75.46 196.01 74.54 195.71 74.94 194.65 74.41

explanation :

output bit rate 1 bpb in all cases
parameters are optimized to minimize E = ( 2 * SCIELAB + 1 * RMSE )
RMSE is on RGB
SCIELAB is perceptual color difference metric

i0 = flat quantization matrix
i7 = tweaked perceptual quantization matrix to minimize E
i4/i7 = optimized blend of flat to perceptual matrices


The table reads roughly left to right in terms of decreasing perceptual error.  

"i0 Cg=1 Co=1" : flat q-matrix, standard lossless YCoCg transform without extra scaling

"i0 Cg=0.6 Co=0.575" ; optimize CoCg scale for E ; interestingly this also helps RMSE

"i7 Cg=0.6 Co=0.575" ; non-flat constant Q-matrix ; hurts RMSE a bit, helps SCIELAB a lot

"i4/i7 opt per image" ; per-image non-flat Q-matrix ; not a big difference

"i7 CoCg optimized independently per image" : independently optimize Co and Cg for each image

"i7 CoCg optimized jointly per image downsampled" : downsample test, CoCg optimized with Co=Cg

On the full kodak set, downsampling is a slight net win. There are a few cases (kodim03,kodim23) where it hurts a lot like I saw before, but in most cases it is a slight win or close to neutral. The conclusion is that given the speed benefit, you should downsample. However there are occasional cases where it will hurt a lot.

I think most of the results are pretty intuitive and not extremely dramatic.

It's a little non-inuitive what exactly is going on with the per-image customized chroma scales. Your first thought might be "well those images have different colors in them, so the color space scale is adapting to the color content in the image". That's not so. For one thing, more or less content of a certain color doesn't mean you need a different color space - it just means that that band of the color space will get more energy, and thus more bits. e.g. an image that has lots of "Co" component colors will simply have more energy in the Co plane - that doesn't mean scaling Co either up or down will help it.

If you think about the scaling another way it's more obvious what's going on. Scaling the color planes is equivalent to using different quantizers per plane. Optimizing the scalings is equivalent to doing an R/D optimization of the quantizer of each plane. Thus we see what the scaling is doing : it's taking bits away from hard to code planes and moving them to easier to code planes (in an R/D slope sense).

In particular, when I visually inspected some of the more extreme cases (cases where the per-image optimized scales were a big win vs. a constant overall scale, such as kodik10) what I found was that the optimized scalings were taking bits *away* from the dominant colors. One very obvious case was on photos of the ocean. The ocean is mostly one color and is very hard to code (expensive in an R/D sense) because it's all choppy and random. The optimized scaling took bits away from the ocean and moved them to other colors that had more R/D payoff.

(BTW rambling a bit : I've noticed that x264 Psy VAQ tends to do the same kind of thing - it takes bits away from areas that are really noisy mess, such as water, and moves them to areas that have smooth pattern and edges. Intuitively you can guess if an area is a mess and just really hard to code then you should just say "fuck it" and starve it for bits even if MSE R/D tells you it wants bits. I think also that improving an area from an RMSE of 4 to 2 is better than improving from 10 to 7, even though it's less of a distortion win. Visually there's a bit difference that occurs when an area goes from "looks good" to "looks noisy" , but not much of a difference when an area goes from "looks bad" to "looks really bad").

So this is in fact not really a surprising result. We know already that heavy R/D bit allocation can do wonders for lossy compressors. That are lots more areas to explore - optimization of every coefficient in the quantization matrix, optimization of the color transform, optimization of the transform basis functions, etc. etc. - and in each case you need to be clever about the way you encode the extra rate control side information.

ADDENDUM : I thought I should write up what I think are the useful takeaway conclusions :

1. It is crucial to do the right kind of scaling to Co/Cg (or chroma more generally) depending on whether you downsample or not. In particular the way most people just turn downsample on or off and don't compensate by scaling chroma is a mistake, eg. not a fair comparison, because their scaling will be tuned for one or the other.

2. Downsample vs. no-downsample is pretty close to neutral. If you downsample for speed, that's probably fine. There are rare cases where it does hurt a whole lot though.

3. Using a non-flat Q matrix does in fact help perceptual quality significantly. And it doesn't hurt RGB RMSE nearly as much as it helps SCIELAB (helps SCIELAB by 10.35 % , hurts RMSE by 1.58 % ).

4. It does appear acceptable to use global tweaked values all the time rather than custom tweaking to each image. Custom tweaks do give you another bit of benefit, but it's not huge, thus not worth the very slow optimization step. (see DCTune eg)