5/12/2010

05-12-10 - P4 By Dir

(ADDENDUM : see comments, I am dumb).

I mentioned this before :

(Currently that's not a great option for me because I talk to both my home P4 server and my work P4 server, and P4 stupidly does not have a way to set the server by local directory. That is, if I'm working on stuff in c:\home I want to use one env spec and if I'm in c:\work, use another env spec. This fucks up things like NiftyPerforce and p4.exe because they just use a global environment setting for server, so if I have some work code and some home code open at the same time they shit their pants. I think that I'll make my own replacement p4.exe that does this the right way at some point; I guess the right way is probably to do something like CVS/SVN does and have a config file in dirs, and walk up the dir tree and take the first config you find).

But I'm having second thoughts, because putting little config shitlets in my source dirs is one of the things I hate about CVS. Granted it would be much better in this case - I would only need a handful of them in my top level dirs, but another disadvantage is my p4bydir app would need to scan up the dir tree all the time to find config files.

And there's a better way. The thing is, the P4 Client specs already have the information of what dirs on my local machine go with what depot mappings. The problem is the client spec is not actually associated with a server. What you need is a "port client user" setting. These are stored as favorites in P4Win, but there is no authoritative list of the valid/good "port client user" setups on a machine.

So, my new idea is that I store my own config file somewhere that lists the valid "port client user" sets that I want to consider in p4bydir. I load that and then grab all the client specs. I use the client specs to identify what dirs to map to where, and the "port client user" settings to tell what p4 environment to set for that dir.

I then replace the global p4.exe with my own p4bydir so that all apps (like NiftyPerforce) will automatically talk to the right connection whenever they do a p4 on a file.

05-12-10 - Cleartype

Since I ranted about Cleartype I thought I'd go into a bit more detail. this article on Cleartype in Win7 is interesting, though also willfully dense.

Another research question we�ve asked ourselves is why do some people prefer bi-level rendering over ClearType? Is it due to hardware issues or is there some other attribute that we don�t understand about visual systems that is playing a role. This is an issue that has piqued our curiosity for some time. Our first attempt at looking further into this involved doing an informal and small-scale preference study in a community center near Microsoft.

Wait, this is a research question ? Gee, why would I prefer perfect black and white raster fonts to smudged and color-fringed cleartype. I just can't imagine it! Better do some community user testing...

1. 35 participants. 2. Comments for bi-level rendering: Washed out; jiggly; sketchy; if this were a printer, I�d say it needed a new cartridge; fading out � esp. the numbers, I have to squint to read this, is it my glasses or it is me?; I can�t focus on this; broken up; have to strain to read; jointed. 3. Comments for ClearType: More defined, Looks bold (several times), looks darker, clearer (4 times), looks like it�s a better computer screen (user suggested he�d pay $500 more for the better screen on a $2000 laptop), sort of more blue, solid, much easier to read (3 times), clean, crisp, I like it, shows up better, and my favorite: from an elderly woman who was rather put out that the question wasn�t harder: this seems so obvious (said with a sneer.)

Oh my god, LOL, holy crap. They are obviously comparing Cleartyped anti-aliased fonts to black-and-white rendered TrueType fonts, NOT to raster fonts. They're probably doing big fonts on a high DPI screen too. Try it again on a 24" LCD with an 8 point font please, and compare something that has an unhinted TrueType and an actual hand-crafted raster font. Jesus. Oh, but I must be wrong because the community survey says 94% prefer cleartype!

Anyway, as usual the annoying thing is that in pushing their fuck-tard agenda, they refuse to acknowledge the actual pros and cons of each method and give you the controls you really want. What I would like is a setting to make Windows always prefer bitmap fonts when they exist, but use ClearType if it is actually drawing anti-aliased fonts. Even then I still might not use it because I fucking hate those color fringes, but it would be way more reasonable. Beyond that obviously you could want even more control like switching preferrence for cleartype vs. bitmap per font, or turning on and off hinting per font or per app, etc. but just some more reasonable global default would get you 90% of the way there. I would want something like "always prefer raster font for sizes <= 14 point" or something like that.

Text editors are a simple case because you just to let the user set the font and get what they want, and it doesn't matter what size the text is because it's not layed out. PDF's and such I guess you go ahead and use TT all the time. The web is a weird hybrid which is semi-formatted. The problem with the web is that it doesn't tell you when formatting is important or not important. I'd like to override the basic firefox font to be my own choice nice bitmap font *when formatting is not important* (eg. in blocks of text like I make). But if you do that globally it hoses the layout of some pages. And then other pages will manually request fonts which are blurry bollocks.

CodeProject has a nice font survey with Cleartype/no-Cleartype screen caps.

GDI++ is an interesting hack to GDI32.dll to replace the font rendering.

Entropy overload has some decent hinted TTF fonts for programmers you can use in VS 2010.

Electronic Dissonance has the real awesome solution : sneak raster fonts into asian fonts so that VS 2010 / WPF will use them. This is money if you use VS 2010.

5/11/2010

05-11-10 - Some New Cblib Apps

Coded up some new goodies for myself today and released them in a new cblib and chuksh .

RunOrActivate : useful with a hot key program, or from the CLI. Use RunOrActivate [program name]. If a running process of that program exists, it will be activated and made foreground. If not, a new instance is started. Similar to the Windows built-in "shortcut key" functionality but not horribly broken like that is.

(BTW for those that don't know, Windows "shortcut keys" have had huge bugs ever since Win 95 ; they sometimes work great, basically doing RunOrActivate, but they use some weird mechanism which causes them to not work right with some apps (maybe they use DDE?), they also have bizarre latency semi-randomly, usually they launch the app instantly but occasionally they just decide to wait for 10 seconds or so).

RunOrActivate also has a bonus feature : if multiple instances of that process are running it will cycle you between them. So for example my Win-E now starts an explorer, goes to existing one if there was one, and if there were a few it cycles between explorers. Very nice. Also works with TCC windows and Firefox Windows. This actually solves a long-time useability problem I've had with shortcut keys that I never thought about fixing before, so huzzah.

WinMove : I've been using this forever, lets you move and resize the active window in various ways, either by manual coordinate or with some shorthands for "left half" etc. Anyway the new bit is I just added an option for "all windows" so that I can reproduce the Win-M minimize all behavior and Win-Shift-M restore all.

I think that gives me all Win-Key functions I actually want.

ADDENDUM : One slightly fiddly bit is the question of *which* window of a process to activate in RunOrActivate. Windows refuses to give you any concept of the "primary" window of a process, simply sticking to the assertion that processes can have many windows. However we all know this is bullshit because Alt-Tab picks out an isolated set of "primary" windows to switch between. So how do you get the list of alt-tab windows? You don't. It's "undefined", so you have to make it up somehow. Raymond Chen describes the algorithm used in one version of Windows.

5/09/2010

05-09-10 - Some Win7 Shite

Perforce Server was being a pain in my ass to start up because the fucking P4S service doesn't get my P4ROOT environment variable. Rather than try to figure out the fucking Win 7 per-user environment variable shite, the easy solution is just to move your P4S.exe into your P4ROOT directory, that way when it sees no P4ROOT setting it will just use current directory.

New P4 Installs don't include P4Win , but you can just copy it from your old install and keep using it.

This is not a Win7 problem so much as a "newer MS systems" problem, but non-antialiased / non-cleartype text rendering is getting nerfed. Old stuff that uses GDI will still render good old bitmap fonts fine, but newer stuff that uses WPF has NO BITMAP FONT SUPPORT. That is, they are always using antialiasing, which is totally inappropriate for small text (especially without cleartype). (For example MSVC 2010 has no bitmap font support (* yes I know there are some workarounds for this)).

This is a huge fucking LOSE for serious developers. MS used to actually have better small text than Apple, Apple always did way better at smooth large blurry WYSIWYG text shit. Now MS is just worse all around because they have intentionally nerfed the thing they were winning at. I'm very disappointed because I always run no-cleartype, no-antialias because small bitmap fonts are so much better. A human font craftsman carefully choosing which pixels should be on or off is so much better than some fucking algorithm trying to approximate a smooth curve in 3 pixels and instead giving me fucking blue and red fringes.

Obviously anti-aliased text is the *future* of text rendering, but that future is still pretty far away. My 24" 1920x1200 that I like to work on is 94 dpi (a 30" 2560x2600 is 100 dpi, almost the same). My 17" lappy at 1920x1200 has some of the highest pixel density that you can get for a reasonable price, it's pretty awesome for photos, but it's still only 133 dpi which is shit for text (*). To actually do good looking antialiased text you need at least 200 dpi, and 300 would be better. This is 5-10 years away for consumer price points. (In fact the lappy screen is the unfortunate uncanny valley; the 24" at 1920x1200 is the perfect res where non-atialiased stuff is the right size on screen and has the right amount of detail. If you just go to slightly higher dpi, like 133, then everything is too small. If you then scale it up in software to make it the right size for the eye, you don't actually have enough pixels to do that scale up. The problem is that until you get above 200 dpi where you can do arbitrary scaling of GUI elements, the physical size of the pixel is important, and the 100 dpi pixel is just about perfect). (* = shit for anti-aliased text, obviously great for raster fonts at 14 pels or so).

( ADDENDUM : Urg I keep trying to turn on Cleartype and be okay with it. No no no it's not okay. They should call it "Clear Chromatic Abberation" or "Clearly the Developers who thing this is okay are colorblind". Do they think our eyes only see luma !? WTF !? Introducing colors into my black and white text is just such a huge visual artifact that no amount of improvement to the curve shapes can make up for that. )

It's actually pretty sweet right now living in a world where our CPU's are nice and multi-core, but most apps are still single core. It means I can control the load on my machine myself, which is damn nice. For example I can run 4 apps and know that they will all be pretty nice and snappy. These days I am frequently keeping 3 copies of my video test app running various tests all the time, and since it's single core I know I have one free core to still fuck around on the computer and it's full speed. The sad thing is that once apps actually all go multi-core this is going to go away, because when you actually have to share cores, Windows goes to shit.

Christ why is the registry still so fucking broken? 1. If you are a developer, please please make your apps not use the registry. Put config files in the same dir as your .exe. 2. The Registry is just a bunch of text strings, why is it not fucking version controlled? I want a log of the changes and I want to know what app made the change when. WTF.

The only decent way to get environment variables set is with TCC "set /S" or "set /U".

"C:\Program Files (x86)" is a huge fucking annoyance. Not only does it break by muscle memory and break a ton of batch files I had that looked for program files, but now I have a fucking quandary every time I'm trying to hunt down a program.. err is it in x86 or not? I really don't like that decision. I understand it's needed for if you actually have an x86 and x64 version of the same app installed, but that is very rare, and you should have only bifurcated paths on apps that actually do have a dual install. (also because lots of apps hard code to c:\program files , they have a horrible hack where they let 32 bit apps think they are actually in c:\program files when they are in "C:\Program Files (x86)"). Blurg.


[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management\PrefetchParameters]
"EnableSuperfetch"=dword:00000000
"EnablePrefetcher"=dword:00000000

[HKEY_CURRENT_USER\Software\Microsoft\Windows\CurrentVersion\Policies\Explorer]
"NoWinKeys"=dword:00000001

[HKEY_CURRENT_USER\Control Panel\Desktop]
"MenuShowDelay"="10"

Some links :

Types - Vista/Win7 has borked the "File Associations" setup. You need a 3rd party app like Types now to configure your file types (eg. to change default icons).

Shark007.net - Windows 7 Codecs - WMP12 Codecs - seem to work.

Pismo Technic Inc. - Pismo File Mount - nicest ISO mounter I've found (Daemon tools feels like it's made out of spit and straw).

Hot Key Plus by Brian Apps - ancient app that still works and I like because it's super simple.

Change Windows 7 Default Folder Icon - Windows 7 Forums ; presumably you have the Preview stuff for Folders turned off, so now make the icon not so ugly.

- how to move your perforce depot Annoyingly I used a different machine name for new lappy and thus a different clientview, so MSVC P4SCC fails to make the connection and wants to rebind every project. The easiest way to fix this is just to not use P4SCC and kill all your bindings and just use NiftyPerforce without P4SCC.

(Currently that's not a great option for me because I talk to both my home P4 server and my work P4 server, and P4 stupidly does not have a way to set the server by local directory. That is, if I'm working on stuff in c:\home I want to use one env spec and if I'm in c:\work, use another env spec. This fucks up things like NiftyPerforce and p4.exe because they just use a global environment setting for server, so if I have some work code and some home code open at the same time they shit their pants. I think that I'll make my own replacement p4.exe that does this the right way at some point; I guess the right way is probably to do something like CVS/SVN does and have a config file in dirs, and walk up the dir tree and take the first config you find).

allSnap make all windows snap - AllSnap for x64/Win7 seems to be broken, but the old 32 bit one seems to work just fine still. (ADDENDUM : nope, old allsnap randomly crashes in Win 7, do not use)

KeyTweak Homepage - I used KeyTweak to remap my Caps Lock to Alt.

Firefox addons :


DownThemAll
Flashblock
PDF Download
Stop Autoplay
Adblock Plus
DownloadHelper

ADDENDUM : I found the last few guys who are ticking my disk :

One that you obviously want to disable is Windows Media Player Media sharing serivcee : Fix wmpnetwk.exe In Windows 7 . It just constantly scans your media dirs for shit to serve. Fuck you.

The next big culprit is the Windows Reliability stuff. Go ahead and disable the RAC scheduled task, but that's not the real problem. The nasty one is the "last alive stamp" which windows writes once a minute by default. This is to help diagnose crashes. You could change TimeStampInterval to 30 or so to make it once every 30 minutes, but I set it to zero to disable it. See :
Shutdown Event Tracker Tools and Settings System Reliability
How to disable hard disk thrashing with Vista - Page 7

And this is a decent summary/repeat of what I've said : How to greatly reduce harddisk grinding noises in Vista .

4/07/2010

04-07-10 - Video

Blurg, the complexity wheel turns. In the end, all the issues with video come down two huge fundamental problems :

1. Lack of the true distortion metric. That is, we make decisions to optimize for some D, but that D is not really what humans perceive as quality. So we try to bias the coder to make the right kind of error in a black art hacky way.

2. Inability to do full non-greedy optimization. eg. on each coding decision we try to do a local greedy decision and hope that is close to globally optimal, but in fact our decisions do have major effects on the future in numerous and complex ways. So we try to account for how current decisions might affect the future using ugly heuristics.

These two major issues underly all the difficulties and hacks in video coding, and they are unfortunately nigh intractable. Because of these issues, you get really annoying spurious results in coding. Some of the annoying shit I've seen :

A. I greatly improve my R/D optimizer. Global J goes up !! (lower J is better, it should have gone down). WTF happened !? On any one block, my R/D optimizer now has much more ability to make decisions and reach a J minimum on that block. The problem is that the local greedy optimization is taking my code stream to weird places that then hurt later blocks in ways I am not accounting for.

B. I introduce a new block type. I observe that the R/D chooser picks it reasonably often and global J goes down. All signs indicate this is good for coding. No, visual quality goes down! Urg. This can come from any number of problems, maybe the new block type has artifacts that are visually annoying. One that I have run into that's a bother is just that certain block types will have their J minimum on the R/D curve at very different places - the result of that is a lot of quality variation across the frame, which is visually annoying. eg. the block type might be good in a strict numerical sense, but its optimum point is at much higher or much lower quality than your other block types, which makes it stand out.

C. I completely screw up a block type, quality goes UP ! eg. I introduce a bug or some major coding inefficiency so a certain block type really sucks. But global quality is better, WTF. Well this can happen if that block type was actually bad. For one thing, block types can actually be bad for global J even if they are good for greedy local J, because they produce output that is not good as a future mocomp source, or even simply because they are redundant with other block types and are a waste of code space. A more complex problem which I ran into is that a broken block type can change the amount of bits allocated to various parts of the video, and that can randomly give you better bit allocation, which can make quality go up even though you broke your coder a bit. Most specifically, I broke my Intra ("I") block (no mocomp) coder, which caused more bits to go to I-like frames, which actually improved quality.

D. I improve my movec finder, so I'm more able to find truly optimal movecs in an R/D sense (eg. find the movec that actually optimizes J on the current block). Global J goes down. The problem here is that optimizing the current movec can make that movec very weird - eg. make the movec far from the "true motion". That then hurts future coding greatly.

In most cases these problems can be patched with hacks and heuristics. The goal of hacks and heuristics is basically to try to address the first two issues. Going back to the numbering of the two issues, what the hacks do is :

1. Try to force distortion to be "good distortion". Forbid too much quality variation between neighboring blocks. Forbid block mode decisions that you somehow decide is "ugly distortion" even if it optimized J. Try to tweak your D metric to make visual quality better. Note that the D tweakage here is a pretty nasty black art - you are NOT actually trying to make a D that approximates a true human visual D, you are trying to make a D under which your codec will make decisions that produce good global output.

2. To account for the greedy/non-greedy problem, you try to bias the greedy decisions towards things that you guess will be good for the future. This guess might be based on actually future data from a previous run. Basically you decide not to make the choice that is locally optimal if you have reason to believe it will hurt too much in the future. This is largely based on intuition and familiarity with the codec.

Now I'll mention a few random particular issues, but really these themes occur again and again.

I. Very simple block modes. Most coders have something like a "direct block copy" mode, or even a "solid single color", eg. DIRECT or "skip" or whatever. These type of blocks are generally quite high distortion and very low rate. The problem occurs when your lambda is sort of near the threshold for whether to prefer these blocks or not. Oddly the alternate choice mode might have much higher rate and much higher distortion. The result is that a bunch of very similar blocks near each other in an image might semi-randomly select between the high quality and low quality modes (which happen to have very similar J's at the current lambda). This is obviously ugly. Furthermore, there's a non-greedy optimization issue with these type of block modes. If we compare two choices that have similar J, one is a skip type block with high distortion, another is some detailed block mode - the skip type is bad for information conveyance to the future. That is, it doesn't add anything useful for future blocks to refer to. It just copies existing pixels (or even wipes some information out in some cases).

II. Gradual changes need to be send gradually. That is, if there is some part of the video which is slowly steadily changing, such as a slow cross fade, or very slow scale/rotate type motion, or whatever - you really need to send it as such. If you make a greedy best J decision, at low bit rate you will some times decide to send zero delta, zero delta, for a while because the difference is so small, and then it becomes too big where you need to correct it and you send a big delta. You've turned the gradual shift into a stutter and pop. Of course the decision to make a big correction won't happen all across the frame at the same time, so you'll see blocks speckle and move in waves. Very ugly.

III. Rigid translations need to preserved. The eye is very sensitive to rigid translations. If you just let the movec chooser optimize for J or D it can screw this up. One reason is that very small motions or movements of monotonous objects might slip to movec = 0 for code size purposes. That is, rather than send the correct small movec, it might decide that J is better by incorrectly sending a zero delta movec with a higher distortion. Another reason is that the actual best pixel match might not correspond to the motion, you can get anomalies, especialy on sliding patterned or semi-patterned objects like grass. In these cases, it actually looks better to use the true motion movec even if it has larger numerical distortion D to do so. Furthermore there is another greedy/non-greedy issue. Sometimes some non-true-motion movec might give you well the best J on the current block by reaching out and grabbing some random pixels that match really well. But that screws up your motion field for the future. That movec will be used to condition predictions of future movecs. So say you have some big slowly translating field - if everyone picks nice true motion movecs they will also be coherent, but if people just get to pick the best match for themselves greedily, they will be a big mess and not predict each other. That movec might also be used by the next frame, the previous B frame, etc.

IV. Full pel vs. half/quarter/sub-pel is a tricky issue. Sub-pel movecs often win in a strict SSD sense; this is partly because when the match is imperfct, sub-pel movecs act to sort of average two guess together; they produce a blurred prediction, which is optimal under L2 norm. There are some problems with this though; sub-pel movecs act to blur the image, they can stand out visually as blurrier bits; they also act to "destroy information" in the same way that simple block modes do. Full pel movecs have the advantage of giving you straight pixel copies, so there is no blur or destruction of information. But full pel movecs can have their own problem if the true motion is subpel - they can produce wiggling. eg. if an area should really have movecs around 0.5 , you might make some blocks where the movec is +0 and some where it is +1. The result is a visible dilation and contraction that wiggles along, rather than a perfect rigid motion.

V. A good example of all this evil is the movec search in x264. They observed that allowing very large movec search ranges actually decreases quality (vs a more local incremental searches). In theory if your movec chooser is using the right criterion, this should not be - more choices should never hurt, it should simply not choose them if they are worse. Their problem is twofold - 1. their movec chooser is obviously not perfect in that it doesn't account for current cost completely correctly, 2. of course it doesn't account for all the effects on the future. The result is that using some heuristic seed spots for the search which you believe are good coherent movecs for various reasons, and then doing small local searches actually gives you better global quality. This is a case where using "broken" code gives better results.

In fact it is a general pattern that using very accurate local decisions often hurts global quality, and often using some tweaked heuristic is better. eg. instead of using true code cost R in your J decision, you make some functional fit to the norms of the residuals; you then tweak that fit to optimize global quality - not to fit R. The result is that the fit can wind up compensating for the greedy/non-greedy and other weird factors, and the approximation can actually be better than the more accurate local criterion.

Did I say blurg?

3/18/2010

03-18-10 - Physics

Gravity is a force that acts proportionally to the two masses. (let's just assume classical Newtonian gravity is in fact the way the universe works)

People outside of science often want to know "but why?" or "how exactly? what is the mechanism? what carries the force?" . At first this seems like a reasonable question, you don't just want to have these rules, you want to know where they come from, what they mean exactly. But if you think a bit more, it should be clear that these questions are absurd.

Let's say you know the fundamental physical laws. These are expressed as mathematical rules that tell you the behavior of objects. Say for example we lived in a world with only Newtonian dynamics and gravity and that is all the laws. Someone asks "but what *is* gravity exactly?". I ask : How could you ever know? What could there ever be that "is" gravity? If something was faccilitating the force of gravity, there would have to be some description of that thing, some new law to describe it. That would mean some new rule to describe this thing that carried gravity. Then you would ask "well where does this rule for the carrier of gravity come from?" and you would need a new rule. Say you said "gravity is carried by the exchange of gravitons" ; then of course they could ask "why is there a graviton, what makes gravitons exactly, why do they couple in this way?" etc.

The fundamental physical laws cannot be explained by anything else.

That's almost a tautology because that's what I mean by "fundamental" - you take all the behavior of the universe, every little thing, like "I pushed on this rock with 10 pounds of force and it went 5 meters per second". You strip away every single law that can be explained with some other law. You strip and strip and finally you are left with a few laws that cannot be explained by anything else. These are the fundamental laws and there is no "why" or "how" for them. In fact the whole human question of "how" is imprecise; what we really should say is "what simpler physical law can explain this phenomenon?". And at some point there is no more answer to that.

Of course this is assuming that there *is* a fundamental physical law. Most physicists assume that to be true without questioning it, but I wrote here at cbloom.com long ago that in fact the set of physical laws might well be infinite - that is, maybe we will find some day that the electrical and gravitational force can be explained in terms of some new law which also adds some new behaviors at very small scale (if it didn't add new behaviors it would simply be a new expression of the same law and not count), and then maybe that new law is explained in terms of another new law which also adds new behaviors, etc. ad infinitum - a russian doll of physcial laws that never ends. This is possible, and furthermore I contend that it is irrelevant.

There is a certain human need to know "why" the physical laws are as they are, or to know the "absolute" "fundamental" laws - but I don't believe there's really much merit to that at all. What if they do finally work out string theory, and it explains all known phenomena for a while, but then we find that there is a small error in the mass of the Higgs Boson on the order of ten to the minus one billion, which tells us there must be some other physical law that we don't yet know. The fact that string theory then is only a very good model of the universe and not the "absolute law" of the universe changes nothing except our own silly human emotions in response to it (and surely crackpots would rise up and say that since it's not "100% right" then there must be angels and thetans at work).

What if we found laws that explained all phenomena that we know of perfectly. We might well think those laws are the "absolute fundamental" laws of the universe. But how would we ever know? Maybe there are other phenomena that can't be explained by those laws that we simply haven't seen yet. Maybe those other phenomena could *never* be seen! (for example there may be another entire set of particles and processes which have zero coupling to our known matter). The existance of this unexplained phenomena does not reduce the merit of the laws you know, even though they are now "not complete" or "don't describe all of nature".

It's funny to think about how our intuition of "mass" was screwed up by the fact that we evolved on the earth in a high gravity environment where we inherently think of mass as "weight" - eg. something is heavy. There's this thing which I will call K. It's the coefficient of inertia, it's how hard something is to move when you apply a certain force to it. F = K A if you will. Imagine we grew up in outer space with lots of large electrical charges around. If we apply an electric field to two charges of different K, one moves fast and one moves slow, the difference is the constant K. It's a very funny thing that this K, this resistance to changes of motion, is also the coupling to the gravitational field.

3/10/2010

03-10-10 - Distortion Measure

What are the things we might put in an ideal distortion measure? This is going to be rather stream of conscious rambling, so beware. Our goal is to make output that "looks like" the input, and also that just looks "good". Most of what I talk about will assume that you are running "D" on something like 4x4 or 8x8 blocks and comparing it to "D" on other blocks, but of course it could be run on a gaussian windowed patch, just some way of localizing distortion on a region.

I'm going to ignore the more "macroscopic" issues of which frame is more important than another frame, or even which object within a frame is more important - those are very important issues I'm sure, but they can be added on later, and are beyond the scope of current research anyway. I want to talk about the microscopic local distortion rating D. The key thing is that the numerical value of D assigns a way to "score" one distortion against another. This not only lets you choose the way your error looks on a given block (choosing the one with lowest score obviously), it also determines how your bits are allocated around the frame in an R/D framework (bits will go to places that D says are more important).

It should be intuitively obvious that just using D = SSD or SAD is very crude and badly broken. One pixel step of numerical error clearly has very different importance depending on where it is. How might we do better ?

1. Value Error. Obviously the plain old metric of "output value - input value" is useful even just as a sanity check and regularizer ; it's the background distortion metric that you will then add your other biasing factors to. All other things being equal you do want output pixels to exactly match input pixels. But even here there's a funny issue of what measure you use. Probably something in the L-N norms, (L1 = SAD, L2 = SSD). The standard old world metric is L2, because if you optimize for D = L2, then you will minimize your MSE and maximize your PSNR, which is the goal of old literature.

The L-N norms behave differently in the way they rate one error vs another. The higher N is, the more importance it puts on the largest error. eg. L-infinity only cares about the largest error. L-2 cares more about big errors than small ones. That is, L2 makes it better to change 100->99 than 1->0. Obviously you could also do hybrid things like use L1 and then add a penalty term for the largest error if you think minimizing the maximum error is important. I believe that *where* the error occurs is more important than what its value is, as we will discuss later.

2. DC Preservation. Changes in DC are very noticeable. Particularly in video, the eye is usually tracking mainly one or two foreground objects; what the means is that most of the frame we are only seeing with our secondary vision (I don't know a good term for this, it's not exactly peripheral vision since it's right in front of you, but it's not what your brain is focused on, so you see it at way lower detail). All this stuff that we see with secondary vision we are only seeing the gross properties of it, and one of those is the DC. Another issue is that if a bunch of blocks in the source have the same DC, and you change one of them in the output, that is sorely noticeable.

I'm not sure if it's most important to preserve the median or the mean or what exactly. Usually people preserve the mean, but there are certainly cases where that can screw you up. eg. if you have a big field of {80} with a single pixel spike on it, you want to preserve that background {80} everywhere no matter what the spike does in the output. eg. {80,80,255,80,80} -> {80,80,240,80,80} is better than making it go -> {83,83,240,83,83} even though the latter has better mean preservation.

3. Edge Preservation. Hard edges, especially long straight lines or smooth curves, are very visible to humans and any artifact in them stands out. The importance of edges varies though; it has something to do with the length of the edge (longer edges are more major visual features) and with the contrast range of the region around the edge : eg. an edge that separates two very smooth sections is super visible, but an edge that's one of many in a bunch of detail is less important (preserving the detail there is important, but the exact shape of the edge is not). eg. a patch of grass or leaves might have tons of edges, but their exact shape is not crucial. An image of hardwood floor has tons of long straight parallel edges and preserving those exactly is very important. The border between objects is typically very important.

Obviously there's the issue of keeping the edges that were in the original and also the issue of not making new edges that weren't in the original. eg. introducing edges at block boundaries or from ringing artifacts or whatever. As with edge preservation, the badness of these introduces edges depends on the neighborhood - it's much worse to make them in a smooth patch than once that's already noisy. (in fact in a noisy patch, ringing artifacts are sort of what you want, which is why JPEG can look better than naive wavelet coders on noisy data).

4. Smooth -> Smooth (and Flat -> Flat). Changing smooth input to not smooth is very bad. Old coders failed hard on this by making block boundaries. Most new coders now handle this easily inherently either because they are wavelet or use unblocking or something. There are still some tricky cases though, such as if you have a smooth ramp with a bit of gaussian noise speckle added to it. Visually the eye still sees this as "smooth ramp" (in fact if you squint your eyes the noise speckly goes away completely). It's very important for the output to preserve this underlying smooth ramp; many good modern coders see the noise speckle as "detail" that should be preserved and wind up screwing up the smooth ramp.

5. Detail/Energy Preservation. The eye is very sensitive to whether a region is "noisy" or "detailed", much more so than exactly what that detail is. Some of the JPEG style "threshold of visibility" stuff is misleading because it makes you think the eye is not sensitive to high frequency shapes - true, but you do see that there's "something" there. The usual solution to this is to try to preserve the amount of high frequency energy in a block.

There are various sub-cases of this. There's true noise (or real life video that's very similar to true noise) in which case the exact pixel values don't matter much at all as long as the frequency spectrum and distribution of the noise is reproduced. There's detail that is pretty close to noise, like tree leaves, grass, water, where again the exact pixels are not very important as long as the character of the source is preserved. Then there's "false noise" ; things like human hair or burlap or bricks can look a lot like noise to naive analysis metrics, but are in fact patterned texture in which case messing up the pattern is very visible.

There are two issues here - obviously there's trying to match the source, but there's also the issue of matching your neighbors. If you have a bunch of neighboring source blocks with a certain amount of energy, you want to reproduce that same patch in the output - you don't want to have single blocks with very different energy, because they will stand out. Block energy is almost like DC level in this way.

6. Dynamic range / sdev Preservation. Of course related to previous metrics, but you can definitely see when the dynamic range of a region changes. On an edge it's very easy to see if a high contrast edge becomes lower contrast. Also in noise/detail areas the main things you notice are the DC, the amount of noise, and the range of the noise. One reason its so visible is because of optical fusion and affects on DC brightness. That is, if you remove the bright specks from a background it makes the whole region look darker. Because of gamma correction, {80,120} is not the same brightness as {100,100}. Now theoretically you could do gamma-corrected DC preservation checks, but there are difficulties in trying to be gamma correct in your error metrics since the gamma remapping sort of does what you want in terms of making changes of dark values relatively more important; maybe you could do gamma-correct DC preservation and then scale it back using gamma to correct for that.

It's unclear to me whether the important thing is the absolute [low,high] range, or the statistical width [mean-sdev,mean+sdev]. Another option would be to sort the values from lowest to highest and look at the distribution; the middle is the median, then you have the low and high tails on each side; you sort of want to preserve the shape of that distribution. For example the input might have the high values in a kind of gaussian falloff tail with most values near median and fewer as it gets higher; then the output should have a similar distribution, but exactly matching the high value is not important. The same block might have all of its low values at exactly 0 ; in that case the output should also have those values at exactly 0.


Whatever all the final factors are, you are left with how to scale them and combine them. There are two issues on scaling : power and coefficient. Basically you're going to combine the sub-distortions something like this :


D = Sum_n { Cn * Dn^Pn }

Dn = distortion sub type n
Cn = coefficient n
Pn = power n

The power Pn lets you change the units that Dn are measured in; it lets you change how large values of Dn contribute vs. how small values contribute. The cofficient Cn obviously just overall scales the importance of each Dn vs. the other D's.

It's actually not that hard to come up with a good set of candidate distortion terms like I did above, the problem is once you have them (the various Dn) - what are the Cn and Pn to combine them?

3/08/2010

03-08-10 - Distortion and Bit Allocation

I now know that rate allocation is by far the most important thing in video. It's obviously important in a lot of things, but in video you just have so many bits and so much flexibility in where you put them, and there are lots of psychovisual phenomena that don't exist in images (due to motion, eye adaptation, feature tracking, etc. because the eye notices changes over time, etc). In fact I conjecture that you could take a really shitty old coder like MPEG2 and make videos that beat anything currently in existance with just better rate allocation.

What can rate allocation do ?

1. Move bits to the source of predictions. That is, code some frame (or part of a frame) better than normal because it will be used as a mocomp source in the future. This is actually a purely mathematical win and would apply without any psychovisual consideration. A lot of people do this in semi-heuristic ways, but of course those can make lots of mistakes (for example there may be cases where increasing the bit assignment to a block might actually make it a worse source for the future, eg. the future might be a better match to the block with more distortion; also starving the future might cause it to no longer choose that block as a source, etc). Some people move bits around while holding all the block mode decisions and movecs constant, which at least lets you converge, but of course you should consider all possible bit moves and all possible mode changes.

2. Move bits from frame to frame to make some frames look better and some look worse. Move bits around the frame to make parts look better and parts look worse. In general choosing where to put your error.

There's also a related issue which is not exactly rate allocation but is very similar. In lossy coders like video coders you often you have a choice of what your error looks like. That is, for the same distortion (in a numerical sense) you could make different shapes of error, through choosing different block modes, choosing different movecs, or more globally choosing quantizers or quantization matrices. This often ties into rate allocation because it involves how you make your free choices in the encoder :

3. What the distortion looks like. In particular, if you make some amount of error (in an SSD or SAD sense (aka L2 or L1 norm)) what does that error look like? what is the shape of it?

Now, in a lagrangian framework the main thing driving all these decisions is just the D metric in J = R + lambda D. If you change D, it changes where bits get put. D determines how important you think one type of error is vs. another type of error.

Just as an example, say you ran face detection on your video, then you could assign face regions to all your frames, and any error in the face region could be counted as extra important - if you put this into your "D" metric, then the lagrangian coder automatically gives those areas more bits. But that kind of example is rather banal. There are obviously tons of human error-importance issues that you could try to account for, having to do with what objects are most important in the frame, where the motion is, what kind of errors are particularly appalling, etc etc.

Purely numerical error distribution can be important : say you have an error of 3 somewhere and an error of 20 somewhere else. You have bits to change each by 1. Should you change the 3 to 2 or the 20 to 19 ? Well, it depends on their neighborhoods, but I think more often than not you should do the 3->2. That will be more visually noticeable. Using L1 or L2 (or L-N for whatever other N's) causes you to make different decisions in these cases. Most simplistically you can see it as a continuum between minimizing the total abs error (L1) vs minimizing the maximum error (L-infinity). That is, the issue of whether you have clumpy error or spread out error is a pretty big one.

The thing holding back development is a lack of a procedure for measuring real "quality". The problem is changing distortion to change your bit allocation for psychovisual purposes will by definition hurt your abstract measures. (hacky changes to D might hurt RMSE but help SSIM, but in that case I would say some of the change was not "psychovisual" - the part of the change which helps SSIM is in fact an analytical change to improve a certain metric). At some point you have to be able to make a decision that you will allocate bits in such a way that your video will look worse to computers, but will look better to humans. (with our current shitty computer analysis models).

x264 and others have a bit of a solution for this - they use a kind of "crowd sourcing" (bleck web 2.0 buzz word, I feel like I just vomitted in my own mouth a little). They can put beta features in their code and they have mobs of fan-boys who will download betas and try them on lots of videos and then post results on the forums. This gives you lots of real human eyes saying "this looks better" or not for attempts at psychovisual. But I don't think you can really make big developments using that technique - you can only make small heuristic stabs in the dark and then find out if they were okay, because the turnaround time for results from the crowd is too long, and if you release too many dead ends for them to test they will stop doing it, so you have to be reasonably sure it is a good change before publishing it to the crowd, etc. It's not the kind of thing a researcher needs, which is a black box where I can throw videos and say "which looks better to a human".

The result is that we are mostly stabbing in the dark and occasionally getting lucky.

3/03/2010

03-03-10 - Image Compresson - Color , ScieLab - Part 2

Follow up to the last post on color .

First a correction : what I said about downsampling there is mostly wrong. I made the classic amateur's blunder of testing on too small a data set and drawing conclusions from it. I'm a little embarassed to make that mistake, but hey this is a blog not a research journal. Any expectations of rigor are unfounded. For example this is one of the test images I ran on that convinced me that downsample was bad :


aikmi
-i7 qtable ; CoCg optimized joint for min SCIELAB

downsample :

   262,144 ->    32,823 =  1.001 bpb =  7.986 to 1 (per pixel)
Q : 11.0000  Co scale = Cg Scale = 1.525
bits DC : 19636|5151|3832 , bits AC : 175319|38483|19879
bits DC = 10.9% bits AC = 89.1%
bits Y = 74.3% bits CoCg = 25.7%
rmse : 7.3420 , psnr : 30.8485
ssim : 0.9134 , perc : 73.3109%
scielab rmse : 2.200

no downsample :

   262,144 ->    32,679 =  0.997 bpb =  8.021 to 1 (per pixel)
Q : 12.0000  Co scale = Cg Scale = 0.625
bits DC : 19185|13535|9817 , bits AC : 160116|39407|19091
bits DC = 16.3% bits AC = 83.7%
bits Y = 68.7% bits CoCg = 31.3%
rmse : 6.9877 , psnr : 31.2781
ssim : 0.9111 , perc : 72.9532%
scielab rmse : 1.980

you can see that downsample is just much worse in every way, including severely worse in SCIELAB which doesn't care about chroma differences as much as luma. In this particular image, there's a lot of high detail color bits, and the downsampled version looks significantly worse, it's easy to pick out visually.

However, in general this is not true, and in fact downsample is often a small win.

Without further ado I present lots of stats :

i0 Cg=1 Co=1 i0 Cg = 0.6 Co = 0.575 i7 Cg = 0.6 Co = 0.575 i4/i7 opt per image i7 CoCg optimized independently per image i7 CoCg optimized jointly per image downsampled
file rmse scielab rmse scielab rmse scielab rmse scielab Co Cg rmse scielab Co / Cg rmse scielab
kodim01 12.6809 4.8898 12.5848 4.8413 12.6567 4.3415 12.7018 4.238 0.455 0.455 12.623 4.3153 1.225 12.486 4.2525
kodim02 6.235 2.1961 6.1733 2.1793 6.2836 2.0519 6.2544 1.9542 0.58 0.58 6.2285 1.978 1.3375 6.4866 1.9841
kodim03 4.0098 1.7135 3.974 1.7173 4.0621 1.5587 3.9778 1.5883 0.705 0.83 4.0853 1.5359 1.6 4.1235 1.6102
kodim04 6.3981 2.4661 6.3658 2.4929 6.4083 2.2579 6.4083 2.2579 0.705 0.705 6.4092 2.248 1.5625 6.3698 2.1977
kodim05 14.2903 7.2293 14.0531 7.1756 14.1613 6.5253 14.2296 6.452 0.58 0.58 14.167 6.5291 1.5625 13.9658 6.4326
kodim06 8.9416 3.6338 8.836 3.5923 8.9622 3.2131 9.0316 3.1608 0.455 0.58 8.9664 3.2184 1.3 8.8455 3.1733
kodim07 5.147 2.316 5.1145 2.1919 5.2338 2.0167 5.2388 1.9815 0.58 0.58 5.202 2.0047 1.225 5.1601 1.9462
kodim08 14.6964 7.5082 14.5479 7.5237 14.5675 6.8769 14.6411 6.7521 0.58 0.83 14.5726 6.8285 1.4875 14.3053 6.692
kodim09 4.4789 1.8149 4.439 1.8574 4.5303 1.675 4.5303 1.675 0.705 0.955 4.5467 1.6359 1.4125 4.5389 1.6906
kodim10 4.9926 2.0932 4.9477 2.1196 5.0678 1.9887 5.0398 1.9514 0.58 0.955 5.0585 1.9109 1.6 5.0449 1.9556
kodim11 7.9484 3.2677 7.9006 3.2315 8.0441 2.9234 8.0441 2.9234 0.58 0.58 8.0478 2.9276 1.375 7.939 2.858
kodim12 4.6495 1.8486 4.6326 1.8529 4.7335 1.6862 4.7259 1.6663 0.58 0.705 4.7041 1.6776 1.2625 4.7001 1.6457
kodim13 18.5372 8.3568 18.3502 8.2634 18.5334 7.2841 18.6579 7.1262 0.455 0.58 18.5013 7.2697 1.1125 18.381 7.2327
kodim14 11.076 4.8628 10.972 4.7473 11.0146 4.3268 11.064 4.2636 0.58 0.58 11.0151 4.3308 1.3 10.9818 4.3614
kodim15 5.8269 2.4099 5.8082 2.4665 5.9134 2.2246 5.8383 2.2457 0.705 0.705 5.9158 2.2098 1.525 5.8699 2.1497
kodim16 5.689 2.3266 5.6289 2.3199 5.7372 2.0534 5.7372 2.0534 0.58 0.58 5.7373 2.055 1.375 5.6667 2.0276
kodim17 5.5166 2.3244 5.47 2.2994 5.6716 2.0774 5.5853 2.0874 0.455 0.705 5.6523 2.0574 1.4125 5.6014 2.037
kodim18 10.8501 4.8609 10.7131 4.7903 10.9517 4.3169 10.9639 4.2627 0.58 0.705 10.9266 4.3006 1.3375 10.8048 4.2189
kodim19 7.1545 2.8338 7.0872 2.8518 7.2311 2.4977 7.2637 2.4362 0.58 0.705 7.2158 2.4758 1.5625 7.1314 2.4396
kodim20 4.7872 1.8258 4.7183 1.8042 4.9208 1.6441 4.863 1.6524 0.455 0.83 4.9265 1.6306 1.1875 4.9427 1.656
kodim21 7.7757 3.3671 7.6338 3.3427 7.9293 3.0078 7.8541 3.0018 0.705 0.705 7.9204 2.95 1.3 7.7688 2.9302
kodim22 8.279 3.2205 8.1788 3.1253 8.3292 2.8656 8.3542 2.8114 0.455 0.58 8.3026 2.8379 1.45 8.267 2.8436
kodim23 3.917 1.5567 3.8968 1.5138 3.953 1.4315 3.961 1.4157 0.58 0.58 3.9481 1.4146 1.6 4.3382 1.573
kodim24 10.9877 5.2479 10.8105 5.0477 11.0256 4.6141 11.0435 4.5882 0.455 0.455 11.0413 4.6005 1.3375 10.9372 4.503
194.86 84.17 192.84 83.35 195.92 75.46 196.01 74.54 195.71 74.94 194.65 74.41

explanation :

output bit rate 1 bpb in all cases
parameters are optimized to minimize E = ( 2 * SCIELAB + 1 * RMSE )
RMSE is on RGB
SCIELAB is perceptual color difference metric

i0 = flat quantization matrix
i7 = tweaked perceptual quantization matrix to minimize E
i4/i7 = optimized blend of flat to perceptual matrices


The table reads roughly left to right in terms of decreasing perceptual error.  

"i0 Cg=1 Co=1" : flat q-matrix, standard lossless YCoCg transform without extra scaling

"i0 Cg=0.6 Co=0.575" ; optimize CoCg scale for E ; interestingly this also helps RMSE

"i7 Cg=0.6 Co=0.575" ; non-flat constant Q-matrix ; hurts RMSE a bit, helps SCIELAB a lot

"i4/i7 opt per image" ; per-image non-flat Q-matrix ; not a big difference

"i7 CoCg optimized independently per image" : independently optimize Co and Cg for each image

"i7 CoCg optimized jointly per image downsampled" : downsample test, CoCg optimized with Co=Cg

On the full kodak set, downsampling is a slight net win. There are a few cases (kodim03,kodim23) where it hurts a lot like I saw before, but in most cases it is a slight win or close to neutral. The conclusion is that given the speed benefit, you should downsample. However there are occasional cases where it will hurt a lot.

I think most of the results are pretty intuitive and not extremely dramatic.

It's a little non-inuitive what exactly is going on with the per-image customized chroma scales. Your first thought might be "well those images have different colors in them, so the color space scale is adapting to the color content in the image". That's not so. For one thing, more or less content of a certain color doesn't mean you need a different color space - it just means that that band of the color space will get more energy, and thus more bits. e.g. an image that has lots of "Co" component colors will simply have more energy in the Co plane - that doesn't mean scaling Co either up or down will help it.

If you think about the scaling another way it's more obvious what's going on. Scaling the color planes is equivalent to using different quantizers per plane. Optimizing the scalings is equivalent to doing an R/D optimization of the quantizer of each plane. Thus we see what the scaling is doing : it's taking bits away from hard to code planes and moving them to easier to code planes (in an R/D slope sense).

In particular, when I visually inspected some of the more extreme cases (cases where the per-image optimized scales were a big win vs. a constant overall scale, such as kodik10) what I found was that the optimized scalings were taking bits *away* from the dominant colors. One very obvious case was on photos of the ocean. The ocean is mostly one color and is very hard to code (expensive in an R/D sense) because it's all choppy and random. The optimized scaling took bits away from the ocean and moved them to other colors that had more R/D payoff.

(BTW rambling a bit : I've noticed that x264 Psy VAQ tends to do the same kind of thing - it takes bits away from areas that are really noisy mess, such as water, and moves them to areas that have smooth pattern and edges. Intuitively you can guess if an area is a mess and just really hard to code then you should just say "fuck it" and starve it for bits even if MSE R/D tells you it wants bits. I think also that improving an area from an RMSE of 4 to 2 is better than improving from 10 to 7, even though it's less of a distortion win. Visually there's a bit difference that occurs when an area goes from "looks good" to "looks noisy" , but not much of a difference when an area goes from "looks bad" to "looks really bad").

So this is in fact not really a surprising result. We know already that heavy R/D bit allocation can do wonders for lossy compressors. That are lots more areas to explore - optimization of every coefficient in the quantization matrix, optimization of the color transform, optimization of the transform basis functions, etc. etc. - and in each case you need to be clever about the way you encode the extra rate control side information.

ADDENDUM : I thought I should write up what I think are the useful takeaway conclusions :

1. It is crucial to do the right kind of scaling to Co/Cg (or chroma more generally) depending on whether you downsample or not. In particular the way most people just turn downsample on or off and don't compensate by scaling chroma is a mistake, eg. not a fair comparison, because their scaling will be tuned for one or the other.

2. Downsample vs. no-downsample is pretty close to neutral. If you downsample for speed, that's probably fine. There are rare cases where it does hurt a whole lot though.

3. Using a non-flat Q matrix does in fact help perceptual quality significantly. And it doesn't hurt RGB RMSE nearly as much as it helps SCIELAB (helps SCIELAB by 10.35 % , hurts RMSE by 1.58 % ).

4. It does appear acceptable to use global tweaked values all the time rather than custom tweaking to each image. Custom tweaks do give you another bit of benefit, but it's not huge, thus not worth the very slow optimization step. (see DCTune eg)

2/23/2010

02-23-10 - Image Compresson - Color , ScieLab

The last time I wrote about anything technical it was to comment on image coding perceptual targets and chroma . Let's get into that a bit more.

There are these standard weapons available to us : 1. Colorspace transform (lossy or lossless) , 2. Relative scaling of color channels, 3. Downsampling , 4. Non-flat quantization matrices.

Many image compressors use some combination of these. For example, JPEG uses YCbCr colorspace, which has a built-in down scaling of the chroma channels, also optionally downsamples chroma, and also usually uses a very high-frequency-killing quantization matrix. The result is that chroma is attacked in many ways - the DC accuracy is destroyed by the scaling in the color conversion as well as the [0] entry of the quantization matrix, and high frequency info is killed both by downsampling and the high entries in the quantization matrix.

But is this good? Obviously it's all bad in terms of RMSE (* not completely true, but close enough), so we need something that approximates the human eye's less sensitie chroma receptors.

For a long time I put off this question because it seemed the only way to attack it was by showing a ton of images to test subjects and asking "is this better?". (Furthermore, there's the ugly problem that any perceptual metric is heavy tied to viewing conditions, and without knowing the viewing conditions you may be optimizing for the wrong thing). But maybe I found a solution.

Let me be clear briefly that I am here only trying to address the issue of how the human eye sees chroma vs luma. This is not a full "psychovisual perceptual metric" which would have to account for the brain identifying areas of noise vs. areas of smoothness, repeated patterns, linear ramps, etc. Basically the only thing I'm trying to capture here is the importance of luma bits vs. chroma bits.

Well, it turns out there's this thing from color research called SCIELAB . You may be familiar with "CIE LAB" aka the "Lab color space" which is considered to be pretty close to "perceptually uniform" , that is 1 unit of distance between two Lab colors has the same perceptual error importance no matter what the two colors are. Well SCIELAB is the extension of CIELAB to images (not just single colors). You can read the paper at that link (or see links below), but the basic thing it does is very simple :

SCIELAB takes the image and transforms it to "opponent color" (luma, red-green, and blue-yellow) , which is roughly the color space that eyes use to see light (rods see luma, cones see chroma) (note that here we are transforming "pixel values" into real light values, so we have to make an assumption about the brightness and color calibration of the viewing device). In opponent color space, each channel is filtered. The filter used for each channel represents the angular resolution that a rod or cone has. Basically this is a gaussian whose sdev is proportional to the angular resolution in that channel. This depends on the DPI of the viewing device and the viewing distance (eg. how many pixels fit into one degree at the eye). The gaussian is narrow for luma, indicating good precision, and wider for chroma. The filter also has a wide negative lobe around the center peak, which captures the fact that we see values as relative to their neighborhood - eg. 100 on a background of 10 looks brighter than 100 on a background of 50.

The gaussian filters represent the probability of a photon from a given pixel hitting and activating a rod or cone. The wider filters for chroma indicate that a half-toned image in red-green or blue-yellow will be indistiguishable from the original at a much shorter distance than a half-toned luma image.

One you do this filtering, you transform back to CIELAB and then you can just do a normal MSE to create a "delta E". (CIE also defines a more elaborate more uniform "delta E" metric for LAB , but for our purposes the plain L2 distance is very close and much simpler). The result is a "SCIELAB delta E" metric that is analytic and can be used in place of MSE for comparing images. Having this SCIELAB metric now lets us try various things and it tells us whether they are perceptually better or not (in terms of optical perception of color, anyway).

So far as I know this has never been used in the mainstream image compression literature ; the only place I found it was this Stanford school project tech report : Direction-Adaptive Partitioned Block Transform for Color Image Coding . This paper is pretty interesting; they aren't actually doing anything with the DA-PBT , they're just evaluating color spaces and how to do color coding starting with a grayscale image compressor.

Let's go through the EE398 paper in detail.

First they use YCbCr because they claim it produces better scielab results than RGB. True enough, but there were a lot of other color spaces to try. Furthermore, they don't mention this, but they are using the JPEG style YCbCr, which has a built in 0.5 scaling of the chroma channels (chroma should have a range of [-256,256] but JPEG offsets and scales to put it back into [0,256]) - they have effectively killed the chroma precision by using YCBCr.

They then look at whether sub-sampling helps or not. They find it to be roughly neutral - but when you try subsampling or not subsampling you should also try optimizing all other free options (scaling of the chroma channels, quantization matrix).

The most interesting part to me is "Rate Allocation". They try giving different fractions of the bit budget to Y or CbCr. They find that optimal delta E almost always occurs somewhere around Y bits = 66% of the total , that is the bit ratios are like [4:1:1]. In order to acheive this ratio they had to use small quantization step sizes for CbCr than Y, but that is an anomaly because of the fact that the YCbCr they use has killed the chroma - if you use a non-scaling YCbCr you would find that the chroma quantization values should be *larger* than luma to acheive the 66% bit allocation. (note that using different quantization values on each channel is equivalent to scaling the channels relative to each other).

They also found that using non-uniform quantization matrices (ala JPEG) hurt. I believe this was just an anomaly of their flawed testing methodology.

This paper was the most serious study of color in image compression that I've ever seen, but is still flawed in some simple ways that we can fix. The big problem is that they make the classic blunder of many people working in compression of optimizing parameters one by one. That is, say you have a compressor with options {A,B,C}. The blunderer finds the optimal value for option A and holds that fixed, then the optimal for B, then the optimal for C. They then try out some experimental new mode for step A, and their tests show it doesn't help - but they failed to retry every option for B and C in the new mode for A. eg. for example something like downsampling might hurt if you're using YCbCr, but say you use some other color space, or scale your colors in some way, or whatever, then downsampling might help and the result of doing all those steps together may be the best configuration.

Let's go back through it carefully :

First of all, the color conversion. Let me note that we use the color conversion in image compression for really two separate purposes which are mixed up. One use is for decorrelation (or energy compaction if you prefer) - this helps compression even for lossless mode. The second is for perceptual separation of chroma from luma so that we can smack the chroma around. Obviously here we need a color transform which gives us {luma/chroma} separation - that is, we cannot use something like the KLT which doesn't necessarilly have a perceptual "luma" axis.

From my earlier color studies, I found that YCoCg produces good results, usually within 1% of the best color transform on each image, so we'll just use that. But we will be careful and use a float <-> float YCoCg which doesn't scale any of the channels.

We will then scale Y relative to CoCg. This scaling is equivalent to variable quantizers and is (one of the ways) how we will control the bit allocation to Y vs. Chroma. This scaling gives you a difference in "value resolution" , it doesn't kill high frequencies.

You can then optionally downsample chroma. Note that in naive tests I have found in the past that downsampling chroma sometimes helps visual quality; and in fact in some cases it even helps MSE measured on the RGB data. I now know that that was just an anomaly due to the fact that I wasn't considering chroma scaling. That is, downsampling was just a crude way of allocating fewer bits to chroma, which does in fact sometimes help, but if you also have the ability to change the chroma bit allocation by relative scaling of the channels, the advantage of downsampling vanishes.

I optimized the scaling of CoCg relative to Y on lots of images. Obviously the true optimum value is highly image dependent (you could compute this per image and store it with the image of course), but in most cases a scale near 0.7 is optimal if you are not downsampling, and a scale near 1.1 is close to optimal when downsampling ( 1.0 is not bad when downsampling ). When not downsampling, the optimal bit allocation is usually in the area of Y ~= 66% of the bits, as seen in the EE398 paper. When downsampling, the optimal bit allocation tends to be closer to Y = 80% of the bits. Downsampling generally hurts RGB MSE and SCIELAB delta E, but I find it sometimes helps RGB SSIM.

Obviously downsampling is resulting in more bits being used on luma, which means you'll have sharper edges and better preservation of texture and a visual appearance of more "detail", at the cost of the color values being far off. By my own examination, I often will find that if I just stare at the image made from downsampled chroma it looks "better" - eg. I see more edge detail, and it has less of that obvious appearance of being compressed, eg. less ringing artifacts, halos, stair-steps, etc. However, when I switch back and forth between the original and the compressed, the version made from downsampled chroma shows obvious color errors. The version made from non-downsampled chroma obviously has much better color preservation, but appears generally blurrier, has more block artifacts, etc. The non-downsampled version wins according to "delta E" , but by my eyes I can't really clearly say one is better than the other, they're just different errors.

The last tool we have is a non-uniform quantization matrix. NUQM lets us give more bits to the low frequencies vs. the high frequencies. Generally NUQM hurts MSE, but it might help "delta E" , because SCIELAB accounts for the "fuzziness" of human visual (insensitivity to high frequency pattern). To test this, what we need to try is various different NUQM's for both luma and chroma, as well as optimizing the relative scaling value in each case. I haven't completed this yet, but early results show that NUQM's do in fact help delta E. Note that I'm not talking about doing a per-image optimal NUQM like "dctopt" does or something, just finding something like the JPEG style skewed matrix to use globally.

Some numbers for example :


On a 512x512 color image of a face , at 1.0 bits per pixel , 
optimizing quality at constant bit rate


baseline : delta E = 2.2933

not downsampled , optimal CoCg scale = 0.625 : delta E = 2.155  (bits Y = 72%)

    downsampled , optimal CoCg scale = 1.188 : delta E = 2.381  (bits Y = 80%)

best NUQM and scaling (no downsampling) : delta E = 1.899  (bits Y = 61%)

( JPEG delta E = 2.7339 )

One thing I notice that NUQM does obviously is give a lot more bits to the DC's. In this case :


not downsampled, same cases as previous
UQM = uniform quantization matrix

 UQM , bits DC = 15.4% , Q = 14.0  , delta E = 2.155 , bits Y = 72%

NUQM , bits DC = 19.4% , Q =  7.25 , delta E = 1.899 , bits Y = 61%

Here Q is the quantizer of the DC component of Y - in the UQM case all Q's are the same (though the Q for chroma is effectively scaled). In the NUQM case the higher frequency AC components get much higher Q's. We can see from the above that because of NUQM, the quantizer for the DC can be much lower at the same bit rate.

Personal visual inspection indicates that the NUQM images just have much more "JPEG-like" artifacts. That is, they generally look more speckly. They obviously preserve flat areas and simple ramps somewhat better. The tradeoff is much worse ringing artifacts and destruction of high frequency detail like fine edges. (in my case the lower Q from NUQM also means a much weaker deblocking filter is used which may be part of the reason for more speckly appearance).

In any case, NUQM clearly helps delta E due to the ability to take bits away from the high frequency chroma data - much better than just scaling and downsampling can.

This is all very interesting and promising, but we have to ask ourselves at some point - how much do we trust this "scielab delta E" ? eg. by optimizing for this metric are we actually making better results? More and more I am convinced that the biggest thing missing from data compression is a better image quality metric (and then once you have that, you need to go back to basics and re-test all your assumptions against it in the correct way).

Color links :

Working Space Comparison sRGB vs. Adobe RGB 1998
Welcome to IEEE Xplore 2.0 Using SCIELAB for image and video quality evaluation
Video compression's quantum leap - 12112003 - EDN
Useful Color Equations
Useful Color Data
Standard illuminant - Wikipedia, the free encyclopedia
SpringerLink - Book Chapter
S-CIELAB Matlab implementation
References related to S-CIELAB
Lab color space - Wikipedia, the free encyclopedia
IEEE Xplore - Login
help - sRGB versus Adobe RGB (1998)
efg's Chromaticity Diagrams Lab Report
CIECAM02 - Wikipedia, the free encyclopedia
Chromatic Adaptation
Brian A. Wandell -- Reference Page
Ask a Color Scientist!
A top down description of S-CIELAB and CIEDE2000. Garrett M. Johnson. 2003; Color Research & Application - Wiley InterScienc
A proposal for the modification of s-CIELAB

2/10/2010

02-10-10 - Some little image notes

1. Code stream structure implies a perceptual model. Often we'll say that uniform quantization is optimal for RMSE but is not optimal for perceptual quality. We think of JPEG-style quantization matrices that crush high frequencies as being better for human-visual perceptual quality. I want to note and remind myself that actually just the coding structure actually targets perceptual quality even if you are using uniform quantizers. (obviously there are gross ways this is true such as if you subsample chroma but I'm not talking about that).

1.A. One way is just with coding order. In something like a DCT with zig-zag scan, we are assuming there will be more zeros in the high frequency. Then when you use something like an RLE coder or End of Block codes, or even just a context coder that will correlate zeros to zeros, the result is that you will want to crush values in the high frequencies when you do RDO or TQ (rate distortion optimization and trellis quantization). This is sort of subtle and important; RDO and TQ will pretty much always kill high frequency detail, not because you told it anything about the HVS or any weighting, but just because that is where it can get the most rate back for a given distortion gain - and this is just because of the way the code structure is organized (in concert with the statistics of the data). The same thing happens with wavelet coders and something like a zerotree - the coding structure is not only capturing correlation, it's also implying that we think high frequencies are less important and thus where you should crush things. These are perceptual coders.

1.B. Any coder that makes decisions using a distortion metric (such as any lagrange RD based coder) is making perceptual decisions according to that distortion metric. Even if the sub-modes are not overtly "perceptual" if the decision is based on some distortion other than MSE you can have a very perceptual coder.

2. Chroma. It's widely just assumed that "chroma is less important" and that "subsampling is a good way to capture this". I think that those contentions are a bit off. What is true, is that subsampling chroma is *okay* on *most* images, and it gives you a nice speedup and sometimes a memory use reduction (half as many samples to code). But if you don't care about speed or memory use, it's not at all clear that you should be subsampling chroma for human visual perceptual gain.

It is true that we see high frequencies of chroma worse than we see high frequencies of luma. But we are still pretty good at locating a hard edge, for example. What is true is that a half-tone printed image in red or blue will appear similar to the original at a closer distance than one in green.

One funny thing with JPEG for example is that the quantization matrices are already smacking the fuck out of the high frequencies, and then they do it even harder for chroma. It's also worth noting that there are two major ways you can address the importance of chroma : one is by killing high frequencies in some way (quantization matrices or subsampling) - the other is how fine the DC value of the chroma should be; eg. how should the chroma planes be scaled vs. the luma plane (this is equivalent to asking - should the quantizers be the same?).

1/22/2010

01-22-10 - Exponential

One of my all time pet-peeves is people who say things are "increased exponentially". Of course the worst use of all is just when people use it to mean "a lot" , eg. they're not even talking about a trend or a response curve, eg. "the 911 Turbo provides an exponential increase in power over the base spec". This abortion of usage is becoming more and more common. But even scientists and mathematicians will use "exponential" to describe a trend that's faster than linear (it's quite common in NYT math/economics/science articles for example).

Today I was reading this blog on software development organizations and my hackles immediately went up when I read :

"The real cost of complexity increases exponentially."

I started to write a snarky post about how he almost certainly meant "geometrically". But then I started thinking about it a bit more. (* correction : by "geometrically" I mean "polynomially").

Maybe software development time actually is exponential in number of features?

If you're trying to write N features and they are all completely independent, then time is O(N), eg. linear.

If each feature can be used with only one other feature, and that interaction is completely custom but independent, then time is O(N^2), eg. geometric.

Already I think there's a bit of a myth we can tackle. A lot of optimistic software devs think that they can get under even this O(N^2) complexity by making the features interact through some generic well defined pathways. eg. rather than specificially code how feature (A) and feature (B) and feature (A+B) work, they try to just write (A) and (B) but make them both aware of their circumstances and handle various cases so that (A+B) works right. The problem is - you didn't really avoid the O(N^2). At the very least, you still had to test (A+B) to make sure it worked, which meant trying all N^2 ways, so your time is still O(N^2). The code might look like it's O(N) because there's only a function for each feature, but within each function is implicit O(N^2) logic !!

What I mean by implicit logic is the blank lines which testing reveals you don't have to write! eg. :


void Feature_A( )
{

    DoStuff()

    if ( SelectionMode() )
    {
        // ... custom stuff here to make (A+C) and (A+D) work
    }

    // !!! blank lines here !
    //  this is implicit knowledge that (A+B) and (A+E) don't need custom code

}

You might argue that this is slightly less than quadratic complexity for the developer, and tester time is cheaper so we don't care if that's quadratic. But in any case it's geometric and above linear.

But is it actually exponential? What if all the features could be enabled or disabled, so the state of your code is a binary string indicating what features are on or off; eg. 1100 = A on, B on, C off, D off. Then there are in fact 2^N feature states, which is in fact exponential.

Another possibility is if the features can only be enabled one by one, but they have lingering effects. You have some shared state, like a data file you're processing, and you can do A then C then B then E on it. In that case the number of sequences is something like N! which is exponential for large N (actually super-exponential)

Let's concretely consider a real video game coding scenario.

You're trying to write N features. You are "sophisticated" so rather than writing N^2 hard-coded interactions, you make each feature interact with a shared world state via C "channels". (in the old Looking Glass speak, these C channels might be a set of standard "properites" on objects and ways to interact with those channels; in the old Oddworld Munch codebase there were C "component" types that could be on objects such as "SoundTrigger "Pressable" etc.). So your initial code writing time something like O(N*C).

But for the N features to really be meaningful, the C is ~= N (roughly proportional). (or at least C ~= log(N) , but certainly C is not constant as N increases - as you add features you always find you need more channels for them to communicate with each other). So just code writing time is something between O(NlogN) and O(N^2).

But your features also affect shared state - e.g. the "world" that the game takes place in, be that physical state, or state variables that can be flipped on other objects. If you have N objects each with K internal states, this creates K^N world states that have to be tested. Even with very small world state, if the features are order dependent, you're back to N! test cases.

If the bug rate was a constant percentage of test cases (eg. 0.1% of test cases produce a bug), then you are back to exponential number of bugs = exponential coder time. But I'm not sure that model of bugs is correct. If the bug rate was a constant percentage of lines of code, then bug rate would only be geometric.

1/17/2010

01-17-10 - Nob or Knob -

Is there a difference between Nob and Knob or are they just alternate spellings of the same thing ?

It appears that "knob" is generally considered more correct now, though in old english "nob" was more common. There are many meanings, I'll show with the spelling I prefer :

knob : dial/wheel control
knob : bump or protuberance
knob : small amount, usually of butter
knob : head of the penis
nob  : head of a man (archaic)
nob  : wealthy or upper class person (archaic)

Some people seem to think that either "nob" or "knob" are exclusively correct for the slang meaning of penis (they also disagree on whether it refers to the whole penis or just the head). It's unclear to me where the origin of this slang came from, since "nob" can either mean a person's head (archaic), which would suggest the slang came to refer to head of the penis, or "knob" can mean any protuberance.

Wiktionary seems to think "knob" is a common way to refer to a hill; perhaps this is British, I've never heard it in America.

Some weird uses :

"Nob Hill" - my guess is this common name refers to a hill where the wealthy people lived, not the fact that the hill was a knob, but I could be totally wrong about that. Some people seem to point Nob Hill at nabob but since nabob basically means the same thing as nob I don't see why you would point "Nob" at "nabob" when you could just point it at "nob".

"Hob Nob" - apparently this is completely unrelated if you believe this etymology it came from habbe nabbe

"For his nobs" in Cribbage is a funny one ; it appears to also be completely unrelated, coming from the game noddy which means simpleton, and since the knave of the same suit was important to the game it was referred to as the "knave noddy" or just "noddy" which must have become nob. (unrelated but the story of John Suckling, purported inventor of modern Cribbage, seems pretty fascinating; he was apparently a master gambler and cheater at cards who used his skill to get money beyond his station; he was involved in a plot to spring a prisoner from the tower of london, and received at least one beating at the handle of a nobleman tradgames ; ezinearticles ; wikipedia )

There are some funny uses : Nob Hill Knob Set and Nifty Nob Inc. maker of fine Knobs good job on the consistency, guys.

1/14/2010

01-14-10 - A small note on Trellis quantization

See reference .

I guess this is obvious, but you do get a pretty nice win from using the true floating point Dct results rather than the quantized Dct results when you do TQ.

I believe the standard practice (what I was doing before anyway) is to do your normal fast Dct + quantization, which takes your integer pixels and makes quantized integer post-Dct output. You then apply TQ on that quantized-and-transformed output, which means instead of sending the true output { 4,3,0,0 } you also consider {4,2,0,0 } and {4,0,0,0 } and so on, measure J(R,D) of each and pick the best.

Okay, but the distortion of changing "1" to 0 is not the same as the distortion of another 1 to 0 , if those were not really the same 1 before quantization.

For example, say you're quantizing with a quantizer = 1.0 for simplicity, and no deadzone, even bucket sizes, so you have quantization buckets :


[ -0.5, 0.5 ] -> 0
[ 0.5 , 1.5 ] -> 1
etc.

In that case, when TQ decides to take a quantized "1" and send instead a 0, if the true value was 0.51 , that's not so bad. If the true original value was 1.49 , that's a lot worse.

However, interestingly, if the true original value was 1.49 , then we could send it as a "2". If the value is near a quantization boundary, then the distortion doesn't care whether you kick the value up or down, but the rate might be significantly different, in which case you should make a choice based on J and get a win.

So my Dct now also does the true float -> float transform just for use in the distortion measurement for TQ. It's also useful in this application to make sure your Dct doesn't do any scaling, so that the transform is Unitary, that is, L2 norm is preserved. That way the distortion measure in post-Dct space is the same as the distortion measure in pre-Dct space which means you can use the same lambda for lagrange J decisions.

1/13/2010

01-13-10 - Oodle Revisited

I got some emails from a friend recently that made me start thinking about Oodle again. Friend is an indie 360 developer who just shot me a query like (paraphrasing) :

"Hey, I'm loading stuff in my game and only getting like 20 MB/sec , what gives?"

So we started trying to dig into what he was doing exactly - are you opening/reading the files with all the right flags, are you on 4k alignment, are you on a thread so you aren't just stalling out for the seeks, etc. ? In the end I think we figured out that his problem was he was reading too many small files, to which he replied :

"Oh yeah, duh, I've always packed files into bundles and loaded big chunks in previous games, but I just hadn't gotten around to it yet in this and it was bugging me that my level loads were taking so long."

! Ah ha ! To me this is where Oodle comes in as a product. It's not super hard stuff, but it's something that people always put off until the very end, which is kind of a shame because it means their level loads take forever during dev. So for three years while you're making your game you suffer through annoying slow level loads. Instead you buy Oodle on day 1 and your level loads are automagically fast.

I believe in this as a product, in the sense that I think we can make it, and it would be extremely valuable, but I'm not convinced that the game industry is mature enough to buy it. The game industry has always suffered from the mistaken thinking of "I could write that myself, so I won't pay for it". That's silly. What you should look at is what's the full cost of writing it yourself, including debugging, and perhaps most importantly including the opportunity cost of spending that time on this instead of something else, or the opportunity cost of not having feature X until you've gotten around to writing it.

For example, say you're on a one man dev team and you lay out your coding tasks for your project in order {A,B,C,D,E} . All during dev you are suffering a penalty from not having feature E done. If it would make dev a lot easier to have feature E done up front, you should pay a *lot* of money to get it immediately. Some of the classic mistakes in this vein are things like profile HUDs which people put off until the end, but would provide huge benefit if you had from the beginning; another one people aren't aware of is level save/load and memory card support - you think of that as a minor detail to do at the end, but as soon as you do it level designers are walking around with scenarios saved on memory cards to show to each other and they get a huge productivity boost, so it would have been a nice win to do early (though then you have maintenance pain).

To me the big win of Oodle is :

Client just writes code to load files one by one with plain old fopen/fread/whatever.

Behind the scenes Oodle magically robustly incrementally syncs PC files to consoles, packages files into bundles, makes prefetch lists and prefetches bundles, compresses & decompresses bundles on threads. You have ship quality fast loading from the beginning.

That's a small, simple, great product I think. The problem is to make it something compelling enough to sell we start to add lots of features : high performance paging in/out for seamless worlds, hot reloading of changed content, smooth IO integration with streaming data files like audio/video, DVD layout optimization, texture compressors, etc. which really just wind up clouding the tiny little valuable product at the core.

old rants