11/18/2009

11-18-09 - Raw Conversion

CR2 format is a big mess of complication. My god I hate TIFF. The main thing is the sensor data. It appears to be stored as "lossless JPEG" which is a new format that uses the JPEG-LS predictor but then just codes the residual with normal JPEG Huffman coding. The sensor data is RGGB which they either store as a 4-channel per pixel [RGGB per pixel] or as 2-channel [GR or GB]. Either way is clearly not optimal. One interesting thing I could do if I cracked the CR2 format is store all these raws smaller with a better compressor. The RAWs from the S90 are around 11M on average, it uses the 2-channel mode; the RAWs are 1872x2784 = 3744x2784 samples and 12 bits per sample. That means the JPEG is getting to 8.85 bits per sample. Not very good.

Of course I probably have to use dcraw to read it for me, but dcraw is just about the worst piece of code I've ever seen in my life. It's a miracle to me that people are able to write functioning software from code like that.

Paul Lee has a modified dcraw and some nice sample pictures of how demosaicing can go wrong (click the Moire or Aliasing links).

My idea for high quality RAW processing :


First of all, abandon your idea of an "image" as a matrix (grid) of colors (aka a bitmap).

The S90 sensor has barrel distortion that's corrected in software.

It also samples colors in an RGGB Bayer mosaic pattern (like most cameras).

The two of those things combined mean that you really just have a collection of independent R's, G's, and B's at
irregular positions (not on a grid due to barrel distortion).

Now, you should also know that you need to do things like denoising on these original samples, NOT on
the grid of colors after conversion to a bitmap.

So I want to denoise directly on the source data of irregular color samples.
Denoising R & B should make use of the higher quality G data.

Denoising should of course use edge detection and other models of the image prior to make a Bayesian
maximum likelihood estimate of the sample without noise.

To output a bitmap you need to sample from this irregular lattice of samples (mosaic'ed and distorted).

Resampling creates aliasing and loss of information, so you only want to do it once ever on an image.

There's absolutely no a-priori reason why we should be resampling to the same resolution as the sensor
here.  You should resample at this point directly to the final resolution that you want your image.

For example with the S90 rather than outputting the stupid resolution 3648x2736, I would just output 3200x2400
which would let me view images at 1600x1200 on monitors with a box down-filter which will make them appear
much higher quality in practice (vs 3648x2736 viewed at 1600x1200 which involves a nasty blurring down-filter).

The output from this should be a floating point bitmap so that we don't throw away any color resolution
information.

Exposure correction can then be done on the floating point bitmap without worrying about the irregular
lattice or any further resampling issues.

11/06/2009

11-06-09 - IsSameFile

I found myself wanting to know if two file names were the same file on disk. It's hard to check that just by looking at the name. Obviously you have issues like one might be absolute, one might be relative. Even if you fix that, they could be different A-code-pageizations of unicode names. And something I hit often is one of them might be on a "subst" or even a hard link. I want to know if they are actually the same file.

This appears to work :


bool IsSameFile(char * Name1,char * Name2)
{
    HANDLE f1,f2;
    BY_HANDLE_FILE_INFORMATION info1;
    BY_HANDLE_FILE_INFORMATION info2;
    
    f1 = CreateFile(Name1,GENERIC_READ,FILE_SHARE_READ,0,OPEN_EXISTING,0,0);
    if ( f1 == INVALID_HANDLE_VALUE )
        return false;
    
    f2 = CreateFile(Name2,GENERIC_READ,FILE_SHARE_READ,0,OPEN_EXISTING,0,0);
    if ( f2 == INVALID_HANDLE_VALUE )
    {
        CloseHandle(f1);
        return false;
    }   
    
    GetFileInformationByHandle(f1,&info1);
    GetFileInformationByHandle(f2,&info2);
    
    CloseHandle(f1);
    CloseHandle(f2);
    
    // BY_HANDLE_FILE_INFORMATION has a unique file ID and Volume Serial Number in it
    //  check those are the same
    //  heh fuck it just check they are all the same
    
    // confirmed : this does work across substs
    
    return memcmp(&info1,&info2,sizeof(info1)) == 0;
}

11/04/2009

11-04-09 - Video is not for Windows

Holy crap video is a flaming mess.

First, if you don't know, there are two main things : packages & streams. The packages are AVI, MKV, MP4, MOV ; they put together image data and audio data in some way (the layer that unpacks these is often called a "splitter" or "demuxer"). The packages send their streams to codecs which convert them to some format for display. On Windows this is all supposed to go through DirectShow which is supposed to use the 4CC codes and some priority information to automatically find the right handler for the various streams and packages. In theory.

The first problem you hit on Windows is that AVI packages are handled pretty well, but AVI can't hold H264 video because AVI can't handle the flexible B-frame ordering that H264 can generate. (limitted profiles of H264 can be put in AVI, and there are hacks around this problem, but you're getting into a world of hurt). So you need MKV or MP4 boxes, and those are handled poorly; some apps handle them okay, some don't. Some apps "cheat" and don't trust DirectShow like they're supposed to (the cheating apps often work better).

Things I've installed lately :


MP4Box : MP4 stream boxer/unboxer ; pretty decent app, recommended, but help is poor

YAMB : GUI for MP4Box.  Useful to help figure out command lines for MP4Box because the help is bad.
    YAMB has bad bugs though and will fail to launch MP4Box, so you have to copy out the command line
    and run it yourself

MKVVerify : MKV stream checker.  Useful because MKV support is so fucking borked.

MediaInfo : Media info reporter.  Questionable usefulness because I don't trust it and don't know where
    it's getting it's info for.

Graphedit : DirectShow graph visualizer and tester from MS

GSpot : AVI info tool.  Useless.

MSU VMT : Moscow State University Video Quality Measurement Tool.  This is pretty neat when it works,
    but far too often it fails to get the frames correctly, so you get totally bogus results.

MSU LS Codec : Moscow State University Lossless Codec.  Best compressing lossless codec, seems nice
    but crashes some tools when you try to use videos compressed with this.  Thus useless.

Lagarith Codec : This appears to be the one good working lossless codec.  Recommended.

HuffYUV Codec : Videos made with this crash me on read.  Jeff says it works great for him.  Avoid.

MeGUI : GUI for "mencoder" which can driver AviSynth and x264 ; like all of these big GUIs that try to
    run a bunch of other products, this mysteriously fails for me.  It seems to set everything up right
    and then it launches ten other programs and they fail to hook up in the way MeGUI expected them to.
    Garbage.

Handbrake : see MeGUI

FFDShow : hooks up the Linux video decoders (ffmpeg , libavc, etc.) to DirectShow.  This thing is
    pretty evil and fails to report frame rate and media info sometimes, but is also the only real
    choice. 

Haali Media Splitter : MKV unboxer, works with FFDShow.  Difficult to install correct manually.
    Even when installed correctly, does some weird shit with framerate; doesn't seem to report it
    correctly through DirectShow.  Probably best to get a codec pack like :

K-Lite Codec Pack : works for me but generally is considered malware
Matroska Codec Pack : didn't work for me
CCCP Codec Pack : not tried

MPlayer : Linux media player, now ported to Windows ; very flexible command line control of everything,
    alternate audio/video in/out.  Highly recommended.

MEncoder : video encode/decode partner to MPlayer.  I've had more success running mplayer and x264 manually
    than using this.  Still I can't complain about MEncoder from the command line.

MPUI : GUI for MPlayer.  This is horrific malware.  When you install it, it takes over your system without
    asking.  They do provide some tools for you to change this after the fact, but still should be avoided.
    Use Media Player Classic or VLC.

AviSynth : script thing to pipe video to other programs that read AVS scripts.  Dear lord.

Basically I've found that all the GUI's are broken, and all the video containers (AVI,MP4,MKV) are broken. The thing I've finally discovered that actually works is using MPlayer and X264 from the command line, and only working with split frames. Trying to work with video containers caused me all kinds of hurt because so many of these apps fail to unbox the containers right and screw up the frame rate or drop frames or other mistakes. Instead now if I want to work on a video I use MPlayer to convert it to raw frames.


mplayer -benchmark -ao null -vo png:z=5 video.avi

to dump frames to PNG

mplayer -benchmark -ao null -vo yuv4mpeg:file=test.y4m video.avi

to dump the video to YUV4MPEG format in "test.y4m" for input to x264

x264.exe --bitrate 10000 --output "out.mp4" test.y4m

x264 compress to "out.mp4"

Then use mp4box to put the audio back if wanted.

The cool thing about mplayer is that its audio/video decoders are the same ones used to view the video. So you can watch it, and if it plays right in the viewer, then it will extract correctly. I've found lots of videos that I can watch in MPC or VLC, but then fail to load the same way in whatever encoder/decoder when I try to process something.

The sucky thing about this method is you make ginormous temp files on your disk, which also slows things down a lot. But avoiding the fuckups of going through the borked DShow codecs and splitters is worth it.

Most of these tools now are originally Linux tools that are getting moved back to Windows. One very promising development is that many of them have the option to directly load libs for the codecs Linux-style (eg. just load libavc to play video) and avoid DirectShow completely. I haven't really tried that yet but it seems like it's almost possible to work with video just by manually picking a few of these libs and then you avoid the whole Windows borked media layer.

ADDENDUM : one of the difficulties I've seen in a lot of tools is reading the frame rate wrong. This is presumably due to the demuxers not reporting things back totally right. But there are also two related fundamental problems that make everything harder :

1. Most of these formats don't have real/useful headers. (if they do have header info, it's just added as "comment" information). This was done originally because theoretically if your AVI is being broadcast on TV and you change the channel into it, you will just start getting bytes in the middle and never see the header, thus they didn't put good headers on at all.

2. It's almost impossible to really reliably just get the frames out of video. DirectShow doesn't have a reliable call that's just "give me the next frame". Instead you have to ask for "when is the next frame" and then "give me an image at this time". The problem is that the "when" can get fucked up in various ways, and then when you say "give me an image at this time" you can either skip frames or get duplicate frames. (this is what fucks up the MSU VMT tool for me so badly, they are getting the time sampling all wrong quite often).

Even if it's not way off, this still causes subtle bugs because people don't agree exactly on how to represent the frame rate. Some people treat broadcast as exactly = 30000/1001 fps and use rational arithmetic for all timing. Some people use floats for frame rate and use 29.97003 and then wind up with floating point precision problems at high frame numbers. Many of the containers store the frame rate as a number of microseconds between frames, eg. 33367 ; so if they store "33367" in the header, should I use that as my frame time increment exactly, or should I use 33366.666666 ?

I'm guessing that tons of people get duplicate and/or dropped frames because of this and just don't notice it.

10/19/2009

10-19-09 - Against asynchronous GUIs

Most new Microsoft products have super asynchronous GUIs. This means they do GUI refreshes and window popups and threads and so on. This sounds like a nice thing, it removes big stalls and freezes in the GUIs and makes them "respond" more quickly, in the sense that when you request some new view it pops the window up immediately and then gradually fills it in, whereas before it would just freeze and wait for the new view to pop up.

In practice for power users this just sucks absolute balls. Power users use the keyboard. We use our peripherial vision. We don't sit around and wait for windows to pop up, we hit key combos that we know and we memorize sequences. So for example you might memorize that hitting Alt-F opens a find dialog and puts the focus in the entry window, so you can highlight the text you want to find, Ctrl-C, Alt-F, Ctrl-V, Enter, all in a row boom you're searching for that text.

BUT NO! No you aren't, instead you just pasted that text into your document because the fucking finder window was being built on a thread and wasn't opened in time to receive your ctrl-v so your ctrl-v went to the old window.

There are few obvious GUI design fails here. One is that asynchronicity and focus stealing should never go together. If a window opening is async then it should not transfer focus to that window. Also, if a key pops up a new window and transfers focus, do that immediately and queue up the key presses and don't process them until the action associated with each key press is complete. eg. if you get input like A,B,C, if you process A and it fires something async, then don't respond to B until the action from A is done if it's possible that they affect each other.

More generally, this programming pattern of finding clever complicated ways to hide the fact that your systems are overly bloated and slow is just not the win. You will only make the failure cases less common but more ugly. For example in this particular case I'd rather have a slow modal popup than an unpredictable async popup.

Now, there certainly is a place for asynchronicity in GUIs, namely in tasks spawned from GUIs that run in a pane. When you start some long process, that should be async from the GUI and run on its own thread, but I would argue that the entire GUI should run on one single thread and GUI drawing updates of things like icons and buttons should not be async (not any aspect of it that affects user input - purely visual updates could be async). So like for example, when you open a new folder, the population of the list for that folder should NOT be async because the user may be typing characters to navigate to names within the folder, and if you populate async you will screw up the response to their typing (it could be okay to be async here *if* you make it respond in the same way that it would if it was synchronous!). Ideally GUI operations should just be fast and not need to be async.


How do we not have a fucking universal laptop docking multi-port thing !? WTF.

The delay of Windows/PC's recognizing USB devices is really tilting. Every time I move my laptop or plug in the keyboard, it's wait .. wait .. wait .. okay now I can type. Urg.

I'm trying this new thing where I put my computer on automatic standby after 2 hours. I've tried lots of reminder apps before to make myself get up and stretch, but I always just ignore them. Standby is a hard enough obstacle that I think it might make me actually get up - if nothing else I'll get up in anger because the standby will fuck up what I'm doing. But when I come out of standby I have to deal with the fucking USB.


Computer-laid-out maps are abominations. Hand drawn maps are beautiful and convey so much more information. (what follows are just links related to maps, not examples of great hand-drawn maps)

Metskers Maps in Seattle is pretty great.

FeetFirst is a non-profit that supports walking in Seattle. Their maps are only okay.

See also : City of Seattle Map of public art and Historic Places in Seattle .

9/29/2009

09-29-09 - Aliasing is Pretty

I made some images of aliasing/moiray patterns : (BTW inherent to these images is the fact that they don't look at all right except at full zoom since the image structure is created by the pixel grid sampling aliasing, so click through). You may be surprised to know what these are an image of.

Photobucket

Photobucket

Photobucket

Photobucket

Photobucket

Photobucket

Photobucket

....

Answer : these are pictures of sin( x^2 + y^2 ) , that is, sin( r^2 ) a very simple radial trig function. You don't see the function at all, the images are 100% aliasing as the radial sin is scaled to very high frequency.

9/26/2009

09-26-09 - Habits - Code Inlining

There's been an off and on discussion with some peers in email about the issue of putting small code functionality directly inline in a bigger function vs. splitting it out. I don't think it's actually an interesting topic so I don't want to repeat it here, but it made me think of something I want to write about.

Basically the contention is that even if you have a logically separate piece of work, sometimes it's better to go ahead and write it inline in the larger function, because it makes the code flow more linear and imperative and thus easier to read and debug. There are also obvious disadvantages to writing code inline - less strong separation of functional bits, temptation to copy-paste code, less ability to debug and test functions independently, etc. etc. All the pros and cons are obvious and thus not interesting.

The key point to me is the issue that while code inline may be a win sometimes - it's a big loss at other times, and it requires you to make a careful smart logical decision about which way to go.

That's very bad. I think most smart people as they get older come to the realization that you shouldn't put yourself in positions where you have to make careful logical decisions over and over. Instead you should craft habits and rules for yourself that help you do the right thing without having to make careful decisions.

One obvious case that most smart people figure out is diet and exercise, or booze or other indulgences. Yes, it's perfectly fine to have a some dessert once in a while, but if you open that door for yourself you're putting yourself in a situation where you are consciously making a decision "is it okay for me to have cake today?" and you will inevitably get lazy and make the wrong choice sometimes.

As usual the best analogy is poker, and it's how this point was really made real for me. Smart people often start out playing poker trying to logically reason out every single action and think they don't need to be constrained by simple rules or habits. That's great if you really are correctly thinking through every situation, but you inevitably get tired or lazy or make mistakes, and if you're not constraining yourself with rules, you can make huge mistakes.

For example, there might be cases where the best play is to limp aces up front, or to not reraise with aces, but correctly making that decision requires a lot of careful thought, and the upside to making that decision is pretty small, while the downside to doing it in the wrong situation is very big. It's best to just make a rule for yourself that you always raise or reraise. It simplifies things, it removes decision points and lets you focus on more important issues. It might be +EV to call raises sometimes with hands like 68o, but it's best to just give yourself a rule that you never do that.

To be clear - these rules are specifically non-optimal. By making rules for yourself you are intentionally choosing to not try to be 100% optimal, so of course someone can say "you could have done something better there". That's not the point. The point is that if you try to make perfect decisions all the time you will occasionally fail very badly.

Winning poker play (or good coding, or good life living) are largely about making it easy for yourself to do the right thing.

9/25/2009

09-25-09 - Motion Search

I just read the paper on Patch Match and it makes me angry so I figure I'll write about the motion search method I'm developing for possible future use in the new RAD video stuff. PatchMatch is just so incredibly trivial and obvious, it's one of those things that never should have been a paper and never should have been accepted in a journal. It's a great thing for someone to write on their blog because you can describe it in about one sentence, and most experts in the field already know the idea and are probably doing it already. (I will say the good thing about the paper is they do a good job of gathering references to other papers that are related, such as stuff in texture synth and hole filling and so on which I find interesting).

Here's the one sentence version of PatchMatch : Seed your match field with some random guess or shitty initial matches; improve by incrementally propagating match offsets to neighbors and trying small random deltas to find improvements. (it's an absolute classic spin network magnetic moment relaxation kind of problem).

Here's what I've been doing : start with a match field set to all nulls (no match found yet). Then incrementally fill it in with matches and propagate them to neighbors. It proceeds in a few steps like this :

Step 1. Use computer vision methods to find feature points in a frame. Match feature points to the previous frame. This bit is a bit tricky and tweaky, you only want to make matches that you're pretty confident in. Note that this matching is done based on a "characteristic" of the feature point which has no distance limit, and is also somewhat immune to rotation and scaling and such. Sometimes this step finds some very good correspondences between the frames, but it's sparse - it only has high confidence at a few places in the frame, so you can't use it to find all the block matches (and you wouldn't want to even if you could). Generally this finds around 100 vectors.

Step 2. Find "distinctive" spots in the frame. The goal is to find some spots that are not degenerate - eg. not flat patches, not straight edges. The idea is that these are places where we can likely find a good motion vector with high confidence, unlike degenerate areas where there are lots of equally good match vectors. I use two mechanisms to find distinctive spots : one is the computer vision feature points that were not already used in the first matching step. The second is to take the "cornerness" map of the image using a Harris or Hessian operator on the derivative of gaussians (this is a lot like an edge map, but it kills straight edges). Find the top 5% highest cornerness values that are local maxima, and use those as distinctive spots. All of the distinctive spots do a long distance brute force block match (something like radius = 16 or 32) to try to find a good motion vector for them.

Step 3 : Flood fill to fill in the gaps. We now have presumably good motion vectors at a few key points in the frame. Go to their neighbors and search for match vectors that are close to the neighboring one that we already found. Put that in the blank and push its neighbors to the queue to continue the flood fill.

Step 4 : Relaxation pass. (this is not critical). We now have a motion vector everywhere in the frame. For each match vector in the frame, look at its 4 neighbors. Examine match vectors that are near my 4 neighboring vectors. If one is better, replace self. Continue to next. Theoretically you should do this pass a few times, but I find 1 or 2 is very close to infinite.

The key thing is that motion is usually semi-coherent (but not fully coherent, because we are not really trying to find true motion here, but rather just the best matching block, which is a lot more random than true motion is). By finding very good motion vectors in seed spots where we have high confidence, we can propagate that good information out to places where we don't have as much confidence. This lets us avoid doing large brute-force searches.

BTW I really do not understand the point of all the "diamond search" type shit in the video compression literature. It seems to just find really shitty motion vectors and is not making good use of the possibilities in the bit stream. Especially with GPU video encoding in this modern age, doing plain old big chunks of brute force motion search is preferrable. (yes, I know it's for speed, but it's a poor way to optimize, and the high quality encoders are still non-realtime anyway, so if you're not realtime you may as well take some more time and do better; plus the vast majority of use of non-realtime video encoders is in an encode-once decode-many type of scenario which means you should spend a lot of cpu and encode as well as possible).

With this method I find motion vectors using local searches of radius 8-16 that are the same quality as brute force searches of radius 50-100, which makes it about two orders of magnitude faster (and higher quality, since nobody does brute force searches that far).


ADDENDUM : To give this post a bit more weight, here are some numbers on quality from my video coder vs. brute force search radius :


 -s16  : rmse : 9.3725 , psnr : 28.7277
 -s26  : rmse : 9.2404 , psnr : 28.8510
 -s48  : rmse : 9.0279 , psnr : 29.0531
 -s64  : rmse : 8.9171 , psnr : 29.1603
 -s100 : rmse : 8.7842 , psnr : 29.2907
 -s9999: rmse : 8.5294 , psnr : 29.5465

(-s16 means it's searching a 33x33 grid for motion vectors) (-s9999 means it searches full frame).

The above described iterative feature point propagation method gets


 -sfast: rmse : 8.8154 , psnr : 29.2600

BTW for doing full-frame brute force search you obviously should use a block-space acceleration structure for high dimensional nearest neighbor search, like a kd-tree, a bd-tree (box decomposition) or vp-tree (vantage point). High dimensional spaces are nasty though; the typical idea of "find a cell then walk to its immediate neighbors" is not fast in high D because you have O(D) neighbors.

9/08/2009

09-08-09 - DXTC Addendum

Ryg pointed out that there are a few very important little details that I took for granted and didn't mention in my original DXTC postings , or was just not clear about :

One is that when I try all ways of hitting two given endpoints, I try both 4-color and 3-color versions. That is, given two endpoint colors C0 and C1, I quantize them to 16 bits, then try the DXT1 palette that you get from {C0,C1} and also the one from {C1,C0} (order of DXT1 determines whether it is 3-color or 4-color).

The second and related crucial thing, is that in 3-color mode, the extra color is transparent black. If the texture has no alpha at all, I assume the user will not be using it as an alpha source, so I treat the transparent black as just black. That is, I do color palette selection with alpha just ignored.

Apparently this is pretty important. I suspect this especially helps with the "4 means" method; if a bunch of the colors are near black, you want them to be classed together and then just ignored for the endpoint selection, so that they will go to the hard-coded black in 3-color mode and your interp end points will be chosen from the remaining colors.

8/27/2009

08-27-09 - Oodle Image Compression Looking Back Pictures

I thought for the record I should put up some pictures about what I talked about last time.

First of all the R/D trellis quantization issue. Very roughly what we're doing here is coding to a certain bit rate. The "RDO" lets us use a smaller quantization bucket size, which initially lowers distrortion and increases our rate, but then we hammer on some of the values - mainly we just force them to zero, which causes some distortion and decreases rate; we choose to hammer the values that save us the most rate per distortion. (99% of the time all you're doing is turning 1's into 0's, so it's a matter of picking the 1 to squash to 0 which saves you most the rate).

Here are the results on "Moses" at 0.5 bits per pixel :

No R/D : RMSE = 9.9381 :

Photobucket

Unconstrained R/D : RMSE = 9.7843 :

Photobucket

You should be able to see in the R/D image that some of the image looks better, but other parts look much worse. The RDO has stolen rate from places where it was expensive in terms of rate to encode a certain distortion, and moved those bits to parts of the image where you can get more distortion win at a cheaper rate. This is awesome if your goal is to minimize RMSE, but it's unclear to me whether this is *ever* good perceptually.

In this particular case, the RDO Moses image actually has a worse SSIM than the No-RD image; this type of mistake is actually something that SSIM is okay at detecting.

In practice I use some hacks to limit how much the RDO can do to any one block. With those hacks I almost always get an SSIM improvement from RDO, but it's still unclear to me whether or not it's actually a perceptual improvement on many images (in some cases it's a very clear win; images like kodim09 or kodim20 where you have big flat patches in some spots and then a lot of edge detail in other spots, the RDO does a good job of stealing from the flats to give to the edges, which the eye likes, because we don't mind it if an almost perfectly smooth area becomes perfectly smooth).

Now for the hacky perceptual smooth DC issue.

This is "kodim04" at 0.25 bpp ; no RDO ; no unblock , no perceptual DC quantization ; basically a naive DCT coder :

Photobucket

Now we turn on the hacky perceptual quantization that gives more precision to smooth DC's : (unblock still off) :

Photobucket

Note that the perceptual quant of DC means that we are using more of our bitrate for the DC band, so we give less bits to AC, which means using a larger quantizer for AC to match the bit rate constraint.

Now with unblocking , no perceptual DC quant : (RMSE = 12.8565 , SSIM = 58.62%)

Photobucket

With unblocking and perceptual DC quant : (RMSE = 12.9666, SSIM = 57.88%)

Photobucket

I think the improvement is clearest on the unblocked images - the perceptual DC quant one actually looks okay, the parts that are supposed to be smooth still look smooth. The one with uniform DC quant looks disgustingly bumpy. Note that the SSIM of the better image is actually quite a bit worse. Of course RMSE gets worse any time you do a perceptual improvement. You should also be able to see that the detail in the hat thatching is better in the nonperceptual version, but that doesn't bother the eye nearly as much as breaking smoothness.

ADDENDUM : some close up pictures of Moses' waddle area showing the R/D artifacts better. You should zoom these to full screen with a box filter and toggle between them to see most clearly. You should see the RDO killing blocks in the collar area very clearly. All you really need to do is look at the last picture of these four and you should be able to see what I'm talking about with the RDO :

Portion of Moses at 0.75 bpp : No lagrange optimization :

Photobucket

With Lagrange RDO :

Photobucket

Crop of No-L :

Photobucket

Crop of RDO :

Photobucket

8/25/2009

08-25-09 - Oodle Image Compression Looking Back

I did a little image compressor for RAD/Oodle. The goal was to make something with quality comparable to a good modern wavelet coder, but using a block-based scheme so that it's more compact and simple in memory use so that it will be easy to stream through the SPU and SIMD and all that good stuff, we also wanted an internal floating point core algorithm so that it extends to HDR and arbitrary bit depths. I wrote about it before, see : here or here . That's been done for a while but there were some interesting bits I never wrote about so I thought I'd note them quickly :

1. I did lagrange R-D optimization to do "trellis quantization" (see previous ). There are some nasty things about this though, and it's actually turned off by default. It usually gives you a pretty nice win in terms of RMSE (because it's measuring "D" (distortion) in terms of MSE, so by design it optimizes that for a given rate), but I find in practice that it actually hurts perceptual quality pretty often. By "perceptual" here I just mean my own eyeballing (because as I'll mention later, I found SSIM to be pretty worthless). The problem is that the R-D trellis optimization is basically taking coefficients and slamming them to zero where the distortion cost of doing that is not worth the rate it would take to have that coefficient. In practice what this does is take individual blocks and makes them very smooth. In some cases that's great, because it lets you put more bits where they're more important (for example on images of just human faces it works great because it takes bits away from the interior patches of skin and gives those bits to the edges and eyes and such).

One of the test images I use is the super high res PNG "moses.png" that I found here . Moses is wearing a herring bone jacket. At low bit rates with R-D Trellis enabled, what happens is the coder just starts tossing out entire blocks in the jacket because they are so expensive in terms of rate. The problem with that is it's not uniform. Perceptually the block that gets killed stands out very strongly and looks awful.

Obviously this could be fixed by using a better measure of "D" in the R-D optimization. This is almost a mantra for me : when you design a very aggressive optimizer and just let it run, you better be damn sure you are rating the target criterion correctly, or it will search off into unexpected spaces and give you bad results (even though they optimize exactly the rating that you told it to optimize).

2. It seems DCT-based coders are better than wavelets on very noisy images (eg. images with film grain, or just images with lots of energy in high frequency, such as images of grasses, etc). This might not be true with fancy shape-adaptive wavelets and such, but with normal wavelets the "prior" model is that the image has most of its energy in the smooth bands, and has important high frequency detail only in isolated areas like edges. When you run a wavelet coder at low bit rate, the result is a very smoothed looking version of the image. That's good in most cases, but on the "noisy" class of images, a good modern wavelet coder will actually look worse than JPEG. The reason (I'm guessing) is that DCT coders have those high frequency pattern basis functions. It might get the detail wrong, but at least there's still detail.

In some cases it makes a big difference to specifically inject noise in the decoder. One way to do this is to do a noisey restore of the quantization buckets. That is, coefficient J with quantizer Q would normally restore to Q*J. Instead we restore to something random in the range [ Q*(J-0.5) , Q*(J+0.5) ]. This ensures that the noisey output would re-encode to the same bit stream the decoder saw. I wound up not using this method for various reasons, instead I optionally inject noise directly in image space, using a simplified model of film grain noise. The noise magnitude can be manually specified by the user, or you can have the encoder measure how noisey the original is and compare to the baseline decoder output and see how much energy we lost, and have the noise injector restore that noise level.

To really do this in a rigorous and sophisticated way you should really have location-variable noise levels, or even context-adaptive noise levels. For example, an image of a smooth sphere on a background of static should detect the local neighborhood and only add noise on the staticy background. Exploring this kind of development is very difficult because any noise injection hurts RMSE a lot, and developing new algorithms without any metric to rate them is a fool's errand. I find that in some cases reintroducing noise clearly looks better to my eye, but there's no good metric that captures that.

3. As I mentioned in the earlier posts, lapping just seems to not be the win. A good post process unblocking filter gives you all the win of lapping without the penalties. Another thing I noticed for the first time is that the JPEG perceptual quantization matrix actually has a built-in bias against blocking artifacts. The key thing is that the AC10 and AC01 (the simplest horizontal and vertical ramps) are quantized *less* than the DC. That guarantees that if you have two adjacent blocks in a smooth gradient area, if the DC's quantize to being one step apart, then you will have at least one step of AC10 linear ramp to bridge between them.

If you don't use the funny JPEG perceptual quantization matrix (which I don't think you should) then a good unblocking filter is crucial at low bit rate. The unblocking filter was probably the single biggest perceptual improvement in the entire codec.

4. I also somewhat randomly found a tiny trick that's a huge improvement. We've long noticed that at high quantization you get this really nasty chroma drift problem. The problem occurs when you have adjacent blocks with very similar colors, but not quite the same, and they sit on different sides of quantization boundary, so one block shifts down and the neighbor shifts up. For example with Quantizer = 100 you might have two neighbors with values {49, 51} and they quantize to {0,1} which restores to {0,100} and becomes a huge step. This is just what quantization does, but when you apply quantization separately to the channels of a color (RGB or YUV or whatever), when one of the channels shifts like that, it causes a hue rotation. So rather than seeing a stair step, what you see is that a neighboring block has become a different color.

Now there are a lot of ideas you might have about how to address this. To really attack it thoroughly, you would need a stronger perceptual error metric, in particular one which can measure non-local patterns, which is something we don't have. The ideal perceptual error metric needs to be able to pick up on things like "this is a smooth gradient patch in the source, and the destination has a block that stands out from the others".

Instead we came up with just a simple hack that works amazingly well. Basically what we do is adaptively resize the quantization of the DC component, so that when you are in a smooth region ("smooth" meaning neighboring block DC's are similar to each other), then we use finer quantization bucket sizes. This lets you more accurately represent smooth gradients and avoid the chroma shift. Obviously this hurts RMSE so it's hard to measure the behavior analytically, but it looks amazingly much better to our eyes.

Of course while this is an exciting discovery it's also terrifying. It reminded me how bad our image quality metrics are, and the fact that we're optimizing for these broken metrics means we are making broken algorithms. There's a whole plethora of possible things you could do along this vein - various types of adaptive quantizer sizes, maybe log quantizers? maybe more coarse quantizers in noisy parts of the image? it's impossible to explore those ideas because we have no way to rate them.

As I mentioned previously, this experiment also convinced me that SSIM is just worthless. I know in the SSIM papers they show examples where it is slightly better than RMSE at telling which image is higher quality, but in practice within the context of a DCT-based image coder I find it almost never differs from RMSE; that is, if you do something like R-D optimized quantization of DCT coefficients with Distortion measured by RMSE, you will produce an image that has almost exactly the same SSIM as if you did R-D with D measured by SSIM. If RMSE and SSIM were significantly different, that would not be the case. I say this within the context of DCT-based image coding because obviously RMSE and SSIM can disagree a lot, but that axis of freedom is not explored by DCT image coders. The main thing is that SSIM is really not measuring anything important visual at all. A real visual metric needs to use global/neighborhood information, and knowledge of shapes and what is important about the image. For example, changing a pixel that's part of a perfect edge is way more important than changing an image that's in some noise. Changing a block from grey to pink is way worse than changing a block from green to blue-green, even if it's a smaller value change. etc. etc.

It seems to me there could very easily be massive improvements possible in perceptual quality without any complexity increase that we just can't get because we can't measure it.

old rants