11-14-07 - 4

ImDoub notes

The basic approach of the trained image doubler works like this :

The goal is take an image I and pretend it was created by downsampling it from an image of 2X the res, we'll call it 2I.

Of course there are many images 2I that when downsampled would produce I. Of those many possible 2I images we want to choose the most likely. To define "most likely" we assume that all images are generated by a statistical source.

So, how do we model this statistical source? We simply gather a big library of images {L} and say they are likely to come from the universal image source (UIS). For each of these images we have an example "doubling", we say that the downsampled image (L/2) when doubled should be L.

So we have this huge space of training samples, { L/2 -> L }, and we basically want to find our image I in that space and interpolate. That space is way too big and too sparse for this to ever work, so we make the assumption that the UIS is *local* and also *unoriented* and also *relative*.

Local means that a pixel's probability is only affected by it's neighbors. In practice we'll use the 2-ring (the immediate neighbors and their neighbors).

Unoriented means that if you take a neighborhood and flip it in X or Y or rotate it 90 degrees, the probabilities are the same.

Relative means that if you add or subtract a constant value to the whole neighborhood it doesn't affect the probabilities.

Now, all 3 of these assumptions are in fact obviously wrong. Locality is obviously wrong because images may have repeated features, like a wallpaper pattern, and obviously having the same pattern in another part of the image far away affects the probabilities in the repeated area. Unoriented is obviously wrong if you just consider images of text that obviously treat X and Y differently, and even in photos they are much more likely to have the horizon line horizontal. Relative is also obviously wrong when you get near the neighborhood of 0 and 255 where you have clamping; perhaps logarithmic-scale infinite range images would be close to being "relative" but real images obviously are not.

Regardless, we're going to use these 3 assumptions (as do almost all image transform algorithms). Furthermore note that we are NOT assumping *scale* independence, either in spatial scale or value scale. That is, we are not assume that you can multiply a neighborhood by a constant and get the same result, nor are we assuming that if you zoom out and take a wider set of neighbors you will get the same result.

Anyway, using these symmetries we can now create a much denser training set. Instead of using a whole image from the library {L} we take a local neighborhood of pixels N (the 2-ring) around each sample in L/2. We put the neighborhood into canonical form. First subtract off the average of the neighborhood, this removes the constant offset. Then flip in X and Y until the values are growing in the +X direct and in the +Y direction. This orients all neighborhoods in the same way.

This canonical form of the neighborhood now predicts a certain doubled pixel, this makes up the training set {N -> L}. Again the procedure is simply take the image you want to double {I}, form the neighborhoods in canonical form, look them up in the training set {N} and interpolate between the closest ones you find. Once you find the prediction, you take it back out of canonical form, putting the right local offset back in to create an output pixel.

In practice how do you do this interpolation? Neural Nets would work fine. So does a pure dense sample search like k-Means where you have a distance metric to find samples. Perhaps the best way is a Support Vector Machine (SVM) which is precisely designed to be ideal for this type of work; all of the neighborhood samples {N} are the support vectors and you can optimize for a desired machine size (which is equivalent to the sensitivity to noise in the training samples).

How does this ideal doubler work differently than any fancy filtering algorithm? First a little review. Normal image filters work by creating a linear combination of the local neighborhood. Any rotationally-symmetric filter can be decomposed into the product of two 1d filters, which is what everyone does (actually that's not true, needs investigation), so you can work first on the rows then on the columns. It's an inherent problem of sampling that you cannot create a filter which is ideal in terms of "ringing" and "blur" - you either have to tolerate one or the other. I'm not really going to get into this right now but it's an interesting topic with lots of papers on it.

So the first thing that the trained doubler can do is make a per-pixel decision about what filter to use - a more blurry filter or a more ringy filter. (of course it isn't explicitly making this decision, it's just finding neighborhoods where the training set L/2 -> L either looks like a blurry upsample or a ringy upsample).

But the trained doubler can do more than that. It can see patterns in the downsampled image which typically come from hard curves of various types in the higher res image, such as hard diagonal stair-steps, or even rasterized circular curves.

BTW the old-fashioned style image doublers work by explicitly looking for these types of patterns. They basically work by assuming that all neighborhoods of an image are either smooth (locally polynomial) or a hard edge of various types (horizontal, vertical, diagonal 45, diagonal 135). They best-fit the neighborhood to one of those groups and then use a different anisotropic linear filter for each case. This is how the old DPCM / CALIC image compressors work and how Wm Withers' "Augural Image Zooming" worked. These semi-heuristic approaches actually work very well.

Another way to think about it is this : for any image that we are given I , imagine we downsample it to I/2 with a filter , and then we double it through the trained doubler J = 2*(I/2). The result J should be as close to the original I as possible. This is how we can formulate the problem as a standard machine learning optimization.

Addendum : Well, back when I worked on this I did a literature search and didn't find much; Won has pointed me at a paper by William Freeman , "Example-based super-resolution" which is almost exactly this approach.

No comments:

old rants