05-15-09 - Image Compression Rambling Part 3 - HDR

Next part : HDR (High Dynamic Range) coding. One of the things I'd like to do well for the future is store HDR data well, since it's more and more important to games.

The basic issue is that the range of values you need to store is massive. This means 8-bit is no good, but really it means any linear encoding is no good. The reason is that the information content of values is relative to their local average (or something like that).

For example, imagine you have an image of the lighting in a room during the day with the windows open. The areas directly hit by the sun are blazing bright, 10^6 nits or whatever. The areas that are in shadow are very dark. However, if you rotate and look at the image with your back to the sun, your eyes adjust and you can see a lot of detail in the shadow area still. If you encoded linearly you would have thrown that all away.

There are a few possibilities for storing this. JPEG-XR (HD Photo) seems to mainly just advocate using floating point pixels, such as F16-RGB (48 bits per pixel). One nice thing about this "half" 48 bit format is that graphics cards now support it directly, so it can be used in textures with no conversion. They have another 32 bit mode RGBE which uses an 8-bit shared exponent and 3 8-bit mantissas. RGBE is not very awesome; it gives you more dynamic range than you need, and not enough precision.

I haven't been able to figure out exactly how they wind up dealing with the floating point in the coder though. You don't want to encode the exponent and mantissa seperately, because the mantissa jumps around weirdly if you aren't aware of the exponent. At some point you need to quantize and make integer output levels.

One option would be do a block DCT on floating point values, send the DC as a real floating point, and then send the AC as fixed-size steps, where the step size is set from the magnitude of the DC. I think this is actually an okay approach, except for the fact that the block artifacts around something like the sun would be horrific in absolute value (though not worse after tone mapping).

Some kind of log-magnitude seems natural because relative magnitude is what really matters. That is, in areas of the image near 0, you want very small steps, but right near the sun where the intensity is 10^8 or whatever you only need steps of 10^5.

This leads to an interesting old idea : relative quantization. This is most obvious on a DPCM type coder. For each pixel you are encoding, you predict the pixel from the previous, and subtract from the prediction and send the error. To get a lossy coder, you quantize the error before encoding it. Rather than quantize with a fixed size step, you quantize relative to the magnitude of the local neighborhood (or the prediction). You create a local scale S and quantize in steps of {S,2S} - but then you also always include some very large steps. So for example in a neighborhood that's near black you might use steps like {1,2,3,4,5...16,20,...256,512,1024,...10^5,10^6,10^7}. You allow a huge dynamic range step away from black, but you don't give many values to those huge steps. Once you get up into a very high magnitude, the low steps would be relative.

This takes advantage of two things : 1. we see variation relative to the local average, and 2. in areas of very large very high frequency variation, we have extremely little sensitivity to exactly what the magnitude of that step is. Basically if you're look at black and you suddenly look at the sun you can only tell "that's fucking bright" , not how bright. In fact you could be off by an order of magnitude or more.

Note that this is all very similar to gamma-correction and is sort of redundant with it. I don't want to get myself tangled up in doing gamma and de-gamma and color space corrections and so on. I just want to be a pixel bucket that people can jam whatever they want in. Variable step sizing like this has been around as an idea in image compression for a very long time. It's a good basic idea - even for 8 bit low dynamic range standard images it is interesting, for example if you have a bunch of pixels like {1,3,2,1,200,1,3,2} - the exact value of that very big one is not very important at all. The bunch of small ones you want to send precisely, but the big one you could tolerate 10 steps of error.

While that's true and seems very valuable, it's very hard to use in practice, which is why no one is doing it. The problem is that you would need to account for non-local effects to actually get away with it. For example - changing a 200 to a 190 in that example above might be okay. But what if that was a row if pixels, and the 200 is actually a vertical edge with lots of pixel values at 200. In that case, you would be able to tell if a 200 suddenly jumped to a 190. Similarly, if you have a bunch of values like - {1,3,2,1,200,200,200,1,3,2} - again the eye is very bad at seeing that the 200 is actually a 200 - you could replace all three of those with a 190 and it would be fine. However if your row was like this : {1,3,2,1,200,200,200,1,3,2,1,200,200,2} - and you changed some of the 200's but not the others - then that would be very visible. The eye can't see absolute intensity very well at all, but it can see relative intensity and repetitions of a value, and it can see edges and shapes. So you have to do some big nonlocal analysis to know how much error you can really use in any given spot.

Greg Ward has an interesting approach for backward compatible JPEG HDR . Basically he makes a tone-mapped version of the image, stores that in normal RGB, divides that by the full range original image, and stores the ratio in a seperate gray scale channel. This is a good compromise, but it's not how you would actually want to store HDR if you had a custom pixel format. (it's bad because it presumes some tone mapping, and the ratio image is very redundant with the luma of the RGB image).

I'm leaning towards Log-Y-UV of some kind. Probably with only Log(Y) and UV linear or something like that. Greg Ward has a summary of a bunch of different HDR formats . Apparently game devs are actually using his LogLuv : see Christer or Deano or Matt Pettineo & Marco .

One nasty thing about LogLuv is that there's some problems near zero, and everybody is doing slightly different hacks to deal with that. Yay. It also doesn't bilerp nicely.

ADDENDUM : I should clarify cuz this got really rambly. LogLuv is an *integer* encoding. As a pixel bucket I can of course handle 3 integers of various scales and signs, no problemo. Also, "floating points" that are in linear importance scale are not really a difficult issue either.

The tricky issue comes from "real" floating points. That is, real HDR image data that you don't want to integerize in some kind of LogLuv type function for whatever reason. In that case, the actual floating point representation is pretty good. Storing N bits of mantissa gives you N bits of information relative to the overall scale of the thing.

The problems I have with literally storing floating point exponent and mantissa are :

1. Redundancy beteen M & E = worse compression. Could be fixed by using one as the context for the other.

2. Extra coding ops. Sending M & E instead of just a value is twice as slow, twice as many arithmetic code ops, whatever.

3. The mantissa does these weird step things. If you just compress M on its own as a plane, it does these huge steps when the exponent changes. Like for the values 1.8,1.9,1.99,2.01,2.10 - M goes from 0.99 to 0.01. Then you do a wavelet or whatever and lossy transform it and it smooths out that zigzag all wrong. Very bad.

So clearly just sending M & E independently is terrible.

No comments:

old rants