# cbloom rants

## 7/26/2011

### 07-26-11 - Pixel int-to-float options

There are a few different reasonable ways to turn pixel ints into floats. Let's have a glance.

Pixels arrive as ints in [0,255]. When you put your ints in floats there is then a range of floats which corresponds to each int value. The total float range shown is the range of values that will map back to [0,255]. In practice you usually clamp, so in fact further out values will also map to 0 or 255.

I'll try to use the scientific notation for ranges, where [ means "inclusive" and ( means "not including the end value". With floats rounding of 0.5's I will always use ( because the rounding behavior for floats is undefined and varies.

On typical images, exact preservation of black (int 0) and white (int 255) is more important than any other value.

```
int-to-float :  f = i;

float-to-int :  i = round( f ) = floor( f + 0.5 );

float range is (-0.5,255.5)
black : 0.0
white : 255.0

```
commentary : quantization buckets are centered on each integer value. Black can drift into negatives, which may or may not be an annoyance.

```
int-to-float :  f = i + 0.5;

float-to-int :  i = floor( f );

float range is [0.0,256.0)
black : 0.5
white : 255.5

```
commentary : quantization buckets span from one integer to the next. There's some "headroom" below black and above white in the [0,256) range. That's not actually a bad thing, and one interesting option here is to actually use a non-linear int-to-float. If i is 0, return f = 0, and if i is 255 return f = 256.0 ; that way the full black and full white are pushed slightly away from all other pixel values.

```
int-to-float :  f = i * (256/255.0);

float-to-int :  i = round( f * (255/256.0) );

float range is (-0.50196,256.50196)

black : 0.0
white : 256.0

```
commentary : scaling white to be 256 is an advantage if you will be doing things like dividing by 32, because it stays an exact power of 2. Of course instead of 256 you could use 1.0 or any other power of two (floats don't care), the important thing is just that white is a pure power of two.

other ?

ADDENDUM : oh yeah; one issue I rarely see discussed is maximum-likelihood filling of the missing bits of the float.

That is, you treat it as some kind of hidden/bayesian process. You imagine there is a mother image "M" which is floats. You are given an integer image "I" which is a simple quantization from M ; I = Q(M). Q is destructive of course. You wish to find the float image F which is the most likely mother image under the probability distribution given I is known, and a prior model of what images are likely.

For example if you have ints like [ 2, 2, 3, 3 ] that most likely came from floats like [ 1.9, 2.3, 2.7, 3.1 ] or something like that.

If you think of the float as a fixed point and you only are given the top bits (the int part), you don't have zero information about what the bottom bits are. You know something about what they probably were, based on the neighbors in the image, and other images in general.

One cheezy way to do this would be to run something like a bilateral filter (which is all the rage in games these days (all the "hacky AA" methods are basically bilateral filters)) and clamp the result to the quantization constraint. BTW this is the exact same problem as optimal JPEG decompression which I have discussed before (and still need to finish).

This may seem like obscure academics to you, but imagine this : what if you took a very dark photograph into photoshop and multiplied up the brightness X100 ? Would you like to see pixels that step by 100 and look like shit, or would you like to see a maximum likelihood reconstruction of the dark area? (And this precision matters even in operations where it's not so obvious, because sequences of different filters and transforms can cause the integer step between pixels to magnify)

ryg said...

FWIW, D3D10+ graphics HW uses

int-to-float: f = i / 255.0f;
float-to-int: i = round(f * 255.0f + 0.5f);

It's usually not computed correctly to 0.5 ULP, but it's close, and required to be exact at the end points, and implementations are also required to guarantee that ftoi(itof(x)) == x for obvious reasons. This is really just an exponent-shifted version of your option 3.

In large(-ish) code bases it's best to pick one convention and stick with it, even if it may not be optimal for everything, just to avoid confusion (and the friction that arises from it). If the data ends up in textures, that's how the GPU will unpack it; might as well use this convention on the SW side. Has the added advantage that you can then freely move computations (or even share code) between shaders and C++ code.

cbloom said...

I'm still a little torn about whether 1.f is nice for white or not (it's sort of cool to be able to write code where "255" means the same thing whether you are on int pixels or float pixels).

But yeah for games you're probably right.

When you're writing image code though (such as compressors) you can get a benefit from using one or the other of these, largely depending on what the rest of your pipeline does. (in particular your quantizer).