# cbloom rants

## 5/26/2009

### 05-26-09 - Storing Floating Points

I rambled a bit before about how to store floating point images . I think I have the awesome solution.

First of all there are two issues :

Goal 1. Putting the floating point data into a format that I can easily run through filters, deltas from previous, etc. Even if you're doing lossless storage your want to be able to delta from previous and have the deltas make sense and preserve the original values.

Goal 2. Quantizing in such a way that the bits are used where the precision is wanted. Obviously you want to be sending bits only for values that are actually possible in the floating point number.

Also to be clear before I start, the issue here is with data that's "true floating point". That is, it's not just data that happens to be in floating point but could be equally well converted to ints inside some [min,max] range. For example, the coordinates of most geometry in video games really isn't meant to be floating point, the extra precision near zero is not actually wanted. The classic example where you actually want floating point is for HDR images where you want a lot of range, and you actually want less absolutely precision for higher values. That is, the difference between 999991 and 999992 is not as important as the difference between 1 and 2.

Now, we are going to be some kind of lossy storage. The loss might be very small, but it's kind of silly to talk about storing floating points without talking about lossy storage, because you don't really intend to have 23 bits of mantissa or whatever. To be lossy, we want to just do a simple linear quantization, which means we want to transform into a space where the values have equal significance.

Using some kind of log-scale is the obvious approach. Taking the log transforms the value such that even size steps in log-space are even multipliers in original space. That's good, it's the kind of scaling we want. It means the step from 1 to 2 is the same as the step from 100 to 200. The only problem is that it doesn't match the original floating point representation really, it's more continuous than we need.

What we want is a transform that gives us even size steps within one exponent, and then when the exponent goes up one, the step size doubles. That makes each step of equal importance. So, the quantizer for each exponent should just be Q ~= 2^E.

But that's easy ! The mantissa of the floating point that we already have is already quantized like that. We can get exactly what we want by just pretending our floating point is fixed point !

```
value = (1 + M)* 2^E

(mantissa with implicit one bit at head, shifted by exponent)

fixed point = {E.M}

```
That is, just take the exponent's int value and stick it in the significant bits above the mantissa. The mantissa is already quantized for you with the right variable step size. Now you can further quantize to create more loss by right shifting (aka only keeping N bits of the mantissa) or by dividing by any number.

This representation also meets Goal 1 - it's now in a form that we can play with. Note that it's not the same as just taking the bytes of a floating point in memory - we actually have an integer now that we can do math with and it makes sense.

```
when you cross an exponent :

1.99 = {0.1111}
2.01 = {1.0001}

```
So you can just subtract those and you get a sensible answer. The steps in the exponent are the correct place value to compensate for the mantissa jumping down. It means we can do things like wavelets and linear predictors in this space.

Now there is a bit of a funny issue with negative numbers vs. negative exponents. First the general solution and what's wrong with it :

Negatives Solution 1 : allow both negative numbers and negative exponents. This creates a "mirrored across zero" precision spectrum. What you do is add E_max to E to make it always positive (just like the IEEE floating point), so the actual zero exponent is biased up. The spectrum of values now looks like :

```
(positives)      real step size
3.M                     /
2.M                    /
1.M                   /
0.M                  /
-1.M                 /
-2.M                /
-3.M               /
(zero)             +
-3.M               \
-2.M                \
-1.M                 \
0.M                  \
1.M                   \
2.M                    \
3.M                     \
(negatives)

```
What we have is a huge band of values with very small exponents on each side of the zero. Now, if this is actually what you want, then fine, but I contend that it pretty much never is. The issue is, if you actually have positives and negatives, eg. you have values that cross zero, you didn't really intend to put half of your range between -1 and 1. In particular, the difference between 1 and 2^10 is the same as the difference between 1 and 2^-10. That just intuitively is obviously wrong. If you had a sequence of float values like :

{ 2.0 , 1.8 , 1.4, 1.0, 0.6, 0.2, -0.1, -0.3, -0.6, -1.0, -1.4 }

That looks nice and smooth and highly compressible right? NO ! Hidden in the middle there is a *MASSIVE* step from 0.2 to -0.1 ; that seems benign but it's actually a step past almost your entire floating point range. (BTW you might be thinking - just add a big value to get your original floating poin tdata away from zero. Well, if I did that it would shift where the precision was and throw away lots of bits; if it's okay to add a big value to get away from zero, then our data wasn't "true" floating point data to begin with and you should've just done a [min,max] scale instead).

So I contend that is almost always wrong.

Negatives Solution 2 : allow negative numbers and NOT negative exponents. I content that you almost never actually want negative exponents. If you do want precision below 1.0, you almost always just want some fixed amount of it - not more and more as you get smaller. That can be represented better just with the bits of the mantissa, or by scaling up your values by some fixed amount before transforming.

To forbid negative exponents we make everything in [0,1.0] into a kind of "denormal". We just give it a linear scale. That is, we make a slightly modified representation :

```
Stored val = {E.M} (fixed point)

Float val =

if E >= 1 : (1 + M) *2^(E-1)
makes numbers >= 1.0

if E == 0 : M
makes numbers in [0.0,1.0)

(M is always < 1, that is we pretend it has a decimal point all the way to the left)

Now the low values are like :

0    : 0.0000
0.5  : 0.1000
0.99 : 0.1111
1.01 : 1.0001

```
Of course we can do negative values with this by just putting the sign of the floating point value onto our fixed point value, and crossing zero is perfectly fine.

won3d said...

The observation that floats form a piecewise linear approximation of lg is a neat one, and is how Blinn and the old school guys used to do wacky approximations of various transcendentals. Always a crowd pleaser.

But regarding negative exponents, exponents are stored in IEE754 as biased numbers, not 2s-complement. So, isn't your "gap at 0" a question of degrees? Like, correctable via a power-of-2 multiplicative bias (aka scalb). Then, wouldn't your "denorms" just be actual denorms?

cbloom said...

Yeah if I understand you correctly I believe you are right.

Basically with a normal IEEE 32 bit float you have 127*2^23 values between 1.0 and 0. You can reduce that gap by multiplying your number by a large negative power of two.

So, like if you want almost no gap, you multiply your number by 2^-127 first, then I can just grab the E & M straight out of the floating pointer number and it will do the denorm for me.

ryg said...

That should work, but that limits your code to platforms and environments that are really IEEE compliant; e.g. SPUs don't have denormals IIRC and on PCs, you can switch SSE computations to "flush to zero" mode where denormals are automatically truncated to zero instead of going through a painfully slow microcoded path. Other CPUs have a similar mode. You'd have to be careful to make sure that you're in IEEE compliant mode before using real denormals, and even so, brace yourself for a serious performance impact - denormals are very rare in normal computations, so the hardware implementations are seriously slow, as in "microcoded with no concurrent completion of other FP ops" slow. If there's a significant percentage of values in that range (and after all that's the whole point), you're most likely better off implementing this manually. That's something I'd check, though, but at least a few years ago, the penalty for having lots of denormals in your calculation were severe - I've seen slowdowns of factor 200 and more. (DSP apps like IIR filters and comb filters are notorious for this: if your source signal goes to zero and stays there for a few seconds, the time spent in these filters will suddenly explode as the state variables all go denormal around the same time, unless you manually clamp them to zero).

won3d said...

Yeah, it is probably no good to depend on IEE754 compliance or the lack thereof. Then again, I think something like "flush to zero" is pretty common on the platforms you likely care about. Maybe you can just use that as part of the quantizer.

cbloom said...

I just discovered this is basically the Lindstrom "fast efficient floating point" method.

cbloom said...

see:

http://www.cc.gatech.edu/~lindstro/

http://www.cs.unc.edu/~isenburg/

in particular :

http://www.cs.unc.edu/~isenburg/lcpfpv/

Tom Forsyth said...

The graphics formats like float16, float11/10, and the 360-specific one that is 7e3 all use tricks like this to focus precision where it's needed - though there's some argument as to whether 7e3 did a good job or if 6e4 would have been a better choice.

As well as changing bit sizes, many of these have biases on the exponents to focus the precision away from 0-1, and all require full-speed support of denormals.