cbloom rants: Arithmetic coder division-free maps

We have now narrowed in on the key issue that forces us to use division in arithmetic coders : the inverse map from arithmetic domain to CDF domain. Let's now start to look at division free options for that map.

Often henceforth I will talk about the map after normalizing the ratio to [1,2) by shifting CDFs. That is :


while cdf_tot*2 < range
    cdf's *= 2

same as :

left shift cdf's to put the top bit of CDF aligned with the top bit of range

Recall that in a typical implementation "range" has its top bit go from position 24-31. cdf_tot = (1 << cdf_shift) we have chosen to be a power of 2, and cdf_shift is typically in the 11-14 range.

We now think in terms of a fixed point fraction. cdf_tot after normalizing is "one" , and "range" is in [1,2) , that is :


one   = 1.00000000000000000

range = 1.01001010101100000

essentially all we're getting at here is that in the scaling from CDF domain to "arithmetic domain", the binary powers of 2 to get us as close to range as possible are the easy part. We just shift the CDF's up. The fractional part of range in [1,2) is the hard part that makes us need a multiply/divide.

And that leads us directly to our first division free map :


identity map : aka "just shift" map :

r_bits = bsr(range) = position of top bit of range
shift = r_bits - cdf_shift

forward(cdf) = cdf << shift

inverse(code) = code >> shift

If we look at how this map performs for range in [1,2) , when range is near 1 ("one" in fixed point, that is, range nearly 1 << r_bits), then this map is exactly right. As range grows towards 2, this map gets worse and worse. Near 2, this map is costing us exactly 1 bit of coding loss per symbol (because we're only using half of the range that we could).

(aside : in the ancient days, Rissanen actually used the "identity map")

The problem with this map is that it is failing to scale up cdf to use all of range when range is over 1. Obviously we can just check for that :


"one or 1.5 map" :

If range is in [1,1.5) , scale by 1X
if range is in [1.5,2) , scale by 1.5X 

forward(cdf) :

cdf_norm = cdf << shift;
ret = cdf_norm;

if ( range (fixed point) >= 1.5 ) (test 0.1 bit)
  ret += cdf_norm >> 1;

inverse(code) :

if ( range (fixed point) >= 1.5 )
  ret = (code * 2)/3;
else
  ret = code;

so we are now testing 1 fractional bit of range below the leading bit. This divides the "range" interval into two scaling zones. We could obviously test further fractional bits of range to divide the range into scaling zones like :


test 2 fractional bits of range (below top 1) :  [ 1X  ,  1.25X , 1.5X , 1.75X ]

but that's just the same as doing the normal range coder scaling, but only using the top 3 bits of range (top one + 2 fractional bits). And that is the same as the recip_arith forward map.

Note that in the "one or 1.5 map" it may look like we are still doing a divide to get from arithmetic domain back to cdf domain. But it's divide by the constant 3, which is implemented by the compiler (or us) as a reciprocal multiply. As we use more (fractional) bits of range, we need various other divides by constants (1/5,1/7,etc.) which are just reciprocal multiplies. Rather than branching, we can just use a table and take the top bits of range to look them up. This leads us directly to recip_arith, which we will flush out in the next post.

In these maps we are always using less than the full range. eg. say we do the "2 fractional bits of range" , if range is in the [1.25,1.5) fixed point interval, we will use 1.25X to scale from CDF to arithmetic domain. That is the correct scaling when range is the bottom of [1.25,1.5) but does not use all of range when range is larger. The reason is we can't do something like use a scaling factor that's the middle of the bucket (1.375), since when range is low that would give us forward(cdf) > range , which is uncodeable. Therefore we must also use the round-down or truncation of range to fewer top bits. This approach of using some number of fractional bits of range means that the map never makes use of all of range; as you add more bits, you can use more and more of range, but you are slowly approaching the limit from below.

There's an alternative approach due to Stuiver & Moffat ("Piecewise Integer Mapping for Arithmetic Coding"), commonly called SM98.

The SM98 map says : consider range and CDF normalized, so range is in [1,2) fixed point. If we just scale CDF by 1X everywhere ("identity map" above) we are not using all of range. We can't *ever* scale CDF by 2X uniformly, because range is strictly < 2 , that would make forward(cdf_tot) exceed range. What we can do is scale CDF by 2X for a portion of the interval, and 1X for the remainder, so that we use all of range :


Choose some CDF threshold t such that when we make the map :

cdf < t -> scale by 1X
cdf >= t -> scale by 2X

then we use the whole range in our map, eg. forward(cdf_tot) = range

The map is :

forward(cdf) :
if ( cdf < t ) ret = cdf
else ret = t + (cdf - t)*2 = cdf*2 - t

t = cdf_tot*2 - range

note that cdf_tot is normalized to be equal to range's top bit here,
so 't' is the same as "2 - range" in fixed point
that's the same as the ~ bit inverse of range's fractional bits

forward(cdf) :
if ( cdf < t ) ret = cdf;
else ret = range - (cdf_tot - cdf)*2;

forward(cdf) :
ret = MAX( cdf , range - (cdf_tot - cdf)*2 )

This final form with the branchless MAX is nicer for implementation, but it's also an alternate way to see the map. What we're doing is a 1X map for the early cdf's, starting at the left side of the arithmetic range. If we stuck with that map the whole way, it would not reach the end of range (and thus waste coding quality). We're simultaneously doing a 2X mapping of the late CDF's, starting at the *right* side of the arithmetic range. If we stuck with that map the whole way, it would overshoot zero and thus not be codeable. Where those two maps cross, we switch between them, thus using the more generous 2X mapping as much as possible.

(the inverse map is done similarly, just with >>1 instead of *2)

So, the SM98 mapping uses all of range, thus does not introduce coding loss due to failure to use all of range. It does, however, not assign intervals propertional to the probability.

When range is near 1, SM98 does the 1X map nearly everywhere, so its scaling is correct. When range is near 2, SM98 does the 2x map nearly everwhere, so again there is little coding loss. Intervals are proportional to the probability. The problem is when range is in the middle. Then some symbols will get a 1X scaling, and others will get a 2X scaling, distorting the probabilities, causing coding loss.

(aside: I did an experiment and confirmed that you can compensate for this somewhat with a skewed probability estimate. SM98 in this form gives too much code space to later symbols (you can of course choose to reverse that). To compensate you should increase the frequency more when you see early symbols than when you see later ones, so that the probability estimate is skewed in the opposite way of the coder. This does in fact cut the coding loss of SM98, roughly in half, from about 1.0% to 0.5%. Note this is totally evil and I'm not actually recommending this, just writing it down for the record.)

And I'll finish with a drawing :

Teaser : you can of course combine the ideas of "fractional bits of range" map and the SM98 map. When range is in the interval [1,1.5) you could use a 1X scaling for low CDF and 1.5X scaling for high CDF; in the [1.5,2) interval use an SM98 threshold to split the CDF's into intervals with 1.5X and 2X scaling. This was tried by Daala as the "reduced overhead" coder. We will come back to this later.

Oh, and another connection :

If you did the "fractional bits of range" scaling thing; first 1 bit giving you a 1.5X zone, then two bits adding 1.25X and 1.75X zones, etc. If you keep doing that all the way down, in a range coder framework you are trying to compute (range / cdf_tot). That means you need to look at (r_bits - cdf_bits). If you simply keep testing that number of bits and adding in the forward() map - you will wind up with the full range coder map.

That process is the Moffat-Neal-Witten DCC95 multiplication free coder. In that context you might want to choose r_bits = 14 or 15, bit renormalization. cdf_bits = 11 or so. The difference (r_bits - cdf_bits) is your coding precision (larger = less coding loss), and it's also the number of times you have to test bits of r and possibly do (ret += cdf>>n) in the forward map.

ADD :

I brought up thinking of range normalized to [1,2) as a conceptual aid, but that can also be a good implementation choice, particularly for coders like SM98 where you spend most of your time doing clz to find the top bit position of range. Instead of letting range float, like in Michael Schindler range coder 24-31 bits, you instead keep range pegged at 31 bits. That lets you avoid the clz to find the top bit, at the cost of doing a clz after encoding to find out how much range has shrunk to normalize it back up.

Now you might think this requires bit renorm, but it does not. Instead you can still do byte renorm, and keep a count of the number of spare bits at the *bottom*. So you are still doing 24-31 bit renorm, but the space is at the bottom instead of the top.

This implementation style is not shown in recip_arith for clarity, but I figure I better mention everything so Google can't patent it.

cbloom rants

10/21/2018

Arithmetic coder division-free maps

No comments:

old rants