10/18/2018

A Multi-Symbol Division Free Arithmetic Coder with Low Coding Loss using Reciprocal Multiplication

A standard multi-symbol arithmetic coder necessarily requires a costly divide to map the arithmetic code target back to a cdf (even when you have chosen the cdf sum to be a power of 2).

Doing binary arithmetic coding without a divide is trivial. Previous methods for division free multi-symbol arithmetic coding have been at the cost of low coding precision, that is coding in more bits than -log2 of the symbol probabilities (for example see DCC95 and SM98).

code is here :

recip_arith on github

recip_arith is public domain.

Our method uses an approximate map from the cdf interval to the range interval, scaling by only a few top bits of range. This approximate map introduces coding loss, but that can be made quite small with just 8 bits from range.

The advantage of this approximate map is that it can be exactly inverted using a table of reciprocal multiplies :

x / y == (x * recip_table[y]) >> 32

recip_table[y] = ((1<<32) + y-1) / y
x/y in integers is the floor divide, and recip_table is the ceil reciprocal. (see Alverson, "Integer Division using reciprocals"). The choice of a 32 bit numerator is somewhat arbitrary, but we do want the reciprocals to fit in 32 bit words to minimize the size of recip_table in L1 cache.

Crucially because only the top 8 bits of "range" are used in the forward map, then "y" needed in the divide is only 8 bits, so the recip_table can be quite small. Furthermore, 8 bits + 24 bits of cdf in the numerator "x" can be inverted exactly.

Coding Efficiency

The coding loss of "recip_arith" with 8 bit tables is reliably under 0.005 bpb (around 0.1%) , in constrast to SM98 where the coding loss can be 10X higher (0.05 bpb or 1.0%)

Calgary Corpus file "news" , len=377,109

cdf_bits = 13  (symbol frequencies sum to 1<<13)

from smallest to largest output :

recip_arith coder (down/up 8-bit) :                                244,641 = 5.190 bpb
cacm87:                                                            244,642 = 5.190 bpb
range coder:                                                       244,645 = 5.190 bpb
recip_arith coder (with end of range map) :                        244,736 = 5.192 bpb
recip_arith coder:                                                 244,825 = 5.194 bpb
recip_arith coder (down/up 2-bit) (aka Daala "reduced overhead") : 245,488 = 5.208 bpb
SM98 :                                                             245,690 = 5.212 bpb
(the down/up and "end of range map" variants will be discussed later)

The crucial step of recip_arith is the way the arithmetic coding interval is shrunk to encode a symbol.

The standard range coder does :

    uint32_t r_norm = ac->range >> cdf_bits;

    ac->low += cdf_low * r_norm;
    ac->range = cdf_freq * r_norm;
while recip_arith does :
    uint32_t range = ac->range;

    int range_clz = clz32(range);
    uint32_t r_top = range >> (32 - range_clz - RECIP_ARITH_TABLE_BITS);
    uint32_t r_norm = r_top << ( 32 - range_clz - RECIP_ARITH_TABLE_BITS - cdf_bits);
            
    ac->low += cdf_low * r_norm;
    ac->range = cdf_freq * r_norm;
where "r_top" is just the highest RECIP_ARITH_TABLE_BITS bits of range, and r_norm is those bits back in their original position, then shifted down by cdf_bits. In the end the way the interval is shrunk is exactly the same as the range coder, except that all bits below the top RECIP_ARITH_TABLE_BITS of "r_norm" are turned off.

Is it useful ?

It's unclear if there are really killer practical advantages to recip_arith at the moment. On many modern high end CPUs, 32 bit divides are pretty fast, so if you just take recip_arith and use it to replace a standard range coder you might not see much difference. On CPUs with slow divides there is a more significant advantage.

recip_arith provides an interesting continuum that connects very low precision coders like SM98 to the full precision range coder. The "up/down" variant with a very small reciprocal table (4 bits?) might be interesting for hardware implementations, where division is not desirable.

recip_arith works with 64-bit coding state (the standard range coder would require a 64-bit divide, which is often much slower than a 32-bit divide), which can provide advantages. recip_arith also works with the encoder & decoder not in lock step; eg. you could encode with 32-bit state and decode with 64-bit state, allowing you independence of the bitstream & implementations, which is very desirable; not all arithmetic coders have this property.

I think primarily it is theoretically interesting at the moment, and it remains to be seen if that turns into a compelling practical application.

What's Next

I'll be doing a couple of followup posts going into the details of how/why this works. We'll talk about the theory of arithmetic coders and how to think about them. Then we'll look at the up/down and "map end" variants.

Index of posts

recip_arith on github
1. cbloom rants About arithmetic coders and recip_arith in particular
2. cbloom rants Arithmetic coder division-free maps
3. cbloom rants Working through recip_arith
4. cbloom rants recip_arith without unused range

No comments:

old rants