10/22/2018

recip_arith without unused range

Finishing up, today we'll go through how to make recip_arith take advantage of the un-mapped portion of the coding range, and see some more connections to past work.

recip_arith uses only the top bits of "range" in the map, rounding down the cdf to range scaling factor when it converts to fixed point. Thus it always maps less than range.

The classic "range coder" does something similar. With the map :

uint32_t r_norm = ac->range >> cdf_bits;

forward(cdf) = cdf * r_norm;
the bottom "cdf_bits" of range are not used in this map, so a small portion of range is unused.

The standard way to make use of this extra range is to assign to the last symbol. Most of the older papers show this as the default. They'll do something like :

(range coder "map end")

uint32_t r_norm = ac->range >> cdf_bits;

forward(cdf) :
if ( cdf == cdf_tot ) return range;
else return cdf * r_norm;

that is, explicitly taking the last interval and making it go all the way to range. You may see this in other "range coder" implementations.

I do not do this in my range coder. The benefit is around 0.001 bpb (or less), and it does cost some speed. The benefit depends on whether you can give that extra range to a useful symbol. Ideally you would give it to the most probable symbol. What you are doing is reducing the code length of one symbol, you get the most benefit by reducing the code length of the symbol that occurs most often. The higher the probability of the MPS, the more benefit you will get, assuming you put the MPS at the end of the alphabet where it gets this range. (conversely if the end of the alphabet is a symbol that never/rarely occurs, you get no benefit).

I call this variant "map end". In recip_arith the tradeoff is a little more biased towards doing "map end", because the unmapped portion of range is larger.

(note that the "map end" excess range is less than the amount of range assigned per cdf; that is, it's equivalent to increasing that integer symbol frequency by some fractional amount, so it's not a huge distortion of the symbol probabilities)

But in recip_arith, just mapping that extra range to one symbol is more of a probability distortion than it is in the range coder, because the extra space is larger.

We can do something better. The scaling factor from cdf to range in recip_arith is the *floor* to fixed point. When range is near the top of that fractional truncation, you would much rather round up. But you can't round up because that would lead to using more than all of range, which is uncodeable :

int range_clz = clz32(range);
uint32_t r_top_down = range >> (32 - range_clz - RECIP_ARITH_TABLE_BITS);
uint32_t r_norm_down = r_top_down << (32 - range_clz - RECIP_ARITH_TABLE_BITS - cdf_bits);

uint32_t r_top_up = r_top_down+1;
uint32_t r_norm_up = r_top_up << (32 - range_clz - RECIP_ARITH_TABLE_BITS - cdf_bits);
r_norm_up == r_norm_down + (1 << (32 - range_clz - RECIP_ARITH_TABLE_BITS - cdf_bits));

r_norm_down * cdf_tot <= range
r_norm_up * cdf_tot > range

if you just use r_norm_down , you use too little of range

unmapped_range = range - r_norm_down * cdf_tot = range & ( (1 << (32 - range_clz - RECIP_ARITH_TABLE_BITS)) - 1 )

conversely if you used r_norm_up for scaling, you would use too much range
the amount is the ~ bit reverse of those bottom bits
So inspired by SM98 we can use a mix of round-down & round-up of the scaling factor, such that all of range is covered.

What we want is to use round_up as much as possible, because it gives lower codelens, and then for enough symbols to avoid over-using range, switch to round_down. Just like SM98 we can find a threshold such that we place the transmission so that all of range is used.

The nice way to think about it conceptually is that we construct two maps : we use r_norm_down as the scaling factor from the left side of range (starting at 0), and we use r_norm_up as the scaling factor from the right side, and we switch over whether they cross.

The implementation is simple :

r_norm_down & r_norm_up as above

uint32_t map_up_excess = (r_norm_up << cdf_bits) - range;

forward(cdf) :

uint32_t cdf_down = cdf * r_norm_down;
int64_t cdf_up = cdf * r_norm_up - (int64_t) map_up_excess; // signed

ret = MAX( cdf_down, cdf_up );
and this can be simplified further with some algebra which I won't show.

Perhaps a picture is clearer :

I call this map "recip_arith down/up". The decoder is straightforward except for one point :

The reciprocal for both r_top and (r_top+1) have to be looked up. range is conceptually in [1,2) in fixed point, r_top can be exactly 1 but never 2. (r_top+1) can be exactly 2. The reciprocal table needs one more entry than (1< The coding loss of "recip_arith down/up" (at RECIP_ARITH_TABLE_BITS = 8) is extremely low. It's typically better than the "range coder" map, and comparable to the CACM87 map.

"recip_arith down/up" is even extremely good at RECIP_ARITH_TABLE_BITS = 4. That might be interesting for hardware implementations, because it means very tiny tables can be used, and the multipliers needed (for encoding) are also very low bit count.

Note that if you reduce RECIP_ARITH_TABLE_BITS all the way down to 1, then r_top_down is always just 1, and r_top_up is just 2. Then "recip_arith down/up" is choosing between a 1X and 2X mapping. This is exactly SM98 !

At RECIP_ARITH_TABLE_BITS = 2, then "recip_arith down/up" is choosing 1X or 1.5X or 2X. (the top bits are either 10 or 11 , we're getting only 1 fractional bit of range, the top bit is always on). This is exactly the same as the "reduced overhead" coder of Daala. (see OD_EC_REDUCED_OVERHEAD, in entcode.h)

OBJ2
range coder:                    246,814 ->   193,172 =  6.261 bpb =  1.278 to 1 
recip_arith (8 bits):           246,814 ->   193,282 =  6.265 bpb =  1.277 to 1 
recip_arith down/up (8 bits):   246,814 ->   193,171 =  6.261 bpb =  1.278 to 1 
recip_arith down/up (4 bits):   246,814 ->   193,240 =  6.264 bpb =  1.277 to 1 
recip_arith down/up (3 bits):   246,814 ->   193,436 =  6.270 bpb =  1.276 to 1 

PAPER3
range coder:                     46,526 ->    27,133 =  4.665 bpb =  1.715 to 1 
recip_arith (8 bits):            46,526 ->    27,156 =  4.669 bpb =  1.713 to 1 
recip_arith down/up (8 bits):    46,526 ->    27,133 =  4.665 bpb =  1.715 to 1 
recip_arith down/up (4 bits):    46,526 ->    27,127 =  4.664 bpb =  1.715 to 1 
recip_arith down/up (3 bits):    46,526 ->    27,155 =  4.669 bpb =  1.713 to 1 

PIC
H : 1.21464
range coder:                    513,216 ->    78,408 =  1.222 bpb =  6.545 to 1 
recip_arith (8 bits):           513,216 ->    78,651 =  1.226 bpb =  6.525 to 1 
recip_arith down/up (8 bits):   513,216 ->    78,408 =  1.222 bpb =  6.545 to 1 
recip_arith down/up (4 bits):   513,216 ->    78,395 =  1.222 bpb =  6.547 to 1 
recip_arith down/up (3 bits):   513,216 ->    78,806 =  1.228 bpb =  6.512 to 1 

PROGL
range coder:                     71,646 ->    42,723 =  4.770 bpb =  1.677 to 1 
recip_arith (8 bits):            71,646 ->    42,757 =  4.774 bpb =  1.676 to 1 
recip_arith down/up (8 bits):    71,646 ->    42,721 =  4.770 bpb =  1.677 to 1 
recip_arith down/up (4 bits):    71,646 ->    42,724 =  4.771 bpb =  1.677 to 1 
recip_arith down/up (3 bits):    71,646 ->    42,731 =  4.771 bpb =  1.677 to 1 

TRANS
range coder:                     93,695 ->    64,806 =  5.533 bpb =  1.446 to 1 
recip_arith (8 bits):            93,695 ->    64,851 =  5.537 bpb =  1.445 to 1 
recip_arith down/up (8 bits):    93,695 ->    64,806 =  5.533 bpb =  1.446 to 1 
recip_arith down/up (4 bits):    93,695 ->    64,820 =  5.535 bpb =  1.445 to 1 
recip_arith down/up (3 bits):    93,695 ->    64,884 =  5.540 bpb =  1.444 to 1 

recip_arith down/up (4 bits) has very low coding loss, it's very close to the standard range coder. (note that "range coder" here is not doing "map end" which would improve it slightly, particularly if end was assigned to the MPS, particularly so on "PIC" which is the only file where range coder noticeably misses the entropy)

Base recip_arith loss vs range coder is exactly 0.004 bpb in all cases here. This is with cdf_bits = 13 and a crappy frequency normalization heuristic, so there are some anomalies where eg. recip_arith down/up 4 bits gets more compression than 8 bits. recip_arith vanilla version at 4 bits gets very poor; coding loss is comparable to SM98, around 1.0%

recip_arith down/up is probably not practical for software implementation, but might be interesting for hardware. It's also interesting to me as a theoretical superset that connects recip_arith to SM98.

No comments:

old rants