10/18/2018

About arithmetic coders and recip_arith in particular

An arithmetic coder encodes a sequence of symbols as an infinite precision number. You divide the range of that number (I will call this the "arithmetic domain") into successively smaller intervals; each interval corresponds to a symbol. After encoding a symbol, your current coding interval shrinks to the low/high range of that symbol, and you encode further symbols within that range. The minimum number of bits required to distinguish the desired sequence from another is the code length of the sequence.

If you make each symbol's interval proportional to the probability of that symbol, the arithmetic coder can produce a code length equal to the entropy. (we're still assuming infinite precision encoding). If your estimated probabilities do not match the true symbol probabilities (they never do) you have some loss due to modeling. The difference between the arithmetic coder's output length and the sum of -log2(P) for all model probabilities is the coding loss.

In order to do arithmetic coding in finite precision, we track the low/high interval of the arithmetic coder. As the top bits (or bytes) of low & high become equal, we stream them out. This is like "zooming in" on a portion of the infinite number line where the top & bottom of the interval are in the same binary power of two buckets.

We must then confront the "underflow problem". That is, sometimes (high - low) ("range") can get very small, but the top bits of high and low never match, eg. if they happen to straddle a binary power of 2. They can be something like


high = 1000000000aaaaa..
low  = 0111111111bbbbb..

There are several solutions to the underflow problem. For example CACM87 "bit plus follow" or the MACM / VirtQ approach which you can read about elsewhere, also the "just force a shrink" method (see links at end).

The method I use in "recip_arith" rather hides the underflow problem. Rather than checking for the top bits (bytes) of low & high being equal, we simply zoom in regardless. The renormalization step is :

    while ( ac->range < (1<<24) )
    {
        *(ac->ptr)++ = (uint8_t)(ac->low>>24);
        ac->low <<= 8;
        ac->range <<= 8;
    }
low and range are 32 bit here, when range is less than 2^24 the top byte of low & high is the same and can be shifted out, *except* in the underflow case. In the underflow case, we could have the top byte of low is = FF , and high is actually 100 with an implicit bit above the 32nd bit (eg. low + range exceeds 2^32). What we do is go ahead and output the FF, then if we later find that we made a mistake we correct it by propagating a carry into the already sent bits.

(note that you could do range += FF here for slightly less coding loss, but the difference is small; the actual "high" of our arithmetic interval is 1 above range, range can approach that infinitely closely from below but never touch it; the coding interval is [low,high) inclusive on the bottom & exclusive on the top. Coders that don't quite get this right have a lot of +1/-1 adjustments around low & high)

Renormalization means we can send an infinite length number while only working on a finite precision portion of that number down in the active range of bits at the bottom. Renormalization also means that "range" is kept large enough to be able to distinguish symbols with only integer subdivision of the range, which we shall now address. Renormalization in and of itself does not introduce any coding loss; it is perfectly accurate (though failing to add FF is coding loss, and schemes like the "force shrink" method or "just dont renormalize" method of fpaq0p do contribute to coding loss).

The other way we must adapt to finite precision is the division of the interval into ranges proportional to symbol probabilities. The infinite precision refinement would be (real numbers!) :

arithmetic_low += CDF_low * arithmetic_range / CDF_sum;

arithmetic_range *= CDF_freq / CDF_sum;

(real numbers, no floor divide)

(CDF = cumulative distribution function, aka cumulative probability, sum of previous symbol frequencies)
CDF_freq = CDF_high - CDF_low for the current symbol ; CDFs in [0,CDF_sum]
We don't want to do real numbers, so we just approximate them with integer math. But how exactly?

The crucial distinguishing aspect of an arithmetic coder is how you map the CDF domain to the arithmetic domain

The CDF domain is controlled by you; you have modeled probabilities somehow. The CDF domain always starts at 0 and goes to CDF_sum, which is under your control. In the decoder, you must search in the CDF domain to find what symbol is specified. Working in the CDF domain is easy. In contrast, the arithmetic interval is always changing; "low/range" is being shrunk by coding, and then zoomed in again by renormalization.

The forward map takes you from CDF domain to arithmetic domain. Adding on the arithmetic "low" is trivial and we will not include it in the map. The crucial thing is just scaling by (arithmetic_range / CDF_sum).

We can now write a very general arithmetic encoder :

arithmetic_low += forward(CDF_low,arithmetic_range,CDF_sum);

arithmetic_range = forward(CDF_high,arithmetic_range,CDF_sum) - forward(CDF_low,arithmetic_range,CDF_sum);
our "forward" map will be working on integers. Some properties forward must have :
forward(x) should be monotonic

forward(x+1) > forward(x) strictly  (so that range can never shrink to zero)

this may only be true for arithmetic_range > CDF_sum or some similar constraint

forward(0) >= 0
forward(CDF_sum) <= arithmetic_range

forward map of the CDF end points does not need to hit the end points of range, but it must be within them
(failure to use all of range does contribute to coding loss)
The coding loss of our approximation is caused by the difference in forward(high) - forward(low) and the ideal scaling (which should be proportional to range & symbol probability).

The integer forward map with lowest coding loss is the "CACM87 map" :


forward(cdf,range,cdf_sum) = ( cdf * range ) / cdf_sum;

this is now integers (eg. floor division)

CACM87 has
forward(cdf_sum) = range ; eg. it uses the full range

coding loss is just due to the floor division not exactly matching the real number divide. (note that you might be tempted to say, hey add (cdf_sum/2) to get a rounded integer division instead of floor; the exact form here is needed to be able to construct an inverse map with the right properties, which we will get to later).

Sketch of full arithmetic coding process

A quick overview of what the decoder has to do. Most of the decoder just replicates the same work as the encoder; it narrows the arithmetic interval in exactly the same way. Rather than streaming out bytes in renormalization, the decoder streams them in. The decoder sees the arithmetic code value that the encoder sent, to some precision ahead. It needs enough bits fetched to be able to resolve the correct symbol (to tell which CDF bin is selected).

In implementation, rather than track the low/high arithmetic interval and the arithmetic number within that interval, we instead just track (arithmetic - low), the offset inside the interval. I call this the "code" in the decoder.

The decoder needs an extra step that the encoder doesn't do : given the current "code" , figure out what symbol that specifies. To do that, we have to take the "code" (in the arithmetic interval domain), map it back to CDF domain, then scan the CDF intervals to find which symbol's bin it falls in.

To do so requires an "inverse" map (arithmetic domain -> CDF domain), the opposite of the "forward" map (CDF -> arithmetic) we just introduced.

A full general purpose (multi-symbol) arithmetic coder is :


(in integers now)

Encode :

look up CDF of the symbol you want to encode
map CDF interval to range inverval :

lo = forward(cdf_low,range,cdf_sum);
hi = forward(cdf_high,range,cdf_sum);

arithmetic_low += lo;
arithmetic_range = hi - lo;

propagate carry in "arithmetic_low" if necessary
renormalize if necessary

Decode :

take current arithmetic "code"
map it back to CDF domain :

target = inverse(arithmetic_code,range,cdf_sum);

find symbol from CDF target such that :
CDF_low <= target < CDF_high

rest proceeds like encoder:

lo = forward(cdf_low,range,cdf_sum);
hi = forward(cdf_high,range,cdf_sum);

arithmetic_code -= lo;
arithmetic_range = hi - lo;

renormalize if necessary

The encoder needs "forward" twice, the decoder needs "forward" twice plus "inverse" once.

Naive implementation of forward & inverse both need division, which would mean 2 and 3 divides for encode & decode, respectively.

The inverse map and when you don't need it

First of all, why do you need the inverse map, and when do you not need it?

One common case where you don't need the inverse map at all is binary arithmetic coding. In that case it is common to just do the forward map and resolve the symbol in arithmetic domain, rather than CDF domain.

That is :


binary decoder :

arithetmetic_code is known

map the threshold between bits 0 & 1 to arithmetic domain :

arihmetic_mid = forward(cdf_min,range,cdf_sum);

find bin in arithmetic domain :

symbol = arithetmetic_code >= arithmetic_mid;

lo/hi = { 0 , arithmetic_mid , range }

(in the binary case we also only need one forward map, not two, since one of the end points is always 0 or range).

Now, you can do the same technique for small alphabet multi-symbol, for 4 or 8 or 16 symbols (in SIMD vectors); rather than make a CDF target to look up the symbol, instead take all the symbol CDF's and scale them into arithmetic domain. In practice this means a bunch of calls to "forward" (ie. a bunch of multiplies) rather than one call to "inverse" (a divide).

But for full alphabet (ie 8 bit, 256 symbol), you don't want to scale all the CDF's. (exception: Fenwick Tree style descent). Typically you want a static (or semi-static, defsum) probability model, then you can do the CDF -> symbol lookup using just a table. In that case we can construct the symbol lookup in CDF domain, we need the map from arithmetic domain back to CDF domain.

The inverse map must have properties :


assume range >= cdf_sum
so the forward map is a stretch , inverse map is a contraction

should invert exactly at the CDF end points :

y = forward(x);
inverse(y) == x

the CDF buckets should map to the lower CDF :

lo = forward(x)
hi = forward(x+1)

(hi > lo , it can be much greater than 1 apart)

inverse( anything in [lo,hi) ) = x

in hand wavey terms, you need inverse to act like floor division. eg :

CDF domain [012]  -> arithmetic domain [00011122]

For example, for the CACM87 forward map we used above, the inverse is :
CACM87

forward(cdf,range,cdf_sum) = ( cdf * range ) / cdf_sum;

inverse(code,range,cdf_sum) = ( code * cdf_sum + cdf_sum-1 ) / range;

(integers, floor division)
In most general form, both forward & inverse map require division. The forward map is easy to make divide-free, but the inverse map not so.

Getting rid of division and cdf_sum power of 2

We'll now start getting rid of pesky division.

The first thing we can do, which we will adopt henceforth, is to choose cdf_sum to be a power of 2. We can choose our static model to normalize the cdf sum to a power of 2, or with adaptive modeling use a scheme that maintains power of 2 sums. (defsum, constant sum shift, sliding window, etc.)


cdf_sum = 1<<cdf_shift;

CACM87 forward(cdf,range,cdf_sum) = ( cdf * range ) >> cdf_shift;

So we have eliminated division from the forward map, but it remains in the inverse map. (the problem is that the CDF domain is our "home" which is under our control, while the arithmetic domain is constantly moving around, stretching and shrinking, as the "range" interval is modified).

We're fundamentally stuck needing something like "1 / range" , which is the whole crux of the problem that recip_arith is trying to attack.

I think we'll come back to that next time, as we've come far enough.

While I'm going into these forward/inverse maps lets go ahead and mention the standard "range coder" :


range coder :

forward(cdf,range,cdf_sum) = cdf * ( range / cdf_sum );

inverse(code,range,cdf_sum) = code / ( range / cdf_sum );

(integer, floor division)

or with power of 2 cdf_sum and in a more C way :

r_norm = range >> cdf_shift;

forward(cdf,range,cdf_sum) = cdf * r_norm;

inverse(code,range,cdf_sum) = code / r_norm;

where I'm introducing "r_norm" , which is just the ratio between "range" and "cdf_sum" , or the scaling from CDF to arithmetic domain.

Historically, the "range coder" map was attractive (vs CACM87) because it allowed 32 bit arithmetic state. In the olden days we needed to do the multiply and stay in 32 bits. In CACM87 you have to do (cdf * range) in the numerator, so each of those was limited to 16 bits. Because the arithmetic state (code/range) was only 16 bits, you had to do bit renormalization (in order to keep range large enough to do large CDF sums (actually byte renormalization was done as far back as 1984, in which case CDF sum was capped at 8 bits and only binary arithmetic coding could be done)). By adopting the "range coder" map, you could put the arithmetic state in 32 bits and still use just 32 bit registers. That meant byte renormalization was possible.

So, with modern processors with 64-bit registers there's actually very little advantage to the "range coder" map over the CACM87 map.

The range coder map has some coding loss. The fundamental reason is that the forward() map scaling is not exactly (Probability * range). Another way to think of that is that not all of range is used. In the "range coder" :


forward(cdf_sum) = cdf_sum * r_norm = r_norm << cdf_shift = (range >> cdf_shift) << cdf_shift

forward(cdf_sum) = range with bottom cdf_shift bits turned off

unused region is (range % cdf_sum)

The coding loss is small in practice (because we ensure that range is much larger than cdf_sum). In typical use, range is 24-31 bits and cdf_shift is in 11-14 , then the coding loss is on the order of 0.001 bpb. You can make the coding loss of the range coder arbitrarily small by using larger range (eg. 56-63 bits) with small cdf_sum.

The "range coder" map is actually much simpler than the CACM87 map. It simply takes each integer step in the CDF domain, and turns that into a step of "r_norm" in the arithmetic domain. The inverse map then just does the opposite, each run of "r_norm" steps in the arithmetic domain maps to a single integer step in the CDF domain.

.. and that's enough background for now.

The entry page for recip_arith : cbloom rants A Multi-Symbol Division Free Arithmetic Coder with Low Coding Loss using Reciprocal Multiplication

Links :

cbloom rants 10-05-08 Rant on New Arithmetic Coders
cbloom rants 10-06-08 Followup on the Russian Range Coder
cbloom rants 10-07-08 A little more on arithmetic coding ...
cbloom rants 10-10-08 On the Art of Good Arithmetic Coder Use
cbloom rants 08-16-10 - Range Coder Revisited .. oh wait, nevermind
cbloom rants 09-16-10 - Modern Arithmetic Coding from 1987
cbloom rants 10-08-08 Arithmetic coders throw away accuracy in lots of little places.
cbloom rants 08-11-10 - Huffman - Arithmetic Equivalence
On the Overhead of Range Coders

A Multi-Symbol Division Free Arithmetic Coder with Low Coding Loss using Reciprocal Multiplication

A standard multi-symbol arithmetic coder necessarily requires a costly divide to map the arithmetic code target back to a cdf (even when you have chosen the cdf sum to be a power of 2).

Doing binary arithmetic coding without a divide is trivial. Previous methods for division free multi-symbol arithmetic coding have been at the cost of low coding precision, that is coding in more bits than -log2 of the symbol probabilities (for example see DCC95 and SM98).

code is here :

recip_arith on github

recip_arith is public domain.

Our method uses an approximate map from the cdf interval to the range interval, scaling by only a few top bits of range. This approximate map introduces coding loss, but that can be made quite small with just 8 bits from range.

The advantage of this approximate map is that it can be exactly inverted using a table of reciprocal multiplies :

x / y == (x * recip_table[y]) >> 32

recip_table[y] = ((1<<32) + y-1) / y
x/y in integers is the floor divide, and recip_table is the ceil reciprocal. (see Alverson, "Integer Division using reciprocals"). The choice of a 32 bit numerator is somewhat arbitrary, but we do want the reciprocals to fit in 32 bit words to minimize the size of recip_table in L1 cache.

Crucially because only the top 8 bits of "range" are used in the forward map, then "y" needed in the divide is only 8 bits, so the recip_table can be quite small. Furthermore, 8 bits + 24 bits of cdf in the numerator "x" can be inverted exactly.

Coding Efficiency

The coding loss of "recip_arith" with 8 bit tables is reliably under 0.005 bpb (around 0.1%) , in constrast to SM98 where the coding loss can be 10X higher (0.05 bpb or 1.0%)

Calgary Corpus file "news" , len=377,109

cdf_bits = 13  (symbol frequencies sum to 1<<13)

from smallest to largest output :

recip_arith coder (down/up 8-bit) :                                244,641 = 5.190 bpb
cacm87:                                                            244,642 = 5.190 bpb
range coder:                                                       244,645 = 5.190 bpb
recip_arith coder (with end of range map) :                        244,736 = 5.192 bpb
recip_arith coder:                                                 244,825 = 5.194 bpb
recip_arith coder (down/up 2-bit) (aka Daala "reduced overhead") : 245,488 = 5.208 bpb
SM98 :                                                             245,690 = 5.212 bpb
(the down/up and "end of range map" variants will be discussed later)

The crucial step of recip_arith is the way the arithmetic coding interval is shrunk to encode a symbol.

The standard range coder does :

    uint32_t r_norm = ac->range >> cdf_bits;

    ac->low += cdf_low * r_norm;
    ac->range = cdf_freq * r_norm;
while recip_arith does :
    uint32_t range = ac->range;

    int range_clz = clz32(range);
    uint32_t r_top = range >> (32 - range_clz - RECIP_ARITH_TABLE_BITS);
    uint32_t r_norm = r_top << ( 32 - range_clz - RECIP_ARITH_TABLE_BITS - cdf_bits);
            
    ac->low += cdf_low * r_norm;
    ac->range = cdf_freq * r_norm;
where "r_top" is just the highest RECIP_ARITH_TABLE_BITS bits of range, and r_norm is those bits back in their original position, then shifted down by cdf_bits. In the end the way the interval is shrunk is exactly the same as the range coder, except that all bits below the top RECIP_ARITH_TABLE_BITS of "r_norm" are turned off.

Is it useful ?

It's unclear if there are really killer practical advantages to recip_arith at the moment. On many modern high end CPUs, 32 bit divides are pretty fast, so if you just take recip_arith and use it to replace a standard range coder you might not see much difference. On CPUs with slow divides there is a more significant advantage.

recip_arith provides an interesting continuum that connects very low precision coders like SM98 to the full precision range coder. The "up/down" variant with a very small reciprocal table (4 bits?) might be interesting for hardware implementations, where division is not desirable.

recip_arith works with 64-bit coding state (the standard range coder would require a 64-bit divide, which is often much slower than a 32-bit divide), which can provide advantages. recip_arith also works with the encoder & decoder not in lock step; eg. you could encode with 32-bit state and decode with 64-bit state, allowing you independence of the bitstream & implementations, which is very desirable; not all arithmetic coders have this property.

I think primarily it is theoretically interesting at the moment, and it remains to be seen if that turns into a compelling practical application.

What's Next

I'll be doing a couple of followup posts going into the details of how/why this works. We'll talk about the theory of arithmetic coders and how to think about them. Then we'll look at the up/down and "map end" variants.

10/13/2018

If you have had difficulty sending me email

My email has been extremely flaky over the last few months due to problems at dreamhost (who host cbloom.com).

If you sent me an email and I did not reply, it's possible I didn't get it. Check your spam for bounced emails, since in my experience many undelivered mail responses incorrectly wind up in spam folders these days.

I will be changing to a new host (because dreamhost sucks and has consistently failed to fix their email issues). There may be more missed emails in the transition.

If you want to send me something reliably, please try carrier pigeon or can on string, since technology has only gotten worse since then. (grumble grumble, we all as an industry fucking epically suck and computers are monumentally awful and depressing, grumble grumble)


Transition in progress... looks like I'll be moving cbloom.com web hosting off dreamhost as well. Let me know if you see any problems.


... done. Now on fastmail.

10/01/2018

Oodle Lossless Image

We have released Oodle Lossless Image Compression (aka OLI) v1.0

OLI is built on the blazing fast Oodle Data Compression engine, with a new very efficient image specific front end. OLI has a very simple native API, and it also has drop in look-alike APIs to replace stb_image or libpng, so it's very easy to integrate.

OLI gets much more compression than PNG. Its compression ratio is similar to higher compression codecs like webp-ll or FLIF. The big advantage of OLI is its blazing fast decode speed. OLI decodes way faster than any of the competition (3-10X faster), so you can compress big images smaller and load them faster.

(OLI is mostly by Jon Olick who did an awesome job on it)

OLI is currently for full color lossless images only. It's not for 3d game textures, it doesn't do BCn GPU compressed textures, it doesn't do things like half-float normal maps. It's for 24 bit RGB and 32 bit RGBA images, the kind of things you would have used PNG for before. OLI currently has only limited support for low color images (eg. palettized, 1-bit, and gray scale images). It's early days for OLI still and that support will get better, particularly if we hear from customers who need it.

For more about OLI visit the RAD Game Tools web site.

Image sizes on my "good" test set, smallest size for each image in bold :

oli webpll pngcrush png
4069138464_193981e0e2_o 3140944 3319222 4019282 4025911
9xkFD 2524927 2772626 3850985 3856541
A-Girl-With-Kind-Eyes 1212598 1307518 1568931 1620105
Beach_Hurricane_WaLp_TW2011 1970283 1997532 2543245 2606013
Beach_Starfish_and_Conch_Shell_Wp_TW 2266789 2299242 2591019 2701580
CliffSideA01_cm0001 3475407 3340540 4302971 4309175
CliffSideA03_cm0001 3626390 3560460 4697715 4704495
Girl_in_white 1253577 1313662 1726569 1729053
IMG_0060 2324810 2601634 3767787 3773211
IMG_0152 2166388 2389052 3668094 3673386
IMG_0155 1526228 1613594 2369898 2373318
IMG_0195 1998851 2115548 2910248 2914448
IMG_0339 1815852 2021904 3153675 3158211
IMG_0341 1196634 1267144 1865148 1867836
LOVE1 1170354 1217124 1598967 1601271
Lily+Allen+smile 1156702 1291656 1850890 1853554
PDI_1200 921293 959710 1260621 1262433
POMSep2007 1061705 1066320 1520546 1522742
Portal2Win 2992069 2904416 3695920 3701248
Spring_Beauty 1749534 1788812 2041309 2118280
Sunset_Rocky_Beach_Walp_TW 2700270 2665954 3137206 3192653
ainokura_bibble 817631 817516 1085462 1087022
artificial 721090 578668 845572 846796
big_building 3772059 3795954 4569456 4579407
big_tree 3476811 3466330 3747685 3753085
bmw 1408411 1429148 1790859 1793439
bridge 1401722 1400592 1535838 1538094
c77b75083b34 2516018 2657536 3818229 3823737
cathedral 1914743 1926612 2212063 2215255
church-in-the-prairie 1955457 1984848 2973585 2977869
cool-orb 379704 326258 492951 493671
deer 4753833 4454234 4757131 4763983
flower_foveon 1701007 1672546 2194479 2213494
hdr 1588182 1610458 1942456 2011759
highclass_by_DivineError_lossless 591236 588602 848270 849494
leaves_iso_200 2537504 2613286 3120814 3125314
moses 2137994 2183616 3220571 3225215
mysoup 2473925 2717346 3631056 3636300
nails2 2181876 2300032 3228688 3233344
nightshot_iso_100 2023458 1948254 2293315 2397815
phoenix 110389 59800 83692 131474
pulchra2.1 2421925 2536570 3482309 3487337
record_only 789644 790842 1211780 1282677
school-shoot-moodboard-number-1 1587878 1638888 2199389 2202557
space-desktop 561563 545412 724571 725615
spider_web 2119921 2203508 2603659 2715256
windmill_evsm 1470104 1487660 2090287 2093311
yahoo 1999722 1700342 2710221 2849650
total 91665412 93248528 121555414 122618434

oli is oli_enc --super
webpll is cwebp -lossless -z 9
pngcrush is default options
png is made by bmp2png


Oodle is an SDK for high performance lossless data compression. For more about Oodle, or licensing inquiries, visit the RAD Game Tools web site. This is my personal blog where I post supplemental material about Oodle.

old rants