09-16-10 - Modern Arithmetic Coding from 1987

I was just reading through my old paper "New Techniques in Context Modeling and Arithmetic Encoding" because I wanted to read what I'd written about PPMCB (since I'm playing with similar things now and PPMCB is very similar to the way modern context mixers do this - lossy hashing and all that). (thank god I wrote some things down or I would have lost everything; I only wish I'd written more, so many little heuristics and experimental knowledge has been lost).

Anyway, in there I found this little nugget that I had completely forggoten about. I wrote :

    Step 3 uses a method introduced by F. Rubin [3], and popularized by G.V. Cormack�s DMC
    implementation [1].  It is:
    while ( R < 256 )
       R <<= 8;
       L = (L << 8) & 0xFFFFFF;
       if ( (L + R) > 0xFFFFFF ) R = 0xFFFFFF - L;
    This method is approximate, and therefore produces slightly longer output than the canonical
    CACM-87 style normalization [5].  However, the output is byte-aligned, and the normalization loop is
    almost always only performed once; these properties make this method extremely fast.
    The line:
    if ( (L + R) > 0xFFFFFF ) R = 0xFFFFFF - L;
    is where the extra output comes from.  It can be seen that R is shrunk to less than it need be, so
    extra normalizations are needed, and more bytes are output.  However, this only happens 1/65536th of
    the time, so the excess output is negligible.
Well what do you fucking know. This is a "carryless range coder" (Rant on New Arithmetic Coders) which was apparently rediscovered 20+ years later by the russians. ( see also (Followup on the "Russian Range Coder") ). The method of reducing range to avoid the carry is due to Rubin :

F. Rubin, "Arithmetic Stream Coding Using Fixed Precision Registers", IEEE Trans. Information Theory IT-25 (6) (1979), p. 672 - 675

And I found it in Cormack's DMC which amazingly is still available at waterloo : dmc.c

Cormack in 1987 wrote in the dmc.c code :

    while ((max-min) < 256)
        putchar(min >> 16);
        min = (min << 8) & 0xffff00;
        max = (max << 8) & 0xffff00;
        if (min >= max) max = 0x1000000;
which is a little uglier than my version above, but equivalent (he has a lot of ugly +1/-1 stuff cuz he didn't get that quite right).

And actually in the Data Compression Using Dynamic Markov Modelling paper that goes with the dmc.c code, they describe the arithmetic coder due to Guazzo and it is in fact an "fpaq0p" style carryless arithmetic coder (it avoids carries by letting range get very small and only works on binary alphabets). I don't have the original paper due to Guazzo so I can't confirm that attribution. The one in the paper does bit by bit output, but as I've stressed before that doesn't characterize the coder, and Cormack then did bytewise output in the implementation.

Anyway, I had completely forgotten about this stuff, and it changes my previous attribution of byte-wise arithmetic coding to Schindler '98 ; apparently I knew about it in '95 and Cormack did it in '87. (The only difference between the Schindler '98 coder and the NTiCM '95 coder is that I was still doing (Sl*L)/R while Schindler moved to Sl*(L/R) which is a small but significant change).

Apparently the "russian range coder" is actually a "Rubin arithmetic coder" (eg. avoid carries by shrinking range), and the "fpaq0p binary carryless range coder" is actually a "Guazzo arithmetic coder" (avoid carries by being binary and ensuring only range >= 2).


ryg said...

Fits in with the general pattern in maths where most of the important concepts are discovered at least twice and tend to be named after the second or third person to write about them :)

Ethatron said...

Here are all arithmetic coders I got source from:


When I took a look at F0also detected it's the Guazzo-Coder, as one can see here:



I tried to make the code appear as similar as possible.

old rants