3/30/2014

03-30-14 - Decoding GIF

So I'm writing a little image viewer for myself because I got fed up with ACDSee sucking so bad. Anyway, I had almost every image format except GIF, so I've been adding that.

It's mostly straightforward except for a few odd quirks, so I'm writing them down.

Links :

spec of gif89a
A breakdown of a GIF decoder
The GIFLIB Library
Frame Delay Times for Animated GIFs by humpy77 on deviantART
gif_timing test
ImageMagick - Animation Basics -- IM v6 Examples
(Optional) Building zlib, libjpeg, libpng, libtiff and giflib � Leptonica & Visual Studio 2008
theimage.com gif Disposal Methods 2
theimage.com GIF Table of Contents

My notes :

A GIF is a "canvas" and the size of the canvas is in the file header as the "screen width/height". There are then multiple "images" in the file drawn into the canvas.

In theory multiple images could be used even for non-animated gifs. Each image can have its own palette, which lets you do true color gifs by assigning different palettes to different parts of the image. So you should not assume that a GIF decodes into an 8-bit palettized image. I have yet to see any GIF in the wild that does this. (and if you had one, most viewers would interpret it as an animated gif, since delay=0 is not respected literally)

(one hacky way around that which I have seen suggested elsewhere : a gif with multiple images but no GCE blocks should be treated as compositing to form a single still image, whereas GCE blocks even with delay of 0 must be interpreted as animation frames)

Note that animation frames which only update part of the image *is* common. Also the transparency in a GIF must be used when drawing frames onto the canvas - it does not indicate that the final pixel should be transparent. That is, an animation frame may mark some pixels as transparent, and that means don't update those pixels.

There is an (optional) global palette and an (optional) per-image palette. In the global header there is a "background color". That is an index to the global palette, if it exists. The background color will be visible in parts of the canvas where there is no image rectangle, and also where images are transparent all the way down to the canvas. However, the ImageMagick link above has this note :


        There is some thinking that rather than clearing the overlaid area to
        the transparent color, this disposal should clear it to the 'background'
        color meta-data setting stored in the GIF animation. In fact the old
        "Netscape" browser (version 2 and 3), did exactly that. But then it also
        failed to implement the 'Previous' dispose method correctly.

        On the other hand the initial canvas should also be set from the formats
        'background' color too, and that is also not done. However all modern
        web browsers clear just the area that was last overlaid to transparency,
        as such this is now accepted practice, and what IM now follows. 
        
which makes me think that many decoders (eg. web browsers) ignore background and just make those pixels transparent.

(ADD : I've seen quite a few cases where the "background" value is larger than the global palette. eg. global palette has 64 colors, and "background" is 80 or 152.)

In the optional GCE block, each image can have a transparent color set. This is a palette index which acts as a color-key transparency. Tranparent pixels should show whatever was beneath them in the canvas. That is, they do not necessarily result in transparent pixels in the output if there was a solid pixel beneath them in the canvas from a previous image.

Each image has a "delay" time and "dispose" setting in the optional GCE block (which occurs before the image data). These apply *after* that frame.

Delay is the frame time; it can vary per frame, there is no global constant fps. Delay is in centiseconds, and the support for delay < 10 is nasty. In practice you must interpret a delay of 0 or 1 centiseconds to mean "not set" rather than to mean they actually wanted a delay of 0 or 1. (people who take delay too literally are why some gifs play way too fast in some viewers).

Dispose is an action to take on the image after it has been displayed and held for delay time. Note that it applies to the *image* not the canvas (the image may be just a rectangular portion of the canvas). It essentially tells how that image's data will be committed to the canvas for future frames. Unfortunately the dispose value of 0 for "not specified" is extremely common. It appears to be correct to treat that the same as a value of 1 (leave in place).

(ADD : I've seen several cases of a dispose value of "5". Dispososal is supposed to be a 3 bit value, of which only the values 0-3 are defined (and fucking 0 means "undefined"). Values 4-7 are supposed to be reserved.)

The ImageMagick and "theimage.com" links above are very handy for testing disposal and other funny animation details.

It's a shame that zero-delay is so widely mis-used and not supported, because it is the most interesting feature in GIF for encoder flexibility.

3/25/2014

03-25-14 - deduper

So here's my little dedupe :

dedupe.zip (x64 exe)

This is a file level deduper (not a block or sector deduper) eg. it finds entire files that are identical.

dedupe.exe does not delete dupes or do anything to them. Instead it outputs a batch file which contains commands to do something to the dupes. The intention is that you then go view/edit that batch file and decide which files you actually want to delete or link or whatever.

It runs on Oodle, using multi-threaded dir enumeration and all that. (see also tabdir )

It finds possible dedupes first by checking file sizes. For any file size where there is more than one file of that size, it then hashes and collects possible dupes that have the same hash. Those are then verified to be actual byte-by-byte dupes.

Obviously deleting a bunch of dupes is dangerous, be careful, it's not my fault, etc.

Possible todos to make it faster :

1. Use an (optional) cache of hashes and modtimes to accelerate re-runs. One option is to store the hash of each file in an NTFS extra data stream on that file (which allows the file to be moved or renamed and retain its hash); probably better just to have an explicit cache file and not muck about with rarely-used OS features.

2. If there are only 2 or 3 files of a given size, I should just jump directly to the byte-by-byte compare. Currently they are first opened and hashed, and then if the hashes match they are byte-by-byte compared, which means an extra scan of those files.

3/15/2014

03-15-14 - Bit IO of Elias Gamma

Making a record of something not new :

Elias Gamma codes are made by writing the position of the top bit using unary, and then the lower bits raw. eg to send 30, 30 = 11110 , the top bit is in position 4 so you send that as "00001" and then the bottom 4 bits "1110". The first few values (starting at 1) are :


1 =   1 : 1
2 =  10 : 01 + 0
3 =  11 : 01 + 1
4 = 100 : 001 + 00
5 = 101 : 001 + 01
...

The naive way to send this code is :

void PutEliasGamma( BITOUT, unsigned val )
{
    ASSERT( val >= 1 );

    int topbit = bsr(val);

    ASSERT( val >= (1<<topbit) && val < (2<<topbit) );

    PutUnary( BITOUT, topbit );

    val -= (1<<topbit); // or &= (1<<topbit)-1;

    PutBits( BITOUT, val, topbit );
} 

But it can be done more succinctly.

We should notice two things. First of all PutUnary is very simple :


void PutUnary( BITOUT, unsigned val )
{
    PutBits( BITOUT , 1, val+1 );
}

That is, it's just putting the value 1 in a variable number of bits. This gives you 'val' leading zero bits and then a 1 bit, which is the unary encoding.

The next is that the 1 from the unary is just the same as the 1 we remove from the top position of 'val'. That is, we can think of the bits thusly :


5 = 101 : 001 + 01

unary of two + remaining 01
or

5 = 101 : 00 + 101

two zeros + the value 5

The result is a much simplified elias gamma coder :

void PutEliasGamma( BITOUT, unsigned val )
{
    ASSERT( val >= 1 );

    int topbit = bsr(val);

    ASSERT( val >= (1<<topbit) && val < (2<<topbit) );

    PutBits( BITOUT, val, 2*topbit+1 );
} 

note that if your bit IO is backwards then this all works slightly differently (I'm assuming you can combine two PutBits into one with the first PutBits in the top of the second; that is
PutBits(a,na)+PutBits(b,nb) == PutBits((a<<nb)|b,na+nb)

Perhaps more importantly, we can do a similar transformation on the reading side.

The naive reader is :


int GetEliasGamma( BITIN )
{
    int bits = GetUnary( BITIN );

    int ret = (1<<bits) + GetBits( BITIN, bits );

    return ret;
}

(assuming your GetBits can handle getting zero bits, and returns a value >= 1). The naive unary reader is :

int GetUnary( BITIN )
{
    int ret = 0;
    while( GetOneBit( BITIN ) == 0 )
    {
        ret++;
    }
    return ret;
}

but if your bits are top-justified in your bit input word (as in ans_fast for example, or see the end of this post for a reference implementation), then you can use count_leading_zeros to read unary :

int GetUnary( BITIN )
{
    int clz = count_leading_zeros( BITIN );

    ASSERT( clz < NumBitsAvailable(BITIN) );

    int one = GetBits( BITIN, clz+1 );
    ASSERT( one == 1 );

    return clz;
}

here the GetBits is just consuming the zeros and the one on bit of the unary code. Just like in the Put case, the key thing is that the trailing 1 bit of the unary is the same as the top bit value ( "(1<<bits)" ) that we added in the naive reader. That is :

int GetEliasGamma( BITIN )
{
    int bits = count_leading_zeros( BITIN );

    ASSERT( bits < NumBitsAvailable(BITIN) );

    int one = GetBits( BITIN, bits+1 );
    ASSERT( one == 1 );

    int ret = (1<<bits) + GetBits( BITIN, bits );

    return ret;
}

can be simplified to combine the GetBits :

int GetEliasGamma( BITIN )
{
    int bits = count_leading_zeros( BITIN );

    ASSERT( bits < NumBitsAvailable(BITIN) );

    int ret = GetBits( BITIN, 2*bits + 1 );

    ASSERT( ret >= (1<<bits) && ret < (2<<bits) );

    return ret;
}

again assuming that your GetBits combines like big-endian style.

You can do the same for "Exp Golomb" of course, which is just Elias Gamma + some raw bits. (Exp Golomb is the special case of Golomb codes with a power of two divisor).

Summary :

//===============================================================================

// Elias Gamma works on vals >= 1
// these assume that the value fits in your bit word
// and your bit reader is big-endian and top-justified

#define BITOUT_PUT_ELIASGAMMA(bout_bits,bout_numbits,val) do { \
    ASSERT( val >= 1 ); \
    uint32 topbit = bsr64(val); \
    BITOUT_PUT(bout_bits,bout_numbits, val, 2*topbit + 1 ); \
    } while(0)

#define BITIN_GET_ELIASGAMMA(bitin_bits,bitin_numbits,val) do { \
    uint32 nlz = clz64(bitin_bits); \
    uint32 nbits = 2*nlz+1; \
    BITIN_GET(bitin_bits,bitin_numbits,nbits,val); \
    } while(0)

//===============================================================================
// MSVC implementations of bsr and clz :

static inline uint32 bsr64( uint64 val )
{
    ASSERT( val != 0 );
    unsigned long b = 0;
    _BitScanReverse64( &b, val );
    return b;
}

static inline uint32 clz64(uint64 val)
{
    return 63 - bsr64(val);
}

//===============================================================================
// and for completeness, reference bitio that works with those functions :
//  (big endian ; bit input word top-justified)

#define BITOUT_VARS(bout_bits,bout_numbits,bout_ptr) \
    uint64 bout_bits; \
    int64 bout_numbits; \
    uint8 * bout_ptr;

#define BITOUT_START(bout_bits,bout_numbits,bout_ptr,buf) do { \
        bout_bits = 0; \
        bout_numbits = 0; \
        bout_ptr = (uint8 *)buf; \
    } while(0)

#define BITOUT_PUT(bout_bits,bout_numbits,val,nb) do { \
        ASSERT( (bout_numbits+nb) <= 64 ); \
        ASSERT( (val) < (1ULL<<(nb)) ); \
        bout_numbits += nb; \
        bout_bits |= ((uint64)(val)) << (64 - bout_numbits); \
    } while(0)
    
#define BITOUT_FLUSH(bout_bits,bout_numbits,bout_ptr) do { \
        *((uint64 *)bout_ptr) = _byteswap_uint64( bout_bits ); \
        bout_bits <<= (bout_numbits&~7); \
        bout_ptr += (bout_numbits>>3); \
        bout_numbits &= 7; \
    } while(0)
    
#define BITOUT_END(bout_bits,bout_numbits,bout_ptr) do { \
        BITOUT_FLUSH(bout_bits,bout_numbits,bout_ptr); \
        if ( bout_numbits > 0 ) bout_ptr++; \
    } while(0)

//===============================================================

#define BITIN_VARS(bitin_bits,bitin_numbits,bitin_ptr) \
    uint64 bitin_bits; \
    int64 bitin_numbits; \
    uint8 * bitin_ptr;

#define BITIN_START(bitin_bits,bitin_numbits,bitin_ptr,begin_ptr) do { \
        bitin_ptr = (uint8 *)begin_ptr; \
        bitin_bits = _byteswap_uint64( *( (uint64 *)bitin_ptr ) ); \
        bitin_ptr += 8; \
        bitin_numbits = 64; \
    } while(0)

#define BITIN_REFILL(bitin_bits,bitin_numbits,bitin_ptr) do { if ( bitin_numbits <= 56 ) { \
        ASSERT( bitin_numbits > 0 && bitin_numbits <= 64 ); \
        uint64 next8 = _byteswap_uint64( *( (uint64 *)bitin_ptr ) ); \
        int64 bytesToGet = (64 - bitin_numbits)>>3; \
        bitin_ptr += bytesToGet; \
        bitin_bits |= next8 >> bitin_numbits; \
        bitin_numbits += bytesToGet<<3; \
        ASSERT( bitin_numbits >= 56 && bitin_numbits <= 64 ); \
    } } while(0)

#define BITIN_GET(bitin_bits,bitin_numbits,nb,ret) do { \
        ASSERT( nb <= bitin_numbits ); \
        ret = (bitin_bits >> 1) >> (63 - nb); \
        bitin_bits <<= nb; \
        bitin_numbits -= nb; \
    } while(0)

//=========================================================

and yeah yeah I know this bitio could be faster. It's a reference implementation that's trying to avoid obfuscations. GTFO.

Added exp-golomb. The naive put is :


PutEliasGamma( val >> r );
PutBits( val & ((1<<r)-1) , r );

but you do various reductions and get to :
//===============================================================================

// this Exp Golomb is for val >= 0
// Exp Golomb is Elias Gamma + 'r' raw bits

#define BITOUT_PUT_EXPGOLOMB(bout_bits,bout_numbits,r,val) do { \
    ASSERT( val >= 0 ); \
    uint64 up = (val) + (1ULL<<(r)); \
    uint32 topbit = bsr64(up); \
    ASSERT( topbit >= (uint32)(r) ); \
    BITOUT_PUT(bout_bits,bout_numbits, up, 2*topbit + 1 - r ); \
    } while(0)
    
#define BITIN_GET_EXPGOLOMB(bitin_bits,bitin_numbits,r,val) do { \
    uint32 nbits = 2*clz64(bitin_bits)+1+r; \
    BITIN_GET(bitin_bits,bitin_numbits,nbits,val); \
    ASSERT( val >= (1ULL<<r) ); \
    val -= (1ULL<<r); \
    } while(0)

//=========================================================

3/14/2014

03-14-14 - Fold Up Negatives

Making a record of something not new :

Say you want to take the integers {-inf,inf} and map them to just the non-negatives {0,1,..inf}. (and/or vice-versa)

(this is common for example when you want to send a signed value using a variable length code, like unary or golomb or whatever; yes yes there are other ways, for now assume you want to do this).

We need to generate a number line with the negatives "folded up" and interleaved with the positives, like


0,-1,1,-2,2,...

The naive way is :

// fold_up makes positives even
//   and negatives odd

unsigned fold_up_negatives(int i)
{
    if ( i >= 0 )
        return i+i;
    else
        return (unsigned)(-i-i-1); 
}

int unfold_negatives(unsigned i)
{
    if ( i&1 ) 
        return - (int)((i+1)>>1);
    else
        return (i>>1);
}

Now we want to do it branchless.

Let's start with folding up. What we want to achieve mathematically is :


fold_up_i = 2*abs(i) - 1 if i is negative

To do this we will use some tricks on 2's complement integers.

The first is getting the sign. Assuming 32-bit integers for now, we can use :


minus_one_if_i_is_negative = (i >> 31);

= 0 if i >= 0
= -1 if i < 0

which works by taking the sign bit and replicating it down. (make sure to use signed right shift, and yes this is probably undefined blah blah gtfo etc).

The other trick is to use the way a negative is made in 2's complement.


(x > 0)

-x = (x^-1) + 1

or

-x = (x-1)^(-1)

and of course x^-1 is the same as (~x), that is flip all the bits. This also gives us :

x^-1 = -x -1

And it leads obviously to a branchless abs :


minus_one_if_i_is_negative = (i >> 31);
abs_of_i = (i ^ minus_one_if_i_is_negative) - minus_one_if_i_is_negative;

since if i is negative this is

-x = (x^-1) + 1

and if i is positive it's

x = (x^0) + 0

So we can plug this in :

fold_up_i = 2*abs(i) - 1 if i is negative

fold_up_i = abs(2i) - 1 if i is negative

minus_one_if_i_is_negative = (i >> 31);
abs(2i) = (2i ^ minus_one_if_i_is_negative) - minus_one_if_i_is_negative;

fold_up_i = abs(2i) + minus_one_if_i_is_negative

fold_up_i = (2i) ^ minus_one_if_i_is_negative

or in code :

unsigned fold_up_negatives(int i)
{
    unsigned two_i = ((unsigned)i) << 1;
    int sign_i = i >> 31;
    return two_i ^ sign_i;
}

For unfold we use the same tricks. I'll work it backwards from the answer for variety and brevity. The answer is :

int unfold_negatives(unsigned i)
{
    unsigned half_i = i >> 1;
    int sign_i = - (int)( i & 1 );
    return half_i ^ sign_i;
}

and let's prove it's right :

if i is even

half_i = i>>1;
sign_i = 0;

return half_i ^ 0 = i/2;
// 0,2,4,... -> 0,1,2,...

if i is odd

half_i = i>>1; // i is odd, this truncates
sign_i = -1;

return half_i ^ -1 
 = -half_i -1
 = -(i>>1) -1
 = -((i+1)>>1)
// 1,3,5,... -> -1,-2,-3,...

which is what we wanted.

Small note : on x86 you might rather use cdq to get the replicated sign bit of an integer rather than >>31 ; there are probably similar instructions on other architectures. Is there a neat way to make C generate that? I dunno. Not sure it ever matters. In practice you should use an "int32" type or compiler_assert( sizeof(int) == 4 );

Summary :


unsigned fold_up_negatives(int i)
{
    unsigned two_i = ((unsigned)i) << 1;
    int sign_i = i >> 31;
    return two_i ^ sign_i;
}

int unfold_negatives(unsigned i)
{
    unsigned half_i = i >> 1;
    int sign_i = - (int)( i & 1 );
    return half_i ^ sign_i;
}

3/13/2014

03-13-14 - Hilbert Curve Testing

So I stumbled on this blog post about counting the rationals which I found rather amusing.

The snippet that's relevant here is that if you iterate through the rationals naively by doing something like

1/1 2/1 3/1 4/1 ...
1/2 2/2 3/2 4/2 ...
1/3 2/3 ...
...
then you will never even reach 1/2 because the first line is infinitely long. But if you take a special kind of path, you can reach any particular rational in a finite number of steps. Much like the way a Hilbert curve lets you walk the 2d integer grid using only a 1d path.

It reminded me of something that I've recently changed in my testing practices.

When running code we are always stepping along a 1d path, which you can think of as discrete time. That is, you run test 1, then test 2, etc. You want to be able to walk over some large multi-dimensional space of tests, but you can only do so by stepping along a 1d path through that space.

I've had a lot of problems testing Oodle, because there are lots of options on the compressors, lots of APIs, lots of compression levels and different algorithms - it's impossible to test all combinations, and particularly impossible to test all combinations on lots of different files. So I keep getting bitten by some weird corner case.

(Total diversion - actually this is a good example of why I'm opposed to the "I tested it" style of coding in general. You know when you stumble on some code that looks like total garbage spaghetti, and you ask the author "wtf is going on here, do you even understand it?" and they "well it works, I tested it". Umm, no, wrong answer. Maybe it passed the test, but who knows how it's going to be used down the road and fail in mysterious ways? Anyway, that's an old school cbloom coding practices rant that I don't bother with anymore ... and I haven't been great about following my own advice ...)

If you try to set up a big test by just iterating over each option :

for each file
  for each compressor
    for each compression_level
      for each chunking
        for each parallel branching
          for each dictionary size
            for each sliding window size
              ...

then you'll never even get to the second file.

The better approach is to get a broad sampling of options. An approach I like is to enumerate all the tests I want to run, using a loop like the above, put them all in a list, then randomly permute the list. Because it's just a permutation, I will still only run each test once, and will cover all the cases in the enumeration, but by permuting I get a broader sampling more quickly.

(you could also add the nice technique that we worked out here long ago - generate a consistent permutation using a pseudorandom generator with known seed, and save your place in the list with a file or the registry or something. That way when you stop and start the tests, you will resume where you left off, and eventually cover more of the space (also when a test fails you will automatically repeat the same test if you restart)).

The other trick that's useful in practice is to front-load the quicker tests. You do want to have a test where you run on 8 GB files to make sure that works, but if that's your first test you have to wait forever to get a failure. This is particularly for the case that something dumb is broken, it should show up as quickly as possible so you can just cancel the test and fix it. So you want an initial broad sampling of options on very quick tests.

3/03/2014

03-03-14 - Windows Links

I wrote a deduper in Oodle today. I was considering making the default action be to replace duplicate files with a link to the original.

I wasn't sure whether to use "hard" or "soft" links, so I did a little research.

In Windows a "hard link" means that multiple file names all point at the same file data. It's a bit of a misnomer, it's not really a "link" to the original. There is no "original" or base file - all instances of a hard link are equal peers.

A "soft link" is just an OS-level shortcut. There is an original base file, and the soft links point at it.

Both are ridiculously broken concepts and IMO should almost never be used.

With "hard links" the problem is that if you accidentally edit any of the links, you have editted the master data. If you did not intend that, you may have fucked up something severely.

Hard links are reasonable *if* the files linked are read-only (and somehow actually kept read-only, not just attrib'ed away).

The problem with "soft links" is that the links are not protected; if you rename or move or delete the original file, all the links are broken, and again you may have just severely fucked something up.

The big problem is that you get no warning in either case. Clearly what you want when you try to rename a file which has a bunch of soft links pointing at it is some kind of dialog that says "hey all these links point at this file, do you really want to rename it and break the links? or should I update the links to point at the new name?". Similarly with hard links, obviously what you want is some prompt like "hey if you modify this, so you want these hard links to see the new version or the old version?".

Now obviously you can't solve this problem in general without user prompts. But I believe that a refcounted copy-on-write link would have been a much more robust and safe solution. Open for write should have done a COW by default unless you passed a special flag to indicate you intend to edit shared data.

Even ignoring the fundamental logic of how links work, there are some very bad practical issues for links in windows.

1. Soft links show a file size of 0 in the dir enumeration file info. This breaks the assumption that most apps make that the file size they get from the dir enumeration will be the same as the file size they get if they open that file handle and ask for its size. It can also screw up enumerations that are just trying to skip zero-size files.

Hard link file sizes are out of date. If the file data is modified, only the directory entry for the one that was used to modify the data is updated. All other links still have the old file sizes, mod times, etc.

2. Hard links break the assumption that saving to a file is the same as saving to a temp and then renaming onto the file. Many apps may or may not use the "write to temp then rename" pattern; what you get is massively different results in a very unexpected way.

3. Mod times are hosed. In general attributes are hosed; neither type of link reflects the attributes of the actual file data in the link - until they are opened, then they get updated. Mod times are particularly bad because many apps use them to detect changes, and with links the file data can be changed but the mod time won't reflect it.

Dear lord. So non-robust.

2/25/2014

02-25-14 - WiFi

So our WiFi stopped working recently, and I discovered a few things which I will now write down.

First of all, WiFi is fucked. 2.4 GHz is way overcrowded and just keeps getting more crowded. Lots of fucking routers now are offering increased bandwidth by using multiple channels simultaneously, etc. etc. It's one big interference fest.

The first issue I found was baby monitors. Baby monitors, like many wireless devices, are also in the 2.4 GHz band and just crap all over your wifi. Turning them off helped our signal a bit, but we were still getting constant problems.

Next issue is interference from neighbors'ses wifises. This is what inSSIDer looks like at my house :

We are *way* away from any neighbors, at least 50 feet in every direction, and we still get this amount of shit from them. Each of my cock-ass-fuck neighbors seems to have four or five wifi networks. Good job guys, way to fix your signal strength issues by just piling more shit in the spectrum.

What you can't see from the static image is that lots of the fucking neighbor wifis are not locked to a specific channel, many of them are constantly jumping around trying to find a clear channel, which just makes them crud them all up.

(I'd love to get some kind of super industrial strength wifi for my house and just crank it up to infinity and put it on every channel so that nobody for a mile around gets any wifi)

I've long had our WiFi on channel 8 because it looked like the most clear spot to be. Well, it turns out that was a classic newb mistake. Apparently it's worse to be slightly offset from a busy channel than it is to be right on it. When you're offset, you get signal leakage from the other channel that just looks like noise; being on the channel you're fighting with other people, but at least you are seing their data as real data that you can ignore. Anyway, switching our network to channel 11 fixed it.

It looks like in practice channel 6 and 11 are the only usable ones in noisy environments (eg. everywhere).

The new 802.11ac on 5 GHz should be a nice clean way to go for a few years until it too gets crudded up.

02-25-14 - ANS Applications

Some rambling on where I see ANS being useful.

In brief - anywhere you used Huffman in the past, you can use ANS instead.

ANS (or ABS) are not very useful for high end compressors. They are awkward for adaptive modeling. Even if all you care about is decode speed (so you don't mind the buffering up the models to make the encode work backwards) it's just not a big win over arithmetic coding. Things like PAQ/CM , LZMA, H264, all the high compression cases that use adaptive context models, there's no real win from ANS/ABS.

Some specific cases where I see ANS being a good win :

JPEG-ANS obviously. Won't be competitive with sophisticated coders like "packjpg" but will at least fix the cliff at low bit rate caused by Huffman coding.

JPEGNEXT-ANS. I've been thinking for a long time about writing a "JPEGNEXT". Back end coefficient classification; coefficients in each group sent by value with ANS. Front end 32x32 macroblocks with various DCT sizes. Try to keep it as simple as possible but be up to modern quality levels.

LZ-ANS. An "LZX class" (which means large window, "repeat match", entropy resets) LZ with ANS back end should be solid. Won't be quite LZMA compression levels, but way way faster.

Lossless image DPCM. ANS on prediction residual values is a clear killer. Should have a few statistics groups with block classification. No need to transmit the ANS counts if you use a parametric model ala TMW etc. Should be easy, fast, and good.

blocksort-ANS. Should replace bzip. Fast to decode.

MPEG-class video coders. Fast wavelet coders. Anywhere you are using minimal context modelling (only a few contexts) and are sending values by their magnitude (not bit plane coders).

Other?

2/18/2014

02-18-14 - ans_fast implementation notes

Some notes about the ans_fast I posted earlier .

ans_fast contains a tANS (table-ANS) implementation and a rANS (range-ANS) implementation.

First, the benchmarking. You can compare to the more naive implementations I posted earlier . However, do not compare this tANS impl to Huffman or arithmetic and conclude "ANS is faster" because the tANS impl here has had rather more loving than those. Most of the tricks used on "ans_fast" can be equally used for other algorithms (though not all).

Here L=4096 to match the 12-bits used in the previous test. This is x64 on my lappy (1.7 Ghz Core i7 with turbo disabled). Compressed sizes do not include sending the counts. Time "withtable" includes all table construction but not histogramming or count normalization (that affects encoder only). ("fse" and "huf" on the previous page included table transmission and histogramming time)


book1

tANS 768771 -> 435252.75

ticks to encode: 4.64 decode: 3.39
mbps encode: 372.92 decode: 509.63

withtable ticks to encode: 4.69 decode: 3.44
withtable mbps encode: 368.65 decode: 501.95

rANS 768771 -> 435980 bytes (v2)

ticks to encode: 6.97 decode: 5.06
mbps encode: 248.02 decode: 341.63

withtable ticks to encode: 6.97 decode: 5.07
withtable mbps encode: 247.92 decode: 341.27

pic

tANS 513216 -> 78856.88

ticks to encode: 4.53 decode: 3.47
mbps encode: 382.02 decode: 497.75

withtable ticks to encode: 4.62 decode: 3.56
withtable mbps encode: 374.45 decode: 485.40

rANS 513216 -> 79480 bytes (v2)

ticks to encode: 5.62 decode: 3.53
mbps encode: 307.78 decode: 490.32

withtable ticks to encode: 5.63 decode: 3.54
withtable mbps encode: 307.26 decode: 488.88

First a note on file sizes : rANS file sizes are a few bytes larger than the "rans 12" posted last time. That's because that was a 32-bit impl. The rANS here is 64-bit and dual-state so I have to flush 16 bytes instead of 4. There are ways to recover some of those bytes.

The tANS file sizes here are smaller than comparable coders. The win comes from the improvements to normalizing counts and making the sort order. In fact, the +1 bias heuristic lets me beat "arith 12" and "rans 12" from the last post, which were coding nearly perfectly to the expected codelen of the normalized counts.

If you run "ans_learning" you will often see that the written bits are less than the predicted codelen :

H = 1.210176
CL = 1.238785
wrote 1.229845 bpb
this is because the +1 bias heuristic lets the codelens match the data better than the normalized counts do.

Okay, so on to the speed.

The biggest thing is that the above reported speeds are for 2x interleaved coders. That is, two independent states encoding the single buffer to a single compressed stream. I believe ryg will talk about this more soon. You can read his paper on arxiv now. Note that this is not just unrolling. Because the states are independent they allow independent execution chains to be in flight at the same time.

The speedup from interleaving is pretty huge (around 1.4X) :


book1

rANS non-interleaved (v1)

ticks to encode: 26.84 decode: 7.33
mbps encode: 64.41 decode: 235.97

withtable ticks to encode: 26.85 decode: 7.38
withtable mbps encode: 64.41 decode: 234.19

rANS 2x interleaved (v1)

ticks to encode: 17.15 decode: 5.16
mbps encode: 100.84 decode: 334.95

withtable ticks to encode: 17.15 decode: 5.22
withtable mbps encode: 100.83 decode: 331.31


tANS non-interleaved

ticks to encode: 6.43 decode: 4.68
mbps encode: 269.10 decode: 369.44

withtable ticks to encode: 6.48 decode: 4.73
withtable mbps encode: 266.86 decode: 365.39

tANS 2x interleaved

ticks to encode: 4.64 decode: 3.39
mbps encode: 372.92 decode: 509.63

withtable ticks to encode: 4.69 decode: 3.44
withtable mbps encode: 368.65 decode: 501.95

But even non-interleaved it's fast. (note that interleaved tANS is using only a single shared bit buffer). The rest of the implementation discussion will use the non-interleaved versions for simplicity.

The tANS implementation is pretty straightforward.

Decoding one symbol is :


    struct decode_entry { uint16 next_state; uint8 num_bits; uint8 sym; };

    decode_entry * detable = table - L;

    #define DECODE_ONE() do { \
        de = detable + state; \
        nb = de->num_bits; \
        state = de->next_state; \
        BITIN_OR(bitin_bits,bitin_numbits,nb,state); \
        *toptr++ = (uint8) de->sym; \
    } while(0)

where BITIN_OR reads "nb" bits and ors them onto state.

With a 64-bit bit buffer, I can ensure >= 56 bits are in the buffer. That means with L up to 14 bits, I can do four decodes before checking for more bits needed. So the primary decode loop is :


        // I know >= 56 bits are available  
        // each decode consumes <= 14 bits

        DECODE_ONE();
        DECODE_ONE();
        DECODE_ONE();
        DECODE_ONE();
            
        BITIN_REFILL(bitin_bits,bitin_numbits,bitin_ptr);
        // now >= 56 bits again

The fastest way I could find to do the bit IO was "big endian style". That's the next bits at the top of the word. Bits in the word are in order of bits in the file. This lets you unconditionally grab the next 8 bytes to refill, but requires a bswap (on little endian platforms). eg :

#define BITIN_REFILL(bitin_bits,bitin_numbits,bitin_ptr) do { \
        ASSERT( bitin_numbits > 0 && bitin_numbits <= 64 ); \
        int64 bytesToGet = (64 - bitin_numbits)>>3; \
        uint64 next8 = _byteswap_uint64( *( (uint64 *)bitin_ptr ) ); \
        bitin_ptr += bytesToGet; \
        bitin_bits |= (next8 >> 1) >> (bitin_numbits-1); \
        bitin_numbits += bytesToGet<<3; \
        ASSERT( bitin_numbits >= 56 && bitin_numbits <= 64 ); \
    } while(0)

The other nice thing about the bits-at-top style is that the encoder can put bits in the word without any masking. The encoder is :

    #define ENCODE_ONE() do { \
        sym = *bufptr--; ee = eetable+sym; \
        msnb = ee->max_state_numbits; \
        msnb += ( state >= ee->max_state_thresh ); \
        BITOUT_PUT(bout_bits,bout_numbits, state,msnb); \
        state = ee->packed_table_ptr[ state>>msnb ]; \
        } while(0)

    #define BITOUT_PUT(bout_bits,bout_numbits,val,nb) do { \
        ASSERT( (bout_numbits+nb) <= 64 ); \
        bout_bits >>= nb; \
        bout_bits |= ((uint64)val) << (64 - nb); \
        bout_numbits += nb; \
    } while(0)

the key interesting part being that the encoder just does BITOUT_PUT with "state", and by shifting it up to the top of the word for the bitio, it gets automatically masked. (and rotate-right is a way to make that even faster).

Similarly to the decoder, the encoder can write 4 symbols before it has to check if the bit buffer needs any output.

The other crucial thing for fast tANS is the sort order construction. I do a real sort, using a radix sort. I do the first step of radix sorting (generating a histogram), and then I directly build the tables from that, reading out of the radix histogram. There's no need to explicitly generate the sorted symbol list as an intermediate step. I use only an 8-bit radix here (256 entries) but it's not significantly different (in speed or compression) than using a larger radix table.

The rANS implementation is pretty straightforward and I didn't spend much time on it, so it could probably be faster (particularly encoding which I didn't spend any time on (ADDENDUM : v2 rANS now sped up and encoder uses fast reciprocals)). I use a 64-bit state with 32-bit renormalization. The basic decode operation is :


        uint64 xm = x & mask;   
        const rans_decode_table::entry & e = table[xm];
            
        x = e.freq * (x >> cumprobtot_bits) + e.xm_minus_low;
    
        buffer[i] = (uint8) e.sym;
        
        if ( x < min_x )
        {
            x <<= 32;
            x |= *((uint32 *)comp_ptr);
            comp_ptr += 4;
        }

One thing I should note is that my rANS decode table is 2x bigger than the tANS decode table. I found it was fastest to use an 8-byte decode entry for rANS :

    // 8-byte decode entry
    struct entry { uint16 freq; uint16 xm_minus_low; uint8 sym; uint16 pad; };
obviously you can pack that a lot smaller (32 bits from 12+12+8) but it hurts speed.

For both tANS and rANS I make the encoder write backwards and the decoder read forwards to bias in favor of decoder speed. I make "L" a variable, not a constant, which hurts speed a little.

02-18-14 - Understanding ANS - Conclusion

I think we can finally say that we understand ANS pretty well, so this series will end. I may cover some more ANS topics but they won't be "Understanding ANS".

Here is the index of all posts on this topic :

cbloom rants 1-30-14 - Understanding ANS - 1
cbloom rants 1-31-14 - Understanding ANS - 2
cbloom rants 02-01-14 - Understanding ANS - 3
cbloom rants 02-02-14 - Understanding ANS - 4
cbloom rants 02-03-14 - Understanding ANS - 5
cbloom rants 02-04-14 - Understanding ANS - 6
cbloom rants 02-05-14 - Understanding ANS - 7
cbloom rants 02-06-14 - Understanding ANS - 8
cbloom rants 02-10-14 - Understanding ANS - 9
cbloom rants 02-11-14 - Understanding ANS - 10
cbloom rants 02-14-14 - Understanding ANS - 11
cbloom rants 02-18-14 - Understanding ANS - 12

And here is some source code for my ANS implementation : (v2 02/21/2014)

ans_learning.cpp
ans_fast.cpp
ans.zip - contains ans_fast and ans_learning
cblib.zip is required to build my code

My home code is MSVC 2005/2008. Port if you like. Email me if you need help.

NOTE : this release is not a library you just download and use. It is intended as documentation of research. If you want some ANS code you can just use off the shelf, go get FiniteStateEntropy . You may also want ryg_rans .

I think I'll do a followup post with the performance of ans_fast and some optimization notes so it doesn't crud up this index post. Please put implementation speed discussion in that followup post .

02-18-14 - Understanding ANS - 12

A little note about sorts and tables.

AAAAAAABBBBBBBBBBBBBCCCCCCCCCCCD

What's wrong with that sort?

(That's the naive rANS sort order; it's just a "cum2sym" table. It's each symbol Fs times in consecutive blocks. It has M=32 entries. M = sum{Fs} , L = coding precision)

(here I'm talking about a tANS implementation with L=M ; the larger (L/M) is, the more you preserve the information in the state x)

Think about what the state variable "x" does as you are coding. In the renormalization range it's in [32,63]. Its position in that range is a slider for the number of fraction bits it contains. At the bottom of the range, log2(x) is 5, at the top log2(x) is 6.

Any time you want to encode a "D" you must go back to a singleton precursor state, Is = [1]. That means you have to output all the bits in x, so all fractional bits are thrown away. All information about where you were in that I range is gone. Then from that singleton Is range you jump to the end of the I range.

(if Fs=2 , then you quantize the fractional bits up to 0.5 ; is Fs=3, you quantize to 1/3 of a bit, etc.)

Obviously the actual codelen for a "D" is longer than that for an "A". But so is the codelen for a "C", and the codelen for "A" is too short. Another way to think of it is that you're taking an initial state x that spans the whole interval [32,63] and thus has variable fractional bits, and you're mapping it into only a portion of the interval.

In order to preserve the fractional bit state size, you want to map from the whole interval back to the whole interval. In the most extreme case, something like :

ACABACACABACABAD

(M=16) , when you encode an A you go from [16,31] to [8,15] and then back the A's in that string. The net result is that state just lost its bottom bit. That is, x &= ~1. You still have the full range of possible fractional bits from [0,1] , you just lost the bottom bit of precision.

I was thinking about this because I was making some weird alternative tANS tables. In fact I suppose not actually ANS tables, but more general coding tables.

For background, you can make one of the heuristic tANS tables thusly :


shuffle(s) = some permutation function
shuffle is one-to-one over the range [0,L-1]
such as Yann's stepping prime-mod-L
or bit reverse

make_tans_shuffle()
{
    int next_state[256];    
    uint8 permutation[MAX_L];
    
    // make permutation :
    uint32 cumulative_count = 0;    
    for LOOP(sym,alphabet)
    {
        uint32 count = normalized_counts[sym];
        if ( count == 0 ) continue;
        
        next_state[sym] = count;
        
        for LOOP(c,(int)count)
        {
            uint32 index = shuffle(cumulative_count);
            cumulative_count++;
            
            permutation[index] = (uint8)sym;
        }
    }
    ASSERT( cumulative_count == (uint32)L );

    // permutation is now our "output string"   

    for LOOP(i,L) // iterating over destination state
    {
        int sym = permutation[i];
        
        // step through states for this symbol
        int from_state = next_state[sym];
        next_state[sym] ++;
                
        int to_state = L + i;
                    
        encode_packed_table_ptr[sym][from_state] = to_state;
    }
}

which is all well and good. But I started thinking - can I eliminate the intermediate permutation[] table entirely? Well, yes. There are a few ways.

If you have a "cum2sym" table already handy, then you can just use shuffle() to look up directly into cum2sym[], and that is identical to the above. But you probably don't have cum2sym.

Well what if we just use shuffle() to make the destination state? Note that this is calling it in the opposite direction (from cum2sym index to to_state , rather than from to_state to cum2sym). If your shuffle is self-inverting like bit reversal is, then it's the same.

It gives you a very simple table construction :


make_tans_shuffle_direct()
{
    uint32 cumulative_count = 0;    
    for LOOP(sym,alphabet)
    {
        uint32 count = normalized_counts[sym];
        if ( count == 0 ) continue;
                
        for LOOP(c,(int)count)
        {
            uint32 index = shuffle(cumulative_count);
            cumulative_count++;

            uint32 to_state = index + L;
            int from_state = count + c; 

            encode_packed_table_ptr[sym][from_state] = to_state;
        }
    }
    ASSERT( cumulative_count == (uint32)L );
}

make_tans_shuffle_direct walks the Fs in a kind of cum2sym order and then scatters those symbols out to semi-random target locations using the shuffle() function.

It doesn't work. Or rather, it works, it encodes & decodes data correctly, but the total coded size is worse.

The problem is that the encode table is no longer monotonic. That is, as "from_state" increases, "to_state" does not necessarily increase. The Fs encode table entries for each symbol are not numerically in order.

In the images we've been picturing from earlier in the post we can see the problem. Some initial state x is renormalized down to the Is coding range. We then follow the state transition back to the I range - but we go somewhere random. We don't go to the same neighborhood where we started, so we randomly get more or less fractional bits.

You can fix it thusly :


make_tans_shuffle_direct_fixed()
{
    uint32 cumulative_count = 0;    
    for LOOP(sym,alphabet)
    {
        uint32 count = normalized_counts[sym];
        if ( count == 0 ) continue;
                
        for LOOP(c,(int)count)
        {
            uint32 index = shuffle(cumulative_count);
            cumulative_count++;

            uint32 to_state = index + L;
            int from_state = count + c; 

            encode_packed_table_ptr[sym][from_state] = to_state;
        }

        // fix - to_states not monotonic
        // sort the destination states for this symbol :
        std::sort( encode_packed_table_ptr[sym]+count, encode_packed_table_ptr[sym]+2*count );
    }
    ASSERT( cumulative_count == (uint32)L );
}

and then it is identical to "make_tans_shuffle" (identical if shuffle is self-inverting, and if not then it's different but equal, since shuffle is really just a random number generator so running it backwards doesn't hurt compression).

For the record the compression penalty for getting the state transition order wrong is 1-2% :


CCC total bytes out :

correct sort : 1788631
shuffle fixed: 1789655
shuffle bad  : 1813450

2/14/2014

02-14-14 - Understanding ANS - 11

I want to do some hand waving about the different ways you can conceptually look at ANS.

Perhaps the best way to understand ANS mathematically is via the analogy with arithmetic coding . While ANS is not actually building up an arithmetic coder interval for the file, each step is very much like a LIFO arithmetic coder, and the (x/P) scaling is what makes x grow the right amount for each symbol. This is the most useful way to think about rANS or uANS, I find.

But there are other ways to think about it.

One is Duda's "asymmetric numeral system", which is how he starts the explanation in the paper, and really confused me to begin with. Now that we've come at ANS from the back way we can understand what he was on about.

The fundamental op in ANS is :


integer x contains some previous value

make x' = x scaled up in some way + new value 

with a normal "symmetric numeral system" , you would just do base-b math :

x' = x * b + v

which gives you an x' where the old value (x) is distributed evenly, and the v's just cycle :

b = 3 for example

x':  0  1  2  3  4  5  6  7  8 ... 
x :  0  0  0  1  1  1  2  2  2
v :  0  1  2  0  1  2  0  1  2

x' is a way of packing the old value x and the new value v together. This symmetric packing corresponds to the output string "012" in the parlance of this series. The growth factor (x'/x) determines the number of bits required to send our value, and it's uniform.

But it doesn't need to be uniform.


0102 :

x':  0  1  2  3  4  5  6  7  8 ... 
x :  0  0  1  0  2  1  3  1  4 
v :  0  1  0  2  0  1  0  2  0

Intuitively, the more often a symbol occurs in the output string, the more slots there are for the previous value (x) to get placed; that is, more bits of x can be sent in lower values of x' when the symbol occurs in many slots. Hence x' grows less. If you're thinking in terms of normalized x's, growing less means you have to output fewer bits to stay in the renormalization range.

You can draw these asymmetric numeral lines in different ways, which Duda does in the paper. For example :


input x as the axis line,
output x' in the table :

"AB"
  0 1 2 3 4 5 6  x
A 0 2 4          x'
B 1 3 5

"AAB"

  0 1 2 3 4 5 6  x
A 0 1 3 4 6 7 9  x'
B 2 5 8 11

output x' as the axis line
input x in the table :

"AB"
  0 1 2 3 4 5 6  x'
A 0   1   2   3  x
B   0   1   2

"AAB"
  0 1 2 3 4 5 6  x'
A 0 1   2 3   4  x
B     0     1

output x' line implicit
show x and output symbol :

"AB"
0 0 1 1 2 2 3
A B A B A B A

"AAB"
0 1 0 2 3 1 4
A A B A A B A

That is, it's a funny way of just doing base-b math; we're shifting up the place value and adding our value in, but we're in an "asymmetric numeral system", so the base is nonuniform. I find this mental image not very useful when thinking about how the coding actually works.

There's another way to think about tANS in particular (tANS = table-based ANS), which is what Yann is talking about .

To get there mentally, we actually need to optimize our tANS code.

When I covered tANS encoding before , I described it something like this :


x is the current state
x starts in the range I = [L,2L-1]

to encode the next symbol s
we need to reach the precursor range Is = [Fs,2Fs-1]

to do that, output bits from x
b = x&1; x >>= 1;
until x is lowered to reach Is

then take the state transition C()
this takes x back into I

this should be familiar and straightforward.

To optimize, we know that x always starts in a single power of 2 interval [L,2L-1] , and it always lands in a power of 2 interval [Fs,2Fs-1]. That means the minimum number of bits we ever output is from L to 2Fs-1 , and the maximum number of bits is only 1 more than that. So the renormalization can be written as :


precompute :

max_state = 2Fs - 1;
min_num_bits = floor(log2(L/Fs));

to renormalize :

x in [L,2L-1]
output min_num_bits from x
x >>= min_num_bits

now ( x >= Fs && x < 2*max_state );

if ( x > max_state ) output 1 more bit; x>>= 1;

now x in [Fs,2Fs-1]

But you can move the check for the extra output bit earlier, before shifting x down :

precompute :

min_num_bits = log2(L) - log2ceil(Fs);  // if L is power of 2
threshold = (2*Fs)<<num_bits;

to renormalize :

x in [L,2L-1]
num_bits = min_num_bits;
if ( x >= threshold ) num_bits ++;
output num_bits from x
x >>= num_bits

x in [Fs,2Fs-1]

and then use C(x) since x is now in Is.

It's just straightforward optimization, but it actually allows us to picture the whole process in a different way. Let's write the same encoder, but just in terms of a table index :


let t = x - L
t in [0,L-1]

t is a table index.


to encode s :

num_bits = min_num_bits[s] + ( t >= threshold[s] );
bitoutput( t, num_bits );
t = encode_state_table[s][ (t+L)>>num_bits ];

That is, we're going from a table index to another table index. We're no longer thinking about going back to the [Fs,2Fs-1] precursor range at all.

Before we got our desired code len by the scaling of the intervals [L,2L)/[Fs,2Fs) , now the code len is the stored number of bits. We can see that we get fractional bits because sometimes we output one more.

Let's revisit an example that we went through previously , but with this new image.


L = 8
Fs = {3,3,2}
output string = ABCABCAB

We can see right away that our table index t is 3 bits. To encode a 'C' there will be only two slots on our numeral line that correspond to a lower digit of C, so we must output 2 bits and keep 1 bit of t. To encode an 'A' we can keep 3 values, so we can output 1 bit for t in [0,3] and 2 bits for t in [4,7] ; that will give us 2 retained values in the first region and 1 retained value in the second.

Explicitly :


t in [0,7]
I want to encode an A
so I want to reach {AxxAxxAx}

t in [0,3]
  output t&1
  index = (t+L)>>1 = 4 or 5
  take the last two A's {xxxAxxAx}
  so state -> 3 or 6

t in [4,7]
  output t&3
  index = (t+L)>>2 = 3
  take the first A {Axxxxxxx}
  state -> 0

Note that the way we're doing it, high states transition to low states, and vice versa. These comes up because of the +L sentry bit method to separate the subranges produced by the shift.

The tANS construction creates this encode table :


encode:
A : b=1+(t>=4) : {0,3,6}
B : b=1+(t>=4) : {1,4,7}
C : b=2+(t>=8) : {2,5}

It should be obvious that we can now drop all our mental ideas about "ANS" and just make these coding tables directly. All you need is an output string, and you think about doing these kinds of mapping :

t in [0,7]

I want to encode a B

[xxxxxxxx] -> [xBxxBxxB]

output bits to reduce the 3 values
and transition to one of the slots with a B

The decode table is trivial to make from the inverse :

decode:
 0: A -> 4 + getbits(2)
 1: B -> 4 + getbits(2)
 2: C -> 0 + getbits(2)
 3: A -> 0 + getbits(1)
 4: B -> 0 + getbits(1)
 5: C -> 4 + getbits(2)
 6: A -> 2 + getbits(1)
 7: B -> 2 + getbits(1)

Note that each symbol's decode covers the entire origin state range :

decode:
 0: A -> 4 + getbits(2)  from [4,7]
 3: A -> 0 + getbits(1)  from [0,1]
 6: A -> 2 + getbits(1)  from [2,3]

 1: B -> 4 + getbits(2)  from [4,7]
 4: B -> 0 + getbits(1)  from [0,1]
 7: B -> 2 + getbits(1)  from [2,3]

 2: C -> 0 + getbits(2)  from [0,3]
 5: C -> 4 + getbits(2)  from [4,7]

During decode we can think about our table index 't' as containing two pieces of information : one is the current symbol to output, but there's also some information about the range where t will be on the next step. That is, the current t contains some bits of the next t. The number of bits depends on where we are in the table. eg. in the example above; when t=4 we specify a B, but we also specify 2 bits worth of the next t.

Doing another example from that earlier post :


Fs={7,6,3}

ABCABABACBABACBA

encode:
A : b=1+(t>=12) : {0,3,5,7,10,12,15}
B : b=1+(t>=8) : {1,4,6,9,11,14}
C : b=2+(t>=8) : {2,8,13}

decode:
 0: A -> 12 + getbits(2)
 1: B -> 8 + getbits(2)
 2: C -> 8 + getbits(3)
 3: A -> 0 + getbits(1)
 4: B -> 12 + getbits(2)
 5: A -> 2 + getbits(1)
 6: B -> 0 + getbits(1)
 7: A -> 4 + getbits(1)
 8: C -> 0 + getbits(2)
 9: B -> 2 + getbits(1)
10: A -> 6 + getbits(1)
11: B -> 4 + getbits(1)
12: A -> 8 + getbits(1)
13: C -> 4 + getbits(2)
14: B -> 6 + getbits(1)
15: A -> 10 + getbits(1)

and this concludes our conception of tANS in terms of just an [0,t-1] table.

I'm gonna be super redundant and repeat myself some more. I think it's intriguing that we went through all this ANS entropy coder idea, scaling values by (x/P) and so on, and from that we constructed tANS code. But you can get to the exact same tANS directly from the idea of the output string!

Let's invent tANS our new way, starting from scratch.

I'm given normalized frequencies {Fs}. Sum{Fs} = L. I want a state machine with L entries. Take each symbol and scatter it into our output string in some way.

To encode each symbol, I need to map the state machine index t in [0,L-1] to one of its occurances in the output string.


There are Fs occurances in the output string

I need to map an [0,L-1] value to an [0,Fs-1] value
by outputting either b or b+1 bits

now clearly if (L/Fs) is a power of 2, then the log2 of that is just b and we always output that many bits. (eg L=16, Fs=4, we just output 2 bits). In general if (L/Fs) is not a power of 2, then

b = floor(log2(L/Fs))
b+1 = ceil(log2(L/Fs))

so we just need two sub-ranges of L such that the total adds up to Fs :

threshold T
values < T output b bits
values >= T output b+1 bits

total of both ranges after output should equal Fs :

(T>>b) + (L-T)>>(b+1) = Fs

(2T + L-T)>>(b+1) = Fs

L+T = Fs<<(b+1)

T = (Fs<<(b+1)) - L

and that's it! We've just made a tANS encoder without talking about anything related to the ANS ideas at all.

The funny thing to me is that we got the exact same condition before from "b-uniqueness". That is, in order to be able to encode symbol s from any initial state, we worked out that the only valid precursor range was Is = [Fs,2*Fs-1] . That leads us to the renormalization loop :


while x > (2*Fs-1)
  output a bit from x; x>>= 1;

And from that we computed a minimum number of output bits, and a threshold state for one more. That threshold we computed was

(max_state + 1)<<min_num_bits

= (2*Fs-1 + 1)<<b
= Fs<<(b+1)

which is the same.

2/11/2014

02-11-14 - Understanding ANS - 10 - Optimal normalized counts

Not really an ANS topic, but a piece you need for ANS so I've had a look at it.

For ANS and many other statistical coders (eg. arithmetic coding) you need to create scaled frequencies (the Fs in ANS terminology) from the true counts.

But how do you do that? I've seen many heuristics over the years that are more or less good, but I've never actually seen the right answer. How do you scale to minimize total code len? Well let's do it.

Let's state the problem :


You are given some true counts Cs

Sum{Cs} = T  the total of true counts

the true probabilities then are

Ps = Cs/T

and the ideal code lens are log2(1/Ps)

You need to create scaled frequencies Fs
such that

Sum{Fs} = M

for some given M.

and our goal is to minimize the total code len under the counts Fs.

The ideal entropy of the given counts is :

H = Sum{ Ps * log2(1/Ps) }

The code len under the counts Fs is :

L = Sum{ Ps * log2(M/Fs) }

The code len is strictly worse than the entropy

L >= H

We must also meet the constraint

if ( Cs != 0 ) then Fs > 0

That is, all symbols that exist in the set must be codeable. (note that this is not actually optimal; it's usually better to replace all rare symbols with a single escape symbol, but we won't do that here).

The naive solution is :


Fs = round( M * Ps )

if ( Cs > 0 ) Fs = MAX(Fs,1);

which is just scaling up the Ps by M. This has two problems - one is that Sum{Fs} is not actually M. The other is that just rounding the float does not actually distribute the integer counts to minimize codelen.

The usual heuristic is to do something like the above, and then apply some fix to make the sum right.

So first let's address how to fix the sum. We will always have issues with the sum being off M because of integer rounding.

What you will have is some correction :


correction = M - Sum{Fs}

that can be positive or negative. This is a count that needs to be added onto some symbols. We want to add it to the symbols that give us the most benefit to L, the total code len. Well that's simple, we just measure the affect of changing each Fs :

correction_sign = correction > 0 ? 1 : -1;

Ls_before = Ps * log2(M/Fs)
Ls_after = Ps * log2(M/(Fs + correction_sign))

Ls_delta = Ls_after - Ls_before
Ls_delta = Ps * ( log2(M/(Fs + correction_sign)) - log2(M/Fs) )
Ls_delta = Ps * log2(Fs/(Fs + correction_sign))

so we need to just find the symbol that gives us the lowest Ls_delta. This is either an improvement to total L, or the least increase in L.

We need to apply multiple corrections. We don't want a solution thats O(alphabet*correction) , since that can be 256*256 in bad cases. (correction is <= alphabet and typically in the 1-50 range for a typical 256-symbol file). The obvious solution is a heap. In pseudocode :


For all s
    push_heap( Ls_delta , s )

For correction
    s = pop_heap
    adjust Fs
    compute new Ls_delta for s
    push_heap( Ls_delta , s )

note that after we adjust the count we need to recompute Ls_delta and repush that symbol, because we might want to choose the same symbol again later.

In STL+cblib this is :


to[] = Fs
from[] = original counts

struct sort_sym
{
    int sym;
    float rank;
    sort_sym() { }
    sort_sym( int s, float r ) : sym(s) , rank(r) { }
    bool operator < (const sort_sym & rhs) const { return rank < rhs.rank; }
};

---------

    if ( correction != 0 )
    {
        //lprintfvar(correction);
        int32 correction_sign = (correction > 0) ? 1 : -1;

        vector<sort_sym> heap;
        heap.reserve(alphabet);

        for LOOP(i,alphabet)
        {
            if ( from[i] == 0 ) continue;
            ASSERT( to[i] != 0 );
            if ( to[i] > 1 || correction_sign == 1 )
            {
                double change = log( (double) to[i] / (to[i] + correction_sign) ) * from[i];
            
                heap.push_back( sort_sym(i,change) );
            }
        }
        
        std::make_heap(heap.begin(),heap.end());
        
        while( correction != 0 )
        {
            ASSERT_RELEASE( ! heap.empty() );
            std::pop_heap(heap.begin(),heap.end());
            sort_sym ss = heap.back();
            heap.pop_back();
            
            int i = ss.sym;
            ASSERT( from[i] != 0 );
            
            to[i] += correction_sign;
            correction -= correction_sign;
            ASSERT( to[i] != 0 );
        
            if ( to[i] > 1 || correction_sign == 1 )
            {
                double change = log( (double) to[i] / (to[i] + correction_sign) ) * from[i];
            
                heap.push_back( sort_sym(i,change) );
                std::push_heap(heap.begin(),heap.end());
            }               
        }
    
        ASSERT( cb::sum(to,to+alphabet) == (uint32)to_sum_desired );
    }

You may have noted that the above code is using natural log instead of log2. The difference is only a constant scaling factor, so it doesn't affect the heap order; you may use whatever log base is fastest.

Errkay. So our first attempt is to just use the naive scaling Fs = round( M * Ps ) and then fix the sum using the heap correction algorithm above.

Doing round+correct gets you 99% of the way there. I measured the difference between the total code len made that way and the optimal, and they are less than 0.001 bpb different on every file I tried. But it's still not quite right, so what is the right way?

To guide my search I had a look at the cases where round+correct was not optimal. When it's not optimal it means there is some symbol a and some symbol b such that { Fa+1 , Fb-1 } gives a better total code len than {Fa,Fb}. An example of that is :


count to inc : (1/1024) was (1866/1286152 = 0.0015)
count to dec : (380/1024) was (482110/1286152 = 0.3748)
to inc; cl before : 10.00 cl after : 9.00 , true cl : 9.43
to dec; cl before : 1.43 cl after : 1.43 , true cl : 1.42

The key point is on the 1 count :

count to inc : (1/1024) was (1866/1286152 = 0.0015)
to inc; cl before : 10.00 cl after : 9.00 , true cl : 9.43

1024*1866/1286152 = 1.485660
round(1.485660) = 1

so Fs = 1 , which is a codelen of 10

but Fs = 2 gives a codelen (9) closer to the true codelen (9.43)

And this provided the key observation : rather than rounding the scaled count, what we should be doing is either floor or ceil of the fraction, whichever gives a codelen closer to the true codelen.

BTW before you go off hacking a special case just for Fs==1, it also happens with higher counts :


count to inc : (2/1024) was (439/180084) scaled = 2.4963
to inc; cl before : 9.00 cl after : 8.42 , true cl : 8.68

count to inc : (4/1024) was (644/146557) scaled = 4.4997
to inc; cl before : 8.00 cl after : 7.68 , true cl : 7.83

though obviously the higher Fs, the less likely it is because the rounding gets closer to being perfect.

So it's easy enough just to solve exactly, simply pick the floor or ceil of the ratio depending on which makes the closer codelen :


Ps = Cs/T from the true counts

down = floor( M * Ps )
down = MAX( down,1)

Fs = either down or (down+1)

true_codelen = -log2( Ps )
down_codelen = -log2( down/M )
  up_codelen = -log2( (down+1)/M )

if ( |down_codelen - true_codelen| < |up_codelen - true_codelen| )
  Fs = down
else
  Fs = down+1

And since all we care about is the inequality, we can do some maths and simplify the expressions. I won't write out all the algebra to do the simplification because it's straightforward, but there are a few key steps :

| log(x) | = log( MAX(x,1/x) )

log(x) >= log(y)  is the same as x >= y

down <= M*Ps
down+1 >= M*Ps

the result of the simplification in code is :

from[] = original counts (Cs) , sum to T
to[] = normalized counts (Fs) , will sum to M

    double from_scaled = from[i] * M/T;

    uint32 down = (uint32)( from_scaled );
                
    to[i] = ( from_scaled*from_scaled <= down*(down+1) ) ? down : down+1;

Note that there's no special casing needed to ensure that (from_scaled < 1) gives you to[i] = 1 , we get that for free with this expression.

I was delighted when I got to this extremely simple final form.

And that is the conclusion. Use that to find the initial scaled counts. There will still be some correction that needs to be applied to reach the target sum exactly, so use the heap correction algorithm above.

As a final note, if we look at the final expression :


to[i] = ( from_scaled*from_scaled < down*(down+1) ) ? down : down+1;

to[i] = ( test < 0 ) ? down : down+1;

test = from_scaled*from_scaled - down*(down+1); 

from_scaled = down + frac

test = (down + frac)^2 - down*(down+1);

solve for frac where test = 0

frac = sqrt( down^2 + down ) - down

That gives you the fractional part of the scaled count where you should round up or down. It varies with floor(from_scaled). The actual values are :

1 : 0.414214
2 : 0.449490
3 : 0.464102
4 : 0.472136
5 : 0.477226
6 : 0.480741
7 : 0.483315
8 : 0.485281
9 : 0.486833
10 : 0.488088
11 : 0.489125
12 : 0.489996
13 : 0.490738
14 : 0.491377
15 : 0.491933
16 : 0.492423
17 : 0.492856
18 : 0.493242
19 : 0.493589

You can see as Fs gets larger, it goes to 0.5 , so just using rounding is close to correct. It's really in the very low values where it's quite far from 0.5 that errors are most likely to occur.

2/10/2014

02-10-14 - Understanding ANS - 9

If you just want to understand the basics of how ANS works, you may skip this post. I'm going to explore some unsolved issues about the sort order.

Some issues about constructing the ANS sort order are still mysterious to me. I'm going to try to attack a few points.

One thing I said wrote last time needs some clarification - "Every slot has an equal probability of 1/M."

What is true is that every character of the output string is equiprobable (assuming again that the Fs are the true probabilities). That is, if you have the string S[] with L symbols, each symbol s occurs Fs times, then you can generate symbols with the correct probability by just drawing S[i] with random i.

The output string S[] also corresponds to the destination state of the encoder in the renormalization range I = [L,2L-1]. What is not true is that all states in I are equally probable.

To explore this I did 10,000 random runs of encoding 10,000 symbols each time. I used L=1024 each time, and gathered stats from all the runs.

This is the actual frequency of the state x having each value in [1024,2047] (scaled so that the average is 1000) :

The lowest most probable states (x=1024) have roughly 2X the frequency of the high least probable states (x=2047).

Note : this data was generated using Duda's "precise initialization" (my "sort by sorting" with 0.5 bias). Different table constructions will create different utilization graphs. In particular the various heuristics will have some weird bumps. And we'll see what different bias does later on.

This is the same data with 1/X through it :

This probability distribution (1/X) can be reproduced just from doing this :


            x = x*b + irandmod(b); // for any base b
            
            while( x >= 2*K ) x >>= 1;
            
            stats_count[x-K] ++;            

though I'd still love to see an analytic proof and understand that better.

So, the first thing I should correct is : final states (the x' in I) are not equally likely.

How that should be considered in sort construction, I do not know.

The other thing I've been thinking about was why did I find that the + 1.0 bias is better in practice than the + 0.5 bias that Duda suggests ("precise initialization") ?

What the +1 bias does is push low probability symbols further towards the end of the sort order. I've been contemplating why that might be good. The answer is not that the end of the sort order makes longer codelens, because that kind of issue has already been accounted for.

My suspicion was that the +1 bias was beating the +0.5 bias because of the difference between normalized counts and unnormalized original counts.

Recall that to construct the table we had to make normalized frequences Fs that sum to L. These, however, are not the true symbol frequencies (except in synthetic tests). The true symbol frequencies had to be scaled to sum to L to make the Fs.

The largest coding error from frequency scaling is on the least probable symbols. In fact the very worst case is symbols that occur only once in a very large file. eg. in a 1 MB file a symbol occurs once; its true probability is 2^-20 and it should be coded in 20 bits. But we scale the frequencies to sum to 1024 (for example), it still must get a count of 1, so it's coded in 10 bits.

What the +1 bias does is take the least probable symbols and push them to the end of the table, which maximizes the number of bits they take to code. If the {Fs} were the true frequencies, this would be bad, and the + 0.5 bias would be better. But the {Fs} are not the true frequencies.

This raises the question - could we make the sort order from the true frequencies instead of the scaled ones? Yes, but you would then have to either transmit the true frequencies to the decoder, or transmit the sort order. Either way takes many more bits than transmitting the scaled frequencies. (in fact in the real world you may wish to transmit even approximations of the scaled frequencies). You must ensure the encoder and decoder use the same frequencies so they build the same sort order.

Anyway, I tested this hypothesis by making buffers synthetically by drawing symbols from the {Fs} random distribution. I took my large testset, for each file I counted the real histogram, made the scaled frequencies {Fs}, then regenerated the buffer from the frequencies {Fs} so that the statistics match the data exactly. I then ran tANS on the synthetic buffers and on the original file data :


synthetic data :

total bytes out : 146068969.00  bias=0.5
total bytes out : 146117818.63  bias=1.0

real data :

total bytes out : 144672103.38  bias=0.5
total bytes out : 144524757.63  bias=1.0

On the synthetic data, bias=0.5 is in fact slightly better. On the real data, bias=1.0 is slightly better. This confirms that the difference between the normalized counts & unnormalized counts is in fact the origin of 1.0's win in my previous tests, but doesn't necessarily confirm my guess for why.

An idea for an alternative to the bias=1 heuristic is you could use bias=0.5 , but instead of using the Fs for the sort order, use the estimated original count before normalization. That is, for each Fs you can have a probability model of what the original count was, and select the maximum-likelihood count from that. This is exactly analoguous to restoring to expectation rather than restoring to middle in a quantizer.

Using bias=1.0 and measuring state occurance counts, we get this :

Which mostly has the same 1/x curve, but with a funny tail at the end. Note that these graphs are generated on synthetic data.

I'm now convinced that the 0.5 bias is "right". It minimizes measured output len on synthetic data where the Fs are the true frequencies. It centers each symbol's occurances in the output string. It reproduces the 1/x distribution of state frequencies. However there is still the missing piece of how to derive it from first principles.


BTW

While I was at it, I gathered the average number of bits output when coding from each state. If you're following along with Yann's blog he's been explaining FSE in terms of this. tANS outputs bits to get the state x down into the coding range Is for the next symbol. The Is are always lower than I (L), so you have to output some bits to scale down x to reach the Is. x starts in [L,2L) and we have to output bits to reach [Fs,2Fs) ; the average number of bits required is like log2(L/Fs) which is log2(1/P) which is the code length we want. Because our range is [L,2L) we know the average output bit count from each state must differ by 1 from the top of the range to the bottom. In fact it looks like this :

Another way to think about it is that at state=L , the state is empty. As state increases, it is holding some fractional bits of information in the state variable. That number of fraction bits goes from 0 at L up to 1 at 2L.


Ryg just pointed me at a proof of the 1/x distribution in Moffat's "Arithmetic Coding Revisited" (DCC98).

The "x" in ANS has the same properties as the range ("R") in an arithmetic coder.

The bits of information in x is I ~= log( x )

I is in [0,1] and is a uniform random value, Pr(I) ~= 1

if log(x) has Pr ~= 1 , then Pr(x) must be ~= 1/x

The fact that I is uniform is maybe not entirely obvious; Moffat just hand-waves about it. Basically you're accumulating a random variable into I ( -log2(P_sym) ) and then dropping the integer part; the result is a fractional part that's random and uniform.

2/06/2014

02-06-14 - Understanding ANS - 8

Time to address an issue that we've skirted for some time - how do you make the output string sort order?

Recall : The output string contains Fs occurances of each symbol. For naive rANS the output string is just in alphabetical order (eg. "AAABBBCCC"). With tANS we can use any permutation of that string.

So what permutation should we use? Well, the output string determines the C() and D() encode and decode tables. It is in fact the only degree of freedom in table construction (assuming the same constraints as last time, b=2 and L=M). So we should choose the output string to minimize code length.

The guiding principle will be (x/P). That is, we achieve minimum total length when we make each code length as close to log2(1/P) as possible. We do that by making the input state to output state ratio (x'/x) as close to (1/P) as possible.

(note for the record : if you try to really solve to minimize the error, it should not just be a distance between (x'/x) and (1/P) , it needs to be log-scaled to make it a "minimum rate" solution). (open question : is there an exact solution for table building that finds the minimum rate table that isn't NP (eg. not just trying all permutations)).

Now we know that the source state always come from the precursor ranges Is, and we know that


destination range :
I = [ M , 2*M - 1]

source range :
Is = [ Fs, 2*Fs - 1 ] for each symbol s

and Ps = Fs/M

so the ideal target for the symbols in each source range is :

target in I = (1/Ps) * (Is) = (M/Fs) * [ Fs, 2*Fs - 1 ] = 

and taking off the +M bias to make it a string index in the range [0,M-1] :

Ts = target in string = target in I - M

Ts = { 0 , M * 1/Fs , M * 2/Fs) , ... }

Essentially, we just need to take each symbol and spread its Fs occurances evenly over the output string.

Now there's a step that I don't know how to justify without waving my hands a bit. It works slightly better if we imagine that the source x was not just an integer, but rather a bucket that covers the unit range of that integer. That is, rather that starting exactly at the value "x = Fs" you start in the range [Fs,Fs+1]. So instead of just mapping up that integer by 1/P we map up the range, and we can assign a target anywhere in that range. In the paper Duda uses a bias of 0.5 for "precise initialization" , which corresponds to assuming the x's start in the middle of their integer buckets. That is :


Ts = { M * (b/Fs), M* (1+b)/Fs, M * (2+b)/Fs , ... }

with b = 0.5 for Duda's "precise initialization". Obviously b = 0.5 makes T centered on the range [0,M] , but I see no reason why that should be preferred.

Now assuming we have these known target locations, you can't just put all the symbols into the target slots that they want, because lots of symbols want the same spot.

For example :


M=8
Fs={3,3,2}

T_A = { 8 * 0.5/3 , 8 * 1.5 / 3 , 8 * 2.5 / 3 } = { 1 1/3 , 4 , 6 2/3 }
T_B = T_A
T_C = { 8 * 0.5/2 , 8 * 1.5/2 } = { 2 , 6 }

One way to solve this problem is to start assigning slots, and when you see that one is full you just look in the neighbor slot, etc. So you might do something like :

initial string is empty :

string = "        "

put A's in 1,4,6

string = " A  A A "

put B's in 1,4,6 ; oops they're taken, shift up one to find empty slots :

string = " AB ABAB"

put C's in 2,6 ; oops they're taken, hunt around to find empty slots :

string = "CABCABAB"

now obviously you could try to improve this kind of algorithm, but there's no point. It's greedy so it makes mistakes in the overall optimization problem (it's highly dependant on order). It can also be slow because it spends a lot of time hunting for empty slots; you'd have to write a fast slot allocator to avoid degenerate bad cases. There are other ways.

Another thing I should note is that when doing these target slot assignments, there's no reason to prefer the most probable symbol first, or the least probable or whatever. The reason is every symbol occurance is equally probable. That is, symbol s has frequency Fs, but there are Fs slots for symbol s, so each slot has a frequency of 1. Every slot has an equal probability of 1/M.

An alternative algorithm that I have found to work well is to sort the targets. That is :


make a sorting_array of size M

add { Ts, s } to sorting_array for each symbol  (that's Fs items added)

sort sorting_array by target location

the symbols in sorting_array are in output string order

I believe that this is identical to Duda's "precise initialization" which he describes using push/pop operations on a heap; the result is the same - assigning slots in the order of desired target location.

Using the sort like this is a little weird. We are no longer explicitly trying to put the symbols in their target slots. But the targets (Ts) span the range [0, M] and the sort is an array of size M, so they wind up distributed over that range. In practice it works well, and it's fast because sorting is fast.

A few small notes :

You want to use a "stable" sort, or bias the target by some small number based on the symbol. The reason is you will have lots of ties, and you want the ties broken consistently. eg. for "AABBCC" you want "ABCABC" or "CBACBA" but not "ABCCAB". One way to get a stable sort is to make the sorting_array work on U32's, and pack the sort rank into the top 24 bits and the symbol id into the bottom 8 bits.

The bias = 0.5 that Duda uses is not strongly justified, so I tried some other numbers. bias = 0 is much worse. It turns out that bias = 1.0 is better. I tried a bunch of values on a large test set and found that bias = 1 is consistently good.

One very simple way to get a decent sort is to bit-reverse the rANS indexes. That is, start from a rANS/alphabetical order string ("AABB..") and take the index of each element, bit-reverse that index (so 0001 -> 1000) , and put the symbol in the bit reversed slot. While this is not competitive with the proper sort, it is simple and one pass.

Another possible heuristic is to just scatter the symbols by doing steps that are prime with M. This is what Yann does in fse.c


All the files in Calgary Corpus :
(compression per file; sum of output sizes)

M = 1024

rANS/alpahabetical : 1824053.75

bit reverse : 1805230.75

greedy search for empty slots : 1801351

Yann's heuristic in fse.c : 1805503.13

sort , bias = 0.0 : 1817269.88

sort , bias = 0.5 : 1803676.38  (Duda "precise")

sort , bias = 1.0 : 1798930.75

Before anyone throws a fit - yes, I tested on my very large test set, not just calgary. The results were consistent on all the test sets I tried. I also tested with larger M (4096) and the results were again the same, though the differences are smaller the larger you make M.

For completeness, here is what the sorts actually do :


rANS/alphabetical : AAAAAAABBBBBBCCC

bit reverse :   ABABABACABACABBC

greedy search : CABABACABABACABB

greedy search, LPS first :  ABCABAACBABACBAB

Yann fse :          AAABBCAABBCAABBC

sort , bias = 0.0 : ABCABABCABABCABA

sort , bias = 0.5 : ABCABABACBABACBA

sort , bias = 1.0 : ABABCABABCABAABC

but I caution against judging sorts by whether they "look good" since that criteria does not seem to match coding performance.

Finally for clarity, here's the code for the simpler sorts :


void make_sort(int * sorted_syms, int sort_count, const uint32 * normalized_counts, int alphabet)
{
    ASSERT( (int) cb::sum(normalized_counts,normalized_counts+alphabet) == sort_count );
    
    const int fse_step = (sort_count>>1) + (sort_count>>3) + 1;
    
    int fse_pos = 0;
    int s = 0;
    for LOOP(a,alphabet)
    {
        int count = normalized_counts[a];

        for LOOP(c,count)
        {
            // choose one :

            // rANS :
            sorted_syms[s] = a;

            // fse :
            sorted_syms[fse_pos] = a;
            fse_pos = (fse_pos + step) % sort_count;

            // bitreverse :
            sorted_syms[ bitreverse(s, numbits(sort_count)) ] = a;

            s++;
        }
    }
}

and the code for the actual sorting sort (recommended) :

struct sort_sym
{
    int sym;
    float rank;
    bool operator < (const sort_sym & rhs) const
    {
        return rank < rhs.rank;
    }
};

void make_sort(int * sorted_syms, int sort_count, const uint32 * normalized_counts, int alphabet)
{
    ASSERT( (int) cb::sum(normalized_counts,normalized_counts+alphabet) == sort_count );

    vector<sort_sym> sort_syms;
    sort_syms.resize(sort_count);

    int s = 0;

    for LOOP(sym,alphabet)
    {
        uint32 count = normalized_counts[sym];
        if ( count == 0 ) continue;
        
        float invp = 1.f / count;
        
        float base =  1.f * invp; // 0.5f for Duda precise

        for LOOP(c,(int)count)
        {
            sort_syms[s].sym = sym;
            sort_syms[s].rank = base + c * invp;
            s++;
        }
    }
    
    ASSERT_RELEASE( s == sort_count );
    
    std::stable_sort(sort_syms.begin(),sort_syms.end());
    
    for LOOP(s,sort_count)
    {
        sorted_syms[s] = sort_syms[s].sym;
    }
}

and for the greedy search :

void make_sort(int * sorted_syms, int sort_count, const uint32 * normalized_counts, int alphabet)
{
    ASSERT( (int) cb::sum(normalized_counts,normalized_counts+alphabet) == sort_count );

    // make all slots empty :
    for LOOP(s,sort_count)
    {
        sorted_syms[s] = -1;
    }
    
    for LOOP(a,alphabet)
    {
        uint32 count = normalized_counts[a];
        if ( count == 0 ) continue;
        
        uint32 step = (sort_count + (count/2) ) / count;
        uint32 first = step/2;
        
        for LOOP(c,(int)count)
        {
            uint32 slot = first + step * c;
            
            // find an empty slot :
            for(;;)
            {
                if ( sorted_syms[slot] == -1 )
                {
                    sorted_syms[slot] = a;
                    break;
                }
                slot = (slot + 1)%sort_count;
            }
        }
    }
}

small note : the reported results use a greedy search that searches away from slot using +1,-1,+2,-2 , instead of the simpler +1,+2 in this code snippet. This simpler version is very slightly worse.

2/05/2014

02-05-14 - Understanding ANS - 7

And we're ready to cover table-based ANS (or "tANS") now.

I'm going to be quite concrete and consider a specific choice of implementation, rather than leaving everything variable. But extrapolation to the general solution is straightforward.

You have integer symbol frequences Fs. They sum to M. The cumulative frequencies are Bs.

I will stream the state x in bits. I will use the smallest possible renormalization range for this example , I = [ M , 2*M - 1]. You can always use any integer multiple of M that you want (k*M, any k), which will give you more coding resolution (closer to entropy). This is equivalent to scaling up all the F's by a constant factor, so it doesn't change the construction here.

Okay. We will encode/decode symbols using this procedure :


ENCODE                      DECODE

|                           ^
V                           |

stream out                  stream in

|                           ^
V                           |

C(s,x) coding function      D(x) decoding function

|                           ^
V                           |

x'                          x'

We need tables for C() and D(). The constraints are :

D(x') = { x , s }  outputs a state and a symbol

D(x) must be given for x in I = [ M , 2*M - 1 ]

D(x) in I must output each symbol s Fs times

that is, D(x in I) must be an output string made from a permutation of "AA..BB.." , each symbol Fs times

D( C( s, x ) ) = { x , s }  decode must invert coding

C(s,x) = x'  outputs the following state

C(s,x) must be given for x' in I
 that's x in Is

The precursor ranges Is = { x : C(s,x) is in I }
must exist and be of the form Is = [ k , 2k-1 ] for some k

Now, if we combine the precursor range requirement and the invertability we can see :

D(x in I) outputs each s Fs times

C(s,x) with x' in I must input each s Fs times

the size of Is must be Fs

the precursor ranges must be Is = [ Fs, 2*Fs - 1 ]

C(s,x) must given in M slots

And I believe that's it; those are the necessary and sufficient conditions to make a valid tANS system. I'll go over some more points and fill in some details.

Here's an example of the constraint for an alphabet of "ABC" and M = 8 :

Now, what do you put in the shaded area? You just fill in the output states from 8-15. The order you fill them in corresponds to the output string. In this case the output string must be some permutation of "AAABBBCC".

Here's one way : (and in true Duda style I have confusingly used different notation in the image, since I drew this a long time ago before I started this blog series. yay!)

In the image above I have also given the corresponding output string and the decode table. If you're following along in Duda's paper arxiv 1311.2540v2 this is figure 9 on page 18. What you see in figure 9 is a decode table. The "step" part of figure 9 is showing one method of making the sort string. The shaded bars on the right are showing various permutations of an output string, with a shading for each symbol.

Before I understood ANS I was trying tables like this :


M=16
Fs = {7,6,3}

 S |  0|  1|  2
---|---|---|---
  1|  2|  3|  4
  2|  5|  6| 10
  3|  7|  8| 15
  4|  9| 11| 20
  5| 12| 13| 26
  6| 14| 16| 31
  7| 17| 19|   
  8| 18| 22|   
  9| 21| 24|   
 10| 23| 27|   
 11| 25| 29|   
 12| 28|   |   
 13| 30|   |   

This table does not work. If you're in state x = 7 and you want to encode symbol 2, you need to stream out bits to get into the precursor range I2. So you stream out from x=7 and get to x=3. Now you look in the table and you are going to state 15 - that's not in the range I=[16,31]. No good!

A correct table for those frequencies is :


 S |  0|  1|  2
---|---|---|---
  3|   |   | 18
  4|   |   | 24
  5|   |   | 29
  6|   | 17|   
  7| 16| 20|   
  8| 19| 22|   
  9| 21| 25|   
 10| 23| 27|   
 11| 26| 31|   
 12| 28|   |   
 13| 30|   |   

Building the decode table from the encode table is trivial.

Note that the decode table D(x) only has to be defined for x in I - that's M entries.

C(x,s) also only has M entries. If you made it naively as a 2d array, it would be |alphabet|*M . eg. something like (256*4096) slots, but most of it would be empty. Of course you don't want to do that.

The key observation is that C(x,s) is only defined over consecutive ranges of x for each s. In fact it's defined over [Fs, 2*Fs-1]. So, we can just pack these ranges together. The starting point in the packed array is just Bs - the cumulative frequency of each symbol. That is :


PC = packed coding table
PC has M entries

C(x,s) = PC[ Bs + (x - Fs) ]


eg. for the {3,3,2} table shown in the image above :

PC = { 8,11,14, 9,12,15, 10,13 }

this allows you to store the coding table also in an array of size M.

There are a few topics on tANS left to cover but I'll leave them for the next post.

2/04/2014

02-04-14 - Understanding ANS - 6

Okay, let's get to streaming.

For illustration let's go back to the simple example of packing arbitrary base numbers into an integer :


// encode : put val into state
void encode(int & state, int val, int mod)
{
    ASSERT( val >= 0 && val < mod );
    state = state*mod + val;
}

// decode : remove a value from state and return it
int decode(int & state, int mod )
{
    int val = state % mod;
    state /= mod;
    return val;
}

as you encode, state grows, and eventually gets too big to fit in an integer. So we need to flush out some bits (or bytes).

But we can't just stream out bits. The problem is that the decoder does a modulo to get the next value. If we stream in and out high bits, that's equivalent to doing something like +65536 on the value. When you do a mod-3 (or whatever) on that, you have changed what you decode.

If you only ever did mod-pow2's, you could stream bits out of the top at any time, because the decoding of the low bits is not affected by the high bits. This is how the Huffman special case of ANS works. With Huffman coding you can stream in and out any bits that are above the current symbol, because they don't affect the mask at the bottom.

In general we want to stream bits (base 2) or bytes (base 256). To do ANS in general we need to mod and multiply by arbitrary values that are not factors of 2 or 256.

To ensure that we get decodability, we have to stream such that the decoder sees the exact value that the encoder made. That is :


ENCODE                      DECODE

|                           ^
V                           |

stream out                  stream in

|                           ^
V                           |

C(s,x) coding function      D(x) decoding function

|                           ^
V                           |

x'                          x'

The key thing is that the value of x' that C(s,x) makes is exactly the same that goes into D(x).

This is different from Huffman, as noted above. It's also different than arithmetic coding, which can have an encoder and decoder that are out of sync. An arithmetic decoder only uses the top bits, so you can have more or less of the rest of the stream in the low bits. While the basic ANS step (x/P + C) is a kind of arithmetic coding step, the funny trick we did to take some bits of x and mod it back down to the low bits (see earlier posts) means that ANS is *not* making a continuous arithmetic stream for the whole message that you can jump into anywhere.

Now it's possible there are multiple streaming methods that work. For example with M = a power of 2 in rANS you might be able to stream high bytes. I'm not sure, and I'm not going to talk about that in general. I'm just going to talk about one method of streaming that does work, which Duda describes.

To ensure that our encode & decode streaming produce the same value of x', we need a range to keep it in. If you're streaming in base b, this range is of the form [L, b*L-1] . So, I'll use Duda's terminology and call "I" the range we want x' to be in for decoding, that is


I = [L, b*L-1]

Decoder streams into x :

x <- x*b + get_base_b();

until x is in I

but the encoder must do something a bit funny :

stream out from x

x' = C(s,x)  , coding function

x' now in I

that is, the stream out must be done before the coding function, and you must wind up in the streaming range after the coding function. x' in the range I ensures that the encoder and decoder see exactly the same value (because any more streaming ops would take it out of I).

To do this, we must know the "precursor" ranges for C(). That is :


Is = { x : C(s,x) is in I }

that is, the values of x such that after coding with x' = C(s,x), x' is in I

these precursor ranges depend on s. So the encoder streaming is :

I'm about to encode symbol s

stream out bits from x :

put_base_b( x % b )
x <- x/b

until x is in Is

so we get into the precursor range, and then after the coding step we are in I.

Now this is actually a constraint on the coding function C (because it determines what the Is are). You must be able to encode any symbol from any state. That means you must be able to reach the Is precursor range for any symbol from any x in the output range I. For that to be true, the Is must span a power of b, just like "I" does. That is,


all Is must be of the form

Is = [ K, b*K - 1 ]

for some K

eg. to be concrete, if b = 2, we're streaming out bits, then Is = { 3,4,5 } is okay, you will be able to get there from any larger x by streaming out bits, but Is = {4,5,6} is not okay.


I = [8, 15]

Is = {4,5,6}

x = 14

x is out of Is, so stream out a bit ; 14 -> 7

x is out of Is, so stream out a bit ; 7 -> 3

x is below Is!  crap!

this constraint will be our primary guide in building the table-based version of ANS.

old rants