3/03/2014

03-03-14 - Windows Links

I wrote a deduper in Oodle today. I was considering making the default action be to replace duplicate files with a link to the original.

I wasn't sure whether to use "hard" or "soft" links, so I did a little research.

In Windows a "hard link" means that multiple file names all point at the same file data. It's a bit of a misnomer, it's not really a "link" to the original. There is no "original" or base file - all instances of a hard link are equal peers.

A "soft link" is just an OS-level shortcut. There is an original base file, and the soft links point at it.

Both are ridiculously broken concepts and IMO should almost never be used.

With "hard links" the problem is that if you accidentally edit any of the links, you have editted the master data. If you did not intend that, you may have fucked up something severely.

Hard links are reasonable *if* the files linked are read-only (and somehow actually kept read-only, not just attrib'ed away).

The problem with "soft links" is that the links are not protected; if you rename or move or delete the original file, all the links are broken, and again you may have just severely fucked something up.

The big problem is that you get no warning in either case. Clearly what you want when you try to rename a file which has a bunch of soft links pointing at it is some kind of dialog that says "hey all these links point at this file, do you really want to rename it and break the links? or should I update the links to point at the new name?". Similarly with hard links, obviously what you want is some prompt like "hey if you modify this, so you want these hard links to see the new version or the old version?".

Now obviously you can't solve this problem in general without user prompts. But I believe that a refcounted copy-on-write link would have been a much more robust and safe solution. Open for write should have done a COW by default unless you passed a special flag to indicate you intend to edit shared data.

Even ignoring the fundamental logic of how links work, there are some very bad practical issues for links in windows.

1. Soft links show a file size of 0 in the dir enumeration file info. This breaks the assumption that most apps make that the file size they get from the dir enumeration will be the same as the file size they get if they open that file handle and ask for its size. It can also screw up enumerations that are just trying to skip zero-size files.

Hard link file sizes are out of date. If the file data is modified, only the directory entry for the one that was used to modify the data is updated. All other links still have the old file sizes, mod times, etc.

2. Hard links break the assumption that saving to a file is the same as saving to a temp and then renaming onto the file. Many apps may or may not use the "write to temp then rename" pattern; what you get is massively different results in a very unexpected way.

3. Mod times are hosed. In general attributes are hosed; neither type of link reflects the attributes of the actual file data in the link - until they are opened, then they get updated. Mod times are particularly bad because many apps use them to detect changes, and with links the file data can be changed but the mod time won't reflect it.

Dear lord. So non-robust.

2/25/2014

02-25-14 - WiFi

So our WiFi stopped working recently, and I discovered a few things which I will now write down.

First of all, WiFi is fucked. 2.4 GHz is way overcrowded and just keeps getting more crowded. Lots of fucking routers now are offering increased bandwidth by using multiple channels simultaneously, etc. etc. It's one big interference fest.

The first issue I found was baby monitors. Baby monitors, like many wireless devices, are also in the 2.4 GHz band and just crap all over your wifi. Turning them off helped our signal a bit, but we were still getting constant problems.

Next issue is interference from neighbors'ses wifises. This is what inSSIDer looks like at my house :

We are *way* away from any neighbors, at least 50 feet in every direction, and we still get this amount of shit from them. Each of my cock-ass-fuck neighbors seems to have four or five wifi networks. Good job guys, way to fix your signal strength issues by just piling more shit in the spectrum.

What you can't see from the static image is that lots of the fucking neighbor wifis are not locked to a specific channel, many of them are constantly jumping around trying to find a clear channel, which just makes them crud them all up.

(I'd love to get some kind of super industrial strength wifi for my house and just crank it up to infinity and put it on every channel so that nobody for a mile around gets any wifi)

I've long had our WiFi on channel 8 because it looked like the most clear spot to be. Well, it turns out that was a classic newb mistake. Apparently it's worse to be slightly offset from a busy channel than it is to be right on it. When you're offset, you get signal leakage from the other channel that just looks like noise; being on the channel you're fighting with other people, but at least you are seing their data as real data that you can ignore. Anyway, switching our network to channel 11 fixed it.

It looks like in practice channel 6 and 11 are the only usable ones in noisy environments (eg. everywhere).

The new 802.11ac on 5 GHz should be a nice clean way to go for a few years until it too gets crudded up.

02-25-14 - ANS Applications

Some rambling on where I see ANS being useful.

In brief - anywhere you used Huffman in the past, you can use ANS instead.

ANS (or ABS) are not very useful for high end compressors. They are awkward for adaptive modeling. Even if all you care about is decode speed (so you don't mind the buffering up the models to make the encode work backwards) it's just not a big win over arithmetic coding. Things like PAQ/CM , LZMA, H264, all the high compression cases that use adaptive context models, there's no real win from ANS/ABS.

Some specific cases where I see ANS being a good win :

JPEG-ANS obviously. Won't be competitive with sophisticated coders like "packjpg" but will at least fix the cliff at low bit rate caused by Huffman coding.

JPEGNEXT-ANS. I've been thinking for a long time about writing a "JPEGNEXT". Back end coefficient classification; coefficients in each group sent by value with ANS. Front end 32x32 macroblocks with various DCT sizes. Try to keep it as simple as possible but be up to modern quality levels.

LZ-ANS. An "LZX class" (which means large window, "repeat match", entropy resets) LZ with ANS back end should be solid. Won't be quite LZMA compression levels, but way way faster.

Lossless image DPCM. ANS on prediction residual values is a clear killer. Should have a few statistics groups with block classification. No need to transmit the ANS counts if you use a parametric model ala TMW etc. Should be easy, fast, and good.

blocksort-ANS. Should replace bzip. Fast to decode.

MPEG-class video coders. Fast wavelet coders. Anywhere you are using minimal context modelling (only a few contexts) and are sending values by their magnitude (not bit plane coders).

Other?

2/18/2014

02-18-14 - ans_fast implementation notes

Some notes about the ans_fast I posted earlier .

ans_fast contains a tANS (table-ANS) implementation and a rANS (range-ANS) implementation.

First, the benchmarking. You can compare to the more naive implementations I posted earlier . However, do not compare this tANS impl to Huffman or arithmetic and conclude "ANS is faster" because the tANS impl here has had rather more loving than those. Most of the tricks used on "ans_fast" can be equally used for other algorithms (though not all).

Here L=4096 to match the 12-bits used in the previous test. This is x64 on my lappy (1.7 Ghz Core i7 with turbo disabled). Compressed sizes do not include sending the counts. Time "withtable" includes all table construction but not histogramming or count normalization (that affects encoder only). ("fse" and "huf" on the previous page included table transmission and histogramming time)


book1

tANS 768771 -> 435252.75

ticks to encode: 4.64 decode: 3.39
mbps encode: 372.92 decode: 509.63

withtable ticks to encode: 4.69 decode: 3.44
withtable mbps encode: 368.65 decode: 501.95

rANS 768771 -> 435980 bytes (v2)

ticks to encode: 6.97 decode: 5.06
mbps encode: 248.02 decode: 341.63

withtable ticks to encode: 6.97 decode: 5.07
withtable mbps encode: 247.92 decode: 341.27

pic

tANS 513216 -> 78856.88

ticks to encode: 4.53 decode: 3.47
mbps encode: 382.02 decode: 497.75

withtable ticks to encode: 4.62 decode: 3.56
withtable mbps encode: 374.45 decode: 485.40

rANS 513216 -> 79480 bytes (v2)

ticks to encode: 5.62 decode: 3.53
mbps encode: 307.78 decode: 490.32

withtable ticks to encode: 5.63 decode: 3.54
withtable mbps encode: 307.26 decode: 488.88

First a note on file sizes : rANS file sizes are a few bytes larger than the "rans 12" posted last time. That's because that was a 32-bit impl. The rANS here is 64-bit and dual-state so I have to flush 16 bytes instead of 4. There are ways to recover some of those bytes.

The tANS file sizes here are smaller than comparable coders. The win comes from the improvements to normalizing counts and making the sort order. In fact, the +1 bias heuristic lets me beat "arith 12" and "rans 12" from the last post, which were coding nearly perfectly to the expected codelen of the normalized counts.

If you run "ans_learning" you will often see that the written bits are less than the predicted codelen :

H = 1.210176
CL = 1.238785
wrote 1.229845 bpb
this is because the +1 bias heuristic lets the codelens match the data better than the normalized counts do.

Okay, so on to the speed.

The biggest thing is that the above reported speeds are for 2x interleaved coders. That is, two independent states encoding the single buffer to a single compressed stream. I believe ryg will talk about this more soon. You can read his paper on arxiv now. Note that this is not just unrolling. Because the states are independent they allow independent execution chains to be in flight at the same time.

The speedup from interleaving is pretty huge (around 1.4X) :


book1

rANS non-interleaved (v1)

ticks to encode: 26.84 decode: 7.33
mbps encode: 64.41 decode: 235.97

withtable ticks to encode: 26.85 decode: 7.38
withtable mbps encode: 64.41 decode: 234.19

rANS 2x interleaved (v1)

ticks to encode: 17.15 decode: 5.16
mbps encode: 100.84 decode: 334.95

withtable ticks to encode: 17.15 decode: 5.22
withtable mbps encode: 100.83 decode: 331.31


tANS non-interleaved

ticks to encode: 6.43 decode: 4.68
mbps encode: 269.10 decode: 369.44

withtable ticks to encode: 6.48 decode: 4.73
withtable mbps encode: 266.86 decode: 365.39

tANS 2x interleaved

ticks to encode: 4.64 decode: 3.39
mbps encode: 372.92 decode: 509.63

withtable ticks to encode: 4.69 decode: 3.44
withtable mbps encode: 368.65 decode: 501.95

But even non-interleaved it's fast. (note that interleaved tANS is using only a single shared bit buffer). The rest of the implementation discussion will use the non-interleaved versions for simplicity.

The tANS implementation is pretty straightforward.

Decoding one symbol is :


    struct decode_entry { uint16 next_state; uint8 num_bits; uint8 sym; };

    decode_entry * detable = table - L;

    #define DECODE_ONE() do { \
        de = detable + state; \
        nb = de->num_bits; \
        state = de->next_state; \
        BITIN_OR(bitin_bits,bitin_numbits,nb,state); \
        *toptr++ = (uint8) de->sym; \
    } while(0)

where BITIN_OR reads "nb" bits and ors them onto state.

With a 64-bit bit buffer, I can ensure >= 56 bits are in the buffer. That means with L up to 14 bits, I can do four decodes before checking for more bits needed. So the primary decode loop is :


        // I know >= 56 bits are available  
        // each decode consumes <= 14 bits

        DECODE_ONE();
        DECODE_ONE();
        DECODE_ONE();
        DECODE_ONE();
            
        BITIN_REFILL(bitin_bits,bitin_numbits,bitin_ptr);
        // now >= 56 bits again

The fastest way I could find to do the bit IO was "big endian style". That's the next bits at the top of the word. Bits in the word are in order of bits in the file. This lets you unconditionally grab the next 8 bytes to refill, but requires a bswap (on little endian platforms). eg :

#define BITIN_REFILL(bitin_bits,bitin_numbits,bitin_ptr) do { \
        ASSERT( bitin_numbits > 0 && bitin_numbits <= 64 ); \
        int64 bytesToGet = (64 - bitin_numbits)>>3; \
        uint64 next8 = _byteswap_uint64( *( (uint64 *)bitin_ptr ) ); \
        bitin_ptr += bytesToGet; \
        bitin_bits |= (next8 >> 1) >> (bitin_numbits-1); \
        bitin_numbits += bytesToGet<<3; \
        ASSERT( bitin_numbits >= 56 && bitin_numbits <= 64 ); \
    } while(0)

The other nice thing about the bits-at-top style is that the encoder can put bits in the word without any masking. The encoder is :

    #define ENCODE_ONE() do { \
        sym = *bufptr--; ee = eetable+sym; \
        msnb = ee->max_state_numbits; \
        msnb += ( state >= ee->max_state_thresh ); \
        BITOUT_PUT(bout_bits,bout_numbits, state,msnb); \
        state = ee->packed_table_ptr[ state>>msnb ]; \
        } while(0)

    #define BITOUT_PUT(bout_bits,bout_numbits,val,nb) do { \
        ASSERT( (bout_numbits+nb) <= 64 ); \
        bout_bits >>= nb; \
        bout_bits |= ((uint64)val) << (64 - nb); \
        bout_numbits += nb; \
    } while(0)

the key interesting part being that the encoder just does BITOUT_PUT with "state", and by shifting it up to the top of the word for the bitio, it gets automatically masked. (and rotate-right is a way to make that even faster).

Similarly to the decoder, the encoder can write 4 symbols before it has to check if the bit buffer needs any output.

The other crucial thing for fast tANS is the sort order construction. I do a real sort, using a radix sort. I do the first step of radix sorting (generating a histogram), and then I directly build the tables from that, reading out of the radix histogram. There's no need to explicitly generate the sorted symbol list as an intermediate step. I use only an 8-bit radix here (256 entries) but it's not significantly different (in speed or compression) than using a larger radix table.

The rANS implementation is pretty straightforward and I didn't spend much time on it, so it could probably be faster (particularly encoding which I didn't spend any time on (ADDENDUM : v2 rANS now sped up and encoder uses fast reciprocals)). I use a 64-bit state with 32-bit renormalization. The basic decode operation is :


        uint64 xm = x & mask;   
        const rans_decode_table::entry & e = table[xm];
            
        x = e.freq * (x >> cumprobtot_bits) + e.xm_minus_low;
    
        buffer[i] = (uint8) e.sym;
        
        if ( x < min_x )
        {
            x <<= 32;
            x |= *((uint32 *)comp_ptr);
            comp_ptr += 4;
        }

One thing I should note is that my rANS decode table is 2x bigger than the tANS decode table. I found it was fastest to use an 8-byte decode entry for rANS :

    // 8-byte decode entry
    struct entry { uint16 freq; uint16 xm_minus_low; uint8 sym; uint16 pad; };
obviously you can pack that a lot smaller (32 bits from 12+12+8) but it hurts speed.

For both tANS and rANS I make the encoder write backwards and the decoder read forwards to bias in favor of decoder speed. I make "L" a variable, not a constant, which hurts speed a little.

02-18-14 - Understanding ANS - Conclusion

I think we can finally say that we understand ANS pretty well, so this series will end. I may cover some more ANS topics but they won't be "Understanding ANS".

Here is the index of all posts on this topic :

cbloom rants 1-30-14 - Understanding ANS - 1
cbloom rants 1-31-14 - Understanding ANS - 2
cbloom rants 02-01-14 - Understanding ANS - 3
cbloom rants 02-02-14 - Understanding ANS - 4
cbloom rants 02-03-14 - Understanding ANS - 5
cbloom rants 02-04-14 - Understanding ANS - 6
cbloom rants 02-05-14 - Understanding ANS - 7
cbloom rants 02-06-14 - Understanding ANS - 8
cbloom rants 02-10-14 - Understanding ANS - 9
cbloom rants 02-11-14 - Understanding ANS - 10
cbloom rants 02-14-14 - Understanding ANS - 11
cbloom rants 02-18-14 - Understanding ANS - 12

And here is some source code for my ANS implementation : (v2 02/21/2014)

ans_learning.cpp
ans_fast.cpp
ans.zip - contains ans_fast and ans_learning
cblib.zip is required to build my code

My home code is MSVC 2005/2008. Port if you like. Email me if you need help.

NOTE : this release is not a library you just download and use. It is intended as documentation of research. If you want some ANS code you can just use off the shelf, go get FiniteStateEntropy . You may also want ryg_rans .

I think I'll do a followup post with the performance of ans_fast and some optimization notes so it doesn't crud up this index post. Please put implementation speed discussion in that followup post .

02-18-14 - Understanding ANS - 12

A little note about sorts and tables.

AAAAAAABBBBBBBBBBBBBCCCCCCCCCCCD

What's wrong with that sort?

(That's the naive rANS sort order; it's just a "cum2sym" table. It's each symbol Fs times in consecutive blocks. It has M=32 entries. M = sum{Fs} , L = coding precision)

(here I'm talking about a tANS implementation with L=M ; the larger (L/M) is, the more you preserve the information in the state x)

Think about what the state variable "x" does as you are coding. In the renormalization range it's in [32,63]. Its position in that range is a slider for the number of fraction bits it contains. At the bottom of the range, log2(x) is 5, at the top log2(x) is 6.

Any time you want to encode a "D" you must go back to a singleton precursor state, Is = [1]. That means you have to output all the bits in x, so all fractional bits are thrown away. All information about where you were in that I range is gone. Then from that singleton Is range you jump to the end of the I range.

(if Fs=2 , then you quantize the fractional bits up to 0.5 ; is Fs=3, you quantize to 1/3 of a bit, etc.)

Obviously the actual codelen for a "D" is longer than that for an "A". But so is the codelen for a "C", and the codelen for "A" is too short. Another way to think of it is that you're taking an initial state x that spans the whole interval [32,63] and thus has variable fractional bits, and you're mapping it into only a portion of the interval.

In order to preserve the fractional bit state size, you want to map from the whole interval back to the whole interval. In the most extreme case, something like :

ACABACACABACABAD

(M=16) , when you encode an A you go from [16,31] to [8,15] and then back the A's in that string. The net result is that state just lost its bottom bit. That is, x &= ~1. You still have the full range of possible fractional bits from [0,1] , you just lost the bottom bit of precision.

I was thinking about this because I was making some weird alternative tANS tables. In fact I suppose not actually ANS tables, but more general coding tables.

For background, you can make one of the heuristic tANS tables thusly :


shuffle(s) = some permutation function
shuffle is one-to-one over the range [0,L-1]
such as Yann's stepping prime-mod-L
or bit reverse

make_tans_shuffle()
{
    int next_state[256];    
    uint8 permutation[MAX_L];
    
    // make permutation :
    uint32 cumulative_count = 0;    
    for LOOP(sym,alphabet)
    {
        uint32 count = normalized_counts[sym];
        if ( count == 0 ) continue;
        
        next_state[sym] = count;
        
        for LOOP(c,(int)count)
        {
            uint32 index = shuffle(cumulative_count);
            cumulative_count++;
            
            permutation[index] = (uint8)sym;
        }
    }
    ASSERT( cumulative_count == (uint32)L );

    // permutation is now our "output string"   

    for LOOP(i,L) // iterating over destination state
    {
        int sym = permutation[i];
        
        // step through states for this symbol
        int from_state = next_state[sym];
        next_state[sym] ++;
                
        int to_state = L + i;
                    
        encode_packed_table_ptr[sym][from_state] = to_state;
    }
}

which is all well and good. But I started thinking - can I eliminate the intermediate permutation[] table entirely? Well, yes. There are a few ways.

If you have a "cum2sym" table already handy, then you can just use shuffle() to look up directly into cum2sym[], and that is identical to the above. But you probably don't have cum2sym.

Well what if we just use shuffle() to make the destination state? Note that this is calling it in the opposite direction (from cum2sym index to to_state , rather than from to_state to cum2sym). If your shuffle is self-inverting like bit reversal is, then it's the same.

It gives you a very simple table construction :


make_tans_shuffle_direct()
{
    uint32 cumulative_count = 0;    
    for LOOP(sym,alphabet)
    {
        uint32 count = normalized_counts[sym];
        if ( count == 0 ) continue;
                
        for LOOP(c,(int)count)
        {
            uint32 index = shuffle(cumulative_count);
            cumulative_count++;

            uint32 to_state = index + L;
            int from_state = count + c; 

            encode_packed_table_ptr[sym][from_state] = to_state;
        }
    }
    ASSERT( cumulative_count == (uint32)L );
}

make_tans_shuffle_direct walks the Fs in a kind of cum2sym order and then scatters those symbols out to semi-random target locations using the shuffle() function.

It doesn't work. Or rather, it works, it encodes & decodes data correctly, but the total coded size is worse.

The problem is that the encode table is no longer monotonic. That is, as "from_state" increases, "to_state" does not necessarily increase. The Fs encode table entries for each symbol are not numerically in order.

In the images we've been picturing from earlier in the post we can see the problem. Some initial state x is renormalized down to the Is coding range. We then follow the state transition back to the I range - but we go somewhere random. We don't go to the same neighborhood where we started, so we randomly get more or less fractional bits.

You can fix it thusly :


make_tans_shuffle_direct_fixed()
{
    uint32 cumulative_count = 0;    
    for LOOP(sym,alphabet)
    {
        uint32 count = normalized_counts[sym];
        if ( count == 0 ) continue;
                
        for LOOP(c,(int)count)
        {
            uint32 index = shuffle(cumulative_count);
            cumulative_count++;

            uint32 to_state = index + L;
            int from_state = count + c; 

            encode_packed_table_ptr[sym][from_state] = to_state;
        }

        // fix - to_states not monotonic
        // sort the destination states for this symbol :
        std::sort( encode_packed_table_ptr[sym]+count, encode_packed_table_ptr[sym]+2*count );
    }
    ASSERT( cumulative_count == (uint32)L );
}

and then it is identical to "make_tans_shuffle" (identical if shuffle is self-inverting, and if not then it's different but equal, since shuffle is really just a random number generator so running it backwards doesn't hurt compression).

For the record the compression penalty for getting the state transition order wrong is 1-2% :


CCC total bytes out :

correct sort : 1788631
shuffle fixed: 1789655
shuffle bad  : 1813450

2/14/2014

02-14-14 - Understanding ANS - 11

I want to do some hand waving about the different ways you can conceptually look at ANS.

Perhaps the best way to understand ANS mathematically is via the analogy with arithmetic coding . While ANS is not actually building up an arithmetic coder interval for the file, each step is very much like a LIFO arithmetic coder, and the (x/P) scaling is what makes x grow the right amount for each symbol. This is the most useful way to think about rANS or uANS, I find.

But there are other ways to think about it.

One is Duda's "asymmetric numeral system", which is how he starts the explanation in the paper, and really confused me to begin with. Now that we've come at ANS from the back way we can understand what he was on about.

The fundamental op in ANS is :


integer x contains some previous value

make x' = x scaled up in some way + new value 

with a normal "symmetric numeral system" , you would just do base-b math :

x' = x * b + v

which gives you an x' where the old value (x) is distributed evenly, and the v's just cycle :

b = 3 for example

x':  0  1  2  3  4  5  6  7  8 ... 
x :  0  0  0  1  1  1  2  2  2
v :  0  1  2  0  1  2  0  1  2

x' is a way of packing the old value x and the new value v together. This symmetric packing corresponds to the output string "012" in the parlance of this series. The growth factor (x'/x) determines the number of bits required to send our value, and it's uniform.

But it doesn't need to be uniform.


0102 :

x':  0  1  2  3  4  5  6  7  8 ... 
x :  0  0  1  0  2  1  3  1  4 
v :  0  1  0  2  0  1  0  2  0

Intuitively, the more often a symbol occurs in the output string, the more slots there are for the previous value (x) to get placed; that is, more bits of x can be sent in lower values of x' when the symbol occurs in many slots. Hence x' grows less. If you're thinking in terms of normalized x's, growing less means you have to output fewer bits to stay in the renormalization range.

You can draw these asymmetric numeral lines in different ways, which Duda does in the paper. For example :


input x as the axis line,
output x' in the table :

"AB"
  0 1 2 3 4 5 6  x
A 0 2 4          x'
B 1 3 5

"AAB"

  0 1 2 3 4 5 6  x
A 0 1 3 4 6 7 9  x'
B 2 5 8 11

output x' as the axis line
input x in the table :

"AB"
  0 1 2 3 4 5 6  x'
A 0   1   2   3  x
B   0   1   2

"AAB"
  0 1 2 3 4 5 6  x'
A 0 1   2 3   4  x
B     0     1

output x' line implicit
show x and output symbol :

"AB"
0 0 1 1 2 2 3
A B A B A B A

"AAB"
0 1 0 2 3 1 4
A A B A A B A

That is, it's a funny way of just doing base-b math; we're shifting up the place value and adding our value in, but we're in an "asymmetric numeral system", so the base is nonuniform. I find this mental image not very useful when thinking about how the coding actually works.

There's another way to think about tANS in particular (tANS = table-based ANS), which is what Yann is talking about .

To get there mentally, we actually need to optimize our tANS code.

When I covered tANS encoding before , I described it something like this :


x is the current state
x starts in the range I = [L,2L-1]

to encode the next symbol s
we need to reach the precursor range Is = [Fs,2Fs-1]

to do that, output bits from x
b = x&1; x >>= 1;
until x is lowered to reach Is

then take the state transition C()
this takes x back into I

this should be familiar and straightforward.

To optimize, we know that x always starts in a single power of 2 interval [L,2L-1] , and it always lands in a power of 2 interval [Fs,2Fs-1]. That means the minimum number of bits we ever output is from L to 2Fs-1 , and the maximum number of bits is only 1 more than that. So the renormalization can be written as :


precompute :

max_state = 2Fs - 1;
min_num_bits = floor(log2(L/Fs));

to renormalize :

x in [L,2L-1]
output min_num_bits from x
x >>= min_num_bits

now ( x >= Fs && x < 2*max_state );

if ( x > max_state ) output 1 more bit; x>>= 1;

now x in [Fs,2Fs-1]

But you can move the check for the extra output bit earlier, before shifting x down :

precompute :

min_num_bits = log2(L) - log2ceil(Fs);  // if L is power of 2
threshold = (2*Fs)<<num_bits;

to renormalize :

x in [L,2L-1]
num_bits = min_num_bits;
if ( x >= threshold ) num_bits ++;
output num_bits from x
x >>= num_bits

x in [Fs,2Fs-1]

and then use C(x) since x is now in Is.

It's just straightforward optimization, but it actually allows us to picture the whole process in a different way. Let's write the same encoder, but just in terms of a table index :


let t = x - L
t in [0,L-1]

t is a table index.


to encode s :

num_bits = min_num_bits[s] + ( t >= threshold[s] );
bitoutput( t, num_bits );
t = encode_state_table[s][ (t+L)>>num_bits ];

That is, we're going from a table index to another table index. We're no longer thinking about going back to the [Fs,2Fs-1] precursor range at all.

Before we got our desired code len by the scaling of the intervals [L,2L)/[Fs,2Fs) , now the code len is the stored number of bits. We can see that we get fractional bits because sometimes we output one more.

Let's revisit an example that we went through previously , but with this new image.


L = 8
Fs = {3,3,2}
output string = ABCABCAB

We can see right away that our table index t is 3 bits. To encode a 'C' there will be only two slots on our numeral line that correspond to a lower digit of C, so we must output 2 bits and keep 1 bit of t. To encode an 'A' we can keep 3 values, so we can output 1 bit for t in [0,3] and 2 bits for t in [4,7] ; that will give us 2 retained values in the first region and 1 retained value in the second.

Explicitly :


t in [0,7]
I want to encode an A
so I want to reach {AxxAxxAx}

t in [0,3]
  output t&1
  index = (t+L)>>1 = 4 or 5
  take the last two A's {xxxAxxAx}
  so state -> 3 or 6

t in [4,7]
  output t&3
  index = (t+L)>>2 = 3
  take the first A {Axxxxxxx}
  state -> 0

Note that the way we're doing it, high states transition to low states, and vice versa. These comes up because of the +L sentry bit method to separate the subranges produced by the shift.

The tANS construction creates this encode table :


encode:
A : b=1+(t>=4) : {0,3,6}
B : b=1+(t>=4) : {1,4,7}
C : b=2+(t>=8) : {2,5}

It should be obvious that we can now drop all our mental ideas about "ANS" and just make these coding tables directly. All you need is an output string, and you think about doing these kinds of mapping :

t in [0,7]

I want to encode a B

[xxxxxxxx] -> [xBxxBxxB]

output bits to reduce the 3 values
and transition to one of the slots with a B

The decode table is trivial to make from the inverse :

decode:
 0: A -> 4 + getbits(2)
 1: B -> 4 + getbits(2)
 2: C -> 0 + getbits(2)
 3: A -> 0 + getbits(1)
 4: B -> 0 + getbits(1)
 5: C -> 4 + getbits(2)
 6: A -> 2 + getbits(1)
 7: B -> 2 + getbits(1)

Note that each symbol's decode covers the entire origin state range :

decode:
 0: A -> 4 + getbits(2)  from [4,7]
 3: A -> 0 + getbits(1)  from [0,1]
 6: A -> 2 + getbits(1)  from [2,3]

 1: B -> 4 + getbits(2)  from [4,7]
 4: B -> 0 + getbits(1)  from [0,1]
 7: B -> 2 + getbits(1)  from [2,3]

 2: C -> 0 + getbits(2)  from [0,3]
 5: C -> 4 + getbits(2)  from [4,7]

During decode we can think about our table index 't' as containing two pieces of information : one is the current symbol to output, but there's also some information about the range where t will be on the next step. That is, the current t contains some bits of the next t. The number of bits depends on where we are in the table. eg. in the example above; when t=4 we specify a B, but we also specify 2 bits worth of the next t.

Doing another example from that earlier post :


Fs={7,6,3}

ABCABABACBABACBA

encode:
A : b=1+(t>=12) : {0,3,5,7,10,12,15}
B : b=1+(t>=8) : {1,4,6,9,11,14}
C : b=2+(t>=8) : {2,8,13}

decode:
 0: A -> 12 + getbits(2)
 1: B -> 8 + getbits(2)
 2: C -> 8 + getbits(3)
 3: A -> 0 + getbits(1)
 4: B -> 12 + getbits(2)
 5: A -> 2 + getbits(1)
 6: B -> 0 + getbits(1)
 7: A -> 4 + getbits(1)
 8: C -> 0 + getbits(2)
 9: B -> 2 + getbits(1)
10: A -> 6 + getbits(1)
11: B -> 4 + getbits(1)
12: A -> 8 + getbits(1)
13: C -> 4 + getbits(2)
14: B -> 6 + getbits(1)
15: A -> 10 + getbits(1)

and this concludes our conception of tANS in terms of just an [0,t-1] table.

I'm gonna be super redundant and repeat myself some more. I think it's intriguing that we went through all this ANS entropy coder idea, scaling values by (x/P) and so on, and from that we constructed tANS code. But you can get to the exact same tANS directly from the idea of the output string!

Let's invent tANS our new way, starting from scratch.

I'm given normalized frequencies {Fs}. Sum{Fs} = L. I want a state machine with L entries. Take each symbol and scatter it into our output string in some way.

To encode each symbol, I need to map the state machine index t in [0,L-1] to one of its occurances in the output string.


There are Fs occurances in the output string

I need to map an [0,L-1] value to an [0,Fs-1] value
by outputting either b or b+1 bits

now clearly if (L/Fs) is a power of 2, then the log2 of that is just b and we always output that many bits. (eg L=16, Fs=4, we just output 2 bits). In general if (L/Fs) is not a power of 2, then

b = floor(log2(L/Fs))
b+1 = ceil(log2(L/Fs))

so we just need two sub-ranges of L such that the total adds up to Fs :

threshold T
values < T output b bits
values >= T output b+1 bits

total of both ranges after output should equal Fs :

(T>>b) + (L-T)>>(b+1) = Fs

(2T + L-T)>>(b+1) = Fs

L+T = Fs<<(b+1)

T = (Fs<<(b+1)) - L

and that's it! We've just made a tANS encoder without talking about anything related to the ANS ideas at all.

The funny thing to me is that we got the exact same condition before from "b-uniqueness". That is, in order to be able to encode symbol s from any initial state, we worked out that the only valid precursor range was Is = [Fs,2*Fs-1] . That leads us to the renormalization loop :


while x > (2*Fs-1)
  output a bit from x; x>>= 1;

And from that we computed a minimum number of output bits, and a threshold state for one more. That threshold we computed was

(max_state + 1)<<min_num_bits

= (2*Fs-1 + 1)<<b
= Fs<<(b+1)

which is the same.

2/11/2014

02-11-14 - Understanding ANS - 10 - Optimal normalized counts

Not really an ANS topic, but a piece you need for ANS so I've had a look at it.

For ANS and many other statistical coders (eg. arithmetic coding) you need to create scaled frequencies (the Fs in ANS terminology) from the true counts.

But how do you do that? I've seen many heuristics over the years that are more or less good, but I've never actually seen the right answer. How do you scale to minimize total code len? Well let's do it.

Let's state the problem :


You are given some true counts Cs

Sum{Cs} = T  the total of true counts

the true probabilities then are

Ps = Cs/T

and the ideal code lens are log2(1/Ps)

You need to create scaled frequencies Fs
such that

Sum{Fs} = M

for some given M.

and our goal is to minimize the total code len under the counts Fs.

The ideal entropy of the given counts is :

H = Sum{ Ps * log2(1/Ps) }

The code len under the counts Fs is :

L = Sum{ Ps * log2(M/Fs) }

The code len is strictly worse than the entropy

L >= H

We must also meet the constraint

if ( Cs != 0 ) then Fs > 0

That is, all symbols that exist in the set must be codeable. (note that this is not actually optimal; it's usually better to replace all rare symbols with a single escape symbol, but we won't do that here).

The naive solution is :


Fs = round( M * Ps )

if ( Cs > 0 ) Fs = MAX(Fs,1);

which is just scaling up the Ps by M. This has two problems - one is that Sum{Fs} is not actually M. The other is that just rounding the float does not actually distribute the integer counts to minimize codelen.

The usual heuristic is to do something like the above, and then apply some fix to make the sum right.

So first let's address how to fix the sum. We will always have issues with the sum being off M because of integer rounding.

What you will have is some correction :


correction = M - Sum{Fs}

that can be positive or negative. This is a count that needs to be added onto some symbols. We want to add it to the symbols that give us the most benefit to L, the total code len. Well that's simple, we just measure the affect of changing each Fs :

correction_sign = correction > 0 ? 1 : -1;

Ls_before = Ps * log2(M/Fs)
Ls_after = Ps * log2(M/(Fs + correction_sign))

Ls_delta = Ls_after - Ls_before
Ls_delta = Ps * ( log2(M/(Fs + correction_sign)) - log2(M/Fs) )
Ls_delta = Ps * log2(Fs/(Fs + correction_sign))

so we need to just find the symbol that gives us the lowest Ls_delta. This is either an improvement to total L, or the least increase in L.

We need to apply multiple corrections. We don't want a solution thats O(alphabet*correction) , since that can be 256*256 in bad cases. (correction is <= alphabet and typically in the 1-50 range for a typical 256-symbol file). The obvious solution is a heap. In pseudocode :


For all s
    push_heap( Ls_delta , s )

For correction
    s = pop_heap
    adjust Fs
    compute new Ls_delta for s
    push_heap( Ls_delta , s )

note that after we adjust the count we need to recompute Ls_delta and repush that symbol, because we might want to choose the same symbol again later.

In STL+cblib this is :


to[] = Fs
from[] = original counts

struct sort_sym
{
    int sym;
    float rank;
    sort_sym() { }
    sort_sym( int s, float r ) : sym(s) , rank(r) { }
    bool operator < (const sort_sym & rhs) const { return rank < rhs.rank; }
};

---------

    if ( correction != 0 )
    {
        //lprintfvar(correction);
        int32 correction_sign = (correction > 0) ? 1 : -1;

        vector<sort_sym> heap;
        heap.reserve(alphabet);

        for LOOP(i,alphabet)
        {
            if ( from[i] == 0 ) continue;
            ASSERT( to[i] != 0 );
            if ( to[i] > 1 || correction_sign == 1 )
            {
                double change = log( (double) to[i] / (to[i] + correction_sign) ) * from[i];
            
                heap.push_back( sort_sym(i,change) );
            }
        }
        
        std::make_heap(heap.begin(),heap.end());
        
        while( correction != 0 )
        {
            ASSERT_RELEASE( ! heap.empty() );
            std::pop_heap(heap.begin(),heap.end());
            sort_sym ss = heap.back();
            heap.pop_back();
            
            int i = ss.sym;
            ASSERT( from[i] != 0 );
            
            to[i] += correction_sign;
            correction -= correction_sign;
            ASSERT( to[i] != 0 );
        
            if ( to[i] > 1 || correction_sign == 1 )
            {
                double change = log( (double) to[i] / (to[i] + correction_sign) ) * from[i];
            
                heap.push_back( sort_sym(i,change) );
                std::push_heap(heap.begin(),heap.end());
            }               
        }
    
        ASSERT( cb::sum(to,to+alphabet) == (uint32)to_sum_desired );
    }

You may have noted that the above code is using natural log instead of log2. The difference is only a constant scaling factor, so it doesn't affect the heap order; you may use whatever log base is fastest.

Errkay. So our first attempt is to just use the naive scaling Fs = round( M * Ps ) and then fix the sum using the heap correction algorithm above.

Doing round+correct gets you 99% of the way there. I measured the difference between the total code len made that way and the optimal, and they are less than 0.001 bpb different on every file I tried. But it's still not quite right, so what is the right way?

To guide my search I had a look at the cases where round+correct was not optimal. When it's not optimal it means there is some symbol a and some symbol b such that { Fa+1 , Fb-1 } gives a better total code len than {Fa,Fb}. An example of that is :


count to inc : (1/1024) was (1866/1286152 = 0.0015)
count to dec : (380/1024) was (482110/1286152 = 0.3748)
to inc; cl before : 10.00 cl after : 9.00 , true cl : 9.43
to dec; cl before : 1.43 cl after : 1.43 , true cl : 1.42

The key point is on the 1 count :

count to inc : (1/1024) was (1866/1286152 = 0.0015)
to inc; cl before : 10.00 cl after : 9.00 , true cl : 9.43

1024*1866/1286152 = 1.485660
round(1.485660) = 1

so Fs = 1 , which is a codelen of 10

but Fs = 2 gives a codelen (9) closer to the true codelen (9.43)

And this provided the key observation : rather than rounding the scaled count, what we should be doing is either floor or ceil of the fraction, whichever gives a codelen closer to the true codelen.

BTW before you go off hacking a special case just for Fs==1, it also happens with higher counts :


count to inc : (2/1024) was (439/180084) scaled = 2.4963
to inc; cl before : 9.00 cl after : 8.42 , true cl : 8.68

count to inc : (4/1024) was (644/146557) scaled = 4.4997
to inc; cl before : 8.00 cl after : 7.68 , true cl : 7.83

though obviously the higher Fs, the less likely it is because the rounding gets closer to being perfect.

So it's easy enough just to solve exactly, simply pick the floor or ceil of the ratio depending on which makes the closer codelen :


Ps = Cs/T from the true counts

down = floor( M * Ps )
down = MAX( down,1)

Fs = either down or (down+1)

true_codelen = -log2( Ps )
down_codelen = -log2( down/M )
  up_codelen = -log2( (down+1)/M )

if ( |down_codelen - true_codelen| < |up_codelen - true_codelen| )
  Fs = down
else
  Fs = down+1

And since all we care about is the inequality, we can do some maths and simplify the expressions. I won't write out all the algebra to do the simplification because it's straightforward, but there are a few key steps :

| log(x) | = log( MAX(x,1/x) )

log(x) >= log(y)  is the same as x >= y

down <= M*Ps
down+1 >= M*Ps

the result of the simplification in code is :

from[] = original counts (Cs) , sum to T
to[] = normalized counts (Fs) , will sum to M

    double from_scaled = from[i] * M/T;

    uint32 down = (uint32)( from_scaled );
                
    to[i] = ( from_scaled*from_scaled <= down*(down+1) ) ? down : down+1;

Note that there's no special casing needed to ensure that (from_scaled < 1) gives you to[i] = 1 , we get that for free with this expression.

I was delighted when I got to this extremely simple final form.

And that is the conclusion. Use that to find the initial scaled counts. There will still be some correction that needs to be applied to reach the target sum exactly, so use the heap correction algorithm above.

As a final note, if we look at the final expression :


to[i] = ( from_scaled*from_scaled < down*(down+1) ) ? down : down+1;

to[i] = ( test < 0 ) ? down : down+1;

test = from_scaled*from_scaled - down*(down+1); 

from_scaled = down + frac

test = (down + frac)^2 - down*(down+1);

solve for frac where test = 0

frac = sqrt( down^2 + down ) - down

That gives you the fractional part of the scaled count where you should round up or down. It varies with floor(from_scaled). The actual values are :

1 : 0.414214
2 : 0.449490
3 : 0.464102
4 : 0.472136
5 : 0.477226
6 : 0.480741
7 : 0.483315
8 : 0.485281
9 : 0.486833
10 : 0.488088
11 : 0.489125
12 : 0.489996
13 : 0.490738
14 : 0.491377
15 : 0.491933
16 : 0.492423
17 : 0.492856
18 : 0.493242
19 : 0.493589

You can see as Fs gets larger, it goes to 0.5 , so just using rounding is close to correct. It's really in the very low values where it's quite far from 0.5 that errors are most likely to occur.

2/10/2014

02-10-14 - Understanding ANS - 9

If you just want to understand the basics of how ANS works, you may skip this post. I'm going to explore some unsolved issues about the sort order.

Some issues about constructing the ANS sort order are still mysterious to me. I'm going to try to attack a few points.

One thing I said wrote last time needs some clarification - "Every slot has an equal probability of 1/M."

What is true is that every character of the output string is equiprobable (assuming again that the Fs are the true probabilities). That is, if you have the string S[] with L symbols, each symbol s occurs Fs times, then you can generate symbols with the correct probability by just drawing S[i] with random i.

The output string S[] also corresponds to the destination state of the encoder in the renormalization range I = [L,2L-1]. What is not true is that all states in I are equally probable.

To explore this I did 10,000 random runs of encoding 10,000 symbols each time. I used L=1024 each time, and gathered stats from all the runs.

This is the actual frequency of the state x having each value in [1024,2047] (scaled so that the average is 1000) :

The lowest most probable states (x=1024) have roughly 2X the frequency of the high least probable states (x=2047).

Note : this data was generated using Duda's "precise initialization" (my "sort by sorting" with 0.5 bias). Different table constructions will create different utilization graphs. In particular the various heuristics will have some weird bumps. And we'll see what different bias does later on.

This is the same data with 1/X through it :

This probability distribution (1/X) can be reproduced just from doing this :


            x = x*b + irandmod(b); // for any base b
            
            while( x >= 2*K ) x >>= 1;
            
            stats_count[x-K] ++;            

though I'd still love to see an analytic proof and understand that better.

So, the first thing I should correct is : final states (the x' in I) are not equally likely.

How that should be considered in sort construction, I do not know.

The other thing I've been thinking about was why did I find that the + 1.0 bias is better in practice than the + 0.5 bias that Duda suggests ("precise initialization") ?

What the +1 bias does is push low probability symbols further towards the end of the sort order. I've been contemplating why that might be good. The answer is not that the end of the sort order makes longer codelens, because that kind of issue has already been accounted for.

My suspicion was that the +1 bias was beating the +0.5 bias because of the difference between normalized counts and unnormalized original counts.

Recall that to construct the table we had to make normalized frequences Fs that sum to L. These, however, are not the true symbol frequencies (except in synthetic tests). The true symbol frequencies had to be scaled to sum to L to make the Fs.

The largest coding error from frequency scaling is on the least probable symbols. In fact the very worst case is symbols that occur only once in a very large file. eg. in a 1 MB file a symbol occurs once; its true probability is 2^-20 and it should be coded in 20 bits. But we scale the frequencies to sum to 1024 (for example), it still must get a count of 1, so it's coded in 10 bits.

What the +1 bias does is take the least probable symbols and push them to the end of the table, which maximizes the number of bits they take to code. If the {Fs} were the true frequencies, this would be bad, and the + 0.5 bias would be better. But the {Fs} are not the true frequencies.

This raises the question - could we make the sort order from the true frequencies instead of the scaled ones? Yes, but you would then have to either transmit the true frequencies to the decoder, or transmit the sort order. Either way takes many more bits than transmitting the scaled frequencies. (in fact in the real world you may wish to transmit even approximations of the scaled frequencies). You must ensure the encoder and decoder use the same frequencies so they build the same sort order.

Anyway, I tested this hypothesis by making buffers synthetically by drawing symbols from the {Fs} random distribution. I took my large testset, for each file I counted the real histogram, made the scaled frequencies {Fs}, then regenerated the buffer from the frequencies {Fs} so that the statistics match the data exactly. I then ran tANS on the synthetic buffers and on the original file data :


synthetic data :

total bytes out : 146068969.00  bias=0.5
total bytes out : 146117818.63  bias=1.0

real data :

total bytes out : 144672103.38  bias=0.5
total bytes out : 144524757.63  bias=1.0

On the synthetic data, bias=0.5 is in fact slightly better. On the real data, bias=1.0 is slightly better. This confirms that the difference between the normalized counts & unnormalized counts is in fact the origin of 1.0's win in my previous tests, but doesn't necessarily confirm my guess for why.

An idea for an alternative to the bias=1 heuristic is you could use bias=0.5 , but instead of using the Fs for the sort order, use the estimated original count before normalization. That is, for each Fs you can have a probability model of what the original count was, and select the maximum-likelihood count from that. This is exactly analoguous to restoring to expectation rather than restoring to middle in a quantizer.

Using bias=1.0 and measuring state occurance counts, we get this :

Which mostly has the same 1/x curve, but with a funny tail at the end. Note that these graphs are generated on synthetic data.

I'm now convinced that the 0.5 bias is "right". It minimizes measured output len on synthetic data where the Fs are the true frequencies. It centers each symbol's occurances in the output string. It reproduces the 1/x distribution of state frequencies. However there is still the missing piece of how to derive it from first principles.


BTW

While I was at it, I gathered the average number of bits output when coding from each state. If you're following along with Yann's blog he's been explaining FSE in terms of this. tANS outputs bits to get the state x down into the coding range Is for the next symbol. The Is are always lower than I (L), so you have to output some bits to scale down x to reach the Is. x starts in [L,2L) and we have to output bits to reach [Fs,2Fs) ; the average number of bits required is like log2(L/Fs) which is log2(1/P) which is the code length we want. Because our range is [L,2L) we know the average output bit count from each state must differ by 1 from the top of the range to the bottom. In fact it looks like this :

Another way to think about it is that at state=L , the state is empty. As state increases, it is holding some fractional bits of information in the state variable. That number of fraction bits goes from 0 at L up to 1 at 2L.


Ryg just pointed me at a proof of the 1/x distribution in Moffat's "Arithmetic Coding Revisited" (DCC98).

The "x" in ANS has the same properties as the range ("R") in an arithmetic coder.

The bits of information in x is I ~= log( x )

I is in [0,1] and is a uniform random value, Pr(I) ~= 1

if log(x) has Pr ~= 1 , then Pr(x) must be ~= 1/x

The fact that I is uniform is maybe not entirely obvious; Moffat just hand-waves about it. Basically you're accumulating a random variable into I ( -log2(P_sym) ) and then dropping the integer part; the result is a fractional part that's random and uniform.

2/06/2014

02-06-14 - Understanding ANS - 8

Time to address an issue that we've skirted for some time - how do you make the output string sort order?

Recall : The output string contains Fs occurances of each symbol. For naive rANS the output string is just in alphabetical order (eg. "AAABBBCCC"). With tANS we can use any permutation of that string.

So what permutation should we use? Well, the output string determines the C() and D() encode and decode tables. It is in fact the only degree of freedom in table construction (assuming the same constraints as last time, b=2 and L=M). So we should choose the output string to minimize code length.

The guiding principle will be (x/P). That is, we achieve minimum total length when we make each code length as close to log2(1/P) as possible. We do that by making the input state to output state ratio (x'/x) as close to (1/P) as possible.

(note for the record : if you try to really solve to minimize the error, it should not just be a distance between (x'/x) and (1/P) , it needs to be log-scaled to make it a "minimum rate" solution). (open question : is there an exact solution for table building that finds the minimum rate table that isn't NP (eg. not just trying all permutations)).

Now we know that the source state always come from the precursor ranges Is, and we know that


destination range :
I = [ M , 2*M - 1]

source range :
Is = [ Fs, 2*Fs - 1 ] for each symbol s

and Ps = Fs/M

so the ideal target for the symbols in each source range is :

target in I = (1/Ps) * (Is) = (M/Fs) * [ Fs, 2*Fs - 1 ] = 

and taking off the +M bias to make it a string index in the range [0,M-1] :

Ts = target in string = target in I - M

Ts = { 0 , M * 1/Fs , M * 2/Fs) , ... }

Essentially, we just need to take each symbol and spread its Fs occurances evenly over the output string.

Now there's a step that I don't know how to justify without waving my hands a bit. It works slightly better if we imagine that the source x was not just an integer, but rather a bucket that covers the unit range of that integer. That is, rather that starting exactly at the value "x = Fs" you start in the range [Fs,Fs+1]. So instead of just mapping up that integer by 1/P we map up the range, and we can assign a target anywhere in that range. In the paper Duda uses a bias of 0.5 for "precise initialization" , which corresponds to assuming the x's start in the middle of their integer buckets. That is :


Ts = { M * (b/Fs), M* (1+b)/Fs, M * (2+b)/Fs , ... }

with b = 0.5 for Duda's "precise initialization". Obviously b = 0.5 makes T centered on the range [0,M] , but I see no reason why that should be preferred.

Now assuming we have these known target locations, you can't just put all the symbols into the target slots that they want, because lots of symbols want the same spot.

For example :


M=8
Fs={3,3,2}

T_A = { 8 * 0.5/3 , 8 * 1.5 / 3 , 8 * 2.5 / 3 } = { 1 1/3 , 4 , 6 2/3 }
T_B = T_A
T_C = { 8 * 0.5/2 , 8 * 1.5/2 } = { 2 , 6 }

One way to solve this problem is to start assigning slots, and when you see that one is full you just look in the neighbor slot, etc. So you might do something like :

initial string is empty :

string = "        "

put A's in 1,4,6

string = " A  A A "

put B's in 1,4,6 ; oops they're taken, shift up one to find empty slots :

string = " AB ABAB"

put C's in 2,6 ; oops they're taken, hunt around to find empty slots :

string = "CABCABAB"

now obviously you could try to improve this kind of algorithm, but there's no point. It's greedy so it makes mistakes in the overall optimization problem (it's highly dependant on order). It can also be slow because it spends a lot of time hunting for empty slots; you'd have to write a fast slot allocator to avoid degenerate bad cases. There are other ways.

Another thing I should note is that when doing these target slot assignments, there's no reason to prefer the most probable symbol first, or the least probable or whatever. The reason is every symbol occurance is equally probable. That is, symbol s has frequency Fs, but there are Fs slots for symbol s, so each slot has a frequency of 1. Every slot has an equal probability of 1/M.

An alternative algorithm that I have found to work well is to sort the targets. That is :


make a sorting_array of size M

add { Ts, s } to sorting_array for each symbol  (that's Fs items added)

sort sorting_array by target location

the symbols in sorting_array are in output string order

I believe that this is identical to Duda's "precise initialization" which he describes using push/pop operations on a heap; the result is the same - assigning slots in the order of desired target location.

Using the sort like this is a little weird. We are no longer explicitly trying to put the symbols in their target slots. But the targets (Ts) span the range [0, M] and the sort is an array of size M, so they wind up distributed over that range. In practice it works well, and it's fast because sorting is fast.

A few small notes :

You want to use a "stable" sort, or bias the target by some small number based on the symbol. The reason is you will have lots of ties, and you want the ties broken consistently. eg. for "AABBCC" you want "ABCABC" or "CBACBA" but not "ABCCAB". One way to get a stable sort is to make the sorting_array work on U32's, and pack the sort rank into the top 24 bits and the symbol id into the bottom 8 bits.

The bias = 0.5 that Duda uses is not strongly justified, so I tried some other numbers. bias = 0 is much worse. It turns out that bias = 1.0 is better. I tried a bunch of values on a large test set and found that bias = 1 is consistently good.

One very simple way to get a decent sort is to bit-reverse the rANS indexes. That is, start from a rANS/alphabetical order string ("AABB..") and take the index of each element, bit-reverse that index (so 0001 -> 1000) , and put the symbol in the bit reversed slot. While this is not competitive with the proper sort, it is simple and one pass.

Another possible heuristic is to just scatter the symbols by doing steps that are prime with M. This is what Yann does in fse.c


All the files in Calgary Corpus :
(compression per file; sum of output sizes)

M = 1024

rANS/alpahabetical : 1824053.75

bit reverse : 1805230.75

greedy search for empty slots : 1801351

Yann's heuristic in fse.c : 1805503.13

sort , bias = 0.0 : 1817269.88

sort , bias = 0.5 : 1803676.38  (Duda "precise")

sort , bias = 1.0 : 1798930.75

Before anyone throws a fit - yes, I tested on my very large test set, not just calgary. The results were consistent on all the test sets I tried. I also tested with larger M (4096) and the results were again the same, though the differences are smaller the larger you make M.

For completeness, here is what the sorts actually do :


rANS/alphabetical : AAAAAAABBBBBBCCC

bit reverse :   ABABABACABACABBC

greedy search : CABABACABABACABB

greedy search, LPS first :  ABCABAACBABACBAB

Yann fse :          AAABBCAABBCAABBC

sort , bias = 0.0 : ABCABABCABABCABA

sort , bias = 0.5 : ABCABABACBABACBA

sort , bias = 1.0 : ABABCABABCABAABC

but I caution against judging sorts by whether they "look good" since that criteria does not seem to match coding performance.

Finally for clarity, here's the code for the simpler sorts :


void make_sort(int * sorted_syms, int sort_count, const uint32 * normalized_counts, int alphabet)
{
    ASSERT( (int) cb::sum(normalized_counts,normalized_counts+alphabet) == sort_count );
    
    const int fse_step = (sort_count>>1) + (sort_count>>3) + 1;
    
    int fse_pos = 0;
    int s = 0;
    for LOOP(a,alphabet)
    {
        int count = normalized_counts[a];

        for LOOP(c,count)
        {
            // choose one :

            // rANS :
            sorted_syms[s] = a;

            // fse :
            sorted_syms[fse_pos] = a;
            fse_pos = (fse_pos + step) % sort_count;

            // bitreverse :
            sorted_syms[ bitreverse(s, numbits(sort_count)) ] = a;

            s++;
        }
    }
}

and the code for the actual sorting sort (recommended) :

struct sort_sym
{
    int sym;
    float rank;
    bool operator < (const sort_sym & rhs) const
    {
        return rank < rhs.rank;
    }
};

void make_sort(int * sorted_syms, int sort_count, const uint32 * normalized_counts, int alphabet)
{
    ASSERT( (int) cb::sum(normalized_counts,normalized_counts+alphabet) == sort_count );

    vector<sort_sym> sort_syms;
    sort_syms.resize(sort_count);

    int s = 0;

    for LOOP(sym,alphabet)
    {
        uint32 count = normalized_counts[sym];
        if ( count == 0 ) continue;
        
        float invp = 1.f / count;
        
        float base =  1.f * invp; // 0.5f for Duda precise

        for LOOP(c,(int)count)
        {
            sort_syms[s].sym = sym;
            sort_syms[s].rank = base + c * invp;
            s++;
        }
    }
    
    ASSERT_RELEASE( s == sort_count );
    
    std::stable_sort(sort_syms.begin(),sort_syms.end());
    
    for LOOP(s,sort_count)
    {
        sorted_syms[s] = sort_syms[s].sym;
    }
}

and for the greedy search :

void make_sort(int * sorted_syms, int sort_count, const uint32 * normalized_counts, int alphabet)
{
    ASSERT( (int) cb::sum(normalized_counts,normalized_counts+alphabet) == sort_count );

    // make all slots empty :
    for LOOP(s,sort_count)
    {
        sorted_syms[s] = -1;
    }
    
    for LOOP(a,alphabet)
    {
        uint32 count = normalized_counts[a];
        if ( count == 0 ) continue;
        
        uint32 step = (sort_count + (count/2) ) / count;
        uint32 first = step/2;
        
        for LOOP(c,(int)count)
        {
            uint32 slot = first + step * c;
            
            // find an empty slot :
            for(;;)
            {
                if ( sorted_syms[slot] == -1 )
                {
                    sorted_syms[slot] = a;
                    break;
                }
                slot = (slot + 1)%sort_count;
            }
        }
    }
}

small note : the reported results use a greedy search that searches away from slot using +1,-1,+2,-2 , instead of the simpler +1,+2 in this code snippet. This simpler version is very slightly worse.

old rants