# cbloom rants

## 2/06/2014

### 02-06-14 - Understanding ANS - 8

Time to address an issue that we've skirted for some time - how do you make the output string sort order?

Recall : The output string contains Fs occurances of each symbol. For naive rANS the output string is just in alphabetical order (eg. "AAABBBCCC"). With tANS we can use any permutation of that string.

So what permutation should we use? Well, the output string determines the C() and D() encode and decode tables. It is in fact the only degree of freedom in table construction (assuming the same constraints as last time, b=2 and L=M). So we should choose the output string to minimize code length.

The guiding principle will be (x/P). That is, we achieve minimum total length when we make each code length as close to log2(1/P) as possible. We do that by making the input state to output state ratio (x'/x) as close to (1/P) as possible.

(note for the record : if you try to really solve to minimize the error, it should not just be a distance between (x'/x) and (1/P) , it needs to be log-scaled to make it a "minimum rate" solution). (open question : is there an exact solution for table building that finds the minimum rate table that isn't NP (eg. not just trying all permutations)).

Now we know that the source state always come from the precursor ranges Is, and we know that

```
destination range :
I = [ M , 2*M - 1]

source range :
Is = [ Fs, 2*Fs - 1 ] for each symbol s

and Ps = Fs/M

```
so the ideal target for the symbols in each source range is :
```
target in I = (1/Ps) * (Is) = (M/Fs) * [ Fs, 2*Fs - 1 ] =

and taking off the +M bias to make it a string index in the range [0,M-1] :

Ts = target in string = target in I - M

Ts = { 0 , M * 1/Fs , M * 2/Fs) , ... }

```
Essentially, we just need to take each symbol and spread its Fs occurances evenly over the output string.

Now there's a step that I don't know how to justify without waving my hands a bit. It works slightly better if we imagine that the source x was not just an integer, but rather a bucket that covers the unit range of that integer. That is, rather that starting exactly at the value "x = Fs" you start in the range [Fs,Fs+1]. So instead of just mapping up that integer by 1/P we map up the range, and we can assign a target anywhere in that range. In the paper Duda uses a bias of 0.5 for "precise initialization" , which corresponds to assuming the x's start in the middle of their integer buckets. That is :

```
Ts = { M * (b/Fs), M* (1+b)/Fs, M * (2+b)/Fs , ... }

```
with b = 0.5 for Duda's "precise initialization". Obviously b = 0.5 makes T centered on the range [0,M] , but I see no reason why that should be preferred.

Now assuming we have these known target locations, you can't just put all the symbols into the target slots that they want, because lots of symbols want the same spot.

For example :

```
M=8
Fs={3,3,2}

T_A = { 8 * 0.5/3 , 8 * 1.5 / 3 , 8 * 2.5 / 3 } = { 1 1/3 , 4 , 6 2/3 }
T_B = T_A
T_C = { 8 * 0.5/2 , 8 * 1.5/2 } = { 2 , 6 }

```
One way to solve this problem is to start assigning slots, and when you see that one is full you just look in the neighbor slot, etc. So you might do something like :
```
initial string is empty :

string = "        "

put A's in 1,4,6

string = " A  A A "

put B's in 1,4,6 ; oops they're taken, shift up one to find empty slots :

string = " AB ABAB"

put C's in 2,6 ; oops they're taken, hunt around to find empty slots :

string = "CABCABAB"

```
now obviously you could try to improve this kind of algorithm, but there's no point. It's greedy so it makes mistakes in the overall optimization problem (it's highly dependant on order). It can also be slow because it spends a lot of time hunting for empty slots; you'd have to write a fast slot allocator to avoid degenerate bad cases. There are other ways.

Another thing I should note is that when doing these target slot assignments, there's no reason to prefer the most probable symbol first, or the least probable or whatever. The reason is every symbol occurance is equally probable. That is, symbol s has frequency Fs, but there are Fs slots for symbol s, so each slot has a frequency of 1. Every slot has an equal probability of 1/M.

An alternative algorithm that I have found to work well is to sort the targets. That is :

```
make a sorting_array of size M

add { Ts, s } to sorting_array for each symbol  (that's Fs items added)

sort sorting_array by target location

the symbols in sorting_array are in output string order

```
I believe that this is identical to Duda's "precise initialization" which he describes using push/pop operations on a heap; the result is the same - assigning slots in the order of desired target location.

Using the sort like this is a little weird. We are no longer explicitly trying to put the symbols in their target slots. But the targets (Ts) span the range [0, M] and the sort is an array of size M, so they wind up distributed over that range. In practice it works well, and it's fast because sorting is fast.

A few small notes :

You want to use a "stable" sort, or bias the target by some small number based on the symbol. The reason is you will have lots of ties, and you want the ties broken consistently. eg. for "AABBCC" you want "ABCABC" or "CBACBA" but not "ABCCAB". One way to get a stable sort is to make the sorting_array work on U32's, and pack the sort rank into the top 24 bits and the symbol id into the bottom 8 bits.

The bias = 0.5 that Duda uses is not strongly justified, so I tried some other numbers. bias = 0 is much worse. It turns out that bias = 1.0 is better. I tried a bunch of values on a large test set and found that bias = 1 is consistently good.

One very simple way to get a decent sort is to bit-reverse the rANS indexes. That is, start from a rANS/alphabetical order string ("AABB..") and take the index of each element, bit-reverse that index (so 0001 -> 1000) , and put the symbol in the bit reversed slot. While this is not competitive with the proper sort, it is simple and one pass.

Another possible heuristic is to just scatter the symbols by doing steps that are prime with M. This is what Yann does in fse.c

```
All the files in Calgary Corpus :
(compression per file; sum of output sizes)

M = 1024

rANS/alpahabetical : 1824053.75

bit reverse : 1805230.75

greedy search for empty slots : 1801351

Yann's heuristic in fse.c : 1805503.13

sort , bias = 0.0 : 1817269.88

sort , bias = 0.5 : 1803676.38  (Duda "precise")

sort , bias = 1.0 : 1798930.75

```

Before anyone throws a fit - yes, I tested on my very large test set, not just calgary. The results were consistent on all the test sets I tried. I also tested with larger M (4096) and the results were again the same, though the differences are smaller the larger you make M.

For completeness, here is what the sorts actually do :

```
rANS/alphabetical : AAAAAAABBBBBBCCC

bit reverse :   ABABABACABACABBC

greedy search : CABABACABABACABB

greedy search, LPS first :  ABCABAACBABACBAB

Yann fse :          AAABBCAABBCAABBC

sort , bias = 0.0 : ABCABABCABABCABA

sort , bias = 0.5 : ABCABABACBABACBA

sort , bias = 1.0 : ABABCABABCABAABC

```
but I caution against judging sorts by whether they "look good" since that criteria does not seem to match coding performance.

Finally for clarity, here's the code for the simpler sorts :

```
void make_sort(int * sorted_syms, int sort_count, const uint32 * normalized_counts, int alphabet)
{
ASSERT( (int) cb::sum(normalized_counts,normalized_counts+alphabet) == sort_count );

const int fse_step = (sort_count>>1) + (sort_count>>3) + 1;

int fse_pos = 0;
int s = 0;
for LOOP(a,alphabet)
{
int count = normalized_counts[a];

for LOOP(c,count)
{
// choose one :

// rANS :
sorted_syms[s] = a;

// fse :
sorted_syms[fse_pos] = a;
fse_pos = (fse_pos + step) % sort_count;

// bitreverse :
sorted_syms[ bitreverse(s, numbits(sort_count)) ] = a;

s++;
}
}
}

```
and the code for the actual sorting sort (recommended) :
```
struct sort_sym
{
int sym;
float rank;
bool operator < (const sort_sym & rhs) const
{
return rank < rhs.rank;
}
};

void make_sort(int * sorted_syms, int sort_count, const uint32 * normalized_counts, int alphabet)
{
ASSERT( (int) cb::sum(normalized_counts,normalized_counts+alphabet) == sort_count );

vector`<`sort_sym> sort_syms;
sort_syms.resize(sort_count);

int s = 0;

for LOOP(sym,alphabet)
{
uint32 count = normalized_counts[sym];
if ( count == 0 ) continue;

float invp = 1.f / count;

float base =  1.f * invp; // 0.5f for Duda precise

for LOOP(c,(int)count)
{
sort_syms[s].sym = sym;
sort_syms[s].rank = base + c * invp;
s++;
}
}

ASSERT_RELEASE( s == sort_count );

std::stable_sort(sort_syms.begin(),sort_syms.end());

for LOOP(s,sort_count)
{
sorted_syms[s] = sort_syms[s].sym;
}
}

```
and for the greedy search :
```
void make_sort(int * sorted_syms, int sort_count, const uint32 * normalized_counts, int alphabet)
{
ASSERT( (int) cb::sum(normalized_counts,normalized_counts+alphabet) == sort_count );

// make all slots empty :
for LOOP(s,sort_count)
{
sorted_syms[s] = -1;
}

for LOOP(a,alphabet)
{
uint32 count = normalized_counts[a];
if ( count == 0 ) continue;

uint32 step = (sort_count + (count/2) ) / count;
uint32 first = step/2;

for LOOP(c,(int)count)
{
uint32 slot = first + step * c;

// find an empty slot :
for(;;)
{
if ( sorted_syms[slot] == -1 )
{
sorted_syms[slot] = a;
break;
}
slot = (slot + 1)%sort_count;
}
}
}
}

```
small note : the reported results use a greedy search that searches away from slot using +1,-1,+2,-2 , instead of the simpler +1,+2 in this code snippet. This simpler version is very slightly worse.