# cbloom rants

## 9/16/2010

### 09-16-10 - A small followup on Cumulative Probability Trees

See original note . Then :

I mentioned the naive tree where each level has buckets of 2^L sums. For clarity, the contents of that tree are :

 C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 CC13 C14 C15 C0-C1 C2-C3 C4-C5 C6-C7 C8-C9 C10-C11 C12-C13 C14-C15 C0-C3 C4-C7 C8-C11 C12-C15 C0-C7 C8-C15 C0-C15

I said before :

```
But querying a cumprob is a bit messy. You can't just go up the tree and add, because you may already be counted in a parent. So you have to do something like :

sum = 0;

if ( i&1 ) sum += Tree[i-1];
i>>=1;
if ( i&1 ) sum += Tree[i-1];
i>>=1;
if ( i&1 ) sum += Tree[i-1];
..

This is O(logN) but rather uglier than we'd like.

```

A few notes on this. It's easier to write the query if our array is 1 based. Then the query is equivalent to taking a value from each row where the bit of "i" is on. That is, line up the bits of "i" with the bottom bit at the top row.

That is :

```
sum = 0;

if ( i&1 ) sum += Tree[i];
if ( i&2 ) sum += Tree[i>>1];
if ( i&4 ) sum += Tree[i>>2];
...

```
Some SIMD instructions sets have the ability to take an N-bit mask and turn it into an N-channel mask. In that case you could compute Tree[n][i>>n] in each channel and then use "i" to mask all N, then do a horizontal sum.

Also the similarity to Fenwick trees and his way of walking the tree can be made more obvious. In particular, we obviously only have to do the sums where i has bits that are on. On modern CPU's the fastest way is just to always do 8 sums, but in the old days it was beneficial to do a minimum walk. The way you do that is by only walking to the bits that are on, something like :

```
sum = 0;
while ( i )
{
int k = BitScan(i);
sum += Tree[k][i>>k];
i -= (1 << k);
}

```
we can write this more like a Fenwick query :
```
sum = 0;
while ( i )
{
int b = i & -i; // isolate the bottom bit of i
int k = BitScan(b); // b -> log2(b)
sum += Tree[k][i>>k];
i = i & (i-1); // same as i -= b; turns off bottom bit
}

```
but then we recall that we made the FenwickTree structure by taking these entries and merging them. In particular, one way to build a Fenwick Tree is to take entries from simple table and do "if my bottom bit is zero, do i>>=1 and step to the next level of the table".

What that means is when we find the bottom bit pos is at bit 'k' and we look up at (i>>k) - that is the entry that will just be in slot "i" in the Fenwick structure :

```
sum = 0;
while ( i )
{
sum += FenwickTree[i];
i = i & (i-1); // turn off bottom bit
}

```
so the relationship and simplification is clear.

Fast decoding would involve a branchless binary search. I leave that as an exercise for the reader.

Not that any of this is actually useful. If you're just doing order-0 "deferred summation" is better, and if you're doing high order, then special structures for small/sparse contexts are better, and if your doing low orders escaped from high order it doesn't really work because you have to handle exclusion.

A better way to do excluded order-0 would be to have a 256-bit flag chunk for excluded or not excluded, then you just keep all the character counts in an array, and you sum them on demand with SIMD using the binary-tree summing method. At each step you sum adjacent pairs, so in step one you take the 256 counts and sum neighbors and output 128. Repeat. One tricky thing is that you have to output two sums - the sum above symbol and the sum below symbol (both excluded by the flags). And the decoder is a bit tricky, but maybe you can just do that sum tree and then binary search down it. Unfortunately the ability to do horizontal pair adds (or the way to swizzle into doing that) and the ability to byte arithmetic is one of the areas where the SIMD instruction sets on the various platforms differ greatly, so you'd have to roll custom code for every case.