9/30/2010

09-30-10 - Coder News

Ignacio wrote a nice article on GPU radiosity via hemicubes . I always wanted to try that, I'm jealous.

I suspect that you could be faster & higher quality now by doing GPU ray tracing instead of render-to-textures. The big speed advantage of GPU ray tracing would come from the fact that you can make the rays to sample into your lightmaps for lots of lightmap texels and run them all in one big batch. Quality comes from the fact that you can use all the fancy raytracing techniques for sampling, monte carlo methods, etc. and it's rather easier to do variable-resolution sampling (eg. start with 50 rays per lightmap texel and add more if there's high variance). In fact, you could set up all the rays to sample into your lightmap texels for your whole world, and then run N bounces by justing firing the whole gigantic batch N times. Of course the disadvantage to this is you have to implement your whole renderer twice, once as a raytracer and once for normal rendering, avoiding that is the whole point of using the hemicube render to texture method.

Was talking to Dave and randomly mentioned that I thought the new Rvalue references were not that interesting because you can basically do everything they give you using RVO and swaps. But then I did some more reading and have changed my mind. Rvalue references are awesome!

Want Speed Pass by Value. � C++Next (Dave Abrahams series on value semantics)
C++ Rvalue References Explained (good intro)
A Brief Introduction to Rvalue References (Howard E. Hinnant, Bjarne Stroustrup, and Bronek Kozicki)
Rvalue References C++0x Features in VC10, Part 2 - Visual C++ Team Blog - Site Home - MSDN Blogs
rvalue reference (the proposal document)
InformIT C++ Reference Guide The rvalue Reference Proposal, Part I

Basically it lets you write templates that know an object is a temporary, so you can mutate it or move from it without making a copy. I think one problem we sometime have as coders is that we think about our existing code and think "bah I don't need that" , but it's only because we have been so conditioned not to write code in that way because we don't have the right tools. C++0x makes it possible to do things in templates that you always wished you could do. That doesn't necessarily mean doing lots more complicated things, it means doing the little things and getting them exactly right. "auto" , "decltype" , "[[base_check]]", etc. all look pretty handy.

9/28/2010

09-28-10 - Branchless LZ77 Decoder

It occurred to me that if you do an LZ77 where the literal/match flag is sent via a run length of literals, then a run of literals is the same as a match, but the source is the next few bytes of the compressed buffer, rather than some previous location in the decompressed buffer.

That is, you're decompressing from "comp" buffer into "dest" buffer. A "match" is just a copy from the "dest" buffer, and literals are just at copy from the "comp" buffer.

So, let's say we do a byte-wise LZ77 , use one bit to flag literal or match, then 7 bits for the length. Our branchless decoder is something like :


{
    U8 control = *comp++;
    int source = control >> 7;
    int length = (control & 127) + 1;
    U8 * lit_ptr = comp;
    U8 * mat_ptr = dest - *((U16 *)comp);
    U8 * copy_from_ptr = select( lit_ptr, mat_ptr, source );
    memcpy( dest, copy_from_ptr, length );
    dest += length;
    comp += (source<<1);
}

Where "select(a,b,c)" = c ? ( a : b ); or something like that. (sometimes better implemented with a negative and; on the PC you might use cmov, on the spu you might use sel, etc.)

While this should be very fast, compression will not be awesome because the division for literals and matches is not ideal and 7 bits of length is a lot for literals but not enough for matches, and offsets are always 16 bits which is too little. We can do a slightly better version using a 256 entry lookup table for the control :


{
    U8 control = *comp++;
    int source = source_table[control];
    int length = length_table[control];
    U8 * lit_ptr = comp;
    U8 * mat_ptr = dest - *((U16 *)comp);
    U8 * copy_from_ptr = select( lit_ptr, mat_ptr, source );
    memcpy( dest, copy_from_ptr, length );
    dest += length;
    comp += (source<<1);
}

for example with the table you could let the match lengths be larger and sparse. But it would probably be better to just have a branch that reads more bytes for long lengths.

Adding things like optionally larger offsets starts to make the branchless code complex enough that eating one branch is better. If you're willing to do a lot of variable shifts it's certainly possible, for example you could grab 1 control byte, look it up in various tables. The tables tell you some # of bits for match length, some # for match offset, and some # for literal run length (they add up a multiple of 8 and use some portion of the control byte as well). Unfortunately variable shifting is untenably slow on many important platforms.

BTW one useful trick for reducing branches in your LZ decoder is to put the EOF case out in some rare case, rather than as your primary loop condition, and you change your primary loop to be an unconditional branch. On PC's this doesn't change much but on some architectures an unconditional branch is much cheaper than a conditional one, even it's predictable. That is, instead of :


while ( ! eof )
{
   .. do one decode step ..
   if ( mode 1 )
     ..
   if ( mdoe 2 )
    ...
}

You do :

for(;;)
{
   .. do one decode step ..
   if ( mode 1 )
     ..
   if ( mode 2 )
   {
     if ( rare case )
     {
        if ( eof )
          break;
     }
   }
}

Also, obviously you don't actually use "memcpy" , but whatever you use for the copy has an internal branch. And of course we have turned our branch into a tight data dependency. On some platforms that's not much better than a branch, but on many it is much better. Unfortunately unrolling doesn't help the data dependency much because of the fact that LZ can copy from its own output, so you have to wait for the copy to be done before you can start the next one (there's also an inherent LHS here, though that stall is the least of our worries).

9/21/2010

09-21-10 - Waiting on Thread Events Part 2

So we've seen the problem, let's look at some solutions.

First of all let me try to convince you that various hacky solutions won't work. Say you're using Semaphores. Instead of doing Up() to signal the event, you could do Increment(1000); That will cause the Down() to just run through immediately until it's pulled back down, so the next bunch of tests will succeed. This might in fact make the race never happen in real life, but the race is still there.

1. Put a Mutex around the Wait. The mutex just ensures that only one thread at a time is actually in the wait on the event. In particular we do :


WaitOnMessage_4( GUID )
{

    {
        MutexInScope(ReceiveMutex);
        pop everything on the receive FIFO
        mark everything I received as done
        if ( the GUID I wanted is done )
            return;
    }

    for(;;)
    {
        MutexInScope( WaiterMutex );

        {
            MutexInScope(ReceiveMutex);
            pop everything on the receive FIFO
            mark everything I received as done
            if ( the GUID I wanted is done )
                return;
        }

        Wait( ReceiveEvent );
    }
}

note that the WaiterMutex is never taken by the Worker thread, only by "main" threads. Also note that it must be held around the whole check for the GUID being done or you have another race. Also we have the separate ReceiveMutex for the pop/mark phase so that when we are in the wait we don't block other threads from checking status.

Now when multiple threads try to wait, the first will get in the mutex and actually wait on the Event, the second will try to lock the mutex and stall out there and wait that way.

The problem with this is that threads that can proceed don't always get to. It's not broken, you will make progress, but it's nowhere near optimal. Consider if N threads come in and wait on different GUIDs. Now the worker finishes something, so that wakes up a Wait and he goes around the loop which releases the Mutex - if you just so happened to wake up the one guy who could proceed, then it's great, he returns, but if you woke up the wrong guy, he will take the mutex and go back to sleep. (also note when the mutex unlocks it might immediately switch thread execution to one of the people waiting on locking it because of the Window fair scheduling heuristics). So you only have a 1/N chance of making optimal progress, and in the worst case you might make all N threads wait until all N items are done. That's bad. So let's look at other solutions :

2. Put the Event/Semaphore on each message instead of associated with the whole Receive queue. You wait on message->event , and the worker signals each message's event when it's done.

This also works fine. It is only race-free if GUIDs can only be used from one thread, if multiple threads can wait on the same GUID you are back to the same old problem. This also requires allocating a Semaphore per message, which may or may not be a big deal depending on how fine grained your work is.

On the plus side, this lets the Wait() be on the exact item you want being done, not just anything being done, so your thread doesn't wake up unnecessarily to check if its item was done.


WaitOnMessage_5A( GUID )
{
    Event ev;

    {
        MutexInScope(ReceiveMutex);
        pop everything on the receive FIFO
        mark everything I received as done
        if ( the GUID I wanted is done )
            return;

        ev = message(GUID)->event
    }

    Wait( ev );
}

we're glossing over the tricky part of this one which is the maintenance of the event on each message. Let's say we know that GUIDs are only touched by one thread at a time. Then we can do the event destruction/recycling when we receive a done message. That way we know that event is always safe for us to wait on outside of the mutex because the only person who will ever Clear or delete that event is us.

I think you can make this safe for multiple threads thusly :


WaitOnMessage_5B( GUID )
{
    Semaphore ev;

    {
        MutexInScope(ReceiveMutex);
        pop everything on the receive FIFO
        mark everything I received as done
        if ( the GUID I wanted is done )
            return;

        message(GUID)->users ++;
        ev = message(GUID)->semaphore;
    }

    Down( ev );

    {
        MutexInScope(ReceiveMutex);
        pop everything on the receive FIFO
        mark everything I received as done
        ASSERT( the GUID I wanted is done );

        message(GUID)->users --;
        if ( message(GUID)->users == 0 )
            delete message(GUID)->semaphore;
    }
}

and you make worker signal the event by doing Up(1000); you don't have to worry about that causing weird side effects, since each event is only ever signalled once, and the maximum number of possible waits on each semaphore is the number of threads. Since the semaphore accesses are protected by mutex, I think you could also do lazy semaphore creation, eg. don't make them on messages by default, just make them when somebody tries to wait on that message.

3. Make the Event per thread. You either put the ReceiveEvent in the TLS, or you just make it a local on the stack or somewhere for each thread that wants to wait. Then we just use WaitOnMessage_3 , but the ReceiveEvent is now for the thread. The tricky bit is we need to let the worker thread know what events it needs to signal.

The easiest/hackiest way is just to have the worker thread always signal N events that you determine at startup. This way will in fact work fine for many games. A slightly cleaner way is to have a list of the events that need signalling. But now you need to protect access to that with a mutex, something like :


WaitOnMessage_6( GUID , LocalEvent )
{
    {
    MutexInScope(EventListMutex)
        add LocalEvent to EventList
    }

    for(;;)
    {
        {
        MutexInScope(ReceiveMutex);
        pop everything on the receive FIFO
        mark everything I received as done
        if ( the GUID I wanted is done )
            return;
        }

        Wait( LocalEvent );
    }

    {
    MutexInScope(EventListMutex)
        remove LocalEvent from EventList
    }

}

and worker does :

    do work
    push message to receive FIFO
    
    {
    MutexInScope(EventListMutex)
        signal everything in EventList
    }

one nice thing about this is that you can push GUID with LocalEvent and then only signal events that want to hear about that specific message.

Note with the code written as-is it works but has very nasty thread thrashing. When the worker signals the events in the list, it immediately wakes up the waiting threads, which then try to lock the EventListMutex and immediately go back to sleep, which wakes the worker back up again. That's pretty shitty so a slightly better version is if the worker does :


    do work
    push message to receive FIFO
    
    TempList;
    {
    MutexInScope(EventListMutex)
        copy everything in EventList to TempList
    }
    -*-
    signal everything in TempList

this fixes our thread thrashing, but it now gives us the usual lock-free type of restrictions - events in TempList are no longer protected by the Mutex, so that memory can't be reclaimed until we are sure that the worker is no longer touching them. (in practice the easiest way to do this is just to use recycling pooled allocators which don't empty their pool until application exit). Note that if an event is recycled and gets a different id, this might signal it, but that's not a correctness error because extra signals are benign. (extra signals just cause a wasteful spin around the loop to check if your GUID is done, no big deal, not enough signals means infinite wait, which is a big deal).

NOTE : you might think there's a race at -*- ; the apparent problem is that the worker has got the event list in TempList, and it gets swapped out, then on some main thread I add myself to EventList, run through and go to sleep in the Wait. Then worker wakes up at -*- and signals TempList - and I'm not in the list to signal! Oh crap! But this can't happen because if that was the work item I needed, it was already put in the receive FIFO and I should have seen it and returned before going into the Wait.

We can also of course get rid of the Mutex on EventList completely by doing the usual lockfree gymnastics; instead make it a message passing thing :


WaitOnMessage_6B( GUID , LocalEvent )
{
    Send Message {Add,LocalEvent,GUID}

    for(;;)
    {
        {
        MutexInScope(ReceiveMutex);
        pop everything on the receive FIFO
        mark everything I received as done
        if ( the GUID I wanted is done )
            break;
        }

        Wait( LocalEvent );
    }

    Send Message {Remove,LocalEvent,GUID}
}

and worker does :

    pop message from "send FIFO" to get work
    do work
    push message to "receive FIFO"

    pop all event messages
        process Add/Remove commands 

    signal everything in EventList

but this isn't tested and I have very little confidence in it. I would want to Relacey this before using it in real code.

4. Just don't do it. Sometimes domain-specific solutions are best. In particular there's a very simple solution I can use for the current problem I'm having.

The thing that made me hit this issue is that I made my work-stealing Worklet system able to wait on IO's. In this case the "worker" is the IO service thread and the messages are IO requests. So now the main thread can wait on IO and so can the Worklets. It's important that they really go to a proper sleep if they are blocked on IO.

But I can solve it in a simple way in that case. The Worklets already have a way to go sleep when they have no work to do. So all I do is if the Worklets only have work which depends on an uncompleted IO, they push their Work back onto the main work dispatch list and go to sleep on the "I have no work" event. Then, whenever the main thread receives any IO wakeup event, it immediately goes and checks the Worklet dispatch list and sees if anything was waiting on an IO completion. If it was, then it hands out the work and wakes up some Worklets to do it.

This solution is more direct and also has the nice property of not waking up unnecessary threads and so on.

09-21-10 - Waiting on Thread Events

A small note on a common threading pattern.

You have some worker thread which takes messages and does something or other and then puts out results. You implement the message passing in some way or other, either with a mutex or a lock-free FIFO or whatever, it doesn't matter for our purposes here.

So your main thread makes little work messages, sends them to the worker, he does them, sends them back. The tricky bit comes when the main thread wants to wait on a certain message being done. Let's draw the structure :


   main      ---- send --->     worker
  thread    <--- receive ---    thread

First assume our messages have some kind of GUID. If the pointers are never recycled it could be the pointer, but generally that's a bad idea, but a pointer + a counter would be fine. So the main thread says Wait on this GUID being done. The first simple implementation would be a spin loop :


WaitOnMessage_1( GUID )
{

    for(;;)
    {
        pop everything on the receive FIFO
        mark everything I received as done
    
        is the GUID I wanted in the done set?
           if so -> return

        yield processor for a while
    }

}

and that works just fine. But now you want to be a little nicer and not just spin, you want to make your thread actually go to sleep.

Well you can do it in Win32 using Events. (on most other OS'es you would use Semaphores, but it's identical here). What you do is you have the worker thread set an event when it pushes a message, and now we can wait on it. We'll call this ReceiveEvent, and our new WaitOnMessage is :


WaitOnMessage_2( GUID )
{

    for(;;)
    {
        Clear( ReceiveEvent );

        pop everything on the receive FIFO
        mark everything I received as done
    
        is the GUID I wanted in the done set?
           if so -> return

        Wait( ReceiveEvent );
    }

}

that is, we clear the receive event so it's unsignalled, we see if our GUID is done, if not we wait until the worker has done something, then we check again. We aren't sleeping on our specific work item, but we are sleeping on the worker doing something.

In the Worker thread it does :


    pop from "send FIFO"

    do work

    push to "receive FIFO"
    
    Signal( ReceiveEvent );

note the order of the push and the Signal (Signal after push) is important or you could have a deadlock due to a race because of the Clear. While the code as-is works fine, the Clear() is considered dangerous and is actually unnecessary - if you remove it, the worst that will happen is you will run through the Wait() one time that you don't need to, which is not a big deal. Also Clear can be a little messy to implement for Semaphores in a cross-platform way. In Semaphore speak, Wait() is Down() , Signal() is Up(), and Clear() is trylock().

Okay, so this is all fine and we're happy with ourselves, until one day we try to WaitOnMessage() from two different threads. I've been talking as if we have one main thread and one worker thread, but the basic FIFO queues work fine for N "main" threads, and we may well might want to have multiple threads generating and waiting on work. So what's the problem ?

First of all, there's a race in the status check, so we put a Mutex around it :


WaitOnMessage_3A( GUID )
{

    for(;;)
    {
        Mutex 
        {
        pop everything on the receive FIFO
        mark everything I received as done
        if ( the GUID I wanted is done )
            break;
        }

        Wait( ReceiveEvent );
    }

}

you need that because one thread could come in and pop the queue but not process it, and then another thread comes in and sees its GUID is not done and goes to sleep incorrectly. We need the pop and the flagging done to act like they are atomic.

Now you might be inclined to make this safe by just putting the Mutex around the whole function. In fact that works, but it means you are holding the mutex while you go into your Wait sleep. Then if another thread comes in to check status, it will be forced to sleep too - even if its GUID is already done. So that's very bad, we don't want to block status checking while we are asleep.

Why is the _3A version still broken ? Consider this case : we have two threads, thread 1 and 2, they make their own work and send it off then each call WaitOnMessage which is :


WaitOnMessage_3( GUID )
{

    for(;;)
    {
        Mutex 
        {
        pop everything on the receive FIFO
        mark everything I received as done
        if ( the GUID I wanted is done )
            break;
        }

        [A]

        Wait( ReceiveEvent );

        [B]
    }

}

Thread 1 runs through to [A] and then swaps out. Thread 2 runs through to [B], which waits on the event, the worker thread then does all the work, sets the event, then thread 2 gets the signal and clears the event. Now thread 1 runs again at [A] and immediately goes into an infinite Wait.

D'oh !

I think this is long enough just describing the problem, so I'll get at some solutions in the next post.

9/20/2010

09-20-10 - A small followup on Fast Means

I wrote previously about various ways of tracking a local mean over time .

I read about a new way that is somewhat interesting : Probabilistic Sliding Windows. I saw this in Ryabko / Fionov where they call it "Imaginary Sliding Windows", but I like Probabilistic better.

This is for the case where you want to have a large window but don't want to store the whole thing. Recall a normal sliding window mean update is :


U8 window[WINDOW_SIZE];
int window_i = 0;
int accum = 0;

// add in head and remove tail :
accum += x_t - window[window_i];
window[window_i] = x_t;
window_i = (window_i + 1) & (WINDOW_SIZE-1);

mean = (accum/WINDOW_SIZE);

// WINDOW_SIZE is power of 2 for speed of course

that's all well and good except when you want WINDOW_SIZE to be 16k or something. In that case you can use histogram probabilitic sliding window. You keep a histogram and accumulator :

int count[256] = { 0 };
int accum = 0;

at each step you add the new symbol and randomly remove something that you have :


// randomly remove something :
r = random_from_histogram( count );
count[r] --;
accum -= r;

//add new :
count[x_t] ++;
accum += x_t;

It's very simple and obvious - instead of knowing which symbol leaves the sliding window at the tail, we generate one randomly from the histogram that we have of symbols in the window.

Now, random_from_histogram is a bit of a problem. If you just do a flat random draw :


// randomly remove something :
for(;;)
{
    int r = randmod(256);
    if ( count[r] > 0 )
    {
        count[r] --;
        accum -= r;
        break;
    }
}

then you will screw up the statistics in a funny way; you will draw unlikely symbols too often, so you will skew towards more likely symbols. Maybe you could compute exactly what that skewing is and compensate for it somehow. To do an unbiased draw you basically have to do an arithmetic decode. You generate a random in [0,accum-1) and then look it up in the histogram.

Obviously this method is not fast and is probably useless for compression, but it is an interesting idea. More generally, probabilistic approximate statistics updates are an interesting area. In fact we do this already quite a lot in all modern compressors (for example there's the issue of hash table collisions). I know some of the PAQs also do probabilistic state machine updates for frequency counts.

There's also a whole field of research on this topic that I'd never seen before. You can find it by searching for "probabilistic quantiles" or "approximate quantiles". See for example : "Approximate Counts and Quantiles over Sliding Windows" or "On Probabilistic Properties of Conditional Medians and Quantiles" (PDF) . This stuff could be interesting for compression, because tracking things like the median of a sliding window or the N most common symbols in a window are pretty useful things for compressors.

For example, what I would really like is a fast probabilistic way of keeping the top 8 most common symbols in the last 16k, and semi-accurately counting the frequency of each and the count of "not top 8".

9/16/2010

09-16-10 - Modern Arithmetic Coding from 1987

I was just reading through my old paper "New Techniques in Context Modeling and Arithmetic Encoding" because I wanted to read what I'd written about PPMCB (since I'm playing with similar things now and PPMCB is very similar to the way modern context mixers do this - lossy hashing and all that). (thank god I wrote some things down or I would have lost everything; I only wish I'd written more, so many little heuristics and experimental knowledge has been lost).

Anyway, in there I found this little nugget that I had completely forggoten about. I wrote :

    
    Step 3 uses a method introduced by F. Rubin [3], and popularized by G.V. Cormack�s DMC
    implementation [1].  It is:
    
    while ( R < 256 )
       {
       output(L>>16);
       R <<= 8;
       L = (L << 8) & 0xFFFFFF;
       if ( (L + R) > 0xFFFFFF ) R = 0xFFFFFF - L;
       }
    
    This method is approximate, and therefore produces slightly longer output than the canonical
    CACM-87 style normalization [5].  However, the output is byte-aligned, and the normalization loop is
    almost always only performed once; these properties make this method extremely fast.
    
    The line:
    if ( (L + R) > 0xFFFFFF ) R = 0xFFFFFF - L;
    is where the extra output comes from.  It can be seen that R is shrunk to less than it need be, so
    extra normalizations are needed, and more bytes are output.  However, this only happens 1/65536th of
    the time, so the excess output is negligible.
    
    
Well what do you fucking know. This is a "carryless range coder" (Rant on New Arithmetic Coders) which was apparently rediscovered 20+ years later by the russians. ( see also (Followup on the "Russian Range Coder") ). The method of reducing range to avoid the carry is due to Rubin :

F. Rubin, "Arithmetic Stream Coding Using Fixed Precision Registers", IEEE Trans. Information Theory IT-25 (6) (1979), p. 672 - 675

And I found it in Cormack's DMC which amazingly is still available at waterloo : dmc.c

Cormack in 1987 wrote in the dmc.c code :

    
    while ((max-min) < 256)
    {
        if(bit)max--;
        putchar(min >> 16);
        outbytes++;
        min = (min << 8) & 0xffff00;
        max = (max << 8) & 0xffff00;
        if (min >= max) max = 0x1000000;
    }
    
    
which is a little uglier than my version above, but equivalent (he has a lot of ugly +1/-1 stuff cuz he didn't get that quite right).

And actually in the Data Compression Using Dynamic Markov Modelling paper that goes with the dmc.c code, they describe the arithmetic coder due to Guazzo and it is in fact an "fpaq0p" style carryless arithmetic coder (it avoids carries by letting range get very small and only works on binary alphabets). I don't have the original paper due to Guazzo so I can't confirm that attribution. The one in the paper does bit by bit output, but as I've stressed before that doesn't characterize the coder, and Cormack then did bytewise output in the implementation.

Anyway, I had completely forgotten about this stuff, and it changes my previous attribution of byte-wise arithmetic coding to Schindler '98 ; apparently I knew about it in '95 and Cormack did it in '87. (The only difference between the Schindler '98 coder and the NTiCM '95 coder is that I was still doing (Sl*L)/R while Schindler moved to Sl*(L/R) which is a small but significant change).

Apparently the "russian range coder" is actually a "Rubin arithmetic coder" (eg. avoid carries by shrinking range), and the "fpaq0p binary carryless range coder" is actually a "Guazzo arithmetic coder" (avoid carries by being binary and ensuring only range >= 2).

09-16-10 - A small followup on Cumulative Probability Trees

See original note . Then :

I mentioned the naive tree where each level has buckets of 2^L sums. For clarity, the contents of that tree are :

C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 CC13 C14 C15
C0-C1 C2-C3 C4-C5 C6-C7 C8-C9 C10-C11 C12-C13 C14-C15
C0-C3 C4-C7 C8-C11 C12-C15
C0-C7 C8-C15
C0-C15

I said before :


But querying a cumprob is a bit messy. You can't just go up the tree and add, because you may already be counted in a parent. So you have to do something like :

sum = 0;

if ( i&1 ) sum += Tree[0][i-1];
i>>=1;
if ( i&1 ) sum += Tree[1][i-1];
i>>=1;
if ( i&1 ) sum += Tree[1][i-1];
..

This is O(logN) but rather uglier than we'd like. 

A few notes on this. It's easier to write the query if our array is 1 based. Then the query is equivalent to taking a value from each row where the bit of "i" is on. That is, line up the bits of "i" with the bottom bit at the top row.

That is :


sum = 0;

if ( i&1 ) sum += Tree[0][i];
if ( i&2 ) sum += Tree[1][i>>1];
if ( i&4 ) sum += Tree[2][i>>2];
...

Some SIMD instructions sets have the ability to take an N-bit mask and turn it into an N-channel mask. In that case you could compute Tree[n][i>>n] in each channel and then use "i" to mask all N, then do a horizontal sum.

Also the similarity to Fenwick trees and his way of walking the tree can be made more obvious. In particular, we obviously only have to do the sums where i has bits that are on. On modern CPU's the fastest way is just to always do 8 sums, but in the old days it was beneficial to do a minimum walk. The way you do that is by only walking to the bits that are on, something like :


sum = 0;
while ( i )
{
  int k = BitScan(i);
  sum += Tree[k][i>>k];
  i -= (1 << k);
}

we can write this more like a Fenwick query :

sum = 0;
while ( i )
{
  int b = i & -i; // isolate the bottom bit of i
  int k = BitScan(b); // b -> log2(b)
  sum += Tree[k][i>>k];
  i = i & (i-1); // same as i -= b; turns off bottom bit
}

but then we recall that we made the FenwickTree structure by taking these entries and merging them. In particular, one way to build a Fenwick Tree is to take entries from simple table and do "if my bottom bit is zero, do i>>=1 and step to the next level of the table".

What that means is when we find the bottom bit pos is at bit 'k' and we look up at (i>>k) - that is the entry that will just be in slot "i" in the Fenwick structure :


sum = 0;
while ( i )
{
  sum += FenwickTree[i];
  i = i & (i-1); // turn off bottom bit
}

so the relationship and simplification is clear.

Fast decoding would involve a branchless binary search. I leave that as an exercise for the reader.

Not that any of this is actually useful. If you're just doing order-0 "deferred summation" is better, and if you're doing high order, then special structures for small/sparse contexts are better, and if your doing low orders escaped from high order it doesn't really work because you have to handle exclusion.

A better way to do excluded order-0 would be to have a 256-bit flag chunk for excluded or not excluded, then you just keep all the character counts in an array, and you sum them on demand with SIMD using the binary-tree summing method. At each step you sum adjacent pairs, so in step one you take the 256 counts and sum neighbors and output 128. Repeat. One tricky thing is that you have to output two sums - the sum above symbol and the sum below symbol (both excluded by the flags). And the decoder is a bit tricky, but maybe you can just do that sum tree and then binary search down it. Unfortunately the ability to do horizontal pair adds (or the way to swizzle into doing that) and the ability to byte arithmetic is one of the areas where the SIMD instruction sets on the various platforms differ greatly, so you'd have to roll custom code for every case.

9/14/2010

09-14-10 - Challenges in Data Compression 2.5 - More on Sparsity

Sparsity & Temporality show up as big issues in lots of funny ways.

For example, they are the whole reason that we have to do funny hand-tweaks of compressors.

eg. exactly what you use as your context for various parts of the compressor will make a huge difference. In PPM the most sensitive part is the SEE ; in CM it looks like the most sensitive part is the contexts that are used for the mixer itself (or the APM or SSE) (not the contexts that are looked up to get statistics to mix).

In these sensitive parts of the coder, you obviously want to use as much context as possible, but if you use too much your statistics become too sparse and you start making big coding mistakes.

This is why these parts of the model have to be carefully hand tweaked; eg. use a few bits from here, compact it into a log2-ish scale, bucketize these things. You want only 10-16 bits of context or something but you want as much good information in that context as possible.

The problem is that the ideal formation of these contexts depends on the file of course, so it should be figured out adaptively.

There are various possibilities for hacky solutions to this. For example something that hasn't been explored very much in general in text compression is severely asymmetric coders. We do this in video coding for example, where the encoder spends a long time figuring things out then transmits simple commands to the decoder. So for example the encoder could do some big processing to try to figure out the ideal compaction of statistics and sends it to the decoder. (* maybe some of these do exist)

If sparsity wasn't an issue, you would just throw every bit of information you have at the model. But in fact we have tons of information that we don't use because we aren't good enough at detecting what information is useful and merging up information from various sources, and so on.

For example an obvious one is : in each context we generally store only something like the number of times each character has occurred. We might do something like scale the counts so that more recent characters count more. eg. you effictively do {all counts} *= 0.9 and then add in the count of the new character as ++. But actually we have way more information than that. We have the exact time that each character occurred (time = position in the file). And, for each time we've used the context in the past, we know what was predicted from it and whether that was right. All of that information should be useful to improve coding, but it's just too much because it makes secondary statistics too sparse.

BTW it might pop into your head that this can be attacked using the very old-school approaches to sparsity that were used in Rissanen's "Context" or DMC for example. Their approach was to use a small context, then as you see more data you split the context, so you get richer contexts over time. That does not work, because it is too conservative about not coding from sparse contexts; as I mentioned before, you cannot tell whether a sparse context is good or not from information seen in that context, you need information from an outside source, and what Context/DMC do is exactly the wrong thing - they try to use the counts within a context to decide whether it should be split or not.

09-14-10 - A small note on structured data

We have a very highly structured file at work. A while ago I discovered that I can improve compression of primitive compressors by transposing the data as if it were a matrix with rows of the structure size.

Apparently this is previously known :


J. Abel, "Record Preprocessing for Data Compression", 
Proceedings of the IEEE Data Compression Conference 2004, 
Snowbird, Utah, Storer, J.A. and Cohn, M. Eds. 521, 2004.

A small note :

He suggests finding the period of the data by looking at the most frequent character repeat distance. That is, for each byte find the last time it occurred, count that distance in a histogram, find the peak of the histogram (ignoring distances under some minimum), that's your guess of hte record length. That works pretty well, but some small improvements are definitely possible.

Lets look at some distance histograms. First of all, on non-record-structured data (here we have "book1") our expectation is that correlation roughly goes down with distance, and in fact it does :

book1

(maximum is at 4)

On record-structured data, the peaks are readily apparently. This file ("struct72") is made of 72-byte structures, hence the peaks out at 72 and 144. But we also see strong 4-8-12 correlations, as there are clearly 4-byte words in the structs :

Vertical bar chart

The digram distance histogram makes the structure even more obvious, if you ignore the peak at 1 (which is not so much due to "structure" as just strong order-1 correlation), the peak at 72 is very strong :

Vertical bar chart

When you actually run an LZ77 on the file (with min match len of 3 and optimal parse) the pattern is even stronger; here are the 16 most used offsets on one chunk :


 0 :       72 :      983
 1 :      144 :      565
 2 :      216 :      282
 3 :      288 :      204
 4 :      432 :      107
 5 :      360 :      106
 6 :      720 :       90
 7 :      504 :       88
 8 :      792 :       78
 9 :      648 :       77
10 :      864 :       76
11 :      576 :       73
12 :     1008 :       64
13 :     1080 :       54
14 :     1152 :       51
15 :     1368 :       49

Every single one is a perfect multiple of 72.

A slightly more robust way to find structure than Jurgen's approach is to use the auto-correlation of the histogram of distances. This is a well known technique from audio pitch detection. You take the "signal" which is here out histogram of distance occurances, and find the intensity of auto-correlation for each translation of the signal (this can be done using the fourier transform). You will then get strong peaks only at the fundamental modes. In particular, in our example "struct72" file you would get a peak at 72, and also a strong peak at 4, because in the fourier transform the smaller peaks at 8,12,16,20, etc. will all add onto the peak at 4. That is, it's correctly handling "harmonics". It will also detect cases where a harmonic happened to be a stronger peak than the fundamental mode. That is, the peak at 144 might have been stronger than the one at 72, in which case you would incorrectly think the fundamental record length was 144.

Transposing obviously helps with compressors that do not have handling of structured data, but it hurts compressors that inherently handle structured data themselves.

Here are some compressors on the struct72 file :


struct72                                 3,471,552

struct72.zip                             2,290,695
transpos.zip                             1,784,239

struct72.bz2                             2,136,460
transpos.bz2                             1,973,406

struct72.pmd                             1,903,783
transpos.pmd                             1,864,028

struct72.pmm                             1,493,556
transpos.pmm                             1,670,661

struct72.lzx                             1,475,776
transpos.lzx                             1,701,360

struct72.paq8o                           1,323,437
transpos.paq8o                           1,595,652

struct72.7z                              1,262,013
transpos.7z                              1,642,304

Compressors that don't handle structured data well (Zip,Bzip,PPMd) are helped by transposing. Compressors that do handle structured data specifically (LZX,PAQ,7Zip) are hurt quite a lot by transposing. LZMA is the best compressor I've ever seen for this kind of record-structured data. It's amazing that it beats PAQ considering it's an order of magnitude faster. It's also impressive how good LZX is on this type of data.

BTW I'm sure PAQ could easily be adapted to manually take a specification in which the structure of the file is specified by the user. In particular for simple record type data, you could even have a system where you give it the C struct, like :


struct Record
{
  float x,y,z;
  int   i,j,k;
  ...
};

and it parses the C struct specification to find what fields are where and builds a custom model for that.

ADDENDUM :

It's actually much clearer with mutual information.


I(X,Y) = H(X) + H(Y) - H(XY)

In particular, the mutual information for offset D is :


I(D) = 2*H( order0 ) - H( bigrams separated by distance D )

This is much slower to compute than the character repeat period, but gives much cleaner data.

On "struct72" : (note : the earlier graphs showed [1,144] , here I'm showing [1,256] so you can see peaks at 72,144 and 216).

struct72

Here's "book1" for comparison :

Vertical bar chart

(left column is actually in bits for these)

and hey, charts are fun, here's book1 after BWT :

Vertical bar chart

09-14-10 - Threaded Stdio

Many people know that fgetc is slow now because we have to link with multithreaded libs, and so it does a mutex and all that. Yes, that is true, but the solution is also trivial, you just use something like this :

#define macrogetc(_stream)     (--(_stream)->_cnt >= 0 ? ((char)*(_stream)->_ptr++) : _filbuf(_stream))

when you know that your FILE is only being used by one thread.

In general there's a nasty issue with multithreaded API design. When you have some object that you know might be accessed from various threads, should you serialize inside every access to that object? Or should you force the client to serialize on the outside?

If you serialize on the inside, you can have severe performance penalties, like with getc.

If you serialize on the outside, you make it very prone to bugs because it's easy to access without protection.

One solution is to introduce an extra "FileLock" object. So the "File" itself has no accessors except "Lock" which gives you a FileLock, and then the "getc" is on the FileLock. That way somebody can grab a FileLock and then locally do fast un-serialized access through that structure. eg:


instead of :

c = File.getc();

you have :

FileLock fl = File.lock();

c = fl.getc();

Another good solution would be to remove the buffering from the stdio FILE and introduce a "FileView" object which has a position and a buffer. Then you can have mutliple FileViews per FILE which are in different threads, but the FileViews themselves must be used only from one thread at a time. Accesses to the FILE are serialized, accesses to the FileView are not. eg :


eg. FILE * fp is shared.

Thread1 has FileView f1(fp);
Thread2 has FileView f2(fp);

FileView contains the buffer

you do :

c = f1.getc();

it can get a char from the buffer without synchronization, but buffer refill locks.

(of course this is a mess with readwrite files and such; in general I hate dealing with mutable shared data, I like the model that data is either read-only and shared, or writeable and exclusively locked by one thread).

(The FileView approach is basically what I've done for Oodle; the low-level async file handles are thread-safe, but the buffering object that wraps it is not. You can have as many buffering objects as you want on the same file on different threads).

Anyway, in practice, macrogetc is a perfectly good solution 99% of the time.

Stdio is generally very fast. You basically can't beat it except in the trivial way (trivial = use a larger buffer). The only things you can do to beat it are :

1. Double-buffer the streaming buffer filling and use async IO to fill the buffer (stdio uses synchronous IO and only fills when the buffer is empty).

2. Have an accessor like Peek(bytes) to the buffer that just gives you a pointer directly into the buffer and ensures it has bytes amount of data for you to read. This eliminates the branch to check fill on each byte input, and eliminates the memcpy from fread.

(BTW in theory you could avoid another memcpy, because Windows is doing IO into the disk cache pages, so you could just lock those and get a pointer and process from them directly. But they don't let you do this for obvious security reasons. Also if you knew in advance that your file was in the disk cache (and you were never going to use it again so you don't want it to get into the cache) you could do uncached IO, which is microscopically faster for data not in the cache, because it avoids that memcpy and page allocation. But again they don't let you ask "is this in the cache" and it's not worth sweating about).

Anyway, the point is you can't really beat stdio for byte-at-a-time streaming input, so don't bother. (on Windows, where the disk cache is pretty good, eg. it does sequential prefetching for you).

old rants