9/20/2010

09-20-10 - A small followup on Fast Means

I wrote previously about various ways of tracking a local mean over time .

I read about a new way that is somewhat interesting : Probabilistic Sliding Windows. I saw this in Ryabko / Fionov where they call it "Imaginary Sliding Windows", but I like Probabilistic better.

This is for the case where you want to have a large window but don't want to store the whole thing. Recall a normal sliding window mean update is :


U8 window[WINDOW_SIZE];
int window_i = 0;
int accum = 0;

// add in head and remove tail :
accum += x_t - window[window_i];
window[window_i] = x_t;
window_i = (window_i + 1) & (WINDOW_SIZE-1);

mean = (accum/WINDOW_SIZE);

// WINDOW_SIZE is power of 2 for speed of course

that's all well and good except when you want WINDOW_SIZE to be 16k or something. In that case you can use histogram probabilitic sliding window. You keep a histogram and accumulator :

int count[256] = { 0 };
int accum = 0;

at each step you add the new symbol and randomly remove something that you have :


// randomly remove something :
r = random_from_histogram( count );
count[r] --;
accum -= r;

//add new :
count[x_t] ++;
accum += x_t;

It's very simple and obvious - instead of knowing which symbol leaves the sliding window at the tail, we generate one randomly from the histogram that we have of symbols in the window.

Now, random_from_histogram is a bit of a problem. If you just do a flat random draw :


// randomly remove something :
for(;;)
{
    int r = randmod(256);
    if ( count[r] > 0 )
    {
        count[r] --;
        accum -= r;
        break;
    }
}

then you will screw up the statistics in a funny way; you will draw unlikely symbols too often, so you will skew towards more likely symbols. Maybe you could compute exactly what that skewing is and compensate for it somehow. To do an unbiased draw you basically have to do an arithmetic decode. You generate a random in [0,accum-1) and then look it up in the histogram.

Obviously this method is not fast and is probably useless for compression, but it is an interesting idea. More generally, probabilistic approximate statistics updates are an interesting area. In fact we do this already quite a lot in all modern compressors (for example there's the issue of hash table collisions). I know some of the PAQs also do probabilistic state machine updates for frequency counts.

There's also a whole field of research on this topic that I'd never seen before. You can find it by searching for "probabilistic quantiles" or "approximate quantiles". See for example : "Approximate Counts and Quantiles over Sliding Windows" or "On Probabilistic Properties of Conditional Medians and Quantiles" (PDF) . This stuff could be interesting for compression, because tracking things like the median of a sliding window or the N most common symbols in a window are pretty useful things for compressors.

For example, what I would really like is a fast probabilistic way of keeping the top 8 most common symbols in the last 16k, and semi-accurately counting the frequency of each and the count of "not top 8".

9/16/2010

09-16-10 - Modern Arithmetic Coding from 1987

I was just reading through my old paper "New Techniques in Context Modeling and Arithmetic Encoding" because I wanted to read what I'd written about PPMCB (since I'm playing with similar things now and PPMCB is very similar to the way modern context mixers do this - lossy hashing and all that). (thank god I wrote some things down or I would have lost everything; I only wish I'd written more, so many little heuristics and experimental knowledge has been lost).

Anyway, in there I found this little nugget that I had completely forggoten about. I wrote :

    
    Step 3 uses a method introduced by F. Rubin [3], and popularized by G.V. Cormack�s DMC
    implementation [1].  It is:
    
    while ( R < 256 )
       {
       output(L>>16);
       R <<= 8;
       L = (L << 8) & 0xFFFFFF;
       if ( (L + R) > 0xFFFFFF ) R = 0xFFFFFF - L;
       }
    
    This method is approximate, and therefore produces slightly longer output than the canonical
    CACM-87 style normalization [5].  However, the output is byte-aligned, and the normalization loop is
    almost always only performed once; these properties make this method extremely fast.
    
    The line:
    if ( (L + R) > 0xFFFFFF ) R = 0xFFFFFF - L;
    is where the extra output comes from.  It can be seen that R is shrunk to less than it need be, so
    extra normalizations are needed, and more bytes are output.  However, this only happens 1/65536th of
    the time, so the excess output is negligible.
    
    
Well what do you fucking know. This is a "carryless range coder" (Rant on New Arithmetic Coders) which was apparently rediscovered 20+ years later by the russians. ( see also (Followup on the "Russian Range Coder") ). The method of reducing range to avoid the carry is due to Rubin :

F. Rubin, "Arithmetic Stream Coding Using Fixed Precision Registers", IEEE Trans. Information Theory IT-25 (6) (1979), p. 672 - 675

And I found it in Cormack's DMC which amazingly is still available at waterloo : dmc.c

Cormack in 1987 wrote in the dmc.c code :

    
    while ((max-min) < 256)
    {
        if(bit)max--;
        putchar(min >> 16);
        outbytes++;
        min = (min << 8) & 0xffff00;
        max = (max << 8) & 0xffff00;
        if (min >= max) max = 0x1000000;
    }
    
    
which is a little uglier than my version above, but equivalent (he has a lot of ugly +1/-1 stuff cuz he didn't get that quite right).

And actually in the Data Compression Using Dynamic Markov Modelling paper that goes with the dmc.c code, they describe the arithmetic coder due to Guazzo and it is in fact an "fpaq0p" style carryless arithmetic coder (it avoids carries by letting range get very small and only works on binary alphabets). I don't have the original paper due to Guazzo so I can't confirm that attribution. The one in the paper does bit by bit output, but as I've stressed before that doesn't characterize the coder, and Cormack then did bytewise output in the implementation.

Anyway, I had completely forgotten about this stuff, and it changes my previous attribution of byte-wise arithmetic coding to Schindler '98 ; apparently I knew about it in '95 and Cormack did it in '87. (The only difference between the Schindler '98 coder and the NTiCM '95 coder is that I was still doing (Sl*L)/R while Schindler moved to Sl*(L/R) which is a small but significant change).

Apparently the "russian range coder" is actually a "Rubin arithmetic coder" (eg. avoid carries by shrinking range), and the "fpaq0p binary carryless range coder" is actually a "Guazzo arithmetic coder" (avoid carries by being binary and ensuring only range >= 2).

09-16-10 - A small followup on Cumulative Probability Trees

See original note . Then :

I mentioned the naive tree where each level has buckets of 2^L sums. For clarity, the contents of that tree are :

C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 CC13 C14 C15
C0-C1 C2-C3 C4-C5 C6-C7 C8-C9 C10-C11 C12-C13 C14-C15
C0-C3 C4-C7 C8-C11 C12-C15
C0-C7 C8-C15
C0-C15

I said before :


But querying a cumprob is a bit messy. You can't just go up the tree and add, because you may already be counted in a parent. So you have to do something like :

sum = 0;

if ( i&1 ) sum += Tree[0][i-1];
i>>=1;
if ( i&1 ) sum += Tree[1][i-1];
i>>=1;
if ( i&1 ) sum += Tree[1][i-1];
..

This is O(logN) but rather uglier than we'd like. 

A few notes on this. It's easier to write the query if our array is 1 based. Then the query is equivalent to taking a value from each row where the bit of "i" is on. That is, line up the bits of "i" with the bottom bit at the top row.

That is :


sum = 0;

if ( i&1 ) sum += Tree[0][i];
if ( i&2 ) sum += Tree[1][i>>1];
if ( i&4 ) sum += Tree[2][i>>2];
...

Some SIMD instructions sets have the ability to take an N-bit mask and turn it into an N-channel mask. In that case you could compute Tree[n][i>>n] in each channel and then use "i" to mask all N, then do a horizontal sum.

Also the similarity to Fenwick trees and his way of walking the tree can be made more obvious. In particular, we obviously only have to do the sums where i has bits that are on. On modern CPU's the fastest way is just to always do 8 sums, but in the old days it was beneficial to do a minimum walk. The way you do that is by only walking to the bits that are on, something like :


sum = 0;
while ( i )
{
  int k = BitScan(i);
  sum += Tree[k][i>>k];
  i -= (1 << k);
}

we can write this more like a Fenwick query :

sum = 0;
while ( i )
{
  int b = i & -i; // isolate the bottom bit of i
  int k = BitScan(b); // b -> log2(b)
  sum += Tree[k][i>>k];
  i = i & (i-1); // same as i -= b; turns off bottom bit
}

but then we recall that we made the FenwickTree structure by taking these entries and merging them. In particular, one way to build a Fenwick Tree is to take entries from simple table and do "if my bottom bit is zero, do i>>=1 and step to the next level of the table".

What that means is when we find the bottom bit pos is at bit 'k' and we look up at (i>>k) - that is the entry that will just be in slot "i" in the Fenwick structure :


sum = 0;
while ( i )
{
  sum += FenwickTree[i];
  i = i & (i-1); // turn off bottom bit
}

so the relationship and simplification is clear.

Fast decoding would involve a branchless binary search. I leave that as an exercise for the reader.

Not that any of this is actually useful. If you're just doing order-0 "deferred summation" is better, and if you're doing high order, then special structures for small/sparse contexts are better, and if your doing low orders escaped from high order it doesn't really work because you have to handle exclusion.

A better way to do excluded order-0 would be to have a 256-bit flag chunk for excluded or not excluded, then you just keep all the character counts in an array, and you sum them on demand with SIMD using the binary-tree summing method. At each step you sum adjacent pairs, so in step one you take the 256 counts and sum neighbors and output 128. Repeat. One tricky thing is that you have to output two sums - the sum above symbol and the sum below symbol (both excluded by the flags). And the decoder is a bit tricky, but maybe you can just do that sum tree and then binary search down it. Unfortunately the ability to do horizontal pair adds (or the way to swizzle into doing that) and the ability to byte arithmetic is one of the areas where the SIMD instruction sets on the various platforms differ greatly, so you'd have to roll custom code for every case.

9/14/2010

09-14-10 - Challenges in Data Compression 2.5 - More on Sparsity

Sparsity & Temporality show up as big issues in lots of funny ways.

For example, they are the whole reason that we have to do funny hand-tweaks of compressors.

eg. exactly what you use as your context for various parts of the compressor will make a huge difference. In PPM the most sensitive part is the SEE ; in CM it looks like the most sensitive part is the contexts that are used for the mixer itself (or the APM or SSE) (not the contexts that are looked up to get statistics to mix).

In these sensitive parts of the coder, you obviously want to use as much context as possible, but if you use too much your statistics become too sparse and you start making big coding mistakes.

This is why these parts of the model have to be carefully hand tweaked; eg. use a few bits from here, compact it into a log2-ish scale, bucketize these things. You want only 10-16 bits of context or something but you want as much good information in that context as possible.

The problem is that the ideal formation of these contexts depends on the file of course, so it should be figured out adaptively.

There are various possibilities for hacky solutions to this. For example something that hasn't been explored very much in general in text compression is severely asymmetric coders. We do this in video coding for example, where the encoder spends a long time figuring things out then transmits simple commands to the decoder. So for example the encoder could do some big processing to try to figure out the ideal compaction of statistics and sends it to the decoder. (* maybe some of these do exist)

If sparsity wasn't an issue, you would just throw every bit of information you have at the model. But in fact we have tons of information that we don't use because we aren't good enough at detecting what information is useful and merging up information from various sources, and so on.

For example an obvious one is : in each context we generally store only something like the number of times each character has occurred. We might do something like scale the counts so that more recent characters count more. eg. you effictively do {all counts} *= 0.9 and then add in the count of the new character as ++. But actually we have way more information than that. We have the exact time that each character occurred (time = position in the file). And, for each time we've used the context in the past, we know what was predicted from it and whether that was right. All of that information should be useful to improve coding, but it's just too much because it makes secondary statistics too sparse.

BTW it might pop into your head that this can be attacked using the very old-school approaches to sparsity that were used in Rissanen's "Context" or DMC for example. Their approach was to use a small context, then as you see more data you split the context, so you get richer contexts over time. That does not work, because it is too conservative about not coding from sparse contexts; as I mentioned before, you cannot tell whether a sparse context is good or not from information seen in that context, you need information from an outside source, and what Context/DMC do is exactly the wrong thing - they try to use the counts within a context to decide whether it should be split or not.

09-14-10 - A small note on structured data

We have a very highly structured file at work. A while ago I discovered that I can improve compression of primitive compressors by transposing the data as if it were a matrix with rows of the structure size.

Apparently this is previously known :


J. Abel, "Record Preprocessing for Data Compression", 
Proceedings of the IEEE Data Compression Conference 2004, 
Snowbird, Utah, Storer, J.A. and Cohn, M. Eds. 521, 2004.

A small note :

He suggests finding the period of the data by looking at the most frequent character repeat distance. That is, for each byte find the last time it occurred, count that distance in a histogram, find the peak of the histogram (ignoring distances under some minimum), that's your guess of hte record length. That works pretty well, but some small improvements are definitely possible.

Lets look at some distance histograms. First of all, on non-record-structured data (here we have "book1") our expectation is that correlation roughly goes down with distance, and in fact it does :

book1

(maximum is at 4)

On record-structured data, the peaks are readily apparently. This file ("struct72") is made of 72-byte structures, hence the peaks out at 72 and 144. But we also see strong 4-8-12 correlations, as there are clearly 4-byte words in the structs :

Vertical bar chart

The digram distance histogram makes the structure even more obvious, if you ignore the peak at 1 (which is not so much due to "structure" as just strong order-1 correlation), the peak at 72 is very strong :

Vertical bar chart

When you actually run an LZ77 on the file (with min match len of 3 and optimal parse) the pattern is even stronger; here are the 16 most used offsets on one chunk :


 0 :       72 :      983
 1 :      144 :      565
 2 :      216 :      282
 3 :      288 :      204
 4 :      432 :      107
 5 :      360 :      106
 6 :      720 :       90
 7 :      504 :       88
 8 :      792 :       78
 9 :      648 :       77
10 :      864 :       76
11 :      576 :       73
12 :     1008 :       64
13 :     1080 :       54
14 :     1152 :       51
15 :     1368 :       49

Every single one is a perfect multiple of 72.

A slightly more robust way to find structure than Jurgen's approach is to use the auto-correlation of the histogram of distances. This is a well known technique from audio pitch detection. You take the "signal" which is here out histogram of distance occurances, and find the intensity of auto-correlation for each translation of the signal (this can be done using the fourier transform). You will then get strong peaks only at the fundamental modes. In particular, in our example "struct72" file you would get a peak at 72, and also a strong peak at 4, because in the fourier transform the smaller peaks at 8,12,16,20, etc. will all add onto the peak at 4. That is, it's correctly handling "harmonics". It will also detect cases where a harmonic happened to be a stronger peak than the fundamental mode. That is, the peak at 144 might have been stronger than the one at 72, in which case you would incorrectly think the fundamental record length was 144.

Transposing obviously helps with compressors that do not have handling of structured data, but it hurts compressors that inherently handle structured data themselves.

Here are some compressors on the struct72 file :


struct72                                 3,471,552

struct72.zip                             2,290,695
transpos.zip                             1,784,239

struct72.bz2                             2,136,460
transpos.bz2                             1,973,406

struct72.pmd                             1,903,783
transpos.pmd                             1,864,028

struct72.pmm                             1,493,556
transpos.pmm                             1,670,661

struct72.lzx                             1,475,776
transpos.lzx                             1,701,360

struct72.paq8o                           1,323,437
transpos.paq8o                           1,595,652

struct72.7z                              1,262,013
transpos.7z                              1,642,304

Compressors that don't handle structured data well (Zip,Bzip,PPMd) are helped by transposing. Compressors that do handle structured data specifically (LZX,PAQ,7Zip) are hurt quite a lot by transposing. LZMA is the best compressor I've ever seen for this kind of record-structured data. It's amazing that it beats PAQ considering it's an order of magnitude faster. It's also impressive how good LZX is on this type of data.

BTW I'm sure PAQ could easily be adapted to manually take a specification in which the structure of the file is specified by the user. In particular for simple record type data, you could even have a system where you give it the C struct, like :


struct Record
{
  float x,y,z;
  int   i,j,k;
  ...
};

and it parses the C struct specification to find what fields are where and builds a custom model for that.

ADDENDUM :

It's actually much clearer with mutual information.


I(X,Y) = H(X) + H(Y) - H(XY)

In particular, the mutual information for offset D is :


I(D) = 2*H( order0 ) - H( bigrams separated by distance D )

This is much slower to compute than the character repeat period, but gives much cleaner data.

On "struct72" : (note : the earlier graphs showed [1,144] , here I'm showing [1,256] so you can see peaks at 72,144 and 216).

struct72

Here's "book1" for comparison :

Vertical bar chart

(left column is actually in bits for these)

and hey, charts are fun, here's book1 after BWT :

Vertical bar chart

09-14-10 - Threaded Stdio

Many people know that fgetc is slow now because we have to link with multithreaded libs, and so it does a mutex and all that. Yes, that is true, but the solution is also trivial, you just use something like this :

#define macrogetc(_stream)     (--(_stream)->_cnt >= 0 ? ((char)*(_stream)->_ptr++) : _filbuf(_stream))

when you know that your FILE is only being used by one thread.

In general there's a nasty issue with multithreaded API design. When you have some object that you know might be accessed from various threads, should you serialize inside every access to that object? Or should you force the client to serialize on the outside?

If you serialize on the inside, you can have severe performance penalties, like with getc.

If you serialize on the outside, you make it very prone to bugs because it's easy to access without protection.

One solution is to introduce an extra "FileLock" object. So the "File" itself has no accessors except "Lock" which gives you a FileLock, and then the "getc" is on the FileLock. That way somebody can grab a FileLock and then locally do fast un-serialized access through that structure. eg:


instead of :

c = File.getc();

you have :

FileLock fl = File.lock();

c = fl.getc();

Another good solution would be to remove the buffering from the stdio FILE and introduce a "FileView" object which has a position and a buffer. Then you can have mutliple FileViews per FILE which are in different threads, but the FileViews themselves must be used only from one thread at a time. Accesses to the FILE are serialized, accesses to the FileView are not. eg :


eg. FILE * fp is shared.

Thread1 has FileView f1(fp);
Thread2 has FileView f2(fp);

FileView contains the buffer

you do :

c = f1.getc();

it can get a char from the buffer without synchronization, but buffer refill locks.

(of course this is a mess with readwrite files and such; in general I hate dealing with mutable shared data, I like the model that data is either read-only and shared, or writeable and exclusively locked by one thread).

(The FileView approach is basically what I've done for Oodle; the low-level async file handles are thread-safe, but the buffering object that wraps it is not. You can have as many buffering objects as you want on the same file on different threads).

Anyway, in practice, macrogetc is a perfectly good solution 99% of the time.

Stdio is generally very fast. You basically can't beat it except in the trivial way (trivial = use a larger buffer). The only things you can do to beat it are :

1. Double-buffer the streaming buffer filling and use async IO to fill the buffer (stdio uses synchronous IO and only fills when the buffer is empty).

2. Have an accessor like Peek(bytes) to the buffer that just gives you a pointer directly into the buffer and ensures it has bytes amount of data for you to read. This eliminates the branch to check fill on each byte input, and eliminates the memcpy from fread.

(BTW in theory you could avoid another memcpy, because Windows is doing IO into the disk cache pages, so you could just lock those and get a pointer and process from them directly. But they don't let you do this for obvious security reasons. Also if you knew in advance that your file was in the disk cache (and you were never going to use it again so you don't want it to get into the cache) you could do uncached IO, which is microscopically faster for data not in the cache, because it avoids that memcpy and page allocation. But again they don't let you ask "is this in the cache" and it's not worth sweating about).

Anyway, the point is you can't really beat stdio for byte-at-a-time streaming input, so don't bother. (on Windows, where the disk cache is pretty good, eg. it does sequential prefetching for you).

9/12/2010

09-12-10 - The defficiency of Windows' multi-processor scheduler

Windows' scheduler is generally pretty good, but there appears to be a bit of shittiness with its behavior on multicore systems. I'm going to go over everything I know about it and then we'll hit the bad point at the end.

Windows as I'm sure you know is a strictly exclusive high priority scheduler. That means if there are higher priority threads that can run, lower priority threads will get ZERO cpu time, not just less cpu time. (* this is not quite true because of the "balanced set manager" , see later).

Windows' scheduler is generally "fair", and most of its lists are FIFO. The scheduler is preemptive and happens immediately upon events that might change the scheduling conditions. That is, there's no delay or no master scheduler code that has to run. For example, when you call SetThreadPriority() , if that affects scheduling the changes will happen immediately (eg. if you make the current thread lower than some other thread that can run, you will stop running right then, not at the end of your quantum). Changing processor affinitiy, Waits, etc. all cause immediate reschedule. Waits, Locks, etc. always put your thread onto a FIFO list.

The core of the scheduler is a list of threads for each priority, on each processor. There are 32 priorities, 16 fixed "realtime" priorities, and 16 "dynamic" priorities (for normal apps). Priority boosts (see later) only affect the dynamic priority group.

When there are no higher priority threads that can run, your thread will run for a quantum (or a few, depending on quantum boosts and other things). When quantum is done you might get switched for another thread of same priority. Quantum elapsed decision is now based on CPU cycles elapsed (TSC), not the system timer interrupt (change was made in Vista or sumfin), and is based on actual amount of CPU cycles your thread got, so if your cycles are being stolen you might run for more seconds than a "quantum" might indicate.

Default quantum is an ungodly 10-15 millis. But amusingly enough, there's almost always some app on your system that has done timeBeginPeriod(1), which globally sets the system quantum down to 1 milli, so I find that I almost always see my system quantum at 1 milli. (this is a pretty gross thing BTW that apps can change the scheduler like that, but without it the quantum is crazy long). (really instead of people doing timeBeginPeriod(1) what they should have done is make the threads that need more responsiveness be very high priority and then go to sleep when not getting input or events).

So that's the basics and it all sounds okay. But then Windows has a mess of hacks designed to fix problems. Basically these are problems with people writing shitty application code that doesn't thread right, and they're covering everyone's mistakes, and really they do a hell of a good job with it. The thing they do is temporary priority boosts and temporary quantum boosts.

Threads in the foreground process always get a longer quantum (something like 60 millis on machines with default quantum length!). Priority boosts affect a thread temporarily, each quantum the priority boost wears off one step, and the thread gets extra quantums until the priority boost is gone. So a boost of +8 gives you 8 extra quantums to run (and you run at higher priority). You get priority boosts for :


 IO completion (+variable)

 wakeup from Wait (+1)
   special boost for foreground process is not disableable

 priority inversion (boosts owner of locked resource that other thread is waiting on)
   this is done periodically in the check for deadlock
   boosts to 14

 GUI thread wakeup (+2) eg. from WaitMsg

 CPU starvation ("balanced set manager")
    temporary kick to 15

boost for IO completion is :
    disk is 1
    network 2
    key/mouse input 6
    sound 8
(boost amount comes from the driver at completion time)

Look at like for example the GUI thread wakeup boost. What really should have been happening there is you should have made a GUI message processing thread that was super minimal and very high priority. It should have just done WaitMsg to get GUI input and then responded to it as quickly as possible, maybe queued some processing on a normal priority thread, then gone back to sleep. The priority boost mechanism is basically emulating this for you.

A particularly nasty example is the priority inversion boost. When a low priority thread is holding a mutex and a high priority thread tries to lock it, the high priority thread goes to sleep, but the low priority thread might never run if there are medium priority threads that want to run, so the high priority thread will be stuck forever. To fix this, Windows checks for this case in its deadlock checker. All of the "INFINITE" waits in windows are not actually infinite - they wait for 5 seconds or so (delay is setable in the registry), after which time Windows checks them for being stalled out and might give you a "process deadlocked" warning; in this check it looks for the priority inversion case and if it sees it, it gives the low priority thread the big boost to 14. This has the wierd side effect of making the low priority thread suddenly get a bunch of quanta and high priority.

The other weird case is the "balanced set manager". This thing is really outside of the normal scheduler; it sort of sits on the side and checks every so often for threads that aren't running at all and gives them a temporary kick to 15. This kick is different than the others in that it doesn't decay (it would get 15 quanta which is a lot), it just runs a few quanta at 15 then goes back to its normal priority.

You an use SetThreadPriorityBoost to disable some of these (such as IO completion) but not all (the special foreground process stuff for example is not disabled by this, and probably not the balanced set or priority inversion stuff either is my guess).

I'm mostly okay with this boosting shit, but it does mean that actually accounting for what priority your thread has exactly and how long you expect it to run is almost impossible in windows. Say you make some low-priority worker thread at priority 6 and your main thread at priority 8. Is your main thread actually higher priority than the background worker? Well, no, not if he got boosted for any of various reasons, he might actually be at higher priority right now.

Okay, that's the summary of the normal scheduler, now let's look at the multi-processor defficiency.

You can understand what's going on if you see what their motivation was. Basically they wanted scheduling on each individual processor to still be as fast as possible. To do this, each processor gets its own scheduling data; that is there are the 32 priority lists of threads on each processor. When a processor has to do some scheduling, it tries to just work on its own list and schedule on itself. In particular, simple quantum expiration thread switches can happen just on the local processor without taking a system-wide lock. (* this was changed for Vista/Win7 or sumpin; old versions of the SMP scheduler took a system-wide spinlock to dispatch; see references at bottom).

(mutex/event waits take the system-wide dispatch lock because there may be threads from other processors on the wait lists for those resoures)

But generally Windows does not move threads around between processors, and this is what can create the problems. In particular, there is absolutely zero code to try to distribute the work load evenly. That's up to you, or luck. Even just based on priorities it might not run the highest priority threads. Let's look at some details :

Each thread has the idea of an "ideal" processor. This is set at thread creation in a global round-robin and a processor round-robin. This is obviously a hacky attempt to balance load which is sometimes good enough. In particular if create a thread per core, this will give you an "ideal" processor on each core, so that's what you want. It also does assign them "right" for hyperthreading, that is to minimize thread overlap, eg. it will assign core0-hyperthread0 then core1-hyperthread0 then core0-hyperthread1 then core1-hyperthread1. You can also change it manually with SetThreadIdealProcessor , but it seems to me it mostly does the right thing so there's not much need for that. (note this is different than Affinity , which forces you to run only on those processors , you don't have to run on your ideal proc, but we will see some problems later).

When the scheduler is invoked and wants to run a thread - there's a very big difference between the cases when there exist any idle processors or not. Let's look at the two cases :

If there are any idle processors : it pretty much does exactly what you would want. It tries to make use of the idle processors. It prefers whole idle cores over just idle hyperthreads. Then from among the set of good choices it will prefer your "ideal", then the ones it last ran on and the one the scheduler is running on right now. Pretty reasonable.

If there are not any idle processors : this is the bad stuff. The thread is only schedules against its "ideal" processor. Which means it just gets stuffed on the "ideal" processor's list of threads and then schedules on that processor by normal single-proc rules.

This has a lot of bad consequences. Say proc 0 is running something at priority 10. Proc 1-5 are all running something at priority 2. You are priority 6 and you want to run, but your ideal proc happened to be 0. Well, tough shit, you get stuck on proc 0's list and you don't run even though there's lots of priority 2 work getting done.

This can happen just by bad luck, when you happen to run into other apps threads that have the same ideal proc as you. But it can also happen due to Affinity masks. If you or somebody else has got a thread locked to some proc via the Affinity mask, and your "ideal" processor is set to try to schedule you there, you will run into them over and over.

The other issue is that even when threads are redistributed, it only happens at thread ready time. Say you have 5 processors that are idle, on the other processor there is some high priority thread gobbling the CPU. There can be threads waiting to run on that processor just sitting there. They will not get moved over to the idle processors until somebody kicks a scheduling event that tries to run them (eg. the high priority guy goes to sleep or one of the boosting mechanisms kicks in). This is a transient effect and you should never have long term scenarios where one processor is overloaded and other processors are idle, but it can happen for a few quanta.

References :

Windows Administration Inside the Windows Vista Kernel Part 1
Windows Administration Inside the Windows Vista Kernel Part 2
Sysinternals Freeware - Information for Windows NT and Windows 2000 - Publications
Inside the Windows NT Scheduler, Part 1 (access locked)
Available here : "Inside the Windows NT Scheduler" .doc version but make sure you read the this errata

Stupid annoying audio-visual formatted information. (god damnit, just use plain text people) :

Mark Russinovich Inside Windows 7 Going Deep Channel 9
Dave Probert Inside Windows 7 - User Mode Scheduler (UMS) Going Deep Channel 9
Arun Kishan Inside Windows 7 - Farewell to the Windows Kernel Dispatcher Lock Going Deep Channel 9

(I haven't actually watched these, so if there's something good in them, please let me know, in text form).

Also "Windows Internals" book, 5th edition.

ADDENDUM : The real interesting question is "can we do anything about it?". In particular, can you detect cases where you are badly load-balanced and kick Windows in some way to adjust things? Obviously you can do some things to load-balance within your own app, but when other threads are taking time from you it becomes trickier.

09-12-10 - PPM vs CM

Let me do a quick sketch of how PPM works vs how CM works to try to highlight the actual difference, sort of like I did for Huffman to Arithmetic , but I won't do as good of a job.

PPM :

make a few contexts of previous characters
  order4 = * ((U32 *)(ptr-4));
  etc.. 

look up observed counts from each context :

counts_o4 = lookup_stats( order4 );
counts_o2 = lookup_stats( order2 );
counts_o1 = lookup_stats( order1 );
counts_o0 = lookup_stats( order0 );

estimate escape probability from counts at each context :

esc_o4 = estimate_escape( order4 );
esc_o2 = estimate_escape( order2 );
...

code in order from most likely best to least :

if ( arithmetic_code( (1-esc_o4) * counts_o4 ) ) return; else arithmetic_code( esc_o4 );
exclude counts_o4
if ( arithmetic_code( (1-esc_o2) * counts_o2 ) ) return; else arithmetic_code( esc_o2 );
exclude counts_o2
...

update counts :
counts_o4 += sym;
...

Now let's do context mixing :

CM :

make a few contexts of previous characters
  order4 = * ((U32 *)(ptr-4));
  etc.. 

look up observed counts from each context :

counts_o4 = lookup_stats( order4 );
counts_o2 = lookup_stats( order2 );
counts_o1 = lookup_stats( order1 );
counts_o0 = lookup_stats( order0 );

estimate weights from counts at each context :

w_o4 = estimate_weight( order4 );
w_o2 = estimate_weight( order2 );
...

make blended counts :
counts = w_o4 * counts_o4 + w_o2 * counts_o2 + ...

now code :
arithmetic_code( counts );
...

update counts :
counts_o4 += sym;
...

It should be clear we can put them together :

make a few contexts of previous characters
  order4 = * ((U32 *)(ptr-4));
  etc.. 

look up observed counts from each context :

counts_o4 = lookup_stats( order4 );
counts_o2 = lookup_stats( order2 );
counts_o1 = lookup_stats( order1 );
counts_o0 = lookup_stats( order0 );
if ( CM )
{
    estimate weights from counts at each context :

    w_o4 = estimate_weight( order4 );
    w_o2 = estimate_weight( order2 );
    ...

    make blended counts :
    counts = w_o4 * counts_o4 + w_o2 * counts_o2 + ...

    // now code :
    arithmetic_code( counts );
    ...
}
else PPM
{
    estimate escape probability from counts at each context :

    esc_o4 = estimate_escape( order4 );
    esc_o2 = estimate_escape( order2 );
    ...

    code in order from most likely best to least :

    if ( arithmetic_code( (1-esc_o4) * counts_o4 ) ) return; else arithmetic_code( esc_o4 );
    exclude counts_o4
    if ( arithmetic_code( (1-esc_o2) * counts_o2 ) ) return; else arithmetic_code( esc_o2 );
    exclude counts_o2
    ... 
}

update counts :
counts_o4 += sym;
...

In particular if we do our PPM in a rather inefficient way we can make them very similar :
make a few contexts of previous characters
  order4 = * ((U32 *)(ptr-4));
  etc.. 

look up observed counts from each context :

counts_o4 = lookup_stats( order4 );
counts_o2 = lookup_stats( order2 );
counts_o1 = lookup_stats( order1 );
counts_o0 = lookup_stats( order0 );

accumulate into blended_counts :
blended_counts = 0;
if ( CM )
{
    estimate weights from counts at each context :

    w_o4 = estimate_weight( order4 );
    w_o2 = estimate_weight( order2 );
    ...


}
else PPM
{
    estimate escape probability from counts at each context :

    esc_o4 = estimate_escape( order4 );
    esc_o2 = estimate_escape( order2 );
    ...

    do exclude :
    exclude counts_o4 from counts_o2
    exclude counts_o4,counts_o2 from counts_o1
    ...

    make weights :

    w_o4 = (1 - esc_04);
    w_o2 = esc_04 * (1 - esc_02);
    ...

}

make blended counts :
blended_counts += w_04 * counts_o4;
blended_counts += w_02 * counts_o2;
...

arithmetic_code( blended_counts );

update counts :
counts_o4 += sym;
...

Note that I haven't mentioned whether we are doing binary alphabet or large alphabet or any other practical issues, because it doesn't affect the algorithm in a theoretical way.

While I'm at it, let me take the chance to mark up the PPM pseudocode with where "modern" PPM differs from "classical" PPM : (by "modern" I mean 2002/PPMii and by "classical" I mean 1995/"before PPMZ").

PPM :

make a few contexts of previous characters
  order4 = * ((U32 *)(ptr-4));
  etc.. 
also make non-continuous contexts like
skip contexts : AxBx
contexts containing only a few top bits from each byte
contexts involving a word dictionary
contexts involving current position in the stream 

look up observed counts from each context :

counts_o4 = lookup_stats( order4 );
counts_o2 = lookup_stats( order2 );
counts_o1 = lookup_stats( order1 );
counts_o0 = lookup_stats( order0 );

possibly rescale counts using a "SEE" like operator
eg. use counts as an observation which you then model to predict coding probabilities

estimate escape probability from counts at each context :

esc_o4 = estimate_escape( order4 );
esc_o2 = estimate_escape( order2 );
...
secondary estimate escape using something like PPMZ
also not just using current context but also other contexts and side information

code in order from most likely best to least :

use LOE to choose best order to start from, not necessarily the largest context
also don't skip down through the full set, rather choose a reduced set

if ( arithmetic_code( (1-esc_o4) * counts_o4 ) ) return; else arithmetic_code( esc_o4 );
exclude counts_o4
if ( arithmetic_code( (1-esc_o2) * counts_o2 ) ) return; else arithmetic_code( esc_o2 );
exclude counts_o2
...

update counts :
counts_o4 += sym;
...
do "partial exclusion" like PPMii, do full update down to coded context
  and then reduced update to parents to percolate out information a bit
do "inheritance" like PPMii - novel contexts updated from parents
do "fuzzy updates" - don't just update your context but also neighbors
  which are judged to be similar in some way


And that's all folks.

09-12-10 - Context Weighting vs Escaping

The defining characteristic of PPM (by modern understanding) is that you select a context, try to code from it, and if you can't you escape to another context. By contrast, context weighting selects multiple contexts and blends the probabilities. These are actually not as different as they seem, because escaping is the same as multiplying probabilities. In particular :

    context 0 gives probabilities P0 (0 is the deeper context)
    context 1 gives probabilities P1 (1 is the parent)
    how do I combine them ?

    escaping :

    P = (1 - E) * P0 + E * (P1 - 0)

    P1 - 0 = Probablities from 1 with chars from 0 excluded

    weighting :

    P = (1 - W) * P0 + W * P1

    with no exclusion in P1

In particular, the only difference is the exclusion. Specifically, the probabilities of non-shared symbols are the same, but the probabilities of symbols that occur in both contexts are different. In particular the flaw with escaping is probably that it gives too low a weight to symbols that occur in both contexts. More generally you should probably be considering something like :

    I = intersection of contexts 0 and 1

    P = a * (P0 - I) + b * (P1 - I) + c * PI

that is, some weighting for the unique parts of each context and some weighting for the intersection.

Note that PPMii is sort of trying to compensate for this because when it creates context 0, it seeds it with counts from context 1 in the overlap.

Which obviously raises the question : rather than the context initialization of PPMii, why not just look in your parent and take some of the counts for shared symbols?

(note that there are practical issues about how your compute your weight amount or escape probability, and how exactly you mix, but those methods could be applied to either PPM or CW, so they aren't a fundamental difference).

09-12-10 - Challenges in Data Compression 1 - Finite State Correlations

One of the classic shortcomings of all known data compressors is that they can only model "finite context" information and not "finite state" data. It's a little obtuse to make this really formally rigorous, but you could say that structured data is data which can be generated by a small "finite state" machine, but cannot be generated by a small "finite context" machine. (or I should say, transmitting the finite state machine to generate the data is much smaller than transmitting the finite context machine to generate the data, along with selections of probabilistic transitions in each machine).

For example, maybe you have some data where after each occurance of 011 it becomes more likely to repeat. To model that with an FSM you only need a state for 011, and it loops back to itself and increases P. To model it with finite contexts you need an 011 state, an 011011 , 011011011 , etc. But you might also have correlations like : every 72 bytes there is a dword which is equal to dword at -72 bytes xor'ed with the dword at -68 bytes and plus a random number which is usually small.

The point is not that these correlations are impossible to model using finite contexts, but the correct contexts to use at each spot might be infinitely large.

Furthermore, you can obviously model FSM's by hard-coding them into your compressor. That is, you assume a certain structure and make a model of that FSM, and then context code from that hard-coded FSM. What we can't do is learn new FSM's from the data.

For example, say you have data that consists of a random dword, followed by some unknown number of 0's, and then that same dword repeated, like

 
DEADBEEF0000000000000000DEADBEEF 
you can model this perfectly with an FCM if you create a special context where you cut out the run of zeros. So you make a context like
DEADBEEF00
and then if you keep seeing zeros you leave the context alone, if it's not a zero you just use normal FCM (which will predict DEADBEEF). What you've done here is to hard-code the finite state structure of the data into your compressor so that you can model it with finite contexts.

In real life we actually do have this kind of weird "finite state" correlation in a lot of data. One common example is "structured data". "Structured data" is data where there is a strong position-based pattern. That is, maybe a sequence of 32 bit floats, so there's strong correlation to (pos&3), or maybe a bunch of N-byte C structures with different types of junk in that.

Note that in this sense, the trivial mode changes of something like HTML or XML or even English text is not really "structured data" in our sense, even though they obviously have structure, because that structure is largely visible through finite contexts. That is, the state transitions of the structure are given to us in actual bytes in the data, so we can find the structure with only finite context modeling. (obviously English text does have a lot of small-scale and large-scale finite-state structure in grammars and context and so on).

General handling of structured data is the big unsolved problem of data compression. There are various heuristic tricks floating around to try to capture a bit of it. Basically they come down to hard coding a specific kind of structure and then using blending or switching to benefit from that structure model when it applies. In particular, 4 or 8 byte aligned patterns are the most common and easy to model structure, so people build specific models for that. But nobody is doing general adaptive structure detection and modeling.

old rants