9/14/2010

09-14-10 - Challenges in Data Compression 2.5 - More on Sparsity

Sparsity & Temporality show up as big issues in lots of funny ways.

For example, they are the whole reason that we have to do funny hand-tweaks of compressors.

eg. exactly what you use as your context for various parts of the compressor will make a huge difference. In PPM the most sensitive part is the SEE ; in CM it looks like the most sensitive part is the contexts that are used for the mixer itself (or the APM or SSE) (not the contexts that are looked up to get statistics to mix).

In these sensitive parts of the coder, you obviously want to use as much context as possible, but if you use too much your statistics become too sparse and you start making big coding mistakes.

This is why these parts of the model have to be carefully hand tweaked; eg. use a few bits from here, compact it into a log2-ish scale, bucketize these things. You want only 10-16 bits of context or something but you want as much good information in that context as possible.

The problem is that the ideal formation of these contexts depends on the file of course, so it should be figured out adaptively.

There are various possibilities for hacky solutions to this. For example something that hasn't been explored very much in general in text compression is severely asymmetric coders. We do this in video coding for example, where the encoder spends a long time figuring things out then transmits simple commands to the decoder. So for example the encoder could do some big processing to try to figure out the ideal compaction of statistics and sends it to the decoder. (* maybe some of these do exist)

If sparsity wasn't an issue, you would just throw every bit of information you have at the model. But in fact we have tons of information that we don't use because we aren't good enough at detecting what information is useful and merging up information from various sources, and so on.

For example an obvious one is : in each context we generally store only something like the number of times each character has occurred. We might do something like scale the counts so that more recent characters count more. eg. you effictively do {all counts} *= 0.9 and then add in the count of the new character as ++. But actually we have way more information than that. We have the exact time that each character occurred (time = position in the file). And, for each time we've used the context in the past, we know what was predicted from it and whether that was right. All of that information should be useful to improve coding, but it's just too much because it makes secondary statistics too sparse.

BTW it might pop into your head that this can be attacked using the very old-school approaches to sparsity that were used in Rissanen's "Context" or DMC for example. Their approach was to use a small context, then as you see more data you split the context, so you get richer contexts over time. That does not work, because it is too conservative about not coding from sparse contexts; as I mentioned before, you cannot tell whether a sparse context is good or not from information seen in that context, you need information from an outside source, and what Context/DMC do is exactly the wrong thing - they try to use the counts within a context to decide whether it should be split or not.

09-14-10 - A small note on structured data

We have a very highly structured file at work. A while ago I discovered that I can improve compression of primitive compressors by transposing the data as if it were a matrix with rows of the structure size.

Apparently this is previously known :


J. Abel, "Record Preprocessing for Data Compression", 
Proceedings of the IEEE Data Compression Conference 2004, 
Snowbird, Utah, Storer, J.A. and Cohn, M. Eds. 521, 2004.

A small note :

He suggests finding the period of the data by looking at the most frequent character repeat distance. That is, for each byte find the last time it occurred, count that distance in a histogram, find the peak of the histogram (ignoring distances under some minimum), that's your guess of hte record length. That works pretty well, but some small improvements are definitely possible.

Lets look at some distance histograms. First of all, on non-record-structured data (here we have "book1") our expectation is that correlation roughly goes down with distance, and in fact it does :

book1

(maximum is at 4)

On record-structured data, the peaks are readily apparently. This file ("struct72") is made of 72-byte structures, hence the peaks out at 72 and 144. But we also see strong 4-8-12 correlations, as there are clearly 4-byte words in the structs :

Vertical bar chart

The digram distance histogram makes the structure even more obvious, if you ignore the peak at 1 (which is not so much due to "structure" as just strong order-1 correlation), the peak at 72 is very strong :

Vertical bar chart

When you actually run an LZ77 on the file (with min match len of 3 and optimal parse) the pattern is even stronger; here are the 16 most used offsets on one chunk :


 0 :       72 :      983
 1 :      144 :      565
 2 :      216 :      282
 3 :      288 :      204
 4 :      432 :      107
 5 :      360 :      106
 6 :      720 :       90
 7 :      504 :       88
 8 :      792 :       78
 9 :      648 :       77
10 :      864 :       76
11 :      576 :       73
12 :     1008 :       64
13 :     1080 :       54
14 :     1152 :       51
15 :     1368 :       49

Every single one is a perfect multiple of 72.

A slightly more robust way to find structure than Jurgen's approach is to use the auto-correlation of the histogram of distances. This is a well known technique from audio pitch detection. You take the "signal" which is here out histogram of distance occurances, and find the intensity of auto-correlation for each translation of the signal (this can be done using the fourier transform). You will then get strong peaks only at the fundamental modes. In particular, in our example "struct72" file you would get a peak at 72, and also a strong peak at 4, because in the fourier transform the smaller peaks at 8,12,16,20, etc. will all add onto the peak at 4. That is, it's correctly handling "harmonics". It will also detect cases where a harmonic happened to be a stronger peak than the fundamental mode. That is, the peak at 144 might have been stronger than the one at 72, in which case you would incorrectly think the fundamental record length was 144.

Transposing obviously helps with compressors that do not have handling of structured data, but it hurts compressors that inherently handle structured data themselves.

Here are some compressors on the struct72 file :


struct72                                 3,471,552

struct72.zip                             2,290,695
transpos.zip                             1,784,239

struct72.bz2                             2,136,460
transpos.bz2                             1,973,406

struct72.pmd                             1,903,783
transpos.pmd                             1,864,028

struct72.pmm                             1,493,556
transpos.pmm                             1,670,661

struct72.lzx                             1,475,776
transpos.lzx                             1,701,360

struct72.paq8o                           1,323,437
transpos.paq8o                           1,595,652

struct72.7z                              1,262,013
transpos.7z                              1,642,304

Compressors that don't handle structured data well (Zip,Bzip,PPMd) are helped by transposing. Compressors that do handle structured data specifically (LZX,PAQ,7Zip) are hurt quite a lot by transposing. LZMA is the best compressor I've ever seen for this kind of record-structured data. It's amazing that it beats PAQ considering it's an order of magnitude faster. It's also impressive how good LZX is on this type of data.

BTW I'm sure PAQ could easily be adapted to manually take a specification in which the structure of the file is specified by the user. In particular for simple record type data, you could even have a system where you give it the C struct, like :


struct Record
{
  float x,y,z;
  int   i,j,k;
  ...
};

and it parses the C struct specification to find what fields are where and builds a custom model for that.

ADDENDUM :

It's actually much clearer with mutual information.


I(X,Y) = H(X) + H(Y) - H(XY)

In particular, the mutual information for offset D is :


I(D) = 2*H( order0 ) - H( bigrams separated by distance D )

This is much slower to compute than the character repeat period, but gives much cleaner data.

On "struct72" : (note : the earlier graphs showed [1,144] , here I'm showing [1,256] so you can see peaks at 72,144 and 216).

struct72

Here's "book1" for comparison :

Vertical bar chart

(left column is actually in bits for these)

and hey, charts are fun, here's book1 after BWT :

Vertical bar chart

09-14-10 - Threaded Stdio

Many people know that fgetc is slow now because we have to link with multithreaded libs, and so it does a mutex and all that. Yes, that is true, but the solution is also trivial, you just use something like this :

#define macrogetc(_stream)     (--(_stream)->_cnt >= 0 ? ((char)*(_stream)->_ptr++) : _filbuf(_stream))

when you know that your FILE is only being used by one thread.

In general there's a nasty issue with multithreaded API design. When you have some object that you know might be accessed from various threads, should you serialize inside every access to that object? Or should you force the client to serialize on the outside?

If you serialize on the inside, you can have severe performance penalties, like with getc.

If you serialize on the outside, you make it very prone to bugs because it's easy to access without protection.

One solution is to introduce an extra "FileLock" object. So the "File" itself has no accessors except "Lock" which gives you a FileLock, and then the "getc" is on the FileLock. That way somebody can grab a FileLock and then locally do fast un-serialized access through that structure. eg:


instead of :

c = File.getc();

you have :

FileLock fl = File.lock();

c = fl.getc();

Another good solution would be to remove the buffering from the stdio FILE and introduce a "FileView" object which has a position and a buffer. Then you can have mutliple FileViews per FILE which are in different threads, but the FileViews themselves must be used only from one thread at a time. Accesses to the FILE are serialized, accesses to the FileView are not. eg :


eg. FILE * fp is shared.

Thread1 has FileView f1(fp);
Thread2 has FileView f2(fp);

FileView contains the buffer

you do :

c = f1.getc();

it can get a char from the buffer without synchronization, but buffer refill locks.

(of course this is a mess with readwrite files and such; in general I hate dealing with mutable shared data, I like the model that data is either read-only and shared, or writeable and exclusively locked by one thread).

(The FileView approach is basically what I've done for Oodle; the low-level async file handles are thread-safe, but the buffering object that wraps it is not. You can have as many buffering objects as you want on the same file on different threads).

Anyway, in practice, macrogetc is a perfectly good solution 99% of the time.

Stdio is generally very fast. You basically can't beat it except in the trivial way (trivial = use a larger buffer). The only things you can do to beat it are :

1. Double-buffer the streaming buffer filling and use async IO to fill the buffer (stdio uses synchronous IO and only fills when the buffer is empty).

2. Have an accessor like Peek(bytes) to the buffer that just gives you a pointer directly into the buffer and ensures it has bytes amount of data for you to read. This eliminates the branch to check fill on each byte input, and eliminates the memcpy from fread.

(BTW in theory you could avoid another memcpy, because Windows is doing IO into the disk cache pages, so you could just lock those and get a pointer and process from them directly. But they don't let you do this for obvious security reasons. Also if you knew in advance that your file was in the disk cache (and you were never going to use it again so you don't want it to get into the cache) you could do uncached IO, which is microscopically faster for data not in the cache, because it avoids that memcpy and page allocation. But again they don't let you ask "is this in the cache" and it's not worth sweating about).

Anyway, the point is you can't really beat stdio for byte-at-a-time streaming input, so don't bother. (on Windows, where the disk cache is pretty good, eg. it does sequential prefetching for you).

9/12/2010

09-12-10 - The defficiency of Windows' multi-processor scheduler

Windows' scheduler is generally pretty good, but there appears to be a bit of shittiness with its behavior on multicore systems. I'm going to go over everything I know about it and then we'll hit the bad point at the end.

Windows as I'm sure you know is a strictly exclusive high priority scheduler. That means if there are higher priority threads that can run, lower priority threads will get ZERO cpu time, not just less cpu time. (* this is not quite true because of the "balanced set manager" , see later).

Windows' scheduler is generally "fair", and most of its lists are FIFO. The scheduler is preemptive and happens immediately upon events that might change the scheduling conditions. That is, there's no delay or no master scheduler code that has to run. For example, when you call SetThreadPriority() , if that affects scheduling the changes will happen immediately (eg. if you make the current thread lower than some other thread that can run, you will stop running right then, not at the end of your quantum). Changing processor affinitiy, Waits, etc. all cause immediate reschedule. Waits, Locks, etc. always put your thread onto a FIFO list.

The core of the scheduler is a list of threads for each priority, on each processor. There are 32 priorities, 16 fixed "realtime" priorities, and 16 "dynamic" priorities (for normal apps). Priority boosts (see later) only affect the dynamic priority group.

When there are no higher priority threads that can run, your thread will run for a quantum (or a few, depending on quantum boosts and other things). When quantum is done you might get switched for another thread of same priority. Quantum elapsed decision is now based on CPU cycles elapsed (TSC), not the system timer interrupt (change was made in Vista or sumfin), and is based on actual amount of CPU cycles your thread got, so if your cycles are being stolen you might run for more seconds than a "quantum" might indicate.

Default quantum is an ungodly 10-15 millis. But amusingly enough, there's almost always some app on your system that has done timeBeginPeriod(1), which globally sets the system quantum down to 1 milli, so I find that I almost always see my system quantum at 1 milli. (this is a pretty gross thing BTW that apps can change the scheduler like that, but without it the quantum is crazy long). (really instead of people doing timeBeginPeriod(1) what they should have done is make the threads that need more responsiveness be very high priority and then go to sleep when not getting input or events).

So that's the basics and it all sounds okay. But then Windows has a mess of hacks designed to fix problems. Basically these are problems with people writing shitty application code that doesn't thread right, and they're covering everyone's mistakes, and really they do a hell of a good job with it. The thing they do is temporary priority boosts and temporary quantum boosts.

Threads in the foreground process always get a longer quantum (something like 60 millis on machines with default quantum length!). Priority boosts affect a thread temporarily, each quantum the priority boost wears off one step, and the thread gets extra quantums until the priority boost is gone. So a boost of +8 gives you 8 extra quantums to run (and you run at higher priority). You get priority boosts for :


 IO completion (+variable)

 wakeup from Wait (+1)
   special boost for foreground process is not disableable

 priority inversion (boosts owner of locked resource that other thread is waiting on)
   this is done periodically in the check for deadlock
   boosts to 14

 GUI thread wakeup (+2) eg. from WaitMsg

 CPU starvation ("balanced set manager")
    temporary kick to 15

boost for IO completion is :
    disk is 1
    network 2
    key/mouse input 6
    sound 8
(boost amount comes from the driver at completion time)

Look at like for example the GUI thread wakeup boost. What really should have been happening there is you should have made a GUI message processing thread that was super minimal and very high priority. It should have just done WaitMsg to get GUI input and then responded to it as quickly as possible, maybe queued some processing on a normal priority thread, then gone back to sleep. The priority boost mechanism is basically emulating this for you.

A particularly nasty example is the priority inversion boost. When a low priority thread is holding a mutex and a high priority thread tries to lock it, the high priority thread goes to sleep, but the low priority thread might never run if there are medium priority threads that want to run, so the high priority thread will be stuck forever. To fix this, Windows checks for this case in its deadlock checker. All of the "INFINITE" waits in windows are not actually infinite - they wait for 5 seconds or so (delay is setable in the registry), after which time Windows checks them for being stalled out and might give you a "process deadlocked" warning; in this check it looks for the priority inversion case and if it sees it, it gives the low priority thread the big boost to 14. This has the wierd side effect of making the low priority thread suddenly get a bunch of quanta and high priority.

The other weird case is the "balanced set manager". This thing is really outside of the normal scheduler; it sort of sits on the side and checks every so often for threads that aren't running at all and gives them a temporary kick to 15. This kick is different than the others in that it doesn't decay (it would get 15 quanta which is a lot), it just runs a few quanta at 15 then goes back to its normal priority.

You an use SetThreadPriorityBoost to disable some of these (such as IO completion) but not all (the special foreground process stuff for example is not disabled by this, and probably not the balanced set or priority inversion stuff either is my guess).

I'm mostly okay with this boosting shit, but it does mean that actually accounting for what priority your thread has exactly and how long you expect it to run is almost impossible in windows. Say you make some low-priority worker thread at priority 6 and your main thread at priority 8. Is your main thread actually higher priority than the background worker? Well, no, not if he got boosted for any of various reasons, he might actually be at higher priority right now.

Okay, that's the summary of the normal scheduler, now let's look at the multi-processor defficiency.

You can understand what's going on if you see what their motivation was. Basically they wanted scheduling on each individual processor to still be as fast as possible. To do this, each processor gets its own scheduling data; that is there are the 32 priority lists of threads on each processor. When a processor has to do some scheduling, it tries to just work on its own list and schedule on itself. In particular, simple quantum expiration thread switches can happen just on the local processor without taking a system-wide lock. (* this was changed for Vista/Win7 or sumpin; old versions of the SMP scheduler took a system-wide spinlock to dispatch; see references at bottom).

(mutex/event waits take the system-wide dispatch lock because there may be threads from other processors on the wait lists for those resoures)

But generally Windows does not move threads around between processors, and this is what can create the problems. In particular, there is absolutely zero code to try to distribute the work load evenly. That's up to you, or luck. Even just based on priorities it might not run the highest priority threads. Let's look at some details :

Each thread has the idea of an "ideal" processor. This is set at thread creation in a global round-robin and a processor round-robin. This is obviously a hacky attempt to balance load which is sometimes good enough. In particular if create a thread per core, this will give you an "ideal" processor on each core, so that's what you want. It also does assign them "right" for hyperthreading, that is to minimize thread overlap, eg. it will assign core0-hyperthread0 then core1-hyperthread0 then core0-hyperthread1 then core1-hyperthread1. You can also change it manually with SetThreadIdealProcessor , but it seems to me it mostly does the right thing so there's not much need for that. (note this is different than Affinity , which forces you to run only on those processors , you don't have to run on your ideal proc, but we will see some problems later).

When the scheduler is invoked and wants to run a thread - there's a very big difference between the cases when there exist any idle processors or not. Let's look at the two cases :

If there are any idle processors : it pretty much does exactly what you would want. It tries to make use of the idle processors. It prefers whole idle cores over just idle hyperthreads. Then from among the set of good choices it will prefer your "ideal", then the ones it last ran on and the one the scheduler is running on right now. Pretty reasonable.

If there are not any idle processors : this is the bad stuff. The thread is only schedules against its "ideal" processor. Which means it just gets stuffed on the "ideal" processor's list of threads and then schedules on that processor by normal single-proc rules.

This has a lot of bad consequences. Say proc 0 is running something at priority 10. Proc 1-5 are all running something at priority 2. You are priority 6 and you want to run, but your ideal proc happened to be 0. Well, tough shit, you get stuck on proc 0's list and you don't run even though there's lots of priority 2 work getting done.

This can happen just by bad luck, when you happen to run into other apps threads that have the same ideal proc as you. But it can also happen due to Affinity masks. If you or somebody else has got a thread locked to some proc via the Affinity mask, and your "ideal" processor is set to try to schedule you there, you will run into them over and over.

The other issue is that even when threads are redistributed, it only happens at thread ready time. Say you have 5 processors that are idle, on the other processor there is some high priority thread gobbling the CPU. There can be threads waiting to run on that processor just sitting there. They will not get moved over to the idle processors until somebody kicks a scheduling event that tries to run them (eg. the high priority guy goes to sleep or one of the boosting mechanisms kicks in). This is a transient effect and you should never have long term scenarios where one processor is overloaded and other processors are idle, but it can happen for a few quanta.

References :

Windows Administration Inside the Windows Vista Kernel Part 1
Windows Administration Inside the Windows Vista Kernel Part 2
Sysinternals Freeware - Information for Windows NT and Windows 2000 - Publications
Inside the Windows NT Scheduler, Part 1 (access locked)
Available here : "Inside the Windows NT Scheduler" .doc version but make sure you read the this errata

Stupid annoying audio-visual formatted information. (god damnit, just use plain text people) :

Mark Russinovich Inside Windows 7 Going Deep Channel 9
Dave Probert Inside Windows 7 - User Mode Scheduler (UMS) Going Deep Channel 9
Arun Kishan Inside Windows 7 - Farewell to the Windows Kernel Dispatcher Lock Going Deep Channel 9

(I haven't actually watched these, so if there's something good in them, please let me know, in text form).

Also "Windows Internals" book, 5th edition.

ADDENDUM : The real interesting question is "can we do anything about it?". In particular, can you detect cases where you are badly load-balanced and kick Windows in some way to adjust things? Obviously you can do some things to load-balance within your own app, but when other threads are taking time from you it becomes trickier.

09-12-10 - PPM vs CM

Let me do a quick sketch of how PPM works vs how CM works to try to highlight the actual difference, sort of like I did for Huffman to Arithmetic , but I won't do as good of a job.

PPM :

make a few contexts of previous characters
  order4 = * ((U32 *)(ptr-4));
  etc.. 

look up observed counts from each context :

counts_o4 = lookup_stats( order4 );
counts_o2 = lookup_stats( order2 );
counts_o1 = lookup_stats( order1 );
counts_o0 = lookup_stats( order0 );

estimate escape probability from counts at each context :

esc_o4 = estimate_escape( order4 );
esc_o2 = estimate_escape( order2 );
...

code in order from most likely best to least :

if ( arithmetic_code( (1-esc_o4) * counts_o4 ) ) return; else arithmetic_code( esc_o4 );
exclude counts_o4
if ( arithmetic_code( (1-esc_o2) * counts_o2 ) ) return; else arithmetic_code( esc_o2 );
exclude counts_o2
...

update counts :
counts_o4 += sym;
...

Now let's do context mixing :

CM :

make a few contexts of previous characters
  order4 = * ((U32 *)(ptr-4));
  etc.. 

look up observed counts from each context :

counts_o4 = lookup_stats( order4 );
counts_o2 = lookup_stats( order2 );
counts_o1 = lookup_stats( order1 );
counts_o0 = lookup_stats( order0 );

estimate weights from counts at each context :

w_o4 = estimate_weight( order4 );
w_o2 = estimate_weight( order2 );
...

make blended counts :
counts = w_o4 * counts_o4 + w_o2 * counts_o2 + ...

now code :
arithmetic_code( counts );
...

update counts :
counts_o4 += sym;
...

It should be clear we can put them together :

make a few contexts of previous characters
  order4 = * ((U32 *)(ptr-4));
  etc.. 

look up observed counts from each context :

counts_o4 = lookup_stats( order4 );
counts_o2 = lookup_stats( order2 );
counts_o1 = lookup_stats( order1 );
counts_o0 = lookup_stats( order0 );
if ( CM )
{
    estimate weights from counts at each context :

    w_o4 = estimate_weight( order4 );
    w_o2 = estimate_weight( order2 );
    ...

    make blended counts :
    counts = w_o4 * counts_o4 + w_o2 * counts_o2 + ...

    // now code :
    arithmetic_code( counts );
    ...
}
else PPM
{
    estimate escape probability from counts at each context :

    esc_o4 = estimate_escape( order4 );
    esc_o2 = estimate_escape( order2 );
    ...

    code in order from most likely best to least :

    if ( arithmetic_code( (1-esc_o4) * counts_o4 ) ) return; else arithmetic_code( esc_o4 );
    exclude counts_o4
    if ( arithmetic_code( (1-esc_o2) * counts_o2 ) ) return; else arithmetic_code( esc_o2 );
    exclude counts_o2
    ... 
}

update counts :
counts_o4 += sym;
...

In particular if we do our PPM in a rather inefficient way we can make them very similar :
make a few contexts of previous characters
  order4 = * ((U32 *)(ptr-4));
  etc.. 

look up observed counts from each context :

counts_o4 = lookup_stats( order4 );
counts_o2 = lookup_stats( order2 );
counts_o1 = lookup_stats( order1 );
counts_o0 = lookup_stats( order0 );

accumulate into blended_counts :
blended_counts = 0;
if ( CM )
{
    estimate weights from counts at each context :

    w_o4 = estimate_weight( order4 );
    w_o2 = estimate_weight( order2 );
    ...


}
else PPM
{
    estimate escape probability from counts at each context :

    esc_o4 = estimate_escape( order4 );
    esc_o2 = estimate_escape( order2 );
    ...

    do exclude :
    exclude counts_o4 from counts_o2
    exclude counts_o4,counts_o2 from counts_o1
    ...

    make weights :

    w_o4 = (1 - esc_04);
    w_o2 = esc_04 * (1 - esc_02);
    ...

}

make blended counts :
blended_counts += w_04 * counts_o4;
blended_counts += w_02 * counts_o2;
...

arithmetic_code( blended_counts );

update counts :
counts_o4 += sym;
...

Note that I haven't mentioned whether we are doing binary alphabet or large alphabet or any other practical issues, because it doesn't affect the algorithm in a theoretical way.

While I'm at it, let me take the chance to mark up the PPM pseudocode with where "modern" PPM differs from "classical" PPM : (by "modern" I mean 2002/PPMii and by "classical" I mean 1995/"before PPMZ").

PPM :

make a few contexts of previous characters
  order4 = * ((U32 *)(ptr-4));
  etc.. 
also make non-continuous contexts like
skip contexts : AxBx
contexts containing only a few top bits from each byte
contexts involving a word dictionary
contexts involving current position in the stream 

look up observed counts from each context :

counts_o4 = lookup_stats( order4 );
counts_o2 = lookup_stats( order2 );
counts_o1 = lookup_stats( order1 );
counts_o0 = lookup_stats( order0 );

possibly rescale counts using a "SEE" like operator
eg. use counts as an observation which you then model to predict coding probabilities

estimate escape probability from counts at each context :

esc_o4 = estimate_escape( order4 );
esc_o2 = estimate_escape( order2 );
...
secondary estimate escape using something like PPMZ
also not just using current context but also other contexts and side information

code in order from most likely best to least :

use LOE to choose best order to start from, not necessarily the largest context
also don't skip down through the full set, rather choose a reduced set

if ( arithmetic_code( (1-esc_o4) * counts_o4 ) ) return; else arithmetic_code( esc_o4 );
exclude counts_o4
if ( arithmetic_code( (1-esc_o2) * counts_o2 ) ) return; else arithmetic_code( esc_o2 );
exclude counts_o2
...

update counts :
counts_o4 += sym;
...
do "partial exclusion" like PPMii, do full update down to coded context
  and then reduced update to parents to percolate out information a bit
do "inheritance" like PPMii - novel contexts updated from parents
do "fuzzy updates" - don't just update your context but also neighbors
  which are judged to be similar in some way


And that's all folks.

09-12-10 - Context Weighting vs Escaping

The defining characteristic of PPM (by modern understanding) is that you select a context, try to code from it, and if you can't you escape to another context. By contrast, context weighting selects multiple contexts and blends the probabilities. These are actually not as different as they seem, because escaping is the same as multiplying probabilities. In particular :

    context 0 gives probabilities P0 (0 is the deeper context)
    context 1 gives probabilities P1 (1 is the parent)
    how do I combine them ?

    escaping :

    P = (1 - E) * P0 + E * (P1 - 0)

    P1 - 0 = Probablities from 1 with chars from 0 excluded

    weighting :

    P = (1 - W) * P0 + W * P1

    with no exclusion in P1

In particular, the only difference is the exclusion. Specifically, the probabilities of non-shared symbols are the same, but the probabilities of symbols that occur in both contexts are different. In particular the flaw with escaping is probably that it gives too low a weight to symbols that occur in both contexts. More generally you should probably be considering something like :

    I = intersection of contexts 0 and 1

    P = a * (P0 - I) + b * (P1 - I) + c * PI

that is, some weighting for the unique parts of each context and some weighting for the intersection.

Note that PPMii is sort of trying to compensate for this because when it creates context 0, it seeds it with counts from context 1 in the overlap.

Which obviously raises the question : rather than the context initialization of PPMii, why not just look in your parent and take some of the counts for shared symbols?

(note that there are practical issues about how your compute your weight amount or escape probability, and how exactly you mix, but those methods could be applied to either PPM or CW, so they aren't a fundamental difference).

09-12-10 - Challenges in Data Compression 1 - Finite State Correlations

One of the classic shortcomings of all known data compressors is that they can only model "finite context" information and not "finite state" data. It's a little obtuse to make this really formally rigorous, but you could say that structured data is data which can be generated by a small "finite state" machine, but cannot be generated by a small "finite context" machine. (or I should say, transmitting the finite state machine to generate the data is much smaller than transmitting the finite context machine to generate the data, along with selections of probabilistic transitions in each machine).

For example, maybe you have some data where after each occurance of 011 it becomes more likely to repeat. To model that with an FSM you only need a state for 011, and it loops back to itself and increases P. To model it with finite contexts you need an 011 state, an 011011 , 011011011 , etc. But you might also have correlations like : every 72 bytes there is a dword which is equal to dword at -72 bytes xor'ed with the dword at -68 bytes and plus a random number which is usually small.

The point is not that these correlations are impossible to model using finite contexts, but the correct contexts to use at each spot might be infinitely large.

Furthermore, you can obviously model FSM's by hard-coding them into your compressor. That is, you assume a certain structure and make a model of that FSM, and then context code from that hard-coded FSM. What we can't do is learn new FSM's from the data.

For example, say you have data that consists of a random dword, followed by some unknown number of 0's, and then that same dword repeated, like

 
DEADBEEF0000000000000000DEADBEEF 
you can model this perfectly with an FCM if you create a special context where you cut out the run of zeros. So you make a context like
DEADBEEF00
and then if you keep seeing zeros you leave the context alone, if it's not a zero you just use normal FCM (which will predict DEADBEEF). What you've done here is to hard-code the finite state structure of the data into your compressor so that you can model it with finite contexts.

In real life we actually do have this kind of weird "finite state" correlation in a lot of data. One common example is "structured data". "Structured data" is data where there is a strong position-based pattern. That is, maybe a sequence of 32 bit floats, so there's strong correlation to (pos&3), or maybe a bunch of N-byte C structures with different types of junk in that.

Note that in this sense, the trivial mode changes of something like HTML or XML or even English text is not really "structured data" in our sense, even though they obviously have structure, because that structure is largely visible through finite contexts. That is, the state transitions of the structure are given to us in actual bytes in the data, so we can find the structure with only finite context modeling. (obviously English text does have a lot of small-scale and large-scale finite-state structure in grammars and context and so on).

General handling of structured data is the big unsolved problem of data compression. There are various heuristic tricks floating around to try to capture a bit of it. Basically they come down to hard coding a specific kind of structure and then using blending or switching to benefit from that structure model when it applies. In particular, 4 or 8 byte aligned patterns are the most common and easy to model structure, so people build specific models for that. But nobody is doing general adaptive structure detection and modeling.

09-12-10 - Challenges in Data Compression 2 - Sparse Contexts

A "sparse" context is one with very little information in it. Exactly what defines "sparse" depends on the situation, but roughly it means it hasn't been seen enough times that you can trust the evidence in it. Obviously a large-alphabet context needs a lot more information to become confident than a binary context, etc.

In the most extreme case, a sparse context might only have 1 previous occurance. Let's take the binary case for clarity, so we have {n0=1,n1=0} or {n0=0,n1=1} in a single context.

There are two issues with a sparse context :

1. What should our actual estimate for P0 (the probability of a zero) be? There's a lot of theory about ideal estimators and you can go read papers on the unbiased estimator or the KT estimator or whatever that will give you formulas like


P0 = (n0 + k) / (n0 + n1 + 2k)

k = 0 or 1/2 or whatever depending on the flavor of the month

but this stuff is all useless in the sparse case. At {n0=1,n1=0} the actual probabilitiy of a 0 might well be 99% (or it could be 10%) and these estimators won't help you with that.

2. Making an estimate of the confidence you have in the sparse context. That is, deciding whether to code from it at all, or how to weight it vs other contexts. Again the important thing is that the statistics in the context itself do not really help you solve this problem.

Now let me take a second to convince you how important this problem is.

In classical literature this issue is dismissed because it is only a matter of "initialization" and the early phase, and over infinite time all your contexts become rich, so asymptotically it doesn't affect ratio at all. Not only is that dismissive of practical finite size data, but it's actually wrong even for very large data. In fact sparse contexts are not just an issue for the early "learning" phase of a compressor.

The reason is that a good compressor should be in the "learning" phase *always*. Basically to be optimal, you want to be as close to the end of sparse contexts as possible without stepping over the cliff into your statistics becoming unreliable. There are two reasons why.

1. Model growth and the edges of the model. The larger your context, or the more context information you can use, the better your prediction will be. As you get more data, the small contexts might become non-sparse, but that just means that you should be pushing out more into longer contexts. You can think of the contexts as tree structured (even if they actually aren't). Around the root you will have dense information, the further out you go to the leaves the sparser they will be.

For example, PPM* taught us that using the longest context possible is helpful. Quite often this context has only occurred once. As you get further into the file, you longest possible context simply gets longer, it doesn't get less sparse.

2. Locality and transients. Real files are not stationary and it behooves you to kill old information. Furthermore, even in dense contexts there is a subset of the most recent information which is crucial and sparse.

For example, in some context you've seen 50 A's and 20 B's. Now you see two C's. Suddenly you are in a sparse situation again. What you have is an insufficient sample sitation. Should I still use those 50 A's I saw, or am I now 99% likely to make more C's ? I don't have enough statistics in this context to tell, so I'm in a sparse situation again.

There are various attempts at addressing this, but like all the problems in this serious there's been no definitive attack on it.

9/06/2010

09-06-10 - WeightedDecayRand

I'm sure I've talked to people about this many times but I don't think I've ever written it up. Anyway, here's one from the files of game developer arcana. I'm sure this is common knowledge, but one of those things that's too trivial to write an article about.

When you use random numbers in games, you almost never actually want random numbers. Rather you want something that *feels* sort of random to the player. In particular the biggest difference is that real random numbers can be very bunchy, which almost always feels bad to the player.

Common cases are like random monster spawns or loot drops. It's really fucking annoying when you're trying to cross some terrain in an Ultima type game and you get unlucky with the random number generator and it wants to spawn a random encounter on every single fucking tile. Or if you're killing some mobs who are supposed to drop a certain loot 1 in 10 times, and you kill 100 of them and they don't fucking drop it because you keep getting unlucky in rand().

What we want is not actually rand. If you design a monster to drop loot roughly 1 in 10 times, you want the player to get it after 5 - 20 kills. You don't really want them to get it on first kill, nor do you want them to sit around forever not getting it.

In all cases, to do this what we need is a random generator which has *state* associated with the event. You can't use a stateless random number generator. So for example with loot drop, the state remembers that it has dropped nothing N times and that can start affecting the probability of next drop. Basically what you want to do for something like that is start with a very low probability to drop loop (maybe even 0) and then have it ratchet up after non-drop. Once the loot is dropped, you reset the probability back to 0. So it's still random, but a more controllable experience for the designer.

Now, this state generally should decay in some way. That is, if you kill mobs and get no loot and then go away and do a bunch of stuff, when you come back it should be back to baseline - it doesn't remember what happened a long time ago. And then the next important issue is how does that state decay? Is it by time or by dicrete events? Whether time or event driven makes the most sense depends on the usage.

The simplest version of a stateful semi-random generator is simply one that forbids repeats. Something like :


int Draw()
{
    for(;;)
    {
        int i = RandMod(m_num);
        if ( i != m_last )
        {
            m_last = i;
            return i;
        }
    }
}

which is an okay solution in some cases. But more generally you want to not just forbid repeats, rather allow them but make them less likely. And for N-ary events you don't just care about the last one, but all the last ones.

A simple binary stateful semirandom generator is like this :


int Draw()
{
    Tick();

    float p = frandunit() * ( m_p0 + m_p1 );
    if ( p < m_p0 )
    {
        m_p0 -= m_repeatPenalty;
        m_p0 = MAX(m_p0,0.f);
        return 0;
    }
    else
    {
        m_p1 -= m_repeatPenalty;
        m_p1 = MAX(m_p1,0.f);       
        return 1;
    }
}

You should be able to see that this is a simple weighted coin flip and we penalize repeats by some repeat parameter. Here we have introduced the idea of a Tick() - the tick is some function that push the probabilities back towards 50/50 , either by time evolution or by a fixed step.

More generally you want N-ary with various parameters. Here's some code :


//-----------------------------------------------------------

class WeightedDecayRand
{
public:

    // redrawPenalty in [0,1] - 0 is a true rand, 1 forbids repeats
    // restoreSpeed is how much chance of redraw is restored per second or per draw
    explicit WeightedDecayRand(int num,float redrawPenalty,float restoreSpeed);
    ~WeightedDecayRand();

    int Draw();
    void Tick(float dt);
    
    int DrawFixedDecay();
    int DrawTimeDecay(double curTime);

private:
    int     m_num;
    float * m_weights;  
    float   m_redrawPenalty;
    float   m_restoreSpeed;
    float   m_weightSum;
    double  m_lastTime;

    FORBID_CLASS_STANDARDS(WeightedDecayRand);
};

//-----------------------------------------------------------

WeightedDecayRand::WeightedDecayRand(int num,float redrawProbability,float restoreSpeed) :
    m_num(num),
    m_redrawPenalty(redrawProbability),
    m_restoreSpeed(restoreSpeed)
{
    m_weights = new float [num];
    for(int i=0;i < num;i++)
    {
        m_weights[i] = 1.f;
    }
    m_weightSum = (float) num;
}

WeightedDecayRand::~WeightedDecayRand()
{
    delete [] m_weights;
}

int WeightedDecayRand::Draw()
{
    float p = frandunit() * m_weightSum;
    for(int i=0;;i++)
    {
        if ( i == m_num-1 || p < m_weights[i] )
        {
            // accepted !
            m_weightSum -= m_weights[i];
            m_weights[i] -= m_redrawPenalty;
            m_weights[i] = MAX(m_weights[i],0.f);
            m_weightSum += m_weights[i];
            return i;
        }
        else
        {
            p -= m_weights[i];
        }
    }
}

void WeightedDecayRand::Tick(float dt)
{
    m_weightSum = 0;
    for(int i=0;i < m_num;i++)
    {
        if ( m_weights[i] < 1.f )
        {
            m_weights[i] += dt * m_restoreSpeed;
            m_weights[i] = MIN(m_weights[i],1.f);
        }
        m_weightSum += m_weights[i];
    }
}

int WeightedDecayRand::DrawFixedDecay()
{
    int ret = Draw();
    
    Tick( 1.f );
    
    return ret;
};

int WeightedDecayRand::DrawTimeDecay(double curTime)
{
    if ( curTime != m_lastTime )
    {
        Tick( (float)(curTime - m_lastTime) );
        m_lastTime = curTime;
    }

    int ret = Draw();
        
    return ret;
};

//-----------------------------------------------------------

09-06-10 - Cross Platform SIMD

I did a little SIMD "Lookup3" (Bob Jenkin's hash), and as a side effect, it made me realize that you can almost get away with cross platform SIMD these days. All the platforms do 16-byte SIMD, so that's nice and standard. The capabilities and best ways of doing things aren't exactly the same, but it's pretty close, and you can mostly cover the differences. Obviously to get super maximum optimal you would want to special case per platform, but even then having a base cross-platform SIMD implementation to start from would let you get all your platforms going easier and identify the key-spots to do platform specific work.

Certainly for "trivial SIMDifications" this works very well. Trivial SIMDification is when you have a scalar op that you are doing a lot of, and you change that to doing 4 of them at a time in parallel with SIMD. That is, you never do horizontal ops or anything else funny, just a direct substitution of scalar ops to 4-at-a-time vector ops. This works very uniformly on all platforms.

Basically you have something like :

U32 x,y,z;

    x += y;
    y -= z;
    x ^= y;
and all you do is change the data type :
simdU32 x,y,z;

    x += y;
    y -= z;
    x ^= y;
and now you are doing four of them at once.

The biggest issue I'm not sure about is how to define the data type.

From a machine point of view, the SIMD register doesn't have a type, so you might be inclined to just expose a typeless "qword" and then put the type information in the operator. eg. have a generic qword and then something like AddQwordF32() or AddQwordU16() . But this is sort of a silly argument. *All* registers are typeless, and yet we find data types in languages to be a convenient way of generating the right instructions and checking program correctness. So it seems like the ideal thing is to really have something like a qwordF32 type, etc for each way to use it.

The problem is how you actually do that. I'm a little scared that anything more than a typedef might lead to bad code generation. The simple typedef method is :


#if SPU
typedef qword simdval;
#if XENON
typedef __vector4 simdval;
#else if SSE
typedef __m128 simdval;
#endif

But the problem is if you want to make them have their type information, like say :

typedef __vector4 simdF32;
typedef __vector4 simdU32;

then when you make an "operator +" for F32 and one for U32 - the compiler can't tell them apart. (if for no good reason you don't like operator +, you can pretend that says "Add"). The problem is the typedef is not really a first class type, it's just an alias, so it can't be used to select the right function call.

Of course one solution is to put the type in the function name, like AddF32,AddU32,etc. but I think that is generally bad code design because it ties operation to data, which should be as indepednent as possible, and it just creates unnecessary friction in the non-simd to simd port.

If you actually make them a proper type, like :


struct simdF32 { __vector4 m; };
struct simdU32 { __vector4 m; };

then you can do overloading to get the right operation from the data type, eg :

RADFORCEINLINE simdU32 operator + ( const simdU32 lhs, const simdU32 rhs )
{
    return _mm_add_epi32(lhs,rhs);
}

RADFORCEINLINE simdF32 operator + ( const simdF32 lhs, const simdF32 rhs )
{
    return _mm_add_ps(lhs,rhs);
}

The problem is that there is some reason to believe that anything but the fundamental type is not handled as well by the compiler. That is, qword,__vector4, etc. get special very good handling by the compiler, and anything else, even a struct which consists of nothing but that item, gets handled worse. I haven't actually seen this myself, but there are various stories around the net indicating this might be true.

I think the typedef way is too just too weak to actually be useful, I have to go the struct way and hope that modern compilers can handle it. Forunately GCC has the very good "vector" keyword thingy, so I don't have to do anything fancy there, and MSVC is now generally very good at handling mini objects (as long as everything inlines).

Another minor annoying issue is how to support reinterprets. eg. I have this simdU32 and I want to use it as a simdU16 with no conversion. You can't use the standard C method of value at/address of, because that might go through memory which is a big disaster.

And the last little issue is whether to provide conversion to a typeless qword. One argument for that would be that things like Load() and Store() could be implemented just from qword, and then you could fill all your data types from that if they have conversion. But if you allow implicit conversion to/from typeless, then all your types can be directly made into each other. That usually confuses the hell out of function overloading among other woes.

9/04/2010

09-04-10 - Holy fuck balls

MSVC 2005 x64 turns this :

static RADFORCEINLINE void rrMemCpySmall(void * RADRESTRICT to,const void * RADRESTRICT from, SINTa size)
{
    U8 * RADRESTRICT pto = (U8 * RADRESTRICT) to;
    const U8 * RADRESTRICT pfm = (const U8 * RADRESTRICT) from;
    for(SINTa i=0;i < size;i++) pto[i] = pfm[i];
}

into

call memcpy

god damn you, you motherfuckers. The whole reason I wrote it out by hand is because I know that size is "small" ( < 32 or so) so I don't want the function call overhead and the rollup/rolldown of a full memcpy. I just want you to generate rep stosb. If I wanted memcpy I'd fucking call memcpy god dammit. Friggle fracking frackam jackam.


In other news, I know it's "cool" to run with /W4 or MSVC or /Wall on GCC (or even /Wextra), but I actually think it's counterproductive. The difference between /W3 and /W4 is almost entirely warnings that I don't give a flying fuck about, shit like "variable is unused". Okay fine, it's not used, fucking get rid of it and don't tell me about it.

Shit like "variable initialized but not used" , "conditional is constant", "unused static function removed" is completely benign and I don't want to hear about it.

I've always been peer-pressured into running with max warning settings because it's the "right" thing to do, but actually I think it's just a waste of my time.


MoveFileTransacted is pretty good. There's other transactional NTFS shit but really this is the only one you need. You can write your changes to a temp file (on your ram drive, for example) then use MoveFileTransacted to put them back on the real file, and it's nice and safe.

BTW while I'm ranting about ram drives; how the fuck is that not in the OS? And my god people are getting away with charging $80 for them. AND even those expensive ones don't support the only feature I actually want - dynamic sizing. It should be the fucking size of the data on it, not some static predetermined size.

9/03/2010

09-03-10 - Page file on Ramdrive = 3LIT3

One of the awesome new "tweaks" that computer buffoons are doing is making ramdisks and putting their windows swap file on the ram disk. Yes, brilliant, that speeds up your swap file alright. So awesome.

(BTW yes I do know that it actually makes sense in one context : some of the fancy ram drives can access > 3 GB RAM in 32 bit windows, in which case it gives you a way to effectively access your excess RAM in an easy way on the old OS; but these geniuses are on Win7-64).

(And of course the right thing to do is to disable page file completely; sadly Windows has not kept up with this modern era of large RAMs and the page file heuristics are not well suited to use as an emergency fallback).


What I have copied from the tweakers is putting my firefox shite on a ram disk :

SuperSpeed Firefox By Putting Profile and SQLite Database in RAMDisk Raymond.CC Blog
Guide Move 99% of All Firefox Writes off your SSD

I can get away with a 128M ram disk and it massive reduces the amount of writes to my SSD. Reducing writes to the SSD is nice for my paranoia (it also speeds up firefox startup and browsing), but the really strong argument for doing this is that I've caught SQLite eating its own ass a few times. I really hate it when I'm just sitting there doing nothing and all of a sudden my fan kicks in and my CPU shoots up, I'm like "WTF is going on god dammit". Well the last few times that's happened, it's been the SQL service fucking around on its files for no god damn good reason. I'm not sure if putting the firefox SQL DB files on the ramdisk actually fixes this, but it makes it cheaper when it does happen anyway.

Also, don't actually use the above linked methods. Instead do this :

Do "about:config" and add "browser.cache.disk.parent_directory" . Obviously also check browser.cache.disk.enable and browser.cache.disk.capacity ; capacity is in KB BTW.

Run "firefox -ProfileManager". Make a new profile and call it "ramprofile" or whatever. Change the dir of that profile to somewhere on the ramdisk. Look at your other profile (probably "default") and see what dir it's in. Make "ramprofile" the default startup profile.

Quit firefox and copy the files from your default profile dir to your ramprofile dir. Run firefox and you are good to go.

Do not set your ramdisk to save & restore itself (as some of fancy ramdisks can do these days). Instead put the profile dir copy operation in your machine startup bat. It's somewhat faster to zip up your default profile dir and have your machine startup bat copy the zip over to the ramdisk and unzip it there.

There's another advantage of this, which is that your whole Firefox profile is now temporary for the session, unless you explicitly copy it back to your hard disk. That's nice. If you pick up tracking cookies or whatever while browsing, or some malicious script changes your home page or whatever, they go away when you reboot.

09-03-10 - LZ and Exclusions

I've been thinking a lot about LZ and exclusions.

In particular, LZMA made me think about this idea of excluding the character that follows from the previous match. eg. after a match, if the string we were matching from was ABCD... and we just wrote ABC of length 3, we know the next character is not a D. LZMA captures this information by using XOR (there are possibly other effects of the xor, I'm not sure).

An earlier LZ77 variant named "LZPP" had even more ideas about exclusions. I won't detail that, but I'll try to do a full treatment here.

There are various ways that exclusions arise from LZ77 coding. First of all we will simplify and assume that you always code the longest possible match at each location, and you always take a match if a match is possible. These assumptions are in fact not valid in modern optimal parse coders, but we will use them nontheless because it makes it easier to illustrate the structure of the LZ77 code.

First let's look at the exclusion after matches. Any time you are in the character after a match, you know the next character cannot be the one that continues the match, or you would just written a longer match. But, that is not just true of the match string you chose, but also of all the match strings you could have chosen.

Say for example you just wrote match of length 3 and got the string [abc]. Now you are on the next character. But previously occuring in the file are [abcd] and [abce] and [abcf]. ALL of those following characters can be excluded, not just the one that was in your actual match string. In particular, in a context coding sense, if the match length was L, you are now effectively coding from a context of length L and you know the next character must be novel in that context.

Also note that this applies not only to literal coding, but also to match coding. All offsets you could code a match from that start with one of the excluded characters can be skipped from the coding alphabet.

There's another type of exclusion in LZ coding that is powerful, and that is exclusion due to failure to match. Say your min match length is 2. You've written a literal and now you are about to write another literal. You can exclude all characters which would have been a match at the last position. That is,

You've just written [a] as a literal. The digrams [ab], [ac], [ad] occur in the file. You can thus exclude {b,c,d} from the current literal, because if it was one of those you would have written a match at the previous spot.

There are two primary states of an LZ coder : after a match and after a literal (there's also an implicit state which is inside a match, continuing the match).

Coding with general exclusions is difficult to do fast, and difficult to do with a bitwise binary style coder. You can code one exclusion using the xor method, but getting beyond that seems impossible. It looks like you have to go to a full general alphabet coder and a tree structure like Fenwick or some relative.


Another issue for LZ is the redundancy of offsets. In particular the most obvious issue is that when you write short matches, many offsets are actually identical, so having the full set to choose from is just wasting code space.

eg. say you code a match of length 2, then offsets which point to a string which is the same as another string up to length 2 are redundant.

There are two ways to take advantage of this :

1. Code length first. Walk offsets backwards from most recent (lowest offset) to highest. Each time you see a substring of the [0,length) prefix that you've never seen before, count that as offset++, otherwise just step over it without incrementing offset. Basically you are just considering a list of the unique [length] substrings that have occured in the file, and they are sorted in order of occurance.

2. Code offset first. Walk from that offset back towards the current position (from large offset down to small). Match each string against the chosen one at offset. The largest substring match length is the base match length. Code (length - base). This relies on the fact that if there was a match of that substring with a lower offset, we would have coded from the lower offset. So whenever we code a larger offset, we know it must be a longer match than we could have gotten somewhere later in the list.

These seem outrageously expensive, but they actually can be incorporated into an ROLZ type coder pretty directly. Method 1 can also incorporate exclusions easily, you just skip over offsets that point to a substring that begins with an excluded character. One disadvantage of method 1 is that it ruins offset structure which we will talk about in the next post.

8/31/2010

08-31-10 - LOOP

Every once in a while I think this is a good idea :

#define LOOP(var,count) for(int var=0;(var)<(int)(count);var++)
#define LOOPBACK(var,count) for(int var=(int)(count)-1;(var)>=0;var--)

but then I never wind up using it, and I find the code that does use it looks really ugly. Maybe if I got in the habit that ugliness would go away.

Certainly adding a "loop" keyword would've been a good idea in C99. Instead we have GCC trying to optimize out signed int wrapping, and many compilers now specifically look for the for(;;) construct and special-case handling it as if it was a loop() keyword.


In other minor news, I'm running with the "NoScript" addon now. I've used FlashBlock for a long time to much happiness, so this is just the next level. It does break 90% of web sites out there, but it has also made me shockingly aware of how many random sites are trying to run scripts that are highly questionable (mainly data-mining and ad-serving).


People sometimes ask me about laptops because I wrote about them before :

cbloom rants 05-07-10 - New Lappy
cbloom rants 04-15-10 - Laptop Part 2
cbloom rants 04-12-10 - Laptop search
cbloom rants 01-20-09 - Laptops Part 3

A small update : I'm still very happy with the Dell Precision 6500. They've updated it with some newer GPU options, so you can now get it with an ATI 7820 , which is a full DX11 part and close to as fast as you can get mobile. Other than that all my advice remains the same - get USB 3 of course, quad i7, LED backlighting, install your own SSD and RAM and do a clean windows install. Do NOT get RAID disks, and do NOT install any Intel or Dell drivers. I have no DPC latency problems or any shite like that. The only thing about it that's slightly less than perfect is the temperature / fan control logic is not quite right.

It looks like there's a problem with the NV GTX 280 / Dual chip / Powermizer stuff. See :

My Solution for Dell XPS M1530 DPC Latency
Dell, DPC Latency, and You - Direct2Dell - Direct2Dell - Dell Community
Dell Latitude DPC Latency Issues

So I recommend staying away from that whole family. The new Vaio Z looks really awesome for a small thin/light, however there is a bit of a nasty problem. They only offer it with SSD's in RAID, and they are custom non-replaceable SSD's, and it appears that the reason for the 4 SSD's in RAID is because they are using cheapo low end flash. There's lots of early reports of problems with this setup, and the fact that you have to run RAID and can't change the drives is a big problem. Also, last time I checked Windows still can't send through TRIM to RAID, so that is borked, but theoretically that will be addressed some day.

8/27/2010

08-27-10 - Cumulative Probability Trees

I've been thinking about Cumulative Probability Trees today; if you prefer, it's partial-sum trees. Basically you have a count for each index C[i] , and you want to be able to quickly get the sum of all counts between 0 and i = S[i]. You want to be able to query and update both in <= logN time. So for example just using S[i] as an array makes query be O(1), but then update is O(N), no good.

The standard solution is Fenwick Trees . They are compact (take no more room than the C[i] table itself). They are O(log(N)) fast. In code :


F[i] contains a partial sum of some binary range
the size of the binary range is equal to the bottom bit on of i
if i&1  - it contains C[i]
if i&3 == 2 - contains Sum of 2 ending with i (eg. C[i]+C[i-1] )
if i&7 == 4 - contains Sum of 4 ending with i
if i&F == 8 - contains Sum of 8 ending with i
(follow the link above to see pictures). To get C[i] from F[i] you basically have to get the partial sums S[i] and S[i-1] and subtract them (you can terminate early when you can tell that the rest of their sum is a shared walk). The logN walk to get S from F is very clever :

sum = 0;
while (i > 0)
  {
    sum += F[i];
    i = i & (i-1);
  }

The i & (i-1) step turns off the bottom bit of i, which is the magic of the Fenwick Tree structure being the same as the structure of binary integers. (apparently this is the same as i -= i & -i , though I haven't worked out how to see that clearly).

If you put F[0] = 0 (F starts indexing at 1), then you can do this branchless if you want :


sum = 0;
UNROLL8( sum += F[i]; i = i & (i-1); );

(for an 8-bit symbol, eg 256 elements in tree).

You can't beat this. The only sucky thing is that just querying a single probability is also O(logN). There are some cases where you want to query probability more often than you do anything else.

One solution to that is to just store the C[i] array as well. That doesn't hurt your update time much, and give you O(1) query for count, but it also doubles your memory use (2*N ints needed instead of N).

One option is to keep C[i], and throw away the bottom level of the Fenwick tree (the odd indexes that just store C[i]). Now your memory use is (3/2)*N ; it's just as fast but a little ugly.

But I was thinking what if we start over. We have the C[i], what if we just build a tree on it?

The most obvious thing is to build a binary partial sum tree. At level 0 you have the C[i], at level 1 you have the sum of pairs, at level 2 you have the sum of quartets, etc :


showing the index that has the sum of that slot :

0:01234567
1:00112233
2:00001111
3:00000000

So update is very simple :

Tree[0][i] ++;
Tree[1][i>>1] ++;
Tree[2][i>>2] ++;
...

But querying a cumprob is a bit messy. You can't just go up the tree and add, because you may already be counted in a parent. So you have to do something like :

sum = 0;

if ( i&1 ) sum += Tree[0][i-1];
i>>=1;
if ( i&1 ) sum += Tree[1][i-1];
i>>=2;
if ( i&1 ) sum += Tree[1][i-1];
..

This is O(logN) but rather uglier than we'd like.

So what if we instead design our tree to be good for query. So we by construction say that our query for cumprob will be this :


sum = Tree[0][i];
sum += Tree[1][i>>1];
sum += Tree[2][i>>2];
...

That is, at each level of the tree, the index (shifted down) contains the amount that should be added on to get the partial sum that preceeds you. That is, if i is >= 64 , then Tree[6][1] will contain the sum from [0..63] and we will add that on.

In particular, at level L, if (i>>L)is odd , it should contain the sum of the previous 2^L items. So how do we do the update for this?


Tree[0][i] ++;
i >>= 1;
if ( i&1 ) Tree[1][i] ++;
i >>= 1;
if ( i&1 ) Tree[2][i] ++;
...

or

Tree[0][i] ++;
if ( i&2 ) Tree[1][i>>1] ++;
if ( i&4 ) Tree[2][i>>2] ++;
...

or

Tree[0][i] ++;
i >>= 1;
Tree[1][i] += i&1;
i >>= 1;
Tree[2][i] += i&1;
...

this is exactly complementary to the query in the last type of tree; we've basically swapped our update and query.

Now if you draw what the sums look like for this tree you get :

These are the table indexes :

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 1 2 3 4 5 6 7
0 1 2 3
0 1
0

These are the table contents :

C0 C0-C1 C2 C2-C3 C4 C4-C5 C6 C6-C7 C8 C8-C9 C10 C10-C11 C12 C12-C13 C14 C14-C15
0 C0-C1 0 C4-C5 0 C8-C9 0 C12-C13
0 C0-C3 0 C8-C11
0 C0-C7
0
it should be obvious that you can take a vertical sum anywhere and that will give you the partial sum to the beginning. eg. if I take a vertical sum in the 13th slot I get (C12-C13)+(0)+(C8-C11)+(C0-C7) = (C0-C13) , which is what I want.

Now what if we slide everything left so we don't have all those zeros in the front, and we'll go ahead and stick the total sum in the top :

C0 C0-C1 C2 C2-C3 C4 C4-C5 C6 C6-C7 C8 C8-C9 C10 C10-C11 C12 C12-C13 C14 C14-C15
C0-C1 0 C4-C5 0 C8-C9 0 C12-C13 0
C0-C3 0 C8-C11 0
C0-C7 0
C0-C15

Now, for each range, we stuff the value into the top slot that it covers, starting with the largest range. So (C0-C7) goes in slot 7 , (C0-C3) goes in slot 3, etc :

C0 -v- C2 -v- C4 -v- C6 -v- C8 -v- C10 -v- C12 -v- C14 -v-
C0-C1 -v- C4-C5 -v- C8-C9 -v- C12-C13 -v-
C0-C3 -v- C8-C11 -v-
C0-C7 -v-
C0-C15

(the -v- is supposed to be a down-arrow meaning you get those cell contents from the next row down). Well, this is exactly a Fenwick Tree.

I'm not sure what the conclusion is yet, but I thought that was like "duh, cool" when I saw it.

8/25/2010

08-25-10 - ProfileParser

I realized a long time ago that the work I did for my AllocParser could basically be applied as-is to parsing profiles. It's the same thing, you have a counter (bytes or clocks), you want to see who counted up that counter, get hierarchical info, etc.

Also Sean and I talked long ago about how to do the lowest impact possible manually-instrumented profiler with full information. Basically you just record the trace and don't do any processing on it during recording. All you have to do is :


Profiler_Push(label)  *ptr++ = PUSH(#label); *ptr++ = rdtsc();
Profiler_Pop( label)  *ptr++ = POP( #label); *ptr++ = rdtsc();

#define PUSH(str)  (U64)(str)
#define POP (str) -(S64)(str)

where ptr is some big global array of U64's , and we will later use the stringized label as a unique id to merge traces. Once your app is done sampling, you have this big array of pushes and pops, you can then parse that to figure out all the hierarichical timing information. In practice you would want to use this with a scoper class to do the push/pop for you, like :

class rrScopeProfiler { rrScopeProfiler() { Push; } ~rrScopeProfiler() { Pop; } };

#define PROFILE(label)  rrScopeProfiler RR_STRING_JOIN(label,_scoper) ( #label );

Very nice! (obviously the pop marker doesn't have to provide the label, but it's handy for consistency checking).

(if you want to get fancy, the profiler push/pop array should really be write-combining memory, and the stores should be movnti's on x64 (*), that way the profiler push/pop wouldn't even pollute your cache, which makes it very low impact indeed)

(* BTW movnti is exposed as _mm_stream_si64 ; unfortunately, this does not exist in VC2005, you need 2008; the fact that they took away inline asm and then failed to expose all the instructions in intrinsics was a huge "fuck you" to all low-level developers; it was terrible in the early days, they've caught up a bit more with each rev ) (note that movntq is not the same, movnti comes from normal registers).

So I did this, and made my AllocParser able to parse that kind of input and turn it into AllocRecords. (the normal mode of AllocParser is to handle stack traces of memory allocations).

So now I can make tabview output for either top-down or bottom-up hierarchies, and also graphiz output like : lz_profile.svg .

There are a few minor things to clean up, like I'd like to be able to show either seconds or clocks, I'd like to be able to divide clocks by the total # of bytes to make it a clocks per byte, and if a node only has a self time, I'd like to delete the redundant +self node.

Another option is to show the hierarchy in a totally different graphviz style, using the boxes-within-boxes method. I tried that before for allocs and it was hideous, but it might be okay for timing. If I do that, then I could combine this with a global timeline to show thread profiles over time, and then use the arrows to connect dependencies.

Then I have to buy that old plotter I've been talking about ...

8/24/2010

08-24-10 - Free Trail Maps

Don't support those Green Trails cocksuckers who charge $400 for their maps. (if you do have the Green Trails maps, I encourage you to post them for public use on the net; you fuckers have abused your right for me to respect your copyright; in fact I vaguely considered buying a set just to scan it and post it, but they're not even worth that).

It's most appalling because all the maps are based on the USGS data, which is paid for by our fucking tax dollars. Fortunately there are some perfectly good free map sites :

libremap.org : Libre Map Project - Free Maps and GIS data
digital-topo-maps.com : Free Printable Topo Maps - Instant Access to Topographic Maps
ACME Mapper 2.0

Of these, digital-topo-maps.com is the easiest to browse around cuz it just uses Google maps (actually, it seems to be like 10X faster than Google's own interface, so it's actually just a nice way to browse normal maps too).

Libre Maps is the most useful for hiking with because it has nice printer-ready pages in high quality.

Also, I sometimes forget that Google still has the Terrain maps because they are hidden under More... now. I'm sure somebody has done trail overlays for Google Terrain, but I haven't found it.

08-24-10 - AutoPrintf Release

Okay, I think AutoPrintf is close enough to final to release it. The nicer, more readable version is in : cblib.zip , but I understand you all like little autonomous files, so I grabbed the shit it needs and crammed it together in an ugly mess here :

apf.h
apf.cpp
apf.inc

(among other advantages, the cblib version doesn't pull vector.h or windows.h into the apf headers, both of which I consider to be very severe sins)

See older posts for a description of how it works and earlier not-good way of doing it and initial announcement .

The basic way it works is :

  • non-basic types are converted into basic types by the template using autoArgConvert; base types are passed through and others get ToString() called on them.

  • "%a" in the format string is turned into the type of the variable it points at (eg %d , %s, whatever). So you can of course use "%7a" or anything else that printf can handle.

  • I just call normal printf to do the actual printing.

that's it, very simple!

Here's my main.cpp for an example of usage :

#include < float.h >
#include < stdio.h >     
#include < string.h >

//#include "autoprintf.h"
#include "apf.h"

#include < windows.h >

struct Vec3 { float x,y,z; };

namespace apf 
{
inline const String ToString( const Vec3 & v )
{
    return StringPrintf("{%.3f,%.3f,%.3f}",v.x,v.y,v.z);
}

/*
inline size_t autoArgConvert(const HWND arg)
{
    return (size_t)arg;
}
*/

inline const String ToString( const HWND v )
{
    return StringPrintf("[%08X]",v);
}
};
    
int main(int argc,const char * argv[])
{   
    //MakeAutoPrintfINL();
    
    //autoprintf("test bad %d",3.f);
    
    autoprintf("hello %-7a",(size_t)400,"|\n");
    
    //*
    //autoprintf("100%");
    autoprintf("percent %% %s","100%"," stupid","!\n");
    autoprintf("hello ","world %.1f",3.f,"\n");
    autoprintf("hello ",7," world\n");
    autoprintf("hello %03d\n",7);
    autoprintf("hello %d",3," world %.1f\n",3.f);
    autoprintf("hello ",(size_t)400,"\n");
    autoprintf("hello ",L"unicode is balls"," \n");
    autoprintf("hello %a ",L"unicode is balls"," \n");
    //autoprintf("hello %S ",L"unicode is balls"," \n");
    autoprintf("hello ",apf::String("world")," \n");
//  autoprintf("hello ",LogString()); // compile error
    
    autoprintf("top bit ",(1UL<<31)," \n");
    autoprintf("top bit %d",(1UL<<31)," \n");
    autoprintf("top bit %a",(1UL<<31)," \n");
    autoprintf("top bit %a\n",(size_t)(1UL<<31));
    
    HANDLE h1 = (HANDLE) 77;
    HWND h2 = (HWND) 77;
    autoprintf("HANDLE %a\n",h1);
    autoprintf("HWND %a\n",h2);
    
    char temp[1024];
    autosnprintf(temp,1023,"hello %a %a %a",4.f,7,apf::String("world"));
    printf("%s\n",temp);
    
    Vec3 v = { 3, 7, 1.5f };
    autoprintf("vector ",v," \n");
    autoprintf("vector %a is cool %a\n",v,(size_t)100);
    /**/
        
    return 0;
}

The normal way to make user types autoconvert is to add a ToString() call for your type, but you could also use autoArgConvert. If you use autoArgConvert, then you will wind up going through a normal %d or whatever.

One nice thing is that this autoprintf is actually even safer than my old safeprintf. If you mismatch primitive types, (eg. you put a %d in your format string but pass a float), it will check it using the same old safeprintf method (that is, a runtime failure). But if you put a std::string in the list when you meant to put a char *, you will get a compile error now, which is much nicer.

Everything in cblib now uses this (I made Log.h be an autoprintf) and I haven't noticed a significant hit to compile time or exe size since the templates are all now deterministic and non-recursive.

Yes it does a lot of dynamic allocation. Get with the fucking 20th century. And it's fucking printf. Printf is slow. I don't want to hear one word about it.

8/23/2010

08-23-10 - AutoPrintf v2

Okay, v2 works and confirmed doesn't give massive exe size bloat or compiler time death the way v1 does.

At the heart of v2 is a "fixed" way of doing varargs. The problem with varargs in C is that you don't get the types of the variables passed in, or the number of them. Well there's no need to groan about that because it's actually really trivial to fix. You just make a bunch of functions like :


template < typename T1, typename T2, typename T3 >
inline String autoToStringSub( T1 arg1, T2 arg2, T3 arg3)
{
    return autoToStringFunc( 3,
            safeprintf_type(arg1), safeprintf_type(arg2), safeprintf_type(arg3), 
            arg1, arg2, arg3, 0 );
}

for various number of args. Here autoToStringFunc(int nArgs, ...) is the basic vararg guy who will do all the work, and we just want to help him out a bit. This kind of adapter could be used very generally to make enhanced varargs functions. Here I only care about the "printf_type" of the variable, but more generaly you could use type_info there. (you could also easily make abstract Objects to encapsulate the args and pass through an array of Objects, so that the called function wouldn't have to be a stupid C vararg function at all, but then it's harder to pass through to the old C funcs that still want varargs).

On top of this we have a type-change adapter :


template < typename T1, typename T2 >
inline String autoToString( T1 arg1, T2 arg2)
{
    return autoToStringSub( 
        autoprintf_StringToChar( autoArgConvert(arg1) ), 
        autoprintf_StringToChar( autoArgConvert(arg2) ));
}

autoToString calls down to autoToStringSub, and uses autoArgConvert. autoArgConvert is a template that passes through basic types and calls ToString() on other types. ToString is a template that knows the basic types, and the client can extend it by adding ToString for their own types. If they don't, it will be a compile error. StringToChar is a helper that turns a String into a char * and passes through anything else. We have to do it in that double-call way so that the String can get allocated and stick around as a temporary until our whole call is done.

The next piece is how to implement autoToStringFunc() , which takes "enhanced varargs". We need to figure out which pieces are format strings and do various types of printfs (including supporting %a for auto-typed printf). The only tricky part of this is how to step around in the varargs. Here is the only place we have to use a little bit of undefined behavior. First of all, think of the va_list as a pointer to a linked list. Calling va_arg essentially advances the pointer one step. That's fine and stanard. But I assume that I can then take that pointer and pass it on as a va_list which is the remaining args (see note *).

The key way we deal with the varargs is with functions like this :


static inline void SkipVAArg(ESafePrintfType argtype, va_list & vl)
{
    switch(argtype)
    {
    case safeprintf_charptr:    { va_arg(vl,char *); return; }
    case safeprintf_wcharptr:   { va_arg(vl,wchar_t *); return; }
    case safeprintf_int32:      { va_arg(vl,int); return; }
    case safeprintf_int64:      { va_arg(vl,__int64); return; }
    case safeprintf_float:      { va_arg(vl,double); return; }
    case safeprintf_ptrint:     { va_arg(vl,int*); return; }
    case safeprintf_ptrvoid:    { va_arg(vl,void*); return; }
    default:
        // BAD
        safeprintf_throwsyntaxerror("SkipVAArg","unknown arg type");
        return;
    }
}

And the remainder is easy!

* : actually it looks like this is okay by the standard, I just have to call va_end after each function call then SkipArgs back to where I was. I believe this is pointless busywork, but you can add it if you want to be fully standard compliant.

8/22/2010

08-22-10 - AutoPrintf v1

Well autoprintf v1 appears to be all working. The core element is a bunch of functions like this :

template < typename T1, typename T2, typename T3, typename T4 >
inline String autoToString( T1 arg1, T2 arg2, T3 arg3, T4 arg4 )
{
    return ToString(arg1) + autoToString( arg2,arg3,arg4);
}


template < typename T2, typename T3 >
inline String autoToString( const char *fmt, T2 arg2, T3 arg3 )
{
    autoFormatInfo fmtInfo = GetAutoFormatInfo(fmt);
    if ( fmtInfo.autoArgI )
    {
        String newFmt = ChangeAtoS(fmt,fmtInfo);
        if ( 0 ) ;
        else if ( fmtInfo.autoArgI == 1 ) return autoToString(newFmt.CStr(), ToString(arg2).CStr(),arg3);
        else if ( fmtInfo.autoArgI == 2 ) return autoToString(newFmt.CStr(), arg2,ToString(arg3).CStr());
        else return autoPrintf_BadAutoArgI(fmt,fmtInfo);
    }

         if ( fmtInfo.numPercents == 0 )    return ToString(fmt) + autoToString(arg2,arg3);
    else if ( fmtInfo.numPercents == 1 )    return StringPrintf(fmt,arg2) + autoToString(arg3);
    else if ( fmtInfo.numPercents == 2 )    return StringPrintf(fmt,arg2,arg3);
    else return autoPrintf_TooManyPercents(fmt,fmtInfo);
};

you have an autoToString that takes various numbers of template args. If the first arg is NOT a char *, it calls ToString on it then repeats on the remaning args. Any time the first arg is a char *, it uses the other specialization which looks in fmt to see if it's a printf format string, then splits the args based on how many percents they are. I also added the ability to use "%a" to mean auto-typed args, which is what the first part of the function is doing.

That's all dandy, but you should be able to see that for large numbers of args, it generates a massive amount of code.

The real problem is that even though the format string is usually a compile-time constant, I can't parse it at compile time, so I generate code for each arg being %a or not being %a, and for each possible number of percents. The result is something like 2^N codegens for N args. That's bad.

So, I know how to fix this, so I don't think I'll publish v1. I have a method for v2 that moves most of the work out of the template. It's much simpler actually, and it's a very obvious idea, all you have to do is make a template like :


autoprintf(T1 a1, T2 a2, T3 a3)
{
    autoPrintfSub( autoType(a1), autoArg(a1) ,autoType(a2), autoArg(a2) , .. )
}

where autoType is a template that gives you the type info of the arg, and autoArg does conversions on non-basic types for you, and then autoPrintfSub can be a normal varargs non-template function and take care of all the hard work.

... yep new style looks like it will work. It requires a lot more fudging with varargs, the old style didn't need any of that. And I'm now using undefined behavior, though I think it always works in all real-world cases. In particular, in v2 I'm now relying on the fact that I can do :


  va_start(vl)
  va_arg(vl) .. a few types to grab some args from vl
  vsnprintf(  vl);

that is, I rely on the fact that va_arg advances me one step in the va_list, and that I then still have a valid va_list for remaining args which I can pass on. This is not allowed by the standard technically but I've never seen a case where it doesn't work (unless GCC decided to get pedantic and forceably make it fail for no good reason).

old rants