9/30/2011

09-30-11 - String Match Results Part 2

The first and simplest set of results are the ones where non-O(N) algorithms make themselves known.

Optimal parsing, large window.

The candidates are :

SuffixArray1 : naive matcher built on divsufsort
SuffixArray2 : using the forward-offset elimination from this post
Suffix2 : Suffix Trie with follows and path compression but missing the bits in this post
Suffix3 : Suffix Trie without follow
Suffix5 : fully working Suffix Trie
MMC1 : MMC with max match length of 256
MMC2 : MMC with no limit
LzFind1 : LzFind (LZMA HC4 - binary tree) with max ML of 256 and step limit of 32
LzFind2 : LzFind with no max ML or step limit

Note : LzFind was modified from LZMA to not record all matches, just the longest, to make it more like the competitors. MMC was modified to make window size a variable.

In all cases I show :


Test_Matcher : average match length per byte , average clocks per byte

And with no further ado :

got arg : window_bits = 24
working on : m:\test_data\lz_stress_tests

loading file : m:\test_data\lz_stress_tests\stress_many_matches
Test_SuffixArray1 : 32.757760 , 164.087948 
Test_SuffixArray2 : 32.756953 , 199.878476 
Test_Suffix2 : 32.757760 , 115.846130 
Test_Suffix3 : 31.628279 , 499.722569 
Test_Suffix5 : 32.757760 , 184.172167 
Test_MMC2 : 32.757760 , 1507.818166 
Test_LzFind2 : 32.757760 , 576.154370 

loading file : m:\test_data\lz_stress_tests\stress_search_limit
Test_SuffixArray1 : 823.341331 , 182.789064 
Test_SuffixArray2 : 823.341331 , 243.492241 
Test_Suffix2 : 823.341331 , 393.930504 
Test_Suffix3 : 807.648294 , 2082.447274 
Test_Suffix5 : 823.341331 , 91.699276 
Test_MMC2 : 823.341331 , 6346.400206 
Test_LzFind2 : 823.341331 , 1807.516994 

loading file : m:\test_data\lz_stress_tests\stress_sliding_follow
Test_SuffixArray1 : 199.576550 , 189.029462 
Test_SuffixArray2 : 199.573198 , 220.316868 
Test_Suffix2 : 199.576550 , 95.225780 
Test_Suffix3 : 198.967622 , 2110.521111 
Test_Suffix5 : 199.576550 , 106.019526 
Test_MMC2 : 199.576550 , 36571.382020 
Test_LzFind2 : 199.576550 , 1249.184412 

loading file : m:\test_data\lz_stress_tests\stress_suffix_forward
Test_SuffixArray1 : 5199.164464 , 6138.802402 
Test_SuffixArray2 : 5199.164401 , 213.675569 
Test_Suffix2 : 5199.164464 , 12901.429712 
Test_Suffix3 : 5199.075953 , 32152.812339 
Test_Suffix5 : 5199.164464 , 145.684678 
Test_MMC2 : 5199.016562 , 6652.666440 
Test_LzFind2 : 5199.164464 , 11739.369336 

loading file : m:\test_data\lz_stress_tests\stress_all_as
Test_SuffixArray1 : 21119.499148 , 40938.612689 
Test_SuffixArray2 : 21119.499148 , 127.520147 
Test_Suffix2 : 21119.499148 , 88178.094886 
Test_Suffix3 : 21119.499148 , 104833.677202 
Test_Suffix5 : 21119.499148 , 119.676823 
Test_MMC2 : 21119.499148 , 25951.480871 
Test_LzFind2 : 21119.499148 , 38581.431558 

loading file : m:\test_data\lz_stress_tests\twobooks
Test_SuffixArray1 : 192196.571348 , 412.356092 
Test_SuffixArray2 : 192196.571348 , 420.437773 
Test_Suffix2 : 192196.571348 , 268.524287 
Test_Suffix3 : DNF
Test_Suffix5 : 192196.571348 , 292.777726 
Test_MMC2 : DNF 
Test_LzFind2 : DNF 

(DNF = Did Not Finish = over 100k clocks per byte).

Conclusion : SuffixArray2 and Suffix5 both actually work and are correct with no blowup cases.

SuffixArray1 looks good on the majority of files (and is slightly faster than SuffixArray2 on those files), but "stress_suffix_forward" clearly calls it out and shows the break down case.

Suffix2 almost works except on the degenerate tests due to failure to get some details of follows quite right ( see here ).

Suffix3 just shows that a Suffix Trie without follows is some foolishness.

We won't show SuffixArray1 or Suffix2 or Suffix3 again.

MMC2 and LZFind2 both have bad failure cases. Both are simply not usable if you want to find the longest match at every byte. We will revisit them later in other usages though and see that they are good for what they're designed for.

I've not included any of the hash chain type matchers in this test because they all obviously crap their pants in this scenario.

09-30-11 - String Match Results Part 1

I was hoping to make some charts and graphs, but it's just not that interesting. Anyhoo, let's get into it.

What am I testing? String matching for an LZ-type compressor. Matches must start before current pos but can run past current pos. I'm string matching only, not compressing. I'm counting the total time and total length of matches found.

I'm testing match length >= 4. Matches of length 2 & 3 can be found trivially by table lookup (though on small files this is not a good way to do it). Most of the matchers can handle arbitrary min lengths, but this is just easier/fairer for comparison.

I'm testing both "greedy" (when you find a match step ahead its length) and "optimal" (find matches at every position). Some matchers like the suffix tree ones don't really support greedy parsing, since they have to do all the work at every position even if you don't want the match there.

I'm testing windowed and non-windowed matchers.

I'm testing approximate and non-approximate (exact) matchers. Exact matchers find all matches possible, approximate matchers find some amount less. I'm not sure the best way to show the approximation vs. speed trade off. I guess you want a "pareto frontier" type of graph, but what should the axes be?

Also, while I'm at it, god damn it!

MAKE YOUR CODE FREE PEOPLE !!

(and GPL is not free). And some complicated personal license is a pain in the ass. I used to do this myself, I know it's tempting. Don't fucking do it. If you post code just make it 100% free for all uses. BSD license is an okay choice.

Matchers I'm having trouble with :


Tornado matchers from FreeArc - seem to be GPL (?)

MMC - GPL

LzFind from 7zip appears to be public domain. divsufsort is free. Larsson's slide is free.

09-30-11 - Don't use memset to zero

A very common missed optimization is letting the OS zero large chunks of memory for you.

Everybody just writes code like this :


U32 * bigTable = malloc(20<<20);
memset(bigTable,0,20<<20);

but that's a huge waste. (eg. for large hash table on small files the memset can dominate your time).

Behind your back, the operating system is actually running a thread all the time as part of the System Idle Process which grabs free pages and writes them with zero bytes and puts them on the zero'ed page list.

When you call VirtualAlloc, it just grabs a page from the zeroed page list and hands it to you. (if there are none available it zeroes it immediately).

!!! Memory you get back from VirtualAlloc is always already zeroed ; you don't need to memset it !!!

The OS does this for security, so you can never see some other app's bytes, but you can also use it to get zero'ed tables quickly.

(I'm not sure if any stdlib has a fast path to this for "calloc" ; if so that might be a reason to prefer that to malloc/memset; in any case it's safer just to talk to the OS directly).

ADDENDUM : BTW to be fair none of my string matchers do this, because other people's don't and I don't want to win from cheap technicalities like that. But all string match hash tables should use this.

9/29/2011

09-29-11 - Suffix Tries 3 - On Follows with Path Compression

Some subtle things that it took me a few days to track down. Writing for my reference.

1. Follows updates should be a bit "lazy". With path compression you aren't making all the nodes on a suffix. So when you match at length 5, the follow at length 4 might not exist. (I made a small note on the consequences of this previously . Even if the correct follow node doesn't exist, you should still link in to the next longest follow node possible (eg. length 3 if a 4 doesn't exist). Later on the correct follow might get made, and then if possible you want to update it. So you should consider the follows links to be constantly under lazy update; just because a follow link exists it might not be the right one, so you may want to update it.

eg. say you match 4 bytes of suffix [abcd](ef..) at the current spot. You want to follow to [bcd] but there is no length 3 node of that suffix currently. Instead you follow to [bc] (the next best follow available) , one of whose children is [dxy], you now split the [dxy] to [d][xy] and add [ef] under [d]. You can then update the follow from the previous node ([abcd]) to point at the new [bc][d] node.

2. It appears that you only need to update one follow per byte to get O(N). I don't see that this is obvious from a theoretical standpoint, but all my tests pass. Say you trace down a long suffix. You may encounter several nodes that don't have fully up to date follow pointers. You do not have to track them all and update them all at the next byte. It seems you can just update the deepest one (not the deepest node, but the deepest node that needs an update). (*)

3. Even if your follow is not up to date, you can still use the gauranteed (lastml-1) match len to good advantage. This was a big one that I missed. Say you match 4096 bytes and you take the follow pointer, and it takes you to a node of depth 10. You've lost a lot of depth - you know you must match at least 4095 bytes and you only have 10 of them. But you still have an advantage. You can descend the tree and skip all string compares up to 4095 bytes. In particular, when you get to a leaf you can immediately jump to matching 4095 of the leaf pointer.

4. Handling of EOF in suffix algorithms is annoying; it needs to act like a value outside the [0,255] range. The most annoying case is when you have a degenerate suffix like aaaa...aaaEOF , because the "follow" for that suffix might be itself (eg. what follows aaa... is aa..) depending on how you handle EOF. This can only happen with the degenerate RLE case so just special casing the RLE-to-EOF case avoids some pain.

ADDED : There's a trivial solution to handling EOF easily : 09-10-14 - Suffix Trie EOF handling

(* = #2 is the thing I have the least confidence in; I wonder if there could be a case where the single node update doesn't work, or if maybe you could get non-O(N) behavior unless you have a more clever/careful update node selection algorithm)

9/28/2011

09-28-11 - String Matching with Suffix Arrays

A suffix sorter (such as the excellent divsufsort by Yuta Mori) provides a list of the suffix positions in an array in sorted order. Eg. sortedSuffixes[i] is the ith suffix in order.

You can easily invert this table to make sortLookup such that sortLookup[ sortedSuffix[i] ] == i . eg. sortLookup[i] is the sort order for position i.

Now at this point, for each suffix sort position i, you know that the longest match with another suffix is either at i-1 or i+1.

Next we need the neighboring pair match lengths for the suffix sort. This can be done in O(N) as previously described here . So we now have a sortSameLen[] array such that sortSameLen[i] tells you the match length between (sorted order) elements i and i+1.

Using just these you can find all the match lengths for any suffix in the array thusly :


For a suffix start at index pos
Find its sort order : sortIndex = sortLookup[pos]
In each direction (+1 and -1)
current_match_len = infinite
step to next sort index
current_match_len = MIN(current_match_len,sortSameLen[sort index])

Okay. This is all old news. But it has a problem that has been discussed previously .

When matching strings for LZ and such, we don't want the longest match in the array, we want the longest match that occurs earlier. Handled naively this ruins the great O() performance of suffix array string matching. But you can do better.

Run Algorithm Next Index with Lower Value on the sortedSuffix[] array. This provides an array nextSuffixPreceding[]. This is exactly what you need - it provides the next closest suffix with a preceding index.

Now instead of the longest match being at +1 and -1, the longest match is at nextSuffixPreceding[i] and priorSuffixPreceding[i].

There's one last problem - if my current suffix is at position pos, and I look up si = sortIndex[pos] and from that nextSuffixPreceding[si] - I need to walk up to that position one by one doing MIN() on the adjacent pair match lengths (sortSameLen). That ruins my O() win.

But there's a solution - simply build the match length as well when you run "next index with lower value". This can be done easily by tracking the match length back to the preceding "fence". This adds no complexity to the algorithm.

The total sequence of operations is :


sort suffixes : O(NlogN) or so

build sort lookup : O(N)

build sort pair same len : O(N)

build next/prior pos preceding with match lengths : O(N)

now to find a match length :

at position "pos"
si = sortLookup[pos]
for each direction (following and preceding)
  matchpos = nextSuffixPreceding[si]
  matchlen = nextSuffixPreceding_Len[si]

that is, the match length lookup is a very simple O(1) per position (or O(N) for all positions).

One minor annoyance remains, which is that the suffix array string searcher does not provide the lowest offset for a given length of match. It gives you the closest in suffix order, which is not what you want.

09-28-11 - Algorithm - Next Index with Lower Value

You are given an array of integers A[]

For each i, find the next entry j (j > i) such that the value is lower (A[j] < A[i]).

Fill out B[i] = j for all i.

For array size N this can be done in O(N).

Here's how :

I'll call this algorithm "stack of fences". Walk the array A[] from start to finish in one pass.

At i, if the next entry (A[i+1]) is lower than the current (A[i]) then you have the ordering you want immediately and you just assign B[i] = i+1.

If not, then you have a "fence", a value A[i] which is seeking a lower value. You don't go looking for it immediately, instead you just set the current fence_value to A[i] and move on via i++.

At each position you visit when you have a fence, you check if the current A[i] < fence_value ? If so, you set B[fence_pos] = i ; you have found the successor to that fence.

If you have a fence and find another value which needs to be a fence (because it's lower than its successor) you push the previous fence on a stack, and set the current one as the active fence. Then when you find a value that satisfies the new fence, you pop off the fence stack and also check that fence to see if it was satisfied as well. This stack can be stored in place in the B[] array, because the B[] is not yet filled out for positions that are fences.

The pseudocode is :


fence_val = fence_pos = none

for(int i=1;i<size;i++)
{
    int prev = A[i-1];
    int cur = A[i];

    if ( cur > prev )
    {
        // make new fence and push stack
        B[i_prev] = fence_pos;
        fence_pos = i_prev;
        fence_val = prev;
    }
    else
    {
        // descending, cur is good :
        B[i_prev] = i;

        while( cur < fence_val )
        {
            prev_fence = B[fence_pos];
            B[fence_pos] = i;
            fence_pos = prev_fence;
            if ( fence_pos == -1 )
            {
                fence_val = -1;
                break;
            }
            fence_val = A[fence_pos];
        }
    }
}

This is useful in string matching, as we will see forthwith.

9/27/2011

09-27-11 - String Match Stress Test

Take a decent size file like "book1" , do :

copy /b book1 + book1 twobooks

then test on "twobooks".

There are three general classes of how string matchers respond to a case like "twobooks" :

1. No problemo. Time per byte is roughly constant no matter what you throw at it (for both greedy and non-greedy parsing). This class is basically only made up of matchers that have a correct "follows" implementation.

2. Okay with greedy parsing. This class craps out in some kind of O(N^2) way if you ask them to match at every position, but if you let them do greedy matching they are okay. This class does not have a correct "follows" implementation, but does otherwise avoid O(N^2) behavior. For example MMC seems to fall into this class, as does a suffix tree without "follows".

Any matcher with a small constant number of maximum compares can fall into this performance class, but at the cost of an unknown amount of match quality.

3. Craps out even with greedy parsing. This class fails to avoid O(N^2) trap that happens when you have a long match and also many ways to make it. For example simple hash chains without an "amortize" limit fall in this class. (with non-greedy parsing they are O(N^3) on degenerate cases like a file that's all the same char).


Two other interesting stress tests I'm using are :

Inspired by ryg, "stress_suffix_forward" :

4k of aaaaa...
then paper1
then 64k of aaaa...
obviously when you first reach the second part of "aaaa..." you need to find the beginning of the file, but a naive suffix sort will have to look through 64k of following a's before it finds it.

Another useful one to check on the "amortize" behavior is "stress_search_limit" :

book1
then, 1000 times :
  128 random bytes
  the first 128 bytes of book1
book1 again
obviously when you encounter all of book1 for the second time, you should match the whole book1 at the head of the file, but matchers which use some kind of simple search limit (eg. amortized hashing) will see the 128 byte matches first and may never get back to the really long one.

9/26/2011

09-26-11 - Tiny Suffix Note

Obviously there are lots of analogies between suffix tries and suffix arrays.

This old note about suffix arrays which provides O(N) neighbor pair match lengths is exactly analogous to using "follow pointers" for O(N) string matching in suffix tries.

(their paper also contains a proof of O(N)'ness , though it is obvious if you think about it a bit; see comments on previous post about this).

Doing Judy-ish stuff for a suffix tree is exacly analogous to the "introspective" stuff that's done in good suffix array sorters like divsufsort.

By Judy-ish I mean using a variety of tree structures and selecting one for the local area based on its properties. (eg. nodes with > 100 children switch to just using a radix array of 256 direct links to kids).

Suffix tries are annoying because it's easy to slide the head (adding nodes) but hard to slide the tail (removing nodes). Suffix arrays are even worse in that they don't slide at all.

The normal way to adapt suffix arrays to LZ string matching is just to use chunks of arrays (possibly a power-of-2 cascade). There are two problems I haven't found a good solution to. One is how to look up a string in the chunk that it is not a member of (eg. a chunk that's behind you). The other is how to deal with offsets that are in front of you.

If you just put your whole file in one suffix array, I believe that is unboundedly bad. If you were allowed to match forwards, then finding the best match would be O(1) - you only have to look at the two slots before you and after you in the sort order. But since we can't match forward, you have to scan. The pseudocode is like this :


do both forward and backward :
start at the sort position of the string I want to match
walk to the next closest in sort order (this is an O(1) table lookup)
if it's a legal match (eg. behind me) - I'm done, it's the best
if not, keep walking

the problem is the walk is unbounded. When you are somewhere early in the array, there can be an arbitrary number (by which I mean O(N)) of invalid matches between you and your best match in the sort order.

Other than these difficulties, suffix arrays provide a much simpler way of getting the advantages of suffix tries.

Suffix arrays also have implementation advantages. Because you separate the suffix string work from the rest of your coder it makes it easier to optimize each one in isolation, you get better cache use and better register allocation. Also, the suffix array can use more memory during the sort, or use scratch space, while a trie has to hold its structure around all the time. For example some suffix sorts will do things like use a 2-byte radix in parts of the sort where that makes sense (and then they can get rid of it and use it on another part of the sort), and that's usually impossible for a tree that you're holding in memory as you scan.

9/25/2011

09-25-11 - More on LZ String Matching

This might be a series until I get angry at myself and move on to more important todos.

Some notes :

1. All LZ string matchers have to deal with this annoying problem of small files vs. large ones (and small windows vs large windows). You really want very different solutions, or at least different tweaks. For example, the size of the accelerating hash needs to be tuned for the size of data or you can spend all your time initializing a 24 bit hash to find matches in 10 byte file.

2. A common trivial case degeneracy is runs of the same character. You can of course add special case handling of this to any string matcher. It does help a lot on benchmarks of course, because this case is common, but it doesn't help your worst case in theory because there are still bad degenerate cases. It's just very rare to have long degenerate matches that aren't simple runs.

One easy way to do this is to special case just matches that start with a degenerate char. Have a special index of [256] slots which correspond to starting with >= 4 of that char.

3. A general topic that I've never seen explored well is the idea of approximate string matching.

Almost every LZ string matcher is approximate, they consider less than the full set of matches. Long ago someone referred to this as "amortized hashing" , which refers to the specific implemntation of a hash chain (hash -> linked list) in which you simply stop searching after visiting some # of links. (amortize = minimize the damage from the worst case).

Another common form of approximate string searching is to use "cache tables" (that is, hash tables with overwrites). Many people use a cache tables with a few "ways".

The problem with both these approaches is that the penalty is *unbounded*. The approximate match can be arbitrarily worse than the best match. That sucks.

What would be ideal is some kind of tuneable and boundable approximate string match. You want to set some amount of loss you can tolerate, and get more speedup for more loss.

(there are such data structures for spatial search, for example; there are nice aproximate-nearest-neighbors and high-dimensional-kd-trees and things like that which let you set the amount of slop you tolerate, and you get more speedup for more slop. So far as I know there is nothing comparable for strings).

Anyhoo, the result is that algorithms with approximations can look very good in some tests, because they find 99% of the match length but do so much faster. But then on another test they suddenly fail to find even 50% of the match length.

9/24/2011

09-24-11 - Suffix Tries 2

Say you have a suffix trie with path compression.

So, for example if you had "abxyz" , "abymn" and "abxyq" then you would have :


[ab]   (vertical link is a child)
|
[xy]-[ymn]  (horizontal link is a sibling)
|
z-q

only the first character is used for selecting between siblings, but then you may need to step multiple characters to get to the next branch point.

(BTW I just thought of an interesting alternative way to do suffix tries in a b-tree/judy kind of way. Make your node always have 256 slots. Instead of always matching the first character to find your child, match N. That way for sparse parts of the tree N will be large and you will have many levels of the tree in one 256-slot chunk. In dense parts of the tree N becomes small, down to 1, in which case you get a radix array). Anyhoo..

So there are substrings that don't correspond to any specific node. For example "abx" is between "ab" and "abxy" which have definite spots in the tree. If you want to add "abxr" you have to first break the "xy" and then add the new node.

Okay, this is all trivial and just tree management, but there's something interesting about it :

If you have a "follow" pointer and the length you want does not correspond to a specific node (ie it's one of those between lengths), then there can be no longer match possible.

So, you had a previous match of length "lastml". You step to the next position, you know the best match is at least >= lastml-1. You use a follow pointer to jump into the tree and find the node for the following suffix. You see that the node does not have length "lastml-1", but some other length. You are done! No more tree walking is needed, you know the best match length is simply lastml-1.

Why is this? Consider if there was a longer match possible. Let's say our string was "sabcdt..." at the last position we matched 5 ("sabcd"). So we now have "abcdt..." and know match is >= 4. We look up the follow node for "abcd" and find there is no length=4 node in the tree. That means that the only path in the tree had "dt" in it - there has been no character other than "t" after "d" or there would be a branching node there. But I know that I cannot match "t" because if I did then the previous match would have been longer. Therefore there is no longer match possible.

This turns out to be very common. I'm sure if I actually spent a month or so on suffix tries I would learn lots of useful properties (there are lots of papers on this topic).

09-24-11 - Suffix Tries 1

To make terminology clear I'm going to use "trie" to mean a tree in which as you descend the length of character match always gets longer, and "suffix trie" to indicate the special case where a trie is made from all suffixes *and* there are "follow" pointers (more on this later).

Just building a trie for LZ string searching is pretty easy. Using the linked-list method (which certainly has disadvantages), internal nodes only need a child & sibling pointer, and some bit of data. If you always descend one char at a time that data is just one char. If you want to do "path compression" (multi-char steps in a single link) you need some kind of pointer + length.

(it's actually much easier to write the code with path compression, since when you add a new string you only have to find the deepest match in the tree then add one node; with single char steps you may have to add many nodes).

So for a file of length N, internal nodes are something like 10 bytes, and you need at most N nodes. Leaves can be smaller or even implicit.

With just a normal trie, you have a nice advantage for optimal parsing, which is that when you find the longest match, you also automatically walk past all shorter matches. At each node you could store the most recent position that that substring was seen, so you can find the lowest offset for each length of match for free. (this requires more storage in the nodes plus a lot more memory writes, but I think those memory writes are basically free since they are to nodes in cache anyway).

The Find and Insert operations are nearly identical so they of course should be done together.

A trie could be given a "lazy update". What you do is on Insert you just tack the nodes on somewhere low down in the tree. Then on Find, when you encounter nodes that have not been fully inserted you pick them up an carry them with you as you descend. Whenever you take a path that your baggage can't take, you leave that baggage behind. This could have advantages under certain usage patterns, but I haven't actually tried it.

But it's only when you get the "follow" pointers that a suffix trie really makes a huge difference.

A follow pointer is a pointer in the tree from any node (substring) to the location in the tree of the substring without the first character. That is, if you are at "banana" in the tree, the follow pointer should point at the location of "anana" in the tree.

When you're doing LZ compression and you find a match at pos P of length L, you know that at pos P+1 there must be a match of at least length L-1 , simply by using the same offset and matching one less character. (there could be a longer match, though). So, if you know the suffix node that was used to find the match of length L at pos P, then you can jump in directly to match of length L-1 at the next position.

This is huge. Consider for example the fully degenerate case, a file of length N of all the same character. (yes obviously there are special case solutions to the fully degenerate case, but that doesn't fix the problem, it just makes it more complex to create the problem). A naive string matcher is actually O(N^3) !!

For each position in the file (*N)
Consider all potential matches (*N)
Compare all the characters in that potential match (*N)
A normal trie makes this O(N^2) , because the comparing characters in the string is combined with finding all potential matches, so the tree descent + string compares combined are just O(N).

But a true suffix trie with follow pointers is only O(N) for the whole parse. Somewhere early on would find a match of length O(N) and then each subsequent one just finds a match of L-1 in O(1) time using the follow pointer. (the O(N) whole parse only works if you are just finding the longest length at each position; if you are doing the optimal parse where you find the lowest offset for each length it's O(N^2))

Unfortunately, it seems that when you introduce the follow pointer this is when the code for the suffix trie gets rather tricky. It goes from 50 lines of code to 500 lines of code, and it's hard to do without introducing parent pointers and lots more tree maintenance. It also makes it way harder to do a sliding window.

9/23/2011

09-23-11 - Morphing Matching Chain

"MMC" is a lazy-update suffix tree.

mmc - Morphing Match Chain - Google Project Hosting
Fast Data Compression MMC - Morphing Match Chain
Fast Data Compression BST Binary Search Tree
encode.ru : A new match searching structure
Ultra-fast LZ

(I'm playing a bit loose with the term "suffix tree" as most people do; in fact a suffix tree is a very special construction that uses the all-suffixes property and internal pointers to have O(N) construction time; really what I'm talking about is a radix string tree or patricia type tree). (also I guess these trees are tries)

Some background first. You want to match strings for LZ compression. Say you decide to use a suffix tree. At each level of the tree, you have already matched L characters of the search string; you just look up your next character and descend that part of the tree that has that character as a prefix. eg. to look up string str, if you've already decended to level L, you find the child for character str[L] (if it exists) and descend into that part of the tree. One way to implement this is to use a linked list for all the characters that have been seen at a given level (and thus point to children at level +1).

So your nodes have two links :


child = subtree that matches at least L+1 characters
sibling = more nodes at current level (match L characters)

the tree for "bar","band",bang" looks like :

b
|  (child links are vertical)
a
|
r-n  (sibling links are horizontal)
| |
* d-g
  | |
  * *

where * means leaf or end of string (and is omitted in practice).

Okay, pretty simple. This structure is not used much in data compression because we generally want sliding windows, and removal of strings as they fall out of the sliding window is difficult.

(Larsson and others have shown that it is possible to do a true sliding suffix tree, but the complexity has prevented use in practice; this would be a nice project if someone wants to make an actual fast implementation of the sliding suffix trie)

Now let's look at the standard way you do a hash table for string matching in the LZ sliding window case.

The standard thing is to use a fixed size hash to a linked list of all strings that share that hash. The linked list can just be an array of positions where that hash value last occured. So :


pos = hashTable[h] contains the position where h last occured
chain[pos] contains the lat position before pos where that same hash h occurred

the nice thing about this is that chain[] can just be an array of the size of the sliding window, and you modulo the lookup into it. In particular :

//search :
h = hash desired string
next = hashTable[h];
while ( next > cur - window_size )
{
  // check match len of next vs cur
  next = chain[next & (window_size-1) ];
}

note that the links can point outside the sliding window (eg. either hashTable[] or chain[] may contain values that go outside the window), but we detect those and know our walk is done. (the key aspect here is that the links are sorted by position, so that when a link goes out of the window we are done with the walk; this means that you can't do anything like MTF on the list because it ruins the position sort order). Also note that there's no check for null needed because we can just initial the hash table with a negative value so that null is just a position outside the window.

To add to the hash table when we slide the window we just tack onto the list :


// add string :
chain[ cur & (window_size)-1 ] = hashTable[h];
hashTable[h] = cur;

and there's the sort of magic bit - we also removed a node right there. We actually popped the node off the back of the sliding window. That was okay because it must have been the last node on its list, so we didn't corrupt any of our lists.

That's it for hash-chain review. It's really nice how simple the add/remove is, particularly for "Greedy" type LZ parsers where you do Insert much more often than you do Find. (there are two general classes of LZ parers - "Optimal" which generally do a Find & Insert at every position, and "Greedy" which when they find a match, step ahead by the match len and only do Inserts).

So, can we get the advantages of hash chains and suffix trees?

Well, we need another idea, and that is "lazy updates". The idea is that we let our tree get out of sorts a bit, and then fix it the next time we visit it. This is a very general idea and can be applied to almost any tree type. I think the first time I encountered it was in the very cool old SurRender Umbra product, where they used lazy updates of their spatial tree structures. When objects moved or spawned they got put on a list on a node. When you descend the tree later on looking for things, if a node has child nodes you would take the list of objects on the node and push them to the children - but then you only descend to the child that you care about. This can save a lot of work under certain usage patterns; for example if objects are spawning off in some part of the tree that you don't visit, they just get put in a high up node and never pushed down to the leaves.

Anyhoo, so our suffix tree requires a node with two links. Like the hash table we will implement our links just as positions :

struct SuffixNode { int sibling; int child; }
like the hash table, our siblings will be in order of occurance, so when we see a position that's out of the window we know we are done walking.

Now, instead of maintaining the suffix tree when we add a node, we're just going to tack the new node on the front of the list. We will then percolate in an update the next time we visit that part of the tree. So when you search the tree, you can first encounter some unmaintained nodes before you get to the maintained section.

For example, say we had "bar" and "band" in our tree, and we add "bang" at level 2 , we just stick it on the head and don't descend the tree to put it in the right place :


b
|  (child links are vertical)
a
|
NG-r-n  (sibling links are horizontal)
     |
     d

(caps indicates unmaintained portion)

now the next time we visit the "ba" part of the tree in a retrieval, we also do some maintenance. We remember the first time we see each character (using a [256] array), and if we see that same character again we know that it's because part of the tree was not maintained.

Say we come in looking for "bank". If see a node with an "n" (that's a maintained n) we know we are done and we go to the child link - there can't be any more n's behind that node. If we see an "N" (no child link), we remember it but we have to keep walking siblings. We might see more "N"s and we are done if we see an "n". Then we update the links. We remove the "n" (of band) from the sibling link and connect it to the "N" instead :


b
|  (child links are vertical)
a
|
n-r
|   
g---d

And this is the essence of MMC (lazy update suffix trie = LUST).

A few more details are significant. Like the simple hash chain, we always add nodes to the front of the list. The lazy update also always adds nodes to the head - that is, the branch that points to more children is always at the most recent occurance of that substring. eg. if you see "danger" then "dank" then "danish" you know that the "dan" node is either unmaintained, or points are the most recent occurance of "dan" (the one in "danish"). What this means is that the simple node removal method of the hash chain works - when the window slides, we just let nodes fall out of the range that we consider valid and they drop off the end of the tree. We don't have to worry about those nodes being an internal node to the tree that we are removing, because they are always the last one on a list.

In practice the MMC incremental update becomes complex because you may be updating multiple levels of the tree at once as you scan. When you first see the "NG" you haven't seen the "n" yet and you don't want to scan ahead the list right away, you want to process it when you see it; so you initially promote NG to a maintained node, but link it to a temporary invalid link that points back to the previous level. Then you keep walking the list and when you see the "n" you fix up that link to complete the maintenance.

It does appear that MMC is a novel and interesting way of doing a suffix trie for a sliding window.

8/14/2011

08-14-11 - A note on convex hull simplification

I wrote this in email and thought it worth recording.

A while ago I wrote mainly about OBB algorithms but a little note about convex hull simplification

It's a little unclear, so I clarified :

My algorithm is very simple and by no means optimal.

I construct a standard (exact) convex hull, then make a mesh from it. I then run a normal mesh simplifier (see for example Garland Heckbert Quadric Error Metrics) to simplify the CH as if it was a mesh. This can ruin inclusion. I then fix it by taking all the face planes of the simplified mesh and pushing them out past any vert in the original mesh.

Stan's (Melax - Convex Hull Simplification With Containment By Successive Plane Removal) way is similar but better. He uses a BSP engine to create the hull. First he finds a normal convex hull. Then he considers only the planes that make up that hull. The working hull is the volume that is on the "front" side of all planes. He then considers removing planes one by one. When you remove a plane, the cost to remove it is the volume that is added to the hull, which is the volume of the space that is on the back side of that plane but is on the front side of all other planes. You create a heap to do this so that the total cost to simplify is only N log N. This requires good BSP code which I don't have, which is why I used the mesh-simplifier approach.

An alternative in the literature is the "progressive hull" technique. This is basically using PM methods but directly considering the mesh as a hull during simplification instead of fixing it after the fact as I do. Probably a better way is to use a real epsilon-hull finder from the beginning rather than finding the exact hull and then simplifying.

My code is in Galaxy4 / gApp_HullTest which is available here ; You should be able to run "Galaxy4.exe hull" ; Hit the "m" key to see various visualations ; give it a mesh argument if you have one (takes .x, .m , .smf etc.)

BTW to summarize : I don't really recommend my method. It happens to be easy to implement if you have a mesh simplifier lying around. Stan's method is also certainly not optimal but is easy to implement if you have good BSP code lying around (and is better than mine (I suspect)).

The technique I actually prefer is to just use k-dops. k-dops are the convex hull made from the touching planes in a fixed set of k directions. Maybe find the optimal OBB and use that as the axis frame for the k directions. Increase k until you are within the desired error tolerance (or k exceeds the number of faces in the exact hull).

ASIDE : I have some BSP code but I hate it; I hate all floating point geometry code. I love integer geometry code. The problem with integers in BSP's is that clipping creates rational points. Maybe I'll write some BSP routines based on rational Vec3's. The only big problem is that the precision requirement goes up with each clip. So you either need arbitrary precision rationals or you have to truncate the precision at some maximum, and then handle the errors created by that (like the truncated point could move onto the back side of a plane that you said you were in front of). (this is better than the errors in floating points, because at least the truncated point is at a definite well defined location, floating points move around depending on how you look at them, those wiggly bastards) (I'm tempted to say that they're like quantum mechanics in that they change when you measure them, except that they really aren't at all, and that's the type of pseudo-scientific-mumbo-jumbo that pseudo-intellectual fucktards love and I so despise, so no, I won't say it).

8/12/2011

08-12-11 - The standard cinit trick

Sometimes I like to write down standard tricks that I believe are common knowledge but are rarely written down.

Say you have some file that does some "cinit" (C++ class constructors called before main) time work. A common example is like a factory that registers itself at cinit time.

The problem is if nobody directly calls anything in that file, it will get dropped by the linker. That is, if all uses are through the factory or function pointers or something like that, the linker doesn't know it gets called that way and so drops the whole thing out.

The standard solution is to put a reference to the file in its header. Something like this :


Example.cpp :

int example_cpp_force = 0;

AT_STARTUP( work I wanted to do );


Example.h :

extern int example_cpp_force;

AT_STARTUP( example_cpp_force = 1 );

where AT_STARTUP is just a helper that puts the code into a class so that it runs at cinit, it looks like this :

#define AT_STARTUP(some_code)   \
namespace { static struct STRING_JOIN(AtStartup_,__LINE__) { \
STRING_JOIN(AtStartup_,__LINE__)() { some_code; } } STRING_JOIN( NUMBERNAME(AtStartup_) , Instance ); };

Now Example.obj will be kept in the link if any file that includes Example.h is kept in the link.

This works so far as I know, but it's not really ideal (for one thing, if Example.h is included a lot, you get a whole mess of little functions doing example_cpp_force = 1 in your cinit). This is one of those dumb little problems that I wish the C standards people would pay more attention to. What we really want is a way within the code file to say "hey never drop this file from link, it has side effects", which you can do in certain compilers but not portably.

8/11/2011

08-11-11 - Free Internet

I mean "free" in a liberty sense, not a monetary sense.

Recent Seattle Weekly article got me thinking about trying to encrypt and anonymize all my internet access. The whole torrent model is just like fish in a barrel for copyright trolls. You can just hop on the net and get a list of infringers any time you want.

So whatever reason, say you want to be able to work on the net and do as you please without your actions being monitored.

Apparently the major US-based services like FindNot and Anonymizer are not to be trusted (they provide logs to the US government and to subpoenas by the RIAA etc).

Really what you want is something like Tor that takes all your traffic and bounces it around a bunch of other machines and then puts out portions of requests from all over. Currently none of those services seem to be quite ready for prime time; Tor for example kicks you out if you try to do high-bandwidth things like torrents.

Some links :

Web-based DNS Randomness Test DNS-OARC
Tor Project Anonymity Online
SwissVPN - Surf the safer way!
Public IP Swiss VPN - Page 2 - Wilders Security Forums
OneSwarm - Private P2P Data Sharing
I2P - Wikipedia, the free encyclopedia
How To Not Get Sued for File Sharing Electronic Frontier Foundation
Free Anonymous BitTorrent Becomes Reality With BitBlinder TorrentFreak
Chilling Effects Clearinghouse
Anonymous P2P - Wikipedia, the free encyclopedia

In general I'm not sure if dark-nets like Tor can survive. I don't trust the internet providers or the US government to allow you to have that freedom. I suspect that if they ever caught on en masse they would be blocked by the standard extra-judicial mechanisms that they used to shut down online poker and funding WikiLeaks (where the government nicely asks the service provider to block that traffic and the provider complies, even though it's not clear the law is on their side).

The only way to get past that (and into places like china) is to hide encrypted packets inside benign packets. That may be fine for little text messages, but you can never get high bandwidth that way.

8/09/2011

08-09-11 - Threading Links

For reference, some of the links I consulted for the recent postings :

[concurrency-interest] fast semaphore
[C++] Chris M Thomasson - Pastebin.com
[C++] Chris M Thomasson - Pastebin.com -rwmutex eventcount
[C++] Chris M Thomasson - Pastebin.com - wsdequeue
[C++] Chris M Thomasson - Pastebin.com - semaphore and mpmc
[C++] Chris M Thomasson - Pastebin.com - mpsc in relacy
[C++] Chris M Thomasson - Pastebin.com - eventcount from cond_Var
[C++] Chris M Thomasson - Pastebin.com - cond_Var from waitset
[C#] Chris M Thomasson - Pastebin.com - eventcount in C#
yet another win32 condvar implementation - comp.programming.threads Computer Group
yet another (tiny) implementation of condvars - comp.programming.threads Google Groups
Would this work on even one platform - about mutex reordering
Windows NT Keyed Events
Win32 Kernel Experimental WaitLock-Free Fast-Path Event-Count for Windows... anZ2dnUVZ, InterlockedLoadFence, and aPOdnXp1l6
win32 condvar futex - NOT! - 29464
Win32 condition variables redux - Thomasson thread list version
Win32 condition variables redux - comp.programming.threads Google Groups
Usenet - Lock-free queue SPMC + MPMC
Usenet - Condition variables signal with or without mutex locked
Time-Published Queue-Based Spin Locks
Ticket spinlocks [LWN.net]
ThreadSanitizer - data-race-test - ThreadSanitizer is a Valgrind-based detector of data races - Race detection tools and mor
Thin Lock vs. Futex � ��Bartosz Milewski's Programming Cafe
The Inventor of Portable DCI-aka-DCL (using TSD) is... ;-) - comp.programming.threads Google Groups
TEREKHOV - Re win32 conditions sem+counter+event = broadcast_deadlock + spur.wake
TBB Thomasson's MPMC
TBB Thomasson - rwmutex
TBB Thomason aba race
TBB Raf on spinning
TBB eventcount posting Dmitry's code
TBB Download Versions
TBB Dmitry on memory model
Task Scheduling Strategies - Scalable Synchronization Algorithms Google Groups
Subtle difference between C++0x MM and other MMs - seq_cst fence weird
Strong Compare and Exchange
Strategies for Implementing POSIX Condition Variables on Win32
Starvation-free, bounded- ... - Intel� Software Network
spinlocks XXXKSE What to do
Spinlocks and Read-Write Locks
SourceForge.net Repository - [relacy] Index of relacy_1_0rrdinttbb_eventcount
Some notes on lock-free and wait-free algorithms Ross Bencina
So what is a memory model And how to cook it - 1024cores
Sleeping Read-Write Locks
Simple condvar implementation for Win32 - comp.programming.threads Google Groups
Simple condvar implementation for Win32 (second attempt)
SignalObjectAndWait Function (Windows)
sequential consistency � Corensic
search for Thomasson - Pastebin.com
search for Relacy - Pastebin.com
sched_setscheduler
Scalable Synchronization
Scalable Synchronization MCS lock
Scalable Synchronization Algorithms Google Groups
Scalable Queue-Based Spin Locks with Timeout
Relacy Race Detector - 1024cores
really simple portable eventcount... - comp.programming.threads Google Groups
really simple portable eventcount... - 2
really simple portable eventcount... - 1
re WaitForMultipleObjects emulation with pthreads
Re sem_post() and signals
Re Portable eventcount (try 2)
Re Intel x86 memory model question
Re C++ multithreading yet another Win32 condvar implementation
race-condition and sub-optimal performance in lock-free queue ddj code...
Race in TBB - comp.programming.threads Google Groups
QPI Quiescence (David Dice's Weblog)
pthread_yield() vs. pthread_yield_np()
pthread_cond_ implementation questions - comp.programming.threads Google Groups
POSIX Threads (pthreads) for Win32
Porting of Win32 API WaitFor to Solaris Platform
Portable eventcount
Portable eventcount - Scalable Synchronization Algorithms Google Groups
Portable eventcount - comp.programming.threads Google Groups
Portable eventcount (try 2) - comp.programming.threads Google Groups
Parallel Disk IO - 1024cores
Obscure Synchronization Primitives
New implementation of condition variables on win32
my rwmutex algorithm for Linux... - this is good
Mutexes and Condition Variables using Futexes
Multithreading in C++0x part 1 Starting Threads Just Software Solutions - Custom Software Development and Website Developmen
Multithreaded File IO Dr Dobb's Journal
Multi-producermulti-consumer SEH-based queue � Intel Software Network Blogs - Intel� Software Network
MSDN Compound Synchronization Objects
MPMC Unbounded FIFO Queue w 1 CASOperation. No jokes. - comp.programming.threads Computer Group
Memory Consistency Models
low-overhead mpsc queue - Scalable Synchronization Algorithms
Lockless Inc Articles on computer science and optimization.
Lockingunlocking SysV semaphores - comp.unix.programmer Google Groups
Lockfree Algorithms - 1024cores
lock-free read-write locks - comp.programming.threads Google Groups
Lock-free bounded fifo-queue on top of vector - comp.programming.threads Google Groups
Linux x86 ticket spinlock
JSS Petersons
JSS Dekker
Joe Seighs awesome rw-spinlock with a twist; the beauty of eventcounts... - comp.programming.threads Google Groups
joe seigh on eventcount fences
Joe Seigh Fast Semaphore
Joe Duffy's Weblog - keyed events
Implementing a Thread-Safe Queue using Condition Variables (Updated) Just Software Solutions - Custom Software Development a
How to use priority inheritance
High-Performance Synchronization for Shared-Memory Parallel Programs University of Rochester Computer Science
good discussion of work stealing
good discussion of a broken condvar implementation
git.kernel.org - linuxkernelgittorvaldslinux-2.6.gitcommit
GCC-Inline-Assembly-HOWTO
futex(2) - Linux manual page
FlushProcessWriteBuffers Function (Windows)
First Things First - 1024cores
Fine-grained condvareventcount
fast-pathed mutex with eventcount for the slow-path... - comp.programming.threads Google Groups
experimental fast-pathed rw-mutex algorithm... - comp.programming.threads Google Groups
eventcount needs storeload
eventcount example of seq_cst fence problem
Effective Go - The Go Programming Language
duffy page that's down meh
Dr. Dobb's Journal Go Parallel QuickPath Interconnect Rules of the Revolution Dr. Dobb's and Intel Go Parallel Programming
Don�t rely on memory barriers for synchronization� Only if you don�t aware of Relacy Race Detector! � Intel Software Network
dmitry's eventcount for TBB
Distributed Reader-Writer Mutex - 1024cores
Discussion of Culler Singh sections 5.1 - 5.3
Developing Lightweight, Statically Initializable C++ Mutexes Dr Dobb's Journal
Derevyago derslib mt_threadimpl.cpp Source File
Derevyago - C++ multithreading yet another Win32 condvar implementation - comp.programming.threads Google Groups
Dekker's algorithm - Wikipedia, the free encyclopedia
David's Wikiblog
data-race-test - Race detection tools and more - Google Project Hosting
condvars signal with mutex locked or not Lo�c OnStage
Concurrent programming on Windows - Google Books
concurrency-induced memory-access anomalies - comp.std.c Google Groups
CONCURRENCY Synchronization Primitives New To Windows Vista
comp.programming.threads Google Groups
comp.lang.c++ Google Groups - thomasson event uses
Common threads POSIX threads explained, Part 3
Chris M. Thomasson - Pastebin.com
Chris M. Thomasson - Pastebin.com - win_condvar
Chapter�22.�Thread - Boost 1.46.1
cbloom rants 07-18-10 - Mystery - Does the Cell PPU need Memory Control -
cbloom rants 07-18-10 - Mystery - Do Mutexes need More than Acquire-Release -
Causal consistency - Wikipedia, the free encyclopedia
C++1x lock-free algos and blocking - comp.lang.c++ Google Groups
C++0x sequentially consistent atomic operations - comp.programming.threads Google Groups
C++0x memory_order_acq_rel vs memory_order_seq_cst
C++ native-win32 waitset class for eventcount... - comp.programming.threads Google Groups
C++ native-win32 waitset class for eventcount... - broken for condvar
C++ N1525 Memory-Order Rationale - nice
C++ multithreading yet another Win32 condvar implementation
Bug-Free Mutexs and CondVars w EventCounts... - comp.programming.threads Google Groups
Break Free of Code Deadlocks in Critical Sections Under Windows
Boost rwmutex 2
Boost rwmutex 1
boost atomics Usage examples - nice
Blog Archive Just Software Solutions - Custom Software Development and Website Development in West Cornwall, UK
Atomic Ptr Plus Project
Asymmetric Dekker
appcoreac_queue_spsc - why eventcount needs fence
AppCore A Portable High-Performance Thread Synchronization Library
Advanced Cell Programming
A word of caution when juggling pthread_cond_signalpthread_mutex_unlock - comp.programming.threads Google Groups
A theoretical question on synchronization - comp.programming.threads Google Groups
A race in LockSupport park() arising from weak memory models (David Dice's Weblog)
A garbage collector for C and C++
A futex overview and update [LWN.net]
A Fair Monitor (Condition Variables) Implementation for Win32

08-09-11 - The Lobster

(this coinage is so obvious I must have stolen it from somewhere, anyway...)

I've been thinking a lot recently about "the lobster".

I've always thought it was bizarre how you can pull into any podunk town in America and go to the scary local diner / steak house, and there will be the regular items - burger, chicken fried steak, what have you, all under $10, and then there's the lobster, for $30, ridiculously overpriced, tucked in the corner of the menu with decorative squiggles around it (as if it needs velvet ropes to separate the VIP section of the menu from the plebian fare).

The thing is, the lobster is not actually good. They probably can't remember the last time anybody actually ordered the lobster. No local would; if the waitress likes you she would warn you not to get, the chefs roll their eyes when the order comes in. Why is it on the menu at all?

I guess it's just there as a trap, for some sucker who doesn't know better, for someone wanting to show off the money they just won, or someone on an expense account to waste money on. You're really just humiliating yourself when you order it, and the restaurant is laughing at you.

I think most people know that you don't actually ever order the lobster in restaurants (other than lobster-specializing places in like Maine or something). But "the lobster" can pop up in many other guises. Expensive watches are obvious lobsters, expensive cars can be less obvious lobsters (is a Maserati a lobster? an Alfa? an Aston? a Porsche?), certainly some of the options and special editions are obvious lobsters, for example the recent Porsche "Speedster" special edition that cost $250k and was just a regular Carrera other than a few colored bits, that's clearly a lobster and Porsche laughs and rolls their eyes at the Seinfelds of the world who are stupid enough to buy the Porsche lobster just because it was on the menu with squiggly lines around it.

I feel like a lot of salesmen try to slip the lobster on you when you're not paying attention. Like when the contractor asks if you want your counters in wood or stone or italian marble - hey wait, contractor, that's the lobster! okay, yeah, you got me, I don't even know where to get italian marble but I thought I'd try to slip it in there. Home improvement in general is full of lobsters. Home theatre stores usually carry a lobster; car wheels ("rims") are rife with lobsters.

The thing that makes the nouveau riche so hilarious is they are constantly getting suckered into buying the lobster and then have the stupidity to brag about it. Ooo look at my gold plated boat ; you fool, you bought the lobster, hide your shame!


One of the things that's so satisfying about video games is that you get a clear reward for more work. You kill some monsters, you get experience, you go up a level; you collect 200 gems, now you can buy the red shield, and it is objectively better than the blue shield you had before. It's very simple and satisfying.

Life is not so clear. More expensive things are not always better. Doing more work doesn't necessarily improve your life. This can be frustrating and confusing.

One of the things that makes me lose it is video game designers who think it's a good idea to make games more realistic in this sense, like providing items in the stores that are expensive but not actually very good. No! I don't want to have to try to suss out "the lobster" in the video game blacksmith, you want video game worlds to be an escapist utopia in which it's always clear that spending more money gets you better stuff. (the other thing I can't stand is games that take away your items; god dammit, don't encourage me to do the work for that if you're going to take it away, don't inject the pains of real life into games, it does not make them better!)

8/01/2011

08-01-11 - Non-mutex priority inversion

An issue I don't see discussed much is non-mutex priority inversion.

First a review of mutex priority inversion. A low priority thread locks a mutex, then loses execution. A high priority thread then tries to lock that mutex and blocks. It gives up its time slice, but a bunch of medium priority threads are available to run, so they take all the time and the low priority thread doesn't get to run. We call it "priority inversion" because the high priority thread is getting CPU time as if it was the same as the low priority thread.

Almost all operating systems have some kind of priority-inversion-protection built into their mutex. The usual mechanism goes something like this : when you block on a mutex, find the thread that currently owns it and either force execution to go to that thread immediately, or boost its priority up to the same priority as the thread trying to get the lock. (for example, Linux has "priority inheritance").

The thing is, there are plenty of other ways to get priority inversion that don't involve a mutex.

The more general scenario is : a high priority thread is waiting on some shared object to be signalled ; a low priority thread will eventually signal that object ; medium priority threads take all the time so the low priority thread can't run, and the high priority thread stays blocked.

For example, this can happen with Semaphores, Events, etc. etc.

The difficulty is that in these cases, unlike with mutexes, the OS doesn't know which thread will eventually signal the shared object to let the high priority thread go, so it doesn't know who to boost.

Windows has panic mechanisms like the "balance set manager" which look for any thread which is not waiting on a waitable handle, but is getting no CPU time, then they force it to get some CPU time. This will save you if you are in one of these non-mutex priority-inversions, but it takes quite a long time for that to kick in, so it's really a last ditch panic save, if it happens you regret it.

Sometimes I see people talking about mutex priority inversion as if that's a scary issue; it's really not on any modern OS. But non-mutex priority inversion *is*.

Conclusion : beware using non-mutex thread flow control primitives on threads that are not of equal priority !

08-01-11 - Double checked wait

Something that we have touched on a few times is the "double checked wait" pattern. It goes like this :
consumer :

if ( not available )
{
    prepare_wait();

    if ( not available )
    {
        wait();
    }
    else
    {
        cancel_wait();
    }
}

producer :

make available
signal_waiters();
now, why do we do this? Well, if you did just a naive check like this :

consumer :

if ( not available )
{
    // (*1)
    wait();
}

producer :

make available
signal_waiters();

you have a race. What happens is you check available and see none, so you step in to *1 ; then the producer runs, publishes whatever and signals - but there are no waiters yet so the signal is lost. Then you go into the wait() and deadlock. This is the "lost wakeup" problem.

So, the double check avoids this race. What must the semantics of prepare_wait & wait be for it to work? It's something like this :

Any signal that happens between "prepare_wait" and "wait" must cause "wait" to not block (either because the waitable handle is signalled, or through some other mechanism).

Some implementations of a prepare_wait/wait mechanism may have spurious signals; eg. wait might not block even though you shouldn't really have gotten a signal; because of that you will usually loop in the consumer.

Now let's look at a few specific solutions to this problem :

condition variables

This is the locking solution to the race. It doesn't use double-checked wait, instead it uses a mutex to protect the race; the naive producer/consumer is replaced with :


consumer :

mutex.lock();
if ( not available )
{
    unlock_wait_lock();
}

producer :

mutex.lock();
make available
signal_waiters();
mutex.unlock();

which prevents the race because you hold the mutex in the consumer across the condition check and the decision to go into the wait.

waitset

Any simple waitset can be used in this scenario with a double-checked wait. For example a trivial waitset based on Event is like this :


waitset.prepare_wait :
    add current thread's Event to list of waiters

waitset.wait :
    WaitForSingleObject(my Event)

waitset.signal_waiters :
    signal all events in list of waiters

for instance, "waitset" could be a vector of handles with a mutex protecting access to that vector. This would be a race without the prepare_wait and double checking.

In this case we ensure the double-checked semantics works because the current thread is actually added to the waitset in prepare_wait. So any signal that happens before we get into wait() will set our Event, and our wait() will not actually block us, because the event is already set.

eventcount

Thomasson's eventcount accomplishes the same thing but in a different way. A simplified version of it works like this :


eventcount.prepare_wait :
    return key = m_count

eventcount.wait :
    if ( key == m_count )
        Wait(event)

eventcount.signal_waiters :
    m_count++;
    signal event;

(note : event is a single shared broadcast event here)

in this case, prepare_wait doesn't actually add you to the waitset, so signals don't go to you, but it still works, because if signal was called in the gap, the count will increase and no longer match your key, so you will not do the wait.

That is, it specifically detects the race - it sees "was there a signal between when I did prepare_wait and wait?" , and if so, it doesn't go into the wait. The consumer should loop, so you keep trying to enter the wait until you get to check your condition without a signal firing.

futex

It just occurred to me yesterday that futex is actually another solution to this exact same problem. You may recall - futex does an internal check of your pointer against a value, and only goes into the wait if the value matches.

producer/consumer with futex is like this :


consumer :

if ( value = not_available )
{
    futex_wait(&value,not_available);
}

producer :

value = available
futex_signal(&value);

this may look like just a single wait at a glance, but if we blow out what futex_wait is doing :

consumer :

if ( value == not_available )
{
    //futex_wait(&value,not_available);

    futex_prepare_wait(&value);
    if ( value == not_available )
        futex_commit_wait(&value);
    else
        futex_cancel_wait(&value);
}

producer :

value = available
futex_signal(&value);

we see can clearly see that futex is just double-checked-wait in disguise.

That is, futex is really our beloved prepare_wait -> wait pattern , but only for the case that the wait condition is of the form *ptr == something.


Do we like the futex API? Not really. I mean it's nice that the OS provides it, but if you are designing your own waitset you would never make the API like that. It confines you to only working on single ints, and your condition has to be int == value. A two-call API like "prepare_wait / wait" is much more flexible, it lets you check conditions like "is this lockfree queue empty" which are impossible to do with futex (what you wind up doing is just doing the double-check yourself and use futex just as an "Event", either that or duplicating the condition into an int for futex's benefit (but that is risky, it can race if not done right, so not recommended)).

BTW some of the later extensions of futex are very cool, like bitset waiting and requeue.

08-01-11 - A game threading model

Some random ideas.

There is no "main" thread at all, just a lot of jobs. (there is a "main job" in a sense, that runs once a frame kicks off the other jobs needed to complete that frame)

Run 1 worker thread per core; all workers just run "jobs", they are all interchangeable. This is a big advantage for many reasons; for example if one worker gets swapped out (or some outside process takes over that CPU), the other workers just take over for it, there is never a stall on a specific thread that is swapped out. You don't have to switch threads just to run some job, you can run it directly on yourself. (caveat : one issue is the lost worker problem which we have mentioned before and needs more attention).

You also need 1 thread per external device that can stall (eg. disk IO, GPU IO). If the API's to these calls were really designed well for threading this would not be necessary - we need a thread per device simply to wrap the bad API's and provide a clean one out to the workers. What makes a clean API? All device IO needs to just be enqueue'd immediately and then provide a handle that you can query for results or completion. Unfortunately real world device IO calls can stall the calling thread for a long time in unpredictable ways, so they are not truly async on almost any platform. These threads should be high priority, do almost no CPU work, and basically just act like interrupts.

A big issue is how you manage locking game objects. I think the simplest thing conceptually is to do the locking at "game object" granularity, that may not be ideal for performance but it's the easiest way for people to get it right.

You clearly want some kind of reader/writer lock because most objects are read many more times than they are written. In the ideal situation, each object only updates itself (it may read other objects but only writes itself), and you have full parallelism. That's not always possible, you have to handle cross-object updates and loops; eg. A writes A and also writes B , B writes B and also writes A ; the case that can cause deadlock in a naive system.

So, all game objects are referenced through a weak-reference opaque handle. To read one you do something like :

    const Object * rdlock(ObjectHandle h)
and then rely on C's const system to try to ensure that people aren't writing to objects they only have read-locked (yes, I know const is not ideal, but if you make it a part of your system and enforce it through coding convention I think this is probably okay).

In implementation rdlock internally increments a ref on that copy of the object so that the version I'm reading sticks around even if a new version is swapped in by wrlock.

There are various ways to implement write-lock. In all cases I make wrlock take a local copy of the object and return you the pointer to that. That way rdlocks can continue without blocking, they just get the old state. (I assume it's okay for reads to get one-frame-old data) (see note *). wrunlock always just exchanges in the local object copy into the table. rdlocks that were already in progress still hold a ref to the old data, but subsequent rdlocks and wrlocks will get the new data.

One idea is like this : Basically semi-transactional. You want to build up a transaction then commit it. Game object update looks something like this :

    Transaction t;
    vector<ObjectHandle> objects_needed;
    objects_needed = self; 
    for(;;)
    {
        wrlock on all objects_needed;

        .. do your update code ..
        .. update code might find it needs to write another object, then do :

        add new_object to objects_needed
        if ( ! try_wrlock( new_object ) )
            continue; // aborts the current update and will restart with new_object in the objects_needed set

        wrunlock all objects locked
        if ( unlocks committed )
            break; // update done
    }

(in actual C++ implementation the "continue" should be a "throw" , and the for(;;) should be try/catch , because the failed lock could happen down inside some other function; also the throw could tell you what lock caused the exception).

There's two sort of variants here that I believe both work, I'm not sure what the tradeoffs are :

1. More mutex like. wrlock is exclusive, only one thread can lock an object at a time. wrunlock at the end of the update always proceeds unconditionally - if you got the locks you know you can just unlock them all, no problem. The issues is deadlock for different lock orders, we handle that with the try_lock, we abort all the locks and go back to the start of the update and retake the locks in a standardized order.

2. More transaction like. wrlock always proceeds without blocking, multiple threads can hold wrlock at the same time. When you wrunlock you check to see that all the objects have the same revision number as when you did the wrlock, and if not then it means some other commit has come in while you were running, so you abort the unlock and retry. So there's no abort/retry at lock time, it's now at unlock time.

In this simplistic approach I believe that #1 is always better. However, #2 could be better if it checked to see if the object was not actually changed (if it's a common case to take a wrlock because you thought you needed it, but then not actually modify the object).

Note that in both cases it helps to separate a game object's mutable portion from its "definition". eg. the things about it that will never change (maybe its mesh, some AI attributes, etc.) should be held to the side somehow and not participate in the wrlock mechanism. This is easy to do if you're willing to accept another pointer chase, harder to do if you want it to just be different portions of the same continuous memory block.

Another issue with this is if the game object update needs to fire off things that are not strictly in the game object transaction system. For example, say it wants to start a Job to do some path finding or something. You can't fire that right away because the transaction might get aborted. So instead you put it in the "Transation t" thing to delay it until the end of your update, and only if your unlocks succeed then the jobs and such can get run.

(* = I believe it's okay to read one frame old data. Note that in a normal sequential game object update loop, where you just do :


for each object
    object->update();

each object is reading a mix of old and new data; if it reads an item in the list before itself, it reads new data, if it reads an item after itself, it reads old data; thus whether it gets old or new data is a "race" anyway, and your game must be okay with that. Any time you absolutely must read the most recent data you can always do a wrlock instead of a rdlock ;

You can also address this in the normal way we do in games, which is separate objects in a few groups and update them in chunks like "phase 1", then "phase2" ,etc. ; objects that are all within the same phase can't rely on their temporal order, but objects in a later phase do know that they see the latest version of the earlier phase. This is the standard way to make sure you don't have one-frame-latency issues.

*).

The big issue with all this is how to ensure that you are writing correct code. The rules are :

1. rdlock returns a const * ; never cast away const

2. game object updates must only mutate data in game objects - they must not mutate global state or anything outside of the limitted transaction system. This is hard to enforce; one way might be to make it absolutely clear with a function name convention which functions are okay to call from inside object updates and which are not.

For checking this, you could set a TLS flag like "in_go_update" when you are in the for {} loop, then functions that you know are not safe in the GO loop can just do ASSERT( ! in_go_update ); which provides a nice bit of safety.

3. anything you want to do in game object update which is not just mutating some GO variables needs to be put into the Transaction buffer so it can be delayed until the commit goes through. Delayed transaction stuff cannot fail; eg. it doesn't get to participate in the retry/abort, so it must not require multiple mutexes that could deadlock. eg. they should pretty much always just be Job creations or destructions that are just pushes/pops from queues.

Another issue that I haven't touched on is the issue of dependencies. A GO update could be dependent on another GO or on a Job completion. You could use the freedom of the scheduling order to reschedule GOs whose dependencies aren't done for later in the tick, rather than stalling.

old rants