10/01/2011

10-01-11 - String Match Results Part 6

You knew that couldn't be the end.

SuffixArray3 : suffix array string matcher which uses a min/max tree to find allowed offsets.

The min/max tree is a binary hierarchy ; at level L there are (size>>L) entries, and each entry covers a range of size (1 << L). Construction is O(N) because N/2+N/4+N/8 ... = N

The min/max tree method is generally slightly slower than the elegant "chain of fences" approach used for SuffixArray2, but it's close. The big advantage is the min/max tree can also be used for windowed matching, which is not easy to integrate in SA2.

(ADDENDUM : this is super unclear, see more at end)

First check that it satisfies the O(N) goal on the stress tests :

0 = stress_all_as
1 = stress_many_matches
2 = stress_search_limit
3 = stress_sliding_follow
4 = stress_suffix_forward
5 = twobooks

Yep. Then check optimal parse, large window vs. the good options :

0 = ares_merged.gr2_sec_5.dat
1 = bigship1.x
2 = BOOK1
3 = combined.gr2_sec_3.dat
4 = GrannyRocks_wot.gr2
5 = Gryphon_stripped.gr2
6 = hetero50k
7 = lzt20
8 = lzt21
9 = lzt22
10 = lzt23
11 = lzt24
12 = lzt25
13 = lzt27
14 = lzt28
15 = lzt29
16 = movie_headers.bin
17 = orange_purple.BMP
18 = predsave.bin

The good Suffix Trie is clearly the best, but we're in the ballpark.

Now optimal parse, 17 bit (128k) window :


totals:
Test_MMC2 : DNF 
Test_LzFind2 : 506.954418 , 1355.992865 
Test_SuffixArray3 : 506.954418 , 514.931740 
Test_MMC1 : 13.095260 , 1298.507490 
Test_LzFind1 : 12.674177 , 226.796123 
Test_Hash1 : 503.319301 , 1094.570022 

Finally greedy parse, 17 bit window :


totals:
Test_MMC2 : 0.663028 , 110.373098 
Test_LzFind2 : DNF 
Test_SuffixArray3 : 0.663036 , 236.896551 
Test_MMC1 : 0.663028 , 222.626069 
Test_LzFind1 : 0.662929 , 216.912409 
Test_Hash1 : 0.662718 , 62.385071 

average match length :

And once more for clarity :

Greedy parse, 16 bit window , just the good candidates :

totals:
Test_SuffixArray3 : 0.630772 , 239.280605 
Test_LzFind1 : 0.630688 , 181.430093 
Test_MMC2 : 0.630765 , 88.413339 
Test_Hash1 : 0.630246 , 51.980073 

It should be noted that LzFind1 is approximate, and Hash1 is even more approximate. Though looking at the match length chart you certainly can't see it.

ADDENDUM : an email I wrote trying to explain this better :


First a reminder of how the normal suffix array searcher works :

We're at some file position F
We look up sortIndex[F] to find our sort position S
we know our longest matches must be at sort positions S-1 or S+1
we step to our neighbors

The problem with windowed matching is our neighbors may be way out of the window

So what we ant is a way to step more than 1 away to find the closest neighbor in the sort order that is within our desired window.

So really all we are doing is adding another search structure on top of the suffix array.  It's a search structure to go in either direction from S and find the closest spot to S that has file position in the range [F-window,F-1] 

To do this I just build a tree.  It's indexed by sort position, and its content is file positions.

Level L of the tree contains nodes that cover an interval of size 1<<L 

That is, level 0 has intervals of size 1, that's just

Tree_0[S] = sortIndexInverse[S] 
  (eg. it's just the file position of that sort pos)
  (in fact we don't make this level of the tree, we just use sortIndexInverse which we already have)

Tree_1[i] covers steps of size 2, that is :
  file positions sortIndexInverse[i*2] and sortIndexInverse[i*2+1]
  I store the min and max of that

Tree_2[i] covers steps of size 4, that is min and max of Tree_1[i*2] and Tree_1[i*2+1]

Now once you have this binary interval tree you can do the normal kind of binary tree walk to find the closest neighbor that's in the range you want.

Also see the following blog on how I walk binary interval trees and the released source code for StringMatchTest contains the full code for this matcher.

10-01-11 - Seattle Stop Shitting on my Face

I've been thinking about upgrading my neoprene gear so that I can swim in the lake through the fall/spring, not just the summer.

I fucking hate swimming laps in a pool during official lap swim hours, with all your "rules" and your system keeping me down. And it just doesn't make sense to go into some nasty indoor crowded box when I'm literally surrounded by miles of beautiful open water.

But there's a bit problem with this idea. Seattle is shitting on my face.

The fall/spring, when a wet suit would help, is when it rains. When it rains, the sewers overflow and drain into the lake. Then you get itchy bumps and vomiting and so on.

Seattle is basically doing nothing about it. There are some little programs to do "rain gardens", but those are sort of like using a tampon to stop elephant piss. What we need is serious fucking civil engineering. ("rain gardens" are also better known as "mosquito breeders"; we Houstonians are always amazed and delighted about the lack of mosquitos here; it's because Seattle is hilly and surrounded by lakes so the water doesn't pool, but the city is doing their best to ruin that). (a much bigger impact than piddly residential rain gardens would be to outlaw concrete parking lots; grass/gravel/lattice parking lots work perfectly fine for holding cars).

Of course this is a problem that is occuring all over the US. I hear NY has a major sewer problem as well. The population and development of most US cities has outstripped their infrastructure, and in our shitty faux-libertarian plutocracy of course there's no money for basic civil engineering. Only the heavy hand of EPA orders is forcing these dumb ass local governments to do anything.

The real solution is something like :

1. Separate the sewage and rain runoff. Run sewage to treatment plants in a closed system so it can never get out. (probably not realistic; alternatively, add a new clean storm water only system)

2. Since you will allow non-sewer storm water to drain to the lake, make the pollutants that run off to the lakes illegal. Fertilizers, pesticides, etc. everything that's water soluble and washes into the lakes is illegal right fucking now.

10-01-11 - More Reliable Timing on Windows

When profiling little code chunks on Windows, one of the constant annoyances is the unreliability of times due to multithreading.

Historically the way you address this is run lots of trials (like 100) and take the MIN time of any trial.

(* important note : if you aren't trying to time "hot cache" performance, you need to wipe the cache between each run. I dunno if there's an easy instruction or system call that would invalidate all cache pages; what I usually do is have a routine that goes and munges over some big array).

It's a bit better these days because of many cores. Now you can quite often find a core which is unmolested by annoying services popping up and stealing CPU time and messing up your profile. But sometimes you get unlucky, and your process runs on an IdealProc that has some other shite.

So a simple loop helps :


template <typename t_func>
uint64 GetFuncTime( t_func * pfunc )
{
    HANDLE proc = GetCurrentProcess();
    HANDLE thread = GetCurrentThread();
    
    DWORD_PTR affProc,affSys;
    GetProcessAffinityMask(proc,&affProc,&affSys);
    
    uint64 tick_range = 1ULL << 62;
    
    for(int rep=0;rep<24;rep++)
    {
        DWORD mask = 1UL<<rep;
        if ( mask & affProc )
            SetThreadAffinityMask(thread,mask);
        else
            continue;   

        uint64 t1 = __rdtsc();
        (*pfunc)();
        uint64 t2 = __rdtsc();

        uint64 cur_tick_range = t2 - t1;
        tick_range = MIN(tick_range,cur_tick_range);

    }

    SetThreadAffinityMask(thread,0xFFFFFFFFUL);

    return tick_range;
}

which makes it reasonably probable that you get a clean run on some core. For published results you will still want to repeat the whole thing N times.

9/30/2011

09-30-11 - String Match Results Part 5 + Conclusion

Finally for completeness, some of the matchers from Tornado in FreeArc. These are basically all standard "cache table" style matchers, originally due to LZRW, made popular by LZP and LZO. The various Tornado settings select different amounts of hash rows and ways.

As they should, they have very constant time operation that goes up pretty steadily from Tornado -3 to -7, because there's a constant number of hash probes per match attempt.


totals : match len : clocks
Test_MMC1 : 0.663028 , 231.254818 
Test_Hash1 : 0.662718 , 64.888003 
Test_Tornado_3 : 0.630377 , 19.658834 
Test_Tornado_4 : 0.593174 , 28.456055 
Test_Tornado_5 : 0.586540 , 40.546146 
Test_Tornado_6 : 0.580042 , 56.841156 
Test_Tornado_7 : 0.596584 , 141.432393 

There may be something wrong with my Tornado wrapper as the -3 matcher actually finds the longest total length. I dunno. The speeds look reasonable. I don't really care much about these approximate matchers because the loss is hard to quantify, so there you go (normally when I see an anomaly like that I would investigate it to make sure I understand why it's happening).

0 = ares_merged.gr2_sec_5.dat
1 = bigship1.x
2 = BOOK1
3 = combined.gr2_sec_3.dat
4 = GrannyRocks_wot.gr2
5 = Gryphon_stripped.gr2
6 = hetero50k
7 = lzt20
8 = lzt21
9 = lzt22
10 = lzt23
11 = lzt24
12 = lzt25
13 = lzt27
14 = lzt28
15 = lzt29
16 = movie_headers.bin
17 = orange_purple.BMP
18 = predsave.bin


Conclusion : I've got to get off string matching so this is probably the end of posts on this topic.

MMC looks promising but has some flaws. There are some cases that trigger a slowness spike in it. Also it has some bad O(N^2) with unbounded match length ("MMC2") so I have to run it with a limit ("MMC1") which removes some of its advantage over LzFind and Hash1 and other approximate matchers. (without the limit it has the advantage of being exact). It's also a GPL at the moment which is a killer.

LzFind doesn't have anything going for it really.

For approximate/small-window matching I don't see any reason to not use the classic Zip hash chain method. I tried a few variants of this, like doing a hash chain to match the first 4 bytes and then link listing off that, and all the variants were worse than the classic way.

For large window / exact matching / optimal parsing, a correct O(N) matcher is the way to go. The suffix-array based matcher is by far the easiest for your kids to implement at home.

09-30-11 - String Match Results Part 4

Okay, finally on to greedy parsing. Note with greedy parsing the average match length per byte is always <= 1.0 (it's actually the % of bytes matched in this case).

Two charts for each , the first is clocks per byte, the second is average match length. Note that Suffix5 is just for reference and is neither windowed nor greedy.

got arg : window_bits = 16

got arg : window_bits = 17

got arg : window_bits = 18

Commentary :

Okay, finally MMC beats Suffix Trie and LzFind, this is what it's good at. Both MMC and LzFind get slower as the window gets larger. Surprisingly, the good old Zip-style Hash1 is significantly faster and finds almost all the matches on these files. (note that LzFind1 and Hash1 both have search limits but MMC does not)


test set :

0 = ares_merged.gr2_sec_5.dat
1 = bigship1.x
2 = BOOK1
3 = combined.gr2_sec_3.dat
4 = GrannyRocks_wot.gr2
5 = Gryphon_stripped.gr2
6 = hetero50k
7 = lzt20
8 = lzt21
9 = lzt22
10 = lzt23
11 = lzt24
12 = lzt25
13 = lzt27
14 = lzt28
15 = lzt29
16 = movie_headers.bin
17 = orange_purple.BMP
18 = predsave.bin


The same matchers ; greedy, 16 bit window, on the stress tests :

LzFind does not do well at all on the stress tests. (note that LzFind1 and MMC1 are length-limitted; LzFind1 and Hash1 are "amortized" (step limitted)).

0 = stress_all_as 
1 = stress_many_matches 
2 = stress_search_limit 
3 = stress_sliding_follow 
4 = stress_suffix_forward 
5 = twobooks

09-30-11 - String Match Results Part 3

Still doing "optimal" (non-greedy parsing) but now lets move on to windowed & non-exact matching.

Windowed, possibly approximate matching.

Note : I will include the Suffix matchers for reference, but they are not windowed.

16 bit window :

Clocks per byte :

Average Match len :

This is what LzFind is designed for and it's okay at it. It does crap out pretty badly on the rather degenerate "particles.max" file, and it also fails to find a lot of matches. (LZFind1 has a maximum match length of 256 and a maximum of 32 search steps, which are the defaults in the LZMA code; LzFind2 which we saw before has those limits removed (and would DNF on many of these files)).

lztest is :

0 = almost_incompressable
1 = bigship1.x
2 = Dolphin1.x
3 = GrannyRocks_wot.gr2
4 = Gryphon_stripped.gr2
5 = hetero50k
6 = movie_headers.bin
7 = orange_purple.BMP
8 = particles.max
9 = pixo_run_animonly_stripped.gr2
10 = poker.bin
11 = predsave.bin
12 = quick.exe
13 = RemoteControl_stripped.gr2
14 = ScriptVolumeMgr.cpp

09-30-11 - String Match Results Part 2b

Still on optimal parsing, exact matching, large window :

Chart of clocks per byte, on each file of a test set :

On my "lztest" data set :


0 = almost_incompressable
1 = bigship1.x
2 = Dolphin1.x
3 = GrannyRocks_wot.gr2
4 = Gryphon_stripped.gr2
5 = hetero50k
6 = movie_headers.bin
7 = orange_purple.BMP
8 = particles.max
9 = pixo_run_animonly_stripped.gr2
10 = poker.bin
11 = predsave.bin
12 = quick.exe
13 = RemoteControl_stripped.gr2
14 = ScriptVolumeMgr.cpp

"lztest" is not a stress test set, it's stuff I've gathered that I think is roughly reflective of what games actually compress. It's interesting that this data set causes lots of DNF's (did not finish) for MMC and LzFind.

Suffix5 (the real suffix trie) is generally slightly faster than the suffix array. It should be, of course, if I didn't do a bonehead trie implementation, since the suffix array method basically builds a trie in the sort, then reads it out to sorted indexes, and then I convert the sorted indexes back to match lengths.

Good old CCC (Calgary Compression Corpus) :


0 = BIB
1 = BOOK1
2 = BOOK2
3 = GEO
4 = NEWS
5 = OBJ1
6 = OBJ2
7 = PAPER1
8 = PAPER2
9 = PAPER3
10 = PAPER4
11 = PAPER5
12 = PAPER6
13 = PIC
14 = PROGC
15 = PROGL
16 = PROGP
17 = TRANS

I won't be showing results on CCC for the most part because it's not very reflective of real world modern data, but I wanted to run on a set where MMC and LzFind don't DNF too much to compare their speed when they do succeed. Suffix Trie is almost always very close to the fastest except on paper4 & paper5 which are very small files.


0 = BIB
1 = BOOK1
2 = BOOK2
3 = GEO
4 = NEWS
5 = OBJ1
6 = OBJ2
7 = PAPER1
8 = PAPER2
9 = PROGC
10 = PROGL
11 = PROGP
12 = TRANS

Two new tests in the mix.

Test_Hash1 : traditional "Zip" style fixed size hash -> linked list. In this run there's no chain limit so matching is exact.

Test_Hash3 : uses cblib::hash_table (a reprobing ("open addressing" or "closed hashing", I prefer reprobing)) to hash the first 4 bytes then a linked list. I was surprised to find that this is almost the same speed as Hash1 and sometimes faster, even though it's a totally generic template hash table (that is not particularly well suited to this usage).

09-30-11 - String Match Results Part 2

The first and simplest set of results are the ones where non-O(N) algorithms make themselves known.

Optimal parsing, large window.

The candidates are :

SuffixArray1 : naive matcher built on divsufsort
SuffixArray2 : using the forward-offset elimination from this post
Suffix2 : Suffix Trie with follows and path compression but missing the bits in this post
Suffix3 : Suffix Trie without follow
Suffix5 : fully working Suffix Trie
MMC1 : MMC with max match length of 256
MMC2 : MMC with no limit
LzFind1 : LzFind (LZMA HC4 - binary tree) with max ML of 256 and step limit of 32
LzFind2 : LzFind with no max ML or step limit

Note : LzFind was modified from LZMA to not record all matches, just the longest, to make it more like the competitors. MMC was modified to make window size a variable.

In all cases I show :


Test_Matcher : average match length per byte , average clocks per byte

And with no further ado :

got arg : window_bits = 24
working on : m:\test_data\lz_stress_tests

loading file : m:\test_data\lz_stress_tests\stress_many_matches
Test_SuffixArray1 : 32.757760 , 164.087948 
Test_SuffixArray2 : 32.756953 , 199.878476 
Test_Suffix2 : 32.757760 , 115.846130 
Test_Suffix3 : 31.628279 , 499.722569 
Test_Suffix5 : 32.757760 , 184.172167 
Test_MMC2 : 32.757760 , 1507.818166 
Test_LzFind2 : 32.757760 , 576.154370 

loading file : m:\test_data\lz_stress_tests\stress_search_limit
Test_SuffixArray1 : 823.341331 , 182.789064 
Test_SuffixArray2 : 823.341331 , 243.492241 
Test_Suffix2 : 823.341331 , 393.930504 
Test_Suffix3 : 807.648294 , 2082.447274 
Test_Suffix5 : 823.341331 , 91.699276 
Test_MMC2 : 823.341331 , 6346.400206 
Test_LzFind2 : 823.341331 , 1807.516994 

loading file : m:\test_data\lz_stress_tests\stress_sliding_follow
Test_SuffixArray1 : 199.576550 , 189.029462 
Test_SuffixArray2 : 199.573198 , 220.316868 
Test_Suffix2 : 199.576550 , 95.225780 
Test_Suffix3 : 198.967622 , 2110.521111 
Test_Suffix5 : 199.576550 , 106.019526 
Test_MMC2 : 199.576550 , 36571.382020 
Test_LzFind2 : 199.576550 , 1249.184412 

loading file : m:\test_data\lz_stress_tests\stress_suffix_forward
Test_SuffixArray1 : 5199.164464 , 6138.802402 
Test_SuffixArray2 : 5199.164401 , 213.675569 
Test_Suffix2 : 5199.164464 , 12901.429712 
Test_Suffix3 : 5199.075953 , 32152.812339 
Test_Suffix5 : 5199.164464 , 145.684678 
Test_MMC2 : 5199.016562 , 6652.666440 
Test_LzFind2 : 5199.164464 , 11739.369336 

loading file : m:\test_data\lz_stress_tests\stress_all_as
Test_SuffixArray1 : 21119.499148 , 40938.612689 
Test_SuffixArray2 : 21119.499148 , 127.520147 
Test_Suffix2 : 21119.499148 , 88178.094886 
Test_Suffix3 : 21119.499148 , 104833.677202 
Test_Suffix5 : 21119.499148 , 119.676823 
Test_MMC2 : 21119.499148 , 25951.480871 
Test_LzFind2 : 21119.499148 , 38581.431558 

loading file : m:\test_data\lz_stress_tests\twobooks
Test_SuffixArray1 : 192196.571348 , 412.356092 
Test_SuffixArray2 : 192196.571348 , 420.437773 
Test_Suffix2 : 192196.571348 , 268.524287 
Test_Suffix3 : DNF
Test_Suffix5 : 192196.571348 , 292.777726 
Test_MMC2 : DNF 
Test_LzFind2 : DNF 

(DNF = Did Not Finish = over 100k clocks per byte).

Conclusion : SuffixArray2 and Suffix5 both actually work and are correct with no blowup cases.

SuffixArray1 looks good on the majority of files (and is slightly faster than SuffixArray2 on those files), but "stress_suffix_forward" clearly calls it out and shows the break down case.

Suffix2 almost works except on the degenerate tests due to failure to get some details of follows quite right ( see here ).

Suffix3 just shows that a Suffix Trie without follows is some foolishness.

We won't show SuffixArray1 or Suffix2 or Suffix3 again.

MMC2 and LZFind2 both have bad failure cases. Both are simply not usable if you want to find the longest match at every byte. We will revisit them later in other usages though and see that they are good for what they're designed for.

I've not included any of the hash chain type matchers in this test because they all obviously crap their pants in this scenario.

09-30-11 - String Match Results Part 1

I was hoping to make some charts and graphs, but it's just not that interesting. Anyhoo, let's get into it.

What am I testing? String matching for an LZ-type compressor. Matches must start before current pos but can run past current pos. I'm string matching only, not compressing. I'm counting the total time and total length of matches found.

I'm testing match length >= 4. Matches of length 2 & 3 can be found trivially by table lookup (though on small files this is not a good way to do it). Most of the matchers can handle arbitrary min lengths, but this is just easier/fairer for comparison.

I'm testing both "greedy" (when you find a match step ahead its length) and "optimal" (find matches at every position). Some matchers like the suffix tree ones don't really support greedy parsing, since they have to do all the work at every position even if you don't want the match there.

I'm testing windowed and non-windowed matchers.

I'm testing approximate and non-approximate (exact) matchers. Exact matchers find all matches possible, approximate matchers find some amount less. I'm not sure the best way to show the approximation vs. speed trade off. I guess you want a "pareto frontier" type of graph, but what should the axes be?

Also, while I'm at it, god damn it!

MAKE YOUR CODE FREE PEOPLE !!

(and GPL is not free). And some complicated personal license is a pain in the ass. I used to do this myself, I know it's tempting. Don't fucking do it. If you post code just make it 100% free for all uses. BSD license is an okay choice.

Matchers I'm having trouble with :


Tornado matchers from FreeArc - seem to be GPL (?)

MMC - GPL

LzFind from 7zip appears to be public domain. divsufsort is free. Larsson's slide is free.

09-30-11 - Don't use memset to zero

A very common missed optimization is letting the OS zero large chunks of memory for you.

Everybody just writes code like this :


U32 * bigTable = malloc(20<<20);
memset(bigTable,0,20<<20);

but that's a huge waste. (eg. for large hash table on small files the memset can dominate your time).

Behind your back, the operating system is actually running a thread all the time as part of the System Idle Process which grabs free pages and writes them with zero bytes and puts them on the zero'ed page list.

When you call VirtualAlloc, it just grabs a page from the zeroed page list and hands it to you. (if there are none available it zeroes it immediately).

!!! Memory you get back from VirtualAlloc is always already zeroed ; you don't need to memset it !!!

The OS does this for security, so you can never see some other app's bytes, but you can also use it to get zero'ed tables quickly.

(I'm not sure if any stdlib has a fast path to this for "calloc" ; if so that might be a reason to prefer that to malloc/memset; in any case it's safer just to talk to the OS directly).

ADDENDUM : BTW to be fair none of my string matchers do this, because other people's don't and I don't want to win from cheap technicalities like that. But all string match hash tables should use this.

old rants