10/12/2011

10-12-11 - Post-backpacking Recalibration

1. 65 degree house is fucking blazing hot at night.

2. Of course I'll walk a mile in the rain to get lunch. No big deal.

3. Everything is fucking DELICIOUS! Jar of spaghetti sauce, om nom nom. Beef stew yum.

4. None of the usual hesitation in sitting on the semi-shit-covered office toilet after having had to sit on the very-shit-covered wilderness toilet.

10/06/2011

10-06-11 - Fiberglass

Don't put this shit in your house. It's toxic, it's poison, it's the modern day asbestos. There are plenty of alternatives.

Even in the attic or walls, sure it's sealed up most of the time, but any time you have to go in there to work you stir it up, then you get glass shards in the air which get in your eyes and lungs, so you have to wear safety suits and respirators and so on just to do basic maintenace work.

If anything ever goes wrong with it, it's a nightmare to dispose of.

Worst of all is using it to wrap ducts. The problem is that all ducts will leak eventually. Maybe not right away, but in 10 years cracks will form. Then the air return ducts will start sucking in at the seams, and eventually that will be sucking glass fibers in, then blowing them out all over the house.

Just say no to toxic shit in your home.

10/03/2011

10-03-11 - Amortized Hashing

The "hash1" simple Zip-style matcher was very fast. So why don't we love it?

The problem is amortized hashing. "Amortized" = we just stop walking after A steps. This limits the worst case. Technically it makes hash1 O(N) (or O(A*N)).

Without it, hash1 has disastrous worst cases, which makes it just not viable. The problem is that the "amortize" can hurt compression quite a bit and unpredictably so.

One perhaps surprising thing is that the value of A (the walk limit) doesn't actually matter that much. It can be 32 or 256 and either way you will save your speed from the cliff. Surpisingly even an A of 1024 on a 128k window helps immensely.


not "amortized" :

 file,   encode,   decode, out size
lzt00,  225.556,   35.223,     4980
lzt01,  143.144,    1.925,   198288
lzt02,  202.040,    3.067,   227799
lzt03,  272.027,    2.164,  1764774
lzt04,  177.060,    5.454,    11218
lzt05,  756.387,    3.940,   383035
lzt06,  227.348,    6.078,   423591
lzt07,  624.699,    4.179,   219707
lzt08,  311.441,    7.329,   261242
lzt09,  302.741,    3.746,   299418
lzt10,  777.609,    1.647,    10981
lzt11, 1073.232,    4.999,    19962
lzt12, 3250.114,    3.134,    25997
lzt13,  101.577,    5.644,   978493
lzt14,  278.619,    6.026,    47540
lzt15, 1379.194,    9.396,    10911
lzt16,  148.908,   12.558,    10053
lzt17,  135.530,    5.693,    18517
lzt18,  171.413,    6.003,    68028
lzt19,  540.656,    3.433,   300354
lzt20,  109.678,    5.488,   900001
lzt21,  155.648,    3.605,   147000
lzt22,  118.776,    6.671,   290907
lzt23,  103.056,    6.350,   822619
lzt24,  218.596,    4.439,  2110882
lzt25,  266.006,    2.498,   123068
lzt26,  118.093,    7.062,   209321
lzt27,  913.469,    4.340,   250911
lzt28,  627.070,    2.576,   322822
lzt29, 1237.463,    4.090,  1757619
lzt30,   75.217,    0.646,   100001

"amortized" to 128 steps :

 file,   encode,   decode, out size
lzt00,  216.417,   30.567,     4978
lzt01,   99.315,    1.620,   198288
lzt02,   85.209,    3.556,   227816
lzt03,   79.299,    2.189,  1767316
lzt04,   90.983,    7.073,    12071
lzt05,   86.225,    4.715,   382841
lzt06,   91.544,    6.930,   423629
lzt07,  127.232,    4.502,   220087
lzt08,  161.590,    7.725,   261366
lzt09,  119.749,    4.696,   301442
lzt10,   55.662,    1.980,    11165
lzt11,  108.619,    6.072,    19978
lzt12,  112.264,    3.119,    26977
lzt13,  103.460,    6.215,   978493
lzt14,   87.520,    5.529,    47558
lzt15,   98.902,    7.568,    10934
lzt16,   90.138,   12.503,    10061
lzt17,  115.166,    6.016,    18550
lzt18,  176.121,    5.402,    68035
lzt19,  272.349,    3.310,   304212
lzt20,  107.739,    5.589,   900016
lzt21,   68.255,    3.568,   147058
lzt22,  108.045,    5.867,   290954
lzt23,  108.023,    6.701,   822619
lzt24,   78.380,    4.631,  2112700
lzt25,   93.013,    2.554,   123219
lzt26,  108.348,    6.143,   209321
lzt27,  103.226,    3.468,   249081
lzt28,  145.280,    2.658,   324569
lzt29,  199.174,    4.063,  1751916
lzt30,   75.093,    1.019,   100001

The times are in clocks per byte. In particular let's look at some files that are really slow without amortize :

no amortize :

 file,   encode,   decode, out size
lzt11, 1073.232,    4.999,    19962
lzt12, 3250.114,    3.134,    25997
lzt15, 1379.194,    9.396,    10911
lzt27,  913.469,    4.340,   250911
lzt29, 1237.463,    4.090,  1757619

amortize 128 :

 file,   encode,   decode, out size
lzt11,  108.619,    6.072,    19978
lzt12,  112.264,    3.119,    26977
lzt15,   98.902,    7.568,    10934
lzt27,  103.226,    3.468,   249081
lzt29,  199.174,    4.063,  1751916

Massive speed differences. The funny thing is the only file where the compression ratio is drastically changes is lzt12. It's changed by around 4%. (lzt29 is a bigger absolute difference but only 0.34%)

So amortized hashing saved us massive time, and only cost us 4% on one file in the test case. Let me summarize the cases. There are three main classes of file :

1. 80% : Not really affected at all by amortize or not. These files don't have lots of degeneracy so there aren't a ton of links in the same hash bucket.

2. 15% : Very slow without amortize, but compression not really affected. These files have some kind of degeneracy (such as long runs of one character) but the most recent occurances of that hash are the good ones, so the amortize doesn't hurt compression.

3. 5% : Has lots of hash collisions, and we do need the long offsets to make good matches. This case is rare.

So obviously it should occur to us that some kind of "introspective" algorithm could be used. Somehow monitor what's happening and adjust your amortize limit (it doesn't need to be a constant) or switch to another algorithm for part of the hash table.

The problem is that we can't tell between classes 2 and 3 without running the slow compressor. That is, it's easy to tell if you have a lot of long hash chains, but you can't tell if they are needed for good compression or not without running the slow mode of the compressor.

10/02/2011

10-02-11 - How to walk binary interval tree

I noted previously that SA3 uses a binary interval tree. It's obvious how that works but it always takes me a second to figure it out so let's write it down.

This is going to be all very similar to previous notes on cumulative probability trees (and followup ).

Say you have an array of 256 entries. You build a min tree. Each entry in the tree stores the minimum value over an interval. The top entry covers the range [0,255] , then you have [0,127][128,255] , etc.

If you're at index i and you want to find the next entry which has a value lower than yours. That is,


find j such that A[j] < A[i] and j > i

you just have to find an interval in the min tree whose min is < A[i]. To make the walk fast you want to step by the largest intervals possible. (once you find the interval that must contain your value you do a normal binary search within that interval)

You can't use the interval that overlaps you, because you only want to look at j > i.

It's easy to do this walk elegantly using the fact that the binary representation of integers is a sort of binary interval tree.

Say we start at i = 37 (00100101). We need to walk from 37 to 256. Obviously we want to use the [128,256) range to make a big step. And the [64,128). We can't use the [32,64) because we're inside that range - this corresponds to the top on bit of 37. We can use [48,64) and [40,48) because those bits are off. We can't use [36,40) but we can use [38,40) (and the bottom on bit corresponds to [37,38) which we are in).

Doing it backwards, you start from whatever index (such as i=37). Find the lowest on bit. That is the interval you can step by (in this case 1). So step by 1 to i=38. Stepping by the lowest bit always acts to clear that bit and push the lowest on bit higher up (sometimes more than 1 level). Now find the next lowest on bit. 38 = 00100110 , so step by 2 to 40 = 00101000 , now step by 8 to 48 = 00110000 , now step by 16 to 64 = 01000000. etc.

In pseudo code :


Start at index i

while ( i < end )
{
  int step = i & (-i); // isolate bottom bit
  // I'm looking at the range [i,i+step]
  int level = BitPos(step);
  check tree[level][i>>level];
  i += step;
}
 
Now this should all be pretty obvious, but here comes the juju.

I've written tree[][] as if it is layed out in the intuitive way, that is :


tree[0] has 256 entries for the 1-step ranges
tree[1] has 128 entries for 2-step ranges
...

The total is 512 entries which is O(N). But notice that tree[0] is only actually ever used for indexes that have the bottom bit on. So the half of them that have the bottom bit off are not needed. Then tree[1] is only used for entries that have the second bit on (but bottom bit off). So the tree[1] entries can actually slot right into the blanks of tree[0], and half the blanks are left over. And so on...

It's a Fenwick tree!

So our revised iteration is :


// searching :
Start at index i
(i starts at 1)

while ( i < end )
{
  int step = i & (-i); // isolate bottom bit
  // I'm looking at the range [i,i+step]
  check fenwick_tree[i];
  i += step;
}

// building :

for all i
{
  int step = i & (-i); // isolate bottom bit
  fenwick_tree[i] = min on range [i,i+step]
}

(in practice you need to build with a binary recursion; eg. level L is built from two entries of level L-1).

Note that to walk backwards you need the opposite entries. That is, at level 7 (steps of 128) you only need [128,256) to walk forward, never [0,128) because a value in that range can't take that step. To walk backwards, you only need [0,128) , never [128,256). So in fact to walk forward or backward you need all the entries. When we made the "Fenwick compaction" for the forward walk, we threw away half the values - those are exactly the values that need to be in the backward tree.

For the three bit case , range [0,8) , the trees are :


Forward :

+-------------------------------+
|              0-8              |
+-------------------------------+
|       ^       |      4-8      |
+---------------+---------------+
|   ^   |  2-4  |   ^   |  6-8  |
+---------------|---------------+
| ^ |1-2| ^ |3-4| ^ |5-6| ^ |7-8| 
+-------------------------------+

where ^ means go up a level to find your step
the bottom row is indexed [0,7] and 8 is off the end on the right
so eg if you start at 5 you do 5->6 then 6->8 and done

Backward :

+-------------------------------+
|              8-0              |
+-------------------------------+
|      4-0      |       ^       |
+---------------+---------------+
|  2-0  |   ^   |  6-4  |   ^   |
+---------------|---------------+
|1-0| ^ |3-2| ^ |5-4| ^ |7-6| ^ | 
+-------------------------------+

the bottom row is indexed [1,8] and 0 is off the end of the left
so eg if you start at 5 you go 5->4 then 4->0 and done

You collapse the indexing Fenwick style on both by pushing the values down the ^ arrows. It should be clear you can take the two tables and lay them on top of each other to get a full set.

10/01/2011

10-01-11 - String Match Results Part 6

You knew that couldn't be the end.

SuffixArray3 : suffix array string matcher which uses a min/max tree to find allowed offsets.

The min/max tree is a binary hierarchy ; at level L there are (size>>L) entries, and each entry covers a range of size (1 << L). Construction is O(N) because N/2+N/4+N/8 ... = N

The min/max tree method is generally slightly slower than the elegant "chain of fences" approach used for SuffixArray2, but it's close. The big advantage is the min/max tree can also be used for windowed matching, which is not easy to integrate in SA2.

(ADDENDUM : this is super unclear, see more at end)

First check that it satisfies the O(N) goal on the stress tests :

0 = stress_all_as
1 = stress_many_matches
2 = stress_search_limit
3 = stress_sliding_follow
4 = stress_suffix_forward
5 = twobooks

Yep. Then check optimal parse, large window vs. the good options :

0 = ares_merged.gr2_sec_5.dat
1 = bigship1.x
2 = BOOK1
3 = combined.gr2_sec_3.dat
4 = GrannyRocks_wot.gr2
5 = Gryphon_stripped.gr2
6 = hetero50k
7 = lzt20
8 = lzt21
9 = lzt22
10 = lzt23
11 = lzt24
12 = lzt25
13 = lzt27
14 = lzt28
15 = lzt29
16 = movie_headers.bin
17 = orange_purple.BMP
18 = predsave.bin

The good Suffix Trie is clearly the best, but we're in the ballpark.

Now optimal parse, 17 bit (128k) window :


totals:
Test_MMC2 : DNF 
Test_LzFind2 : 506.954418 , 1355.992865 
Test_SuffixArray3 : 506.954418 , 514.931740 
Test_MMC1 : 13.095260 , 1298.507490 
Test_LzFind1 : 12.674177 , 226.796123 
Test_Hash1 : 503.319301 , 1094.570022 

Finally greedy parse, 17 bit window :


totals:
Test_MMC2 : 0.663028 , 110.373098 
Test_LzFind2 : DNF 
Test_SuffixArray3 : 0.663036 , 236.896551 
Test_MMC1 : 0.663028 , 222.626069 
Test_LzFind1 : 0.662929 , 216.912409 
Test_Hash1 : 0.662718 , 62.385071 

average match length :

And once more for clarity :

Greedy parse, 16 bit window , just the good candidates :

totals:
Test_SuffixArray3 : 0.630772 , 239.280605 
Test_LzFind1 : 0.630688 , 181.430093 
Test_MMC2 : 0.630765 , 88.413339 
Test_Hash1 : 0.630246 , 51.980073 

It should be noted that LzFind1 is approximate, and Hash1 is even more approximate. Though looking at the match length chart you certainly can't see it.

ADDENDUM : an email I wrote trying to explain this better :


First a reminder of how the normal suffix array searcher works :

We're at some file position F
We look up sortIndex[F] to find our sort position S
we know our longest matches must be at sort positions S-1 or S+1
we step to our neighbors

The problem with windowed matching is our neighbors may be way out of the window

So what we ant is a way to step more than 1 away to find the closest neighbor in the sort order that is within our desired window.

So really all we are doing is adding another search structure on top of the suffix array.  It's a search structure to go in either direction from S and find the closest spot to S that has file position in the range [F-window,F-1] 

To do this I just build a tree.  It's indexed by sort position, and its content is file positions.

Level L of the tree contains nodes that cover an interval of size 1<<L 

That is, level 0 has intervals of size 1, that's just

Tree_0[S] = sortIndexInverse[S] 
  (eg. it's just the file position of that sort pos)
  (in fact we don't make this level of the tree, we just use sortIndexInverse which we already have)

Tree_1[i] covers steps of size 2, that is :
  file positions sortIndexInverse[i*2] and sortIndexInverse[i*2+1]
  I store the min and max of that

Tree_2[i] covers steps of size 4, that is min and max of Tree_1[i*2] and Tree_1[i*2+1]

Now once you have this binary interval tree you can do the normal kind of binary tree walk to find the closest neighbor that's in the range you want.

Also see the following blog on how I walk binary interval trees and the released source code for StringMatchTest contains the full code for this matcher.

10-01-11 - Seattle Stop Shitting on my Face

I've been thinking about upgrading my neoprene gear so that I can swim in the lake through the fall/spring, not just the summer.

I fucking hate swimming laps in a pool during official lap swim hours, with all your "rules" and your system keeping me down. And it just doesn't make sense to go into some nasty indoor crowded box when I'm literally surrounded by miles of beautiful open water.

But there's a bit problem with this idea. Seattle is shitting on my face.

The fall/spring, when a wet suit would help, is when it rains. When it rains, the sewers overflow and drain into the lake. Then you get itchy bumps and vomiting and so on.

Seattle is basically doing nothing about it. There are some little programs to do "rain gardens", but those are sort of like using a tampon to stop elephant piss. What we need is serious fucking civil engineering. ("rain gardens" are also better known as "mosquito breeders"; we Houstonians are always amazed and delighted about the lack of mosquitos here; it's because Seattle is hilly and surrounded by lakes so the water doesn't pool, but the city is doing their best to ruin that). (a much bigger impact than piddly residential rain gardens would be to outlaw concrete parking lots; grass/gravel/lattice parking lots work perfectly fine for holding cars).

Of course this is a problem that is occuring all over the US. I hear NY has a major sewer problem as well. The population and development of most US cities has outstripped their infrastructure, and in our shitty faux-libertarian plutocracy of course there's no money for basic civil engineering. Only the heavy hand of EPA orders is forcing these dumb ass local governments to do anything.

The real solution is something like :

1. Separate the sewage and rain runoff. Run sewage to treatment plants in a closed system so it can never get out. (probably not realistic; alternatively, add a new clean storm water only system)

2. Since you will allow non-sewer storm water to drain to the lake, make the pollutants that run off to the lakes illegal. Fertilizers, pesticides, etc. everything that's water soluble and washes into the lakes is illegal right fucking now.

10-01-11 - More Reliable Timing on Windows

When profiling little code chunks on Windows, one of the constant annoyances is the unreliability of times due to multithreading.

Historically the way you address this is run lots of trials (like 100) and take the MIN time of any trial.

(* important note : if you aren't trying to time "hot cache" performance, you need to wipe the cache between each run. I dunno if there's an easy instruction or system call that would invalidate all cache pages; what I usually do is have a routine that goes and munges over some big array).

It's a bit better these days because of many cores. Now you can quite often find a core which is unmolested by annoying services popping up and stealing CPU time and messing up your profile. But sometimes you get unlucky, and your process runs on an IdealProc that has some other shite.

So a simple loop helps :


template <typename t_func>
uint64 GetFuncTime( t_func * pfunc )
{
    HANDLE proc = GetCurrentProcess();
    HANDLE thread = GetCurrentThread();
    
    DWORD_PTR affProc,affSys;
    GetProcessAffinityMask(proc,&affProc,&affSys);
    
    uint64 tick_range = 1ULL << 62;
    
    for(int rep=0;rep<24;rep++)
    {
        DWORD mask = 1UL<<rep;
        if ( mask & affProc )
            SetThreadAffinityMask(thread,mask);
        else
            continue;   

        uint64 t1 = __rdtsc();
        (*pfunc)();
        uint64 t2 = __rdtsc();

        uint64 cur_tick_range = t2 - t1;
        tick_range = MIN(tick_range,cur_tick_range);

    }

    SetThreadAffinityMask(thread,0xFFFFFFFFUL);

    return tick_range;
}

which makes it reasonably probable that you get a clean run on some core. For published results you will still want to repeat the whole thing N times.

9/30/2011

09-30-11 - String Match Results Part 5 + Conclusion

Finally for completeness, some of the matchers from Tornado in FreeArc. These are basically all standard "cache table" style matchers, originally due to LZRW, made popular by LZP and LZO. The various Tornado settings select different amounts of hash rows and ways.

As they should, they have very constant time operation that goes up pretty steadily from Tornado -3 to -7, because there's a constant number of hash probes per match attempt.


totals : match len : clocks
Test_MMC1 : 0.663028 , 231.254818 
Test_Hash1 : 0.662718 , 64.888003 
Test_Tornado_3 : 0.630377 , 19.658834 
Test_Tornado_4 : 0.593174 , 28.456055 
Test_Tornado_5 : 0.586540 , 40.546146 
Test_Tornado_6 : 0.580042 , 56.841156 
Test_Tornado_7 : 0.596584 , 141.432393 

There may be something wrong with my Tornado wrapper as the -3 matcher actually finds the longest total length. I dunno. The speeds look reasonable. I don't really care much about these approximate matchers because the loss is hard to quantify, so there you go (normally when I see an anomaly like that I would investigate it to make sure I understand why it's happening).

0 = ares_merged.gr2_sec_5.dat
1 = bigship1.x
2 = BOOK1
3 = combined.gr2_sec_3.dat
4 = GrannyRocks_wot.gr2
5 = Gryphon_stripped.gr2
6 = hetero50k
7 = lzt20
8 = lzt21
9 = lzt22
10 = lzt23
11 = lzt24
12 = lzt25
13 = lzt27
14 = lzt28
15 = lzt29
16 = movie_headers.bin
17 = orange_purple.BMP
18 = predsave.bin


Conclusion : I've got to get off string matching so this is probably the end of posts on this topic.

MMC looks promising but has some flaws. There are some cases that trigger a slowness spike in it. Also it has some bad O(N^2) with unbounded match length ("MMC2") so I have to run it with a limit ("MMC1") which removes some of its advantage over LzFind and Hash1 and other approximate matchers. (without the limit it has the advantage of being exact). It's also a GPL at the moment which is a killer.

LzFind doesn't have anything going for it really.

For approximate/small-window matching I don't see any reason to not use the classic Zip hash chain method. I tried a few variants of this, like doing a hash chain to match the first 4 bytes and then link listing off that, and all the variants were worse than the classic way.

For large window / exact matching / optimal parsing, a correct O(N) matcher is the way to go. The suffix-array based matcher is by far the easiest for your kids to implement at home.

09-30-11 - String Match Results Part 4

Okay, finally on to greedy parsing. Note with greedy parsing the average match length per byte is always <= 1.0 (it's actually the % of bytes matched in this case).

Two charts for each , the first is clocks per byte, the second is average match length. Note that Suffix5 is just for reference and is neither windowed nor greedy.

got arg : window_bits = 16

got arg : window_bits = 17

got arg : window_bits = 18

Commentary :

Okay, finally MMC beats Suffix Trie and LzFind, this is what it's good at. Both MMC and LzFind get slower as the window gets larger. Surprisingly, the good old Zip-style Hash1 is significantly faster and finds almost all the matches on these files. (note that LzFind1 and Hash1 both have search limits but MMC does not)


test set :

0 = ares_merged.gr2_sec_5.dat
1 = bigship1.x
2 = BOOK1
3 = combined.gr2_sec_3.dat
4 = GrannyRocks_wot.gr2
5 = Gryphon_stripped.gr2
6 = hetero50k
7 = lzt20
8 = lzt21
9 = lzt22
10 = lzt23
11 = lzt24
12 = lzt25
13 = lzt27
14 = lzt28
15 = lzt29
16 = movie_headers.bin
17 = orange_purple.BMP
18 = predsave.bin


The same matchers ; greedy, 16 bit window, on the stress tests :

LzFind does not do well at all on the stress tests. (note that LzFind1 and MMC1 are length-limitted; LzFind1 and Hash1 are "amortized" (step limitted)).

0 = stress_all_as 
1 = stress_many_matches 
2 = stress_search_limit 
3 = stress_sliding_follow 
4 = stress_suffix_forward 
5 = twobooks

09-30-11 - String Match Results Part 3

Still doing "optimal" (non-greedy parsing) but now lets move on to windowed & non-exact matching.

Windowed, possibly approximate matching.

Note : I will include the Suffix matchers for reference, but they are not windowed.

16 bit window :

Clocks per byte :

Average Match len :

This is what LzFind is designed for and it's okay at it. It does crap out pretty badly on the rather degenerate "particles.max" file, and it also fails to find a lot of matches. (LZFind1 has a maximum match length of 256 and a maximum of 32 search steps, which are the defaults in the LZMA code; LzFind2 which we saw before has those limits removed (and would DNF on many of these files)).

lztest is :

0 = almost_incompressable
1 = bigship1.x
2 = Dolphin1.x
3 = GrannyRocks_wot.gr2
4 = Gryphon_stripped.gr2
5 = hetero50k
6 = movie_headers.bin
7 = orange_purple.BMP
8 = particles.max
9 = pixo_run_animonly_stripped.gr2
10 = poker.bin
11 = predsave.bin
12 = quick.exe
13 = RemoteControl_stripped.gr2
14 = ScriptVolumeMgr.cpp

old rants