10/27/2011

10-27-11 - Tiny LZ Decoder

I get 17 bytes for the core loop (not including putting the array pointers in registers because presumably they already are there if you care about size) .

My x86 is rusty but certainly the trick to being small is to use the ancient 1 byte instructions, which conveniently the string instructions are. For example you might be tempted to read out length & offset like this :


        mov cl,[esi]    // len
        mov al,[esi+1]  // offset
        add esi,2

but it's smaller to do

        lodsb  // len
        mov cl,al
        lodsb  // offset

because it keeps you in 1 byte instructions. (and of course any cleverness with lea is right out). (naively just using lodsw and then you have len and offset in al and ah is even better, but in practice I can't make that smaller)

Anyhoo, here it is. I'm sure someone cleverer with free time could do better.


__declspec(noinline) void TinyLZDecoder(char * to,char * fm,char * to_end)
{
    __asm
    {
        mov esi,fm
        mov edi,to
        mov edx,to_end
        xor eax,eax
        xor ecx,ecx
    more:
        movsb   // literal
        lodsb   // len
        mov cl,al
        lodsb   // offset
        push esi
        mov esi,edi
        sub esi,eax
        rep movsb   // match
        pop esi
        cmp edi,edx
        jne more
    }

}

------------------------------------------------------

    more:
        movsb   // literal
00401012  movs        byte ptr es:[edi],byte ptr [esi] 
        lodsb   // len
00401013  lods        byte ptr [esi] 
        mov cl,al
00401014  mov         cl,al 
        lodsb   // offset
00401016  lods        byte ptr [esi] 
        push esi
00401017  push        esi  
        mov esi,edi
00401018  mov         esi,edi 
        sub esi,eax
0040101A  sub         esi,eax 
        rep movsb   // match
0040101C  rep movs    byte ptr es:[edi],byte ptr [esi] 
        pop esi
0040101E  pop         esi  
        cmp edi,edx
0040101F  cmp         edi,edx 
        jne more
00401021  jne         more (401012h) 

Also obviously you would get much better compression with a literal run length instead of a single literal every time, and it only costs a few more bytes of instructions. You would get even better compression if the run len could be either a match len or a literal run len and that's just another few bytes. (ADDENDUM : see end)


A small "Predictor/Finnish" is something like this :


    __asm
    {
ByteLoop:   lodsb   // al = *esi++ // control byte
            mov edx, 0x100
            mov dl, al
BitLoop:
            shr edx, 1
            jz  ByteLoop
            jnc zerobit
            lodsb
            mov [ebx], al
zerobit:    mov al, [ebx]
            mov bh, bl
            mov bl, al
            stosb  // *edi++ = al
            jmp BitLoop
    }

the fast version of Finnish of course copies the bit loop 8 times to avoid looping but you can't do that if you want to be small.

I'm quite sure this could be smaller using some clever {adc esi} or such. Also the sentry bit looping is taking a lot of instructions and so on.

Note the crucial trick of "Finnish" is that the hash table must be 64k in size and 64k aligned, so you can do the hash update and the table address computaton just by cycling the bottom two bytes of ebx. (I use the name "Predictor" for the generic idea of the single bit prediction/literal style compressor; "Finnish" is the specific variant of Predictor that uses this bx trick).

(note that this is not remotely the fast way to write these on modern CPU's)


ADDENDUM : Better (in the sense of more compression) LZ decoder in 22 bytes (core loop only) :


__declspec(noinline) void TinyLZDecoder(char * to,char * fm,char * to_end)
{
    __asm
    {
        mov esi,fm
        mov edi,to
        mov edx,to_end
        xor eax,eax
        xor ecx,ecx
    more:
        mov cl,[esi]    // len
        inc esi
        shr cl,1        // bottom bit is flag
        jc literals
        
    //match:
        lodsb   // offset -> al
        push esi
        mov esi,edi
        sub esi,eax
        rep movsb   // copy match
        pop esi
    
        // ecx is zero, just drop through
    literals:
        rep movsb  // copy literal run
    
        cmp edi,edx
        jne more
    }

}

Note that obviously if your data is bigger than 256 bytes you can use a two byte match offset by doing lodsw instead of lodsb.

x86 of course is a special case that's particularly amenable to small LZ decoders; it's not really fair, all other instruction sets will be much larger. (which is sort of ironic because RISC is supposed to make instruction encoding smaller) (addendum : not actually true, that last bit)

10/26/2011

10-26-11 - Tons

In the US, a "ton" = 2000 pounds.

In the UK a "ton" is 2240 pounds (which comes from twenty "hundredweights" where a "hundredweight" is eight stone, and a stone is 14 pounds, WTF Britain).

A "metric ton" is obviously 1000 kg. In the UK this is officially called a "tonne" which you will see in technical documents, but I don't see that used much in casual writing, and it's certainly confusing when spoken since it sounds the same. (but a UK ton is very close to a metric ton (2204.6 pounds) so the mixup here surely happens all the time and is not a huge problem).

(when you hear someone in the UK phonetically say "ton" do they mean "tonne" or imperial ton?)

To differentiate the US ton vs UK ton they can be called "short ton" or "long ton".


On a related note, a pint is not a pound *anywhere* in the world.

In the UK, 1 oz by volume of water = 1 oz of weight. But a "pint" in the UK is 20 oz. So a pint is 1.25 pounds (a gallon is exactly 10 pounds)

In the US, 1 oz by volume of water = 1.041 oz of weight, so a pint = 1.041 pounds. (and a gallon = 8.33 pounds).

(neither liquid ounce is anything neat in terms of volume; the only nice whole number unit is the US gallon which is 231 cubic inches)

(the weight measures are the same in the US and UK, it's the US volume measure which went weird (1.041), and I believe it was done in order to make the gallon an integer number of cubic inches)

If you want to get technical, a (US) "pint's a pound" at some high temperature. (...some digging...) actually it's very close just before boiling. It looks like 98 C water is almost exactly a pound per pint (US).

Actually there is a sort of cute book-end of the ranges of water density there :

Very close to freezing (4 C) water is 1 g/ml , and very close to boiling (98 C) it's a pound per (US) pint. The difference is a factor of about 0.96.

10-26-11 - The Eight Month Cruise

If you're buying a home in Seattle, you should always try to do so in early spring. This will give you a few months of wetness to see any problems, and then you'll have the whole summer to deal with them before the wet sets in again. (it also lets you see the homes during rains, which lets you look for water incursions while you shop). I tried to time it that way, but the home shopping took too long and I didn't wind up buying until late summer.

The problem with Seattle is that once it starts raining (around Oct 1 pretty reliably) it literally does not stop for the next 8 months. Sure maybe it stops for a day or two, but never long enough for the whole house to dry out and then give you a big chunk of dry days to do something like replace the roof or paint the exterior.

Fortunately I don't have any problem so large as that, but even for minor things it's damn annoying. For example some time around September I realized that I really need to get a coat of waterproofer on all my decking. Oh well, it's gonna have to wait 8 months. There's a couple of spots I need to touch up exterior paint, but you can't really do a good job of painting without a solid 5 days of dry and decent warmth.

It's almost like our houses are boats, and we go on an 8 month aquatic voyage every year. You really need to use those 4 months as a chance to get your boat up on dry dock, scrub off the moss, dig out the dry rot, apply epoxy, sand and paint, etc.

10/24/2011

10-24-11 - LZ Optimal Parse with A Star Part 1

First two notes that aren't about the A-Star parse :

1. All good LZ encoders these days that aren't optimal parsers use complicated heuristics to try to bias the parse towards a cheaper one. The LZ parse is massively redundant (many parses can produce the same original data) and the cost is not the same. But in the forward normal parse you can't say for sure which decision is cheapest, so they just make some guesses.

For example the two key crucial heuristics in LZMA are :


// LZMA lastOffset heuristic :
  if (repLen >= 2 && (
        (repLen + 1 >= mainLen) ||
        (repLen + 2 >= mainLen && mainDist >= (1 << 9)) ||
        (repLen + 3 >= mainLen && mainDist >= (1 << 15))))
  {
    // take repLen instead of mainLen
  }

// LZMA lazy heuristic :

        // Truncate current match if match at next position will be better (LZMA's algorithm)
        if (nextlen >= prevlen && nextdist < prevdist/4 ||
            nextlen == prevlen + 1 && !ChangePair(prevdist, nextdist) ||
            nextlen > prevlen + 1 ||
            nextlen + 1 >= prevlen && prevlen >= MINLEN && ChangePair(nextdist, prevdist))
        {
             return MINLEN-1;
        } else {
             return prevlen;
        }

One is choosing a "repeat match" over a normal match, and the next is choosing a lazy match (literal then match) over greedy (immediate match).

My non-optimal LZ parser has to make similar decisions; what I did was set up a matrix and train it. I made four categories for a match based on what the offset is : {repeat, low, medium, high } , so to decide between two matches I do :


  cat1 = category of match 1 offset
  cat2 = category of match 2 offset
  diff = len1 - len2

  return diff >= c_threshold[cat1][cat2]
  
so c_threshold is a 4x4 matrix of integers. The values of threshold are in the range [0,3] so it's not too many to just enumerate all possibilities of threshold and see what's best.

Anyway, the thing about these heuristics is that they bias the parse in a certain way. They assume a certain cost tradeoff between literals vs. repeat matches vs. normal matches, or whatever. When the heuristic doesn't match the file they do badly. Otherwise they do amazingly well. One solution might be having several heuristics trained on different files and choose the one that creates the smallest output.

Also I should note - it's not trivial to tell when you have the heuristic wrong for the file. The problem is that there's a feedback loop between the parse heuristic and the statistical coder. That causes you to get into a local minimum, and you might not see that there's a better global minimum which is only available if you make a big jump in parsing style.

Repeating that more explicitly : the statistical coder will adapt to your parse heuristic; say you have a heuristic that prefers low offsets (like the LZMA heuristic above), that will cause your parse to select more low offsets, that will cause your statistical backend to code low offsets smaller. That's good, that's what you want, that's the whole point of the heuristic, it skews the parse intentionally to get the statistics going in the right direction. The issue then is that if you try to evaluate an alternative parse using the statistics that you have, it will look bad, because your statistics are trained for the parse you have.

2. It's crazy that LZ compression can do so well with so much redundancy in the code stream.

Think about it this way. Enumerate all the compressed bit streams that are created by encoding all raw bit streams up to length N.

Start with the shortest compressed bit stream. See what raw bits that decodes to. Now there may be several more (longer) compressed bit streams that decode to that same raw bit stream. Mark those all as unused.

Move to the next shortest compressed bit stream. First if there is a shorter unused one, move it into that slot. Then mark all other output streams that make the same raw bits as unused, and repeat.

For example a file like "aaaaaaa" has one encoding that's {a, offset -1 length 6} ; but there are *tons* of other encodings, such as {a,a,[-1,5]} or {a,[-1,3],[-1,3]} or {a,[-1,2],[-2,4]} etc. etc.

All the encodings except the shortest are useless. We don't want them. But even the ones that are not the best are quite short - they are small encodings, and small encodings take up lots of code space (see Kraft Inequality for example - one bit shorter means you take up twice the code space). So these useless but short encodings are real waste. In particular, there are other raw strings that don't have such a short encoding that would like to have that output length.

Anyhoo, surprisingly (just like with video coding) it seems to be good to add even *more* code stream redundancy by using things like the "repeat matches". I don't think I've ever seen an analysis of just how much wasted code space there is in the LZ output, I'm curious how much there is.

Hmm we didn't actually get to the A Star. Next time.

10/18/2011

10-18-11 - To wrap it up or move on

One of the things I've really struggled with in the last few years at RAD is the trade off between wasting time on a side shoot of the main code line vs. really wrapping something up while your focus is on it.

Generally after I've spent a week or two on a topic I start feeling like "okay that's enough time on this I need to move on" ; I guess that's my internal project manager watching my schedule. But on the other hand, there's such a huge advantage to staying on a topic while it fills your mind. You just lose so much momentum if you have to come back to it later.

For example I wish I had finished my JPEG decoder back when I was working on it, that was an important piece of software I believe, but I felt like I needed to move on to more practical tasks, and now it's all out of my head and would take me several weeks to figure out all the nuances again.

Currently I'm working on a new way to do LZ optimal parsing, and I feel like I need to move on because it's not that important to my product and I need to get onto more practical tasks, but at the same time I hate to leave it now because I feel close to a breakthrough and I know that if I stop now I may never come back to it, and if I do it will take forever to get back into the flow.

10-18-11 - StringMatchTest - Hash 1b

For reference :

The good way to do the standard Zip-style Hash -> Linked List for LZ77 parsing.

There are two tables : the hash entry point table, which gives you the head of the linked list, and the link table, which is a circular buffer of ints which contain the next position where that hash occured.

That is :


  hashTable[ hash ]  contains the last (most recent preceding) position that hash occured
  chainTable[ pos & (window_size-1) ]  contains the last position of the hash at pos before pos

To walk the table you do :

  i = hashTable[ hash ];
  while ( i in window )
    i = chainTable[ i & (window_size-1) ]

To update the table you do :

  head = hashTable[ hash ];
  hashTable[hash] = pos;
  chainTable[ pos & (window_size-1) ] = head;

And now for some minor details that are a bit subtle. We're going to go through "Hash1" from StringMatchTest which I know I still haven't posted.

int64 Test_Hash1(MATCH_TEST_ARGS)
{
    uint8 * buf = (uint8 *)charbuf;
    const int window_mask = window_size-1;
        
    vector<int> chain; // circular buffer on window_size
    chain.resize(window_size);
    int * pchain = chain.data();
    
    const int hash_size = MIN(window_size,1<<20);
    const int hash_mask = hash_size-1;
    
for small files or small windows, you can only get good per-byte speed if you make hash size proportional with that MIN. (what you can't see is outside the Test_ I also made window_size be no bigger than the smallest power of 2 that encloses file size).


    vector<int> hashTable; // points to pos of most recent occurance of this hash
    hashTable.resize(hash_size);
    int * phashTable = hashTable.data();
    
    memset(phashTable,0,hash_size*sizeof(int));

As noted previously, for large hashes you can get a big win by using a malloc that gives you zeros. I don't do it here for fairness to the other tests. I do make sure that my initialization value is zero so you can switch to VirtualAlloc/calloc.

    int64 total_ml = 0;
    
    // fiddle the pointers so that position 0 counts as being out of the window
    int pos = window_size+1;
    buf -= pos;
    ASSERT( (char *)&buf[pos] == charbuf );
    size += pos;

I don't want to do two checks in the inner loop for whether a position is a null link vs. out of the window. So I make the "null" value of the linked list (zero) be out of the window.

    for(;pos<(size-TAIL_SPACE_NEEDED);)
    {
        MATCH_CHECK_TIME_LIMIT();

        // grab 4 bytes (@ could slide here)
        uint32 cur4 = *((uint32 *)&buf[pos]);
        //ASSERT( cur4 == *((uint32 *)&buf[pos]) );

On PC's it's fastest just to grab the unaligned dword. On PowerPC it's faster to slide bytes through the dword. Note that endian-ness changes the value, so you may need to tweak the hash function differently for the two endian cases.

        // hash them 
        uint32 hash = hashfour(cur4) & hash_mask;
        int hashHead =  phashTable[hash];
        int nextHashIndex = hashHead;
        int bestml = 0;
        int windowStart = pos-window_size;
        ASSERT( windowStart >= 0 );
        
        #ifdef MAX_STEPS
        int steps = 0;
        #endif

Start the walk. Not interesting.

        while( nextHashIndex >= windowStart )
        {
            uint32 vs4 = *((uint32 *)&buf[nextHashIndex]);
            int hi = nextHashIndex&window_mask;
            if ( vs4 == cur4 )
            {
                int ml = matchlenbetter(&buf[pos],&buf[nextHashIndex],bestml,&buf[size]);
                    
                if ( ml != 0 )
                {
                    ASSERT( ml > bestml );
                    bestml = ml;
                    
                    // meh bleh extra check cuz matchlenbetter can actually go past end
                    if ( pos+bestml >= size )
                    {
                        bestml = size - pos;
                        break;
                    }
                }
            }

            #ifdef MAX_STEPS
            if ( ++steps > MAX_STEPS )
                break;
            #endif
                                
            nextHashIndex = pchain[hi];
        }

This is the simple walk of the chain of links. Min match len is 4 here which is particularly fast, but other lens can be done similarly.

"MAX_STEPS" is the optimal "amortize" (walk limit) which hurts compression in an unbounded way but is necessary for speed.

"matchlenbetter" is a little trick ; it's best to check the character at "bestml" first because it is the most likely to differ. If that char doesn't match, we know we can't beat the previous match, so we can stop immediately. After that I check the chars in [4,bestml) to ensure that we really do match >= bestml (the first 4 are already checked) and lastly the characters after, to extend the match.

The remainder just updates the hash and is not interesting :


        ASSERT( bestml == 0 || bestml >= 4 );
        total_ml += bestml;
        
        // add self :
        //  (this also implicitly removes the node at the end of the sliding window)
        phashTable[hash] = pos;
        int ci = pos & window_mask;
        pchain[ci] = hashHead;
                
        if ( greedy && bestml > 0 )
        {
            int end = pos+bestml;
            pos++;
            ASSERT( end <= size );
            for(;pos<end;pos++)
            {               
                uint32 cur4 = *((uint32 *)&buf[pos]);
                
                // hash them 
                uint32 hash = hashfour(cur4) & hash_mask;
                int hashHead =  phashTable[hash];
                phashTable[hash] = pos;
                int ci = pos & window_mask;
                pchain[ci] = hashHead;      
            }
            pos = end;
        }
        else
        {
            pos++;
        }
    }
    
    return total_ml;
}

Note that for non-greedy parsing you can help the O(N^2) in some cases by setting bestml to lastml-1. This helps enormously in practice because of the heuristic of matchlenbetter but does not eliminate the O(N^2) in theory. (the reason it helps in practice is because typically when you find a very long match, then the next byte will not have any match longer than lastml-1).

(but hash1 is really just not the right choice for non-greedy parsing; SuffixArray or SuffixTrie are vastly superior)

10/12/2011

10-12-11 - Post-backpacking Recalibration

1. 65 degree house is fucking blazing hot at night.

2. Of course I'll walk a mile in the rain to get lunch. No big deal.

3. Everything is fucking DELICIOUS! Jar of spaghetti sauce, om nom nom. Beef stew yum.

4. None of the usual hesitation in sitting on the semi-shit-covered office toilet after having had to sit on the very-shit-covered wilderness toilet.

10/06/2011

10-06-11 - Fiberglass

Don't put this shit in your house. It's toxic, it's poison, it's the modern day asbestos. There are plenty of alternatives.

Even in the attic or walls, sure it's sealed up most of the time, but any time you have to go in there to work you stir it up, then you get glass shards in the air which get in your eyes and lungs, so you have to wear safety suits and respirators and so on just to do basic maintenace work.

If anything ever goes wrong with it, it's a nightmare to dispose of.

Worst of all is using it to wrap ducts. The problem is that all ducts will leak eventually. Maybe not right away, but in 10 years cracks will form. Then the air return ducts will start sucking in at the seams, and eventually that will be sucking glass fibers in, then blowing them out all over the house.

Just say no to toxic shit in your home.

10/03/2011

10-03-11 - Amortized Hashing

The "hash1" simple Zip-style matcher was very fast. So why don't we love it?

The problem is amortized hashing. "Amortized" = we just stop walking after A steps. This limits the worst case. Technically it makes hash1 O(N) (or O(A*N)).

Without it, hash1 has disastrous worst cases, which makes it just not viable. The problem is that the "amortize" can hurt compression quite a bit and unpredictably so.

One perhaps surprising thing is that the value of A (the walk limit) doesn't actually matter that much. It can be 32 or 256 and either way you will save your speed from the cliff. Surpisingly even an A of 1024 on a 128k window helps immensely.


not "amortized" :

 file,   encode,   decode, out size
lzt00,  225.556,   35.223,     4980
lzt01,  143.144,    1.925,   198288
lzt02,  202.040,    3.067,   227799
lzt03,  272.027,    2.164,  1764774
lzt04,  177.060,    5.454,    11218
lzt05,  756.387,    3.940,   383035
lzt06,  227.348,    6.078,   423591
lzt07,  624.699,    4.179,   219707
lzt08,  311.441,    7.329,   261242
lzt09,  302.741,    3.746,   299418
lzt10,  777.609,    1.647,    10981
lzt11, 1073.232,    4.999,    19962
lzt12, 3250.114,    3.134,    25997
lzt13,  101.577,    5.644,   978493
lzt14,  278.619,    6.026,    47540
lzt15, 1379.194,    9.396,    10911
lzt16,  148.908,   12.558,    10053
lzt17,  135.530,    5.693,    18517
lzt18,  171.413,    6.003,    68028
lzt19,  540.656,    3.433,   300354
lzt20,  109.678,    5.488,   900001
lzt21,  155.648,    3.605,   147000
lzt22,  118.776,    6.671,   290907
lzt23,  103.056,    6.350,   822619
lzt24,  218.596,    4.439,  2110882
lzt25,  266.006,    2.498,   123068
lzt26,  118.093,    7.062,   209321
lzt27,  913.469,    4.340,   250911
lzt28,  627.070,    2.576,   322822
lzt29, 1237.463,    4.090,  1757619
lzt30,   75.217,    0.646,   100001

"amortized" to 128 steps :

 file,   encode,   decode, out size
lzt00,  216.417,   30.567,     4978
lzt01,   99.315,    1.620,   198288
lzt02,   85.209,    3.556,   227816
lzt03,   79.299,    2.189,  1767316
lzt04,   90.983,    7.073,    12071
lzt05,   86.225,    4.715,   382841
lzt06,   91.544,    6.930,   423629
lzt07,  127.232,    4.502,   220087
lzt08,  161.590,    7.725,   261366
lzt09,  119.749,    4.696,   301442
lzt10,   55.662,    1.980,    11165
lzt11,  108.619,    6.072,    19978
lzt12,  112.264,    3.119,    26977
lzt13,  103.460,    6.215,   978493
lzt14,   87.520,    5.529,    47558
lzt15,   98.902,    7.568,    10934
lzt16,   90.138,   12.503,    10061
lzt17,  115.166,    6.016,    18550
lzt18,  176.121,    5.402,    68035
lzt19,  272.349,    3.310,   304212
lzt20,  107.739,    5.589,   900016
lzt21,   68.255,    3.568,   147058
lzt22,  108.045,    5.867,   290954
lzt23,  108.023,    6.701,   822619
lzt24,   78.380,    4.631,  2112700
lzt25,   93.013,    2.554,   123219
lzt26,  108.348,    6.143,   209321
lzt27,  103.226,    3.468,   249081
lzt28,  145.280,    2.658,   324569
lzt29,  199.174,    4.063,  1751916
lzt30,   75.093,    1.019,   100001

The times are in clocks per byte. In particular let's look at some files that are really slow without amortize :

no amortize :

 file,   encode,   decode, out size
lzt11, 1073.232,    4.999,    19962
lzt12, 3250.114,    3.134,    25997
lzt15, 1379.194,    9.396,    10911
lzt27,  913.469,    4.340,   250911
lzt29, 1237.463,    4.090,  1757619

amortize 128 :

 file,   encode,   decode, out size
lzt11,  108.619,    6.072,    19978
lzt12,  112.264,    3.119,    26977
lzt15,   98.902,    7.568,    10934
lzt27,  103.226,    3.468,   249081
lzt29,  199.174,    4.063,  1751916

Massive speed differences. The funny thing is the only file where the compression ratio is drastically changes is lzt12. It's changed by around 4%. (lzt29 is a bigger absolute difference but only 0.34%)

So amortized hashing saved us massive time, and only cost us 4% on one file in the test case. Let me summarize the cases. There are three main classes of file :

1. 80% : Not really affected at all by amortize or not. These files don't have lots of degeneracy so there aren't a ton of links in the same hash bucket.

2. 15% : Very slow without amortize, but compression not really affected. These files have some kind of degeneracy (such as long runs of one character) but the most recent occurances of that hash are the good ones, so the amortize doesn't hurt compression.

3. 5% : Has lots of hash collisions, and we do need the long offsets to make good matches. This case is rare.

So obviously it should occur to us that some kind of "introspective" algorithm could be used. Somehow monitor what's happening and adjust your amortize limit (it doesn't need to be a constant) or switch to another algorithm for part of the hash table.

The problem is that we can't tell between classes 2 and 3 without running the slow compressor. That is, it's easy to tell if you have a lot of long hash chains, but you can't tell if they are needed for good compression or not without running the slow mode of the compressor.

10/02/2011

10-02-11 - How to walk binary interval tree

I noted previously that SA3 uses a binary interval tree. It's obvious how that works but it always takes me a second to figure it out so let's write it down.

This is going to be all very similar to previous notes on cumulative probability trees (and followup ).

Say you have an array of 256 entries. You build a min tree. Each entry in the tree stores the minimum value over an interval. The top entry covers the range [0,255] , then you have [0,127][128,255] , etc.

If you're at index i and you want to find the next entry which has a value lower than yours. That is,


find j such that A[j] < A[i] and j > i

you just have to find an interval in the min tree whose min is < A[i]. To make the walk fast you want to step by the largest intervals possible. (once you find the interval that must contain your value you do a normal binary search within that interval)

You can't use the interval that overlaps you, because you only want to look at j > i.

It's easy to do this walk elegantly using the fact that the binary representation of integers is a sort of binary interval tree.

Say we start at i = 37 (00100101). We need to walk from 37 to 256. Obviously we want to use the [128,256) range to make a big step. And the [64,128). We can't use the [32,64) because we're inside that range - this corresponds to the top on bit of 37. We can use [48,64) and [40,48) because those bits are off. We can't use [36,40) but we can use [38,40) (and the bottom on bit corresponds to [37,38) which we are in).

Doing it backwards, you start from whatever index (such as i=37). Find the lowest on bit. That is the interval you can step by (in this case 1). So step by 1 to i=38. Stepping by the lowest bit always acts to clear that bit and push the lowest on bit higher up (sometimes more than 1 level). Now find the next lowest on bit. 38 = 00100110 , so step by 2 to 40 = 00101000 , now step by 8 to 48 = 00110000 , now step by 16 to 64 = 01000000. etc.

In pseudo code :


Start at index i

while ( i < end )
{
  int step = i & (-i); // isolate bottom bit
  // I'm looking at the range [i,i+step]
  int level = BitPos(step);
  check tree[level][i>>level];
  i += step;
}
 
Now this should all be pretty obvious, but here comes the juju.

I've written tree[][] as if it is layed out in the intuitive way, that is :


tree[0] has 256 entries for the 1-step ranges
tree[1] has 128 entries for 2-step ranges
...

The total is 512 entries which is O(N). But notice that tree[0] is only actually ever used for indexes that have the bottom bit on. So the half of them that have the bottom bit off are not needed. Then tree[1] is only used for entries that have the second bit on (but bottom bit off). So the tree[1] entries can actually slot right into the blanks of tree[0], and half the blanks are left over. And so on...

It's a Fenwick tree!

So our revised iteration is :


// searching :
Start at index i
(i starts at 1)

while ( i < end )
{
  int step = i & (-i); // isolate bottom bit
  // I'm looking at the range [i,i+step]
  check fenwick_tree[i];
  i += step;
}

// building :

for all i
{
  int step = i & (-i); // isolate bottom bit
  fenwick_tree[i] = min on range [i,i+step]
}

(in practice you need to build with a binary recursion; eg. level L is built from two entries of level L-1).

Note that to walk backwards you need the opposite entries. That is, at level 7 (steps of 128) you only need [128,256) to walk forward, never [0,128) because a value in that range can't take that step. To walk backwards, you only need [0,128) , never [128,256). So in fact to walk forward or backward you need all the entries. When we made the "Fenwick compaction" for the forward walk, we threw away half the values - those are exactly the values that need to be in the backward tree.

For the three bit case , range [0,8) , the trees are :


Forward :

+-------------------------------+
|              0-8              |
+-------------------------------+
|       ^       |      4-8      |
+---------------+---------------+
|   ^   |  2-4  |   ^   |  6-8  |
+---------------|---------------+
| ^ |1-2| ^ |3-4| ^ |5-6| ^ |7-8| 
+-------------------------------+

where ^ means go up a level to find your step
the bottom row is indexed [0,7] and 8 is off the end on the right
so eg if you start at 5 you do 5->6 then 6->8 and done

Backward :

+-------------------------------+
|              8-0              |
+-------------------------------+
|      4-0      |       ^       |
+---------------+---------------+
|  2-0  |   ^   |  6-4  |   ^   |
+---------------|---------------+
|1-0| ^ |3-2| ^ |5-4| ^ |7-6| ^ | 
+-------------------------------+

the bottom row is indexed [1,8] and 0 is off the end of the left
so eg if you start at 5 you go 5->4 then 4->0 and done

You collapse the indexing Fenwick style on both by pushing the values down the ^ arrows. It should be clear you can take the two tables and lay them on top of each other to get a full set.

10/01/2011

10-01-11 - String Match Results Part 6

You knew that couldn't be the end.

SuffixArray3 : suffix array string matcher which uses a min/max tree to find allowed offsets.

The min/max tree is a binary hierarchy ; at level L there are (size>>L) entries, and each entry covers a range of size (1 << L). Construction is O(N) because N/2+N/4+N/8 ... = N

The min/max tree method is generally slightly slower than the elegant "chain of fences" approach used for SuffixArray2, but it's close. The big advantage is the min/max tree can also be used for windowed matching, which is not easy to integrate in SA2.

(ADDENDUM : this is super unclear, see more at end)

First check that it satisfies the O(N) goal on the stress tests :

0 = stress_all_as
1 = stress_many_matches
2 = stress_search_limit
3 = stress_sliding_follow
4 = stress_suffix_forward
5 = twobooks

Yep. Then check optimal parse, large window vs. the good options :

0 = ares_merged.gr2_sec_5.dat
1 = bigship1.x
2 = BOOK1
3 = combined.gr2_sec_3.dat
4 = GrannyRocks_wot.gr2
5 = Gryphon_stripped.gr2
6 = hetero50k
7 = lzt20
8 = lzt21
9 = lzt22
10 = lzt23
11 = lzt24
12 = lzt25
13 = lzt27
14 = lzt28
15 = lzt29
16 = movie_headers.bin
17 = orange_purple.BMP
18 = predsave.bin

The good Suffix Trie is clearly the best, but we're in the ballpark.

Now optimal parse, 17 bit (128k) window :


totals:
Test_MMC2 : DNF 
Test_LzFind2 : 506.954418 , 1355.992865 
Test_SuffixArray3 : 506.954418 , 514.931740 
Test_MMC1 : 13.095260 , 1298.507490 
Test_LzFind1 : 12.674177 , 226.796123 
Test_Hash1 : 503.319301 , 1094.570022 

Finally greedy parse, 17 bit window :


totals:
Test_MMC2 : 0.663028 , 110.373098 
Test_LzFind2 : DNF 
Test_SuffixArray3 : 0.663036 , 236.896551 
Test_MMC1 : 0.663028 , 222.626069 
Test_LzFind1 : 0.662929 , 216.912409 
Test_Hash1 : 0.662718 , 62.385071 

average match length :

And once more for clarity :

Greedy parse, 16 bit window , just the good candidates :

totals:
Test_SuffixArray3 : 0.630772 , 239.280605 
Test_LzFind1 : 0.630688 , 181.430093 
Test_MMC2 : 0.630765 , 88.413339 
Test_Hash1 : 0.630246 , 51.980073 

It should be noted that LzFind1 is approximate, and Hash1 is even more approximate. Though looking at the match length chart you certainly can't see it.

ADDENDUM : an email I wrote trying to explain this better :


First a reminder of how the normal suffix array searcher works :

We're at some file position F
We look up sortIndex[F] to find our sort position S
we know our longest matches must be at sort positions S-1 or S+1
we step to our neighbors

The problem with windowed matching is our neighbors may be way out of the window

So what we ant is a way to step more than 1 away to find the closest neighbor in the sort order that is within our desired window.

So really all we are doing is adding another search structure on top of the suffix array.  It's a search structure to go in either direction from S and find the closest spot to S that has file position in the range [F-window,F-1] 

To do this I just build a tree.  It's indexed by sort position, and its content is file positions.

Level L of the tree contains nodes that cover an interval of size 1<<L 

That is, level 0 has intervals of size 1, that's just

Tree_0[S] = sortIndexInverse[S] 
  (eg. it's just the file position of that sort pos)
  (in fact we don't make this level of the tree, we just use sortIndexInverse which we already have)

Tree_1[i] covers steps of size 2, that is :
  file positions sortIndexInverse[i*2] and sortIndexInverse[i*2+1]
  I store the min and max of that

Tree_2[i] covers steps of size 4, that is min and max of Tree_1[i*2] and Tree_1[i*2+1]

Now once you have this binary interval tree you can do the normal kind of binary tree walk to find the closest neighbor that's in the range you want.

Also see the following blog on how I walk binary interval trees and the released source code for StringMatchTest contains the full code for this matcher.

10-01-11 - Seattle Stop Shitting on my Face

I've been thinking about upgrading my neoprene gear so that I can swim in the lake through the fall/spring, not just the summer.

I fucking hate swimming laps in a pool during official lap swim hours, with all your "rules" and your system keeping me down. And it just doesn't make sense to go into some nasty indoor crowded box when I'm literally surrounded by miles of beautiful open water.

But there's a bit problem with this idea. Seattle is shitting on my face.

The fall/spring, when a wet suit would help, is when it rains. When it rains, the sewers overflow and drain into the lake. Then you get itchy bumps and vomiting and so on.

Seattle is basically doing nothing about it. There are some little programs to do "rain gardens", but those are sort of like using a tampon to stop elephant piss. What we need is serious fucking civil engineering. ("rain gardens" are also better known as "mosquito breeders"; we Houstonians are always amazed and delighted about the lack of mosquitos here; it's because Seattle is hilly and surrounded by lakes so the water doesn't pool, but the city is doing their best to ruin that). (a much bigger impact than piddly residential rain gardens would be to outlaw concrete parking lots; grass/gravel/lattice parking lots work perfectly fine for holding cars).

Of course this is a problem that is occuring all over the US. I hear NY has a major sewer problem as well. The population and development of most US cities has outstripped their infrastructure, and in our shitty faux-libertarian plutocracy of course there's no money for basic civil engineering. Only the heavy hand of EPA orders is forcing these dumb ass local governments to do anything.

The real solution is something like :

1. Separate the sewage and rain runoff. Run sewage to treatment plants in a closed system so it can never get out. (probably not realistic; alternatively, add a new clean storm water only system)

2. Since you will allow non-sewer storm water to drain to the lake, make the pollutants that run off to the lakes illegal. Fertilizers, pesticides, etc. everything that's water soluble and washes into the lakes is illegal right fucking now.

10-01-11 - More Reliable Timing on Windows

When profiling little code chunks on Windows, one of the constant annoyances is the unreliability of times due to multithreading.

Historically the way you address this is run lots of trials (like 100) and take the MIN time of any trial.

(* important note : if you aren't trying to time "hot cache" performance, you need to wipe the cache between each run. I dunno if there's an easy instruction or system call that would invalidate all cache pages; what I usually do is have a routine that goes and munges over some big array).

It's a bit better these days because of many cores. Now you can quite often find a core which is unmolested by annoying services popping up and stealing CPU time and messing up your profile. But sometimes you get unlucky, and your process runs on an IdealProc that has some other shite.

So a simple loop helps :


template <typename t_func>
uint64 GetFuncTime( t_func * pfunc )
{
    HANDLE proc = GetCurrentProcess();
    HANDLE thread = GetCurrentThread();
    
    DWORD_PTR affProc,affSys;
    GetProcessAffinityMask(proc,&affProc,&affSys);
    
    uint64 tick_range = 1ULL << 62;
    
    for(int rep=0;rep<24;rep++)
    {
        DWORD mask = 1UL<<rep;
        if ( mask & affProc )
            SetThreadAffinityMask(thread,mask);
        else
            continue;   

        uint64 t1 = __rdtsc();
        (*pfunc)();
        uint64 t2 = __rdtsc();

        uint64 cur_tick_range = t2 - t1;
        tick_range = MIN(tick_range,cur_tick_range);

    }

    SetThreadAffinityMask(thread,0xFFFFFFFFUL);

    return tick_range;
}

which makes it reasonably probable that you get a clean run on some core. For published results you will still want to repeat the whole thing N times.