11/09/2011

11-09-11 - Weird shite about Exceptions in Windows

What happens when an exception is thrown in Windows ? (please fill in any gaps, I haven't researched this in great detail).

1. The VectoredExceptionHandlers are called. One of these you may not be aware of is the "first chance" exception handler that the MSVC debugger installs. If you have the flags set in a certain way, this will cause you to breakpoint at the spot of the throw without passing the exception on to the SEH chain.

2. The list of __except() handlers is walked and those filters are invoked; if the filter takes the exception then they handle it.

* of note here is the change from x86 to x64. Under x86 SEH handlers were made on the stack and then tacked onto the list as you descended (basically the __try corresponds to tacking on the handler); under x64 that is all removed and the SEH filter walk relies on being able to trace back up the function call stack. Normally there's no difference, however under x64 if your function call stack can't be walked for some reason, then your SEH filters won't get called! This can happen for a few reasons; it can happen due to the 32-64 thunk layer, it can happen if you manually create some ASM or "naked" functions and don't maintain the stack trace info correctly, and it can happen of course if you stomp the return addresses in the stack. See for example : The case of the disappearing OnLoad exception � user-mode callback exceptions in x64 at Thursday Night . (stomping the stack can of course ruin SEH on x86 as well since the exception registration structures are on the stack).

More info on the x64 SEH trace : at osronline and at nynaeve .

3. If no filter wanted the exception, it goes up to the UnhandledExceptionFilter. In MSVC's CRT this is normally set to __CxxUnhandledExceptionFilter, that function itself will check if a debugger is present and do different things (eg. breakpoint).

4. If UnhandledExceptionFilter still didn't handle the exception and passes it on, the OS gets it and you get the application critical error popup box. Depending on your registry settings this may invoke automatic debugging. However as noted here : SetUnhandledExceptionFilter and VC8 - Jochen Kalmbach's WebLog there is a funny bypass where the OS will pass the exception directly to the OS handler and not call your filter.

Automatic debugging is controlled by

[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\AeDebug].  
when it was first introduced it defaulted to Dr Watson. At some point (Vista?) that was changed to point it to the Windows Troubleshooting engine instead. I believe that when you install DevStudio this registry key is changed to point to vsjitdebugger. The key "Auto" is set to 0 by default which means ask before popping into the debugger.

To clarify a bit about what happens with unhandled exceptions : your unhandled exception callback is not called first, and is not necessarily called at all. After all the SEH filters get their chance, the OS calls its own internal "UnhandledExceptionFilter" - not your callback. This OS function checks if you are in a debugger and might just pass off the exception to the debugger (this is *not* the "first chance" check which is done based on your MSVC check boxes). This function also might just decide that the exception is a security risk and pass it straight to the AeDebug. If none of those things happen, then your filter may get called. (this is where the CRT CxxUnhandledExceptionFilter would get called if you didn't install anything).

Another note : the standard application error popup box just comes from UnhandledExceptionFilter. One of the ways you can get a silent application exit with no popup is if the OS detects that your SEH chain is corrupted, it will just TerminateProcess on your ass and drop you out. Similarly if you do something bad from inside one of your exception handlers. (another way you can get a silent TerminateProcess is if you touch things during thread or process destruction; eg. from a DLL_THREAD_DETACH or something like that, if you try to enter crit secs that are being destroyed you can get a sudden silent process exit).


Some links :

DebugInfo.com - Unexpected user breakpoint in NTDLL.DLL
Under the Hood New Vectored Exception Handling in Windows XP
SetUnhandledExceptionFilter Anti Debug Trick � Evilcodecave�s Weblog
SetUnhandledExceptionFilter and VC8 - Jochen Kalmbach's WebLog
SetErrorMode function
C++ tips AddVectoredExceptionHandler, AddVectoredContinueHandler and SetUnhandledExceptionFilter - Zhanli's tech notes - Sit
A Crash Course on theDepths of Win32 Structured Exception Handling, MSJ January 1997

A bit of interesting stuff about how the /RTC run time checks are implemented :

Visual C++ Debug Builds��Fast Checks� Cause 5x Slowdowns Random ASCII

A bit about stack sizes on windows, in particular there are *two* thread stack sizes (the reserved and initial commit) and people don't usually think about that carefully when they pass a StackSize to CreateThread :

Thread Stack Size

Not directly related but interesting :

Pushing the Limits of Windows Processes and Threads - Mark's Blog - Site Home - TechNet Blogs
Postmortem Debugging Dr Dobb's
John Robbins' Blog How to Capture a Minidump Let Me Count the Ways
Collecting User-Mode Dumps
Automatically Capturing a Dump When a Process Crashes - .NET Blog - Site Home - MSDN Blogs

11/08/2011

11-08-11 - Differences Running in Debugger

Bugs that won't repro under debugging are the worst. I'm not talking about "debug" vs "release" builds; I mean the exact same exe, run in the debugger vs. not in the debugger.

What I'd like is to assemble a list of the differences between running under the debugger and not under the debugger. I don't really know the answers to this so this is a community participation post. eg. you fill in the blanks.

Differences in running under the debugger :

1. Timing. A common problem now with heavily threaded apps are bugs due to timing variation. But where do the timing differences come from exactly?

1.a. OutputDebugString. Duh, affects timing massively. Similarly anything you do dependent on IsDebuggerPresent().

1.b. VC-generated messages about thread creation etc. These obviously affect timing. You can disable them being shown by right-clicking in the output window of the debugger, but the notification is still being sent so you can never completely eliminate the timing difference for creating/destroying threads. (and the debugger does a lot more work for thread accounting anyway, so create/destroy will always have significant timing variation).

2. Exceptions. (not C++ exceptions, which are handled pretty uniformly, but more the low level SEH exceptions like access violations and such). Obviously in the debugger you can toggle the handling of various exceptions and that can change behavior. One thing I'm not sure of is if there are any registry settings or other variables that control exception behavior in NON-debugged runs? (* more on this in another post)

3. Stack. Is there a difference here? Not that I know of.

4. Debug Heap. This is probably the biggest one. Processes run in the debugger on windows *always* get the debug heap, even if you didn't ask for it. You can turn this off by setting _NO_DEBUG_HEAP as an environment variable or starting MSVC with -hd. See Behavior of Spawned Processes .

Note that this isn't coming from MSVC, it's actually in ntdll. When you create your process heap, ntdll does a "QueryInformationProcess" and sees if it's being debugged, and if so it stuffs in the debug heap. The important thing is that this is at heap creation time, which leads to a solution.

5. Child Process issues. Because the debugged process is a child process of the debugger, it inherits its process properties. (the same issue can occur for running under "cmd" vs. spawning from explorer). Two specifics are "permissions" and environment variables. Another inherited value is the "ErrorMode" as in "GetErrorMode/SetErrorMode".

There's a solution to #4 and #5 which is this :

Start your app outside of the debugger. Make it do an int 3 so it pauses. Then attach the debugger. You can now debug bug you don't get some of the ran-from-debugger differences.

(note to self about attaching : for some reason the MSVC "attach to running process" seems to fail a lot; there are other ways to do it though, when you get an int 3 message box popup you can click "debug" there, or from task manager or procexp you can find the task and click "debug" there).

11/03/2011

11-03-11 - BoolYouMustCheck

I've got a lot of functions that return error codes (rather than throw or something). The problem with that is that it's very easy to just not check the error code and then you have incorrect code that can possibly break in a nasty way if the error case is hit and not detected.

One way to test this is like this :


class BoolYouMustCheck
{
private:
    bool m_b;
    mutable bool m_checked;

public :

    //BoolYouMustCheck() : m_b(false), m_checked(false) { }
    BoolYouMustCheck(bool b) : m_b(b), m_checked(false) { }
    
    ~BoolYouMustCheck()
    {
        ASSERT( m_checked );
    }
    
    operator bool () const
    {
        m_checked = true;
        return m_b;
    }

};

it's just a proxy for bool which will assert if it is assigned and never read.

So now you can take a function that returns an error condition, for example :


bool func1(int x)
{
    return ( x > 7 );
}

normally you could easily just call func1() and not check the value. But you change it to :

BoolYouMustCheck func2(int x)
{
    return ( x > 7 );
}

(in practice you probably just want to do #define bool BoolYouMustCheck)

Now you get :


{

    int y = clock();

    // asserts:
    func1(y);

    // asserts :
    bool b1 = func1(y);
    
    // okay :
    bool b2 = func1(y);
    if ( b2 )
        y++;
    
    // okay :
    if ( func1(y) )
        y++;
        
    return y;
}

which is kind of nice.

The only ugly thing is that the assert can be rather far removed from the line of code that caused the problem. In the first case (just calling func1 and doing nothing with the return value), you get an assert right away, because the returned class is destructed right away. But in the second case where you assign to b1, you don't get the assert until the end of function scope. I guess you could fix that by taking a stack trace in the constructor.

(note : if you want to intentionally ignore the return value b1 you can just add a line like (int) b1; to surpress the assert.

11/02/2011

11-02-11 - StringMatchTest Release

Code for my string match testbed discussed previously. I'm not gonna do the work to turn this into a clean standalone, so it's a big mess and you can take what you like out of it.

stringmatchtest.zip (45k)

Note : the stringmatchtest.vcproj project refers to some files that are not included in this distribution. Just delete them from the project.

Requires cblib.zip (633k)

You may also need STLPort (I haven't tried building with the VC STL , I use STLPort 5.1.5 or 5.2.1). (BTW I had to modify the STLPort headers to make it build on VS 2008 ; the mods should be obvious).

Tested with VC 2005 and 2008. Does not build with VC 2010 currently.

The most interesting bit is probably in test_suffixarray, which implements the three suffix-array based string searchers previously described on this blog. See previous posts :

cbloom rants 06-17-10 - Suffix Array Neighboring Pair Match Lens
cbloom rants 09-23-11 - Morphing Matching Chain
cbloom rants 09-25-11 - More on LZ String Matching
cbloom rants 09-27-11 - String Match Stress Test
cbloom rants 09-28-11 - Algorithm - Next Index with Lower Value
cbloom rants 09-28-11 - String Matching with Suffix Arrays
cbloom rants 10-02-11 - How to walk binary interval tree
cbloom rants 09-24-11 - Suffix Tries 1
cbloom rants 09-24-11 - Suffix Tries 2
cbloom rants 09-26-11 - Tiny Suffix Note
cbloom rants 09-29-11 - Suffix Tries 3 - On Follows with Path Compression

cbloom rants 09-30-11 - String Match Results Part 1
cbloom rants 09-30-11 - String Match Results Part 2
cbloom rants 09-30-11 - String Match Results Part 2b
cbloom rants 09-30-11 - String Match Results Part 3
cbloom rants 09-30-11 - String Match Results Part 4
cbloom rants 09-30-11 - String Match Results Part 5 + Conclusion
cbloom rants 10-01-11 - String Match Results Part 6

StringMatchTest includes :


/*
 * divsufsort.c for libdivsufsort-lite
 * Copyright (c) 2003-2008 Yuta Mori All Rights Reserved.
 *

/* LzFind.c -- Match finder for LZ algorithms
2009-04-22 : Igor Pavlov : Public domain */

/*
    MMC (Morphing Match Chain)
    Match Finder
    Copyright (C) Yann Collet 2010-2011

StringMatchTest like all cbloom.com software is released under zlib license (basically free for all uses).

10/27/2011

10-27-11 - Tiny LZ Decoder

I get 17 bytes for the core loop (not including putting the array pointers in registers because presumably they already are there if you care about size) .

My x86 is rusty but certainly the trick to being small is to use the ancient 1 byte instructions, which conveniently the string instructions are. For example you might be tempted to read out length & offset like this :


        mov cl,[esi]    // len
        mov al,[esi+1]  // offset
        add esi,2

but it's smaller to do

        lodsb  // len
        mov cl,al
        lodsb  // offset

because it keeps you in 1 byte instructions. (and of course any cleverness with lea is right out). (naively just using lodsw and then you have len and offset in al and ah is even better, but in practice I can't make that smaller)

Anyhoo, here it is. I'm sure someone cleverer with free time could do better.


__declspec(noinline) void TinyLZDecoder(char * to,char * fm,char * to_end)
{
    __asm
    {
        mov esi,fm
        mov edi,to
        mov edx,to_end
        xor eax,eax
        xor ecx,ecx
    more:
        movsb   // literal
        lodsb   // len
        mov cl,al
        lodsb   // offset
        push esi
        mov esi,edi
        sub esi,eax
        rep movsb   // match
        pop esi
        cmp edi,edx
        jne more
    }

}

------------------------------------------------------

    more:
        movsb   // literal
00401012  movs        byte ptr es:[edi],byte ptr [esi] 
        lodsb   // len
00401013  lods        byte ptr [esi] 
        mov cl,al
00401014  mov         cl,al 
        lodsb   // offset
00401016  lods        byte ptr [esi] 
        push esi
00401017  push        esi  
        mov esi,edi
00401018  mov         esi,edi 
        sub esi,eax
0040101A  sub         esi,eax 
        rep movsb   // match
0040101C  rep movs    byte ptr es:[edi],byte ptr [esi] 
        pop esi
0040101E  pop         esi  
        cmp edi,edx
0040101F  cmp         edi,edx 
        jne more
00401021  jne         more (401012h) 

Also obviously you would get much better compression with a literal run length instead of a single literal every time, and it only costs a few more bytes of instructions. You would get even better compression if the run len could be either a match len or a literal run len and that's just another few bytes. (ADDENDUM : see end)


A small "Predictor/Finnish" is something like this :


    __asm
    {
ByteLoop:   lodsb   // al = *esi++ // control byte
            mov edx, 0x100
            mov dl, al
BitLoop:
            shr edx, 1
            jz  ByteLoop
            jnc zerobit
            lodsb
            mov [ebx], al
zerobit:    mov al, [ebx]
            mov bh, bl
            mov bl, al
            stosb  // *edi++ = al
            jmp BitLoop
    }

the fast version of Finnish of course copies the bit loop 8 times to avoid looping but you can't do that if you want to be small.

I'm quite sure this could be smaller using some clever {adc esi} or such. Also the sentry bit looping is taking a lot of instructions and so on.

Note the crucial trick of "Finnish" is that the hash table must be 64k in size and 64k aligned, so you can do the hash update and the table address computaton just by cycling the bottom two bytes of ebx. (I use the name "Predictor" for the generic idea of the single bit prediction/literal style compressor; "Finnish" is the specific variant of Predictor that uses this bx trick).

(note that this is not remotely the fast way to write these on modern CPU's)


ADDENDUM : Better (in the sense of more compression) LZ decoder in 22 bytes (core loop only) :


__declspec(noinline) void TinyLZDecoder(char * to,char * fm,char * to_end)
{
    __asm
    {
        mov esi,fm
        mov edi,to
        mov edx,to_end
        xor eax,eax
        xor ecx,ecx
    more:
        mov cl,[esi]    // len
        inc esi
        shr cl,1        // bottom bit is flag
        jc literals
        
    //match:
        lodsb   // offset -> al
        push esi
        mov esi,edi
        sub esi,eax
        rep movsb   // copy match
        pop esi
    
        // ecx is zero, just drop through
    literals:
        rep movsb  // copy literal run
    
        cmp edi,edx
        jne more
    }

}

Note that obviously if your data is bigger than 256 bytes you can use a two byte match offset by doing lodsw instead of lodsb.

x86 of course is a special case that's particularly amenable to small LZ decoders; it's not really fair, all other instruction sets will be much larger. (which is sort of ironic because RISC is supposed to make instruction encoding smaller) (addendum : not actually true, that last bit)

10/26/2011

10-26-11 - Tons

In the US, a "ton" = 2000 pounds.

In the UK a "ton" is 2240 pounds (which comes from twenty "hundredweights" where a "hundredweight" is eight stone, and a stone is 14 pounds, WTF Britain).

A "metric ton" is obviously 1000 kg. In the UK this is officially called a "tonne" which you will see in technical documents, but I don't see that used much in casual writing, and it's certainly confusing when spoken since it sounds the same. (but a UK ton is very close to a metric ton (2204.6 pounds) so the mixup here surely happens all the time and is not a huge problem).

(when you hear someone in the UK phonetically say "ton" do they mean "tonne" or imperial ton?)

To differentiate the US ton vs UK ton they can be called "short ton" or "long ton".


On a related note, a pint is not a pound *anywhere* in the world.

In the UK, 1 oz by volume of water = 1 oz of weight. But a "pint" in the UK is 20 oz. So a pint is 1.25 pounds (a gallon is exactly 10 pounds)

In the US, 1 oz by volume of water = 1.041 oz of weight, so a pint = 1.041 pounds. (and a gallon = 8.33 pounds).

(neither liquid ounce is anything neat in terms of volume; the only nice whole number unit is the US gallon which is 231 cubic inches)

(the weight measures are the same in the US and UK, it's the US volume measure which went weird (1.041), and I believe it was done in order to make the gallon an integer number of cubic inches)

If you want to get technical, a (US) "pint's a pound" at some high temperature. (...some digging...) actually it's very close just before boiling. It looks like 98 C water is almost exactly a pound per pint (US).

Actually there is a sort of cute book-end of the ranges of water density there :

Very close to freezing (4 C) water is 1 g/ml , and very close to boiling (98 C) it's a pound per (US) pint. The difference is a factor of about 0.96.

10-26-11 - The Eight Month Cruise

If you're buying a home in Seattle, you should always try to do so in early spring. This will give you a few months of wetness to see any problems, and then you'll have the whole summer to deal with them before the wet sets in again. (it also lets you see the homes during rains, which lets you look for water incursions while you shop). I tried to time it that way, but the home shopping took too long and I didn't wind up buying until late summer.

The problem with Seattle is that once it starts raining (around Oct 1 pretty reliably) it literally does not stop for the next 8 months. Sure maybe it stops for a day or two, but never long enough for the whole house to dry out and then give you a big chunk of dry days to do something like replace the roof or paint the exterior.

Fortunately I don't have any problem so large as that, but even for minor things it's damn annoying. For example some time around September I realized that I really need to get a coat of waterproofer on all my decking. Oh well, it's gonna have to wait 8 months. There's a couple of spots I need to touch up exterior paint, but you can't really do a good job of painting without a solid 5 days of dry and decent warmth.

It's almost like our houses are boats, and we go on an 8 month aquatic voyage every year. You really need to use those 4 months as a chance to get your boat up on dry dock, scrub off the moss, dig out the dry rot, apply epoxy, sand and paint, etc.

10/24/2011

10-24-11 - LZ Optimal Parse with A Star Part 1

First two notes that aren't about the A-Star parse :

1. All good LZ encoders these days that aren't optimal parsers use complicated heuristics to try to bias the parse towards a cheaper one. The LZ parse is massively redundant (many parses can produce the same original data) and the cost is not the same. But in the forward normal parse you can't say for sure which decision is cheapest, so they just make some guesses.

For example the two key crucial heuristics in LZMA are :


// LZMA lastOffset heuristic :
  if (repLen >= 2 && (
        (repLen + 1 >= mainLen) ||
        (repLen + 2 >= mainLen && mainDist >= (1 << 9)) ||
        (repLen + 3 >= mainLen && mainDist >= (1 << 15))))
  {
    // take repLen instead of mainLen
  }

// LZMA lazy heuristic :

        // Truncate current match if match at next position will be better (LZMA's algorithm)
        if (nextlen >= prevlen && nextdist < prevdist/4 ||
            nextlen == prevlen + 1 && !ChangePair(prevdist, nextdist) ||
            nextlen > prevlen + 1 ||
            nextlen + 1 >= prevlen && prevlen >= MINLEN && ChangePair(nextdist, prevdist))
        {
             return MINLEN-1;
        } else {
             return prevlen;
        }

One is choosing a "repeat match" over a normal match, and the next is choosing a lazy match (literal then match) over greedy (immediate match).

My non-optimal LZ parser has to make similar decisions; what I did was set up a matrix and train it. I made four categories for a match based on what the offset is : {repeat, low, medium, high } , so to decide between two matches I do :


  cat1 = category of match 1 offset
  cat2 = category of match 2 offset
  diff = len1 - len2

  return diff >= c_threshold[cat1][cat2]
  
so c_threshold is a 4x4 matrix of integers. The values of threshold are in the range [0,3] so it's not too many to just enumerate all possibilities of threshold and see what's best.

Anyway, the thing about these heuristics is that they bias the parse in a certain way. They assume a certain cost tradeoff between literals vs. repeat matches vs. normal matches, or whatever. When the heuristic doesn't match the file they do badly. Otherwise they do amazingly well. One solution might be having several heuristics trained on different files and choose the one that creates the smallest output.

Also I should note - it's not trivial to tell when you have the heuristic wrong for the file. The problem is that there's a feedback loop between the parse heuristic and the statistical coder. That causes you to get into a local minimum, and you might not see that there's a better global minimum which is only available if you make a big jump in parsing style.

Repeating that more explicitly : the statistical coder will adapt to your parse heuristic; say you have a heuristic that prefers low offsets (like the LZMA heuristic above), that will cause your parse to select more low offsets, that will cause your statistical backend to code low offsets smaller. That's good, that's what you want, that's the whole point of the heuristic, it skews the parse intentionally to get the statistics going in the right direction. The issue then is that if you try to evaluate an alternative parse using the statistics that you have, it will look bad, because your statistics are trained for the parse you have.

2. It's crazy that LZ compression can do so well with so much redundancy in the code stream.

Think about it this way. Enumerate all the compressed bit streams that are created by encoding all raw bit streams up to length N.

Start with the shortest compressed bit stream. See what raw bits that decodes to. Now there may be several more (longer) compressed bit streams that decode to that same raw bit stream. Mark those all as unused.

Move to the next shortest compressed bit stream. First if there is a shorter unused one, move it into that slot. Then mark all other output streams that make the same raw bits as unused, and repeat.

For example a file like "aaaaaaa" has one encoding that's {a, offset -1 length 6} ; but there are *tons* of other encodings, such as {a,a,[-1,5]} or {a,[-1,3],[-1,3]} or {a,[-1,2],[-2,4]} etc. etc.

All the encodings except the shortest are useless. We don't want them. But even the ones that are not the best are quite short - they are small encodings, and small encodings take up lots of code space (see Kraft Inequality for example - one bit shorter means you take up twice the code space). So these useless but short encodings are real waste. In particular, there are other raw strings that don't have such a short encoding that would like to have that output length.

Anyhoo, surprisingly (just like with video coding) it seems to be good to add even *more* code stream redundancy by using things like the "repeat matches". I don't think I've ever seen an analysis of just how much wasted code space there is in the LZ output, I'm curious how much there is.

Hmm we didn't actually get to the A Star. Next time.

10/18/2011

10-18-11 - To wrap it up or move on

One of the things I've really struggled with in the last few years at RAD is the trade off between wasting time on a side shoot of the main code line vs. really wrapping something up while your focus is on it.

Generally after I've spent a week or two on a topic I start feeling like "okay that's enough time on this I need to move on" ; I guess that's my internal project manager watching my schedule. But on the other hand, there's such a huge advantage to staying on a topic while it fills your mind. You just lose so much momentum if you have to come back to it later.

For example I wish I had finished my JPEG decoder back when I was working on it, that was an important piece of software I believe, but I felt like I needed to move on to more practical tasks, and now it's all out of my head and would take me several weeks to figure out all the nuances again.

Currently I'm working on a new way to do LZ optimal parsing, and I feel like I need to move on because it's not that important to my product and I need to get onto more practical tasks, but at the same time I hate to leave it now because I feel close to a breakthrough and I know that if I stop now I may never come back to it, and if I do it will take forever to get back into the flow.

10-18-11 - StringMatchTest - Hash 1b

For reference :

The good way to do the standard Zip-style Hash -> Linked List for LZ77 parsing.

There are two tables : the hash entry point table, which gives you the head of the linked list, and the link table, which is a circular buffer of ints which contain the next position where that hash occured.

That is :


  hashTable[ hash ]  contains the last (most recent preceding) position that hash occured
  chainTable[ pos & (window_size-1) ]  contains the last position of the hash at pos before pos

To walk the table you do :

  i = hashTable[ hash ];
  while ( i in window )
    i = chainTable[ i & (window_size-1) ]

To update the table you do :

  head = hashTable[ hash ];
  hashTable[hash] = pos;
  chainTable[ pos & (window_size-1) ] = head;

And now for some minor details that are a bit subtle. We're going to go through "Hash1" from StringMatchTest which I know I still haven't posted.

int64 Test_Hash1(MATCH_TEST_ARGS)
{
    uint8 * buf = (uint8 *)charbuf;
    const int window_mask = window_size-1;
        
    vector<int> chain; // circular buffer on window_size
    chain.resize(window_size);
    int * pchain = chain.data();
    
    const int hash_size = MIN(window_size,1<<20);
    const int hash_mask = hash_size-1;
    
for small files or small windows, you can only get good per-byte speed if you make hash size proportional with that MIN. (what you can't see is outside the Test_ I also made window_size be no bigger than the smallest power of 2 that encloses file size).


    vector<int> hashTable; // points to pos of most recent occurance of this hash
    hashTable.resize(hash_size);
    int * phashTable = hashTable.data();
    
    memset(phashTable,0,hash_size*sizeof(int));

As noted previously, for large hashes you can get a big win by using a malloc that gives you zeros. I don't do it here for fairness to the other tests. I do make sure that my initialization value is zero so you can switch to VirtualAlloc/calloc.

    int64 total_ml = 0;
    
    // fiddle the pointers so that position 0 counts as being out of the window
    int pos = window_size+1;
    buf -= pos;
    ASSERT( (char *)&buf[pos] == charbuf );
    size += pos;

I don't want to do two checks in the inner loop for whether a position is a null link vs. out of the window. So I make the "null" value of the linked list (zero) be out of the window.

    for(;pos<(size-TAIL_SPACE_NEEDED);)
    {
        MATCH_CHECK_TIME_LIMIT();

        // grab 4 bytes (@ could slide here)
        uint32 cur4 = *((uint32 *)&buf[pos]);
        //ASSERT( cur4 == *((uint32 *)&buf[pos]) );

On PC's it's fastest just to grab the unaligned dword. On PowerPC it's faster to slide bytes through the dword. Note that endian-ness changes the value, so you may need to tweak the hash function differently for the two endian cases.

        // hash them 
        uint32 hash = hashfour(cur4) & hash_mask;
        int hashHead =  phashTable[hash];
        int nextHashIndex = hashHead;
        int bestml = 0;
        int windowStart = pos-window_size;
        ASSERT( windowStart >= 0 );
        
        #ifdef MAX_STEPS
        int steps = 0;
        #endif

Start the walk. Not interesting.

        while( nextHashIndex >= windowStart )
        {
            uint32 vs4 = *((uint32 *)&buf[nextHashIndex]);
            int hi = nextHashIndex&window_mask;
            if ( vs4 == cur4 )
            {
                int ml = matchlenbetter(&buf[pos],&buf[nextHashIndex],bestml,&buf[size]);
                    
                if ( ml != 0 )
                {
                    ASSERT( ml > bestml );
                    bestml = ml;
                    
                    // meh bleh extra check cuz matchlenbetter can actually go past end
                    if ( pos+bestml >= size )
                    {
                        bestml = size - pos;
                        break;
                    }
                }
            }

            #ifdef MAX_STEPS
            if ( ++steps > MAX_STEPS )
                break;
            #endif
                                
            nextHashIndex = pchain[hi];
        }

This is the simple walk of the chain of links. Min match len is 4 here which is particularly fast, but other lens can be done similarly.

"MAX_STEPS" is the optimal "amortize" (walk limit) which hurts compression in an unbounded way but is necessary for speed.

"matchlenbetter" is a little trick ; it's best to check the character at "bestml" first because it is the most likely to differ. If that char doesn't match, we know we can't beat the previous match, so we can stop immediately. After that I check the chars in [4,bestml) to ensure that we really do match >= bestml (the first 4 are already checked) and lastly the characters after, to extend the match.

The remainder just updates the hash and is not interesting :


        ASSERT( bestml == 0 || bestml >= 4 );
        total_ml += bestml;
        
        // add self :
        //  (this also implicitly removes the node at the end of the sliding window)
        phashTable[hash] = pos;
        int ci = pos & window_mask;
        pchain[ci] = hashHead;
                
        if ( greedy && bestml > 0 )
        {
            int end = pos+bestml;
            pos++;
            ASSERT( end <= size );
            for(;pos<end;pos++)
            {               
                uint32 cur4 = *((uint32 *)&buf[pos]);
                
                // hash them 
                uint32 hash = hashfour(cur4) & hash_mask;
                int hashHead =  phashTable[hash];
                phashTable[hash] = pos;
                int ci = pos & window_mask;
                pchain[ci] = hashHead;      
            }
            pos = end;
        }
        else
        {
            pos++;
        }
    }
    
    return total_ml;
}

Note that for non-greedy parsing you can help the O(N^2) in some cases by setting bestml to lastml-1. This helps enormously in practice because of the heuristic of matchlenbetter but does not eliminate the O(N^2) in theory. (the reason it helps in practice is because typically when you find a very long match, then the next byte will not have any match longer than lastml-1).

(but hash1 is really just not the right choice for non-greedy parsing; SuffixArray or SuffixTrie are vastly superior)

old rants