Fuck this shit. I'm going to Hawaii.
11/23/2011
11/22/2011
11-22-11 - The Mature Programmer
The mature programmer manages their own time and productivity well. The MP knows that maintenance is as much work as the initial writing and code always takes longer than you think. The MP knows that any changes to code can introduce bugs, no matter how seemingly trivial. The MP knows that premature optimization is foolish and dangerous. The MP knows that sexy coding like writing big complex systems from scratch is rarely the best way to go. The MP does not get into ego competitions about who has the prettiest code. The MP acheives the best final result in the minimum amount of time.
When I started at Oddworld, I was watching lots of game companies get into dick-waving contests about who had the greatest home-rolled graphics engine, and I was also watching lots of indie developers spend massive amounts of time on their "one true" product and never actually ship it. I resolved that we would not fall into those traps - we would be humble and not reach too far, we would not let our egos stop us from licensing code or using old fashioned solutions to problems, we would stay focused on the end product - any sexy code that didn't produce a visible benefit in the actual shipping game was nixed. For the most part I think we succeeded in that (there were a few digressions that were mainly due to me).
But the way of the Mature Programmer can be a trap which comes back to bite you.
The problem is that writing code in this way is not very much fun. Sure there's the fun of making the product - and if you're working on a game and believe in the game and the team, then just seeing the good product come out can give you motivation. But if you don't have that, it can be a real slog.
Most of us got into programming not for the end products that we create, but because the programming itself is a joy. Code can be beautiful. Code can be a clever, artistic, exciting creation, like a good mathematical proof. The Mature Programmer would say that "clever code is almost always dangerous code". But fuck him. The problem is that when you get carried away with being "mature" you suck the joy right out coding.
You need to allow yourself a certain amount of indescretions to keep yourself happy with your code. Sure those templates might not actually be a good idea, but you enjoy writing the code that way - fine, do it. Yes, you are optimizing early and it just makes the code harder to maintain and harder to read and more buggy - but you love to do that, fine, do it.
Obviously you can't go overboard with this, but I think that I (and many others) have gone overboard with being mature. Basically in the last ten years of my evolution as a coder I have become less of a wild card "hot shot" and more of a productivity manager, an efficient task analyzer and proactive coordinater of code-actualizing solutions. It's like a management beaurocracy of one inside my head. It's horrible.
I think there are two factors to consider : first is that being "mature" and productive can cause burnout which winds up hurting your productivity, or it can just make coding unpleasant so you spend fewer hours at it. Most "mature" coders brag about the fact that they can get as much done in 6 hours as they used to do in 14. But those 14 hours were FUN, you coded that long because you loved it, you couldn't get to sleep at night because you wanted to code more; now the 6 hours is all sort of unpleasant because instead of rolling your own solution you're just tying together some java and perl packages. Second is that being productive is not the only goal. We are coding to get some task done and to make money, but we're also coding because we enjoy it, and actually being less productive but enjoying your coding more may be a net +EV.
2. The healthy opposition of a producer
Many programmers in normal coding jobs hate having the interference of a producer (or corporate management, or the publisher, or whatever). This person enforces stupid schedules and won't let us do the features we want, and urrgh we hate them! These coders long to be able to make their own schedules and choose their own tasks and be free.
It's actually a very healthy and much more relaxing in many ways to have that opposition. When you have to schedule yourself or make your own decisions about tasks, you become responsible for both the creative "reach for the sky" side and the responsible "stay in budget" side. It's almost impossible to do a good job of both sides. This can happen if you are an indie or even if you are a powerful lead with a weak producer.
Most creative industries know that there is a healthy opposition in having the unconscrained creative dreamer vs. the budget-enforcing producer. You don't want the dreamer to be too worried about thinking about schedules or what's possible - you just want them to make ideas and push hard to get more.
When you have to cut features or crunch or whatever, it's nice to have that come from outside - some other force makes you do it and you can hate them and get on with it. It's nice to have that external force to blame that's not on your team; it gives you a target of your frustration, helps bonding, and also gives you an excuse to get the job done (because they told you to).
When you have to balance dreams vs schedules on your own, it adds an intellectual burden to every task - as you do each task you have to consider "is this worth the time? is this the right task to do now? should I do a simpler version of this?" which greatly reduces your ability to focus just on the task itself.
3. Coding standards
It's kind of amazing to me how many experienced programmers still just don't understand programming. The big difficulty in programming is that the space of the ways to write something are too large. We can get lost in that space.
One of the problems is simply the intellectual overload. Smart coders can mistakenly believe that they can handle it, but it is a burden on everyone. Every time you write a line of code, if you have to think "should I use lower case or mixed caps?" , or "should I make this a function or just write it in line?" , your brain is spending masses of energy on syntactic decisions and doesn't have its full power for the functionality. Strict coding standards are actually an intellectual relief because they remove all those decisions and give you a specific way to do the syntax. (The same of course goes for reading other people's code - your eyes can immediately start looking at the functionality, not try to figure out the current syntax)
The other big benefit of coding standards is creating a "meta language" which is smaller than the parent language and enforces certain invariants. By doing that you again reduce the space that the brain has to consider. For example you might require that all C macros behave like functions (eg. don't eat scopes and don't declare variables). Now when I see one I know I don't have to worry about those things. Or you might require that globals are never externed and only get accessed through functions called "GetGlobal_blah". It doesn't really matter what they are as long as they are simple, clear, uniform, and strictly enforced, because only if they are totally reliable can you stop thinking about them.
4. The trap of "post-Mature Programmer" ism.
Many great coders of my generation have gone through the strict clean rules-following coder phase and have moved onto the "post" phase. The "post-mature programmer" knows the importance of following strict coding style rules or not indulging themselves too much, but also sees the benefit of bending those rules and believes that they can be a bit more free about deciding on what to do for each situation.
I believe that they/we mostly get this wrong.
The best analogy I can think of is poker. Most successful poker players go through several phases. First you think you're ever so clever and you can bluff and trap people and play all sorts of weird lines. Once you move up levels and start playing serious poker this delusion is quickly wiped out and you realize you need to go back to fundemantals. So then most people will go through the TAG "standard line" phase where they learn the right thing to do in each situation and the standard way to analyze hands, and they will be quite successful with this. (note that "standard line" doesn't mean nitty, it involves things like squeeze plays and even check-shove river bluffs, but it's based on playing a balanced range and studying EV). But then they are so successful with their solid play that they start to think they can get away with "mixing it up", playing hands that are almost certainly not profitable because they think they are good enough post-flop to make up for it (eg. Durrr style), or imagining that by playing some minus EV hands it helps their image and pays off later.
This is almost always wrong. Limping AA is almost always wrong, opening 72o UTG is almost always wrong - maybe you've done some analysis and you've decided it's the right thing at this table at this moment (for example limping AA because the people behind you attack limpers way too much and they think you would never limp AA so they will get stuck easily). It's wrong.
(telling yourself that your current bad play is made up for with later "image value" is one of the great rationalizations that poker players use an excuse to justify their bad play. programmers due to same with a set of excuses like "performance" that are really just rationalizing justifications for their bad practices; with poker, EV in the hand is worth more than EV in the bush; that is, the later image value you might win is so small and dubious and depends on various things working out just right that it's almost never correct to give up known current value for possible future value. (a particularly simple case of this is "implied odds" which bad players use an excuse to chase hands they shouldn't))
The problem is that when you open yourself up to making any possible move at any moment, there is simply too much to consider. You can't possibly go through all those decisions from first principles and make the right choice. Even if you could, there's no way you can sustain it for thousands of hands. You're going to make mistakes.
The same is true in coding; the post-MP knows the value of encapsulating a bit of functionality into a struct + helpers (or a class), but they think I'm smart enough I can decide not to do that in this particular case. No! You are wrong. I mean, maybe you are in fact right in this particular case, but it's not a good use of your brain energy to make that decision, and you will make it wrong some times.
There is a great value in having simple rules. Like "any time I enter a pot preflop, I come in for a raise". It may not always be the best thing to do, but it's not bad, and it saves you from making possibly big mistakes, and most importantly it frees up your brain for other things.
The same thing happens with life decision making. There's a standard set of cliches :
Don't marry the first person you sleep with Don't get in a serious relationship off a rebound Don't buy anything if the salesman is pushing it really hard Take a day to sleep on any big decision Don't lend money to poor friends etc.you may think "I'm smart, I'm mature, I don't need these rules, I can make my own decision correctly based on the specifics of the current situation". But you are wrong. Sure, following the rules you might miss out on the truly optimum decision once in a while. But it's foolish arrogance to think that your mind is so strong that you don't need the protection and simplicity that the rules provide.
In poker the correct post-solid-player adjustment is very very small. You don't go off making wild plays all the time, that's over-confidence in your abilities and just "spew". A correctly evolved player basically sticks to the solid line and the standard way of evaluating, but knows how to indentify situations where a very small correction is correct. Maybe the table is playing too tight preflop, so in the hijack position you start opening the top 35% of hands instead of the top 25% of hands. You don't just start opening every hand. You stay within the scope of the good play that you understand and can do without rethinking your whole approach.
The same is true in programming I believe; the correct adjustment for post-mature coding is very small; you don't have to be totally dogmatic about making every member variable private, but you also don't just stop encapsulating classes at all.
11/09/2011
11-09-11 - Weird shite about Exceptions in Windows
1. The VectoredExceptionHandlers are called. One of these you may not be aware of is the "first chance" exception handler that the MSVC debugger installs. If you have the flags set in a certain way, this will cause you to breakpoint at the spot of the throw without passing the exception on to the SEH chain.
2. The list of __except() handlers is walked and those filters are invoked; if the filter takes the exception then they handle it.
* of note here is the change from x86 to x64. Under x86 SEH handlers were made on the stack and then tacked onto the list as you descended (basically the __try corresponds to tacking on the handler); under x64 that is all removed and the SEH filter walk relies on being able to trace back up the function call stack. Normally there's no difference, however under x64 if your function call stack can't be walked for some reason, then your SEH filters won't get called! This can happen for a few reasons; it can happen due to the 32-64 thunk layer, it can happen if you manually create some ASM or "naked" functions and don't maintain the stack trace info correctly, and it can happen of course if you stomp the return addresses in the stack. See for example : The case of the disappearing OnLoad exception � user-mode callback exceptions in x64 at Thursday Night . (stomping the stack can of course ruin SEH on x86 as well since the exception registration structures are on the stack).
More info on the x64 SEH trace : at osronline and at nynaeve .
3. If no filter wanted the exception, it goes up to the UnhandledExceptionFilter. In MSVC's CRT this is normally set to __CxxUnhandledExceptionFilter, that function itself will check if a debugger is present and do different things (eg. breakpoint).
4. If UnhandledExceptionFilter still didn't handle the exception and passes it on, the OS gets it and you get the application critical error popup box. Depending on your registry settings this may invoke automatic debugging. However as noted here : SetUnhandledExceptionFilter and VC8 - Jochen Kalmbach's WebLog there is a funny bypass where the OS will pass the exception directly to the OS handler and not call your filter.
Automatic debugging is controlled by
[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\AeDebug].
when it was first introduced it defaulted to Dr Watson. At some point (Vista?) that was changed to point it to the Windows Troubleshooting
engine instead. I believe that when you install DevStudio this registry key is changed to point to vsjitdebugger. The key "Auto" is set to 0
by default which means ask before popping into the debugger.
To clarify a bit about what happens with unhandled exceptions : your unhandled exception callback is not called first, and is not necessarily called at all. After all the SEH filters get their chance, the OS calls its own internal "UnhandledExceptionFilter" - not your callback. This OS function checks if you are in a debugger and might just pass off the exception to the debugger (this is *not* the "first chance" check which is done based on your MSVC check boxes). This function also might just decide that the exception is a security risk and pass it straight to the AeDebug. If none of those things happen, then your filter may get called. (this is where the CRT CxxUnhandledExceptionFilter would get called if you didn't install anything).
Another note : the standard application error popup box just comes from UnhandledExceptionFilter. One of the ways you can get a silent application exit with no popup is if the OS detects that your SEH chain is corrupted, it will just TerminateProcess on your ass and drop you out. Similarly if you do something bad from inside one of your exception handlers. (another way you can get a silent TerminateProcess is if you touch things during thread or process destruction; eg. from a DLL_THREAD_DETACH or something like that, if you try to enter crit secs that are being destroyed you can get a sudden silent process exit).
Some links :
DebugInfo.com - Unexpected user breakpoint in NTDLL.DLL
Under the Hood New Vectored Exception Handling in Windows XP
SetUnhandledExceptionFilter Anti Debug Trick � Evilcodecave�s Weblog
SetUnhandledExceptionFilter and VC8 - Jochen Kalmbach's WebLog
SetErrorMode function
C++ tips AddVectoredExceptionHandler, AddVectoredContinueHandler and SetUnhandledExceptionFilter - Zhanli's tech notes - Sit
A Crash Course on theDepths of Win32 Structured Exception Handling, MSJ January 1997
A bit of interesting stuff about how the /RTC run time checks are implemented :
Visual C++ Debug Builds��Fast Checks� Cause 5x Slowdowns Random ASCII
A bit about stack sizes on windows, in particular there are *two* thread stack sizes (the reserved and initial commit) and people don't usually think about that carefully when they pass a StackSize to CreateThread :
Not directly related but interesting :
Pushing the Limits of Windows Processes and Threads - Mark's Blog - Site Home - TechNet Blogs
Postmortem Debugging Dr Dobb's
John Robbins' Blog How to Capture a Minidump Let Me Count the Ways
Collecting User-Mode Dumps
Automatically Capturing a Dump When a Process Crashes - .NET Blog - Site Home - MSDN Blogs
11/08/2011
11-08-11 - Differences Running in Debugger
What I'd like is to assemble a list of the differences between running under the debugger and not under the debugger. I don't really know the answers to this so this is a community participation post. eg. you fill in the blanks.
Differences in running under the debugger :
1. Timing. A common problem now with heavily threaded apps are bugs due to timing variation. But where do the timing differences come from exactly?
1.a. OutputDebugString. Duh, affects timing massively. Similarly anything you do dependent on IsDebuggerPresent().
1.b. VC-generated messages about thread creation etc. These obviously affect timing. You can disable them being shown by right-clicking in the output window of the debugger, but the notification is still being sent so you can never completely eliminate the timing difference for creating/destroying threads. (and the debugger does a lot more work for thread accounting anyway, so create/destroy will always have significant timing variation).
2. Exceptions. (not C++ exceptions, which are handled pretty uniformly, but more the low level SEH exceptions like access violations and such). Obviously in the debugger you can toggle the handling of various exceptions and that can change behavior. One thing I'm not sure of is if there are any registry settings or other variables that control exception behavior in NON-debugged runs? (* more on this in another post)
3. Stack. Is there a difference here? Not that I know of.
4. Debug Heap. This is probably the biggest one. Processes run in the debugger on windows *always* get the debug heap, even if you didn't ask for it. You can turn this off by setting _NO_DEBUG_HEAP as an environment variable or starting MSVC with -hd. See Behavior of Spawned Processes .
Note that this isn't coming from MSVC, it's actually in ntdll. When you create your process heap, ntdll does a "QueryInformationProcess" and sees if it's being debugged, and if so it stuffs in the debug heap. The important thing is that this is at heap creation time, which leads to a solution.
5. Child Process issues. Because the debugged process is a child process of the debugger, it inherits its process properties. (the same issue can occur for running under "cmd" vs. spawning from explorer). Two specifics are "permissions" and environment variables. Another inherited value is the "ErrorMode" as in "GetErrorMode/SetErrorMode".
There's a solution to #4 and #5 which is this :
Start your app outside of the debugger. Make it do an int 3 so it pauses. Then attach the debugger. You can now debug bug you don't get some of the ran-from-debugger differences.
(note to self about attaching : for some reason the MSVC "attach to running process" seems to fail a lot; there are other ways to do it though, when you get an int 3 message box popup you can click "debug" there, or from task manager or procexp you can find the task and click "debug" there).
11/03/2011
11-03-11 - BoolYouMustCheck
One way to test this is like this :
class BoolYouMustCheck
{
private:
bool m_b;
mutable bool m_checked;
public :
//BoolYouMustCheck() : m_b(false), m_checked(false) { }
BoolYouMustCheck(bool b) : m_b(b), m_checked(false) { }
~BoolYouMustCheck()
{
ASSERT( m_checked );
}
operator bool () const
{
m_checked = true;
return m_b;
}
};
it's just a proxy for bool which will assert if it is assigned and never read.
So now you can take a function that returns an error condition, for example :
bool func1(int x)
{
return ( x > 7 );
}
normally you could easily just call func1() and not check the value. But you change it to :
BoolYouMustCheck func2(int x)
{
return ( x > 7 );
}
(in practice you probably just want to do #define bool BoolYouMustCheck)
Now you get :
{
int y = clock();
// asserts:
func1(y);
// asserts :
bool b1 = func1(y);
// okay :
bool b2 = func1(y);
if ( b2 )
y++;
// okay :
if ( func1(y) )
y++;
return y;
}
which is kind of nice.
The only ugly thing is that the assert can be rather far removed from the line of code that caused the problem. In the first case (just calling func1 and doing nothing with the return value), you get an assert right away, because the returned class is destructed right away. But in the second case where you assign to b1, you don't get the assert until the end of function scope. I guess you could fix that by taking a stack trace in the constructor.
(note : if you want to intentionally ignore the return value b1 you can just add a line like (int) b1; to surpress the assert.
11/02/2011
11-02-11 - StringMatchTest Release
Note : the stringmatchtest.vcproj project refers to some files that are not included in this distribution. Just delete them from the project.
Requires cblib.zip (633k)
You may also need STLPort (I haven't tried building with the VC STL , I use STLPort 5.1.5 or 5.2.1). (BTW I had to modify the STLPort headers to make it build on VS 2008 ; the mods should be obvious).
Tested with VC 2005 and 2008. Does not build with VC 2010 currently.
The most interesting bit is probably in test_suffixarray, which implements the three suffix-array based string searchers previously described on this blog. See previous posts :
cbloom rants 06-17-10 - Suffix Array Neighboring Pair Match Lens
cbloom rants 09-23-11 - Morphing Matching Chain
cbloom rants 09-25-11 - More on LZ String Matching
cbloom rants 09-27-11 - String Match Stress Test
cbloom rants 09-28-11 - Algorithm - Next Index with Lower Value
cbloom rants 09-28-11 - String Matching with Suffix Arrays
cbloom rants 10-02-11 - How to walk binary interval tree
cbloom rants 09-24-11 - Suffix Tries 1
cbloom rants 09-24-11 - Suffix Tries 2
cbloom rants 09-26-11 - Tiny Suffix Note
cbloom rants 09-29-11 - Suffix Tries 3 - On Follows with Path Compression
cbloom rants 09-30-11 - String Match Results Part 1
cbloom rants 09-30-11 - String Match Results Part 2
cbloom rants 09-30-11 - String Match Results Part 2b
cbloom rants 09-30-11 - String Match Results Part 3
cbloom rants 09-30-11 - String Match Results Part 4
cbloom rants 09-30-11 - String Match Results Part 5 + Conclusion
cbloom rants 10-01-11 - String Match Results Part 6
StringMatchTest includes :
/*
* divsufsort.c for libdivsufsort-lite
* Copyright (c) 2003-2008 Yuta Mori All Rights Reserved.
*
/* LzFind.c -- Match finder for LZ algorithms
2009-04-22 : Igor Pavlov : Public domain */
/*
MMC (Morphing Match Chain)
Match Finder
Copyright (C) Yann Collet 2010-2011
StringMatchTest like all cbloom.com software is released under zlib license (basically free for all uses).
10/27/2011
10-27-11 - Tiny LZ Decoder
My x86 is rusty but certainly the trick to being small is to use
the ancient 1 byte instructions, which conveniently the string instructions are. For example you
might be tempted to read out length & offset like this :
mov cl,[esi] // len
mov al,[esi+1] // offset
add esi,2
but it's smaller to do
lodsb // len
mov cl,al
lodsb // offset
because it keeps you in 1 byte instructions. (and of course any cleverness with lea is right out).
(naively just using lodsw and then you have len and offset in al and ah is even better, but in practice
I can't make that smaller)
Anyhoo, here it is. I'm sure someone cleverer with free time could do better.
__declspec(noinline) void TinyLZDecoder(char * to,char * fm,char * to_end)
{
__asm
{
mov esi,fm
mov edi,to
mov edx,to_end
xor eax,eax
xor ecx,ecx
more:
movsb // literal
lodsb // len
mov cl,al
lodsb // offset
push esi
mov esi,edi
sub esi,eax
rep movsb // match
pop esi
cmp edi,edx
jne more
}
}
------------------------------------------------------
more:
movsb // literal
00401012 movs byte ptr es:[edi],byte ptr [esi]
lodsb // len
00401013 lods byte ptr [esi]
mov cl,al
00401014 mov cl,al
lodsb // offset
00401016 lods byte ptr [esi]
push esi
00401017 push esi
mov esi,edi
00401018 mov esi,edi
sub esi,eax
0040101A sub esi,eax
rep movsb // match
0040101C rep movs byte ptr es:[edi],byte ptr [esi]
pop esi
0040101E pop esi
cmp edi,edx
0040101F cmp edi,edx
jne more
00401021 jne more (401012h)
Also obviously you would get much better compression with a literal run length instead of a single
literal every time, and it only costs a few more bytes of instructions.
You would get even better compression if the
run len could be either a match len or a literal run len and that's just another few bytes.
(ADDENDUM : see end)
A small "Predictor/Finnish" is something like this :
__asm
{
ByteLoop: lodsb // al = *esi++ // control byte
mov edx, 0x100
mov dl, al
BitLoop:
shr edx, 1
jz ByteLoop
jnc zerobit
lodsb
mov [ebx], al
zerobit: mov al, [ebx]
mov bh, bl
mov bl, al
stosb // *edi++ = al
jmp BitLoop
}
the fast version of Finnish of course copies the bit loop 8 times to avoid looping but you can't do that if you want to be small.
I'm quite sure this could be smaller using some clever {adc esi} or such. Also the sentry bit looping is taking a lot of instructions and so on.
Note the crucial trick of "Finnish" is that the hash table must be 64k in size and 64k aligned, so you can do the hash update and the table address computaton just by cycling the bottom two bytes of ebx. (I use the name "Predictor" for the generic idea of the single bit prediction/literal style compressor; "Finnish" is the specific variant of Predictor that uses this bx trick).
(note that this is not remotely the fast way to write these on modern CPU's)
ADDENDUM :
Better (in the sense of more compression) LZ decoder in 22 bytes (core loop only) :
__declspec(noinline) void TinyLZDecoder(char * to,char * fm,char * to_end)
{
__asm
{
mov esi,fm
mov edi,to
mov edx,to_end
xor eax,eax
xor ecx,ecx
more:
mov cl,[esi] // len
inc esi
shr cl,1 // bottom bit is flag
jc literals
//match:
lodsb // offset -> al
push esi
mov esi,edi
sub esi,eax
rep movsb // copy match
pop esi
// ecx is zero, just drop through
literals:
rep movsb // copy literal run
cmp edi,edx
jne more
}
}
Note that obviously if your data is bigger than 256 bytes you can use a two byte match offset by doing lodsw instead
of lodsb.
x86 of course is a special case that's particularly amenable to small LZ decoders; it's not really fair, all other instruction sets will be much larger. (which is sort of ironic because RISC is supposed to make instruction encoding smaller) (addendum : not actually true, that last bit)
10/26/2011
10-26-11 - Tons
In the UK a "ton" is 2240 pounds (which comes from twenty "hundredweights" where a "hundredweight" is eight stone, and a stone is 14 pounds, WTF Britain).
A "metric ton" is obviously 1000 kg. In the UK this is officially called a "tonne" which you will see in technical documents, but I don't see that used much in casual writing, and it's certainly confusing when spoken since it sounds the same. (but a UK ton is very close to a metric ton (2204.6 pounds) so the mixup here surely happens all the time and is not a huge problem).
(when you hear someone in the UK phonetically say "ton" do they mean "tonne" or imperial ton?)
To differentiate the US ton vs UK ton they can be called "short ton" or "long ton".
On a related note, a pint is not a pound *anywhere* in the world.
In the UK, 1 oz by volume of water = 1 oz of weight. But a "pint" in the UK is 20 oz. So a pint is 1.25 pounds (a gallon is exactly 10 pounds)
In the US, 1 oz by volume of water = 1.041 oz of weight, so a pint = 1.041 pounds. (and a gallon = 8.33 pounds).
(neither liquid ounce is anything neat in terms of volume; the only nice whole number unit is the US gallon which is 231 cubic inches)
(the weight measures are the same in the US and UK, it's the US volume measure which went weird (1.041), and I believe it was done in order to make the gallon an integer number of cubic inches)
If you want to get technical, a (US) "pint's a pound" at some high temperature. (...some digging...) actually it's very close just before boiling. It looks like 98 C water is almost exactly a pound per pint (US).
Actually there is a sort of cute book-end of the ranges of water density there :
Very close to freezing (4 C) water is 1 g/ml , and very close to boiling (98 C) it's a pound per (US) pint. The difference is a factor of about 0.96.
10-26-11 - The Eight Month Cruise
The problem with Seattle is that once it starts raining (around Oct 1 pretty reliably) it literally does not stop for the next 8 months. Sure maybe it stops for a day or two, but never long enough for the whole house to dry out and then give you a big chunk of dry days to do something like replace the roof or paint the exterior.
Fortunately I don't have any problem so large as that, but even for minor things it's damn annoying. For example some time around September I realized that I really need to get a coat of waterproofer on all my decking. Oh well, it's gonna have to wait 8 months. There's a couple of spots I need to touch up exterior paint, but you can't really do a good job of painting without a solid 5 days of dry and decent warmth.
It's almost like our houses are boats, and we go on an 8 month aquatic voyage every year. You really need to use those 4 months as a chance to get your boat up on dry dock, scrub off the moss, dig out the dry rot, apply epoxy, sand and paint, etc.
10/24/2011
10-24-11 - LZ Optimal Parse with A Star Part 1
1. All good LZ encoders these days that aren't optimal parsers use complicated heuristics to try to bias the parse towards a cheaper one. The LZ parse is massively redundant (many parses can produce the same original data) and the cost is not the same. But in the forward normal parse you can't say for sure which decision is cheapest, so they just make some guesses.
For example the two key crucial heuristics in LZMA are :
// LZMA lastOffset heuristic :
if (repLen >= 2 && (
(repLen + 1 >= mainLen) ||
(repLen + 2 >= mainLen && mainDist >= (1 << 9)) ||
(repLen + 3 >= mainLen && mainDist >= (1 << 15))))
{
// take repLen instead of mainLen
}
// LZMA lazy heuristic :
// Truncate current match if match at next position will be better (LZMA's algorithm)
if (nextlen >= prevlen && nextdist < prevdist/4 ||
nextlen == prevlen + 1 && !ChangePair(prevdist, nextdist) ||
nextlen > prevlen + 1 ||
nextlen + 1 >= prevlen && prevlen >= MINLEN && ChangePair(nextdist, prevdist))
{
return MINLEN-1;
} else {
return prevlen;
}
One is choosing a "repeat match" over a normal match, and the next is choosing a lazy match (literal then match) over greedy (immediate match).
My non-optimal LZ parser has to make similar decisions; what I did was set up a matrix and train it. I made four categories for a match based on what the offset is : {repeat, low, medium, high } , so to decide between two matches I do :
cat1 = category of match 1 offset cat2 = category of match 2 offset diff = len1 - len2 return diff >= c_threshold[cat1][cat2]so c_threshold is a 4x4 matrix of integers. The values of threshold are in the range [0,3] so it's not too many to just enumerate all possibilities of threshold and see what's best.
Anyway, the thing about these heuristics is that they bias the parse in a certain way. They assume a certain cost tradeoff between literals vs. repeat matches vs. normal matches, or whatever. When the heuristic doesn't match the file they do badly. Otherwise they do amazingly well. One solution might be having several heuristics trained on different files and choose the one that creates the smallest output.
Also I should note - it's not trivial to tell when you have the heuristic wrong for the file. The problem is that there's a feedback loop between the parse heuristic and the statistical coder. That causes you to get into a local minimum, and you might not see that there's a better global minimum which is only available if you make a big jump in parsing style.
Repeating that more explicitly : the statistical coder will adapt to your parse heuristic; say you have a heuristic that prefers low offsets (like the LZMA heuristic above), that will cause your parse to select more low offsets, that will cause your statistical backend to code low offsets smaller. That's good, that's what you want, that's the whole point of the heuristic, it skews the parse intentionally to get the statistics going in the right direction. The issue then is that if you try to evaluate an alternative parse using the statistics that you have, it will look bad, because your statistics are trained for the parse you have.
2. It's crazy that LZ compression can do so well with so much redundancy in the code stream.
Think about it this way. Enumerate all the compressed bit streams that are created by encoding all raw bit streams up to length N.
Start with the shortest compressed bit stream. See what raw bits that decodes to. Now there may be several more (longer) compressed bit streams that decode to that same raw bit stream. Mark those all as unused.
Move to the next shortest compressed bit stream. First if there is a shorter unused one, move it into that slot. Then mark all other output streams that make the same raw bits as unused, and repeat.
For example a file like "aaaaaaa" has one encoding that's {a, offset -1 length 6} ; but there are *tons* of other encodings, such as {a,a,[-1,5]} or {a,[-1,3],[-1,3]} or {a,[-1,2],[-2,4]} etc. etc.
All the encodings except the shortest are useless. We don't want them. But even the ones that are not the best are quite short - they are small encodings, and small encodings take up lots of code space (see Kraft Inequality for example - one bit shorter means you take up twice the code space). So these useless but short encodings are real waste. In particular, there are other raw strings that don't have such a short encoding that would like to have that output length.
Anyhoo, surprisingly (just like with video coding) it seems to be good to add even *more* code stream redundancy by using things like the "repeat matches". I don't think I've ever seen an analysis of just how much wasted code space there is in the LZ output, I'm curious how much there is.
Hmm we didn't actually get to the A Star. Next time.