8/31/2010

08-31-10 - LOOP

Every once in a while I think this is a good idea :

#define LOOP(var,count) for(int var=0;(var)<(int)(count);var++)
#define LOOPBACK(var,count) for(int var=(int)(count)-1;(var)>=0;var--)

but then I never wind up using it, and I find the code that does use it looks really ugly. Maybe if I got in the habit that ugliness would go away.

Certainly adding a "loop" keyword would've been a good idea in C99. Instead we have GCC trying to optimize out signed int wrapping, and many compilers now specifically look for the for(;;) construct and special-case handling it as if it was a loop() keyword.


In other minor news, I'm running with the "NoScript" addon now. I've used FlashBlock for a long time to much happiness, so this is just the next level. It does break 90% of web sites out there, but it has also made me shockingly aware of how many random sites are trying to run scripts that are highly questionable (mainly data-mining and ad-serving).


People sometimes ask me about laptops because I wrote about them before :

cbloom rants 05-07-10 - New Lappy
cbloom rants 04-15-10 - Laptop Part 2
cbloom rants 04-12-10 - Laptop search
cbloom rants 01-20-09 - Laptops Part 3

A small update : I'm still very happy with the Dell Precision 6500. They've updated it with some newer GPU options, so you can now get it with an ATI 7820 , which is a full DX11 part and close to as fast as you can get mobile. Other than that all my advice remains the same - get USB 3 of course, quad i7, LED backlighting, install your own SSD and RAM and do a clean windows install. Do NOT get RAID disks, and do NOT install any Intel or Dell drivers. I have no DPC latency problems or any shite like that. The only thing about it that's slightly less than perfect is the temperature / fan control logic is not quite right.

It looks like there's a problem with the NV GTX 280 / Dual chip / Powermizer stuff. See :

My Solution for Dell XPS M1530 DPC Latency
Dell, DPC Latency, and You - Direct2Dell - Direct2Dell - Dell Community
Dell Latitude DPC Latency Issues

So I recommend staying away from that whole family. The new Vaio Z looks really awesome for a small thin/light, however there is a bit of a nasty problem. They only offer it with SSD's in RAID, and they are custom non-replaceable SSD's, and it appears that the reason for the 4 SSD's in RAID is because they are using cheapo low end flash. There's lots of early reports of problems with this setup, and the fact that you have to run RAID and can't change the drives is a big problem. Also, last time I checked Windows still can't send through TRIM to RAID, so that is borked, but theoretically that will be addressed some day.

8/27/2010

08-27-10 - Cumulative Probability Trees

I've been thinking about Cumulative Probability Trees today; if you prefer, it's partial-sum trees. Basically you have a count for each index C[i] , and you want to be able to quickly get the sum of all counts between 0 and i = S[i]. You want to be able to query and update both in <= logN time. So for example just using S[i] as an array makes query be O(1), but then update is O(N), no good.

The standard solution is Fenwick Trees . They are compact (take no more room than the C[i] table itself). They are O(log(N)) fast. In code :


F[i] contains a partial sum of some binary range
the size of the binary range is equal to the bottom bit on of i
if i&1  - it contains C[i]
if i&3 == 2 - contains Sum of 2 ending with i (eg. C[i]+C[i-1] )
if i&7 == 4 - contains Sum of 4 ending with i
if i&F == 8 - contains Sum of 8 ending with i
(follow the link above to see pictures). To get C[i] from F[i] you basically have to get the partial sums S[i] and S[i-1] and subtract them (you can terminate early when you can tell that the rest of their sum is a shared walk). The logN walk to get S from F is very clever :

sum = 0;
while (i > 0)
  {
    sum += F[i];
    i = i & (i-1);
  }

The i & (i-1) step turns off the bottom bit of i, which is the magic of the Fenwick Tree structure being the same as the structure of binary integers. (apparently this is the same as i -= i & -i , though I haven't worked out how to see that clearly).

If you put F[0] = 0 (F starts indexing at 1), then you can do this branchless if you want :


sum = 0;
UNROLL8( sum += F[i]; i = i & (i-1); );

(for an 8-bit symbol, eg 256 elements in tree).

You can't beat this. The only sucky thing is that just querying a single probability is also O(logN). There are some cases where you want to query probability more often than you do anything else.

One solution to that is to just store the C[i] array as well. That doesn't hurt your update time much, and give you O(1) query for count, but it also doubles your memory use (2*N ints needed instead of N).

One option is to keep C[i], and throw away the bottom level of the Fenwick tree (the odd indexes that just store C[i]). Now your memory use is (3/2)*N ; it's just as fast but a little ugly.

But I was thinking what if we start over. We have the C[i], what if we just build a tree on it?

The most obvious thing is to build a binary partial sum tree. At level 0 you have the C[i], at level 1 you have the sum of pairs, at level 2 you have the sum of quartets, etc :


showing the index that has the sum of that slot :

0:01234567
1:00112233
2:00001111
3:00000000

So update is very simple :

Tree[0][i] ++;
Tree[1][i>>1] ++;
Tree[2][i>>2] ++;
...

But querying a cumprob is a bit messy. You can't just go up the tree and add, because you may already be counted in a parent. So you have to do something like :

sum = 0;

if ( i&1 ) sum += Tree[0][i-1];
i>>=1;
if ( i&1 ) sum += Tree[1][i-1];
i>>=2;
if ( i&1 ) sum += Tree[1][i-1];
..

This is O(logN) but rather uglier than we'd like.

So what if we instead design our tree to be good for query. So we by construction say that our query for cumprob will be this :


sum = Tree[0][i];
sum += Tree[1][i>>1];
sum += Tree[2][i>>2];
...

That is, at each level of the tree, the index (shifted down) contains the amount that should be added on to get the partial sum that preceeds you. That is, if i is >= 64 , then Tree[6][1] will contain the sum from [0..63] and we will add that on.

In particular, at level L, if (i>>L)is odd , it should contain the sum of the previous 2^L items. So how do we do the update for this?


Tree[0][i] ++;
i >>= 1;
if ( i&1 ) Tree[1][i] ++;
i >>= 1;
if ( i&1 ) Tree[2][i] ++;
...

or

Tree[0][i] ++;
if ( i&2 ) Tree[1][i>>1] ++;
if ( i&4 ) Tree[2][i>>2] ++;
...

or

Tree[0][i] ++;
i >>= 1;
Tree[1][i] += i&1;
i >>= 1;
Tree[2][i] += i&1;
...

this is exactly complementary to the query in the last type of tree; we've basically swapped our update and query.

Now if you draw what the sums look like for this tree you get :

These are the table indexes :

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 1 2 3 4 5 6 7
0 1 2 3
0 1
0

These are the table contents :

C0 C0-C1 C2 C2-C3 C4 C4-C5 C6 C6-C7 C8 C8-C9 C10 C10-C11 C12 C12-C13 C14 C14-C15
0 C0-C1 0 C4-C5 0 C8-C9 0 C12-C13
0 C0-C3 0 C8-C11
0 C0-C7
0
it should be obvious that you can take a vertical sum anywhere and that will give you the partial sum to the beginning. eg. if I take a vertical sum in the 13th slot I get (C12-C13)+(0)+(C8-C11)+(C0-C7) = (C0-C13) , which is what I want.

Now what if we slide everything left so we don't have all those zeros in the front, and we'll go ahead and stick the total sum in the top :

C0 C0-C1 C2 C2-C3 C4 C4-C5 C6 C6-C7 C8 C8-C9 C10 C10-C11 C12 C12-C13 C14 C14-C15
C0-C1 0 C4-C5 0 C8-C9 0 C12-C13 0
C0-C3 0 C8-C11 0
C0-C7 0
C0-C15

Now, for each range, we stuff the value into the top slot that it covers, starting with the largest range. So (C0-C7) goes in slot 7 , (C0-C3) goes in slot 3, etc :

C0 -v- C2 -v- C4 -v- C6 -v- C8 -v- C10 -v- C12 -v- C14 -v-
C0-C1 -v- C4-C5 -v- C8-C9 -v- C12-C13 -v-
C0-C3 -v- C8-C11 -v-
C0-C7 -v-
C0-C15

(the -v- is supposed to be a down-arrow meaning you get those cell contents from the next row down). Well, this is exactly a Fenwick Tree.

I'm not sure what the conclusion is yet, but I thought that was like "duh, cool" when I saw it.

8/25/2010

08-25-10 - ProfileParser

I realized a long time ago that the work I did for my AllocParser could basically be applied as-is to parsing profiles. It's the same thing, you have a counter (bytes or clocks), you want to see who counted up that counter, get hierarchical info, etc.

Also Sean and I talked long ago about how to do the lowest impact possible manually-instrumented profiler with full information. Basically you just record the trace and don't do any processing on it during recording. All you have to do is :


Profiler_Push(label)  *ptr++ = PUSH(#label); *ptr++ = rdtsc();
Profiler_Pop( label)  *ptr++ = POP( #label); *ptr++ = rdtsc();

#define PUSH(str)  (U64)(str)
#define POP (str) -(S64)(str)

where ptr is some big global array of U64's , and we will later use the stringized label as a unique id to merge traces. Once your app is done sampling, you have this big array of pushes and pops, you can then parse that to figure out all the hierarichical timing information. In practice you would want to use this with a scoper class to do the push/pop for you, like :

class rrScopeProfiler { rrScopeProfiler() { Push; } ~rrScopeProfiler() { Pop; } };

#define PROFILE(label)  rrScopeProfiler RR_STRING_JOIN(label,_scoper) ( #label );

Very nice! (obviously the pop marker doesn't have to provide the label, but it's handy for consistency checking).

(if you want to get fancy, the profiler push/pop array should really be write-combining memory, and the stores should be movnti's on x64 (*), that way the profiler push/pop wouldn't even pollute your cache, which makes it very low impact indeed)

(* BTW movnti is exposed as _mm_stream_si64 ; unfortunately, this does not exist in VC2005, you need 2008; the fact that they took away inline asm and then failed to expose all the instructions in intrinsics was a huge "fuck you" to all low-level developers; it was terrible in the early days, they've caught up a bit more with each rev ) (note that movntq is not the same, movnti comes from normal registers).

So I did this, and made my AllocParser able to parse that kind of input and turn it into AllocRecords. (the normal mode of AllocParser is to handle stack traces of memory allocations).

So now I can make tabview output for either top-down or bottom-up hierarchies, and also graphiz output like : lz_profile.svg .

There are a few minor things to clean up, like I'd like to be able to show either seconds or clocks, I'd like to be able to divide clocks by the total # of bytes to make it a clocks per byte, and if a node only has a self time, I'd like to delete the redundant +self node.

Another option is to show the hierarchy in a totally different graphviz style, using the boxes-within-boxes method. I tried that before for allocs and it was hideous, but it might be okay for timing. If I do that, then I could combine this with a global timeline to show thread profiles over time, and then use the arrows to connect dependencies.

Then I have to buy that old plotter I've been talking about ...

8/24/2010

08-24-10 - Free Trail Maps

Don't support those Green Trails cocksuckers who charge $400 for their maps. (if you do have the Green Trails maps, I encourage you to post them for public use on the net; you fuckers have abused your right for me to respect your copyright; in fact I vaguely considered buying a set just to scan it and post it, but they're not even worth that).

It's most appalling because all the maps are based on the USGS data, which is paid for by our fucking tax dollars. Fortunately there are some perfectly good free map sites :

libremap.org : Libre Map Project - Free Maps and GIS data
digital-topo-maps.com : Free Printable Topo Maps - Instant Access to Topographic Maps
ACME Mapper 2.0

Of these, digital-topo-maps.com is the easiest to browse around cuz it just uses Google maps (actually, it seems to be like 10X faster than Google's own interface, so it's actually just a nice way to browse normal maps too).

Libre Maps is the most useful for hiking with because it has nice printer-ready pages in high quality.

Also, I sometimes forget that Google still has the Terrain maps because they are hidden under More... now. I'm sure somebody has done trail overlays for Google Terrain, but I haven't found it.

08-24-10 - AutoPrintf Release

Okay, I think AutoPrintf is close enough to final to release it. The nicer, more readable version is in : cblib.zip , but I understand you all like little autonomous files, so I grabbed the shit it needs and crammed it together in an ugly mess here :

apf.h
apf.cpp
apf.inc

(among other advantages, the cblib version doesn't pull vector.h or windows.h into the apf headers, both of which I consider to be very severe sins)

See older posts for a description of how it works and earlier not-good way of doing it and initial announcement .

The basic way it works is :

  • non-basic types are converted into basic types by the template using autoArgConvert; base types are passed through and others get ToString() called on them.

  • "%a" in the format string is turned into the type of the variable it points at (eg %d , %s, whatever). So you can of course use "%7a" or anything else that printf can handle.

  • I just call normal printf to do the actual printing.

that's it, very simple!

Here's my main.cpp for an example of usage :

#include < float.h >
#include < stdio.h >     
#include < string.h >

//#include "autoprintf.h"
#include "apf.h"

#include < windows.h >

struct Vec3 { float x,y,z; };

namespace apf 
{
inline const String ToString( const Vec3 & v )
{
    return StringPrintf("{%.3f,%.3f,%.3f}",v.x,v.y,v.z);
}

/*
inline size_t autoArgConvert(const HWND arg)
{
    return (size_t)arg;
}
*/

inline const String ToString( const HWND v )
{
    return StringPrintf("[%08X]",v);
}
};
    
int main(int argc,const char * argv[])
{   
    //MakeAutoPrintfINL();
    
    //autoprintf("test bad %d",3.f);
    
    autoprintf("hello %-7a",(size_t)400,"|\n");
    
    //*
    //autoprintf("100%");
    autoprintf("percent %% %s","100%"," stupid","!\n");
    autoprintf("hello ","world %.1f",3.f,"\n");
    autoprintf("hello ",7," world\n");
    autoprintf("hello %03d\n",7);
    autoprintf("hello %d",3," world %.1f\n",3.f);
    autoprintf("hello ",(size_t)400,"\n");
    autoprintf("hello ",L"unicode is balls"," \n");
    autoprintf("hello %a ",L"unicode is balls"," \n");
    //autoprintf("hello %S ",L"unicode is balls"," \n");
    autoprintf("hello ",apf::String("world")," \n");
//  autoprintf("hello ",LogString()); // compile error
    
    autoprintf("top bit ",(1UL<<31)," \n");
    autoprintf("top bit %d",(1UL<<31)," \n");
    autoprintf("top bit %a",(1UL<<31)," \n");
    autoprintf("top bit %a\n",(size_t)(1UL<<31));
    
    HANDLE h1 = (HANDLE) 77;
    HWND h2 = (HWND) 77;
    autoprintf("HANDLE %a\n",h1);
    autoprintf("HWND %a\n",h2);
    
    char temp[1024];
    autosnprintf(temp,1023,"hello %a %a %a",4.f,7,apf::String("world"));
    printf("%s\n",temp);
    
    Vec3 v = { 3, 7, 1.5f };
    autoprintf("vector ",v," \n");
    autoprintf("vector %a is cool %a\n",v,(size_t)100);
    /**/
        
    return 0;
}

The normal way to make user types autoconvert is to add a ToString() call for your type, but you could also use autoArgConvert. If you use autoArgConvert, then you will wind up going through a normal %d or whatever.

One nice thing is that this autoprintf is actually even safer than my old safeprintf. If you mismatch primitive types, (eg. you put a %d in your format string but pass a float), it will check it using the same old safeprintf method (that is, a runtime failure). But if you put a std::string in the list when you meant to put a char *, you will get a compile error now, which is much nicer.

Everything in cblib now uses this (I made Log.h be an autoprintf) and I haven't noticed a significant hit to compile time or exe size since the templates are all now deterministic and non-recursive.

Yes it does a lot of dynamic allocation. Get with the fucking 20th century. And it's fucking printf. Printf is slow. I don't want to hear one word about it.

8/23/2010

08-23-10 - AutoPrintf v2

Okay, v2 works and confirmed doesn't give massive exe size bloat or compiler time death the way v1 does.

At the heart of v2 is a "fixed" way of doing varargs. The problem with varargs in C is that you don't get the types of the variables passed in, or the number of them. Well there's no need to groan about that because it's actually really trivial to fix. You just make a bunch of functions like :


template < typename T1, typename T2, typename T3 >
inline String autoToStringSub( T1 arg1, T2 arg2, T3 arg3)
{
    return autoToStringFunc( 3,
            safeprintf_type(arg1), safeprintf_type(arg2), safeprintf_type(arg3), 
            arg1, arg2, arg3, 0 );
}

for various number of args. Here autoToStringFunc(int nArgs, ...) is the basic vararg guy who will do all the work, and we just want to help him out a bit. This kind of adapter could be used very generally to make enhanced varargs functions. Here I only care about the "printf_type" of the variable, but more generaly you could use type_info there. (you could also easily make abstract Objects to encapsulate the args and pass through an array of Objects, so that the called function wouldn't have to be a stupid C vararg function at all, but then it's harder to pass through to the old C funcs that still want varargs).

On top of this we have a type-change adapter :


template < typename T1, typename T2 >
inline String autoToString( T1 arg1, T2 arg2)
{
    return autoToStringSub( 
        autoprintf_StringToChar( autoArgConvert(arg1) ), 
        autoprintf_StringToChar( autoArgConvert(arg2) ));
}

autoToString calls down to autoToStringSub, and uses autoArgConvert. autoArgConvert is a template that passes through basic types and calls ToString() on other types. ToString is a template that knows the basic types, and the client can extend it by adding ToString for their own types. If they don't, it will be a compile error. StringToChar is a helper that turns a String into a char * and passes through anything else. We have to do it in that double-call way so that the String can get allocated and stick around as a temporary until our whole call is done.

The next piece is how to implement autoToStringFunc() , which takes "enhanced varargs". We need to figure out which pieces are format strings and do various types of printfs (including supporting %a for auto-typed printf). The only tricky part of this is how to step around in the varargs. Here is the only place we have to use a little bit of undefined behavior. First of all, think of the va_list as a pointer to a linked list. Calling va_arg essentially advances the pointer one step. That's fine and stanard. But I assume that I can then take that pointer and pass it on as a va_list which is the remaining args (see note *).

The key way we deal with the varargs is with functions like this :


static inline void SkipVAArg(ESafePrintfType argtype, va_list & vl)
{
    switch(argtype)
    {
    case safeprintf_charptr:    { va_arg(vl,char *); return; }
    case safeprintf_wcharptr:   { va_arg(vl,wchar_t *); return; }
    case safeprintf_int32:      { va_arg(vl,int); return; }
    case safeprintf_int64:      { va_arg(vl,__int64); return; }
    case safeprintf_float:      { va_arg(vl,double); return; }
    case safeprintf_ptrint:     { va_arg(vl,int*); return; }
    case safeprintf_ptrvoid:    { va_arg(vl,void*); return; }
    default:
        // BAD
        safeprintf_throwsyntaxerror("SkipVAArg","unknown arg type");
        return;
    }
}

And the remainder is easy!

* : actually it looks like this is okay by the standard, I just have to call va_end after each function call then SkipArgs back to where I was. I believe this is pointless busywork, but you can add it if you want to be fully standard compliant.

8/22/2010

08-22-10 - AutoPrintf v1

Well autoprintf v1 appears to be all working. The core element is a bunch of functions like this :

template < typename T1, typename T2, typename T3, typename T4 >
inline String autoToString( T1 arg1, T2 arg2, T3 arg3, T4 arg4 )
{
    return ToString(arg1) + autoToString( arg2,arg3,arg4);
}


template < typename T2, typename T3 >
inline String autoToString( const char *fmt, T2 arg2, T3 arg3 )
{
    autoFormatInfo fmtInfo = GetAutoFormatInfo(fmt);
    if ( fmtInfo.autoArgI )
    {
        String newFmt = ChangeAtoS(fmt,fmtInfo);
        if ( 0 ) ;
        else if ( fmtInfo.autoArgI == 1 ) return autoToString(newFmt.CStr(), ToString(arg2).CStr(),arg3);
        else if ( fmtInfo.autoArgI == 2 ) return autoToString(newFmt.CStr(), arg2,ToString(arg3).CStr());
        else return autoPrintf_BadAutoArgI(fmt,fmtInfo);
    }

         if ( fmtInfo.numPercents == 0 )    return ToString(fmt) + autoToString(arg2,arg3);
    else if ( fmtInfo.numPercents == 1 )    return StringPrintf(fmt,arg2) + autoToString(arg3);
    else if ( fmtInfo.numPercents == 2 )    return StringPrintf(fmt,arg2,arg3);
    else return autoPrintf_TooManyPercents(fmt,fmtInfo);
};

you have an autoToString that takes various numbers of template args. If the first arg is NOT a char *, it calls ToString on it then repeats on the remaning args. Any time the first arg is a char *, it uses the other specialization which looks in fmt to see if it's a printf format string, then splits the args based on how many percents they are. I also added the ability to use "%a" to mean auto-typed args, which is what the first part of the function is doing.

That's all dandy, but you should be able to see that for large numbers of args, it generates a massive amount of code.

The real problem is that even though the format string is usually a compile-time constant, I can't parse it at compile time, so I generate code for each arg being %a or not being %a, and for each possible number of percents. The result is something like 2^N codegens for N args. That's bad.

So, I know how to fix this, so I don't think I'll publish v1. I have a method for v2 that moves most of the work out of the template. It's much simpler actually, and it's a very obvious idea, all you have to do is make a template like :


autoprintf(T1 a1, T2 a2, T3 a3)
{
    autoPrintfSub( autoType(a1), autoArg(a1) ,autoType(a2), autoArg(a2) , .. )
}

where autoType is a template that gives you the type info of the arg, and autoArg does conversions on non-basic types for you, and then autoPrintfSub can be a normal varargs non-template function and take care of all the hard work.

... yep new style looks like it will work. It requires a lot more fudging with varargs, the old style didn't need any of that. And I'm now using undefined behavior, though I think it always works in all real-world cases. In particular, in v2 I'm now relying on the fact that I can do :


  va_start(vl)
  va_arg(vl) .. a few types to grab some args from vl
  vsnprintf(  vl);

that is, I rely on the fact that va_arg advances me one step in the va_list, and that I then still have a valid va_list for remaining args which I can pass on. This is not allowed by the standard technically but I've never seen a case where it doesn't work (unless GCC decided to get pedantic and forceably make it fail for no good reason).

8/21/2010

08-21-10 - autoprintf

AH HAHA HA ! I think I finally I have the one true printf solution. I can now do :

    autoprintf("hello world\n");
    autoprintf("hello ",7," world\n");
    autoprintf("hello %03d\n",7);
    autoprintf("hello ","world %.1f",3.f,"\n");
    autoprintf("hello %d",3," world %.1f\n",3.f);
    autoprintf("hello ",(size_t)400,"\n");
    autoprintf("hello ",L"unicode is balls"," \n");
    autoprintf("hello ",String("world")," \n");

In particular, all of the following things work :
  • Regular old printf stuff works just like always.

  • .. except that it does type validation against the format string using my old "safeprintf" system

  • You can put args anywhere on the line, not just at the end. This is handy when you get really long printfs going, and makes it easier to copy-paste and rearrange. Similarly you can put format strings anywhere on the line, which is handy when you have a long printf set up and you want to just add something on the end.

  • Does automatic type deduction and output of types that are not explicitly qualified in the format string. In particular a handy one is (size_t) which is not properly supported in a cross-platform way. Any type that doesn't have a % format thingy provided by the caller gets an automatic one.

  • Supports non-basic types in a clean template overriding way. So things like that line that pass a String() as an argument - you will get a compile error if you haven't made a compatible template for that type, or if you have you get nice autoconversion. Basically all the advantages of the C++ iostream << dealy but without the horribleness of having to do your printing that way.

I'm gonna clean up the code a bit and try to extricate it from cblib (meh or maybe not) and I'll post it in a few days.

It does pretty much everything I've always wanted from a printf. There is one thing missing, which is formatting for arbitrary types. Currently you can only format the basic types, and the non-basic types go through a different system. eg. you can either do :

autoprintf("hello %5d ",anInt); 
or
autoprintf("hello ",(size_t)anInt); 
but you can't yet do
autoprintf("hello %5",(size_t)anInt); 
(note that the type specifier is left off, only format specifiers are on the %). I know how to make this work, but it makes the implementation a lot more complicated, so I might punt on it.

The more complicated version is to be able to pass through the format spec into the templated converter. For example, you might have a ToString() for your Vec3 type which makes output like ("{%f,%f,%f}",x,y,z) . With the current system you can do :

Vec3 v;
autoprintf("v = ",v);
and it will call your ToString, but it would be groovy if you could do :
Vec3 v;
autoprintf("v = %.1",v);
as well and have that format apply to the conversion for the type. But that's probably more complication than I want to get into.

Another thing that might be nice is to have an explicit "%a" or something for auto-typed, so you can use them at the end like normal printf args. eg :

autoprintf("hello %d %a %f\n", 3, String("what"), 7.5f );

08-21-10 - Adler32

Sean pointed out that I should try Adler32 for corruption checking. For reference I did some study of file hashes before and I'll use the same test set now, so you can compare to that. I'm using the Adler32 from zlib which looks like a decent implementation.

Testing on 10M arrays of average length 192 (random in [128,256]).


count : 10000000
totalBytes : 1920164768
clocks per byte :
burtle               : 1.658665
crc32                : 10.429893
adler32              : 1.396631
murmur               : 1.110712
FNV                  : 2.520380

So Adler is in fact decently fast, not as fast as Murmur but a bit faster than Burtle. (everything is crazy fast on my x64 lappy; the old post was on my work machine, everything is 2-3X faster on this beast; it's insane how much Core i7 can do per clock).

BTW I wasn't going to add Murmur and FNV to this test - I didn't test them before because they are really not "corruption detection" hashes, they are hashes for hash tables, in particular they don't really try to specifically gaurantee the one bit flips will change the hash or whatever it is that CRC's gaurantee, but after I saw how non-robust Adler was I figured I should add them to the test, and we will see that they do belong...

Now when I count collisions in the same way as before, a problem is evident :


collisions :
rand32               : 11530
burtle               : 0
crc32                : 11774
adler32              : 1969609
murmur               : 11697
FNV                  : 11703

note that as before, rand32 gives you a baseline on how many collisions a perfect 32 bit hash should give you - those collisions are just due to running into the limitted space of the 32 bit word. Burtle here is a 64 bit hash and never collides. (I think I screwed up my CRC a little bit, it's colliding more than it should. But anyhoo). Adler does *terribly*. But that's actually a known problem for short sequences.

How does it do on longer sequences ? On arrays of random length between 2k and 4k (average 3k) :


num hashes : 10000000
totalBytes : 30722620564
clocks per byte :
burtle               : 1.644675
crc32                : 11.638417
adler32              : 1.346784
murmur               : 1.027105
FNV                  : 2.999243
collisions :
rand32               : 11530
burtle               : 0
crc32                : 11586
adler32              : 12335
murmur               : 11781
FNV                  : 11653

it's better, but still the worst of the group.

BTW I should note that the adler32 implementation does unrolling and rollup/rolldown and all that kind of stuff and none of the other ones do. So it's speed advantage is a bit unfair. All these sort of informal speed surveys should be taken with a grain of salt, since to really fairly compare them I would have to spend a few weeks on each one making sure I got it as fast as possible, and of course testing on various platforms. In particular FNV and Murmur use multiplies with is a no-go, but you could probably use shift and add to replace the multiplies, and you'd get something like Bob's "One at a Time" hash.

So I figured I'd test on what is more like my real usage scenario.

In the RAD LZH , I compress 16k data quanta, and check the CRC of each compressed chunk before decompressing. So compressed chunks are between 0 and 16k bytes. Since they are compressed they are near random bits. Corruption will take various forms, either complete stompage with random shite, or some bit flips, or tail or head stomps. Complete stompage has been tested in the above runs (it's the same as checking the collision rate for two unrelated sequences), so I tested incremental stomps.

I made random arrays between 256 and 16k bytes long. I then found the hash of that array, did some randomized incremental stomping, and took the hash after the changes. If the hashes were the same, it counts as a collision. The results are :


numTests : 13068402
burtle               : 0 : 0.00000%
crc32                : 0 : 0.00000%
adler32              : 3 : 0.00002%
murmur               : 0 : 0.00000%
FNV                  : 0 : 0.00000%

Adler32 is the only one that fails to detect these incremental stomps. Granted the failure rate is pretty low (3/13068402) but that's not secure. Also, the hashes which are completely not designed for this (Murmur and FNV) do better. (BTW you might think the Adler32 failures are all on very short arrays; not quite, it does fail on a 256 byte case, then twice at 3840 bytes).

ADDENDUM : Ok I tested Fletcher32 too.


cycles :
rand32               : 0.015727
burtle               : 1.364066
crc32                : 4.527377
adler32              : 1.107550
fletcher32           : 0.697941
murmur               : 0.976026
FNV                  : 2.439253

large buffers :
num hashes : 10000000
totalBytes : 15361310411
rand32               : 11530
burtle64             : 0
crc32                : 11710
adler32              : 12891
fletcher32           : 11645
murmur               : 11792
FNV                  : 11642

small buffers :
num hashes : 10000000
totalBytes : 1920164768
rand32               : 11530
burtle64             : 0
crc32                : 11487
adler32              : 24377
fletcher32           : 11793
murmur               : 11673
FNV                  : 11599

difficult small buffers :
num hashes : 10000000
totalBytes : 1920164768
rand32               : 11530
burtle64             : 0
burtle32             : 11689
crc32                : 11774
adler32              : 1969609
fletcher32           : 11909
murmur               : 11665
FNV                  : 11703

Conclusion : Adler32 is very bad and unsafe. Fletcher32 looks perfectly solid and is very fast.

ADDENDUM 2 : a bit more testing. I re-ran the test of munging the array with incremental small changes of various types again. Running on lengths from 256 up to N, I get :


munge pattern 1 :
length : 6400
numTests             : 25069753
rand32               : 0
burtle64             : 0
burtle32             : 0
crc32                : 0
adler32              : 14
fletcher32           : 22
murmur               : 0
FNV                  : 0

munge pattern 2 :
length : 4096
numTests             : 31322697
rand32               : 0
burtle64             : 0
burtle32             : 0
crc32                : 0
adler32              : 9
fletcher32           : 713
murmur               : 0
FNV                  : 0

So I strike my conclusion that Fletcher is okay. Fletcher and Adler are both bad.

ADDENDUM 3 : Meh, it depends what kind of "corruption" you expect. The run above in which Fletcher is doing very badly includes some "munges" which tend to fill the array with lots of zeros, in which area it does very badly.

If you look at really true random noise type errors, and you always start your array full of random bits, and then you make random bit flips or random byte changes (between 1 and 7 of them), and then refill the array with rand, they perform as expected over a very large number of runs :


numTests : 27987536
rand32               : 3 : 0.00001%
burtle64             : 2 : 0.00001%
burtle32             : 2 : 0.00001%
crc32                : 1 : 0.00000%
adler32              : 1 : 0.00000%
fletcher32           : 2 : 0.00001%
murmur               : 2 : 0.00001%
FNV                  : 1 : 0.00000%

8/20/2010

08-20-10 - Deobfuscating LZMA

I've been trying to figure out LZMA for a while, if anyone can help please chime in.

LZMA is very good, and also very obscure. While the code is free and published, it's completely opaque and the algorithm is not actually described anywhere. In particular, it's very good on structured data - even better than PPM. And, superficially, it looks very much like any other LZ+arithmetic+optimal parse, which there were many of before LZMA, and yet it trounces them all.

So, what's going on in LZMA? First, a description of the basic coder. Most of LZMA is very similar to LZX - LZX uses a forward optimal parse, log2-ish encoded lengths and offsets, and the most recent 3 offsets in an MTF/LRU which are coded as special "repeat match" codes. (LZX was made public when Microsoft made it part of CAB and published a full spec ).

LZMA can code a literal, a match, or a recent offset match - one of the three most recent offsets (like LZX). This is pretty standard. It also has two coding modes that are unusual : "Delta Literal" coding, and the 0th most recent offset match can code a single character match.

Everything it codes is context-coded binary arithmetic coded. Literals are coded as their bits; the initial context is the previous character and a few bits of position, and as literal bits are coded they are shifted into the context for future bits (top to bottom). This is pretty standard.

Using a few bits of position as part of the context lets it have different statistics at each byte position in a dword (or whatever). This is very useful for coding structured data such as arrays of floats. This idea has been around for a long time, but older coders don't do it and it certainly is part of the advantage on array/structured data. The bottom bits of position are also used as part of the context for the match flag, and also the "is last match 0 long" flag. Other match-related coding events don't use it.

In theory you should figure out what the local repetition period is and use that; LZMA doesn't make any effort to do that and just always uses N bits of position (I think N=2 is a typical good value).

Lengths and Offsets are coded in what seems to be mostly a pretty standard log2-ish type coding (like Zip and others). Offsets are coded as basically the position of their MSB and then the remaining bits. The MSB is context-coded with the length of the match as context; this capture length-offset correlation. Then, the bottom 4 bits of the offset are sent, binary arithmetic coded on each other in reverse order (bottom bit first). This lets you capture things like a fixed structure in offsets (eg. all offsets are multiples of 4 or 8). The bits between the MSB and the bottom 4 are sent without compression.

The binary match flags are context coded using the "state" , which is the position in an internal finite state machine. It is :


LZMA state machine :

Literal :

  state < 7 :
    normal literal
  state >= 7 :
    delta literal

  state [0-3] -> state = 0
  state [4-9] -> state -= 3 ([1-6])
  else state -= 6 [10-11] -> ([4-5])


Match :

  rep0
   len 1 : 
     state ->   < 7 ? 9 : 11
   len > 1 :
     state ->   < 7 ? 8 : 11

  rep12
     state ->   < 7 ? 8 : 11

  normal  
     state ->   < 7 ? 7 : 10


// or from Igor Pavlov's code :

static const int kLiteralNextStates[kNumStates] = {0, 0, 0, 0, 1, 2, 3, 4,  5,  6,   4, 5};
static const int kMatchNextStates[kNumStates]   = {7, 7, 7, 7, 7, 7, 7, 10, 10, 10, 10, 10};
static const int kRepNextStates[kNumStates]     = {8, 8, 8, 8, 8, 8, 8, 11, 11, 11, 11, 11};
static const int kShortRepNextStates[kNumStates]= {9, 9, 9, 9, 9, 9, 9, 11, 11, 11, 11, 11};

Okay, this is the first funny unique thing. State basically tells you what the last few coding operations have been. As you send matches, state gets larger, as you send literals, state gets smaller. In particular, after any literal encoding state is < 7, and after any match encoding it is > 7. Then above and below that it tells you something about how many literals or matches you've recently encoded. For example :


initial state = 5
code a normal match -> 7
code a rep match -> 11
code a literal -> 5
code a literal -> 2
code a literal -> 0

Now it's unclear to me whether this funny state machine thing is really a huge win as a context; presumably it is tweaked out to be an advantage, but other coders have used the previous match flags as context for the match flag coding (eg. was the last thing a match is one bit, take the last three that gives you 8 states of previous context), which seems to me to have about the same effect.

There is one funny and clever thing here though, and that's the "delta literal". Any time you code a literal immediately after a match, the state will be >= 7 so you will code a delta literal. After that state will fall below 7 so you will code normal literals. What is a delta literal ?

Delta literals are coded as :


    char literal = *ptr;
    char lastPosPtr = ptr - lastOffset;
    char delta = literal ^ *lastPosPtr;

that is, the character is xor'ed with the character at the last coded match offset away from current pointer (not at the last coded pos, the last offset, that's important for structured data).

When I first saw this I thought "oh it's predicting that the char at the last offset is similar, so the xor makes equal values zero" , but that's not the case at all. For one thing, xor is not a great way to handle correlated values, subtract mod 256 would be better. For another, these character are in fact gauranteed to *NOT* match. If they did match, then the preceeding match would have just been one longer. And that's what's cool about this.

Immediately after a match, you are in a funny position : you have a long preceding context which matches some other long context in the file (where the match was). From PPM* and LZP we know that long contexts are very strong predictors - but we also know that we have failed to match that character! If we just use the normal literal coder, we expect the most likely character to be the one that we just failed to match, so that would be a big waste of code space. So instead, we use this delta literal coder which will let us statistically exclude the zero.

Okay, I think that's it for how the coding works. A few more tidbits :

The match finder in 7zip appears to be a pretty standard hash-to-binary-tree. It uses a hash to find the most recent occurance of the first few chars of the current string, that points to a node in the binary tree, and then it walks the binary tree to find all matches. The details of this are a little bit opaque to me, but I believe it walks backwards in order, and it only finds longer matches as it walks back. That is, it starts at the lowest offset occurance of the substring and finds the match length for that, then it steps to the next later one along the binary tree, and finds a match *if longer*. So it doesn't find all offsets, it presumes that larger offsets are only interesting if their matches are longer. (I'm a little unclear on this so this could be wrong).

One thing I can't figure out is how the binary tree is maintained with the sliding window.

ADDENDUM : I just found it described in "Handbook of Data Compression By David Salomon, Giovanni Motta, David (CON) Bryant". My description above of the binary tree was basically right. It is built in the "greedy" way : new nodes are added at the top of the tree, which means that when you are searching down the tree, you will always see the lowest offset possible for a given match length first, so you only need to consider longer lengths. Also since older nodes are always deeper in the tree, you can slide the window by just killing nodes and don't have to worry about fixing the tree. Of course the disadvantage is the tree can be arbitrarily unbalanced, but that's not a castrophe, it's never worse than just a straight linked list, which is the alternative.

The big piece I'm missing is how the optimal parse works. It's a forward optimal parse which explores a limitted branch space (similar to previous work that was done in Quantum and LZX). When it saves state in the optimal parse tree, it only updates the FSM "state" variable and the last 3 offsets, it doesn't update the whole context-arithmetic state. At each position it appears to consider the cost of either a literal, a match, or a "lazy" match (that's a literal and then the following match), but I haven't figured out the details yet. It seems to optimal parse in 4k chunks, maybe it updates the arithmetic state on those boundaries. I also see there are lots of heuristics to speed up the optimal parse, assumptions about certain coding decisions being cheaper than others without really testing them, hard-coded things like (if offset > (1 << 14) && length < 7) which surely helps. If anyone has figured it out, please help me out.


ADDENDUM : here's an illustration of how the special LZMA modes help on structured data. Say you have a file of structs; the structs are 72 bytes long. Within each struct are a bunch of uint32, floats, stuff like that. Within something like a float, you will have a byte which is often very correlated, and some bytes that are near random. So we might have something like :


[00,00,40,00] [7F 00 3F 71] ... 72-8 bytes ... [00,00,40,00] [7E 00 4C 2F]
... history ...                                * start here

we will encode :

00,00,40,00 : 
  4 byte match at offset 72
  (offset 72 is probably offset0 so this is a rep0 match)

7E :
  delta literal
  encode 7E ^ 7F = 1

00 :
  one byte match to offset 72 (rep0)

4C :
  delta literal
  encode 4C ^ 3F = 0x73

2F :
  regular literal

Also because of the position and state-based coding, if certain literals occur often in the same spot in the pattern, that will be captured very well.

Note that this is not really the "holy grail" of compression which is a compressor that figures out the state-structure of the data and uses that, but it is much closer than anything in the past. (eg. it doesn't actually figure out that the first dword of the structure is a float, and you could easily confuse it, if your struct was 73 bytes long for example, the positions would no longer work in simple bottom-bits cycles).

old rants