07-31-10 - GCC Scheduling Barrier

When implementing lock-free threading, you sometimes need a compiler scheduling barrier, which is weaker than a CPU instruction scheduling barrier, or a cache temporal ordering memory barrier.

There's a common belief that an empty volatile asm in GCC is a scheduling barrier :

 __asm__ volatile("")

but that appears to not actually be the case (* or rather, it is actually the case, but not according to spec). What I believe it does do is it splits the "basic blocks" for compilation, but then after initial optimization there's another merge pass where basic blocks are combined and they can in fact then schedule against each other.

The GCC devs seem to be specifically defending their right to schedule across asm volatile by refusing to give gaurantees : gcc.gnu.org/bugzilla/show_bug.cgi?id=17884 or gcc-patches/2004-10/msg01048.html . In fact it did appear (in an unclear way) in the docs that they wouldn't schedule across asm volatile, but they claim that's a documentation error.

Now they also have the built-in "__sync" stuff . But I don't see any simple compiler scheduling barrier there, and in fact __sync has problems.

__sync is defined to match the Itanium memory model (!?) but then was ported to other platforms. They also do not document the semantics well. They say :

"In most cases, these builtins are considered a full barrier"

What do you mean by "full barrier" ? I think they mean LL,LS,SL,SS , but do they also mean Seq_Cst ? ( there also seem to be some bugs in some of the __sync implementations bugzilla/show_bug.cgi?id=36793 )

For example on PS3 __sync_val_compare_and_swap appears to be :

   .. loop ..

which means it is a full Seq_Cst operation like a lock xchg on x86. That would be cool if I actually knew it always was that on all platforms, but failing to clearly document the gauranteed memory semantics of the __sync operations makes them dangerous. (BTW as an aside, it looks like the GCC __sync intrinsics are generating isync for Acquire unlike Xenon which uses lwsync in that spot).

(note of course PS3 also has the atomic.h platform-specific implementation; the atomic.h has no barriers at all, which pursuant to previous blog Mystery - Does the Cell PPU need Memory Control might actually be the correct thing).

I also stumbled on this thread where Linus rants about GCC .

I think the example someone gives in there about signed int wrapping is pretty terrible - doing a loop like that is fucking bananas and you deserve to suffer for it. However, signed int wrapping does come up in other nasty unexpected places. You actually might be wrapping signed ints in your code right now and not know about it. Say I have some code like :

char x;
x -= y;
x += z;
What if x = -100 , y = 90 , and z = 130 ? You should have -100 - 90 + 130 = -60. But your intermediate was -190 which is too small for char. If your compiler just uses an 8 bit machine register it will wrap and it will all do the right thing, but under gcc that is not gauranteed.

See details on signed-integer-overflow in GCC and notes about wrapv and wrapv vs strict-overflow


07-27-10 - 2d arrays

Some little things I often forget and have to search for.

Say you need to change a stack two dimensional array :

int array[7][3];

into a dynamically allocated one, and you don't want to change any code. Do you know how? It's :

int (*array) [3] = (int (*) [3]) malloc(sizeof(int)*3*7);

It's a little bit cleaner if you use a typedef :

typedef int array_t [3];

int array[7][3];

array_t array[7];

array_t * array = (array_t *) malloc(7*sizeof(array_t));

those are all equivalent (*).

You can take a two dimensional array as function arg in a few reasonable ways :

void func(int array[7][3]) { }

void func(int (*array)[3]) { }

void func(int array[][3]) { }

function arg arrays are always passed by address.

2-d arrays are indexed [row][col] , which means the first index takes big steps and the second index takes small steps in memory.

(* = if your compiler is some fucking rules nazi, they are microscopically not quite identical, because array[rows][cols] doesn't actually have to be rows*cols ints all in a row (though I'm not sure how this would ever actually not be the case in practice)).


07-26-10 - Virtual Functions

Previous post on x86/PPC made me think about virtual functions.

First of all, let's be clear than any serious game code base must have *some* type of dynamic dispatch (that is, data-dependent function calls). When people say "avoid virtual functions" it just makes me roll my eyes. Okay, assume I'm not just over-abstracting and I'm doing the dynamic dispatch for a good reason, because I actually need to do different things on different objects. The issue is just how you implement it.

How C++ virtual functions are implemented on most compilers :

    There's a "vtable" of function pointers associated with each class type. Individual objects have a pointer to their vtable. The advantage of this is that vtables are shared, but the disadvantage is that you get an extra cache miss. What actually happens when you make a virtual call? (multiple and virtual inheritance ignored here for now).
    vtable pointer = obj->vtable;
    func_pointer = vtable->func
    jump func_pointer

Why does this hurt ?

    vtable pointer = obj->vtable;
    The load of vtable pointer may be a cache miss, but that doesn't count against us since you are working on obj anyway you have to have one cache miss there.
    func_pointer = vtable->func
    Then you fetch the func pointer, which is maybe a cache miss (if this type of object has not been used recently).
    jump func_pointer
    Then you jump to variable, which may or may not be able to use branch prediction or fuck up your CPU's pipelining.

How can virtual calls be removed ?

  • A good practice is to never expose virtuals directly. In the base class you should have something like :
    void DoStuff(int x);
    as the public interface, and then hidden lower down, something like :
    virtual void v_DoStuff(int x) = 0;
    void DoStuff(int x) { v_DoStuff(x); }
    this gives you a single controlled call-through point which you can then make non-virtual or whatever at some point. (though, this screws up the next issue:)

  • The compiler can do it better than you (in theory anyway). Virtual functions that aren't actually overriden anywhere can obviously replaced with direct calls. But it's better than that - if you hold a pointer to some type, as long as nothing *derived* from that type override the virtual, it can be called direct. Even if sibling or cousin or parent classes override that virtual, by holding it by a derived type you can know that you are in a leaf part of the virtual call.

    No C++ compiler that I know of actually does this, because it requires knowledge of the class heirarchy in the whole program. But Java and C# are doing this now, so hopefully we will get it in C++ soon.

    When/if we get this, it is a very powerful optimization, because you can make then make functions that take concrete types and get joint-dispatch where you turn many virtual calls into just one. An example to be clear :

    class Base { virtual int Func() { return 0; } };
    class A : public Base { virtual int Func() { return 1; } };
    class B : public A { int m_data; };
    class C : public Base { virtual int Func() { return 2; } };
    void Test(A * obj)
    in this case the virtual call to Func() can be replaced with the direct call A::Func() by the compiler because no child of A overrides Func.

  • Don't use the method of virtual calls to hide implementation from interface. Some C++ coders will make a naked pure virtual interface class for their D3DHardware or ResourceManager or whatever, and then make a concrete class that impelements those virtuals. This is pretty nice for code cleanliness, but gives you virtual calls all over that are unnecessary. Note that if we had compiler-virtual-elimination this method would be fine since all those virtuals could be eliminated as there is only one implementation of that interface in the program.

  • Large/rare classes could use in-object vtables. In some cases, the space savings of sharing the vtable for all instances of a given class is not a huge win, and in that case you'd rather have the vtable directly in line in the object, because it gives you one less cache miss. There's no way to make the compiler do this for you, but you can do it reasonably easily with function pointers.

  • They can be converted into normal branches. In the DoStuff() example, instead of calling v_DoStuff, you could just branch on object type and call some concrete functions. Like :
    void DoStuff(int x)
        if ( this is Actor ) Actor::DoStuff(x);
        else Base::DoStuff(x);
    the advantage of this is that you avoid a cache miss for the vtable, and you use a normal branch which can be predicted even on shitty chips.

    This is only practical if you have only a few overrides for the function. The shitty thing about this (and many of these patterns) is that it pushes knowledge of the derived class up to the base. One nice variant of this is :

  • Rare overrides can use a test. If DoStuff() usually calls the default implementation, but once in a rare while needs an override, you can do that more efficiently with a pattern like :
    void DoStuff(int x)
        if ( base is okay ) Base::DoStuff(x);
        else v_DoStuff(x);
    so we check a bool (could be in a bitfield) and if so we use direct call, else we call the virtual.

  • Pushing data up to parent instead of calling down to child. A good method is simply to eliminate the virtual through code flow redesign.

    For example, a lot of bad game engines might have something like "virtual GetPosition()" on the base object. This at first seems like a good idea - not all objects have positions, and they might want to implement different ways of reporting it. In fact it is a bad idea, and you should just store m_position as public data, then have the different object types push their different position into m_position. (For example silly people implement attachments by making GetPosition() query the parent position then add some offset; instead you should do that computation only once in your object update and store it into the shared m_position).

    Most good games have a chunk of the most common data stored directly in the base object type so it can be accessed without virtuals.

  • Specialize for common cases. I sort of noted about this before, but when you're calling a bunch of virtuals from something, you can change many virtuals into one by dispatching to a func that calls them concretely. eg. Say you have a function like :
    void func(obj * o)
    that does a bunch of virtual calls. Instead make a concrete call version t_func which does no virtuals :
    template < typename T >
    void t_func(T * o)
    void func(obj * o)
        dispatch to actual type of obj :
           t_func < typeof(o) >( o );
    the difficulty is the dispatching from func to t_func. C++ has no mechanism to do type-dependent dispatch other than the vtable mechanism, which is annoying when you want to write a helper function and not add it to your class definition. There are general solutions to this (see for example in cblib the PrefDispatcher which does this by creating a separate but parallel class heirarchy to do dispatch) but they are all a bit ugly. A better solution for most games is to either add func() to the vtable or to just to know what concrete types you have and do manual dispatch.
  • Note that just flattening your hierarchy is not generally a great solution. For example, rather than have a bunch of different types of Crate (ExplodingCrate, FlammableCreate) you might decide to just make one single UberCrate that can be any type of crate. This eliminates virtual calls when you are working on UberCrate since it is just one concrete type, but it adds a lot of branches (if (crate.isExploding)) , and often makes the classes fatter in memory, etc. Making objects data-polymorphic instead of type-polymorphic may or may not be a win depending on the details.
  • In general it's good practice to make queries as fast and non-virtual as possible, and push the virtuals to the updates.

07-26-10 - Code Issues

How do I make a .c/.cpp file that's optional? eg. if you don't build it into your project, then you just don't get the functionality in it, but if you do, then it magically turns itself on and gives you more goodies.

I'll give you a particular example to be concrete, though this is something I often want to do. In the RAD LZH stuff I have various compressors. One is a very complex optimal parser. I want to put that in a separate file. People should be able to just include rrLZH.cpp and it will build and run fine, but the optimal parser will not be available. If they build in rrLZHOptimal, it should automatically provide that option.

I know how to do this in C++. First rrLZH has a function pointer to the rrLZHOptimal which is statically initialized to NULL. The rrLZHOptimal has a CINIT class which registers itself and sets that function pointer to the actual implementation.

This works just fine (it's a standard C++ self-registration paradigm), but it has a few problems in practice :

1. You can run into order-of-initialization issues if you aren't careful. (this is not a problem if you are a good C++ programmer and embrace proper best practices; in that case you will be initializing everything with singletons and so on).

2. It's not portable because of the damn linkers that don't recognize CINIT as a binding function call, so the module can get dropped if it's in a lib or whatever. (this is the main problem ; it would have been nice if C++0x had defined a way to mark CINIT constructions as being required in the link (not by default, but with a __noremove__ modifier or something)). There are various tricks to address this but I don't think any of them is very nice. (*)

I general I like this pattern a lot. The more portable version of this is to have an Install() function that you have to manually call. I *HATE* the Install function pattern. It causes a lot of friction to making a new project because you have to remember to call all the right Installs, and you get a lot of mysterious failures where some function just doesn't work and you have to go back and install the right thing, and you have to install them in the right order (C++ singleton installs mostly take care of order for you). etc.

(* : this is one of those issues that's very annoying to deal with as a library provider vs. an end-application developer. As an app developer you just decide "this is how we're doing it for our product" and you have a method that works for you. As a library developer you have to worry about people not wanting to use the method you have found that works, and how things might behave under various compilers and usage patterns. It sucks.)

ADDENDUM : the problem with the manually-calling Install() pattern is that it puts the selection of features in the wrong & redundant place. There is one place that I want to select my modules, and that is in my build - eg. which files I compile & link, not in the code. The problem with it being in the code is that I can't create shared & generic startup code that just works. I wind up having to duplicate startup code to every app, which is very very bad for maintainability. And of course you can't make a shared "Startup()" function because that would force you to link in every module you might want to use, which is the thing you want to avoid.

For the PS3 people : what would be the ideal way for me expose bits of code that can be run on the SPU? I'm just not sure what people are actually using and how they would like things to be presented to them. eg. should I provide a PPU function that does all the SPU dispatching for you and do all my own SPU management? Is it better if I go through SPURS or some such? Or should I just provide code that builds for SPU and let you do your management?

I've been running into a problem with the MSVC compiler recently where it is incorrectly merging functions that aren't actually the same. The repro looks like this. In some header file I have a function sort of like this :

StupidFunction.h :

inline int StupidFunction()

Then in two different files I have :
A.cpp :

#define SOME_POUND_DEFINE  (0)
#include "StupidFunction.h"


and :
B.cpp :

#define SOME_POUND_DEFINE  (1)
#include "StupidFunction.h"


and what I get is that both printfs print the same thing (random whether its 0 or 1 depending on build order).

If I put "static" on StupidFunction() it fixes this and does the right thing. I have no idea what the standard says about compilation units and inlines and merging and so on, so for all I know their behavior might be correct, but it's damn annoying. It appears that the exact definition of inline changed in C99, and in fact .cpp and .c have very different rules about inlines (apparently you can extern an inline which is pretty fucked up). (BTW the whole thing with C99 creating different rules that apply to .c vs .cpp is pretty annoying).

ADDENDUM : see comments + slacker.org advice about inline best practice (WTF, ridiculous) , and example of GCC inline rules insanity

In other random code news, I recently learned that the C preprocessor (CPP) is not what I thought.

I always thought of CPP as just a text substitution parser. Apparently that used to be the case (and still is the case for many compilers, such as Comeau and MSVC). But at some point some new standard was introduced that makes the CPP more tightly integrated with the C language. And of course those standards-nazis at GCC now support the standard.

The best link that summarizes it IMO is the GCC note on CPP Traditional Mode that describes the difference between the old and new GCC CPP behavior. Old CPP was just text-sub, New CPP is tied to C syntax, in particular it does tokenization and is allowed to pass that tokenization directly to the compiler (which does not need to retokenize).

I guess the point of this is to save some time in the compile, but IMO it's annoying. It means that abuse of the CPP for random text-sub tasks might not work anymore (that's why they have "traditional mode", to support that use). It also means you can't do some of the more creative string munging things in the CPP that I enjoy.

In particular, in every CPP except GCC, this works :

#define V(x) x
#define CAT(a,b)  V(a)V(b)

to concatenate two strings. Note that those strings can be *anything* , unlike the "##" operator which under GCC has very specific annoying behavior in that it must take a valid token on each side and produce a valid token as output (one and only one!).

In further "GCC strictness is annoying", it's fucking annoying that they enforce the rule that only ints can be constants. For example, lots of code bases have something like "offsetof" :

/* Offset of member MEMBER in a struct of type TYPE. */
#define offsetof(TYPE, MEMBER) ((size_t) &((TYPE *)0)->MEMBER)

well, that's illegal under GCC for no damn good reason at all. So you have to do :

/* Offset of member MEMBER in a struct of type TYPE. */
#ifndef __GNUC__
#define offsetof(TYPE, MEMBER) ((size_t) &((TYPE *)0)->MEMBER)
/* The cast to "char &" below avoids problems with user-defined
   "operator &", which can appear in a POD type.  */
#define offsetof(TYPE, MEMBER)                                  \
  (__offsetof__ (reinterpret_cast <size_t>                      \
                 (&reinterpret_cast <const volatile char &>     \
                  (static_cast<TYPE *> (0)->MEMBER))))
#endif /* C++ */

damn annoying. (code stolen from here ). The problem with this code under GCC is that a "type *" cannot be used in a constant expression.

A similar problem comes up in templates. On every other compiler, a const pointer can be used as a template value argument, because it's just the same as an int. Not on GCC! In fact because they actually implement the standard, there's a new standard for C++0x which is going to make NULL okay, but only NULL which is also annoying because there are places I would use arbitrary values. (see for example 1 or 2 ).

ADDENDUM : a concrete example where I need this is my in-place hash table template. It's something like :

template < typename t_key,t_key empty_val,t_key deleted_val >
class hash_table

that is, I hash keys of type t_key and I need a value for "empty" and "deleted" special keys for the client to set. This works great (and BTW is much faster than the STL style of hash_map for many usage patterns), but on GCC it doesn't work if t_key is "char *" because you can't template const pointer values. My work-around for GCC is to take those template args as ints and cast them to t_key type internally, but that fucking blows.

In general I like to use template args as a way to make the compiler generate different functions for various constant values. It's a much cleaner way than the #define / #include method that I used above in the static/inline problem example.


07-21-10 - x86

x86 is really fucking wonderful and it's a damn shame that we don't have it on all platforms. (addendum : I don't really mean x86 the ISA, I mean x86 as short hand for the family of modern processors that run x86; in particular P-Pro through Core i7).

It's not just that the chips are super fast and easy to use. It's that they actually encourage good software engineering. In order to make the in-order PPC chips you have to abandon everything you've learned about good software practices in the last 20 years. You can't abstract or encapsulate. Everything has to be in locals, every function has to be inline.

1. Complex addressing.

This is way more huge than I ever expected. There are two important subaspects here :

1.A. being able to do addition and simple muls in the addressing, eg. [eax + ecx*2]

1.B. being able to use memory locations directly in instructions instead of moving through registers.

Together these work together to make it so that on x86 you don't have to fuck around with loading shit out to temporaries. It makes working on variables in structs almost exactly the same speed as working on variables in a local.


  x = y;


  mov eax, ecx;


  x = s.array[i];


  mov eax, [eax + ecx*4 + 48h]

and those run at almost the same speed !

This is nice for C and accessing structs and arrays of course, but it's especially important for C++ where lots of things are this-> based. The compiler keeps "this" in a register, and this-> references run at the same speed as locals!

ADDENDUM : the really bad issue with the current PPC chips is that the pipeline from integer computations to load/stores is very bad, it causes a full latency stall. If you have to compute an address and then load from it, and you can't get other instructions in between, it's very slow. The great thing about x86 is not that it's one instruction, it's that it's fast. Again, to be clear, the point here is not that CISC is better or whatever, it's simply that having fast complex addressing you don't have to worry about changes the way you write code. It lets you use structs, it lets you just use for(i) loops and index off i and not worry about it. Instead on the PPC you have to worry about things like indexing byte arrays is faster than any other size, and if you're writing loops and accessing dword arrays maybe you should be iterating with a pointer instead of an index, or maybe iterate with index*4, or whatever.

2. Out of order execution.

Most people thing of OOE as just making things faster and letting you be a bit lazy about code gen and stalls and so on. Yes, that is true, but there's a more significant point I think : OOE makes C++ fast.

In particular, the entire method of referencing things through pointers is impossible in even moderate performant code without OOE.

The nasty thing that C++ (or any modern language really, Java ,C#, etc. are actually much much worse in this regard) does is make you use a lot of pointers, because your data types may be polymorphic or indeterminate, it's often hard to hold them by value. Many people think that's a huge performance problem, but on the PC/x86 it actually doesn't hurt much. Why?

Typical C++ code may be something like :

.. stuff ..
.. stuff ..

this may involve several dependent memory fetches. On an in-order chip this is stall city. With OOE it can get rearranged to :

..stuff ..
temp = this->m_obj;
.. stuff ..
vtable = temp->vtable;
.. stuff ..
.. stuff ..

And as long as you have enough stuff to do in between it's no problem. Now obviously doing lots of random calls through objects and vtables in a row will still make you slow, but that's not a common C++ pattern and it's okay if that's slow. But the common pattern of just getting a class pointer from somewhere then doing a bunch of stuff on it is fast (or fast enough for not-super-low-level code anyway).

ADDENDUM : obviously if your code path was completely static, then a compile-time scheduler could do the same thing. But your code path is not static, and the caches have basically random delays because other threads might be using them too, so no static scheduling can ever be as good. And even beyond that, the compiler is just woefully unable to handle scheduling for these things. For example, to schedule as well as OOP can, you would have to do things like speculatively read ptr and *ptr even if it might only be needed if a certain branch is taken (because if you don't do the prefetching the stall will be horrific) etc. Furthermore, the scheduling can only really compete when all your functions are inline; OOP sort of inlines your functions for you since it can schedule functions across the jump. etc. etc.

ADDENDUM : 3. Another issue that I think might be a big one is the terrible penalty for "jump to variable" on PPC. This hits you when you do a switch() and also when you make virtual calls. It can only handle branch prediction for static branches, there's no "branch target predictor" like modern x86 chips have. Maybe I'll write a whole post about virtual functions.

Final addendum :

Anyway, the whole point of this post was not to make yet another rant about how current consoles are slow or bad chips. Everyone knows that and it's old news and boring.

What I have realized and what I'm trying to say is that these bad old chips are not only slow - much worse than that! They cause a regression in software engineering practice back to the bad old days when you have to worry about shit like whether you pre-increment or post-increment your pointers. They make clean, robust, portable programming catastrophically inefficient. All the things we have made progress on in the last 20 years, since I started coding on Amigas and 286's where we had to worry about this shit, we moved into an enlightened age where algorithms were more important than micro bullsit, and now we have regressed.

At the moment, the PowerPC console targets are *SO* much slower than the PC, that the correct way to write code is just to write with only the PowerPC in mind, and whatever speed you get on x86 will be fine. That is, don't think about the PC/x86 performance at all, just 100% write your code for the PPC.

There are lots of little places where they differ - for example on x86 you should write code to take use of complex addressing, you can have fewer data dependencies if you just set up one base variable and then do lots of referencing off it. On PPC this might hurt a lot. Similarly there are quirks about how you organize your branches or what data types you use (PPC is very sensitive to the types of variables), alignment, how you do loops (preincrement is better for PPC), etc.

Rather than bothering with #if __X86 and making fast paths for that, the right thing is just to write it for PPC and not sweat the x86, because it will be like a bazillion times faster than the PPC anyway.

Some other PPC notes :

1. False load-hit-stores because of the 4k aliasing is an annoying and real problem (only the bottom bits of the address are used for LHS conflict detection). In particular, it can easily come up when you allocate big arrays, because the allocators will tend to give you large memory blocks on 4k alignment. If you then do a memcpy between two large arrays you will get a false LHS on every byte! WTF !?!? The result is that you can get wins by randomly offsetting your arrays when you know they will be used together. Some amount of this is just unavoidable.

2. The (unnamed console) compiler just seems to be generally terrible about knowing when it can keep things in registers and when it can't. I noted before about the failure to load array base addresses, but it also fucks up badly if you *EVER* call a function using common variables. For example, say you write a function like this :

int x = 0;

  for( ... one million .. )
    .. do lots of stuff using x ..
    x = blah;


the correct thing of course is to just keep x in a register through the whole function and not store its value back to the stack until right before the function :

//int x; // x = r7
r7 = 0;

  for( ... one million .. )
    .. do lots of stuff using r7 ..
    r7 = blah;

stack_x = r7

Instead what I see is that a store to the stack is done *every time* x is manipulated in the function :

//int x; // x = r7
r7 = 0;
stack_x = r7;

  for( ... one million .. )
    .. do lots of stuff using r7 - stores to stack_x every time ! ..
    r7 = blah;
   stack_x = r7;


The conclusion is the same one I came to last time :

When you write performance-critical code, you need to completely isolate it from function calls, setup code, etc. Try to pass in everything you need as a function argument so you never had to load from globals or constants (even loading static constants seems to be compiled very badly, you have to pass them in to make sure they get into registers), and do everything inside the function on locals (which you never take the address of). Never call external functions.


07-18-10 - Mystery - Why no isync for Acquire on Xenon -

The POWER architecture docs say that to implement Acquire memory constraint, you should use "isync". The Xbox 360 claims they use "lwsync" to enforce Acquire memory constraint. Which is right? See :

Lockless Programming Considerations for Xbox 360 and Microsoft Windows
Example POWER Implementation for C/C++ Memory Model
PowerPC storage model and AIX programming

Review of the PPC memory control instructions in case you're a lazy fucker who wants to butt in but not actually read the links that I post :

First of all review of the PPC memory model. Basically it's very lazy. We are dealing with in-order cores, so the load/store instructions happen in order, but the caches and store buffers are not kept temporally in order. That means an earlier load can get a newer value, and stores can be delayed in the write queue. The result is that loads & stores can go out of order arbitrarily unless you specifically control them. (* one exception is that "consume" order is guaranteed, as it is on all chips but the Alpha; that is, *ptr is always newer than ptr). To control ordering you have ;

lwsync = #LoadLoad barrier, #LoadStore barrier, #StoreStore barrier ( NOT #StoreLoad barrier ) ( NOT Sequential Consistency ).

lwsync gives you all the ordering that you have automatically all the time on x86 (x86 gives you every barrier but #StoreLoad for free). If you put an lwsync after every instruction you would have a nice x86-like semantics.

In a hardware sense, lwsync basically affects only my own core; it makes me sequentialize my write queue and my cache reads, but doesn't cause me to make a sync point with all other cores.

sync = All barriers + Sequential Consistency ; this is equivalent to a lock xchg or mfence on x86.

Sync makes all the cores agree on a single sync point (it creates a "total order"), so it's very expensive, especially on very-many-core systems.

isync = #LoadLoad barrier, in practice it's used with a branch and causes a dependency on the load used in the branch. (note that atomic ops use loadlinked-storeconditional so they always have a branch there for you to isync on). In a hardware sense it causes all previous instructions to finish their loads before any future instructions start (it flushes pipelines).

isync seems to be the perfect thing to implement Acquire semantics, but the Xbox 360 doesn't seem to use it and I'm not sure why. In the article linked above they say :

"PowerPC also has the synchronization instructions isync and eieio (which is used to control reordering to caching-inhibited memory). These synchronization instructions should not be needed for normal synchronization purposes."

All that "Acquire" memory semantics needs to enforce is #LoadLoad. So lwsync certainly does give you acquire because it has a #LoadLoad, but it also does a lot more that you don't need.

ADDENDUM : another Xenon mystery : is there a way to make "volatile" act like old fashioned volatile, not new MSVC volatile? eg. if I just want to force the compiler to actually do a memory load or store (and not optimize it out or get from register or whatever), but don't care about it being acquire or release memory ordered.

07-18-10 - Mystery - Does the Cell PPU need Memory Control -

Is memory ordering needed on the PPU at all ?

I'm having trouble finding any information about this, but I did notice a funny thing in Mike Acton's Multithreading Optimization Basics :


// Pentium
#define  AtomicStoreFence() __asm { sfence }
#define  AtomicLoadFence()  __asm { lfence }

// PowerPC
#define  AtomicStoreFence() __asm { lwsync }
#define  AtomicLoadFence()  __asm { lwsync }

// But on the PPU
#define  AtomicStoreFence() 
#define  AtomicLoadFence()

Now, first of all, I should note that his Pentium defines are wrong. So that doesn't inspire a lot of confidence, but Mike is more of a Cell expert than an x86 expert. (I've noted before that thinking sfence/lfence are needed on x86 is a common mistake; this is just another example of the fact that "You should not try this at home!" ; even top experts get the details wrong; it's pretty sick how many random kids on gamedev.net are rolling their own lock-free queues and such these days; just say no to lock-free drugs mmmkay). (recall sfence and lfence only have any effect on non-temporal memory such as write-combined or SSE; normal x86 doesn't need them at all).

Anyway, the issue on the PPU is you have two hardware threads, but only one core, and more importantly, only one cache (and only one store queue (I think)). The instructions are in order, all of the out-of-orderness of these PowerPC chips comes from the cache, so since they are on the same cache, maybe there is no out-of-orderness ? Does that mean that memory accesses act sequential on the PPU ?

Hmmm I'm not confident about this and need more information. The nice thing about Cell being open is there is tons of information about it from IBM but it's just a mess and very hard to find what you want.

Of note - thread switches on the Cell PPU are pretty catastrophically slow, so doing a lot of micro-threading doesn't really make much sense on that platform anyway.

ADDENDUM : BTW I should note that even if the architecture doesn't require memory ordering (such as on x86), doing this :

#define  AtomicStoreFence() 
#define  AtomicLoadFence()

is a bad idea, because the compiler can still reorder things on you. Better to do :

#define  AtomicStoreFence()  _CompilerWriteBarrier() 
#define  AtomicLoadFence()   _CompilerReadBarrier()

07-18-10 - Mystery - Do Mutexes need More than Acquire-Release -

What memory order constraints do Mutexes really need to enforce ?

This is a surprisingly unclear topic and I'm having trouble finding any good information on it. In particular there are a few specific questions :

1. Does either Mutex Lock or Unlock need to be Sequential Consistent? (eg. a global sync/ordering point) (and followup : if they don't *need* be Seq_Cst , is there a good argument for them to be Seq_Cst anyway?)

2. Does either Lock or Unlock need to keep memory accesses from moving IN , or only keep them from moving OUT ? (eg. can Lock just be Acquire and Unlock just be Release ?)

Okay, let's get into it a bit. BTW by "mutex" I mean "CriticalSection" or "Monitor". That is, something which serializes access to a shared variable.

In particular, it should be clear that instructions moving *OUT* is bad. The main point of the mutex is to do :

y = 1;


  load x
  x ++;
  store x;


y = 2;

and obviously the load should not move out the top nor should the store move out the bottom. This just means the Lock must be Acquire and the Unlock must be Release. However, the y=1 could move inside from the top, and the y=2 could move inside from the bottom, so in fact the y=1 assignment could be completely eliminated.

Hans Boehm : Reordering Constraints for Pthread-Style Locks goes into this question in a bit of detail, but it's fucking slides so it's hard to understand (god damn slides). Basically he argues that moving code into the Mutex (Java style) is fine, *except* if you allow a "try_lock" type function, which allows you to invert the mutex; with try_lock, then lock() must be a full barrier, but unlock() still doesn't need to be.

Joe Duffy mentions this subject but doesn't come to any conclusions. He does argue that it can be confusing if they are not full barriers . I think he's wrong about that and his example is terrible. You can always cook up very nasty examples if you touch shared variables inside mutexes and also outside mutexes. I would like to see an example where *well written code* behaves badly.

One argument for making them full barriers is that CriticalSection provides full barriers on Windows, so people are used to it, so it's good to give people what they are used to. Some coders may see "Lock" and think code neither moves in or out. But on some platforms it does make the mutex much more expensive.

To be concrete, is this a good SpinLock ?

    while ( ! CAS( lock , 0 , 1 , memory_order_seq_cst )


    StoreRelease( lock , 0 );

    // AtomicExchange( lock, 0 , memory_order_seq_cst );

One issue that Joe mentions is the issue of fairness and notifying other processors. If you use the non-fencing Unlock, then you aren't immediately giving other spinning cores a change to grab your lock; you sort of bias towards yourself getting the lock again if you are in high contention. IMO this is a very nasty complex issue and is a good reason not to roll your own mutexes; the OS has complex mechanisms to prevent live locks and starvation and all that shit.

For more concreteness - Viva64 has a nice analysis of Dmitriy V'jukov's implementation of the Peterson Lock . This is a specific implementation of a lock which does not have *any* sequence point; the Lock() is Acquire_Release ordered (so loads inside can't move up and stores before it can't move in) and Unlock is only Release ordered.

The question is - would using a minimally-ordering Lock implementation such as Dmitriy's cause problems of any kind? Obviously Dmitriy's lock is correct in the sense of providing mutual exclusion and data race freedom, so the issue is not that; it's more a question of whether it causes practical programming problems or severely unexpected behavior. What about interaction with file IO or other non-simple-memory-access resources? Is there a good reason not to use such a minimally-ordering lock?


07-13-10 - Tech Blurg

How do I atomically store or load 128 bit data on x64 ?

One option is just to use cmpxch16b to do loads and stores. That's atomic, but seems a bit excessive. I dunno, maybe it's fine. For loads that's simple enough, you just do a cmpxch16b with 0 and it gives you the value that was there. For stores it's a bit uglier because you have to do a loop and do at least two cmps (one to load, then one to store, which will only succeed if nobody else stored since the load).

The other option is to use the SSE 128 bit load/store. I *believe* that it is atomic (assuming no cache lines are straddled), however it is important to note that SSE memory ops on x86/x64 are weakly ordered, unlike normal memory ops which are all strongly ordered (every x86 load is #LoadLoad and every store is #StoreStore). So, to make strongly ordered 128 bit load/store from the SSE load store you have to do something like

load :
    sse load 128

store :
    sse store 128

or such. I'm not completely sure that's right though and I'm having trouble finding any information on this. What I need is load_acquire_128 and store_release_128. (yes I know MSVC has intrinsics for LoadAcq_128 and StoreRel_128, but those are only for Itanium). (BTW a lot of people mistakenly think they need to use lfence or sfence with normal code; no no, those are only for SSE and write combined memory).

(ADDENDUM : yes, I think this is correct; movdqa (a = aligned) appears to be the correct atomic way to load/store 128 bits on x86; I'm a little worried that getting the weaker SSE memory model involved will break some of the assumptions about the x86 behavior of access ordering).

In other news, the random differences between GCC and MSVC are fucking annoying. Basically it's the GCC guys being annoying cocks; you know MS is not going to copy your syntax, but you could copy theirs. If you would get your heads out of your asses and stop trying to be holier than Redmond, you would realize it's better for the world if you provide compatible declarations. Shit like making me do __attribute__((always_inline)) instead of just supporting __forceinline is just annoying and pointless. Also, you all need to fix up your damn stdlib to be more like MS. Extensions like vsnprintf should be named _vsnprintf (MS style, correct) (* okay maybe not).

You also can't just use #defines to map the MSVC stuff to GCC, because often the qualifiers have to go in different places, so it's a real mess. BTW not having pragma warning disable is pretty horrendous. And no putting it on the command line is nowhere near equivalent, you want to be able to turn them on and off for specific bits of code where you know the warning is bogus or innocuous.

The other random problem I have is the printf format for 64 bit int (I64d) appears to be MS only. God damn it.


07-10-10 - PowerPC Suxors

I finally have done my first hard core optimization for PowerPC and discovered a lot of weird quirks, so I'm going to try to write them up so I have a record of it all. I'm not gonna talk about the issues that have been well documented elsewhere (load-hit-store and restrict and all that nonsense).

X. The code gen is just not very good. I'm spoiled by MSVC on the PC, not only is the code gen for the PC quite good, but any mistakes that it makes are magically hidden by out of order PC cores. On the PC if it generates a few unnecessary moves because it didn't do the best possible register assignments, those just get hidden and swallowed up by out-of-order when you have a branch or memory load to hide them.

In contrast, on the PPC consoles, the code gen is quirky and also very important, because in-order execution means that things like unnecessary moves don't get hidden. You have to really manually worry about shit like what variables get put into registers, how the branch is organized (non-jumping case should be most likely), and even exactly what instructions are done for simple operations.

Basically you wind up in this constant battle with the compiler where you have to tweak the C, look at the assembly, tweak the C, back and forth until you convince it to generate the right code. And that code gen is affected by stuff that's not in the immediate neighborhood - eg. far away in the function - so if you want to be safe you have to extract the part you want to tweak into its own function.

X. No complex addressing (lea). One consequence of this is that byte arrays are special and much faster than arrays of larger objects, because it has to do an actual multiply or shift. So for example if you have a struct of several byte members, you should use SOA (several structs) instead of AOS (one array of large struct).

X. Inline ASM kills optimization. You think with the code gen being annoying and flaky you could win by doing some manual inline ASM, but on Xenon inline ASM seems to frequently kick the compiler into "oh fuck I give up" no optimization mode, the same way it did on the PC many years ago before that got fixed.

X. No prefetching. On the PC if you scan linearly through an array it will be decently prefetched for you. (in some cases like memcpy you can beat the automatic prefetcher by doing 4k blocks and prefetching backwards, but in general you just don't have to worry about this). On PPC there is no automatic prefetch even for the simplest case so you have to do it by hand all the time. And of course there's no out-of-order so the stall can't be hidden. Because of this you have to rearrange your code manually to create a minimum of dependencies to give it a time gap between figuring out the address you want (then prefetch it) and needing the data at that address.

X. Sign extension of smaller data into registers. This one was surprising and bit me bad. Load-sign-extend (lha) is super expensive, while load-unsigned-zero-extend (lhz) is normal cost. That means all your variables need to be unsigned, which fucking sucks because as we know unsigned makes bugs. (I guess this is a microcoded instruction so if you use -mwarn-microcode you can get warnings about it).

PS3 gcc appears to be a lot better than Xenon at generating an lhz when the sign extension doesn't actually matter. eg. I had cases like load an S16 and immediately stuff it into a U8. Xenon would still generate an lha there, but PS3 would correctly just generate an lhz.

-mwarn-microcode is not really that awesome because of course you do have to use lots of microcode (shift,mul,div) so you just get spammed with warnings. What you really want is to be able to comment up your source code with the spots that you *know* generate microcode and have it warn only when it generates microcode where it's unexpected. And actually you really want to mark just the areas you care about with some kind of scope, like :

__speed_critical {
  .. code ..

and then it should warn about microcode and load hit stores and whatever else within that scope.

X. Stack variables don't get registered. There appears to be a quirk of the compiler that if you have variables on the stack, it really want to reference them from the stack. It doesn't matter if they are used a million times in a loop, they won't get a register (and of course "register" keyword does nothing). This is really fucking annoying. It's also an aspect of #1 - whether or not it gets registered depends on the phase of the moon, and if you sneeze the code gen will turn it back into a load from the stack. The same is actually true of static globals, the compiler really wants to generate a load from the static base mem, it won't cache that.

Now you might think "I'll just copy it into a local" , but that doesn't work because the compiler completely eliminates that unnecessary copy. The most reliable way we found to make the compiler register important variables is to copy them into a global volatile (so that it can't eliminate the copy) then back into a local, which then gets registered. Ugh.

You might think this is not a big deal, but because the chips are so damn slow, every instruction counts. By not registering the variables, they wind up doing extra loads and adds to get the values out of static of stack mem and generate the offsets and so on.

X. Standard loop special casing. On Xenon they seem to special case the standard

for(int i=0;i < count;i++) { }

kind of loop. If you change that at all, you get fucked. eg. if you just do the same thing but manually, like :

for(int i=0;;)
    if ( i == count ) break;

that will be much much slower because it loses the special case loop optimization. Even the standard paradigm of backward looping :

for(int i=count;i--;) { }

appears to be slower. This just highlights the need for a specific loop() construct in C which would let the compiler do whatever it wants.

X. Clear top 32s. The PS3 gcc wants to generate a ton of clear-top-32s. Dunno if there's a trick to make this go away.

X. Rotates and shifts. PPC has a lot of instructions for shifting and masking. If you just write the C, it's generally pretty good at figuring out that some combined operation can be turned into one instruction. eg. something like this :

x = ( y >> 4 ) & 0xFF;

will get turned into one instruction. Obviously this only works for constant shifts.

X. The ? : paradigm. As usual on the PC we are spoiled by our fucking wonderful compiler which almost always recognizes ? : as a case it can generate without branches. The PPC seems to have nothing like cmov or a good setge variant, so you have to generate it manually . The clean solution to this is to write your own SELECT , that's like :

#define SELECT(condition,val_if_true,val_if_false)  ( (condition) ? (val_if_true) : (val_if_false) )

and replace it with Mike's bit masky version for PPC.


07-09-10 - Backspace

#define _WIN32_WINNT 0x0501 
#include <windows.h>
#include <psapi.h>

static bool strsame(const char * s1,const char * s2)
    for(int i=0;;i++)
        if ( s1[i] != s2[i] )
            return false;
        if ( s1[i] == 0 )
            return true;

// __declspec(dllimport) void RtlFillMemory( void *, size_t count, char );
extern "C" 
void * __cdecl memset(void *buf,int val,size_t count)
    char * p = (char *) buf;
    for(size_t i=0;i < count;i++)
        p[i] = val;
    return buf;

//#undef RtlCopyMemory
//NTSYSAPI void NTAPI RtlCopyMemory( void *, const void *, size_t count );
extern "C" 
void * __cdecl memcpy(void *dst,const void *src,size_t count)
    char * d = (char *) dst;
    char * s = (char *) src;
    for(size_t i=0;i < count;i++)
        d[i] = s[i];
    return dst;

//int CALLBACK WinMain ( IN HINSTANCE hInstance, IN HINSTANCE hPrevInstance, IN LPSTR lpCmdLine, IN int nShowCmd )
int my_WinMain(void)
    bool isExplorer = false;
    HWND active = GetForegroundWindow();
    DWORD procid;
    DWORD otherThread = GetWindowThreadProcessId(active,&procid);
    if ( active )
        HWND cur = active;
            HWND p = GetParent(cur);
            if ( ! p ) break;
            cur = p;
        char name[1024];
        name[0] = 0;
        if ( GetClassNameA(cur,name,sizeof(name)) )
            //lprintf("name : %s\n",name);
            isExplorer = strsame(name,"CabinetWClass") ||
    if ( isExplorer )
        //lprintf("sending alt-up\n");
        INPUT inputs[4] = { 0 }; // this calls memset
        inputs[0].type = INPUT_KEYBOARD;
        inputs[0].ki.wVk = VK_LMENU;
        inputs[1].type = INPUT_KEYBOARD;
        inputs[1].ki.wVk = VK_UP;
        // send keyups in reverse order :
        inputs[2] = inputs[1]; // this generates memcpy
        inputs[3] = inputs[0];
        inputs[2].ki.dwFlags |= KEYEVENTF_KEYUP;
        inputs[3].ki.dwFlags |= KEYEVENTF_KEYUP;
        //lprintf("sending backspace\n");
        // can't use SendInput here cuz it will run me again
        // find the actual window and send a message :
        DWORD myThread = GetCurrentThreadId();
        HWND focus = GetFocus();
        if ( ! focus )
            focus = active;
        // the HotKey thingy that I use will still send the KeyUp for Backspace,
        //  so I only need to send the key down :
        //  (some apps respond to keys on KeyUp and would process backspace twice- eg. Firefox)
        // also, if I send the KeyDown/Up together the WM_CHAR gets out of order oddly
        int vk = VK_BACK;
        int ScanKey = MapVirtualKey(vk, 0);
        const LPARAM lpKD = 1 | (ScanKey << 16);
        //const LPARAM lpKU = lpKD | (1UL << 31) | (1UL << 30);
    return 0;

extern "C"
int WinMainCRTStartup(void)
    return my_WinMain();

Bug fixed 07/12 : don't send key up message. Also, build without CRT and the EXE size is under 4k.

Download the EXE


07-08-10 - Remote Dev

I think the OnLive videogames-over-the-internet thing is foolish and unrealistic.

There is, however, something it would be awesome for : dev kits.

If I'm a multi-platform videogame developer, I don't want to have to buy dev kits of every damn system for every person in my company. They cost a fortune and it's a mess to administer. It would be perfectly reasonable to have a shared farm of devkits somewhere out on the net, you can get to them whenever you want and run your game and get semi-interactive gameplay ala OnLive.

It would be vastly superior to the current situation, wherein poor developers wind up buying 2 dev kits for their 40 person team and you have to constantly be going around asking if you can use it. Instead you would always get instant access to some dev kit in the world.

Obviously this isn't a big deal for 1st party, but for the majority of small devs who do little ports it would be awesome. It would also rock for me because then I could work from home and instantly test my RAD stuff on any platform (and not have to ask someone at the office to turn on the dev kit, or make sure noone is using it, or whatever).


07-03-10 - Length-Limitted Huffman Codes Heuristic

In the last post if you look at the comments you can find some comparison of optimal length limitted vs. heuristic length limitted.

I thought I would describe the heuristic algorithm. It is O(N) with no additional storage (it can work in place, which goes nicely with Moffat's in place Huffman codelen builder ). Here's the algorithm :

1. Build Huffman code lengths using Moffat INPLACE. You observe some of those code lengths are > maxCodeLen. We will work only on the code lengths, and we are given the symbol counts. We are given the symbol counts in sorted order (this was already done for INPLACE; if they were not originally sorted a simple NlogN sort will make them so).

2. Set all code lengths > max to be = maxCodeLen. We now have invalid code lengths, they are not "prefix". That is, they do not satisfy the kraft inequality K <= 1 for decodability.

3. Compute the Kraft number, K = Sum { 2 ^ - L_i } ; we currently have K > 1 and want to shrink it down to K = 1 by increasing some code lengths.

4. PASS 1. Walk over the symbols in sorted order (from lowest count to highest) while K > 1. Do :

  while ( codeLen[ s ] < max && K > 1 )
    codeLen[ s ] ++;

    // adjust K for change in codeLen
    K -= 2 ^ - codeLen[ s ]

5. PASS 2. Walk over the symbols backwards (from highest to lowest count) while K < 1. Do :

  while ( (K + 2^-codeLen[ s ]) <= 1 )
    // adjust K for change in codeLen
    K += 2 ^ - codeLen[ s ]

    codeLen[ s ] --;

6. You now have a set of codelens with K = 1 and all codeLens <= max. Fini.

Okay, so what's happening here ?

There's one forward pass and one backwards pass. First we truncate the code lengths that were too long. We are now in trouble and we need to find some other code lengths that we can make longer so that we are prefix again. The best code to make longer is the one with the lowest symbol count. It doesn't matter how long the current code length is, the cost of doing L += 1 is always the symbol count. So we just walk forward from the lowest symbol count. (*).

After step 4 you have a code with K <= 1 , if it's == 1 you're done, but sometimes it is < 1 because you bumped a lower codelen than necessary and you have a bit of space in your prefix code. To take advantage of this you want to find the highest count symbol whose length you can decrease and still have a prefix code.

As noted in the previous post this can be far from optimal, but in the standard case it just doesn't matter much because these are the very rare symbols.

footnotes :

(* while it is true that the cost is independent of L, the benefit to K is not independent of L, so adjusting shorter code lens is probably better. Instead of the minimum symbol count (C) you want to minimize the cost per benefit, which is C * 2^L . So you'd have to maintain a priority queue (**).)

(** it's actually more complex than that (I just tried it). In this step you will often be overshooting K, when considering overshooting you have to consider the penalty from doing len++ in the step that does the overshoot vs. how much you can get back by doing len-- elsewhere to come back to K=1. That is, you need merge step 4 and 5 such that you create a single priority queue which consists of some plain len++ ops, and also some ops that do one len++ some number of other len--'s, and pick the best of those options which doesn't overshoot K. Keep doing ops while K > 1 and you will wind up with K = 1. ).

Actually I wonder if this is a way to reconcile Huffman code building with Package-Merge ?

What would the correct priority queue op be for the (**) footnote ?

Say you're considering some op that does a len++ somewhere and overshoots K. You need compensate with some amount of K value to correct. Say that value you need to correct is 2^L. You can either do len-- on a code of length L, or you can do it on two codes of length L+1. Or one of length L+1 and two of length L+2.

Yep, I see it. Construct a priority queue for each length L. In the queue are symbols of code length L, and also pairs of items of length L+1 (an item is either a symbol or a pair). To correct K by 2^L you pick the best item from the L'th queue.

But rather than mess with this making an initial K and then doing corrections, you can just start with all L = 0 and K = N and then consider doing L++ on each code, that is, so you start by taking the best items from the L = 1 list. Which is just the package-merge algorithm !

Note that seeing this equivalence relies on some properties of the package-merge algorithm that aren't obvious. When you are pulling nodes at the final list (the L = 1 list), you can either pick a symbol; picking a symbol means its length was 0 and you are making it 1. That means that symbol was never picked before. (this is true because a coin i is never picked in an earlier list before it is made active in the final list). Or, if you don't pick a symbol you can pick a pair from the next L list. This corresponds to doing L++ on those code lengths. The key thing is : if a tree item has child i at level L, then child i already occurs L times as a raw symbol. This must be true because the cost of the tree item containing child i is > the cost of child i itself, so at all those levels child i would have chosen before the tree item.

For example :

L=3:   A  B

L=2:   A  B  {AB}  C

L=1:   A  B  {AB}  C  {AB|C}

At the point where we select {AB} in the L=1 list, A and B must already have occured once so their length is already 1. So {AB} means change both their lengths from 1 to 2; this adds them to the active set on the 2 list.


07-02-10 - Length-Limitted Huffman Codes

I have something interesting to write about Huffman decoders, but that will have to wait a bit.

In the mean time I finally wrote a length-limitted huffman code builder. Everyone uses the "package-merge" algorithm (see Wikipedia , or the paper "A Fast and Space-Economical Algorithm for Length-Limited Coding" by Moffat et.al ; the Larmore/Hirschberg paper is impenetrable).

Here's my summary :

Explicity what we are trying to do is solve :

Cost = Sum[ L_i * C_i ]

C_i = count of ith symbol
L_i = huffman code length

given C_i, find L_i to minimize Cost

contrained such that L_i <= L_max


Sum[ 2 ^ - L_i ] = 1
(Kraft prefix constraint)

This is solved by construction of the "coin collector problem"

The Cost that we minimize is the real (numismatic) value of the coins that the collector pays out
C_i is the numismatic value of the ith coin
L_i is the number of times the collector uses a coin of type i
so Cost = Sum[ L_i * C_i ] is his total cost.

For each value C_i, the coins have face value 2^-1, 2^-2, 2^-4, ...
If the coin collector pays out total face value of (n-1) , then he creates a Kraft correct prefix code

The coin collector problem is simple & obvious ; you just want to pay out from your 2^-1 value items ;
an item is either a 2^-1 value coin, or a pair of 2^-2 value items ; pick the one with lower numismatic value

The fact that this creates a prefix code is a bit more obscure
But you can prove it by the kraft property

If you start with all lenghts = 0 , then K = sum[2^-L] = N
Now add an item from the 2^-1 list
if it's a leaf, L changes from 0 to 1, so K does -= 1/2
if it's a node, then it will bring in two nodes at a lower level
    equivalent to to leaves at that level, so L changes from 1 to 2 twice, so K does -= 1/2 then too
so if the last list has length (2N-2) , you get K -= 1/2 * (2N-2) , or K -= N-1 , hence K = 1 afterward

BTW you can do this in a dynamic programming sort of way where only the active front is needed; has same
run time but less storage requirements.

You start at the 2^-1 (final) list.  You ask : what's the next node of this list?  It's either a symbol or
  made from the first two nodes of the prior list.  So you get the first two nodes of the prior list.
When you select a node into the final list, that is committed, and all its children in the earlier lists
  become final; they can now just do their increments onto CodeLen and be deleted.
If you select a symbol into the final list, then the nodes that you looked at earlier stick around so you
  can look at them next time.

Okay, so it all works fine, but it bothers me.

I can see that "package-merge" solves the "coin collector problem". In fact, that's obvious, it's the obvious way to solve that problem. I can also see that the minimization of the real value cost in "coin collector problem" can be made equivalent to the minimization of the total code length, which is what we want for Huffman code building. Okay. And I can understand the proof that the codes built in this way are prefix. But it's all very indirect and round-about.

What I can't see is a way to understand the "package-merge" algorithm directly in terms of building huffman codes. Obviously you can see certain things that are suggestive - the making pairs of items with minimum cost is a lot like how you would build a huffman tree. The funny thing is that the pairing here is not actually building the huffman tree - the huffman tree is never actually made; instead we make code lengths by counting the number of times the symbol appears in the active set. Even that we can sort of understand intuitively - if a symbol has very low count, it will appear in all L lists, so it will have a code length of L, the max. If it has a higher count, it will get bumped out of some of the lists by packages of lower-count symbols, so it will have a length < L. So that sort of makes sense, but it just makes me even more unhappy that I can't quite see it.

old rants