7/27/2010

07-27-10 - 2d arrays

Some little things I often forget and have to search for.

Say you need to change a stack two dimensional array :


int array[7][3];

into a dynamically allocated one, and you don't want to change any code. Do you know how? It's :

int (*array) [3] = (int (*) [3]) malloc(sizeof(int)*3*7);

It's a little bit cleaner if you use a typedef :

typedef int array_t [3];

int array[7][3];

array_t array[7];

array_t * array = (array_t *) malloc(7*sizeof(array_t));

those are all equivalent (*).

You can take a two dimensional array as function arg in a few reasonable ways :


void func(int array[7][3]) { }

void func(int (*array)[3]) { }

void func(int array[][3]) { }

function arg arrays are always passed by address.

2-d arrays are indexed [row][col] , which means the first index takes big steps and the second index takes small steps in memory.

(* = if your compiler is some fucking rules nazi, they are microscopically not quite identical, because array[rows][cols] doesn't actually have to be rows*cols ints all in a row (though I'm not sure how this would ever actually not be the case in practice)).

7/26/2010

07-26-10 - Virtual Functions

Previous post on x86/PPC made me think about virtual functions.

First of all, let's be clear than any serious game code base must have *some* type of dynamic dispatch (that is, data-dependent function calls). When people say "avoid virtual functions" it just makes me roll my eyes. Okay, assume I'm not just over-abstracting and I'm doing the dynamic dispatch for a good reason, because I actually need to do different things on different objects. The issue is just how you implement it.

How C++ virtual functions are implemented on most compilers :

    There's a "vtable" of function pointers associated with each class type. Individual objects have a pointer to their vtable. The advantage of this is that vtables are shared, but the disadvantage is that you get an extra cache miss. What actually happens when you make a virtual call? (multiple and virtual inheritance ignored here for now).
    
    vtable pointer = obj->vtable;
    
    func_pointer = vtable->func
    
    jump func_pointer
    
    

Why does this hurt ?

    vtable pointer = obj->vtable;
    
    
    The load of vtable pointer may be a cache miss, but that doesn't count against us since you are working on obj anyway you have to have one cache miss there.
    func_pointer = vtable->func
    
    
    Then you fetch the func pointer, which is maybe a cache miss (if this type of object has not been used recently).
    jump func_pointer
    
    
    Then you jump to variable, which may or may not be able to use branch prediction or fuck up your CPU's pipelining.

How can virtual calls be removed ?

  • A good practice is to never expose virtuals directly. In the base class you should have something like :
    
    void DoStuff(int x);
    
    
    as the public interface, and then hidden lower down, something like :
    
    virtual void v_DoStuff(int x) = 0;
    void DoStuff(int x) { v_DoStuff(x); }
    
    
    this gives you a single controlled call-through point which you can then make non-virtual or whatever at some point. (though, this screws up the next issue:)

  • The compiler can do it better than you (in theory anyway). Virtual functions that aren't actually overriden anywhere can obviously replaced with direct calls. But it's better than that - if you hold a pointer to some type, as long as nothing *derived* from that type override the virtual, it can be called direct. Even if sibling or cousin or parent classes override that virtual, by holding it by a derived type you can know that you are in a leaf part of the virtual call.

    No C++ compiler that I know of actually does this, because it requires knowledge of the class heirarchy in the whole program. But Java and C# are doing this now, so hopefully we will get it in C++ soon.

    When/if we get this, it is a very powerful optimization, because you can make then make functions that take concrete types and get joint-dispatch where you turn many virtual calls into just one. An example to be clear :

    
    class Base { virtual int Func() { return 0; } };
    
    class A : public Base { virtual int Func() { return 1; } };
    
    class B : public A { int m_data; };
    
    class C : public Base { virtual int Func() { return 2; } };
    
    void Test(A * obj)
    {
      obj->Func();
    }
    
    
    in this case the virtual call to Func() can be replaced with the direct call A::Func() by the compiler because no child of A overrides Func.

  • Don't use the method of virtual calls to hide implementation from interface. Some C++ coders will make a naked pure virtual interface class for their D3DHardware or ResourceManager or whatever, and then make a concrete class that impelements those virtuals. This is pretty nice for code cleanliness, but gives you virtual calls all over that are unnecessary. Note that if we had compiler-virtual-elimination this method would be fine since all those virtuals could be eliminated as there is only one implementation of that interface in the program.

  • Large/rare classes could use in-object vtables. In some cases, the space savings of sharing the vtable for all instances of a given class is not a huge win, and in that case you'd rather have the vtable directly in line in the object, because it gives you one less cache miss. There's no way to make the compiler do this for you, but you can do it reasonably easily with function pointers.

  • They can be converted into normal branches. In the DoStuff() example, instead of calling v_DoStuff, you could just branch on object type and call some concrete functions. Like :
    
    void DoStuff(int x)
    {
        if ( this is Actor ) Actor::DoStuff(x);
        else Base::DoStuff(x);
    }
    
    
    the advantage of this is that you avoid a cache miss for the vtable, and you use a normal branch which can be predicted even on shitty chips.

    This is only practical if you have only a few overrides for the function. The shitty thing about this (and many of these patterns) is that it pushes knowledge of the derived class up to the base. One nice variant of this is :

  • Rare overrides can use a test. If DoStuff() usually calls the default implementation, but once in a rare while needs an override, you can do that more efficiently with a pattern like :
    
    void DoStuff(int x)
    {
        if ( base is okay ) Base::DoStuff(x);
        else v_DoStuff(x);
    }
    
    
    so we check a bool (could be in a bitfield) and if so we use direct call, else we call the virtual.

  • Pushing data up to parent instead of calling down to child. A good method is simply to eliminate the virtual through code flow redesign.

    For example, a lot of bad game engines might have something like "virtual GetPosition()" on the base object. This at first seems like a good idea - not all objects have positions, and they might want to implement different ways of reporting it. In fact it is a bad idea, and you should just store m_position as public data, then have the different object types push their different position into m_position. (For example silly people implement attachments by making GetPosition() query the parent position then add some offset; instead you should do that computation only once in your object update and store it into the shared m_position).

    Most good games have a chunk of the most common data stored directly in the base object type so it can be accessed without virtuals.

  • Specialize for common cases. I sort of noted about this before, but when you're calling a bunch of virtuals from something, you can change many virtuals into one by dispatching to a func that calls them concretely. eg. Say you have a function like :
    
    void func(obj * o)
    {
      o->v1();
      o->v2();
      o->v3();
    }
    
    
    that does a bunch of virtual calls. Instead make a concrete call version t_func which does no virtuals :
    
    template < typename T >
    void t_func(T * o)
    {
      o->T::v1();
      o->T::v2();
      o->T::v3();
    }
    
    void func(obj * o)
    {
        dispatch to actual type of obj :
           t_func < typeof(o) >( o );
    }
    
    
    the difficulty is the dispatching from func to t_func. C++ has no mechanism to do type-dependent dispatch other than the vtable mechanism, which is annoying when you want to write a helper function and not add it to your class definition. There are general solutions to this (see for example in cblib the PrefDispatcher which does this by creating a separate but parallel class heirarchy to do dispatch) but they are all a bit ugly. A better solution for most games is to either add func() to the vtable or to just to know what concrete types you have and do manual dispatch.
  • Note that just flattening your hierarchy is not generally a great solution. For example, rather than have a bunch of different types of Crate (ExplodingCrate, FlammableCreate) you might decide to just make one single UberCrate that can be any type of crate. This eliminates virtual calls when you are working on UberCrate since it is just one concrete type, but it adds a lot of branches (if (crate.isExploding)) , and often makes the classes fatter in memory, etc. Making objects data-polymorphic instead of type-polymorphic may or may not be a win depending on the details.
  • In general it's good practice to make queries as fast and non-virtual as possible, and push the virtuals to the updates.

07-26-10 - Code Issues

How do I make a .c/.cpp file that's optional? eg. if you don't build it into your project, then you just don't get the functionality in it, but if you do, then it magically turns itself on and gives you more goodies.

I'll give you a particular example to be concrete, though this is something I often want to do. In the RAD LZH stuff I have various compressors. One is a very complex optimal parser. I want to put that in a separate file. People should be able to just include rrLZH.cpp and it will build and run fine, but the optimal parser will not be available. If they build in rrLZHOptimal, it should automatically provide that option.

I know how to do this in C++. First rrLZH has a function pointer to the rrLZHOptimal which is statically initialized to NULL. The rrLZHOptimal has a CINIT class which registers itself and sets that function pointer to the actual implementation.

This works just fine (it's a standard C++ self-registration paradigm), but it has a few problems in practice :

1. You can run into order-of-initialization issues if you aren't careful. (this is not a problem if you are a good C++ programmer and embrace proper best practices; in that case you will be initializing everything with singletons and so on).

2. It's not portable because of the damn linkers that don't recognize CINIT as a binding function call, so the module can get dropped if it's in a lib or whatever. (this is the main problem ; it would have been nice if C++0x had defined a way to mark CINIT constructions as being required in the link (not by default, but with a __noremove__ modifier or something)). There are various tricks to address this but I don't think any of them is very nice. (*)

I general I like this pattern a lot. The more portable version of this is to have an Install() function that you have to manually call. I *HATE* the Install function pattern. It causes a lot of friction to making a new project because you have to remember to call all the right Installs, and you get a lot of mysterious failures where some function just doesn't work and you have to go back and install the right thing, and you have to install them in the right order (C++ singleton installs mostly take care of order for you). etc.

(* : this is one of those issues that's very annoying to deal with as a library provider vs. an end-application developer. As an app developer you just decide "this is how we're doing it for our product" and you have a method that works for you. As a library developer you have to worry about people not wanting to use the method you have found that works, and how things might behave under various compilers and usage patterns. It sucks.)

ADDENDUM : the problem with the manually-calling Install() pattern is that it puts the selection of features in the wrong & redundant place. There is one place that I want to select my modules, and that is in my build - eg. which files I compile & link, not in the code. The problem with it being in the code is that I can't create shared & generic startup code that just works. I wind up having to duplicate startup code to every app, which is very very bad for maintainability. And of course you can't make a shared "Startup()" function because that would force you to link in every module you might want to use, which is the thing you want to avoid.


For the PS3 people : what would be the ideal way for me expose bits of code that can be run on the SPU? I'm just not sure what people are actually using and how they would like things to be presented to them. eg. should I provide a PPU function that does all the SPU dispatching for you and do all my own SPU management? Is it better if I go through SPURS or some such? Or should I just provide code that builds for SPU and let you do your management?


I've been running into a problem with the MSVC compiler recently where it is incorrectly merging functions that aren't actually the same. The repro looks like this. In some header file I have a function sort of like this :

StupidFunction.h :

inline int StupidFunction()
{
  return SOME_POUND_DEFINE;
}

Then in two different files I have :
A.cpp :

#define SOME_POUND_DEFINE  (0)
#include "StupidFunction.h"

printf("%d\n",StupidFunction());

and :
B.cpp :

#define SOME_POUND_DEFINE  (1)
#include "StupidFunction.h"

printf("%d\n",StupidFunction());

and what I get is that both printfs print the same thing (random whether its 0 or 1 depending on build order).

If I put "static" on StupidFunction() it fixes this and does the right thing. I have no idea what the standard says about compilation units and inlines and merging and so on, so for all I know their behavior might be correct, but it's damn annoying. It appears that the exact definition of inline changed in C99, and in fact .cpp and .c have very different rules about inlines (apparently you can extern an inline which is pretty fucked up). (BTW the whole thing with C99 creating different rules that apply to .c vs .cpp is pretty annoying).

ADDENDUM : see comments + slacker.org advice about inline best practice (WTF, ridiculous) , and example of GCC inline rules insanity


In other random code news, I recently learned that the C preprocessor (CPP) is not what I thought.

I always thought of CPP as just a text substitution parser. Apparently that used to be the case (and still is the case for many compilers, such as Comeau and MSVC). But at some point some new standard was introduced that makes the CPP more tightly integrated with the C language. And of course those standards-nazis at GCC now support the standard.

The best link that summarizes it IMO is the GCC note on CPP Traditional Mode that describes the difference between the old and new GCC CPP behavior. Old CPP was just text-sub, New CPP is tied to C syntax, in particular it does tokenization and is allowed to pass that tokenization directly to the compiler (which does not need to retokenize).

I guess the point of this is to save some time in the compile, but IMO it's annoying. It means that abuse of the CPP for random text-sub tasks might not work anymore (that's why they have "traditional mode", to support that use). It also means you can't do some of the more creative string munging things in the CPP that I enjoy.

In particular, in every CPP except GCC, this works :


#define V(x) x
#define CAT(a,b)  V(a)V(b)

to concatenate two strings. Note that those strings can be *anything* , unlike the "##" operator which under GCC has very specific annoying behavior in that it must take a valid token on each side and produce a valid token as output (one and only one!).


In further "GCC strictness is annoying", it's fucking annoying that they enforce the rule that only ints can be constants. For example, lots of code bases have something like "offsetof" :


/* Offset of member MEMBER in a struct of type TYPE. */
#define offsetof(TYPE, MEMBER) ((size_t) &((TYPE *)0)->MEMBER)

well, that's illegal under GCC for no damn good reason at all. So you have to do :

/* Offset of member MEMBER in a struct of type TYPE. */
#ifndef __GNUC__
#define offsetof(TYPE, MEMBER) ((size_t) &((TYPE *)0)->MEMBER)
#else
/* The cast to "char &" below avoids problems with user-defined
   "operator &", which can appear in a POD type.  */
#define offsetof(TYPE, MEMBER)                                  \
  (__offsetof__ (reinterpret_cast <size_t>                      \
                 (&reinterpret_cast <const volatile char &>     \
                  (static_cast<TYPE *> (0)->MEMBER))))
#endif /* C++ */

damn annoying. (code stolen from here ). The problem with this code under GCC is that a "type *" cannot be used in a constant expression.

A similar problem comes up in templates. On every other compiler, a const pointer can be used as a template value argument, because it's just the same as an int. Not on GCC! In fact because they actually implement the standard, there's a new standard for C++0x which is going to make NULL okay, but only NULL which is also annoying because there are places I would use arbitrary values. (see for example 1 or 2 ).

ADDENDUM : a concrete example where I need this is my in-place hash table template. It's something like :


template < typename t_key,t_key empty_val,t_key deleted_val >
class hash_table
{

that is, I hash keys of type t_key and I need a value for "empty" and "deleted" special keys for the client to set. This works great (and BTW is much faster than the STL style of hash_map for many usage patterns), but on GCC it doesn't work if t_key is "char *" because you can't template const pointer values. My work-around for GCC is to take those template args as ints and cast them to t_key type internally, but that fucking blows.

In general I like to use template args as a way to make the compiler generate different functions for various constant values. It's a much cleaner way than the #define / #include method that I used above in the static/inline problem example.

7/21/2010

07-21-10 - x86

x86 is really fucking wonderful and it's a damn shame that we don't have it on all platforms. (addendum : I don't really mean x86 the ISA, I mean x86 as short hand for the family of modern processors that run x86; in particular P-Pro through Core i7).

It's not just that the chips are super fast and easy to use. It's that they actually encourage good software engineering. In order to make the in-order PPC chips you have to abandon everything you've learned about good software practices in the last 20 years. You can't abstract or encapsulate. Everything has to be in locals, every function has to be inline.

1. Complex addressing.

This is way more huge than I ever expected. There are two important subaspects here :

1.A. being able to do addition and simple muls in the addressing, eg. [eax + ecx*2]

1.B. being able to use memory locations directly in instructions instead of moving through registers.

Together these work together to make it so that on x86 you don't have to fuck around with loading shit out to temporaries. It makes working on variables in structs almost exactly the same speed as working on variables in a local.

e.g.


  x = y;

 is

  mov eax, ecx;


while

  x = s.array[i];

 is

  mov eax, [eax + ecx*4 + 48h]

and those run at almost the same speed !

This is nice for C and accessing structs and arrays of course, but it's especially important for C++ where lots of things are this-> based. The compiler keeps "this" in a register, and this-> references run at the same speed as locals!

ADDENDUM : the really bad issue with the current PPC chips is that the pipeline from integer computations to load/stores is very bad, it causes a full latency stall. If you have to compute an address and then load from it, and you can't get other instructions in between, it's very slow. The great thing about x86 is not that it's one instruction, it's that it's fast. Again, to be clear, the point here is not that CISC is better or whatever, it's simply that having fast complex addressing you don't have to worry about changes the way you write code. It lets you use structs, it lets you just use for(i) loops and index off i and not worry about it. Instead on the PPC you have to worry about things like indexing byte arrays is faster than any other size, and if you're writing loops and accessing dword arrays maybe you should be iterating with a pointer instead of an index, or maybe iterate with index*4, or whatever.

2. Out of order execution.

Most people thing of OOE as just making things faster and letting you be a bit lazy about code gen and stalls and so on. Yes, that is true, but there's a more significant point I think : OOE makes C++ fast.

In particular, the entire method of referencing things through pointers is impossible in even moderate performant code without OOE.

The nasty thing that C++ (or any modern language really, Java ,C#, etc. are actually much much worse in this regard) does is make you use a lot of pointers, because your data types may be polymorphic or indeterminate, it's often hard to hold them by value. Many people think that's a huge performance problem, but on the PC/x86 it actually doesn't hurt much. Why?

Typical C++ code may be something like :


.. stuff ..
this->m_obj->func();
.. stuff ..

this may involve several dependent memory fetches. On an in-order chip this is stall city. With OOE it can get rearranged to :

..stuff ..
temp = this->m_obj;
.. stuff ..
vtable = temp->vtable;
.. stuff ..
vtable->func();
.. stuff ..

And as long as you have enough stuff to do in between it's no problem. Now obviously doing lots of random calls through objects and vtables in a row will still make you slow, but that's not a common C++ pattern and it's okay if that's slow. But the common pattern of just getting a class pointer from somewhere then doing a bunch of stuff on it is fast (or fast enough for not-super-low-level code anyway).

ADDENDUM : obviously if your code path was completely static, then a compile-time scheduler could do the same thing. But your code path is not static, and the caches have basically random delays because other threads might be using them too, so no static scheduling can ever be as good. And even beyond that, the compiler is just woefully unable to handle scheduling for these things. For example, to schedule as well as OOP can, you would have to do things like speculatively read ptr and *ptr even if it might only be needed if a certain branch is taken (because if you don't do the prefetching the stall will be horrific) etc. Furthermore, the scheduling can only really compete when all your functions are inline; OOP sort of inlines your functions for you since it can schedule functions across the jump. etc. etc.

ADDENDUM : 3. Another issue that I think might be a big one is the terrible penalty for "jump to variable" on PPC. This hits you when you do a switch() and also when you make virtual calls. It can only handle branch prediction for static branches, there's no "branch target predictor" like modern x86 chips have. Maybe I'll write a whole post about virtual functions.

Final addendum :

Anyway, the whole point of this post was not to make yet another rant about how current consoles are slow or bad chips. Everyone knows that and it's old news and boring.

What I have realized and what I'm trying to say is that these bad old chips are not only slow - much worse than that! They cause a regression in software engineering practice back to the bad old days when you have to worry about shit like whether you pre-increment or post-increment your pointers. They make clean, robust, portable programming catastrophically inefficient. All the things we have made progress on in the last 20 years, since I started coding on Amigas and 286's where we had to worry about this shit, we moved into an enlightened age where algorithms were more important than micro bullsit, and now we have regressed.


At the moment, the PowerPC console targets are *SO* much slower than the PC, that the correct way to write code is just to write with only the PowerPC in mind, and whatever speed you get on x86 will be fine. That is, don't think about the PC/x86 performance at all, just 100% write your code for the PPC.

There are lots of little places where they differ - for example on x86 you should write code to take use of complex addressing, you can have fewer data dependencies if you just set up one base variable and then do lots of referencing off it. On PPC this might hurt a lot. Similarly there are quirks about how you organize your branches or what data types you use (PPC is very sensitive to the types of variables), alignment, how you do loops (preincrement is better for PPC), etc.

Rather than bothering with #if __X86 and making fast paths for that, the right thing is just to write it for PPC and not sweat the x86, because it will be like a bazillion times faster than the PPC anyway.

Some other PPC notes :

1. False load-hit-stores because of the 4k aliasing is an annoying and real problem (only the bottom bits of the address are used for LHS conflict detection). In particular, it can easily come up when you allocate big arrays, because the allocators will tend to give you large memory blocks on 4k alignment. If you then do a memcpy between two large arrays you will get a false LHS on every byte! WTF !?!? The result is that you can get wins by randomly offsetting your arrays when you know they will be used together. Some amount of this is just unavoidable.

2. The (unnamed console) compiler just seems to be generally terrible about knowing when it can keep things in registers and when it can't. I noted before about the failure to load array base addresses, but it also fucks up badly if you *EVER* call a function using common variables. For example, say you write a function like this :


{
int x = 0;

  for( ... one million .. )
  {
    .. do lots of stuff using x ..
    x = blah;
  }

  external_func(&x);
}

the correct thing of course is to just keep x in a register through the whole function and not store its value back to the stack until right before the function :

{
//int x; // x = r7
r7 = 0;

  for( ... one million .. )
  {
    .. do lots of stuff using r7 ..
    r7 = blah;
  }

stack_x = r7
external_func(&stack_x);
}

Instead what I see is that a store to the stack is done *every time* x is manipulated in the function :


{
//int x; // x = r7
r7 = 0;
stack_x = r7;

  for( ... one million .. )
  {
    .. do lots of stuff using r7 - stores to stack_x every time ! ..
    r7 = blah;
   stack_x = r7;
  }

external_func(&stack_x);
}

The conclusion is the same one I came to last time :

When you write performance-critical code, you need to completely isolate it from function calls, setup code, etc. Try to pass in everything you need as a function argument so you never had to load from globals or constants (even loading static constants seems to be compiled very badly, you have to pass them in to make sure they get into registers), and do everything inside the function on locals (which you never take the address of). Never call external functions.

7/18/2010

07-18-10 - Mystery - Why no isync for Acquire on Xenon -

The POWER architecture docs say that to implement Acquire memory constraint, you should use "isync". The Xbox 360 claims they use "lwsync" to enforce Acquire memory constraint. Which is right? See :

Lockless Programming Considerations for Xbox 360 and Microsoft Windows
Example POWER Implementation for C/C++ Memory Model
PowerPC storage model and AIX programming

Review of the PPC memory control instructions in case you're a lazy fucker who wants to butt in but not actually read the links that I post :

First of all review of the PPC memory model. Basically it's very lazy. We are dealing with in-order cores, so the load/store instructions happen in order, but the caches and store buffers are not kept temporally in order. That means an earlier load can get a newer value, and stores can be delayed in the write queue. The result is that loads & stores can go out of order arbitrarily unless you specifically control them. (* one exception is that "consume" order is guaranteed, as it is on all chips but the Alpha; that is, *ptr is always newer than ptr). To control ordering you have ;

lwsync = #LoadLoad barrier, #LoadStore barrier, #StoreStore barrier ( NOT #StoreLoad barrier ) ( NOT Sequential Consistency ).

lwsync gives you all the ordering that you have automatically all the time on x86 (x86 gives you every barrier but #StoreLoad for free). If you put an lwsync after every instruction you would have a nice x86-like semantics.

In a hardware sense, lwsync basically affects only my own core; it makes me sequentialize my write queue and my cache reads, but doesn't cause me to make a sync point with all other cores.

sync = All barriers + Sequential Consistency ; this is equivalent to a lock xchg or mfence on x86.

Sync makes all the cores agree on a single sync point (it creates a "total order"), so it's very expensive, especially on very-many-core systems.

isync = #LoadLoad barrier, in practice it's used with a branch and causes a dependency on the load used in the branch. (note that atomic ops use loadlinked-storeconditional so they always have a branch there for you to isync on). In a hardware sense it causes all previous instructions to finish their loads before any future instructions start (it flushes pipelines).

isync seems to be the perfect thing to implement Acquire semantics, but the Xbox 360 doesn't seem to use it and I'm not sure why. In the article linked above they say :

"PowerPC also has the synchronization instructions isync and eieio (which is used to control reordering to caching-inhibited memory). These synchronization instructions should not be needed for normal synchronization purposes."

All that "Acquire" memory semantics needs to enforce is #LoadLoad. So lwsync certainly does give you acquire because it has a #LoadLoad, but it also does a lot more that you don't need.

ADDENDUM : another Xenon mystery : is there a way to make "volatile" act like old fashioned volatile, not new MSVC volatile? eg. if I just want to force the compiler to actually do a memory load or store (and not optimize it out or get from register or whatever), but don't care about it being acquire or release memory ordered.

07-18-10 - Mystery - Does the Cell PPU need Memory Control -

Is memory ordering needed on the PPU at all ?

I'm having trouble finding any information about this, but I did notice a funny thing in Mike Acton's Multithreading Optimization Basics :

!

// Pentium
#define  AtomicStoreFence() __asm { sfence }
#define  AtomicLoadFence()  __asm { lfence }

// PowerPC
#define  AtomicStoreFence() __asm { lwsync }
#define  AtomicLoadFence()  __asm { lwsync }

// But on the PPU
#define  AtomicStoreFence() 
#define  AtomicLoadFence()

Now, first of all, I should note that his Pentium defines are wrong. So that doesn't inspire a lot of confidence, but Mike is more of a Cell expert than an x86 expert. (I've noted before that thinking sfence/lfence are needed on x86 is a common mistake; this is just another example of the fact that "You should not try this at home!" ; even top experts get the details wrong; it's pretty sick how many random kids on gamedev.net are rolling their own lock-free queues and such these days; just say no to lock-free drugs mmmkay). (recall sfence and lfence only have any effect on non-temporal memory such as write-combined or SSE; normal x86 doesn't need them at all).

Anyway, the issue on the PPU is you have two hardware threads, but only one core, and more importantly, only one cache (and only one store queue (I think)). The instructions are in order, all of the out-of-orderness of these PowerPC chips comes from the cache, so since they are on the same cache, maybe there is no out-of-orderness ? Does that mean that memory accesses act sequential on the PPU ?

Hmmm I'm not confident about this and need more information. The nice thing about Cell being open is there is tons of information about it from IBM but it's just a mess and very hard to find what you want.

Of note - thread switches on the Cell PPU are pretty catastrophically slow, so doing a lot of micro-threading doesn't really make much sense on that platform anyway.

ADDENDUM : BTW I should note that even if the architecture doesn't require memory ordering (such as on x86), doing this :


#define  AtomicStoreFence() 
#define  AtomicLoadFence()

is a bad idea, because the compiler can still reorder things on you. Better to do :

#define  AtomicStoreFence()  _CompilerWriteBarrier() 
#define  AtomicLoadFence()   _CompilerReadBarrier()

07-18-10 - Mystery - Do Mutexes need More than Acquire-Release -

What memory order constraints do Mutexes really need to enforce ?

This is a surprisingly unclear topic and I'm having trouble finding any good information on it. In particular there are a few specific questions :

1. Does either Mutex Lock or Unlock need to be Sequential Consistent? (eg. a global sync/ordering point) (and followup : if they don't *need* be Seq_Cst , is there a good argument for them to be Seq_Cst anyway?)

2. Does either Lock or Unlock need to keep memory accesses from moving IN , or only keep them from moving OUT ? (eg. can Lock just be Acquire and Unlock just be Release ?)

Okay, let's get into it a bit. BTW by "mutex" I mean "CriticalSection" or "Monitor". That is, something which serializes access to a shared variable.

In particular, it should be clear that instructions moving *OUT* is bad. The main point of the mutex is to do :


y = 1;

Lock(x)

  load x
  x ++;
  store x;

Unlock(x)

y = 2;

and obviously the load should not move out the top nor should the store move out the bottom. This just means the Lock must be Acquire and the Unlock must be Release. However, the y=1 could move inside from the top, and the y=2 could move inside from the bottom, so in fact the y=1 assignment could be completely eliminated.

Hans Boehm : Reordering Constraints for Pthread-Style Locks goes into this question in a bit of detail, but it's fucking slides so it's hard to understand (god damn slides). Basically he argues that moving code into the Mutex (Java style) is fine, *except* if you allow a "try_lock" type function, which allows you to invert the mutex; with try_lock, then lock() must be a full barrier, but unlock() still doesn't need to be.

Joe Duffy mentions this subject but doesn't come to any conclusions. He does argue that it can be confusing if they are not full barriers . I think he's wrong about that and his example is terrible. You can always cook up very nasty examples if you touch shared variables inside mutexes and also outside mutexes. I would like to see an example where *well written code* behaves badly.

One argument for making them full barriers is that CriticalSection provides full barriers on Windows, so people are used to it, so it's good to give people what they are used to. Some coders may see "Lock" and think code neither moves in or out. But on some platforms it does make the mutex much more expensive.

To be concrete, is this a good SpinLock ?


Lock()
{
    
    while ( ! CAS( lock , 0 , 1 , memory_order_seq_cst )
    ;

}

Unlock()
{
    StoreRelease( lock , 0 );

    // AtomicExchange( lock, 0 , memory_order_seq_cst );
}

One issue that Joe mentions is the issue of fairness and notifying other processors. If you use the non-fencing Unlock, then you aren't immediately giving other spinning cores a change to grab your lock; you sort of bias towards yourself getting the lock again if you are in high contention. IMO this is a very nasty complex issue and is a good reason not to roll your own mutexes; the OS has complex mechanisms to prevent live locks and starvation and all that shit.

For more concreteness - Viva64 has a nice analysis of Dmitriy V'jukov's implementation of the Peterson Lock . This is a specific implementation of a lock which does not have *any* sequence point; the Lock() is Acquire_Release ordered (so loads inside can't move up and stores before it can't move in) and Unlock is only Release ordered.

The question is - would using a minimally-ordering Lock implementation such as Dmitriy's cause problems of any kind? Obviously Dmitriy's lock is correct in the sense of providing mutual exclusion and data race freedom, so the issue is not that; it's more a question of whether it causes practical programming problems or severely unexpected behavior. What about interaction with file IO or other non-simple-memory-access resources? Is there a good reason not to use such a minimally-ordering lock?

7/13/2010

07-13-10 - Tech Blurg

How do I atomically store or load 128 bit data on x64 ?

One option is just to use cmpxch16b to do loads and stores. That's atomic, but seems a bit excessive. I dunno, maybe it's fine. For loads that's simple enough, you just do a cmpxch16b with 0 and it gives you the value that was there. For stores it's a bit uglier because you have to do a loop and do at least two cmps (one to load, then one to store, which will only succeed if nobody else stored since the load).

The other option is to use the SSE 128 bit load/store. I *believe* that it is atomic (assuming no cache lines are straddled), however it is important to note that SSE memory ops on x86/x64 are weakly ordered, unlike normal memory ops which are all strongly ordered (every x86 load is #LoadLoad and every store is #StoreStore). So, to make strongly ordered 128 bit load/store from the SSE load store you have to do something like


load :
    sse load 128
    lfence

store :
    sfence
    sse store 128

or such. I'm not completely sure that's right though and I'm having trouble finding any information on this. What I need is load_acquire_128 and store_release_128. (yes I know MSVC has intrinsics for LoadAcq_128 and StoreRel_128, but those are only for Itanium). (BTW a lot of people mistakenly think they need to use lfence or sfence with normal code; no no, those are only for SSE and write combined memory).

(ADDENDUM : yes, I think this is correct; movdqa (a = aligned) appears to be the correct atomic way to load/store 128 bits on x86; I'm a little worried that getting the weaker SSE memory model involved will break some of the assumptions about the x86 behavior of access ordering).

In other news, the random differences between GCC and MSVC are fucking annoying. Basically it's the GCC guys being annoying cocks; you know MS is not going to copy your syntax, but you could copy theirs. If you would get your heads out of your asses and stop trying to be holier than Redmond, you would realize it's better for the world if you provide compatible declarations. Shit like making me do __attribute__((always_inline)) instead of just supporting __forceinline is just annoying and pointless. Also, you all need to fix up your damn stdlib to be more like MS. Extensions like vsnprintf should be named _vsnprintf (MS style, correct) (* okay maybe not).

You also can't just use #defines to map the MSVC stuff to GCC, because often the qualifiers have to go in different places, so it's a real mess. BTW not having pragma warning disable is pretty horrendous. And no putting it on the command line is nowhere near equivalent, you want to be able to turn them on and off for specific bits of code where you know the warning is bogus or innocuous.

The other random problem I have is the printf format for 64 bit int (I64d) appears to be MS only. God damn it.

7/10/2010

07-10-10 - PowerPC Suxors

I finally have done my first hard core optimization for PowerPC and discovered a lot of weird quirks, so I'm going to try to write them up so I have a record of it all. I'm not gonna talk about the issues that have been well documented elsewhere (load-hit-store and restrict and all that nonsense).

X. The code gen is just not very good. I'm spoiled by MSVC on the PC, not only is the code gen for the PC quite good, but any mistakes that it makes are magically hidden by out of order PC cores. On the PC if it generates a few unnecessary moves because it didn't do the best possible register assignments, those just get hidden and swallowed up by out-of-order when you have a branch or memory load to hide them.

In contrast, on the PPC consoles, the code gen is quirky and also very important, because in-order execution means that things like unnecessary moves don't get hidden. You have to really manually worry about shit like what variables get put into registers, how the branch is organized (non-jumping case should be most likely), and even exactly what instructions are done for simple operations.

Basically you wind up in this constant battle with the compiler where you have to tweak the C, look at the assembly, tweak the C, back and forth until you convince it to generate the right code. And that code gen is affected by stuff that's not in the immediate neighborhood - eg. far away in the function - so if you want to be safe you have to extract the part you want to tweak into its own function.

X. No complex addressing (lea). One consequence of this is that byte arrays are special and much faster than arrays of larger objects, because it has to do an actual multiply or shift. So for example if you have a struct of several byte members, you should use SOA (several structs) instead of AOS (one array of large struct).

X. Inline ASM kills optimization. You think with the code gen being annoying and flaky you could win by doing some manual inline ASM, but on Xenon inline ASM seems to frequently kick the compiler into "oh fuck I give up" no optimization mode, the same way it did on the PC many years ago before that got fixed.

X. No prefetching. On the PC if you scan linearly through an array it will be decently prefetched for you. (in some cases like memcpy you can beat the automatic prefetcher by doing 4k blocks and prefetching backwards, but in general you just don't have to worry about this). On PPC there is no automatic prefetch even for the simplest case so you have to do it by hand all the time. And of course there's no out-of-order so the stall can't be hidden. Because of this you have to rearrange your code manually to create a minimum of dependencies to give it a time gap between figuring out the address you want (then prefetch it) and needing the data at that address.

X. Sign extension of smaller data into registers. This one was surprising and bit me bad. Load-sign-extend (lha) is super expensive, while load-unsigned-zero-extend (lhz) is normal cost. That means all your variables need to be unsigned, which fucking sucks because as we know unsigned makes bugs. (I guess this is a microcoded instruction so if you use -mwarn-microcode you can get warnings about it).

PS3 gcc appears to be a lot better than Xenon at generating an lhz when the sign extension doesn't actually matter. eg. I had cases like load an S16 and immediately stuff it into a U8. Xenon would still generate an lha there, but PS3 would correctly just generate an lhz.

-mwarn-microcode is not really that awesome because of course you do have to use lots of microcode (shift,mul,div) so you just get spammed with warnings. What you really want is to be able to comment up your source code with the spots that you *know* generate microcode and have it warn only when it generates microcode where it's unexpected. And actually you really want to mark just the areas you care about with some kind of scope, like :


__speed_critical {
  .. code ..
}

and then it should warn about microcode and load hit stores and whatever else within that scope.

X. Stack variables don't get registered. There appears to be a quirk of the compiler that if you have variables on the stack, it really want to reference them from the stack. It doesn't matter if they are used a million times in a loop, they won't get a register (and of course "register" keyword does nothing). This is really fucking annoying. It's also an aspect of #1 - whether or not it gets registered depends on the phase of the moon, and if you sneeze the code gen will turn it back into a load from the stack. The same is actually true of static globals, the compiler really wants to generate a load from the static base mem, it won't cache that.

Now you might think "I'll just copy it into a local" , but that doesn't work because the compiler completely eliminates that unnecessary copy. The most reliable way we found to make the compiler register important variables is to copy them into a global volatile (so that it can't eliminate the copy) then back into a local, which then gets registered. Ugh.

You might think this is not a big deal, but because the chips are so damn slow, every instruction counts. By not registering the variables, they wind up doing extra loads and adds to get the values out of static of stack mem and generate the offsets and so on.

X. Standard loop special casing. On Xenon they seem to special case the standard


for(int i=0;i < count;i++) { }

kind of loop. If you change that at all, you get fucked. eg. if you just do the same thing but manually, like :

for(int i=0;;)
{
    i++;
    if ( i == count ) break;
}

that will be much much slower because it loses the special case loop optimization. Even the standard paradigm of backward looping :

for(int i=count;i--;) { }

appears to be slower. This just highlights the need for a specific loop() construct in C which would let the compiler do whatever it wants.

X. Clear top 32s. The PS3 gcc wants to generate a ton of clear-top-32s. Dunno if there's a trick to make this go away.

X. Rotates and shifts. PPC has a lot of instructions for shifting and masking. If you just write the C, it's generally pretty good at figuring out that some combined operation can be turned into one instruction. eg. something like this :


x = ( y >> 4 ) & 0xFF;

will get turned into one instruction. Obviously this only works for constant shifts.

X. The ? : paradigm. As usual on the PC we are spoiled by our fucking wonderful compiler which almost always recognizes ? : as a case it can generate without branches. The PPC seems to have nothing like cmov or a good setge variant, so you have to generate it manually . The clean solution to this is to write your own SELECT , that's like :


#define SELECT(condition,val_if_true,val_if_false)  ( (condition) ? (val_if_true) : (val_if_false) )

and replace it with Mike's bit masky version for PPC.

7/09/2010

07-09-10 - Backspace

#define _WIN32_WINNT 0x0501 
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
#include <psapi.h>

static bool strsame(const char * s1,const char * s2)
{
    for(int i=0;;i++)
    {
        if ( s1[i] != s2[i] )
            return false;
        if ( s1[i] == 0 )
            return true;
    }
}

// __declspec(dllimport) void RtlFillMemory( void *, size_t count, char );
extern "C" 
void * __cdecl memset(void *buf,int val,size_t count)
{
    char * p = (char *) buf;
    for(size_t i=0;i < count;i++)
    {
        p[i] = val;
    }
    return buf;
}

//#undef RtlCopyMemory
//NTSYSAPI void NTAPI RtlCopyMemory( void *, const void *, size_t count );
extern "C" 
void * __cdecl memcpy(void *dst,const void *src,size_t count)
{
    char * d = (char *) dst;
    char * s = (char *) src;
    for(size_t i=0;i < count;i++)
    {
        d[i] = s[i];
    }
    return dst;
}

//int CALLBACK WinMain ( IN HINSTANCE hInstance, IN HINSTANCE hPrevInstance, IN LPSTR lpCmdLine, IN int nShowCmd )
int my_WinMain(void)
{
    bool isExplorer = false;
    
    HWND active = GetForegroundWindow();
    
    DWORD procid;
    DWORD otherThread = GetWindowThreadProcessId(active,&procid);
            
    if ( active )
    {   
        HWND cur = active;
        for(;;)
        {
            HWND p = GetParent(cur);
            if ( ! p ) break;
            cur = p;
        }
        
        char name[1024];
        name[0] = 0;
        
        if ( GetClassNameA(cur,name,sizeof(name)) )
        {
            //lprintf("name : %s\n",name);
        
            isExplorer = strsame(name,"CabinetWClass") ||
                strsame(name,"ExploreWClass"); 
        }
    }
    
    if ( isExplorer )
    {   
        //lprintf("sending alt-up\n");
    
        INPUT inputs[4] = { 0 }; // this calls memset
        inputs[0].type = INPUT_KEYBOARD;
        inputs[0].ki.wVk = VK_LMENU;
        inputs[1].type = INPUT_KEYBOARD;
        inputs[1].ki.wVk = VK_UP;
        
        // send keyups in reverse order :
        inputs[2] = inputs[1]; // this generates memcpy
        inputs[3] = inputs[0];
        inputs[2].ki.dwFlags |= KEYEVENTF_KEYUP;
        inputs[3].ki.dwFlags |= KEYEVENTF_KEYUP;
        
        SendInput(4,inputs,sizeof(INPUT));
    }
    else
    {
        //lprintf("sending backspace\n");
        // can't use SendInput here cuz it will run me again
        
        // find the actual window and send a message :
        
        DWORD myThread = GetCurrentThreadId();
        
        AttachThreadInput(otherThread,myThread,TRUE);
        
        HWND focus = GetFocus();
        
        if ( ! focus )
            focus = active;
            
        // the HotKey thingy that I use will still send the KeyUp for Backspace,
        //  so I only need to send the key down :
        //  (some apps respond to keys on KeyUp and would process backspace twice- eg. Firefox)
        // also, if I send the KeyDown/Up together the WM_CHAR gets out of order oddly
        //PostMessageKeyUpDown(focus,VK_BACK);
        int vk = VK_BACK;
        int ScanKey = MapVirtualKey(vk, 0);
        const LPARAM lpKD = 1 | (ScanKey << 16);
        //const LPARAM lpKU = lpKD | (1UL << 31) | (1UL << 30);
        
        PostMessage(focus,WM_KEYDOWN,vk,lpKD);
        //PostMessage(w,WM_KEYUP,vk,lpKU);  
        
        AttachThreadInput(otherThread,myThread,FALSE);
    }   
    
    return 0;
}

extern "C"
{
int WinMainCRTStartup(void)
{
    return my_WinMain();
}
}

Bug fixed 07/12 : don't send key up message. Also, build without CRT and the EXE size is under 4k.

Download the EXE

7/08/2010

07-08-10 - Remote Dev

I think the OnLive videogames-over-the-internet thing is foolish and unrealistic.

There is, however, something it would be awesome for : dev kits.

If I'm a multi-platform videogame developer, I don't want to have to buy dev kits of every damn system for every person in my company. They cost a fortune and it's a mess to administer. It would be perfectly reasonable to have a shared farm of devkits somewhere out on the net, you can get to them whenever you want and run your game and get semi-interactive gameplay ala OnLive.

It would be vastly superior to the current situation, wherein poor developers wind up buying 2 dev kits for their 40 person team and you have to constantly be going around asking if you can use it. Instead you would always get instant access to some dev kit in the world.

Obviously this isn't a big deal for 1st party, but for the majority of small devs who do little ports it would be awesome. It would also rock for me because then I could work from home and instantly test my RAD stuff on any platform (and not have to ask someone at the office to turn on the dev kit, or make sure noone is using it, or whatever).

7/03/2010

07-03-10 - Length-Limitted Huffman Codes Heuristic

In the last post if you look at the comments you can find some comparison of optimal length limitted vs. heuristic length limitted.

I thought I would describe the heuristic algorithm. It is O(N) with no additional storage (it can work in place, which goes nicely with Moffat's in place Huffman codelen builder ). Here's the algorithm :

1. Build Huffman code lengths using Moffat INPLACE. You observe some of those code lengths are > maxCodeLen. We will work only on the code lengths, and we are given the symbol counts. We are given the symbol counts in sorted order (this was already done for INPLACE; if they were not originally sorted a simple NlogN sort will make them so).

2. Set all code lengths > max to be = maxCodeLen. We now have invalid code lengths, they are not "prefix". That is, they do not satisfy the kraft inequality K <= 1 for decodability.

3. Compute the Kraft number, K = Sum { 2 ^ - L_i } ; we currently have K > 1 and want to shrink it down to K = 1 by increasing some code lengths.

4. PASS 1. Walk over the symbols in sorted order (from lowest count to highest) while K > 1. Do :


  while ( codeLen[ s ] < max && K > 1 )
  {
    codeLen[ s ] ++;

    // adjust K for change in codeLen
    K -= 2 ^ - codeLen[ s ]
  }

5. PASS 2. Walk over the symbols backwards (from highest to lowest count) while K < 1. Do :


  while ( (K + 2^-codeLen[ s ]) <= 1 )
  {
    // adjust K for change in codeLen
    K += 2 ^ - codeLen[ s ]

    codeLen[ s ] --;
  }

6. You now have a set of codelens with K = 1 and all codeLens <= max. Fini.

Okay, so what's happening here ?

There's one forward pass and one backwards pass. First we truncate the code lengths that were too long. We are now in trouble and we need to find some other code lengths that we can make longer so that we are prefix again. The best code to make longer is the one with the lowest symbol count. It doesn't matter how long the current code length is, the cost of doing L += 1 is always the symbol count. So we just walk forward from the lowest symbol count. (*).

After step 4 you have a code with K <= 1 , if it's == 1 you're done, but sometimes it is < 1 because you bumped a lower codelen than necessary and you have a bit of space in your prefix code. To take advantage of this you want to find the highest count symbol whose length you can decrease and still have a prefix code.

As noted in the previous post this can be far from optimal, but in the standard case it just doesn't matter much because these are the very rare symbols.

footnotes :

(* while it is true that the cost is independent of L, the benefit to K is not independent of L, so adjusting shorter code lens is probably better. Instead of the minimum symbol count (C) you want to minimize the cost per benefit, which is C * 2^L . So you'd have to maintain a priority queue (**).)

(** it's actually more complex than that (I just tried it). In this step you will often be overshooting K, when considering overshooting you have to consider the penalty from doing len++ in the step that does the overshoot vs. how much you can get back by doing len-- elsewhere to come back to K=1. That is, you need merge step 4 and 5 such that you create a single priority queue which consists of some plain len++ ops, and also some ops that do one len++ some number of other len--'s, and pick the best of those options which doesn't overshoot K. Keep doing ops while K > 1 and you will wind up with K = 1. ).

Actually I wonder if this is a way to reconcile Huffman code building with Package-Merge ?

What would the correct priority queue op be for the (**) footnote ?

Say you're considering some op that does a len++ somewhere and overshoots K. You need compensate with some amount of K value to correct. Say that value you need to correct is 2^L. You can either do len-- on a code of length L, or you can do it on two codes of length L+1. Or one of length L+1 and two of length L+2.

Yep, I see it. Construct a priority queue for each length L. In the queue are symbols of code length L, and also pairs of items of length L+1 (an item is either a symbol or a pair). To correct K by 2^L you pick the best item from the L'th queue.

But rather than mess with this making an initial K and then doing corrections, you can just start with all L = 0 and K = N and then consider doing L++ on each code, that is, so you start by taking the best items from the L = 1 list. Which is just the package-merge algorithm !

Note that seeing this equivalence relies on some properties of the package-merge algorithm that aren't obvious. When you are pulling nodes at the final list (the L = 1 list), you can either pick a symbol; picking a symbol means its length was 0 and you are making it 1. That means that symbol was never picked before. (this is true because a coin i is never picked in an earlier list before it is made active in the final list). Or, if you don't pick a symbol you can pick a pair from the next L list. This corresponds to doing L++ on those code lengths. The key thing is : if a tree item has child i at level L, then child i already occurs L times as a raw symbol. This must be true because the cost of the tree item containing child i is > the cost of child i itself, so at all those levels child i would have chosen before the tree item.

For example :


L=3:   A  B

L=2:   A  B  {AB}  C

L=1:   A  B  {AB}  C  {AB|C}

At the point where we select {AB} in the L=1 list, A and B must already have occured once so their length is already 1. So {AB} means change both their lengths from 1 to 2; this adds them to the active set on the 2 list.

7/02/2010

07-02-10 - Length-Limitted Huffman Codes

I have something interesting to write about Huffman decoders, but that will have to wait a bit.

In the mean time I finally wrote a length-limitted huffman code builder. Everyone uses the "package-merge" algorithm (see Wikipedia , or the paper "A Fast and Space-Economical Algorithm for Length-Limited Coding" by Moffat et.al ; the Larmore/Hirschberg paper is impenetrable).

Here's my summary :


Explicity what we are trying to do is solve :

Cost = Sum[ L_i * C_i ]

C_i = count of ith symbol
L_i = huffman code length

given C_i, find L_i to minimize Cost

contrained such that L_i <= L_max

and 

Sum[ 2 ^ - L_i ] = 1
(Kraft prefix constraint)

This is solved by construction of the "coin collector problem"

The Cost that we minimize is the real (numismatic) value of the coins that the collector pays out
C_i is the numismatic value of the ith coin
L_i is the number of times the collector uses a coin of type i
so Cost = Sum[ L_i * C_i ] is his total cost.

For each value C_i, the coins have face value 2^-1, 2^-2, 2^-4, ...
If the coin collector pays out total face value of (n-1) , then he creates a Kraft correct prefix code

The coin collector problem is simple & obvious ; you just want to pay out from your 2^-1 value items ;
an item is either a 2^-1 value coin, or a pair of 2^-2 value items ; pick the one with lower numismatic value

The fact that this creates a prefix code is a bit more obscure
But you can prove it by the kraft property

If you start with all lenghts = 0 , then K = sum[2^-L] = N
Now add an item from the 2^-1 list
if it's a leaf, L changes from 0 to 1, so K does -= 1/2
if it's a node, then it will bring in two nodes at a lower level
    equivalent to to leaves at that level, so L changes from 1 to 2 twice, so K does -= 1/2 then too
    
so if the last list has length (2N-2) , you get K -= 1/2 * (2N-2) , or K -= N-1 , hence K = 1 afterward

-----------------------------------
BTW you can do this in a dynamic programming sort of way where only the active front is needed; has same
run time but less storage requirements.

You start at the 2^-1 (final) list.  You ask : what's the next node of this list?  It's either a symbol or
  made from the first two nodes of the prior list.  So you get the first two nodes of the prior list.
When you select a node into the final list, that is committed, and all its children in the earlier lists
  become final; they can now just do their increments onto CodeLen and be deleted.
If you select a symbol into the final list, then the nodes that you looked at earlier stick around so you
  can look at them next time.

Okay, so it all works fine, but it bothers me.

I can see that "package-merge" solves the "coin collector problem". In fact, that's obvious, it's the obvious way to solve that problem. I can also see that the minimization of the real value cost in "coin collector problem" can be made equivalent to the minimization of the total code length, which is what we want for Huffman code building. Okay. And I can understand the proof that the codes built in this way are prefix. But it's all very indirect and round-about.

What I can't see is a way to understand the "package-merge" algorithm directly in terms of building huffman codes. Obviously you can see certain things that are suggestive - the making pairs of items with minimum cost is a lot like how you would build a huffman tree. The funny thing is that the pairing here is not actually building the huffman tree - the huffman tree is never actually made; instead we make code lengths by counting the number of times the symbol appears in the active set. Even that we can sort of understand intuitively - if a symbol has very low count, it will appear in all L lists, so it will have a code length of L, the max. If it has a higher count, it will get bumped out of some of the lists by packages of lower-count symbols, so it will have a length < L. So that sort of makes sense, but it just makes me even more unhappy that I can't quite see it.

6/21/2010

06-21-10 - RRZ PNG-alike notes

Okay I've futzed around with heuristics today and it's a bit of a dead end. There are just a few cases that you cannot ever catch with a heuristic.

There is one good heuristic that works 90% of the time :


do normal filter
try compress with LZ fast and MinMatchLen = 8
try compress with LZ normal and MinMatchLen = 3

if the {fast,8} wins then the image is almost certainly a "natural" image. Natural images do very well with long min match len, smooth filters, and simple LZ matchers.

If not, then it's some kind of weird synethetic thing. At that point, pretty much all bets are off. Synthetic images have the damnable property that certain patterns repeat, so they are very sensitive to whether the LZ can find those patterns after filtering and such. But, a good start is to try the no-filter case with normal LZ, and perhaps try the Adaptive, and you can use Loco or Non-Loco depending on whether the normal filter chose loco post-filter or not.

But there are some bitches. kodak_12 is a natural image, and we detect that right. The problem is the best mode {N,N+L,A,A+L} changes when you optimize the LZ parameters, and it changes by quite a lot actually. Many modes will show N+L or A+L willing by quite a lot, but the right mode is N and it wins by 10k.

ryg_t.train.03.bmp is the worst. It is a solid 10% better with the "Normal" mode, but this only shows up when you do the LZ optimal parse; at any other setting of LZ all the modes are very close, but for some reason there are some magic patterns that only occur in Normal mode which are very good for the optimal parse - all the other modes stay about the same size when you turn LZ up to optimal, but Normal filter gets way smaller.

Okay, some actually useful notes :

There are some notes on the PNG web pages that say the best way to choose the filter per row is with sum of abs. Oh yeah? I can beat it. First of all, doing sum of abs but adding a small penalty for non-zero helps a tiny bit. But the best thing is to do entropy per row, and add a penalty for non-zero. You're welcome.

The filters N and (N+W)/2 are almost never best as whole-image filters, but are actually helpful in the adaptive filter loop.

I reduced my filter set down to only 5 and it hurt very little. Having the extra filters is basically free in terms of the format, but is a pain in the ass to maintain if you need to write optimized SIMD decoders for every filter on every platform. So for my own sanity, a minimum set of filters is preferrable.

BTW I should note that the fact that you have to tune minMatchLen and lzLevel is an indicator of the limitations of the optimal parse. If the optimal parse really found the best LZ stream, you should just run Optimal and let it pick what matches it wants. This is an example of it finding a local minimum which is not the global minimum. The problem is that the Huffman codes are severely different if you run with MML = 3 or 5 for example. Maybe there's a way around this problem; it requires more thought.

6/20/2010

06-20-10 - Struct annoyance

It's a very annoying fact of coding life that references through a struct are much slower than loading the variable out to temps. So for example this is slow :

void rrMTF_Init(rrMTFData * pData)
{
    for (int i = 0; i < 256;  i++)
    {
        pData->m_MTFimage[i] = pData->m_MTFmap[i] = (U8) i;
    }
}

and you have to do this by hand :

void rrMTF_Init(rrMTFData * pData)
{
    U8 * pMTFimage = pData->m_MTFimage;
    U8 * pMTFMap = pData->m_MTFmap;
    for (int i = 0; i < 256;  i++)
    {
        pMTFimage[i] = pMTFmap[i] = (U8) i;
    }
}

Unfortunately there are plenty of cases where this is actually a significant big deal.

06-20-10 - Some notes on PNG

Before we go further, lets have a look at PNG for reference.

Base PNG by default does :


Filter 5 = "adaptive" - choose from [0..4] per row using minimum sum of abs

Zlib strategy "FILTER" , which just means minMatchLength = 4

So the pngcrush guys did some clever things. Basically the idea is to try all possible ways to write a PNG and see which is smallest. The things you can play with are :


Filter 0..5

Zlib Strategy ; they have a few hard-coded but really it comes down to min match length

Zlib compression level (oddly highest level is not always best)

PNG scan options (progressive, interleaves, bottom up vs top down, etc.)

It's well known that on weird synthetic images "no filter" beats any filter. The only way you can detect that is actually by trying an LZ compress, you cannot tell from statistics.

The clever thing in pngcrush is that they don't search that whole space, but still usually find the optimal (or close to optimal) settings. The way they do it is with a heuristic guided search; they identify things that they have to always test (the 5 default strategies they try) with LZ, then depending on which of those is best they try a few others, and then maybe a few more, then you're done. It's like based on which branch of the search space you walk off initially they know from testing where the optimum likely is.

"loco" here is pngcrush with the LOCO color space conversion (RGB -> (R-G),G,(B-G) ) from JPEG-LS. This is the only lossless color conversion you can do that is not range expanding (eg. stays in bytes) (* correction : not quite true, see comments, of course any lifting style transform can be non-range-expanding using modulo arithmetic; it does appear to be the only *useful* byte-to-byte color conversion though). (BTW LOCO is not allowed in compliant PNG, but it's such a big win that it's unfair to them not to pretend that PNG can do LOCO for purposes of this comparison).

name png pngcrush loco advpng crush+adv best
ryg_t.yello.01.bmp 421321 412303 386437 373573 373573 373573
ryg_t.train.03.bmp 47438 37900 36003 34260 34260 34260
ryg_t.sewers.01.bmp 452540 451880 429031 452540 451880 429031
ryg_t.font.01.bmp 44955 37857 29001 22514 22514 22514
ryg_t.envi.colored03.bmp 113368 97203 102343 113368 97203 97203
ryg_t.envi.colored02.bmp 63241 55036 65334 63241 55036 55036
ryg_t.concrete.cracked.01.bmp 378383 377831 309126 378383 377831 309126
ryg_t.bricks.05.bmp 506528 486679 375964 478709 478709 375964
ryg_t.bricks.02.bmp 554511 553719 465099 554511 553719 465099
ryg_t.aircondition.01.bmp 29960 29960 23398 20320 20320 20320
ryg_t.2d.pn02.bmp 29443 26025 27156 24750 24750 24750
kodak_24.bmp 705730 704710 572591 705730 704710 572591
kodak_23.bmp 557596 556804 483865 557596 556804 483865
kodak_22.bmp 701584 700576 580566 701584 700576 580566
kodak_21.bmp 680262 650956 547829 646806 646806 547829
kodak_20.bmp 505528 504796 439993 500885 500885 439993
kodak_19.bmp 671356 670396 545636 671356 670396 545636
kodak_18.bmp 780454 779326 631000 780454 779326 631000
kodak_17.bmp 624331 623431 510131 615723 615723 510131
kodak_16.bmp 573671 541748 481190 517978 517978 481190
kodak_15.bmp 612134 611258 516741 612134 611258 516741
kodak_14.bmp 739487 703036 590108 739487 703036 590108
kodak_13.bmp 890577 859429 688072 866745 859429 688072
kodak_12.bmp 569219 533864 477151 535591 533864 477151
kodak_11.bmp 635794 634882 519918 635794 634882 519918
kodak_10.bmp 593590 592738 500082 593590 592738 500082
kodak_09.bmp 583329 582489 493958 558418 558418 493958
kodak_08.bmp 787619 786491 611451 787619 786491 611451
kodak_07.bmp 566085 565281 486421 566085 565281 486421
kodak_06.bmp 667888 631478 540442 644928 631478 540442
kodak_05.bmp 807702 806538 638875 807702 806538 638875
kodak_04.bmp 637768 636856 532209 637768 636856 532209
kodak_03.bmp 540788 506321 464434 514336 506321 464434
kodak_02.bmp 617879 616991 508297 596342 596342 508297
kodak_01.bmp 779475 760251 588034 706982 706982 588034
bragzone_TULIPS.bmp 680124 679152 591881 680124 679152 591881
bragzone_SERRANO.bmp 153129 107167 107759 96932 96932 96932
bragzone_SAIL.bmp 807933 806769 623437 807933 806769 623437
bragzone_PEPPERS.bmp 426419 424748 376799 426419 424748 376799
bragzone_MONARCH.bmp 614974 614098 526754 614974 614098 526754
bragzone_LENA.bmp 474968 474251 476524 474968 474251 474251
bragzone_FRYMIRE.bmp 378228 252423 253967 230055 230055 230055
bragzone_clegg.bmp 483752 483056 495956 483752 483056 483056

There's no adv+loco because advpng and advmng both refuse to work on the "loco" bastardized semi-PNG.

BTW I should note that I should eat my hat a little bit over my own "PNG sucks" post. The thing is, yes basic PNG is easy to beat and it has a lot of mistakes in the design, but the basic idea is fine, they did a good job on the standard pretty quickly, but the thing that really seals the deal is that once you make a flexible open standard, people will step in and find ways to play with it, and while base PNG is pretty bag, PNG after optimization is not bad at all.

06-20-10 - Searching my PNG-alike

Okay so stealing some ideas from pngcrush let's check out the search space. I decide I'm going to examine various filters, Loco or no loco, LZ min match len, and LZ compress "level" (level = how hard it looks for matches).

Here are the results for my PNG-alike with the modes :

0 = no filter
Loco = no filter (in loco space)
Normal = select a single DPCM filter for the whole image
N+L = Normal in LOCO space
Adaptive = per row best DPCM filter
A+L = you get it by now

The left six columns are these modes with default LZ parameters (Fast match, min match len = 4). The right six columns are the same modes with LZ parameters optimized for each mode.

name 0 Loco Normal N+L Adaptive A+L optimized LZs 0 Loco Normal N+L Adaptive A+L
ryg_t.yello.01.bmp 458255 435875 423327 427031 438946 431678 372607 359799 392963 395327 415370 401618
ryg_t.train.03.bmp 56455 55483 69635 76211 68022 67678 36399 35195 31803 40155 37582 36638
ryg_t.sewers.01.bmp 599803 610463 452287 452583 466154 452834 593935 593759 421779 420091 443786 421166
ryg_t.font.01.bmp 42239 32207 38855 38855 53070 36746 33119 26911 35383 35383 40998 33798
ryg_t.envi.colored03.bmp 297631 309347 150183 165803 142658 157046 265755 278103 102487 114923 95394 109022
ryg_t.envi.colored02.bmp 109179 112623 100687 113867 89178 98374 90039 93535 62139 68027 54662 57514
ryg_t.concrete.cracked.01.bmp 481115 407759 355575 356911 408054 361602 384795 353235 299963 301907 342810 310342
ryg_t.bricks.05.bmp 551475 485271 418907 417655 492622 418406 469063 448279 372195 370459 429066 373310
ryg_t.bricks.02.bmp 665315 632623 482367 481347 538670 483106 590319 577699 455431 455203 522158 455426
ryg_t.aircondition.01.bmp 41635 29759 26011 26011 32866 25738 29023 25103 20547 20547 23946 20522
ryg_t.2d.pn02.bmp 25075 26539 28303 29259 28046 28974 22147 22723 25915 26179 26194 25790
kodak_24.bmp 826797 771829 640137 634141 726892 633308 723681 712285 567693 558085 684060 559564
kodak_23.bmp 835449 783941 576481 569981 608476 565172 724641 712001 481577 478041 551292 479240
kodak_22.bmp 898101 879949 655461 651213 722096 651000 803429 804073 577433 571301 689664 574252
kodak_21.bmp 781077 708861 617565 629401 705608 618424 647881 633069 549865 549025 665724 545584
kodak_20.bmp 609705 561957 495509 500161 537692 494484 501293 490849 434865 431745 506592 429556
kodak_19.bmp 822045 733053 630793 624897 697020 619064 673953 658345 550669 541845 660444 541424
kodak_18.bmp 941565 912081 691161 692425 789736 693804 850705 848353 618961 619949 764004 622628
kodak_17.bmp 758089 678233 597225 592941 660016 590292 617097 606045 507169 504961 610092 508672
kodak_16.bmp 650557 587537 543001 543001 607916 545244 522829 511833 466277 466277 536280 468136
kodak_15.bmp 759109 697257 595353 590321 648656 586324 643385 628481 511193 504213 593304 506728
kodak_14.bmp 891629 793553 661505 657357 745816 649928 749085 731569 584645 580301 707596 581520
kodak_13.bmp 997161 891637 729425 730901 878212 729580 853557 829057 677041 678613 802224 680196
kodak_12.bmp 672361 602825 545921 562749 606004 549292 539305 526793 465297 472693 539532 467088
kodak_11.bmp 758197 691697 604869 597125 665364 587264 639537 625145 523649 512685 608388 510200
kodak_10.bmp 747121 681213 592637 589561 635972 578888 625961 611553 504573 499753 576284 497400
kodak_09.bmp 688365 629429 587233 583245 627916 571652 565449 562329 501377 495637 567772 491896
kodak_08.bmp 1001257 882153 686961 684177 792916 684056 860269 825757 615657 610505 766620 610524
kodak_07.bmp 709829 673917 568649 563561 616820 561636 605177 600157 480705 473233 551136 473500
kodak_06.bmp 779709 694229 600145 600145 687824 601564 642525 626449 534037 534037 637180 534804
kodak_05.bmp 962357 873793 700581 695257 810688 694828 845905 823025 633581 623341 783400 624368
kodak_04.bmp 813869 729865 613241 606849 672176 607280 677345 660057 531533 522061 622904 528064
kodak_03.bmp 639809 586873 522049 542681 581880 527856 519549 510309 437765 452965 511900 443948
kodak_02.bmp 729069 649709 603853 591781 654408 585916 598913 584941 515437 502941 602364 500964
kodak_01.bmp 872669 747333 661481 655597 772808 653388 699945 682005 593001 582389 689436 586328
bragzone_TULIPS.bmp 1032589 1021309 646905 652213 701508 652128 966881 969913 565997 571377 662504 570796
bragzone_SERRANO.bmp 150142 151706 169982 169302 173229 175217 103462 103566 139306 139074 142609 143457
bragzone_SAIL.bmp 983473 941993 686013 686117 795152 686204 892301 887097 613845 609953 762420 610008
bragzone_PEPPERS.bmp 694247 679795 424603 423679 451006 424262 655987 650771 369291 366611 416426 368106
bragzone_MONARCH.bmp 887917 868533 600373 598985 654864 598976 810325 805725 507937 508085 598348 508096
bragzone_LENA.bmp 737803 733251 493703 498299 498274 502150 710215 704763 467103 475179 471586 477638
bragzone_FRYMIRE.bmp 342667 344807 419859 420811 420506 419026 241899 242355 335063 336859 335990 335894
bragzone_clegg.bmp 760493 799717 525077 541329 523580 536244 557529 571413 445897 468265 444736 465376

One important note :

The "Normal" filters include the option to do a post-filter Loco conversion. This is different than the "loco space" option in the modes above. Let me elaborate. "Loco" in the modes listed above means transform the image into Loco colorspace, and then proceed with filtering and LZ. Loco built into a filter mode means, on each pixel do the filter delta, then do loco conversion. This can be integrated directly into the DPCM pixel delta code, so it's just considered a filter type. In particular, in "loco space", then the N,W,NW neighbors are already in loco colorspace. When loco is part of the filter, the neighbors are in RGB space and the delta is converted after the fact. If everything was linear, these would be equivalent.

Okay, what do we see?

It's very surprising to me how much LZ optimization helps. In particular it surprises me that making the LZ search *worse* (by turning down compression "level") helps a lot; as does increasing match len; on natural images a min match len around 8 or 10 is usually best. (or even more, I forbid a min match len > 10 because it starts to hurt decode speed).

Well we were hoping that we could pick the mode based on the default LZ parameters, and then just optimize the LZ parameters for that one mode. It is often the case the the best mode after optimization is the same as the best mode before optimization, but not always. When it is not the case, it is usually a small mistake. However, there is one case where it's a very bad mistake - on ryg_t.yello.01.bmp you would make output of 393k instead of 360k.

Natural images are the easiest; for them you can pretty much just pick A+L (or N+L) and you're very close to best if you didn't get the best. Synthetic images are harder, they are very sensitive to the exact mode.

We can also say that no filter + loco is almost always wrong, except for that same annoying one case. Unfortunately I don't see any heuristic that can detect when 0+loco needs to be checked. Similarly for adaptive + noloco.

Obviously there's a fundamental problem when the initial sizes are very close together, you can't really differentiate between the modes. When the sizes are very far apart then it is a reliable guess.

Let me briefly note things I could be searching that I'm not :

Rearranging pixels in various ways, eg. RGBRGB -> RRRGGGBB , or to whole planes; interleaving lines, different scan orders, etc. etc.

LSQR fits for predictors. This doesn't hurt decode speed a ton so it would fit in my design spec, I'm just getting sick of wasting my time on this so I'm not bothering with it.

Predictors on different regions instead of per row. eg. a predictor type per 16x16 tile or something.

Anything that hurts decode speed, eg. bigger predictors, adaptive predictors, non-LZ coder, etc.

Oh I'm also not looking at any pixel format conversions; I assume the client has put it in the pixel format they want and won't change it. Obviously some of the PNG optimizers can win by palettizing when not many colors are actually used, and of course there are lots of other pixel formats that might help, blah blah.

Oh while I'm at it, I should also note that my LZ is actually kind of crippled for this comparison. I divide the data stream into 256k chunks and compress them completely independently (no LZ window across the chunk boundary). This lets me seek on compressed data and decompress portions of it independently, but it is quite a bad penalty.

old rants