1/25/2009

01-24-09 - linkies

Complexification Gallery - wow really gorgeous images. All made algorithmically with Processing, and most driven by semi-physical-mathematical models. Lots of applets to play with. This is seriously fucking amazing and inspiring.

Killzone 2 making of video is pretty awesome.

Mischief and Ivo Beltchev have some crazy debugger database plugin thing for the string-CRC model. IMO this is fucking nuts. But I do love me some autoexp.dat

We were talking about this the other day and I realized I've forgotten what "Koenig lookup" is and why you need it. Basically it just means that functions are looked up in the namespace of their arguments. So :


namespace my_ns
{
    class Vec3 { ... };

    void func( Vec3 & x )
    {
        ...
    }

};

int main()
{
    my_ns::Vec3 v;
    func(v);
}

works, even though it's calling a func() that's not in the global namespace.

If you think about it a second this is nice, because it means that non-member functions on a class act like they are in the same namespace as that class. This is pretty crucial for non-member operators; it would be syntactically horrific to have to call the operators in the right namespace. But it's nice for other stuff too, it means you don't need to jam everything into member functions in order to make it possible to hand a class out to another namespace.


namespace my_ns
{
    class Vec3 { ... };

    bool operator == (const Vec3 &a , const Vec3 & b) { ... }
};


int main()
{
    my_ns::Vec3 a,b;

    if ( a == b )  // ***
    {
        ...
    }
}
At the line marked *** we're calling my_ns::operator == (Vec3,Vec3) - but how did we get to call a function in my_ns when we're not in that namespace? Koenig lookup.

Now, this really becomes crucial when you start doing generic programming and using templates and namespaces. The reason is your containers in the STL are in std:: namespace. You are passing in objects that are in your namespace. Obviously the STL containers and algorithms need to get access to the operations in your namespace. The only way they can do that is Koenig lookup - they use the namespace of the type they are operating on. For example to use std::sort and make use of your " operator < " it needs to get to your namespace.

See Herb's good articles :

What's In a Class - The Interface Principle
Namespaces and the Interface Principle
Argument dependent name lookup - Wikipedia, the free encyclopedia

Not exactly related but other good C++ stuff : Koenig has a Blog .

And if you want to get really befuddled about C++ name lookup you can mix this in : Why Not Specialize Function Templates .

Finally, I've had "dos here" for a long time, but of course it really should be "tcc here". Just follow these registry instructions but for the "command" put :


"C:\Program Files\JPSoft\TCCLE9\tcc.exe" /IS cdd %1

1/21/2009

01-21-09 - Towards Lock-Free Threading

Currently Oodle threads communicate through plain old mutex locking, but I'd like to go towards lock-free communication eventually. Now, lock-free coding doesn't have the normal deadlock problems, but it has a whole new host of much scarier and harder to debug thread timing problems, because you don't have the simplicity of mutex blocking out concurrency during shared data access.

It occurs to me there's a pretty simple way to make the transition and sort of have a middle ground.

Start by writing your threading using plain old mutexes and a locked "communication region". The communication area can only be accessed while the mutex is held. This is just the standard easy old way :


{Thread A} <---> {Communication Region} <---> {Thread B}

Thread A - lock mutex on CR
    change values ; 
    unlock
--- Thread B lock mutex
    read values
    unlock

Now find yourself a lockfree stack (aka singly linked list LIFO). The good old "SList" in Win32 is one fine choice. Now basically pretend that Thread A and Thread B are like over a network, and send messages to each other via the lock-free stacks.

To keep the code the same, they both get copies of the Communication Region :


{Thread A | Communication Region} <---> {Communication Region | Thread B}

and they send messages via two stacks :

{Thread A | Communication Region} --- stack AtoB --> {Communication Region | Thread B}
{Thread A | Communication Region} <-- stack BtoA --- {Communication Region | Thread B}

The messages you send across the stacks apply the deltas to the communication regions to keep them in sync, just like networked game.

So for example, if Thread A is the "main thread" and Thread B is a "job doer thread" , a common case might be that the communication region is a list of pending jobs and a list of completed jobs.


Main Thread
    {Pending Jobs}
    {Completed Jobs}

Request Job :
    add to my {Pending Jobs}
    push message on stackAtoB


Worker Thread
    {Pending Jobs}
    {Completed Jobs}

Sit and pop jobs off stackAtoB
put in pending jobs list
work on pending jobs
put result on completed jobs list
push completion message on stackBtoA

The nice thing is that Main Thread can at any time poke around in his own Pending and Completed list to see if various jobs are still pending or done yet awaiting examination.

Obviously if you were architecting for lock-free from the start you wouldn't do things exactly like this, but I like the ability to start with a simple old mutex-based system and debug it and make sure everything is solid before I start fucking around with lock-free. This way 99% of the code is identical, but it still just talks to a "Communication Region".


ADDENDUM :

I should note that this is really neither new nor interesting. This is basically what every SPU programmer does. SPU "threads" get a copy of a little piece of data to work on, they do work on their own copy, and then they send it back to the main thread. They don't page pieces back to main memory as they go.

While the SPU thread is working, the main thread can either not look at the "communication region", or it can look at it but know that it might be getting old data. For many applications that's fine. For example, if the SPU is running the animation system and you want the main thread to query some bone positions to detect if your punch hit somebody - you can just go ahead and grab the bone positions without locking, and you might get new ones or you might get last frame's and who cares. (a better example is purely visual things like particle systems)

Now I should also note that "lock free" is a bit of a false grail. The performance difference of locks vs. no locks is very small. That is, whether you use "CriticalSection" or "InterlockedExchange" is not a big difference. The big difference comes from the communication model. Having lots of threads contending over one resource is slow, whether that resource is "lock free" or locked. Obviously holding locks for a long time is bad, but you can implement a "lock free" model using locks and its plenty fast.

That is, this kind of model :


[ xxxxxxxxxx giant shared data xxxxxxxxxx ]

    |          |         |         |
    |          |         |         |
    |          |         |         |

[thread A] [thread B] [thread C] [thread D]

is slow regardless of whether you use locks or not. And this kind of model :

[data A  ] [data B  ] [data C  ] [data D  ] 
     
    |     /    |     /   |       / |
    |    /     |    /    |      /  |
    |   /      |   /     |     /   |

[thread A] [thread B] [thread C] [thread D]

is fast regardless of whether you use locks or not. Okay I've probably made this way more wrong and confusing now.


ADDENDUM #2 :

Let me try to express it another way. The "message passing" model that I described is basically a way of doing a large atomic memory write. The message that you pass can contain various fields, and it is processed synchronously by the receiver. That makes common unsafe lock-free methods safe. Let me try to make this clear with an example :

You want Thread B to do some work and set a flag when it's done. Thread A is waiting to see that flag get set and then will process the work. So you have a communication region like :

// globals :
    bool isDone;
    int workParams[32];
    int workResults[32];

Now a lot of people try to do lock-free work passing trivially by going :

Thread A :

    isDone = false;
    // set up workParams
    // tell Thread B to start

    ... my stuff ...

    if ( isDone )
    {
        // use workResults
    }   

Thread B :

    // read workParams

    // fill out workResults

    isDone = true;

Now it is possible to make code like this work, but it's processor & compiler dependent and can be very tricky and causes bugs. (I think I'll write some about this in a new post, see later). (the problem is that the reads & writes of isDone and the params and results don't all happen together and in-order). Instead we can just pass the object :

struct ThreadMessage
{
    LinkNode
    bool isDone;
    int workParams[32];
    int workResults[32];
}

Thread A :
    ThreadMessage myLocalCopy; // in TLS or stack
    // fill out myLocalCopy
    // push myLocalCopy to thread message queue to thread B

    ... my stuff ...

    if ( pop my message queue )
    {
        myLocalCopy = copy from queue
        // use myLocalCopy.workResults
    }
    
Thread B :
    ThreadMessage myLocalCopy; // in TLS or stack
    myLocalCopy = pop message queue

    // read myLocalCopy.workParams

    // fill out myLocalCopy.workResults
        
    push myLocalCopy to queue for thread S

Okay. Basically we have taken the separate variables and linked them together, so that as far as our thread is concerned they get written and read in one big chunk. That is, we move the shared data from one large consistent state to another.

1/17/2009

01-17-09 - Float to Int

A while ago the Some Assembly Required blog wrote some good notes about float-to-int. I posted some notes there but I thought I'd try to summarize my thoughts coherently.

What I'd like is a pretty fast float-to-int (ftoi) conversion. The most useful variants are "truncate" (like C, fractions go towards zero), and "round" , that is, fractions go towards the nearest int. We'd like both to be available all the time, and both to be fast. So I want ftoi_trunc and ftoi_round.

First let me say that I hate the FPU control word with a passion. I've had so many bugs because of that fucker over the years. I write some code and test it and everything is fine, and then we actually put in some game and all hell breaks loose. WTF happened? Oh well, I tested it with the default word setup, and now it's running with the FPU set to single precision. The other classic bug is people changing the rounding mode. D3D used to be really bad about this (you could use FPU_PRESERVE but it was a pretty big hit back in the old days with software T&L, not a big deal any more). Or even worse is people who write code intentionally designed to work with the FPU in a non-standard rounding mode (like round to nearest). Then if you call other code that's meant for the normal rounding mode, it fails.

Ok, rant over. Don't mess with the FPU control word.

That means the classic /QIfist really doesn't do that much for us. Yes, it makes ftoi_trunc faster :


int ftoi_trunc(float f) { return (int) f; }

that's fast enough with /QIfist, but you still need round :

int ftoi_round(float f) { return ( f >= 0.f ) ? (int)(f + 0.5f) : (int)(f - 0.5f); }

note that a lot of people just do + 0.5 to round - that's wrong, for negatives you need to go the other way, because the C truncation is *toward zero* not *down*.

Even if you could speed up the round case, I really don't like using compiler options for crucial functionality. I like to make little code snippets that work the way I want regardless of the compiler settings. In particular if I make some code that relies on ftoi being fast I don't want to use C casts and hope they set the compiler right. I want the code to enforce its correctness.

Fortunately the xs routines at stereopsis by Sree Kotay are really good. The key piece is a fast ftoi_round (which I have slightly rejiggered to use the union method of aliasing) :


union DoubleAnd64
{
  uint64    i;
  double    d;
};

static const double floatutil_xs_doublemagic = (6755399441055744.0); // 2^52 * 1.5

inline int ftoi_round(const float val)
{
  DoubleAnd64 dunion;
  dunion.d = val + floatutil_xs_doublemagic;
  return (int) dunion.i; // just cast to grab the bottom bits
}

in my tests this runs at almost exactly the same speed as FISTp (both around 7 clocks), and it always works regardless of the FPU control word setting or the compiler options.

Note that this is a "banker's round" not a normal arithmetic rounding where 0.5 always goes up or down - 0.5 goes to the nearest *even* value. So 2.5 goes to 2.0 and 3.5 goes to 4.0 ; eg. 0.5's go up half the time and down half the time. To be more precise, ftoi_round will actually round the same way that bits that drop out of the bottom of the FPU registers during addition round. We can see that's why making a banker_round routine was so easy, because that's what the FPU addition does.

But, we have a problem. We need a truncate (ftoi_trunc). Sree provides one, but it uses a conditional, so it's slow (around 11 clocks in my tests). A better way to get the truncate is to use the SSE intrinsinc :


inline int ftoi_trunc(const float f)
{
  return _mm_cvtt_ss2si( _mm_set_ss( f ) );
}

Note that the similar _mm_cvt_ss2si (one t) conversion does banker rounding, but the "magic number" xs method is faster because it pipelines better, and because I'm building for x86 so the cvt stuff has to move the value from FPU to SSE. If you were building with arch:sse and all that, then obviously you should just use the cvt's. (but then you dramatically change the behavior of your floating point code by making it run through float temporaries all the time, instead of implicit long doubles like in x86).

So, that's the system I have now and I'm pretty happy with it. SSE for trunc, magic number for round, and no reliance on the FPU rounding mode or compiler settings, and they're both fast.

For completeness I'll include my versions of the alternatives :


#include < xmmintrin.h >

typedef unsigned __int64 uint64;

union DoubleAnd64
{
  uint64  i;
  double  d;
};

static const double floatutil_xs_doublemagic = (6755399441055744.0); // 2^52 * 1.5
static const double floatutil_xs_doublemagicdelta = (1.5e-8);                         //almost .5f = .5f + 1e^(number of exp bit)
static const double floatutil_xs_doublemagicroundeps = (0.5f - floatutil_xs_doublemagicdelta);       //almost .5f = .5f - 1e^(number of exp bit)

// ftoi_round : *banker* rounding!
inline int ftoi_round(const double val)
{
  DoubleAnd64 dunion;
  dunion.d = val + floatutil_xs_doublemagic;
  return (int) dunion.i; // just cast to grab the bottom bits
}

inline int ftoi_trunc(const float f)
{
  return _mm_cvtt_ss2si( _mm_set_ss( f ) );
}

inline int ftoi_round_sse(const float f)
{
  return _mm_cvt_ss2si( _mm_set_ss( f ) );
}

inline int ftoi_floor(const double val)
{
    return ftoi_round(val - floatutil_xs_doublemagicroundeps);
}

inline int ftoi_ceil(const double val)
{
    return ftoi_round(val + floatutil_xs_doublemagicroundeps);
}

// ftoi_trunc_xs = Sree's truncate
inline int ftoi_trunc_xs(const double val)
{
  return (val<0) ? ftoi_round(val+floatutil_xs_doublemagicroundeps) : 
                   ftoi_round(val-floatutil_xs_doublemagicroundeps);
}

BTW note the ceil and floor from Sree's XS stuff which are both quite handy and hard to do any other way. Note you might think that you can easily make ceil and floor yourself from the C-style trunc, but that's not true, remember floor is *down* even on negatives. In fact Sree's truncate is literally saying "is it negative ? then ceil, else floor".

Finally : if you're on a console where you have read-modify-write aliasing stall problems the union magic number trick is probably not good for you. But really on a console you're locked into a specific CPU that you completely control, so you should just directly use the right intrinsic for you.

AND : regardless of what you do, please make an ftoi() function of some kind and call that for your conversions, don't just cast. That way it's clear where where you're converting, it's easy to see and search for, it's easy to change the method, and if you use ftoi_trunc and ftoi_round like me it makes it clear what you wanted.

ASIDE : in fact I'm starting to think *all* casts should go through little helper functions to make them very obvious and clear. Two widgets I'm using to do casts are :


// same_size_bit_cast casts the bits in memory
//  eg. it's not a value cast
template < typename t_to, typename t_fm >
t_to same_size_bit_cast( const t_fm & from )
{
    COMPILER_ASSERT( sizeof(t_to) == sizeof(t_fm) );
    return *( (const t_to *) &from );
}

// check_value_cast just does a static_cast and makes sure you didn't wreck the value
//  eg. for numeric casting to make sure the value fits in the new type
template < typename t_to, typename t_fm >
t_to check_value_cast( const t_fm & from )
{
    t_to to = static_cast< t_to >(from);
    ASSERT( static_cast< t_fm >(to) == from );
    return to;
}

intptr_t pointer_to_int(void * ptr)
{
    return (intptr_t) ptr;
}

void * int_to_pointer(intptr_t i)
{
    return (void *) i;
}

ADDENDUM : I'm told this version of same_size_bit_cast is better on various platforms. That's why we want it gathered up in one place so it can be easily changed ;)


// same_size_bit_cast casts the bits in memory
//  eg. it's not a value cast
template < typename t_to, typename t_fm >
t_to same_size_bit_cast( const t_fm & from )
{
    COMPILER_ASSERT( sizeof(t_to) == sizeof(t_fm) );
    
    union
    {
        t_fm    fm;
        t_to    to;
    } temp;
    
    temp.fm = from;
    return temp.to;
}

1/16/2009

01-16-09 - push_macro & pop_macro

WTF MSVC has macro push & pop !? How did I not know this? It's so superior. It actually makes #defining new & delete actually possibly an okay option. (normally I get sucked into a hell of having to #undef them and redef them back to the right thing)


#define test 1

#pragma PRAGMA_MESSAGE( STRINGIZE(test) )

#pragma push_macro("test")

#undef test
#define test 2

#pragma PRAGMA_MESSAGE( STRINGIZE(test) )

#pragma pop_macro("test")

#pragma PRAGMA_MESSAGE( STRINGIZE(test) )

outputs : 1 , 2 , 1

Wow!

BTW this demo used these tricks :


#define _Stringize( L )            #L
#define _DoMacro1( M, X )        M(X)

#define STRINGIZE(M)            _DoMacro1( _Stringize, M )
#define LINE_STRING                STRINGIZE( __LINE__ )

#define PRAGMA_MESSAGE(str)        message( __FILE__ "(" LINE_STRING ") : message: " str)

Of course I can't use it at work where multi-platform support is important, but I can use it at home where I don't give a flying fuck about things that don't work in MSVC and it makes life much easier.

01-16-09 - Virtual Memory

So I just had kind of a weird issue that took me a while to figure out and I thought I'd write up what I learned so I have it somewhere. (BTW I wrote some stuff last year about VirtualAlloc and the zeroer.)

The problem was this Oodle bundler app I'm working on was running out of memory at around 1.4 GB of memory use. I've got 3 GB in my machine, I'm not dumb, etc. I looked into some things - possible virtual address space fragmentation? No. Eventually by trying various allocation patterns I figured it out :

dwAllocationGranularity

On Windows XP all calls to VirtualAlloc get rounded up to the next multiple of 64k. Pages are 4k - and pages will actually be allocated to your process on 4k granularity - but the virtual address space is reserved in 64k chunks. I don't know if there's any fundamental good reason for this or if it's just a simplification for them to write a faster/smaller allocator because it only deals with big aligned chunks.

Anyway, my app happened to be allocating a ton of memory that was (64k + 4k) bytes (there was a texture that was exactly 64k bytes, and then a bit of header puts you into the next page, so the whole chunk was 68k). With VirtualAlloc that actually reserves two 64k pages, so you are wasting almost 50% of your virtual address space.

NOTE : that blank space you didn't get in the next page is just *gone*. If you do a VirtualQuery it tells you that your region is 68k bytes - not 128k. If you try to do a VirtualAlloc and specify an address in that range, it will fail. If you do all the 68k allocs you can until VirtualAlloc returns NULL, and then try some more 4k allocs - they will all fail. VirtualAlloc will never give you back the 60k bytes wasted on granularity.

The weird thing is there doesn't seem to be any counter for this. Here are the TaskMgr & Procexp reading meanings :

TaskMgr "Mem Usage" = Procexp "Working Set"

This is the amount of memory whose pages are actually allocated to your app. That means the pages have actually been touched! Note that pages from an allocated range may not all be assigned.

For example, if you VirtualAlloc a 128 MB range , but then only go and touch 64k of it - your "Mem Usage" will show 64k. Those pointer touches are essentially page faults which pull pages for you from the global zero'ed pool. The key thing that you may not be aware of is that even when you COMMIT the memory you have not actually got those pages yet - they are given to you on demand in a kind of "COW" pattern.

TaskMgr "VM Size" = Procexp "Private Bytes"

This is pretty simple - it's just the amount of virtual address space that's COMMITed for your app. This should equal to the total "Commit Charge" in the TaskMgr Performance view.

ProcExp "Virtual Size" =

This one had me confused a bit and seems to be undocumented anywhere. I tested and figured out that this is the amount of virtual address space RESERVED by your app, which is always >= COMMIT. BTW I'm not really sure why you would ever reserve mem and not commit it, or who exactly is doing that, maybe someone can fill in that gap.

Thus :

2GB >= "Virtual Size" >= "Private Bytes" >= "Working Set".

Okay, that's all cool. But none of those counters shows that you have actually taken all 2 GB of your address space through the VirtualAlloc granularity.

ADDENDUM : while I'm explaining mysteriously named counters, the "Page File Usage History" in Performance tab of task manager has absolutely nothing to do with page file. It's just your total "Commit Charge" (which recall the same as the "VM Size" or "Private Bytes"). Total Commit Charge is technically limited by the size of physical ram + the size of the paging file. (which BTW, should be zero - Windows runs much better with no paging file).


To be super clear I'll show you some code and what the numbers are at each step :


int main(int argc,char *argv[])
{

    lprintf("UseAllMemory...\n");

    vector<void *>  mems;
    
    #define MALLOC_SIZE     ((1<<16) + 4096)
    
    lprintf("reserving:\n");
    
    uint32 total = 0;
    
    for(;;)
    {       
        void * ptr = VirtualAlloc( NULL, MALLOC_SIZE , MEM_RESERVE, PAGE_READWRITE );
        
        if ( ! ptr )
        {
            break;
        }
        
        total += MALLOC_SIZE;
        mems.push_back(ptr);
        
        lprintf("%d\r",total);
    }
    lprintf("%u\n",total);

    lprintf("press a key :\n");
    getch();

This does a bunch of VirtualAlloc reserves with a stupid size. It prints :

UseAllMemory...
reserving:
1136463872
press a key :

The ProcExp Performance tab shows :

Private Bytes : 2,372 K
Virtual Size : 1,116,736 K
Working Set : 916 K

Note we only got around 1.1 GB. If you change MALLOC_SIZE to be a clean power of two you should get all 2 GB.

Okay, so let's do the next part :



    lprintf("comitting:\n");
    
    for(int i=0;i < mems.size();i++)
    {
        VirtualAlloc( mems[i], MALLOC_SIZE, MEM_COMMIT, PAGE_READWRITE );
    }
    
    lprintf("press a key :\n");
    getch();

We committed it so we now see :

Private Bytes : 1,112,200 K
Virtual Size : 1,116,736 K
Working Set : 2,948 K

(Our working set also grew - not sure why that happened, did Windows just alloc a whole bunch? It would appear so. It looks like roughly 128 bytes are needed for each commit).

Now let's actually make that memory get assigned to us. Note that it is implicity zero'ed, so you can read from it any time and pull a zero.


    lprintf("touching:\n");
    
    for(int i=0;i < mems.size();i++)
    {
        *( (char *) mems[i] ) = 1;
    }

    lprintf("press a key :\n");
    getch();

We now see :

Private Bytes : 1,112,200 K
Virtual Size : 1,116,736 K
Working Set : 68,296 K

Note that the Working Set is still way smaller than the Private Bytes because we have only actually been given one 4k page from each of the chunks that we allocated.

And wrap up :


    lprintf("freeing:\n");
    
    while( ! mems.empty() )
    {
        VirtualFree( mems.back(), 0, MEM_RELEASE );
        
        mems.pop_back();
    }   

    lprintf("UseAllMemory done.\n");

    return 0;
}


For background now you can go read some good links about Windows Virtual memory :

Page table - Wikipedia - good intro/background
RAM, Virtual Memory, Pagefile and all that stuff
PAE and 3GB and AWE oh my...
Mark's Blog : Pushing the Limits of Windows Virtual Memory
Managing Virtual Memory in Win32
Chuck Walbourn Gamasutra 64 bit gaming
Brian Dessent - Re question high virtual memory usage
Tom's Hardware - My graphics card stole my memory !

I'm assuming you all basically know about virtual memory and so on. It kind of just hit me for the first time, however, that our problem now (in 32 bit aps) is the amount of virtal address space. Most of us have 3 or 4 GB of physical RAM for the first time in history, so you actually cannot use all your physical RAM - and in fact you'd be lucky to even use 2 GB of virtual address space.

Some issues you may not be aware of :

By default Windows apps get 2 GB of address space for user data and 2 GB is reserved for mapping to the kernel's memory. You can change that by putting /3GB in your boot.ini , and you must also set the LARGEADDRESSAWARE option in your linker. I tried this and it in fact worked just fine. On my 3 GB work system I was able to allocated 2.6 GB to my app. HOWEVER I was also able to easily crash my app by making the kernel run out of memory. /3GB means the kernel only gets 1 GB of address space and apparently something that I do requires a lot of kernel address space.

If you're running graphics, the AGP window is mirrored into your app's virtual address space. My card has 256MB and it's all mirrored, so as soon as I init D3D my memory use goes down by 256MB (well, actually more because of course D3D and the driver take memory too). There are 1GB cards out there now, but mapping that whole video mem seems insane, so they must not do that. Somebody who knows more about this should fill me in.

This is not even addressing the issue of the "memory hole" that device mapping to 32 bits may give you. Note that PAE could be used to map your devices above 4G so that you can get to the full 4G of memory, if you also turn that on in the BIOS, and your device drivers support it; apparently it's not recommended.

There's also the Address Windowing Extensions (AWE) stuff. I can't imagine a reason why any normal person would want to use that. If you're running on a 64-bit OS, just build 64-bit apps.

VirtualQuery tells me something about what's going on with granularity. It may not be obvious from the docs, but you can call VirtualQuery with *ANY* pointer. You can call VirtualQuery( rand() ) if you want to. It doesn't have to be a pointer to the base of an allocation range. From that pointer it gives you back the base of the allocation. My guess is that they do this by stepping back through buckets of size 64k. To make 2G of ram you need 32k chunks of 64k bytes. Each chunk has something like MEMORY_BASIC_INFORMATION, which is about 32 bytes. To hold 32k of those would take 1 MB. This is just pure guessing.

SetSystemFileCacheSize is interesting to me but I haven't explored it.

Oh, some people apparently have problems with DLL's that load to fixed addresses fragmenting virtual memory. It's an option in the DLL loader to specify a fixed virtual address. This is naughty but some people do it. This could make it impossible for you to get a nice big 1.5 GB virtual alloc or something. Apparently you can see the fixed address in the DLL using "dumpbin.exe" and you can modify it using "rebase.exe"

ADDENDUM : I found a bunch of links about /3GB and problems with Exchange Server fragmenting virtual address space. Most interestingly to me these links also have a lot of hints about the way the kernel manages the PTE's (Page Table Entries). The crashes I was getting with /3GB were most surely running out of PTE's ; apparently you can tell the OS to make more room for PTE's with the /USERVA flag. Read here :

The number of free page table entries is low, which can cause system instability
How to Configure the Paged Address Pool and System Page Table Entry Memory Areas
Exchange Server memory management with 3GB, USERVA and PAE
Clint Huffman's Windows Performance Blog Free System Page Table Entries (PTEs)


I found this GameFest talk by Chuck Walkbourn : Why Your Windows Game Won�t Run In 2,147,352,576 Bytes that covers some of these same issues. In particular he goes into detail about the AGP and memory mirroring and all that. Also in Vista with the new WDDM apparently you can make video-memory only resources that don't take any app virtual address space, so that's a pretty huge win.

BTW to be clear - the real virtual address pressure is in the tools. For Oodle, my problem is that to set up the paging for a region, I want to load the whole region, and it can easily be > 2 GB of content. Once I build the bundles and make paging units, then you page them in and out and you have nice low memory use. It just makes the tools much simpler if they can load the whole world and not worry about it. Obviously that will require 64 bit for big levels.

I'm starting to think of the PC platform as just a "big console". For a console you have maybe 10 GB of data, and you are paging that through 256 MB or 512 MB of memory. You have to be careful about memory use and paging units and so on. In the past we thought of the PC as "so much bigger" where you can be looser and not worry about hitting limits, but really the 2 GB Virtual Address Space limit is not much bigger (and in practice it's more like 1.5 GB). So you should think of the PC as have a "small" 1 GB of memory, and you're paging 20 GB of data through it.

1/15/2009

01-14-09 - Allocator Alignment

I like it when allocations of size N are aligned to the next lowest power of 2 below N.

So eg. an allocation of 4000 bytes is aligned to 2048. It means you can do things like just malloc a 128-bit vector and it's aligned and you never have to worry about it. You never have to manually ask for alignment as long as the alignment you want is <= the size of your object (which it almost always is).

eg. if you want to malloc some MAT4x4 objects, you just do it and you know that they are aligned to sizeof(MAT4x4).

Is there any disadvantage to this? (eg. does it waste a lot of memory compared to more conservative alignment schemes?)


Also, I used to always do my malloc debug tracking "intrusively" , that is by sticking an info header at the front of the block and allocating and bigger piece for each alloc, then linking them together. The advantage of this is that it's very fast - when you free you just go to (ptr - sizeof(info_header)).

I think I am now convinced that that is the wrong way to do it. It's better to have a separate tracker which hashes from the pointer address to an info struct. The big advantage of the "non-intrusive" way like this is that it doesn't change the behavior of the allocator at all. So things like alignment aren't affected, and neither is cache usage or optimization issues (for example if you're using a GC-type arena allocator and adjacency of items is important to performance).

In general now I'm more eager for debugging and instrumentation schemes like this which have *zero* affect on the behavior of the core functionality, but basically just watch it from the outside.

(For example on consoles where you have 256M of memory in the real consoles and an extra 256M in the dev kits, it's ideal to make two separate allocators, one in the lower 256 where all your real data goes and one for the upper 256 where all your non-intrusive debug extra data goes; in practice this is a pain in the butt, but it is the ideal way to do things, so that you have the real final version of the game running all the time in the lower 256).

1/13/2009

01-13-09 - Strings

I just had another idea for strings that I think is rather appealing. I've ranted here before about refcounted strings and the suckitude of std::string and bstring and so on. Anyway, here's my new idea :

Mutable strings are basically a vector< char > like std::string or whatever. They go through a custom allocator which *never frees*. What that means is you can always just take a c_str char * off the string and hold onto it forever.

Thus the readable string is just char *, and you can store those in your hashes or whatever. Mutable string is a String thingy that supports operator += and whatnot, but you just hold those temporarily to do edits and then grab the char * out.

So the usage is that you always just pass around char *'s , your objects all store char *'s, nobody ever worries about who owns it and whether to free it, you can pass it across threads and not worry. To make strings you put a String on the stack and munge it all you want, then pull the char * out and rock with that.

Obviously this wastes memory, BUT in typical gamedev usage I think the waste is usually microscopic. I almost always just read const strings out of config files and then never edit them.

One exception that I'd like to handle better is frequently mutated strings. For example, you might have something in a spawner that does something like this :


for(int variant=0;variant < numVariants;variant++)
{
    // char name[80];
    // sprintf(name,"spawnmobj_%d",variant);
    String name("spawnmobj_");
    name += variant;

    Spawn(name);
}

I don't love making names programatically like this, but lots of people do it and it's quite convenient, so it should be supported. With the model that I have proposed here, this would do allocs every time you spawn and memory use would increase forever. One way to fix this is to use a global string pool and merge duplicates at the time they are converted to char *. That way you don't every increase your memory use when you make strings you made before - only when you make new strings.

With the string pool model, the basic op becomes :


    const char * StringPool::GetPermanentPointer( String & str );

in which the 'str' is added to the pool (or an existing one is found), and the char * you get back will never go away.

ADDENDUM : to be clear, this is not intended as an optimization at all, it's simply a way to make the programming easy without being too awful about crazy memory use. (eg. not just making String a char [512])

01-10-09 - Simple Image

libPNG is such overcomplicated shizzle. I could easily make a little image format that was just BMP with a LZ that was like maybe 1000 lines of code and just put it in a header, STB style. Hell I could toss in a lossy wavelet image coder for another 1000 lines of code. It wouldn't be the best in the world, but it would be good enough and super simple. Unfortunately I guess there's no point cuz it's not a standard and whatnot. (Using arithcoding instead of Huffman is part of what could make both of those so easy to write).

WIC is obviously a good thing in theory - it's sillythat every app has its own image readers & writers (I mean Amiga OS did this perfectly back in 1892 so get with the last century bitches). On the other hand, the fact that it's closed source and MS-only and in fact requires .NET 3.5 makes it pretty much ass.

A simple open-source multi-platform generic image IO library that's based on pluggable components and supports every format is such an obvious thing that we should have. It should be super simple C. You shouldn't have to recompile apps to get new plug-ins, so it should be DLL's in a dir in Windows and whatever similar thing on other OS'es. (but you should also easily be able to just compile all the plug-ins into your app as a static lib if you want to do that).

One thing that mucks it up is that many of the image formats allow all kinds of complicated nonsense in them, and if you want to support all that then your API starts getting really complicated and ugly. Personally I'm inclined to only support rectangular bit-plane images (eg. N components, each component is B bits, and they are interleaved in one of a handful of simple ways, like ABCABC or AAABBBCCC ).

All compressors unfortunately have this problem that they start off super simple but become huge messes when you add lots of features.

ADDENDUM : Oh fucking cock hole. I put a PNG loader in my image thing and now all my apps depend on libPNG.dll and zlib.dll , so I have to distribute those with every damn exe, and worry about dll's being in path and so on. They also cause failures at startup if they're not found, when really they should just fail if I actually try to load a PNG. (of course I could do that by doing LoadLibrary by hand and querying for the functions, but I have to call like 100 functions to load a PNG so doing that would be a pain). Urg bother.

1/10/2009

01-09-09 - LRB Video

I'm kind of excited about the possibilities for video compression with Larrabee. The chip is very powerful and flexible and obviously well suited to video. Having that much power in the decoder would let you do things that are impossible today - mainly doing motion-comp on the decoder side. That lets you acheive the old dream of basically not sending motion vectors at all, (or just sending corrections from that's predicted).

In fact, it would let you send video just more like a normal predictive context coder. For each pixel, you predict a probability for each value. That probability is done with context matching, curve fitting, motion compensation etc. It has to be reproduced in the decoder. This is a basic context coder. These kind of coders take a lot of CPU power, but are actually much simpler conceptually and architecturally than something like H264. Basically you are just doing a kind of Model-Coder paradigm thing which is very well understood. You use an arithmetic coder, so your goal is just to make more accurate probabilities in your model.

Other slightly less ambitious possibilities are just using things like 3d directional wavelets for the transform. Again you're eliminating the traditional "mocomp" step but building it into your transform instead.

Another possibility is to do true per-pixel optical flow, and frame to frame step the pixels forward along the flow lines like an incompressible fluid (eg. not only do the colors follow the flow, but so do their velocities). Then of course you also send deltas.

Unfortunately this is all a little bit pointless because no other architecture is anywhere close to as flexible and powerful, so you would be making a video format that can only be played back on LRB. There's also the issue that we're getting to the point where H264 is "good enough" in the sense that you can do HD video at near-lossless quality, and the files may be bigger than you'd like, but disks keep getting bigger and cheaper so who cares.

01-09-09 - Image Stuff

Most of this is related to RAW. The GUILLERMO LUIJK stuff in particular is very good. dcraw seems to be the best freeware raw importer, but god help you working with that. UFRaw and LibRaw are conversions of dcraw into more usable forms, though they tend to lag his updates. I've given up on WIC because I can't get it (the new Windows SDK) to install on all my machines.

The commercial RAW processors that I've looked at are so freaking slow, this is definitely a problem that could use some bad ass optimizer programmer love. Hell even just the viewers are slow as balls.

ImageMagick is pretty cool BTW ; it's really similar to the old DOS program "Image Alchemy" which I used to use lots in the 386 days. It's all command line so you can set up batch files to do the processing you want on various images.

DCRAW and mods :

Dave Coffin's Home Page
About LibRaw LibRaw
UFRaw - Home
RAWHide Image Converter
RAW decoder comparison dcraw vs. dcraw_ahd vs. Bibble

Good articles on photos :

GUILLERMO LUIJK��-��b i t s & p h o t o g r a p h y
GUILLERMO LUIJK TUTORIALS DCRAW TUTORIAL
perfectRAW 0.65 Buscando la perfecci�n
Photographic Tone Reproduction

Windows Imaging Components (WIC) :

How to Write a WIC-Enabled CODEC and Get Full Platform Support for Your Image Format
Windows with C++ Decoding Windows Vista Icons with WIC
Windows Imaging Component Overview
WIC-Enabled Codecs C++ Tutorial Loading an Image File
Canon download WIC codec

Other MS junk :

Microsoft Research Image Composite Editor (ICE)
Microsoft Professional Photography SyncToy v2.0
Microsoft Professional Photography Downloads RAW SyncToy ProPhoto Shoot
Microsoft Professional Photography Codecs

Other misc image stuff :

LibTIFF - TIFF Library and Utilities
ImageMagick Convert, Edit, and Compose Images
Framewave Project Homepage
FastStone Image Viewer - Powerful and Intuitive Photo Viewer, Editor and Batch Converter

DNG (Adobe Digital Negative) :

DNG specification
DNG ProfilesEditor - Adobe Labs
Adobe - Digital Negative (DNG)

1/08/2009

01-07-09 - Oodle Rambling

One of the hard things with turning resources into bundles is that it's hard to define an exact metric to optimize. Generally with computers if you can come up with a simple fast measure of whether a certain configuration is best, then you can through various techniques at it and do well under that metric (see for example all the work on the Surface Area Heuristic for geometry).

Anyway, the problem with bundling is the optimal setup depends very much on how they're used.

If you just do a standard "load everything and immediately stall" kind of whole level loading, then it's pretty easy. In fact the optimal is just to jam everything in the level into one bundle.

On the other hand, if you actually do paging and drop bundles in and out to reduce memory load, it's harder. Finest grain paging minimizes your memory use, because it means you never hold a single object in memory that isn't needed. In that case, again its easy to make optimal bundles - you just merge together resources which are always either loaded or not loaded at the same time (eg. the list of "sets" which want them is identical).

More generally, you might load a big chunk for your level, then page pieces. To optimize for that you want the level load to be a big fast chunk, then the rest to be in pages. Also, you might not want the pages to be all finest possible grain, if you have a little memory to waste you probably want to merge some of them up to reduce seeks, at least to merge up many very small resources together.

One of the main ways to make bundles for Oodle is just to let Oodle watch your data loads and let it make the bundles for you. (if you don't like that, you can always completely manually specify what resource goes in which bundle). In order to make it possible for Oodle to figure out a good bundling you can also push & pop markers on a stack to define "sets" and maybe also do some tagging.

One thing that's occurred to me is that even the basic idea of making bundles to just load the data fast is dependent on how you use it. In particular, when you start trying to use the resource, and whether they always work as a group, etc.

For the straight loading path, I believe the key feature to optimize is the amount of time that the main thread spends waiting for resources; eg. make that time as small as possible. That's not the same as maximizing throughput or minimizing latency - it depends on the usage pattern.

For example :

Case 1.

Request A,B,C
... do stuff ...
Block on {A,B,C}
/Process {A,B,C}

Case 2.

Request A,B,C
... do stuff ...
Block on A
Process A
Block on B
Process B
Block on C
Process C
These are similar load paths, but they're actually very different. In Case 1 the bundle optimizer should try to make {ABC} all available as soon as possible. That means they should be blocked together as one unit to ensure there's no seek between them, and there's no need to make them available one by one. In Case 2 you should again try to make ABC linear to avoid seeks, but really the most important thing is to make A available as quickly as posible, because if it there is some time needed to get B it will be somewhat hidden by processing A.

Anyway, I'm kind of rambling and I'm not happy with all of this.

I've got a lot of complication currently because I'm trying to support a lot of different usages. One of those usages that I've been trying to support is the "flat load".

"Float Load" = a game resource compiler knows how to write the bits to disk exactly the way that the game wants them in memory. This lets you load up the resource and just point directly at it (maybe fix some pointers internal to the resource, but ideally not). This was a big deal back in the old days; it allows "zero CPU use" streaming - you just fire an async disk op, then when it's done you point at it. We did this on Xbox 1 for Stranger and it was crucial to being able to seamlessly page so much of the game.

This is obviously the super fast awesome way to load resources, but I don't think that many people actually do this any more. And it's less and less important all the time. For one thing, almost everybody is compressing data now to have better bandwidth, so the "flat load" is a bit of a myth - you're actually streaming the data through a decompressor. Almost every system now is multi-core, both on PC's and consoles, and people can afford to devote maybe 25% of one core to loading work, so the necessity of having "zero CPU use" streaming which the Flat Load offers is going away.

Anyway, the reason the Flat Load is so complicated is because it means I have to support the case that the application is pointing into the bundle memory. That means a lot of things. For systems with "cpu" and "gpu" separate memory regions, it means I have to load the pieces of the bundle into the right regions. It means I need to communicate with the app to know when it's okay for me to free that memory.

It also creates a huge issue for Bundle unloading and paging because you have to deal with the "resource transfer" issue. I won't even get into this, but it's the source of much ridicule from Casey and the cause of some of our biggest pain at Oddworld.

Anyway, I think I might get rid of the "Flat Load" completely and just assume that my job is just to page in the resource bits, and the game will spin on this bits, and then I can throw them away. That lets me make things a lot simpler at the low level, which would let me make the high level easier to use and neater.

I think my selling point will really be in all the neat friendly high level tools, like the load profiler, memory use visualizer, disk-watcher for demand loading, console file transfers, all that kind of jazz.

BTW :

"Data Streaming" = Bringing in bits of data incrementally and processing or showing the bits as they come (eg. not just waiting until all the data is in before it is shown). eg. Videos "stream".

"Data Paging" = Bringing in and out bits of data based on what's needed. When you run around a big seamless world game like GTA it is "paging" in objects - NOT STREAMING.

1/05/2009

01-05-09 - Paging

Somebody sent me this paper :

"I3D09 A Novel Page-Based Data Structure for Interactive Walkthroughs.pdf"

a little while ago. While the basic ideas in it are totally fine, but they page on *4K* granularity. That's so completely bone-headed (on disks where seek time dominates) that it invalidates the whole paper.

There's a natural "critical size" that you can figure for paging which is the only possible combination of your fundamental variables. (this is sort of a physics style argument - there's only one way you can combine your variables and get the right units, therefore it must be right answer, up to a constant multiplier).

Your two variables inherent to the system are seek time and throughput speed of your disk. You combine them thusly :


typical hard disk :

seek time 10 ms
load speed 50 MB/sec

10 ms = X / (50 MB/sec)

10 /1000 sec = X  sec / (50 MB

10*50 /1000 MB = X

X = 0.5 MB


typical DVD :

seek time 100 ms
load speed 5 MB/sec

X = 100*5/1000 MB = 0.5 MB

Pleasantly, for both DVD and HD paging the critical paging unit size is close to the same !! Note that I'm not saying 0.5 MB is actually the ideal size, there may be some constant on there, so maybe it's 0.1 MB or 1.0 MB , but it's certainly somewhere close to that range (the exact optimum depends on a lot of things like how important latency vs. throughput is to you, etc).

Of course, Solid State Disks change that in the future. They are really just like an extra level of slower RAM, and they make fine-scale paging practical. The possibilities for infinite geometry & infinite texture are very exciting with SSD's. Unfortunately you won't be able to target that mainstream for a long time.

BTW : the Giga-Voxels paper is sort of interesting. The basic method of using a tree with indirections to voxel bricks is an old idea (see earlier papers) - the main addition in the GigaVoxels paper is the bit to have the GPU pass back the information about what new pages need to be loaded, so that the render gives you the update information and it's just one pass. That bit is really ugly IMO with the current GPU architecture, but it's progress anyway. Personally I still kinda think voxel rendering is a red herring and surface reps are the way to go for the foreseeable future.

Jeff had a good idea for streaming loads a while ago. A lot of people play a Bink movie when they do level loads. During that time you're hitting the IO system hard for your level data and also for your movie load. Right now that causes a lot of seeking and requires big buffering for the movie data and so on. With Oodle doing the IO for Bink and for your level load, we could great an interleaved load stream that has like [200 ms of Bink Data][some level load data][200 ms of Bink data][ ... ] ; that eliminates all seeking and lets the DVD just jam fast.

In fact, in the ideal Oodle future that might not even be a special case. The Bink data could just be split into packets, and then it could all just be handed to the generic Oodle packet profiler & layout system, and Oodle would automatically lay out the packets to minimize total load time. This all probably won't be in Oodle V1 , but it's good to keep a picture in my head of where we want to be some day.

1/01/2009

01-01-09 - Console vs GUI

Raymond Chen's blog is often informative and interesting, but he's very obstinate and often has blinders on. The latest one about console vs GUI mode apps in Windows is a classic example of Microsoftian "our way is right" thinking which is just completely off base and also ignores what users and developers want. It doesn't matter how "right" you are (and BTW, you're not right), if lots of users (developers) want something, such as the ability to make apps that are either console apps or GUI apps, then you should give it to them.

In particular, I would say that 90% of the apps I write I would like to be GUI/console hybrid apps. That is, depending on the command line switches, they either run attached to the console or not. There's also another major usage pattern that I like : if the app is launched by running from an icon or something like that, run as a GUI app, but if it's run from a command line, then run as a console app. This can be accomplished with the devenv ".com" trick (works because from cmd a .com of the same name is run before the .exe), but that means you have to build two targets which is annoying.

In practice what I'm doing now is that I make my apps console apps, and then depending on how they are run (command line swtiches and such) it can respawn itself with DETACHED_PROCESS. (this also makes sure you detach from the console - I hate fucking GUI interactive apps that I run from the console that don't detach themselves and thus lock up my console). BTW this last bit is another important hybrid usability thing - I want my app to tell if it's going into interactive mode or not. If it's a non-interactive thing (like just showing help or doing some batch processing) then it should run as a non-detached console app. If it pops a GUI and runs interactive, it should detach.

This is all reasonably easy with these calls :


IsDebuggerPresent() (needed because you don't want to do any funny respawning if you're debugging)

GetConsoleWindow()

CreateProcess()

AllocConsole()

freopen("CONIN$","rb",stdin); // etc...

I've written in this blog in the past about some of the very bizarre weirdness about how Windows manages the console system; in fact just go read my post from last year if you care.

BTW if you do want to do the .com/.exe thing to be cleaner, you should use the "subsystem not set" trick and just make two mains like this .

Also, for my GUI apps in "final" builds, I still have the code in there to write log info out to a normal console with printf. I just make the app start up detached - hence no console window. Then I give the user a button to toggle the console visible. The nice thing is that you can just GetConsoleWindow() and tell it to HIDE or be visible. Oh, make sure you create the console right at start up and then hide it even if you don't think you want a console, that way if the user chooses to show it, all the logs will be there in the history.

12/27/2008

12-26-08 - In Defense of Less Typing

In the C++ rant we got a bit into the discussion of "that just reduces typing". It's become a common wisdom these days that "anything that just reduces typing is not of significant importance in programming". This is sort of a reaction to the bad old days where developers put too much emphasis on reducing the time it took to make a first draft of the code. Now people like to repeat the plathitudes that "first draft is only 10% of dev time, debuggability and readability and maintainability are what really matter".

Yes, yes, that is all true, but I think people miss the forest for the trees here in some cases.

Every character you type is a chance for a bug or a mistake. For one thing there are typos, but there are also just brain farts. The less you type when writing a piece of code, the more likely it is to be correct.

(I should emphasize the fact that reducing code duplication is very good for many reasons that I don't go into detail much in this rant, and those are mainly the main reason to merge duplicate code; I'm talking about cases where the code is not exactly duplicated, but is similar, or where you have a choice between making a very simple API which requires a lot of typing by the client, or a more complex API which has very simple usage for the client).

A lot of good programmers now are adopting the idea of exposing simple minimal C-style APIs that leave usage up to the client. There are a lot of things to like about that (for example, Sean's stb.h style thing for simple functionality is in fact wonderfully easy to integrate and steal snippets from), but there are also bad things about it. I think good programmers overestimate their ability to write simple usage code without mistakes. For example, you might think that you don't need a class to encapsulate a 32-bit Color, you can easily just use a 4-byte array or pass around a dword and do the shifts by hand - but I have seen bug after bug from small mistakes in that kind of code, because if you write the same thing over and over, or copy-paste and try to change a bunch of code by hand, there is some small chance of mistake each time you do it.

It's funny to me that good programmers in game dev are going in two directions at the same time. One direction is to make code very easy to follow by simple visual inspection. Many major people in game dev are pretty high on this these days. The basic idea is to use C-style imperative programming, make the action of each line obvious to simple visual inspection, reduce segmentation of code into function calls and prefer simple long linear code blocks (rather than factored out functions, conditionals, objects, etc). There are certainly merits to that. The other direction people are going is custom metaprogramming and local language redefinition. Many of these people want coding languages where you can locally redefine the rules of the language and do custom markup for things like network mirroring, thread fiber spawning, local-global state memory variable differentiation, etc. This kind of stuff would make code completely impossible to understand by simple visual inspection without intimate undestanding of how all the extra meta-language rules work. These ideas also have a lot of merit, because writing micro-threaded massively parallel code in plain C-style C++ is really difficult and ugly, and having custom features would make it much better - but these two directions are totally conflicting.

While I'm ranting about opposite directions, let me also rant about the idea that something is a good idea for "library design" but not for within your app (or vice versa). IMO Coding = Library Design. Most of the complex code that I write is in "libraries" for myself. Libraries are just chunks of functionality that you want to expose in a compact interface. Well, that's what you should be doing all the time. Coding is just writing a bunch of libraries, then the "app" is just tying together the "libraries".

So, for example, Casey's excellent talk about good library design (things like exposing multiple levels of interface from very simple to nitty gritty, and not forcing a usage pattern on the client) are just good ways to write code *always*.

I don't trust the me of one year ago, nor do I trust the me of one year in the future. I need to write API's for myself that make me write the right code. Part of that is all the things I've often written about before (self-validating API's, API's that are impossible to use wrong), but part of it is just plain less typing. If the API makes me (the client) write a whole bunch of code to do the simple things I often want to do - that makes it far more likely I will do it wrong.

Also I believe the illusion of choice is a horrible thing. If there's really only one or two reasonable ways to use a system, then just expose that to me. Don't give me what looks like a bunch of flexible components, but they only really work right if you do one specific thing.


Addendum : okay I'm bored of this topic and I'm sure you are too, but I feel like I started it so I should wrap it up a bit more.

Paul Graham has this thing "Succinctness is Power" that's sort of similar to this rant. As usual he writes it well, but I think he's a little bit wrong. The issue that I believe is important, which is what I'm trying to talk about here is :

Reducing the number of choices that the programmer has to make in order to write correct code.

Part of that is reducing typing - but the crucial thing is reducing typing when there is actual choice in the alternatives. That is, if it's something you *have* to type, that's not bad. For example a very verbose language like COBOL is not inherently bad due to its verbosity (cobol is horrible for other reasons).

Making code that works correctly with minimal typing (and makes compile errors if you use it wrong) is the goal. So part of what I'm getting at here is using short common words that it's easy for the programmer to get right, using highly constrained classes instead of very general ones, etc.

Part of the credo goes like this :


remove the option to do things that you never want to do

make it hard to do things that you rarely want to do

make it easy to do the right thing

As an example - iterators are cool even when they save almost no work. Say for example you have something like a doubly linked list class. Many of the simple C guys would say "hey you can just write code to traverse the linked list", and you write client code like :

for(Node * n = list.head; n != list.head; n = n->next)
{
    ...
}

That's easy right, you're a good programmer, you can just write that loop. No. You can't, that's what I'm trying to say with this rant. I mean, yes, you can, but you had a 1% chance of introducing a bug because you had to write code.

Writing code is bad.

Instead you could make your Node look like an iterator. You could give it standard names like begin() and end(). Now instead of writing code you can just use for_each , or even just copy-paste or spit out a loop that you've done a million times :

for(Node::iterator it = list.begin(); it != list.end(); ++it)
{
    ...
}

Is safer because it's standard. On a similar note, using a constrained object like an iterator is safer than using an int, because every time you use an int people get tempted to do evil things. How many bugs have I seen because people try to get clever with their loop iteration? Maybe they count backwards for efficiency and use and unsigned type by mistake. Or they pull the ++i out of the for() and then forget to do it due to a continue. Or they use the "i" outside of the for loop and bugs get introduced.

Lots of people are anti-OOP these days. I love OOP ; no, not deep inheritance trees and lots of data inheritance, and whatnot, but the basic idea of coding in terms of objects that impose constraints and conditions on their use. The best way for me to program is to build components and helpers which make expressing the program easy.

12/18/2008

12-18-08 - Open Source Just Doesn't Work

There's a Netflix Plugin for Media Portal . I dare you to even try to figure out how to install it and get it to run. And then there are some big problems with control and integration once you get it running. I'm trying to get the developer to fix it; fingers crossed, it would be nice to have that. I keep hearing the Xbox Netflix integration is really awesome. I might have to get an Xbox just for that. Yay.

12/16/2008

12-16-08 - Libraries and Cinit

I need a kind of mini class-factory for Oodle. This is for when I load a paging bundle that's full of various resources, I need to make each one. (BTW there will be a "low level" usage pattern for Oodle where you just load a paging bundle and the game gets a handle to the bundle and does whatever it wants with it. This is for the "high level" layer that automates a lot of things for you but is optional).

I guess what I'm going to have to do is require the game to give me a creator function pointer that knows how to make all the objects. eg. it would give me something like


void * (*fp_factory) (int objectType);

and fp_factory would point to something like

void * GameObjectFactory(int type)
{
    switch(type)
    {
    case OBJECT_PLAYER : return (void *) new Player;
    case OBJECT_ENEMY: ...
    }
}

Or something. As a game developer I hate that kind of global registry, because when you add a new object type you have to remember to go update this global registry file, which becomes a frequent merge disaster for source control, etc. I really like having everything you need to make a game object in a single CPP file. That means objects should be self-registering.

The way I usually do self-registering objects is with a little class that runs at cinit. Something like :


#define IMPLEMENT_FACTORY_PRODUCT(classname)    namespace { ClassFactory::Registrar classname##registrar( classname::Factory , typeid(classname) ); }

then in Player.cpp you have :

IMPLEMENT_FACTORY_PRODUCT(Player);

That's all fine and dandy, but it's not so cool for me as a library maker.

For one thing, doing work during CINIT is kind of naughty as a library. I've been trying to make it a rule for myself that I don't use object constructors to do cinit work the way I sometimes like to do. It's a perfectly normal thing to do in C++, and if you are writing C++ code you should make all your systems instantiate-on-use like proper singletons so that CINIT objects work - BUT as a library maker I can't require the clients to have done all that right. In particular if I do something like allocations during CINIT, they might run before the client does its startup code to install custom allocators or whatever.

For another thing, there are problems with the linker and cinit in libraries. The linker can drop objects even though they are doing cinit calls that register them in global factory databases. There are various tricks to prevent this, but they are platform dependent and it is a little evil bit of spaghetti to get the client involved in.

I guess I probably also shouldn't rely on "typeid" or "dynamic_cast" or any other runtime type information existing either since people like to turn that off in the compiler options for no good reason (it has basically zero cost if you don't use it). So without that stuff I pretty much just have to rely on the game to give me type info manually anyway.

Bleh, I'm just rambling now...

12/15/2008

12-15-08 - Denoising

I've been playing around with denoising images a tiny bit. There's a ton of papers on this and I've barely only scratched the surface, but it strikes me that the majority of the researches seem to be doing silly things that are ill-conceived.

Almost all of them work in the same basic way. They create a prediction of what the current pixel should be with a local smooth predictor, let's call that 'S'. Then they take the difference from the actual pixel value 'A'. If the difference is greater than a certain threshold, |S - A| > T , they replace the pixel value with 'S'.

That's just very wrong. It assumes that images can be modeled with a single-lobe Gaussian probability distribution. In fact, images are better modeled as a blend of several lobes with different means. That is, there is not one single smooth generator, but many, which are switched or blended between based on some state.

Any single-lobe predictor will incorrectly identify some valid image detail as noise.

I like to make it clear that the problem has two parts : deciding if a pixel is noise or not noise, and then filling in a replacement value if you decide that the pixel is noise.

My feeling is that the second part is actually not the important or difficult part. Something like a median filter or a bilateral filter is probably an okay way to fill in a value once you decide that a pixel is noise. But the first part is tricky and as I've said any simple weighted linear predictor is no good.

Now, ideally we would have a rich model for the statistical generation of images. But I wrote about that before when talking about Image Doubling (aka Super-Resolution), and we're still very far from that.

In the mean time, the best thing we have at the moment, I believe, is the heuristic modes of something like CALIC, or the Augural Image Zooming paper, or Pyramid Workshop or TMW. Basically these methods have 6 - 100 simple models of local image neighborhoods. For example the basic CALIC models are : {locally smooth}, {vertical edge}, {horizontal edge}, {45 degree edge}, {135 degree edge}, {local digital}, {pattern/texture}. The neighborhood is first classified to one of the heuristic models, and then a prediction is made using that model.

We can thus propose a simple heuristic noise detection algorithm :

Bayesian Noise Detection :

N = current local neighborhood
A = actual pixel value

P(M|N) = probability of model M given neighborhood N
P(A|M) = probability of pixel A was generated by model M

let

P(A|N) = argmax{M} P(A|M) * P(M|N)

then classify A as noise if

P(A|N) < T

for some threshold T

(I don't specify how the P's are normalized because it just changes the scaling of T,
but they should be normalized in the same way for the whole image)

Note that a very crucial thing is that we are using the argmax on models, NOT the average on models. What we're saying is that if *any* of our heuristic local models had a high likelihood of generating the value A, then we do not consider it noise. The only values that are noise are ones which are unlikely under *all* models.

In a totally hand-wavey heuristic way, this is just saying that if a pixel is within threshold of being a locally smooth value, or an edge value, or a texture, etc. then it's not noise. If it fits none of those models within threshold, it's noise.

I started by looking at the Median Filter and the Bilateral Filter. There have been some cool recent papers on fast Median Filtering :
Constant Time Median Filter
Weiss Log(r) Median Filter
Fast Bilateral Filter ; Sylvain Paris and Fr�do Durand + good references
Siggraph Course on Bilateral Filtering

Those are all very worth reading even though I don't think it's actually super useful. The fast median filter approaches use cool ways of turning an operation over a sliding window into incremental operations that only process values getting added in and removed as you step the window. Median filter is a weird thing that works surprisingly well actually, but it does create a weird sort of Nagel-drawing type of look, with nothing but smooth gradients and hard edges. In fact it's a pretty good toon-shading process.

BTW the fast median filters seem useless for denoising, since they really only matter for large r (radius of the filter), and for denoising you really only want something like a 5x5 window, at which size a plain brute force median is faster.

Bilateral filter actually sort of magically does some of the heuristic cases that I've talked about it. Basically it makes a prediction using a filter weighted by distance and also value difference. So similar values contribute and disparate values don't. That actually does a sort of "feature selection". That is, if your pixel value is close to other pixel values in a vertical edge, then the bilateral filter will weight strongly on those other similar pixel values and ignore the other off-edge pixels. That's pretty good, and the results are in fact decent, but if you think about it more you see the bilateral filter is just a ghetto approximation of what you really want. Weighting based on pixel value difference is not the right way to blend models, it makes no distinction about the context of that value difference - eg. it doesn't care if the value difference comes from a smooth gradient or a sudden step. As others have noted, the Bilateral Filter makes the image converge towards piecewise-constant, which is obviously wrong. Getting towards piecewise linear would be better, piecewise bicubic would be better still - but even that is just the very easiest beginning of what the heuristic estimator can do.

NL Means is a denoising algorithm which is a bit closer to the right idea; he's definitely aware of the issues. However, the actual NL means method is poor. It relies on closely matching neighborhoods to form good predictions, which anyone who's worked in image compression or super-resolution knows is not a good approach. The problem is there are simply too many possible values in reasonably sized neighborhoods. That is, even for a moderately sized neighborhood like 5x5, you have 2^8^25 possible values = 2^200. No matter how much you train, the space is too sparse. It may seem from the NL Means formulation that you're weighting in various neighbors, but in reality in practice you only find a few neighbors that are reasonably close and they get almost all of the weight, and they might not even be very close. It's like doing K-means with 2^200 possible values - not good.

There's a lot of work on Wavelet Denoising which I haven't really read. There are some obvious appealing aspects of that. With wavelets you can almost turn an image into a sum of {smooth}+{edges}+{texture}+{noise} and then just kill the noise. But I don't really like the idea of working in wavelet domain, because you wind up affecting a lot of pixels. Most of the noise I care about comes from cameras, which means the noise is in fact isolated to 1 or 2 adjacent pixels. I also don't like all the research papers that want to use 9x9 or larger local windows. Real images are very local, their statistics change very fast, and pulling in shit from a 9x9 window is just going to mess them up. IMO a 5x5 window is the reasonable size for typical image resolutions of today.

BTW one thing I've noticed with my camera noise images is that the fucking camera JPEG makes the noise so much harder to remove. The noise looks like it's originally just true single pixel noise, but when it goes through the camera JPEG, that single-pixel peak is really unfriendly to the DCT, so it gets smeared out, and you wind up having noise lumps that look like little DCT shapes. To specifically denoise photos that have gone through JPEG you probably have to work on 8x8 blocks and work directly on the DCT coefficients. (also the Bayer pattern demosaic obviously spreads noise as well; ideally you'd get to work on the raw taps before jpeg, before the camera denoise, and before the Bayer demosaic).

ADDENDUM : a lot of the denoise people seem to be trying to perfect the Playboy Centerfold algorithm, that makes photos look extremely smoothed and airbrushed. Often if you're not sure a pixel is noise it's best to leave it alone. Also, all the methods that use a pixel-magnitude threshold value for noise are wrong. The threshold for noise needs to be context sensitive. That is, in smooth parts of the image, you might be able to say that a pixel is probably noise when it's off from expectation by only 1 or 2 pixel values. In chaotic textures parts of the image, a pixel might be off by 20 values or more and you still might not be able to say it's noise. The correct parameter to expose to the user is a *confidence*. That is, I want to do something like replace all pixels which the algorithm is >= 90% confident it can improve.

Another problem I've seen with the majority of the denoisers is that they create structures from noise. If you run them on just staticy junk, they will form local flat junks, or linear bits or weird feathery patterns. This is because even in random noise there will be little bits that have similar values so they become seeds to create structures. This is very bad, the weird structures that are formed by this "denoising" are much worse than the original static, which is pretty inoffensive to the eye.

Marc sent me the link to GREYCstoration a free denoiser based on the image manifold PDE research. I don't like that this technique is becoming known as "PDE" - PDE just means partial differential equation; in this case it's a diffusion equation, in particular a variant of anisotropic diffusion with different diffusion rates based on the local curvature - eg. it diffuses across smooth areas and not across edges. (that's actually an old basic technique, the new thing he does is the diffusion follows contour lines (but is actually 2d, just weighted) and works on all components). It looks pretty decent. It's actually more exciting to me for super-resolution, it looks like it does a pretty good job of image super-resolution.

12/08/2008

12-08-08 - File Hashes

So I'm sticking a file hash in Oodle and thought I'd test some of the stuff out there. Test candidates :

SHA1 (Sean's stb.h implementation)

MD5 (OpenSSL implementation)

BurtleBurtle Lookup3 (hashlittle2)

Cessu's SSE2 hash (+ extra tail code I added)

CRC32

CRC32+32

In all cases I create a 64 bit hash. Hey, it's plenty of bits, it's easier to pass around cuz I have a 64 bit type, and it makes it a fair competition. SHA1 makes 160 bits (= 5 dwords), MD5 makes 128 bits (= 4 dwords), so I use Bob's Mix method to get that down to 64 bits.

A lot of people think SHA1 or MD5 or something is the "right thing" to do for file hashes. That's not really true. Those hashes were designed for cryptography which is not the same purpose. In particular, they are slow *on purpose* because they want to be hard to invert. They also make tons of bits, not because you need tons of bits to tell files apart, but again to make them hard to invert by brute force attack. I don't care about my file hashes being vulnerable to attack, I just want the chance of accidental collisions to be microscopic.

CRC32+32 means doing CRC32 on alternating bytes and jamming them together to make 64 bits. This is not a true "CRC64" but I might refer to it as CRC64 sometimes. (suggestion by "Joe Smith" ; Joe Smith? Is that a pseudonym?)

Just for background, if the 64 bit hashes are "perfect" - that is the 64 bit words coming out of them are random in every bit, even for input that is very non-random - then the chance of collision is indeed microscopic. (in fact you only need maybe 40 bits). The number of items you can hash in B bits is around 2^(B/2) , so B = 32 is not enough bits since 2^16 = 64k and you may in fact run on 64k files. But even at B = 40, 2^20 = 1 Million is a lot, and certainly B = 64, means 2^32 = 4 Billion items before you expect a collision. So, anyway, the point is to test whether these hashes are actually close enough to perfect on real data that they don't generate an unexpected high number of collisions.

I ran these hashes on every file on my hard drive. I threw out files that were exactly equal to some other file so there would be no false collisions due to the files actually being identical. I have 24963 files. I made 2 variants of every file, one by taking a random byte and changing it to a random value, and another variant by flipping 4 random bits. So in total 74853 arrays were hashed.

First the speed numbers :

sha1                 : 48.218018
md5                  : 19.837351
burtle               : 7.122040
Cessu                : 6.370848
crc32+32             : 15.055287
crc32                : 21.550138

These are in clocks per byte. The CRC numbers are a little funny because the CRC32+32 loop is a bit unrolled, but the CRC32 loop goes byte by byte. In any case, even though CRC is very simple, it is not fast, because even unrolled it still works byte by byte and there's a hard data dependency - you have to completely process each byte before you can work on the next byte.

Cessu's hash is only barely faster than Bob's lookup3 even though it uses SSE2 and works on 16 bytes at a time. Bob's hash is really damn good. When I tested it on strings it did not perform well for me because I'm so close to the bone on strings that the rollup & rolldown overhead killed me, but on larger arrays or even long strings, lookup3 kicks butt. ( Bob's page )

So... how many total collisions in the hashes do you think I had? (only testing collisions versus hashes of the same type of course). Remember I tested on 74853 different arrays, made from 24963 files and 24963+24963 more tiny modifications.

One.

One collision. Of course it was in CRC32. None of the 64 bit hashes had any collisions.

I then tried making 8 variants of each file by 8 different random byte jams, so I was running 199704 arrays. Again zero collisions for any 64 bit hash.

So, in an attempt to actually get a collision, I made 10,000,000 test arrays by sprintf'ing the digits from 1 to 10,000,000 , and then tacked on 2 random bytes. (note : you should not test hashes by making random arrays, because any decent hash will return random output bits from random input bits; the test I am interested in is how close the hash output is to random on highly *nonrandom* input). I ran the hashes on all those strings and got :

collisions on 10,000,000 tests :

sha1                 : 0
md5                  : 0
burtle               : 0
Cessu                : 0
crc64                : 0
rand32               : 11,530
crc32                : 11,576

Again none of the 64 bit hashes has any collisions. CRC32 had quite a few of course - but only as many as a 32 bit random number generator !! That means the CRC is in fact performing as a perfect hash. It is perfectly randomizing the nonrandom input.

So, I have no idea which of the 64 bit hashes is "best" in terms of randomizing bits and detecting errors. Obviously if they are actually perfectly making 64 bits, the chance of me ever seeing a collision is tiny. I thought maybe the "crc32+32" might not have 64 real bits of quality and might fail sooner, but it's not bad enough to fail in any kind of real world scenario apparently.

So, anyway, I'm gonna use "lookup3" because it's both super fast, plenty good, and it has the Bob Jenkins seal of approval which means it should actually be "robust".

HOWEVER : SSE4.2 has a CRC32 instruction. If you were in some application where you could rely on having that, then that would definitely be the fastest way to go, and a CRC32+32 appears to be plenty high quality for file identification.

BTW I keep hearing that CRC32 has degeneracy failures on real world data, but I have yet to see it.

12-08-08 - DXTC Summary

I thought I should fix some things that were wrong or badly said in my original DXTC posts :

DXTC Part 1
DXTC Part 2
DXTC Part 3
DXTC Part 4

Added later :

DXTC Addendum
DXTC More Followup
DXTC is not enough
DXTC is not enough part 2

On the "ryg" coder : there was a bug/typo in the implementation I was using which gave bogus results, so you should just ignore the numbers in those tables. See for correction : Molly Rocket Forum on Ryg DXTC coder . Also I should note he does have some neat code in there. The color finder is indeed very fast; it is an approximation (not 100% right) but the quality is very close to perfect. Also his "RefineBlock" which does the Simon Brown style optimize-end-from-indices is a very clever fast implementation that collapses a lot of work. I like the way he uses the portions of one 32 bit word to accumulate three counters at a time.

I also mentioned in those posts that the optimized version of the Id "real time DXTC" bit math was "wrong". Well, yes, it is wrong in the sense that it doesn't give you exactly the same indeces, but apparently that was an intentional approximation by JMP, and in fact it's a very good one. While it does pick different indeces than the exact method, it only does so in cases where the error is zero or close to zero. On most real images the actual measured error of this approximation is exactly zero, and it is faster.

So, here are some numbers on a hard test set for different index finders :


    exact : err:  31.466375 , clocks: 1422.256522

    id    : err:  31.466377 , clocks: 1290.232239
            diff:  0.000002

    ryg   : err:  31.466939 , clocks:  723.051241
            diff:  0.000564

    ryg-t : err:  31.466939 , clocks:  645.445860
            diff:  0.000564

You can see the errors for all of them are very small. "ryg-t" is a new one I made which uses a table to turn the dot product checks into indexes, so that I can eliminate the branches. Start with the "ryg" dot product code and change the inner loop to :

    const int remap[8] = { 0 << 30,2 << 30,0 << 30,2 << 30,3 << 30,3 << 30,1 << 30,1 << 30 };

    for(int i=0;i < 16;i++)
    {
        int dot = block.colors[i].r*dirr + block.colors[i].g*dirg + block.colors[i].b*dirb;

        int bits =( (dot < halfPoint) ? 4 : 0 )
                | ( (dot < c0Point) ? 2 : 0 )
                | ( (dot < c3Point) ? 1 : 0 );

        mask >>= 2;
        mask |= remap[bits];
    }

I should note that these speed numbers are for pure C obvious implementations and if you really cared about speed you should use SSE and who knows what would win there.


Now this last bit is a little half baked but I thought I'd toss it up. It's a little bit magical to me that Ryg's Mul8Bit (which is actually Jim Blinn's) also happens to produce perfect quantization to 565 *including the MSB shifting into LSB reproduction*.

I mentioned before that the MSB shifting into LSB thing is actually "wrong" in that it would hurt RMSE on purely random data, because it is making poor use of the quantization bins. That is, for random data, to quantize [0,255] -> 32 values (5 bits) you should have quantization bins that each hold 8 values, and whose reconstruction is at the middle of each bin. That is, you should reconstruct to {4,12,20,...} Instead we reconstruct to {0,8,...247,255} - the two buckets at the edges only get 4 values, and there are some other ones that get 9 values. Now in practice this is a good thing because your original data is *not* random - it's far more likely to have exact 0 and 255 values in the input, so you want to reproduce those exactly. So anyway, it's not a uniform quantizer on the range [0,255]. In fact, it's closer to a uniform quantizer on the range [-4,259].


I think it might actually just be a numerical coincidence in the range [0,255].

The correct straight-forward quantizer for the DXTC style colors is

    return (32*(val+4))/(256+8);

for R.  Each quantization bin gets 8 values except the top and bottom which only get 4.  That's equivalent to quantizing the range [-4,256+4) with a uniform 8-bin quantizer.

Now

1/(256 + 8) = 1/256 * 1/(1 + 8/256)

We can do the Taylor series expansion of 1/(1+x) for small x on the second term and we get ( 1 - 8/256 + 64/256/256) up to powers of x^2

So we have

    ( (32*val+128) * ( 1 - 8/256 + 64/256/256) ) >> 8

And if we do a bunch of reduction we get 

    return ( 31*val+124 + ((8*val+32)>>8) ) >> 8

If we compare this to Mul8bit :

        return ( 31*val+128 + ((val*31 + 128)>>8)) >> 8;

it's not exactly the same math, but they are the same for all val in [0,255]

But I dunno. BTW another way to think about all this is to imagine your pixels are an 8.inf fixed point rep of an original floating point pixel, and you should replicate the 8 bit value continuously. So 0 = 0, 255 = FF.FFFFFF.... = 1.0 exactly , 127 = 7F.7F7F7F..

BTW this reminds me : When I do Bmp -> FloatImage conversions I used to do (int + 0.5)/256 , that is 0 -> 0.5/256 , 255 -> 255.5/256 . I no longer do that, I do 0->0, and 255 -> 1.0 , with a 1/255 quantizer.

12/05/2008

12-05-08 - lrotl

Well I found one x86 ASM widget. I've always known you could do nice fast barrel rotations on x86 but thought they were inaccessible from C. Huzzah! Stdlib has a function "_lrotl()" which is exactly what you want, and happily it is one of the magic functions the MSVC recognizes in their compiler and turns into assembly with all goodness. (They also have custom handling for strcmp, memcpy, etc.)

Oh, I noticed lrotl in OpenSSL which seems to have a ton of good code for different hashes/checksums/digests/whatever-the-fuck-you-call-them's.

As a test I tried it on Sean's hash, which is quite good and fast for C strings :


RADINLINE U32 stb_hash(const char *str)
{
   U32 hash = 0;
   while (*str)
   {
      hash = (hash << 7) + (hash >> 25) + *str++;
   }
   return hash;
}

RADINLINE U32 stb_hash_rot(const char *str)
{
   U32 hash = 0;
   while (*str)
   {
      hash = _lrotl(hash,7) + *str++;
   }
   return hash;
}

stb_hash : 6.43 clocks per char
stb_hash_rot : 3.24 clocks per char

Woot! Then I also remembered something neat I saw today at Paul Hsieh's Assembly Lab . A quick check for whether a 32 bit word has any null byte in it :

#define has_nullbyte(x) (((x) - 0x01010101) & ( ~(x) & 0x80808080 ) )

Which can of course be used to make an unrolled stb_hash :

RADINLINE U32 stb_hash_rot_dw(const char *str)
{
   U32 hash = 0;
   
   while ( ! has_nullbyte( *((U32 *)str) ) )
   {
      hash = _lrotl(hash,7) + str[0];
      hash = _lrotl(hash,7) + str[1];
      hash = _lrotl(hash,7) + str[2];
      hash = _lrotl(hash,7) + str[3];
      str += 4;
   }
   while (*str)
   {
      hash = _lrotl(hash,7) + *str++;
   }
   return hash;
}

stb_hash_rot_dw : 2.50 clocks

So anyway, I'm getting distracted by pointless nonsense, but it's nice to know lrotl works. (and yes, yes, you could be faster still by switching the hash algorithm to something that works directly on dwords)

old rants