12/06/2012

12-6-12 - Theoretical Oodle Rewrite Continued

So, continuing on the theme of a very C++-ish API with "futures" , ref-counted buffers, etc. :

cbloom rants 07-19-12 - Experimental Futures in Oodle
cbloom rants 10-26-12 - Oodle Rewrite Thoughts

It occurs to me that this could massively simplify the giant API.

What you do is treat "array data" as a special type of object that can be linearly broken up. (I noted previously about having RW locks in every object and special-casing arrays by letting them be RW-locked in portions instead of always locking the whole buffer).

Then arrays could have two special ways of running async :

1. Stream. A straightforward futures sequence to do something like read-compress-write would wait for the whole file read to be done before starting the compress. What you could do instead is have the read op immediately return a "stream future" which would be able to dole out portions of the read as it completed. Any call that processes data linearly can be a streamer, so "compress" could also return a stream future, and "write" would then be able to write out compressed bits as they are made, rather than waiting on the whole op.

2. Branch-merge. This is less of an architectural thing than just a helper (you can easily write it client-side with normal futures); it takes an array and runs the future on portions of it, rather than running on the whole thing. But having this helper in from the beginning means you don't have to write lots of special case branch-merges to do things like compress large files in several pieces.

So you basically just have a bunch of simple APIs that don't look particularly Async. Read just returns a buffer (future). ReadStream returns a buffer stream future. They look like simple buffer->buffer APIs and you don't have to write special cases for all the various async chains, because it's easy for the client to chain things together as they please.

To be redundant, the win is that you can write a function like Compress() and you write it just like a synchronous buffer-to-buffer function, but it's arguments can be futures and its return value can be a future.

Compress() should actually be a stackful coroutine, so that if the input buffer is a Stream buffer, then when you try to access bytes that aren't yet available in that buffer, you Yield the coroutine (pending on the stream filling).


Functions take futures as arguments and return futures.

Every function is actually run as a stackful coroutine on the worker threadpool.

Functions just look like synchronous code, but things like file IO cause a coroutine Yield rather than a thread Wait.

All objects are ref-counted and create automatic dependency chains.

All objects have built-in RW locks, arrays have RW locks on regions.

Parallelism is achieved through generic Stream and Branch/Merge facilities.

While this all sounds very nice in theory, I'm sure in practice it wouldn't work. What I've found is that every parallel routine I write requires new hacky special-casing to make it really run at full efficiency.

12-6-12 - The Oodle Dilemma

Top two Oodle (potential) customer complaints :

1. The API is too big, it's too complicated. There are too many ways of doing basically the same thing, and too many arguments and options on the functions.

2. I really want to be able to do X but I can't do it exactly the way I want with the API, can you add another interface?

Neither one is wrong.

11/23/2012

11-23-12 - Global State Considered Harmful

In code design, a frequent pattern is that of singleton state machines. eg. a module like "the log" or "memory allocation" which has various attributes you set up that affect its operation, and then subsequent calls are affected by those attributes. eg. things like :


Log_SetOutputFile( FILE * f );

then

Log_Printf( const char * fmt .... );

or :

malloc_setminimumalignment( 16 );

then

malloc( size_t size );

The goal of this kind of design is to make the common use API minimal, and have a place to store the settings (in the singleton) so they don't have to be passed in all the time. So, eg. Log_Printf() doesn't have to pass in all the options associated with logging, they are stored in global state.

I propose that global state like this is the classic mistake of improving the easy case. For small code bases with only one programmer, they are mostly okay. But in large code bases, with multi-threading, with chunks of code written independently and then combined, they are a disaster.

Let's look at the problems :

1. Multi-threading.

This is an obvious disaster and pretty much a nail in the coffin for global state. Say you have some code like :


pcb * previous_callback = malloc_setfailcallback( my_malloc_fail_callback );

void * ptr = malloc( big_size ); 

malloc_setfailcallback( previous_callback );

this is okay single threaded, but if other threads are using malloc, you just set the "failcallback" for them as well during that span. You've created a nasty race. And of course you have no idea whether the failcallback that you wanted is actually set when you call malloc because someone else might change it on another thread.

Now, an obvious solution is to make the state thread-local. That fixed the above snippet, but some times you want to change the state so that other threads are affected. So now you have to have thread-local versions and global versions of everything. This is a viable, but messy, solution. The full solution is :

There's a global version of all state variables. There are also thread-local copies of all the global state. The thread-local copies have a special value that means "inherit from global state". The initial value of all the thread-local state should be "inherit". All state-setting APIs must have a flag for whether they should set the global state or the thread-local state. Scoped thread-local state changes (such as the above example) need to restore the thread-local state to "inherit".

This can be made to work (I'm using for the Log system in Oodle at the moment) but it really is a very large conceptual burden on the client code and I don't recommend it.

There's another way that these global-state singletons are horrible for multi-threading, and that's that they create dependencies between threads that are not obvious or intentional. A little utility function that just calls some simple functions picks up these ties to shared variables and needs synchronization protection with the global state. This is related to :

2. Non-local effects.

The global state makes the functions that use it non-"pure" in a very hidden way. It means that innocuous functions can break code that's very far away from it in hidden ways.

One of the classic disasters of global state is the x87 (FPU) control word. Say you have a function like :


void func1()
{

    set x87 CW

    do a bunch of math that relies on that CW

    func2();

    do more math that relies on CW

    restore CW
}

Even without threading problems (the x87 CW is thread-local under any normal OS), this code has nasty non-local effects.

Some branch of code way out in func2() might rely on the CW being in a certain state, or it might change the CW and that breaks func1().

You don't want to be able to break code very far away from you in a hidden way, which is what all global state does. Particularly in the multi-threaded world, you want to be able to detect pure functions at a glance, or if a function is not pure, you need to be able to see what it depends on.

3. Undocumented and un-asserted requirements.

Any code base with global state is just full of bugs waiting to happen.

Any 3d graphics programmer knows about the nightmare of the GPU state machine. To actually write robust GPU code, you have to check every single render state at the start of the function to ensure that it is set up the way you expect. Good code always expresses (and checks) its requirements, and global state makes that very hard.

This is a big problem even in a single-source code base, but even worse with multiple programmers, and a total disaster when trying to copy-paste code between different products.

Even something like taking a function that's called in one spot in the code and calling it in another spot can be a hidden bug if it relied on some global state that was set up in just the right way in that original spot. That's terrible, as much as possible functions should be self-contained and work the same no matter where they are called. It's sort of like "movement of call site invariance symmetry" ; the action of a function should be determined only by its arguments (as much as possible) and any memory locations that it reads should be as clearly documented as possible.

4. Code sharing.

I believe that global state is part of what makes C code so hard to share.

If you take a code snippet that relies on some specific global state out of its content and paste it somewhere else, it no longer works. Part of the problem is that nobody documents or checks that the global state they need is set. But a bigger issue is :

If you take two chunks of code that work independently and just link them together, they might no longer work. If they share some global state, either intentionally or accidentally, and set it up differently, suddenly they are stomping on each other and breaking each other.

Obviously this occurs with anything in stdlib, or on the processor, or in the OS (for example there are lots of per-Process settings in Windows; eg. if you take some libraries that want a different time period, or process priority class, or priviledge level, etc. etc. you can break them just by putting them together).

Ideally this really should not be so. You should be able to link together separate libs and they should not break each other. Global state is very bad.


Okay, so we hate global state and want to avoid it. What can we do? I don't really have the answer to this because I've only recently come to this conclusion and don't have years of experience, which is what it takes to really make a good decision.

One option is the thread-local global state with inheritance and overrides as sketched above. There are some nice things about the thread-local-inherits-global method. One is that you do still have global state, so you can change the options somewhere and it affects all users. (eg. if you hit 'L' to toggle logging that can change the global state, and any thread or scope that hasn't explicitly sets it picks up the global option immediately).

Other solutions :

1. Pass in everything :

When it's reasonable to do so, try to pass in the options rather than setting them on a singleton. This may make the client code uglier and longer to type at first, but is better down the road.

eg. rather than


malloc_set_alignment( 16 );

malloc( size );

you would do :

malloc_aligned( size , 16 );

One change I've made to Oodle is taking state out of the async systems and putting in the args for each launch. It used to be like :

OodleWork_SetKickImmediate( OodleKickImmediate_No );
OodleWork_SetPriority( OodlePriority_High );
OodleWork_Run( job );

and now it's :

OodleWork_Run( job , OodleKickImmediate_No, OodlePriority_High );

2. An options struct rather than lots of args.

I distinguish this from #3 because it's sort of a bridge between the two. In particular I think of an "options struct" as just plain values - it doesn't have to be cleaned up, it could be const or made with an initializer list. You just use this when the number of options is too large and if you frequently set up the options once and then use it many times.

So eg. the above would be :


OodleWorkOptions wopts = { OodleKickImmediate_No, OodlePriority_High  };
OodleWork_Run( job , &wopts );

Now I should emphasize that we already have given ourselves great power and clarity. The options struct could just be global, and then you have the standard mess with that. You could have it in the TLS so you have per-thread options. And then you could locally override even the thread-local options in some scope. Subroutines should take OodleWorkOptions as a parameter so the caller can control how things inside are run, otherwise you lose the ability to affect child code which a global state system has.

Note also that options structs are dangerous for maintenance because of the C default initializer value of 0 and the fact that there's no warning for partially assigned structs. You can fix this by either making 0 mean "default" for every value, or making 0 mean "invalid" (and assert) - do not have 0 be a valid value which is anything but default. Another option is to require a magic number in the last value of the struct; unfortunately this is only caught at runtime, not compile time, which makes it ugly for a library. Because of that it may be best to only expose Set() functions for the struct and make the initializer list inaccessible.

The options struct can inherit values when its created; eg. it might fill any non-explicitly given values (eg. the 0 default) by inheriting from global options. As long as you never store options (you just make them on the stack), and each frame tick you get back to a root for all threads that has no options on the stack, then global options percolate out at least once a frame. (so for example the 'L' key to toggle logging will affect all threads on the next frame).

3. An initialized state object that you pass around.

Rather than a global singleton for things like The Log or The Allocator, this idea is to completely remove the concept that there is only one of those.

Instead, Log or Allocator is a struct that is passed in, and must be used to do those options. eg. like :


void FunctionThatMightLogOrAllocate( Log * l, Allocator * a , int x , int y )
{
    if ( x )
    {
        Log_Printf( l , "some stuff" );
    }

    if ( y )
    {
        void * p = malloc( a , 32 );

        free( a , p );
    }
}

now you can set options on your object, which may be a per-thread object or it might be global, or it might even be unique to the scope.

This is very powerful, it lets you do things like make an "arena" allocator in a scope ; the arena is allocated from the parent allocator and passed to the child functions. eg :


void MakeSuffixTrie( Allocator * a , U8 * buf, int bufSize )
{

    Allocator_Arena arena( a , bufSize * 4 );

    MakeSuffixTrie_Sub( &arena, buf, bufSize );
}

The idea is there's no global state, everything is passed down.

At first the fact that you have to pass down a state pointer to use malloc seems like an excessive pain in the ass, but it has advantages. It makes it super clear in the signature of a function which subsystems it might use. You get no more surprises because you forgot that your Mat3::Invert function logs about degeneracy.

It's unclear to me whether this would be too much of a burden in real world large code bases like games.

11/13/2012

11-13-12 - Another Oodle Rewrite Note

Of course this is not going to happen. But in the imaginary world in which I rewrite from scratch :

I've got a million (actually several hundred) APIs that start an Async op. All of those APIs take a bunch of standard arguments that they all share, so they all look like :


OodleHandle Oodle_Read_Async(

                // function-specific args :
                OodleIOQFile file,void * memory,SINTa size,S64 position,

                // standard args on every _Async function :
                OodleHandleAutoDelete autoDelete,OodlePriority priority,const OodleHandle * dependencies,S32 numDependencies);

The idea was that you pass in everything needed to start the op, and when it's returned you get a fully valid Handle which is enqueued to run.

What I should have done was make all the little _Async functions create an incomplete handle, and then have a standard function to start it. Something like :


// prep an async handle but don't start it :
OodleHandleStaging Oodle_Make_Read(
                OodleIOQFile file,void * memory,SINTa size,S64 position
                );

// standard function to run any op :
OodleHandle Oodle_Start( OodleHandleStaging handle,
                OodleHandleAutoDelete autoDelete,OodlePriority priority,const OodleHandle * dependencies,S32 numDependencies);

it would remove a ton of boiler-plate out of all my functions, and make it a lot easier to add more standard args, or have different ways of firing off handles. It would also allow things like creating a bunch of "Staging" handles that aren't enqueued yet, and then firing them off all at once, or even just holding them un-fired for a while, etc.

It's sort of ugly to make clients call two functions to run an async op, but you can always get client code that looks just like the old way by doing :


OodleHandle Oodle_Start( Oodle_Make_Read( OodleIOQFile file,void * memory,SINTa size,S64 position ) ,
                OodleHandleAutoDelete autoDelete,OodlePriority priority,const OodleHandle * dependencies,S32 numDependencies);

and I could easily make macros that make that look like one function call.

Having that interval of a partially-constructed op would also let me add more attributes that you could set on the Staging handle before firing it off.

(for example : when I was testing compresses on enwik, some of the tasks could allocate something like 256MB each; it occurred to me that a robust task system should understand limitting the number of tasks that run at the same time if their usage of some resource exceeds the max. eg. for memory usage, if you know you have 2 GB free, don't run more than 8 of those 256 MB tasks at once, but you could run other non-memory-hungry tasks during that time. (I guess the general way to do that would be to make task groups and assign tasks to groups and then limit the number from a certain group that can run simultaneously))

11/08/2012

11-08-12 - Job System Task Types

I think I've written this before, but a bit more verbosely :

It's useful to have at least 3 classes of task in your task system :

1. "Normal" tasks that want to run on one of the worker threads. These take some amount of time, you don't want them to block other threads, they might have dependencies and create other jobs.

2. "IO" tasks. This is CPU work, but the main thing it does is wait on an IO and perhaps spawn another one. For example something like copying a file through a double-buffer is basically just wait on an IO op then do a tiny bit of math, then start another IO op. This should be run on the IO thread, because it avoids all thread switching and thread communication as it mediates the IO jobs.

(of course the same applies to other subsytems if you have them)

3. "Tiny" / "Run anywhere" tasks. These are tasks that take very little CPU time and should not be enqueued onto worker threads, because the time to wake up a thread for them dwarfs the time to just run them.

The only reason you would run these as async tasks at all is because they depend on something. Generally this is something trivial like "set a bool after this other job is done".

These tasks should be run immediately when they become ready-to-run on whatever thread made them rtr. So they might run on the IO thread, or a worker thread, or the main thread (any client thread). eg. if a tiny task is enqueued on the main thread and its dependencies are already done, it should just be run immediately.

It's possible that class #2 could be merged into class #3. That is, eliminate the IO-tasks (or GPU-tasks or whatever) and just call them all "tiny tasks". You might lose a tiny bit of efficiency from that, but the simplicity of having only two classes of tasks is probably preferable. If the IO tasks are made into just generic tiny tasks, then it's important that the IO thread be able to execute tiny tasks from the generic job system itself, otherwise it might go to sleep thinking there is no IO to be done, when a pending tiny IO task could create new IO work for it.

Okay.

Beyond that, for "normal" tasks there's the question of typical duration, which tells you whether it's worth it to fire up more threads.

eg. say you shoot 10 tasks at your thread-pool worker system. Should you wake up 1 thread and let it do all 10 ? Or wake up 10 threads and give each one a task? Or maybe wake 2?

One issue that still is bugging me is when you have a worker thread, and in doing some work it makes some more tasks ready-to-run. Should it fire up new worker threads to take those tasks, or should it just finish its task and then do them itself? You need two pieces of information : 1. are the new tasks significant enough to warrant a new thread? and 2. how close to the end of my current task am I? (eg. if I'm in the middle of some big work I might want to fire up a new thread even though the new RTR tasks are tiny).

When you have "tiny" and "normal" tasks at the same priority level, it's probably worth running all the tinies before any normals.

Good lord.

11/06/2012

11-06-12 - Using your own malloc with operator new

In Oodle, I use my own allocator, and I wish to be able to still construct & destruct classes. (Oodle's public API is C only, but I use C++ internally).

The traditional way to do this is to write your own "operator new" implementation which will link in place of the library implementation. This way sucks for various reasons. The important one to me is that it changes all the news of any other statically-linked code, which is just not an okay thing for a library to do. You may want to have different mallocs for different purposes; the whole idea of a single global allocator is kind of broken in the modern world.

(the presence of global state in the C standard lib is part of what makes C code so hard to share. The entire C stdlib should be a passed-in vtable argument. Perhaps more on this in a later post.)

Anyway, what I want is a way to do a "new" without interfering with client code or other libraries. It's relatively straightforward (*), but there are a few little details that took me a while to get right, so here they are.

(* = ADDENDUM = not straightforward at all if multiple-inheritance is used and deletion can be done on arbitrary parts of the MI class)

//==================================================================

/*

subtlety : just calling placement new can be problematic; it's safer to make an explicit selection
of placement new.  This is how we call the constructor.

*/

enum EnumForPlacementNew { ePlacementNew };

// explicit access to placement new when there's ambiguity :
//  if there are *any* custom overrides to new() then placement new becomes ambiguous   
inline void* operator new   (size_t, EnumForPlacementNew, void* pReturn) { return pReturn; }
inline void  operator delete(void*,  EnumForPlacementNew, void*) { }

#ifdef __STRICT_ALIGNED
// subtlety : some stdlibs have a non-standard operator new with alignment (second arg is alignment)
//  note that the alignment is not getting passed to our malloc here, so you must ensure you are
//    getting it in some other way
inline void* operator new   (size_t , size_t, EnumForPlacementNew, void* pReturn) { return pReturn; }
#endif

// "MyNew" macro is how you do construction

/*

subtlety : trailing the arg list off the macro means we don't need to do this kind of nonsense :

    template <typename Entry,typename t_arg1,typename t_arg2,typename t_arg3,typename t_arg4,typename t_arg5,typename t_arg6,typename t_arg7,typename t_arg8,typename t_arg9>
    static inline Entry * construct(Entry * pEntry, t_arg1 arg1, t_arg2 arg2, t_arg3 arg3, t_arg4 arg4, t_arg5 arg5, t_arg6 arg6, t_arg7 arg7, t_arg8 arg8, t_arg9 arg9)

*/

//  Stuff * ptr = MyNew(Stuff) (constructor args); 
//  eg. for void args :
//  Stuff * ptr = MyNew(Stuff) ();
#define MyNew(t_type)   new (ePlacementNew, (t_type *) MyMalloc(sizeof(t_type)) ) t_type 

// call the destructor :
template <typename t_type>
static inline t_type * destruct(t_type * ptr)
{
    RR_ASSERT( ptr != NULL );
    ptr->~t_type();
    return ptr;
}

// MyDelete is how you kill a class

/*

subtlety : I like to use a Free() which takes the size of the object.  This is a big optimization
for the allocator in some cases (or lets you not store the memory size in a header of the allocation).
*But* if you do this, you must ensure that you don't use sizeof() if the object is polymorphic.
Here I use MSVC's nice __has_virtual_destructor() extension to detect if a type is polymorphic.

*/

template <typename t_type>
void MyDeleteNonVirtual(t_type * ptr)
{
    RR_ASSERT( ptr != NULL );
    #ifdef _MSC_VER
    RR_COMPILER_ASSERT( ! __has_virtual_destructor(t_type) );
    #endif
    destruct(ptr);
    MyFree_Sized((void *)ptr,sizeof(t_type));
}

template <typename t_type>
void MyDeleteVirtual(t_type * ptr)
{
    RR_ASSERT( ptr != NULL );
    destruct(ptr);
    // can't use size :
    MyFree_NoSize((void *)ptr);
}

#ifdef _MSC_VER

// on MSVC , MyDelete can select the right call at compile time

template <typename t_type>
void MyDelete(t_type * ptr)
{
    if ( __has_virtual_destructor(t_type) )
    {
        MyDeleteVirtual(ptr);
    }
    else
    {
        MyDeleteNonVirtual(ptr);
    }
}

#else

// must be safe and use the polymorphic call :

#define MyDelete    MyDeleteVirtual

#endif

and the end result is that you can do :

    foo * f = MyNew(foo) ();

    MyDelete(f);

and you get normal construction and destruction but with your own allocator, and without polluting (or depending on) the global linker space. Yay.

11-06-12 - Protect Ad-Hoc Multi-Threading

Code with un-protected multi-threading is bad code just waiting to be a nightmare of a bug.

"ad-hoc" multi-threading refers to sharing of data across threads without an explicit sharing mechanism (such as a queue or a mutex). There's nothing wrong per-se with ad-hoc multi-threading, but too often people use it as an excuse for "comment moderated handoff" which is no good.

The point of this post is : protect your threading! Use name-changes and protection classes to make access lifetimes very explicit and compiler (or at least assert) moderated rather than comment-moderated.

Let's look at some examples to be super clear. Ad-Hoc multi-threading is something like this :


int shared;


thread0 :
{

shared = 7; // no atomics or protection or anything

// shared is now set up

start thread1;

// .. do other stuff ..

kill thread1;
wait thread1;

print shared;

}

thread1 :
{

shared ++;

}

this code works (assuming that thread creation and waiting has some kind of memory barrier in it, which it usually does), but the hand-offs and synchronization are all ad-hoc and "comment moderated". This is terrible code.

I believe that even with something like a mutex, you should make the protection compiler-enforced, not comment-enforced.

Comment-enforced mutex protection is something like :


struct MyStruct s_data;
Mutex s_data_mutex;
// lock s_data_mutex before touching s_data

That's okay, but comment-enforced code is always brittle and bug-prone. Better is something like :

struct MyStruct s_data_needs_mutex;
Mutex s_data_mutex;
#define MYSTRUCT_SCOPE(name)    MUTEX_IN_SCOPE(s_data_mutex); MyStruct & name = s_data_needs_mutex;

assuming you have some kind of mutex-scoper class and macro. This makes it impossible to accidentally touch the protected stuff outside of a lock.

Even cleaner is to make a lock-scoper class that un-hides the data for you. Something like :

//-----------------------------------

template <typename t_data> class ThinLockProtectedHolder;

template <typename t_data>
class ThinLockProtected
{
public:

    ThinLockProtected() : m_lock(0), m_data() { }
    ~ThinLockProtected() { }    

protected:

    friend class ThinLockProtectedHolder<t_data>;
    
    OodleThinLock   m_lock;
    t_data          m_data;
};

template <typename t_data>
class ThinLockProtectedHolder
{
public:

    typedef ThinLockProtected<t_data>   t_protected;
    
    ThinLockProtectedHolder(t_protected * ptr) : m_protected(ptr) { OodleThinLock_Lock(&(m_protected->m_lock)); }
    ~ThinLockProtectedHolder() { OodleThinLock_Unlock(&(m_protected->m_lock)); }
    
    t_data & Data() { return m_protected->m_data; }
    
protected:

    t_protected *   m_protected;

};

#define TLP_SCOPE(t_data,ptr,data)  ThinLockProtectedHolder<t_data> RR_STRING_JOIN(tlph,data) (ptr); t_data & data = RR_STRING_JOIN(tlph,data).Data();

//--------
/*

// use like :

    ThinLockProtected<int>  tlpi;
    {
    TLP_SCOPE(int,&tlpi,shared_int);
    shared_int = 7;
    }

*/

//-----------------------------------
Errkay.

So the point of this whole post is that even when you are just doing ad-hoc thread ownership, you should still use a robustness mechanism like this. For example by direct analogy you could use something like :

//=========================================================================

template <typename t_data> class AdHocProtectedHolder;

template <typename t_data>
class AdHocProtected
{
public:

    AdHocProtected() : 
    #ifdef RR_DO_ASSERTS
        m_lock(0), 
    #endif
        m_data() { }
    ~AdHocProtected() { }

protected:

    friend class AdHocProtectedHolder<t_data>;
    
    #ifdef RR_DO_ASSERTS
    U32             m_lock;
    #endif
    t_data          m_data;
};

#ifdef RR_DO_ASSERTS
void AdHoc_Lock( U32 * pb)  { U32 old = rrAtomicAddExchange32(pb,1); RR_ASSERT( old == 0 ); }
void AdHoc_Unlock(U32 * pb) { U32 old = rrAtomicAddExchange32(pb,-1); RR_ASSERT( old == 1 ); }
#else
#define AdHoc_Lock(xx)
#define AdHoc_Unlock(xx)
#endif


template <typename t_data>
class AdHocProtectedHolder
{
public:

    typedef AdHocProtected<t_data>  t_protected;
    
    AdHocProtectedHolder(t_protected * ptr) : m_protected(ptr) { AdHoc_Lock(&(m_protected->m_lock)); }
    ~AdHocProtectedHolder() { AdHoc_Unlock(&(m_protected->m_lock)); }
    
    t_data & Data() { return m_protected->m_data; }
    
protected:

    t_protected *   m_protected;

};

#define ADHOC_SCOPE(t_data,ptr,data)    AdHocProtectedHolder<t_data> RR_STRING_JOIN(tlph,data) (ptr); t_data & data = RR_STRING_JOIN(tlph,data).Data();

//==================================================================
which provides scoped checked ownership of variable hand-offs without any explicit mutex.

We can now revisit our original example :


AdHocProtected<int> ahp_shared;


thread0 :
{

{
ADHOC_SCOPE(int,&ahp_shared,shared);

shared = 7; // no atomics or protection or anything

// shared is now set up
}

start thread1;

// .. do other stuff ..

kill thread1;
wait thread1;

{
ADHOC_SCOPE(int,&ahp_shared,shared);
print shared;
}

}

thread1 :
{
ADHOC_SCOPE(int,&ahp_shared,shared);

shared ++;

}

And now we have code which is efficient, robust, and safe from accidents.

10/26/2012

10-26-12 - Oodle Rewrite Thoughts

I'm getting increasingly annoyed with the C-style Oodle threading code. It's just such a nightmare to manually manage things like object lifetimes in an async / multi-threaded environment.

Even something as simple as "write part of this buffer to a file" constantly causes me pain, because implied in that operation is "the buffer must not be freed until the write is done" , "the buffer should not be changed in the area being written until the write is done" , and "the file should not be closed until the write is done".

When you first start out and aren't doing a lot of complicated ops, it doesn't seem too bad, you can keep those things in your head; they become "comment-enforced" rules; that is, the code doesn't make itself correct, you have to write comments like "// write is pending, don't free buffer yet" (often you don't actually write the comments, but they're still "comment-enforced" as opposed to "code-enforced").

I think the better way is the very-C++-y Oodle futures .

Oodle futures rely on every object they take as inputs having refcounts, so there is no issue of free before exit. Some key points about the Oodle futures that I think are good :

A. Dependencies are automatic based on your arguments. You depend on anything you take as arguments. If the arguments themselves depend on async ops, then you depend on the chain of ops automatically. This is super-sweet and just removes a ton of bugs. You are then required to write code such that all your dependencies are in the form of function arguments, which at first is a pain in the ass, but actually results in much cleaner code overall because it makes the expression of dependencies really clear (as opposed to just touching some global deep inside your function, which creates a dependency in a really nasty way).

B. Futures create implicit async handles; the async handles in Oodle future are all ref-counted so they clean themselves automatically when you no longer care about them. This is way better than the manual lifetime management in Oodle right now, in which you either have to hold a bunch of handles.

C. It's an easy way to plug in the result of one async op into the input of the next one. It's like an imperative way of using code to do that graph drawing thing ; "this op has an output which goes into this input slot". Without an automated system for this, what I'm doing at the moment is writing lots of little stub functions that just wait on one op, gather up its results and starts the next op. There's no inefficiency in this, it's the same thing the future system does, but it's a pain in the ass.

If I was restarting from scratch I would go even further. Something like :

1. Every object has a refcount AND a read-write lock built into. Maybe the refcount and RW lock count go together in one U32 or U64 which is maintained by lockfree ops.

Refcounting is obvious. Lifetimes of async ops are way too complicated without it.

The RW lock in every object is something that sophomoric programmers don't see the need for. They think "hey it's a simple struct, I fill it on one thread, then pass it to another thread, and he touches it". No no no, you're a horrible programmer and I don't want to work with you. It seems simple at first, but it's just so fragile and prone to bugs any time you change anything, it's not worth it. If every object doesn't just come with an RW lock it's too easy to be lazy and skip adding one, which is very bad. If the lock is uncontended, as in the simple struct handoff case above, then it's very cheap, so just use it anyway.

2. Whenever you start an async op on an object, it takes a ref and also takes either a read lock or write lock.

3. Buffers are special in that you RW lock them in ranges. Same thing with textures and such. So you can write non-overlapping ranges simultaneously.

4. Every object has a list of the ops that are pending on that object. Any time you start a new op on an object, it is delayed until those pending ops are done. Similarly, every op has a list of objects that it takes as input, and won't run until those objects are ready.

The other big thing I would do in a rewrite from scratch is the basic architecture :

1. Write all my own threading primitives (semaphore, mutex, etc) and base them on a single waitset. (I basically have this already).

2. Write stack-ful coroutines.

3. When the low level Wait() is called on a stackful coroutine, instead yield the coroutine.

That way the coroutine code can just use Semaphore or whatever, and when it goes to wait on the semaphore, it will yield instead. It makes the coroutine code exactly the same as non-coroutine code and makes it "composable" (eg. you can call functions and they actually work), which I believe is crucial to real programming. This lets you write stackful coroutine code that does file IO or waits on async ops or whatever, and when you hit some blocking code it just automatically yields the coroutine (instead of blocking the whole worker thread).

This would mean that you could write coroutine code without any special syntax; so eg. you can call the same functions from coroutines as you do from non-coroutines and it Just Works the way you want. Hmm I think I wrote the same sentence like 3 times, but it's significant enough to bear repetition.

10/22/2012

10-22-12 - Windows 7 Start Menu Input Race

I've been super annoyed by some inconsistent behavior in the Windows 7 start menu for a while now. Sometimes I hit "Start - programname - enter" and nothing happens. I just sort of put it down to "god damn Windows is flakey and shit" but I finally realized yesterday exactly what's happening.

It's an input race , as previously discussed here

What happens is, you hit Start, and you get your focus in the type-in-a-program edit box. That part is fine. You type in a program name. At that point it does the search in the start menu thing in the background (it doesn't stall after each key press). In many cases there will be a bit of a delay before it updates the list of matching programs found.

If you hit Enter before it finds the program and highlights it, it just closes the dialog and doesn't run anything. If you wait a beat before hitting enter, the background program-finder will highlight the thing and hitting enter will work.

Very shitty. The start menu should not have keyboard input races. In this case the solution is obvious and trivial - when you hit enter it should wait on the background search task before acting on that key (but if you hit escape it should immediately close the window and abort the task without waiting).

I've long been an advocate of video game programmers doing "flakiness" testing by playing the game at 1 fps, or capturing recordings of the game at the normal 30 fps and then watching them play back at 1 fps. When you do that you see all sorts of janky shit that should be eliminated, like single frame horrible animation pops, or in normal GUIs you'll see things like the whole thing redraw twice in a row, or single frames where GUI elements flash in for 1 frame in the wrong place, etc.

Things like input races can be very easily found if you artificially slow down the program by 100X or so, so that you can see what it's actually doing step by step.

I'm a big believer in eliminating this kind of flakiness. Almost nobody that I've ever met in development puts it as a high priority, and it does take a lot of work for apparently little reward, and if you ask consumers they will never rate it highly on their wish list. But I think it's more important than people realize; I think it creates a feeling of solidness and trust in the application. It makes you feel like the app is doing what you tell it to, and if your avatar dies in the game it's because of your own actions, not because the stupid game didn't jump even though you hit the jump button because there was one frame where it wasn't responding to input.

10-22-12 - LZ-Bytewise conclusions

Wrapping this up + index post. Previous posts in the series :

cbloom rants 09-02-12 - Encoding Values in Bytes Part 1
cbloom rants 09-02-12 - Encoding Values in Bytes Part 2
cbloom rants 09-02-12 - Encoding Values in Bytes Part 3
cbloom rants 09-04-12 - Encoding Values in Bytes Part 4
cbloom rants 09-04-12 - LZ4 Optimal Parse
cbloom rants 09-10-12 - LZ4 - Large Window
cbloom rants 09-11-12 - LZ MinMatchLen and Parse Strategies
cbloom rants 09-13-12 - LZNib
cbloom rants 09-14-12 - Things Most Compressors Leave On the Table
cbloom rants 09-15-12 - Some compression comparison charts
cbloom rants 09-23-12 - Patches and Deltas
cbloom rants 09-24-12 - LZ String Matcher Decision Tree
cbloom rants 09-28-12 - LZNib on enwik8 with Long Range Matcher
cbloom rants 09-30-12 - Long Range Matcher Notes
cbloom rants 10-02-12 - Small note on LZHAM
cbloom rants 10-04-12 - Hash-Link match finder tricks
cbloom rants 10-05-12 - OodleLZ Encoder Speed Variation with Worker Count
cbloom rants 10-07-12 - Small Notes on LZNib
cbloom rants: 10-16-12 - Two more small notes on LZNib

And some little additions :

First a correction/addendum on cbloom rants 09-04-12 - LZ4 Optimal Parse :

I wrote before that going beyond the 15 states needed to capture the LRL overflowing the control byte doesn't help much (or at all). That's true if you only go up to 20 or 30 or 200 states, but if you go all the way to 270 states, so that you capture the transition to needing another byte, there is some win to be had (LZ4P-LO-332 got lztestset to 12714031 with small optimal state set, 12492631 with large state set).

If you just do it naively, it greatly increases memory use and run time. However, I realized that there is a better way. The key is to use the fact that there are so many code-cost ties. In LZ-Bytewise with the large state set, often the coding decision in a large number of states will have the same cost, and furthermore often the end point states will all have the same cost. When this happens, you don't need to make the decision independently for each state, instead you make one decision for the entire block, and you store a decision for a range of states, instead of one for each state.

eg. to be explicit, instead of doing :


in state 20 at pos P
consider coding a literal (takes me to state 21 at pos P+1)
consider various matches (takes me to state 0 at pos P+L)
store best choice in table[P][20]

in state 21 ...

do :

in states 16-260 at pos P
consider coding a literal (takes me to states 17-261 at pos P+1 which I saw all have the same cost)
consider various matches (takes me to state 0 at pos P+L)
store in table[P] : range {16-260} makes decision X

in states 261-263 ...

so you actually can do the very large optimal parse state set with not much increase in run time or memory use.

Second : I did a more complex variant of LZ4P (large window). LZ4P-LO includes "last offset". LZ4P-LO-332 uses a 3-bit-3-bit-2-bit control word (as described previously here : cbloom rants 09-10-12 - LZ4 - Large Window ) ; the 2 bit offset reserves one value for LO and 3 values for normal offsets.

(I consider this an "LZ4" variant because (unlike LZNib) it sends LZ codes as a strictly alternating LRL-ML pairs (LRL can be zero) and the control word of LRL and ML is in one byte)

Slightly better than LZ4P-LO-332 is LZ4P-LO-695 , where the numbering has switched from bits to number of values (so 332 should be 884 for consistency). You may have noticed that 6*9*5 = 270 does not fit in a byte, but that's fixed easily by forbidding some of the possibilities. 6-9-5 = 6 values for literals, 9 for match lengths, and 5 for offsets. The 5 offsets are LO + 2 bits of normal offset. So for example one of the ways that the 270 values is reduced is because an LO match can never occur after an LRL of 0 (the previous match would have just been longer), so those combinations are removed from the control byte.

LZ4P-LO-695 is not competitive with LZNib unless you spill the excess LRL and ML (the amount that is too large to fit in the control word) to nibbles, instead of spilling to bytes as in the original LZ4 and LZ4P. Even with spilling to nibbles, it's no better than LZNib. Doing LZ4P-LO-695, I found a few bugs in LZNib, so its results also got better.

Thirdly, current numbers :

raw lz4 lz4p332 lz4plo695 lznib d8 zlib OodleLZHLW
lzt00 16914 6473 6068 6012 5749 4896 4909
lzt01 200000 198900 198880 198107 198107 198199 198271
lzt02 755121 410695 292427 265490 253935 386203 174946
lzt03 3471552 1820761 1795951 1745594 1732491 1789728 1698003
lzt04 48649 16709 15584 15230 14352 11903 10679
lzt05 927796 460889 440742 420541 413894 422484 357308
lzt06 563160 493055 419768 407437 398780 446533 347495
lzt07 500000 265688 248500 240004 237120 229426 210182
lzt08 355400 331454 322959 297694 302303 277666 232863
lzt09 786488 344792 325124 313076 298340 325921 268715
lzt10 154624 15139 13299 11774 11995 12577 10274
lzt11 58524 25832 23870 22381 22219 21637 19132
lzt12 164423 33666 30864 29023 29214 27583 24101
lzt13 1041576 1042749 1040033 1039169 1009055 969636 923798
lzt14 102400 56525 53395 51328 51522 48155 46422
lzt15 34664 14062 12723 11610 11696 11464 10349
lzt16 21504 12349 11392 10881 10889 10311 9936
lzt17 53161 23141 22028 21877 20857 18518 17931
lzt18 102400 85659 79138 74459 76335 68392 59919
lzt19 768771 363217 335912 323886 299498 312257 268329
lzt20 1179702 1045179 993442 973791 955546 952365 855231
lzt21 679936 194075 113461 107860 102857 148267 83825
lzt22 400000 361733 348347 336715 331960 309569 279646
lzt23 1048576 1040701 1035197 1008638 989387 777633 798045
lzt24 3471552 2369885 1934129 1757927 1649592 2289316 1398291
lzt25 1029744 324190 332747 269047 230931 210363 96745
lzt26 262144 246465 244990 239816 239509 222808 207600
lzt27 857241 430350 353497 315394 328666 333120 223125
lzt28 1591760 445806 388712 376137 345343 335243 259488
lzt29 3953035 2235299 1519904 1451801 1424026 1805289 1132368
lzt30 100000 100394 100393 100010 100013 100020 100001
total 24700817 14815832 13053476 12442709 12096181 13077482 10327927

And comparison charts on the aggregated single file lzt99 :

Speeds are the best of 20 trials on each core; speed is the best of either x86 or x64 (usually x64 is faster). The decode times measured are slightly lower for everybody in this post (vs the last post of this type) because of the slightly more rigorous timing runs. For reference the decode speeds I measured are (mb/s) :


LZ4 :      1715.10235
LZNib :     869.1924302
OodleLZHLW: 287.2821629
zlib :      226.9286645
LZMA :       31.41397495

Also LZNib current enwik8 size : (parallel chunking (8 MB chunks) and LRM 12/12 with bubble)

LZNib enwik8 mml3 : 30719351
LZNib enwik8 stepml : 30548818

(all other LZNib results are for mml3)

old rants