7/09/2011

07-09-11 - LockFree - Thomasson's simple MPMC

Warming back up into this stuff, here's some very simple algos to study.

first fastsemaphore :


class fastsemaphore
{
    rl::atomic<long> m_count;
    rl::rl_sem_t m_waitset;

public:
    fastsemaphore(long count = 0)
    :   m_count(count)
    {
        RL_ASSERT(count > -1);
        sem_init(&m_waitset, 0, 0);
    }

    ~fastsemaphore()
    {
        sem_destroy(&m_waitset);
    }

public:
    void post()
    {
        if (m_count($).fetch_add(1) < 0)
        {
            sem_post(&m_waitset);
        }
    }

    void wait()
    {
        if (m_count($).fetch_add(-1) < 1)
        {
            // loop because sem_wait returns non-zero for spurious failure
            while (sem_wait(&m_waitset));
        }
    }
};

Most code I post will be in Relacy notation, which is just modified C++0x. Note that C++0x atomics without explicit memory ordering specifications (such as the fetch_adds here) default to memory_order_seq_cst (sequential consistency).

Basically your typical OS "semaphore" is a very heavy kernel-space object (on Win32 for example, semaphores are cross-process). Just doing P or V on it even when you don't modify wait states is very expensive. This is just a user-space wrapper which only calls to the kernel semaphore if it is at an edge transition that will cause a thread to either go to sleep or wake up.

So this is a simple thing that's nice to have. Note that m_count is always 0 or negative. If it's negative it's the (minus) number of threads that are sleeping on that semaphore. (between post and wakeup a thread can be sleeping but no longer counted, so we should say that threads which are not in minus m_count are either running or pending-running).

( see here )

Now we can look at Thomasson's very simple MPMC bounded blocking queue :


template<typename T, std::size_t T_depth>
class mpmcq
{
    rl::atomic<T*> m_slots[T_depth];
    rl::atomic<std::size_t> m_push_idx;
    rl::atomic<std::size_t> m_pop_idx;
    fastsemaphore m_push_sem;
    fastsemaphore m_pop_sem;

public:
    mpmcq() 
    :   m_push_idx(T_depth), 
        m_pop_idx(0),
        m_push_sem(T_depth),
        m_pop_sem(0)
    {
        for (std::size_t i = 0; i < T_depth; ++i)
        {
            m_slots[i]($).store(NULL);
        }
    }

public:
    void push(T* ptr)
    {
        m_push_sem.wait();

        std::size_t idx = m_push_idx($).fetch_add(1) & (T_depth - 1);

        rl::backoff backoff;

        while (m_slots[idx]($).load())
        {
            backoff.yield($);
        }

        RL_ASSERT(! m_slots[idx]($).load());

        m_slots[idx]($).store(ptr);

        m_pop_sem.post();
    }


    T* pop()
    {
        m_pop_sem.wait();

        std::size_t idx = m_pop_idx($).fetch_add(1) & (T_depth - 1);

        T* ptr;
        rl::backoff backoff;

        while ( (ptr = m_slots[idx]($).load()) == NULL )
        {
            backoff.yield($);
        }
        
        m_slots[idx]($).store(NULL);

        m_push_sem.post();

        return ptr;
    }
};

First let's understand what's going on here. It's just an array of slots with a reader index and writer index that loop around. "pop_sem" counts the number of filled slots - so the popper waits on that semaphore to see filled slots be non-zero. "push_sem" counts the number of available slots - so the pusher waits on that being greater than zero to be able to fill a slot.

So the producer and consumer both nicely go to sleep and wake each other when they should. Also because we use "fastsemaphore" they have reasonably low overhead when they are in the non-sleeping case.

Now, why is the weird backoff logic there? It's because of the "M" (for multiple) in MPMC. If this was an SPSC queue then it could be much simpler :


    void push(T* ptr)
    {
        m_push_sem.wait();

        std::size_t idx = m_push_idx($).fetch_add(1) & (T_depth - 1);

        RL_ASSERT(! m_slots[idx]($).load());

        m_slots[idx]($).store(ptr);

        m_pop_sem.post();
    }


    T* pop()
    {
        m_pop_sem.wait();

        std::size_t idx = m_pop_idx($).fetch_add(1) & (T_depth - 1);

        /* (*1) */

        T* ptr = m_slots[idx]($).exchange(NULL);
        RL_ASSERT( ptr != NULL );

        m_push_sem.post();

        return ptr;
    }

which should be pretty obviously correct for SPSC.

But, now consider you have multiple consumers and the queue is completely full.

Consumer 1 gets a pop_idx = 2. But then at (*1) it swaps out and doesn't run any more.

Consumer 2 gets a pop_idx = 3 and runs through and posts to the push semaphore.

Now a producer runs and gets push_idx = 2. It believes there is an empty slot it can write to, but it looks in slot 2 and there's still something there (because consumer 1 hasn't cleared it's slot yet). So, it has to do the backoff-yield loop to give consumer 1 some CPU time to let it run.

So the MPMC with backoff-yield works, but it's not great. As long as the queue is near empty it works reasonably well, but when it's full it acts like a mutex-based queue, in that one consumer being swapped out can block all your pushers from running (and because it's just a busy wait here, the normal OS hacks to rescue you (like Windows priority boosts) won't work here (this kind of thing is exactly why the Windows scheduler has so many hacks and why despite your whining you really do want it to be like that)).

7/08/2011

07-08-11 - Who ordered Event Count -

Say you have something like a lock-free queue , and you wish to go into a proper OS-level thread sleep when it's empty (eg. not just busy wait).

Obviously you try something like :


popper :

item = queue.pop();
if ( item == NULL )
{
  Wait(handle);
}


pusher :

was_empty = queue.push();
if ( was_empty )
{
  Signal(handle);
}

where we have extended queue.push to atomically check if the queue was empty and return a bool (this is easy to do for most lock-free queue implementations).

This doesn't work. The problem is that between the pop() and the Wait(), somebody could push something on the queue. That means the queue is non-empty at that point, but you go to sleep anyway, and nobody will ever wake you up.

Now, one obvious solution is to put a Mutex around the whole operation and use a "Condition Variable" (which is just a way of associating a sleep/wakeup with a mutex). That way the mutex prevents the state of the queue from changing while you decide to go to sleep. But we don't want to do that today because the whole point of our lock-free queue is that it's lock-free, and a mutex would spoil that. (actually I suppose the classic solution to this problem would be to use a counting semaphore and inc it on each push and dec it on each pop, but that has even more overhead if it's a kernel semaphore). Basically we want these specific fast paths and slow paths :


fast path :
  when popper can get an item without sleeping
  when pusher adds an item and there are no waiters

slow path :
  when popper has no work to do and goes to sleep
  when pusher needs to wake a popper

we're okay with the sleep/wakeup part being slow because that involves thread transitions anyway, so it's always slow and complicated. But in the other case where a core is just sending a message to another running core it should be mutex-free.

So, the obvious answer is to do a kind of double-check, something like :


popper :

item = queue.pop(); 
if ( item ) return item;  // *1
atomic_waiters ++;   // *2
item = queue.pop();  // *3
if ( item ) { atomic_waiters--; return item; }
if ( item == NULL )
{
  Wait(handle);
}
atomic_waiters--;


pusher :

queue.push();
if ( atomic_waiters > 0 )
{
  Signal(handle);
}

this gets the basics right. First popper has a fast path at *1 - if it gets an item, it's done. It then registers the fact that it's going to wait to a shared variable at *2. Then it double checks at *3. The double check is not an optimization to avoid sleeping your thread, it's crucial for correctness. The issue is that the popper could swap out between *1 and *2, and the pusher could then run completely, and it will see waiters == 0 and not do a signal. So the double-check at *3 catches this.

There's a performance bug with this code as written - if the queue goes empty then you do lots of pushes, all those pushes send the signal. You might be tempted to fix that by moving the "atomics_waiters--" line to the pusher side (before the signal), but that creates a race. You could fix that but then you spot a bigger problem :

This code doesn't work at all if you have multiple pushers or poppers. The problem is "lost wakeups". Basically if there are multiple poppers going into wait at the same time, the pusher may think it's done the wakeups it needs to, but it hasn't, and a popper goes into a wait forever.

To fix this you need a real proper "event count". What a proper event_count does is register the waiter at a certain point in time. The usage is like this :


popper :

item = queue.pop(); 
if ( item ) return item;
count = event_count.get_event_count();
item = queue.pop();
if ( item ) { event_count.cancel_wait(count); return item; }
if ( item == NULL )
{
  event_count.wait_on_count(count);
}


pusher :

queue.push();
event_count.signal();

Now, as before get_event_count() registers in an atomic visible variable that I want to wait on something (most people call this function prepare_wait()), but it also records the current "event count" to identify the wait (this is just the number of pushes, or the number of signals if you like). Then wait_on_count() only actually does the wait if the event_count is still on the same count as when I did get_wait_count - if the internal counter has advanced the wait is not done. signal() is a fast nop if there are no waiters, and increments the internal count.

This eliminates the lost wakeup problem, because if the "event_count" has advanced (and signaled some other popper, and won't signal again) then you will simply not go into the Wait.

Basically it's exactly like a Windows Event in that you can wait and signal, but with the added feature that you can record a place in time on the event, and then only do the Wait if that time has not advanced between your recording and the call to Wait.

It turns out that event_count and condition variables are closely related; in particular, one can very easily be implemented in terms of the other. (I should note that the exact semantics of pthread cond_var are *not* easy to implement in this way, but a "condition variable" in the broader sense need not comply with their exact specs).

Maybe in the future I'll get into how to implement event_count and condition_var.

BTW eventcount is the elegant solution to the problem of Waiting on Thread Events discussed previously.


ADDENDUM : if you like, eventcount is a way of doing Windows' PulseEvent correctly.

"Event" on Win32 is basically a "Gate" ; it's either open or closed. SetEvent makes it open. When you Wait() on the Event, if the gate is open you walk through, if it's closed you stop. The normal way to use it is with an auto-reset event, which means when you walk through the gate you close it behind yourself (atomically).

The idea of the Win32 PulseEvent API is to briefly open the gate and let through someone who was previously waiting on the gate, and then close it again. Unfortunately, PulseEvent is horribly broken by design and almost always causes a race, which leads most people to recommend against ever using PulseEvent (or manual reset events). (I'm sure it is possible to write correct code using PulseEvent, for example the race may be benign and just be a performance bug, but it is wise to follow this advice and not use it).

For example the standard Queue code using PulseEvent :

popper :
  node = queue.pop();
  if ( ! node ) Wait(event);

pusher :
  queue.push(node);
  PulseEvent(event);

is a totally broken race (if the popper is between getting a null node and enterring the wait, it doesn't see the event), and most PulseEvent code is similarly broken.

eventcount's Signal is just like PulseEvent - it only signals people who were previously waiting, it doesn't change the eventcount into a signalled state. But it doesn't suffer from the race of PulseEvent because it has a consistent way of defining the moment in "time" when the event fires.

07-08-11 - Event Count and Condition Variable

If you have either event_count or condition_variable, it's pretty straightforward to get the other from the one you have.

eventcount from condition_variable :

# by Chris M Thomasson
#
# class eventcount {  
# public:  
#   typedef unsigned long key_type;  
#   
#   
# private:  
#   mutable rl::atomic<key_type> m_count;  
#   rl::mutex m_mutex;  
#   rl::condition_variable m_cond;  
#   
#   
#   void prv_signal(key_type key) {  
#     if (key & 1) {  
#       m_mutex.lock($);  
#       while (! m_count($).compare_exchange_weak(key, (key + 2) & ~1,   
#         rl::memory_order_seq_cst));  
#       m_mutex.unlock($);  
#       m_cond.notify_all($);  
#     }  
#   }  
#   
#   
# public:  
#   eventcount() {  
#     m_count($).store(0, rl::memory_order_relaxed);  
#   }  
#   
#   
# public:  
#   key_type get() const {   // aka prepare_wait
#     return m_count($).fetch_or(1, rl::memory_order_acquire);  
#   }  
#   
#   
#   void signal() {  // aka notify_one
#     prv_signal(m_count($).fetch_add(0, rl::memory_order_seq_cst));  
#   }  
#   
#   
#   void signal_relaxed() {  
#     prv_signal(m_count($).load(rl::memory_order_relaxed));  
#   }  
#   
#   
#   void wait(key_type cmp) {  
#     m_mutex.lock($);  
#     if ((m_count($).load(rl::memory_order_seq_cst) & ~1) == (cmp & ~1)) {  
#       m_cond.wait(m_mutex, $);  
#     }  
#     m_mutex.unlock($);  
#   }  
# };  
#
and condition variable from event count :
by Dmitry V'jukov :

   1. class condition_variable  
   2. {  
   3.     eventcount ec_;  
   4.   
   5. public:  
   6.     void wait(mutex& mtx)  
   7.     {  
   8.         int count = ec_.prepare_wait();  
   9.         mtx.unlock();  
  10.         ec_.wait(count);  
  11.         mtx.lock();  
  12.     }  
  13.   
  14.     void signal()  
  15.     {  
  16.         ec_.notify_one();  
  17.     }  
  18.   
  19.     void broadcast()  
  20.     {  
  21.         ec_.notify_all();  
  22.     }  
  23. };   
(note this is a simplified condition variable without all the POSIX compliance crud).

In C++0x you have condition_variable at the stdlib level, so that is probably the best approach for the future. Unfortunately that future is still far away. On Pthreads you also have condition_variable (though a rather more complex one). Unfortunately, on Win32 (pre-Vista) you don't have condition_varaiable at all, so you have to build one of these from OS primitives.

(BTW there are various sources for good condition_var implementations for Win32, such as boost::thread and Win32 pthreads by Alex Terekhov).

ADDENDUM : really eventcount is the more primitive of the two; it's sort of a mistake that C++0x has provided "condition_var" as a primitive. They are not trying to provide a full set of OS-level thread control types (eg. they don't provide semaphore, event, what have you) - they are trying to provide the minimal basic set, and they chose condition_var. They should have done mutex and eventcount, as you can build everything from that.

(actually there's something perhaps even more primitive that eventcount which is "waitset" which can be easily used to build any of the basic blocking thread control devices).

7/06/2011

07-06-11 - Who ordered Condition Variables -

I'm getting back into some low level threading stuff for a week or two and I'll try to write about it because it's very confusing and I always forget the basics.

Condition Variables are explained very strangely around the net. You'll find sites that say they "let you wait on a variable being set to certain value" (not true), "let you avoid polling" (not true), or "the mutex is a left-over from pre-pthreads implementations" (not true).

What a "Condition Variable" really is is a way to receive a Signal and enter a Mutex at the same time ("atomically" if you like).

Why do we want this?

The typical case is we are waiting on some state. When the state is not true, we want our thread to go into a real sleep. So we use an OS Wait() on the thread that's waiting for that state, and an OS Signal() when the state is set to wake the thread. (Wait and Signal might be "Event" on win32, or a semaphore in pthreads, or a futex, etc). Basically :


Waiting thread :

if ( state I want is not set )
{
  Wait(handle);
  // state I want should be set now
  // **


Signalling thread :

if ( I changed state to the one wanted )
  Signal(handle)

simple enough. The problem is, the invariant at (**) is not true. The state you want is NOT necessarily set there, because there's a race. Immediately after you receive the signal, you could be swapped out, and someone else could change the state, and then it would not be what you wanted.

So the obvious thing is to put a mutex around "state". To be less abstract I'll talk about a one item queue (aka a mailbox).


Waiting thread :

Lock mutex;
if ( mailbox empty )
{
  atomically { Unlock(mutex);   Wait(handle); (**) Lock(mutex) }
  // mailbox must be full now
}

Signaling thread :

Lock mutex
if ( mailbox empty )
{
  mailbox = some stuff;
  Signal(handle); Unlock(mutex);
}

now when the mailbox filler signals the handle, the waiter immediately wakes up and tries to lock the mutex and can't; the filler then unlocks and the waiter can run, and it is gauranteed to have an item.

Note that the line marked {atomically} *must* be atomic, otherwise you have a race at (**) just like before.

And this :

  atomically { Unlock(mutex);   Wait(handle); (**) Lock(mutex) }
is exactly what pthread_cond_wait() is.

Personally rather than introducing a new synchronization data type I would have preferred to just get a function that does unlock_wait_lock(). The "cond_var" in pthreads has nothing to do with a "condition"', it's just a waitable handle (and associated mutex and other wiring) in Windows lingo.

There's one more point worth talking about. In the thread that filled the mailbox, we did :


lock mutex
set state
signal
unlock

That's good because it gaurantees that the receiver gets the state it wants and is race free. It's bad because it causes some unnecessary "thread thrashing".

(thread thrashing is a lot like input latency that I mentioned a while ago; it's something you just want to constantly watch and be vigilant about tracking and removing. Any time a thread wakes up and does nothing and goes back to sleep, you are "thrashing" and just wasting massive amounts of CPU. You want to minimize the number of useless thread wakeups)

The alternative is :


lock mutex
set state
unlock
(**)
signal

now, there's a race a ** where the state can get changed before you signal, so the invariant in the receiver is no longer true.

In most cases, however, this is not actually bad, and this form is actually preferred because of its increased efficiency. You have to change the receiver to a "double-check" type of pattern. Something like :


Waiting thread :

Lock mutex;
if ( mailbox empty )
{
  retry:
  cond_wait(mutex,handle); //  atomically { Unlock(mutex); Wait(handle); Lock(mutex) }
  // mailbox may or may not be full now
  if ( mailbox empty ) goto retry;
  // now do work
}

Signaling thread :

Lock mutex
if ( mailbox empty )
{
  mailbox = some stuff;
  Unlock(mutex);
  // intentional race here
  Signal(handle);
}

In general I believe it's a safer design to treat the signal as meaning "wake up and check this condition" instead of "wake up and this condition is definitely set". Then you engineer to minimize the number of wakeups when the condition is not set.

BTW a better design of the primitive would have allowed the signalling thread to do


atomically { Unlock(mutex); Signal(handle); }

which would be the ideal thing. Unfortunately a normal cond_var is not expressed this way. However, apparently on some modern UNIXes cond_var actually *acts* this way. What you do is signal inside the mutex, but the other thread isn't actually woken up until you unlock the mutex. Unfortunately this is a hidden optimization (they should have just provided unlock_and_signal() as one call) and you can't rely on it if you're cross-platform.

Some links on this topic :
Usenet - Condition variables signal with or without mutex locked
condvars signal with mutex locked or not Lo�c OnStage
A word of caution when juggling pthread_cond_signalpthread_mutex_unlock - comp.programming.threads Google Groups

6/28/2011

06-28-11 - String Extraction

So I wrote a little exe to extract static strings from code. It's very simple.

StringExtractor just scans some dirs of code and looks for a tag which encloses a string with parens. eg. :

    MAKE_STRING( BuildHuff )
it takes all the strings it finds in that way and makes a table of indexes and contents, like :


enum EStringExtractor
{
    eSE_Null = 0,
    eSE_BuildHuff = 1,
    eSE_DecodeOneQ = 2,
    eSE_DecodeOneQ_memcpy = 3,
    eSE_DecodeOneQ_memset = 4,
    eSE_DecodeOneQuantum = 5, ...


const char * c_strings_extracted[] = 
{
    0,
    "BuildHuff",
    "DecodeOneQ",
    "DecodeOneQ_memcpy",
    "DecodeOneQ_memset",
    "DecodeOneQuantum", ...

it outputs this to a generated .C and .H file, which you can then include in your project.

The key then is what MAKE_STRING means. There are various ways to set it up, depending on whether you are replacing an old system that uses char * everywhere or not. Basically you want to make a header that's something like :


#if DO_STRINGS_RAW

#define MAKE_STRING( label )   (string_type)( #label )
#define GET_STRING(  index )   (const char *)( index )

#else

#include "code_gen_strings.h"

#define MAKE_STRING( label )  (string_type)( eSE_ ## label )
#define GET_STRING(  index )  c_strings_extracted[ (int)( index ) ]

#endif

(string_type can either be const char * to replace an old system, or if you're doing this from scratch it's cleaner to make it a typedef).

If DO_STRINGS_RAW is on, you run with the strings in the code as normal. With DO_STRINGS_RAW off, all static strings in the code are replaced with indexes and the table lookup is used.

It's important to me that the code gen doesn't actually touch any of the original source files, it just makes a file on the side (I hate code gen that modifies source because it doesn't play nice with editors); it's also important to me that you can set DO_STRINGS_RAW and build just fine without the code gen (I hate code gen that is a necessary step in the build).

Now, why would you do this? Well, for one thing it's just cleaner to get all static strings in one place so you can see what they are, rather than having them scattered all over. But some real practical benefits :

You can make builds that don't have the string table; eg. for SPU or other low-memory console situations, you can run the string extraction to turn strings into indeces, but then just don't link the table in. Now they can send back indeces to the host and you can do the mapping there.

You can load the string table from a file rather than building it in. This makes it optional and also allows localization etc. (not a great way to do this though).

For final builds, if you are using these strings just for debug info, you can easily get rid of all of them in one place just by #defining MAKE_STRING and GET_STRING to nulls.

Anyhoo, here's the EXE :

stringextractor.zip (84k)

(stringextractor is also the first cblib app that uses my new standardized command line interface; all cblib apps in the future will have a common set of -- options; also almost all cblib apps now take either files or dirs on the command line and if you give them dirs they iterate on contents).

(stringextractor also importantly uses a helper to not change the output file if the contents don't change; this means that it doesn't mess up the modtime of the generated file and cause rebuilds that aren't necessary).

Obviously one disadvantage is you can't have spaces or other non-C-compatible characters in the string. But I guess you could fix this by using printf style codes and do printf when you generate the table.

6/24/2011

06-24-11 - Regression

Oodle now can run batch files and generate this :

test_cdep.donetest_huff.donetest_ioqueue.donetest_lzp.donetest_oodlelz.done
r:\test_done_xenon
r:\test_done_ps3passpasspasspass : 128.51pass : 450.50
r:\test_done_win32passpassfailpass : 341.94pass : 692.58
test_cdep.donetest_huff.donetest_ioqueue.donetest_lzp.donetest_oodlelz.done
r:\test_done_xenon
r:\test_done_ps3passpasspasspass : 128.03pass : 450.73
r:\test_done_win32passpasspasspass : 335.55pass : 686.90

Yay!

Something that's important for me is doing constant runs of the speeds of the optimized bits on all the platforms, because it's so easy to break the optimization with an inoccuous check-in, and then you're left trying to find what slowed you down.

Two niggles continue to annoy me :

1. Damn Xenon doesn't have a command line interface (by which I mean you can't actually interact with running programs from a console; you can start programs; the specific problem is that you can't tell if a program is done or still running or crashed from a console). I have my own hacky work-around for this which is functional but not ideal. (I know I could write my own nice xenon-runner with the Dm API's but I haven't bitten that off yet).

2. Damn PS3TM doesn't provide "force connect" from the command line. They provide most of the commands as command line switches, but not the one that I actually want. Because of this I frequently have problems connecting to the PS3 during the regression run, and I have to open up the damn GUI and do it by hand. This is in fact the only step that I can't automate and that's annoying. I mean, why do they even fucking bother with providing the "connect" and "disconnect" options? They never fucking work, the only thing that works is "force disconnect". Don't give me options that just don't work dammit.

(and the whole idea of PS3TM playing nice and not disconnecting other users is useless because it doesn't disconnect idle people, so when someone else is "connected" that usually means they were debugging two days ago and just never disconnected)

(there is a similar problem with the Xenon (similar in the sense that it can't be automated); it likes to get itself into a state where it needs to be cold-booted by physically turning it off; I'm not sure why the "cold boot" xbreboot is not the same as physically power cycling it, so that's mildly annoying too).

6/23/2011

06-23-11 - Map File Graphviz

What I want :

Something that parses the .map and obj's and creates a graph of the size of the executable. Graph nodes should be the size they take in the exe, and connections should be dependencies.

I have all this code for generating graphviz/dotty because I do it for my allocation grapher, but I don't know a good way to get the dependencies in the exe. Getting the sizes of things in the MAP is relatively easy.

To be clear, what you want to see is something like :

s_log_buf is 1024 bytes
s_log_buf is used by Log() in Log.cpp
Log() is called from X and Y and Z
...
just looking at the .map file is not enough, you want to know why a certain symbol got dragged in. (this happens a lot, for example some CRT function like strcpy suddenly shows up in your map you're like "where the fuck did that come from?")

Basically I need the static call-graph or link tables , a list of the dependencies from each function. The main place I need this is on the SPU, because it's a constant battle to keep your image size minimal to fit in the 256k.

I guess I can get it from "objdump" pretty easily, but that only provides dependency info at the obj level, not at the function level, which is what I really want.

Any better solutions?

6/17/2011

06-17-11 - C casting is the devil

C-style casting is super dangerous, as you know, but how can you do better?

There are various situations where I need to cast just to make the compiler happy that aren't actually operational casts. That is, if I was writing ASM there would be no cast there. For example something like :

U16 * p;
p[0] = 1;
U8 * p2 = (U8 *)p;
p2[1] = 7;
is a cast that changes the behavior of the pointer (eg. "operational"). But, something like :
U16 * p;
*p = 1;
U8 * p2 = (U8 *)p;
p2 += step;
p = (U16 *) p2;
*p = 2;
is not really a functional cast, but I have to do it because I want to increment the pointer by some step in bytes, and there's no way to express that in C without a cast.

Any time I see a C-style cast in code I think "that's a bug waiting to happen" and I want to avoid it. So let's look at some ways to do that.

1. Well, since we did this as an example already, we can hide those casts with something like ByteStepPointer :


template<typename T>
T * ByteStepPointer(T * ptr, ptrdiff_t step)
{
    return (T *)( ((intptr_t)ptr) + step );
}

our goal here is to hide the nasty dangerous casts from the code we write every day, and bundle it into little utility functions where it's clear what the purpose of the cast is. So now we can write out example as :
U16 * p;
*p = 1;
p = ByteStepPointer(p,step);
*p = 2;
which is much prettier and also much safer.

2. The fact that "void *" in C++ doesn't cast to arbitrary pointers the way it does in C is really fucking annoying. It means there is no "generic memory location" type. I've been experimenting with making the casts in and out of void explicit :


template<typename T>
T * CastVoid(void * ptr)
{
    return (T *)( ptr );
}

template<typename T>
void * VoidCast(T * ptr)
{
    return (void *)( ptr );
}

but it sucks that it's so verbose. In C++0x you can do this neater because you can template specialize based on the left-hand-side. So in current C++ you have to write
Actor * a = CastVoid<Actor>( memory );
but in 0x you will be able to write just
Actor * a = CastVoid( memory );

There are a few cases where you need this, one is to call basic utils like malloc or memset - it's not useful to make the cast clear in this case because the fact that I'm calling memset is clear enough that I'm treating this pointer as untyped memory; another is if you have some generic "void *" payload in a node or message.

Again you don't want just a play C-style cast here, for example something like :

Actor * a = (Actor *) node->data;
is a bug waiting to happen if you change "data" to an int (among other things).

3. A common annoying case is having to cast signed/unsigned. It should be obvious that when I write :

U32 set = blah;
U32 mask = set & (-set);
that I want the "-" operator to act as (~set + 1) on the bits and I don't care that it's unsigned, but C won't let you do that. (see previous rants about how what I really want in this scenario is a "#pragma requires(twos_complement)" ; warning me about the sign is fucking useless for portability because it just makes me cast, if you want to make a real portable language you have to be able to express capabilities of the platform and constraints of the algorithm).

So, usually what you want is a cast that gives you the signed type of the same register size, and that doesn't exist. So I made my own :


static inline S8  Signed(U8 x)  { return (S8) x; }
static inline S16 Signed(U16 x) { return (S16) x; }
static inline S32 Signed(U32 x) { return (S32) x; }
static inline S64 Signed(U64 x) { return (S64) x; }

static inline U8  Unsigned(S8 x)  { return (U8) x; }
static inline U16 Unsigned(S16 x) { return (U16) x; }
static inline U32 Unsigned(S32 x) { return (U32) x; }
static inline U64 Unsigned(S64 x) { return (U64) x; }

So for example, this code :
mask = set & (-(S32)set);
is a bug waiting to happen if you switch to 64-bit sets. But this :
mask = set & (-Signed(set));
is robust. (well, robust if you include a compiler assert that you're 2's complement)

4. Probably the most common case is where you "know" a value is small and need to put it in a smaller type. eg.

int x = 7;
U8 small = (U8) x;
But all integer-size-change casts are super unsafe, because you can later change the code such that x doesn't fit in "small" anymore.

(often you were just wrong or lazy about "knowing" that the value fit in the smaller type. One of the most common cases for this right now is putting file sizes and memory sizes into 32-bit ints. Lots of people get annoying compiler warnings about that and think "oh, I know this is less than 2 GB so I'll just C-style cast". Oh no, that is a huge maintenance nightmare. In two years you try to run on a larger file and suddenly you have bugs all over and you can't find them because you used C-style casts. Start checking your casts!).

You can do this with a template thusly :


// check_value_cast just does a static_cast and makes sure you didn't wreck the value
template <typename t_to, typename t_fm>
t_to check_cast( const t_fm & from )
{
    t_to to = static_cast<t_to>(from);
    ASSERT( static_cast<t_fm>(to) == from );
    return to;
}

but it is so common that I find the template a bit excessively verbose (again C++0x with LHS specialization would help, you could then write just :
small = check( x );

small = clamp( x );
which is much nicer).

To do clamp casts with a template is difficult. You can use std::numeric_limits to get the ranges of the dest type :

template <typename t_to, typename t_fm>
t_to clamp_cast( const t_fm & from )
{
    t_to lo = std::numeric_limits<t_to>::min();
    t_to hi = std::numeric_limits<t_to>::max();
    if ( from < lo ) return lo; // !
    if ( from > hi ) return hi; // !
    t_to to = static_cast<t_to>(from);
    RR_ASSERT( static_cast<t_fm>(to) == from ); 
    return to;
}
however, the compares inherent (at !) in clamping are problematic, for example if you're trying to clamp_cast from signed to unsigned you may get warnings there (you can also get the unsigned compare against zero warning when lo is 0). (? is there a nice solution to this ? you want to cast to the larger ranger of the two types for the purpose of the compare, so you could make some template helpers that do the compare in the wider of the two types, but that seems a right mess).

Rather than try to fix all that I just use non-template versions for our basic types :


static inline U8 S32ToU8Clamp(S32 i)    { return (U8) CLAMP(i,0,0xFF); }
static inline U8 S32ToU8Check(S32 i)    { ASSERT( i == (S32)S32ToU8Clamp(i) ); return (U8)i; }

static inline U16 S32ToU16Clamp(S32 i)  { return (U16) CLAMP(i,0,0xFFFF); }
static inline U16 S32ToU16Check(S32 i)  { ASSERT( i == (S32)S32ToU16Clamp(i) ); return (U16)i; }

static inline U32 S64ToU32Clamp(S64 i)  { return (U32) CLAMP(i,0,0xFFFFFFFFUL); }
static inline U32 S64ToU32Check(S64 i)  { ASSERT( i == (S64)S64ToU32Clamp(i) ); return (U32)i; }

static inline U8 U32ToU8Clamp(U32 i)    { return (U8) CLAMP(i,0,0xFF); }
static inline U8 U32ToU8Check(U32 i)    { ASSERT( i == (U32)U32ToU8Clamp(i) ); return (U8)i; }

static inline U16 U32ToU16Clamp(U32 i)  { return (U16) CLAMP(i,0,0xFFFF); }
static inline U16 U32ToU16Check(U32 i)  { ASSERT( i == (U32)U32ToU16Clamp(i) ); return (U16)i; }

static inline U32 U64ToU32Clamp(U64 i)  { return (U32) CLAMP(i,0,0xFFFFFFFFUL); }
static inline U32 U64ToU32Check(U64 i)  { ASSERT( i == (U64)U64ToU32Clamp(i) ); return (U32)i; }

static inline S32 U64ToS32Check(U64 i)  { S32 ret = (S32)i; ASSERT( (U64)ret == i ); return ret; }
static inline S32 S64ToS32Check(S64 i)  { S32 ret = (S32)i; ASSERT( (S64)ret == i ); return ret; }

which is sort of marginally okay. Maybe it would be nicer if I left off the type it was casting from in the name.

6/16/2011

06-16-11 - Optimal Halve for Doubling Filter

I've touched on this topic several times in the past . I'm going to wrap up a loose end.

Say you have some given linear doubling filter (linear in the operator sense, not that it's a line). You wish to halve your image in the best way such that the round trip has minimum error.

For a given discrete doubling filter (non-interpolating) find the optimal halving filter that minimizes L2 error. I did it numerically, not analytically, and measured the actual error of down->up vs. original on a large test set.

I generated halving filters for half-widths of 3,4, and 5. Large filters always produce lower error, but also more ringing, so you may not want the largest width halving filter.


upfilter :  linear  :
const float c_filter[4] = { 0.12500, 0.37500, 0.37500, 0.12500 };

 downFilter : 
const float c_filter[6] = { -0.15431, 0.00162, 0.65269, 0.65269, 0.00162, -0.15431 };
fit err = 17549.328

 downFilter : 
const float c_filter[8] = { 0.05429, -0.21038, -0.01115, 0.66724, 0.66724, -0.01115, -0.21038, 0.05429 };
fit err = 17238.310

 downFilter : 
const float c_filter[10] = { 0.05159, 0.00138, -0.21656, -0.00044, 0.66402, 0.66402, -0.00044, -0.21656, 0.00138, 0.05159 };
fit err = 16959.596

upfilter :  mitchell1  :
const float c_filter[8] = { -0.00738, -0.01172, 0.12804, 0.39106, 0.39106, 0.12804, -0.01172, -0.00738 };

 downFilter : 
const float c_filter[6] = { -0.13475, 0.02119, 0.61356, 0.61356, 0.02119, -0.13475 };
fit err = 17496.548

 downFilter : 
const float c_filter[8] = { 0.05595, -0.19268, 0.00985, 0.62688, 0.62688, 0.00985, -0.19268, 0.05595 };
fit err = 17131.069

 downFilter : 
const float c_filter[10] = { 0.05239, 0.00209, -0.19664, 0.01838, 0.62379, 0.62379, 0.01838, -0.19664, 0.00209, 0.05239 };
fit err = 16811.168

upfilter :  lanczos4  :
const float c_filter[8] = { -0.00886, -0.04194, 0.11650, 0.43430, 0.43430, 0.11650, -0.04194, -0.00886 };

 downFilter : 
const float c_filter[6] = { -0.09637, 0.05186, 0.54451, 0.54451, 0.05186, -0.09637 };
fit err = 17332.452

 downFilter : 
const float c_filter[8] = { 0.04290, -0.14122, 0.04980, 0.54852, 0.54852, 0.04980, -0.14122, 0.04290 };
fit err = 17054.006

 downFilter : 
const float c_filter[10] = { 0.03596, 0.00584, -0.13995, 0.05130, 0.54685, 0.54685, 0.05130, -0.13995, 0.00584, 0.03596 };
fit err = 16863.054

upfilter :  lanczos5  :
const float c_filter[10] = { 0.00551, -0.02384, -0.05777, 0.12982, 0.44628, 0.44628, 0.12982, -0.05777, -0.02384, 0.00551 };

 downFilter : 
const float c_filter[6] = { -0.08614, 0.07057, 0.51557, 0.51557, 0.07057, -0.08614 };
fit err = 17323.692

 downFilter : 
const float c_filter[8] = { 0.05112, -0.13959, 0.06782, 0.52065, 0.52065, 0.06782, -0.13959, 0.05112 };
fit err = 16899.712

 downFilter : 
const float c_filter[10] = { 0.04554, 0.00403, -0.13655, 0.06840, 0.51857, 0.51857, 0.06840, -0.13655, 0.00403, 0.04554 };
fit err = 16566.352

------------------------------

6/14/2011

06-14-11 - ProcessSuicide

The god damn lagarith DLL has some crash in its shutdown, so any time I play an AVI with app that uses lagarith, it hangs on exit.

(this is one of the reasons that I need to write my own lossless video format; the other reason is that lagarith can't play back at 30 fps even on ridiculously fast modern machines; and the other standard HuffYUV frequently crashes for me and is very hard to make support RGB correctly)

Anyhoo, I started using this to shut down my app, which doesn't have the stupid "wait forever for hung DLL's to unload" problem :


void ProcessSuicide()
{
    DWORD myPID = GetCurrentProcessId();

    lprintf("ProcessSuicide PID : %d\n",myPID);

    HANDLE hProcess = OpenProcess (PROCESS_ALL_ACCESS, FALSE, myPID); 
        
    if ( hProcess == INVALID_HANDLE_VALUE )
    {
        lprintf("Couldn't open my own process!\n");
        // ?? should probably do something else here, but never happens
        return;
    }
        
    TerminateProcess(hProcess,0);
    
    // ... ?? ... should not get here
    ASSERT(false);
    
    CloseHandle (hProcess);
}

At first I thought this was a horrible hack, but I've been using it for months now and it doesn't cause any problems, so I'm sort of tempted to call it not a hack but rather just a nice way to quit your app in Windows and not ever get that stupid thing where an app hangs in shutdown (which is a common problem for big apps like MSDev and Firefox).

old rants