7/25/2011

07-25-11 - Semaphore from CondVar

Semaphore from CondVar is quite trivial :

struct semaphore_from_condvar
{
private:
    VAR_T(int) m_count;
    t_cond_var m_cv;
public:
    semaphore_from_condvar(int initialCount) : m_count(initialCount)
    {
    }
    
    void post() // "increment"
    {
        t_cond_var::guard g;
        m_cv.lock(g);
        VAR(m_count) ++;
        m_cv.signal_unlock(g);
    }
    
    void post(int count)
    {
        t_cond_var::guard g;
        m_cv.lock(g);
        VAR(m_count) += count;
        m_cv.broadcast_unlock(g);
    }   
    
    void wait() // "decrement"
    {
        t_cond_var::guard g;
        m_cv.lock(g);
        while( VAR(m_count) <= 0 )
        {
            m_cv.unlock_wait_lock(g);
        }
        VAR(m_count)--;
        m_cv.unlock(g);
    }
};

the only thing that's less than ideal is when you have lots of waiters and try to wake N of them. Ideally you would wake exactly N threads. Here we have to wake them all, they will all try to dec count, N will pass through, and the rest will go back to sleep. All with heavy contention on the CV lock.

I noted once that a Windows auto-reset "Event" acts just like a semaphore with a max count of 1. We can see that very explicitly if we do an implementation of event using condvar :


struct event_from_condvar
{
private:
    VAR_T(int) m_count;
    t_cond_var m_cv;
public:
    event_from_condvar()
    {
    }
    
    void signal()
    {
        t_cond_var::guard g;
        m_cv.lock(g);
        VAR(m_count) = 1;
        m_cv.signal_unlock(g);
    }
        
    void wait()
    {
        t_cond_var::guard g;
        m_cv.lock(g);
        while( VAR(m_count) == 0 )
        {
            m_cv.unlock_wait_lock(g);
        }
        RL_ASSERT( VAR(m_count) == 1 );
        VAR(m_count) = 0;
        m_cv.unlock(g);
    }
};

which is a very silly way to implement event, but may be useful if you are limitted to C++0x only.

(the lack of low level wake/wait primitives in C++0x is a bit annoying; perhaps one solution is to use condition_variable_any - the templated variant of CV, and give it a mutex that is just a NOP in lock/unlock ; that would let you use the notify() and wait() mechanisms of CV with out all the mutex luggage that you don't need. But it remains to be seen what actual implementations do).

More on events in the next post.

13 comments:

Drawknob said...

So how can semaphores be implemented efficiently on Windows?
There's a fast mutex implementation using keyedevents at http://locklessinc.com/articles/keyed_events/ but I'm not sure it would be as simple as using m->owned as a counter. There's also a similar question (with no answers) at http://stackoverflow.com/questions/8408217/fast-counting-semaphore-on-windows

Drawknob said...

Hi, thanks for the reply.
Looking at most recent glibc sem_wait() and sem_post(), it seems those would be faster than condvar-based versions (each is a single futex call and there's no mutex used, and they already do a userspace check to see whether the futex call is needed). I'll benchmark to check, but I'm thinking it might be worth to attempt a keyed events version. The hard part would be making sure there are no races.

http://sourceware.org/git/?p=glibc.git;a=blob;f=nptl/sysdeps/unix/sysv/linux/sem_wait.c

http://sourceware.org/git/?p=glibc.git;a=blob;f=nptl/sysdeps/unix/sysv/linux/sem_post.c

cbloom said...

I deleted my reply cuz it was wrong.

Futex is sort of perfect for doing semaphores because it has the double-check waitset logic built in.

Keyed event is a nice way to do a lightweight waitable object, but doesn't have the extra logic of futex.

cbloom said...

The thing is, as long as you use something like "fastsemaphore" as a wrapper, the slow underlying semaphore only gets used when threads are waking or sleeping, so timing that is sort of pointless.

It is true that the Win32 "Semaphore" is cross-process, and a single-process-only variant can be a lot faster. But either way you're going into the kernel.

Also the true Semaphore provides a "Wake N" in one kernel call; I'm not aware of any other way to do a Wake N on Win32 (just Wake 1, N times).

Drawknob said...

Why wake N? Even POSIX semaphores only wake one.

cbloom said...

Looping to wake N by doing wake 1 N times can be *massively* slower than a true kernel wake N. For one thing it's N kernel calls. But worst of all you could lose execution on any off those wakes.

I'm not sure what you mean by POSIX sem's being wake 1. Posix sem_post provides wake N at the api level ; the fact that the Linux implementation using Futex does a loop of Wake 1's is just a shitty thing about that implementation.

I don't think it's a good idea to pay too much attention to the Linux implementation of threading primitives. They are known to be a not very good POSIX implementation.

Futex is awesome because it provides some good machinery to the user mode, but if you were actually writing an OS you would of course implement semaphore in the kernel rather than using a call to futex that doesn't really do what you want. I'm not sure what they're thinking with NPTL.

Drawknob said...

> I'm not sure what you mean by POSIX sem's being wake 1. Posix sem_post provides wake N at the api level

Huh?

http://www.csc.villanova.edu/~mdamian/threads/posixsem.html#post
or for example
http://docs.oracle.com/cd/E19963-01/html/821-1601/sync-11157.html

Nothing to do with Linux/NPTL.

cbloom said...

Well, I am corrected. That's retarded, BTW.

My comments about NPTL were in regards to the fact that they seem to be trying to build threading in user-space on top of futex rather than just doing it in the kernel.

Also apparently Boost.Threads and C++0x both don't provide semaphore. That's also retarded.

Drawknob said...

Why is it retarded? If there are multiple waiters, if several are woken, all but one will block again right away--unless you increment the semaphore value by the number you're trying to wake. However, semaphore semantics have always been that of incrementing or decrementing the counter by one.

There's an excellent reason for building synchronization in user-space--having a fast path that avoids a kernel call as much as possible. In both Windows and Linux, switching to kernel mode costs on the order of a thousand CPU cycles, whereas atomic instructions are 10x faster.

cbloom said...

"Why is it retarded? If there are multiple waiters, if several are woken, all but one will block again right away--unless you increment the semaphore value by the number you're trying to wake."

The correct wake is MIN(count,# waiters).

eg. if I have 4 waiters and I post 3 to my sem it should wake 3.

A correctly design system should never wake up threads just to have them go back to sleep again.

"However, semaphore semantics have always been that of incrementing or decrementing the counter by one."

The vast majority of OS's actually provide sem inc by N.

"There's an excellent reason for building synchronization in user-space--having a fast path that avoids a kernel call as much as possible. In both Windows and Linux, switching to kernel mode costs on the order of a thousand CPU cycles, whereas atomic instructions are 10x faster."

That's a terrible reason. I can always put a fast path in user-mode using atomic instructions on term of a kernel primitive.

There are a lot of things in NPTL that are absurdly complex and delicate (such as wait-morphing from condvar to mutex, respecting priorities correctly, etc) which would be totally trivial if they were in the kernel.

Basically any time you are doing thread sleeping or waking you should be in the kernel, because you have to go to the kernel for that anyway. If you are on a fast-path that doesn't involve wake/sleep then user mode is great, but it's always easy to add that fast path on the outside.

cbloom said...

I should also re-iterate at this point that micro-profiling the performance of threading primitives is generally at best pointless and at worst quite harmful.

(see previous blogs on this rant topic)

Drawknob said...

Ah, I see now you were referring to the logic in cases where a thread is going to (likely) wait anyway.

Well, you can always add a new kernel module. I think one of the reasons for the current design of futex is that it was only added as a way to improve the performance of the POSIX subset implemented by glibc on Linux. It's also a better design than keyed events due to the check of the condition after the thread is put on the wait queue. The keyed event has to have the release function block to prevent missed wakes, and a blocking wake is just not good (the benefit of keyed events over a futex is that they're alertable, so you can safely cancel a thread even if blocked in the kernel by queueing an APC and throwing an exception that unwinds the stack).

It would be nice if there was a more convenient way for a user to add such lower level functionality. The libOS in an exokernel approach looks ever more attractive.

I agree about using realistic benchmarks.

So what's your recommendation for the best semaphore implementation to use for Windows, and for Linux?

cbloom said...

"So what's your recommendation for the best semaphore implementation to use for Windows, and for Linux?"

I think as long as you use a "fastsemaphore" wrapper it basically doesn't matter what the base semaphore is.

old rants