6/01/2012

06-01-12 - On C++ Atomic Fences Part 3

Finally a small note of confusion. There are cases where I think "what I need is a #StoreLoad", but a seq_cst fence doesn't work, and changing the load to an RMW does work. Let's try to look into one of those cases a bit.

I will use an example that I used before , a very simple "futex" (not really) waitset based exchange-mutex. I'm gonna use the exact same code here as I did there, but I'm putting the futex_system ops into the mutex to make it all a bit leaner :


struct futex_mutex3
{
    std::atomic<int> m_state; // mutex locked flag

    // waitset :
    HANDLE          m_handle; // actual wait event
    atomic<int>     m_count;  // waiter count

    /*************/
                
    futex_mutex3() : m_state(0), m_count(0)
    {
        m_handle = CreateEvent(NULL,0,0,NULL);
    }
    ~futex_mutex3()
    {
        CloseHandle(m_handle);  
    }

    void lock()
    {
        // try to grab the mutex by exchanging in a 1
        //  if it returns 0, we got the lock
        if ( m_state($).exchange(1,rl::mo_acquire) )
        {
            // prepare_wait :
            // add one to waiter count
            m_count($).fetch_add(1,mo_acquire);
        
            // (*1)
        
            // double check :
            while ( m_state($).exchange(1,rl::mo_acquire) )
            {
                // wait :
                WaitForSingleObject(m_handle, INFINITE);
            }
            
            // retire_wait :
            m_count($).fetch_add(-1,mo_relaxed);
        }
    }

    void unlock()
    {
        m_state($).store(0,rl::mo_release);

        //notify_one :

        // need #StoreLoad before loading m_count

        // (*2)

        int c = m_count($).load(mo_acquire);        
            
        if ( c > 0 )
        {       
            SetEvent(m_handle);
        }
    }
};

The code as written does not work. The problem is the interaction of spots *1 and *2.

(mutual exclusion is guaranteed by this code by our actions on m_state , which also provides the necessary acquire & release to make the mutex valid. So when I say it "doesn't work" it means the waitset interaction with the mutex doesn't work, eg. we deadlock by failing to wake a waiter, it's a "missed wakeup" problem).

(irrelevant aside : if you want this to be a real mutex implementation, then the lock() operations on m_state should probably be acq_rel to prevent mutex overlap deadlocks; see this blog post among others; but the purpose of this post is not to make a real mutex, it's to demonstrate an issue, so let's not get bogged down)

In brief, the problem is that the unlocker at *2 can load a waiter count of 0 (and thus not signal), even though the waiter has passed point *1 (and thus count should not be zero).

The bad execution goes like this :


1. thread 0 holds the lock on the mutex

thread 1 calls lock()

2. thread 1 tries to lock the mutex, sees m_state=1 and goes into prepare_wait
3. thread 1 does m_count ++
4. thread 1 tries the exchange again, sees m_state=1 and goes into the wait

thread 0 calls unlock()

5. thread 0 stores a 0 to m_state
6. thread 0 loads m_count and gets 0 (out of date value)

Now, you might think the problem is that the load can act like it hoists above the store. That is, we know the store happens after the exchange (#4), because the exchange didn't see a zero. Therefore #3 (the inc to count) must already have happened. But the load at #6 is seeing the value before #3; sure, that's allowed, the load has no ordering contraint that stops it from moving back in time.

So clearly we want a #StoreLoad between 5 and 6 to prevent that load from backing up. You cannot express that in C++0x and that's what I meant in this original blog post when I said that the C++0x seq_cst fence is not really a StoreLoad and there's no way to really express this kind of StoreLoad in C++0x. Specifically, just adding a seq_cst fence here where you want a StoreLoad does not work :


    void unlock()
    {
        m_state($).store(0,rl::mo_release);

        //notify_one :

        // need #StoreLoad before loading m_count

        std::atomic_thread_fence(mo_seq_cst);

        int c = m_count($).load(mo_acquire);        
    
        if ( c > 0 )
        {       
            SetEvent(m_handle);
        }
    }

The problem as I understand it that is a single fence in C++0x doesn't really do what you want. I kind of got at this in Part 2 as well, like you can't just use a fence as a way to publish and have relaxed loads receive it. You need another fence in the receiving thread, so that the fences can "synchronize with" each other. Also if you go back to Part 1 and look at most of the rules about fences, they only provide ordering if they have the right kind of object to connect through; you need something to carry a transitive relationship.

Note : I believe that on every real CPU, putting a MemoryBarrier() there where you want the #StoreLoad would make this code work. This example is actually very similar to the Peterson lock we saw in part 2.

Boiled down, the problem is like this :


A & B initially zero

thread 0 :
    A = 1
    #StoreLoad
    load B

thread 1 :
    B = 1
    #StoreLoad
    load A

It should not be possible for both threads to load 0. Either one or both threads should see a 1. Now if you make both #StoreLoads into atomic_thread_fence(seq_cst) then it works - but not because the fence is a #StoreLoad. It works because the two seq_cst fences must have a definite order against each other in the total order S, and then that provides reference for all the other ops to be "happens before/after" each other.

To be very clear, this code works :


    thread 0:

        A($).store(1,mo_relaxed);
        std::atomic_thread_fence(mo_seq_cst,$);
        r1($) = B($).load(mo_relaxed);

    thread 1:

        B($).store(1,mo_relaxed);
        std::atomic_thread_fence(mo_seq_cst,$);
        r2($) = A($).load(mo_relaxed);

    after :

        r1+r2 == 1 or 2 (never 0)

(BTW this is actually the classic WFMO case in semi-disguise; you're waiting on the two conditions A and B to become true; if two separate threads set the conditions, then at least one should see the joint A && B be true, but that only works with the appropriate #StoreLoad; see this blog post )

But what if we remove the need for one of the threads to have a seq_cst fence :


    thread 0:

        A($).store(1,mo_relaxed);
        std::atomic_thread_fence(mo_seq_cst,$); // #StoreLoad ?
        r1($) = B($).load(mo_relaxed);

    thread 1:

        B($).store(1,mo_relaxed);
        // load A will be after store B via StoreStore release ordering
        r2($) = A($).fetch_add(0,mo_acq_rel);

    after :

        r1+r2 == 1 or 2 (never 0)

Here thread 1 uses an RMW to ensure StoreLoad ordering, so we get rid of the fence. This code no longer works. Now the B.load in thread 0 can hoist above the B.store in thread 1. The reason is that the seq_cst fence in thread 0 is not acting as a StoreLoad any more because it has nothing to synchronize against in the other thread.

Take away : atomic_thread_fence(mo_seq_cst) does NOT really act as a #StoreLoad. It can be used in spots where you need a #StoreLoad, but only in the right situation, eg. if it has the right other stuff to synchronize with.

So, getting back to our futex_mutex example, you can use a seq_cst fence at (*2) to act as your storeload, but only if you also have one at (*1) for it synchronize with :


struct futex_mutex3
{
    std::atomic<int> m_state; // mutex locked flag

    // waitset :
    HANDLE          m_handle; // actual wait event
    atomic<int>     m_count;  // waiter count

    /*************/
                
    futex_mutex3() : m_state(0), m_count(0)
    {
        m_handle = CreateEvent(NULL,0,0,NULL);
    }
    ~futex_mutex3()
    {
        CloseHandle(m_handle);  
    }

    void lock()
    {
        // try to grab the mutex by exchanging in a 1
        //  if it returns 0, we got the lock
        if ( m_state($).exchange(1,rl::mo_acquire) )
        {
            // prepare_wait :
            // add one to waiter count
            m_count($).fetch_add(1,mo_relaxed);
        
            // (*1)
            std::atomic_thread_fence(mo_seq_cst,$);
        
            // double check :
            while ( m_state($).exchange(1,rl::mo_acquire) )
            {
                // wait :
                WaitForSingleObject(m_handle, INFINITE);
            }
            
            // retire_wait :
            m_count($).fetch_add(-1,mo_relaxed);
        }
    }

    void unlock()
    {
        m_state($).store(0,rl::mo_release);

        //notify_one :

        // (*2)
        // need #StoreLoad before loading m_count

        std::atomic_thread_fence(mo_seq_cst,$);

        int c = m_count($).load(mo_acquire);        
            
        if ( c > 0 )
        {       
            SetEvent(m_handle);
        }
    }
};

That is, the two fences allow you to set up a transitive relationship - like the m_count.load in unlock() is definitely after the fence in unlock, the fence in unlock is after the fence in lock, and the fence in lock is after the m_count.fetch_add ; therefore the m_count.load must see count > 0.

Or, alternatively, if you leave the fence out of lock(), you can fix it just in unlock by changing either the store or the load into an RMW :


variant "apple" :

    void unlock()
    {
        // release is for the mutex innards
        m_state($).exchange(0,rl::mo_acq_rel);

        // #StoreLoad is achieved as a #LoadLoad on m_state

        int c = m_count($).load(mo_relaxed);        
            
        if ( c > 0 )
        {       
            SetEvent(m_handle);
        }
    }

variant "banana" :

    void unlock()
    {
        // release is for the mutex innards
        m_state($).store(0,rl::mo_release);

        // #StoreLoad is achieved as a #StoreStore on m_count :

        int c = m_count($).fetch_add(0,mo_release);  //(*3)
            
        if ( c > 0 )
        {
            SetEvent(m_handle);
        }
    }

(variant "apple" only works if the double check on m_state in lock is acq_rel).

(variant "banana" appears to work if *3 is only mo_relaxed, which is a bit of a mystery! We'll leave that as an exercise for the reader). (update : it doesn't actually work, see comments).

7 comments:

  1. I think banana with a relaxed fetch_add is actually wrong. I
    believe that relacy is just missing the bug because of the way it
    works.

    You are essentially interested in the execution in which the lock
    does the exchange, then the unlock is recorded to do the
    fetch_add and read 0, then lock does it's add, fence, double
    check, get's stuck waiting, and then finally unlock does is store
    release.

    The problem is that relacy executes the store statements in
    program order (and later adds the additional behaviors) and so
    when it gets to the relaxed rmw in the unlock operation relacy
    can't make the fetch_add return a count of zero because that
    could potentially change the value returned by the already
    executed fetch_add in the execution trace (in the lock
    fetch_add)... You need to use backtracking based search in your
    model checker to make this work..

    See this excerpt from an email from Dmitry for a related issue:

    >For the simplest example, supposed x=y=0 >initially with the following code:
    >T1:
    >r1=x.load();
    >y.store(1, relaxed);
    >
    >T2:
    >r2=y.load();
    >x.store(1,relaxed);
    >
    >Can it simulate the execution in which r1=r2=1 at
    >the end?

    It is one of few situations that Relacy can't model. The problem
    is that (1) Relacy is based on dynamic verification (it actually
    executes the program and can observe only what had actually
    happen) and (2) it's executes the program in a sequentially
    consistent way and applies possible relaxations afterwards. For
    the example you provided it means that when the first load is
    executed (either r1 or r2) no stores yet happened, but Relacy
    must already return some value and the only value it can return
    is 0. However it definitely must model r1==1||r2==1. I was
    thinking about how it's possible to model such situation, but did
    not find any realistic solution.

    ReplyDelete
  2. BTW. We're working on a tool like Relacy that uses backtracking based search to handle these cases plus partial order reduction to optimize the search.

    ReplyDelete
  3. Hmm, without anything at (*1), what rule prevents the compiler from reordering m_count.fetch_add() past m_state.exchange()?

    ReplyDelete
  4. @Brian - thanks for the comment. Yeah that makes sense and clears up that false positive from Relacy.

    I guess any time you do a relaxed store or RMW you have to be very careful about trusting Relacy. And it's a good reminder to me that Relacy can only give definite negatives, not definite positives.

    I'm looking forward to your tool! It is so hard to reason about this stuff that everything is suspect until we have rigorous simulators.

    ReplyDelete
  5. @Preshing - yeah I definitely think you are right. I'll change the post.

    I actually had all the ops on m_count as acq_rels in the original version of this example (http://cbloomrants.blogspot.com/2011/07/07-31-11-example-that-needs-seqcst_31.html) and then kept backing off the constraints to try to find the minimal set and went too far.

    (I believe the decrement on m_count can be relaxed, but the increment cannot).

    ReplyDelete
  6. You said:

    "a single fence in C++0x doesn't really do what you want. (...) you can't just use a fence as a way to publish and have relaxed loads receive it. You need another fence in the receiving thread, so that the fences can "synchronize with" each other."

    I don't think this is a C++0x limitation, I think this is a CPU/compiler/logic reality. In any lockless programming there should be errors on the reader and writer. Looking for these pairs helps to identify the need for less obvious barriers.

    ReplyDelete
  7. No Bruce, that's exactly the point that the naive reader expects, and therefore is why I wrote this extensive series to clarify the subtle issue. Perhaps the previous two posts in the series are more clear about it.

    BTW I don't encourage anyone to spend too much time thinking about these subtle corner cases unless you are obsessive like me. Most people will do better to just stick to the C++0x "rules" and make sure you have good transitive "happens before" relationships.

    ReplyDelete