5/08/2013

05-08-13 - A Lock Free Weak Reference Table

It's very easy (almost trivial (*)) to make the table-based {index/guid} style of weak reference lock free.

(* = obviously not trivial if you're trying to minimize the memory ordering constraints, as evidenced by the revisions to this post that were required; it is trivial if you just make everything seq_cst)

Previous writings on this topic :

Smart & Weak Pointers - valuable tools for games - 03-27-04
cbloom rants 03-22-08 - 6
cbloom rants 07-05-10 - Counterpoint 2
cbloom rants 08-01-11 - A game threading model
cbloom rants 03-05-12 - Oodle Handle Table

The primary ops conceptually are :


Add object to table; gives it a WeakRef id

WeakRef -> OwningRef  (might be null)

OwningRef -> naked pointer

OwningRef construct/destruct = ref count inc/dec

The full code is in here : cbliblf.zip , but you can get a taste for how it works from the ref count maintenance code :


    // IncRef looks up the weak reference; returns null if lost
    //   (this is the only way to resolve a weak reference)
    Referable * IncRef( handle_type h )
    {
        handle_type index = handle_get_index(h);
        LF_OS_ASSERT( index >= 0 && index < c_num_slots );
        Slot * s = &s_slots[index];

        handle_type guid = handle_get_guid(h);

        // this is just an atomic inc of state
        //  but checking guid each time to check that we haven't lost this slot
        handle_type state = s->m_state.load(mo_acquire);
        for(;;)
        {
            if ( state_get_guid(state) != guid )
                return NULL;
            // assert refcount isn't hitting max
            LF_OS_ASSERT( state_get_refcount(state) < state_max_refcount );
            handle_type incstate = state+1;
            if ( s->m_state.compare_exchange_weak(state,incstate,mo_acq_rel,mo_acquire) )
            {
                // did the ref inc
                return s->m_ptr;
            }
            // state was reloaded, loop
        }
    }

    // IncRefRelaxed can be used when you know a ref is held
    //  so there's no chance of the object being gone
    void IncRefRelaxed( handle_type h )
    {
        handle_type index = handle_get_index(h);
        LF_OS_ASSERT( index >= 0 && index < c_num_slots );
        Slot * s = &s_slots[index];
        
        handle_type state_prev = s->m_state.fetch_add(1,mo_relaxed);
        state_prev;
        // make sure we were used correctly :
        LF_OS_ASSERT( handle_get_guid(h) == state_get_guid(state_prev) );
        LF_OS_ASSERT( state_get_refcount(state_prev) >= 0 );
        LF_OS_ASSERT( state_get_refcount(state_prev) < state_max_refcount );
    }

    // DecRef
    void DecRef( handle_type h )
    {
        handle_type index = handle_get_index(h);
        LF_OS_ASSERT( index >= 0 && index < c_num_slots );
        Slot * s = &s_slots[index];
        
        // no need to check guid because I must own a ref
        handle_type state_prev = s->m_state.fetch_add((handle_type)-1,mo_release);
        LF_OS_ASSERT( handle_get_guid(h) == state_get_guid(state_prev) );
        LF_OS_ASSERT( state_get_refcount(state_prev) >= 1 );
        if ( state_get_refcount(state_prev) == 1 )
        {
            // I took refcount to 0
            //  slot is not actually freed yet; someone else could IncRef right now
            //  the slot becomes inaccessible to weak refs when I inc guid :
            // try to inc guid with refcount at 0 :
            handle_type old_guid = handle_get_guid(h);
            handle_type old_state = make_state(old_guid,0); // == state_prev-1
            handle_type new_state = make_state(old_guid+1,0); // == new_state + (1<<handle_guid_shift);
  
            if ( s->m_state($).compare_exchange_strong(old_state,new_state,mo_acq_rel,mo_relaxed) )
            {
                // I released the slot
                // cmpx provides the acquire barrier for the free :
                FreeSlot(s);
                return;
            }
            // somebody else mucked with me
        }
    }

The maintenance of ref counts only requires relaxed atomic increment & release atomic decrement (except when the pointed-at object is initially made and finally destroyed, then some more work is required). Even just the relaxed atomic incs could get expensive if you did a ton of them, but my philosophy for how to use this kind of system is that you inc & dec refs as rarely as possible. The key thing is that you don't write functions that take owning refs as arguments, like :


void bad_function( OwningRefT<Thingy> sptr )
{
    more_bad_funcs(sptr);
}

void Stuff::bad_caller()
{
    OwningRefT<thingy> sptr( m_weakRef );
    if ( sptr != NULL )
    {
        bad_function(sptr);
    }
}

hence doing lots of inc & decs on refs all over the code. Instead you write all your code with naked pointers, and only use the smart pointers where they are needed to ensure ownership for the lifetime of usage. eg. :

void good_function( Thing * ptr )
{
    more_good_funcs(ptr);
}

void Stuff::bad_caller()
{
    OwningRefT<thingy> sptr( m_weakRef );
    Thingy * ptr = sptr.GetPtr();
    if ( ptr != NULL )
    {
        good_function(ptr);
    }
}

If you like formal rules, they're something like this :


1. All stored variables are either OwningRef or WeakRef , depending on whether it's
an "I own this" or "I see this" relationship.  Never store a naked pointer.

2. All variables in function call args are naked pointers, as are variables on the
stack and temp work variables, when possible.

3. WeakRef to pointer resolution is only provided as WeakRef -> OwningRef.  Naked pointers
are only retrieved from OwningRefs.

And obviously there are lots of enchancements to the system that are possible. A major one that I recommend is to put more information in the reference table state word. If you use a 32-bit weak reference handle, and a 64-bit state word, then you have 32-bits of extra space that you can check for free with the weak reference resolution. You could put some mutex bits in there (or an rwlock) so that the state contains the lock for the object, but I'm not sure that is a big win (the only advantage of having the lock built into the state is that you could atomically get a lock and inc refcount in a single op). A better usage is to put some object information in there that can be retrieved without chasing the pointer and inc'ing the ref and so on.

For example in Oodle I store the status of the object in the state table. (Oodle status is a progression through Invalid->Pending->Done/Error). That way I can take a weak ref and query status in one atomic load. I also store some lock bits, and you aren't allowed to get back naked pointers unless you have a lock on them.

The code for the weak ref table is now in the cbliblf.zip that I made for the last post. Download : cbliblf.zip

( The old cblib has a non-LF weak reference table that's similar for comparison. It's also more developed with helpers and fancier templates and such that could be ported to this version. Download : cblib.zip )

ADDENDUM : alternative DecRef that uses CAS instead of atomic decrement. Removes the two-atomic free path. Platforms that implement atomic add as a CAS loop should probably just use this form. Platforms that have true atomic add should use the previously posted version.


    // DecRef
    void DecRef( handle_type h )
    {
        handle_type index = handle_get_index(h);
        LF_OS_ASSERT( index >= 0 && index < c_num_slots );
        Slot * s = &s_slots[index];
        
        // no need to check guid because I must own a ref
        handle_type state_prev = s->m_state($).load(mo_relaxed);
        
        handle_type old_guid  = handle_get_guid(h);

        for(;;)
        {
            // I haven't done my dec yet, guid must still match :
            LF_OS_ASSERT( state_get_guid(state_prev) == old_guid );
            // check current refcount :
            handle_type state_prev_rc = state_get_refcount(state_prev);
            LF_OS_ASSERT( state_prev_rc >= 1 );
            if ( state_prev_rc == 1 )
            {
                // I'm taking refcount to 0
                // also inc guid, which releases the slot :
                handle_type new_state = make_state(old_guid+1,0);

                if ( s->m_state($).compare_exchange_weak(state_prev,new_state,mo_acq_rel,mo_relaxed) )
                {
                    // I released the slot
                    // cmpx provides the acquire barrier for the free :
                    FreeSlot(s);
                    return;
                }
            }
            else
            {
                // this is just a decrement
                // but have to do it as a CAS to ensure state_prev_rc doesn't change on us
                handle_type new_state = state_prev-1;
                LF_OS_ASSERT( new_state == make_state(old_guid,  state_prev_rc-1) );
                
                if ( s->m_state($).compare_exchange_weak(state_prev,new_state,mo_release,mo_relaxed) )
                {
                    // I dec'ed a ref
                    return;
                }
            }
        }
    }

17 comments:

Anonymous said...

What is the purpose of the loop, i.e. what is the state==old_state case?

If the semantics of compare_exchange_weak is both to check whether the existing value is state, and if it is exchange, otherwise write the existing value to state, then by definition surely if the compare fails, state must != old_state.

cbloom said...

Well compare_exchange_weak can fail spuriously, so the loop actually is necessary there. But you're right, I should just change it to compare_exchange_strong and then not loop, that's simpler/clearer.

Anonymous said...

Ah, makes sense, thanks

Brian said...

Your library seems to be missing the code for IncRef/DecRef.

What is the usage scenario?

I'm a little worried that since the DecRef atomic decrement is relaxed, that its effects could move above the threads usage of the object such that another thread sees the first thread's decrement, decrements the count to zero, and frees the object before the first thread is done using it.

cbloom said...

"Your library seems to be missing the code for IncRef/DecRef."

Uh, yup. Fixed.

"I'm a little worried that since the DecRef atomic decrement is relaxed, that its effects could move above the threads usage of the object such that another thread sees the first thread's decrement, decrements the count to zero, and frees the object before the first thread is done using it."

Yeah, you have a point. I'll have to think about that a little more.

cbloom said...

Yep, of course you're right - DecRef needs to be at least mo_release. That keeps it from moving up before ops on the pointer it protects. I think that's it (?)

That need is hidden if the object is internally protected by a mutex, or if the shared reference is protected by a mutex.

I really need to get CDSChecker running cuz Relacy is no good at modeling relaxed atomics.

Some threads I found on minimal memory ordering required for refcounts :

https://groups.google.com/forum/?fromgroups=#!topic/comp.programming.threads/Wne5asVbNfA

https://groups.google.com/forum/?fromgroups=#!msg/comp.programming.threads/Sev_8xKh3RU/wEkEqnOhs_oJ

Brian said...

I would think you'd need a release on the normal decrement path and load acquire (or consume??) of the count on the path that you actually do the deletion...

You have to force a synchronization between the computation that uses the object and the computation that frees it.

I expect that the release alone would keep compilers in practice from breaking your code, but it isn't technically correct.

johnb said...

Herb Sutter briefly discussed the memory ordering requirements for refcounts in his "atomic<> Weapons" talk:

Part 1:
http://channel9.msdn.com/Shows/Going+Deep/Cpp-and-Beyond-2012-Herb-Sutter-atomic-Weapons-1-of-

Part 2:
http://channel9.msdn.com/Shows/Going+Deep/Cpp-and-Beyond-2012-Herb-Sutter-atomic-Weapons-2-of-2

It's a long talk. The bit about ref-counts is near the end of part 2, (timecode about 1h 20)

He says: relaxed increment. Acquire+Release decrement. Having said that, he wasn't talking about weak pointers; I don't know if that adds any extra twists.

johnb said...

Sorry, first URL got truncated. Should be:

http://channel9.msdn.com/Shows/Going+Deep/Cpp-and-Beyond-2012-Herb-Sutter-atomic-Weapons-1-of-2

Brian said...

An acquire fence would also work on free path of DecREf.

cbloom said...

Right. The actual ref decrement here is release, but then before an object is freed there's also the acq_rel to change the guid which is in the same atomic word. The guid change provides the acquire for the freeing of the object and slot.

(I suppose it's not obvious in the posted code, but state_get_guid and state_get_refcount are just pulling bit fields out of a single word.)

cbloom said...

Addendum to post : alternative DecRef.

Thom said...

(Sorry in advance for the long comment on an old post)

I had been pretty confused about this for a while, since — most of the implementations for refcounting I've seen before (for example, C++11's `std::shared_ptr`, Rust's `std::sync::Arc`, or `mozilla::RefPtr`) support weak references while also being lock-free. These implementations also avoid needing a big fixed-size table and have simpler implementations, etc. So I figured I was missing something.

Anyway, I actually downloaded the code and read it, and I think see the differences now:

1. In the other implementations, atomic operations are required when copying and destroying the weak references too. These operations are still lock-free... but in yours it's just an opaque integer handle. That's super useful — you can hand it out over a C ABI and let callers do whatever they want with it, or use it in other cases where `weak_ptr` would be annoying.

2. Also in them, memory for the storage of a small struct holding strong/weak refcounts isn't released until the last weak reference is destroyed. For a while I thought this was the point, and so I didn't get it (after all, both approaches has some amount of memory that can't be freed while weak refs still exist, for yours it's the table itself).

Anyway, I think the first is way more useful/important! Actually, I've implemented things like this many times but never lock-free, and never with builtin refcounting.

---

Unfortunately, the fixed size max seems like an issue for some use-cases... But unless I'm wrong, I think you can handle this by keeping a number of these fixed-size tables in a lockfree slist. Each table would have `{slots, freelist, id}`, and you'd reserve some bits in the handle for the "table id". When you run out of slots, you just make a new table, and push it on. Issues:

- I guess there is a race in this algorithm, where 2 threads both add a table at the same time. It should just result in some wasted memory though, and not anything dangerous if it's programmed right though, and that memory can still eventually get used.

- I don't think you can ever free the tables themselves, but maybe someone clever (or who has an RCU-type of thing) can. I think it's probably not worth the hassle unless I'm missing a nice way to do it.

- Each table has its own freelist, but maybe you can share them? I haven't thought about it much. Since it's more obvious to me that that will be able to use roughly the same algorithm without issues (most functions just needs to find the table with the right ID first, except alloc, which might need to check multiple freelists).

- Arguably, this just prolongs the inevitable, since eventually your table ID will overflow the number of bits that the handle has for it. Such is life.

That said, maybe I'm missing another approach.

cbloom said...

Thom - yep, pretty much agree on all points.

Yes in practice the fixed size table is annoying. It worked well for Oddworld because we had very constrained runtime environments, knew exactly what the content needed, etc, but in general it is a problem for variable-size work-loads.

Some kind of lock-free expanding table there is probably possible (as you outline, some kind of list of tables). I think semi-lock-free is probably fine there as well, you could just mutex and block during table growth moments and be "mostly lock free".

And yes as you point out the big advantage of this method is that the weak reference is just a U64.

This means you can have 100's of weak references pointing at an object, the object dies, and there's zero cleanup work, the weak refs just automatically become null and the object and all reference memory is gone.

I also just like writing code with POD data types whenever possible, you can clone weak refs and it's just assignment, no copy construct necessary.

The way I believe that strong/weak refs should be used is that only the "owner" has a strong ref, so that's typically only one place, and everyone else points via weak ref, which makes the GC very light weight.

Making this system work with expanding table and open sourcing it is something I'd like to do some day.

cbloom said...

I think that an expanding handle table could be done with a simple RW lock in most cases.

To get the handle table pointer, take the read lock, if it needs to expand, take the write lock.

Assuming expansion is rare and reads are common, this can be made quite fast. (for true wait-free-ness you could do something like RCW where you prepare the expanded table and leave the old one live, then swap over the pointers in one exchange, but for my use cases that isn't necessary; I don't need to do things like guarantee that handle status queries are always less than 100 cycles or whatever)

RW locks can be made to favor reads so that the read lock is very minimal overhead. I think this is a simple, robust, easy to understand and performant solution.

Thom said...

Hm, yeah, in practice "mostly lock-free" is probably good enough, and with a sufficiently good rwlock it's probably faster to take the lock than to make it work without.

I often also write lock-free code for a couple other reasons:
- So that it works in a signal handler/crash reporter/etc (ofc this needs additional care beyond just thread safety).
- Somewhat better portability, since I sometimes care about non-windows/posix oses, but if nothing else, they tend to be annoying to test on, so nice to reduce the amount of platform-specific code they need.

(Sadly, the portability argument gets less straightforward here, since the trick of adding ABA tags to pointers is getting less and less portable over time...).

> for true wait-free-ness you could do something like RCW where you prepare the expanded table and leave the old one live, then swap over the pointers in one exchange

Oh, hm. I think I see what you mean here, but in my mind you'd have a race between multiple threads trying to grow the table — this could be solved by ABA-tagging the pointer to the table and CASing, but since you said "one exchange", maybe I'm missing something clever.

cbloom said...

BTW for the record - I wound up doing a growing handle table for Oodle because the fixed size table is annoying in practice.

The Oodle handle table now uses a table of tables. The outer array is pointers to tables, the inner array is dense tables. You start with one dense table, when that is full you add more to the outer array. There's no read-lock ever needed, you can always access existing handles without any atomic gate. There is a write-lock needed only when you add another dense table to the outer array (actually that could be lock free too but it's better with a lock I think).

It's very simple but works well for this purpose. Maybe I'll write a full blog post to document it...

old rants