12/03/2011

12-03-11 - Worker Thread system with reverse dependencies

In the previous episode we looked at a system for doing work with dependencies.

That system is okay; I believe it works, but it has two disadvantages : 1. It requires some non-standard synchronization primitives such as OR waits, and 2. There is a way that it can fail to do work as soon as possible; that is, there is the possibility for moments when work could be done but the worker that could do it is asleep. It's one of our design goals to not let that happen so let's see why it happens :

The problem basically is the NR (not ready) queue. When we have no RTR (ready to run) work, we popped one item from the NR queue and waited on its dependencies. But there could be other items later in the NR queue which become ready sooner. If the items in the NR queue become ready to run in order, this doesn't occur, but if they can become ready in different orders, we could miss out on chances to do work.

Anyhoo, both of these problems go away and everything becomes much simpler if we reformulate our system in terms of "forward dependencies" instead of "backward dependencies".

Normal "dependencies" are backwards; that is, A depends on B and C, which were created earlier in time. The opposite direction link I will call "permits" (is there a standard term for this?). That is, B and C permit A. A needs 2 permissions before it can run.

I propose that it is conceptually easier to set up work in terms of "dependencies", so the client still formulates work items with dependencies, but when they are submitted to be run, they are converted into "permissions". That is, A --> {B,C} is changed into B --> {A} and C --> {A}.

The main difference is that there is no longer any "not ready" queue at all. NR items are not held in any global list, they are only pointed to by their dependencies. Some dependency back in the tree should be ready to run, and it will then be the root that points through various NR items via permission links.

With no further ado, let's look at the implementation.

The worker thread becomes much simpler :


worker :

wait( RTR_sem );

pop RTR_queue and do work

that's it! Massively simpler. All the work is now in the permissions maintenance, so let's look at that :

How do we maintain permissions? Each item which is NR (not ready) has a (negative) count of the # of permissions needed before it can run. Whenever an item finishes, it walks its permission list and incs the permit count on the target item. When the count reaches zero, all permissions are done and the item can now run.

A work item now has to have a list of permissions. In my old system I had just a fixed size array for dependencies; I found that [3] was always enough; it's simply the nature of work that you rarely need lots of dependencies (and in the very rare cases that you do need more than 3, you can create a dummy item which only marks itself complete when many others are done). But this is not true for permissions, there can be many on one item.

For example, a common case is you do a big IO, and then spawn lots of work on that buffer. You might have 32 work items which depend on the IO. This only needs [1] when expressed as dependencies, but [32] when expressed as permissions. So a fixed size array is out and we will use a linked list.

The maintenance looks like this :


submit item for work :

void submit( work_item * wi , work_item * deps[] , int num_deps )
{

    wi->permits = - num_deps;

    if ( num_deps == 0 )
    {
        RTR_queue.push( p );
        RTR_sem.post();
        return;
    }

    for(int i=0;i<num_deps;i++)
    {
        deps[i]->lock();

        if ( ! deps[i]->is_done )
        {
            deps[i]->permits_list.push( wi );
        }
        else
        {
            int prev = wi->permits.fetch_add(1); // needs to be atomic
            if ( prev == -1 ) // permitted (do this also if num_deps == 0)
            {
                RTR_queue.push( p );
                RTR_sem.post();
            }
        }

        deps[i]->unlock();
    }

}


when an item is completed :

void complete( work_item * wi )
{
    wi->lock();

    set wi->is_done

    swap wi->permits_list to local permits_list

    wi->unlock();

    for each p in permits_list
    {
        int prev = p->permits.fetch_add(1);

        if ( prev == -1 )
        {
            // p is now permitted

            RTR_queue.push( p );
            RTR_sem.post();
        }
    }
}

the result is that when you submit not-ready items, they go into the permits list somewhere, then as their dependencies get done their permits count inc up towards zero, when it hits zero they go into the RTR queue and get picked up by a worker.

The behavior is entirely the same as the previous system except that workers who are asleep because they have no RTR work can wake up when any NR item becomes RTR, not just when the single one they popped becomes RTR.

One annoyance with this scheme is you need to lock the item to maintain the permits_list ; that's not really a big deal (I use an indexed lock system similar to Nt Keyed Events, I don't actually put a lock object on each item), but I think it's possible to maintain that list correctly and simply lock free, so maybe we'll revisit that.

ADDENDUM : hmm , not easy to do lock free. Actually maintaining the list is not hard, and even doing it and avoiding races against the permitted count is not hard, the problem is that the list is in the work item and items can be deleted at any time, so you either need to hold a lock on the item to prevent deletion, or you need something like RCU or SMR.

12 comments:

Martin Ecker said...

Nice post. This is exactly how our task/worker system worked in the last commercial title I worked on. I can attest to this working really well in practice.

ryg said...

"The opposite direction link I will call "permits" (is there a standard term for this?)"
The standard term is "anti-dependency". It's awful.

A somewhat nicer and also reasonably standard pair is "input dependency" (A depends on its inputs) vs. "output dependency" (someone depends on A's outputs).

  said...

Clever!

Miles said...
This comment has been removed by the author.
Miles said...

This is very similar to what we do except we call permits "continuations". A task only has one continuation but it can manually decrement the ref count on other tasks during it's own run() method.

That works for most cases but is a bit more manual to set up.

Brian said...

We also do exactly this (we have a runtime library that our compiler generated C code calls into).

One way of avoiding the lock on the permits list, is to maintain a lock-free list of dependence objects in each task. When a task retires (finishes), it sets a finished flag in its record, goes through this list, does an atomic XCHG on the dependence item to set a flag, and if the XCHG succeeds decrements the dependence count of the dependent task.

When a task adds a dependence record to the task it depends on, it adds the record to the lock free list, does an MFENCE, checks if the task with the list has finished, and if so tries an atomic XCHG on the dependence record (and if it succeeds doesn't count the dependence in its local count).

One additional trick we do is to use a large bias in the dependence count of the task we are setting up, keep track of the dependences the task has, and then decrement the bias minus the actual count from its count. It avoids a bunch of more expensive atomic decrements...

Debugging all of this and getting it right was a major, major pain...

cbloom said...

@ Brian - that stuff is much easier to get right with Relacy.

My problem is that objects can get deleted, so I basically have to hold a lock on them to prevent deletion. I started to make the "permits list" lockfree and realized it was pointless because I need a lock anyway to ensure lifetime.

I've got a nice way to make the fast path (no dependencies) lock free, so I'm happy with that.

Brian said...

BTW. It isn't clear to me that making it lock free is a performance win. The real places we suffered from were (under Linux) as the amount of work per task becomes small was (1) the memory allocator as you generate a whole bunch of task records and (2) cache line contention on the work queue as tasks become ready (in practice one thread makes most tasks ready in our system).

We bought into lock-free reference counting for freeing objects to avoid the deleting issue you describe. And we don't actually free these types of objects, but simply add them to a queue to be recycled (see system allocator performance issues)...

cbloom said...

On point (2) -

multiplexing the queue is the standard solution to reduce contention in the high load case; in the low load scenario (queues nearly empty) you have some amount of contention that you cannot avoid, but in the high load scenario (# items >> #workers) you can make contention as small as possible (one cache line handoff is inevitable)

"We bought into lock-free reference counting for freeing objects"

I haven't wanted to impose things like that, I want the system to have a minimum of requirements to use correctly, but I may regret this later. I also considered building a RW lock into my handle system but decided against it for the moment.

Brian said...

The problem we have is that we don't control the structure of the tasks (we're compiling arbitrary code to use our library). Every core actually has its own work queue. But if one core is producing most of the work at a high enough rate (and a bunch of worker cores are completing the tasks fast enough), the cache line contention from the work stealers can be problematic... This was the problem we hit when we had a single core spinning off a new task every 700 clock cycles on a 8 core Nehalem or 1600 clock cycles on a 24 core Opteron. We've thought about trying to batch items to amortize the cache line miss on the insert operation, but ensuring timely delivery of tasks would become a pain...

bouliiii said...

Hey :-)
This is exactly what I implemeted with yaTS. You have a "start" dependency that prevents a task from running and a "end" dependency that delays the task end.
Ben

Jesse James Lactin said...

Sounds really good. Have you considered implementing this with fibers instead of threads? The first problem I see is you pretty much have to roll your own synchronization primitives. To my knowledge, under Windows, stock mutexes and semaphores will put the whole thread to sleep, which isn't what you want.

old rants