You may recall, I use a worker thread system with forward "permits" (reversed dependencies) . When any handle completes it sees if that completion should trigger any followup handles, and if so those are then launched. Handles may be SPU jobs or IOs or CPU jobs or whatever. The problem I will talk about occurred when the predessor and the followup were both CPU jobs.
I'll talk about a specific case to be concrete : decoding compressed data while reading it from disk.
To decode each chunk of LZ data, a chunk-decompress job is made. That job depends on the IO(s) that read in the compressed data for that chunk. It also depends on the previous chunk if the chunk is not a seek-reset point. So in the case of a non-reset chunk, you have a dependency on an IO and a previous CPU job. Your job will be started by one or the other, whichever finishes last.
Now, when decompression was IO bound, then the IO completions were kicking off the decompress jobs, and everything was fine.
In these timelines, the second line is IO and the bottom four are workers. (click images for higher res)
LZ Read and Decompress without seek-resets, IO bound :
You can see the funny fans of lines that show the dependency on the previous decompress job and also the IO. Yellow is a thread that's sleeping.
You may notice that the worker threads are cycling around. That's not really ideal, but it's not related to the problem I'm talking about today. (that cycling is caused by the fact that the OS semaphore is FIFO. For something like worker threads, we'd actually rather have a LIFO semaphore, because it makes it more likely that you get a thread with something useful still hot in cache. Someday I'll replace my OS semaphore with my own LIFO one, but for now this is a minor performance bug). (Win32 docs say that they don't gaurantee any particular order, but in my experience threads of equal priority are always FIFO in Win32 semaphores)
Okay, now for the problem. When the IO was going fast, so we were CPU bound, it's the prior decompress job that triggers the followup work.
But something bad happened due to the forward permit system. The control flow was something like this :
On worker thread 0
wake from semaphore
do on an LZ decompress job
mark job done
completion change causes a permits check
permits check sees that there is a pending job triggered by this completion
-> fire off that handle
handle is pushed to worker thread system
no worker is available to do it, so wake a new worker and give him the job
finalize (usually delete) job I just finished
look for more work to do
there is none because it was already handed to a new worker
And it looked like this :
LZ Read and Decompress without seek-resets, CPU bound, naive permits :
You can see each subsequent decompress job is moving to another worker thread. Yuck, bad.
So the fix in Oodle is to use the "delay-kick" mechanism, which I'd already been using for coroutine refires (which had a similar problem; the problem occurred when you yielded a coroutine on something like an IO, and the IO was done almost immediately; the coroutine would get moved to another worker thread instead of just staying on the same one and continuing from the yield as if it wasn't there).
The scheme is something like this :
On each worker thread :
Try to pop a work item from the "delay kick queue"
if there is more than one item in the DKQ,
take one for myself and "kick" the remainder
(kick means wake worker threads to do the jobs)
If nothing on DKQ, pop from the main queue
if nothing on main queue, wait on work semaphore
Do your job
Set "delay kick" = true
("delay kick" has to be in TLS of course)
Mark job as done
Permit system checks for successor handles that can now run
if they exist, they are put in the DKQ instead of immediately firing
Set "delay kick" = false
In brief : work that is made runnable by the completion of work is not fired until the worker
thread that did the completion gets its own shot at grabbing that new work. If the completion made 4 jobs
runnable, the worker will grab 1 for itself and kick the other 3. The kick is no longer in the completion
phase, it's in the pop phase.
And the result is :
LZ Read and Decompress without seek-resets, CPU bound, delay-kick permits :
These last two timelines are on the same time scale, so you can see just from the visual that eliminating the unnecessary thread switching is about a 10% speedup.
Anyway, this particular issue may not apply to your worker thread system, or you may have other solutions. I think the main take-away is that while worker thread systems seem very simple to write at first, there's actually a huge amount of careful fiddling required to make them really run well. You have to be constantly vigilant about doing test runs and checking threadprofile views like this to ensure that what's actually happening matches what you think is happening. Err, oops, I think I just accidentally wrote an advertisement for Telemetry .