02-18-09 - Perfectionists are Good Bosses

I pulled Jeff in to show off my awesome thread work dispatcher that I was all proud of. I pull up my threadprofile viewer that shows the concurrency, and he immediately goes "what is this time gap at the end where you don't show anything happening?". I was like "hmm god dammit I'm trying to show off my cool shit here don't point out the flaws". But today I realized that on the main thread I was doing a sleep-loop to wait for the work to all get done, when really I should be using an event to wake the main thread when workers run out of work. With a sleep loop you can waste time up to the length of your sleep interval. With a wait event you wake instantly when its time to wake.

I already knew about the evils of sleep-loops and the goodness of events for doing thread waiting, I just got lazy in this particular loop and did it the sleep way cuz its easier. Perfectionist bosses can be annoying when you feel like you did something well enough, but they're also damn helpful, and they push you to be honest about the quality of your work. It's the role I used to play at OW, and I missed not having it when I was working on my own. Anyway I fixed it and the missing time gap disappeared and now I am happy. (this could also be a blog about the awesomeness of good tools - I never would have known about the time gap without the threadprofiler and viewer).

BTW about windows events : something that has been hammered home for me recently is that Windows manual reset events are just a race condition waiting to happen. !! Always use auto-reset events !!

For example, if the worker thread is sending you signals once in a while to let you know certain things have happened, if you have code like this in the main thread with a manual reset event :

WaitForSingleObject ( hEvent );
  (*) race here
ClearEvent( hEvent );

That's a race. If your thread is swapped out after the Wait but before the Clear, other threads can run along and signal the Event. Then it switches back to you and you clear the event and lose any signals that were set.

I have often used manual reset events in the past successfuly, because I use them to signal a shared data item has changed which is then protected by a critical section. As long as you take the critical section and use the shared data to reevaluate whether to wait, that pattern is safe from races even with a manual reset event (it doesn't matter than you may miss some event signals because you will see it when you check the data). But now that I am doing more lock-free stuff, it's more important to get the event right, and hell there's no reason to not just use auto-reset always, so do!! (when in doubt, be safer).

And now a teaser/brag. This is the performance of the work dispatcher :

linear        : 0.770 seconds, 2312158485 clocks   100.000% of linear time
1 threads     : 0.770 seconds, 2313504188 clocks   100.058%
2 threads     : 0.386 seconds, 1160138775 clocks    50.175% , speedup = 1.993 = 0.99650 per core
3 threads     : 0.258 seconds, 774830107 clocks     33.511% , speedup = 2.984 = 0.99467
4 threads     : 0.194 seconds, 582231143 clocks     25.181% , speedup = 3.971 = 0.99275
5 threads     : 0.194 seconds, 582824062 clocks     25.207%
6 threads     : 0.194 seconds, 583144837 clocks     25.221%
7 threads     : 0.194 seconds, 583397962 clocks     25.231%
8 threads     : 0.194 seconds, 582415282 clocks     25.189%

Making a work dispatcher with almost perfect scaling is not really that impressive (there are plenty existing around the net - and actually if you are Windows-only there are some good ones built directly into Windows), but there are some things that I am happy about :

The loss from going from single-thread linear to dispatched in packets is very close to zero.

The loss from running lots of small packets as opposed to a few big ones is very close to zero. (I tried packets of between 4 and 64 items and it made almost no difference).

There's no slowdown when you have too many threads (obviously this is a 4 core machine), which is a common flaw in broken work dispatchers (they start getting slower and slower with too many threads because they have contention issues and they fight each other).

(I should say this testing is on a quad-core chip, which doesn't show scalability problems nearly as much as a true 4-proc machine with separate caches, so there may still be scalability flaws I don't know about)


MaciejS said...

Actually, the main thread shouldn't just sit there waiting for event (signalizing that workers are done) either. If possible, it should participate in performing tasks as well (fetching them from central queue/stealing from workers - depends on the system), if it has nothing better to do anyway.

cbloom said...

Yeah, though I think that could only hurt in my system. I think the ideal setup for me is to make a worker per core, and then also have the "main thread" which can run along and keep doing user code or whatever. So if the main thread does work there it will only cause cache contention with a worker.

However there are certainly cases where having the main thread participate could be a big win. One is if the cores are not fully committed. But another is if one of the workers is asleep or idle, you'd rather let the main thread go ahead and take over his slot so that you can just do work without the thread switch penalty.

Another case is if the main thread blocks on a small work item, it's better to just find it and immediately execute it on the main thread rather than wait for a worker to do it.

old rants