10/12/2011

10-12-11 - Bad profiling

I've seen quite a few posts recently that purport to do some profiling and show you which option is faster. The worst is probably spin locks are faster than mutexes but this kind of cache line study is not awesome either.

(Bouliii's test is not actually a demonstration of cache line sharing at all; it's a demonstration that threading speedup depends on data independence; to demonstrate cache line effects you should have lots of threads atomically incrementing *different* variables that are within the same cache line, not *one* variable; if you did that you would see that cache line sharing causes almost as much slow down as variable sharing)

This kind of profiling is deeply flawed and I believe is in fact worse than nothing.

Timing threading primitives in isolation or on synthetic task sets is not useful.

Obviously if all you measure is the number of clocks to take a lock, then the simplest primitives (like spinlock) will appear the fastest. Many blogs in the last few years have posted ignorant things like "critical section takes 200 clocks and spin lock is only 50 clocks so spin lock is faster". Doof! (bang head). What you care about in real usage is things like fairness, starvation, whether the threads that have CPU time are the ones that are able to get real work done, etc.

So say you're not completely naive and instead you actually cook up some synthetic work for some worker threads to do and you test your primitives that way. Well, that's better. It does tell you what the best primitive is for *that particular work load*. But that might not reflect real work at all. In particular homogenous vs. heterogenous threads (worker threads that are interchangeable and can do any work vs. threads that are dedicated to different things) will behave very differently. What other thread loads are on the system? Are you measuring how well your system releases CPU time to other threads when yours can't do any more? (a fair threading benchmark should always have a low priority thread that's counting the number of excess cycles that are released). (spinning will always seem faster than giving up execution if you are ignoring the ability of the CPU to do other work)

Furthermore the exact sharing structure and how the various workers share cache lines is massively important to the results.

In general, it is very easy for bad threading primitives to look very good in synthetic benchmarks, but to be disastrous in the real world, because they suffer from things like thundering herd or priority inversion or what I call "thread thrashing" (wasted or unnecessary thread context switches).

You may have noticed that when I posted lots of threading primitives a month or two ago there was not one benchmark. But what I did talk about was - can this threading primitive spin the CPU for a long time waiting on a data item? (eg. for an entire time slice potentially) ; can this threading primitive suffer from priority inversion or even "priority deadlock" (when the process can stall until the OS level set manager saves you) ; how many context switches are needed to transfer control, is it just one or are there extra? etc. these are the questions you should ask.

This has been thoroughly discussed over the years w.r.s.t. memory allocators, so we should all know better by now.

Also, as has been discussed thoroughly wrst allocators, it is almost impossible to cook up a thorough benchmark which will tell you the one true answer. The correct way to decide these things is :

1. Understand the issues and what the trade-offs are.

2. Make sure you pick an implementation that does not have some kind of disastrous corner case. For example make sure you allocator has decent O() behavior with no blow-up scenario (or understand the blow-up scenario and know that it can't happen for your application). Particularly with threading I believe that for 99.99% of us (eg. everyone except LRB) it's more important to have primitives that work correctly than to save 10 clocks of micro efficiency.

3. Try to abstract them cleanly so that they are easily swappable and test them in your real final application with your real use case. The only legitimate test is to test in the final use case.

5 comments:

ryg said...

I can assure you that even on LRB, correctness trumps everything :). And on any system, if you're honestly fighting over a few dozen clock cycles in your lock implementation, you're either worrying about the wrong thing or fixing the wrong problem: if your perf actually depends on this, there's very likely something deeply wrong with your dataflow.

cbloom said...

Yup.

bouliiii said...

Hey guys, do not this small post as a study. This is just a random remark about sharing data between cores and demonstrate the slowdown you may witness.

Technically, it is all about cache line sharing because they do travel from cores to cores. and that causes the slow down. I agree a false sharing example is a bit more explicit.

Also, there is nothing bad in micro-benchmarking an application. Better doing it than doing nothing if you can.
But clearly, this is not sufficient.

bouliiii said...

Changed the title finally :)

cbloom said...

Yeah your post is fine actually; I've seen a ton of fine grained threading profiles and just grabbed the most recent example that I thought fit. My blog's research department is very lazy sometimes.

old rants