tag:blogger.com,1999:blog-5246987755651065286.post8444470007240734272..comments2024-02-22T16:15:42.388-08:00Comments on cbloom rants: 01-25-09 - Low Level Threading Junk Part 2cbloomhttp://www.blogger.com/profile/10714564834899413045noreply@blogger.comBlogger7125tag:blogger.com,1999:blog-5246987755651065286.post-1416279844341752682009-01-27T10:50:00.000-08:002009-01-27T10:50:00.000-08:00Everything I've found says that "dependen...Everything I've found says that "dependent reads" are gauranteed order on every modern CPU (see the many links provided).<BR/><BR/>Now, you are right that it's a little mysterious exactly how that would happen in the circuity, I've been thinking about the exact same issue that you outline.<BR/><BR/>I haven't yet found any real clear description of what exactly is going on, but I think the key thing is :<BR/><BR/>on every current architecture (eg. not Alpha) the ordering of cache transfers is made to appear to happen in the same order as the reads or writes are executed by the CPU.<BR/><BR/>If you look at the use of smp_read_barrier_depends in the Linux kernel it's for exactly this case. They do stuff like :<BR/><BR/>ptr = shared_var;<BR/>smp_read_barrier_depends();<BR/>i = ptr->val;<BR/><BR/>and as I said before smp_read_barrier_depends is a nop on everything but alpha.<BR/><BR/>If you find any info about what's actually going on here I'd be interested to hear it. Obviously on x86 there's just a total order which is kind of mad, but on Power there's not a total order and they must do some kind of wait from the cpu to the cache to ensure order here.cbloomhttps://www.blogger.com/profile/10714564834899413045noreply@blogger.comtag:blogger.com,1999:blog-5246987755651065286.post-52628794400696098092009-01-27T02:39:00.000-08:002009-01-27T02:39:00.000-08:00I agree it works on x86, I'm not sure the theo...I agree it works on x86, I'm not sure the theory 'every platform in existence guarantees ordering of dependent reads' is actually true.<BR/><BR/>If I create an object by filling out its "m_x" field through a pointer (temp->m_x = 1), and then write a pointer to it in a global variable 'foo', I generate two instructions:<BR/><BR/>move mem[base_reg + offset(m_x)],1<BR/>move mem[foo],base_reg<BR/><BR/>and (in this scenario) there will be a memory write barrier between them.<BR/><BR/>On the reading side, it just does the same thing in the opposite direction:<BR/><BR/>move base_reg, mem[foo]<BR/>move xvalreg, mem[base_reg + offset(m_x)]<BR/><BR/>but without the memory barrier between them.<BR/><BR/>Now, I agree, you can't reorder those two reads as generated by the compiler--it needs the value from mem[foo] to even generate the address to fetch the xvalreg from, so even the CPU can't reorder it.<BR/><BR/>But when we're talking about cache coherency across multiple cores and multiple processors, the fact that it has to *generate* and "compute" the addresses needed in order--that it can't compute the 2nd address until the value at the first address has been loaded, isn't a guarantee (as I understand it).<BR/><BR/>Say the first address is 0x20000, and the second address (computed by loading the first address) is 0x30000.<BR/><BR/>We're guaranteed that the CPU is going to pull the value out of the cache for 0x20000 before it pulls the value out of the cache for 0x30000--it can't reorder those reads as they come into the cache.<BR/><BR/>But from the POV of the cache, here's what happens over time:<BR/><BR/>- read request of 0x20000<BR/>- read result for 0x20000 sent to CPU<BR/>- read request of 0x30000<BR/>- read result for 0x30000 sent to CPU<BR/><BR/>The question is, are those two read results required to be enforced against a global ordering if the CPU doesn't issue a read memory barrier. It doesn't matter that the results have to be returned in that order, and that one address was computed from the other; all that matters is, given that pair of requests, what can happen -- how coherent a view of the imaginary conistent global memory store the cache has to provide. If it's allowed to use stale data for 0x30000, then the double-checked locking can still break.<BR/><BR/>So, I don't know. My vague guess at this is that, in fact, without the read barrier--with only the write barrierr--there's no actual guarantee here. Obviously it works on x86, so I'm not sure how exactly you would have a system that this doesn't work on, but this is apparently e.g. all that the java memory model promises.Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-5246987755651065286.post-81936795417899617942009-01-26T23:37:00.000-08:002009-01-26T23:37:00.000-08:00Note that Hans Boehm has a paper in PLDI that exam...Note that Hans Boehm has a paper in PLDI that examines a possible memory model for C++. It might be interesting to take a look at that.Brianhttps://www.blogger.com/profile/18061165495812067689noreply@blogger.comtag:blogger.com,1999:blog-5246987755651065286.post-87817762869168032922009-01-26T16:27:00.000-08:002009-01-26T16:27:00.000-08:00Okay Sean here's my new understanding :1. It almos...Okay Sean here's my new understanding :<BR/><BR/>1. It almost always does work as-is because "s_instance" is a pointer, and any access to it will be dependent, and every platform in existence guarantees ordering of dependent reads.<BR/><BR/>2. Even barring that fact it would still work on x86 because every read on x86 is read-acquire.<BR/><BR/>In terms of Linux-speak , it should be protected by smp_read_barrier_depends() , which is a nop on every platform but Alpha.<BR/><BR/>See :<BR/><BR/>http://www.linuxjournal.com/article/8211<BR/>http://www.linuxjournal.com/article/8212<BR/><BR/>I'm gonna try to make sure I have this right then I'll revise the original.cbloomhttps://www.blogger.com/profile/10714564834899413045noreply@blogger.comtag:blogger.com,1999:blog-5246987755651065286.post-91245032433507079662009-01-26T14:18:00.000-08:002009-01-26T14:18:00.000-08:00"In other words, you memory barrier to make sure t..."In other words, you memory barrier to make sure the writes go out in the right order on thread A, but there's no guarantee that thread B still sees them in the same order unless it does a read memory barrier as well."<BR/><BR/>I'm pretty sure that MemoryBarrier() does exactly that - it gaurantees that other cores see the same barrier.<BR/><BR/>Anyway, I think you're right that the check of "s_instance" for NULL should probably be an Acquire.cbloomhttps://www.blogger.com/profile/10714564834899413045noreply@blogger.comtag:blogger.com,1999:blog-5246987755651065286.post-65036957755561216022009-01-26T13:40:00.000-08:002009-01-26T13:40:00.000-08:00When implementing own spinning locks it's probably...When implementing own spinning locks it's probably better to introduce exponential backoff instead of just sleeping for predefined duration. Not that it'll matter much, as spinning locks should only be taken for _very_ quick operations.MaciejShttps://www.blogger.com/profile/15783093211220278613noreply@blogger.comtag:blogger.com,1999:blog-5246987755651065286.post-34371104063526819722009-01-26T13:26:00.000-08:002009-01-26T13:26:00.000-08:00This is a "double-checked lock" that works.Uh, no ...<I>This is a "double-checked lock" that works.</I><BR/><BR/>Uh, no it doesn't. At least not according to the java guys, for the java memory model (since C++ doesn't have a memory model)...<BR/><BR/>http://www.cs.umd.edu/~pugh/java/memoryModel/DoubleCheckedLocking.html<BR/><BR/>"However, even with a full memory barrier being performed by the thread that initializes the helper object, it still doesn't work.<BR/><BR/>The problem is that on some systems, the thread which sees a non-null value for the helper field also needs to perform memory barriers."<BR/><BR/>In other words, you memory barrier to make sure the writes go out in the right order on thread A, but there's no guarantee that thread B still sees them in the same order unless it does a read memory barrier as well.<BR/><BR/>So say the java guys; I can't say that I have any actual clue.Anonymousnoreply@blogger.com