05-29-10 - Lock Free in x64

I mentioned long ago in the low level threading articles that some of the algorithms are a bit problematic on with 64 bit pointers because we don't have large enough atomic operations.

The basic problem is that for many of the lock-free algorithms we need to be able to do a DCAS , that is a CAS of two pointer-sized values, or a pointer and a counter. When our pointer was 32 bits, we could use a 64 bit CAS to implement DCAS. If our pointer is 64 bits then we need a 128 bit CAS to implement DCAS the same way. There are various solutions to this :

1. Use 128 bit CAS. x64 has cmpxchg16b now which is exactly what you need. This is obviously simple and nice. There are a few minor problems :

1.A. There are not other 128 bit atomics, eg. Exchange and Add and such are missing. These can be implemented in terms of loops of CAS, but that is a very minor suckitude.

1.B. Early AMD64 chips do not have cmpxchg16b. You have to check for its presence with a CPUID call. If it doesn't exist you are seriously fucked. Fortunately these chips are pretty rare, so you can just use a really evil fallback to keep working on them : either disable threading completely on them, or simply run the 32 bit version of your app. The easiest way to do that is to have your installer check the CPUID flag and install the 32 bit x86 version of your app instead of the 64 bit version.

1.C. All your lock-free nodes become 16 bytes instead of 8 bytes. This does things like make your minimum alloc size 16 bytes instead of 8 bytes. This is part of the general bloating of 64 bit structs and mildly sucks. (BTW you can see this in winnt.h as MEMORY_ALLOCATION_ALIGMENT is 16 on Win64 and 8 on Win32).

1.D. _InterlockedCompareExchange128 only exists on newer versions of MSVC so you have to write it yourself in ASM for older versions. Urg.

So #1 is an okay solution, but what are the alternative ?

2. Pack {Pointer,Count} into 64 bits. This is of course what Windows does for SLIST, so doing this is actually very safe. Currently pointers on Windows are only 44 bits because of this. They will move to 48 and then 52. You can easily store a 52 bit pointer + a 16 bit count in 64 bits (the 52 bit pointer has the bottom four bits zero so you actually have 16 bits to work with). Then you can just keep using 64 bit CAS. This has no disadvantage that I know of other than the fact that twenty years from now you'll have to touch your code again.

3. You can implement arbitrary-sized CAS in terms of pointer CAS. The powerful standard paradigm for this type of thing is to use pointers to data instead of data by value, so you are just swapping pointers instead of swapping values. It's very simple, when you want to change a value, you malloc a copy of it and change the copy, and then swap in the pointer to the new version. You CAS on the pointer swap. The "malloc" can just be taking data from a recycled buffer which uses hazard pointers to keep threads from using the same temp item at the same time. This is a somewhat more complex way to do things conceptually, but it is very powerful and general, and for anyone doing really serious lockfree work, a hazard pointer system is a good thing to have. See for example "Practical Lock-Free and Wait-Free LL/SC/VL Implementations Using 64-Bit CAS".

You could also of course use a hybrid of 2 & 3. You could use a packed 64 bit {pointer,count} until your pointer becomes more than 52 bits, and then switch to a pointer to extra data.

No comments:

old rants