In general when I'm porting atomics to a new platform, what I would really like is a document from the hardware maker that describes their memory semantics and cache model in detail. Unfortunately that can be very hard to find. You also need to know how the compiler interacts with that memory model (eg. what does volatile mean, how do I generate compiler reorder barriers, etc). Again that can be very hard to find, particularly because so many compilers now are based on GCC and the GCC guys are generally stubborn punk-asses about clearly defining how they behave in the "undefined" parts of C which are so crucial to real code.
So assuming you can't find any decent documentation, the next place to look is some of the large cross-platform codebases. Probably the best one of all is Linux, because it's decently well tested. Unfortunately you can't just copy code from these since most of them are GPL, but you can use them as educational material to figure out the memory semantics of a platform. So some things we learn from Linux straight off :
1. Before ARMv6 you're fucked. There are no real atomic ops (SWP is not good enough) so you have to use some locking/critical-section mechanism to do atomics. Linux in the kernel does this by blocking interrupts, doing the atomic, then turning them back on. If you're not in the kernel, on Linux there's a secret syscall function pointer you can use, or non-Linux you have to use SWP to implement a spinlock which you then use to do CAS and such.
2. With ARMv6 you can use ldrex/strex , which seems to be your standard LL-SC kind of thing.
3. If you're SMP you need full memory barriers for memory ordering.
One thing I don't know is whether any of the Apple/Android consumer ARM multi-core chips are actually SMP ; eg. do they have separate caches, or are they shared single cache with multiple execution units?
Some reference I've found :
[pulseaudio-discuss] Atomic operations on ARM
Wandering Coder - ARM
QEmu - Commit - ViewGit
pulseaudio-discuss Atomic operations on ARM 1
Old Nabble - gcc - Dev - atomic accesses
Linux Kernel Locking Techniques
Linux Kernel ARM atopics
Debian -- Details of package libatomic-ops-dev in sid
Data alignment Straighten up and fly right
Broken ARM atomic ops wrt memory barriers (was [PATCH] Add cmpxchg support for ARMv6+ systems) - Patchwork
Atomic - GCC Wiki
ARM Technical Resources
ARM Information Center
ARM RealView compiler has some interesting intrinsics that are not documented very well :
ARM RealView has : __force_stores __memory_changed __schedule_barrier __yield __strexeqone trick in this kind of work is to find a compiler that has intrinsics you want and then just look at what assembly is generated so that you can see how to generate the op you want on that platform.
(but beware, because the intrinsics are not always correct; in particular the GCC __sync ops are not all right, sometimes have bugs, and sometimes their behavior is "correct" but doesn't match the documentation; it's very hard to find correct documentation on what memory semantics the GCC __sync ops actually gaurantee).
Anyway, maybe I'll update this when I get some more information / do some more research.