4/06/2009

04-06-09 - Multi-threaded Allocators

I've been reading a bit about Multi-threaded Allocators. A quick survery :

The canonical base allocator is Hoard . Hoard is a "slab allocator"; it takes big slabs from the OS and assigns each slab to a fixed-size block. In fact it's extremely similar to my "Fixed Restoring" allocator that's been in Galaxy3 forever and we shipped in Stranger. The basic ideas seem okay, though it appears to use locks when it could easily be lock-free for most ops.

One thing I really don't understand about Hoard is the heuristic for returning slabs to the shared pool. The obvious way to do a multi-threaded allocator is just to have a slab pool for each thread. The problem with that is that if each thread does 1 alloc of size 8, every thread gets a whole slab for itself and the overhead is proportional to the number of threads, which is a bad thing going into the massively multicore future. Hoard's heuristic is that it checks the amount of free space in a slab when you do a free. If the slab is "mostly" free by some measure, it gets returned to the main pool.

My problem with this is it seems to have some very bad cases that IMO are not entirely uncommon. For example a usage like this :


for( many )
{
    allocate 2 blocks
    free 1 block
}

will totally break Hoard. From what I can tell what hoard will do is allocate you a slab for your thread the first time you go in, then when you free() it will be very empty, so it will return the slab to the shared pool. Then after that when you allocate you will either get from the shared pool, or pull the slab back. Very messy.

There are two questions : 1. how greedy is new slab creation? That is, when a thread allocates an object of a given size for the first time, does it immediately get a brand new slab, or do you first give it objects from a shared slab and wait to see if it does more allocs before you give it its own slab. 2. how greedy is slab recycling? eg. when a thread frees some objects, when do you give the slab back to the shared pool. Do you wait for the slab to be totally empty, or do you do it right away.

The MS "Low Fragmentation Heap" is sort of oddly misnamed. The interesting bit about it is not really the "low fragmentation", it's that it has better multi-threaded performance. So far as I can tell, the LFH is 99.999% identical to Hoard. See MS PPT on the Low Fragmentation Heap

We did a little test and it appears that the MS LFH is faster than Hoard. My guess is there are two things going on : 1. MS actually uses lock-free linked lists for the free lists in each slab (Hoard uses locks), and 2. MS makes a slab list *per processor*. When a thread does an alloc it gets the heap for the processor it's on. They note that the current processor is only available to the kernel (KeGetCurrentProcessorNumber ). Hoard also makes P heaps for P processors (actually 2P), but they can't get current processor so they use the thread Id and hash it down to P which leads to occasional contention and collisions.

The other canonical allocator is tcmalloc . I still haven't been able to test tcmalloc because it doesn't build on VS2003. tcmalloc is a bit different than Hoard or LFH. Instead of slab lists, it uses traditional SGI style free lists. There are free lists for each thread, and they get "garbage collected" back to the shared pool.

One issue I would be concerned about with tcmalloc is false sharing. Because the free lists get shared back to a global pool and then redistributed back to threads, there is no real provision to prevent little items from going to different threads, which is bad. Hoard and LFH don't have this problem because they assign whole slabs to threads.

The Hoard papers make some rather ridiculous claims about avoiding false sharing, however. The fact is, if you are passing objects between threads, then no general purpose allocator can avoid false sharing. The huge question for the allocator is - if I free an object on thread 1 that was allocated on thread 2 , should I put it on the freelist for thread 1, or on the freelist for thread 2 ? One or the other will make false sharing bad, but you can't answer it unless you know the usage pattern. (BTW I think tcmalloc puts it on thread 1 - it uses the freelist of the thread that did the freeing - and Hoard puts it on thread 2 - the freelist of the thread that did the allocation; neither one is strictly better).

Both tcmalloc and the LFH have high memory overhead. See here for example . They do have better scalability of overhead to high thread counts, but the fact remains that they may hold a lot of memory that's not actually in use. That can be bad for consoles.

In fact for video games, what you want in an allocator is a lot of tweakability. You want to be able to tweak the amount of overhead it's allowed to have, you want to be able to tweak how fast it recycles pages for different block types or shares them across threads. If you're trying to ship and you can't fit in memory because your allocator has too much overhead, that's a disaster.

BTW false sharing and threaded allocators are clearly a place that a generational copying garbage collector would be nice. (this does not conflict with my contention that you can always be faster by doing very low level manual allocation - the problem is that you may have to give tons of hints to the allocator for it to do the right thing). With GC it can watch the usage and be smart. For example if a thread does a few allocs, they can come from a global shared pool. It does some more allocs of that type, then it gets its own slab and the previous allocs are moved to that slab. If those objects are passed by a FIFO to another thread, then they can be copied to a slab that's local to that thread. Nice.

It seems clear to me that the way to go is a slab allocator. Nice because it's good for cache-coherence and false sharing. The ops inside a slab can be totally thread-local and so no need to worry about multithread issues. Slabs are large enough that slab management is rare and most allocations only take the time do inside-slab work.

One big question for me is always how to get from an object to its owning slab (eg. how is free implemented). Obviously it's trivial if you're okay with adding 4 bytes to every allocation. The other obvious way is to have some kind of page table. Each slab is page-aligned, so you take the object pointer and truncate it down to slab alignment and look that up. If for example you have 32 bit pointers (actually 31), and you pages are 4k, then you need 2 to the 19 page table entries (512k). If a PTE is 4 bytes that's 2 MB of overhead. It's more reasonable if you use 64k pages, but that means more waste inside the page. It's also a problem for 64-bit pointers.

There are some other options that are a bit more hard core. One is to reserve a chunk of virtual address for slabs. Then whenever you see a free() you check if the pointer is in that range, and if so you know you can just round the pointer down to get the slab head. The problem is the reserving of virtual address.

Another option is to try to use pointer alignment. You could do something like make all large allocs be 4k aligned. Then all small allocs are *not* 4k aligned (this happens automatically because they come from inside a 4k page), and to get the base of the page you round down to the next lowest 4k.

10 comments:

NeARAZ said...

While you're on the subject of multithreaded allocators - it might make sense to take a look at nedmalloc (http://www.nedprod.com/programs/portable/nedmalloc/)

cbloom said...

Thanks. I wish people would write a bit more about their algorithm, I don't want to have dig around in code if I can avoid it.

There's also jemalloc and Intel TBB's malloc, both of which are supposed to be good and neither of which have much description.

There's also "Miser" which is an implementation of Hoard.

Autodidactic Asphyxiation said...

I think tc_malloc is supposed build on VS2003 (at least 32-bit) if you build tcmalloc_minimal. You might want to be aware of this patch:

http://code.google.com/p/google-perftools/issues/detail?id=120#c6

FWIW, folks care about tcmalloc, so it gets some attention. You can expect it to steadily improve over time, despite the fact it has some funky issues in Windows, ATM.

cbloom said...

The impression I got was that the Linux version was well tested, but the Windows version was the bastard step child.

Plus it seem to me that the method used in tcmalloc is always going to be more of a problem for false sharing.

(I also don't see anything about tcmalloc reclaiming free pages when all the small blocks on a page are freed?)

Autodidactic Asphyxiation said...

While tcmalloc certainly gets a lot more exercise in Linux, I wouldn't go so far as to call it a bastard stepchild. Maybe the sexual deviant daughter. Anyway, the Chromium folks are experimenting with tcmalloc. As you would guess it is a performance win but a space efficiency loss.

Could you talk more about false sharing? Normally, I hear about it when two threads (say different threads are always on different CPUs in this discussion) write two different pieces of memory that happen to share a cache line. Then writes to that cache line makes things slow. People get this when they do naive things like allocate an array of accumulators and distribute each of them to different threads.

But false sharing with respect to free lists seem less damning, since by the time some freed object is recycled, it isn't going to stay in the other CPU's cache, and certainly it shouldn't be getting writes from there.

cbloom said...

The issue with false sharing due to allocators is when small blocks are handed out to different threads.

Say you have 8-byte allocations like

[A B C D]

in a row in memory (32 bytes).

Initially the allocator hands out A,B,C,D to thread 1.

The thread 1 frees B and D but keeps A & C. Thread 1 then frees a bunch of other stuff.

tcmalloc sees all that freed memory and decides to reclaim it and hand it out to thread 2 who is now doing lots of allocations.

Thread 2 does some 8 byte allocations and gets handed B & D.

Now Thread 1 works on A & C and Thread 2 works on B & D , and we have very bad false sharing.

BTW I hate the term "false sharing" , I should just write "interthread cache line contention".

I think it's reasonable to demand an allocator that *never* gives you any false sharing as long as you use objects on the same thread that allocates them. Both Hoard and the MS LFH satisfy this constraint, but I believe tcmalloc does not, because it works on freelists, not free slabs.

(I haven't actually tested tcmalloc so take all this with a grain of salt)

Autodidactic Asphyxiation said...

OK, I see. I'm guessing tcmalloc shouldn't be hard to test, though, and if it is then it is a bug.

nothings said...

The idea of a fixed-size slab allocator kind of scares me, because it can induce so much overhead in the worst case... although I'm not sure how likely it is to approach the worst case (which is if, say, 90% of a given size of allocation are short term, and 10% are long term).

The overhead of the page-table is pretty huge, especially if you never use that much memory, which is another big source of overhead for tcmalloc-style allocators. I was just thinking that for something like G, we could *enforce* a model where the client *has* to pass in a fixed block of memory and we only allocate out of that (since enough people want totally predictable memory usage, and that seems the best way to do it). If you know up front you're starting with a single fixed block of memory, then you can just size the page table to match and the overhead is a lot more bearable. Like, if you're going to hand a G-like library 2MB to run from, that's only 512 4KB pages, so the page table entries could be 8 bytes and the overhead would still be less than 1%. (But that's few enough pages and there are enough variable-sized structures that the page-per-size for each multiple of 4 smaller than 256 means you could easily eat up 64 pages just on every unique size, which is part of why I didn't go this route for rrMallocGeneric.)

cbloom said...

"The idea of a fixed-size slab allocator kind of scares me, because it can induce so much overhead in the worst case... "

Well, the overhead is (# of size bins) * (size of slab). It depends on your application domain whether that is big for you. Basically it doesn't make a lot of sense unless your number of allocations is reasonably large.

"The overhead of the page-table is pretty huge, especially if you never use that much memory, which is another big source of overhead for tcmalloc-style allocators."

Well tcmalloc is not actually page based; it has its own kind of worse possible case similar to the SGI freelist allocators. But yeah, the big problem with the page table is that you want it to size more proportionally to the actual number of pages used. Your example of a small fixed-size chunk is a good use case for page tables I think. If you make a page table for a large address space, but then only actually allocate a tiny bit, your overhead percent is arbitrarily high.


"page-per-size for each multiple of 4 smaller than 256 "

BTW nobody actually does it like this, it's more like logarithmic, steps of 4, then steps of 8, then steps of 16. So you might only have 16 bins for sizes <= 256 (instead of 64 if you did one for each multiple of 4).

nothings said...

I'm pretty sure when I read the source to tcmalloc, all small allocations got dedicated pages to just that size, with a page table thing to indicate the size of the thing, so I'm not sure why you're saying it's not page based. (It doesn't keep separate pages per thread, though.)

I'm also pretty sure when I read through the tcmalloc source it wasn't logarthimic-y, it was in strict multiples of 8 or 16. But I might be just misremembering it due to having re-read dlmalloc more recently (which definitely has distinct free lists per size in fine granularity, but that's not a big deal since it's not fixed-size slabs, it's basically a way of implementing best-fit efficiently). Or it might have changed since I read it.

old rants