Comments on cbloom rants: 04-06-09 - Multi-threaded Allocators

I'm pretty sure when I read the source to tcmalloc...

2009-04-10T00:10:00.000-07:00

I'm pretty sure when I read the source to tcmalloc, all small allocations got dedicated pages to just that size, with a page table thing to indicate the size of the thing, so I'm not sure why you're saying it's not page based. (It doesn't keep separate pages per thread, though.)

I'm also pretty sure when I read through the tcmalloc source it wasn't logarthimic-y, it was in strict multiples of 8 or 16. But I might be just misremembering it due to having re-read dlmalloc more recently (which definitely has distinct free lists per size in fine granularity, but that's not a big deal since it's not fixed-size slabs, it's basically a way of implementing best-fit efficiently). Or it might have changed since I read it.

"The idea of a fixed-size slab allocator kind...

2009-04-09T13:59:00.000-07:00

"The idea of a fixed-size slab allocator kind of scares me, because it can induce so much overhead in the worst case... "

Well, the overhead is (# of size bins) * (size of slab). It depends on your application domain whether that is big for you. Basically it doesn't make a lot of sense unless your number of allocations is reasonably large.

"The overhead of the page-table is pretty huge, especially if you never use that much memory, which is another big source of overhead for tcmalloc-style allocators."

Well tcmalloc is not actually page based; it has its own kind of worse possible case similar to the SGI freelist allocators. But yeah, the big problem with the page table is that you want it to size more proportionally to the actual number of pages used. Your example of a small fixed-size chunk is a good use case for page tables I think. If you make a page table for a large address space, but then only actually allocate a tiny bit, your overhead percent is arbitrarily high.

"page-per-size for each multiple of 4 smaller than 256 "

BTW nobody actually does it like this, it's more like logarithmic, steps of 4, then steps of 8, then steps of 16. So you might only have 16 bins for sizes <= 256 (instead of 64 if you did one for each multiple of 4).

The idea of a fixed-size slab allocator kind of sc...

2009-04-09T13:47:00.000-07:00

The idea of a fixed-size slab allocator kind of scares me, because it can induce so much overhead in the worst case... although I'm not sure how likely it is to approach the worst case (which is if, say, 90% of a given size of allocation are short term, and 10% are long term).

The overhead of the page-table is pretty huge, especially if you never use that much memory, which is another big source of overhead for tcmalloc-style allocators. I was just thinking that for something like G, we could *enforce* a model where the client *has* to pass in a fixed block of memory and we only allocate out of that (since enough people want totally predictable memory usage, and that seems the best way to do it). If you know up front you're starting with a single fixed block of memory, then you can just size the page table to match and the overhead is a lot more bearable. Like, if you're going to hand a G-like library 2MB to run from, that's only 512 4KB pages, so the page table entries could be 8 bytes and the overhead would still be less than 1%. (But that's few enough pages and there are enough variable-sized structures that the page-per-size for each multiple of 4 smaller than 256 means you could easily eat up 64 pages just on every unique size, which is part of why I didn't go this route for rrMallocGeneric.)

OK, I see. I'm guessing tcmalloc shouldn't be hard...

2009-04-08T11:00:00.000-07:00

OK, I see. I'm guessing tcmalloc shouldn't be hard to test, though, and if it is then it is a bug.

The issue with false sharing due to allocators is ...

2009-04-08T09:21:00.000-07:00

The issue with false sharing due to allocators is when small blocks are handed out to different threads.

Say you have 8-byte allocations like

[A B C D]

in a row in memory (32 bytes).

Initially the allocator hands out A,B,C,D to thread 1.

The thread 1 frees B and D but keeps A & C. Thread 1 then frees a bunch of other stuff.

tcmalloc sees all that freed memory and decides to reclaim it and hand it out to thread 2 who is now doing lots of allocations.

Thread 2 does some 8 byte allocations and gets handed B & D.

Now Thread 1 works on A & C and Thread 2 works on B & D , and we have very bad false sharing.

BTW I hate the term "false sharing" , I should just write "interthread cache line contention".

I think it's reasonable to demand an allocator that *never* gives you any false sharing as long as you use objects on the same thread that allocates them. Both Hoard and the MS LFH satisfy this constraint, but I believe tcmalloc does not, because it works on freelists, not free slabs.

(I haven't actually tested tcmalloc so take all this with a grain of salt)

While tcmalloc certainly gets a lot more exercise ...

2009-04-07T22:15:00.000-07:00

While tcmalloc certainly gets a lot more exercise in Linux, I wouldn't go so far as to call it a bastard stepchild. Maybe the sexual deviant daughter. Anyway, the Chromium folks are experimenting with tcmalloc. As you would guess it is a performance win but a space efficiency loss.

Could you talk more about false sharing? Normally, I hear about it when two threads (say different threads are always on different CPUs in this discussion) write two different pieces of memory that happen to share a cache line. Then writes to that cache line makes things slow. People get this when they do naive things like allocate an array of accumulators and distribute each of them to different threads.

But false sharing with respect to free lists seem less damning, since by the time some freed object is recycled, it isn't going to stay in the other CPU's cache, and certainly it shouldn't be getting writes from there.

The impression I got was that the Linux version wa...

2009-04-07T17:55:00.000-07:00

The impression I got was that the Linux version was well tested, but the Windows version was the bastard step child.

Plus it seem to me that the method used in tcmalloc is always going to be more of a problem for false sharing.

(I also don't see anything about tcmalloc reclaiming free pages when all the small blocks on a page are freed?)

I think tc_malloc is supposed build on VS2003 (at ...

2009-04-07T17:46:00.000-07:00

I think tc_malloc is supposed build on VS2003 (at least 32-bit) if you build tcmalloc_minimal. You might want to be aware of this patch:

http://code.google.com/p/google-perftools/issues/detail?id=120#c6

FWIW, folks care about tcmalloc, so it gets some attention. You can expect it to steadily improve over time, despite the fact it has some funky issues in Windows, ATM.

Thanks. I wish people would write a bit more abou...

2009-04-07T10:00:00.000-07:00

Thanks. I wish people would write a bit more about their algorithm, I don't want to have dig around in code if I can avoid it.

There's also jemalloc and Intel TBB's malloc, both of which are supposed to be good and neither of which have much description.

There's also "Miser" which is an implementation of Hoard.

While you're on the subject of multithreaded alloc...

2009-04-07T01:04:00.000-07:00

While you're on the subject of multithreaded allocators - it might make sense to take a look at nedmalloc (http://www.nedprod.com/programs/portable/nedmalloc/)