9/30/2011

09-30-11 - Don't use memset to zero

A very common missed optimization is letting the OS zero large chunks of memory for you.

Everybody just writes code like this :


U32 * bigTable = malloc(20<<20);
memset(bigTable,0,20<<20);

but that's a huge waste. (eg. for large hash table on small files the memset can dominate your time).

Behind your back, the operating system is actually running a thread all the time as part of the System Idle Process which grabs free pages and writes them with zero bytes and puts them on the zero'ed page list.

When you call VirtualAlloc, it just grabs a page from the zeroed page list and hands it to you. (if there are none available it zeroes it immediately).

!!! Memory you get back from VirtualAlloc is always already zeroed ; you don't need to memset it !!!

The OS does this for security, so you can never see some other app's bytes, but you can also use it to get zero'ed tables quickly.

(I'm not sure if any stdlib has a fast path to this for "calloc" ; if so that might be a reason to prefer that to malloc/memset; in any case it's safer just to talk to the OS directly).

ADDENDUM : BTW to be fair none of my string matchers do this, because other people's don't and I don't want to win from cheap technicalities like that. But all string match hash tables should use this.

7 comments:

  1. http://stackoverflow.com/questions/2688466/why-mallocmemset-slower-than-calloc

    It looks like glibc likely uses virtual memory tricks for calloc. This isn't surprising since it will use mremap for large, aligned reallocs.

    ReplyDelete
  2. Relying on the OS to clear pages is very expensive if it results in a pagefault on every initial page acess. It's much cheaper to use hot pages already in your heap than to map new ones in that case, even if you have to memset them yourself.

    memset is not bad if you're going to use the memory soon afterwards anyway, because it just warms the cache for you.
    (BTW, offline page zeroing daemons are often a bad idea because of this; they zero the pages then let them cool down before using them; much better to zero the page just before you're going to use it.)

    The benchmark in the stackoverflow question is completely stupid because it doesn't actually touch the memory after calloc. If it did, it would be about the same as the memset one because of all the pagefaults.

    ReplyDelete
  3. "memset is not bad if you're going to use the memory soon afterwards anyway, because it just warms the cache for you."

    This is only true for tiny allocations. And I'm clearly not talking about tiny allocations.


    BTW It's particularly compelling for uses like "cache tables" ala LZRW or LZP matching, because you might allocate a 512MB cache table, and then large sections of it might never be touched at all. (especially on small files)

    ReplyDelete
  4. Fast match-tables can be small enough to fit into the stack, since they are designed to use L1 cache.

    In such case, i'm not sure memset() can be avoided. Stack memory is typically not zeroed by OS.

    ReplyDelete
  5. Sure sure; but if it fits in L1 it's tiny tiny. You can't rely on L1 being bigger than 32k or so, and you have to share that with lots of other stuff, so you can only fit maybe a 4k table reliably in L1.

    Also, I try to always write code that doesn't use the stack because unfortunately other platforms are not as stack-friendly as Windows.

    ReplyDelete
  6. Also, if you look at the example in the original post it's a malloc of 20 MB. I'm really talking about large tables here.

    ReplyDelete