1/16/2009

01-16-09 - Virtual Memory

So I just had kind of a weird issue that took me a while to figure out and I thought I'd write up what I learned so I have it somewhere. (BTW I wrote some stuff last year about VirtualAlloc and the zeroer.)

The problem was this Oodle bundler app I'm working on was running out of memory at around 1.4 GB of memory use. I've got 3 GB in my machine, I'm not dumb, etc. I looked into some things - possible virtual address space fragmentation? No. Eventually by trying various allocation patterns I figured it out :

dwAllocationGranularity

On Windows XP all calls to VirtualAlloc get rounded up to the next multiple of 64k. Pages are 4k - and pages will actually be allocated to your process on 4k granularity - but the virtual address space is reserved in 64k chunks. I don't know if there's any fundamental good reason for this or if it's just a simplification for them to write a faster/smaller allocator because it only deals with big aligned chunks.

Anyway, my app happened to be allocating a ton of memory that was (64k + 4k) bytes (there was a texture that was exactly 64k bytes, and then a bit of header puts you into the next page, so the whole chunk was 68k). With VirtualAlloc that actually reserves two 64k pages, so you are wasting almost 50% of your virtual address space.

NOTE : that blank space you didn't get in the next page is just *gone*. If you do a VirtualQuery it tells you that your region is 68k bytes - not 128k. If you try to do a VirtualAlloc and specify an address in that range, it will fail. If you do all the 68k allocs you can until VirtualAlloc returns NULL, and then try some more 4k allocs - they will all fail. VirtualAlloc will never give you back the 60k bytes wasted on granularity.

The weird thing is there doesn't seem to be any counter for this. Here are the TaskMgr & Procexp reading meanings :

TaskMgr "Mem Usage" = Procexp "Working Set"

This is the amount of memory whose pages are actually allocated to your app. That means the pages have actually been touched! Note that pages from an allocated range may not all be assigned.

For example, if you VirtualAlloc a 128 MB range , but then only go and touch 64k of it - your "Mem Usage" will show 64k. Those pointer touches are essentially page faults which pull pages for you from the global zero'ed pool. The key thing that you may not be aware of is that even when you COMMIT the memory you have not actually got those pages yet - they are given to you on demand in a kind of "COW" pattern.

TaskMgr "VM Size" = Procexp "Private Bytes"

This is pretty simple - it's just the amount of virtual address space that's COMMITed for your app. This should equal to the total "Commit Charge" in the TaskMgr Performance view.

ProcExp "Virtual Size" =

This one had me confused a bit and seems to be undocumented anywhere. I tested and figured out that this is the amount of virtual address space RESERVED by your app, which is always >= COMMIT. BTW I'm not really sure why you would ever reserve mem and not commit it, or who exactly is doing that, maybe someone can fill in that gap.

Thus :

2GB >= "Virtual Size" >= "Private Bytes" >= "Working Set".

Okay, that's all cool. But none of those counters shows that you have actually taken all 2 GB of your address space through the VirtualAlloc granularity.

ADDENDUM : while I'm explaining mysteriously named counters, the "Page File Usage History" in Performance tab of task manager has absolutely nothing to do with page file. It's just your total "Commit Charge" (which recall the same as the "VM Size" or "Private Bytes"). Total Commit Charge is technically limited by the size of physical ram + the size of the paging file. (which BTW, should be zero - Windows runs much better with no paging file).


To be super clear I'll show you some code and what the numbers are at each step :


int main(int argc,char *argv[])
{

    lprintf("UseAllMemory...\n");

    vector<void *>  mems;
    
    #define MALLOC_SIZE     ((1<<16) + 4096)
    
    lprintf("reserving:\n");
    
    uint32 total = 0;
    
    for(;;)
    {       
        void * ptr = VirtualAlloc( NULL, MALLOC_SIZE , MEM_RESERVE, PAGE_READWRITE );
        
        if ( ! ptr )
        {
            break;
        }
        
        total += MALLOC_SIZE;
        mems.push_back(ptr);
        
        lprintf("%d\r",total);
    }
    lprintf("%u\n",total);

    lprintf("press a key :\n");
    getch();

This does a bunch of VirtualAlloc reserves with a stupid size. It prints :

UseAllMemory...
reserving:
1136463872
press a key :

The ProcExp Performance tab shows :

Private Bytes : 2,372 K
Virtual Size : 1,116,736 K
Working Set : 916 K

Note we only got around 1.1 GB. If you change MALLOC_SIZE to be a clean power of two you should get all 2 GB.

Okay, so let's do the next part :



    lprintf("comitting:\n");
    
    for(int i=0;i < mems.size();i++)
    {
        VirtualAlloc( mems[i], MALLOC_SIZE, MEM_COMMIT, PAGE_READWRITE );
    }
    
    lprintf("press a key :\n");
    getch();

We committed it so we now see :

Private Bytes : 1,112,200 K
Virtual Size : 1,116,736 K
Working Set : 2,948 K

(Our working set also grew - not sure why that happened, did Windows just alloc a whole bunch? It would appear so. It looks like roughly 128 bytes are needed for each commit).

Now let's actually make that memory get assigned to us. Note that it is implicity zero'ed, so you can read from it any time and pull a zero.


    lprintf("touching:\n");
    
    for(int i=0;i < mems.size();i++)
    {
        *( (char *) mems[i] ) = 1;
    }

    lprintf("press a key :\n");
    getch();

We now see :

Private Bytes : 1,112,200 K
Virtual Size : 1,116,736 K
Working Set : 68,296 K

Note that the Working Set is still way smaller than the Private Bytes because we have only actually been given one 4k page from each of the chunks that we allocated.

And wrap up :


    lprintf("freeing:\n");
    
    while( ! mems.empty() )
    {
        VirtualFree( mems.back(), 0, MEM_RELEASE );
        
        mems.pop_back();
    }   

    lprintf("UseAllMemory done.\n");

    return 0;
}


For background now you can go read some good links about Windows Virtual memory :

Page table - Wikipedia - good intro/background
RAM, Virtual Memory, Pagefile and all that stuff
PAE and 3GB and AWE oh my...
Mark's Blog : Pushing the Limits of Windows Virtual Memory
Managing Virtual Memory in Win32
Chuck Walbourn Gamasutra 64 bit gaming
Brian Dessent - Re question high virtual memory usage
Tom's Hardware - My graphics card stole my memory !

I'm assuming you all basically know about virtual memory and so on. It kind of just hit me for the first time, however, that our problem now (in 32 bit aps) is the amount of virtal address space. Most of us have 3 or 4 GB of physical RAM for the first time in history, so you actually cannot use all your physical RAM - and in fact you'd be lucky to even use 2 GB of virtual address space.

Some issues you may not be aware of :

By default Windows apps get 2 GB of address space for user data and 2 GB is reserved for mapping to the kernel's memory. You can change that by putting /3GB in your boot.ini , and you must also set the LARGEADDRESSAWARE option in your linker. I tried this and it in fact worked just fine. On my 3 GB work system I was able to allocated 2.6 GB to my app. HOWEVER I was also able to easily crash my app by making the kernel run out of memory. /3GB means the kernel only gets 1 GB of address space and apparently something that I do requires a lot of kernel address space.

If you're running graphics, the AGP window is mirrored into your app's virtual address space. My card has 256MB and it's all mirrored, so as soon as I init D3D my memory use goes down by 256MB (well, actually more because of course D3D and the driver take memory too). There are 1GB cards out there now, but mapping that whole video mem seems insane, so they must not do that. Somebody who knows more about this should fill me in.

This is not even addressing the issue of the "memory hole" that device mapping to 32 bits may give you. Note that PAE could be used to map your devices above 4G so that you can get to the full 4G of memory, if you also turn that on in the BIOS, and your device drivers support it; apparently it's not recommended.

There's also the Address Windowing Extensions (AWE) stuff. I can't imagine a reason why any normal person would want to use that. If you're running on a 64-bit OS, just build 64-bit apps.

VirtualQuery tells me something about what's going on with granularity. It may not be obvious from the docs, but you can call VirtualQuery with *ANY* pointer. You can call VirtualQuery( rand() ) if you want to. It doesn't have to be a pointer to the base of an allocation range. From that pointer it gives you back the base of the allocation. My guess is that they do this by stepping back through buckets of size 64k. To make 2G of ram you need 32k chunks of 64k bytes. Each chunk has something like MEMORY_BASIC_INFORMATION, which is about 32 bytes. To hold 32k of those would take 1 MB. This is just pure guessing.

SetSystemFileCacheSize is interesting to me but I haven't explored it.

Oh, some people apparently have problems with DLL's that load to fixed addresses fragmenting virtual memory. It's an option in the DLL loader to specify a fixed virtual address. This is naughty but some people do it. This could make it impossible for you to get a nice big 1.5 GB virtual alloc or something. Apparently you can see the fixed address in the DLL using "dumpbin.exe" and you can modify it using "rebase.exe"

ADDENDUM : I found a bunch of links about /3GB and problems with Exchange Server fragmenting virtual address space. Most interestingly to me these links also have a lot of hints about the way the kernel manages the PTE's (Page Table Entries). The crashes I was getting with /3GB were most surely running out of PTE's ; apparently you can tell the OS to make more room for PTE's with the /USERVA flag. Read here :

The number of free page table entries is low, which can cause system instability
How to Configure the Paged Address Pool and System Page Table Entry Memory Areas
Exchange Server memory management with 3GB, USERVA and PAE
Clint Huffman's Windows Performance Blog Free System Page Table Entries (PTEs)


I found this GameFest talk by Chuck Walkbourn : Why Your Windows Game Won�t Run In 2,147,352,576 Bytes that covers some of these same issues. In particular he goes into detail about the AGP and memory mirroring and all that. Also in Vista with the new WDDM apparently you can make video-memory only resources that don't take any app virtual address space, so that's a pretty huge win.

BTW to be clear - the real virtual address pressure is in the tools. For Oodle, my problem is that to set up the paging for a region, I want to load the whole region, and it can easily be > 2 GB of content. Once I build the bundles and make paging units, then you page them in and out and you have nice low memory use. It just makes the tools much simpler if they can load the whole world and not worry about it. Obviously that will require 64 bit for big levels.

I'm starting to think of the PC platform as just a "big console". For a console you have maybe 10 GB of data, and you are paging that through 256 MB or 512 MB of memory. You have to be careful about memory use and paging units and so on. In the past we thought of the PC as "so much bigger" where you can be looser and not worry about hitting limits, but really the 2 GB Virtual Address Space limit is not much bigger (and in practice it's more like 1.5 GB). So you should think of the PC as have a "small" 1 GB of memory, and you're paging 20 GB of data through it.

4 comments:

castano said...

Isn't 256MB your AGP aperture size? You should be able to change that in the BIOS. That memory does not mirror the GPU memory, but extends it. OK, one section of the aperture size is also used for write-combining, which does mirror video memory, but not all of it. One of the advantages of PCIe is that this memory is allocated on demand.

nothings said...

I ranted somewhere (possibly in the comments of rrSystemMalloc) about the fact that VirtualAlloc doesn't make any mention of dwAllocGranularity affecting it; it's only mentioned in the win32 docs for some OTHER function (I forget which, now).

Sly said...

Something usefull I've found around this 3.x gb memory limit in WinXP and PAE:
There's a few ramdrive driver that are able to create themselves using PAE, in the >3.x gb hidden&useless memory.

One of which is Gavotte Ramdisk: http://www.badongo.com/file/12113169

I now have a 1gb pagefile + 3gb of memory, and at last the PS3 builds are not killing my PC. I can even surf the web during these devil compilation, woohoo!

kimberly said...
This comment has been removed by a blog administrator.

old rants