3/12/2009

03-12-09 - ERROR_NO_SYSTEM_RESOURCES

ERROR_NO_SYSTEM_RESOURCES is the fucking devil.

So far as I can tell, the only people in the history of the universe who have actually pushed the Windows IO system really hard are me and SQL Server. When I go searching around the web for my problems they are always in relation to SQL Server.

I don't have a great understanding of this problem yet. Hopefully someone will chime in with a better link. This is what I have found so far :

You receive error 1450 ERROR_NO_SYSTEM_RESOURCES when you try to create a very large file in Windows XP
SystemPages Core Services
Sysinternals Forums - not enough resources problem - Page 1
Overlapped WriteFile fails with code 1450 [Archive] - CodeGuru Forums
Novell Eclipse FTK file io
How to use the userva switch with the 3GB switch to tune the User-mode space to a value between 2 GB and 3 GB
GDI Usage - Bear
Error Message ERROR_NO_SYSTEM_RESOURCES (1450)
Download details Detection, Analysis, and Corrective Actions for Low Page Table Entry Issues
Counter of the Week Symptoms Lack of Free System Page Table Entries (PTEs) and Error Message ERROR_NO_SYSTEM_RESOURCES (1450
Comparison of 32-bit and 64-bit memory architecture for 64-bit editions of Windows XP and Windows Server 2003

Basically the problem looks like this :

Windows Kernel has a bunch of internal fixed-size buffers. It has fixed-size (or small max-size) buffers for Handles, for the "Paged Pool" and "Non-Paged Pool", oh and for PTEs (page table entries). You can cause these resources to run out at any time and then you start getting weird errors. The exact limit is unknowable, because they are affected by what other processes are running, and also by registry settings and boot.ini settings.

I could make the error go away by playing with those settings to give myself more of a given resource, but of course you can't expect consumers to do that, so you have to work flawlessly in a variety of circumstances.

In terms of File IO, this can hit you in a whole variety of crazy ways :

1. There's a limit on the number of file handles. When you try to open a file you can get an out-of-resources error.

2. There's a limit on the number of Async ops pending, because the Kernel needs to allocate some internal resources and can fail.

3. There's a limit on how many pages of disk cache you can get. Because windows secretly runs everything you do through the cache (note that this is even true to some extent if you use FILE_FLAG_NO_BUFFERING - there are a lot of subtleties to when you actually get to do direct IO which I have written about before), any IO op can fail because windows couldn't allocate a page to the disk cache (even though you already have memory allocated in user space for the buffer).

4. Even ignoring the disk cache issue, windows has to mirror your memory buffer for the IO into kernel address space. I guess this is because the disk drivers talk to kernel memory so you user virtual address has to be moved to kernel for the disk to fill it. This can fail if the kernel can't find a block of kernel address space.

5. When you are sure that you are doing none of the above, you can still run into some other mysterious shit about the kernel failing to allocate internal pages for its own book-keeping of IOs. This is error 1450 (0x5AA) , ERROR_NO_SYSTEM_RESOURCES.

The errors you may see are :

ERROR_NOT_ENOUGH_MEMORY = too many AsyncIO 's pending
Solution : wait until some finish and try again

ERROR_NOT_ENOUGH_QUOTA = single IO call too large
Solution : break large IOs into many smaller ones (but then beware the above)

ERROR_NO_SYSTEM_RESOURCES = failure to alloc pages in the kernel address space for the IO
Solution : ???

So I have made sure I don't have too many handles open. I have made sure I don't have too many IO ops pending. I have made sure my IO ops are not too big. I have done all that, and I still randomly get ERROR_NO_SYSTEM_RESOURCES depending on what else is happening on my machine. I sort of have a solution, which seems to be the standard hack solution around the net - just sleep for a few millis and try the IO again. Eventually it magically clears up and works.

BTW while searching for this problem I found this code snippet : Novell Eclipse FTK file io . It's quite good. It's got a lot of the little IO magic that I've only recently learned, such as using "SetFileValidData" when extending files for async writes, and it also has a retry loop for ERROR_NO_SYSTEM_RESOURCES.

Further investigation reveals that this problem is not caused by me at all - the kernel is just out of paged pool. If I do a very small IO (64k or less) or if I do non-overlapped IO, or if I just wait and retry later, I can get the IO to go through. Oh, and if you use no buffering, that also succeeds.

2 comments:

kevin77 said...

I get the ERROR_NO_SYSTEM_RESOURCES (1450) after opening a 445 MB file, seeking to some offset within that file, and then trying to read 4 bytes. I have an open ticket with Microsoft now to try and resolve the problem, but it appears the "solution" will more than likely be to increase the PagedPoolSize registry setting as described at http://support.microsoft.com/kb/312362

cbloom said...

Probably you have a buggy device driver on your system which is leaking paged pool. There's a kernel-memory debug monitoring tool you can use to investigate who is allocating memory, I forget what it's called.

old rants