7/13/2020

Robust Win32 IO

I see far too much code in production that does not use Win32 IO robustly. Some of the issues are subtle and tricky, but many of them just come down to checking error codes and return values. You cannot assume :
ReadFile(size) either successfully reads all "size" bytes, or fails mysteriously and we should abort
What you actually need to be handling is :
ReadFile(size)
succeeded but got less than size
failed but failed due to being already at EOF
failed but failed due to a temporary system condition that we should retry
succeeded but is not asynchronous the way we expected
succeeded and was asynchronous but then GetOverlapped result does not wait as we expected
failed but failed due to IO size being too big and we should cut it into pieces

In a surely pointless attempt to improve matters, I've tried to make easy to use clean helpers that do all this for you, so you can just include this code and have robust IO :

robustwin32io.zip

Even if you are being careful and checking all the error codes, some issues you may not be handling :

  • Cut large IOs into pieces. You may have memory allocated for your large IO buffer, but when you create an IO request for that, the OS needs to mirror that into the disk cache, and into kernel memory space for the IO driver. If your buffer is too large, that can fail due to running out of resources.

    (this is now rarely an issue on 64-bit windows, but was common on 32-bit windows, and can still happen on the MS consoles)

  • Retry IOs in case of (some) failures. One of the causes of IO failure if too many requests in the queue, for example if you are spamming the IO system generating lots of small request. If you get these failures you should wait a bit for the queue to drain out then retry.

  • Always call GetOverLappedResult(FALSE) (no wait) before GetOverLappedResult(TRUE) (wait) to reset the event. If you don't do this, GetOverLappedResult(TRUE) can return without waiting for the IO to return, causing a race against the IO. This behavior was changed in Windows 7 so this might not be necessary any more, but there's some dangerous behavior with the manual-reset Event in the OVERLAPPED struct. When you start an async IO it is not necessarily reset to unsignaled. When you GetOverLappedResult(TRUE) it is supposed to be waiting on an event that is signalled when the IO completes, but if the event was already set to signalled before you called, it will just return immediately.

    NOTE this is not the issue with trying to do GetOverLappedResult on the same OVERLAPPED struct from multiple threads - that is just wrong; access to the OVERLAPPED struct should be mutex protected if you will query results from multiple threads, and you should also track your own "is io async" status to check before calling GetOverLappedResult.

  • Always call SetLastError(0) before calling any Windows API and then doing GetLastError. See previous blog on this topic : 10-03-13 - SetLastError(0). This particular bug was fixed in Windows Vista (so a while ago), but I'm paranoid about it and it's harmless to do, so I still do it. GetLastError/SetLastError is just a variable in your thread-info-block, so it's only a few instructions to access it. It's best practice to always SetLastError(0) at the start of a sequence of operations, that way you know you aren't getting errors that were left over from before.

For example, here's how to call GetOverlappedResult : (only call if st == win32_io_started_async)

BOOL win32_get_async_result(HANDLE handle,
                            OVERLAPPED * data,
                            DWORD * pSize)
{
    // only call this if you got "win32_io_started_async"
    // so you know IO is actually pending
            
    DWORD dwSize = 0;
    
    // first check result with no wait
    //  this also resets the event so that the next call to GOR works :
    
    if ( GetOverlappedResult(handle,data,&dwSize,FALSE) )
    {
        if ( dwSize > 0 )
        {
            *pSize = (DWORD) dwSize;
            return true;
        }
    }   
    
    // if you don't do the GOR(FALSE)
    //  then the GOR(TRUE) call here can return even though the IO is not actually done
    
    // call GOR with TRUE -> this yields our thread if the IO is still pending
    if ( ! GetOverlappedResult(handle,data,&dwSize,TRUE) )
    {
        DWORD err = GetLastError();
        
        if ( err == ERROR_HANDLE_EOF )
        {
            if ( dwSize > 0 )
            {
                *pSize = (DWORD) dwSize;
                return true;
            }
        }
        
        return false;       
    }
        
    *pSize = (DWORD) dwSize;
    
    return true;    
}           

Get the code :

robustwin32io.zip

Also note that I don't recommend trying to do unbuffered writes on Windows. It is possible but complex and requires privilege elevation which puts up a UAC prompt, so it's not very usable in practice. Just do buffered writes. See also : 01-30-09 - SetFileValidData and async writing and 03-12-09 - ERROR_NO_SYSTEM_RESOURCES

No comments:

old rants