Good article on Windows file caching . The Windows file cache architecture is pretty sweet really. Files are cached in 256k blocks. The caching is also file "buffering". That is, file IO in the application doesn't get any buffer just for that app - you get access to the cache pages, that is your buffer. It means multiple apps on the same file share buffers. This is all pretty sweet. The caching is just a file mapping of pages in the system address space. When you create a memory mapped file, you are just getting direct access to those pages, which you can then lock to get pointers in your address space. You can lock & unlock the pages just like any pages using VirtualAlloc and VirtualFree. When you write to pages, you can control them getting flushed out the write cache by unlocking them.
Windows file caching is *extremely* greedy. It will basically throw away anything it has cached in order to cache the latest request. This makes it look very good in benchmarks, but can be pretty retarded in practice. For example, I was testing by doing linear streaming IO across a 2 GB file. Windows caches every single byte of the 2 GB file so that future touches to it are instant; in order to do that, it will evict the baby and the bathwater from the cache (including stuff like the cache of the NTFS MFT table).
One thing I haven't been able to figure out is a way to manually control the prefetching that windows does. When you do a bunch of ReadFile calls in a row, the cache will asynchronously prefetch ahead the next 256k chunk. This means that even if you just call the synchronous ReadFile (or fread) to get bits of the file, the return is almost immediate. What I would like, however, is a way to manually say "prefetch at this byte". The only way I can do that at the moment is to issue a read at that byte on a thread, but that's not really ideal, it would be better to be able to signal the underlying cache directly.
Also it appears that if you use memory mapped files, windows stops doing the smart async prefetch ahead. Say you memory map a huge file, then you run through streaming touching pages in order. Windows uses the virtual page fault from the memory manager to interrupt and pull in pages from disk. It seems to not prefetch because you get a lot of faults and your app just stalls when you get to pages that aren't in memory yet. Memory mapped files are pretty awesome, it's cool that you just get access to the cache, but without the ability to request prefetches they're pretty worthless since the plain old ReadFile beats them so badly. (if you are in fact doing totally random file accesses on a huge file, then they are just about perfect).
Stdio in MSVC has a 4k buffer and then calls straight to ReadFile to fill it. The 4k buffer isn't really providing you any "buffering" it just hides function call overhead and makes the reads page aligned; the real buffering is inside ReadFile.
If you're linking with the multithreaded libs, stdio becomes outrageously slow. Personally I like my threading model to work such that individual objects, such as a "FILE" are not protected from concurrent use by the library. That is, once the API hands out an atomic object to the app, it is up to the app if it wants to protect that object from multiple threads accessing it or not. The library should only do protection from access to shared/global data structures. (I know they can't do this with stdio because they want to make sure it's safe, and they want to protect stdout/stdin/stdierr). Anyway, the result is that "getc" goes from being around 10 clocks to being around 170 clocks. Fortunately, you can fix that pretty easily by just bringing back the macro yourself. Just define this :
#define macrogetc(_stream) (--(_stream)->_cnt >= 0 ? ((U8)*(_stream)->_ptr++) : _filbuf(_stream))
And use it to do your IO when you know the FILE is only being used by one thread.
Something I just discovered today : while Windows buffering is in fact very good for reads, it sucks for writes. If I write out a stream of data through a buffered file, I only get around 25 MB/sec. If I write out chunks with async non-buffered calls, I get 100 MB/sec. (conversely, buffered streaming input is actually faster than doing a bunch of async non-buffered reads).