9/14/2010

09-14-10 - Threaded Stdio

Many people know that fgetc is slow now because we have to link with multithreaded libs, and so it does a mutex and all that. Yes, that is true, but the solution is also trivial, you just use something like this :

#define macrogetc(_stream)     (--(_stream)->_cnt >= 0 ? ((char)*(_stream)->_ptr++) : _filbuf(_stream))

when you know that your FILE is only being used by one thread.

In general there's a nasty issue with multithreaded API design. When you have some object that you know might be accessed from various threads, should you serialize inside every access to that object? Or should you force the client to serialize on the outside?

If you serialize on the inside, you can have severe performance penalties, like with getc.

If you serialize on the outside, you make it very prone to bugs because it's easy to access without protection.

One solution is to introduce an extra "FileLock" object. So the "File" itself has no accessors except "Lock" which gives you a FileLock, and then the "getc" is on the FileLock. That way somebody can grab a FileLock and then locally do fast un-serialized access through that structure. eg:


instead of :

c = File.getc();

you have :

FileLock fl = File.lock();

c = fl.getc();

Another good solution would be to remove the buffering from the stdio FILE and introduce a "FileView" object which has a position and a buffer. Then you can have mutliple FileViews per FILE which are in different threads, but the FileViews themselves must be used only from one thread at a time. Accesses to the FILE are serialized, accesses to the FileView are not. eg :


eg. FILE * fp is shared.

Thread1 has FileView f1(fp);
Thread2 has FileView f2(fp);

FileView contains the buffer

you do :

c = f1.getc();

it can get a char from the buffer without synchronization, but buffer refill locks.

(of course this is a mess with readwrite files and such; in general I hate dealing with mutable shared data, I like the model that data is either read-only and shared, or writeable and exclusively locked by one thread).

(The FileView approach is basically what I've done for Oodle; the low-level async file handles are thread-safe, but the buffering object that wraps it is not. You can have as many buffering objects as you want on the same file on different threads).

Anyway, in practice, macrogetc is a perfectly good solution 99% of the time.

Stdio is generally very fast. You basically can't beat it except in the trivial way (trivial = use a larger buffer). The only things you can do to beat it are :

1. Double-buffer the streaming buffer filling and use async IO to fill the buffer (stdio uses synchronous IO and only fills when the buffer is empty).

2. Have an accessor like Peek(bytes) to the buffer that just gives you a pointer directly into the buffer and ensures it has bytes amount of data for you to read. This eliminates the branch to check fill on each byte input, and eliminates the memcpy from fread.

(BTW in theory you could avoid another memcpy, because Windows is doing IO into the disk cache pages, so you could just lock those and get a pointer and process from them directly. But they don't let you do this for obvious security reasons. Also if you knew in advance that your file was in the disk cache (and you were never going to use it again so you don't want it to get into the cache) you could do uncached IO, which is microscopically faster for data not in the cache, because it avoids that memcpy and page allocation. But again they don't let you ask "is this in the cache" and it's not worth sweating about).

Anyway, the point is you can't really beat stdio for byte-at-a-time streaming input, so don't bother. (on Windows, where the disk cache is pretty good, eg. it does sequential prefetching for you).

6 comments:

castano said...

Instead of macrogetc you could also use getc_nolock (on windows) or getc_unlocked (on unix).

On unix you can easily build a FileLock using flockfile/funlockfile. I suppose there must be something like that on windows too.

cbloom said...

Interesting!

It looks like it's "_fgetc_nolock" in my version of the VC headers.

VC also has

#define _CRT_DISABLE_PERFCRIT_LOCKS

and

_lock_file()

it looks like these appeared in VC2005

nothings said...

Anyway, the point is you can't really beat stdio for byte-at-a-time streaming input, so don't bother. (on Windows, where the disk cache is pretty good).

Yeah, but this comes up because I write portable code (both my free libs and my work code are supposed to be ultra-portable).

I mean, I'm getting bug reports in stb_image from people who compilers that don't allow "//" comments in C, because they're on some exotic embedded platform. And stb_image is where I had this problem and had to write my own buffering layer.

I don't think reaching into the hidden innards of FILE is viable, and I don't think fgets_nolock() or whatever are viable portable solutions. Maybe the new Cxx will add this, but it'll be another decade before we can assume it for max-portability.

As to whether you can go any faster and the whole disk page cacheing memcpy thing, well, obviously there's always the possibility of mmap.

cbloom said...

Yeah portability is just a huge disaster in general. The solution to that is for people to stop buying anything but Intel/Windows , which is the one true platform. I blame all of you who like Linux and iPhones and game consoles for my pain. Quit buying that shit.

"I don't think reaching into the hidden innards of FILE is viable"

It's about as viable as relying on getc_nolock ;) Actually maybe more viable since I know it works on all MSVC versions since VC6 and that's not true of nolock.

"obviously there's always the possibility of mmap."

Somebody always has to mention memmap. It's like the Loch Ness monster, the situation where mmap is actually fast, somebody thinks they saw it once but can never reproduce it. I dunno if it works better on UNIX but on Windows it's almost always slower. (not that that matters)

nothings said...

I definitely have seen speedups from using it on windows, I forget for what. Maybe my disk database, maybe the netflix prize stuff (although there it's also because you're avoiding needing separate memory for the disk cache).

Like, I always figured the speedup would be miniscule--just the savings of touching twice as much memory--but instead I actually saw significant, inexplicable speedups, as if e.g. windows was prefetching when I went through the mmap and not when I went through read/fread.

nothings said...

Actually, or maybe it was the same thing again -- maybe it was just because accesses didn't need to be locked and in the file-IO version I was doing fine-grained fread()s.

But it's not like mmap is portable, so.

old rants