1/05/2009

01-05-09 - Paging

Somebody sent me this paper :

"I3D09 A Novel Page-Based Data Structure for Interactive Walkthroughs.pdf"

a little while ago. While the basic ideas in it are totally fine, but they page on *4K* granularity. That's so completely bone-headed (on disks where seek time dominates) that it invalidates the whole paper.

There's a natural "critical size" that you can figure for paging which is the only possible combination of your fundamental variables. (this is sort of a physics style argument - there's only one way you can combine your variables and get the right units, therefore it must be right answer, up to a constant multiplier).

Your two variables inherent to the system are seek time and throughput speed of your disk. You combine them thusly :


typical hard disk :

seek time 10 ms
load speed 50 MB/sec

10 ms = X / (50 MB/sec)

10 /1000 sec = X  sec / (50 MB

10*50 /1000 MB = X

X = 0.5 MB


typical DVD :

seek time 100 ms
load speed 5 MB/sec

X = 100*5/1000 MB = 0.5 MB

Pleasantly, for both DVD and HD paging the critical paging unit size is close to the same !! Note that I'm not saying 0.5 MB is actually the ideal size, there may be some constant on there, so maybe it's 0.1 MB or 1.0 MB , but it's certainly somewhere close to that range (the exact optimum depends on a lot of things like how important latency vs. throughput is to you, etc).

Of course, Solid State Disks change that in the future. They are really just like an extra level of slower RAM, and they make fine-scale paging practical. The possibilities for infinite geometry & infinite texture are very exciting with SSD's. Unfortunately you won't be able to target that mainstream for a long time.

BTW : the Giga-Voxels paper is sort of interesting. The basic method of using a tree with indirections to voxel bricks is an old idea (see earlier papers) - the main addition in the GigaVoxels paper is the bit to have the GPU pass back the information about what new pages need to be loaded, so that the render gives you the update information and it's just one pass. That bit is really ugly IMO with the current GPU architecture, but it's progress anyway. Personally I still kinda think voxel rendering is a red herring and surface reps are the way to go for the foreseeable future.

Jeff had a good idea for streaming loads a while ago. A lot of people play a Bink movie when they do level loads. During that time you're hitting the IO system hard for your level data and also for your movie load. Right now that causes a lot of seeking and requires big buffering for the movie data and so on. With Oodle doing the IO for Bink and for your level load, we could great an interleaved load stream that has like [200 ms of Bink Data][some level load data][200 ms of Bink data][ ... ] ; that eliminates all seeking and lets the DVD just jam fast.

In fact, in the ideal Oodle future that might not even be a special case. The Bink data could just be split into packets, and then it could all just be handed to the generic Oodle packet profiler & layout system, and Oodle would automatically lay out the packets to minimize total load time. This all probably won't be in Oodle V1 , but it's good to keep a picture in my head of where we want to be some day.

7 comments:

Tom Forsyth said...

0.5MB assumes that all seeks are average case. If you have a decent-sized queue of seeks, things like elevator algorithms help reduce the average quite a bit - even without relying on disk locality (not sure how practical disk locality is unless you defragment your HDD quite often).

Annoyingly I believe DVDs have a flatter profile - there's a constant amount of time after any seek to refocus & find which track they're on and other arcane stuff I don't really understand. So while there's certainly variability according to seek amount, it's not as large as with HDDs. I seem to recall that the constant time is about the same as a full-traverse seek, i.e. your worst seek time is only 2x your best seek.

The other factor is memory under-utilisation - if your blocks are too big, you fill up memory with junk, so less fits in it, and you have to do more reads - which means more seeks.

Also, just because their finest granularity is 4kbytes, doesn't mean they always load single 4kbyte chunks (maybe it does - I haven't read the paper yet). When demand-streaming textures on an Xbox 1, I had a granularity of 64bytes, but of course I'd load an entire series of mipmaps at any one time - I never did a single transfer that small.

So there's a balance. Anyway, 4kbytes - yeah, too small. We at Larrabee HQ believe that 64kbytes is right about the sweet spot at the moment.

cbloom said...

The fact is, Windows cache reads in 256k chunks, so anything smaller than 256k is a little weird a silly (basically you're paging inside a paging system which is already paging on 256k granularity).

But yeah, there's sort of two competing factors:

larger chunks = faster throughput
smaller chunks = better usage of memory for what you really need it for (high relevance of bits)

So there's obviously a key issue of how much working memory you have which also affects ideal block size; with a very small working space (like Xbox1) you need to page on fine granularity so that you use your available memory well, with larger working space, you should page on bigger chunks which gives you higher throughput.

Anonymous said...

That value of seeks*bandwidth is the cost of a seek in terms of unused bandwidth. I don't think it's related to the optimal paging unit.

MSN

Anonymous said...

...the cost of a seek in terms of unused bandwidth...

Short pages will waste bandwidth, though, in a way that long pages won't. The "optimal" size Charles suggests is at the knee of the curve; as your pages get smaller, you start to waste massively more bandwidth, whereas pages that are larger have similar amounts of bandwidth wastage.

But, yeah, if you don't need the bandwidth, you might use the smaller pages to get you finer granularity for some reason, e.g. what's mentioned above for Xbox1. This is also why the situation with the DVD may not be the same as that with the hard drive in practice; your bandwidth to the hard drive is so much higher, you may not need it all.

But the point is that 4KB is far, far removed from the critical point. At 10ms seeks, you only get 100 reads per second, so a 4KB page size gives you only 400KB bandwidth to the hard drive, i.e. less than 1%. So the point of the "critical size" for pages is that anybody writing a paper using a 4KB page size seems just totally removed from reality; you don't need to actually measure anything to realize how ineffective that size is.

Tom Forsyth said...

"so a 4KB page size gives you only 400KB bandwidth to the hard drive"

Nonono - only if you only ever read a single page per seek. Which is crazy. Nobody should ever do that. Every time you do a "seek", you should aggressively grab adjacent data. And as Charles points out, Windows actively prevents dumb apps from doing this by growing requests to 256kbytes whether they like it or not.

Again, your granularity does not have to have anything whatsoever to do with your average access size.

Anonymous said...

The only thing that I've seen that remotely mimics truly random reads is audio, and even most of that can be clustered with other things.

MSN

cbloom said...

"
Nonono - only if you only ever read a single page per seek. Which is crazy. Nobody should ever do that. Every time you do a "seek", you should aggressively grab adjacent data. And as Charles points out, Windows actively prevents dumb apps from doing this by growing requests to 256kbytes whether they like it or not.

Again, your granularity does not have to have anything whatsoever to do with your average access size."

Yeah, though this is more true for DVD than HD these days.

Trying to elevator seeks for HD's has become very hard, since most of the new large disks have aliased clusters in various ways, you can't easily tell when you are seeking forward or back without being very intimate with the disk and the file system.

If you're not memory limited, on Windows you should probably just go ahead and do the 256k reads your self, rather than doing smaller reads (even 64k) and letting Windows group them for you. Like, if my small read is getting promoted to 256k, then I can do a better job myself of figuring out what data should be in that read.

old rants