4/15/2009

04-15-09 - Oodle Page Cache

So I'm redoing the low level file IO part of Oodle. Actually I may be retargetting a lot of Oodle. One thing I took from GDC and something I've been thinking about a long time is how to make Oodle simpler. I don't want to be making a big structure that you have to buy into and build your game on. Rather I want to make something "leafy" that's easy for people to plug in at the last minute to solve specific problems.

Pursuant to that, the new idea is to make Oodle a handful of related pieces. You can use one or more of the pieces, and each is easy to plug in at the last minute.

1. Async File IO ; the new idea with this is that it's cross platform, all nice and async, does the LZ decompression on a thread, can do DVD packaging and DVD emulation, can handle the PS3/Xenon console data transfers - but it just looks like regular files to you. This is less ambitious than the old system ; it no longer directly provides things like paging data in & out, or hot-loading artist changes; you could of course still do those things but it leaves it more up to the client to do that.

The idea is that if you just write your game on the PC using lazy loose file loads, boom you pop in Oodle and hardly touch the code at all, and you automatically get your files packed up nice and tight into like an XBLA downloadable pack, or a DVD for PS3, or whatever, and it's all fast and good. Oh, and it also integrates nicely with Bink and Miles and Granny so that you can things like play a Bink video while loading a level, and the data streamers share the bandwidth and schedule seeks correctly.

2. Texture goodies. We'll provide the most awesome threaded JPEG decoders, and also probably a better custom lossy and custom lossless texture compressors that are specifically designed for modern games (with features like good alpha-channel support, support for various bit depths and strange formats like X16Y16 , etc.). Maybe some nice DXTC realtime encoding stuff and quality offline encoding stuff. Maybe also a whole custom texture cache thing, so you can say Oodle use 32 MB for textures and do all the paging and decompression and such.

3. Threading / Async utilities. You get the threaded work manager, the thread profiler, we'll probably do "the most awesome" multithreaded allocator. We'll try to give these specific functions that address something specific that people will need to ship a game. eg. a customer is near ship and their allocator is too slow and taking too much memory so they don't fit in the 256 MB of the console. Boom plug in Oodle and you can ship your game.

Anyway, that's just the idea, it remains to be worked out a bit. One thing I'm definitely doing is the low level IO is now going through a page cache.

As I'm writing it I've been realizing that the page cache is a super awesome paradigm for games in general these days. Basically the page cache is just like an OS virtual memory manager. There's a certain limited amount of contiguous physical memory. You divide it into pages, and dynamically assign the pages to the content that's wanted at the time.

Now, the page cache can be used just for file IO, and then it's a lot like memory mapped files. The client can use the stdio look-alike interface, and if they do that, then the page cache just automatically does the "right thing", prefetching ahead pages as they sequentially read through a file, etc.

But since we're doing this all custom and we're in a video game environment where people are willing to get more manual and lower to the bone, we can do a lot more. For example, you can tell me whether a file should sequential prefetch or not. You can manually prefetch at specific spots in the file that you expect to jump to. You can prefetch whole other files that you haven't opened yet. And perhaps most importantly, you can assign priorities to the various pages, so that when you are in a low memory situation (as you always are in games), the pages will be used for the most important thing. For example you can prefetch the whole next file that you expect to need, but you would do that at very low priority so it only uses pages if they aren't needed for anything more urgent.

The next awesome thing I realized about the page cache is that - hey, that can just be the base for the whole game memory allocator. So maybe you give 32 MB to page cache. That can be used for file IO, or video playback - or maybe you want to use it to pop up your in game "pause menu" GUI. Or say you want to stream in a compressed file - you map pages to read in the packed bits, and then you map pages as you decompress; as you decompress you toss the pages with the packed bits and leave the uncompressed pages in the cache.

Say you have some threaded JPEG decompress or something - it grabs a page to decompress to. The other cool thing is because all this is manual - it can grab the page with a certain priority level. If no page is available, it can do various things depending on game knowledge to decide how to respond to that. For example if it's just decompressing a high res version of a texture you already have in low res, it can just abort the decompress. If it's a low priority prefetch, it can just go to sleep and wait for a page to become available. If it's high priority, like you need this texture right now, that can cause other pages that are less important to get dropped.

Pages could also cache procedurally generated textures, data from a network, optional sounds, etc. etc.

I think about it this way - in the future of multicore and async and threading and whatnot, you might have 100 jobs to run per frame. Some of those jobs need large temp work memory. You can't just statically assign memory to various purposes, you want it to be used by the jobs that need it in different ways.

4/06/2009

04-06-09 - Multi-threaded Allocators

I've been reading a bit about Multi-threaded Allocators. A quick survery :

The canonical base allocator is Hoard . Hoard is a "slab allocator"; it takes big slabs from the OS and assigns each slab to a fixed-size block. In fact it's extremely similar to my "Fixed Restoring" allocator that's been in Galaxy3 forever and we shipped in Stranger. The basic ideas seem okay, though it appears to use locks when it could easily be lock-free for most ops.

One thing I really don't understand about Hoard is the heuristic for returning slabs to the shared pool. The obvious way to do a multi-threaded allocator is just to have a slab pool for each thread. The problem with that is that if each thread does 1 alloc of size 8, every thread gets a whole slab for itself and the overhead is proportional to the number of threads, which is a bad thing going into the massively multicore future. Hoard's heuristic is that it checks the amount of free space in a slab when you do a free. If the slab is "mostly" free by some measure, it gets returned to the main pool.

My problem with this is it seems to have some very bad cases that IMO are not entirely uncommon. For example a usage like this :


for( many )
{
    allocate 2 blocks
    free 1 block
}

will totally break Hoard. From what I can tell what hoard will do is allocate you a slab for your thread the first time you go in, then when you free() it will be very empty, so it will return the slab to the shared pool. Then after that when you allocate you will either get from the shared pool, or pull the slab back. Very messy.

There are two questions : 1. how greedy is new slab creation? That is, when a thread allocates an object of a given size for the first time, does it immediately get a brand new slab, or do you first give it objects from a shared slab and wait to see if it does more allocs before you give it its own slab. 2. how greedy is slab recycling? eg. when a thread frees some objects, when do you give the slab back to the shared pool. Do you wait for the slab to be totally empty, or do you do it right away.

The MS "Low Fragmentation Heap" is sort of oddly misnamed. The interesting bit about it is not really the "low fragmentation", it's that it has better multi-threaded performance. So far as I can tell, the LFH is 99.999% identical to Hoard. See MS PPT on the Low Fragmentation Heap

We did a little test and it appears that the MS LFH is faster than Hoard. My guess is there are two things going on : 1. MS actually uses lock-free linked lists for the free lists in each slab (Hoard uses locks), and 2. MS makes a slab list *per processor*. When a thread does an alloc it gets the heap for the processor it's on. They note that the current processor is only available to the kernel (KeGetCurrentProcessorNumber ). Hoard also makes P heaps for P processors (actually 2P), but they can't get current processor so they use the thread Id and hash it down to P which leads to occasional contention and collisions.

The other canonical allocator is tcmalloc . I still haven't been able to test tcmalloc because it doesn't build on VS2003. tcmalloc is a bit different than Hoard or LFH. Instead of slab lists, it uses traditional SGI style free lists. There are free lists for each thread, and they get "garbage collected" back to the shared pool.

One issue I would be concerned about with tcmalloc is false sharing. Because the free lists get shared back to a global pool and then redistributed back to threads, there is no real provision to prevent little items from going to different threads, which is bad. Hoard and LFH don't have this problem because they assign whole slabs to threads.

The Hoard papers make some rather ridiculous claims about avoiding false sharing, however. The fact is, if you are passing objects between threads, then no general purpose allocator can avoid false sharing. The huge question for the allocator is - if I free an object on thread 1 that was allocated on thread 2 , should I put it on the freelist for thread 1, or on the freelist for thread 2 ? One or the other will make false sharing bad, but you can't answer it unless you know the usage pattern. (BTW I think tcmalloc puts it on thread 1 - it uses the freelist of the thread that did the freeing - and Hoard puts it on thread 2 - the freelist of the thread that did the allocation; neither one is strictly better).

Both tcmalloc and the LFH have high memory overhead. See here for example . They do have better scalability of overhead to high thread counts, but the fact remains that they may hold a lot of memory that's not actually in use. That can be bad for consoles.

In fact for video games, what you want in an allocator is a lot of tweakability. You want to be able to tweak the amount of overhead it's allowed to have, you want to be able to tweak how fast it recycles pages for different block types or shares them across threads. If you're trying to ship and you can't fit in memory because your allocator has too much overhead, that's a disaster.

BTW false sharing and threaded allocators are clearly a place that a generational copying garbage collector would be nice. (this does not conflict with my contention that you can always be faster by doing very low level manual allocation - the problem is that you may have to give tons of hints to the allocator for it to do the right thing). With GC it can watch the usage and be smart. For example if a thread does a few allocs, they can come from a global shared pool. It does some more allocs of that type, then it gets its own slab and the previous allocs are moved to that slab. If those objects are passed by a FIFO to another thread, then they can be copied to a slab that's local to that thread. Nice.

It seems clear to me that the way to go is a slab allocator. Nice because it's good for cache-coherence and false sharing. The ops inside a slab can be totally thread-local and so no need to worry about multithread issues. Slabs are large enough that slab management is rare and most allocations only take the time do inside-slab work.

One big question for me is always how to get from an object to its owning slab (eg. how is free implemented). Obviously it's trivial if you're okay with adding 4 bytes to every allocation. The other obvious way is to have some kind of page table. Each slab is page-aligned, so you take the object pointer and truncate it down to slab alignment and look that up. If for example you have 32 bit pointers (actually 31), and you pages are 4k, then you need 2 to the 19 page table entries (512k). If a PTE is 4 bytes that's 2 MB of overhead. It's more reasonable if you use 64k pages, but that means more waste inside the page. It's also a problem for 64-bit pointers.

There are some other options that are a bit more hard core. One is to reserve a chunk of virtual address for slabs. Then whenever you see a free() you check if the pointer is in that range, and if so you know you can just round the pointer down to get the slab head. The problem is the reserving of virtual address.

Another option is to try to use pointer alignment. You could do something like make all large allocs be 4k aligned. Then all small allocs are *not* 4k aligned (this happens automatically because they come from inside a 4k page), and to get the base of the page you round down to the next lowest 4k.

04-06-09 - The Work Dispatcher

I thought I'd write a bit about my multithreaded "Worklet" dispatcher before I forget about it. I call little units of work "worklets" just because I like to be nonstandard and confusing.

The basic idea is that the main thread can at any time fire up a bunch of worklets. The worklets then go to a bunch of threads and get done. The main thread can then wait on the worklets.

There are a few things I do differently in the design from most people. The most common "thread pool" and "job swarm" things are very simple - jobs are just an isolated piece of independent work, and often once a bunch is fired you can only wait on them all being done. I think these are too limiting to be really generally useful, so I added a few things.

1. Worker threads that are not in use should go completely to sleep and take 0 cpu time. There should be a worker thread per processor. There might also be other threads on these processors, and we should play nice with arbitrary other programs running and taking some of the cpu time. Once a worker thread wakes up to do work, it should stay awake as long as possible and do all the work it can, that is, it shouldn't have to go to sleep and wait for more work to get fired so it can wake up again.

2. Worklets can have dependencies on other worklets. That way you can set up a dependency tree and fire it, and it will run in the right order. eg. if I want to run A, then B, then run C only after A & B are both done, you fire worklets {A}, {B} and {C: dependent on AB}. Dependencies can be evaluated by the worker threads. That's crucial because it means they don't need to stall and wait for the main thread to fire new work to them.

3. The main thread can block or check status on any Worklet. The main thread (or other game threads) might keep running along, and the worker threads may be doing the work, and we want to be able to see if they are done or not. In particular we don't want to just support the OpenMP style "parallel for" where we fork a ton of work and then immediately block the main thread on it - we want real asynchronous function calls. Often in games we'll want to make the main thread(s) higher priority than the worker threads, so that the worker threads only run in idle time.

The actual worker threads I implemented with a work-stealing scheme. It's not true work-stealing at all, because there is a concept of a "main thread" that runs the game and pushes work, and there's also the dependencies that need to be evaluated. All of the thread-thread communication is lock free. When the main thread adds new work items it just jams them onto queues. When the worker threads pop off work items they just pop them off queues. I do currently use a lock for dependency evaluation.

Traditional work stealing (and the main papers) are designed for operating system threads where the threads themselves are the ones making work. In that environment, the threads push work onto queues and then pop it off for themselves or steal from peers. There are custom special lock-free data structures designed for this kind of operation - they are fast to push & pop at one end, but also support popping at the other end (stealing) but more slowly. What I'm doing is not traditional work stealing. I have external threads (the "game threads") that do not participate in the work doing, but they can push work to the workers. In my world the workers currently can never make new work (that could be added if it's useful).

There are a lot of nice things about work stealing. One is you don't need a seperate dispatcher thread running all the time (which would hurt you with more context switches). Another is that workers who have cpu time can just keep jamming along by stealing work. They sort of do their own dispatching to themselves, so the threads that have the cpu do the work of the dispatching. It also offloads the dispatching work from the main thread. In my system, the workers do all work they can that's known to be depency-okay. Once that work is exhausted they reevaluate dependencies to see if more work can be done, so they do the dependency checking work for themselves off the main thread.

Another nice thing about work stealing is that it's self-balancing in the face of external activity. Anything running on a PC has to face lots of random CPU time being stolen. Even in a console environment you have to deal with the other threads taking variable amounts of time. For example if you have 6 cores, you want 6 workers threads. But you might also have 3 other threads, like a main thread, a gpu-feeding thread, and a sound thread. The main thread might usually take 90% of the cpu, so the worker on that core rarely gets any time, the gpu-feeder might usually take 50% of the cpu time on that thread, but in phases, like it takes 100% of the cpu for half the frame then goes idle for the other half. With work stealing your worker thread will automatically kick in and use that other time.

In order for the self balancing to work as well as possible you need small worklets. In fact, the possible wasted time is equal to the duration of the longest task. The time waste case happens like this :


Fire N work items , each taking time T to N cores
For some reason one of the cores is busy with something else so only (N-1) of them do work
Now you block on the work being done and have to wait for that busy core, so total time is 2T

Total time should have been N*T/(N-1)
For N large the waste approaches T

Basically the smaller T is (the duration of longest work) the more granular the stealing self-allocation is. Another easy way to see it is :

You are running N workers and a main thread
The main thread takes most of the time on core 0
You fire (N-1) very tiny work items and the (N-1) non-main cores pick them up
You fire a very large work item and core 0 worker picks it up

That's a disaster. The way to avoid it is to never fire single large work items - if you would have split that into lots of little work items it would have self-balanced, because the stealing nature means that only worker threads that have time take tasks.

For example, with something like a DXTC encoder, rather than split the image into something like N rectangles for N cores and fire off 1 work item to each core, you should go ahead and split it into lots of tiny blocks. Of course this requires that the per-work-item overhead is extremely low, which of course it is because we are all lock-free and goodness.

There are some things I haven't done yet that I might. One is to be a bit smarter about the initial work dispatching, try to assign to the CPU that will have the most idle time. If you actually did make nice tiny worklets all the time, that wouldn't be an issue, but that's not always possible. In the case that you do make large work items, you want those to go the cores that aren't being used by other threads in your game.

Another issue is the balance of throughput vs latency. That is, how fast does the system retire work, vs. how long does it take any individual work item to get through. Currently everything is optimized for throughput. Work is done in a roughly FIFO order, but with dependencies it's not gauranteed to be FIFO, and with the work stealing and variations in CPU Time assignment you can have individual work items that take a lot longer to get through the system than is strictly necessary. Usually this isn't a big deal, but sometimes you fire a Worklet and you need it to get done as quickly as possible. Or you might need to get done inside a certain deadline, such as before the end of the frame. For example you might fire a bunch of audio decompression, but set a deadline to ensure it's done before the audio buffers run out of decompressed data. Handling stuff like that in a forward-dispatched system is pretty easy, but in work-stealing it's not so obvious.

Another similar issue is when the main thread decides to block on a given work item. You want that item to get done as soon as possible by the thread that has the highest probability of getting a lot of CPU time. Again not easy with work-stealing since some worker thread may have that item in its queue but not be getting much CPU time for some reason.

3/23/2009

03-23-09 - Oodle GDI and Strings

Some random Oodle thoughts/questions :

My ThreadViewer is currently using GDI. It's pretty good, I like it. I did originally because I consider it a favor to the user. There are some nice things about using GDI instead of a 3D API - it plays nicer like a normal windows app, you know it gets Paint messages with rects and only repaints what it needs to, it can be dragged around and respects the user's paint while dragging or not request, and of course it doesn't gobble CPU when it's just sitting there. And of course it doesn't interfere with other 3D apps using the same device driver and memory and whatnot.

There are some disadvantages though. GDI is butt slow. If I draw much at all it really dogs. It's possible I could speed that up. For example I'm drawing individual lines and it might be faster to draw line-lists. I also wonder if it would be a win to batch up states. Right now I do SetPen/SetBrush and then draw, over and over, I could batch up draws with the same settings to avoid state changes. Another disadvantage is fancier blend modes and such are tricky or slow with GDI.

One option is to do my own 2d drawing by locking a DIB and using my own rasterizer. Probably crazy talk. The other option is just to be rude and use 3d.


The issue of strings in Oodle is quite interesting. The way you refer to resources is by name (string). Obviously the file names that I load have to be strings because that's what the OS uses. The main data structures that map resources to files are thus big string->string maps. (at some point I'm going to rewrite this to make it a bidirectional map, a database that can go resource->file or file->resource, which is just string<->string with hashes both ways).

Currently what I have is a ref-counted string that can optionally also just be a plain C-string. It's a 4-byte pointer and I stuff a flag in the bottom bit to indicate if it's got a ref-count or not. If not, it's assumed to point at const memory. If it does have a ref count, the count is right before the string data. Thus getting to the characters is always fast. I also almost always pass around a 4-byte hash with the string, which lets me avoid doing string compares in almost all cases. I'm usually passing around 8-byte (64 bit) "HashedString" objects. I only actually do a string compare if the hashes match but the pointers don't, which in my world is basically never. This is very good for the cache because it means you never actually follow the pointer to the string data.

One problem with this is that the strings can still wind up taking a lot of memory. The most obvious solution to that we've discussed before here. It's to explicitly use {Path/Name} instead of {Full Path}. That is, break off the directory from the file name and share the directory string for all files in the same dir. That would definitely be a big win and is one option.

Another option is to have a "Final mode" where the strings go away completely. The 4-byte hash would be kept, and instead of a 4-byte pointer to string data there would be a uniquifier to break hash ties. In order to make the right uniquifier you would have to have a full list of all the strings used in the game, which you of course do have with Oodle if your game is done. Then the uniquifier has to get baked into all the files, so there would have to be a "destring" pass over all the data that may or may not be reversible. This could be made relatively transparent to the client. One annoyance we had with this kind of thing at Oddworld is that once you bake out the strings, if you have any errors you get messages like "Resource 0x57EA10BF failed to load!". Still this would make the super-low-memory console wonks happy.

Another option is to combine the strings very aggressively. I already am running all the strings through a single global string table (the Game Dictionary that has the resource<->file map also makes all strings unique - this is a big win with the refcounted string - whenever you read a string from a file, rather than allocating a new buffer or that string you see if you have it already and just use the one you already have and bump the refcount). I'm not actually enforcing that to be required, but a lot of possibilities open up if you do require that.

For one thing, a "string" can just become an index or handle to the global string table. I almost never care about the contents of the string, I just use them as a key to find things, so if the char data is converted to a unique index, I can just compare index equality. Then, that index in the string table need not point at plain old char data. For example it could be an index to a leaf of a Trie. Each string gets added to the Trie and only unshared parts of the string make new chars. Each unique string is a leaf, those get labels and that's the handle that is given back to index the global string table.

Another option would be just some sort of LZ type compression. Some strings get full data, but other strings are just prefix matches, like "use the first N bytes from string M and then add these chars".

I think in the end I'll probably have to do the zero-strings option to please the masses. It also has the advantage of making your memory use more controlled and calculable. Like, you use 32 bytes per resource, not 32 bytes + enough to hold the strings whatever that is.

3/13/2009

03-13-09 - Automatic Prefs

I wrote briefly at molly about the way I do "hot vars" or "tweak vars". I don't really like them. I had a bunch in my GDC app for tweaking, but it means I have to edit the code to tweak things and I don't like that for various reasons. It's way better to have a real text pref file. There are just so many wins. I can source control it seperately. I can copy it off and save good snapshots of prefs I like for different purposes, like a development prefs vs. final run prefs. I can copy it and change it to make new instances. And I can edit it and have my final app load the changes without a recompile (hot var can do this too if you keep the C file around as "data")

Fortunately I have a prefs system in cblib, so I switched to that. But it made me realize that I'd really like to have an automatic pref system. Basically I want to write something like :


struct MyPref
{

int i = 7;
float x = 1.3;
String str = "hello world";
ColorDW color(200,50,177);
Vec3 v(3.4,0,1.7);

};

and have it automatically get IO to the pref file and construction with those values as defaults.

Now, that all is actually pretty easy if I just make some custom syntax and run a source code preprocess over the file before compiling. I could use a syntax like :


struct MyPref
{

int i; //$ = 7;
float x; //$ = 1.3;
String str; //$ = "hello world";
ColorDW color; //$ (200,50,177);
Vec3 v; //$ (3.4,0,1.7);

};

where anything with a //$ automatically becomes a "pref var" that gets IO'd and tweakability and so on.

The annoyance and something I haven't figured out is just how to deal with generated code in a build setting. Should the code generator stick the generated code back into the original file? Should it make another seperate C file on the side and put the generated code in there? Maybe all the generated code from all the C files in the whole project should go together in one big spot?

I dunno but it seems like a mess. Maybe the easiest thing to do would be to put the autogenerated code in the same file, and run the generator as a pre-build step.

It would also be annoying to have to put the code generator in as a pre-build step on every file one by one. So annoying as to be unacceptable actually. I would want it to automatically run on all my files any time I build, so that if I just go and put the autogen markup in any file it does the stuff and I don't have to open msdev options dialogs.

I know people out there are doing things like this, so I'm curious how you deal with the mod time issues and builds and source control.

BTW the autogenerated code will look something like this :


void MyPref::AutoGen_SetDefault()
{
    i = int(7);
    x = float(1.3);
    str = String("hello world");
    color =  ColorDW(200,50,177);
    v = Vec3(3.4,0,1.7);
}

template <typename functor>
void MyPref::AutoGen_Reflection()
{
    REFLECT(i);
    REFLECT(x);
    REFLECT(str);
    REFLECT(color);
    REFLECT(v);
}

pretty easy to generate just by some text movement. The big win of the shortened syntax is that you only have to write a variable once instead of 3 times.

03-13-09 - Paged Pool Leaks

So I tracked down the paged pool leak. It appears to be a bug in my ATI driver (or perhaps a bug in the DX interface that shows up as a leak in the driver). It's caused by locking POOL_MANAGED textures that are currently in the GPU push buffer. When you do that it causes the hardware texture to get aliased to avoid contention, and it appears something in there is leaking. I don't think that whole textures are leaking because the leak is pretty slow, it must just be some kind of texture book-keeping object (eg. maybe it actually is the "texture" struct, just not the actual surface bits).

Anyhoo, nobody cares about bugs that I see via Directx8, but what is interesting is the cool Poolmon utility.

Poolmon lets you see the kernel allocations and thus track down leaks.

You need to turn on some registry settings .

Run Poolmon in a command line with lots of vertical lines.

Use the keys "d" and "p" to get the view you want

Then track away.

3/12/2009

03-12-09 - Fixed Memory Pools

The Windows Kernel Paged Pool shit I'm dealing with is reminding me how much I hate fixed size memory pools.

I have fucking 3 Gigs of RAM in this machine and I'm using less than 1 G ! How can you be running out of memory !!!! Oh, because you have a fucking stupid fixed size page thing. Now, okay, maybe *MAYBE* the OS kernel is one case in which fixed size pools is a good thing.

It's common wisdom in game development that dynamic allocations in games are bad and that "mature" people use fixed size pools because it's more stable and safe from fragmentation and robust and so on. Hog wash!

It should be obvious why you would want variable memory allocation. Memory is one of our primary limiting factors these days, and it should be allocated to whatever needs it right now. When you have fixed pools it means that you are preventing the most important thing from getting memory in some case.

For example, your artists want to make a super high poly detailed background portion of the game with no NPC's. Oh, no, sorry, you can't do that, we're reserving that memory for 32 NPC's all the time even though you have none here. In another part of the game, the artists want to have super simple everything and then 64 NPC's. Oh no, sorry, you only get 32 even though you could run more because we're reserving space for lots of other junk that isn't in this part of the game.

Now, I'm not saying that budgets for artists is a bad thing. Obviously artists need clear guidelines about what will run fast enough and fit in memory. But having global fixed limits is a weak cop out way to do that.

Furthermore, recycling pools and maximum counts for spawned items is a perfectly reasonable thing to do. But I don't think of that as a way of dividing up the available memory - it's a way of preventing buggy art from screwing up the system, or just lazy artists from making mistakes. For example, having a maximum particle count doesn't mean you should go ahead and preallocate all those particles, cuz you might want to use that memory for something else in other cases (and of course the hard-fixed-size pool thing can

In general I'm not talking here about *dynamic* variation. I'm talking about *static* variation. Like anything that can be spawned or trigger from scripts or whatever, stuff that can be created by the player - that stuff should be premade. Anything that *could* exist at a given spot *should* exist. That way you know that no player action can crash you. Note that this really just a way to avoid testing all the combinatorics of different play possibilities.

By static variation I mean, in room 1 you might have resource allocation like {16 NPC's, 100 MB of textures} , in room 2 you might have {8 NPC's, 150 MB of textures}.

Fixed sized budgets is like if you partitioned your hard disk in half for programs and data. People used to do things like that, but we all now realize it's dumb, it's better just to have one big disk and that way you can change how you are using things as need arises.

Now, people sometimes worry about fragmentation. That may or may not be an issue for you. On Stranger on XBox it basically wasn't an issue because we had 64M or physical memory and 2G of virtual address space, so you have tons of slack. Again now with 64 bit pointers you have the same kind of safety and don't have to worry. Sadly, 32-bit Windows right now is actually in a really bad spot where the amount of physical memory roughly matches the address space, and we actually want to use most of that. That is fragmentation danger land.

However, doing variable size budgets doesn't necessarily increase your fragmentation at all. The only thing that would give you fragmentation is if you are dynamically allocating and freeing things of different sizes. Now of course you shouldn't do that !

One option is just to tear things all the way down and build them all the way back up for each level. That way you allocate {A,B,C,D} in order, then you free it all so you get back to empty {} , then next level you allocate {C,B,B,A,E} and there's no fragmentation worry. (if you tried to do a minimal transition between those sets by doing like -D +B+E then you could have problems).

Another option is relocatable memory. We did this for Stranger for the "Contiguous" physical memory. Even though virtual address fragmentation wasn't an issue, physical memory (for graphics) fragmentation was. But all our big resources were relocatable anyway because they were designed for paging. So when the contiguous memory got fragmented we just slid down the blogs to defrag it, just like you defrag a disk. Our resources were well designed for fast paging, so this was very fast - only a few thousand clocks to defrag the memory, and it only had to be done at paging area transitions.

Note that "relocatable resources" is kind of a cool handy thing to have in any case. It lets you load them "flat" into memory and then just rebase the whole thing and boom it's ready to use.

Personally after being a console dev and now seeing the shit I'm seeing with Oodle, I would be terrified of releasing a game on a PC. Even if you are very good about your memory use, your textures and VB's and so on create driver resources and you have no idea how big those are, and it will vary from system to system. The AGP aperture and the Video-RAM shadow eat out huge pieces of your virtual address space. The kernel has an unknown amount of available mem, and of course who knows what other apps are running (if it's a typical consumer machine it probably has antivirus and the whole MS bloatware installed and running all the time).

I don't see how you can use even 512 MB on a PC and get reliable execution. I guess the only robust solution is to be super scalable and not count on anything. Assume that mallocs or any system call can fail at any time, and downgrade your functionality to cope.

Now, certainly doing prereserved buckets and zero allocations and all that does have its merits, mainly in convenience. It's very easy as a developer to verify that what you're doing fits the "rules" - you just look at your allocation count and if it's not zero that's a bug.


It's just very frustrating when someone is telling you "out of memory" when you're sitting there staring at 1 GB of free memory. WTF, I have memory for you right here, please use it.

The other important thing is efficiency is irrelevant if you can't target it where you need it. Having a really efficient banana picking operation doesn't do you a lick of good when everyone wants apples. A lot of game coders miss the importance of this point. It's better to run at 90% efficiency or so, but be flexible enough to target your power at exactly what's needed at any moment.

Like with a fixed system maybe you can handle 100 MB of textures and 50 MB of geometry very efficiently. That's awesome if that's what the scene really needs. But usually that is not the exactly ideal use of memory for a given spot. Maybe some spot really only needs 10 MB of geometry data. You still only provide 100 MB of textures. I'm variable and can now provide 139 MB of textures. (I lose 1 MB due to overhead from being variable).

Many game devs see that and think "ZOMG you have 1 MB of overhead that's unacceptable". In reality, you have 39 MB less of the actual resource your artists want in that spot.

03-12-09 - ERROR_NO_SYSTEM_RESOURCES

ERROR_NO_SYSTEM_RESOURCES is the fucking devil.

So far as I can tell, the only people in the history of the universe who have actually pushed the Windows IO system really hard are me and SQL Server. When I go searching around the web for my problems they are always in relation to SQL Server.

I don't have a great understanding of this problem yet. Hopefully someone will chime in with a better link. This is what I have found so far :

You receive error 1450 ERROR_NO_SYSTEM_RESOURCES when you try to create a very large file in Windows XP
SystemPages Core Services
Sysinternals Forums - not enough resources problem - Page 1
Overlapped WriteFile fails with code 1450 [Archive] - CodeGuru Forums
Novell Eclipse FTK file io
How to use the userva switch with the 3GB switch to tune the User-mode space to a value between 2 GB and 3 GB
GDI Usage - Bear
Error Message ERROR_NO_SYSTEM_RESOURCES (1450)
Download details Detection, Analysis, and Corrective Actions for Low Page Table Entry Issues
Counter of the Week Symptoms Lack of Free System Page Table Entries (PTEs) and Error Message ERROR_NO_SYSTEM_RESOURCES (1450
Comparison of 32-bit and 64-bit memory architecture for 64-bit editions of Windows XP and Windows Server 2003

Basically the problem looks like this :

Windows Kernel has a bunch of internal fixed-size buffers. It has fixed-size (or small max-size) buffers for Handles, for the "Paged Pool" and "Non-Paged Pool", oh and for PTEs (page table entries). You can cause these resources to run out at any time and then you start getting weird errors. The exact limit is unknowable, because they are affected by what other processes are running, and also by registry settings and boot.ini settings.

I could make the error go away by playing with those settings to give myself more of a given resource, but of course you can't expect consumers to do that, so you have to work flawlessly in a variety of circumstances.

In terms of File IO, this can hit you in a whole variety of crazy ways :

1. There's a limit on the number of file handles. When you try to open a file you can get an out-of-resources error.

2. There's a limit on the number of Async ops pending, because the Kernel needs to allocate some internal resources and can fail.

3. There's a limit on how many pages of disk cache you can get. Because windows secretly runs everything you do through the cache (note that this is even true to some extent if you use FILE_FLAG_NO_BUFFERING - there are a lot of subtleties to when you actually get to do direct IO which I have written about before), any IO op can fail because windows couldn't allocate a page to the disk cache (even though you already have memory allocated in user space for the buffer).

4. Even ignoring the disk cache issue, windows has to mirror your memory buffer for the IO into kernel address space. I guess this is because the disk drivers talk to kernel memory so you user virtual address has to be moved to kernel for the disk to fill it. This can fail if the kernel can't find a block of kernel address space.

5. When you are sure that you are doing none of the above, you can still run into some other mysterious shit about the kernel failing to allocate internal pages for its own book-keeping of IOs. This is error 1450 (0x5AA) , ERROR_NO_SYSTEM_RESOURCES.

The errors you may see are :

ERROR_NOT_ENOUGH_MEMORY = too many AsyncIO 's pending
Solution : wait until some finish and try again

ERROR_NOT_ENOUGH_QUOTA = single IO call too large
Solution : break large IOs into many smaller ones (but then beware the above)

ERROR_NO_SYSTEM_RESOURCES = failure to alloc pages in the kernel address space for the IO
Solution : ???

So I have made sure I don't have too many handles open. I have made sure I don't have too many IO ops pending. I have made sure my IO ops are not too big. I have done all that, and I still randomly get ERROR_NO_SYSTEM_RESOURCES depending on what else is happening on my machine. I sort of have a solution, which seems to be the standard hack solution around the net - just sleep for a few millis and try the IO again. Eventually it magically clears up and works.

BTW while searching for this problem I found this code snippet : Novell Eclipse FTK file io . It's quite good. It's got a lot of the little IO magic that I've only recently learned, such as using "SetFileValidData" when extending files for async writes, and it also has a retry loop for ERROR_NO_SYSTEM_RESOURCES.

Further investigation reveals that this problem is not caused by me at all - the kernel is just out of paged pool. If I do a very small IO (64k or less) or if I do non-overlapped IO, or if I just wait and retry later, I can get the IO to go through. Oh, and if you use no buffering, that also succeeds.

03-12-09 - Debugging Question

If I have a process under debug in MSVC , and I want to switch to debugging with WinDbg - how do I do that? WinDbg refuses to attach because it's already being debugged. If I "detach" with MSVC, the process immediately runs free (ceases to be stopped) which makes me lose the moment I was trying to debug. Urg. WTF ?

I need this because I want to write out a block of memory from my process to disk, which it seems MSVC won't do, but WinDbg with .writemem .

Another question is : how do I get a *full* crash dump of a process? (with all its memory). If you use "write crash dump" from MSVC, it only writes a "minidump" which is just registers, call stack, that kind of stuff.

WinDbg can write a full crash dump very easily (with .dump) , but I can't figure out how to change debuggers.

3/10/2009

03-10-09 - Snark Suppository

I posted a pretty snyde comment on John Ratcliff's Code Suppository which I feel a bit guilty about, so I thought I would defend and clarify my position a bit.

First of all, to the specific question of bounding volumes : there is decent code for them in Eberly (though it's really not great and the sphere and OBB code is very far from optimal). There is very good code for all the common primitives in Galaxy3 - though you do have to buy into the whole system.

BTW this is sort of a side note but I really like John's Code Suppository in general, or Sean's "one file" way of distributing code. I'm glad they do it, because I can't and don't want to. I love it when other people make their code available to me like that, but I can't write code like that, and I find it impossible to maintain code like that in my own libraries.

As for the exact flaws and prior art in John's post, I sent him this :


1. Exact sphere fitting is well known.  See for example this very good source code :

 miniball 

2. Exact fitting of the best-rectangle via rotating calipers is well known.  See for example
any computational geometry textbook.

3. To fit an OBB the first step should always be computing the convex hull via something like QHull :

 qhull 

Once you have the convex hull, brute force is in fact a good solution.  (being O(N^2) in the number of 
hull faces is much better than being O(N^2) in the number of original faces).  

And there's more at : RAPID OBB-Tree or Computational Geometry Code or O'Rourke's page

(those are just the code references, you can find more papers on citeseer)

But the real answer is : pick up ANY academic text on computational geometry and there will be very good algorithms for all this stuff. The correct algorithms are also fast! And the error in the approximate algorithms is not small, the "put sphere center at the average of all points" heuristic can be very very bad, as can the OBB fit heuristics. (BTW the good computational geometry code in Galaxy3 is basically from me reading through O'Rourke's Computational Geometry textbook and implementating algorithms I thought were cool and useful - at his website, tons of good code is available for free ; the good sphere code in galaxy is just miniball).

For example, both Eberly and John use the PCA (covariance) axes to fit the OBB. That sort of works okay quite often, but occasionally it is unboundedly bad. (BTW all comments on Eberly's stuff are based on several year old code that was free; I haven't touched his stuff since he published that book and started trying to make money off it, so he may have fixed some of his stuff).

BTW also all primitive-fitting code should be able to find either the minimum-volume primitive or the minimum-surface-area primitive. In most cases minimum-surface-area is actually what you want, because that is what gives you the minimum chance of random ray intersection. It also tends to make nicer looking volumes in practice IMO (it forces them to be more "compact" - surface tension turns spread out things into spheres).

BTW #2 rotating calipers is one of those super simple and perfect algorithms that I believe everyone should know. It's one of those "ah ha" algorithms that makes you realize doing anything else is silly, because you cannot beat it for speed and it gets the answer exactly right. ( there's a whole page on it )

I want to say again that I love what John is doing with Code Suppository, but I keep seeing bad bounding volume code recycled again and again in the game development community. Lots of people use fundamentally wrong image processing or audio processing code without reading any of the DSP literature. People do hacky heuristic UV unwraps without reading any of the papers. People write their own hashes and data structures or sorting algorithms without understanding their big-O or actually comparing to reference implementations.

This is bull! And there's no excuse for it any more. Game developers have long secretly treasured the hack and the heuristic. Everyone loves coming up with their own little simple trick. No more. We have long used "speed/optimization" or "time pressure" as excuses to do things wrong. That's BS now that we are many thousands of people working many years on multi-million dollar projects. And especially for code we are sharing on the internet.

Part of the reason why this stuff is even hard to find on Google these days is because so many people keep posting bad versions! If you search for "bounding box fit" on Google half of what you find will be amateur game developer newsgroups and such with people posting ass-tastic versions. It was almost better 10 years about when the only thing you got was university research web sites.

Part of the problem is that too many game developers only talk to other game developers, and lose sight of how strong all the research out there in the world is. Yes, yes, we are hot shit. But for any algorithm you can think of, there are 1000 Chinese PhD's working on it right now. You can't beat them. The best you can do is go to CiteSeer and read their papers and steal their work. If you think you can beat them, then you better be damn sure and prove it by actually testing against their work.

I'm calling everyone out. There's plenty of un-tested, un-benchmarked, un-researched code out there on the net. We don't need more. We need less! I don't want to see another code snippet that somebody tried and just visually looked at and thought it looked "pretty good". I want exact error bounds. I want complexity analysis. I want profiles against the other major examples in the field. I want to see references to prior art. I want to see rigor.

The next time somebody gives you their hash_map implementation, you ask them - how is it better than stlport hash_map ? how does it compare to google dense or sparse hash map? What are the memory use and execution time limits? Show me your reference list (does it include RDE, khash, and Paul Hsieh?). If they stare at you blankly - kick them in the nuts.

I should also note, and I think everyone knows - that I welcome and encourage similar criticism of my own work. I mean often in this space I am just ranting about nonsense that I obviously have no idea WTF I'm talking about. But when I post code - the whole point is for you to tell me ways in which it is wrong so that I can fix it. That's a huge part of why I post code on the internet, I want feedback and fixes and even just "this sucks, there's much better code for that here".


And finally, I'll end with a bit of a challenge :

I don't actually have perfect OBB code, and to my knowledge it doesn't exist. The code in Galaxy3 is much better than the common covariance method. The main improvement there is that I use rotating calipers to find the truely optimal 2d rectangle if one of the axes is fixed. However, I don't do the correct full search of axes. I only check all the normals of the faces of the convex hull. That is not the full set - you should also consider boxes which are only supported by edges and vertices (this is sort of like doing seperating axis tests and only using face normals and not doing edge-edge axes). O'Rourke has described an exact O(N^3) solution, but it's messy and to my knowledge no simple implementation exists.

So I have a few challenges that I'll leave for the reader (or myself in the future) :

1. What is the maximum error (vs the exact solution) of the Galaxy3 method of only considering OBB's which are supported by at least one face of the convex hull?

2. Is there a good implementation of the O(N^3) exact OBB algorithm?

3. Is there an O(N^2) or O(NlogN) OBB algorithm which comes within a (useful) finite bounded error of the exact solution?

old rants