cbloom rants: 03/2009

3/23/2009

03-23-09 - Oodle GDI and Strings

Some random Oodle thoughts/questions :

My ThreadViewer is currently using GDI. It's pretty good, I like it. I did originally because I consider it a favor to the user. There are some nice things about using GDI instead of a 3D API - it plays nicer like a normal windows app, you know it gets Paint messages with rects and only repaints what it needs to, it can be dragged around and respects the user's paint while dragging or not request, and of course it doesn't gobble CPU when it's just sitting there. And of course it doesn't interfere with other 3D apps using the same device driver and memory and whatnot.

There are some disadvantages though. GDI is butt slow. If I draw much at all it really dogs. It's possible I could speed that up. For example I'm drawing individual lines and it might be faster to draw line-lists. I also wonder if it would be a win to batch up states. Right now I do SetPen/SetBrush and then draw, over and over, I could batch up draws with the same settings to avoid state changes. Another disadvantage is fancier blend modes and such are tricky or slow with GDI.

One option is to do my own 2d drawing by locking a DIB and using my own rasterizer. Probably crazy talk. The other option is just to be rude and use 3d.

The issue of strings in Oodle is quite interesting. The way you refer to resources is by name (string). Obviously the file names that I load have to be strings because that's what the OS uses. The main data structures that map resources to files are thus big string->string maps. (at some point I'm going to rewrite this to make it a bidirectional map, a database that can go resource->file or file->resource, which is just string<->string with hashes both ways).

Currently what I have is a ref-counted string that can optionally also just be a plain C-string. It's a 4-byte pointer and I stuff a flag in the bottom bit to indicate if it's got a ref-count or not. If not, it's assumed to point at const memory. If it does have a ref count, the count is right before the string data. Thus getting to the characters is always fast. I also almost always pass around a 4-byte hash with the string, which lets me avoid doing string compares in almost all cases. I'm usually passing around 8-byte (64 bit) "HashedString" objects. I only actually do a string compare if the hashes match but the pointers don't, which in my world is basically never. This is very good for the cache because it means you never actually follow the pointer to the string data.

One problem with this is that the strings can still wind up taking a lot of memory. The most obvious solution to that we've discussed before here. It's to explicitly use {Path/Name} instead of {Full Path}. That is, break off the directory from the file name and share the directory string for all files in the same dir. That would definitely be a big win and is one option.

Another option is to have a "Final mode" where the strings go away completely. The 4-byte hash would be kept, and instead of a 4-byte pointer to string data there would be a uniquifier to break hash ties. In order to make the right uniquifier you would have to have a full list of all the strings used in the game, which you of course do have with Oodle if your game is done. Then the uniquifier has to get baked into all the files, so there would have to be a "destring" pass over all the data that may or may not be reversible. This could be made relatively transparent to the client. One annoyance we had with this kind of thing at Oddworld is that once you bake out the strings, if you have any errors you get messages like "Resource 0x57EA10BF failed to load!". Still this would make the super-low-memory console wonks happy.

Another option is to combine the strings very aggressively. I already am running all the strings through a single global string table (the Game Dictionary that has the resource<->file map also makes all strings unique - this is a big win with the refcounted string - whenever you read a string from a file, rather than allocating a new buffer or that string you see if you have it already and just use the one you already have and bump the refcount). I'm not actually enforcing that to be required, but a lot of possibilities open up if you do require that.

For one thing, a "string" can just become an index or handle to the global string table. I almost never care about the contents of the string, I just use them as a key to find things, so if the char data is converted to a unique index, I can just compare index equality. Then, that index in the string table need not point at plain old char data. For example it could be an index to a leaf of a Trie. Each string gets added to the Trie and only unshared parts of the string make new chars. Each unique string is a leaf, those get labels and that's the handle that is given back to index the global string table.

Another option would be just some sort of LZ type compression. Some strings get full data, but other strings are just prefix matches, like "use the first N bytes from string M and then add these chars".

I think in the end I'll probably have to do the zero-strings option to please the masses. It also has the advantage of making your memory use more controlled and calculable. Like, you use 32 bytes per resource, not 32 bytes + enough to hold the strings whatever that is.

3/13/2009

03-13-09 - Automatic Prefs

I wrote briefly at molly about the way I do "hot vars" or "tweak vars". I don't really like them. I had a bunch in my GDC app for tweaking, but it means I have to edit the code to tweak things and I don't like that for various reasons. It's way better to have a real text pref file. There are just so many wins. I can source control it seperately. I can copy it off and save good snapshots of prefs I like for different purposes, like a development prefs vs. final run prefs. I can copy it and change it to make new instances. And I can edit it and have my final app load the changes without a recompile (hot var can do this too if you keep the C file around as "data")

Fortunately I have a prefs system in cblib, so I switched to that. But it made me realize that I'd really like to have an automatic pref system. Basically I want to write something like :


struct MyPref
{

int i = 7;
float x = 1.3;
String str = "hello world";
ColorDW color(200,50,177);
Vec3 v(3.4,0,1.7);

};

and have it automatically get IO to the pref file and construction with those values as defaults.

Now, that all is actually pretty easy if I just make some custom syntax and run a source code preprocess over the file before compiling. I could use a syntax like :


struct MyPref
{

int i; //$ = 7;
float x; //$ = 1.3;
String str; //$ = "hello world";
ColorDW color; //$ (200,50,177);
Vec3 v; //$ (3.4,0,1.7);

};

where anything with a //$ automatically becomes a "pref var" that gets IO'd and tweakability and so on.

The annoyance and something I haven't figured out is just how to deal with generated code in a build setting. Should the code generator stick the generated code back into the original file? Should it make another seperate C file on the side and put the generated code in there? Maybe all the generated code from all the C files in the whole project should go together in one big spot?

I dunno but it seems like a mess. Maybe the easiest thing to do would be to put the autogenerated code in the same file, and run the generator as a pre-build step.

It would also be annoying to have to put the code generator in as a pre-build step on every file one by one. So annoying as to be unacceptable actually. I would want it to automatically run on all my files any time I build, so that if I just go and put the autogen markup in any file it does the stuff and I don't have to open msdev options dialogs.

I know people out there are doing things like this, so I'm curious how you deal with the mod time issues and builds and source control.

BTW the autogenerated code will look something like this :


void MyPref::AutoGen_SetDefault()
{
    i = int(7);
    x = float(1.3);
    str = String("hello world");
    color =  ColorDW(200,50,177);
    v = Vec3(3.4,0,1.7);
}

template <typename functor>
void MyPref::AutoGen_Reflection()
{
    REFLECT(i);
    REFLECT(x);
    REFLECT(str);
    REFLECT(color);
    REFLECT(v);
}

pretty easy to generate just by some text movement. The big win of the shortened syntax is that you only have to write a variable once instead of 3 times.

03-13-09 - Paged Pool Leaks

So I tracked down the paged pool leak. It appears to be a bug in my ATI driver (or perhaps a bug in the DX interface that shows up as a leak in the driver). It's caused by locking POOL_MANAGED textures that are currently in the GPU push buffer. When you do that it causes the hardware texture to get aliased to avoid contention, and it appears something in there is leaking. I don't think that whole textures are leaking because the leak is pretty slow, it must just be some kind of texture book-keeping object (eg. maybe it actually is the "texture" struct, just not the actual surface bits).

Anyhoo, nobody cares about bugs that I see via Directx8, but what is interesting is the cool Poolmon utility.

Poolmon lets you see the kernel allocations and thus track down leaks.

You need to turn on some registry settings .

Run Poolmon in a command line with lots of vertical lines.

Use the keys "d" and "p" to get the view you want

Then track away.

3/12/2009

03-12-09 - Fixed Memory Pools

The Windows Kernel Paged Pool shit I'm dealing with is reminding me how much I hate fixed size memory pools.

I have fucking 3 Gigs of RAM in this machine and I'm using less than 1 G ! How can you be running out of memory !!!! Oh, because you have a fucking stupid fixed size page thing. Now, okay, maybe *MAYBE* the OS kernel is one case in which fixed size pools is a good thing.

It's common wisdom in game development that dynamic allocations in games are bad and that "mature" people use fixed size pools because it's more stable and safe from fragmentation and robust and so on. Hog wash!

It should be obvious why you would want variable memory allocation. Memory is one of our primary limiting factors these days, and it should be allocated to whatever needs it right now. When you have fixed pools it means that you are preventing the most important thing from getting memory in some case.

For example, your artists want to make a super high poly detailed background portion of the game with no NPC's. Oh, no, sorry, you can't do that, we're reserving that memory for 32 NPC's all the time even though you have none here. In another part of the game, the artists want to have super simple everything and then 64 NPC's. Oh no, sorry, you only get 32 even though you could run more because we're reserving space for lots of other junk that isn't in this part of the game.

Now, I'm not saying that budgets for artists is a bad thing. Obviously artists need clear guidelines about what will run fast enough and fit in memory. But having global fixed limits is a weak cop out way to do that.

Furthermore, recycling pools and maximum counts for spawned items is a perfectly reasonable thing to do. But I don't think of that as a way of dividing up the available memory - it's a way of preventing buggy art from screwing up the system, or just lazy artists from making mistakes. For example, having a maximum particle count doesn't mean you should go ahead and preallocate all those particles, cuz you might want to use that memory for something else in other cases (and of course the hard-fixed-size pool thing can

In general I'm not talking here about *dynamic* variation. I'm talking about *static* variation. Like anything that can be spawned or trigger from scripts or whatever, stuff that can be created by the player - that stuff should be premade. Anything that *could* exist at a given spot *should* exist. That way you know that no player action can crash you. Note that this really just a way to avoid testing all the combinatorics of different play possibilities.

By static variation I mean, in room 1 you might have resource allocation like {16 NPC's, 100 MB of textures} , in room 2 you might have {8 NPC's, 150 MB of textures}.

Fixed sized budgets is like if you partitioned your hard disk in half for programs and data. People used to do things like that, but we all now realize it's dumb, it's better just to have one big disk and that way you can change how you are using things as need arises.

Now, people sometimes worry about fragmentation. That may or may not be an issue for you. On Stranger on XBox it basically wasn't an issue because we had 64M or physical memory and 2G of virtual address space, so you have tons of slack. Again now with 64 bit pointers you have the same kind of safety and don't have to worry. Sadly, 32-bit Windows right now is actually in a really bad spot where the amount of physical memory roughly matches the address space, and we actually want to use most of that. That is fragmentation danger land.

However, doing variable size budgets doesn't necessarily increase your fragmentation at all. The only thing that would give you fragmentation is if you are dynamically allocating and freeing things of different sizes. Now of course you shouldn't do that !

One option is just to tear things all the way down and build them all the way back up for each level. That way you allocate {A,B,C,D} in order, then you free it all so you get back to empty {} , then next level you allocate {C,B,B,A,E} and there's no fragmentation worry. (if you tried to do a minimal transition between those sets by doing like -D +B+E then you could have problems).

Another option is relocatable memory. We did this for Stranger for the "Contiguous" physical memory. Even though virtual address fragmentation wasn't an issue, physical memory (for graphics) fragmentation was. But all our big resources were relocatable anyway because they were designed for paging. So when the contiguous memory got fragmented we just slid down the blogs to defrag it, just like you defrag a disk. Our resources were well designed for fast paging, so this was very fast - only a few thousand clocks to defrag the memory, and it only had to be done at paging area transitions.

Note that "relocatable resources" is kind of a cool handy thing to have in any case. It lets you load them "flat" into memory and then just rebase the whole thing and boom it's ready to use.

Personally after being a console dev and now seeing the shit I'm seeing with Oodle, I would be terrified of releasing a game on a PC. Even if you are very good about your memory use, your textures and VB's and so on create driver resources and you have no idea how big those are, and it will vary from system to system. The AGP aperture and the Video-RAM shadow eat out huge pieces of your virtual address space. The kernel has an unknown amount of available mem, and of course who knows what other apps are running (if it's a typical consumer machine it probably has antivirus and the whole MS bloatware installed and running all the time).

I don't see how you can use even 512 MB on a PC and get reliable execution. I guess the only robust solution is to be super scalable and not count on anything. Assume that mallocs or any system call can fail at any time, and downgrade your functionality to cope.

Now, certainly doing prereserved buckets and zero allocations and all that does have its merits, mainly in convenience. It's very easy as a developer to verify that what you're doing fits the "rules" - you just look at your allocation count and if it's not zero that's a bug.

It's just very frustrating when someone is telling you "out of memory" when you're sitting there staring at 1 GB of free memory. WTF, I have memory for you right here, please use it.

The other important thing is efficiency is irrelevant if you can't target it where you need it. Having a really efficient banana picking operation doesn't do you a lick of good when everyone wants apples. A lot of game coders miss the importance of this point. It's better to run at 90% efficiency or so, but be flexible enough to target your power at exactly what's needed at any moment.

Like with a fixed system maybe you can handle 100 MB of textures and 50 MB of geometry very efficiently. That's awesome if that's what the scene really needs. But usually that is not the exactly ideal use of memory for a given spot. Maybe some spot really only needs 10 MB of geometry data. You still only provide 100 MB of textures. I'm variable and can now provide 139 MB of textures. (I lose 1 MB due to overhead from being variable).

Many game devs see that and think "ZOMG you have 1 MB of overhead that's unacceptable". In reality, you have 39 MB less of the actual resource your artists want in that spot.

03-12-09 - ERROR_NO_SYSTEM_RESOURCES

ERROR_NO_SYSTEM_RESOURCES is the fucking devil.

So far as I can tell, the only people in the history of the universe who have actually pushed the Windows IO system really hard are me and SQL Server. When I go searching around the web for my problems they are always in relation to SQL Server.

I don't have a great understanding of this problem yet. Hopefully someone will chime in with a better link. This is what I have found so far :

You receive error 1450 ERROR_NO_SYSTEM_RESOURCES when you try to create a very large file in Windows XP
SystemPages Core Services
Sysinternals Forums - not enough resources problem - Page 1
Overlapped WriteFile fails with code 1450 [Archive] - CodeGuru Forums
Novell Eclipse FTK file io
How to use the userva switch with the 3GB switch to tune the User-mode space to a value between 2 GB and 3 GB
GDI Usage - Bear
Error Message ERROR_NO_SYSTEM_RESOURCES (1450)
Download details Detection, Analysis, and Corrective Actions for Low Page Table Entry Issues
Counter of the Week Symptoms Lack of Free System Page Table Entries (PTEs) and Error Message ERROR_NO_SYSTEM_RESOURCES (1450
Comparison of 32-bit and 64-bit memory architecture for 64-bit editions of Windows XP and Windows Server 2003

Basically the problem looks like this :

Windows Kernel has a bunch of internal fixed-size buffers. It has fixed-size (or small max-size) buffers for Handles, for the "Paged Pool" and "Non-Paged Pool", oh and for PTEs (page table entries). You can cause these resources to run out at any time and then you start getting weird errors. The exact limit is unknowable, because they are affected by what other processes are running, and also by registry settings and boot.ini settings.

I could make the error go away by playing with those settings to give myself more of a given resource, but of course you can't expect consumers to do that, so you have to work flawlessly in a variety of circumstances.

In terms of File IO, this can hit you in a whole variety of crazy ways :

1. There's a limit on the number of file handles. When you try to open a file you can get an out-of-resources error.

2. There's a limit on the number of Async ops pending, because the Kernel needs to allocate some internal resources and can fail.

3. There's a limit on how many pages of disk cache you can get. Because windows secretly runs everything you do through the cache (note that this is even true to some extent if you use FILE_FLAG_NO_BUFFERING - there are a lot of subtleties to when you actually get to do direct IO which I have written about before), any IO op can fail because windows couldn't allocate a page to the disk cache (even though you already have memory allocated in user space for the buffer).

4. Even ignoring the disk cache issue, windows has to mirror your memory buffer for the IO into kernel address space. I guess this is because the disk drivers talk to kernel memory so you user virtual address has to be moved to kernel for the disk to fill it. This can fail if the kernel can't find a block of kernel address space.

5. When you are sure that you are doing none of the above, you can still run into some other mysterious shit about the kernel failing to allocate internal pages for its own book-keeping of IOs. This is error 1450 (0x5AA) , ERROR_NO_SYSTEM_RESOURCES.

The errors you may see are :

ERROR_NOT_ENOUGH_MEMORY = too many AsyncIO 's pending
Solution : wait until some finish and try again

ERROR_NOT_ENOUGH_QUOTA = single IO call too large
Solution : break large IOs into many smaller ones (but then beware the above)

ERROR_NO_SYSTEM_RESOURCES = failure to alloc pages in the kernel address space for the IO
Solution : ???

So I have made sure I don't have too many handles open. I have made sure I don't have too many IO ops pending. I have made sure my IO ops are not too big. I have done all that, and I still randomly get ERROR_NO_SYSTEM_RESOURCES depending on what else is happening on my machine. I sort of have a solution, which seems to be the standard hack solution around the net - just sleep for a few millis and try the IO again. Eventually it magically clears up and works.

BTW while searching for this problem I found this code snippet : Novell Eclipse FTK file io . It's quite good. It's got a lot of the little IO magic that I've only recently learned, such as using "SetFileValidData" when extending files for async writes, and it also has a retry loop for ERROR_NO_SYSTEM_RESOURCES.

Further investigation reveals that this problem is not caused by me at all - the kernel is just out of paged pool. If I do a very small IO (64k or less) or if I do non-overlapped IO, or if I just wait and retry later, I can get the IO to go through. Oh, and if you use no buffering, that also succeeds.

03-12-09 - Debugging Question

If I have a process under debug in MSVC , and I want to switch to debugging with WinDbg - how do I do that? WinDbg refuses to attach because it's already being debugged. If I "detach" with MSVC, the process immediately runs free (ceases to be stopped) which makes me lose the moment I was trying to debug. Urg. WTF ?

I need this because I want to write out a block of memory from my process to disk, which it seems MSVC won't do, but WinDbg with .writemem .

Another question is : how do I get a *full* crash dump of a process? (with all its memory). If you use "write crash dump" from MSVC, it only writes a "minidump" which is just registers, call stack, that kind of stuff.

WinDbg can write a full crash dump very easily (with .dump) , but I can't figure out how to change debuggers.

3/10/2009

03-10-09 - Snark Suppository

I posted a pretty snyde comment on John Ratcliff's Code Suppository which I feel a bit guilty about, so I thought I would defend and clarify my position a bit.

First of all, to the specific question of bounding volumes : there is decent code for them in Eberly (though it's really not great and the sphere and OBB code is very far from optimal). There is very good code for all the common primitives in Galaxy3 - though you do have to buy into the whole system.

BTW this is sort of a side note but I really like John's Code Suppository in general, or Sean's "one file" way of distributing code. I'm glad they do it, because I can't and don't want to. I love it when other people make their code available to me like that, but I can't write code like that, and I find it impossible to maintain code like that in my own libraries.

As for the exact flaws and prior art in John's post, I sent him this :


1. Exact sphere fitting is well known.  See for example this very good source code :

 miniball 

2. Exact fitting of the best-rectangle via rotating calipers is well known.  See for example
any computational geometry textbook.

3. To fit an OBB the first step should always be computing the convex hull via something like QHull :

 qhull 

Once you have the convex hull, brute force is in fact a good solution.  (being O(N^2) in the number of 
hull faces is much better than being O(N^2) in the number of original faces).

And there's more at : RAPID OBB-Tree or Computational Geometry Code or O'Rourke's page

(those are just the code references, you can find more papers on citeseer)

But the real answer is : pick up ANY academic text on computational geometry and there will be very good algorithms for all this stuff. The correct algorithms are also fast! And the error in the approximate algorithms is not small, the "put sphere center at the average of all points" heuristic can be very very bad, as can the OBB fit heuristics. (BTW the good computational geometry code in Galaxy3 is basically from me reading through O'Rourke's Computational Geometry textbook and implementating algorithms I thought were cool and useful - at his website, tons of good code is available for free ; the good sphere code in galaxy is just miniball).

For example, both Eberly and John use the PCA (covariance) axes to fit the OBB. That sort of works okay quite often, but occasionally it is unboundedly bad. (BTW all comments on Eberly's stuff are based on several year old code that was free; I haven't touched his stuff since he published that book and started trying to make money off it, so he may have fixed some of his stuff).

BTW also all primitive-fitting code should be able to find either the minimum-volume primitive or the minimum-surface-area primitive. In most cases minimum-surface-area is actually what you want, because that is what gives you the minimum chance of random ray intersection. It also tends to make nicer looking volumes in practice IMO (it forces them to be more "compact" - surface tension turns spread out things into spheres).

BTW #2 rotating calipers is one of those super simple and perfect algorithms that I believe everyone should know. It's one of those "ah ha" algorithms that makes you realize doing anything else is silly, because you cannot beat it for speed and it gets the answer exactly right. ( there's a whole page on it )

I want to say again that I love what John is doing with Code Suppository, but I keep seeing bad bounding volume code recycled again and again in the game development community. Lots of people use fundamentally wrong image processing or audio processing code without reading any of the DSP literature. People do hacky heuristic UV unwraps without reading any of the papers. People write their own hashes and data structures or sorting algorithms without understanding their big-O or actually comparing to reference implementations.

This is bull! And there's no excuse for it any more. Game developers have long secretly treasured the hack and the heuristic. Everyone loves coming up with their own little simple trick. No more. We have long used "speed/optimization" or "time pressure" as excuses to do things wrong. That's BS now that we are many thousands of people working many years on multi-million dollar projects. And especially for code we are sharing on the internet.

Part of the reason why this stuff is even hard to find on Google these days is because so many people keep posting bad versions! If you search for "bounding box fit" on Google half of what you find will be amateur game developer newsgroups and such with people posting ass-tastic versions. It was almost better 10 years about when the only thing you got was university research web sites.

Part of the problem is that too many game developers only talk to other game developers, and lose sight of how strong all the research out there in the world is. Yes, yes, we are hot shit. But for any algorithm you can think of, there are 1000 Chinese PhD's working on it right now. You can't beat them. The best you can do is go to CiteSeer and read their papers and steal their work. If you think you can beat them, then you better be damn sure and prove it by actually testing against their work.

I'm calling everyone out. There's plenty of un-tested, un-benchmarked, un-researched code out there on the net. We don't need more. We need less! I don't want to see another code snippet that somebody tried and just visually looked at and thought it looked "pretty good". I want exact error bounds. I want complexity analysis. I want profiles against the other major examples in the field. I want to see references to prior art. I want to see rigor.

The next time somebody gives you their hash_map implementation, you ask them - how is it better than stlport hash_map ? how does it compare to google dense or sparse hash map? What are the memory use and execution time limits? Show me your reference list (does it include RDE, khash, and Paul Hsieh?). If they stare at you blankly - kick them in the nuts.

I should also note, and I think everyone knows - that I welcome and encourage similar criticism of my own work. I mean often in this space I am just ranting about nonsense that I obviously have no idea WTF I'm talking about. But when I post code - the whole point is for you to tell me ways in which it is wrong so that I can fix it. That's a huge part of why I post code on the internet, I want feedback and fixes and even just "this sucks, there's much better code for that here".

And finally, I'll end with a bit of a challenge :

I don't actually have perfect OBB code, and to my knowledge it doesn't exist. The code in Galaxy3 is much better than the common covariance method. The main improvement there is that I use rotating calipers to find the truely optimal 2d rectangle if one of the axes is fixed. However, I don't do the correct full search of axes. I only check all the normals of the faces of the convex hull. That is not the full set - you should also consider boxes which are only supported by edges and vertices (this is sort of like doing seperating axis tests and only using face normals and not doing edge-edge axes). O'Rourke has described an exact O(N^3) solution, but it's messy and to my knowledge no simple implementation exists.

So I have a few challenges that I'll leave for the reader (or myself in the future) :

1. What is the maximum error (vs the exact solution) of the Galaxy3 method of only considering OBB's which are supported by at least one face of the convex hull?

2. Is there a good implementation of the O(N^3) exact OBB algorithm?

3. Is there an O(N^2) or O(NlogN) OBB algorithm which comes within a (useful) finite bounded error of the exact solution?

3/07/2009

03-07-09 - Malloc

Arseny Kapoulkine has a nice new blog post about replacing CRT malloc .

I looked into it a while ago and gathered these links on the subject :

What your mother never told you about graphics development Fighting against CRT heap and winning
TRET The Reverse Engineering Toolkit
The Bag of Holding
somecode Memtracer
Practical Efficient Memory Management EntBlog
New Fun Blog - Scott Bilas Figuring out how to override malloc and new
MemTracer strikes back .mischief.mayhem.soap.
Lightning Engine � Blog Archive � Tracking Memory Allocations
Gamasutra - Monitoring Your PC's Memory Usage For Game Development
Fast File Loading EntBlog
Detours - Microsoft Research
CodeProject Visual Leak Detector - Enhanced Memory Leak Detection for Visual C++. Free source code and programming help
A Cross-Platform Memory Leak Detector

then I decided it was a pain in the ass and I wasn't going to do it. This is one place where C++ got it so much better - you can just override new/delete and it all works. I never use malloc/free anyway, so WTF the few people who do use them can go to the CRT heap, what do I care?

The new cblib has a simple lock-free paged small block allocator. It's very primitive and not as fast or efficient as it could be (it's very fast for single threaded but has very poor cache line sharing behavior), but on the plus side it is nice and simple and self-contained unlike tcmalloc or hoard or something which are like a huge mess of code.

BTW related to this all and the threading stuff is : LeapHeap

Hmm these were in that link bag and don't belong, but they are quite fun -
HAKMEM -- CONTENTS -- DRAFT, NOT YET PROOFED
Hacker's Delight

3/03/2009

03-03-09 - Oodle Update

Oodle currently has two modes - "Incremental Build" and "Final". In Final mode it's assumed that you have made all the packs the way you want, and all the content is valid. This is presumably how you will actually ship your game. In Final mode, I don't check bundles against the original files or any of that stuff, it's much simpler and faster.

The problem of course is that Final is a big pain during development. Incremental lets you change any individual file at any time. It lets you delete any of the packed bundles. It automatically tries to use the best packed bundle - but only if it contains the latest version of the file. As I've written about before, we can't use modtime for this reliably, so I now use the hash of the source file (and use modtime to tell when I need to recompute the hash).

The goal of the incremental build system is that you can just do whatever you want during dev, and everything "just works" - you never see the wrong version fo the content or anything weird. I consider it better to be correct and robust than to be hyper efficient. But I also have the simultaneous goal that performance should be as close to the Final mode as possible. There are two big reasons for this - one is just speed, you still want your loads and iteration to be fast during development, but the other is even more significant : the way you run during dev should be as close as possible to the way you run when you ship. It also hopefully makes it easier to write code for Oodle, because you just write to one API and it automatically acts the right way.

Okay, that all sounds good, but dear god it gets complex in practice.

The general thing that makes it really tricky is I want to allow the client to do absolutely anything to their content at any time, without telling me, and things should still work. Oodle maintains a lightweight database of what it thinks of the content, but at any time that database can be wrong because the client may have done things without telling me. So I constantly need to be verifying it and reupdating it.

For example, clients might go and change source content while the game's not running. When the game runs next I want it to automatically update to the newer source content. (if they change it while the game is running, I see it and reload immediately). Clients might just delete bundles. They might sync to bundles that were made on a server. They might run the bundler tool and add or remove files from individual bundles. The next time the game runs it should just all work (and it shouldn't have to spin over every file and load everything up and validate it either).

I think I have this largely fixed for most normal cases. Basically it involves a lot of "lazy evaluation" of database correctness; as things are requested or brought in, inconsistencies are detected and fixed. A typical load pattern goes like this :

Client code asks to make resource A resident. (there are various ways that clients can make resources resident, this is the "loosest" path - you could also manually tell a whole Bundle to come resident, or you could ask for a named "paging set" to come resident).

First I check if A is already resident and just give it back. But if there's been an error in the previous load of A, I invalidate the database entry that was used and don't return the bad version.

I see if A has a valid map in the database. If so I try to use one of the existing bundles. Now there's a heuristic for which bundle to load. If any of the bundles that contain A is already in memory or already in the progress of loading, I use that. Else, if A is in a header-aggregate bundle, I use that to get the header. (header aggregates contain the headers of lots of resources and generally are made resident very early so you can get the headers before the data comes in). Next if A is in a compiled bundle, I use that (compiled bundles are presumably made by the packing tool so are preferred if possible). Finally I fall back to the single bundle.

If there is no valid database map, I invoke the "baker" for A which knows how to load the data raw and turn it into a single bundle. This is the only pathway that knows how to load raw data, and this is where the client must plug in their file format parsers.

That all catches a lot of cases. For example, if some bundle has been corrupted, I will go ahead and fire the load, when the load fails I will invalidate the map from the resource to that bundle. If you retry the load, it will see there's no mapping to a bundle and fire the baker to make a new one.

When bundles load, they contain various objects, it should contain the resource that caused the load request, but it also might contain others. Those others may have also been previously loaded another way. Our new data may or may not be inteded to replace the old data. The heuristic that I use is that the raw source file on disk is always authoritative. I believe that is intuitive and what artists and users would expect - if you have a certain BMP on disk, then that is what you will see in the game. If you have some packed bundles that contain older versions of that file, they will be ignored - when they try to register their objects they will be rejected. If you have packed bundles that contain newer versions (eg. from a server or other developer) those will also be ignored if you don't have the newer source files, I'm not sure if this is good or bad, but it's an inevitable consequence of not trusting mod time.

That's all relatively simple (LOL?).

One problem remains - because I'm primarily keying off the source file hash, I can't distinguish between different processings of the same original source file. This happens if you're doing something nontrivial in the baker, which I believe is a common case. The baker might convert textures to DXTC, it might optimize geometry for vertex cache, that kind of stuff. You might want to have a "fast bake" that just gets the data into some useable format, and a "heavy bake" that really does a good job. If there are bundles which contain the content in various levels of bake, I always want to load the heavier bake.

Currently I'm using the heuristic that newer bakes are "heavier" , so I always prefer the newest bundle that contains the valid content. But there are various problems with that. One is that of course it's not necessarily correct and modtime isn't reliable. But even if it *was* always correct, it wouldn't work for me, because I can't tell when two bakings are the same. eg. say I load up bundle X that contains resources {A,B,C}. Now client asks for resource C. I see that I already have it because I loaded bundle X. But I also see in the game content map that bundle X is not the newest bundle for resource C. There's some newer bundle Y. Can I return the X->C that I already have? Or do I have to fire a load of Y to get the newer version? I can't tell.

I guess the easy solution is to add an extra field to the baking of each resource that indicates the baking level, and require the client to manually set that field. I guess that field should really be a bit-field so that the client can mark up to 32 types of processing that resource might have and then I insist that you get the "most processed" resource at any time.

I hate exposing more fields that the client has to set right for things to work. My goal is that all this complexity is completely hidden, the client doesn't really have to worry about it, you just do things and it all magically works.

Semi-related : I wrote a while ago about the speeds that I think you should have with an LZ decompressor for file IO. In that post I assume that the game has some CPU work to do for loading, so the CPU used by the decompressor is stealing from someone else.

It turns out that's changing fast. On a new 8-core chip, it's highly unlikely that the game is actually using all the CPU time *ever* , and even more unlikely that it's using it during load. That means CPU is basically free, which totally changes the equation. Having a slow decompressor like LZMA might still hurt your *latency* but it will improve your throughput.

Of course this only true on modern CPUs. We sort of have a problem now with CPUs similar to the problem with GPUs - there's a huge difference in capabilities between high spec and low spec, and the way you target them is quite different. The biggest step is probably from 1 core to 2 core, because the difference between having a true simultaneous background thread and just switching with one is huge. But the difference from 2 to 8 is also pretty huge. With 8 cores you can do a *lot* very fast in the background. For example you could just do masses of decompression and procedural content generation.

Steam Hardware Survey shows 1 cpu is going away pretty fast, so hopefully soon we will be able to target min specs of >= 2 cpus and >= dx9.

3/02/2009

03-02-09 - Low Level Threading - Part 6

Wrap up and Relacy

I know this stuff is scary, but I believe it's necessary and it's time to embrace it. In the coming years, we will all be writing multi-threaded code. Most of you are probably writing lock free code already. I believe that using solid lock free data structures is the right way to do this. It lets you write them carefully, understand them, profile them, test them, simulate them in Relacy, and then you can be very confident about them, and the client code that uses them is pretty simple and doesn't have to worry too much about weird threading issues.

Now, if you like, you can use message passing between threads with locking. In fact, I encourage you to leave that as an option. In fact in all my code I have #defines to switch the lock-free queue with just a plain old critsec locking queue. Furthermore, I have #defines to remove threading altogether (and just pump the thread routine periodically from the main thread). This lets me debug problems without threading when possible. If the locking queue overhead is fine for you, then go ahead and use it.

People who don't like the lock-free data structures usually wind up using one of two alternatives : 1. "ad-hoc" multithreading, or 2. "parallel for". I believe both of these are wrong in different ways.

"ad-hoc" multi-threading is in widespread use currently in game development. This involves a whole host of dangerous techniques, such as just setting variables in one thread and checking them in another, or randomly tossing in some Interlocked ops to make things "safe". This stuff scares the shit out of me, is not generally well understood or isolated, and can cause really evil bugs. I strongly encourage you all to stop doing this. You are highly likely to be writing rare race conditions when you do this, that will be very hard to find or reproduce. Any time you communicate between threads without using a critsec you are writing lock-free code. Just because you didn't use a lock-free FIFO or something doesn't mean you made it simpler.

"parallel for" is being pushed by a lot of people as a solution to multi-threading. Don't get me wrong, I think "parallel for" is a perfectly good thing - however, it is a very high level abstraction that is not capable of addressing the hard threading problems. It basically only has one use - taking some large task that is very easy to seperate into work pieces, doing them on threads, then waiting for them all to be done. That is a common pattern, especially in tools, but it is only one specific work pattern - and it is BY FAR THE EASIEST. I think it's a huge mistake to have these big software products (like .NET Parallel Extensions, Open MP, Intel TBB, etc.) all basically targetted at "parallel for" type work patterns - because that is a *leaf* usage pattern. Threading libraries should be targetted at the *core* usage pattern, and then you can build your own leaves on it.

Basically "parallel for" is just a syntactic sugar for wrapping up independent operations into "Worklets" that can be run on Workers in a thread pool job scheme. I don't need syntactic sugar. Converting a for loop into Worklets is not the hard part of threading. I'm perfectly happy to write a little code to turn my for loop into a function pointer that can be run from the worker swarm. If somebody gave you a bunch of lock-free queues and lock-free hash maps and cross-platform signals and events, you could code up a job swarm parallel-for thing in an afternoon. I'm serious. It's just not the hard part of the problem at all.

And it's not very useful in realtime apps. We have threads that run together over long periods of time and need to check on each other's status and coordinate latency and timing, etc.

Okay, now let's talk about how you establish confidence in your threading routines. The only way is through exhaustive testing. And I don't mean some little unit test you code up. I mean Relacy Race Detector or an equivalent.

In all my lockfree code, I avoid directly doing things like spinning or even storing variables. I call functions like "LoadAcquire" or "StoreRelaxed" or "SpinBackOff". That way I can just change those for each platform and the code should stay the same. Another advantage is that I can change these functions to use C++0x or Relacy.

Relacy Race Detector is a dynamic thread race checker. It's based on the C++0x model, so adapting the code to work in Relacy basically just means making it use the C++0x model, plus a bit of extra juju.

Relacy replaces the C++0x types like std::atomic with its own versions. The Relacy objects contain a lot of extra information, like what thread last read & wrote the value, and a history of what the value has been over the past. It can tell you things like reading uninitialized variables, data races, etc.

Relacy simulates the possible thread execution orders. It does this by testing thread switches at all the "sync points" in the code. (sync points are places you touch shared variables). Normal C code that you write is assumed to touch only thread-local stuff - it is run without checking. If you touch either a std::atomic or an rl::var (in my code these are ATOMIC_VAR or NOT_THREAD_SAFE) it creates a sync point. (addition syncs are at any mutex entry/exit, at thread yield, and at memory fences).

The sync points automatically invoke the Relacy scheduler which might do a thread switch. In Relacy you can easily see all the sync points because they have a "$" at them. (the $ is actually just a #define to debug file/line, but it's useful as a quick visual way to see all possible thread switches).

Relacy actually runs your code using fibers. It makes a fiber for each thread and then does the switching between fibers manually. That means that in a Relacy test, your code actually runs single threaded. Plain C code that is supposed to be thread-local is run atomically (with no thread switches) which is nice for debugging.

Relacy has two main simulation modes : "context_bound" and "random". Random is obviously a monte carlo simulation. At each sync point it randomly picks a thread switch. You want to run 10^8 or so iterations of random, and you should make the test pretty complex to try to stress the algorithms. "context_bound" is a full enumeration of all possible program execution orders (bounded by the context limit, which is some kind of tree depth limit). A test with context_bound can find very obscure races that you might never see in a million years of real world testing, such as switching back and forth between two threads on exactly alternating lines of code or whatever. Generally you want to run a very simple test with context_bound such as just doing one push and one pop.

Note that Relacy does not check that your code is actually producing the right output - you need to do that yourself using RL_ASSERT. What Relacy checks is that the code is 1. free of obvious bugs like races, memory leaks, etc., and 2. produces the same output every time regardless of thread switch pattern.

This #2 is obviously crucial - it's the whole goal of writing correct threaded code : the action is the same regardless of the execution thread switch pattern. Obviously to check this, your program must be deterministic. That means, no using rand (use rl::rand instead), no using statics (use test_suite::before() or test_suite::invariant() instead). (note that if you use rl::rand, all possible rand values will be simulated, so generally use a small count).

Relacy also tracks the full execute state and can save and restore the state at any point. Basically it journals the entire thread switch history. I haven't actually used this yet, but it could be awesomely handy for finding really tricky threading bugs.

Personally I believe that lock free code that has not been thoroughly tested with Relacy is toxic and I want nothing to do with it. (as I've written previously, if you actually plug a lot of the respected published lock-free algorithms into Relacy you will find races) (BTW another warning : do not just grab the example code that comes with Relacy and try to use it in your project - some of Dmitriy's examples are intentionally broken so that you can see how Relacy finds errors).

Also, there are some other race checkers out there. CHESS by Microsoft Research is an okay one if you are on Windows. The Intel Thread Checker is similar. I believe they are both inferior to Relacy. There are also some that work by static analysis; the static analysis tools generally work by having you write up your program in a custom syntax. I haven't tried these but I don't like the idea of having to translate between a custom syntax, because it creates the possibility that the code as written in the custom tool is okay, but after translating into real code it's broken. The ideal thread checker IMO works directly on code in its real syntax so there is no translation danger, and is also comprehensive, not sampling.

You may now download code to my lock free data structures in Relacy wrapper : (MSVC >= 7)

myrelacy.zip (111k)

(this is based on Relacy 1_3_0 ; the official version of Relacy 1_3_0 has some bugs that are fixed in here; Dmitriy will be releasing an official fixed version soon)

ADDENDUM ON RELACY LICENSE : (revised 9-14-09)

Relacy is now released under the BSD license :


    1 /*  Relacy Race Detector
    2  *  Copyright (c) 2009, Dmitry S. Vyukov
    3  *  All rights reserved.
    4  *  Redistribution and use in source and binary forms, with or without modification,
    5  *  are permitted provided that the following conditions are met:
    6  *    - Redistributions of source code must retain the above copyright notice,
    7  *      this list of conditions and the following disclaimer.
    8  *    - Redistributions in binary form must reproduce the above copyright notice, this list of conditions
    9  *      and the following disclaimer in the documentation and/or other materials provided with the distribution.
   10  *    - The name of the owner may not be used to endorse or promote products derived from this software
   11  *      without specific prior written permission.
   12  *  THIS SOFTWARE IS PROVIDED BY THE OWNER "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
   13  *  THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
   14  *  IN NO EVENT SHALL THE OWNER BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY,
   15  *  OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
   16  *  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
   17  *  STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE,
   18  *  EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
   19  */

My work with Relacy is 100% free for any use. However, the original Relacy license still applies to all work product made with Relacy, such as my code above.

The version of Relacy that I built my code with was released under a previous less clear and restrictive license. Dmitry says the new BSD license applies.

03-02-09 - Sleep Sucks and VSync Woes

Ok, first let me describe the issue :

You want to hit 60 fps (or whatever) pretty reliably. Each frame there is some amount of work you *must* do (eg. rendering, respond to input), and there is some amount of work that you would like to do (eg. streaming in data, decompressing, etc.). You want to do as much of the optional work as possible each frame, without using too much time.

You would also like to behave well with other apps. eg. you don't want to just spin and eat 100% of the CPU when you're idle. (this is especially important for casual games where the user is likely browsing the web while playing). To do this you want to Sleep() or something with your idle time.

So, ideally your frame loop looks something like :


-- page flip --

start = timer();

[ do necessary work ]

remaining = start + frame_duration - timer()

[ do optional work as remaining time allows ]

remaining = start + frame_duration - timer()

[ if remaining is large , sleep a while ]

-- page flip --

You want the duration of these three things to add up :


(necessary work) + (optional work) + (sleep time) <= frame_duration

but as close to frame_duration as possible without going over

Okay. There is one specific case where this is pretty easy : if you are in full screen, using V-sync page flips, *AND* your graphics driver is actually polite and sleeps for V-sync instead of just spinning. I don't know what the state of modern drivers is, but in the past when I looked into this I found that many major companies don't actually sleep-to-vsync on Windows, they busy spin. (this is not entirely their fault, as there is not a very good vsync interrupt system on windows).

Sleep-to-vsync is important not just because you want to give time to other apps, but because the main thread wants to give time to the background threads. eg. if you want to put streaming on a lower priority background thread, then you write your loop just like :


-- page flip --
[ do necessary work ]
-- request page flip --
    puts main thread to sleep
    [ optional work runs until vsync interrupt switches me back ]
-- page flip --

and you assume the page flip will sleep you to vsync which will give the background loader time to run.

Even if your driver is nice and gives you a sleep-to-vsync, you can't use it in windowed mode.

A lot of games have horrible tearing in windowed mode. It seems to me that you could fix this by using the Dx9+ GetRasterStatus calls. You simply don't do a flip while the raster scan line is inside your window (or right near the top of your window). You wait for it to be past the bottom or in the vblank area or far enough above you. That should eliminate tearing and also make you hit the frame time more exactly because you sync to the raster.

(if you just use a timer without using the raster you can get the worst possible tearing, where a flip line scans up and down on your image because you're just barely missing sync each frame).

But we still have the problem that we can't sleep to vsync, we basically have to spin and poll the raster.

Now you might think you could look at the remaining time and try to sleep that amount. Ha! Sleep(millis) is a huge disaster. Here's a sampling of how long you actually sleep when you call Sleep(0) or Sleep(1) :


sleepMillis : 0 , actualMillis : 28.335571
sleepMillis : 0 , actualMillis : 22.029877
sleepMillis : 0 , actualMillis : 3.383636
sleepMillis : 0 , actualMillis : 18.398285
sleepMillis : 0 , actualMillis : 22.335052
sleepMillis : 0 , actualMillis : 6.214142
sleepMillis : 0 , actualMillis : 3.513336
sleepMillis : 0 , actualMillis : 15.213013
sleepMillis : 0 , actualMillis : 16.242981
sleepMillis : 0 , actualMillis : 8.148193
sleepMillis : 1 , actualMillis : 5.401611
sleepMillis : 0 , actualMillis : 15.296936
sleepMillis : 0 , actualMillis : 16.204834
sleepMillis : 1 , actualMillis : 7.736206
sleepMillis : 0 , actualMillis : 3.112793
sleepMillis : 0 , actualMillis : 16.194344
sleepMillis : 0 , actualMillis : 6.464005
sleepMillis : 0 , actualMillis : 3.378868
sleepMillis : 0 , actualMillis : 22.548676
sleepMillis : 0 , actualMillis : 4.644394
sleepMillis : 0 , actualMillis : 28.266907
sleepMillis : 0 , actualMillis : 51.839828
sleepMillis : 0 , actualMillis : 7.650375
sleepMillis : 0 , actualMillis : 19.336700
sleepMillis : 0 , actualMillis : 5.380630
sleepMillis : 0 , actualMillis : 6.172180
sleepMillis : 0 , actualMillis : 30.097961
sleepMillis : 0 , actualMillis : 17.053604
sleepMillis : 0 , actualMillis : 3.190994
sleepMillis : 0 , actualMillis : 27.353287
sleepMillis : 0 , actualMillis : 23.557663
sleepMillis : 0 , actualMillis : 27.498245
sleepMillis : 0 , actualMillis : 26.178360
sleepMillis : 0 , actualMillis : 29.123306
sleepMillis : 0 , actualMillis : 23.521423
sleepMillis : 0 , actualMillis : 6.811142

You basically cannot call Sleep() at all EVER in a Windows game if you want to reliably hit your frame rate.

And yes yes this is with timeBeginPeriod(1) and this is even with setting my priority to THREAD_PRIORITY_TIME_CRITICAL. And no nothing at all is running on my machine (I'm sure the average user machine that has fucking Windows Crippleware grinding away in the background all the time is far worse).

So I have no solution and I am unhappy.

There definitely seems to be some weird stuff going on in the scheduler triggering this. Like if I just sit in my main thread and do render-flip , with no thread switches and no file IO, then my sleep times seem to be pretty reliable. If I kick some streaming work that activates the other threads and does file IO, then my sleep times suddenly go nuts - and they stay nuts for about a second after all the streaming work is done. It's like the scheduler increases my time slice quantum and then lets it come back down later.

My links on this topic :

Using Waitable Timers with an Asynchronous Procedure Call (Windows)
Timing in Win32 - Geiss
Priority Boosts (Windows)
Molly Rocket View topic - Why Win32 is a bad platform for games - part 1
IDirectDrawGetScanLine
How to fight tearing - virtualdub.org
Examining and Adjusting Thread Priority
Detecting Vertical Retrace with crazy vxd
CodeProject Tearing-free drawing with mmtimer
CodeGuru Creating a High-Precision, High-Resolution, and Highly Reliable Timer, Utilising Minimal CPU Resources

I am the linkenator. I am the linkosaurus rex. Linkotron 5000. I have strong Link-Fu. I see your Links are as big as mine. My name is Inlinko Montoya, you linked my father, prepare to be linked!

ADDENDUM : The situation seems to be rather better in Dx9+. You can use VSync in Windowed Mode, and the VSync does in fact seem to sleep your app. (older versions of Dx could only Vsync in full screen, I didn't expect this to actually work, but it does).

However, I am still seeing some really weird stuff with it. For one thing, it seems to frequently miss VSync for no good reason. If you take one of the trivial Dx9 sample apps from the SDK, in their framework you will find that they don't display the frame rate when Vsync is On. Sneaky little bastards, it hides their flaw.

In any sample app using the DXUT framework you can find a line like :


txtHelper.DrawTextLine( DXUTGetFrameStats( DXUTIsVsyncEnabled() ) );

This line shows the frame rate only if Vsync if Off. Go ahead and try turning off VSync - you should see a framerate like 400 - 800 fps. Now change it to :


txtHelper.DrawTextLine( DXUTGetFrameStats( true ) );

so that you show framerate even if Vsync is on. You will see a framerate of 40 !!!

40 fps is what you get when you alternate between a 30 fps and a 60 fps frame. ((1/30) + (1/60))/2 = 1/40 . That means they are missing vsync on alternating frames.

If you go and stick a timer around just the Present() call to measure the duration spent in Present, you should see something like :


no vsync :
    duration : lo : 1.20 millis , hi : 1.40 millis

vsync :

    duration : lo : 15.3 millis , hi : 31.03 millis

Yuck. Fortunately, the samples are really easy to fix. Just put this line anywhere in the sample app :


timeBeginPeriod(1);

to change the scheduler granularity. This says to me that Present() is actually just calling the OS Sleep() , and is not using a nice interrupt or anything like that, so if your scheduler granularity is too high you will sleep too long and miss vsync a lot. In fact I would not be surprised if Present just had a loop like :

    for(;;)
    {
        if ( IsRasterReady() ) break;

        Sleep(1);
    }

which sucks donkey balls for various reasons. (* addendum : yes they are doing this, see below). If in fact they are doing what I think they are doing, then you will hit vsync more reliably if you pump up your thread priority before calling Sleep :


    int oldp = GetThreadPriority( GetCurrentThread() );
    SetThreadPriority( GetCurrentThread() , THREAD_PRIORITY_SOMETHING_REALLY_FUCKING_HIGH );

    Present( ... );

    SetThreadPriority( GetCurrentThread() , oldp );

this tries to ensure that the scheduler will actually switch back to you for your vsync the way you want. (this has no effect in the samples cuz they have no other threads, but may be necessary if you're actually stressing the system).

Note the samples also freak out if they don't have focus. That's because they actually call Sleep(50) when they don't have focus. That's a cool thing and not a bug ;)

To wrap up, here's the approach that I believe works :


Use dx9+

Use Vsync and 1 back buffer

Bump up thread priority & scheduler interval to make sure Present() doesn't miss

Make main thread higher priority than worker threads so that workers run when main is sleeping
(must also check that workers are not starving)

On machines with speed-step , *and* if the app has focus :
Run a very low priority thread that just spins to keep the cpu clocked up

and cross your fingers and pray.

BTW whenever you show something like framerate you should show four numbers :


Instant current value this frame.
Average over last N frames.
Low over last N frames.
High over last N frames.

with N like 10 or something.
obviously you can do this trivially with a little circular buffer.

You can see a lot of things that way. For example when you're doing something like missing vsync on alternating frames, you will see the average is perfectly stable, a rock solid 40.0 , but the instantaneous value is jumping up and down like mad.

BTW cb::circular_array does this pretty well.

ADDENDUM : I found a note in the Dx9 docs that is quite illuminating. They mention that if you use PRESENT_INTERVAL_DEFAULT, the action is to sync to vblank. If you use PRESENT_INTERVAL_ONE, the action is the same - but there is a difference, it automatically does a timeBeginPeriod(1) for you to help you hit vsync more precisely. This tells me that they are in fact doing the Sleep()/GetRasterStatus loop that I suspected.

3/01/2009

03-01-09 - Clean Up

It does actually help to clean up your machine from time to time. One issue is that NTFS dir listing on large dirs is really slow, so huge dirs full of temp files makes dir listing chug. Another issue is that I believe Firefox has an N^2 kind of bug with temporary files where it eventually becomes outrageously slow.

Anyhoo, cleaning up is also a very good idea before you do a defrag. It's silly to defrag with a huge mess of temporary files when it would be better just to kill them and make them all from scratch afterward.

This is the kind of thing that you probably only need to do like once a year. BTW not a bad idea to do before backing up your machine since it's kind of silly to back up a bunch of temp files. (this stuff was literally 20% of the bytes on my disk - 10 GB out of 50). It's a little bit of a catch-22 because of course you'd like to back up *before* you run around deleting lots of junk. More on backups later..

Here's my cleanup routine :

On every drive :


call dele *.chk
call dele ffast*
call dele -y recycled

And try to get the temp dirs :


call zdel -y -r  %TEMP%\*
call zdel -y -r  %TMP%\*
call zdel -y -r "%USERPROFILE%\local settings\temp\*"
call zdel -y -r "%USERPROFILE%\local settings\temporary internet files\*"

Then go to your source code and wipe out the crud there :


zdel -r "*.(ncb|sbr|pch|bak|pdb|ilk|idb|opt|plg|bsc|obj|suo|log_prev)"

My god the ncb's and the suo's and all that stuff are a huge waste of space. Note I have "obj" and "pdb" in there - you might not want to delete those. I do. If I can't rebuild the obj I figure I don't care about that code any more.

Oh but there's more. Lots of programs keep their own huge temp dirs. (there was one really bad culprit of this and now I forget what it was; damn). One is "Visual Assist".


zdel -r "C:\Program Files\Visual Assist.NET\vc7\cache\*"
zdel -r "C:\Program Files\Visual Assist.NET\vc7\history\*"

I assume you run a P4 server or something at home, so go ahead and checkpoint your journal too :


p4d -jc -z

And I don't have a good non-GUI way to do this, but we've missed some shite from Firefox, so use the GUI to delete all history and temporary files.

And finally we've still missed a ton of garbage in "Documents and Settings" but it's too much of a pain and too much of a risk to run around deleting more, so that's it. NOW we can defrag.

After all that I backed up my lappy. It took 6 bloody hours. I have a 60 GB disk in lappy and it can do nearly 60 MB/sec. That means it should take 1000 seconds. 17 minutes. To just copy the whole disk. Sigh.

I've tried a few backup programs in the day and they all have this severe order-of-magnitude slowness problem because they use the damn file system which is butt slow rather than just mirroring the whole disk. Yes I know there are programs to just do a full disk mirror but from what I can tell, none of them let you then treat the results as just a bunch of files you can browse into.

I never want a backup system that creates some custom file format on the backup server - I just want files. I never want to have to do a "recovery" - I just want to go browse to the backups and find the old files that I backed up and copy them out manually.

What a backup program should do : 1. Read the whole disk as a raw image so it's super fast. 2. Compress the disk with a fast LZP or something in memory as it goes. (compression here is not really to save space - it should actually make things faster because compressing is faster than disk write speed). 3. Write the compressed image to the server. 4. Provide an explorer plugin so that you can browse into the stored image as it was just a dir and copy files out.

The whole thing should take maybe 30 minutes max.

Oh, while I'm at it - HYACHACHCA - "robocopy" and "xcopy" both have huge terrible bugs. There are certain file names that they can enumerate in their FindFirst/FindNext which they then fail to find when they actually try to copy the files. I've written before about the disaster of file system names on Windows but this is ridiculous. Also robocopy has this retry-wait thing, which is nice if it's because of a file being in use or something, but when it's a hard error like "file not found" it's dumb to spin in retry-wait for 30 minutes. And there's no key you can press to just make it give up and skip that file, you have to ctrl-C and start over at which time you just give up and use your own copy prompt because nobody else seems capable of writing the simplest piece of software.