11/19/2010

11-19-10 - Games without Strings

Quite a number of major video games have shipped with embarassingly large amounts of time spent in "strcmp" and embarassingly large amounts of memory spent on strings.

I'm of the opinion that games should not ship with strings. However, strings are very useful for development. How should we reconcile this? The basic strategy in all cases will be to replace the string with just an integer GUID or index. Then instead of strcmp you just do == test on the ints, and instead of storing lots of strings in memory, you just have ints to refer to other objects.

First of all, let's say that the premature optimization of making all your references into compact indices from the beginning is a mistake IMO. The main place that strings come up and where they are super useful is as the names of content. That is, textures, models, prefs, NPC types, weapon types, levels, attributes, etc. If I want to make an NPC it's nice to just be able to say, okay use the "ToughMarine" model and give him "AggressiveEnemy" behavior and "BigShotgun" weapon. Often those strings correspond to file names, either directly or indirectly, which is very handy because it lets you add new elements by just making new files, and it lets you find the content that a string corresponds to.

The bad premature optimization that I'm talking about is to work with "databases" of content right from the start rather than loose files. Then your references can be indices. So instead my NPC would use model 7, behavior 3, and weapon 41. This makes your references fast and compact right from the start, but the penalties to development process are just too severe. (some of the old game engines did this, but I believe it is now understood in the industry that this is very bad for source control and large teams and is being phased out).

So, we don't want that method, we want to work with strings during development, but ship without strings.

( some example of unnacceptable workflow in my opinion : when artists have to yell out "who has the texture name database checked out? I have to add a new string to it" , or when there's an error in the content hookup and the game prints out "error : npc 13 couldn't find model 9" (though, kudos for at least printing an error) )

One option is to have a "bake" step that converts all the loose content and string references into these packed databases and indexes. Basically it has to load up every piece of content, see what all the strings in the game are and convert them to an index, then order all the data by its index so you can find it. While this does work, it's pretty painful. Whole-game bake operations like this are pretty disastrous for development, because it means you can't test the real shipping game without doing some big time consuming process. I'm a big believer in being able to run at least portions of the game in their real shipping state all the time during development, not just at the very end. It makes it hard to just change one piece of content or add something and have.

Another option is to have a sort of "incremental baking" from a GUID server. I've seen this work in practice, so it is viable, but it's a little scary. Basically the idea is you keep a central string table that maps unique strings to indices (and vice versa). Each time you make a new piece of content or make a new string, you have to send a packet to the GUID server and get the index for that string. Now you can refer to your piece of content by that index, so you have compact references. While this does certainly work, relying on being able to communicate with the GUID server for development is a bit scary. Also if you ever accidentally get a bug in the GUID system you could corrupt a bunch of content. To deal with both of those issues, it would be nice to keep the original strings around in the content files during development as backup for the central repository.

The option we used for Oddworld Stranger was string hashing. In the shipping game, every 32 bit char * was replaced with a 32 bit integer hash of the char *. Using a hash makes the string -> index mapping deterministic and local, you don't have to talk to your neighbor to make sure you are getting a unique int. This method has lots of advantages and only one (rather large) disadvantage : the possibility of hash collisions. When you get a hash collision you wind up with a perplexing but amusing bug, like you try to put "SinisterHelmet" on a guy and instead you get "RubberChicken" attached to his head because those string hashes collided. During development you need to keep both the hash and the string around.

To handle collisions, we had a debug mode that would load up a level with the strings and the hashes and check if any strings had the same hashes. Note that you don't need to check collisions across the whole game, but only for content that can possibly be loaded at the same time. On Stranger I think we wound up with 2 collisions over the whole course of development. The solution to those collisions was simply to rename some content. eg. "FloppyHat" was colliding with "PurpleSparkle" , so we just renamed it to "PurpleSparkle2" and we were hash collision free. I definitely do not advocate the 32-bit hash method for large scale game development. It's okay on a small team where you can check things like that once in a while and handle it, but with big distributed teams and huge amounts of content it's not viable.

The simplest fix is to go to a 64-bit hash. Then the probability of collision becomes so small that you could just deal with it in the unlikely event that it happens. Note that in the shipping game you never actually generate these hashes, so the speed of the hash function is irrelevant; in the final game they are opaque GUIDs that are used to link pieces of content together.

With the hash+string method there's also an issue of how you store your content and what you do with the string in the shipping version of the game. What we did is just to store both inline in the data files (eg. 4 bytes of hash then the string bytes immediately following), and in the shipping game you just read the string in then immediately discard it. But if you want to be really super optimal and flat-load your data, then you would have to factor out the strings at some point. One option is to make all your data files in chunks, and make one of the chunks be "nonfinal" data, and put the hash -> string mapping in the nonfinal data chunk. Then you can "flat load" the rest of the data.

But none of these solutions is definitively awesome, so I'm curious what other people do or if there is a better solution.


ADDENDUM : I knew I'd seen this written about before, but forgot about it.

Mick West wrote about it for game developer but doesn't really address the difficult issues.

Ivo Beltchev describes CRCView - a VC extension to do GetStr() for CRC's. I wonder how well that works; I've always found the VC expression evaluator to stop working right when I need it most.

Mischief.mayhem puts it together. He suggests using an external SQL database for GUID->String mapping and collision resolution.

BTW I'm not sure why you bother with hashing if you are using a global database. Just make the first string be "1", the second string be "2", etc. But I'm not super convinced that the global database is an okay method.

22 comments:

Matt Enright said...

We use strings to ID content, but make sure that all the comparisons happen during loading, where their cost is far outweighed by IO. Everything is references by direct memory pointers after that. Sometimes we get lazy during development and let strings slip into updating or rendering, but profiling passes allow us to find and cull them later.

cbloom said...

Oh yeah, I should have been more explicit, but I assume that in all cases the references are resolved to pointers when possible, you don't keep doing name or hash lookups in inner loops. eg. if I put texture X on my object, I don't keep doing TextureManager::Get(X) every time I render, I cache the pointer.

However, for paging or optional objects, that pointer might not be around, so there's an issue of what you do for references that can't be resolved. One option is to make an empty dummy object for that item in "paged out" state and go ahead and resolve the name lookup to a pointer. The other object is to keep the strings around and do the resolve when it becomes resident.

Julien Koenen said...

Most data in our games is not directly referenced from code but from other data (The LevelSector Resource has references to Models, Collisions, Entities, Lights, ...) The Level sector file is output from a level converter and contains string references to the other resources. This can be loaded into the game directly (which will automatically resolve and load all Resources) or linked into a statically linked file which replaces all resource references with direct offsets (It bakes all referenced resources into one big file). This means we can still have both: The super fast/small runtime format for the final build and the iteration time friendly version which allows fast replacement of single resources.

Resources that are referenced directly from code are loaded in life-time based linked packets as well and we use pregenerated string-CRCs for the reference.

Because the life time of level sectors and "global" stuff is different anyway we have "namespaces" for the different data packs (and for resource-types). So the Id is basically a triplet containing the data-bucket (life-time-dependent), the resource type and a crc of the original filename of the resource.

This works quite fine for us.

Phil Teschner said...

We use string hashes everywhere in our project. We have a code-preprocessing step that generates the hashes at compile time. Basically all you have to do in code is write H("foo bar") and the pre-processing step will modify the code into H("foo bar", 135612361). H is simply a macro:

#define H(x, y) y

All our resource managers keep track of the hash->resource mapping which is essentially the same as your guid->resource lookup.

This also means it's easy to test for hash-collisions within a resource manager. We actually allow collisions across resource managers since you always know the context in which the hash is used in.

For debugging purposes we create a symbol table that stores all hash->string lookups that we need. We have a symbol table per resource manager and several generic ones for hashes that are used outside of the resource system. These symbol tables are loaded on demand in our non-ship builds and are mainly used for error messages and such.

We actually pre-process our Lua scripts to do the same thing.

Aaron said...

Similar to what Phil is doing, except we don't have a preprocessor to do it, we have to run a perl script manually on the source file to convert it.

We also (over me screaming bloody murder) use 32 bit hashes, which cause us a collision about once every two or three days. So we have a remapping table that's a huge pain in the ass to deal with.

Does anyone here use 32 bit hashes in a real, modern, big game?

The other big question is where to keep your backing string for these. At OW we kept the string right in the token class, right. Made it a little bigger, but was great for debugging. Someone wanted to save 200k where I work, so they took that out. Now it's a disaster, you have to copy-paste the hash into this side app that looks it up in the token map. Irritating!

cbloom said...

"The super fast/small runtime format for the final build and the iteration time friendly version which allows fast replacement of single resources."

This is basically the "bake process" method that I mentioned first.

It certainly has its advantages. The big disadvantage is that if the unbaked version can easily be very slow to load or very fat in memory; eg. on a 256MB console if your baked version is actually using all the memory, then the unbaked data won't fit and you're in trouble.

cbloom said...

Phil - sup!

"This also means it's easy to test for hash-collisions within a resource manager."

Only if you keep around the strings all the time, right? I guess even without the strings you could tell that the hash is associated with two different data items.

But the big issue is - what do you do when there is a collision?

"For debugging purposes we create a symbol table that stores all hash->string lookups that we need."

It sounds like this is a file; is it local to each person or shared? Basically its a mini database.

cbloom said...

" Now it's a disaster, you have to copy-paste the hash into this side app that looks it up in the token map. Irritating!"

Oh dear lord.

"The other big question is where to keep your backing string for these. At OW we kept the string right in the token class, right. Made it a little bigger, but was great for debugging."

Yeah, the OW Token was :

struct Token {
U32 m_hash;
#ifndef FINAL
const char * m_pDebugString;
#endif
};

which was indeed handy for debugging, but has the big disadvantage that you can no longer just IO it as binary, because it changes size in the FINAL game.

There's also a sort of separate but related issue of how the strings are associated with the hashes in the data files. At OW we would just write the strings next to the hashes in the data files. If you were running in a "no string backing" build the file reader would just skip over those strings. Then we also had a final bake that would strip the strings out of the files at the very end.

Inevitably that always leads to bugs due to running in a mode you haven't run all along, so you get "problem with 0x5cade7f" which is damn annoying.

cbloom said...

BTW there are advantages of the GUID->String table that I don't see mentioned much.

For example, you can use it as a centralized way to redirect content.

eg. your original string might be :

LoadTexture("c:\\meshes\\billboards\\a.bmp");

you convert this into

LoadTexture(GUID 17)

and somewhere you have an association

17:"c:\\meshes\\billboards\\a.bmp"

but now you can change it in the mapping, eg.

17:"c:\\meshes\\billboards\\a_alt.bmp"

this can used for example to make localized builds without rebuilding the code.

It can also be used to point content at per-platform compiled versions. Eg. on the PC you might just load the BMP as above, but when you run on 360 it uses a different GUID map and finds :

17:"xe:\\mesh360\\billboards\\a.ptc"

or whatever.

ryg said...

If you're working with strings, just use string interning. Even in debug builds, that's a single char* (usually wrapped in a class) that's self-describing and unique (so you can use the pointer as hash index). And you can immediately spot all time spent hashing strings to do the interning (since it's all in one ctor).

You can serialize an interned string table along with the rest of the "bundle", usual pointer fixup takes care of the rest. That gives each bundle its own "namespace", if you want to access stuff from other bundles you need to look up in the namespaces of all bundles currently loaded which isn't a big deal as the count is usually low.

Depending on the size of strings per bundle and the number of explicit asset references in the code, that's all you need. Hashes/GUIDs are worth the hassle if you have tens of thousands of named assets in memory at the same time, but not if it's hundreds.

It depends on how you organize your content I guess. We only ever had a tiny number of actually named objects in the assets that got loaded by the game. Almost all of it got stripped during content build time.

cbloom said...

At OW we had something like 100k named references in our largest levels. I asssume very large world modern games are well beyond that.

"If you're working with strings, just use string interning."

Yeah, I was thinking about mentioning this. It is just a form of index into a string table, however.

It seems to be much less efficient than other methods, however, because it requires you to still load and match strings in
the final build. In fact it's basically a no-go because of that.

"Depending on the size of strings per bundle and the number of explicit asset references in the code"

The issue is cross-bundle references. To do string interning you have to have a global string table that makes unique pointers for each data item.

Paging games my have lots of references that aren't baked together into single bundles.

castano said...

On the Witness we are using strings resolved at load time, with dummy objects for paged out stuff. Fairly straightforward.

If string comparisons turn out to be a problem, the plan is to move to hash+strings and drop the strings in final builds, pretty much like you did at OW.

I haven't really thought much about collisions... our editor is the game, so I guess we could trigger an error as soon as the offensive asset is referenced. The most annoying case is dealing with indirect references.

We have separate tables for each resource type, so I guess reduces the chances of having a collision a little bit. If frequent collisions were a problem in practice, in this game I would probably keep things simple and use larger hashes or any other solution that doesn't require artist intervention.

NeARAZ said...

In Unity we just never use string paths to identify assets => problem solved? :)

Every asset has a GUID, and in any code you never reference any assets by names. All references to everything are set up visually (in the editor); or populated from editor side scripts (which have utility functions to convert between GUIDs and paths etc.).

This has some downsides, but also some advantages: i.e. you can reorganize your assets at any time, and nothing breaks. Rename them, move them around, organize into folders etc. - everything goes. Also, the runtime side never sees any asset path strings, so part of this "lots of strings" problem goes away.

Internal implementation detail: GUIDs uniquely identify an asset in the editor/offline part. When building the game, all identifiers are converted to smaller unique numbers (IIRC, 32 bit) and runtime just uses those.

Phil Teschner said...

"Only if you keep around the strings all the time, right? I guess even without the strings you could tell that the hash is associated with two different data items."

Right we detect it when we build up the lookup structure which is either done at startup or as a deploy step (bake step).

When there is a collision we just require the asset to be renamed. In the last 3 years we've had maybe 8 collisions and we have a lot of assets. Again it helps that the string doesn't have to be unique across all types.

"It sounds like this is a file; is it local to each person or shared? Basically its a mini database"

Yes, everyone basically has a local version of this file that is merged with the global version (checked in version).

So essentially everyone has a mini-db that is a combination of their view of the assets and the depot's view.

For the local table we just keep adding strings until a new build is retrieved from the depot and then we do a clean build of the symbols.

nothings said...

Basically all you have to do in code is write H("foo bar") and the pre-processing step will modify the code into H("foo bar", 135612361).

#define H(x, y) y


No modification of source necesssary:

Restrict all your identifiers to legal C identifiers (no spaces or punctuation), and use code like: H(foobar)

Then the implementation is:

#ifdef RES_STRINGS
#define H(x) #x
#else
#include "resource_map.h"
#define H(x) RES_##x
#endif

Thatcher Ulrich said...

My two random thoughts:

1) this is a ridiculous optimization; just use strings.

2) just use strings, but require them to fit in X bytes, for some reasonable X. An MS GUID is 16 bytes, so let's take that as our acceptable limit. Suppose you have a 64 character alphabet; you can fit 21 characters in 16 bytes. This would involve some manual abbreviation. So

"c:\\meshes\\billboards\\a.bmp" (26 chars)

oops, too big. Use a more compact convention:

"mesh/bbd/a.bmp" (14 chars)

It would probably improve the artists' spelling too.

cbloom said...

You could just us 8 character strings and make the artists use names like "A83CF1A7" ;)

Autodidactic Asphyxiation said...

I agree with Thatcher. In your code, always use strings. Have your offline "baker" have the option to abbreviate your strings so they fit in some small amount of memory. Use a string class with small-string optimization (like tu_string). You can do even better than tu_string, since you are never mutating these values.

cbloom said...

"Have your offline "baker" have the option to abbreviate your strings so they fit in some small amount of memory."

Basically you're suggesting the "bake" method for string removal, which has its pros and cons. I don't think that "abbreviating" strings is any more or less viable than hashing them - you still have to have a method of resolving collisions and mapping back to the full names and so on.

Obviously you shouldn't do this optimization (string elimination) if it's not important to you. Games have good reason to believe/know that it is necessary. When you're working in a memory limit and you have 2 MB spent on strings, that is a clear target to attack.

On PC-only games you clearly don't bother (though PC-only games are where we've seen massive amounts of time in strcmp, so you should still do something like interning or hashing to eliminate the string compares even if you keep the full strings around all the time).

The other option is to simply not allow content to dynamically hook up to other content by name, but I think the advantages of that are so immense that it should not be sacrificed.

cbloom said...

I guess there are other viable options :

You could accelerate the string compares by keeping a hash or interning. But you still need the memory for the full strings.

You can reduce the memory for the full strings by what basically amounts to LZ compression. You can store characters in a 40 symbol alphabet or so, and maybe even use a static Huffman table so they are 3-6 bits per character. But a bigger memory use reduction would be prefix sharing.

If your acceleration method is interning / string table, then the prefix sharing can be done by using a trie to store the strings.

There are lots of papers on tightly compressed tries. So that should be a way to reduce memory use and speed up string compares.

You could also use a string table, but only keep the table during load time, then discard it. This only works if you never load content during play that requires name-based hookup. You could live with that even in a paging system by requiring the "header" part of all content is loaded at level-load time, and only the "data payload" part of content can be paged.

Autodidactic Asphyxiation said...

First, a clarification. When I mean "abbreviate" I mean find anything that you can cram into a few bytes, which could also be a hash. I didn't have time to fully write out my thought, since the train I was in was going into a tunnel. By the idea is that you just store the hash as a string, and rely on your string implementation to deal with it efficiently.

You could have a reasonable string class that fit in 8 bytes. It overlays 2 representations: either be a pointer to data (and length, if you're only 32-bit), or an interned buffer of up to 7 chars. String comparisons in the small case is naturally really fast.

The baker just has to convert all your string names to hashes, and cram that hash into those 7 chars. Even if you only used 7 bits in each of those chars, you have 2^49 hash values which should be enough for millions of IDs. The probability of collision should be really low, that you would never need to check it, except during mondo asset builds.

BTW, using hashing in this manner is called "fingerprinting."

cbloom said...

Why not just use a 64 bit hash then?

(I think that's actually a pretty good final answer, because collisions are so improbable)

But you still have the issue of wanting to map back to the original string for debug/display purposes and to check for the improbable collision, so it doesn't really change anything.

old rants