Comments on cbloom rants: 11-19-10 - Games without Strings

Why not just use a 64 bit hash then? (I think tha...

2010-11-29T09:51:05.713-08:00

Why not just use a 64 bit hash then?

(I think that's actually a pretty good final answer, because collisions are so improbable)

But you still have the issue of wanting to map back to the original string for debug/display purposes and to check for the improbable collision, so it doesn't really change anything.

First, a clarification. When I mean "abbrevia...

2010-11-28T22:51:04.003-08:00

First, a clarification. When I mean "abbreviate" I mean find anything that you can cram into a few bytes, which could also be a hash. I didn't have time to fully write out my thought, since the train I was in was going into a tunnel. By the idea is that you just store the hash as a string, and rely on your string implementation to deal with it efficiently.

You could have a reasonable string class that fit in 8 bytes. It overlays 2 representations: either be a pointer to data (and length, if you're only 32-bit), or an interned buffer of up to 7 chars. String comparisons in the small case is naturally really fast.

The baker just has to convert all your string names to hashes, and cram that hash into those 7 chars. Even if you only used 7 bits in each of those chars, you have 2^49 hash values which should be enough for millions of IDs. The probability of collision should be really low, that you would never need to check it, except during mondo asset builds.

BTW, using hashing in this manner is called "fingerprinting."

I guess there are other viable options : You coul...

2010-11-28T13:34:56.030-08:00

I guess there are other viable options :

You could accelerate the string compares by keeping a hash or interning. But you still need the memory for the full strings.

You can reduce the memory for the full strings by what basically amounts to LZ compression. You can store characters in a 40 symbol alphabet or so, and maybe even use a static Huffman table so they are 3-6 bits per character. But a bigger memory use reduction would be prefix sharing.

If your acceleration method is interning / string table, then the prefix sharing can be done by using a trie to store the strings.

There are lots of papers on tightly compressed tries. So that should be a way to reduce memory use and speed up string compares.

You could also use a string table, but only keep the table during load time, then discard it. This only works if you never load content during play that requires name-based hookup. You could live with that even in a paging system by requiring the "header" part of all content is loaded at level-load time, and only the "data payload" part of content can be paged.

"Have your offline "baker" have the...

2010-11-28T13:15:38.422-08:00

"Have your offline "baker" have the option to abbreviate your strings so they fit in some small amount of memory."

Basically you're suggesting the "bake" method for string removal, which has its pros and cons. I don't think that "abbreviating" strings is any more or less viable than hashing them - you still have to have a method of resolving collisions and mapping back to the full names and so on.

Obviously you shouldn't do this optimization (string elimination) if it's not important to you. Games have good reason to believe/know that it is necessary. When you're working in a memory limit and you have 2 MB spent on strings, that is a clear target to attack.

On PC-only games you clearly don't bother (though PC-only games are where we've seen massive amounts of time in strcmp, so you should still do something like interning or hashing to eliminate the string compares even if you keep the full strings around all the time).

The other option is to simply not allow content to dynamically hook up to other content by name, but I think the advantages of that are so immense that it should not be sacrificed.

I agree with Thatcher. In your code, always use st...

2010-11-28T09:20:02.260-08:00

I agree with Thatcher. In your code, always use strings. Have your offline "baker" have the option to abbreviate your strings so they fit in some small amount of memory. Use a string class with small-string optimization (like tu_string). You can do even better than tu_string, since you are never mutating these values.

You could just us 8 character strings and make the...

2010-11-23T15:03:44.459-08:00

You could just us 8 character strings and make the artists use names like "A83CF1A7" ;)

My two random thoughts: 1) this is a ridiculous o...

2010-11-23T14:46:31.530-08:00

My two random thoughts:

1) this is a ridiculous optimization; just use strings.

2) just use strings, but require them to fit in X bytes, for some reasonable X. An MS GUID is 16 bytes, so let's take that as our acceptable limit. Suppose you have a 64 character alphabet; you can fit 21 characters in 16 bytes. This would involve some manual abbreviation. So

"c:\\meshes\\billboards\\a.bmp" (26 chars)

oops, too big. Use a more compact convention:

"mesh/bbd/a.bmp" (14 chars)

It would probably improve the artists' spelling too.

Basically all you have to do in code is write H(&q...

2010-11-23T12:30:17.923-08:00

Basically all you have to do in code is write H("foo bar") and the pre-processing step will modify the code into H("foo bar", 135612361).

#define H(x, y) y

No modification of source necesssary:

Restrict all your identifiers to legal C identifiers (no spaces or punctuation), and use code like: H(foobar)

Then the implementation is:

#ifdef RES_STRINGS
#define H(x) #x
#else
#include "resource_map.h"
#define H(x) RES_##x
#endif

"Only if you keep around the strings all the ...

2010-11-22T08:34:02.120-08:00

"Only if you keep around the strings all the time, right? I guess even without the strings you could tell that the hash is associated with two different data items."

Right we detect it when we build up the lookup structure which is either done at startup or as a deploy step (bake step).

When there is a collision we just require the asset to be renamed. In the last 3 years we've had maybe 8 collisions and we have a lot of assets. Again it helps that the string doesn't have to be unique across all types.

"It sounds like this is a file; is it local to each person or shared? Basically its a mini database"

Yes, everyone basically has a local version of this file that is merged with the global version (checked in version).

So essentially everyone has a mini-db that is a combination of their view of the assets and the depot's view.

For the local table we just keep adding strings until a new build is retrieved from the depot and then we do a clean build of the symbols.

In Unity we just never use string paths to identif...

2010-11-21T23:03:49.195-08:00

In Unity we just never use string paths to identify assets => problem solved? :)

Every asset has a GUID, and in any code you never reference any assets by names. All references to everything are set up visually (in the editor); or populated from editor side scripts (which have utility functions to convert between GUIDs and paths etc.).

This has some downsides, but also some advantages: i.e. you can reorganize your assets at any time, and nothing breaks. Rename them, move them around, organize into folders etc. - everything goes. Also, the runtime side never sees any asset path strings, so part of this "lots of strings" problem goes away.

Internal implementation detail: GUIDs uniquely identify an asset in the editor/offline part. When building the game, all identifiers are converted to smaller unique numbers (IIRC, 32 bit) and runtime just uses those.

On the Witness we are using strings resolved at lo...

2010-11-21T22:35:29.173-08:00

On the Witness we are using strings resolved at load time, with dummy objects for paged out stuff. Fairly straightforward.

If string comparisons turn out to be a problem, the plan is to move to hash+strings and drop the strings in final builds, pretty much like you did at OW.

I haven't really thought much about collisions... our editor is the game, so I guess we could trigger an error as soon as the offensive asset is referenced. The most annoying case is dealing with indirect references.

We have separate tables for each resource type, so I guess reduces the chances of having a collision a little bit. If frequent collisions were a problem in practice, in this game I would probably keep things simple and use larger hashes or any other solution that doesn't require artist intervention.

At OW we had something like 100k named references ...

2010-11-21T18:29:29.941-08:00

At OW we had something like 100k named references in our largest levels. I asssume very large world modern games are well beyond that.

"If you're working with strings, just use string interning."

Yeah, I was thinking about mentioning this. It is just a form of index into a string table, however.

It seems to be much less efficient than other methods, however, because it requires you to still load and match strings in
the final build. In fact it's basically a no-go because of that.

"Depending on the size of strings per bundle and the number of explicit asset references in the code"

The issue is cross-bundle references. To do string interning you have to have a global string table that makes unique pointers for each data item.

Paging games my have lots of references that aren't baked together into single bundles.

If you're working with strings, just use strin...

2010-11-21T18:22:39.432-08:00

If you're working with strings, just use string interning. Even in debug builds, that's a single char* (usually wrapped in a class) that's self-describing and unique (so you can use the pointer as hash index). And you can immediately spot all time spent hashing strings to do the interning (since it's all in one ctor).

You can serialize an interned string table along with the rest of the "bundle", usual pointer fixup takes care of the rest. That gives each bundle its own "namespace", if you want to access stuff from other bundles you need to look up in the namespaces of all bundles currently loaded which isn't a big deal as the count is usually low.

Depending on the size of strings per bundle and the number of explicit asset references in the code, that's all you need. Hashes/GUIDs are worth the hassle if you have tens of thousands of named assets in memory at the same time, but not if it's hundreds.

It depends on how you organize your content I guess. We only ever had a tiny number of actually named objects in the assets that got loaded by the game. Almost all of it got stripped during content build time.

BTW there are advantages of the GUID->String ta...

2010-11-21T13:42:07.907-08:00

BTW there are advantages of the GUID->String table that I don't see mentioned much.

For example, you can use it as a centralized way to redirect content.

eg. your original string might be :

LoadTexture("c:\\meshes\\billboards\\a.bmp");

you convert this into

LoadTexture(GUID 17)

and somewhere you have an association

17:"c:\\meshes\\billboards\\a.bmp"

but now you can change it in the mapping, eg.

17:"c:\\meshes\\billboards\\a_alt.bmp"

this can used for example to make localized builds without rebuilding the code.

It can also be used to point content at per-platform compiled versions. Eg. on the PC you might just load the BMP as above, but when you run on 360 it uses a different GUID map and finds :

17:"xe:\\mesh360\\billboards\\a.ptc"

or whatever.

" Now it's a disaster, you have to copy-p...

2010-11-21T13:37:46.954-08:00

" Now it's a disaster, you have to copy-paste the hash into this side app that looks it up in the token map. Irritating!"

Oh dear lord.

"The other big question is where to keep your backing string for these. At OW we kept the string right in the token class, right. Made it a little bigger, but was great for debugging."

Yeah, the OW Token was :

struct Token {
U32 m_hash;
#ifndef FINAL
const char * m_pDebugString;
#endif
};

which was indeed handy for debugging, but has the big disadvantage that you can no longer just IO it as binary, because it changes size in the FINAL game.

There's also a sort of separate but related issue of how the strings are associated with the hashes in the data files. At OW we would just write the strings next to the hashes in the data files. If you were running in a "no string backing" build the file reader would just skip over those strings. Then we also had a final bake that would strip the strings out of the files at the very end.

Inevitably that always leads to bugs due to running in a mode you haven't run all along, so you get "problem with 0x5cade7f" which is damn annoying.

Phil - sup! "This also means it's easy t...

2010-11-21T12:58:46.259-08:00

Phil - sup!

"This also means it's easy to test for hash-collisions within a resource manager."

Only if you keep around the strings all the time, right? I guess even without the strings you could tell that the hash is associated with two different data items.

But the big issue is - what do you do when there is a collision?

"For debugging purposes we create a symbol table that stores all hash->string lookups that we need."

It sounds like this is a file; is it local to each person or shared? Basically its a mini database.

"The super fast/small runtime format for the ...

2010-11-21T12:55:53.767-08:00

"The super fast/small runtime format for the final build and the iteration time friendly version which allows fast replacement of single resources."

This is basically the "bake process" method that I mentioned first.

It certainly has its advantages. The big disadvantage is that if the unbaked version can easily be very slow to load or very fat in memory; eg. on a 256MB console if your baked version is actually using all the memory, then the unbaked data won't fit and you're in trouble.

Similar to what Phil is doing, except we don't...

2010-11-21T09:58:13.605-08:00

Similar to what Phil is doing, except we don't have a preprocessor to do it, we have to run a perl script manually on the source file to convert it.

We also (over me screaming bloody murder) use 32 bit hashes, which cause us a collision about once every two or three days. So we have a remapping table that's a huge pain in the ass to deal with.

Does anyone here use 32 bit hashes in a real, modern, big game?

The other big question is where to keep your backing string for these. At OW we kept the string right in the token class, right. Made it a little bigger, but was great for debugging. Someone wanted to save 200k where I work, so they took that out. Now it's a disaster, you have to copy-paste the hash into this side app that looks it up in the token map. Irritating!

We use string hashes everywhere in our project. W...

2010-11-21T06:31:09.582-08:00

We use string hashes everywhere in our project. We have a code-preprocessing step that generates the hashes at compile time. Basically all you have to do in code is write H("foo bar") and the pre-processing step will modify the code into H("foo bar", 135612361). H is simply a macro:

#define H(x, y) y

All our resource managers keep track of the hash->resource mapping which is essentially the same as your guid->resource lookup.

This also means it's easy to test for hash-collisions within a resource manager. We actually allow collisions across resource managers since you always know the context in which the hash is used in.

For debugging purposes we create a symbol table that stores all hash->string lookups that we need. We have a symbol table per resource manager and several generic ones for hashes that are used outside of the resource system. These symbol tables are loaded on demand in our non-ship builds and are mainly used for error messages and such.

We actually pre-process our Lua scripts to do the same thing.

Most data in our games is not directly referenced ...

2010-11-21T02:25:39.695-08:00

Most data in our games is not directly referenced from code but from other data (The LevelSector Resource has references to Models, Collisions, Entities, Lights, ...) The Level sector file is output from a level converter and contains string references to the other resources. This can be loaded into the game directly (which will automatically resolve and load all Resources) or linked into a statically linked file which replaces all resource references with direct offsets (It bakes all referenced resources into one big file). This means we can still have both: The super fast/small runtime format for the final build and the iteration time friendly version which allows fast replacement of single resources.

Resources that are referenced directly from code are loaded in life-time based linked packets as well and we use pregenerated string-CRCs for the reference.

Because the life time of level sectors and "global" stuff is different anyway we have "namespaces" for the different data packs (and for resource-types). So the Id is basically a triplet containing the data-bucket (life-time-dependent), the resource type and a crc of the original filename of the resource.

This works quite fine for us.

Oh yeah, I should have been more explicit, but I a...

2010-11-20T14:49:42.072-08:00

Oh yeah, I should have been more explicit, but I assume that in all cases the references are resolved to pointers when possible, you don't keep doing name or hash lookups in inner loops. eg. if I put texture X on my object, I don't keep doing TextureManager::Get(X) every time I render, I cache the pointer.

However, for paging or optional objects, that pointer might not be around, so there's an issue of what you do for references that can't be resolved. One option is to make an empty dummy object for that item in "paged out" state and go ahead and resolve the name lookup to a pointer. The other object is to keep the strings around and do the resolve when it becomes resident.

We use strings to ID content, but make sure that a...

2010-11-20T14:42:27.317-08:00

We use strings to ID content, but make sure that all the comparisons happen during loading, where their cost is far outweighed by IO. Everything is references by direct memory pointers after that. Sometimes we get lazy during development and let strings slip into updating or rendering, but profiling passes allow us to find and cull them later.