I'm of the opinion that games should not ship with strings. However, strings are very useful for development. How should we reconcile this? The basic strategy in all cases will be to replace the string with just an integer GUID or index. Then instead of strcmp you just do == test on the ints, and instead of storing lots of strings in memory, you just have ints to refer to other objects.
First of all, let's say that the premature optimization of making all your references into compact indices from the beginning is a mistake IMO. The main place that strings come up and where they are super useful is as the names of content. That is, textures, models, prefs, NPC types, weapon types, levels, attributes, etc. If I want to make an NPC it's nice to just be able to say, okay use the "ToughMarine" model and give him "AggressiveEnemy" behavior and "BigShotgun" weapon. Often those strings correspond to file names, either directly or indirectly, which is very handy because it lets you add new elements by just making new files, and it lets you find the content that a string corresponds to.
The bad premature optimization that I'm talking about is to work with "databases" of content right from the start rather than loose files. Then your references can be indices. So instead my NPC would use model 7, behavior 3, and weapon 41. This makes your references fast and compact right from the start, but the penalties to development process are just too severe. (some of the old game engines did this, but I believe it is now understood in the industry that this is very bad for source control and large teams and is being phased out).
So, we don't want that method, we want to work with strings during development, but ship without strings.
( some example of unnacceptable workflow in my opinion : when artists have to yell out "who has the texture name database checked out? I have to add a new string to it" , or when there's an error in the content hookup and the game prints out "error : npc 13 couldn't find model 9" (though, kudos for at least printing an error) )
One option is to have a "bake" step that converts all the loose content and string references into these packed databases and indexes. Basically it has to load up every piece of content, see what all the strings in the game are and convert them to an index, then order all the data by its index so you can find it. While this does work, it's pretty painful. Whole-game bake operations like this are pretty disastrous for development, because it means you can't test the real shipping game without doing some big time consuming process. I'm a big believer in being able to run at least portions of the game in their real shipping state all the time during development, not just at the very end. It makes it hard to just change one piece of content or add something and have.
Another option is to have a sort of "incremental baking" from a GUID server. I've seen this work in practice, so it is viable, but it's a little scary. Basically the idea is you keep a central string table that maps unique strings to indices (and vice versa). Each time you make a new piece of content or make a new string, you have to send a packet to the GUID server and get the index for that string. Now you can refer to your piece of content by that index, so you have compact references. While this does certainly work, relying on being able to communicate with the GUID server for development is a bit scary. Also if you ever accidentally get a bug in the GUID system you could corrupt a bunch of content. To deal with both of those issues, it would be nice to keep the original strings around in the content files during development as backup for the central repository.
The option we used for Oddworld Stranger was string hashing. In the shipping game, every 32 bit char * was replaced with a 32 bit integer hash of the char *. Using a hash makes the string -> index mapping deterministic and local, you don't have to talk to your neighbor to make sure you are getting a unique int. This method has lots of advantages and only one (rather large) disadvantage : the possibility of hash collisions. When you get a hash collision you wind up with a perplexing but amusing bug, like you try to put "SinisterHelmet" on a guy and instead you get "RubberChicken" attached to his head because those string hashes collided. During development you need to keep both the hash and the string around.
To handle collisions, we had a debug mode that would load up a level with the strings and the hashes and check if any strings had the same hashes. Note that you don't need to check collisions across the whole game, but only for content that can possibly be loaded at the same time. On Stranger I think we wound up with 2 collisions over the whole course of development. The solution to those collisions was simply to rename some content. eg. "FloppyHat" was colliding with "PurpleSparkle" , so we just renamed it to "PurpleSparkle2" and we were hash collision free. I definitely do not advocate the 32-bit hash method for large scale game development. It's okay on a small team where you can check things like that once in a while and handle it, but with big distributed teams and huge amounts of content it's not viable.
The simplest fix is to go to a 64-bit hash. Then the probability of collision becomes so small that you could just deal with it in the unlikely event that it happens. Note that in the shipping game you never actually generate these hashes, so the speed of the hash function is irrelevant; in the final game they are opaque GUIDs that are used to link pieces of content together.
With the hash+string method there's also an issue of how you store your content and what you do with the string in the shipping version of the game. What we did is just to store both inline in the data files (eg. 4 bytes of hash then the string bytes immediately following), and in the shipping game you just read the string in then immediately discard it. But if you want to be really super optimal and flat-load your data, then you would have to factor out the strings at some point. One option is to make all your data files in chunks, and make one of the chunks be "nonfinal" data, and put the hash -> string mapping in the nonfinal data chunk. Then you can "flat load" the rest of the data.
But none of these solutions is definitively awesome, so I'm curious what other people do or if there is a better solution.
ADDENDUM : I knew I'd seen this written about before, but forgot about it.
Mick West wrote about it for game developer but doesn't really address the difficult issues.
Ivo Beltchev describes CRCView - a VC extension to do GetStr() for CRC's. I wonder how well that works; I've always found the VC expression evaluator to stop working right when I need it most.
Mischief.mayhem puts it together. He suggests using an external SQL database for GUID->String mapping and collision resolution.
BTW I'm not sure why you bother with hashing if you are using a global database. Just make the first string be "1", the second string be "2", etc. But I'm not super convinced that the global database is an okay method.