3/23/2009

03-23-09 - Oodle GDI and Strings

Some random Oodle thoughts/questions :

My ThreadViewer is currently using GDI. It's pretty good, I like it. I did originally because I consider it a favor to the user. There are some nice things about using GDI instead of a 3D API - it plays nicer like a normal windows app, you know it gets Paint messages with rects and only repaints what it needs to, it can be dragged around and respects the user's paint while dragging or not request, and of course it doesn't gobble CPU when it's just sitting there. And of course it doesn't interfere with other 3D apps using the same device driver and memory and whatnot.

There are some disadvantages though. GDI is butt slow. If I draw much at all it really dogs. It's possible I could speed that up. For example I'm drawing individual lines and it might be faster to draw line-lists. I also wonder if it would be a win to batch up states. Right now I do SetPen/SetBrush and then draw, over and over, I could batch up draws with the same settings to avoid state changes. Another disadvantage is fancier blend modes and such are tricky or slow with GDI.

One option is to do my own 2d drawing by locking a DIB and using my own rasterizer. Probably crazy talk. The other option is just to be rude and use 3d.


The issue of strings in Oodle is quite interesting. The way you refer to resources is by name (string). Obviously the file names that I load have to be strings because that's what the OS uses. The main data structures that map resources to files are thus big string->string maps. (at some point I'm going to rewrite this to make it a bidirectional map, a database that can go resource->file or file->resource, which is just string<->string with hashes both ways).

Currently what I have is a ref-counted string that can optionally also just be a plain C-string. It's a 4-byte pointer and I stuff a flag in the bottom bit to indicate if it's got a ref-count or not. If not, it's assumed to point at const memory. If it does have a ref count, the count is right before the string data. Thus getting to the characters is always fast. I also almost always pass around a 4-byte hash with the string, which lets me avoid doing string compares in almost all cases. I'm usually passing around 8-byte (64 bit) "HashedString" objects. I only actually do a string compare if the hashes match but the pointers don't, which in my world is basically never. This is very good for the cache because it means you never actually follow the pointer to the string data.

One problem with this is that the strings can still wind up taking a lot of memory. The most obvious solution to that we've discussed before here. It's to explicitly use {Path/Name} instead of {Full Path}. That is, break off the directory from the file name and share the directory string for all files in the same dir. That would definitely be a big win and is one option.

Another option is to have a "Final mode" where the strings go away completely. The 4-byte hash would be kept, and instead of a 4-byte pointer to string data there would be a uniquifier to break hash ties. In order to make the right uniquifier you would have to have a full list of all the strings used in the game, which you of course do have with Oodle if your game is done. Then the uniquifier has to get baked into all the files, so there would have to be a "destring" pass over all the data that may or may not be reversible. This could be made relatively transparent to the client. One annoyance we had with this kind of thing at Oddworld is that once you bake out the strings, if you have any errors you get messages like "Resource 0x57EA10BF failed to load!". Still this would make the super-low-memory console wonks happy.

Another option is to combine the strings very aggressively. I already am running all the strings through a single global string table (the Game Dictionary that has the resource<->file map also makes all strings unique - this is a big win with the refcounted string - whenever you read a string from a file, rather than allocating a new buffer or that string you see if you have it already and just use the one you already have and bump the refcount). I'm not actually enforcing that to be required, but a lot of possibilities open up if you do require that.

For one thing, a "string" can just become an index or handle to the global string table. I almost never care about the contents of the string, I just use them as a key to find things, so if the char data is converted to a unique index, I can just compare index equality. Then, that index in the string table need not point at plain old char data. For example it could be an index to a leaf of a Trie. Each string gets added to the Trie and only unshared parts of the string make new chars. Each unique string is a leaf, those get labels and that's the handle that is given back to index the global string table.

Another option would be just some sort of LZ type compression. Some strings get full data, but other strings are just prefix matches, like "use the first N bytes from string M and then add these chars".

I think in the end I'll probably have to do the zero-strings option to please the masses. It also has the advantage of making your memory use more controlled and calculable. Like, you use 32 bytes per resource, not 32 bytes + enough to hold the strings whatever that is.

21 comments:

Anonymous said...

One option is to do my own 2d drawing by locking a DIB and using my own rasterizer.

If you do this, there's no need to lock. Just allocate some memory, draw into it, then make a dib out of it and call SetDIBitsToDevice. That's how all my "portable" apps work, like Chromatron. I never quite polished up my 2D drawing API into a releasable header file, though.

And as to strings, keeping only one copy of each string actually has a name: interning.

cbloom said...

You know I've actually learned what "interning" means like ten times now and every time I immediately forget it, because it's such a bad name. It's almost the opposite of what it's actually doing, it's more like "externing" , storing the string data in an external place and just referring to it. Or "uniquing" or something. Not "interning".

Every time I read "interning" I think "oh is that the thing were you store 4-character or shorter strings directly in the pointer value?".

ryg said...

Just drop the "intern" and refer to it as "String pool". Unambiguous enough for me (where else do you actually end up pooling strings)?

I also have the same problem with open/closed hashing - I'm never sure which is which. Whenever there's confusing terminology, I try to use whatever term describes best what's actually going on - e.g. "chained" instead of "open" hashing.

cbloom said...

Oh yeah, hashing is particularly LOL. From Wikipedia :

"Open addressing, or closed hashing"

whaaa???

I hereby rename them "chained hashing" or "probing hashing".

Anonymous said...

I call them O(N) and O(1) hashing.

Bret said...

There's also a psychological advantage to completely baking out the strings. If the developers know they don't have to worry about string size, they're more likely to use longer, more descriptive names: "AutomaticPlasmaRifleWithLaserSight" instead of "apr+ls", for example. Easier-to-read code makes life better for everyone.

cbloom said...

Yeah Bret, that's true. It would make people feel better emotionally about the whole string thing if I could just say "they go away completely in final mode".

won3d said...

I always thought open/closed for hash tables made sense.

@nothings,

You might think this is a nitpick, but choice of chaining/probing does not change the big-O for most hash table operations since big-O is about worst-case. You could claim hash tables are O(1) with some qualifiers, perhaps.

cbloom said...

I've always disliked the O(1) claims about hash tables. If they're under-full they're O(1) , but if they're nearly full they're O(N), so it seems like in practice most of the time they're around O(logN). I suppose there's a big article on this on Wikipedia... :(

Anonymous said...

Big-O notation isn't about "worst case", it's an *upper-bound*. (Also, we often use it as shorthand where we mean big-theta or whichever one it is that means both upper and lower bound.) Thus you can say "this hash table has O(N) average-case performance, and this other one has O(1) average-case performance" and there's no contradiction.

A hash table that doubles in size when a fullness-threshhold is reached avoids the concern Charles mentions. You can do this doubling with both open- and closed- hash tables, but in practice I'm not sure anyone does it with external chaining and everybody (should) do it with internal probing. This makes my O(N) vs O(1) assertions not wrong.

Compared to internal probing with secondary hashing, chaining has a problem if a lot of elements hash to the same slot; they all mutually interfere in a way that secondary hashing avoids (but other forms of internal probing don't!). You can try to avoid this by making the external chain something other than a linked list; it could be a binary search tree, for instance, and then it would make the external-chained implemention O(log N), making the whole data structure O(log N). But actually, you know, as long as you're at speeding this up, maybe you should make it use a faster data structure for each chain. Like, I dunno, a hash table. Maybe even an O(1) one.

(Yes, the hash tables all die horribly if everything you insert hash the same hash value for all hashes, obviously. This is not an issue in practice, also obviously.)

won3d said...

O(1) claims for hash tables are only correct for perfect hashing or other constrained cases. A big problem with hash tables is the rehashing, which is clearly O(N), but the question is how that affects inserts, since rehashing performance is amortized across many inserts. For open addressing there is also the issue of how long you let link chains get for other probes like find/delete.

cbloom said...

"Big-O notation isn't about "worst case", it's an *upper-bound*."

Okay, what's the difference between "worst case" and "upper bound" exactly ?

"A hash table that doubles in size when a fullness-threshhold is reached avoids the concern Charles mentions."

More about this later.

"You can do this doubling with both open- and closed- hash tables, but in practice I'm not sure anyone does it with external chaining"

Of course they do, in fact the most common hash map in the world (STL) does this.

" and everybody (should) do it with internal probing. This makes my O(N) vs O(1) assertions not wrong."

Well it's rather bogus to compare a probing hash that does resizing vs. a chaining hash that doesn't. If they both do resizing they have the exact same big-O.

"Compared to internal probing with secondary hashing, chaining has a problem if a lot of elements hash to the same slot; they all mutually interfere in a way that secondary hashing avoids"

That's only true if the secondary hash uses bits that weren't used in the original hash. I don't think that actually changes the big-O but yes it can be a win in practice.

cbloom said...

So the funny thing about the O(1) claim for hashing to me is that probing is obviously O(N/M) where N is the occupancy and M is the table size. Then they say that we choose M to be proportional to N, like M = 2*N , so then probing is O(1).

But obviously in fact the table size is not exactly proportional to N. And under common usage in fact you usually wind up growing the table a few times until it's plenty big, and then it settles down to some size, say 256k, and then your N might vary.

When N varies with table size fixed - hash table query time varies like O(N).

I'm no sure if that actually matters in any way. In practice the thing that sucks is if you accidentally get stuck with your hash table just under the resize threshold.

Anonymous said...

Sorry, I was gone for GDC.

O(1) claims for hash tables are only correct for perfect hashing or other constrained cases. A big problem with hash tables is the rehashing, which is clearly O(N), but the question is how that affects inserts, since rehashing performance is amortized across many inserts.

There are two conditions in which you can attack the O(1) claim for hash tables. One has to do with the worst case if the hash values aren't effectively random (which I'll come back to); in the other condition you accept the model that the hash function is effectively random. Your second sentence seems not to be focusing on the first condition; if the first sentence is also talking about the second issue, then that's just wrong (but it's not actually wrong for the first condition). But anyway, I do disagree with the second sentence.

That is, setting aside the fact that the actual worst case is when the hash is ill-behaved, the rehash is amortized over the inserts in exactly the same way a realloc()ing buffer is as long as you double the size each time. We don't have a good way to talk about this outside of big-O, and the hash table's pseudo-randomness interferes here which is why I have to say "setting aside" above, but it's just totally false that the rehashing "is a big problem" for the big-O cost of inserts. That's because the cost of rehashing is the cost of re-inserting the rehashed elements into the rehashed table. Ihe number of rehashed inserts is easily made to be O(N) after N insertions. So if cost of a hash insert without rehashing is O(1), then the cost of rehashing is amortized O(1). (If the cost of insertion is not O(1), then this isn't true, but then cost of insertion already isn't O(1), so the rehash isn't the part causing the problem. The rehashing just effectively doubles the number of insertions. (It's more complicated if you rehash to shrink the table as well, but the idea is you set these threshholds to guarantee amortized O(1).)

You can see the cost of rehashing in practice if this doesn't make sense by looking at 'hash seq' and 'hash ran32' on this graph of performance-per-insert for a given number of inserts. You can see that the cost goes up slowly due to memory issues, but it's not due to the rehashing, which is the source of the spiky behavior on the graph.

To Charles' various comments.

On the difference between "upper bound" versus "worst case", I mean in the context of "talking about performance of algorithms the way computer scientists due" -- that's the subclass of jargon here -- and I should be clear that I was answering in general and not actually talking about the hashing case so this is kind of tangential. But, let's use quicksort as an example.

Quicksort typically runs in something resembling N log N time. There are worst cases where it goes quadratic. With a simple memory model, it might take something resembling aN^2+bN+c time, but even that can't capture the way the performance varies with the data, so we need a notation that can set aside that unimportant variation and capture the big picture. So we can use big-O notation and say that the worst case is O(N^2). But O() is just saying that it's an upper bound (with, essentially, dropped lower-order terms), so it's also true that quicksort's worst case is O(N^3) or O(N^4), because big-O is just supplying an upper bound. If you want to say your value is tight, not just an upper bound, you use big theta: quicksort's worst case is theta(N^2). (This actually says that the lower bound and the upper bound are both N^2.)

We can also talk about the average case performance of quicksort, averaged over all possible sorts. This is big-theta(N log N), although we (again) just typically say that it's O(N log N). Of course, with big O, we can still say other things, so technically even in the average case performance is O(N^4).

So O(N^4) is an upper-bound but not a worst case. And big-theta(N^2) describes the worst case performance, but O(N log N) is a valid upper bound for average performance (but not the worst case).

But, for hashing, it is sloppy of me to argue that a similar "average case" performance applies, because I've never run the numbers. I don't know what happens if you iterate over all possible subsets, and (unlike quicksort's average being over all possible input permutations) that's not really something that makes sense in the real world anyway. Instead I usually just mean "performance conditional on having good hash functions". This really isn't very hard, since any reasonable hash function will handle basically any data that isn't tuned to destroy it. Look at this graph of the time taken to do 1M insert/deletes on tables of various sizes. Again, the performance changes with table size due to memory hierarchy issues, but it's basically the same performance even with several different models of data.

If STL hash_map actually works with external chaining and doubling, ok, fine. What I meant is that I rarely see that in textbooks or talked about in general, but that 'talked about' predates STL even having hashing at all so maybe that's changed too. It may currently be widely utilized due to STL, but it's not common looking at individual instances of source code.

The reason for what I've seen is simple: an "open addressed" hash table of size N can't handle more than N elements in it at all, so if you actually want to deploy one in the real world you have to make it growable (excepting some circumstance that truely bounds the input). Whereas an externally-chained table of size N totally can grow indefinitely without rehashing; it's still O(N) per operation, but the suppressed constant is 1/table_size which makes it pretty fast until N actually gets big, which you may never see.

(When the popular understanding of how an externally chained hash table works includes table doubling, I will stop referring to them as O(1) and O(N) hash tables. Possibly that time has already come and I just didn't realize it.)

"When N varies with table size fixed - hash table query time varies like O(N)."

Only in the sense that big-O is a loose bound. As long as, say, M >= N*2 (although you can pick any constant, e.g. 1.1), and the hash-is-random criterion is met, then you can show the tigher O(1) bound as well. There isn't some kind of cheaty math going on. If your hash table settles at some size and stops doubling, it's because it's large enough to get the O(1) performance.

I'm not sure why you think this is somehow not actually true, and whether that's in the math or in practice. It seems like you're misunderstanding why the performance is what it is (roughly, the average length of the internal hash chains), but I may be misreading you. It's obviously not an issue in practice if you look at my 1M insert/deletes graph.

"That's only true if the secondary hash uses bits that weren't used in the original hash. I don't think that actually changes the big-O but yes it can be a win in practice."

By definition secondary hashing uses a different hash from the main hash, so by definition it is true. The thing where I just rehash the original value to get at the unused bits is a cheap hack I've done that basically doesn't guarantee my performance very much past tables of, oh, 2^20 or so. My libraries actually do have a function that computes two separate 32-bit values for this (although a single 64-bit value would be just as good).

Anonymous said...

Oh yeah, and I'm sorry I set off this whole thing. The original comment was really just a joke. (Because, to my mind, the popular conception of internal probing incorporates table doubling and the popular conception of external chaining doesn't, thus making those performance behaviors commonplace, regardless of the actual possibilities. If those are no longer the popular conceptions of each, yeah, nevermind.)

cbloom said...

I wrote about how STL hash map works back here :

http://cbloomrants.blogspot.com/2008/10/10-19-08-2.html

as usual with the STL it's a superb way to do things given their constraints. For one thing it's much better for "multi" hashes than a probing hash. The other thing is it makes no assumptions about the data its holding; the data could be very large and/or copying could be very slow.

cbloom said...

So yeah obviously I know hash lookup time is bounded by O(1).

The thing that's a little weird to is that for any finite X , the time to do a lookup with occupation (X+1) is > that time to lookup with occupation (X).

If you prefer images and real data, it looks like this :

http://farm4.static.flickr.com/3554/3393500048_c5631541f7_o.png

Anonymous said...

I don't understand what that graph is showing... time for an insertion NOT counting rehashing? Because the rehash should show as a huge spike. I guess it's the lookup time at given sizes?

But yeah, you're going to see that variability, but the whole point of that variability is that it's not O(N) at all. Yes, it generally grows as load factor increases, but it's bounded (and, in fact, the bounding constant is pretty small--you're seeing it go from insanely fast to fast, not from fast to slow). Like, this isn't true:

"The time to do a lookup with occupation (X+1) is > that time to lookup with occupation (X)."

That's obviously false on the graph! It's true for MOST X, but it's not true for certain ones in a very important way, which is why it's bounded.

There are other data structures that exhibit similar patterns, where they cluster the data and split periodically, like B-trees. Not only is the cost of splitting amortized over the other operations, but the other operations themselves show variability depending on the state of the splitting.

Anyway, doing the STLport linked list thing is nice for large data so that you don't have to actually copy the data when you resize the table. Of course, that means you're copying that data into these linked list nodes in the first place, though, and e.g. for insert-only hashing you're actually only saving half of the copies because of the magic of doubling. And in practice most of the time you're going to have a pointer instead to avoid the excess data in the hash table anyway.

Now, let's look at the memory overhead from these extra pointers for this algorithm. I don't know how STLport is tuned so I'll just pick something arbitrary. If we look at probe counts on this page you linked to before, we can compute the necessary load factors to achieve the same probe counts (which won't necessarily exactly predict performance, but it's a starting place).

Say we pick a load factor for the secondary-hashed table of 0.80, since that's the way I tend to run because I want to keep performance up. According to that page, secondary-hashing needs 2.01 probes per hit, and 5 probes per miss. For chaining to take the same number of probes, you would need for hits a load factor of 2.01 = 1+L/2 => L=2.02, and for misses a load factor of L=5. (Chaining load factor can be > 1 since you store multiple elements per chain).

If we assume hits are more common, let's say we'll use a load factor of 2.5 for the chained hash table.

Now, that means if we store N elements, we'll need N linked list nodes and a hash "jump" table with N/2.5 entries, compared to secondary hashing on a table with N/0.8 entries. (Although obviously when we first double the storage will be a lot bigger.)

If the storage size for an element is S, and for a pointer is P, then we get:

STLport (assuming singly-linked, which only allows iterating in one direction)
N*(S+P) + N/2.5*P

Secondary hashing: N/0.8*S = N*S*1.25

So the difference between them is:

excess = N*S*1.25 - (N*(S+P) + N/2.5*P)
= N*S + N*S/4 - N*S - N*P - N/2.5*P
= N*( S/4 - P - P*0.4)
per-item overhead: S/4 - 1.4*P
crossover point:
S/4 = 1.4*P
S = 4.8*P

So, assuming I didn't screw up the arithmetic, for the chosen number of probes to give equivalent performance, the STLport implementation uses more memory if the size of the item is less than 5 pointers, and more memory if the size of the item is more than 5 pointers. That's assuming the STLport is singly-linked, and totally ignores the fact that after doubling, the memory usage is way different.

If we assume the above is the best case (right before doubling), then we can count the memory for each right after doubling (secondary hashing uses twice as much, STL port only uses twice as much for the hash table):

STLport (assuming singly-linked)
N*(S+P) + N/2.5*P*2

Secondary hashing: N/0.8*S*2 = N*S*2.5

excess = N*S*2.5 - N*(S+P) - N/2.5*P*2
per-item = S*2.5 - S - P - 0.8P
= 1.5*S - 1.8*P
crossover:
1.5*S = 1.8*P
S = 6/5 *P

so now after doubling the secondary hash table is only smaller than the STLport for a data structure of about 1 pointer in size (which is almost always non-existant, except for hash tables that just store a single pointer or single integer). If you make the linked list doubly-linked, that actually means with this level of performance the internal probed table is esentially always for the classic two-pointer hash table (e.g. map a string to some other object pointer).

I imagine STLport may use a larger load factor to reduce the memory usage, but maybe not. And all this math is predicated on me randomly picking a probes-per-search target. I imagine if you ran the math for the secondary-probed table having a max load factor of 0.9, the STLport memory usage would be smaller always or something.

And obviously, as soon as the data gets at all large the internally probed table is always always larger (although it does place less pressure on the memory allocator).

won3d said...

Hey Sean,

Yeah, as I said I'm being fairly nitpicky here, and agree with all the major points. I'm not questioning your understanding of hash table performance.

My point was that there is a difference been an unqualified O(n) and saying amortized O(n). A single unlucky insert will certainly be an O(n) operation, and this is what I meant. In the case for hash tables, it might make sense to spread the cost (i.e. deamortize) of rehashing a hash table over several operations to reduce the cost of the rare bad case.

Anonymous said...

Ok, fair enough. I almost always mean amortized when I talk about performance, because very few people actually work with a real-time OS etc., so the difference isn't very significant, although I guess games are soft real-time-y and a big hit can suck.

The thing is, while there are dynamic hash table strategies that gradualize the rehashing incrementally, you pay an extra overhead all the time (much as is the case with most real-time algorithms). They don't entirely avoid the O(N) hit on the rehash, either; e.g. if you realloc the table, you may have to copy the old table over. If you alloc a new empty table, windows may zero out the memory for it for security reasons.

But yeah, it could happen.

won3d said...

There are certainly cases when you care about optimizing for worst-case performance. Also, I believe there are certain form of deamortization that can be done opportunistically. Say, when you have some heuristic reason to believe that you won't be causing very many cache misses.

So a few months ago I was working on that multi-way, cache-aware cuckoo-based hash table. One thing I noticed is that I was able to get excellent performance at very high occupancy (like 95%), even if the initial insert crisis came much sooner (say 70-80% full). The normal crisis management strategy is to realloc and rehash, but it occurred to me something like a multilevel hash table could work much better, due to the rareness of crises. You can have a smaller "second-chance" table that handled overflow. If you are willing to do the peak time/space tradeoff, then you can even prealloc the larger size while delaying the rehashing/movement. What is interesting is that I don't think the "second-chance" table ever needs to get very large, maybe something like O(lg N). In fact, it could just be a fixed-size thing for all practical purposes.

There's not much point to this rambling, but it does make me want to work on my hash table/function again.

old rants