10/19/2008

10-19-08 - 1

Hashing scares me. Any time I see a hash function I feel great skepticism. I have no confidence that a hash is actually working well unless I test it in the real situation it's being used. Hashing is fast *if* everything is working perfectly, if the hash table is actually sized right for the data, if the hash function is actually randomly distributing your actual keys, if the reprobe is not walking all over unnecessarilly, etc.

Preface : What I want to do is hash strings for lookup. I want low memory use. I'm going to store a (char *) and a 32 bit hash value in each element. I only do a full strcmp if the 32-bit hash values are equal. The time that I want to minimize is the combined time of the hashing function and the lookup using the hash. So the hash computation should be fast and also good enough so that the lookup performs well. Hashes that don't randomize well will result in table collisions and also unnecessary strcmp's being done which is slow. The 32-bit hash value is used in the elements, but generally a much smaller hash size is used for the table. Ok, now let's get on with it...

The STLport hash is a linked list of all the elements, and the hash table gives you jump pointers into that list. When you make a query with a given key, it makes hash(key) and then you know that your element is between

begin = jump_table[ hash(key) ]
end = jump_table[ hash(key) + 1 ]
where begin and end are pointers into the big linked list. This lets it easily make an iterator that can walk over all the data in a deterministic order (though not sorted order, unless the hash function is lexicographic which it probably isn't). It also makes insert and remove very fast because you just find yourself then unlink the node. STLport grows the hash table pretty aggressively, and just uses a linear list within each segment. You could grow the jump table more slowly, and then use a sorted list on each segment and do a binary search. This is equivalent to the conceptual model of a hash table where each bucket points at a linked list, ala PkZip.

A few things of note :

1. STLport grows the hash table based on *total occupancy* not the max occupancy of any bucket. That means if your hash function is retardedly broken, it can perform really badly, lookups can become O(N). That's really your fault. On the plus side, because it's not a reprober, if it gets all screwed up in some part of the hash, the other parts will still function file, eg. screwups don't leak all over the way they do in reprobers.

2. The STLport hash is both faster and uses less memory than map<> or almost any self-balancing binary tree type thing. Pretty much any time your key is something that can be easily hashed, you should use a hash.

3. The STLport hash always makes the table prime number in size and uses a mod to convert your hash() into a table index. Basically they don't trust you to be making decent hashes (wisely so). For example if you just have an integer index, you can return that as your hash and it will work fine.

4. For static data you could of course do something neater by just keeping the elements in a vector<> and making a jump table that indexes into the vector. Also if you know you can make good hashes you could use a pow two size and a mask instead of a mod.

A normal table-based hash with reprobing on collision can be faster. I tried making one that's minimum size. I want to use a pow2 to avoid a divide, so I just find the next pow2 up and use that, meaning I have very little empty space. A normal reprobing hash can perform very badly when there's not much empty space because you have to keep walking until you see empty to know you've visited everything that goes to your bucket.

The first thing I tried as reprobing with a step of 1. There's a lot of fancy stuff on reprobring with rehashing and whatnot, but I've read that those things are terrible for cache and it's better just to reprobe by stepping linearly, so I'm trying that. ADDENDUM : more on this later.

The next is to count how many elements really belong in each bucket. When you insert something in a bucket, you bump its count, and then if its full you reprobe and add it to the actual first empty spot you find (and don't bump that count). Then when you lookup you only need to keep reprobing until you've seen count elements that hash the same as your query. For very full hash tables this means much less reprobring, it was about 2X faster for me (though it makes the hash table bigger because you have to store that count).

Test of finding all the words in "book1" and "book2"

sorted list                   : 537.19 clocks
reorg as implicit binary tree : 348.14 clocks

std::set                      : 664.89 clocks
std::hashset                  : 253.01 clocks

hash with 1-reprobe & count   : 212.33 clocks
static vector jump table hash : 152.29 clocks

The winner for me is the "static vector jump table hash" , which is basically the STLport hash modification #4 that I described above; the elements are just stored flat in an array, and the hash bucket tells you the start index of the data that goes in that bucket. I found that a linear search is much faster than a binary search within the bucket, but I do still sort within the bucket so that I could early return as soon as I see a value larger than the query in the linear search walk. Note that the jump table is actually *smaller* than the hash because it doesn't need any empty space.

Memory use :

num words : 41509

sorted list lookup:
mem use: 332072 = 8.00/per

	sorted list = 8 bytes per element (zero overhead)
	element = 4 byte hash + 4 byte pointer

implicit btree lookup:
mem use: 524288 = 12.63/per

	implicit btree = sorted list with empty spaces to make perfect pow 2 tree
	worst case almost doubles memory use when size = pow2 + 1
	best case is zero overhead when size is exactly a pow2

jumplist lookup:
mem use: 594220 = 14.32/per

	jumplist = 8 bytes per element + 4 bytes per hash entry; 16 bit hash

strhash lookup:
mem use: 786432 = 18.95/per

	hash is 12 bytes per element + overhead for empty space

I should note that I'm hashing words here, which are quite short, which means the overhead of the hash is crucial. Hashes with expensive rollups don't work. If you were hashing something very very long, it might be wise to use a hash like Bob Jenkins' which rolls up to a loop that works on 3 32-bit ints at a time.

Links : Paul Hsieh hash ; I found this to be slow. It's certainly not magic.

Bob Jenkin's page ; nice page with a summary of tons of hash functions and discussion of what makes good hash functions. His "lookup3" is very complex and slow on small blocks but apparently fast and good on large blocks.

Murmur Hash good simple hash, works on 32-bits at a time so probably fast for big blocks, but was slow for me (though a lot faster than Paul Hsieh).

FNV Hash is the winner. This is the most basic hash and it's also the best for this application. It's just multiply by a prime and xor in the next byte. Apparently the primes he lists here are especially good, and you should take some care in how you make your final hash value from the 32-bit hash. In my experiments it really doesn't matter what prime I use, any prime bigger than 256 seems to work fine.

I think hashes where you do a bunch of shifts are probably pointless. The reason is that one multiply by a good magic number is equivalent to a whole mess of shifts ! eg. x*5 = (x << 2 ) + x , etc. On almost all modern chips a multiply and shift are very nearly the same speed, so hashes with one multiply are faster than hashes that do a bunch of shifts. (sadly there are some major new architectures where that isn't true; multiply and shift by a variable are both slow, while shift by a constant is much faster)

Okay, I counted collisions too.


	// 220 clocks :
	// StringHash collisions : 4620821
	//return rrCRC_Block((const U8 *)string,(S32)strlen(string));
	
	// 190 clocks :
	// StringHash collisions : 5520089
	//return PaulHsiehHash(string,(S32)strlen(string));

	// 180 clocks :
	// StringHash collisions : 4372337
	//return MurmurHash2(string,(int)strlen(string),0x12345678);

	// 171 :
	// StringHash collisions : 4815568
	//return oneAtATimeHashStr(string);
	
	// 152 clocks :
	// StringHash collisions : 4192267
	//return 4001 * stb_hash((char *)string);
	
	// 146 :
	// StringHash collisions : 6157277
	//return bernstein(string);

	// 144 :
	// StringHash collisions : 6603083
	return alphaNumHashStr(string);
	
	// 142 :
	// StringHash collisions : 5892286
	//return FNVHashStr(string);
	
The best hash in terms of avoiding collisions is the STB hash, which is :

unsigned int stb_hash(char *str)
{
   unsigned int hash = 0;
   while (*str)
      hash = (hash << 7) + (hash >> 25) + *str++;
   return hash + (hash >> 16);
}

The top three best hashes are some of the worst in terms of avoiding collisions. But they're also super simple. The time I'm measuring is the total time to compute the hash & use it for lookup. The lookup is pretty damn fast so any extra work in computing the hash has got to be worth it.

BTW conventional wisdom says the hash value in stb_hash should not be initialized to zero, because that makes all the lengths of pure-zero input all map to the hash value zero. Of course that is irrelevant if you're hashing strings. In fact, using anything but 0 there seems to hurt the number of collisions. Also note the shift 7 & 25 is a rotate around the 32 bits which can be done in one instruction on x86, which would make it very fast. This hash has some bad theoretical degeneracies, and Bob Jenkins actually lists this exact hash as an example of a bad hash, but on my test data it has the fewest collisions of any of them! (I haven't tested Bob's full strength hard core lookup3).

Again the murmur looks very good; if you're just going to take a hash and drop it in, which I strongly discourage, murmur might be the way to go. Also I should note that the big flat hash table with reprobing can be very fast if you let it be very big and have lots of empty space. In my experiments it needs around 50% empty space to be fast, even at 25% empty space it gets very slow, beyond 50% empty space doesn't help much more.

I also generally don't like checking performance in ways like this. The real application is queries that will happen randomly in the middle of other application processing. A hash functions with a table lookup might perform very well in an isolated test, but when you drop that in the middle of a real app, the table is not in cache and your hash function also dirties the cache which slows down the rest of your app.

ADDENDUM on collisions : So handling collisions well is actually very important. In fact, with good collision handling the reprobing hash can be almost as fast as the static-jump-list hash. In latest test static jump list is 114 clocks and reprobe hashing is 119 clocks. That's with a 75% full hash table. (you may see authors around the net using 25-50% full hash tables; those people are memory wasters and their speed numbers are lies!).

So the +1 reprobing is in fact great for cache but it does suffer badly in areas of degeneracy. Areas of degeneracy arise where a bunch of values get put in the same bucket. This can happen even if your hash is a perfect hash in the sense that random input = random output, but you might just give it input that has many values that hash to the same spot. When you do +1 probing in those spots it's locally N^2 because for each value you walk on all the other values trying to find empty space.

One good way to reprobe that I got from Sean is to use bits of the hash that were not used in the hash indexing. That is, you computed a 32 bit hash value, but then you only used 16 bits to index the hash table, usually using something like xor folding :


index = (h >> 16) ^ h;

So, try to make a reprobe that uses the *other* 16 bits of the hash. For example :
reprobe = (h>>8) * 173 + 1;

or

reprobe = h + (h>>6) + (h>>19);
or something from the FNV page. What this does is take two elements that has to the same bucket and makes them reprobe to different buckets (hopefully) so they no longer collide. Yay. But just using that as a linear reprobe is not the best thing.

Another thing you can do is a quadratic reprobe. Your reprobe step starts at 1 and then grows quadratically. Warning : some quadratic steps do not ever cover all values, so they might be an infinite loop. The Google Sparse Hash guy claims that triangular steps do cover all values but I haven't seen anyone else claim that so YMMV. (ADDENDUM : apparently it's in Knuth and also on Wikipedia - yes Triangular numbers do in fact cover all steps). A triangular number is n*(n+1)/2 , that's {1,3,6,10,15,...}. Anyway, what the quadratic step does for you is it makes neighbors not stomp on each other. That is if you have one bucket at hash "h" with 10 elements in it and a bucket at hash "h+1" with 10 elements, if you just do +1 reprobing they'll be all up in each other's business and make 20 solid elements with no space. The quadratic step makes big spaces between and they don't run into each other.

The best reprobe that I found is a hybrid, basically it goes like this :

reprobe with +1 twice

then reprobe with a rehash like (h>>8) * 173 + 1;

then reprobe with a linearly growing step : 1 then 2,3,4
This is faster than any other reprobe that I tried by a few clocks. Annoted what we get is :
reprobe with +1 twice
	-> very good for cache in the cases where we get lucky and have empty space right next to us
	-> basically this is so fast its free if we got lucky, helps more for more empty tables

then reprobe with a rehash like (h>>8) * 173 + 1;
	-> breaks the original hash degeneracy, but it's a very evil random cache miss
	-> (btw you could prefect this address before doing the two +1's which would probably make us even faster)

then reprobe with a linearly growing step : 1 then 2,3,4
	-> that's a triangular number from the original index
That's the reprobe that makes us almost as fast as the static jump list, even at 75% fill. BTW the rough numbers were :

Reprobe +1 : 170 clocks
Reprobe with rehash : 130
Reprobe with quadratic : 127
Reprobe with rehash then quadratic : 125
Reprobe with +1 then rehash then quadratic : 119

Static jump list : 115

ADDENDUM more links : Google Sparse Hash Implementation Details is a nice document; it's a cool idea. Basically it's a big flat put the values in the table and reprobe kind of hash, but he uses a really neat sparse table thing instead of an array, which lets you have more empty space in your hash without wasting memory on the empties. Obviously it's not the fastest thing in the world but if you're in a memory-constrained situation it's hot (not memory constrained like small memory, but like hey my search cache machine is using 3.9 GB already).

Cuckoo Hashing looks pretty interesting. I definitely think this could be faster than regular reprobe hashing, and in fact it's ideal for my case of put-rare-get-common.

Python dictionary implementation notes Some good stuff including notes on cache locality in reprobing.

little page with numbers of reprobes for different load factors

Attractive Chaos has a big hash table comparison. His hash appears to be very ordinary, but he claims it's the best, so I'll benchmark his hash tomorrow for completeness...

No comments:

old rants