Writing some things I've been perking on for a while...
There's a hash function test at strchr.com :
Hash functions An empirical comparison - strchr.com
At first this test looks pretty cool, but on further examination it has a lot of flaws. For one thing, they're basically testing only english words.
If what you care about is short english text, then maybe the results are good, but if you care about other things they are not. I
do like the fact that they are testing the total time to compute the hash + the time to lookup. However, when you do that then the
details of the hash table become crucial, and it seems they are using an externally-chained hash table which is a severe bias (I'd
like to see tests using stlport std::hash_map, rdestl, khash, google's "dense hash", or the cblib hash_table, etc. - some standard reference hash
implementations). You also need to talk about your hash fill ratio, whether you store the hash value, whether the data is interned, etc.
These will affect which hash function is "best".
There are also major architectural issues. For example, the large multiplies of Murmur are not fast on all current platforms.
Furthermore, on some platforms you can't do unaligned DWORD reads fast, and you may have to do endian swaps. These issues do not
affect all hashes in the same way, so to compare hashes without considering these issues is a mistake.
Also, to do a fair comparison you really need to understand the details of each hash. Some are designed for use only in prime-size
tables. Some should be collapsed by xor folding, others should not. You cannot simply plug a bunch of different hash functions into
the same code and have a fair comparison.
But perhaps most crucially, the hashes are not optimized to the same amount. Some are unrolled, others are not. Some are reading
dwords, some are reading bytes. This is not a fair comparison. Really the only way to do a perfect comparison is to optimize each
hash function as much as possible, because hashes have a different degree of capacity for optimization. For example, Adler and
Fletcher (which are checksums not hashes, BTW) can easily be made fully 16-byte parallel (see for example
asmcommunity Adler32 ; there are also some very good
Altivec versions floating around), most hashes cannot be directly parallelized the same way (they can only be "trivially" parallized
in the sense of running multiple hashes at once in parallel on interleaved input streams).
(ADDENDUM : actually the biggest problem with the test is probably that the hash functions are called through function pointer, not inlined;
this is a severe penalty for the cheaper hash functions; in particular with properly inlined hash functions I ran a similar test and found
FNV to just destroy Murmur, which is the opposite of what they find)
Anyway, enough being a grouch, let's get to what I've been thinking about.
One piece of news : there's a (beta) update to Murmur2 called (unsurprisingly) Murmur3 :
MurmurHash3 information and brief performance results
Murmur3 improves the basic mixing a bit to fix some flaws in Murmur2. It also works on larger data elements, which is a mixed blessing. That will make
throughput on large keys faster, but hurts performance on small keys. One of the nice things about Murmur2 was how simple it was.
And some links from the past for reference :
What I wrote previously about hash table lookup :
cbloom rants 10-17-08 - 1
cbloom rants 10-19-08 - 1
cbloom rants 10-21-08 - 4
cbloom rants 10-21-08 - 5
My final answer for hashing strings was a variant of FNV, hash value stored in the table, max 70% full, quadratic
reprobing.
File hashing :
cbloom rants 08-21-10 - Adler32
My final answer for file hashing was Bob Jenkin's Lookup3, modified to be 4-way SIMD. Since Lookup3 inherently
works on 12 bytes at a time, the 4-way simd makes it work on 48 bytes at a time, and I then unroll that 8 times
so that I can work on 384 bytes at a time (which is crucially 3 whole 128 byte cache lines).
So anyway. When you're talking about hashes it's important to distinguish what you're using them for. There are lots
of variations, but I think there are three primary categories :
1. Hash for hash table lookup (with full resolution, either by in-table probing or external chains).
In this case,
collisions are only bad in that they cause you to take more CPU time to probe. You will never fail to find what you
are looking up due to collisions. There's a very subtle trade off here between hash quality and the time it takes
to compute the hash - the winner will depend intricately on the exact data you are hashing and what kind of hash table
implementation you are using. I'm sure this is one of those scenarios where different authors will all claim a different
hash function is "best" because they were testing in different scenarios.
2. File checksum/integrity/corruption/hacking testing.
In this case you want the hash function to change any time the file
has been changed. There are then two subsets of this : catching accidental corruption and catching intentional corruption.
To check the ability to catch accidental corruption, you want to make sure your hash is robust to common types of damage,
such as file truncation, injection of zeros, single bit flips, etc. Catching intentional corruption protects you from attacks.
We're not talking about cryptographic security, but a good hash should make it very difficult to construct alternative data
which is both *valid* and has the same hash.
3. Hashes for "cache tables". This is the main thing I want to talk about.
A "cache table" is a hash table in which collisions are not resolved (or they are resolved only some finite number of times).
So far as I know there's not a standard term for this, but we found this Google Code project :
cache-table - Project Hosting on Google Code
And I think "cache table" is a pretty good name for it. It's similar to CPU caches, but without a backing store. That is,
a traditional "cache" is basically a "cache table", but then if the data is not found in the cache, you go and get it out of
the slow location. We say "cache table" to make it clear that in this case there is no slow location - if it's not in the cache
you don't find it.
Cache tables are very popular in data compression. To my knowledge they were first used popularly in LZRW. I then made heavy use
of them for LZP (and PPMCB, which is very similar to the way modern compressors use them). They are now the basis of most modern compressors such as PAQ and related context mixers.
Cache tables in traditional data compression usage never resize, you simply set your memory usage and that's how big the cache table is.
Usage of a cache table is something like this :
h = hash(key)
look up table[h]
(optionally) check if table[h] is right for key
if so return its data
optionally :
reprobe, either by stepping h and/or by searching "ways"
# of reprobe steps is finite
if failed to find, return not in table
There are a few optional places. A crucial one is whether you do exact resolution or simply accept collisions. (in the old language
of LZP, this is "context confirmation"). That is, for minimum memory use, table[h] might not store anything about key. That means when
hashes collide, you can't tell, and you might get back an entry which does not correspond to your key.
Sometimes this is just fine. For example, in LZRW style string matchers, if you get back the wrong data you will simply not get a string
match. In other cases you might want some amount of collision resolution, but only probabilistically. For example, PAQ stores only a few
bits of key in the table, so when hash collisions occur it can detect them most of the time, but accepts a certain amount of false positives.
This issue should always be considered as a trade off of memory usage - is it better to use the memory to resolve collisions more accurately,
or just to have a larger table so that fewer collision occur?
The other issue is how you reprobe, and that is closely related to your insertion strategy. When you try to insert and find a collision,
you have a variety of choices. You can try all the possible reprobes for that hash value and pick the one that is "best" to replace under
some criteria (perhaps such as the last used, or simply the slot that's empty). If you're using "ways" you might just do round-robin insertion
in the ways, which is a probabilistic replacement strategy.
Anyway, I think "cache tables" is an interesting topic which I haven't seen explored much in the literature. It's particularly interesting
in compression, because when you get a false collision (eg. you retreive data from the table and think it applies to you when it doesn't),
that doesn't ruin your compressor, it just makes you code that event with more bits than ideal. So you can measure the cost of collisions as
bit excess, and you ideally want to tune collisions to occur for less important data. Hash function collisions in cache tables affect both
speed and the quality of results - eg. more collisions means more dropped data - but you also usually care exactly *which* data is dropped.
The question of reprobes within a linear tables vs. ways (multiple independent linear tables) is interesting, and sometimes the optimal solution
is to use both. For example, my "medium speed" string matcher for the RAD LZ77 uses two independent hash values (in the same table) and 2 ways. That is, there are two
linear tables, and you look at two spots within each table.
I think that Cuckoo hashing for cache tables is a particularly interesting possibility. "Cuckoo cache tables" could do the normal Cuckoo thing
(where you have two hashes for each key and you swap data around to make space for an insertion),
but you set a maximum number of swaps on insertion (something like 4 or 8), and if you have to do the cuckoo swap step and hit the maximum number, you
just stop and let some item fall out of the cache. You can also easily extend this by keeping track of age or quality or something as you do the
Cuckoo, and choose to drop the worst item instead of simply dropping the last one in the chain. You don't have to worry about the pathological
Cuckoo case of having a cycle you can't resolve - you simply drop an item in that case (in fact you don't even have to detect that case, you'll
just go around the cycle up to your maximum finite number of steps).
(ADDENDUM : I tried cuckoo cache tables )