Ever since I read about Cuckoo Hashing I thought, hmm even if it's not the win for hash tables, maybe it's a win for "cache tables" ?
(A cache table is like a hash table, but it never changes size, and inserts might overwrite previous entries (or not insert the new entry, though that's unusual). There may be only a single probe or multiple).
Let me introduce it as a progression :
1. Single hash cache table with no hash check :
This is the simplest. You hash a key and just look it up in a table to get the data. There is no check
to ensure that you get the right data for your key - if you have collisions you may just get the wrong data
back from lookup, and you will just stomp other people's data when you write.
Data table[HASH_SIZE];
lookup :
hash = hash_func(key);
Data & found = table[hash];
insert :
table[hash] = data;
This variant was used in LZP1 ; it's a good choice in very memory-limited situations where collisions are
either unlikely or not that big a deal (eg. in data compression, a collision just means you code from the
wrong statistics, it doesn't break your algorithm).
2. Single hash cache table with check :
We add some kind of hash-check value to our hash table to try to ensure that the entry we get actually was from our key :
Data table[HASH_SIZE];
int table_check[HASH_SIZE]; // obviously not actually a separate array in practice
lookup :
hash = hash_func(key);
check = alt_hash_func(key);
if ( table_check[hash] == check )
{
Data & found = table[hash];
}
insert :
table_check[hash] = check;
table[hash] = data;
In practice, hash_func and alt_hash_func are usually actually the same hash function, and you just grab different
bit ranges from it. eg. you might do a 64-bit hash func and grab the top and bottom 32 bits.
In data compression, the check hash value can be quite small (8 bits is common), because as noted above collisions are not catastrophic, so just reducing the probability of an undetected collision to 1/256 is good enough.
3. Double hash cache table with check :
Of course since you are now making two hashes, you could look up two spots in your table. We're basically
running the primary hash and alt_hash above, but instead of unconditionally using only one of them as the
lookup hash and one as the check, we can use either one.
Data table[HASH_SIZE];
int table_check[HASH_SIZE]; // obviously not actually a separate array in practice
lookup :
hash1 = hash_func(key);
hash2 = alt_hash_func(key);
if ( table_check[hash1] == hash2 )
{
Data & found = table[hash1];
}
else if ( table_check[hash2] == hash1 )
{
Data & found = table[hash2];
}
insert :
if ( quality(table[hash1]) <= quality(table[hash2]) )
{
table_check[hash1] = hash2;
table[hash1] = data;
}
else
{
table_check[hash2] = hash1;
table[hash2] = data;
}
Where we now need some kind of quality function to decide which of our two possible insertion locations to
use. The simplest form of "quality" just checks if one of the slots is unused. More complex would be some
kind of recency measure, or whatever is appropriate for your data. Without any quality rating you could
still just use a random bool there or a round-robin, and you essentially have a hash with two ways, but
where the ways are overlapping in a single table.
Note that here I'm showing the check as using the same number of bits as the primary hash, but it's not required for this type of usage, it could be fewer bits.
(also note that it's probably better just to use hash1 and hash1+1 as your two hash check locations, since it's so much better for speed, but we'll use hash1 and hash2 here as it leads more directly to the next -)
4. Double hash cache table with Cuckoo :
Once you get to #3 the possibility of running a Cuckoo is obvious.
That is, every entry has two possible hash table indices. You can move an entry to its alternate index and it
will still be found. So when you go to insert a new entry, instead of overwriting, you can push what's
already there to its alternate location. Lookup is as above, but insert is something like :
insert :
PushCuckoo(table,hash1);
table_check[hash1] = hash2;
table[hash1] = data;
PushCuckoo(table,hash1)
{
// I want to write at hash1; kick out whatever is there
if ( table[hash1] is empty ) return;
// move my entry from hash1 to hash2
hash2 = table_check[hash1];
PushCuckoo(hash2);
table[hash2] = table[hash1];
table_check[hash2] = hash1;
}
Now of course that's not quite right because this is a cache table, not a hash table. As written above you
have a gauranteed infinite loop because cache tables are usually run with more unique insertions than
slots, so PushCuckoo will keep trying to push things and never find an empty slot.
For cache tables you just want to do a small limited number of pushes (maybe 4?). Hopefully you find an empty slot to in that search, and if not you lose the entry that had the lowest "quality" in the sequence of steps you did. That is, remember the slot with lowest quality, and do all the cuckoo-pushes that precede that entry in the walk.
For example, if you have a sequence like :
I want to fill index A
hash2[A] = B
hash2[B] = C
hash2[C] = D
hash2[D] = E
none are empty
entry C has the lowest quality of A-E
Then push :
B -> C
A -> B
insert at A
That is,
table[C] = table[B]
hash2[C] = B
table[B] = table[A]
hash2[B] = A
table[A],hash2[A] = new entry
The idea is that if you have some very "high quality" entries in your cache table, they won't be
destroyed by bad luck (some unimportant event which happens to have the same hash value and thus overwrites
your high quality entry).
So, I have tried this and in my experiments it's not a win.
To test it I wrote a simple symbol-rank compressor. My SR is order-5-4-3-2-1 with only 4 symbols ranked in each context. (I chose an SR just because I've been working on SR for RAD recently; otherwise there's not much reason to pursue SR, it's generally not Pareto). Contexts are hashed and looked up in a cache table. I compressed enwik8. I tweaked the compressor just enough so that it's vaguely competitive with state of the art (for example, I use a very simple secondary statistics table for coding the SR rank), because otherwise it's not a legitimate test.
For Cuckoo Caching, the hash check value must be the same size as the hash table index, so that's what I've done for most fo the testing. (in all the other variants you are allowed to set the size of the check value freely). I also tested 8-bit check value for the single lookup case.
I'm interested in low memory use and really stressing the cache table, so most of the testing was at 18-bits of hash table index. Even at 20 bits the difference between Cuckoo and no-Cuckoo disappears.
The results :
18 bit hash :
Single hash ; no confirmation :
Compress : 100000000 -> 29409370 : 2.353
Single hash ; 8 bit confirmation :
Compress : 100000000 -> 25169283 : 2.014
Single hash ; hash_bits size confirmation :
Compress : 100000000 -> 25146207 : 2.012
Dual Hash ; hash_bits size confirmation :
Compress : 100000000 -> 24933453 : 1.995
Cuckoo : (max of 10 pushes)
Compress : 100000000 -> 24881931 : 1.991
Conclusion : Cuckoo Caching is not compelling for data compression. Having some confirmation hash is critical, but even 8 bits is plenty. Dual hashing is a good win over single hashing (and surprisingly there's very little speed penalty (with small cache table tables, anyway, where you are less likely to pay bad cache miss penalties)).
For the record :
variation of compression with hash table size :
two hashes, no cuckoo :
24 bit o5 hash : (24,22,20,16,8)
Compress : 100000000 -> 24532038 : 1.963
20 bit o5 hash : (20,19,18,16,8)
Compress : 100000000 -> 24622742 : 1.970
18 bit o5 hash : (18,17,17,16,8)
Compress : 100000000 -> 24933453 : 1.995
Also, unpublished result : noncuckoo-dual-hashing is almost as good with the 2nd hash kept within cache page range of the 1st hash. That is, the good thing to do is lookup at [hash1] and [hash1 + 1 + (hash2&0xF)] , or some other form of semi-random nearby probe (as opposed to doing [hash1] and [hash2] which can be quite far apart). Just doing [hash1] and [hash1+1] is not as good.