11-23-08 - Hashing & Cache Line Size

A while ago I wrote about hash tables and reprobing and cache sizes. I mentioned Cliff Click's nonblocking hash table which is sort of interesting to me in theory but also scary and not something I really need at the moment. Anyhoo I was looking at : Some dude's C version of the lock free hash table and I noticed a little thing that's relevant to my old discussion.

He explicitly uses the cache line size and does +1 linear stepping along the cache line, then does a complex reprobe by rehashing, then does +1 again while in the cache line, then rehashes, etc. That seems like a pretty reasonable thing to do. I guess I considered that at the time, but didn't want to do it because it requires you to explicitly know the cache line size of the machine you're on. It literally requires a hard coded :

NumLinearSteps = CACHE_LINE_SIZE / sizeof(table_entry);

Some other : blogs about the lock free hash have extensive link lists.


won3d said...

"extensive link lists" is almost ambiguous since it sounds like you could be describing those hashes as using open chaining.

Instead of linear probing, you can stick N elements into a cache-aligned bucket. If you pack the memoized hash values together, you can compare them in parallel using SSE (or Cell SPU) Depending on whether you are optimizing for hits (most of the time) or misses (e.g. a Bloom filter), you either want your hashes and pointers to share a cache line, or you want them in separate cache lines.

Sean Barrett said...

The one time I tried bucketing a hash table on cache lines, I saw no significant performance gain. I didn't investigate closely, but I can see reasons why that might occur--you increase the number of collisions (counting multiple things in the same bucket as a collision), and that actually causes even more collisions overall to the point where you're still probing just as many cache lines as before.

Sean Barrett said...

Hmm, that non-blocking hash table is interesting theory stuff, but in the end if you look at the actual performance results:


i'm not sure it's worth the effort. I mean, it's twice as fast as the blocking implementation on their crazy gazillion-CPU machines if you crank up the thread count, but on e.g. the niagara it only shows gains at around 14 threads, and is only 20% faster on the AMD machine. I mean, if you can just link it in, I guess go for it, but I'm not sure it's worth the engineering effort compared to one that just sucks down the lock.

I guess if you're writing code on Larrabee or such, then maybe, but as a dissenter against the massively multicore future I doubtful of the worthwhileness.

won3d said...

re: Bloom filtering example

I realize that this is a dumb example, because in a Bloom filter you don't need anything but the hashes (if that!).

re: bucketing

I think it depends on what the expected occupancy of the hash table is. If you are often finding things during the linear probing but without rehashing, then doing the comparisons in parallel can be a slight win. Also, you have a bit more control over when you incur cache misses. If you are just linear probing a cache-line's worth of hashes, you sometimes cause 2 misses, depending on the offset of the original hash and how long you had to probe.

Sean Barrett said...

Yeah, there's a subtler in-between algorithm where you rehash when your +1s would otherwise cross a cache line, as opposed to after exactly a cache line's worth of +1s, which I would try before trying exactly a cache-line's worth of +1s. Because basically those are "strictly free".

won3d said...

How is the in-between algorithm better than the bucketized version? Just because the first probe is a bit cheaper?

Azul seems to have some pretty crazy hardware! It's like the LISP machine reborn, for massively parallel Java.

Sean Barrett said...

The first probe is more likely to hit, so we're talking non-cache-performance better, yeah. Otherwise, I don't see much advantage over the bucket, which is why I don't see any advantage to the cachline's-worth-of-+1 approach vs. bucketing. And since I didn't see any win from bucketing versus not...

cbloom said...

Well, bucketing has pros & cons. I'm not sure under what conditions exactly it would be a win.

But doing at least a single +1 step if it keeps you in the current cache line is definitely a win (*). I mean I've tested it pretty extensively for my string index case, and it just seems obvious, because the extra one compare is almost free and saves you from the cache miss when you reprobe.

In fact it could be 100% free if you prefetch the rehash. Like :

index = hash()
if ( found at index ) return
next = rehash()
prefetch mem at rehash
if index+1 is in the same cache line as index
then check at index+1 and if found return
now check at rehash

You can clearly see that the check at index+1 is just taking time that you would have spent stalled out on the rehash mem fetch anyway, so it's free and gives you a better chance of a hit.

(*) - "definitely" a win under moderately full hash tables, eg. fill <= 80% or so. From talking to Sean I know he's successfully run hashes at 75-95% full, which is a bit of a different scenario because you have so many collisions all the time and the chance of getting your hit with a +1 probe is much lower.

Sean Barrett said...

Yeah, that's a good point, 25% versus 80% are going to be pretty different.

Also, I don't normally store the hash value in the table itself, so if you're talking extra cache misses chasing the string pointer to compare it, then it totally changes (but the bucket case I compared was just hashing 32-bit values that were stored directly, so it didn't have this problem).

won3d said...

I am going to test the theory that you can make a bucketed cuckoo hash table that works well in the full case.

cbloom said...

"Also, I don't normally store the hash value in the table itself, so if you're talking extra cache misses chasing the string pointer to compare it, then it totally changes"

Yeah this is a big factor and changes whether the +1 is an obvious win or not. If your compare is really expensive for some reason (eg. strcmp is slow because its a cache miss) then the +1 might not be good.

Sean Barrett said...

Also, I guess you should basically just always store the hash value. The problem is it bloats the table and you fit fewer items per cache line, but if you're trying for as few cache misses as possible, you'd prefer to avoid fetching through key poitners more than you'd prefer to get more per cache line. (And certainly in my existing stuff that doesn't do the +1 or bucketing.) Mainly I've avoided it for size reasons, but if you actually have keys that are pointers to significantly sized items (e.g. 12-character strings) then the extra storage isn't that big a deal.

cbloom said...

BTW another thing I never got to but is pretty obvious -

if you're going to store the hash value, it's slightly better to use a hash that's different than the one you used to index into the hash table.

Using the same hash is sort of redundant and doesn't maximize your chance to immediately detect equality.

eg. you could make a 32 bit hash, and use 16 bits of it to index and just store the other 16 bits. Or you could make a 64 bit hash and just munge that down to index with 16 bits and store some 32 bit shuffle of it.

I think this is probably a very small effect though. The probability of false hash equality at 32 bits is already microscopic.

won3d said...

I was thinking of a variation on this idea with cuckoo hashes, since you could store the rehash value. Of course, you have to be able to do this back and forth, but this is pretty natural with something like your top/bottom 16-bit example. Anyway, it does make your memoized hashes eat up less cache, and if you're silly like I am then you can pack more into an SSE register.

Re: probing v. bucketing, independent of memoization

So Sean's idea has a natural extension. Instead of stopping at a cache line boundary, you can loop back to the beginning of the cache line and continue probing, rehashing after you've gone through them all. Clearly you can do this very cheaply by indexing with some bit masking tricks. This has an analogy with associative caches, which will do "way prediction" to potentially avoid associative lookups in highly-associative caches.

old rants