11/29/2010

11-29-10 - Useless hash test

Clocks to do the full hash computation and lookup on char * strings of english words :


hash functions :

 0 : FNV1A_Hash_Jesteress
 1 : FNV1A_HashStr_WHIZ
 2 : FNVHashStr
 3 : HashFNV1
 4 : MurmurHash2_Str
 5 : HashKernighanRitchie
 6 : HashBernstein
 7 : HashLarson
 8 : Hash17
 9 : Hash65599
10 : stb_hash
11 : HashWeinberger
12 : HashRamakrishna
13 : HashOneAtTime
14 : HashCRC

The hash functions are not given the string length, they must find it. For hashes that work on dwords the fast way to do that is using


#define DWORD_HAS_ZERO_BYTE(V)       (((V) - 0x01010101UL) & ~(V) & 0x80808080UL)

Hash functions are all inlined of course. The easy way to do that is to use macros to make functors for your hash functions and then pass them into a templated hash test function, something like this :


#define TEST_HT(hashfunc) \
  struct STRING_JOIN(htops_,hashfunc) : public hash_table_ops_charptr \
   { inline hash_type hash_key(const char_ptr & t) { return hashfunc( t ); } }; \
  typedef hash_table < char_ptr,Null,STRING_JOIN(htops_,hashfunc) > STRING_JOIN(ht_,hashfunc); \
  test_a_hash < STRING_JOIN(ht_,hashfunc)>(STRINGIZE(hashfunc),words,queries);

TEST_HT(FNV1A_HashStr_WHIZ);
TEST_HT(FNVHashStr);

This is using cblib:hash_table which is a reprobing hash table, with the hash values in the table, quadratic reprobing, and with special key values for empty & deleted (not a separate flag array). cblib:hash_table is significantly faster than khash or std::hash_map (though it takes more memory than khash and other similar hash tables due to storing the hash value). (BTW this was studied before extensively; see earlier blog posts; the four major deviations from khash are all small wins : 1. don't use separate flags because they cause an extra cache miss, 2. use cache-friendly quadratic reprobe, not rehashing, 3. use pow2 table size not prime, 4. store hash value).

The four tests in the chart above are :


blue:   StringHash_Test("M:\\CCC\\canterbury\\LCET10.txt","M:\\CCC\\canterbury\\PLRABN12.txt");
yellow: StringHash_Test("m:\\ccc\\book1","m:\\ccc\\book1");
red:    StringHash_Test("M:\\CCC\\words\\Sentence-list_00,032,359_English_The_Holy_Bible.txt","M:\\CCC\\words\\Word-list_00,038,936_English_The Oxford Thesaurus, An A-Z Dictionary of Synonyms.wrd");
green:  StringHash_Test("m:\\ccc\\large\\enwik8","m:\\ccc\\book2");

make hash from words in first arg
lookup words in second arg

(generally second file is much smaller than first, so we get some hits and some misses)

Anyway, the result is that the Whiz and Jesteress variants of FNV1 by Georgi 'Sanmayce' are in fact quite good in this kind of usage. They rely on having fast unaligned dword reads, so they are pretty much Intel-only, which makes them a bit pointless. But here they are :


U32 HashFNV1(const char *key) 
{
    U32 hash = 2166136261;
    while(*key)
        hash = (16777619 * hash) ^ (*key++);
    return hash;
}

inline uint32 FNV1A_HashStr_WHIZ(const char *str)
{
    const uint32 PRIME = 709607;//15607;
    uint32 hash32 = 2166136261;
    const char *p = str;

    for(;;)
    {
        uint32 dw = *(uint32 *)p;
        if ( DWORD_HAS_ZERO_BYTE(dw) )
            break;

        hash32 = (hash32 ^ dw) * PRIME;
        p += sizeof(uint32);
    }
    while( *p )
    {
        hash32 = (hash32 ^ *p) * PRIME;
        p++;
    }
    
    return hash32;
}

U32 FNV1A_Hash_Jesteress(const char *str)
{
    const U32 PRIME = 709607;
    U32 hash32 = 2166136261;
    const char *p = str;

    for(;;)
    {
        uint32 dw1 = *(uint32 *)p;
        if ( DWORD_HAS_ZERO_BYTE(dw1) )
            break;
        
        p += 4;
        hash32 = hash32 ^ _lrotl(dw1,5);
        
        uint32 dw2 = *(uint32 *)p;
        if ( DWORD_HAS_ZERO_BYTE(dw2) )
        {
            // finish dw1 without dw2
            hash32 *= PRIME;
            break;
        }
        
        p += 4;
            
        hash32 = (hash32 ^ dw2) * PRIME;        
    }
    
    while( *p )
    {
        hash32 = (hash32 ^ *p) * PRIME;
        p++;
    }
    
    return hash32;
}

I have no doubt these could be massaged to be a bit faster through careful study of the asm (in particular the handling of the 0-3 byte tail doesn't look awesome). An even better thing would be to ensure all your strings are 4-byte aligned and that after the null they have null bytes to fill the final dword.

Anyway, you can pretty much ignore all this, because hash functions have to be tested in-situ (eg. on your exact data, on your hash table implementation, on your platform), but it is what it is.

BTW FNV1 is a lot faster than FNV1a. Also the mixing of Whiz and Jesteress are quite poor and they do have a lot more collisions on some values, but that appears to not matter that much on this particular kind of test.

10 comments:

ryg said...

"They rely on having fast unaligned dword reads, so they are pretty much Intel-only"
PPC supports unaligned dword reads too. This is fast (same as aligned) if the word doesn't cross a 32-byte boundary and slow (microcoded) if it does - apparently, total cost is 28 cycles in that case (ouch!). That's measured results btw.

If you do random unaligned accesses, 29/32 will execute in 1 cycle and 3/32 in 28 cycles, for an average of about 3.53 cycles/load. Best I've managed to come up with for manual load of a dword is 4 cycles, so doing it manually just isn't worth it (unless you're always off-alignment by a known amount).

And of course lwz fares much better if your data is at least 2-byte-aligned (e.g. UTF-16 strings), because 2 of the 3 bad cases are with odd addresses; you average about 2.69 cycles in that case.

won3d said...

One additionally worry about unalignmed loads is the "NUL in DWORD" trick can read past allocated memory, which means you might be touching a memory page you didn't intend.

So memoizing the hash values is that classic size/speed tradeoffs. It is particularly useful if you expect lots of misses, say because you expect lots of probing since your hash tables are full-ish and comparisons are expensive (e.g. chasing a pointer).

But have we talked about the fact that you don't need to store the entire hash value? Or that the bits that aren't used for addressing (usually the low bits) are less useful than the others?

Say the keys to your hash are malloc'd pointers, which end up having a lot of alignment (like 8 or even 16 bytes). You can use those low bits from the key to store your hash bits, and you get most of the benefit without blowing up the size of your hash table.

Per Vognsen said...

"But have we talked about the fact that you don't need to store the entire hash value? Or that the bits that aren't used for addressing (usually the low bits) are less useful than the others?"

Sometimes your value pointers are also drawn from a relatively narrow contiguous range of memory such as a memory pool of fixed-size objects. Then the base address of the memory range and the object size can be stored in the hash table header so you only need to store the index within that range per key.

If the memory pool has a maximum size of 2^n objects, that leaves w-n free bits (w = 32 or 64) you can use to store a subset of the hash bits for early rejection. You can even allow such hash tables to be rebuilt if the associated memory pool has to grow, although that should generally be avoided.

cbloom said...

"So memoizing the hash values is that classic size/speed tradeoffs."

Yeah. Though one test I did was comparing a less-full hash to one that's more full but with the hash value stored.

eg. hash value is 32 bits and the data is a 32 bit char *.

So a 70% full hash with hash stored takes the same memory as a 35% full hash without the hash stored.

In that case you can ask which has the better speed given constant memory use.

The answer is basically whether key is a pointer or not / eg. how expensive key compare is. If key compare is cheap, don't store the hash, if key compare is a cache miss, store the hash.

"But have we talked about the fact that you don't need to store the entire hash value?"

Yeah, I mentioned this in the "cache table" post and I think we talked about it in the original hash table posts. But yeah, there are a lot of issues related to that. Probably just 8-16 bits of hash are enough.

Anonymous said...

Poor stb_hash, worst at both yellow and green.

Pour out a bottle for him.

Sanmayce said...

Hi to all,
thanks for the useful test Mr. Bloom, let me share with you my view on this topic:

First you are welcome on FNV1A_Jesteress/FNV1A_Meiyan/FNV1A_Mantis home page:
http://www.sanmayce.com/Fastest_Hash/index.html

Yesterday having read your charming rant I thought: "An informative and useful article, but why it is called 'useless'?"

Strange, may be your demands are too high. After all here words are being hashed, right? Not phrases/sentences/files.

My knowledge of C[++]/rehashing is very limited, and yes I understand that you are very sensitive to the proper methodology, that is why I am glad when more experienced guys can do the test I am incapable to do properly.

Every further optimization is worth searching for, but I am joyous with the FNV1[A]'s strength and potential with different granularities not so much the concrete implementation.

Speaking of your "worthless, useless, pointless" I hope I am in error by saying:
"You look at the river
But you forget the sea."

Regards

cbloom said...

The reason I say "Useless" is because I didn't do a thorough enough test for it to be definitive, so I'm just contributing yet another half-assed test. I did try to fix a couple of the issues that bothered me in the strchr test, but I just don't have the time to do it really right.

I think there are far too many half-assed hash tests on the internet (at strchr, my site, khash, rde, paul hsieh, etc.), and me adding one more doesn't really help anyone.

What would be a strong contribution is for somebody to spend the time to do a really thorough comparison.

JJS said...

My conclusion is just that a lot of even very simple functions are simply "acceptable enough" for some typical uses.

Sanmayce said...

Hi Mr. Bloom,
by chance I have come with the FASTEST known to me lookuper for SMALL KEYS:
https://software.intel.com/en-us/forums/intel-moderncode-for-parallel-architectures/topic/824947

If interested, ... look it up.

It would be fun to see whether I am wrong when saying FASTEST, your methodology may once more prove "useless" while being superpractical.

Sanmayce said...

Dummy me, had to fix v2, now everything is OK, my excuse - yesterday, have been distracted the whole day.
So, here comes v3:

```
#define _rotl_KAZE(x, n) (((x) << (n)) | ((x) >> (32-(n))))
#define _PADr_KAZE(x, n) ( ((x) << (n))>>(n) )
#define ROLInBits 27 // 5 in r.1; Caramba: it should be ROR by 5 not ROL, from the very beginning the idea was to mix two bytes by shifting/masking the first 5 'noisy' bits (ASCII 0-31 symbols).
// CAUTION: Add 8 more bytes to the buffer being hashed, usually malloc(...+8) - to prevent out of boundary reads!
uint32_t FNV1A_Hash_Yorikke_v3(const char *str, uint32_t wrdlen)
{
const uint32_t PRIME = 591798841;
uint32_t hash32 = 2166136261;
uint64_t PADDEDby8;
const char *p = str;
for(; wrdlen > 2*sizeof(uint32_t); wrdlen -= 2*sizeof(uint32_t), p += 2*sizeof(uint32_t)) {
hash32 = ( _rotl_KAZE(hash32,ROLInBits) ^ (*(uint32_t *)(p+0)) ) * PRIME;
hash32 = ( _rotl_KAZE(hash32,ROLInBits) ^ (*(uint32_t *)(p+4)) ) * PRIME;
}
// Here 'wrdlen' is 1..8
PADDEDby8 = _PADr_KAZE(*(uint64_t *)(p+0), (8-wrdlen)<<3); // when (8-8) the QWORD remains intact
hash32 = ( _rotl_KAZE(hash32,ROLInBits) ^ *(uint32_t *)((char *)&PADDEDby8+0) ) * PRIME;
hash32 = ( _rotl_KAZE(hash32,ROLInBits) ^ *(uint32_t *)((char *)&PADDEDby8+4) ) * PRIME;
return hash32 ^ (hash32 >> 16);
}
// Last touch: 2019-Oct-03, Kaze
```

old rants