Comments on cbloom rants: 03-04-15 - LZ Match Finding Redux

hash-ways : yes, # of entries in a hash row. So y...

2015-03-26T07:52:44.161-07:00

hash-ways : yes, # of entries in a hash row. So you look up by the hash and then iterate over the # of hash ways.

2nd-hash : this is using a different hash function, but writing in the same hash table. So it just stomps on the primary hash.

It's most effective if the 2nd hash is on a different length substring. eg. if the primary hash is on length 3, the 2nd hash is on length 4 or 5.

Quick question, to be sure to understand words use...

2015-03-26T06:42:58.761-07:00

Quick question, to be sure to understand words used in this post :
hash-ways : related to the number of pointers or indexes on a hash row
2nd hash : prefix hash-bits into cell ?

Right, but you potentially eat more cache misses w...

2015-03-16T14:35:34.707-07:00

Right, but you potentially eat more cache misses while probing large buckets. You could squeeze high_hash_bits + offset into a U32, and maybe double the number of probes per cache miss compared to interning. You'd also get more associative ways and maybe can find better matches.

Associativity effects aside, you could probably predict the # cache misses per match find based on the LZ dumps. Also, IIRC, short matches also tend to be recent/frequent, so might even not incur a cache miss even though you have to chase the offset.

BTW, I tend to call the "hash high bits" the "syndrome" on loan from CRC terminology.

I did test the U32 prefix and it was faster than c...

2015-03-16T14:34:33.994-07:00

I did test the U32 prefix and it was faster than chasing the pointer on my tests (8-way/16-way), but really not by much (something like 1%). That was testing with somewhat weird data though, so no particular conclusion.

Doing direct len-3 matches from your hash table means you can't be sloppy with cleaning out stale offsets (>=1 wrap-around ago) from the hash table, or you can generate invalid encodings based on aliased match offsets.

The nice thing about the U32 prefix is you can com...

2015-03-16T14:20:49.865-07:00

The nice thing about the U32 prefix is you can completely resolve a len-3 match without chasing a pointer at all.

It gives you sort of proportional speed - any time you eat a cache miss, you know you have at least a len-4 match, which means hey at least you're getting a 4 byte step from that time penalty.

But yeah, it just has to be tested. You're right that only a few high hash bits (eg. 8) would resolve most collisions.

You can also store a few high hash bits (assuming ...

2015-03-16T14:06:53.295-07:00

You can also store a few high hash bits (assuming low bits are used as bucket index), which will filter out a lot of non-matches very quickly. U32 prefix seems like a waste of space, but I guess it depends on how long your matches tend to be.

There's no big win from SIMD in this case sinc...

2015-03-16T13:24:18.537-07:00

There's no big win from SIMD in this case since the values in the cache table are pointers, not the actual characters you're matching against.

As Charles mentions, you can store the first few characters at each offset in the cache table (along with the offset itself), which avoids the cost of potentially missing the cache accessing the window only to find out you had a hash collision. In that case, you might be able to profitably SIMD it.

For multi-way cache tables, do you ever do SIMD pa...

2015-03-16T07:34:13.374-07:00

For multi-way cache tables, do you ever do SIMD parallel probing?