11/28/2008

11-28-08 - Chance of CRC being bad

Say I run the CRC32 on data chunks. I've got them identified by name so I only care about false equality (collisions) on objects of the same name. I'm going to be conservative and assume I only get 31 good bits out of the CRC so there are 2^31 values, not a full 2^32. I think around 20 versions of the data per name max. So the chance of collision is a birthday "paradox" thing (it's not a freaking paradox, it's just probability!) :

P = "1 - e^( - 20*19/(2*(2^31)) )" = 8.847564e-008

Now if you have N names, what's the chance that *any* of them has a collision?


C = 1 - (1 - P)^N

So, for what N of objects is C = 50% ?


log(0.5) = N * log(1 - P)

N = log(0.5)/(log(1-P)

N = 7,834,327

You can have 7 Million files before you get a 50% chance of any one of them having a collision.

For a more reasonable number like N = 100k , what's the chance of collision?


C = "1 - (1 - 8.847564*(10^-8))^100000" = 0.00881 = 0.881 %

These probabilities are very low, and I have been pessimistic so they're even lower, but perhaps they are too high. On any given project, an 0.881% chance of a problem is probably okay, but for me I have to care about what's the chance that *any customer* has a problem, which puts me closer to the 7 M number of files and means it is likely I would see one problem. Of course a collision is not a disaster. You just tell Oodle to refresh everything manually, and the chance is that only happens once to anybody ever in the next few years.

BTW we used CRC32 to hash all the strings in Oddworld : Stranger's Wrath. We had around 100k unique strings which is pretty close to the breaking point for CRC32 (unlike the above case, they all had to be unique against each other). Everything was fine until near the end of the project when we got a few name collisions. Oh noes! Disaster! Not really. I just changed one of the names from "blue_rock.tga" to "blue_rock_1.tga" and it no longer collided. Yep, that was a crazy tough problem. (of course I can't use solutions like that as a library designer).

4 comments:

  1. The problem is that CRC32(mystring) is not actually that random. You would need a cryptographic hash for that property. You get stronger guarantees, like probable universal uniqueness (it is basically a GUID).

    http://mail.python.org/pipermail/python-list/2004-November/291869.html

    ReplyDelete
  2. Yeah I'd be curious to see the actual measured collision rate on various types of data. I almost just did it myself but then said "meh".

    ReplyDelete
  3. Why do you care about the fact that you have 100 customers? You don't care about collisions across customers, so the overall number of assets in one pool is still the same.

    ReplyDelete
  4. "Why do you care about the fact that you have 100 customers? You don't care about collisions across customers, so the overall number of assets in one pool is still the same."

    Yeah, true, it's just that if there's a 1% chance per game that's a fine choice for a one game developer, but with 100 games the chance of *anyone* having a problem makes it very likely (63%) that somebody hits it. Still not really a big deal though.

    ReplyDelete