10/16/2006

10-16-06 - 2

I just found this set of Good Collaborative Filtering Papers (haven't read any yet). I'll release my Netflix code at some point if/when I decide I have no hope of winning.

So, I've read through a bunch of the literature. It seems like I independently invented most of the same techniques. Some interesting tidbits : I'm currently using 2 similar movies & 2 similar users, while the accepted good # in the academic community seems to be 30 (!!). (using more doesn't increase my run time at all since I'm looking at all of them anyway to pick the best 2). Also, the way they're comparing vectors is generally a dot product, while I've been using like a Euclidean distance. I'm not sure if that will make much of a difference. They also do fudging on the similarity to penalize small intersections and such; I have similar fudging and I've found that the details of those fudge factors are enormously important in practice. I can't tell in the literature if people have really investigated different similarity functions, or everyone is just using the one that was published first and has become standard.

One cool approach is the "Context-Boosted Collaborative Filtering" paper, though it seems completely unusable on large data sets because the first step is turning the sparse matrix into a full matrix by filling in the gaps with content-based predictions. That is, it would fill in the full 17k x 480k matrix of ratings, which is 8 billion ratings, which aside from taking about 8 GB of storage (not bad), would take around 1,000,000 hours to compute.

No comments:

old rants