10/17/2006

10-17-06 - 6

I've noticed two really odd things in the CF research papers so far.

1. They all use this "cosine" similarity measure. The cosine similarity is basically the dot product of the normalized vectors of ratings. What that means is that two movies with ratings like {2,2,2} and {4,4,4} are considered to be identical by this measure. Now, that's an okay thing assuming you're compensating for movie average, since if you subtract the average off both they are both {0,0,0}. However, the standard way of making an item-based prediction does NOT subtract off the average! It's reported in the literature as

Pred = Sum{other items} Weight(this,other) * Rating(other item) / sum of weights;

If you were to correct for averages it would be :

Pred = Average(this item) + Sum{other items} Weight(this,other) * (Rating(other item) - average(other item) / sum of weights;

But that's not what they report.

2. The exception to this (not correcting for average) is the "Similarity Fusion" paper (sigir06_similarityfuson.pdf) which DOES correct for averages, however, they specifically claim that their method in the case of item-only weights reduces to the standard item-based predictor which it in fact does NOT. It reduces to the 2nd term above which is the average-corrected item based predictor, which is quite different.

It seems to me the cosine weighting is very bad if you don't correct for averages, and yet the standard in the literature is to use the cosine weighting and not correct for averages. WTF. The standard "Item-Based Collaborative Filtering" paper (www10_sarwar.pdf) does try using a linear slope+bias to shift ratings between items, but they find that it hurts performance with lots of neighbors or dense data. Weird.

No comments:

old rants