10/19/2006

10-19-06 - 3

I just realized that cosine similarity is very closely related to MSE similarity.

Consider two vectors A and B. The "MSE" of their difference is just |A-B|^2, the "rmse" is |A-B|. You can define a scale-invariant version of MSE as (|A-B|^2)/ab (where a = |A|, b = |B|). I'm not sure this scale invariance is a good thing or not, in practice is does weight things to the weightings, but people seem to like it because it's "elegant". Anyway, the Cosine law tells us that :

Cos(theta) = (a/b + b/a)/2 - (|A-B|^2)/2ab

The second term is familiar, the first term is this weird sort of ratio average. In the special case of unit vectors a = b = 1 this reduces to

Cos(theta) = 1 - (|A-B|^2)/2

So for unit vectors a "cosine" similarity and an MSE similarity will produce identical orderings. For non-unit vectors, we have this ratio term. The ratio term is minimized when a=b, so it gets bigger for length differences. In a sense it cancels out the length difference from |A-B|. Consider two parallel vectors of different length. The MSE metric considers these very different, but cosine says they are the same.

In practice with movie ratings there are some very weird things that happen here. The "A" vector is the vector of user A's ratings over N movies *subtracted by his average rating*. This subtraction is important, but it does a weird thing. It makes the vector of random direction (by definition of average), and generally makes it close to zero. If you look at a subset of movies that he rates close to average, his vector will be very close to zero and very random. The "B" vector is the same thing. If you now consider two vectors that are close to {0}, the MSE error between them is of course tiny, but the cosine between them is completely random. Obviously that's not a very sensible weighting. Similarly, consider the "scale invariant" version of MSE. You're dividing by the vector lengths. What that does in practice is make the error much larger for vectors near zero. (dividing by length does sort of work for Cosine because larger cosine = better, the opposite for mse)

Despite all this the "Pearson" correlation for user-user similarity does in fact seem to perform well.

No comments:

old rants