3/11/2008

03-11-08 - 4

Netflix Prize notes , Part 2 : Collaborative Filtering

I'm gonna describe basic collaborative filtering along with ways to improve on the standard literature. There are a few steps :
1. Finding similar movies & users
2. Weighting the similar movies & users
3. Using the similar movies & users & weights to make a prediction.

I'm going to try to be sort of vague and general, but then I'll also add in very specific details of what I actually found to be best on the Netflix data, so hopefully we can make that weird mix work.

1. Finding similar movies & users. We're interested in some query movie & user {m,u}. First let's find movies similar to m. We can greatly reduce our search by only finding similar movies that the query user u has rated. Those are the ones that are really useful to us so it's a good speedup. So each movie is a row in this big matrix of ratings, with many blanks. There are two sort of general ways to define similarity, one is a dot product, and the other is a difference/distance or MSE. Of course those are very closely related, if you normalize your vectors, then the dot product is linearly related to the distance squared via the cosine rule. A lot of the papers jump right ahead to using "correlation" but I don't want to do that. "Correlation" is a very specific mathematical thing, and may not be the best thing for our similarity metric.

So I have these two movies that have been rated by various users, they have some users in common (both not blank), some not in common, and many entries that are blank in both. Now, if you just did a distance squared using only the elements where you had ratings in both movies, that would be very bogus, because it actually favors movies that have a very small user base intersection, which are probably in fact very similar movies.

The other thing we want to do is remove linear bias when comparing movies. If two movies are very similar, but one is just better so that everyone rates it higher on average, they can still be highly correlated and useful predictors. We're going to remove average bias when we do the prediction, so we should do it now. So, basically any time we use a rating, we subtract off the average rating of that movie. NOTE : in general you might also want to remove linear scale. On the Netflix data I found that hurt, that scale contains important information and you should not remove it, but on more general data you should consider removing scale. Of course then you also have to remove it & restore it when you do the prediction.

A note on the average rating though : you don't want to just use the average of the ratings that exist on the row. Rather you want to make a prediction of what the average rating of that movie would be if all the blanks were filled in. There are fancy ways to do this, but it turns out that just pretending you have C more ratings with the global average rating works plenty well.

movie average = [ ( sum of ratings on movie ) + C * ( global average rating ) ] / [ ( num ratings ) + C ]
and C = 12.5 was best for Netflix. You want to do the same thing for the customer average, but for that C = 25 was best.

Now, most of the literature uses a Pearson or "correlation" measure for similarity, which are forms of dot product measure, higher is better. I found using a distance was easier to tweak and gave better results, but of course they are directly related as I noted before.

So, let's create this distance between two movies. First we have the distance squared between ratings where they both have a rating for a given user :

movie errsqr = Sum[u] {  ( ( rating(m1,u) - average(m1) )  - ( rating(m2,u) - average(m2) ) )^2 }

m1 & m2 = movies to compare
u = a user index , sum is over users that have a rating for m1 and m2
The number of users in common is "num shared" and the number of movies where one has a rating but other does not is "num unshared". We create a modified movie distance thusly :
movie distance squared = ( errsqr + (num unshared) * A + B ) / (num shared)

where A & B are constants
A = 0.016 and B = 4 were best on Netflix
And the movies with the smallest distance are the most similar.

Okay, now we need a similar thing for customer-customer similarity. We could of course just use the exact same type of thing, but I found something similar was better & faster. We're going to use the similar movies list that we already found and find customers that are similar over those movies, rather than finding customers that are globally similar. In fact this should immediatley give you the idea that we could've done the same thing for movies - rather than comparing the movies globally we could compare them only around users similar to the current one. More generally you want to pick rows & columns of the big matrix which produce a sub-matrix that is related to the query. You could do this with more generalized clustering or with something like SVD, but I'm getting away from regular collaborative filtering.

So, for customers, we do a similar thing but only over the similar movies. First, for acceleration we only consider other users which have rated the query movie. So we walk the list of users that rated the query movie and find the ones that are most similar when measured over the similar movies list. It's very similar to before with similar motivation :

user errsqr = Sum[m] {  W(m) * [  ( rating(m,c1) - average(c1) ) - ( rating(m,c2) - average(c2) ) ]^2 }   / Sum[m] { W(m) }

c1 & c2 = customers to compare
m = movie in the "similar movie" list which is rated by c1 & c2
W(m) = weight of movie m
Note that weighting has appeared and we'll talk later about how we weight a movie. The customer average ratings are corrected using the formula previously given. We then make our distance :
user distsqr = [ (user errsqr) * (num shared) + A * (num unshared) + B * max(0, C - num shared) ] / (num shared + num unshared)

num shared = # of movies used in the user errsqr sum
num unshared = # of similar movies without ratings in common (note this includes both-blank here)
num shared + num unshared = # of similar movies, a constant for this movie

for Netflix :
A = 1
B = 1.168
C = 7.7
There's this extra term with the max which makes users with fewer than "C" movies in common get a big penalty. So then we gather the users with smallest distsqr.

The set of N most similar movies and M most similar users is the local neighborhood of the query. For Netflix I found the best results with N=30 and M=100. We index the local neighborhood sorted by similarity, so the 0th movie is the query movie, the 1st is the most similar, etc. The [0,0] element of the local matrix is the query spot, it's blank and it's what we want to fill. The whole 0th row and 0th column are fully populated with ratings, by construction - we only considered similar movies & users which came off the query. The [0,1] element for example is the rating of the most similar user on the query movie. The [1,1] is the rating of the most similar user on the most similar movie. Roughly the farther away in this matrix, the lower the correlation, but of course you shouldn't use the indexes but rather the distances we computer above. Note that of course you can't always find a valid N movies or M users, in which case the last few rows or columns are left blanks and their weights are set to zero.

2. Weighting the similar movies & users

Now we need to talk about weighting. We already used the movie weights in the user-user similarity metric, and we'll keep using them similarly. We want a weight that will multiply how much a term should contribute, proportional to the similarity to the query. Implicitly all these quantities are related to the query user & movie.

Our weight is going to be based on a cosine / dot product , so let's start with that.

First define rub = rating unbiased = rating - average. If we're comparing two movies, that "average" should be movie's average, if we're comparing two customers it should be the customer's average. The cosine formula is deceptively simple :

cosine(m1,m2) = rub(m1) * rub(m2) / ( |rub(m1)| * |rub(m2)| )

Remember the ratings are like a row and we just treat them as a vector and do a dot product and magnitude. But there's some subtlety. The dot of rub1 and rub2 is done only on elements where they are both not blank, that is only over "shared" elements. However, the magnitudes |rub1| and |rub2| are the sqrt of dot products over ALL the elements in rub1 and rub2 respectively. That means we are in fact penalizing unshared entries. Note that if you pretended that elements where one was blank might be the same, that should contribute positively to the numerator of the dot product, and here it contributes zero.

The cosine is in [-1,1] , and in theory it's the absolute value of the cosine that you care about - eg. a -1 would indicate perfect opposing correlation and would also be a good predictor. Some of the papers use a "bipolar" model to use this. On Netflix I found the opposing correlations to not be helpful and excluded movies with a cosine < 0.

You do this exact same cosine thing for user-user similarity and it's called the Pearson similarity measure (but the "rub" uses the user average not the movie average to unbias).

Now we have this cosine but it's not actually the weight I found to be best. The weights I used on Netflix were :

weight(m) = cosine(qm,m) / ( distance(qm,m) + 0.00001 )

qm = the query movie
m = movie to weight

weight(u) = cosine(qu,u) / ( distsqr(qu,u) + 2.0 )

qu = the query user
u = the user to weight
There's really no theoretical reason to prefer any particular form of weight. I tried a lot of things. Just the cosine is okay. Just 1/dist is okay. Even something as weird and simple as just using the # of shared ratings is actually very good. These forms were best on Netflix but the exact best form is going to depend on your problem. This is obviously a flawed part of the problem, the reason the weird forms are working is because they're catching something about the shared / overlap / sparse problem which is hard to solve for.

Of course the overall scale of the weights doesn't matter because any time we use them we divide out the sum of the weights. We also force the self-weight to be exactly 1.0 , weight(qm) = 1.0 and weight(qu) = 1.0

3. Using the similar movies & users & weights to make a prediction.

Okay, we have our local neighborhood of ratings, let's call it L[] , and our goal is L[0,0]. Let's build up a series of predictors because they will all be useful to us. From now on I'm using indexes sorted by similarity, with 0 being the query, so m0 is the query movie, m1 is the most similar movie, etc.

I. "One Movie" :

pred = average(m0) + ( L[1,0] - average(m1) )
This is the rating of movie 1 by the same user, compensated for the average bias difference between movie 0 and 1.

II. "One User" :

pred = average(u0) + ( L[0,1] - average(u1) )
Similar to One Movie.

III. "N movie" :

pred = Sum[mi>0] { W(mi) * clamp[ L[mi,0] - average(mi) + average(m0) ] } / Sum[mi>0] { W(mi) }
This is just the weighted sum of a bunch of "One Movie" predictors on each of the similar movies. In the future I'm not going to bother writing the denominator - any time there's a weighted sum you of course divide by the sum of weights to normalize. NOTE : on Netflix it's beneficial to clamp the term in brackets into the valid range [0,5]. Of course we always clamp at the end but it's maybe a bit odd that clamping internally is beneficial also, this may or may not be good in general.

IV. "N user" Just like N movie. Duh.

V. "N movie slope-1" :

pred = Sum[mi>0] { W(mi) * clamp[ L[mi,0] + slope(mi,m0) ] }

slope(m0,mi) = Sum[u>0] { W(u) * ( L[m0,u] - L[mi,u] ) }

and we add a pretend extra term of (average(m0) - average(mi)) with a weight of 0.0075
This is exactly like N Movie, but instead of just using the difference in average bias between the two movies, we use this "slope" thing. If we just used (average(m0) - average(mi)) for Slope it would be exactly the same as N Movie. The "slope" is just the average difference in rating, but summed over the local similar users and weighted by the user weight. Thus instead of using a global average delta we use the delta where it matters to us weighted by how much it matters to us. This is a general trend that we can do - stick weights on everything and make all the deltas and biases local & weighted.

This should be pretty intuitive. If you picture the L matrix, remember our goal is to predict [0,0]. We want to find a way to make other entries in the 0th column down to the [0,0] spot. So, we want to find the way to get from one row (m) down to the 0th row. The answer of course is just the average delta between the mth row and the 0th row.

VI. "N user slope-1" : Exactly like N movie but switch user <-> movie.

VII. "Off Axis Slope 1" : So far we've only dealt with "on axis" parts of L[] , that is the row and column that go through [0,0]. But we can use the whole thing. To do so, we need a weight for arbitrary users and movies, which we can easily make :

W(m,u) = sqrt( W(m) * W(u) )
That's the geometric average of the movie and user weight, relative to the query. Note that arithmetic averages or sums make no sense on these weights because they are not on the same scale.

pred = Sum[mi>0, ui>0] { W(mi,ui) * clamp[ L[mi,ui] + slope(mi,m0) + slope(ui,u0) ] }
Pretty obvious. Note that this is only the "off axis" terms, we're not including the stuff that we had in N movie and N user, those could definitely be crammed in here but we prefer to keep them seperate.

1are both good predictors on their own. I should note that traditional Collaborative Filtering is only "N user". That is, the strict definition is using similar users ratings of the same item to predict my rating of that item. "N movie" is not strictly CF, but it's so analogous that I consider them the same thing, and of course the off axis naturally follows. On Netflix, "N movie" actually performs slightly better than "N user". If you average the two, that's even better.

To make a prediction for each query, you can just pick one of these predictors, or just pick the query movie average rating, or the query user average rating, or some linear combination of all that. In fact you can optimize the linear combo on the L2 norm by just doing an LSQR fit of all these predictors + the averages + a constant 1.0 term, and that will give you the optimal coefficients for averaging these preds up.

So these are basic collaborative filters. These are all basic linear models making simple "DPCM" style predictions. All this can be done pretty fast and you can beat CineMatch just doing this junk. Next time we'll talk about where I went beyond this, and what some of the other cutting edge people did.

No comments:

old rants