3/11/2008

03-11-08 - 3

Netflix Prize notes , Part 1 : Intro

DISCLAIMER : this is mainly a brain dump for my records. If you really want to know how the good people are now doing it you should just go read the papers on the Netflix Prize forums, in particular the stuff from BellKor and all the SVD stuff that everyone is doing.

I worked on the Netflix prize from fall 2006 to spring 2007. It's been about a year since I touched it so this will be a little rough, but I realized I never wrote up what I was doing, so I figure I should try to get some notes down before I lose them forever. There are a lot of little quirks that are specific to the Netflix data set (stuff related to just having 5 ratings, trends related to when the rating was done in absolute time, when the rating was done relative to when the movie was released, and trends related to when the movie was released in absolute time, and others). I'm going to ignore all those factors and just talk about general collaborative filtering topics. Obviously to make a top-tier competitor you have to compensate for all those quirks on the front & back end. Also I haven't really followed developments too much since then so this is well out of date.

The basic Netflix problem goes like this. You have a bunch of users and a bunch of movies. Users have provided their ratings for some portion of the movies. Some users have rated only a few; a very few users have rated tons. Generally we think of it as one giant matrix, rows are movies and columns are users, and the ones with past data are filled in with a rating, and there are tons of blanks. You goal is to predict the most likely value from a blank. (actually in the case of the netflix prize, it's to minimize the L2 error on a blank, which is not the same as predicting the most likely value since you'll predict values in between the integers to reflect uncertainty).

In the particular case of Netflix you have 480k users, 17k movies, and 100 million previous ratings. That sounds like a lot of ratings, but in this big matrix, only 1.2% of it is filled in, so it's very sparse. Probably the biggest issue is dealing with the sparsity. The huge dimensions and sparsity is why you can't just use a standard supervised learning approach.

Basic "collaborative filtering" is a kind of machine learning or prediction algorithm; the basic operation is like this : a query is a {movie,user} index in the matrix that's blank. You find other similar movies to the query movie, and other similar users to the query user. This gives you a little local matrix. Say you found 20 similar movies and 50 similar users, now you have a 20x50 matrix, which is still sparse, but less so, and small enough to manage. You then use this little matrix to predict the blank. The most obvious way is to find the single most similar non-blank entry and take that rating. Obviously that sucks but it's a start.

I'm gonna go into some quick general notes on training. Of course you need to be careful not to test on the data you trained on. Netflix had this concept of a "probe set" but actually that's a bad way to go also because you're just excluding the probe from the training which gives up information. A good general way to do this is if you have your big data set, you divide it into N chunks. Then you do N seperate train & test operations, where you train on (N-1) of the chunks and test on the chunk left out. You then average the N errors to get your whole set error. Since you trained N times you have N models, to create your final predictive model, you could train on the whole N chunks, but usually it's better to take the N models, each trained on (N-1) chunks, and average the results from the N models. This is better because it smooths out overtraining quirks. Another general training thing is that when training like this, you need to account for the number of free terms in your model. If you have T training samples and a model with M free parameters, a better error estimate is (error total)/(T - M). Basically each time you add a term to your model, that should be able to at least cancel out the error from one sample by fitting to it, so you need to pretend your data is one step smaller. This makes is a better way to compare models of different sizes. Models that are too big will tend to overfit and not be good on unknown data. Note that in the simple case of linear models, averaging the result is the same as averaging the models.

I should also note that this is one particular problem where regression & classification both make sense. The Netflix Prize error measure is an L2 norm which should immediately suggest regression with least squares - that way you're optimizing your model on the exact same error measure that you will be judged on, which is good. But classification also has some appeal because of how discrete the ratings are. There are only 5 ratings in Netflix; to use classification you would not just predict the most likely rating, but rather predict a probability for each class, and your final output would be P(1)*1 + P(2)*2 + P(3)*3 + P(4)*4 + P(5)*5. This lets you use whatever kind of discrete classification method you want, as long as you do it in a kind of Bayesian way and generate probabilities of being in each class. For example you can use SVM to classify and use the distance from the SVM boundary curve to generate a probability of being on each side. Perhaps a better way to make probabilities is to use a big ensemble of classifiers built from random subsets of the data and count how many members of the ensemble vote for each class to generate a probability. Most classifiers are binary, but there are many different ways to make binary classifiers work on N classes. These classes happen to have a sensible linear relationship, so just doing a binary tree is perfectly reasonable. For classes that are not necessarily linearly orderable, the binary tree method is a bad way because it imposes a false linear order and you have to do something like an error correcting binary code.

Making ensembles of simple predictors and averaging is a really amazing thing in general; I'm not gonna really talk about it, but you can search for "bagging" and "boosting", in particular AdaBoost & the many more modern followups. Also obviously classification and regression are closely related. You can make a binary classifier by doing regression and just saying if output is > 0 that's class 1, if it's <= 0 that's class -1. If you use a linear regression that's the same as a plane fit classifier ("perceptron"); the difference is the error metric; regression is using some L2 or other norm, while most classifiers are designed to optimize to # of classification errors. Obviously you need to use the training method that matches the error metric you care about. There's also logit or logistic regression but I'm getting off track.

Let me also digress a second to note some things that are obviously wrong with all the standard literature on collaborative filtering. First of all, everyone makes the "missing at random" assumption. This is the assumption that which movies a user has chosen to rate is totally random, that is which elements are blank or not blank is random. Another way of saying this assumption is that the ratings which a user has previously provided have the same statistics as the theoretical ratings they would've provided on all the other movies. This is just obviously very wrong, since people do not choose to see movies randomly, there's a lot of information not only in the ratings they have provided, but also in which movies they have provided ratings for. For example, 22% of the users in Netflix have only given ratings of 4 & 5. That means they only rated the things they really like. Obviously if you do something retarded like normalize to their average (4.5 or so) and say all the 4's were things they "didn't like" you would be way off. Though it's also important to note that the Netflix test query is also not actually a random movie choice, it's also something the user chose.

Also another quick note on general learning stuff. I talked briefly about cross-training, the N chunk thing, and also about overfitting. Almost all the learning algorithsm have some kind of damping or model size parameter. Even with linear regression you want to use a regularized/damped regression solver. With neural nets there are lots of things that affect overfitting, like how you feed your data through, how fast you train, how many passes you make on the data, etc. With SVM you have the # of support vectors as your control (usually this is controlled through some kind of error margin size parameter). The less you damp, the more you overfit, and the better your results will be on that particular training set, but it will make the results worse when you test on new data, which is what you actually care about. The goal is to optimize on your training set, but not in a way that overfits and thus hurts you on new data. Cross-training is one way to do this, another is the use of an ensemble, as are the damping parameters in the algorithm. Unfortunatley there's no theoretical way to optimize the damping parameter, you have to just use trial & error to search for the best damping parameter, which you can find through cross-training.

No comments:

old rants