3/11/2008

03-11-08 - 5

Netflix Prize notes , Part 3 : Local weighting of predictors

So, last time we built some CF predictors and talked a bit about combining them, so let's go into more detail on that. This is going to be even more rough because it's the stuff I was still actively working on when I stopped.

First, how you would make an optimal global weighting of predictors. This is easy because our metric is an L2 error, we can just user an LSQR fit. For each query in the test set, we run the CF predictors. We put all the CF predictors, [I] through [VII] into a row vector. We also tack on the movie average and user the average and a constant 1.0. We take all these rows from all the test queries and tack them together to form a big matrix A and solve to minimize |Ax - b|^2 , where b are the actual ratings. In reality we want a damped solver as mentioned previously to reduce overtraining. We also want to do this with the N-chunk cross training previously mentioned. The solution "x" gives us the coefficients for each of the predictors in our row. Since this is a totally linear model if we train N times we can just blend those together to make a single average set of coefficients. This is the optimal set of coefficients for training our predictors. Note that this is also a really good way to see what predictors are working well - they will have coefficients that are large, the ones that are worthless will have coefficients near zero.

But it's immediately obvious that a global weighting can be beat. Different queries have different characteristics, and will have different optimal weightings. For example, you might do one query and find that there are similar movies that are very very similar, but there are no really good similar users for that query. In that case you will want to not use the N-user prediction at all, and just use the N-movie prediction. Local weighting will let us select for the predictor that is more suited to the query, but of course you don't just want to select you want to do some linear combo.

Now, this is very similar to combining "experts" and there's lots of theory on that. It does not however fit the normal experts model, because we aren't getting feedback as we go along, and we don't really have time continuity of the queries, so we can't track how well experts are performing and adjust their weight using those schemes.

One thing we can do is estimate the error of the various predictors. Think of each predictor as a predictor for data compression. Rather than just predict a value, you need to predict a probability spectrum. Let's make Gaussian probabilities, so we need to predict a center and a width. The center is just the prediction value we already did in Part 2. We need a width which minimizes the entropy. eg. if we're more confident we can have a smaller width, if we're very unsure we must predict a large width. This is the same as estimating an MSE for each predictor.

Once we have an MSE estimate for each predictor, we can combine them in various ways which I've written about in the past here. For example, we could weight each predictor by

e^(- Beta * MSE)

for some constant Beta to be optimized
or by
(1 / MSE)
and of course normalize by dividing through by the sum of the weights.

How do we get an error estimate for each predictor? We train a learner to output MSE given some good conditioning variables. The most obvious thing is the "distance" to the similar movies & users that we computer to find our similar neighbors. Small distance should be a good indicator of confidence. Another pretty obvious one is the sdev of ratings in the L[] matrix along the 0 row and 0 column (after adjusting using slope1). That is, if you look at the "N movie slope 1" it's just a weighted average of an array. Instead I can take the weighted sdev of that same array and it tells me how much variation is in it. I don't just use that sdev as my estimate of the MSE of the predictor, I use it as a training variable. So I gather 4 or so interesting variables like this which are good indicators, and now I have 4 float -> 1 float supervised learner to train. In my research I didn't work out exactly what the best learner is here; you can certain easily use a Neural Net or anything like that. I just used a linear fit.

BTW I should note that any linear fit can easily be made polynomial by adding cross terms. That is, say I have a 4 float input set to learn on. I want to try the quadratic functions, you just add (4*5)/2 more terms for all the squares you can make from the inputs. The actual learner is still linear, but you find quadratic functions in the inputs. Of course you can do other functions of the inputs besides polynomials, but you can't learn the parameters of those functions.

So, to summarize : take measurements of the neighborhood that reflect confidence in the different predictors, feed them to a trained learner that will guess the MSE of a predictor (one learner for each predictor), use the MSE to weight the various predictions.

Aside : when combining all these predictors, you do not really want to get as many good predictors as you can to combine, because they will be very similar and not complementary. What you really want are predictors that offset each other well, or predictors that are very good for certain types of neighborhood which can be reliably identified. This is something I learned from the Volf switching compression work - if you just try to weight together the best compressors in the world, it doesn't help because they're all PPM variants and they don't complement. Instead if you weight in some really shitty compressors that are very different, like LZ77 or LZP or Grammars, you get a big benefit because they complement the main PPM coder well.

Aside #2 : at this point I'll mention there are some weird things with pretending we're predicting a Gaussian, because the values are only discrete 1-5. Instead of just averaging together the centers of the preds, we could treat each one actually as a probability spectrum, and add the spectrums. This is messy, I'm not going to get into it here.

Now let's look at a totally different way to do the same thing. Again we're trying to find local weights for the different predictors. This time we're going to think of it as a local regression problem. Instead of finding global coefficients with an lsqr to weight our predictors, we want to find a bunch of regions, and find coefficients in each region.

There are a few ways to do this kind of local regression. The first is to use a classification learner. First let's simplify and just worry about the "N movie slope-1" and "N user slope-1" predictors, since they are the most valuable and it gives us just two categories, let's call these "M" and "U" here. We look at all our training samples and see if the actual value is closer to the M or the U value. We label each value with an M or a U depending on which is closer. Now our goal is to train a learner which can look at the local neighborhood and predict a category, either M or U.

To train this learner we want to use the same kind of things as we used to estimate the local MSE - stuff like the distance to the similar user and the similar movie, the sdevs, etc. Also note that this learner must be nonlinear - if it's just a linear learner like a simple NNet or LSQR then we may as well just do a global linear fit of coefficients. We have a bunch of conditioning values for the learner, these are the axes of some high dimensional space. In this space are scattered the training samples labelled with M's and U's. We want to find the pockets where there are lots of M's or lots of U's and put down splitting curves. The ideal solution for this is an SVM (Support Vector Machine) with a radial basis kernel. SVM's are pretty ugly to train, so before you try to train one you need to try to get all your variables nice, get rid of any that are redundant, remove bias & scale. There are also ugly things about this, one is that to use an RBF machine you need to define a distance in this space, but your values are not in the same units and it's not obvious how to combine them.

I'm not gonna really talk about SVM's, you can follow the links below if you like, but I will wave my hands for a second. To do category prediction the simplest thing you can do is find a single plane that puts one category on the front and one category on the back. That's the basic linear categorizer, which can be found with neural nets among other ways. The basic trick of SVM's is to note that the plane test only relies on a dot product, and mathematically any time you have a dot product you can replace it with any other valid "kernel". This is the "kernel trick" and it's very common and useful. So for example a Gaussian RBF is a kernel so it can be used in the exact same kind of "plane" fit, instead of you use e^-k*(a-b)^2

If you can make the SVM that makes guesses about where the good "M" and "U" regions are, you still want to actually weight each one rather than selecting. There are a few good ways to do this. One is the SVM can tell you the distance from a query point to the nearest threshold surface. You can then turn this distance into weights some way or other. Another way is instead of training one SVM, you make a bunch of random subsets of your training data and train N SVM's. Now when you query a value you query all N of them, and your weights are the # that voted M and the # that voted U. This method of making a big randomized ensemble and using it to vote is very powerful. It's also great in practice because SVM's are evil on huge data sets, so doing N seperate trains on chunks of (1/N) samples is much better.

Now the final way I'll talk about is using a Decision Tree to select areas and do local regression. Now, with the whole M/U labeling thing we could've totally used a decision tree there as well, so you could apply the same ideas. All of these techniques are just things in your bag of tricks that you can use anywhere appropriate.

A Decision Tree is basically just a binary tree on various attributes of the neighborhood. If we again think of this high dimensional space where the axes are the useful properties about our neighborhood, the decision tree is just a BSP tree in that space. We want to build a BSP tree that takes us down to leaves where the neighborhoods within a leaf have similar attributes. Trying to do this greedy top-down does not work very well because you have to search tons of directions in high-D space and it's hard to find axes that provide good seperation. Instead what we're going to do is just randomly build a deep tree and then prune leaves.

To make our DT, we just start splitting. To make each split, we choose a direction in parameter space at random. We then find the centroid of the values in that direction and put a plane there, then go to each child and repeat. I made 8-deep trees (256 leaves). Now we want to collapse leaves that aren't too useful. The reason we need to do this is we are worried about overtraining. We want our tree as small as possible.

What we do is within each leaf, we do an LSQR to linear fit the predictors and find the best coefficients in that leaf. We store the error for these. Then for each node that's just above the leaves, we look at the error if pruned it - we put the values together and do an LSQR on the union and measure the error there. It's very important to account for the parameters of the model when you do this, because the extra parameters of the leaves always let you fit the data regardless of whether it's a good leaf or not

C = # of LSQR coefficients to weight predictors
P = # of parameters used in DT plane

Q(leaves) = (sum of errors in L0) / (N0 - C - P) + (sum of errors in L1) / (N1 - C - P)

Q(pruned) = (sum of errors in L0+L1) / (N0 + N1 - C)

L0,L1 = leaf 0 and 1
N0,N1 = # of items in leaf 0 and 1

prune if Q(pruned) < Q(leaves)

These trees can kind of suck because they are randomly made, but as usual we can throw a big hammer at them. Just randomly make a ton of them. Then we can test how well they work by trying them on some training data. The ones that suck we can just throw out. The rest we can average.

BTW an alternative to the DT thing is a kind of k-Means. You pick seed samples, map each sample to the closest seed, this defines clusters. Then you do the LSQR on each cluster. Then to query you interpolate between the fits at each seed. There are various ways to interpolate. Some good ways are just to weight each seed using the distance to that seed using either 1/D or e^(-k*D). Again instead of trying hard to find really good seeds you're probably better off just making a big ensemble by randomly picking seeds and then throw out the bad ones.

The final usage looks like this : build a local neighborhood L[] and compute the basic CF predictors as well as the distances and other good parameters to select predictors with. This defines a point in parameter space. Use the point to walk down the DT (or ensemble of DT's) that we built to find a region. That region tells us coefficients to use to weight all our predictors, which gives us our output prediction value.

1 comment:

Michael said...

I see that the 10% mark has finally be crested.

old rants