10-13-06

My Netflix contestant is close to Cinematch now, and my program is still mighty dumb. The predictor is quite sensitive to little things and you do have to be careful not to cheat on the probe data, because the probe queries are already in the conditioning set and they can help you a lot in subtle ways. For example, if you let them be included when computing the average rating on each movie, it actually helps a lot. I was doing that and got 0.91 rmse and thought I was the winner, but knew it was too good to be true.

Having the probe data contained in the training set is actually really annoying, because it means I can't fairly precompute very much based on the training set. For example, if you precompute the "most similar" other movie for each movie, you will be using the probe data as part of that, and that helps a *TON* which totally throws off testing on the probe.

There's one case maybe y'all can give me ideas on. It's surprisingly common to have a movie for which I can't find any good "similar" movies. These "red herring" movies wind up contributing a lot to the RMSE score because they tend to be predicted badly. Right now I'm just using the average rating of them as their prediction, but there must be something better.

