10-06-06 - 3

I got my first non-trivial predictor working on the Netflix data. It's a simple single support predictor using the one most similar movie that user has previously rated.

RMSE for guessing each movie gets its average rating : 1.051938
RMSE for Netflix' Cinematch : 0.9474
RMSE for simple one-movie predictor : 1.037155

So, that's not very encouraging. Obviously I knew it wasn't going to be close to Cinematch, but the improvement over just using the average for each movie is microscopic, and cinematch is like light years ahead.

I'm starting to think that beating Cinematch by 10% is all but impossible. It's a lot like data compression where the closer you get to the true entropy the harder it is to make advances. The Netflix Prize FAQ justifies their 10% criterion by saying the difference between the "average" predictor and Cinematch is about 10%, so they want 10% more. That fails to address two major factors : "average" is actually not an awful predictor at all, and the next 10% is way way harder than the first 10%.

No comments:

old rants