10-13-06 - 2

In algorithms like this Netflix dealy, something I found back in the compression days is that there's two steps really to algorithm design - first you create a good general structure based on theory of what you think should happen. Then you find lots of little tweaky cases and start doing lots of heuristic "clever" things, and often those sort of finishing touches wind up being huge improvements. This can perplex the algorithm developer in two ways. First of all, you can sometimes have a pretty poor basic algorithm, but you tweak it and do all sort of cleverness to it, and it actually winds up performing really well. The problem is if you try other algorithms that might be much better, they will appear worse because you haven't put all the time of tweaking into them. You can dig yourself into a local minimum where you have all this fancy heuristic junk which really helps your poor algorithm, but doesn't help the better algorithm. The other funny thing that can happen is that some algorithms that are "better" are really hard to do these clever tweaks with. In data compression you have algorithms like CTW which are really beautiful theoretical structures and certainly base CTW beats base PPM, but PPM is intuitive and easy to play with and you can build all sorts of other ideas into it and you can wind up beating CTW because you can isolate the inefficiencies and attack them with various schemes.

One thing I'm surprised at. 25% of movies fall into a category of "super predictable" cases. In these cases, I gave the same rating to the two most similar movies, and the most similar user gave the same rating to the query movie. The RMSE of these queries is still 0.85 ; I would have thought these cases would be more predictable.

Part of the problem with making a really amazing predictor is that the ratings are so discretized. If somebody felt like a "3.5" about a movie, they will sort of randomly shove it to a "3" or a "4" depending on how they feel, but the difference is a whole 1.0 which is big.

No comments:

old rants