11-05-06 - 1

I'm gonna try to write a cripple's summary of ML. In all cases I care about supervised learning. That is, given a big sample of parameters x and ouputs y you have the training set {y_i,X_i} where X is a vector. The goal is to find the function f() which minimizes the error |y_i - f(x_i)| over all samples. This is a regression problem but you can easily convert it into a "classification" problem or use a different error metric or whatever.

Searching for arbitrary f() is NP hard. In fact, you can see that finding an optimal f() is equivalent to the Kolmogorov compressor. If we use an "MDL" error we think of f() as a coder; to reproduce the output y_i we have to transmit the specification of f() and also the residuals (y_i - f(x_i)).

For all the approaches we will be approximating f() as a combination of simple functions. You can of course get arbitrarily close to any f() by using a linear sum of enough functions. One thing I've come to realize is that all the ML techniques are very very similar. Almost all of them have a form like this :

f(X) = Sum[j to M] w_j * h( A_j * (X - L_j) )

Where h() is some function, "w" are weights, "A" are parameters, and "L" is a location in the same space as X. There are M terms. Possibly there are other transforms or linear combos or whatever, those issues are sort of minor (but messy).

Anyway, all the schemes come down to how you pick these terms and optimize them. Almost always you pick your basis functions h() in advance. This could be your Kernel, your fit function, whatever. Common choices are the sigmoid or radial gaussian. Obviously you could just put a delta function at each sample, but that would give you a large model that would not generalize well. So you sort of want that smallest/smoothest model that fits the training set. This is a very hacky problem with ML and you really just have to tweak and test by cross-training.

Neural Nets are a silly way of talking about basic linear algebra. You choose M and h() in advance and then optimize the parameters by greedy gradient descent. Obviously you try to choose the smallest M possible to get a decent fit on the data, but you can only do this by trial & error with cross-training. Also the gradient descent can get stuck in local minima and you get no gaurantee of maximum margin or other ideas that the model will generalize well. The nice thing about neural nets is they're pretty simple and straightforward. Unfortunately NN's have been conflated with all this brain modeling and circuit diagram nonsense which makes it much harder to just see the linear algebra.

SVM's are actually extremely similar, but instead of greedy optimization, they set up a nice constrained optimmization which solves for maximum margin. The L's are not free parameters but are instead points chosen from the training set (called the support vectors). Instead of controlling M directly, you set a parameter which controls the model size vs. closeness of fit tradeoff, and the SVM solves for the best M. The h() can only be functions that work as "kernels" but that's not too bad, it's a large set.

There are some nasty problems with SVM's. The parameter that controls model size has to be tweaked by cross-training which is hacky and time consuming. For clean seperable data svms do an awesome job of picking a minimal model (eg. for linearly seperable data with a plain dot kernel they pick a single perfect plane), however for noisy or ill fit data, M can get arbitrary large (up to N the training set site), and the time to train and test scale like M*N which is very bad if M ~ N. Finally h() must be chosen in advance and many factors you might want to tweak are actually implicity hidden inside the choice of h() such as the relative scaling of parameters. Most simple Neural Nets are equivalent to SVM's, and the SVM will find better solutions if you can make it behave the way you want.

Another method is local regression. Here the L are chosen to be some regions, and then some sort of standard regression is done on that region, such as linear regression, or splines, or whatever. The regression is easy so the whole problem is the choice of regions. The standard ways are K-nearest-neighbors, K-means, and Decision Trees. These are all pretty solid yet all very hacky in their own way. The big problem is that just like kernels they implicitly rely on a distance measure between points. It's crucial for the similarity of samples to be proportional to distance. With lots of different complicated parameters this is ill defined; you can do thing like SVD or PCA on the paramets and then take distances in that space, but that actually may or may not be good depending on the data. In fact if you knew the ideal distance metric you would have the solution to the whole problem. DT's cut up space to define the regions. With Decision Trees most people just build axis-aligned trees and try all possible splits. This is slow and also ignores the fact that you could get much better splits by trying all possible off-axis splits. Ideally you would use an SVM to pick the optimal split at each level.

K nearest neighbors is pretty simple and solid, of all the methods it seems the least "tweaky" and requires the least a priori knowledge of the data, but in practice it is very sensitive to choice of K, choice of how you define "nearest", and you should probably use a distance-weighting and it will be sensitive to that. As usual these things can only be tested brute force by trial on cross-training.

The choice of distance metric or pre-transformation of the parameters is important, as is possible elimination of useless parameters (usually after SVD). Unfortunately this is basically a hacky trial & error thing and can have huge affects on the model's performance. Also with Neural Nets the exact pre-ops and post-ops you perform can make the problem either much easier or much harder to model which strongly affects performance. Basically if you know a lot about your data (like it basically responds linearly to parameters, or it's generated by independent sources on the parameters, etc.) you can fit it well, but if you don't you will do badly and there aren't good ways to "learn to learn".

Fully Bayesian approaches are much more sound, but are computationally infeasible for anything but trivial models because they basically sum over all possible models & parameters weighted by the likelihood. Basically rather than guessing the params or optimizing them, you give them probabilities and sum on all possibilities. Check out the NIPS tutorial here Basically you correctly consider the model+parameters to be unknown and use the probability that they are the true model.

In general the more modern techniques like SVM don't perform significantly better on complex data than just something like a good local regression. I should add that the truly modern techniques seem to be blend models which are provably good as the number of blends gets large, things like AdaBoost with ensembles of decision trees, etc. It appears to me that these are computationally infeasable except on trivial problems.

Way too much of the literature compares their method to some garbage old method, like plain old k-NN with no distance weighting or outlier discarding or local distance metric. The other big sin is to do comparisons in some really complex domain like image matching where a huge amount of performance comes from how you preprocess the data and take advantage of the specific nature of the problem.

No comments:

old rants