2/01/2008

02-01-08 - 3

The fully Bayesian approach to modeling says you should consider all possible models, and weight each model by the probability that it is the right one.

You've seen past data D. You know for a given model the chance that it would produce data D is P(D|M) , so the probability of M being right is P(M|D) = P(D|M)*P(M)/P(D)

Now you want to make some new prediction given your full model. Your full model is actually an ensemble of simple models all weighted by their likelihood. The probability of some new event X based on part observations D is thus P(X|D) = Sum on M { P(X|M) * P(M|D) }

P(X|D) = Sum on M { P(X|M) * P(D|M) * P(M) } / P(D)

Note that the P(M) term here is actually important and often left out by people like me who do this all heuristically. It's a normalizing term in your integration over all models, and it's pretty important to get right depending on how you formulate your model. Basically it serves to make the discretization of model space into cells of equal importance so that you aren't artificially over-weighting models that are multiple covers. You can also use P(M) to control what models are preferred; for example you might make smoother or simpler models morel likely. eg. if you're modeling with a sine function you might make lower frequency waves more likely, so you make P(M) like e^(- freq^2) or something. This makes your model less prone to overfit noisy data. This whole step is referred to as "estimating the priors" ; that is the a-priori probabilities of the parameters of the model.

A common shortcut is just to use the most likely model ; this is aka "maximum likelihood". This just means you pick the one model with the highest P(M|D) and use that one to make predictions. This is potentially giving away a lot of accuracy. People often do a lot of hard work and math to find the model that is maximum likelihood, but we should remember that's far less desirable

In simple situations you can often write down P(D|M) analytically, and then actually explicitly integrate over all possible models. It's common to use models with Gaussian probability spreads because they are normalizable and integrable (over the infinite domain), eg P(D|M) of the form e^(-(D-M)^2).

An alternative to Maximum Likelihood is to use just a few different models and estimate the probability of each one. Again this is sort of a strange discretization of the model space but works well in practice if the chosen models span the state space of reasonable models well. This is usually called weighting "Experts" , also called "Aggregating" ; in Machine Learning they do a sort of related thing and call it "Boosting". There are a lot of papers on exactly how to weight your experts, but in practice there's a lot of heuristics involved in that, because there are factors like how "local" you want your weighting to be (should all prior data weight equally, or should more recent data matter more?").

In all the weighting stuff you'll often see the "log loss". This is a good way of measuring modeling error, and it's just the compression loss. If you think of each of the models as generating a probability for data compression purposes, the log loss is the # of bits needed to compress the actual data using that model as opposed to the best model. You're trying to minimize the log loss, which is to say that you're trying to find the best compressor. Working with log loss (aka compression inefficiency) is nice because it's additive and you don't have to worry about normalization; eg. after each new data event you can update the log loss of each model by just adding on the error from the new data event.

As I said there are various weighting schemes depending on your application, but a general and sort of okay one to start with is the previous probability weight. That is, each model M is weighted by the probability that that model would've generated the previously seen data D, eg. P(D|M). If you're working with a loss value on each expert, this is the exponential loss, that is e^(-loss). Obviously divide by the sum of all weights to normalize. Roughly this means you can use simple weights like e to the minus (# of errors) or e to the minus (distance squared).

My favorite tutorial is : Bayesian Methods for Machine Learning from Radford Neal

If you like to come at it from a data compression viewpoint like me, the canonical reference is : Weighting Techniques in Data Compressiom , Paul Volf's thesis. This work really shook up the data compression world back in 2002, but you can also look at it from a general modeling point of view and I think it's an interesting analysis in general of weighting vs. selecting vs. "switching" models. Volf proved that weighting even just two models together can be a huge win. When chosing your two models you don't really want the best two; you want two good models that don't contain any inherent inefficiencies, but you want them to be as different as possible. That is you want your two models to sort of be a decent span of "model space", whereas the single best model might be in the middle of model space, you don't want that guy, you want two guys that average to the middle.

A Tutorial on Learning With Bayesian Networks by David Heckerman is more mathematical and thorough but still half way readable.

More modern research mainly focuses on not just weighting the experts, but learning the topology of experts. That is, optimizing the number of experts and what parts of the past data each relies on and what their models are, etc. A common approach is to use a "Simple Bayesian Estimator" as a building block (that's just a simple model that generates random numbers around some mean, like a Gaussian), and then figure out a topology for combining these simple bayes guys using product rule of probabilities and so on.

Here's sort of a cool example that game-tech graphics people should be able to relate to : Bayesian Piecewise Linear Regression Using Multivariate Linear Splines . The details are pretty complex, but basically they have a simple model of a bunch of data which is just a piecewise linear (planar) fit, which is just C0 (value continuous but not derivative continuous). They form an ensemble model in a different way using a monte carlo approach. Rather than trying to weight all possible models, they draw models randomly using the correct probability weighting of the models for the data set. Then you just average the results of the drawn models. The resulting ensemble prediction is much better than the best found linear model, and is roughly C1.

This last paper shows a general thing about these Bayesian ensembles - even when the individual models are really shitty, the ensemble can be very good. In the linear regression models, each of the linear splines is not even close to optimal, they don't search for the best possible planes, they're just randomly picked, but then they're combined and weighted and the results are very good. This was seen with AdaBoost for classification too. With boosting you can take super shitty classifiers like just a single plane classifier (everything on the front is class A, everything on back is class B), but if you average over an ensemble of these planes you can get great fits to all kinds of shapes of data.

I wrote about a way to do simple weighting back on 12-16-06 using very simple variance weighting.

No comments:

old rants