01-16-09 - Automated NFL Bettor

I was thinking about some NFL betting this weekend and realized I never wrote about my automated NFL bettor and thought I should remedy that.

I was working on the Netflix prize stuff (see my post here then here ) and it occured to me that it's a very similar problem.

If you think about it in terms of the Netflix sparse matrix way of thinking, you have 32 x 32 teams , and each time they play fills out a slot in the matrix with the score delta of that game. The matrix is very sparse - there are lots of blank spots where teams haven't played each other (it's less than half full), and the challenge for an automatic bettor is to fill in the blanks - aka making a prediction for teams that haven't played yet.

Now, it should be obvious that you could use one of the Netflix methods right off the bat. For example "Collaborative Filtering" can be directly applied - it just means using the point delta of teams I've already played to predict the delta vs. a new team. So eg. if team A beat team B by +3 , and team B beat lost to team C by -7, then we predict team A will lose to team C by -4. You can also obviously use an SVD approach. Take the matrix of scores with blanks. Use SVD to find a low-rank generator that approximates the existing values, and makes predictions for the blanks.

Both of these techniques "work" but are terrible. The problem is the NFL scores have lots of randomness. That is, we fundamentally assume that if team A and team B play over and over the average score delta will converge to some number, but in the short term the actual score is drawn from some random source centered on the average. We generally assume that it's Gaussian, but in fact in the NFL it's very non-Gaussian, there are weird peaks at +3/-3 and +7/-7 and a hole at 0 (BTW this is where the "Wong Teaser" comes from - it exploits the fact that oddsmakers use Gaussian models but the distribution is not actually Gaussian).

So, we really need to use some kind of Bayesian or Hidden Model approach. That is, the score that we actually observed cannot be taken as the average expected score of those teams. Instead we must find a model for each team which maximizes the likelihood that we would observe the actually seen scores given that model ; this is an ML approach , and it's crucial that you also include a prior for regularization. Let me say that again in a more Bayesian way -

We saw some set of scores {S} ; there is some team model {M} and the model tells us the expected average score between teams; the model is our predictor. There's some probability that the Model M creates scores S : P(S|M) ; but that is not the right thing to maximize ! We want to find the model that maximizes P(M|S) which is :

P(M|S) = P(S|M) * P(M) / P(S)

Now P(S|M) is known, it's given by our model, but the inclusion of P(M) here is really crucial - Not all models are equally likely!

In particular, you need to make more extreme models less likely or you will get bad overfitting. The data is way too sparse and too random. The prior model should know that all the teams are really very close in strength and extreme score differences are not very reflective of future performance, especially when they're anomalies.

Most of the existing automatic NFL predictors use some kind of "power ranking". This just means assigning a single value to each team and predicting the score of A vs. B to be power(A) - power(B). You can fit these trivially using least squares and you have plenty of data, but that still sucks because of the variance issue. The good people who do this automatically use various forms of prior conditioning and regularization in the least squares fit. For example they use things like fitting to the square roots, which reduces the importance of large differences.

Dr K's Automatic Betting Talk (PPT) has a good summary of the classic ways of doing regularized power rankings.

Anyway I got into that and started doing some reading and realized I wasn't going to beat the stuff that people are already doing. There's also just really not enough data to make good fits in the NFL. I mean the "Power Ranking" is a really pathetic model if you think about it, it gives each team a single "strength" number so it has absolutely no way to model the style that the team plays and what kinds of strenghts and weaknesses it might have. An obvious idea would be to try to fit something like 5 numbers - {rushing offence, rushing defence, passing offence,rushing defence,home field advantage} - but you don't have nearly enough information to fit that.

If you look at The Prediction Tracker he's got aggregated lots of predictions from around the net including some that are formulated with good mathematical models. However, not one of them can reliably beat the spread.

For example : Dr K's preds and history he has a 134-125 record this year. That's actually remarkable, most people do worse. Assuming a very good 105 vig (often called a 105 bet in the NFL because you're putting up 105 to win 100), so a bet pays +1 on a win and -1.05 on a loss, his total profit is 134 - 125*1.05 = 2.75 ; yay he actually made money. If he bet $105 on every game all season he made $275 on the year. (BTW a lot of suckers make bets at 110 vigs which is much worse and pretty much impossible to beat).

But that is assuming that you make every bet. If you look at the automatic predictions, often they predict a score delta that's very close to the spread. Clearly you shouldn't make that bet, because the vig will mean you lose money on average. You should only place a bet when your confidence is greater than the vig. To break even you have to win 105/205 of the time = 51.2 %

So, I thought - is there any way to formulate a *strategy* for betting that gaurantees a profit? My idea was to go to The Prediction Tracker and grab all the predictions from various sources and create a decision tree that tells me whether to bet or not.

You can think about a decision tree as a sort of BSP tree on the space of the parameters. Each parameter (in this case all the predictors) are each a coordinate axis which defines a large dimensional space. In this case the axes are {LINE, SAG, PFZ, ARGH, KAM ... } it's about a 30 dimensional space. The value for each of those is the coordinate in that dimension, so we have specified a point in this big space. The decision tree cuts up the space into regions of {BET} and {NO BET} , basically a 1 or a 0. These spatial regions can capture a lot of rules. For example, it can capture rules like :

If abs( (average pred) - (line) ) < 1.0 
  then don't bet

If ( (average pred) - (pred sdev) ) <= line - 1.0 && ( (average pred) + (pred sdev) ) >= line + 1.0
  then don't bet


These are some rules that seem intuitive to me - if the pred is too close to the line, don't bet, or if the sdev of the various preds is too wide and near the line then don't bet. But the point is not to hard code rules like that, the decision tree finds them automatically and finds the right dividing values. It could also come up with rules that you may not expect, like it might find things like :

if SAG > 10 and L2 > 8 and LINE < 7 then
  do bet

We can train the decision tree based on past performance if we have past scores and past lines. We can make an optimal in-hindsight decision tree. Basically what you do is for every game in the past you make a point in space from all the parameters on that game; you give this point either a category of {bet} if the prediction was right, or {no bet} if the prediction was wrong. Then you try to put down splitting planes to chop up the space to separate the {bet} points from the {no bet} points. You hope that they are clustered together in pockets. BTW it's important to use the right cost function for tree building here because of the vig - it's not just the number of points miscategorized, there's a penalty of 105 for putting a {no bet} into {bet}, but only a penalty of 100 for putting a {bet} into {no bet}. That is, it's better to abstain from possible bets once in a while than it is to make a bet you shouldn't have. In fact I found better trees if I pretended I was betting at 110, eg. make the tree artificially more conservative.

In practice again you can't actually do this directly because you don't have enough history and the space is too big and sparse, but you can fix that in this case by collapsing dimensions. For example you can eliminate most of the predictors because only a few of them actually help, so instead of 30 dimensions you have 4. You can also make optimized linear combinations of predictors, and then make the decision tree on the combined predictors. Another option would be to use PCA to collapse the dimensions rigorously (note that we don't have the sparseness problem because we're doing PCA on all the prediction values, and we have a pred for each axis for every game each week).

Of course in testing your DT you need to do holdouts. eg. if you have 4 seasons of past data, hold out 100 games randomly chosen. Train the DT on the other games, and then test it on the 100 that you held out. That way you're not testing on the data you trained on.

With this approach I was able to find a meta-betting strategy that had a 55% success rate on the random holdouts tested. That's well over the 51.22% needed to break even. If you bet $105 with a 55% success rate you expect $7.75 of profit from each bet.

That's pretty good, but man sports betting sucks as a way to actually make money. You're putting up huge sums at near coinflips to make very small profits.

No comments:

old rants