Using ELO ratings for match result prediction in association football

I occasionally comment on a scientific paper that is of interest to me. This time, it was:

Lars Magnus Hvattuma and Halvard ArntzenbĀ (20010) Using ELO ratings for match result prediction in association football, International Journal of Forecasting, 26(3), 460-470 (doi).

This paper falls into the broader categories of Football, Forecasting and Sport, which I have blogged about in various ways.

ELO ratings were originally designed for Chess. In brief, if you play an opponent who is lot weaker than you and you lose, your rating would reduce a lot more than if you were beaten by a player who is only slightly weaker than you. Similarly, if you beat a player who is a lot stronger than you, your rating will increase a lot more than if you beat a player who is only slightly stronger than you. It is usually (if not always) a zero sum game so that what ever points you gain (or lose) the same number of points is lost (or won) by your opponent. There are a number of variations as to how ratings are updated but once you have a rating you can compare yourself to another player (or team) to decide who is the better player.

The idea behind ELO ratings have now been used for a number of sports. Relevant to this blog post, it has been applied to football. In fact, there is a web site that maintains a list of ELO ratings for world football. At the time of writing, the latest match played (on the 15th Aug 2012) was a friendly between Albania and Moldova. The score was 0-0 and the teams gained 5 points (Moldova) and lost 5 points (Albania) respectively. Their points are now 1461 (Albania) and 1376 (Moldova). The fact that Albania were higher ranked meant they lost points against a lower ranked team when they could only manage a draw. They also had home advantage, though I am not sure that this is a factor in this model. Looking at the web page, I don’t think so.

In the paper, which is the subject of this post, the authors define two ELO models to try and rate teams. The first uses a basic ELO to work out ratings and then uses the difference between the two ratings as a covariate (a variable that could be a predictor of the study under investigation) in a regression model. This seems reasonable, as the the ELO rating is a measure of the current strength of the team. A second ELO model is also used which also takes into account the margin of victory. That is, a 3-0 win is better than a 2-1 win.

To compare the two ELO models the authors define another six prediction methods. The first (which they call UNI) simply chooses an outcome at random. The second (FRQ) uses the frequency of home wins, draws and away wins to make its prediction. For example, over a range of matches if two thirds result in a home win, then FRQ would predict that two thirds of the matches it predicts are also predicted as home wins. It is noted that UNI and FRQ are quite naive prediction methods. Two other benchmarks are derived from a paper by John Goddard (Goddard J. (2005) Regression models for forecasting goals and match results in association football, International Journal of Forecasting, 21(2), 331-340 (doi)). The fifth comparative model is based on bookmakers odds, by taking the average of a set of odds from various bookmakers. The final model is almost the same as the fifth, but instead of using the average odds, the maximum odds are used.

The paper also uses various staking strategies. A fixed bet of one unit, a stake in proportion to the odds, to return one unit and stakes based on the Kelly system which is an optimal staking system if you know the exact probabilities.

The various prediction models and staking plans were tested on eight seasons of data (2000-2001 to 2007-2008). Previous seasons data was used, to train the various models.

So, what were the results? The two ELO models were found to be significantly worse than two of the other six models (those based on bookmaker odds), but the ELO models were better than the other four models. The conclusion is that this is a promising result but perhaps more covariates are required to predict even more accurately.