A compound framework for sports results prediction: A football case study

The latest paper that caught my attention, that I thought I would comment on is (other publications I have commented are can be seen here).

Byungho Min, Jinhyuck Kim, Chongyoun Choe, Hyeonsang Eom and R.I. (Bob) McKay (2008) A compound framework for sports results prediction: A football case study, Knowledge Based Systems, 21(7), 551-562 (doi).

You might be able to download a copy of the paper from here. Note that this link may not be permanent and it may not be an exact copy of the paper that was published (although it does look like it).

The paper presents a framework which is designed to predict the outcome of football matches. They call their system FRES (Football Result Expert System).

The authors note that most previous research focuses on a single factor when predicting the outcome of a football match, and the main factor that is used usually the score data. Even when other factors are taken into account, the score tends to still dominate the prediction process.

Within FRES, two machine learning methodologies are utilised, a rule-based system and Bayesian networks. The paper describes how they are used within FRES in enough detail to allow readers to produce (as all good scientific papers should do) the system.

FRES is tested on the 2002 World Cup tournament. Most football prediction systems are tested on league competitions, where teams (typically) play a double round robin tournament. Testing their approach on a the 2002 World Cup means that the system cannot easily be compared to other systems. Where previous approaches have been tested on other tournaments (for example, previous World Cups) not all the data was available to enable FRES to make those predictions. In the words of the authors, “In the case of the few works which predict a tournament such as the World Cup, the available evaluation was conducted with old data, such as the World Cup 1994, 1998, which would unfairly hobble FRES, since some of the data it relies on are not available for these earlier tournaments.

Although not a scientific term (at least not one I am familiar with!), I do like the term unfairly hobble.

In order to provide some sort of comparison with FRES, the paper implements two other systems, a historic predictor and a discounted historic predictor.

FRES was able to predict six countries out of the top eight in the tournament, The other predictors were able to predict five. Moreover, various statistical tests are conducted which confirms that FRES is statistically better than the other two methods.

One thing I like about the FRES system is that is has a lookahead mechanism. Based on this, England does not rate very highly as, due to the draw, there is a high probability that they will meet Brazil in the quarter finals. Turkey, on the other hand are rated more highly due to the perceived easier draw.

It would be useful to have FRES tested on league competitions, so that better comparisons could be made with more prediction systems that have been reported in the scientific literature. Perhaps the authors are working on that now? It would, for example, be interesting to see if it beats a random method, or a method which always predicts a home win (as the authors did in the paper I discussed a few days ago).

 

Sports Forecasting: A Comparison of the Forecast Accuracy of Prediction Markets, Betting Odds and Tipsters

In some of my posts I comment on a scientific paper that has caught my eye. There is no particular reason for the papers that I choose, they are just of interest to me. In this post, the paper that caught my eye was (comments on other papers can be seen here).

Martin Spann and Bernd Skiera (2009) Sports forecasting: a comparison of the forecast accuracy of prediction markets, betting odds and tipsters, Journal of Forecasting, 28(1), 55-72 (doi).

This paper looks at three different prediction methods, and assesses their effectiveness in predicting the outcome of premier league matches from the German football league. The three methods that are investigated are prediction markets, tipsters and betting odds.

Prediction Markets are based on various people taking a stance on the same event and willing to back their hunch by paying (or collecting) money should their hunch be wrong (or right). Given that a number of people are taking a stance on the same event, it can be seen as a predictive model of the event.

Tipsters are (or should be) the views of experts who publish their predictions in newspapers, on web sites etc. The advice from tipsters is often based on their expertise, rather than applying some system or formal model. The paper (citing Forrest and Simmons, 2000 as its source) says that tipsters can often beat a random selection method, but does worse than simply choosing a home win every time. It also cites Andersson et al., saying that soccer experts often do worse that people who are less well informed about the game.

Betting Odds, in previous work, have found to be a good forecasting method (not surprising I suppose seeing as the bookmakers rely on setting the correct prices to make their living). Of course, the bookmakers can change their odds but when they publish fixed odds (on say a special betting coupon), this can be seen a prediction of the match outcome.

The games that are forecast in this paper are those from the German premier league from three seasons (1999-2000, 2000-2001 and 2001-2002).  The number of games predicted by each method varied (Prediction Markets and Betting Odds = 837, Tipsters = 721 and Prediction Markets, Betting Odds and Tipsters = 678). The number of predictions for each method varied simply due to the data that was available and where the number of games is between two, or three, methods, this is the intersection of the games that that method was able to predict.

To evaluate each method, the authors calculate the percentage of correct predictions. They also calculate the root mean squared error, as well as the amount of money that each method would have won (three figures are given, a 25% fee, a 12% fee and no fee). Comparisons are also made with a random selection policy as well as a naive selection policy, which simply assume a home win.

So, what did the authors find? Over the 837 games, the prediction market and betting odds were able to predict 52.69% and 52.93% of games respectively. If there was no fee this would have returned a profit of 12.30% and 11.92% respectively. The naive model (pick home wins) predicted 50.42% correctly and returned a profit of 11.79% The random method only managed to predict 37.98% of games correctly.

If we look at the 678 games that all three methods could predict, then the percentages of correct predictions were 54.28% (prediction market), 53.69% (betting odds), 42.63% (tipsters), 50.88% (naive model) and 37.98% (random). The returns (assuming no fee) were 16.20% (prediction market), 13.49% (betting odds), -0.19% (tipsters) and 12.44% (naive model).

I’m not sure why, but profit information is not given for the random model but it would almost certainly result in a loss.

A further test is also carried out. Only games where methods agree on the selection are bet upon. For example, of the 678 games, there are 380 games where the three methods agree on the result. If we only bet on those games, we get a correct prediction percentage of 57.11%, higher than any of the methods used in isolation, and betting on every game. The profit return would be 13.86% (no fee), 1.66% (12% fee) and -8.72% (25% fee).

The authors conclude that the prediction market and the betting odds provide the best indication of the outcome. They agree with previous work that tipsters are generally quite poor at prediction.

 

 

Can Forecasters Forecast Successfully?: Evidence from UK Betting Markets

Journal of Forecasting, 19(6): 505-513I am occasionally blog on a paper that is of interest. Well, of interest to me. The latest paper to catch my eye is (other papers I have commented on can be seen here).

Leighton Vaughan Williams (2000) Can Forecasters Forecast Successfully?: Evidence from UK Betting Markets, Journal of Forecasting, 19(6), 505-513 (doi).

The reason that this paper was of interest is because I was reading Leighton’s book (Betting to Win : A Professional Guide to Profitable Betting, High Stakes Publishing, 2002, ISBN: 1-84344-015-6) and this paper was mentioned. Many (many, many) years ago, long before I was an academic, I went through a phase where I collected all sort of horse racing systems. My idea was to test them all out to see if any of them worked. I never actually placed a bet and I never really tested any systems as they generally involved too much time to collect all the data, process the data etc. etc.

Since then I have still thought that it would be interesting to look at various horse racing systems to see if they worked.

This is what this paper does. Unlike my idea though, it takes tips from services that you subscribe to, either by paying money or by contacting them using a premium rate phone service. This seems a lot more sensible, rather than having to enter all the data yourself.

This paper looks at the performance of tipping services, with the analysis being carried out in 1995. Five services were compared. Four of these were subscription based. That is a fee is paid, and you gets tips at various times. In 1995, these services cost at least 99 GBP per month, which seems a lot to me now, let alone in 1995. The other service was a premium rate phone number, where you phone up to receive the tip and the costs of the phone call effectively covers the cost of the service. These five services were chosen as they were amongst the top tipping services as assessed by the Racing Information Database (I have tried to google this and am not sure that it is in existence anymore, but would be willing to be corrected, and update this post to provide a link).

The paper goes through each of the tipping services and evaluates how many tips were provided (and over what period – some, for example, were analysed over three months, others over six months – I think the period was probably chosen to ensure that a sufficient number of tips were analysed as not all services provide tips at the same interval), any conditions associated with the service (for example, only bet if a certain price is available), the profit (or loss) from investing in that service etc.

The good news is that all the tipping services produced a pre-tax profit when used with the relevant staking/price plans. Leighton also makes the point that none of these profits could be said to be significant. It was also interesting to note that increased profits could have been achieved if some of the lesser supported tips were ignored. Of course, this would be a hindsight examination and the obvious question would be, when in play, what tips do you ignore, and what ones do you actively bet upon? There is also evidence that you should use a variable staking plan, rather than a flat stakes method.

If you use a tipping service, there are also other factors to take into account. There is an upfront investment (which you may never recover). Unlike an academic study, you will probably only choose one and which one do you choose? There is also (as pointed out by Leighton, in chapter 20 of his book) the fact that you have to take what the tipping services advertise with a pinch of salt. As an example, a service might only say bet if you can get a price of 4-1 or better. What happens if that price is almost impossible (or even actually impossible) to get, will the service still include that in their results if the horse should win? And what if there was a price available (even for a few seconds) at 6-1, would the service return that as the price you could have got even though, unless you were very quick, or very lucky, you would have struggled to get on at 6-1.

It is twelve years since the paper was published, and seventeen years since the analysis was carried out and things have moved on. Tipping services (I suspect) come and go, technology has moved on, the tax regime has changed and there are now many other ways to bet which were not so predominant at the time. I am thinking specifically of spread betting and betting exchanges. These have, undoubtedly, made a big difference to the industry.

I am not up to date with the scientific literature in the area of sports forecasting and I suspect that there are many papers out there that provide various comparisons and analysis. If you know of any, or know of a good review paper in this area, I would be very interested if you could post a comment giving the reference.

I would also be interested in hearing from any professional tipping services (no matter what sport, but UK based as I don’t claim to understand American sports or markets) who wish to subject their service to scientific analysis. Note, this is not an open invite to advertise your service on this blog. I get enough spam as it is (and moderate it out) and I don’t want the comments box filled up with lightly disguised adverts to various web sites that claim to make millionaires from people who subscribe. But serious enquiries are welcomed.

 

Prediction of sporting events: A Scientific Approach

My final year undergraduate dissertation project (many years ago) attempted to predict the outcome of horse races using Neural Networks. I briefly blogged about it in June 2009 (http://graham-kendall.com/blog/?p=8/).

The result of the project was (in my view) encouraging but was lacking in a couple of areas. The data was incomplete (the starting prices were not available so I had to make some assumptions and it would have been more useful to have studied a greater number of races). I would also have liked to have tried some other prediction methods, beyond just neural networks.

Since doing that project I have maintained an interest in predicting sporting events, although sports scheduling (e.g. 10.1016/j.cor.2009.05.013 and 10.1057/palgrave.jors.2602382) has seemed to have taken up more of my time. But I have always wanted to return to prediction, utilising Operations Research methodologies.  As such, I maintain a database of any literature that I see on the topic. This incudes the scientific literature, as well any newspaper cuttings, useful web sites etc.

 

One of the problems that serious sports forecasters face is being taken seriously. A quick google for sports prediction (or many other similar terms) will bring up many sites offering services that (supposedly) enable you to make money. The services typically involve investing in some system, or subscribing to a service where you are sent the predictions for you back in whatever way you see fit.

Of course, if we were sceptical, we might assume that many of these services are really there to make money for the people selling the service, rather then those who are buying. I am sure that there are some services out there that make money for both the seller and the buyer, but the challenge is to find out which services offer value for money before you go bankrupt in the pocess!

 

Unfortunately, there are not that many scientific papers that consider how to predict the outcome of sporting events, at least as a way to return a monetary profit. There are some, of course. For example an article that appeared last year in the International Journal of Forecasting

 

S. Lessmann, M-C. Sung, and J.E.V. Johnson (2010) Alternative methods of predicting competitive events: An application in horserace betting markets, 26:518-536, DOI: 10.1016/j.ijforecast.2009.12.013

 

considered how to predict horse races. The motivation of the article was actually to try and predict competive events such as political elections and (of course) sporting events, although the paper was really a large scale (1000 races,  12,092 runners) study. The paper concluded that their proposed model was able to provide an increase in wealth of just over 528% if using a Kelly (Kelly, J. L. (1956). A new interpretation of information rate. The Bell System Technical Journal, 35, 917–926) strategy, with reinvestment.

 

Considering other sports, such as football (the UK version), a couple of examples of predicting matches can be found in Economics, Management and Optimization in Sports, Springer, 2004, ISBN: 3-540-20712-0

In Using Statistics to Predict Scores in English Premier League Soccer (John S. Croucher, pp 43-57), various models are presented that attempt to fit the number of goals scored by each team. The best model found had a Poisson distribution.

Another paper from the same book (Modelling and Forecasting Match Results in the English Premier League and Football League (Stephen Dobson and John Goddard, pp 59-77)) considers about 30 seasons of data. This paper also uses a statistical method, but assigns probabilities to win, lose or draw, rather than trying to predict the number of goals scored. This paper also provides a good overview of previous work in this area.

 

I suppose, the stock market is one area that has been widely studied with respect to prediction, with an eye on turning a profit. There have been hundreds (if not thousands) of papers that look at ways to predict stock prices, interest rates, inflation etc.

 

Maybe, not surprisingly, there has been limited reporting in the scientific literature as to whether anybody has made (or makes) money from the methodologies that they have developed. After all, if you have a successful system, why tell everybody about it (which is one of the major arguments as to why would you buy a system/tips from a service on the internet).

 

What I would actually like to see is a lot more scientific papers not only reporting their predictive systems but also how much money was made, over what period of time and if the system is in daily/weekly use at the time of writing.

Of course, the system needs to reproducible (as should all good scientific writing).

However, if the system is successful, the author(s) might be unwilling to reveal its secrets but might still want to let the world know about its effectiveness. Under these circumstances I have a few ideas as to how this could be done.

  1. In the run up to publishing the paper, the authors make a series of predictions and lodges them with a reputable source. This could be another scientist, a lawyer, or even published on a web site that can be verified from a date/time point of view. The important thing is to ensure that the predictions can be verified as being made in advance of the event. If these predictions were made over a period of (say) six months, then this could form part of the results presented in a paper.
  2. It is, of course, uderstandable that authors do not want to publish the full details of their winning methodology in a scientific paper but, as scientists, we like to publish our work. The scientific community should be understandable of this, in the same that they are accepting that sometimes certain factors must be kept confidential due to commercial sensitivities. Therefore, the general methodology could be described but omitting key points (and being upfront about that) but, if combined with 1), above, then this could still make a contribution to the scientific literature.

 

An attractive alternative would be to run a prediction competition (see Kaggle, who are doing excellent work in this area), where competitors are given a set of data and asked to provide predictions on the outcome of sporting events; ideally those that have not taken place yet.

 

In summary, I would really like to see more reporting (on a sicientific basis) of sports predictions which are unasheamedly about trying to return a profit, as this is an under represented area at the moment. Why not have a go?

 

Note: I entered this blog entry into the INFORMS blog competition. The March 2011 competition was O.R. and Sports.

Football Prediction: A decision to be made

Today I have been working on my research that is investigating if it is possible to predict the outcome of football matches. The measure I will eventually use, to see if the predictions can be considered successful, will include if it can make money at the bookmakers, if it it more successful than other tipsters etc.

One of the functions I have in my system is to be able to generate the league table for a given date. That is, taking into account the fixtures played to date, generate the league table for any point in the season.

I believe that my function is working correctly and today I was carrying out some tests to see if the league tables I generated were:

  1. Correct at the end of the season. That is, taking into account every match played, is my input data correct and does my algorithm process that data correctly.
  2. Does my algorithm, given a date, generate the correct table for that point in the season.

I initially thought that point 2 would be very time consuming to check but I found a very useful web site. http://www.statto.com is not only a very useful web site (for all sorts of reasons) but one of the facilities it offers is to generate a league table for a particular point in the season.

When doing my checks (and there are still a lot more to do), I have found some problems where my generated tables are not correct. This is almost certainly down to my inputting the results incorrectly, so I need to check all those.

However, my checks also highlighted another problem. Actually, I knew this was something that I needed to address but I had not really thought it through.

The problem arises when teams having points taken away. I knew that this happened and I had yet to include it in my system so I was expecting the tables not to match exactly.

However, I had assumed that the points were deducted at the start of the season but this does not seem to be the case. It appears that the points can be deducted at any time in the season.
This is not too much of an issue. It just makes the programming more complex than I had hoped.

The real issue is what do I do when a team has had points deducted?

Let me give you an example. A team has won 3 games and drawn 2. That means that they have received 11 points (you get 3 points for a win and 1 point or a draw). But, if they have had 10 points deducted then they will only have 1 point. This obviously affects their league position. If I am using the league position as one of the contributory factors in my predictive model, is this fair – or should I ignore the points deduction for the purposes of prediction?
On the other hand, their league position, with the points deduction, may affect the way they play, and could be a factor in the prediction.

I’m not quite sure what I am going to do yet.

Football (Soccer) Prediction: Data Collection (#002)

If you have read previous versions of this blog you’ll know that a) I have an interest in trying to predict football (soccer) matches and b) I am currently developing a system (as a research project) that I hope to get up and running in the next month or so.

What kept me busy for a lot of last season was collecting the data.

Fixtures
The fixtures were easy to collect, being readily available from various sources. As my basic reference guide I used the Sky Sports Yearbook as this is a piece of published work that will be available to future generations. I could have used may of the web sites that are available but as a scientist we don’t really like to rely on web sites as they may not stand the test of time. The fact that web sites are not peer reviewed are another factor we also have to consider.

Results
The results are also easy to collect as they are a matter of public record and are reported in the media and, from a research point of view, we can validate them for years to come (e.g. newspapers, next seasons Sky Sports Yearbook etc.)

Bookmakers Odds
One of the other things I wanted to collect was the bookmakers odds. This proved a lot more challenging. There are a couple of problems. As far as I am aware they are not a matter of public record (at least that stand the test of time). Or, to put it another way, can you go and verify what the odds were for a given match at a given time – just by accessing publicly available information? Importantly, if two (or more) people are independently given the same task will they come back with the same answer? And, is it the correct answer anyway?
The second problem is that odds change over time anyway, with the weight of money that has been bet.

Anyway, over the course of last season, I made regular visits to the bookmakers to pick up their fixed odd coupons and I filed them away as evidence of the odds I was using.

Since carrying out the data collection (particularly the odds) I have discovered a couple of interesting other sources. I believe that the Racing Post (which, importantly, is also published as a daily paper, so is a matter of public record) publishes the best odds available on the fixtures for a given day.
I was also pointed to a web site recently (http://www.football-data.co.uk). This is a very good web site that not only has a lot of information but also has at least seven seasons worth of fixtures data including results and odds information from a selection of bookmakers.

I have checked the odds I collected last season against the ones on this web site and they match up, which is encouraging.
It also provides me with more than just last season to carry out initial testing before I use the system in anger on this season.

The downside of the http://www.football-data.co.uk web site is two-fold.

  1. I don’t think they will make their data available until a few hours before the kick off time. This might be a problem for the system that I am developing.
  2. The data is still a web site so, from a research point of view, I should not really cite it as the web site may not be available in 1/10/100 years time.

Please don’t take this as a criticism of the web site. They have done (and are doing) a fantastic job of collating all this data and, for this particular research project, will save me HOURS of data collection and data entry time.

Football (Soccer) Prediction: Development Framework (#001)

As the new football (soccer in the USA) season approaches I am trying to get a football prediction system up and running. I think I will struggle to get it ready for the start of the new season (which starts Aug 7th) but that is not so important as this is mostly a research project. In any case, the system I have in mind will take a few weeks before it is usable as I need to get some results posted for the prediction system to work on.

I did a quick check on how much time I have spent so far on the programming. As a rough estimate, I think it is about 100 hours, mostly (if not all) at weekends. I still have a lot to do but I almost have the “football framework” that I need. That is, I can read in the data that I have been collecting, generate a league table for a given date in the season and collate various other statistics that I will eventually need. I also have various data structures that I will “pass around” the prediction part of the system.

I reckon that I need about another 20 hours and then I’ll have the framework completed. Then I can start to work on the prediction parts of the system.

One thing that I need to implement is an Artificial Neural Network (ANN). I have one from another project I worked on (stock market forecasting) but I want to re-engineer it. At the moment the ANN is only a feed forward network as it was used in an evolutionary setting. That is, the predictions were evolved rather than a more traditional training mechanism.
One thing lacking in my ANN class (I program in C++) is a back propagation training (BP) mechanism So, apart from tidying up the code, I also want to implement a back propagation method, as this seems one potential way to carry out the prediction.

So I have my work cut out over the coming weeks, but I hope that it will be interesting and, you never know, it might just work.

Football Prediction: Follow up

Whilst searching around the net looking for relevant resources for my plan to predict football matches, I came across The Sports Exchange. It looks like a relatively new web site, but seems very nice.

I posted a comment in their blog (about football pools prediction), and received a number of replies. The blog entry can be seen here.

Predicting the Results of Football Matches

I have recently become interested in trying to predict the results of football matches.

The interest grew from wondering what else I could use the data for, that I had collected for generating football fixtures (see JORS paper). The data included the travel distances between all the teams in each division. I also maintained the fixtures that were actually played over the Christmas/New Year period so that I could compare the fixtures I generated against those that were actually played. Needless to say, it took a long time to collect all this data.

As I had gone to all the time and trouble of collecting the data I want to maximise its usage, so I began to look for other uses that I could put it to, and prediction seemed an obvious challenge.

With this in mind, for most of last season I collected additional data that I think might be important in predicting football matches. For example, I have been updating all the scores as each fixture was played. I have also been keeping a record of the odds that bookmakers were offering. Just collecting this data was a large data collection exercise in itself and I certainly did not get the odds on every fixture, but I have enough to be going on with.

I’m not totally sure what I am going to do with this data yet but I know if I don’t collect some of it as and when it is available, it becomes almost impossible (or at least a lot more difficult) to collect.

I have started programming some support functions. For example, given a date and a set of results I can generate the relevant league table for that point in the season.

My ultimate goal is to develop a prediction model and test out how good it is on the 2008-2009 season and, if I find a good model I will try and predict the fixtures for the 2009-2010 season before the matches are actually played.

The new season starts quite soon (kick off is 8th August 2009, but there is one match scheduled on the 7th August 2009).

If I am going to have ever chance of testing the prediction model over the coming season, I need to get programming!

Horse Race Prediction with Neural Networks

I was sorting through some old papers recently and I came across my undergraduate final year dissertation. I recall that it started as a project about genetic algorithms but quickly turned into a project that used neural networks to predict the outcome of horse races.

I trained a back propagation network and used the final network to predict the outcome of (selected) races that the network had not been trained on. One of the biggest challenges was finding suitable data. I was lucky enough that a couple of companies (Timeform and Raceform – thank you) sent me their databases which made the data collection side of things a lot easier than it might have been.

One item that was missing from both datasets were the starting prices. Due to this I could not really judge if the predictions would result in a profit. However I did a few calculations and assumed that the average odds were either 2/1 (3.00), evens (2.00) or 1/2 (1.50) (see note, for a description of the odds calculation). I also made an assumption that the odds would also capture any tax that had to be paid.
Using these figures it was possible to make a profit even when the average odds were as low as 1/2 (1.50).

I wonder if it really is possible to develop a prediction system that can make a profit from backing horses? Although my undergraduate dissertation suggested that it is, it would need a lot more development, testing and analysis.
I would also like to investigate other methodologies, in addition to neural networks – but that needs a little more thinking about.

Of course, it’s not possible to predict the result of every race but you only need to predict enough races, at good enough odds, to show a profit.

One of the issues when betting is the amount of tax you have to pay but with new methods of betting (such as spread betting and betting exchanges) becoming ever more popular, perhaps this might not be so much of an issue.
I know that betting exchanges (such as betfair) still charge a tax but at least you are betting against other punters and are not limited by the odds being offered by the bookmakers.

I’ll keep this one of the back burner for a while, but I think there is some potential in exploring it further.

Note on odds: I have shown the odds in two ways. The UK way of expressing odds is (for example) 2/1 which means you have to place a stake of 1 unit to win 2 units. You also receive back your stake. So if you bet 1 unit at odds of 2/1, and the horse wins you receive 3 units back (the 2 units you won + your 1 unit stake = 3 units; less any tax – but let’s ignore that for the purpose of this discussion).
Another way of expressing odds is the decimal format (which I have shown in brackets). This is used, for example, on betfair. This says how much you will receive for 1 unit, including getting your stake back. So if you bet 1 unit at odds of 3.00, and the horse wins, you get 3 units back.
So the two ways are just the same way of expressing the same thing, but you might be more used to seeing one system over another, depending on where you live/bet.