Football (Soccer) Prediction: Data Collection (#002)

If you have read previous versions of this blog you’ll know that a) I have an interest in trying to predict football (soccer) matches and b) I am currently developing a system (as a research project) that I hope to get up and running in the next month or so.

What kept me busy for a lot of last season was collecting the data.

The fixtures were easy to collect, being readily available from various sources. As my basic reference guide I used the Sky Sports Yearbook as this is a piece of published work that will be available to future generations. I could have used may of the web sites that are available but as a scientist we don’t really like to rely on web sites as they may not stand the test of time. The fact that web sites are not peer reviewed are another factor we also have to consider.

The results are also easy to collect as they are a matter of public record and are reported in the media and, from a research point of view, we can validate them for years to come (e.g. newspapers, next seasons Sky Sports Yearbook etc.)

Bookmakers Odds
One of the other things I wanted to collect was the bookmakers odds. This proved a lot more challenging. There are a couple of problems. As far as I am aware they are not a matter of public record (at least that stand the test of time). Or, to put it another way, can you go and verify what the odds were for a given match at a given time – just by accessing publicly available information? Importantly, if two (or more) people are independently given the same task will they come back with the same answer? And, is it the correct answer anyway?
The second problem is that odds change over time anyway, with the weight of money that has been bet.

Anyway, over the course of last season, I made regular visits to the bookmakers to pick up their fixed odd coupons and I filed them away as evidence of the odds I was using.

Since carrying out the data collection (particularly the odds) I have discovered a couple of interesting other sources. I believe that the Racing Post (which, importantly, is also published as a daily paper, so is a matter of public record) publishes the best odds available on the fixtures for a given day.
I was also pointed to a web site recently ( This is a very good web site that not only has a lot of information but also has at least seven seasons worth of fixtures data including results and odds information from a selection of bookmakers.

I have checked the odds I collected last season against the ones on this web site and they match up, which is encouraging.
It also provides me with more than just last season to carry out initial testing before I use the system in anger on this season.

The downside of the web site is two-fold.

  1. I don’t think they will make their data available until a few hours before the kick off time. This might be a problem for the system that I am developing.
  2. The data is still a web site so, from a research point of view, I should not really cite it as the web site may not be available in 1/10/100 years time.

Please don’t take this as a criticism of the web site. They have done (and are doing) a fantastic job of collating all this data and, for this particular research project, will save me HOURS of data collection and data entry time.