Jump to content
** April Poker League Result : 1st Like2Fish, 2nd McG, 3rd andybell666 **

Building a football system with data mining


Recommended Posts

Evening all, I am a first time poster here so "Hi!". To cut a long story short... I am a geek... I specialise in data and as such am looking to build a football 1x2 system. I am here today to obtain some advise from more knowledgable people... I have built an SQL database from all the data that is available at http://www.football-data.co.uk/ and plan on data mining this to find trends which might not otherwise be easily visible. Are there any other data sources on the net available for downloading, web site viewing or purchasing? Opta is without question the holy grail when it comes to this kind of stuff however I'd imagine they are a little expensive. :) Anyways, my second question is: Form would appear to be normally viewed in the last 6 matches. What are peoples thoughts on this? Why is it 6 matches and not say 8 matches, or 3 matches? If analysing home form, or away form, surely 6 matches would be too long a period to view? (Afterall it is a third of a season...). Any thoughts on the above are welcome and if there is interest in this project I'll post progress on this thread going forward. Cheers, AM

Link to comment
Share on other sites

Re: Building a football system with data mining Welcome. the website whoscored.com is very detailed, maybe worth a look. I actually think form should be 3 matches not even 6. Usually it's a space issue hence no more than 8, most publications & websites like to show 6, as it in the old days when the football pools was the major means of gambling on football - the national papers showed 6. The most important thing "in" form is actually looking at the teams played in their last few games. Sometimes teams will play 4 of the top 8 in a 6 game run , and it's very important to take account of the standard of the opposition. Vice versa a team may have had a a run of low quality opposition that's why a ranking for all teams is important. Someteams have bogey teams , they just never seem to do well against, Other key factors are players missing. Some teams fail to win in a run of games with a 1 or more key players out. The above are standard factors that most people can tell you about and mining has been done numerous times. The data you are using www.football-data.co.uk has odds attached from many bookmakers. Use 1 bookmaker and stick to it is my best advice there. You may see that teams with odds of 2-1, only win 20% of their games etc. There are so many factors that play a role that no system has been devised that takes account of all factors and then gives them the right weighting as it is a dynamic environment. The perfect model would change week to week and actually include weather (rain , wind etc) . also what kit the team was wearing. Some teams can use their home kits for away games when there is a clash of colours, research has shown these teams do better. There are lots of papers on it that go into probability theory with poisson distributions etc. Lady luck plays a role as there are always surprising results. This may be way out but it is key. I suggest you model ladyluck by looking at astrological factors. You would take account of the birthdays of the players that you would expect to play and see it their zodiac sign is in the ascendency or not. Do this for each team and you will get a well-being of the team on a particular match day. Where there is a clear advantage some weighted variable can be introduced. The piece of research that needs doing is going back over past data where shocks occured, maybe teams that were bigger than 5-1 to win ,then have a look at the actual players that took part, and their birthdays, (in relation to time of year) I believe a pattern / advantage will emerge. Yep , I said it was way out !!!!! Good luck

Link to comment
Share on other sites

Re: Building a football system with data mining I usually consider the last 5 games home / away and last 5 games global. I know 5 home games is a lot, but less than that can be very tricky, if for example, they have played home against #1 #2 and #3... and did not won one of those 3 games...

Link to comment
Share on other sites

Re: Building a football system with data mining Welcome AM. There is probably a lot of inferred attributes you'd want to calculate from the F-D.co.uk data. My tip would be to calculate long term class and short term form variables, but look beyond the result. I look on with interest, you are today, where I was 10years ago and I'm still going, it's such a great challenge. Matt

Link to comment
Share on other sites

Re: Building a football system with data mining Thank you for all the replies... this is exactly the sort of information which I was hoping to discuss. I am aware of the whoscored.com website as the owner and creator is a Villa fan - which I am also... hence my cryptic username. I guess there are many layers of consideration which need to go into accurately predicting the probability of a football match result with the most decisive factor being league position and going all the way down to lesser significant factors such as shirt colour. The complexity of the majority of these issues won't actually be analysing the data it would be obtaining it! The man hours which would need to go into creating and maintaining a database containing weather, shirt colour, attendance etc would be huge. So, I am hoping that by refining fairly high level data I may be able to obtain some edge over the bookie (there is also the view point that too much information may be detrimental to the overall system). I'll try to put some thought into the form conundrum over the next week or so and do some sort of analysis into the number of matches that a team should be judged over. However, as some have already stated you would need to rank the opponents strengths in these matches to see whether they were expected to win the matches or not. If they did win them, were they lucky to win them? For example, Southampton vs Villa this season... Villa had 20%ish possession and 3 shots... Southampton had 80% and a huge amount of shots... Villa won 3-2... this was indeed a fortunate win for Villa and so should be factored into the form analysis... somehow. I could consider harvesting whoscored.com to obtain all the information on the players and their ratings so far this season. It would probably take a little while to set this up though if I can think of a reason to do so it might be worth the time investment. The issue I have with looking at individual players is again the amount of time and effort it would take into looking at starting lineups for an upcoming match. Especially so when considering the lineups aren't normally announced until an hour or two before kick-off. Thanks for the responses so far.

Link to comment
Share on other sites

Re: Building a football system with data mining Hi, I'm another "geek" and first time poster, who is trying to do the same thing you are trying to do with the same dataset(actually, I could have written your same post, exept I use linq and not sql ;) ). May I ask you how you measure your model's performance? I don't really like benchmarking trough a simple accuracy measure (id est: the best model is the one that forecast the highest number of correct results) since I strongly belive that any forecast must be fomulated in terms of probability. I had your same doubt about the number of previous matches that should be considered, and I am currently experimenting to find an answer :( P.S. I'm Italian, sorry for my poor English

Link to comment
Share on other sites

Re: Building a football system with data mining I thought you guys may be interested in this: THE FINK TANK VALUE SYSTEM Powerful Football Ratings: Fink Tank Service Name: Have a play with it here http://www.dectech.co.uk/football/index.php Basically Fink Tank is a free football predictor that is provided by Dectech and sponsored by The Times newspaper. Fink Tank is named after Daniel Finkelstein who writes regular columns in The Times, although it was originally created by Henry Scott and Alex Morton. On any page you can click "Help!" to display some hints and tips on how to use each page. On the main page, there are predictions for English Premier League matches occurring in the next 7 days. If you want to look at predictions for other divisions, just click on the relevant link in the "divisions" tab on the left. You can sort the predictions by clicking on the relevant heading. Double-clicking on any match will take you to the Game Simulator, where you can look at more information about each match, and adjust the team ratings.

Link to comment
Share on other sites

Re: Building a football system with data mining Zenagian, the system needs to accruately create a probability, as a percentage, of a Home win, draw and Away win. The ultimate test being if the probability of one of these results is higher than the probability which the bookies odds would suggest then place a bet. If over a period of time these bets return a positive value then the system is indeed a winner. Using the Analysis Services of SSMS the idea is to create as many inputs into a match as possible and then allow the data mining decision tree to advise which of these inputs have the most relevance to the outcome of a match. For example, I have a list of Premier League matches going back 13 years and the outcome of these matches. I have created an SQL script which provides the league table as of the date of each of these matches and from this league table you can create some "scoring metrics" to input into the data mining model. For example, if Aston Villa are to play Newcastle and Villa have an average points per game of 1.2 and Newcastle have an average points per game of 0.8 then an input into the data mining model would be "Average Points per Game Diff: 0.4". Once you have calculated this figure for every match going back 13 years, along with many other inputs, then you tell the data mining to do its thing and it will come back with the most relevant fields to predict a result and based on its findings apply a probability of each of the home, draw and away results. From the very basic mining that I have done so far I can advise that if the home teams average goal difference per game minus the away teams average goal difference per game is greater then 0.4 then there is a 72% chance of the home team winning. If you were then to place a bet on all home teams where the goal difference difference between the 2 teams is greater than 0.4 and the bookies odd of a home win is less than 72% then you would have made money over the last 13 years. Not much money, but some. You can then data mine these results to find any trends behind them to reduce the number of bets made to increase the profitibility. So on and so forth... Anyway, the key to all of this is the inputs which you put into the data mining model to begin with. The example above is an incredibly straight forward one and one which will never gain you much of an advantage over the bookies due to its simplicity. The trick, I believe, yet could be wrong, will be adding complexity to find an edge over the bookies which they may not already have covered in their prices. For example if you were devise an input into the model which accurately scored recent form (as mentioned in an earlier post something along the lines of team strength weighting and luck) then you may just find an edge when combined with other relevant inputs. I apologise if none of the above makes much sense... I've had a very long and stressful 2 weeks at work, am extremey tired and have a young baby currently screaming in my ear so I'm not on top form when it comes to thinking or communicating clearly but hopefully you get the jist of what I am trying to say. Andypaps28 - thank you for that link. All information such as this is important to understand. However, it could be that I am missing something on this link but it doesn't show historical performance of how accurate it is does it? That would be very interesting to see as it would give an idea as to whether using past games shots and goals (which would appear to be what they use according to the info page) can accurately predict the outcome of future matches...

Link to comment
Share on other sites

Re: Building a football system with data mining Found the following article on the Fink Tank: http://www.sotdoc.co.uk/an-analysis-of-the-fink-tank-football-bets-18-august-2012-29th-april-2013/ They conclude:

in conclusion Football predictive models will not make you rich unless you have a large bank and trigger thousands of bets a season.
I love a challenge!! :D
Link to comment
Share on other sites

Re: Building a football system with data mining

Andypaps28 - thank you for that link. All information such as this is important to understand. However' date= it could be that I am missing something on this link but it doesn't show historical performance of how accurate it is does it? That would be very interesting to see as it would give an idea as to whether using past games shots and goals (which would appear to be what they use according to the info page) can accurately predict the outcome of future matches...
Finktank are profitable. I know this first hand. Use there percentages and convert them to a price, choose the size of value you require, say 10% value and place the bets. This generally highlights draws and away wins on unfancied teams but they've landed some absolute huge prices this year. There have been a few threads on here trying to find profit from the FinkTank ratings. PS - I can help you with some data to understand what angles you may or may not want to pursue. Search for my other recent posts.
Link to comment
Share on other sites

Re: Building a football system with data mining

Zenagian' date= the system needs to accruately create a probability, as a percentage, of a Home win, draw and Away win. The ultimate test being if the probability of one of these results is higher than the probability which the bookies odds would suggest then place a bet. If over a period of time these bets return a positive value then the system is indeed a winner. Maybe I misunderstood you, but it seems to me that there is a little flaw in your methodology: you mine your dataset for patterns to exploit and you build a betting system on that pattern, so far so good. But then you test the betting system on the same dataset you mined, allowing the system an information advantage it would not have in a real application. I suggest you to split your dataset in two subset: a "mining" one and a "test" one, this way you will have a much more realistic idea of the "power" of your system. I apologize if I have not understood what you meant in your previous post and for any mistake I made in writing this one. P.S. I did not invent any of this, it is standard methodology when you're dealing with this kind of research, actually you should have three dataset: one to be mined, one to test different models and one to validate the best one.
So you use ROI to evaluate your model's performance? I'm trying to obtain a direct measure of the ability of the model to approximate "real" probability of the event outcome; I've read that some researchers use euclidean mean squared error(you need to properly label you data and store the estimated percentage in a vector) or the geometric mean of extimated probabilities of occured events, but this measure don't really satisfy me.
Link to comment
Share on other sites

Re: Building a football system with data mining I've done a bit of reading on the Fink Tank and yes it seems to be a pretty good product... I'll do some more reading on it over the next couple of days. Zenagian, yes I plan on using ROI as a test. You're absolutely right about the datasets you use... if you were to build your model on the whole of the data set then all you would be doing is finding results in this set which may not be a true reflection of reality going forward. The SSMS data mining service allows you set to how much of the data set you wish to use to find trends with the remainder used to calculate the results based on these findings. By default the amount of data it uses to find trends is 30% though this is a setting which you can change if you so wish. Once the model has been created then you would need to train it over time to ensure that it is constantly learning and adapting to the latest information. Still not made any progress on this project this week so no update in terms of finding trends in form yet.

Link to comment
Share on other sites

  • 2 weeks later...

Re: Building a football system with data mining Thanks for that link uknowsit. When I have exhausted the wealth of data I currently have at my disposal on the main European leagues I'll certainly start looking into these more obscure leagues. I've been quiet on this thread of recent, however I've spent a lot of time going through this data. I've found a few very small trends in the data which will make you a small amount of money (I'm talking about 1%, so when I say small I mean small). When I have a bit more time I'll come back to update on the kinda stuff I've been looking at.

Link to comment
Share on other sites

  • 2 weeks later...

Re: Building a football system with data mining I am thinking too of building a database, but am still in the planing phase. So I have a few suggestions. You can calculate the correlations of your data, for example finding out if the last 3,4 or whatever results correlate most with the match results, that way you can find out the best values. For the strength of teams the ELO values could be appropriate. Furthermore the ELO differences in the last matches can indicate whether teams are under or over achieving. Neuronal networks are interesting too, if you find a good structure, may be difficult cause of the chaotic behaviour of stats and match results.

Link to comment
Share on other sites

  • 1 year later...

Re: Building a football system with data mining Seeing as this appears to be the last time Fink/Dectech was mentioned here - in case you don't know the ratings aren't going to be published any more, at least not in their current "free" form. This is a shame as I used them as a kind of sanity check against my own ratings, guess I'll have to look elsewhere now

Link to comment
Share on other sites

  • 2 weeks later...
  • 4 weeks later...

I'm guessing not. I'm interested in how you housed and ran in all in SQL though, I've been looking to move more of the heavy lifting to something more scalable at the moment. When you say Model? What do you mean Yelnahs? Stats model or Data model?

Link to comment
Share on other sites

Hi Matthew, Thanks for getting back. Ok, this may be a long winded response so please bear with me. I've been working on this on and off for over a year and looking to get it up and running in the coming months. I may need assistance with the final piece but will get to that later on. I am very lucky in that I'm a technologist first and foremost and I'm also a huge footy fan. So basically I can program, work with databases and have relative football knowledge (i.e nerd). Right, the 'model' as I call it was primarily built for Elo RateForm but has since evolved when I started to notice that Elo on it's own is not sufficient. It's all based in SQL. I get my stats from 12xpert over at football-data.co.uk (Joseph is an absolute gent, the work he does...) So, I have a SQL database with all the leagues supported over at football-data going back to 93/94. I have recently decided to concentrate on all English divisions as well as the top divisions for the SPL, Spain, Italy, Germany and France starting from 08/09 due to the amount of data available (mainly shots) and the way football has changed since 09ish. What the 'model' currently does is iterate through matches one by one on a per season basis, calculates a number of stats, current rateform etc. and outputs it to another table. The stats it produces per game is as follows (these apply for both home and away teams). This information is then put in a new database. All the below stats and information is available for every future fixture, by that I mean, for example, Week 1 in october the stats below are updated, for week 2's fixtures then the below stats can be recalled to help make a betting decision. Team - Home and away teams GamesPlayed - Self explanitory RateForm - Current rate forms of both teams at the time of kick off Kitty Per Team - The amount of RF points in the kitty per team (7% home, 5% away) Kitty - Total kitty Form - Form for the last 6 games, e.g WWDDLL ScoredGF - Total Goals scored in the last 6 games ConcededGF - Total Conceded last 6 ScoreForm - Scored taken away from Conceded GoalScoredForm - Last 6 games scored in 'string' format, i.e 116230 GoalConcededForm - Last 6 games concededin 'string' format, i.e 001346 LeaguePosition - Each team's league position LeaguePoints - Points per team LeaguePointsPerGame - Points per game average LeagueShots - Shots so far this season LeagueShotsPerGame - shots per game ratio LeagueShotsOnTarget - shots on target this season LeagueShotsOnTargetPerGame - shots ratio per game LeagueShotConversion - Ratio of many shots it takes per goal LeagueShotOnTargetConversion - Ratio of many shots on target it takes per goal LeagueGoalsScored - Goals so far this season LeagueGoalsScoredPerGame - Goals per game ratio LeagueGoalsConceded - Goals conceded so far this season LeagueGoalsConcededPerGame - Goals conceded ratio LeagueShotsConceded - Shots conceded so far this season LeagueShotsConcededPerGame - Per game ratio of shots conceded LeagueShotsOnTargetConceded - Shots conceded that were on target LeagueShotsOnTargetConcededPerGame - On target shot conceded ratio RateFormPosition - Where in the league each team currently stands in my rate form table HomeOdds - Odds based on Pinnacle, uses Bet 365 if pinnacle were unavailable DrawOdds - Odds based on Pinnacle, uses Bet 365 if pinnacle were unavailable AwayOdds - Odds based on Pinnacle, uses Bet 365 if pinnacle were unavailable Season - What season Division - What division Right. Hopefully I haven't lost you. All the above stats are available per team before any game is played. because the season isn't up and running at the moment I am using historical data. But I can stop the above analysis mid season and then simulate the rest of the season by placing dummy bets if I wish (currently my test model). My problem is a good one in that I have too much information and I really don't know what to do with it... Previously when the system was just RateForm with minimal stats I would (stupidly) run a query looking for all games where the Home Team is >500 rateform points above the away team and see what that brings back. If it made a historical 'profit' based on a season's worth of data I would then roll it out over a few seasons and other leagues and would 'lose' a shed load. I'm now slight older and a bit wiser and I know that such a blanket parameter is not feasible. I need to use the data above to find that elusive 'Value'. So I suppose that's my question. How to use the above information to attempt to find value. I'm thinking of something like adding the defensive and attacking ratios (based on at least 10 games into the new season) and querying all data to find out the win, draw, lose ratio and work out the odds that way. There is also the Poisson distribution option based on the goal stats. Is there any sense in doing both and then getting average odds? More than happy to share databases and anything else people might want (within reason of course). Hopefully this at least gets a conversation going. I suppose my end goal isn't to make bucket loads of money, I'd like to be able to outsmart the bookies from time to time however ;)

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...