neilovan Posted August 8, 2018 Share Posted August 8, 2018 (edited) Hi All, I am a systems/data analyst by profession and have been running Machine Learning algorithms over large datasets of European soccer results. The leagues are; Championship, English Premier, Scotland, Holland Eredivisie, Germany Bundesliga1, Spain Laliga, Turkey ligue1, Belgium Pro, Portugal, Italy SerieA and French Ligue 1. One of the bets that interests me is the full time draw. So, I have written some machine learning models that have been trained to do just this. For a model to be successful it must beat the base rate for that particular bet. SO looking at the table below for 2017/2018 .... Belgium pro had the highest draw % (28) while Portugal had the lowest (20%). The 4 columns on the right are 0-0, 1-1, 2-2, 3-3 draws as a percentage of results. SO as an example, the Championship had the highest % of 0-0 draws (9%) while LaLiga had the lowest percentage of 1-1 draws (8%). A decent model must beat the base rate. SO Belgium full time DRAW predictions should have a strike rate of at least 28%. 2017/2018 DRAW STATS League Drw ▴ League 0-0 1-1 2-2 3-3 Belgium Pro 28 Belgium Pro 6 13 6 2 Championship 27 Championship 9 12 5 1 Germany Bundes 1 27 Germany Bundes 1 7 13 6 1 English Premier 26 English Premier 8 12 5 1 French Ligue 1 25 French Ligue 1 6 12 6 2 Scottish Premier 25 Scottish Premier 8 12 4 0 Holland Eredivisie 24 Holland Eredivisie 4 11 6 2 Spanish LaLiga 23 Spanish LaLiga 7 8 6 1 Turkey Super Lig 22 Turkey Super Lig 6 11 4 2 Italy Serie A 22 Italy Serie A 7 11 3 1 Portugal Primeira 20 Portugal Primeira 6 10 3 1 Seasons 2009 to 2017 draw averages by league looked like this. Championship 0.2748129675810474 so 27.48% etc French 0.2733241188666206 Italy 0.2595656670113754 Turkey 0.25154541131716596 Germany 0.2508269018743109 Belgium 0.24920969441517388 EPL 0.2542722451384797 Portugal 0.24693777560019597 Scotland 0.24314536989136057 Holland 0.232821341956346 Spain 0.22832167832167832 OK, models are done. I will post the following three leagues (see model results below), every week. I would expect at least 3 times as many games in the season for each league. 2009 Italy 4 2 2 4.80 2 2.80 70.00 % 2010 Italy 6 4 2 8.45 2 6.45 107.50 % 2011 Italy 6 3 3 6.95 3 3.95 65.83 % 2012 Italy 4 0 4 0.00 4 -4.00 -100.00 % 2013 Italy 7 3 4 6.70 4 2.70 38.57 % 2014 Italy 3 2 1 4.50 1 3.50 116.67 % 2015 Italy 3 1 2 2.75 2 0.75 25.00 % 2016 Italy 10 1 9 2.60 9 -6.40 -64.00 % 2017 Italy 6 1 5 2.25 5 -2.75 -45.83 % Total games 49 wins 17 Total Profit or loss 7.0 ROI 14.29 % Strike Rate 0.3469387755102041 2009 Germany 7 3 4 7.50 4 3.50 50.00 % 2010 Germany 4 1 3 2.60 3 -0.40 -10.00 % 2011 Germany 11 4 7 9.45 7 2.45 22.27 % 2012 Germany 2 0 2 0.00 2 -2.00 -100.00 % 2013 Germany 8 4 4 9.60 4 5.60 70.00 % 2014 Germany 8 2 6 4.60 6 -1.40 -17.50 % 2015 Germany 7 1 6 2.30 6 -3.70 -52.86 % 2016 Germany 6 0 6 0.00 6 -6.00 -100.00 % 2017 Germany 8 5 3 11.43 3 8.43 105.38 % Total games 61 wins 20 Total Profit or loss 6.48 ROI 10.62 % Strike Rate 0.32786885245901637 2009 Championship 6 2 4 4.80 4 0.80 13.33 % 2010 Championship 7 0 7 0.00 7 -7.00 -100.00 % 2011 Championship 7 1 6 2.25 6 -3.75 -53.57 % 2012 Championship 7 2 5 4.80 5 -0.20 -2.86 % 2013 Championship 5 3 2 7.30 2 5.30 106.00 % 2014 Championship 5 2 3 4.80 3 1.80 36.00 % 2015 Championship 9 4 5 9.80 5 4.80 53.33 % 2016 Championship 6 1 5 2.70 5 -2.30 -38.33 % 2017 Championship 6 4 2 10.15 2 8.15 135.83 % Total games 58 wins 19 Total Profit or loss 7.6 ROI 13.10 % Strike Rate 0.3275862068965517 Looking forward to your company/opinions etc in a winning 2018/2019 season. All the best to you. Edited August 8, 2018 by neilovan Quote Link to comment Share on other sites More sharing options...
neilovan Posted August 8, 2018 Author Share Posted August 8, 2018 No Championship fixtures fit the criteria this weekend of 9th August. Italy and Spain not started yet Quote Link to comment Share on other sites More sharing options...
real55555 Posted August 9, 2018 Share Posted August 9, 2018 It'll be more useful if you can include column headings in your model results. But from what I can understand is that throughout Italy Serie A season 2009 - 2017, only 49 games fit your criteria. I'd say that sample is too low. After 49 games your Profit or Loss is only 7.0, which mean had 2 matches out of the 17 wins somehow turned into a win for the home or away team, you'd only be breaking even or somewhere around postive 1 unit. To summarize, 1. Sample size too low / Criteria too strict to produce good sample size for backtesting 2. Based on current results, just 2-3 matches swing in the result could end up in red, which I consider to be very risky. Quote Link to comment Share on other sites More sharing options...
neilovan Posted August 9, 2018 Author Share Posted August 9, 2018 (edited) 1 hour ago, real55555 said: It'll be more useful if you can include column headings in your model results. But from what I can understand is that throughout Italy Serie A season 2009 - 2017, only 49 games fit your criteria. I'd say that sample is too low. After 49 games your Profit or Loss is only 7.0, which mean had 2 matches out of the 17 wins somehow turned into a win for the home or away team, you'd only be breaking even or somewhere around postive 1 unit. To summarize, 1. Sample size too low / Criteria too strict to produce good sample size for backtesting 2. Based on current results, just 2-3 matches swing in the result could end up in red, which I consider to be very risky. When you develop ML models you split data into a training set (ie. to train the model), and an unseen testing set ( to see how accurate your model is). My split is a 30 test/70 train split . So for all the different leagues you can treble up (at least) the games selected. SO in that period the model processed 49 games (but it only looked at 1/3rd of the data), so its closer to 150 games. year league Games W L Amnt Amnt P/L ROI Won Lost 2009 Italy 4 2 2 4.80 2 2.80 70.00 % 2010 Italy 6 4 2 8.45 2 6.45 107.50 % 2011 Italy 6 3 3 6.95 3 3.95 65.83 % 2012 Italy 4 0 4 0.00 4 -4.00 -100.00 % 2013 Italy 7 3 4 6.70 4 2.70 38.57 % 2014 Italy 3 2 1 4.50 1 3.50 116.67 % 2015 Italy 3 1 2 2.75 2 0.75 25.00 % 2016 Italy 10 1 9 2.60 9 -6.40 -64.00 % 2017 Italy 6 1 5 2.25 5 -2.75 -45.83 % Total games 49 wins 17 Total Profit or loss 7.0 ROI 14.29 % In 2017 Italy had 6 games , Germany 8 , and Championship only 6. SO from this model ( I run 1 model for these three leagues),I would expect 60 or so predictions for the season. The model is only selecting about 5% of the fixtures in a season. But for me three things are important here; 1) That the model predicts well for unseen data from three divisions, in the same processing run. 2) That the model wins. I don't think you must look at the seasons in isolation, because in the absolute short term anything can happen. 3) The model minimizes losses. I would rather have a model with a higher threshold (fewer games selected), that one that loses predictions, which could have been avoided. So I have gone for higher thresholds, trusting the intrinsic nature of the models input. Look, it's only an experiment that I thought I would share throughout the season. Just a bit of interest with a decent result at the end of 2018/2019. Edited August 9, 2018 by neilovan Quote Link to comment Share on other sites More sharing options...
Data Posted August 9, 2018 Share Posted August 9, 2018 Best of luck with this @neilovan I've had over 20 years with ML models, wouldn't go anywhere without them and my betting wallet has seen the benefit. Just a personal view here, but I find it strange that you've focused your attention on football draws.Your past data will show that a level stake bet on each of home win, draw and away win will show the biggest loss to be in the draw column. Does this suggest that the bookies tend to underprice the draw because it is sooo difficult to predict with any degree of certainty? Again, I won't knock your project here, but maybe turning your attention to home wins could (would?) allow a higher win% from more bets? But I'm probably preaching to the converted and you have tried that route and dismissed it. However, fortune favours the brave, so good hunting. I'll be interested to see how it pans out. Quote Link to comment Share on other sites More sharing options...
real55555 Posted August 9, 2018 Share Posted August 9, 2018 2 hours ago, neilovan said: When you develop ML models you split data into a training set (ie. to train the model), and an unseen testing set ( to see how accurate your model is). My split is a 30 test/70 train split . So for all the different leagues you can treble up (at least) the games selected. SO in that period the model processed 49 games (but it only looked at 1/3rd of the data), so its closer to 150 games. year league Games W L Amnt Amnt P/L ROI Won Lost 2009 Italy 4 2 2 4.80 2 2.80 70.00 % 2010 Italy 6 4 2 8.45 2 6.45 107.50 % 2011 Italy 6 3 3 6.95 3 3.95 65.83 % 2012 Italy 4 0 4 0.00 4 -4.00 -100.00 % 2013 Italy 7 3 4 6.70 4 2.70 38.57 % 2014 Italy 3 2 1 4.50 1 3.50 116.67 % 2015 Italy 3 1 2 2.75 2 0.75 25.00 % 2016 Italy 10 1 9 2.60 9 -6.40 -64.00 % 2017 Italy 6 1 5 2.25 5 -2.75 -45.83 % Total games 49 wins 17 Total Profit or loss 7.0 ROI 14.29 % In 2017 Italy had 6 games , Germany 8 , and Championship only 6. SO from this model ( I run 1 model for these three leagues),I would expect 60 or so predictions for the season. The model is only selecting about 5% of the fixtures in a season. But for me three things are important here; 1) That the model predicts well for unseen data from three divisions, in the same processing run. 2) That the model wins. I don't think you must look at the seasons in isolation, because in the absolute short term anything can happen. 3) The model minimizes losses. I would rather have a model with a higher threshold (fewer games selected), that one that loses predictions, which could have been avoided. So I have gone for higher thresholds, trusting the intrinsic nature of the models input. Look, it's only an experiment that I thought I would share throughout the season. Just a bit of interest with a decent result at the end of 2018/2019. Best of luck, hope it works out for you. Still I would like to point to you that sample size is everything in betting based on a certain set of criteria or statistics. I myself have went through this route as well (backtesting on draws) but with mixed results because I think it is not easy to predict a draw, plus it is a bet that tends to be underpriced because it is an unattractive bet. People like to see teams either win or lose instead of draws, so this markets tends to be underbought and should offer better value but somehow I have not been able to have any results convincing enough for me to place actual stakes on it. Having said that, I must admit I am not the best in terms on statistics and coming up with prediction models. Quote Link to comment Share on other sites More sharing options...
neilovan Posted August 9, 2018 Author Share Posted August 9, 2018 2 hours ago, Data said: Best of luck with this @neilovan I've had over 20 years with ML models, wouldn't go anywhere without them and my betting wallet has seen the benefit. Just a personal view here, but I find it strange that you've focused your attention on football draws.Your past data will show that a level stake bet on each of home win, draw and away win will show the biggest loss to be in the draw column. Does this suggest that the bookies tend to underprice the draw because it is sooo difficult to predict with any degree of certainty? Again, I won't knock your project here, but maybe turning your attention to home wins could (would?) allow a higher win% from more bets? But I'm probably preaching to the converted and you have tried that route and dismissed it. However, fortune favours the brave, so good hunting. I'll be interested to see how it pans out. Hello, This is just one part of what I am working on. My main interest is in over/under 2.5 goal prediction. For me it is a fundamentally stronger bet than a 1X2 bet, because there are only 2 outcomes. I have models that are ready for this, as well as home win long shot odds. Hopefully all three are winners come seasons end. Quote Link to comment Share on other sites More sharing options...
neilovan Posted August 9, 2018 Author Share Posted August 9, 2018 1 hour ago, real55555 said: Best of luck, hope it works out for you. Still I would like to point to you that sample size is everything in betting based on a certain set of criteria or statistics. I myself have went through this route as well (backtesting on draws) but with mixed results because I think it is not easy to predict a draw, plus it is a bet that tends to be underpriced because it is an unattractive bet. People like to see teams either win or lose instead of draws, so this markets tends to be underbought and should offer better value but somehow I have not been able to have any results convincing enough for me to place actual stakes on it. Having said that, I must admit I am not the best in terms on statistics and coming up with prediction models. I have another model that analysed every game in the EPL from 2009 to 2017 for long odds home wins. I agree that a larger sample size (for training and testing) cannot hurt. But let's see how these 65 games go. Quote Link to comment Share on other sites More sharing options...
Xtc12 Posted August 9, 2018 Share Posted August 9, 2018 Thanks, will be looking in ...... good luck ! Quote Link to comment Share on other sites More sharing options...
neilovan Posted August 13, 2018 Author Share Posted August 13, 2018 (edited) Model predicts the following FT draws for upcoming fixtures country_div match_date h_team a_team Draw 0 English Premier 8/18/2018 West Ham Bournemouth 2.58 4 Championship 8/18/2018 Reading Bolton 2.45 7 Championship 8/21/2018 Rotherham Hull N/A 9 Championship 8/22/2018 Bolton Birmingham N/A Edited August 13, 2018 by neilovan Quote Link to comment Share on other sites More sharing options...
neilovan Posted August 26, 2018 Author Share Posted August 26, 2018 4 loses out of 4 , maybe this needs some more work Quote Link to comment Share on other sites More sharing options...
Drawsandmore Posted August 31, 2018 Share Posted August 31, 2018 Hi neilovan 4 losses out of four is a disappointing start, but it doesn't necessarily mean your system needs more work. If you had started with 4 wins from 4 it would have been a great start...But would you have been convinced that you had found the perfect system, and would never need to change anything about it again? I suspect not. As others on here have said, its all about the sample size. If you believed in your method at the outset, you should stick with it for now at lest. And I bet that in both your testing and training data you have experienced longer losing runs. Good luck with this. Quote Link to comment Share on other sites More sharing options...
paparainbow Posted November 16, 2018 Share Posted November 16, 2018 Do you use logistic regression or maybe random Forrest? Also what features did u use? Quote Link to comment Share on other sites More sharing options...
liero1 Posted November 19, 2018 Share Posted November 19, 2018 not sure if I understand correctly but is your sample size 58, 61, 49? Don't think you can test anything on that small of a sample.. But again I might misunderstand sth here.. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.