Machine Learning - FT Draws predictions

neilovan · August 8, 2018

Hi All,

I am a systems/data analyst by profession and have been running Machine Learning algorithms over large datasets of European soccer results.

The leagues are; Championship, English Premier, Scotland, Holland Eredivisie, Germany Bundesliga1, Spain Laliga, Turkey ligue1, Belgium Pro, Portugal, Italy SerieA and French Ligue 1. One of the bets that interests me is the full time draw. So, I have written some machine learning models that have been trained to do just this. For a model to be successful it must beat the base rate for that particular bet. SO looking at the table below for 2017/2018 ....

Belgium pro had the highest draw % (28) while Portugal had the lowest (20%). The 4 columns on the right are 0-0, 1-1, 2-2, 3-3 draws as a percentage of results. SO as an example, the Championship had the highest % of 0-0 draws (9%) while LaLiga had the lowest percentage of 1-1 draws (8%). A decent model must beat the base rate. SO Belgium full time DRAW predictions should have a strike rate of at least 28%.

2017/2018 DRAW STATS

League	Drw ▴	League	0-0	1-1	2-2	3-3
Belgium Pro	28	Belgium Pro	6	13	6	2
Championship	27	Championship	9	12	5	1
Germany Bundes 1	27	Germany Bundes 1	7	13	6	1
English Premier	26	English Premier	8	12	5	1
French Ligue 1	25	French Ligue 1	6	12	6	2
Scottish Premier	25	Scottish Premier	8	12	4	0
Holland Eredivisie	24	Holland Eredivisie	4	11	6	2
Spanish LaLiga	23	Spanish LaLiga	7	8	6	1
Turkey Super Lig	22	Turkey Super Lig	6	11	4	2
Italy Serie A	22	Italy Serie A	7	11	3	1
Portugal Primeira	20	Portugal Primeira	6	10	3	1

Seasons 2009 to 2017 draw averages by league looked like this.

Championship  0.2748129675810474  so 27.48% etc
French        0.2733241188666206
Italy         0.2595656670113754 
Turkey        0.25154541131716596
Germany       0.2508269018743109
Belgium       0.24920969441517388
EPL           0.2542722451384797
Portugal      0.24693777560019597 
Scotland      0.24314536989136057
Holland       0.232821341956346
Spain         0.22832167832167832

OK, models are done. I will post the following three leagues (see model results below), every week. I would expect at least 3 times as many games in the season for each league.

 2009           Italy   4         2      2   4.80    2     2.80    70.00 %
 2010           Italy   6         4      2   8.45    2     6.45   107.50 %
 2011           Italy   6         3      3   6.95    3     3.95    65.83 %
 2012           Italy   4         0      4   0.00    4    -4.00  -100.00 %
 2013           Italy   7         3      4   6.70    4     2.70    38.57 %
 2014           Italy   3         2      1   4.50    1     3.50   116.67 %
 2015           Italy   3         1      2   2.75    2     0.75    25.00 %
 2016           Italy  10         1      9   2.60    9    -6.40   -64.00 %
 2017           Italy   6         1      5   2.25    5    -2.75   -45.83 %
Total games  49  wins  17   Total Profit or loss 7.0  ROI   14.29 %
Strike Rate  0.3469387755102041

 2009         Germany   7         3      4   7.50    4     3.50    50.00 %
 2010         Germany   4         1      3   2.60    3    -0.40   -10.00 %
 2011         Germany  11         4      7   9.45    7     2.45    22.27 %
 2012         Germany   2         0      2   0.00    2    -2.00  -100.00 %
 2013         Germany   8         4      4   9.60    4     5.60    70.00 %
 2014         Germany   8         2      6   4.60    6    -1.40   -17.50 %
 2015         Germany   7         1      6   2.30    6    -3.70   -52.86 %
 2016         Germany   6         0      6   0.00    6    -6.00  -100.00 %
 2017         Germany   8         5      3  11.43    3     8.43   105.38 %
Total games  61  wins  20   Total Profit or loss 6.48  ROI   10.62 %
Strike Rate  0.32786885245901637

 2009    Championship   6         2      4   4.80    4     0.80    13.33 %
 2010    Championship   7         0      7   0.00    7    -7.00  -100.00 %
 2011    Championship   7         1      6   2.25    6    -3.75   -53.57 %
 2012    Championship   7         2      5   4.80    5    -0.20    -2.86 %
 2013    Championship   5         3      2   7.30    2     5.30   106.00 %
 2014    Championship   5         2      3   4.80    3     1.80    36.00 %
 2015    Championship   9         4      5   9.80    5     4.80    53.33 %
 2016    Championship   6         1      5   2.70    5    -2.30   -38.33 %
 2017    Championship   6         4      2  10.15    2     8.15   135.83 %
Total games  58  wins  19   Total Profit or loss 7.6  ROI   13.10 %
Strike Rate  0.3275862068965517


Looking forward to your company/opinions etc  in a winning 2018/2019 season.

All the best to you.

Edited August 8, 2018 by neilovan

neilovan · August 8, 2018

No Championship fixtures fit the criteria this weekend of 9th August. Italy and Spain not started yet

real55555 · August 9, 2018

It'll be more useful if you can include column headings in your model results. But from what I can understand is that throughout Italy Serie A season 2009 - 2017, only 49 games fit your criteria. I'd say that sample is too low. After 49 games your Profit or Loss is only 7.0, which mean had 2 matches out of the 17 wins somehow turned into a win for the home or away team, you'd only be breaking even or somewhere around postive 1 unit.

To summarize,

1. Sample size too low / Criteria too strict to produce good sample size for backtesting

2. Based on current results, just 2-3 matches swing in the result could end up in red, which I consider to be very risky.

neilovan · August 9, 2018

1 hour ago, real55555 said:

It'll be more useful if you can include column headings in your model results. But from what I can understand is that throughout Italy Serie A season 2009 - 2017, only 49 games fit your criteria. I'd say that sample is too low. After 49 games your Profit or Loss is only 7.0, which mean had 2 matches out of the 17 wins somehow turned into a win for the home or away team, you'd only be breaking even or somewhere around postive 1 unit.

To summarize,

1. Sample size too low / Criteria too strict to produce good sample size for backtesting

2. Based on current results, just 2-3 matches swing in the result could end up in red, which I consider to be very risky.

When you develop ML models you split data into a training set (ie. to train the model), and an unseen testing set ( to see how accurate your model is). My split is a 30 test/70 train split . So for all the different leagues you can treble up (at least) the games selected. SO in that period the model processed 49 games (but it only looked at 1/3rd of the data), so its closer to 150 games.

 year           league  Games     W      L   Amnt   Amnt   P/L      ROI
                                              Won   Lost    
 2009           Italy   4         2      2   4.80    2     2.80    70.00 %
 2010           Italy   6         4      2   8.45    2     6.45   107.50 %
 2011           Italy   6         3      3   6.95    3     3.95    65.83 %
 2012           Italy   4         0      4   0.00    4    -4.00  -100.00 %
 2013           Italy   7         3      4   6.70    4     2.70    38.57 %
 2014           Italy   3         2      1   4.50    1     3.50   116.67 %
 2015           Italy   3         1      2   2.75    2     0.75    25.00 %
 2016           Italy  10         1      9   2.60    9    -6.40   -64.00 %
 2017           Italy   6         1      5   2.25    5    -2.75   -45.83 %
Total games  49  wins  17   Total Profit or loss 7.0  ROI   14.29 %

In 2017 Italy had 6 games , Germany 8 , and Championship only 6. SO from this model ( I run 1 model for these three leagues),I would expect 60 or so predictions for the season. The model is only selecting about 5% of the fixtures in a season. But for me three things are important here;

1) That the model predicts well for unseen data from three divisions, in the same processing run.

2) That the model wins. I don't think you must look at the seasons in isolation, because in the absolute short term anything can happen.

3) The model minimizes losses. I would rather have a model with a higher threshold (fewer games selected), that one that loses predictions, which could have been avoided. So I have gone for higher thresholds, trusting the intrinsic nature of the models input.

Look, it's only an experiment that I thought I would share throughout the season. Just a bit of interest with a decent result at the end of 2018/2019.

Edited August 9, 2018 by neilovan

Data · August 9, 2018

Best of luck with this @neilovan

I've had over 20 years with ML models, wouldn't go anywhere without them and my betting wallet has seen the benefit. Just a personal view here, but I find it strange that you've focused your attention on football draws.Your past data will show that a level stake bet on each of home win, draw and away win will show the biggest loss to be in the draw column. Does this suggest that the bookies tend to underprice the draw because it is sooo difficult to predict with any degree of certainty?

Again, I won't knock your project here, but maybe turning your attention to home wins could (would?) allow a higher win% from more bets? But I'm probably preaching to the converted and you have tried that route and dismissed it.

However, fortune favours the brave, so good hunting. I'll be interested to see how it pans out.

real55555 · August 9, 2018

2 hours ago, neilovan said:
When you develop ML models you split data into a training set (ie. to train the model), and an unseen testing set ( to see how accurate your model is). My split is a 30 test/70 train split . So for all the different leagues you can treble up (at least) the games selected. SO in that period the model processed 49 games (but it only looked at 1/3rd of the data), so its closer to 150 games.
 year           league  Games     W      L   Amnt   Amnt   P/L      ROI
                                              Won   Lost    
 2009           Italy   4         2      2   4.80    2     2.80    70.00 %
 2010           Italy   6         4      2   8.45    2     6.45   107.50 %
 2011           Italy   6         3      3   6.95    3     3.95    65.83 %
 2012           Italy   4         0      4   0.00    4    -4.00  -100.00 %
 2013           Italy   7         3      4   6.70    4     2.70    38.57 %
 2014           Italy   3         2      1   4.50    1     3.50   116.67 %
 2015           Italy   3         1      2   2.75    2     0.75    25.00 %
 2016           Italy  10         1      9   2.60    9    -6.40   -64.00 %
 2017           Italy   6         1      5   2.25    5    -2.75   -45.83 %
Total games  49  wins  17   Total Profit or loss 7.0  ROI   14.29 %
In 2017 Italy had 6 games , Germany 8 , and Championship only 6. SO from this model ( I run 1 model for these three leagues),I would expect 60 or so predictions for the season. The model is only selecting about 5% of the fixtures in a season. But for me three things are important here;

1) That the model predicts well for unseen data from three divisions, in the same processing run.

2) That the model wins. I don't think you must look at the seasons in isolation, because in the absolute short term anything can happen.

3) The model minimizes losses. I would rather have a model with a higher threshold (fewer games selected), that one that loses predictions, which could have been avoided. So I have gone for higher thresholds, trusting the intrinsic nature of the models input.

Look, it's only an experiment that I thought I would share throughout the season. Just a bit of interest with a decent result at the end of 2018/2019.

Best of luck, hope it works out for you. Still I would like to point to you that sample size is everything in betting based on a certain set of criteria or statistics. I myself have went through this route as well (backtesting on draws) but with mixed results because I think it is not easy to predict a draw, plus it is a bet that tends to be underpriced because it is an unattractive bet. People like to see teams either win or lose instead of draws, so this markets tends to be underbought and should offer better value but somehow I have not been able to have any results convincing enough for me to place actual stakes on it. Having said that, I must admit I am not the best in terms on statistics and coming up with prediction models.

neilovan · August 9, 2018

2 hours ago, Data said:

Best of luck with this @neilovan

I've had over 20 years with ML models, wouldn't go anywhere without them and my betting wallet has seen the benefit. Just a personal view here, but I find it strange that you've focused your attention on football draws.Your past data will show that a level stake bet on each of home win, draw and away win will show the biggest loss to be in the draw column. Does this suggest that the bookies tend to underprice the draw because it is sooo difficult to predict with any degree of certainty?

Again, I won't knock your project here, but maybe turning your attention to home wins could (would?) allow a higher win% from more bets? But I'm probably preaching to the converted and you have tried that route and dismissed it.

However, fortune favours the brave, so good hunting. I'll be interested to see how it pans out.

Hello,

This is just one part of what I am working on. My main interest is in over/under 2.5 goal prediction. For me it is a fundamentally stronger bet than a 1X2 bet, because there are only 2 outcomes. I have models that are ready for this, as well as home win long shot odds. Hopefully all three are winners come seasons end.

neilovan · August 9, 2018

1 hour ago, real55555 said:

Best of luck, hope it works out for you. Still I would like to point to you that sample size is everything in betting based on a certain set of criteria or statistics. I myself have went through this route as well (backtesting on draws) but with mixed results because I think it is not easy to predict a draw, plus it is a bet that tends to be underpriced because it is an unattractive bet. People like to see teams either win or lose instead of draws, so this markets tends to be underbought and should offer better value but somehow I have not been able to have any results convincing enough for me to place actual stakes on it. Having said that, I must admit I am not the best in terms on statistics and coming up with prediction models.

I have another model that analysed every game in the EPL from 2009 to 2017 for long odds home wins. I agree that a larger sample size (for training and testing) cannot hurt. But let's see how these 65 games go.

Xtc12 · August 9, 2018

Thanks, will be looking in ...... good luck !

neilovan · August 13, 2018

Model predicts the following FT draws for upcoming fixtures

	country_div	match_date	h_team	a_team	Draw
0	English Premier	8/18/2018	West Ham	Bournemouth	2.58
4	Championship	8/18/2018	Reading	Bolton	2.45
7	Championship	8/21/2018	Rotherham	Hull	N/A
9	Championship	8/22/2018	Bolton	Birmingham	N/A

Edited August 13, 2018 by neilovan

neilovan · August 26, 2018

4 loses out of 4 , maybe this needs some more work

Drawsandmore · August 31, 2018

Hi neilovan

4 losses out of four is a disappointing start, but it doesn't necessarily mean your system needs more work. If you had started with 4 wins from 4 it would have been a great start...But would you have been convinced that you had found the perfect system, and would never need to change anything about it again? I suspect not.

As others on here have said, its all about the sample size. If you believed in your method at the outset, you should stick with it for now at lest. And I bet that in both your testing and training data you have experienced longer losing runs.

Good luck with this.

paparainbow · November 16, 2018

Do you use logistic regression or maybe random Forrest?

Also what features did u use?

liero1 · November 19, 2018

not sure if I understand correctly but is your sample size 58, 61, 49? Don't think you can test anything on that small of a sample.. But again I might misunderstand sth here..

Sign In

Machine Learning - FT Draws predictions

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Popular Contributors

Forum Statistics