Jump to content
** December Poker League Result : 1st Elliott Sutcliffe, 2nd McG, 3rd juanmoment **
** Football Tipster Competition Result : 1st Hotspur88, 2nd Bagzi, 3rd Gazza's United, 4th luckypants, 5th Craggwood **
** December Naps Competition Result: 1st mick33, 2nd Sugardaddyken, 3rd kenisbusy, 4th Johnrobertson. KO Cup Winner: Kingdom for, Most Winners: Johnrobertson **

ipredict - Football statistical predictions


Recommended Posts


I have been playing around with a new prediction model for all the Leagues that BBC posts match reports on (all professional English and Scottish Leagues).  The number of variables you can derive from the text commentary is quite good, especially for a lower leagues. It is still early phase, and I'm still tweaking some things. 

The final goal is to have a fully automated workflow that:

1) scrapes every new game from BBC, derives events from text (Shots, Bookings, Goals, Substitutions, Cards, Offside),

2) Aggregates the statistics for each game and applies Expected Goals (based on each individual shot, what is the probability that it should have been a goal) and Expected outcome models (based on the aggregated game statistics, what should have been the result).

4) Computes historical metrics for each team from game statistics 

5) Scrapes fixtures and odds from oddsportal (multiple bookmakers) and appends them to a database

6) Applies the prediction model, suggest recommendations and creates visualization. Currently I have only an Outcome model (1X2) but will also attempt including an Over/Under model and BTS model.

7) Posts predictions on a webpage (this is for the future, there is still a lot of work to be done)

8)Keeps track of Bankroll

I'm at about 95% of that (excluding 7), but I was hurrying because there are plenty of games today. 


I will monitor the results from the predictions here. From time to time may also be also posting visualizations and explanatory analysis of games. I also have Offensive and Defensive metrics for each team, so maybe they can tell a nice story. 

Results: From cross-validation results looked decent, (about 10% of games ended with a loss, median yield was about 5.5%) but I'll not get to excited because in the past I had the experience with a model that good results with cross-validation but did awful when using it. 

Odds, bookmakers: Currently I am scraping best odds from oddsportal, but in practice I will limit myself to: Pinnacle,Marathon,Matchbook

Bet selection: My threshold for recommending a bet is Bookmaker odd/My odd - 1 > 0.125. Only these bets are posted. But I can post all of them, but I need a more convenient solution because the images would become to large. Maybe I'll create an album for every league and embed the thumbnails here.

First round: Still working on visualizations, they were done in a hurry, quality and size needs to be optimized. Also, please suggest others ways of displaying the data if you have any ideas.

Does anyone knows how can I embed a table in the forum posts?  I think it is not practical to posts tables as screen captures.





Edited by allen29
Link to comment
Share on other sites

18 minutes ago, Matthew said:

Good approach.

Can you post more info on the cross-validation you've done and the historical results, you give them little coverage in your post above.....

Hi Matthew, 

I spent little time on model-building (1-2 days) because I was running behind schedule and all the cleaning and integrating data as well as summarizing it took more 90% of the time.

I did repeated cross validation, each time splitting my data randomly with 75% train and 25% test set. My dataset contained about 5850 matches, after removing first 24 rounds from the first season and first 6 rounds from each season in order for the historical averages to be more reliable. I built the model on the train set and evaluated the yield of the predictions (I have a function that inputs the predictions and certain thresholds and restrictions and returns the yield) on the test dataset, using odds from footbal data (maximum odds - but nerfed a bit by adding +1.5% to their adjusted probabilities and then converting then back to European odds (1/prob)) . The bookmaker had a slightly better accuracy  for each type of  bet but I expected that since my model has no information about missing players, tiredness, etc. I did the cross-validation inside a loop, repeated it 100 times, at each iteration building another model (different data) and validating it. I ignored bets that had under 20% probability of winning according to my model and had a median yield of about 5.5%, minimum -7%, 25q = 3%, 75q = 10.5%, max 17%. The average no of bets was around 450 for all the 100 iterations. I will post the output after I do more tests, I am taking a break from this today.

I will definitely investigate more into modeling and probably look for ensembles (averaging different outputs from different models). I have to look closely at each predictor to see if the relationship with a certain performance benchmark (not sure which will indicate "true value") is linear or somewhat curved. The model will probably change a bit when I think I have a superior one.






Link to comment
Share on other sites

I really like your approaches to finding value, both with this and your last system.  If I had the technical ability I would be exploring similar metrics especially the 'Expected Goal' route.  A quick scan over the fixtures above and it looks like a great start?   Have you given up on your first system?

Either way, I wish you luck pal.  Ill live my analytical life through you! :)

Link to comment
Share on other sites

That was a great first round for the system, totally unexpected!

@knobbo: I haven't, I will try to work on it more during this summer, and maybe I will try to include the other leagues whoscored covers (Netherlands,Turkey, USA, Brazil, Argentina).

There was one invalid bet: "Oldham Athletic vs Chester FC". The away side was Chesterfield and it got into the recommendations because of noisy name merging. The issue is now fixed though. 

I am introducing a new chart: "Match analysis". "Prior win expectancy" is the adjusted probability from my model and "Posterior win expectancy" is what the expected outcome of the game based on the game statistics. This helps identifying teams that were either fortunate or unlucky in the game and should help me identify the quality of the bets in the long run and if I'm on a poor run whether it is an unlucky one or not.




Edited by allen29
Link to comment
Share on other sites

Here's what the model likes from matches within the next 7 days. 

I have also created an album with some visualizations of team performances, you can find them here.

Expected points album presents the average number of points a team should have accumulated per game based on their performances. The formula for this is: Posterior win expectancy * 3 + Posterior draw Expectancy. The are presented for games at Home and Away and you can see some interesting differences for some teams there. For example, Leicester and Arsenal were expected to gather more points on the road compared to their home ground. Another interesting fact you can see that for the Scottish teams the home field advantage is significantly lower compared to English teams.

The "Expected table" uses boxplots (box and whiskers charts) to present to same phenomena. The teams are ordered by their mean expected points although one may argue that the median could be more adequate. A larger box area means that the team had very mixed performances. For example, among Premier League teams Aston Villa has the lowest variance, meaning that they have abnormal performances more rarely than other teams. They are constantly bad.

Can someone please recommend a convenient way to attach thumbnails to the forum or embed images? I tried the "Insert other media" function with an imgur link and the image is not displayed.

#Update: Added new games where odds became available2016-03-31_Recom.thumb.png.a448d026526a9


Edited by allen29
Wrong image title, had to correct it
Link to comment
Share on other sites

Late night post here. One of the main arguments for why metrics derived from expediencies rather than actual result are superior for predicting the outcome of a game. You can see from the chart that they are more stable (have a lower variance) and resemble more to a bell-shape curved. Awkward shapes may appear over the course of a season, but in the long run they tend to stabilize. The same can not be said about actual goals, for example look at the pdf of Tottenham and Sunderland, it is multi-modal which makes prediction much more difficult.


Link to comment
Share on other sites

2.36 pts today, but could have been much better, there were some late equalizers that ruined a lot of profit plus a lot of draws. I'll post the match analysis later today. There were also some games that qualified because the odds changed but I haven't included them. 

I think I've found a way to avoid posting tables, although bookmaker and timestamp are not present.  If this is a problem for reporting purposes, please tell me and I'll switch back to tables or I'll find a way to include them. Games on the left are selections for this week only as well as the P/L chart, but Bank is from the beginning. Maybe I'll change this for the bank to synchronize nicely with the P/L chart.

A lot of the team abbreviations I approximated myself so they are not really standard. Is there a list with official team abbreviations that I could use?

Remaining games are the one from the previous post that have not yet been played. The rule is that everything that I post where my probabilities are greater than those of the bookmaker at the time of posting qualifies as a selection. You may notice that the probabilities from the bookmaker do not add to 100%. It looks a bit unaesthetic, I could normalize them but I think it helps noticing what the over-round for a particular game is.


Edited by allen29
Link to comment
Share on other sites

It was a pretty good day, although could have been better but some really late goals from the opposing site prevented that (e.g :Airdrieonians min 91, Hull min 93, Fleetwood min 91). 

So far, the results:

Today: 12.83 pts from 25 bets.

Overall: 18.91 pts from 83 bets (2.30 pts-home, 16.61-away), with a yield of 22.78% (7,93% - home, 30.75% - away).



Link to comment
Share on other sites

Very impressive results mate!
This is scarily similar to how I'm planning my own analysis; ie. fully automated, drawing on ELO type ratings (based on goals though).

It's still early days for my own system, coding everything in C# so it takes a quite a while (I'm a newbie and to top it off I'm annoyingly perfectionistic).

Instead of scraping stats from BBC, have you considered xmlsoccer.com? That's a flat monthly tenner covering all major leagues.
When I go live with my system I think I'll try their service, until then I backtest using results from football-data.co.uk.

Anyways, keep up the good work and best of luck :-)

Link to comment
Share on other sites

Well the only reason I use bbc is because they provide very structured match commentaries which you can turn in useful match statistics and for lower leagues I think this is really useful since not many bettors cover them or look at this information. I think BBC actually sends people to these games so they can record every event since most of them are not on TV. Also, I noticed that the variation in odds is much higher than widely covered leagues, so if you monitor these you may catch some really good opportunities.

What would be useful would be an API that returns odds from various bookmakers and can also place automatic bets when a signal is triggered but I think this is not possible since some bookmakers do not have this feature and the integration between several bookmakers would be cumbersome.

There are also drawbacks to the match commentaries, for example, if you watched Sunderland vs Leicester, in 90th minute Jamie Vardy dribbled past the goalkeeper and had an open goal. This is what the text commentary specifies: 

"Goal!  Sunderland 0, Leicester City 2. Jamie Vardy (Leicester City) right footed shot from the centre of the box to the centre of the goal. Assisted by Demarai Gray."

The model sees this information, it doesn't know that he was through on goal, dribbled past the goalkeeper and had an empty net in front.


What's the goal expectancy on this shot? 

The model says 16%, although common sense says it should be 99.9%. 

The only companies that can give you this kind of information at the moment are the big data collection companies and are extremely expensive.

For lower leagues though, they do not cover these games so I think this is really where you can have an edge and take advantage of generous prices.  


Edited by allen29
Bad formating
Link to comment
Share on other sites

You need to give your model a bit of slack.  Last weekend was probably the first that I would class as the tail of the season.  I have observed that forecasts become very unreliable due to the human nature of nothing to play for/head already on the beach/Euros etc.  I would not lose any confidence if it nose dives from here on in, on the other side, if it performs really strongly the same is true.  Its unfortunate that to have any true belief in it we would have to wait till start of oct to see how she does.  I really do like your approach/concept, and truly believe you could be onto something, especially the lower leagues.  Is ot something you could repliate for summer leagues without too much work load?


Keep the faith/patience pal.

Link to comment
Share on other sites

Thank you for your message @knobbo

I was thinking about the same thing lately: motivation towards the end of the season. I will try to see if in my simulations yield is lower in April and May compared to other months. 

One thing I am trying to apply, but it is very difficult to do, is to have a motivation factor/variable for each team, based on their chances to achieve an objective. For example, a team that is mathematically relegated or has a very high probability of relegation should have a very low motivation (e.g: Aston Villa, Bolton), but a team that has still decent chances of a safe finish should be much more motivated (e.g: Norwich, Newcastle, Sunderland). That would imply that I need to predict some probabilities of a team's table finish such as where is most likely to finish: Champion, Promotion, Play-offs, Top Half, Bottom Half, Relegation using strictly available information before each game and based on that to come up with  a motivation metric. I'm still experimenting with this, looks that there is something there, but still more work to do.  I noticed that the market has a better accuracy for home wins on clear favorites (low home odds). This is why I filter out bets against strong home favorites (60% or more win prob) at the moment, but I need to fix this and I think this approach should help.

Also, now I have more models because there are different ways in which you can estimate the strength and form of each team and you can get slightly different predictions and it is really difficult to know which one is the closest to reality. I use kind of a voting principle for example, if 75% or more of the predictions agree that here is value on a bet, bet is placed. From tests, this increases yield a bit and reduces number of losing runs.The idea is that each model's weakness should cancel out.

I would like to try it for more leagues, but I would need similar data. MLS, Brazil, Argentina could be interesting if I find a decent amount of match statistics.



Link to comment
Share on other sites

First time in negative territory. Meanwhile, I've had some time to check for performance by Month and I'm not sure what it means. From 100 simulations, April has a mean negative yield of about -7% which would support the theory that weird things happen towards the end of the season. But May average returns were huge, so I'm not sure if the theory holds or not.



Link to comment
Share on other sites

Update: -7.6 (-4.7%): Home (-19.29 pts, -32.7% yield), Away(11.69 pts., 11.3% yield)

I'll be adding new leagues soon, some of which will be played during summer so the thread won't be dead for that period. I found similar data from espn and for a even longer timeframe. The new additions will Mexican First Division, MLS, Australia's A-League, Brazilian First Division, Argentinian League (still some data issues) and starting from October: Bundesliga, Ligue 1, La Liga, Serie A. Probabbly the thread will need a rename.


Edited by allen29
Picture was to large.
Link to comment
Share on other sites

Decent week, could have turned back in profit but Sunday's games ruined it.

Again, away bets going strong while homes keep on disappointing: P/L (-2.01: H: -19.94, A: 17.93), Yield (H: -30.32 %, A: 14.81%)

Interesting thing about Swansea - Liverpool, odds at kick-off for Liverpool win were 2.15 (i caught them 2 days ago at 2.43) which were incredibly low given Liverpool's young and experimental starting 11. I was expecting the market to react the opposite way and Swansea's odds to shorten, even become favorite to win. 


Link to comment
Share on other sites

Forgot to post in time here, my model suggested 3 picks tonight.

For the proof, they were posted on another website where you can track your bets yesterday. (not sure if i'm allow to post details about competitor websites, so I blacket out my name).

I'm not sure if it is again the rules to include those bets, but if it is I won't include them.


Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.


  • Create New...