Machine learning made easy
Building a Simple Football Prediction Model Using Machine Learning
With no prior knowledge
Introduction
Football has always been a challenging sport to model. The most famous model is the Dixon-Coles¹ which leverages the Poisson distribution as a prior to model goal scoring. Rating models based on pairwise² comparisons and ranking³ have emerged as an alternative way of making predictions. For instance, the Elo rating⁴ is used by FIFA to rank national teams. In this article, we present a different approach that does not require knowledge in football or make any assumptions and thus can be generalized to other sports. We will show how to train the model and make predictions with the associated probabilities using regularized logistic regression and scikit-learn.
We won’t go into details of the model as there are plenty of very good resources on the web that do better than we could. But we will cover some basics with the python code that can be found on GitHub.
Regularized logistic regression
Logistic regression is a statistical model used for classification. Classification means you deal with categorical variables to predict. For instance, you want to predict who will win a match. In the case of football, you will have three classes: the home team wins, the away team wins or it is a draw.
If you ever used logistic regression you know that it is a model for two classes: 0 when the event has not realized and 1 the event realized. To predict the winner of the football match, we will need three models, each of them will predict a different event unless you use a multinomial loss. One for the home team to win, one for the way team to win, and one for the draw.
The logistic regression model can be explained using probabilities but you can also see it as a model to predict 0 and 1. In fact, the prediction is not perfect and falls into the range {0,1} in such a way that they can be interpreted as the probability that the event realizes (prediction close to 1) or the event does not realize (prediction close to 0). If you want to know the maths behind this idea take a look at this article, but you can also be interested in the original probabilistic model.
Processing the data
We already know we need to train three logistic models with a binary target y
that represents the match results. For instance, for the “draw model”, the target is 1 if the match result is a draw else it is 0. Fortunately, we do not need to explicitly set three models as scikit-learn LogisticRegression
will do it for us by specifying the right options. So y
will be represented in three classes depending on the outcomes. To represent the teams, we will use a one-hot encoder.
Assume we have a league with 20 teams playing home and away. Each team is encoded in a vector of length 40 where each entry can take a value of 1 or 0. The 20 first entries represent the home teams and the 20 left the away teams. Then to encode a match between two teams we just have to put the 1s at the right place. Using scikit-learn that step can be done easily:
from sklearn.preprocessing import OneHotEncoder
team_encoding = OneHotEncoder(sparse=False).fit(team_names)
Now we can encode a list of past matches to train the model. For instance, we can use the past three seasons. Having the home and away team names we can encode them to get our dummy features.
home_dummies = team_encoding.transform(home_team_names)
away_dummies = team_encoding.transform(away_team_names)
Say we have 100 past matches, then home_dummies
and away_dummies
are arrays with 100 rows and 20 columns with 0 and 1. Now we can concatenate these two arrays to build the features X
that each of the logistic models will use
import numpy as np
X = np.concatenate([home_dummies, away_dummies], 1)
The target can be directly set in three classes. Having the historical score for the home team and the away team we encode y
in three classes: 1 for home team wins, 0 for a draw, and -1 for the away team wins
y = np.sign(home_score - away_score)
The number of parameters of the model equals two times the number of teams. Each team has a parameter to indicate its home and away strength. We have all we need it is time to train the model.
Training the model
We already mentioned that we are using scikit-learn LogisticRegression
regression model. As with many other models in the library, it comes with default options that we may not want to use. We want to train three logistic regression models independently. We do not a constant term has the home advantage is already encoded in the features. We want to regularize the model with the ridge (l2) penalization.
The regularization will help to reduce the strength and the weakness of strong and weak teams. For instance, if a team always wins the model will be tempted to associate a very large coefficient to that team and lead to over-optimistic prediction.
The regularization parameter C
can be found using cross-validation but the default value gives decent results. The model we want is then
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(
penalty="l2", fit_intercept=False, multi_class="ovr", C=1
)
Finally the model in train with a classicalmodel.fit(X, y)
. The model parameters are available in model.coef_
as an array of 3 rows and many columns equals two times the number of teams. The first row corresponds to the model for the away team to win, the second row for the draw one, and the third for the home team to win. We will give an intuition on those parameters in the next example.
To make a prediction we have two choices. We can obtain the probability of each outcome or the expected winner
probablities = predict_proba(home_team_name, away_team_name)
winner = predict_winner(home_team_name, away_team_name)
The pipeline from processing the data to make a prediction is available in a LogisticModel
available here.
Model interpretation: a concrete example
Let’s use some real data from the English Premier League. We train the model following the steps described above using the data from 2018–03 to 2021–03. in which we have 1110 matches and 27 teams.
The model coefficients help to interpret how the model understands the data. Remember that the model tries to predict 0 and 1 using the team information we have encoded for three different outcomes. The team coefficients are not very useful but their sign and magnitude compare to others are. For instance, let’s have a look at the home wins model. A positive sign on a team coefficient means the team increases the chance of home to wins. A negative sign on a team coefficient means the team decreases the chance of the home team to wins.
We observe that the coefficient of Manchester United playing away is negative (-1.35) which means that Manchester United has a negative effect on the chance the home team has to win. In other terms, Manchester United has been really good playing away but also playing home as the coefficient is 1.63.
When Huddersfield Town plays home the coefficient is also negative (-1.37) meaning that they have a negative impact on their own chance of winning.
When Norwich City plays away it increases the chance of the home team to win as its coefficient is positive (1.0). We can have the same conclusion on AFC Bournemouth but the coefficient is smaller (0.75) and so do the impact. Any team playing home against Norwich City or AFC Bournemouth will see its chances to win increase compare to play against a team like Manchester City or Liverpool that has a negative coefficient.
Performance in the 5 major leagues
Now we have a model and we understand it, it is time to run a test. For that, we use the 5 major leagues in Europe namely the England Premier Leagues, the French Ligue 1, The German Bundesliga, the Italia Serie A, and the Spanish Liga. Data were provided by Sportmonks.
In order to estimate the out-of-sample performance of the model, we use 3 years of matches to predict the next round. The last 5 rounds are predicted which yields 500 out-of-sample predictions in total. All numbers are calculated as of 4 March 2021. The performance of the logistic model will be compared to the actual Octosport model performance.
In the next table, we show two different measures of performance.
- accuracy: which is the percentage of correct prediction, the closest to 100% the best
- log loss: which gives a measure of the quality of the probability, the closes to 0 the best.
We also added the percentage of time the home team won as a benchmark, 41% of the 500 matches.
The logistic model increases that number by 7% while the Octosport model increased it by 14%. The probabilities are also better for the Octosport model as shown by the log loss. Overall the logistic model shows a reasonable performance but if you want to go beyond it is another story.
There are good chances that other simple models like Dixon-Coles or Poisson regression will results in the same level of performance as they are all using the same information, the matches results.
What is important in this result is that it takes a lot of effort to consistently beat simple models like the one presented in this article. The Octosport model uses much more complicated machine learning models and infrastructure. It processes a lot of data from multiple sources predicting more than 580 competitions across the globe. We need to process a lot of heterogeneous data, monitor the performance, make the prediction in time, deal with overfitting risk, understand the maths and we haven’t mentioned the infrastructure yet.
Conclusion
In this article, we present a simple but effective model to predict sports events with a focus on soccer. No assumptions have been made and a logistic model is used in conjunction with simple machine-learning tricks like one-hot encoding and ridge regularization. The model can be used on any other team-based sport.
Making predictions in soccer using statistical learning is something anyone can do. But when you want to improve those models that’s where the difficulties show up. Octosport aims to provide such improvement to anyone through our API but we also want to educate people on soccer and sports prediction in general.
Prediction plays an important role in betting but beating the bookmaker is not just a prediction game.
The logistic model is available on Github, have fun with it.
References
[1] M. J. Dixon and S. G. Coles (1997), Modelling Association Football Scores and Inefficiencies in the Football Betting Market. Journal of the Royal Statistical Society: Series C (Applied Statistics), 46: 265–280.
[2] L. Maystre, V. Kristof and M. Grossglauser, Pairwise Comparisons with Flexible Time-Dynamics, KDD 2019
[3] R. A. Bradley and M. E. Terry (1952), Rank analysis of incomplete block designs: I. The method of paired comparisons. Biometrika, 39(3–4), 502–537.
[4] A. Elo. 1978. The Rating Of Chess Players, Past & Present. Arco Publishing.