Geek Culture

A new tech publication by Start it up (https://medium.com/swlh).

Follow publication

ANALYTICS MADE EASY

How To Measure Football Prediction Model Quality

octosport.io
Geek Culture
Published in
6 min readDec 14, 2021

--

Introduction

In predictive modeling, one of the most important steps is measuring your model quality. More importantly, you need to have a measure that is understandable and generalizable. There are a lot of metrics out there that you can use to train a model, select the parameters, or do cross-validation. Among them, for classification problems, we can mention the accuracy, the F1 score, or the log-loss.

While they are relatively easy to use, it is not always simple to understand if a log-loss of -0.93 or accuracy of 10% means that a model is either a good or a bad predictor.

In this article, we show that the performances numbers of a prediction model in soccer are not enough to judge its quality. We will introduce the notion of a benchmark that is necessary to understand if a model really learns from the data.

The necessity of a benchmark

We already introduced the log-loss and the accuracy that are two performance measures of classification models used in football prediction.

The log-loss gives you a measure of the quality of the probability that the model produces. But this loss scale differently given the number of classes. So it is not easy to link the model performance with a scale of log-loss.

Accuracy does not have this problem. Indeed, it is easy to understand that 10% of your prediction are correct. But does that mean that your model is good or not good? If you have a two-class problem probably not. What if you have a 100 classes problem? In fact, you cannot answer the question even if you know the number of classes. Let us explain why.

Say 90% of the cars passing in front of your windows are white and we teach a deep neural network model to predict the next cars’ color. If the model ends up always predicting “the car is white” its accuracy performance will be 90%. However, that does not make it a very good model. Besides, you could watch across the window, observe that most of the cars are white, and predict that the next car will be white. You will reach the same performance as the deep neural net. In this case, you will conclude the model is bad as you can reach the same performance: you act as a benchmark.

Having a benchmark is a good way to assert the quality of your model. Not beating the benchmark means the model is porr or what you try to predict is too random to be learned from the data.

This is an obvious exaggeration, but it gives you the notion of a benchmark model: a vanilla model which is in this case you watching through the window.

A benchmark for football models

In football, we can predict many events: the final score, the half-time score, the first team to score, the number of goals, and so on. Having a benchmark for each of them will help to have a common measure of the model quality.

As we have done for the example of the car, we can build a benchmark for each market. Obviously, you are not going to watch all the matches and count which events come more often. Rather than that, we can use the data, look past matches, and figure out the most common outcomes. It is a simple historical model.

For instance, in the past 100 matches for one particular league, we observe that home teams win 45% of the time, away team 30%, and draw 35%. That will give us a log loss of -1.06. In comparison, a random log loss is -1.098.

The random model is also a possible benchmark but it is not fair. For instance, we know home teams win more often than 33% of the time so that knowledge should be in the benchmark

Example

Let’s get an example using a model and matches results. The data are provided by sportmonks’. We are using their probabilities for the 1x2 market.

We have collected 15000 matches between October 2021 and December 2021 distributed over 643 leagues. The matches and thus le log-loss are time ordered. First, we computed the benchmark probabilities using the average results of 1000 past matches at each date. The next plot shows the evolution of these historical probabilities through time.

Evolution of the historical probabilities for 1, X and 2

For instance, at the end of December 2021, we measured 39.5% for the home team to win, 31.5% to lose, and 29% for a draw. The random benchmark would have been 33% on each.

Then using the prediction model probabilities and our fresh historical probabilities we calculate the log-losses for each match. Then we take a rolling average of 1000 observations. We also added the random benchmark. The evolution of the log-losses is shown in the next figure.

Evolution of the log-loss for each model

We observe that the prediction model outperforms the benchmark meaning that it is able to learn useful information from the data. We also remark that the historical model is a much better benchmark than the random guess.

The league predictability index

The league predictability index (LPI) is a measure of a model log-loss compared to the historical benchmark and is used by sportmonks to rate their model. Each day and for each league, the model log-loss is computed for each event over the last 100 matches, or less if the data are not available with a minimum of 50 matches.

The benchmark log-loss is computed over the same number of matches but instead of using the average market outcome of the league.

Then, the quality of the model is classified as poor, medium, good, or high depending on its log-loss level(ℓ) against a percentage amount of the benchmark. The next table shows the number we use for all our models.

Predictability index classification

For instance, if the benchmark log loss is -1.01 and our model log loss is -0.98 for a particular league, the predictability index will be medium. If our model log loss is -0.94 the index will be good and high if it is above -0.93425.

Conclusion

In this article, we discussed why having performance measures is not enough to assess a model quality in the case of football predictions. Indeed, using a vanilla benchmark will help to understand if a model really learns useful information from the data.

The historical model appears to be an excellent benchmark as it will show you how a model does in terms of performance compared to the average outcomes of the league. It is universal and easy to compute and extensible to any market.

Alternatively, you can use another model or your own benchmark that will help you to understand if your model predictions are really performing.

--

--

octosport.io
octosport.io

Written by octosport.io

I am a data scientist writing about machine learning for football prediction at octosport.io.

No responses yet

Write a response