Predicting 2020-2021 English Premier League Table Results Using Machine Learning

By William Daseking


The English Premier League is one of if not the top soccer league in the world. Millions of people from around the world tune in every weekend to see their team play and hopefully win. With big-name players like Kevin De Bruyne, Mohamed Saleh, Harry Kane, and Christian Pulisic, the league is full of excitement and glory. Unlike American Sports, the Premier League, and most other soccer leagues, do not use a playoff system to determine their champion. Instead, they use a point system where teams earn points based on the results of their matches. The team with the most points at the end of the season is crowned champions and given the Premier League Trophy. Teams earn 3 points for a win, 1 point for a draw, and 0 points for a loss. Beyond declaring a champion, the 3 teams with the lowest amounts of points are relegated each season to a lower division, and the top 3 from the lower division get promoted to take their places. Additionally, the top 4 teams get placed into the UEFA Champions League, the preeminent European competition with millions in prize money on the line. Because of all this, every game matters as every slip-up could cost you the championship, the chance to play in the Champions League, and possibly the humiliation of being relegated to a lesser competition. The Premier League table is the table of the 20 teams in the league sorted by the number of points they have. This table is what determines all of the things above. Because of the importance of this table, predicting it can be very valuable as you could gain insight into who is on the right track and who should be worried. It can show you if your team is doing well, or if you should be wearing a paper bag over your head in embarrassment. This project attempts to predict the results of the Premier League table by taking advantage of historical statistics from previous seasons and using them in conjunction with machine learning techniques to try to get a good estimate of how the current, 2020-2021 Premier League table will finish.

Importing the Necessary Python Libraries

Before we begin to load and work with the dataset, we first have to import any necessary libraries. The first library we import is Matplotlib’s pylab module to help us create our visualizations. The next two libraries we are importing are NumPy and Pandas. These two libraries will help us store and manipulate our dataset. The next library we import is the Seaborn plotting library. This library is built on top of Matplotlib’s plotting system and will help us easily make beautiful looking visualizations. The last four imports are from the SciKit Learn library which will be helping us create our models and test how well they are working.

Loading Our Dataset

After importing all the Python Libraries we need, it is now time to load our dataset we are working with. Our dataset was formed using data and statistics from FBRef which is a soccer statistics site run by Sports Reference. Our dataset contains statistics dating back to the 1992-1993 Premier League season which was the first year the league existed. Each season’s statistics are spread across three CSV files, a league table CSV file that contains the final league table as well some major statistics for each team that season, a squad standard statistics CSV file that contains standard soccer statistics for each team that season, and a squad goalkeeping CSV file that contains goalkeeping and defensive statistics for each team that season. Each of these files was created by going to each season’s page on FBref and clicking the “Get table as CSV” button under the “Share & more” dropdown next to the three tables that we are getting our data from. The tables on the page are then converted into CSV formatting and I then copied them into an empty file that was named in accordance with the year the season took place in and which table the data came from. In total, there are 87 files that make up our dataset and you can find all the files and the Jupyter Notebook for this project in the GitHub repository linked here. In the code below, we have three functions, one for each table type, that loads a specified year’s data into a Pandas DataFrame, cleans it to remove any duplicate or unneeded columns, and properly names any columns that need more descriptive names. We begin loading the data by looping through all the years that our dataset has data for, calling those three functions to get the data from the three files associated with that year, merging them into one DataFrame, and then appending them to the end of the DataFrame that contains data from all the years so we have to work with only one DataFrame.

Data Dictionary

The list below describes the column names of each column in our full dataset:

Exploratory Data Analysis

In this next section, we will conduct an exploratory data analysis of our dataset. We will look at and visualize our dataset to see its trends and properties. This will give us a better understanding of the dataset as a whole as well as give us insight into what statistics might be useful for the model later on.

Scatter Plot of Year vs. Points

Our first visualization is a scatter plot of individual team point totals over the years. Looking at the plot, we see that the regression line has a slightly negative slope which indicates that point totals are trending downwards over time. This trend is an interesting one and one that we should look into further. Besides the negative slope, I don’t see any other trends due to the number of points on the plot. This can be remedied by looking at the point distributions by year which we will do in the next visualization.

Violin Plot of Points by Year

The next visualization we have is a violin plot of points by year. It shows us the distribution of points for each year. Looking at the plot, I notice that the white dot in each distribution which represents the mean point total for that year does appear to go down slightly over time. Another thing I notice is that there are varying periods of teams with very high and very low point totals and there are periods with more parity in point totals. In particular, the period from 1992 to 1998 shows a more dense distribution with less extreme points from the mean. The mid to late 2000s as well as late 2010s seem to show the opposite trend with distributions that have wider ranges in point totals. To look further into these trends we should take a look at wins, draws, and losses over the years.

Scatter Plot of the Number of Wins for a Team by Year

With our next visualization, we have a scatter plot of the number of wins for a team by year. The first thing that jumps out to me is that the regression line is roughly horizontal indicating there is little change in the number of wins over time. There are a few high outliers after 2015 which indicates that there must have been some very dominant teams that year. Additionally, there are some low outliers between 2005 and 2010 so there must have been teams that did not do as well during that period. These might have caused the wider distributions that we noticed in the previous visualization. In regards to our trend from the first visualization of decreasing point totals over time, this plot doesn’t show us any reason for that.

Scatter Plot of the Number of Draws for a Team by Year

The next visualization we have shows us the number of draws for a team by year. The big thing that I notice is that over time, the number of draws appears to be decreasing. This could be what is behind the trend of decreasing point totals over time from the first visualization. Since we didn’t see any drops in the number of wins over time, this reduction in draws should be complemented by an increase in the number of losses over time. In the next visualization, we will take a look at that.

Scatter Plot of the Number of Losses for a Team by Year

Our next visualization is a scatter plot of the number of losses for a team by year. Looking at the regression line, we see that the number of losses seems to stay roughly the same over time. This is interesting as we figured we see an increase in the number of losses overtime to compensate for the reduction in the number of draws over time that we saw in the last visualization. This indicates that we might want to look at the number of games played by year since that reduction in draws and the trend from the first visualization had to come from somewhere.

Scatter Plot of Number of Matches for a Team by Year

This visualization shows us a scatter plot of the number of matches for a team by year. This plot finally shows us where that decreasing trend in the number of points and the number of draws came from. As you can see from the plot, in the first 3 seasons, the teams played 42 matches and after that, they only played 38. This reduction in the number of matches means that there are fewer points a team can get which would cause the trends we noticed earlier. Because of this change in the number of matches, it means that we should look at statistics from the per match point of view to negate the effect 4 additional games would have on total season statistics. Additionally, since the way the league works is that each team plays each team twice, this also means that there were 22 teams for the first three seasons and 20 teams in the seasons after that.

Scatter Plot of Points per Match for a Team by Year

Because of our discovery in the change in the number of matches played by each team, we should go back and look at the points for a team by year from a per match standpoint to negate the effects of 4 additional games on point totals. We see this working in the visualization above which shows that points per match are stable over time. This shows us that the trend we saw in the original scatter plot of point totals for a team by year was most likely due to the additional matches in the first 3 seasons.

Scatter Plot of Goals per Match vs. Rank in League Table

This next visualization looks at goals per match vs. rank in the league table. As goals are the way you win matches, you would expect that teams that score more goals per match are more likely to be at the top of the league table. We see this in our visualization as we see that the teams ranked the highest typically score the most goals per match and as you go down the league table, teams tend to score less. Because goals per match seems to do a good job indicating how a team will rank in the table, it might be a good statistic to consider for our model.

Scatter Plot of Assists per Match vs. Rank in League Table

Another statistic that I think might make for a useful predictor for our model is assists per match. Assists are when another player passes the ball to a player who then scores. Since assists are tied to goals which we saw above can lead to higher positions in the league table, assists might also be valuable for our model. This visualization plots assists per match vs. rank in the league table to see if shows us something that might be useful for our model. Looking at the visualization we again see a decreasing trend in the number of assists per match as you go down in the league table indicating that more assists per match are connected with an increase in the league table position.

Scatter Plot of Goals Allowed per Match vs. Rank in League Table

The next statistic we are looking at in relation to league table rank is goals allowed per match. The visualization we have here is a scatter plot of goals allowed per match vs. rank in the league table. In this plot we see a trend of increasing goals allowed per match as you go down the rankings. This makes sense since if your team allows a lot of goals against you then it would be harder to win or draw games to get points as you would also have to score more goals to make up for it. Because of this, we should consider using it as a predictor for our model.

Scatter Plot of Goal Difference per Match vs. Rank in League Table

Goal difference per match is another statistic that might give us insight into a team's rank as it looks at the average difference in goals between the team and who they are playing. A negative goal difference per match indicates that they are getting outscored on average which would make it hard to win a lot of games. The visualization above plots goal difference per match vs. rank in the league table. Looking at the plot, we see a trend in that teams that are ranked higher typically have more positive goal differences per match. This might make this statistic a good predictor for the model.

Scatter Plot of Goalkeeper Save Percentage vs. Rank in League Table

Another statistic I would like to take a look at is goalkeeper save percentage since if your goalkeeper is giving up a lot of goals, it is hard to win games. This visualization plots goalkeeper save percentage vs. rank in the league table to see if there is any trend between the two. Looking at the visualization, we do see a slightly decreasing trend between the two. This statistic might be a decent predictor but we will have to do further analysis to see if it is worth adding it to our model.

Scatter Plot of Clean Sheet Percentage vs. Rank in League Table

A clean sheet is a game where the team does not concede a goal. Clean sheet percentage is another defensive statistic that looks at the percentage of matches that a team kept a clean sheet. Our visualization above plots clean sheet percentage against rank in the league table to see if they are connected in any way. The plot does appear to show a connection between the two as teams that are ranked higher tend to have higher clean sheet percentages. Again, because of this trend, this might make a good predictor for our model.

Scatter Plot of Shots on Target Allowed Per Match vs. Rank in League Table

The last statistic we are going to look at is shots on target allowed per match. This statistic is one that looks at a team's ability to prevent shots that could go in. The fewer shots on target, the fewer chance a team can score. This visualization plots the statistic against rank in the league table so we can see if there is a relationship between the two. Again, we do find one as teams that concede fewer shots on target per match are typically higher-ranked. This relationship will make it another statistic we should consider for the model

Deciding Which Predictors To Use

In this section, we will finally get to build our model for predicting league table rankings. Because of the additional matches played in the first three seasons, any statistics considered will need to be calculated on a per match or the equivalent per 90 minutes basis to remove any effects of the additional games. Along with the statistics identified above, I have also included a couple of additional potential predictors that we will evaluate alongside the other statistics. These statistics are penalty kicks per match, penalty kick attempts per match, yellow cards per match, red cards per match, number of players on a team, and the average age of players on a team. I decided to look at the penalty kick statistics as when one is given it can be a potentially contentious decision that people talk a lot about after the game. Because of their contentiousness, I thought it would be interesting to see if there was any correlation between them and ranking in the league table. I chose the yellow and red card statistics because I was curious if there was any correlation between the discipline of a team and their ranking. Lastly, I chose the number of players and average age to see if there was any connection between squad depth and ranking or age and ranking. Out of all these potential predictors, we need to ensure that each one will add to the model and we can do this by checking their Pearson correlation coefficient with league table rank which we will do below.

Looking at our heat map of the correlation matrix between our potential predictors and rank in the league table, our variable we are trying to predict, we see a lot of different numbers. To make things simple, if a statistic’s Pearson correlation coefficient is roughly between +- 0.3 we will remove it from consideration in our model since its relationship with what we are trying to predict is weak. With this criteria, we end up throwing out number of players, average age, the penalty kick statistics, and the yellow/red card statistics from our potential predictors. This leaves us with goal difference per match, goals scored per match, assists per match, goals allowed per match, save percentage, clean sheet percentage, and shots on target allowed per match.

Next, we look at our potential predictors and their correlations with each other. We do not want to include anything that is the same statistic under a different name in our model as that just adds additional data to the model without helping to improve the model. To prevent this, we will take a look at the Pearson correlation coefficient heat map for the potential predictors.

Looking at the heat map above, we see for the most part that our predictors are not correlated but there are a few iffy ones. In particular, the coefficients between goals per match, assists per match, and goal difference per match are concerning. The coefficients between each other are all 0.86 or higher indicating a strong, positive correlation with each other. Because of this, we should remove all but one from the model to get rid of any redundancy in our predictors. I chose to stick with goal difference per match as it gives perspective between the offensive and defensive sides and how they compare. Goals allowed per 90 is also highly correlated with goal difference per match so I removed it as well to give us our final set of predictors. After removing correlated predictors, we are left with goal difference per match, goalkeeper save percentage, clean sheet percentage, and shots on target allowed per match as our final set of predictors that we will use to build our model.

Building and Testing the Model

Now that we have finally chosen our predictors, we can finally build and test our model. Since our task is predicting a final league table rank which is like a category, it might seem like a classification task but I have chosen to use regression. This is due to the fact that if we use a classifier it might classify multiple teams in one season to the same rank which can’t happen in real life. By using a regressor, we can avoid this as it can give us a decimal portion that can separate two teams. Essentially, our predicted table will be the sorted order of the predicted ranks given to each team. With our choice of regression, we now have to choose a couple of regressors that we can test against each other and see which one performs the best. I ended up choosing linear regression, a decision tree regressor, and a random forest regressor. I chose the linear regressor because of its simplicity and how widespread it is. Additionally, the ranks lend themselves to a linear pattern that fits linear regression nicely. Now, typically you might use a gradient descent version of linear regression when doing machine learning because it can be quicker to use. I decided against this because the dataset is on the small side so it’s easy to calculate the exact linear regression and not that much slower than using gradient descent. I chose the decision tree regressor because it can be easy to interpret and understand and it can work decently well on a wide variety of data. Lastly, I chose the random forest regressor because it often can be one of the better machine learning methods due to its combination of lots of decision trees creating one prediction as well as its ability to work on a wide variety of datasets well. To evaluate the three models against each other, we will conduct a ten-fold cross-validation procedure on each one, average the scores for each fold, and compare the averages against each other.

Since our models are regressors, our machine learning library SciKit-Learn will return R squared values for each iteration of the cross-validation procedure. Looking at the results from our ten-fold cross-validation procedure, we see that the random forest regressor did the best with an R squared average of around 0.86. The linear regressor did second best with an R squared average of around 0.82. Last was the decision tree regressor with an R squared average of around 0.78. Based on these results, we will use the random forest regressor to predict the table for the 2020-2021 Premier League season.

Predicting the 2020-2021 Premier League Table

Now that we have selected the method we will use to predict the table, we can now load in the current season data. This data came from FBRef just like the other data we have been using and was collected from their site on December 14th,2020. After loading the current season data, we fit the model to the historical data that we have been working with. We will then pass each team’s data for the predictors we chose from before into the model and collect the results into an array. We will then sort the results to get the predicted table.

Using our model, we predict that Tottenham Hotspur will be the team that finishes top of the table and wins the Premier League. On the other hand, we can sadly predict that Fulham, West Brom, and Sheffield United will be relegated to the lower division. Now if you look at the number next to each team, you might notice that the top team isn’t a 1 and that other teams are only a few tenths of a place away from each other. This is due to us using regression and our model is saying that each team's predicted rank is similar to that place historically. However, by sorting them according to their predicted value, we are getting that team’s rank in relation to the other teams that season which is what we care about in determining the predicted table. So while the top team might not be predicted to be the top team by its predicted rank, in relation to the other team’s predicted ranks, it is the top.

A New Statistic Enters the Field

While the model above used the entire history of the Premier League to predict the league table, the next model uses data from the 2017-2018 season onward. This is because in 2017 new, more advanced statistics began to be collected such as expected goals, expected assists, and expected goals allowed. These statistics were collected by determining the probability that a shot will result in a goal based on the characteristics of play leading up to a shot such as location the shot was taken, what body part was used to take the shot, the type of pass before the shot, and the type of attack the shot was taken in among others. Using these probabilities, you can gain insight into a team’s ability to create opportunities to score or concede goals. More information about these statistics can be found here. For our second model, we will utilize the expected goal difference per match statistic alongside the existing predictors to provide an additional data point that might change how our model predicts the table.

Using this new model utilizing the new advanced statistics, we predict that Chelsea will finish top of the table and win the Premier League. At the bottom of the table, we find Burnley, West Brom, and Sheffield United in the bottom three spots and therefore they are predicted to be relegated.

Conclusion

With our dataset of 28 years of English Premier League history, we created two machine learning models using random forest regressors to predict what the final league table will look like. We analyzed the dataset to identify four predictors that could be used to build our models. According to the first model, which uses the full historical data, we predict that Tottenham Hotspur will win the Premier League. In the second model, which utilizes a much smaller dataset, but includes a new, advanced statistic, we predict that Chelsea will win the Premier League. We will have to wait till the end of the season to see who was right. May the best model win!