Data mining is a practice that employs mathematical algorithms to search for hidden information in a large amount of data to analyse the underlying pattern and law, and this practice is also known as knowledge discovery in data. The National Basketball Association (NBA) is the professional basketball game at the highest level in the world, and many events in an NBA game are used for statistical analysis. In this paper, data mining technology was applied based on event statistics to quantify the ability of basketball players and teams, the aim of the exercise being to predict basketball results. According to the NBA (2013–2018) season competition data, the quantitative evaluation method was firstly used to establish a player ability evaluation model, and the feature variable selection history game data weighting method was selected to construct a team player ability evaluation feature system. Secondly, machine learning algorithms such as linear regression,
- Data Mining
- Linear Regression
- Neural Network
The National Basketball Association (NBA) is a professional basketball league in the North American continent. Its enormous influence attracts many fans from all over the world. Because of the significant amount of money and the vast number of fans involved in the NBA league, there have been many studies attempting to predict the outcome of NBA games by simulating the winning team and analysing their players’ abilities so as to assist the coach in team organisation. NBA games have accumulated much historical game data and statistical analysis data. Even with this situation being what it is, analysing and predicting these games is still very complicated .
Statistics have always been a must for basketball player evaluation, from simple field-goal average to overall efficiency indicators such as the attack score introduced by Oliver (2004) . Generally, a professional sport network is staffed with a professional analysis team, which is responsible for collecting and interpreting data from each game, and establishing statistical indicators based on the player performance in the actual play to measure the realisation level of players and teams. In predicting the performance of individual athletes, only a few statistical indicators can be used. In order to simplify the game analysis and accurately predict game results by using data, related technologies such as machine learning have been applied to predict outcomes of NBA games.
In this paper, machine learning was used to predict the performance of players before they played in the regulation game, by using FPTS collected from professional sports sites and their NBA basketball player statistical indicators, to predict outcomes of NBA games. Such prediction will only define the winning team, without considering its final scores in the game. Moreover, the effects of feature variables of basketball matches on game prediction were analysed, and feature selection was performed. The machine learning models adopted in this paper included the linear regression, extreme gradient boosting (
The paper uses a machine learning algorithm to predict the player performance based on past player data, and to predict team performance and the outcome of basketball games. Specifically, the data adopted, the feature variables constructed and the predicted evaluation objectives were analysed. First, the statistics of player games and player salary and position information during three regular seasons of NBA from 2015 to 2018 were used. For missing values in player data, the mean value or
Second, indicators were defined to quantify a player's ability, including total points (
There were other variables affecting player performance, such as home-court advantage, rest days, team sheets, player positions and salary, which were collected from the DraftKings’ algorithm or coaches’ decision-making system . The variable (value) of a player was constructed as the ratio between salary divided by 1,000 and FPTS, which was treated as a heuristic algorithm. When the value is higher than 5, it indicates that the player is in a good state, with higher ability evaluation . Moreover, the dimensionality resulting from more feature variables is solved by reducing the correlation between variables and selected important features. In this paper, considering the predictive effectiveness of variables for players, the following models were established to represent the quantitative relationship between variables and players’ ability values. The ability value
Finally, this paper aims to predict players’ ability value
Considering the prediction goal of this paper, it is obvious that the prediction is closely related to the regression. Therefore, the regression algorithm was adopted. The model comparison in this paper applies relevant machine learning regression algorithms, including linear regression,
For the linear regression, if a linear relationship exists between independent variables and dependent variables, it will meet the following equation:
The least square method is used to estimate the parameters. This method can deduce the optimised parameters after the model training, so as to predict outcomes by regression.
Boosting is a method to transform a weak classifier into a strong classifier, whose function model is superposed. To be specific:
The objective function is a feature of
Optimising the solution:
When the objective function is determined, it moves to the training process. For each iteration process, the training of the objective function of a tree can be written as follows:
The input is the predicted value after the (
After a series of calculations, the smallest objective function
After putting it into the original equation, the minimum value obtained is:
So, the finally obtained
The neural network algorithms are widely used in all subdomains of artificial intelligence. They are briefly introduced in the literature, including the explained version.
Data were pre-processed, including filling the missing value, making the variable name uniform and carrying out variable standardisation. The mean value was used to fill in the missing data. The data from different sources were uniformly processed and the player data were standardised for
Except for the variables directly used to calculate FPTS, other variables were selected for models to quantify the player ability by using more advanced statistical methods. The details are discussed in part 2.
Using data from recent games allows a more objective and accurate prediction of the results of players’ abilities. Therefore, the weighted mean of the past 10 games was used to obtain each variable. The relevant theory shows that the mean weight of games increases linearly with the number of games. In this paper, the square root and linear and square modes were used for quantitative evaluation. It is necessary to normalise the weights such that the sum of the weights is 1. As shown in Figure 1, the weighted square mode is a better weighting method. Accordingly, it was considered as the best option in this paper.
For the consistency of feature variables, the standard deviation of the FPTS variable over the last 10 games was defined as FPTS_std, while salary information from DraftKings was also defined as the Salary variable. Before a game, the participation of a player in the game cannot be determined from the model. Thus, it is necessary to calculate the value of the model's feature variable according to the published player sheet  before the game.
Because of the different predictive abilities of features on game outcomes and correlations between features, the features should be filtered and ranked. Taking the field goal (FG) as an example, it is highly correlated with effective field-goal percentage (eFG%) because the latter considered far fewer free throw points. Furthermore, some variables have multiple collinearity, such as three pointers (3P), three pointers (3PA) and three-point percentage (3P%). In this paper, the Pearson correlation coefficient was used for screening features. With the setting correlation threshold of 0.90, the following six features were screened out: three-point shot (3PA), defensive rebound (DRB), field goals attempted (FAG), field-goal percentage (FG%) and offensive rating (ORtg). In addition, variables without predictive ability are directly ignored in models. The gradient descent method was used in models to evaluate and quantify the feature significance of 34 variables. Using feature ranking, such features as dummy variables of position (SG, F, C), three pointers (3P) and three doubles (TD) were excepted. Finally, the remaining 29 variables were used as selected features for regression, gradient enhancement and deep learning.
The results of linear regression for all variables, using three different datasets, are shown in Table 1. A 5-fold cross validation was implemented when the linear regression model was trained. According to the linear model regression prediction, the minimum value of
Prediction effect of linear model
MAE, Mean absolute error;
RMSE, Root mean square error.
Max_depth: 5, n_estimators: 354, n_child_weight: 0, gamma: 0.8, learning_rate: 0.0152. The use of these parametric adjustment models will result in better performance (
In terms of the neural network model, Figure 4 shows the learning process of the model. A total of 20% of the training data was retained as validation data. It can be seen that the model soon starts to overfit, and verification losses are different from training losses. To prevent the model from overfitting, a loss layer was added, which randomly ignored 40% of the data points before feeding them forward to the last layer. In this paper, the EarlyStopping method in the Keras.callbacks model was applied. If verification is lost for no more than five periods, learning is terminated. This solution improved the model performance, with the RMSE reduced from 9.0678 in the original model to 9.0387.
Finally, the performance of these three models was compared, as shown in Table 2, where the models in the first and third rows calculated the mean value by using the defined linear combination of coefficients. In the fourth and fifth lines, the
Comparison of three machine learning models
MAE, mean absolute error;
RMSE, root mean square error;
XGBoost, extreme gradient boosting.
After the prediction calculation by models, and 5-fold cross validation, the consistent RMSEs (8.9910, 9.0522, 9.0148, 8.9351, and 9.0831) were obtained. In order to further improve the robustness of the model, the effect of small changes in input data on the performance of the model was also studied. First, the Gaussian noise was created by using mean 0 and variance 1, which was then scaled to the range of [0, 0.2] and added to the continuous variable in the original input. When the 5-fold cross validation was performed on the independent variable evaluated by using the noise, the losses seem to have barely budged, with
For the model,
The statistical significance was 0.1%, with four degrees of freedom and a critical value of 15.54. Therefore, the weighted mean, feature engineering and
In this paper, the steps and processes of solving application problems by using machine learning methods are discussed, i.e., analysing problems (predicting NBA games), searching research and processing data, feature selection, model training, evaluation and optimisation, etc. The indicators from DraftKings were used to predict how NBA players perform in regular games. The model prediction is run to minimise the RMSE between the predicted value and the actual FPTS statistics. It started from the basis linear model, where averages of the seasons’ past statistical data were used, together with weighting methods, to extract important features from selected features. The key feature variables for the model to predict a player's ability included salary information, team, player sheet provided by DraftKings, and other important statistical factors such as total rebounds and individual points scored. After feature selection and data normalisation, the
Moreover, the comparison between the predictions by models and FPTS actual statistics was verified in relevant games selected from the training data. The algorithm was used in five games broadcast on 10 March 2018, which produced the following eight player lineups, with an expected FPTS of 242.2643 and a total salary of US$ 49,900. The blue, orange and green bars stand for the actual FPTS, predictions by final model and predictions by basis liner mode, respectively. Below the names of the players their positions are mentioned. The final model predicted that the actual FPTS were much better than that of the basis liner modes of certain athletes such as Dillon Brooks (SF) and Dwight Powell (C). However, it tends to overestimate the FPTS of players such as Tomas Satoransky (G) and Kobi Simmons (F). Overall, the predictions from the match were very accurate with losses of 6.2836 (MAE) and 7.6538 (RMSE).
Throughout the data mining exercise, the feature importance and feature correlation matrix were essential to understanding how statistical indicators affect the predicated outcome of competition. Most importantly, data processing and feature extraction in this project have a great impact on the predicted results, which must be focused on. In particular, unifying names, handling missing values in data processing, and ranking and combination of important features in the feature selection process will greatly affect the final model training.
While the models used in this paper generally followed the performance comparisons of the algorithms themselves, the final improvement in RMSE was less than 10%. If, the opponents’ defensive data, such as a team's defensive rating and the positions of opposing players, are considered, the accuracy of prediction of the model might be improved. Furthermore, there is another important factor – namely, a coach determines when a player enters the field. When simulated players are used, it becomes possible to observe how the number of minutes varies under different game managements and, especially, how the formation tactics change during games. These factors can be modelled for quantitative evaluation and included within models. DraftKings also provides the views of news outlets and professional reviewers, which can be combined through natural language processing to be useful for performance prediction and formation optimisation. In conclusion, it is highly necessary to use the machine learning algorithm in the basketball game prediction and player ability quantitative evaluation system, and thus this usage is worthy of further research.
Prediction effect of linear model
Comparison of three machine learning models