Uneingeschränkter Zugang

Feature Sorting Algorithm Based on XGBoost and MIC Combination Model


Zitieren

Introduction

With the rapid development of the Internet, the amount of data generated by human activities is increasing exponentially. Big data is generally considered as PB-level data, including structured, semi-structured and unstructured data, whose scale and complexity greatly exceed the storage and computing capacity of existing hardware. However, due to the variety of irregular or incomplete data, such data brings great difficulties to data processing. Moreover, the traditional GBDT has defects in design, can not be processed in parallel, has high computational complexity, and is not suitable for high-dimensional sparse features.

In order to tackle a series of difficulties and challenges brought by massive data to server storage and operation, researchers have been trying to find out an efficient data analysis system that can extract valuable information from massive data. In recent years, many Internet companies have launched various big data processing systems, such as Google's MapReduce system, Microsoft's Cosmos system developed for parallel Blockchain, Haystack system proposed by Facebook to solve the massive small files, etc. Under the background of massive data, it is more important to grasp the key information of data quickly. The big data application, which is named “ScholarSpace”, developed by the Network and Mobile Data Management Laboratory (WAMDM) of Renmin University of China fully reflects the process of big data processing, and combines modern big data technology with traditional economics, law, literature and other disciplines to obtain key information. However, because the big data is not structured and its relationship is complex, it is not easy to establish association between the data object's structure and attributes. Moreover, because the error value of the single model for calculating the importance is high, the algorithm for extracting key information from big data always has certain limitations. In addition, other researchers also began to pay attention to the importance of feature selection system in massive data. Based on LDA model, Yang Guijun quantified the key information as subject feature vector as explanatory variable, and integrated multiple models with sample disturbance and attribute disturbance by using XGBoost algorithm to build prediction model; According to the film review data, Zhang Hongli extracted features from user reviews as auxiliary prediction indicators, and combined with other factor indicators as independent variables to build a prediction model; Based on the thought process of constructing tree in XGBoost algorithm, Li Zhanshan proposes a new wrapped feature selection algorithm xgbsfs, which avoids the limitation of single importance measurement by measuring three important indexes; Ye Qianyi uses the XGBoost model to mine the attribute data of shopping malls, and extracts the features through the cleaning and visual analysis of the original data, so as to predict the user behavior information and the sales of shopping malls. Although the above algorithms have used the existing algorithms to optimize the feature ranking algorithm, the feature ranking algorithm model still has the problems of low accuracy or too complex model.

To solve the above problems, there is an urgent need for an effective data analysis algorithm to extract the key information of massive data. The first mock exam shows that compared with logistic regression and decision tree, the integration of different models with a certain strategy has higher accuracy and better stability. In recent years, XGBoost algorithm has been widely used in the field of data analysis and provides a method to calculate feature ranking, but the model is complex and it is difficult to adjust parameters. MIC algorithm is often used in feature selection of machine learning. The algorithm is simple and its effect is obviously better than other similar algorithms. Therefore, this paper combines XGBoost and MIC model to design an algorithm to extract the key information of data. The results of the first mock exam of the anti breast cancer candidate drug dataset show that the algorithm can effectively reduce the error value caused by a single model and improve the generalization ability of the model.

XGBoost-MIC combination model

XGBoost model calculates the importance score of each attribute by gradient lifting algorithm, which has the advantages of preventing over-fitting, simple models and strong flexibility. But it is not suitable for processing high-dimensional feature data, and there are many parameters, so it is relatively difficult to adjust parameters. The MIC algorithm model just needs a large data samples, which can calculate the high-dimensional features well. MIC algorithm can cover all functional relations evenly when the sample size is large enough, rather than limited to some functional expressions. In order to combine the advantages of the two models, a new algorithm XGBoost-MIC combination model, is obtained by combining and weighting XGBoost and MIC models.

XGBoost model

XGBoost, the full name of eXtreme Gradient Boosting, is an optimized distributed gradient boosting library, which is efficient, flexible and portable. XGBoost is a tool for large-scale parallel Boosting tree. XGBoost is an optimization of boosting algorithm, which integrates weak classifier into a strong classifier. XGBoost algorithm generates a new tree through continuous iteration to fit the residual of the previous tree. With the increase of iteration times, the accuracy continues to improve. It is the fastest and the best open source Boosting tree toolkit at present, which is more than 10 times faster than common toolkits. In terms of data science, XGBoost is a necessary weapon for major data science competitions, and a large number of contestants like to choose XGBoost for data mining competitions. In terms of industrial large-scale data, the distributed version of XGBoost is widely portable. And it is supported in various distributed environments such as Kubernetes, Hadoop, SGE, MPI, Dask, etc, which makes it a good solution to the problem of industrial large-scale data.

The definition function of XGBoost is shown in formula (1). y^i=ϕ(xi)=k=1Kfk(Xi) {\hat y_i} = \phi \left( {{x_i}} \right) = \sum\limits_{k = 1}^K {{f_k}\left( {{X_i}} \right)}

The goal of formula (1) is to learn such K tree models f(x). In order to learn the model f(x), the objective function shown in formula (2) is defined. L(ϕ)=il(y^i,yi)+kΩ(fk)whereΩ(f)=γT+12λω2 \matrix{ {L\left( \phi \right) = \sum\limits_i {l\left( {{{\hat y}_i},{y_i}} \right) + \sum\limits_k {\Omega \left( {{f_k}} \right)} } } \cr {where\;\;\;\;\Omega \left( {\rm{f}} \right) = \gamma {\rm{T}} + {1 \over 2}\lambda {{\left\| \omega \right\|}^2}} \cr }

Among them, k is the total number of trees, fk is the model of the kth tree, ŷi represents the predicted value of the model and yi represents the category label of the ith sample, T represents the number of leaf nodes of each tree, and ω represents the set of scores of leaf nodes of each tree.

The tree model of XGBoost and the method of calculating feature importance are introduced below.

Tree model

Decision tree is a basic classification and regression method. It is a decision analysis method to judge the project risk and feasibility by constructing a decision tree based on the known probability of various situations. As a supervised learning model, decision tree does not need any prior assumptions about data, so as to quickly find decision rules according to the characteristics of data. XGBoost adopted the gradient lifting algorithm to continuously reduce the loss caused by the last decision tree and generate new models to ensure the reliability of the final decision tree.

In the process of building a tree model with a given data set, greedy algorithm is used to select a feature segmentation point in each layer as a leaf node. If the gain value of the whole tree is the largest after segmentation, it means that the more times the feature is segmented, the better the effect of the whole tree is gained. That means the feature is more important. The weight of leaf nodes in the process of feature segmentation is recorded as w (gi, hi), which gi hi are expressed by formula (3) and formula (4) respectively. gi=δy^(t1)l(yi,y^(t1)) {g_i} = {\delta _{\hat y^{\left( {{\rm{t}} - 1} \right)}}}l\left( {{y_i},{{\hat y}^{\left( {t - 1} \right)}}} \right) hi=δy^(t1)2l(yi,y^(t1)) {h_i} = \delta _{_{\hat y^{\left( {{\rm{t}} - 1} \right)}}}^2l\left( {{y_i},{{\hat y}^{\left( {t - 1} \right)}}} \right)

l indicates the gap between the target value yi and the predicted value ŷ. Combining the weights of all leaf nodes, in order to maximize the gain value of the segmented tree, each feature can be used as the gain of the separation point, and its gain is the difference between the total weight after segmentation and the total weight of the leaf nodes before segmentation.

Feature importance model

Importance is a measure to evaluate the importance of each feature in the feature sets to which it belongs. XGBoost uses three attributes to measure the importance of features, including Freq, Gain and Cover. Here, Freq is the percentage of times that a specific feature occurs in the model tree, Gain is the relative contribution value obtained by calculating the contribution of features in the model to each tree, and Cover is the coverage index of all functions related to a certain function.

The calculation formulas of the above three importance metrics are shown below (5). Freq=|X|Gain=GainxFScoreCover=CoverxFScore \matrix{ {Freq = \left| X \right|} \hfill \cr {Gain = {{\sum {Gai{n_x}} } \over {FScore}}} \hfill \cr {Cover = {{\sum {Cove{r_x}} } \over {FScore}}} \hfill \cr }

X is a set of feature classification into leaf nodes, Gain is the gain value of each leaf node in X when it is segmented, and Cover is the number of samples falling on each node in X.

In order to get the best model, we construct L = f(θ), and get θ by “gradient descent”. With the support of the ascension tree, what we are looking for θ is a parameter for all trees(tree structure, scores of leaf nodes). In order to optimize the parameters and reduce the amount of calculation, we regard a tree as θ, and build a relationship between the loss function and a certain tree, and then take the derivative of loss for θ, and the process of finding the tree is the solution process of XGBoost.

In this paper, XGBoost model needs to determine three parameters: general parameters, auxiliary parameters and task parameters. General parameters are used to control the function macroscopically; The auxiliary function controls the iterative model, select tree model or linear model; Task parameters control the performance of learning tasks and learning objectives.

Among the above three parameters, the auxiliary parameters have the greatest impact on the algorithm performance, and the maximum height of the tree will affect the final result. Therefore, first tune the maximum height of the tree. In the tuning process, first give other parameters an initial value, and set the important parameter lines to common typical values or default values. Compare the changes of test data by changing the height of the tree, So as to get better parameter selection. After determining the maximum height of the tree, the best combination of other parameters is obtained by traversal method.

Figure 1 shows the process of building decision tree and optimizing parameters. Figure 2 shows the XGBoost algorithm flow.

Figure 1

Building decision tree

Figure 2

XGBoost algorithm flow

MIC correlation coefficient model

MIC (Maximum Mutual Information Coefficient) can be used to measure the degree of correlation between two variables X and Y, which is often used for feature selection of machine learning. Compared with Mutual Information(MI), MIC has a higher accuracy and is an excellent data correlation calculation method.

According to the nature of MIC, MIC is universal, fair and symmetrical. The so-called universality means that when the sample size contains most of the information of the total sample, it can capture all kinds of associations, but not limited to specific function types (such as linear function, exponential function or periodic function), or it can cover all functional relationships in a balanced way. The complex relationship between general variables can be modeled not only by a single function, but also by superposition functions. The so-called fairness means that when the sample size is large enough, similar coefficients can be given for the correlation of different types of single noise with similar degree. Symmetry means that the amount of information obtained from different footholds is the same.

The basic principle of MIC will make use of the concept of mutual information. Mutual information is a measure of the degree of interdependence between random variables, which can be regarded as the information of another variable contained in one random variable. Mutual information can be defined by formula (6). P(x, y) is the joint probability between variables x and y. I(x,y)=p(x,y)log2p(x,y)p(x)p(y)dxdyI[x;y]I[X;Y]=XYlog2p(X,Y)p(X)p(Y) \matrix{ {I\left( {x,y} \right) = \int {p\left( {x,y} \right){{\log }_2}{{p\left( {x,y} \right)} \over {p\left( x \right)p\left( y \right)}}dxdy} } \hfill \cr {I\left[ {x;y} \right] \approx I\left[ {X;Y} \right] = \sum\nolimits_{XY} {{{\log }_2}{{p\left( {X,Y} \right)} \over {p\left( X \right)p\left( Y \right)}}} } \hfill \cr }

The following formula (7) of MIC model is given as follows. MIC(x;y)=maxa*b<BI(x;y)log2min(a,b)MIC[x;y]=max|X||Y|<BI[X;Y]log2(min(|X|,|Y|)) \matrix{ {MIC\left( {x;y} \right) = \mathop {\max }\limits_{a*b < B} {{I\left( {x;y} \right)} \over {{{\log }_2}\min \left( {a,b} \right)}}} \hfill \cr {MIC\left[ {x;y} \right] = \mathop {\max }\limits_{\left| X \right|\left| Y \right| < B} {{I\left[ {X;Y} \right]} \over {{{\log }_2}\left( {\min \left( {\left| X \right|,\left| Y \right|} \right)} \right)}}} \hfill \cr }

MIC model calculates the relationship between two attributes in turn, discretizes them in two-dimensional space, and uses scatter plot to represent them. The model divides the current two-dimensional space into a certain number of intervals in the X and Y directions, and then checks the situation that the current scatters fallen into each grid, thus solving the problem that the joint probability in mutual information is hard to find.

The calculation of MIC value is generally divided into the following three steps: first normalize the data volume through different meshing methods, then calculate the maximum mutual information value, then normalize the maximum mutual information value, and finally select the maximum mutual information value under different scales as the mic value.

Combination model

In XGBoost algorithm, the steps of constructing decision tree are as follows:

Traverse all feature nodes from the root node. For a certain feature, according to the sample value, sort and then determine the segmentation point with the best gain.

Select the feature with the highest gain from all the selected segmentation points for segmentation.

Divide to the maximum depth and build the next tree.

Integrate all trees to complete the model construction.

The feature importance measure is calculated according to the feature importance index.

According to the above algorithm ideas, the results of the two algorithms are weighted and combined by the error reciprocal method which are shown in formula (8). ft=α1f1t+α2f2t,t=1,2,3,,nα1=ε2ε1+ε3,α2=ε1ε1+ε2 \matrix{ {{f_t} = {\alpha _1}{f_{1t}} + {\alpha _2}{f_{2t}},t = 1,2,3, \ldots ,n} \hfill \cr {{\alpha _1} = {{{\varepsilon _2}} \over {{\varepsilon _1} + {\varepsilon _3}}},\;\;\;{\alpha _2} = {{{\varepsilon _1}} \over {{\varepsilon _1} + {\varepsilon _2}}}} \hfill \cr }

Respectively, ɛ1 and ɛ2 is the error between the calculated result and the true value by XGBoost and MIC algorithms. It can be seen from formula (8) that this method all error, thus reducing the error of the whole combination model and obtaining the predicted value closer to the real situation, thus improving the overall prediction accuracy. The error reciprocal method can not only give greater weight to the method with higher prediction accuracy, but also ensure that the error reciprocal method can give greater weight to the method with smaller absolute error value (i.e. the method with high prediction accuracy) at any time. If the number of positive and negative errors of each single prediction error is equal, the error reciprocal method can effectively reduce the combined prediction error and achieve better combination effect.

Experimental process and result analysis

This experiment runs under Linux Ubuntu16.04 system, and the framework version adopted is XGBoost1.5.0. In terms of hardware environment, the CPU used in the experiment is Intel(R) XEON W-2133 and the GPU is NVIDIA TITAN XP 12G.

The experimental data comes from the effect of 729 molecular descriptors of 1974 compounds provided by China Postgraduate Mathematical Modeling Competition in 2021 on inhibiting the activity of breast cancer cells, including the biological activity value of compounds on ER α and pIC50 obtained by converting IC50 value. In this data set, the biological activities of various compounds are constant and do not react with each other.

The flow of this experiment is shown in Figure 3. Firstly, preprocess the source data. Secondly, the processed data are calculated by XGBoost algorithm and MIC algorithm, and then weighted and combined by the reciprocal error method to obtain a combination model (namely XGBoost-MIC combination model). Finally, the algorithm results of XGBoost-MIC combination model are compared with those of single model (XGBoost or MIC model), and it is decided whether to adjust the parameters. If the results of the combination model are better than those of the single model, the experimental results will be analyzed. Otherwise, adjust the parameters and retrain the model.

Figure 3

Experimental flow chart

Data preprocessing

In this data set, there are a certain number of unique attributes and empty attributes, so before using this data set, data processing should be carried out first.

For the unique attributes in the data set which can’t describe the distribution law of the sample itself, we can delete these attributes directly.

In order to avoid dealing with high-dimensional data and reduce the difficulty of learning tasks, feature coding is needed. The feature coding method adopted in this paper is One-Hot Encoding. Only One-Hot Encoding uses N-bit status registers to code N possible values, and each state is represented by an independent register, and only one of them is valid at any time.

For data with different magnitudes in the data set, the magnitude difference may greatly affect the calculation results. In order to reduce its effect on the results, the data must be standardized. In this paper, the data are normalized. Its conversion function is shown in formula (9). x=xminmaxmin x = {{x - \min } \over {\max - \min }}

Finally, we divide the data set into two parts: the training set (80%) and the verification set (20%). The verification set is used to record the accuracy of the algorithm to find out the best model parameters.

Experimental results and analysis

As shown in Figure 3, the XGBoost and MIC models are realized respectively. The two models are optimized by grid search method to get the optimal parameters, and then the combination model is calculated by the reciprocal error method. The score and ranking of feature importance calculated by this model are shown in Figure 4.

Figure 4

Order of feature importance

In this paper, R2 (the decisive coefficient of the relationship between a random variable and multiple random variables) is selected to reflect the regression model, and is used as the main evaluation index of the prediction performance of each model. Root mean square error (RMSE) is selected as the auxiliary evaluation index. The calculation of R2 and RMSE is shown in formula (10) and (11) respectively. R2=1RSSTSS=1i(yifi)2i(yiy¯)2 {\rm R^2} = 1 - {{RSS} \over {TSS}} = 1 - {{\sum\limits_i {{{\left( {{y_i} - {f_i}} \right)}^2}} } \over {{{\sum\limits_i {\left( {{y_i} - \bar y} \right)} }^2}}} RMSE=1mi=1m(yiy^i)2 {\rm{RMSE}} = \sqrt {{1 \over {\rm{m}}}\sum\limits_{i = 1}^m {{{\left( {{y_i} - {{\hat y}_i}} \right)}^2}} }

yi is the predicted value of XGBoost-MIC combination model for the I-th data set, ŷi is the actual measured value in the I-th data set, and m is the number of samples.

The XGBoost-MIC combination model was tested with the testing-set data divided in the preprocessing, and the R2 and RMSE values were compared with those of the single model. The results are shown in Table 1.

Comparsion of R2 and RMSE values between XGBoost-MIC conbined model and single model.

Method R2 RMSE
composite pattern 0.759 8.705
XGBoost model 0.723 9.453
MIC model 0.689 8.994

As can be seen from Table 1, the value of XGBoost-MIC combination model is higher(the closer the R2 value is to the 1, the closer the predicted value is to the real value), The prediction accuracy of the combination model is nearly 0.02 higher than that of the single model. By analyzing RMSE values, it can be concluded that the combination model has lower error than the single model. In general, the combination model has higher accuracy and less error, which has obvious advantages over the single model.

Conclusion

This paper proposes a feature importance selection algorithm based on the combination of XGBoost and MIC. In the first mock exam, the algorithm uses XGBoost to rank the feature importance. The XGBoost-MIC model combines XGBoost and MIC models by weighted reciprocal method. The method automatically improves the model and gives more weight to the smaller error models, which can correct the larger errors caused by the single model. A combination of the first and the first mock exam models was used to compare the data set of the anti breast cancer candidate drugs. The results show that the first mock exam is better than the single model, and the XGBoost-MIC combination model can effectively improve the accuracy and efficiency of the feature ranking, and has a better generalization ability, which is suitable for dealing with large-scale data.

In the future work, we will continue to improve the performance of the algorithm proposed in this paper on a large data set, continue to explore more suitable parameter optimization methods and model fusion methods, continue to process data sets in other fields, and further obtain more robust models combined with scenarios.

eISSN:
2470-8038
Sprache:
Englisch
Zeitrahmen der Veröffentlichung:
4 Hefte pro Jahr
Fachgebiete der Zeitschrift:
Informatik, andere