Defect prediction of big data computer network based on deep learning model

Computer network software prediction is a good way to improve software quality, and the predictions of the software prediction method are close to the characteristics of the data set. In order to solve the problem that the invisible size of special data set is too large for computer software prediction, the author proposed a computer network-based software prediction method with deep computer coding and power learning. deep exploration of data features. data features. This type of model first uses an unsupervised learning-based evaluation model to evaluate the data set of 6 open projects, which solves the problem of classification uncertainty in the data; Deep self-encoding network models were then investigated. The model reduces the size of the data set, which is used to connect our model at the end of the model, the model uses training sets of shortened length to train the workers


Introduction
With the arrival of the age of big data, how to deal with massive data puts forward higher requirements for today's data processing technology, so big data management platform, data warehouse technology and BI (Business Intelligence).tools came into being.Among them, etl (extraction, transformation, loading) is the core and soul of BI and data warehouse, which can integrate and improve data value according to unified rules.These data management tools can provide support for business decision analysis and enable people to mine the required information from massive data.However, due to the huge volume of data, container technology has opened up a new way for the development of computer networks, making the use of system resources more and more efficient.The traditional big data analysis module can perform basic mining and analysis on the platform data, but the data usually contains some in-depth information, therefore, more in-depth processing of data is required.Different from the traditional machine learning methods, the deep neural network is a process of extracting features layer by layer, and the computer automatically extracts data from it, without human intervention in the extraction process.Especially in the prediction field, in-depth learning can more accurately predict future trends [1][2].As shown in Figure 1:

Literature review
With the continuous development of today's computer networks, software reliability has become an important factor in software analysis, the scope and functions of network software are expanding, the software complexity is increasing, and the probability of network software defects is increasing.cause software failure.This causes the software to fail.Software prediction has become one of the areas of research in software engineering and data mining to help detect network software defects in time.Anti-prediction technology in computer network software can predict if there is a software failure for any reason, thus helping the affected team to quickly understand all situations and develop appropriate strategies for testing and improving the quality of the software.improve the reliability and stability of software and network software.Based on this, many researchers devote themselves to the research of network software defect prediction technology one after another, and try to detect whether there are defects in software through deep learning methods.Traditional software defect prediction is based on manual acquisition of software metric features for classification learning, and the method of feature selection directly affects the accuracy and stability of software defect prediction [3].

Establishment of software defect prediction model
The distribution of network software defect data is extremely unbalanced, and only a few data are defective.If the random sampling method is adopted, the training data set may contain very little or no defect data, and it is difficult to train a good prediction model, the lack of enough defect data as training samples is a bottleneck of the defect prediction model.By analyzing the potential correlation between software metrics and network software defects, the author proposes a mixed sampling method based on unsupervised learning and random sampling, which ensures that the training set contains a certain proportion of defect data and reduces the impact of imbalance of data types.On this basis, the deep learning self coding network is used to extract the features of the original dataset, which reduces the dimension of the features, removes a lot of redundant information, and learns the features with complex structures, which can significantly improve the prediction performance.
The specific steps are as follows: 1) Data preprocessing, including default data filling and data standardization, uses Z-core method for standardization to reduce the impact of large numerical differences between attributes on the prediction model, so that the values of sample metrics are distributed between 0 and 1.
2) Using a mixed sampling method based on unsupervised sampling and random sampling proposed by the author, the training data set is collected from a large number of unbalanced software data sets, so that the software defect data in the training set accounts for a certain proportion, and the rest is used as the test data set.
3) The unlabeled training set is used to train the deep self coding network, and the high-level features with complex structure are extracted, then the extracted features are added with corresponding labels to train the classifier.
4) The trained classifier is used to predict the test data set.

Sampling method
Some studies show that the higher the complexity of metric elements, the greater the possibility of defects.By analyzing the box chart of medium elements in defect data and non defect data, we found that the metric value of modules with defects is generally higher than that of modules without defects.Some scholars pointed out that the median of attributes can be used as a threshold to measure the complexity of attributes, based on this assumption, we propose a mixed sampling method based on unsupervised learning and random sampling, which ensures that the samples contain a certain defect sample data rate [4][5].

Data Set and Evaluation Indicators
The experimental data set comes from several commonly used open source projects, including NASA Eclipse project in PROMISE library and Net Gene η Lucene dataset in the library.The metric elements of NASA dataset mainly include Halstead complexity, MeCabe loop complexity and code lines; Eclipse project datasets mainly design metrics from the aspects of code complexity and abstract syntax tree; Lucene dataset mainly includes network and change genetics metric elements.It is a manually verified denoised dataset.Table 1 shows the detailed information of the dataset.The number of characteristics of the dataset includes three levels: small (21), medium (155) and large (465).The experiment was verified by the cross cross method 34J in the data set shown in Table 1.Divide the data set into ten parts on average, and take turns to use nine of them as the training set and one as the test set for verification, ten fold crossover method is a very widely used evaluation method in software engineering research.
The performance evaluation to verify the effectiveness of the proposed method uses four current mainstream evaluation indicators: Recall, precision, F1 score, AUC (Area Under ROC Curve).The evaluation results of these performance indicators are based on the confusion matrix in Table 2 [6].The recall rate refers to the proportion of correctly predicted defective modules in all the real defective modules, the calculation is shown in Formula ( 1): ( The precision ratio refers to the proportion of correctly predicted defect modules in all modules predicted as defects.The calculation is shown in Formula (2): (2)

Experiment and result analysis
In the software defect prediction experiment of deep learning self coding computer network, according to the mixed sampling method based on unsupervised and random sampling proposed by TP precision the author, the number of samples N must be sampled randomly from the candidate set of 2 * N, so the sampling rate cannot be greater than 0 5, if the sampling rate is too small, the percentage of defect data in the training set is too large, which makes the distribution of the training set unbalanced and affects the software defect prediction performance.Therefore, in this experiment, the sampling rate is set as u=0.4,and the ten fold cross validation method is used to repeat the experiment for 20 times, the average value of the results of the 20 experiments is taken as the final experimental result.The classifier samples three common classification algorithms, namely Softmax, Support Vector Machine (SVM) and Logistic Regression (LR).
The benchmark methods used for comparison in this experiment include three types: Without using any feature extraction method, the defect prediction model is directly trained using the original data set.
The common feature extraction method PCA is used to extract features, and then the defect prediction model is trained.
A hybrid feature selection method HFS is used to extract features, and then the defect prediction model is trained.
Principal component analysis (PCA) is a common feature extraction algorithm, the purpose is to find a low dimensional projection space that can retain sample differences through linear transformation, so that a small number of principal components contain a lot of information.A feature selection method that combines feature subset evaluator and feature ranking evaluator is proposed, which can improve feature quality and achieve better prediction performance than general feature selection methods.
For the first research question RQ1, the experiment uses the six datasets in Table 1, the basic model refers to the method of direct classification without feature extraction, it is compared with the software defect prediction model based on deep self coding computer network (DA) proposed by the author in the case of using three different classifiers Softmax, SVM and LR.The experimental results are shown in Table 3 to Table 5, the bold numbers represent the better results [7][8].

Figure 1 .
Figure 1.Big Data Computer Network Defects

Table 1 .
Data Set

Table 3 .
Comparison results between the author's model and the benchmark model when the classifier is softmax

Table 4 .
Comparison results between the author's model and the benchmark model when the classifier is SVM