With the development of artificial intelligence technology, deep learning technology, such as the convolutional neural network (CNN), has attracted more and more attention from experts and engineering technicians and has been widely used [1,2,3,4,5]. For example, in the field of image analysis, relatively to fully connected neural networks, CNN has greatly improved the accuracy and efficiency of image analysis due to its few connection parameters, shared connection weight values, and local receptive fields, and has been widely recognised by experts and technicians in the field. However, if the structure of a CNN model is complex, the number of neurons contained and the weight parameters are too large, the sample set for training is too small, the number of training rounds is too large, etc., may make the model fall into overfitting during the training process [6, 7], which shows that during training, the data curve fits well and the correct rate is high, but when the trained model is used in the test data set, its correct rate is very low [8].
Aiming at the overfitting phenomenon in the process of model training, many scholars have proposed a series of solutions. For example, given the problems in early machine learning, Li et al. [9] discussed three solutions to overfitting problems based on the BP algorithm, namely early stopping method, adjustment method and automatic generation of hidden nodes. This is an earlier method of researching overfitting problems. The method of machine learning dropout in 2012 has already been proposed to solve the overfitting problems by Srivastava et al. [10] and Hinton [11]. Chen et al. [12] proposed an improved singular value decomposition (SVD) algorithm by studying the overfitting phenomenon in the scoring prediction problem. There are also related research reports in other documents [13,14,15]. For the overfitting problem of the CNN model based on deep learning, a series of solutions have been proposed [16, 17], including the data expansion method, regular method, integrated learning method, early stopping method, etc. However, affected by various factors, including knowledge level, technical application and research time, as well as the insufficient development of deep learning technology-related theories, these research results have their own advantages, but they all have certain limitations.
Based on the characteristics of CNN, the present research on the overfitting of the CNN model by the scholars was carried out. We discuss the method of the maximum value discarding of the pooling layer based on the mask, and by combining the weight decay technology, overfitting can be reduced. Through theoretical analysis and experimental comparison, it is proved that this method can avoid overfitting effectively in CNN model training and improve the ability to detect images.
In general, the CNN structure consists of three layers that are convolution, fully connected and pooling. A group of convolution operations and a group of pooling operations form a convolution-pooling operation. After a group of convolution-pooling operations is executed, the next group of convolution-pooling operations will continue to be executed to complete the sample feature extraction, and the final fully connected layer performs a one-dimensional operation on the obtained features to complete the feature classification. As CNN uses the local connection of neurons, with the sharing of local connection parameters, and the use of downsampling for pooling, it has fewer number of parameter of weight value, fewer features, and better object representation than the general neural network in the extraction process of sample features. Moreover, the characterisation ability has better generalisation.
Assume
After the convolution operation, the next is a pooling layer, which samples the features of the convolution layer to reduce the dimensions and complexity of the network. The pooling operation is explained as follows:
Immediately after the pooled operation is completed, the full connection operation will be performed, and the weight parameters of the backpropagation are updated through the loss function calculation. The loss function is directly in contact with the CNN network training effect and model performance. Therefore, the loss function designing is very important. For an explanation, see ‘Weight Decay Calculation’ in the next section.
The previous section introduced the basic principles of CNN. This section is based on the above, a CNN model with maximum pooled dropout and weight decay is designed, to avoid overfitting in the process of model training as much as possible.
Maximum pooled is a common method in the CNN pooling layer. In this paper, the maximum pooled is introduced to reduce the occurrence of overfitting in the model training process. In addition, combined with the maximum pooled of dropout, it can also overcome the disadvantages of neurons with high activation value caused by average pooled, avoid the disadvantage of simple maximum pooled by ignoring the values of other pooling units and can effectively express the characteristics of pooling layer.
Before the maximum pooled dropout operation, first, we describe the mask processing method. The mask processing is to set the activation value of some neurons in the pooling region of the pooling layer to 0 with a certain probability and perform the maximum pooled dropout operation between the holding neurons. The following is a description of the design process.
Assuming
In Figure 1, a 2*2 pooling kernel is adopted with a step of 2 for pooling. If only maximum pooled is adopted, the result is [[45], [57]]. If maximum pooled is carried out after using a mask, the result is [[42], [55]], because some pooling units are discarded.
The following is a method for designing the maximum pooled dropout and obtaining the maintenance probability (
From the above, the Bernoulli probability distribution
Analysing formula (6), it can be seen that when performing maximum pooled discarding in the pooling area, the
Assuming the number of feature maps is
When training a big CNN, the above-mentioned maximum pooled dropout can better reduce the occurrence of overfitting. However, if the loss function value of the model changes drastically during the training process, it indicates that the weight value of related neurons is too large, so the absolute value of the derivative obtained by backpropagation is large, that is, the complexity of the model is large. The weight decay can reduce the complexity of the model by restricting the value of the neuron so it can’t be too big, thereby reducing the overfitting to a certain extent occurs.
Usually, we use the mean square deviation function as the loss function to detect overfitting of CNN, and is also used for linear regression. This function (named
To reduce the weight value of neurons in the network, we introduce a penalty term in the loss function shown in formula (8), then the loss function (
Continue analysis of Eq. (11), when
Running parameters: The adaptive learning rate algorithm (AdaGrad) is applied in model training, which is an algorithm based on gradient optimisation. The initial the learning rate η = 0.1, and it can automatically adjust. The network weight
Experimental environment conditions: deep learning framework is TensorFlow, programming language is Python3.8.4, OS is Win10, Hardware includes Intel Core i7 CPU@3.00, RAM 8GB, NVIDIA GTX 3080Ti GPU.
Experimental data set: There are 2 data sets in this experiment, one is CVC-ClincDB [19], which belongs to a hospital in Barcelona, Spain. It is a collection of gastric polyp case images from 23 patients, including 31 sequences of 612 clear images, which are divided into 4 categories according to pathological types, namely normal images, polyp images 1, 2, and 3. The resolution of each image is 384*384, 3 channels (RGB), and some of the samples are shown in Figure 2 (left). Because the data set is small, it can be used to detect whether the method will be overfitting in the application of small data sets. Before the experiment, it was divided into a training set (500 images) and a test set (112 images), and the data were normalised to the interval [0, 1]. Another CIFAR-10 is a data set consisting of 60,000 colour images, which includes 10 categories. Some samples are shown in Figure 2 (right). Before the experiment, the data set is also divided into 50,000 images used for training and 10,000 images used to test. Each sample size is 32*32, 3 channels RGB image.
Comparative experiment: To compare experimental results, we selected the GPU parallel optimization-based immune CNN method proposed by Gong et al. [6] (ICNN), the improved neural networks of feature detector algorithm proposed by Hinton et al. [11] (INN_FD) and the random CNN pooling method proposed by Zeiler and Fergus [20] (SCP) for comparative experiments. Among them, the ICNN method has a low test accuracy in the data set (only 80.2%), but its performance is stable and has a solid theory basis. The INN_FD method proposed by Hinton is a discard algorithm combined with a fully connected layer, the method is mature in technology, simple and convenient to implement. The SCP method proposed in Ref. [20] uses a random selection of pooled mask values in the process of training the model, when applying the model to detect samples, the probability of pooling unit in the pooling area is used as the probability weighted mean value of the model, so this method is a better performance to reduce the overfitting of the training model. Therefore, it is feasible for us to choose the above three methods for comparative experiments.
Experimental evaluation indicators: one indicator is the change curve of the loss value and the accuracy, and the other indicator is the error rate generated when the 4 methods are used for data set detection. The actual effect of the method is verified by comparing the change of the curve and the error rate of the sample test with various methods.
For the convenience of description, the maximum pooled dropout and weight decay algorithm proposed in this paper is expressed as MD_WD. The CNN structure proposed by the MD_WD is 1*28*28->6C3->1P2->12C3->1P2->1000N->4N. The model consists of 2 groups of convolution-pooled layers and a fully connected layer. The input of the model is 28*28 single-channel image, and the output is 4 categories. The first group of convolution-pooled is 6C3->1P2, the second group of convolution-pooled is 12C3->1P2, where C means convolution, P means pooled, 6C3 means the number of convolution kernels is 6, and the size of each kernel is 3*3, 1P2 means step size is 1, the size of the pooled kernel is 2*2, 1000N represents the whole connecting layer with 1000 neurons The network structure of the other three comparison methods (ICNN/INN_FD/SCP) see relevant references, which is limited to space and will not be described in detail.
Figure 3 shows the change curve of the loss function value and the correct rate of the MD_WD method proposed during model training, where the maximum number of iterations epoch = 20,000. It can be seen from the left figure that with the increase in the number of iterations, the loss function value of both the training set and the verification set is decreasing. When the number of iterations reaches 10000, the change curve of the loss function value of the training set and the verification set tends to be stable. Similarly, it can be found from the right figure that with the increase of the number of iterations, the accuracy of the training set and the test set is also increasing. When the number of iterations reaches 10000, the curve changes tend to be stable, indicating that our method of MD_WD has not been overfitting in the experiment.
To further verify the performance of the proposed method, we use the above 4 methods to test the data set under different maintenance probabilities, and the error rates obtained are shown in Table 1. Because the ICNN method does not involve the change of maintenance probability, this method does not consider the impact of maintenance probability when the error rate test is carried out. Comparing the data from Table 1, firstly, looking from the horizontal, when the maintenance probability is 0.5, each method gets the lowest error rate except ICNN, but the error rate of the INN_FD varieties most sharply with the maintenance probability; Secondly, looking at the lowest error from 4 methods longitudinally, our MD_WD method has a minimum error, the value is only 1.63%, the SCP method has the smallest value of 2.08%, and the value of INN_FD is 6.23%, indicating that this method can be used on small data sets. It reduces overfitting very well and thus can achieve better detection results.
The error rate (%) under different maintenance probabilities in CVC-ClincDB.
ICNN | 6.36 | 6.36 | 6.36 | 6.36 | 6.36 | 6.36 | 6.36 | 6.36 | 6.36 |
INN_FD | 7.27 | 6.85 | 6.50 | 6.23 | 6.20 | 6.31 | 6.38 | 6.40 | 6.51 |
SCP | 2.23 | 2.17 | 2.10 | 2.12 | 2.08 | 2.26 | 2.31 | 2.37 | 2.49 |
MD_WD | 2.21 | 2.15 | 2.03 | 1.95 | 1.63 | 1.88 | 2.19 | 2.26 | 2.27 |
The experiment in Section 4.2 is based on the test of a small data set, and the CIFA-10 is a large natural colour image data set with 3 channels. Comparing with the data set of CVC-ClincDB in Section 4.2, the difference in each category of this data set is larger. Therefore, the structure of CNN is explored: 3*32*32->16C5-> 2P2->64C3->2P2->2000N->10N, the meaning of each item in the network structure is the same as the previous experiment, and the experiments of the other three methods are the same as the experiments in Section 4.2. The training is carried out for 5000 iterations, and the change curve obtained by the method MD_WD is shown in Figure 4. By analysing the curve on the left of Figure 4, it can be found that with the increase of iteration rounds, the loss curve of the verification set continues to decline, and the curve changes tend to be stable after 4000 iterations. Although the loss value of the training set changes drastically during the iteration process, this value is also decreasing with the rounds of iterations increasing. The accuracy in the right figure gradually rises as the number of iteration rounds increases, and tends to level off around 4000 iterations. It shows that our MD_WD method can still effectively reduce the occurrence of overfitting when tested on large data sets. It shows that the method in this paper can still effectively reduce overfitting when applied to large data sets.
Same as the previous data set experiment, in this data set, we use the above 4 methods to test through the respective trained models under the constraints of different maintenance probability values
The error rate (%) under different maintenance probabilities in CIFAR-10.
ICNN | 7.44 | 7.44 | 7.44 | 7.44 | 7.44 | 7.44 | 7.44 | 7.44 | 7.44 |
INN_FD | 8.87 | 7.95 | 7.76 | 7.10 | 6.15 | 7.21 | 7.38 | 8.12 | 8.69 |
SCP | 4.83 | 4.67 | 4.55 | 4.31 | 4.09 | 3.92 | 4.16 | 4.47 | 5.39 |
MD_WD | 3.21 | 3.08 | 2.97 | 2.93 | 2.81 | 2.88 | 3.08 | 3.16 | 3.35 |
The method of INN_FD has a minimum error rate of 6.15% when
One of the challenges encountered in deep neural networks is the generalisation of the network. The so-called generalisation is to make the model perform well on the test set and also on the training set. This paper proposed a maximum pooled and weight decay in the improved CNN structure and fuses them to reduce the occurrence of model overfitting during training, to improve the generalisation performance of the model. The characteristics of this paper include two aspects. One is to design a dropout method in pooling, that is, after the pooling layer is processed by the mask method, sort the unit values in the pooled area with a certain probability and then discard them according to the maximum value. The second is to introduce a penalty item to update the weight (weight decay) in the iterative update process of weights to reduce the complexity of the network, and analyse and derive it.
However, there are some shortcomings in the proposed method. One is that while performing maximum pooled dropout, when the convolution operation is completed and the semi-linear activation function (ReLU) is used for further processing, it may cause the value of the non-maximum output unit to be 0, thus causing the corresponding neuron to fail update (gradient is 0), that is, a dead neural unit. For another reason, the penalty term parameter