Acceso abierto

Algorithm of overfitting avoidance in CNN based on maximum pooled and weight decay


Cite

Introduction

With the development of artificial intelligence technology, deep learning technology, such as the convolutional neural network (CNN), has attracted more and more attention from experts and engineering technicians and has been widely used [1,2,3,4,5]. For example, in the field of image analysis, relatively to fully connected neural networks, CNN has greatly improved the accuracy and efficiency of image analysis due to its few connection parameters, shared connection weight values, and local receptive fields, and has been widely recognised by experts and technicians in the field. However, if the structure of a CNN model is complex, the number of neurons contained and the weight parameters are too large, the sample set for training is too small, the number of training rounds is too large, etc., may make the model fall into overfitting during the training process [6, 7], which shows that during training, the data curve fits well and the correct rate is high, but when the trained model is used in the test data set, its correct rate is very low [8].

Aiming at the overfitting phenomenon in the process of model training, many scholars have proposed a series of solutions. For example, given the problems in early machine learning, Li et al. [9] discussed three solutions to overfitting problems based on the BP algorithm, namely early stopping method, adjustment method and automatic generation of hidden nodes. This is an earlier method of researching overfitting problems. The method of machine learning dropout in 2012 has already been proposed to solve the overfitting problems by Srivastava et al. [10] and Hinton [11]. Chen et al. [12] proposed an improved singular value decomposition (SVD) algorithm by studying the overfitting phenomenon in the scoring prediction problem. There are also related research reports in other documents [13,14,15]. For the overfitting problem of the CNN model based on deep learning, a series of solutions have been proposed [16, 17], including the data expansion method, regular method, integrated learning method, early stopping method, etc. However, affected by various factors, including knowledge level, technical application and research time, as well as the insufficient development of deep learning technology-related theories, these research results have their own advantages, but they all have certain limitations.

Based on the characteristics of CNN, the present research on the overfitting of the CNN model by the scholars was carried out. We discuss the method of the maximum value discarding of the pooling layer based on the mask, and by combining the weight decay technology, overfitting can be reduced. Through theoretical analysis and experimental comparison, it is proved that this method can avoid overfitting effectively in CNN model training and improve the ability to detect images.

Structure of CNN

In general, the CNN structure consists of three layers that are convolution, fully connected and pooling. A group of convolution operations and a group of pooling operations form a convolution-pooling operation. After a group of convolution-pooling operations is executed, the next group of convolution-pooling operations will continue to be executed to complete the sample feature extraction, and the final fully connected layer performs a one-dimensional operation on the obtained features to complete the feature classification. As CNN uses the local connection of neurons, with the sharing of local connection parameters, and the use of downsampling for pooling, it has fewer number of parameter of weight value, fewer features, and better object representation than the general neural network in the extraction process of sample features. Moreover, the characterisation ability has better generalisation.

Assume xil x_i^l represents the input of i-th in l-th layer of CNN, i = 1, 2, 3,⋯, nl, nl represents the number of features in l-th layer, wl,k represents value of the convolution kernel which is k-th of the l-th layer, bl,k represents the bias of convolution kernel corresponding to k-th of l-th layer, zl+1,k represents the output value obtained by adding the corresponding bl,k to the convolution kernel wl,k and the input xil x_i^l after the convolution operation. The f (xl) is the activation function of this layer, and the entire convolution operation process can be expressed as: zl+1,k=i=1nlcon(wl,k,xil)+bil,k {z^{l + 1,k}} = \sum\limits_{i = 1}^{{n^l}} con({w^{l,k}},\;x_i^l) + b_i^{l,k} xl+1,k=f(zl+1,k) {x^{l + 1,k}} = f({z^{l + 1,k}}) In formula (1), con represents the convolution operation on the l-th layer, which is used as the input of the next layer (l + 1), and then processed by f (xl). The output xil+1 x_i^{l + 1} in k-th feature map of the (l + 1)-th layer has be obtained.

After the convolution operation, the next is a pooling layer, which samples the features of the convolution layer to reduce the dimensions and complexity of the network. The pooling operation is explained as follows: sml+1=pool(sm,1l,sm,2l,,sm,jl,,sm,nl) s_m^{l + 1} = pool(s_{m,1}^l,\;s_{m,2}^l, \ldots ,s_{m,j}^l, \ldots ,s_{m,n}^l) The formula (3), sm,il s_{m,i}^l represents the value of the i-th pooling unit in the m-th pooling area of the xl+1,k, the l is pooling layer, and n is the number of pooling units of the pooling region. For example, if a 2*2 pooled is used, the number of pooling units n = 4. Pool (.) represents the operation of pooling. The maximum pooled operation is widely used because it can retain the essential characteristics of the object. The maximum pooled method is adopted in this paper.

Immediately after the pooled operation is completed, the full connection operation will be performed, and the weight parameters of the backpropagation are updated through the loss function calculation. The loss function is directly in contact with the CNN network training effect and model performance. Therefore, the loss function designing is very important. For an explanation, see ‘Weight Decay Calculation’ in the next section.

Overfitting avoidance algorithm

The previous section introduced the basic principles of CNN. This section is based on the above, a CNN model with maximum pooled dropout and weight decay is designed, to avoid overfitting in the process of model training as much as possible.

Maximum pooled dropout

Maximum pooled is a common method in the CNN pooling layer. In this paper, the maximum pooled is introduced to reduce the occurrence of overfitting in the model training process. In addition, combined with the maximum pooled of dropout, it can also overcome the disadvantages of neurons with high activation value caused by average pooled, avoid the disadvantage of simple maximum pooled by ignoring the values of other pooling units and can effectively express the characteristics of pooling layer.

Before the maximum pooled dropout operation, first, we describe the mask processing method. The mask processing is to set the activation value of some neurons in the pooling region of the pooling layer to 0 with a certain probability and perform the maximum pooled dropout operation between the holding neurons. The following is a description of the design process.

Assuming rl represent the mask of l-th layer, which is the Bernoulli distribution (p), p is maintenance probability, which is a very important parameter. Its design process will be described in detail later. So the model processing can be expressed as follows: sml=rl*sml s_m^{\prime l} = {r^l}*s_m^l sml+1=Maxpooling(sm,1l,sm,2l,,sm,jl,sm,nl) s_m^{l + 1} = Maxpooling(s_{m,1}^{\prime l},\;s_{m,2}^{\prime l}, \cdots ,s_{m,j}^{\prime l},\;s_{m,n}^{\prime l}) In Eq. (4), sml s_m^{\prime l} represents the value corresponding to the m-th pooling area value sml s_m^l of the l-th layer after masked, and the sml+1 s_m^{l + 1} of Eq. (5) represents the result obtained by taking the maximum value of all pooling units in the m-th pooling area after the mask. The specific implementation is shown in Figure 1.

Fig. 1

Operation diagram of the mask based on maximum pooled.

In Figure 1, a 2*2 pooling kernel is adopted with a step of 2 for pooling. If only maximum pooled is adopted, the result is [[45], [57]]. If maximum pooled is carried out after using a mask, the result is [[42], [55]], because some pooling units are discarded.

The following is a method for designing the maximum pooled dropout and obtaining the maintenance probability (p).

From the above, the Bernoulli probability distribution p is a very important parameter, which is actually the maintenance probability in process of maximum pooling dropout. Therefore, the value is an important index that avoids overfitting as well as important parameters to determine the dropout of the neurons unit. Supposing that the maintenance probability of each pooled area in a feature map of being pooled is set to p, it can be sized manually depending on the actual situation (generally set p = 0.5), the discard probability (supposing q) is q = 1 − p. Meanwhile, each unit values (sm,1l,sm,2l,,sm,nl) (s_{m,1}^l,\;s_{m,2}^l, \ldots ,s_{m,n}^l) of the pooling region m in the l-th layer is assumed to be rearranged in ascending order, i.e. 0<dm,1l<dm,2l<<dm,nl 0 < d_{m,1}^l < d_{m,2}^l < \ldots < d_{m,n}^l (the value of dm,jl d_{m,j}^l is a pool cell sm,jl s_{m,j}^l ). Then dm,jl d_{m,j}^l is selected as the maximum value of the entire pooling area. That is: each value of the unit (dm,j+1l,dm,j+1l,,dm,nl) (d_{m,j + 1}^l,\;d_{m,j + 1}^l, \ldots ,d_{m,n}^l) is greater than dm,jl d_{m,j}^l being suppressed (discard), and only the value is less than or equal to dm,jl d_{m,j}^l being kept. Since max value is got in these maintenance values, the pooled output is dm,jl d_{m,j}^l , namely: pj=poss(sml+1=dm,j1)=pqnj,j=1,2,,n {p_j} = poss(s_m^{l + 1} = d_{m,j}^1) = p{q^{n - j}},j = 1,\;2, \ldots ,n

Analysing formula (6), it can be seen that when performing maximum pooled discarding in the pooling area, the j-th activation value dm,jl d_{m,j}^l in the arrangement is selected by polynomial ascending order as the output value of the pooling area, namely: sml+1=dm,jl,pj(p1,p2,p3,,pn) s_m^{l + 1} = d_{m,j}^l,{p_j} \in \left( {{p_1},{p_2},{p_3}, \ldots ,{p_n}} \right)

Assuming the number of feature maps is r in l-th layer, each feature map size is s, pooling region size is t * t, and the size of the pooling step is t. It means the number of the pooled region rs/t without considering overlapping pooled. Then the calculation amount (the number of parameters) of the l-th layer of the model during training is (t + 1)2rs/t (including a bias). After the maximum pooled dropout is introduced, under the relevant probability constraints, the pooling units are randomly suppressed, that is, t is reduced, and the calculation amount of model training is reduced exponentially, which effectively reduces the complexity and can more effectively reduce the occurrence of the overfitting phenomenon.

Weight decay calculation

When training a big CNN, the above-mentioned maximum pooled dropout can better reduce the occurrence of overfitting. However, if the loss function value of the model changes drastically during the training process, it indicates that the weight value of related neurons is too large, so the absolute value of the derivative obtained by backpropagation is large, that is, the complexity of the model is large. The weight decay can reduce the complexity of the model by restricting the value of the neuron so it can’t be too big, thereby reducing the overfitting to a certain extent occurs.

Usually, we use the mean square deviation function as the loss function to detect overfitting of CNN, and is also used for linear regression. This function (named L0) is shown in Eq. (8), the parameter is: N represents the number of raining samples, yi represents the value of the actual label in the object of the sample i, and oi represents the value of output in object i. L0=1Ni=1N(yioi)2 {L_0} = {1 \over N}\sum\limits_{i = 1}^N {({y_i} - {o_i})^2}

To reduce the weight value of neurons in the network, we introduce a penalty term in the loss function shown in formula (8), then the loss function (L) is: L=L0+λ2ww2 L = {L_0} + {\lambda \over 2}\sum\limits_w {\left\| w \right\|^2} The right term in formula (9) is the penalty term, where the connection coefficient w of the neuron represents the weight, λ (λ > 0) is the coefficient of the penalty item, where it is used to calculate the proportional relationship between the penalty item ∑w ||w||2 and L0. 1/2 is a constant designed for the convenience of derivation. Taking the derivative of (9), we get: {Lw=L0w+λwLb=L0b \left\{ {\matrix{ {{{\partial L} \over {\partial w}} = {{\partial {L_0}} \over {\partial w}} + \lambda w} \hfill \cr {{{\partial L} \over {\partial b}} = {{\partial {L_0}} \over {\partial b}}} \hfill \cr } } \right. In the Eq. (10), b is the bias of neurons (such as convolution kernels) in the CNN model, which is existed in output oi in formula (8), oi = w · xi−1 + b, by analysing formula (10), it can be found that adding the penalty term does not affect the update of bias b, that is, formula (8) is the same as the partial derivative of formula (9) on b, but for the update of network weight w, after introducing the penalty term into formula (9), there is: w=wη(L0w+λw)=wηL0wηλw=(1ηλ)wηL0w w^\prime = w - \eta \left( {{{\partial {L_0}} \over {\partial w}} + \lambda w} \right) = w - \eta {{\partial {L_0}} \over {\partial w}} - \eta \lambda w = \left( {1 - \eta \lambda } \right)w - \eta {{\partial {L_0}} \over {\partial w}} Equation (11), η represents the learning rate. w′ is the new weight obtained by adding the penalty term according to formula (9) after updating the weight. In the analysis of the Eq. (11), it is shown that the coefficient w is 1 when the penalty item does not exist. After adding a penalty item, the coefficient of w is 1 − ηλ. Since η and λ are positive numbers smaller than 1, the equation 1 − ηλ < 1, that is, the updated weight with penalty items is smaller than that without penalty items. This is the theoretical significance of weight decaying. Note that the item ηL0w \eta {{\partial {L_0}} \over {\partial w}} in Eq. (11) is the gradient of the weight decay during backpropagation. The existence of this term has nothing to do with the penalty term, so we do not include this term when discussing the weight decay.

Continue analysis of Eq. (11), when w > 0, the w′ becomes smaller, and when w < 0, the w′ becomes larger. Due to |w| < 1, the effect of Eq. (11) means that make |w′| → 0, that is, the value of w′ is reduced as much as possible, which is shown to the function of reducing the weight of the network, reducing the complexity of the network, and avoiding overfitting. It should be pointed out that the penalty term coefficient λ in formula (9) is an important parameter, and its value directly affects the training effect of the model. If λ is set too large, the weight value w will decrease too fast, and under-fitting may happen, even if model training cannot be carried out, and if λ is set too small, overfitting will happen. The setting of a can be carried out by using the rules based on the Bayes method. The details can be found in Ref. [18], which is limited to space and will not be detailed here.

Experiment and result analysis
Experimental configuration and environment

Running parameters: The adaptive learning rate algorithm (AdaGrad) is applied in model training, which is an algorithm based on gradient optimisation. The initial the learning rate η = 0.1, and it can automatically adjust. The network weight w of initialisation is based on normal distribution, the mean value is 0, the variance is 0.01, the batch size during training is 100, the bias value is initialised b = 0, the maximum pooled default maintenance probability p = 0.5, the penalty term coefficient λ in the weight decay calculation is carried out to set according to the literature [18], this experiment λ = 0.

Experimental environment conditions: deep learning framework is TensorFlow, programming language is Python3.8.4, OS is Win10, Hardware includes Intel Core i7 CPU@3.00, RAM 8GB, NVIDIA GTX 3080Ti GPU.

Experimental data set: There are 2 data sets in this experiment, one is CVC-ClincDB [19], which belongs to a hospital in Barcelona, Spain. It is a collection of gastric polyp case images from 23 patients, including 31 sequences of 612 clear images, which are divided into 4 categories according to pathological types, namely normal images, polyp images 1, 2, and 3. The resolution of each image is 384*384, 3 channels (RGB), and some of the samples are shown in Figure 2 (left). Because the data set is small, it can be used to detect whether the method will be overfitting in the application of small data sets. Before the experiment, it was divided into a training set (500 images) and a test set (112 images), and the data were normalised to the interval [0, 1]. Another CIFAR-10 is a data set consisting of 60,000 colour images, which includes 10 categories. Some samples are shown in Figure 2 (right). Before the experiment, the data set is also divided into 50,000 images used for training and 10,000 images used to test. Each sample size is 32*32, 3 channels RGB image.

Fig. 2

Some samples of CVC-ClincDB and CIFAR-10 data sets.

Comparative experiment: To compare experimental results, we selected the GPU parallel optimization-based immune CNN method proposed by Gong et al. [6] (ICNN), the improved neural networks of feature detector algorithm proposed by Hinton et al. [11] (INN_FD) and the random CNN pooling method proposed by Zeiler and Fergus [20] (SCP) for comparative experiments. Among them, the ICNN method has a low test accuracy in the data set (only 80.2%), but its performance is stable and has a solid theory basis. The INN_FD method proposed by Hinton is a discard algorithm combined with a fully connected layer, the method is mature in technology, simple and convenient to implement. The SCP method proposed in Ref. [20] uses a random selection of pooled mask values in the process of training the model, when applying the model to detect samples, the probability of pooling unit in the pooling area is used as the probability weighted mean value of the model, so this method is a better performance to reduce the overfitting of the training model. Therefore, it is feasible for us to choose the above three methods for comparative experiments.

Experimental evaluation indicators: one indicator is the change curve of the loss value and the accuracy, and the other indicator is the error rate generated when the 4 methods are used for data set detection. The actual effect of the method is verified by comparing the change of the curve and the error rate of the sample test with various methods.

CVC-ClincDB data set experiment

For the convenience of description, the maximum pooled dropout and weight decay algorithm proposed in this paper is expressed as MD_WD. The CNN structure proposed by the MD_WD is 1*28*28->6C3->1P2->12C3->1P2->1000N->4N. The model consists of 2 groups of convolution-pooled layers and a fully connected layer. The input of the model is 28*28 single-channel image, and the output is 4 categories. The first group of convolution-pooled is 6C3->1P2, the second group of convolution-pooled is 12C3->1P2, where C means convolution, P means pooled, 6C3 means the number of convolution kernels is 6, and the size of each kernel is 3*3, 1P2 means step size is 1, the size of the pooled kernel is 2*2, 1000N represents the whole connecting layer with 1000 neurons The network structure of the other three comparison methods (ICNN/INN_FD/SCP) see relevant references, which is limited to space and will not be described in detail.

Figure 3 shows the change curve of the loss function value and the correct rate of the MD_WD method proposed during model training, where the maximum number of iterations epoch = 20,000. It can be seen from the left figure that with the increase in the number of iterations, the loss function value of both the training set and the verification set is decreasing. When the number of iterations reaches 10000, the change curve of the loss function value of the training set and the verification set tends to be stable. Similarly, it can be found from the right figure that with the increase of the number of iterations, the accuracy of the training set and the test set is also increasing. When the number of iterations reaches 10000, the curve changes tend to be stable, indicating that our method of MD_WD has not been overfitting in the experiment.

Fig. 3

Curve of iteration times, loss value and accuracy in CVC-ClincDB dataset.

To further verify the performance of the proposed method, we use the above 4 methods to test the data set under different maintenance probabilities, and the error rates obtained are shown in Table 1. Because the ICNN method does not involve the change of maintenance probability, this method does not consider the impact of maintenance probability when the error rate test is carried out. Comparing the data from Table 1, firstly, looking from the horizontal, when the maintenance probability is 0.5, each method gets the lowest error rate except ICNN, but the error rate of the INN_FD varieties most sharply with the maintenance probability; Secondly, looking at the lowest error from 4 methods longitudinally, our MD_WD method has a minimum error, the value is only 1.63%, the SCP method has the smallest value of 2.08%, and the value of INN_FD is 6.23%, indicating that this method can be used on small data sets. It reduces overfitting very well and thus can achieve better detection results.

The error rate (%) under different maintenance probabilities in CVC-ClincDB.

p 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
ICNN 6.36 6.36 6.36 6.36 6.36 6.36 6.36 6.36 6.36
INN_FD 7.27 6.85 6.50 6.23 6.20 6.31 6.38 6.40 6.51
SCP 2.23 2.17 2.10 2.12 2.08 2.26 2.31 2.37 2.49
MD_WD 2.21 2.15 2.03 1.95 1.63 1.88 2.19 2.26 2.27
CIFAR-10 dataset

The experiment in Section 4.2 is based on the test of a small data set, and the CIFA-10 is a large natural colour image data set with 3 channels. Comparing with the data set of CVC-ClincDB in Section 4.2, the difference in each category of this data set is larger. Therefore, the structure of CNN is explored: 3*32*32->16C5-> 2P2->64C3->2P2->2000N->10N, the meaning of each item in the network structure is the same as the previous experiment, and the experiments of the other three methods are the same as the experiments in Section 4.2. The training is carried out for 5000 iterations, and the change curve obtained by the method MD_WD is shown in Figure 4. By analysing the curve on the left of Figure 4, it can be found that with the increase of iteration rounds, the loss curve of the verification set continues to decline, and the curve changes tend to be stable after 4000 iterations. Although the loss value of the training set changes drastically during the iteration process, this value is also decreasing with the rounds of iterations increasing. The accuracy in the right figure gradually rises as the number of iteration rounds increases, and tends to level off around 4000 iterations. It shows that our MD_WD method can still effectively reduce the occurrence of overfitting when tested on large data sets. It shows that the method in this paper can still effectively reduce overfitting when applied to large data sets.

Fig. 4

Curve of iteration times, loss value and accuracy in the CIFAR-10 dataset.

Same as the previous data set experiment, in this data set, we use the above 4 methods to test through the respective trained models under the constraints of different maintenance probability values p. The error rate of 4 methods obtained under the different maintenance probabilities values is shown in Table 2, and ICNN is independent of the maintenance probability, its explanation is the same as in the previous section. As shown by the horizontal analysis of the error rate in Table 2, the method achieves its lowest misclassification rate between maintenance probability 0.4∼0.6.

The error rate (%) under different maintenance probabilities in CIFAR-10.

p 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
ICNN 7.44 7.44 7.44 7.44 7.44 7.44 7.44 7.44 7.44
INN_FD 8.87 7.95 7.76 7.10 6.15 7.21 7.38 8.12 8.69
SCP 4.83 4.67 4.55 4.31 4.09 3.92 4.16 4.47 5.39
MD_WD 3.21 3.08 2.97 2.93 2.81 2.88 3.08 3.16 3.35

The method of INN_FD has a minimum error rate of 6.15% when p = 0.5, SCP has a minimum error rate of 3.92% when p = 0.6, the method of this paper has a minimum error rate of 2.81% when p = 0.5. In addition, longitudinal analysis of the lowest error rate of each method, the minimum error rate of the MD_WD method is only 2.81%, indicating that the method proposed in this paper can also be applied to large data sets and has achieved good generalisation.

Conclusion

One of the challenges encountered in deep neural networks is the generalisation of the network. The so-called generalisation is to make the model perform well on the test set and also on the training set. This paper proposed a maximum pooled and weight decay in the improved CNN structure and fuses them to reduce the occurrence of model overfitting during training, to improve the generalisation performance of the model. The characteristics of this paper include two aspects. One is to design a dropout method in pooling, that is, after the pooling layer is processed by the mask method, sort the unit values in the pooled area with a certain probability and then discard them according to the maximum value. The second is to introduce a penalty item to update the weight (weight decay) in the iterative update process of weights to reduce the complexity of the network, and analyse and derive it.

However, there are some shortcomings in the proposed method. One is that while performing maximum pooled dropout, when the convolution operation is completed and the semi-linear activation function (ReLU) is used for further processing, it may cause the value of the non-maximum output unit to be 0, thus causing the corresponding neuron to fail update (gradient is 0), that is, a dead neural unit. For another reason, the penalty term parameter λ introduced not only lacks the theoretical basis but also theoretically seems to have a certain connection with the maximum pooled dropout probability p. But the two facts are not combined for this study. Therefore, the solution to these problems is the goal of the research at the next step.

eISSN:
2444-8656
Idioma:
Inglés
Calendario de la edición:
Volume Open
Temas de la revista:
Life Sciences, other, Mathematics, Applied Mathematics, General Mathematics, Physics