In our daily life, handwritten numbers are very common, but in many areas of work, the part about numbers is sometimes very cumbersome, such as data collection, which is a time-consuming, large amount of work. At this time, the function of handwriting recognition technology is reflected, which brings convenience and efficiency to human.
The proposal of nerve capsule comes from a assumption of Hinton[1]: instead of using a group of coarse coding or single neurons to represent the relationship between the observer and the object similar to the object’s posture information, a set of activated neurons is selected to represent it. This group of neurons is called nerve capsule. One of the advantages of capsule network is that it needs less training data than convolutional neural network, but the effect is not inferior to it.
For the traditional neural network, neurons can not represent multiple attributes at the same time, resulting in the activation of neurons can only represent a certain entity. In this way, the nature of the entity will be hidden in a large number of network parameters. When adjusting the network parameters, we can not guarantee the pure motivation. It must take into account the input of all kinds of samples, so it is inevitable to adjust the parameters in a troublesome and time-consuming manner. After the application of vector neurons, we will be able to determine the existence of all the properties wrapped in a capsule, in the adjustment of parameters, such constraints will be greatly reduced, the best parameters are easy to obtain.
The design and research of artificial neural network largely borrows from the structure of biological neural network. In the field of neurolysis, a conclusion has been drawn that there are a large number of cortical dominant structures in the cerebral cortex of most primates. There are hundreds of neurons in the cortex of most primates, and there are also hierarchical structures in it. These small units can handle different types of visual stimuli well. The researchers speculate that there is a mechanism in the brain that combines low-level visual features with some weight values to construct a colorful world in our eyes. Based on this discovery in biology, Hinton suggests that it is more appropriate to try to replace the relationship between the object and the observer with a series of active neurons instead of one. So, there is the nerve capsule mentioned earlier.
In October 2017, Sabour, Hinton and others published the topic “Dynamic Routing Between Capsules” [10] at a top-level conference on machine learning called “NIPS” and proposed Capsule network (CapsNet). This is a deep learning method that shakes the whole field of artificial intelligence. It breaks the bottleneck of convolutional neural network (CNN) and pushes the field of artificial intelligence to a new level. This paper focuses on the recognition of MNIST data set based on capsule network. MNIST[7] is a data set composed of numbers handwritten by different people.
Although handwritten digits in BP neural network[9] and convolution neural network[2][5][6][11] have a certain good recognition effect, but the emergence of capsule network brings a new breakthrough to the recognition of data sets, and has a better recognition effect, and it’s recognition accuracy greatly exceeds the convolutional neural network.
The neural capsule proposed by Hinton is to implement ontology from the perspective of philosophy. The various properties of a particular entity are represented by the activity of nerve cells in an activated capsule. These attributes include the size, location, orientation and other information of the entity. From the existence of some special attributes, we can infer the existence of instances.
In the field of machine learning, the probability of entity existence is represented by the output size of independent logistic regression unit. In the neural capsule, the norm obtained by normalizing the output high-order vector represents the existence probability of the entity, and the attributes of the entity are represented by various “posture” of the vector. This reflects the essence of ontology, that is to define the existence of entity according to its various attributes.
In the research of capsule network, the working process of capsule network is closer to the behavior of human brain because of its less training data. In the aspect of white box adversarial attacks, capsule network shows strong resistance. Under the effect of the fast gradient symbol method, the accuracy can still be maintained above 70%. The accuracy of training and testing on MNIST is better than that of convolution neural network. In some practical applications, such as in specific text classification tasks, convolution capsule network can effectively improve the accuracy of feature extraction. [12]Chinese scholars have also applied the visual reconstruction method based on capsule network structure in the field of functional magnetic resonance imaging. In the intelligent traffic sign recognition, by introducing pooling layer into the main capsule layer, the super depth convolution model improves the feature extraction part of the original network structure, and uses the moving index average method to improve the dynamic routing algorithm, which improves the recognition accuracy of the network in the field of traffic sign recognition.
The capsule network first appeared in the article “Dynamic routing between capsules” published by Hinton et al. in October 2017. Based on the capsule network proposed by Sabour et al in 2017, an improved version of the new capsule system was proposed in the article “Matrix Capsules with EM Routing”[3] published in 2018.
In this system, each encapsulated capsule uses a logical unit to represent the presence or absence of an entity. A 4×4 pose matrix is used to represent the pose information of the entity. In this paper, the iterative routing method between capsule layers based on EM algorithm is mentioned. The output of the lower layer capsules reaches the higher level capsules through routing algorithm, so that the activated capsules get a group of similar pose voting. The new system is much more resistant to Lily attacks than baseline CNN. In the paper “Stacked Capsule Autoencoders”[4] published in 2019, an unsupervised capsule automatic encoder (SCAE) is introduced. By observing the neural encoders of all components, the existence and pose information of the target can be inferred, that is to say, the object can be inferred explicitly through the relationship between the components. The accuracy on SVHN[8] and MNIST datasets is 55% and 98.5%, respectively.
Network structure of deep capsule
Finally, the capsule network is compared with the improved deep capsule network as shown in the following table 1:
STRUCTURE COMPARISON OF CAPSULE NETWORK AND DEEP CAPSULE NETWORK
Capsule network | Deep capsule network | |
Convolution layer | Conv1: 256*9*9 | Conv1: 512*9*9 Conv2: 256*5*5 |
Primary Capsule | 9*9 | 5*5 |
Digit Capsule | One time dynamic routing | Twice dynamic routing |
FC |
The decoder structure in this paper is the same as capsnet, as shown in the figure. The goal of capsnet model optimization is to calculate the edge loss for each number to allow multiple numbers to exist at the same time. In addition, capsnet can reconstruct the input image based on the instantiation parameters obtained by previous processing. In the training process of image reconstruction, only the activated capsules are allowed to participate in the adjustment of three-level fully connected network at each time. The structure mainly responsible for reconstructing the image is the decoder, which receives a 16 × 10 matrix from the digital capsule layer, reconstructs a 28×28 image after three full connection layers.
Decoder network
In this paper, there are two routes, the primary route and the secondary route, but both are the same dynamic path structure. It is used to ensure that the output of the capsule is only delivered to the appropriate parent node, which is similar to the idea of “focusing on Cultivation”. It is necessary for the lower layer capsule i to know how to deliver its output vector to the higher-level capsule j. At this time, it is necessary to evaluate the coupling degree between the low-level capsules and the high-level capsules. This is represented by the scalar weight CIJ, which is the importance.
In this high-dimensional vector space, in order to describe the spatial relationship of different parts of the entity, each capsule is set with corresponding weight. An affine transformation matrix, which is composed of several weight vectors, an affine transformation matrix is generated. After transforming the matrix, we can get the
The low-level neural capsule
For an intermediate layer capsule, the input is a vector and the output is also a vector, but the input process for it is two stages:
The process of dynamic routing algorithm is as follows:
Softmax processes data
Predict the output
Weighted sum
Compress the vector
Update coupling coefficient
The following figure 3 is a description of the dynamic routing algorithm.
Dynamic routing algorithm
For all layers of
Dynamic routing algorithm focuses on clustering similar parts together, and then forms a larger granularity identification module. If the predicted vector
The traditional cross entropy function only supports the scenario of one classification, so this function is not suitable for capsule network. In order to distinguish multiple classifications in a picture, the edge loss function is used to achieve the objective function of model optimization for each digital capsule
In the above formula,
EXPERIMENTAL ENVIRONMENT
Operating system | Windows10(RAM16.0GB) |
CPU | Intel(R)Core(TM)i7-9750H |
GPU | NVIDIA GeForce GTX 1660 Ti |
Dataset | MNIST |
Other | pytorch1.5.0+cu101 python 3.7.7 |
Handwritten digital machine vision database is widely used in image recognition and classification. The sample image in MNIST is 28 × 28 pixels, including four files: training set image, training set label, test set image and test set label. These files are binary files, each pixel of which is converted to a number between 0 and 255, where 0 is white and 255 is black. The training set has 60000 handwritten training samples. Its function is to fit model parameters, such as calculating offset and weight. The test set has 10000 samples, and its function is to test the final effect of the model.
Test precision chart of capsule network under 50 epochs
As shown in Figure 4, the highest accuracy of this training is 99.55% in the 44th epoch.
As shown in Figure 5, the highest accuracy of this training is 99.62% in the 43rd epoch.
Test accuracy chart of deep capsule network under 50 epochs
Test precision chart of capsule network under 30 epochs
Test accuracy chart of deep capsule network under 30 epochs
As shown in Figure 8, it can be seen that the test accuracy of the deep capsule network in a short epoch increases faster than the accuracy of the capsule network and the recognition accuracy is also higher, under the same conditions.
Comparison between the accuracy of capsule network and deep capsule network
As shown in the Figure 9, it shows that the number of route iterations is not the more the better, which should be obtained according to the specific experiment of network structure. In a smaller training period, It is more appropriate to select the number of iterations of the primary route 2 times and the number of iterations of the secondary route 3 times.
Impact of changing the number of routing iterations on the deep capsule network
When the two routing iterations are different as to Figure 10.
Influence of different iteration times of two routes on deep capsule network
As shown in the Figure 11, the training time and classification accuracy of the network are compared under different collocation times of “primary route” and “secondary route”. From the analysis of the data in the table, if only from the classification accuracy, the combination of “main route” iteration twice and “secondary route” iteration three times is the best, but the training time is long. If the training time and classification accuracy are considered comprehensively, the primary route is best to be iterated once and the secondary route is iterated twice.
Influence of the same number of two routing iterations on deep capsule network
In order to understand the reconstructed picture, use the imshow function of matplotlib to draw and visualize, then the input picture 12 and the reconstructed picture 13 are shown in the following figure:
Schematic diagram of some pictures in MNIST database
Schematic diagram of reconstructed image
From the comparison, we can see that the reconstructed digital image is clearer and smoother than the input image. It can be inferred that the reconstructed image has the function of smoothing noise.
In the same way, we train the overlapped handwritten digital images with deep capsule network, and finally put the vectors into the decoder to decode the reconstructed images. Some of them are shown in figure 14, and the separation effect is basically accurate.
Comparison of input and output images of the network
Figure 15 shows the three separation effects ‘0’ and ‘1’, ‘3’ and ‘4’, ‘0’ and ‘9’. It is obvious that the network has been able to separate two completely coincident handwritten digits. Even if ‘3’ and ‘4’ overlap and it is difficult for human eyes to separate them, the network can still successfully separate them, with an accuracy rate of 93.53%. The accuracy rate of collaterals was only 88.10%.
Partial reconstruction results of the improved network
However, the situation shown in Figure 16 still exists in the reconstructed picture. The original overlapping picture is the overlap of the numbers ‘9’ and ‘4’. The two reconstructed images are like ‘9’, without ‘4’, the reconstruction is wrong, the error rate after the improvement is still 6.47%.
Improved partial error reconstruction results
The deep capsule network model in this paper is based on the characteristics and shortcomings of the capsule network. On the one hand, it retains the advantages of capsule network in understanding the attitude of objects; on the other hand, in view of the shortcomings of capsule network, the convolution kernel size of convolution layer is optimized, and the dynamic routing process is improved to twice routing. The final deep capsule network not only retains the advantages of traditional capsule network, but also improves the performance.