Object detection has become one of the important research directions in the field of computer vision. Its essence is to find the desired target in a complex background image, give the location information of the target, and judge its class. Object detection technology is widely used in face recognition, medical field, traffic research and other aspects [1].
Object detection based on deep learning can be divided into two groups: two-stage object detection method based on region proposal and one-stage object detection method based on regression [4]. In the two-stage series, represented by a series of algorithms of regional convolutional neural network (r-cnn), the task of the first stage is to generate a group of target candidate regions, and then send these candidate regions to the second stage. Then coordinate regression and classification are carried out to gradually realize end-to-end object detection. Although the detection accuracy has been improved, the detection speed is slow due to its large network parameters [6]. One-stage series algorithms are represented by single-detector YOLO and single-network multi-scale detector SSD. It discards the stage of extracting candidate regions in the two stage method, and directly obtains the category probability and location of the target, making its network structure simpler. Compared with the two stage object detection method, the detection speed of this method has been improved to a certain extent, but the localization accuracy has decreased, and there is still a problem that the model parameters are too large [6].
SSD [20] (full name: Single Shot Detection) is a typical algorithm of one-stage object detection method. In order to detect objects of different scales, it uses shallow feature maps to detect small objects, and uses deep feature maps to detect large objects.
The SSD algorithm encapsulates the target location and target prediction in the forward operation, and can directly generate the category probability and position coordinate value of the object. The final detection result can be obtained by only one step detection. Although the detection speed is faster, the positioning accuracy is decreased. SSD algorithm uses shallow high-resolution feature layer. Due to the lack of feature expression ability of this layer, there may be missed detection and false detection when detecting small-scale targets.
In order to avoid the above problems, this paper improves the standard SSD object detection algorithm. The residual network structure is introduced, and the adjacent feature maps are fused. At the same time, the SE attention mechanism experiment is carried out on the PASCAL VOC2012 dataset.
Liu et al. proposed SSD algorithm [20] in 2015, which combines the advantages of Yolo's fast speed and fast rcnn's accurate positioning. The main innovations of SSD algorithm are as follows: (1) extracting feature maps at different scales; (2) using prior boxes with different aspect ratios. These two important improvements enable SSD algorithm to efficiently detect targets with different scales [20].
The network structure of SSD is shown in Figure 1. Its detection framework [20] consists of two sections: feature extraction network and multi-scale feature detection network. The feature extraction network adopts the VGG16 network structure for the preliminary extraction of image features. Since the SSD network only needs to extract features without classification in this part, the fully connection layer in VGG16 is replaced by the convolution layer. The second part is used to detect the predicted feature maps of different scales generated by the feature extraction network. The spatial resolution of the shallow feature map is higher than that of the deep feature map, which can more accurately identify the detailed information such as the edge, contour and texture of the image [18]. The deep feature map has a large receptive field and strong information representation ability, but it lacks detailed information compared with the shallow feature map detect smaller sized targets and the deep feature map detect larger targets.
SSD adopts multi-scale feature detection in object detection. The size of the input image can be 300×300, 512×512 [18], etc. The original SSD basic network uses VGG16 as the backbone network. The multi-scale pyramid feature map is generated by adding several convolution layers with gradually reduced size to the basic network: First, the FC7 layer in VGG16 is replaced by the convolution layer Conv7, and all dropout layers and FC8 layers are removed; Secondly, the feature layers Conv8, Conv9, Conv10 and Conv11 are added. The SSD network uses the Conv4_3 layer in vgg16 as the first prediction feature map, and the feature maps obtained from the Conv7, Conv8_2, Conv9_2, Conv10_2, and Conv11_2 layers as the subsequent prediction feature maps. So, a total of 6 prediction feature maps with different scales are obtained. Taking the input image size of 300×300 as an example, the size of the six predicted feature maps obtained by the feature extraction network are 38×38, 19×19, 10×10, 5×5, 3×3, 1×1.
For the six predicted feature maps, each cell of each predicted feature map will extract k = 4 to k = 6 default boxes according to different aspect ratios, and finally obtain 8732 prior boxes. If the size of a feature map is H×W, it means that H×W×k default boxes will be generated on this feature map. The k values on the six predicted feature maps are: 4, 6, 6, 6, 4, 4. The size and number of default boxes generated by each feature map are shown in Table I.
Default dimension and quantity of feature map
Feature Map1 | 38 × 38 | 21{1/2,1,2};
|
38 × 38 × 4 |
Feature Map2 | 19 × 19 | 45{1/3,1/2,1,2,3};
|
19 × 19 × 6 |
Feature Map3 | 10 × 10 | 99{1/3,1/2,1,2,3};
|
10 × 10 × 6 |
Feature Map4 | 5 × 5 | 153{1/3,1/2,1,2,3};
|
5 × 5 × 6 |
Feature Map5 | 3 × 3 | 207{1/2,1,2};
|
3 × 3 × 4 |
Feature Map6 | 1 × 1 | 261{1/2,1,2};
|
1 × 1 × 4 |
After obtaining all default boxes, we will calculate the scores of c categories and 4 coordinate offsets (boundary box regression parameters) for each default box: For the M×N, P-channel predicted feature map, two 3×3 convolution kernels of P channels are used to generate probability scores and coordinate offsets of corresponding default boxes. Let s be the number of all default boxes on these six prediction feature maps, then the number of convolution kernels required for calculating the category is s × c, and the number of convolution kernels required to calculate the coordinate offset is s × 4.
The loss function of multi-scale feature detection network is divided into two parts: category loss and position offset loss. The calculation formula of the target loss function is shown in (1). Among them, N is the number of matched positive samples, α is 1,
For the predicted category loss, the calculation formula is shown in (2).
Among them,
The equation of positional offset loss
The improvements in this paper based on the SSD algorithm include:
In this paper, the backbone in the SSD network is replaced by the VGG16 network with the ResNet50 network. In the object detection algorithm, the selection of the classification network has a significant impact on the performance of the algorithm. Removing the fully connected layer and loss layer from the classification network will get the basic network part of the object detection algorithm. VGG16 consists of 13 convolutional layers, 3 fully connected layers and 5 pooling layers. The outstanding feature of vgg16 is its simple structure, which is easier to build by stacking several convolutional and pooling layers. However, the disadvantage of VGG16 is that the number of layers is shallow and the feature extraction is not sufficient[8]. The ResNet50 network consists of a series of Residual structures and uses Batch Normalization to accelerate the training; and the ResNet50 network is much smaller than VGG16, and its speed and accuracy are superior to VGG16. Therefore, in this paper, ResNet50 is chosen to replace the VGG16 network in the SSD network.
The shallow feature maps are large in size and contain sufficient detail information, but too little semantic information. The deep feature maps have large receptive fields and contain sufficient semantic information, but detail information is gradually lost as the receptive field increases [3]. Therefore, fusing features of different scales is an important means to improve the performance of object detection. In this paper, bilinear interpolation is used to achieve upsampling, which increases the image resolution and retains more feature information; the concatenate method is used to stitch the shallow feature map with the deep feature map in the depth direction. The feature fusion method is shown in Figure 3.
Taking the fusion of Conv4 feature layer and Conv7 feature layer as an example, as shown in Figure 3(a), the specific fusion is as follows:
In Figure 3(b), the feature maps of the Conv7 layer are fused with the feature maps of the Conv8_2 layer, and the obtained feature maps are used to replace the feature maps of the Conv7 layer.
In Figure 3(c), the feature maps of the Conv8_2 layer are fused with the feature maps of the Conv9_2 layer, and the feature maps of the Conv8_2 layer are replaced with the resulting feature maps.
In Figure 3(d), the feature maps of the Conv9_2 layer are fused with the feature maps of the Conv10_2 layer, and the feature maps of the Conv9_2 layer are replaced with the obtained feature maps.
In Figure 3(e), the feature maps of the Conv10_2 layer are fused with the feature maps of the Conv11_2 layer, and the feature maps of the Conv10_2 layer are replaced with the obtained feature maps.
The operation flow of their fusion is similar to Figure 3(a), and the final size of the new predicted feature layers are obtained as:
38×38×1536,19×19×1024,10×10×768,5×5×512,3×3×512,1×1×256.
SENet emerged to solve the loss problem caused by the different importance of different channels of the feature map in the convolutional pooling process. In the traditional convolutional pooling process, each channel of the feature map is equally important by default. In real-world problems, the importance of different channels varies. SENet can adaptively recalibrate the channel feature responses by explicitly modeling the interdependencies between channels. The role of SENet is to obtain the weights of each channel of the incoming feature map, allowing different channels to have different effects on the task results with different weights. So, the use of SENet allows the network to focus on the channels that need to play the most role in the detection task.
The process of SEblock is divided into two steps: Squeeze and Excitation.
Squeeze: Obtaining the global compressed feature volume of the current feature map by performing Global Average Pooling on the feature map layer.
Excitation: The weights of each channel in the feature map are obtained by a two-layer fully connected bottleneck structure, and the weighted feature map is used as the input of the next layer of the network.
The four steps of the SE attention mechanism are illustrated in Figure 4, as follows:
Performing global average pooling of the input feature maps. Performing two full connections, the first with a smaller number of fully connected neurons and the second with the same number of fully connected neurons as the number of channels of the input feature map. After completing two full connections, the weights (between 0 and 1) of each channel in the input feature map are obtained by performing another Sigmoid to fix the value between 0 and 1. After obtaining this weight value, multiply this weight value by the input feature map.
The feature map contains a large number of channels, and the judgment of the network varies from channel to channel. Some channels contain rich information and some channels hardly play a role. The purpose of introducing the SE attention mechanism is to make full use of the feature channels that contain important information. The SE attention mechanism is added to the back of the six predicted feature maps obtained after feature fusion, so that it adaptively assigns weights to each channel in the feature maps, strengthening the useful feature channels while suppressing the useless ones. The SSD model with the added SE attention mechanism is shown in Figure 5.
This experiment uses the PASCAL VOC2012 dataset, which contains 11530 images and 27450 targets labeled. There are 20 categories of targets to be recognized in this dataset, which are divided into four main categories: Vehicles, Household, Animals, and People; Vehicles class include Aero plane, Bicycle, Boat, Bus, Car, Motorbike, and Train; Household class contains: Bottle, Chair, Dining table, Potted plant, Sofa, TV/Monitor; Animals class contains: Bird, Cat, Cow, Dog, Horse, Sheep. The images in the dataset are annotated with corresponding XML files for the location and class of the target [4].
The environment used for the experiments is shown in Table II.
Environment configuration table
Software | Operating System | windows10 |
Deep Learning Framework | pytorch-gpu | |
Compiler Language | python | |
Compilers | pycharm |
To comprehensively evaluate the accuracy of the SSD algorithm in detecting targets, this paper chooses to use the Mean Average Precision (mAP) as the evaluation criterion. MAP represents the average of all Average Precision (AP), and each Average Precision(AP) is measured using the intersection over Union (IOU). Samples with IOU above the threshold are positive samples, and samples below the threshold are negative samples. The calculation of AP requires Precision, Recall, as shown in (4):
Among them, TP is the positive sample with positive prediction, FP is the negative sample with positive prediction, FN is the positive sample with negative prediction, and p(r) is the precise-recall curve.
Training parameters setting table
learning rate | 0.0005 |
momentum | 0.9 |
weight_decay | 0.0005 |
batch size | 16 |
epoch | 50 |
step_size | 5 |
In the experiments, transfer learning strategy is used to reduce the training difficulty and improve the detection results. The improved SSD algorithm in this paper is trained on the ResNet50 pre-training model, and its loss function remains the same as that of the original SSD-VGG16 algorithm. After the training, the improved algorithm in this paper produces the loss curve shown in Figure 6. From this figure, it can be seen that the algorithm in this paper starts to converge when the iteration reaches the 25th epoch. The final training loss value is 2.72.
The distribution of mAP values is shown in Figure 7. It can be seen that the mAP value reaches a maximum of 72.67%, which is an increase of 2.1% over the original SSD-VGG16 network.
In this paper, the SSD-VGG16 algorithm, the SSD-ResNet50 algorithm and the improved algorithm of this paper are respectively trained and validated on the PASCAL VOC2012 dataset. And we calculates the mAP of each algorithm to detect the target. From Table IV, we can see that the mAP of the improved algorithm in this paper improves from 70.6% to 72.7% compared with the original SSD-VGG16 algorithm, thus the detection effect of this paper's algorithm is better than the original SSD algorithm.
Comparison of detection results of different algorithms on the PASCAL VOC2012 dataset
SSD-VGG16 | 70.6% |
SSD-ResNet50 | 72.1% |
The improved algorithm | 72.7% |
In order to improve the detection accuracy of SSD algorithm for object detection, this paper improves the network structure of SSD object detection algorithm, and introduces multi-layer feature fusion module and channel attention mechanism. Multi-layer feature fusion at different scales based on the ResNet50 feature network increases the semantic information of the original feature map. At the same time, the SE attention mechanism is added to enhance the focus on feature channels for each layer of the network. The comparison experiments on the PASCAL VOC2012 dataset show that the improved SSD algorithm in this paper has a 2.1% higher mean average precision and better detection capability compared to other commonly used algorithms.