Face Mask Wearing Detection Based on YOLOv5

By April, 2022, more than 500million cases of COVID-19 had been confirmed in the world, and more than 6.2 million cases had died. Under the normal epidemic situation, wearing masks in public places is an important means of effective epidemic prevention. Therefore, mask wearing detection has become a core work of epidemic prevention. At present, the main detection method is manual detection, which is not only time-consuming and laborious, but also increases the risk of virus infection. Therefore, it will be of great practical significance to build a mask detection and monitoring system to realize the automation and intelligence of epidemic prevention and control.

At present, some scholars have done research on mask wearing detection. Ren Yu [1] and others proposed a fast r-cnn mask wearing detection algorithm based on deep learning, which accelerated the convergence of the model by migrating the weight of the pre training model on the Imagenet data set, and the accuracy of the final test set reached 89.41%. Dong Yanhua [2] and others proposed a SSD network based on residual structure. By adding residual structure before the positioning and classification of SSD network, the feature extraction network and classification and positioning layer are separated, which effectively solves the dual task of learning local information and high-level information at the same time. Finally, the average detection accuracy reaches 92.3%. Cao chengshuo [3] and others proposed a YOLO mask algorithm, which is based on YOLOv3, introduces the attention mechanism into the feature extraction network, and uses the feature pyramid and path aggregation strategy for feature fusion. The average accuracy of the algorithm is 93.33%. Based on YOLOv4, Cheng Tinghao [4] and others redesigned the anchor frame size with K-means clustering algorithm and improved the network structure. The recall rate and accuracy rate on the test set reached 88.20% and 92.30% respectively.

This paper adopts the YOLOv5 target detection network model. Compared with YOLOv3 and YOLOv4, the innovation of YOLOv5 is that it uses the focus structure in the backbone, which reduces flops and improves the training speed. Compared with the ordinary convolution operation in YOLOv4neck structure, YOLOv5 uses CSP_X structure, which strengthens the fusion ability of network features and improves the detection accuracy. Overall, the performance of YOLOv5 is better than that of YOLOv4 and YOLOv3.

II.

Yolov5 network model

YOLOv5 algorithm

Yolov5 algorithm is developed on the basis of yolov4 and yolov3. Compared with yolov4, the architecture of yolov5 is about 90% smaller [9]. In terms of accuracy, the performance of yolov5 is better than that of yolov3 and yolov4 algorithms in the current market.YOLOv5 is a single-stage target detection algorithm. YOLOv5 [10] series can be divided into YOLOv5s [11], YOLOv5m, YOLOv5l [12] and YOLOv5x [13]. Among them, YOLOv5s has the smallest network. Although the accuracy is slightly worse than the other three, it has the fastest speed. Because of the speed requirement of mask detection algorithm, this paper selects YOLOv5s network model.

The network structure diagram of YOLOv5s is shown in Figure 1. It can be seen that the model is mainly divided into four parts: input, backbone, neck and prediction. Mosaic data enhancement is used at the input end to improve the detection effect of small targets [14]. In addition, using the adaptive anchor frame calculation method [15], different data sets will have anchor frames with initial length and width, which can get a larger intersection and union ratio, and greatly improve the efficiency of training and prediction [16]; The backbone part is a convolutional neural network that aggregates and forms image features on different image granularity [17], mainly including Focus structure and CSP structure. The neck part is between the backbone and prediction. It is a network layer of a series of mixed and combined image features, which transmits the image features to the prediction part [18]. The prediction part is the final detection part, which mainly predicts the image features, generates the boundary box and predicts the type of target [19].

Input

The input terminal includes mosaic data enhancement [20], image size processing and adaptive anchor box calculation. Masaic data enhancement adopts 4 pictures to be spliced in the way of random scaling, random clipping and random arrangement. Using mosaic data enhancement can greatly enrich the data set and add many small targets. YOLO algorithms need to change the input image size into a fixed size. The standard size of the image in this paper is 640 × six hundred and forty × 3. It is necessary to set the initial knowledge anchor box before network training. The initial knowledge anchor box of YOLOv5 is [10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59119], [116, 90156198373326].

Adaptive image scaling refers to the unified scaling of images with different lengths and widths to a standard size, and then used as a data set for detection and processing. After the picture is scaled, if there are many black edges filled on both sides of the picture, there will be information redundancy and affect the speed. In this regard, yolov5 has been improved and the letterbox function has been modified, so that the original picture can adaptively reduce information redundancy, that is, reduce black edges. The steps of yolov5 algorithm for image adaptive scaling are to calculate the scaling scale, calculate the size after scaling, and calculate the black edge filling value. The calculation formula is as follows (1): (1) ${\begin{matrix} l_{3} = l_{2} + np \cdot \mod ((l_{1} - w_{1})) \cdot \min {\frac{l_{2}}{l_{1}}, \frac{w_{2}}{w_{1}}}, 32) \\ w_{3} = w_{1} \cdot \min {\frac{l_{2}}{l_{1}}, \frac{w_{2}}{w_{1}}} + np \cdot \mod ((l_{1} - w_{1}) \cdot \min {\frac{l_{2}}{l_{1}}, \frac{w_{2}}{w_{1}}}, 32) \end{matrix}$ \left\{{\matrix{{\left. {{l_3} = {l_2} + np \cdot mod(({l_1} - {w_1})) \cdot min\left\{{{{{l_2}} \over {{l_1}}},\,{{{w_2}} \over {{w_1}}}} \right\},32} \right)} \cr {{w_3} = {w_1} \cdot min\left\{{{{{l_2}} \over {{l_1}}},\,{{{w_2}} \over {{w_1}}}} \right\} + np \cdot mod\left({({l_1} - {w_1}) \cdot min\left\{{{{{l_2}} \over {{l_1}}},\,{{{w_2}} \over {{w_1}}}} \right\},32} \right)} \cr } } \right.

Where: l₁, w₁ is the length and width of the original image; l2, w2 is the length and width of the original scaled size; l3, w₃ is the length and width of adaptive scaling size; $\frac{l_{2}}{l_{1}}$ {{{l_2}} \over {{l_1}}} , $\frac{w_{2}}{w_{1}}$ {{{w_2}} \over {{w_1}}} is the scaling factor. Select a small scaling factor and multiply it by the length and width of the original image size to obtain l₂ and w₂ $(l_{1} - w_{1}) \cdot \min {\frac{l_{2}}{l_{1}}, \frac{w_{2}}{w_{1}}}$ ({l_1} - {w_1}) \cdot min\left\{{{{{l_2}} \over {{l_1}}},\,{{{w_2}} \over {{w_1}}}} \right\} is the height to be filled, and then in numpy use np.mod to remove the remainder, get 8 pixels, and divide 8 by 2 to get the values that need to be filled in at both ends of the image adaptation.

Adaptive anchor box calculation, setting anchor boxes with initial length and width for different data sets. In the network training of YOLOv5, the predicted anchor frame is output on the set initial anchor frame, the error between the predicted anchor frame and the ground truth of the actual calculation anchor frame is calculated, and the data is updated by backpropagation, and the network parameters are optimized through multiple iterations. The anchor box is calculated. In network training, the network outputs the prediction frame on the basis of the initial anchor frame, then compares it with the real frame, calculates the gap between the two, and then updates it in reverse to iterate the network parameters.

Backbone

YOLOv5 uses the focus and CSP structures in backbone^[21]. The CSP^[22] structure is shown in the network structure diagram in Figure 1, The Focus structure is the operation of the YOLOv5 algorithm that is different from YOLOv3 and YOLOv4, and the key step of focus structure is slicing operation, as shown in Figure 2. For example, the image of 608 * 608 * 3 is input into the focus structure, and the slicing operation is adopted. First, it becomes the feature map of 304 * 304 * 12, and then after a convolution operation of 32 convolution cores, it finally becomes the feature map of 304 * 304 * 32. The original intention of the design of the Cross Stage Partial structure is to reduce redundant computation and enhance gradients. The difference between YOLOv5 and YOLOv4 is that only the backbone network in YOLOv4 uses CSP structure. Two CSP structures are designed in YOLOv5. Take YOLOv5s network as an example, CSP1_X structure is applied to backbone network, another CSP2_X structure is applied to neck.

Neck

The network structure design of neck adopts the structure of FPN + PAN. FPN [23] uses a top-down side connection to build a high-level semantic feature map on all scales and construct the classic structure of the feature pyramid. PAN [24] uses a bottom-up feature pyramid. After passing through a multi-layer network in the middle of FPN, the target information at the bottom has been very blurred. Therefore, pan is added to make up for and strengthen the positioning information. The specific structure is shown in Figure 3.

Prediction

Prediction includes bounding box loss function and non maximum suppression (NMS). YOLOv5 uses CIOU_Loss [25], which effectively solves the problem of overlapping bounding boxes. In the post-processing process of target detection, for the screening of many target frames, weighted NMS operation is adopted to obtain the optimal target frame.

Target detection algorithms often output multiple overlapping prediction frames for the same target, resulting in false detection. Therefore, non maximum suppression algorithm [26] is generally used as post-processing technology to suppress redundant prediction frames in order to obtain the final detection results. NMS [27] and soft NMS are commonly used non maximum suppression algorithms [28]. It mainly reduces the redundant prediction frames for the same target [29], reduces the number of false positive prediction frames and improves the accuracy of target detection by suppressing the confidence of the prediction frame with non maximum confidence [30] (hereinafter referred to as the non maximum box). However, NMS and soft NMS only use IOU as the judgment standard [31] to suppress the prediction box that highly overlaps with the confidence maximum box. This kind of algorithm can successfully suppress redundant prediction frames, but it will also suppress adjacent targets [32]. Therefore, this paper uses weighted NMS operation [33]. Compared with the traditional non maximum suppression, weighted NMS does not directly eliminate those frames with the same category as the current rectangular frame whose IOU is greater than the threshold in the process of rectangular frame elimination, but weights them according to the confidence of network prediction to obtain a new rectangular frame, Take the rectangular box as the rectangular box of the final prediction, and then eliminate those boxes [34].

III.

Experiment and result analysis

Experimental preparation

There are 8000 pictures in this experimental data set, including two categories: mask and no mask. Mask indicates that the personnel have worn masks, and no mask indicates that the personnel have not worn masks. 80% of the images in the data set are used as training sets and 20% as test sets. Some dataset images are shown in Figure 4.

The experiment uses pytoch to build the network framework, the initial learning rate is set to 0.01, and 200 epochs are trained iteratively. Firstly, the obtained images are normalized and preprocessed, and then the processed images are input into the YOLOv5 network model training to obtain the best weight data, and then the images are tested and analyzed. The experimental process is shown in Figure 5.

Mosaic data enhancement

Mosaic splices four pictures by randomly selecting them and randomly cutting, arranging and scaling them. The effect picture is shown in Figure 6. This method enriches the background and small targets of the detection object, and improves the robustness of the network to a certain extent. In addition, after mosaic data enhancement, it is equivalent to processing four pictures at a time, and the batch size increases implicitly, which makes the initially set batch size value do not need to be large, and a good model can be obtained, reducing the performance requirements of GPU.

Experimental environment

The configuration of algorithm experiment platform is shown in Table 1.

TABLE I.

Experimental environment configuration

Parameter	Configuration

CPU	Intel(R) Xeon(R) Silver 4210R CPU @ 2.40GHz
GPU	RTX 3090
Memory	60G
Display memory	24G
System environment	Ubuntu18.04
Experimental platform	PyTorch1.6.0, Python3.8
Accelerated environment	CUDA10.1

YOLOv5 network training

In this paper, the training epoch is 200 and the batch size is 64. The official pre training weight of YOLOv5 is used to accelerate the convergence of the model. In the experiment, the learning rates of BN layer, weight layer and Bais layer are lr0, lr1 and lr2 respectively, in which lr0 and lr1 change the same. The learning change rate is shown in Figure 7.

In this paper, the loss function adopts CIOU_Loss, as shown in formula (2). Where v is used to measure the consistency of the relative proportion of two rectangular boxes, α Is the weight coefficient, from α It can be seen from the definition of parameters that the loss function will be more inclined to optimize in the direction of increasing overlapping areas, especially when the IOU is zero. CIOU_Loss comprehensively considers the overlapping area, center distance and aspect ratio, and further considers the relative proportion of rectangular frame, which makes the detection effect further. (2) $\begin{matrix} ℒ_{CIoU} = 1 - IoU + \frac{ρ^{2} (b, b^{gt})}{c^{2}} + α v \\ v = \frac{4}{π^{2}} {(arctan \frac{w^{gt}}{h^{gt}} - arctan \frac{w}{h})}^{2} \\ α = \frac{v}{(1 - IoU) + v} \end{matrix}$ \matrix{{{{\scr L}_{CIoU}} = 1 - IoU + {{{\rho ^2}({\bf{b}},\,{{\bf{b}}^{gt}})} \over {{c^2}}} + \alpha v} \cr {v = {4 \over {{\pi ^2}}}{{\left({\arctan {{{w^{gt}}} \over {{h^{gt}}}} - \arctan {w \over h}} \right)}^2}} \cr {\alpha = {v \over {(1 - {\rm{IoU}}) + v}}} \cr }

Experiment and result analysis

The results of this experiment will be from mAP@0.5, mAP@0.5: 0.95, Precision and Recall. The calculation formula is as follows. (3) $Precision ? = \frac{TP}{TP + FP}$ {\rm Precision}? = {{TP} \over {TP + FP}} (4) $Recall ? = \frac{TP}{TP + FN}$ {\rm Recall}? = {{TP} \over {TP + FN}} (5) $AP = \int_{0}^{1} P d R$ AP = \int_0^1 {P{\rm{d}}R} (6) $mAP = \frac{\sum_{i = 1}^{N} {AP}_{i}}{N}$ mAP = {{\sum\nolimits_{i = 1}^N {{AP}_i} } \over N}

TP, FP and FN in (3) and (4) above refer to the number of correct inspection frames, false inspection frames and missed inspection frames respectively. AP value is the area of P-R curve, and N in formula (6) represents the total number of detection categories, which is 2 in this paper. mAP@0.5 it refers to the average AP of all categories when IOU is set to 0.5, mAP@0.5: 0.95 refers to the average mAP on different IOU thresholds. The IOU value ranges from 0.5 to 0.95 in steps of 0.05.

While outputting the prediction box, this paper will also output the classification confidence score belonging to the box. Generally speaking, for a certain model, by adjusting the confidence threshold, the positive or negative of the predicted value can be changed, and the Precision and Recall also change accordingly. By observing the changes in the confidence thresholds of Precision and Recall followers, the quality of the model can be evaluated to a certain extent. If a model maintains a stable Precision at a high level while Recall grows, it proves that the model has better performance. If a model needs to lose a lot of Precision in exchange for the improvement of Rccall, the performance of the model is poor. Typically, researchers use the Precision-Recall curve to measure the model's trade-off between Precision and Recall. The mean precision of each category is calculated from the area under its corresponding Precision-Recall curve. Generally, the higher the AP of each category, the better. The average precision of the evaluation indicators commonly used in target detection is the average of each category of AP.

The experimental process and experimental test results are shown in Figure 8 and Table 2 respectively.

TABLE II.

Test results of each model on the test set

Model	Precision/%	Recall/%	mAP@0.5/%
YOLOv5s	94.8	89	93.5
YOLOv4	76.2	85.4	51.2
YOLOv3	73.6	82.3	48.9

It can be seen from table 2 that the Precision, Recall and mAP of YOLOv5s on the test set are 94.8%, 89% and 93.5% respectively. These three indicators are significantly higher than YOLOv3 and YOLOv4 algorithms.

After the algorithm training in this paper, input the data of the test set into the network model to get the test results, as shown in Figure 9. If the mask is worn, the value on the target box represents the confidence of classification. If the mask is not worn, no mask will be displayed on the target box. According to the performance analysis of various data, the training effect of YOLOv5 network model is ideal.

IV.

Conclusion

To sum up, in order to achieve the accuracy and real-time of detection and liberate the manual from the complex detection work, this paper proposes a personnel mask wearing detection method based on yolov5, which can effectively solve the problem of time-consuming and laborious manual detection of mask wearing. If the mask is not worn in the public place, it will be detected in the mask confidence box. If the mask is not worn in the public place, it will be detected and displayed. Experiments show that the accuracy, recall and average accuracy (map) on the test set reach 94.8%, 89.0% and 93.5% respectively. The detection effect is better than that of yolov4 and yolov3, and the overall effect is ideal. Further improvements can be made in the follow-up, such as how to detect the non-standard wearing of personnel masks, how to lighten the model, and how to verify and improve the network model in practical application.

eISSN:: 2470-8038
Język:: Angielski

Częstotliwość wydawania:: 4 razy w roku
Dziedziny czasopisma:: Computer Sciences, other

Kanał RSS czasopisma

Face Mask Wearing Detection Based on YOLOv5

Data publikacji: 26 maj 2023

Zakres stron: 67 - 75

DOI: https://doi.org/10.2478/ijanmc-2022-0017

Słowa kluczoweMask Detection, Yolov5, CIOU Loss Function

© 2022 Yunshan Xie et al., published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Figure 1.

Figure 2.

Figure 3.

Figure 4.

Figure 5.

Figure 6.

Figure 7.

Figure 8.

Figure 9.

Słowa kluczowe
Mask Detection, Yolov5, CIOU Loss Function