Uneingeschränkter Zugang

Backpack detection model using multi-scale superpixel and body-part segmentation


Zitieren

Introduction

Object detection is essential in video surveillance applications, medical diagnostics, and image processing [1][2]. The major challenge in object detection in video surveillance is the variation in shapes, colors, and sizes of the objects themselves [3]. The failure of an automatic surveillance system to detect an object can even be fatal [4]. Apart from being required to detect objects with high accuracy, a model detector must have a low computation time so that it can be used in the real world [5]. The backpack is a type of carried object (CO) widely used for various purposes because of its practicality. Various valuable items such as wallets, laptops, cameras, and cell phones may be kept in backpacks, making them vulnerable to theft. In addition, backpacks are often confused for one another because many look similar. Research that developed a backpack detection model in video surveillance was reported in [6]–[9]. However, the performance of these methods still needs to be improved so that they can be widely used [10].

A process that significantly affects the accuracy of an object detection model is the localization of the object area in the image [11]. A backpack is generally located above the bend line; therefore, a backpack detector model should only focus on areas above that line [12]. Detecting the backpack in the entire image will reduce accuracy and increase computation time. This study proposes a backpack detection model that localizes the backpack area through the multi-scale superpixel segmentation process and body-part method. Backpacks are generally located above the bend line, so only superpixels that lie above that line are classified. The experimental results are then compared with the state-of-the-art method. This paper is organized as follows: Chapter 2 presents the related work, Chapter 3 discusses the proposed method, Chapter 4 explains the experiment, and Chapter 5 presents the conclusion and suggestions for further research.

Related work

Many studies have developed CO detection methods in video surveillance. Research conducted by Haritaoglu et al. [7] utilizes silhouettes to detect whether someone is carrying an object. Based on the assumption that the human silhouette when standing is close to symmetrical, pixels are then classified as symmetrical and asymmetric based on vertical lines. The temporal template consists of two components: the textural component, which represents the object's appearance, and the shape component, which represents shape information. The components are then used to localize the backpack area. However, one study [13] pointed out that the diagonal line is susceptible to noise. In addition, the varying size and shape of the CO also dramatically affect the detection accuracy.

Another study [9] used a temporal template which was then compared with multi-directional constructed exemplars to detect and localize areas of CO. The walking direction of the pedestrian at the temporal template was detected and then used to select the appropriate exemplar. The protrusion area resulted from comparing the temporal template and exemplars and was then classified as CO. A study by Tavanai et al. [14] utilized generic shape properties such as conventions extracted in the area of protrusions on a person's silhouette. This model also performs the tracking process using the spatio-temporal property. However, the assumption that CO forms protrusions limits the detection ability of the model, especially for the CO that overlaps with the pedestrian's clothing or body.

Wahyono et al. [12] utilized sliding windows and body-parts to localize CO. Sliding windows were shifted over the image, leading to the HOG feature being extracted. These features were then classified using the support vector machine (SVM). Body parts were then used to classify bags into several categories, including backpacks, rolling luggage, etc. The problem that often arises in sliding windows is that if the size of the windows is too small, the area to be analyzed will increase, and CO with a large size may not be detected. Conversely, if the size of the windows is too large, the dimensions of the features that must be classified will also be large, causing the model's accuracy to decrease.

Ghadiri et al. [15] detected CO by analyzing human contours. Person contours are built based on human contour exemplars with different standing and walking poses. Contours located outside the contour of the person are then used as candidates for CO. However, the detection process is only carried out on a single frame. This study was then continued [16] by integrating temporal information. Superpixel segmentation descriptors and human shape modelling were used to identify candidate CO areas. Furthermore, spatio-temporal information was integrated into the candidate area to detect CO. However, this model cannot be applied to still images.

A study by Ghadiri et al. [17] performed multi-scale superpixel segmentation on the foreground area of the video frame. A codebook (including local features based on contour information of upright persons, standing or walking, without carried objects from different viewpoints) was used to detect human-like regions in the segmentation result area. The CO candidate area was then obtained by performing complement operations from the human-like area. Superpixels with a high probability of being CO were then combined to form the complete CO form. However, it is difficult for this model to detect CO which completely overlaps with the pedestrian's body.

Proposed method

This study developed a backpack detection model that combines superpixel and body-part segmentation processes to localize the CO areas. The model developed in this study consists of two phases: the training and testing phases. Pedestrian images are segmented using the superpixels method into three scales during the training phase. The HOG and histogram features are then extracted from the segmented area. The resulting feature vector is then reduced to get a smaller dimension. This reduction process also aims to eliminate insignificant components. The feature vector is then used to train the classifier into two classes: backpacks and other classes. The training result model is then stored and used in the verification stage.

The pedestrian image is also segmented into three scales in the testing phase. The segments are then selected using the body-part method, where only the segment above the bend line will be selected as a candidate area where the backpack is generally located. The feature of the candidate area is then extracted, and the dimensions of the resulting feature vector are reduced. Furthermore, each feature vector is classified using the model that has been trained in the training phase. Figure 1 presents the backpack detection method developed in this study.

Figure 1:

Backpack detection using multi-scale superpixel segmentation and body-part method.

Pre-processing

In most computer vision systems, smoothing is a step commonly used to remove noise. The median filtering method is a smoothing method that improves image quality in the spatial domain. This method is non-linear filtering and works almost the same way as mean filtering [18]. Another step that can also improve video quality is contrast enhancement, which aims to provide a better contrast distribution. The histogram equalization method is widely used in image contrast enhancement [19]. This method usually increases the contrast value of the image globally, especially if the contrast value in the image is located at a close distance. Eq. (1) is used to perform histogram equalization operations on the image. pni=nin {p_n}\left( i \right) = {{{n_i}} \over n} Where

pn (i) = images histogram for pixel value i

ni = number of occurrences of gray level i

n = total number of pixels in the image

Foreground Detection (FD)

The FD method detects changes in the image area by differentiating between two adjacent frames [20]. When an area contains pixels at moments t1 and t2, there is a differentiation between f1(x, y) and f2(x, y). Differentiation is obtained by subtracting the k-th frame (fk (x, y)) from the k-1 frame (fk−1 (x, y)), as shown in Eq. (2). Dx,y=fkx,yfk1x,y D\left( {x,y} \right) = \left| {{f_k}\left( {x,y} \right) - {f_{k - 1}}\left( {x,y} \right)} \right| The area resulting from frame differentiation can be the movement of objects, changes in lighting, or noise. Figure 2 shows the resulting image of the foreground detection process.

Figure 2:

The resulting image of the foreground detection process on each dataset.

Segmentation

Superpixel segmentation has been widely used in object detection methods [17], [21]. Simple Linear Iterative Clustering (SLIC) generates superpixels by clustering pixels based on their similarity in color and proximity [22]. SLIC is usually implemented in the labxy color space, where the clustering process is more manageable. The Lab channel is a vector of color pixels in the CIELAB color space, where it is widely recognized that the colors of the pixels that are close together are perceptually the same. Meanwhile, the xy channel is the pixel position in the image plane. The SLIC algorithm uses the K-Means clustering method, as shown below.

SLIC segmentation

1: Centroid Initialization Ck=[lk,ak,bk,xk,yk]T
2: Put centroid in n × n window
3:   repeat
4: for each cluster Ck do
5: Group each pixel in the nearest centroid (based on measurement of pixel distance to centroid)
    end for
6: Update centroid
7: until centroid unchanged
Features Extraction

There are two features, namely the HOG and histogram features, used to classify images in this study. In the first scenario, training and testing of the model are carried out using the HOG feature only. In the second scenario, the two features (HOG and histogram) are combined and classified in a single matrix.

HOG

The HOG feature classifies pixels based on orientation and direction [23]. In the HOG feature extraction process, several variables must be determined: orientation, pixels per cell, and cells-per-block. The orientations variable shows the number of bin orientations on the pixel gradient formed in the histogram. Pixels-per-cell is a variable that determines the number of pixels in rows and columns in each gradient histogram formed. The cell-per-block variable determines the area in which the number of histograms in the cells will be normalized. In a 64 × 128 image, we would have 8 × 16=128 cells. To create 16 × 16 blocks, 4 cells would need to be combined, resulting in 7 × 15=105 blocks. Each block has a vector 36 × 1 feature; hence, the total features for the image would be 105 × 36 × 1= 3780 features. Figure 3 shows a cell and a block in a 64 × 128 image.

Figure 3:

Cell and block in 64 × 128 image.

Histogram

The histogram is a color feature extraction technique that describes the distribution of colors in the image [24]. The histogram calculates the appearance of a color in an image regardless of its spatial location. If L is the number of gray levels in the image, ni is the number of pixels with gray degrees of i-th, and n is the total number of pixels in the image, then the image histogram (hi) can be calculated using Eq. (3). hi=nin,i=0,1,..L1 {h_i} = {{{n_i}} \over n},i = 0,1, \ldots ..L - 1

Feature Dimensionality Reduction

Feature dimensions that are too large would slow down the training and testing process. By contrast, a small feature dimension can decrease the execution time without unduly affecting the model's accuracy [25]. We adopted PCA (Principal Component Analysis) to reduce the variable dimensions by eliminating the correlation between the independent variables and transforming the original independent variables into new uncorrelated variables [26]. Let i be the vector feature of the image, and let h be its dimension. The projection matrix with the first P principal components is obtained by solving the eigen equations of the covariance matrix S, defined as S=1Nn=1N(xix¯)(xix¯)T S = {1 \over N}\sum\limits_{n = 1}^N {\left( {{x_i} - \bar x} \right){{\left( {{x_i} - \bar x} \right)}^T}} The mean of the feature vector of the training sample is defined as v¯=1Ni=1Nxi \bar v = {1 \over N}\sum\limits_{i = 1}^N {{x_i}} The final descriptor is computed using the projection matrix as y = UT(x).

SVM Training

Boser, Guyon, and Vapnik developed the Support Vector Machine (SVM) method, which was first presented in 1992 at the Annual Workshop on Computational Learning Theory [27]. The concept of SVM can be explained simply as an attempt to find the best hyperplane that separates two classes in the input space. The available data is denoted as xiRd, while labels are denoted as yi ∈ {−1,+1} where i = 1,2,… ..l, and l is the amount of data. In our task, yi = 1 can be considered to be the backpack, and yi = −1 can be classified as the body or background. We assumed that the two classes could be separated perfectly by a hyperplane d, which is defined as w.x+b=0 \vec w.\vec x + b = 0 . The training feature xi, belonging to class −1, can be formulated as a training feature that satisfies the inequality w.x+b1 \vec w.\vec x + b \le - 1 , while the training feature xi, belonging to class +1, can be formulated as w.x+b+1 \vec w.\vec x + b \ge + 1 . The distance of the hyperplane to the closest points in either class is 1w {1 \over {\left| {\vec w} \right|}} ; then the combined distance can be determined 1w+1w=2w {1 \over {\left| {\vec w} \right|}} + {1 \over {\left| {\vec w} \right|}} = {2 \over {\left| {\vec w} \right|}} . Then, to optimize this distance in a large dataset, the Lagrangian multiplier can be used, as shown in Eq. (6). Lw,b,a=12w2i=1laiyixi.w+b1i=1,2,3,,l \matrix{{L\left( {\vec w,b,a} \right) = {1 \over 2} \left\| {w{^2}} \right\| - \sum\limits_{i = 1}^l {{a_i}\left( {{y_i}\left( {\left( {\overrightarrow {{x_i}} .\vec w + b} \right) - 1} \right)} \right)} } \hfill \cr {\left( {i = 1,2,3, \ldots ,l} \right)} \hfill \cr }

If the input data xi cannot be linearly separated, then it can be mapped into a higher dimensional feature space k by utilizing the kernel. There are three types of kernels in SVM: Linear, Polynomial, and Radial Basis Function (RBF). The kernel function used in this study is the linear kernel, which is denoted by k(xi, xj)=xiTxj. The linear kernel is quite effective, and its required processing time is also faster [28].

Body-Part

The body-part method spatially localizes the CO area at a specific location based on body proportions and camera position [6]. Various kinds of CO can be detected by utilizing the general position of the CO carried based on the bend line.

In Figure 4, if h is the height of the head, then it can be derived that the height and width of the human body are H = 8h and W = 2h, respectively. The horizontal diameter of the body is denoted as C, and the vertical diameter of the body is denoted as B. If T is the highest point of the head and L is the leftmost point of the body, then Eq. (7) can be used to obtain B, and Eq. (8) can be used to obtain C. B=T+4h B = T + 4h C=L+H C = L + H

Figure 4:

Human Body Proportion Model [29].

Based on observations of the dataset we use at the training stage, generally, the backpack is located around or slightly above the bend line. Therefore, only those superpixels located above the bend line are used in the classification process. Based on the human body proportion model, the bend line can be calculated using the head area.

To detect the head area, we first collected head segments and then extracted their features (Figure 5). The obtained feature vectors were then used to train the head detector model. In the testing phase, the trained model was then used to detect head segments, and after the head was obtained, the bend line was then generated using Eq. (7) and Eq. (8).

Figure 5:

Heads segment sample.

The bend line is then used as a reference for selecting the superpixels that will be used as ROI to detect backpacks. Figure 6 shows the training process for the body-part model and superpixel selection based on the bend line.

Figure 6:

Bend-line identification and superpixel selection process.

Experiments

This section describes the dataset used for training and testing, the model implementation, the test scenario, and the model results. The testing stage was carried out in this study using two scenarios: BP_SC1 and BP_SC2. Each scenario was evaluated using precision, recall, and F1 score on the DIKE20 dataset.

Datasets

We built the DIKE20 dataset to train the model we used for the classification process. Each acquired pedestrian image was then selected manually, where only pedestrians carrying backpacks were used to train the model. The DIKE20 dataset was acquired using four cameras. Each camera was mounted on a wall of approximately 3 meters. Camera A was used to record objects from the front view, cameras B and C were used to record objects from the side view, and camera D was used to record objects from the rear view.

Figure 7 shows the camera configuration in the acquisition room. Another objective was for the image to present the pedestrian body intact, and not overlapping with pedestrians or other objects. There were 966 images selected from various views used to train the model.

Figure 7:

Camera configuration in the acquisition room.

The testing process of the trained model was carried out on three datasets: DIKE20, PETS2006, and i-LIDS. In the DIKE20 dataset, 271 images were selected. This study used all 7 of the scenarios in the PETS2006 dataset. The images in each scenario were selected based on the constraints previously mentioned. 65 images were selected from the first scenario, 70 from the second, 33 from the third, 12 from the fourth, 55 from the fifth, 66 from the sixth, and 22 from the seventh. In the i-LIDS dataset, there were six scenarios.

Only the first scenario was suitable for this research. This scenario was acquired using two cameras. From the first camera, 121 images were selected, and from the second camera, 64 images were selected. Table 1 shows the number of images used at the testing stage on each dataset.

Number of test images in each dataset

Dataset Test Images
DIKE20 271
PETS2006 323
i-LIDS 185
Total 779
Multi-Scale Superpixel Generation

Superpixels are generated using the K-Means algorithm on an image using 5-dimensional color space (Labxy). The variable k in the K-Means algorithm shows the scale, which is the number of superpixels to be formed. Due to the varying sizes of CO in this study, the segmentation was carried out on three scales: l1, l2, and l3, which are 10, 15, and 20 superpixels respectively. Figure 8.a shows the segmentation result on the l1 scale, Figure 8.b shows the segmentation result on the 12 scale, and Figure 8.c shows the segmentation result on the 13 scale.

Figure 8:

The segmentation results on the l1, l2, and l3 scales.

Backpack Localization Using the Body-Part Method

Backpacks are generally located above the bend line. This study trained a model that recognized the head area by classifying segments through its HOG feature. After the head area was obtained, the bend line was identified by adding the coordinates of the body's highest point with four times the area of the head. Segments located above the bend line had their features extracted. Figure 9 shows the process of determining the body's vertical (B) and horizontal (C) diameter. After the center lines B and C were obtained, the segments above line B were selected. Table 2 shows the selected superpixels and their location based on bend line. The selected segments’ features were extracted and then classified to detect whether there was a backpack among the segments.

Figure 9:

The result of bend-line determining process on image with scale l2.

The selected superpixels and their location based on bend line

Superpixels Location
B + 3h
B + 2h
B + 2h
B + h
B + h
Features Extraction

We extracted two features to train the backpack detection model: the HOG feature and the histogram. The evaluation of the model's performance was carried out in two scenarios. The first scenario (BP_SC1) only utilized the HOG feature, while the second scenario (BP_SC2) utilized the HOG and histogram features combined.

The initialization values for extracting HOG features were: orientation = 8, pixels-per-cell = 16x16, and cells-per-block = 2. The histogram feature was extracted by converting the image into a grayscale color space. Each color bin was then calculated in the gray degree range of 0 to 255. Figure 10 shows the example of features extraction.

Figure 10:

Example of Features Extraction on Selected Superpixels (B+h).

Detection Result

The testing stage was carried out using two scenarios: BP_SC1 and BP_SC2. The ROC curve for each scenario in each dataset has been generated using a different threshold.

In the ROC curve, there is an area called AUC (Area Under the Curve), which shows the ratio of true positive (y-axis) and false positive (x-axis). The AUC value obtained by the model for the DIKE20 dataset was the same for both scenarios: 0.82. Figure 11 shows the ROC curve for each scenario on the DIKE20 dataset. Model performance was also measured using precision, recall, and F1 scores. In the DIKE20 dataset for the BP_SC1 scenario, the precision, accuracy, and F1 scores were 46%, 79%, and 60%, respectively. In the BP_SC2 scenario, the precision increase was 6%, and recall score increased by only 1%. The F1 score for the BP_SC2 scenario was 63%, where the score increased by 3% compared to the F1 score for the BP_SC1 scenario. Table 3 presents the precision, recall, and F1 scores on the DIKE20 dataset.

Figure 11:

The ROC curve for each scenario on the DIKE20 dataset.

The precision, recall, and F1 scores on DIKE20 dataset

Methods Precision Recall F1 score
BP_SC1 46% 79% 60%
BP_SC2 52% 80% 63%
Benchmarking

Benchmarking of the constructed model was carried out by comparing the precision, recall, and F1 scores with the model developed by [9], [16], and [17]. In addition, we also measured the AUC score on each dataset on each scenario. The AUC score on the PETS2006 dataset was 0.86 for the BP_SC1 scenario and 0.89 for the BP_SC2 scenario. There was an increase in the AUC value of 0.03 for the BP_SC2 scenario. The ROC curve for each scenario for the PETS2006 dataset is shown in Figure 12.

Figure 12:

The ROC curve for each scenario on the PETS2006 dataset.

The AUC scores on the i-LIDS dataset were 0.86 and 0.83 for the BP_SC1 and BP_SC2 scenarios, respectively. There was a decrease in the AUC value of 0.03 in the BP_SC2 scenario. The ROC curve of each scenario for the i-Lids dataset can be seen in Figure 13.

Figure 13:

The ROC curve for each scenario on the i-LIDS dataset.

On the PETS2006 dataset, the best precision score is obtained by [17], whereas the best recall score is 85%, and was obtained by the BP_SC1 scenario. The highest F1 score was obtained by BP_SC2 (69%). Table 4 shows each method's precision, recall, and F1 scores on the PET2006 dataset.

Comparison of precision, recall, and F1 scores on the PETS2006 dataset

Methods Precision Recall F1 score
Damen and Hog (2012) 50% 55% 52%
Ghadiri et al. (2017) 57% 71% 63%
Ghadiri et al. (2019) 60% 79% 68%
Proposed Methods
BP_SC1 56% 85% 68%
BP_SC2 59% 83% 69%

The experimental results on the i-LIDS dataset are presented in Table 5. For the i-LIDS dataset, the best precision score is obtained by Ghadiri et al. [17]. Our methods, BP_SC1 and BP_SC2, had significantly higher recall scores (90% and 95%, respectively). However, the F1 scores obtained by [17] and our method (BP_SC2) are comparable.

Comparison of precision, recall, and F1 scores on the i-LIDS dataset

Methods Precision Recall F1 score
Damen and Hog (2012) 52% 47% 49%
Ghadiri et al. (2017) 62% 60% 61%
Ghadiri et al. (2019) 72% 64% 67%
Proposed Methods
BP_SC1 49% 90% 64%
BP_SC2 52% 95% 67%
Conclusion and future work

The localization of the backpack area that utilizes multi-scale superpixel segmentation and the body-part method shows promising performance. On the DIKE20 dataset, the highest F1 score obtained by our model was 69% for the BP_SC2 scenario. On the PETS2006 dataset, the highest F1 score was obtained by our model, which scored 69% (BP_SC2). On the i-LIDS dataset, the highest F1 score was obtained by our model, which scored 67% (BP_SC2). The same score was also obtained by [17]. The average F1 score obtained by our model on PETS2006 and the i-LIDS dataset was 68%, 1% higher than the average F1 score obtained by the state-of-the-art method [17]. Further model development can be pursued in several ways. The segmentation stage could be developed by adding a segment merging process based on similarity in color, location, and edges of the segments. Adding features and finding the right combination of features in the classification process could also improve the model's performance.

eISSN:
1178-5608
Sprache:
Englisch
Zeitrahmen der Veröffentlichung:
Volume Open
Fachgebiete der Zeitschrift:
Technik, Einführungen und Gesamtdarstellungen, andere