1. bookAHEAD OF PRINT
Dettagli della rivista
License
Formato
Rivista
eISSN
2444-8656
Prima pubblicazione
01 Jan 2016
Frequenza di pubblicazione
2 volte all'anno
Lingue
Inglese
Accesso libero

Computer vision recognition and tracking algorithm based on convolutional neural network

Pubblicato online: 23 Dec 2022
Volume & Edizione: AHEAD OF PRINT
Pagine: -
Ricevuto: 09 May 2022
Accettato: 16 Aug 2022
Dettagli della rivista
License
Formato
Rivista
eISSN
2444-8656
Prima pubblicazione
01 Jan 2016
Frequenza di pubblicazione
2 volte all'anno
Lingue
Inglese
Introduction

Visual tracking research is one of the most important visual research directions in the field of computer vision. Its application fields include visual navigation, intelligent video surveillance, intelligent city, and military precision guidance [13]. In practical application scenarios, the tracking target state will be affected by many factors, such as deformation, illumination intensity, rotation, and so on, which makes single target tracking still a very challenging task. In recent years, the convolutional neural network has been successfully applied to various image processing problems (such as target detection, object classification and image segmentation), which significantly improves the performance of corresponding tasks [47].

The introduced self-attention mechanism is realised by the spatial attention module and channel attention module. The spatial attention module selectively aggregates the weighted sum of all eigenvalues to the corresponding position in the original input feature map, which enhances the correlation between similar features. After integrating all the feature maps, the channel attention module selectively emphasises the importance of each channel feature by weighting the channel space of the feature. In part of the training data of MDnet, the semantics of the targets are the same but the categories are different, so the discrimination ability of the model is reduced. To solve this problem, a composite loss function is constructed. The composite loss function is composed of the classification loss function and instance discrimination loss function. The classification loss function is used to count the loss value of target classification, and the instance discrimination loss function is used to improve the weight of the target in the current video sequence and suppress its weight in other sequences [812]. In the target tracking algorithm MDnet based on the CNN framework, several down sampling structures are used to obtain the depth features of the target, resulting in the loss of more detailed information in the downsampling process. The feature maps of different depths of targets extracted by CNN, especially the shallow feature map, contain more spatial position information of target objects, which is very important for target tracking tasks requiring accurate positioning. For deep features, the deep features of the target cover the semantic information about the robustness of the target, which makes the discrimination ability of the network model not easy to be affected in some tracking scenes including light intensity change, object deformation and rotation. Therefore, aiming at the problem that the tracking network cannot make full use of different levels of features, taking MDnet as the benchmark algorithm, a multi-domain convolution neural network based on multi-level feature aggregation is designed. By aggregating different levels of features, the tracker can make full use of the characteristics of different levels, and improve the feature representation ability of the tracking network. The designed feature aggregation network is still an end-to-end model without introducing the problem of specific parameter settings, which maintains the universality of the target tracker. A specific data augmentation strategy is used for network pre-training, which fully takes into account various conditions that may occur in the target tracking scene. The training samples are processed by pre-simulating the change in the target state, which makes the network model more discriminative after training [13]. In addition, the anomaly detection module is embedded in the tracking algorithm, and a perfect anomaly response strategy is designed to improve the problem of template drift caused by the drastic change of target state or long-term tracking.

The proposed algorithm has carried out detailed quantitative and qualitative experiments on the widely used OTB and vot2015 data sets. The experiments show that the proposed algorithm shows good tracking performance and is superior to many top-notch algorithms, which fully verifies the effectiveness of the proposed algorithm.

Convolutional neural network
Concept of convolutional neural network CNN

The convolutional neural network was proposed in 1989 when it was built by Yann Lecun of New York University. It is the optimal version after multiple version iterations, and its computational performance has been greatly improved. Since ancient times, the main way for human beings to obtain information is the eyes and ears. They analyse what they see and hear, to make a corresponding response. For computing devices, to solve what they see in front of them, they need to capture and save the images with cameras. Next, they use the powerful computing function of computers to analyse pictures to get relevant information. With the development of technology, the current research field does not refer to the shallow convolutional neural network, and the complexity of the number of layers can be large or small. Its application has also made a great breakthrough in speech recognition, image processing, and natural language processing [1416].

A complete convolutional neural network structure includes three layers: input layer, hidden layer, and output layer [13]. Due to the input features and marks, they are corresponding one by one in the convolution neural network. In this way, compared with the above neural network learning mode, the convolution neural network belongs to supervised learning, and the input features are standardised. The vector group of early training samples is composed of data annotation and input eigenvalues. Through the early normalisation processing, in this learning mode, the size of weight data is repeatedly corrected, making the whole convolutional neural network converge. In the input layer, the standardised operation of input features has an obvious positive relationship between improving the performance of the whole network and learning rate. This layer includes three types: convolution layer, pooling layer, and full connection layer.

Convolution layer

The convolution layer is well known. In the direction of image processing, a convolution neural network has certain advantages over a depth neural network, and its use of local perception is stronger than the latter. In the convolution layer, we extract the input data to get its eigenvalues, and when these speech data are preprocessed, the extracted eigenvalues will be cut into rectangular blocks one by one. For the back-propagation algorithm, by calculating each convolution unit, the input data features pass through a convolution layer. Through the accumulation of convolution layers, more information on data features can be extracted more completely. When passing through more convolution layers, some complex features will be extracted from the basic data. Therefore, the calculation formula of convolution can be expressed by the following formula 1.

Y00=X00W00+X01W01+X10W10+X11W11

For example, in a simple convolution process, the initial output data X is a 4*4 matrix. After the first convolution of matrix W, the data is half 2*2. Finally, after another convolution, the result matrix Y is 3*3.

After these three layers of processing, it needs to be processed through softmax regression. In this process, it is to define the probability value that the initial sample does not match the same category, and obtain the probability that the input belongs to a certain category among different categories on the final output. Here, some processing of convolution is different from the category of signal operation. In this model, convolution processing is a linear weighted calculation. See the formula above (1).

The convolution layer contains many convolution kernels. The eigenvalues of different levels are learned by these convolution kernels, to obtain many output values. Of course, these convolution kernels are not directly connected with the whole characteristic graph. There is a layer of rules between them, which combine them. In the convolution core, its work is different from that of the full connection layer network. The convolution core in the convolution layer is connected to the areas where the features are input. Through such connection, the connection size of the whole network has been reduced to a certain extent. On the contrary, it can get enough structured features. Moreover, because the convolution core has a parameter sharing mechanism, the parameters of the training network have been reduced a lot.

Pooling layer

Another name for the pooling layer is also called the full connection layer. In the pooling layer, it will not reduce the matrix depth of the whole network and can reduce the size of the matrix to a certain extent [15]. After processing in the convolution layer, a lot of information coincides to a certain extent, so there is a pooling layer. After the processing of the pooling layer, the number of nodes in the full connection layer can be reduced, which can achieve the effect of reducing the number of parameters in the whole neural network. After the processing of the pooling layer, the effect obtained is to accelerate the calculation speed. At the same time, it can also make the output data not cause overfitting problems. In terms of use, it is somewhat similar to the convolution layer. Here, there are two calculation methods to process the data characteristics obtained through the convolution layer, the maximum pooling layer, and the average pooling layer, The former is to take the maximum value of the data, which is also the most used pool layer structure. The latter obtains its average value through the relevant calculation area of specified characteristic data [1721].

Single depth slice is 4× The block of 4 is sampled with the maximum value, and the result is 2×2. In this way, the single depth slice can be divided into 2×2, as shown in the figure above, take the maximum value of each small square. We can use formula (2) to express the output size of the pooling layer.

Size_Pool=Size_Patch/Size_Cpool

In the above formula (2), we can see that there is a division operation, which shows that the characteristics of the input pool layer should be divided by the size of the pooling layer. Size_Pool is the data feature size obtained by the above method, Size_Patch is the feature size output after the calculation of the convolution layer, Size_Cpool refers to the size of the pooling layer.

The processing of the pooling layer has achieved remarkable results in two aspects. One is to compress the data level of the whole neural network. At the same time, it will reduce the training parameters of the neural network and improve the constraint of the neural network. Based on these two, it will achieve a good effect, that is, it can maintain the invariance of the network, to improve the Luban of the whole network itself, In this way, the application in speech recognition can achieve an effect, that is, the speech recognition system will have certain fault tolerance in the translation caused by different wording techniques and some external environment noise.

Full connection layer

After the cyclic calculation of the above two layers, for the data input into the network, after the cyclic calculation of the first two layers, for the input characteristic data, the relevant important data has been reorganised, and its key information can be better reflected. For the operation of the first two layers, they can be regarded as automatic calculation behaviour. After these two processes, the processing results are classified through the full connection layer. There is a rasterisation process, which reorganises the output results, and then obtains a feature vector, which also upgrades part of the feature information processed by the convolution layer and pooling layer to deeper dimensional space, and also classifies all the information. The tool of this classification is the full connection layer. For example, it can convolute a two-dimensional feature value through the above convolution, In this way, it is transformed into one-dimensional feature data so that the information can be reorganised globally. Softmax full connection is generally adopted, which is also the last checkpoint of a convolutional neural network. Its key role is to normalise the output results. The activation value obtained in this way is the speech feature calculated by a convolutional neural network.

After the first two layers of the convolutional neural network are calculated, if the output result has a flag, after the last softmax, the output result is a vector value with N dimensions. Each value in the vector has a corresponding probability, that is, it represents the occurrence probability of the first flag. Each flag corresponds to the probability one by one, and the sum of the probabilities of the N flags is 1. The expression (3) of softmax for the last level is as follows.

F(Zj)=ezji=1Nezi$$F({Z_j}) = {{{e^{{z_j}}}} \over {\mathop \sum \limits_{i = 1}^N {e^{{z_i}}}}}$$

From the above formula, the function of this operation is to improve the nonlinear expression ability of the whole convolution model. The whole formula is actually a normalisation process. The meaning of the denominator part is the sum of N signs. The upper part is the result of the first sign. Obviously, the result of all signs will not be >1. When the corresponding probability of the result of a sign approaches 1, the result of this sign must be greater than that of other signs. In this way, the results of other signs must be infinitely close to each other. This is the normalisation process mentioned above.

Convolution neural network optimisation algorithm

The neural network optimisation algorithm is used in some schemes to accelerate the convergence speed of the network model or optimise the loss function. To obtain the optimal parameters and minimise the value of the loss function, optimization methods are commonly used in neural networks

There are three algorithms: random gradient descent method, momentum gradient descent method and Adam algorithm.

Random gradient descent method

The random gradient descent method is a simple and effective method. It randomly extracts one group from all samples, updates it according to the gradient after training, and then extracts one group and updates it once. When the sample size is large, you may not need to train all samples to get a network model within an acceptable range. Because only one group of samples is updated each time. The parameters required for the updating of the network model will be greatly reduced and the training speed of the network will be accelerated, but this will also cause the problem of uncertain optimisation direction, which will eventually result in local optimisation rather than global optimisation, thus reducing the accuracy of the network model.

Momentum gradient descent method

The momentum gradient descent method draws on the ideas of physics. Imagine rolling the ball into a frictionless bowl. The accumulated momentum does not stop at the bottom of the bowl but pushes it forward, which produces momentum. The descent process of the above random gradient descent method is not directly to the best, but back and forth, which will waste a lot of network training time. In addition, the random gradient descent algorithm needs to set a learning rate before training. Setting an excessive learning rate will lead to excessive updating of parameters and large errors of gradient swing. Setting a small learning rate will increase the number of iterations of training, which will increase the training time of the model and waste resources and time. The momentum gradient descent method can set a larger learning rate and reduce the training times in the training process so that the convergence speed of the network model is faster and more stable, and there will be no back-and-forth descent problem. The algorithm mainly sets an impulse in the process of random gradient descent to accelerate the gradient descent. The momentum gradient descent algorithm can be expressed by the following formula: Vdw=θVdw+(1θ)dw Vdb=θVdb+(1θ)db W=WφVdw b=bφVdb

Where dw and db respectively represent the gradient value generated in this iteration, Vdw and Vdb the gradient momentum accumulated in the previous iteration. The weighted average gradient of the two indexes can slow down the gradient decline and ϕ are super parameters introduced. A large number of experiments show that the effect is better.

Adam algorithm

Adaptive motion estimation is a combination of rmsprop and random gradient descent methods. The algorithm uses the square gradient of rmsprop to scale the learning rate of the network model, and uses the average value of the gradient movement rather than the gradient itself to calculate. In this way, it can set different learning rates for different parameters. It is an adaptive learning rate method. Adam algorithm is named for its use of an adaptive matrix. It adjusts the learning rate in the network model by estimating the first-order matrix and the second-order matrix. The fluctuation problem of the previous algorithm is improved, which significantly improves the convergence speed and stability of the network model. Adam algorithm can be expressed by the following formula: Vdw=ϕ1Vdw+(1ϕ1)dw Vdb=ϕ1Vdb+(1ϕ1)db Sdw=ϕ2Sdw+(1ϕ2)dw2 Sdb=ϕ2Sdb+(1ϕ2)db2 Sdw and Sdb in the above formula represent the weighted and biased second-order exponential weighted average gradient respectively. Due to the large difference between the moving average gradient value and the value after iteration, it is necessary to properly correct the value. The correction formula is: :Vdwc=Vdw/(1ϕ1t),Vdbc=Vdb/(1ϕ1t),Sdwc=Sdw/(1ϕ2t),$$\matrix{ {:V_{dw}^c = {V_{dw}}/(1 - \phi _1^t),\quad V_{db}^c = {V_{db}}/(1 - \phi _1^t),\quad S_{dw}^c = {S_{dw}}/(1 - \phi _2^t),} \cr } $$

Sdbc=VSdb/(1ϕ2t) Combining the modified weight and offset to update the equation, we can get a new formula: W=WφVdwcSdwc+ω b=bφVdbcSdbc+ω ϕ1 in the above formula corresponds to the value of ϕ in the momentum gradient descent algorithm, and the same value is 0.9. The parameter ϕ2 corresponds to the specified value similar to ϕ1, which is generally 0.999. The parameter ω is a smooth item, which is generally taken as 108. The learning rate φ needs to be fine-tuned in the network model training.

Image quality evaluation standard

Image quality generally refers to the evaluation of people’s visual perception of images. Image quality assessment (IQA) includes a subjective assessment based on human perception (i.e. visual realism of images) and an objective assessment. Although human subjective evaluation meets our practical needs, this evaluation method consumes a lot of manpower and time, and is subject to too many human factors to obtain stable evaluation results. Therefore, the calculation method based on objective evaluation is the mainstream common method. This paper mainly introduces two common objective evaluation methods: peak signal-to-noise ratio (PSNR) and structural similarity (SSIM).

Peak signal-to-noise ratio

PSNR is the most commonly used objective evaluation method of image quality. It is usually compared with the original image after image compression and image restoration. It evaluates the image quality by comparing the difference in pixel values between the original image and the processed image. Its formula is as follows: PSNR=10log10(k21Ni=1N(I(i)I^(i)2))$$PSNR = 10 \cdot {\log _{10}}\left( {{{{k^2}} \over {{1 \over N}\mathop \sum \limits_{i = 1}^N (I(i) - \hat I{{(i)}^2})}}} \right)$$

Generally, k adopts 8-bit representation, and k is 255. According to the above formula, PSNR is only related to the error MSE between pixels. MSE is to calculate the mean square error between the real image I with N pixels and the reconstructed image. PSNR only focuses on the differences between image pixels, while people often pay attention to the visual sensory experience. However, to rationally compare with the experimental results in literature and works, it still has great advantages. Therefore, PSNR is still the most extensive evaluation standard in the field of image super-resolution. For colour images, the PSNR values of three RBG channels are generally calculated and then averaged, or the mean square error of three RGB channels can be calculated and then averaged.

Structural similarity

Generally speaking, it is easier for human vision to extract the structure of images. Therefore, according to the comparison of image brightness, contrast and structure, a structural similarity index (SSM) is proposed to measure the structural similarity between images. For an image I with N pixels. Suppose the brightness of the image is γi and the contrast of the image is λi, respectively corresponding to the average value and deviation of the image feature intensity. among γi=1/Ni=1NI(i) λi=(1N1i=1NI(iγi)2)1/2$${\lambda _i} = {\left( {{1 \over {N - 1}}\mathop \sum \limits_{i = 1}^N I{{(i - {\gamma _i})}^2}} \right)^{1/2}}$$

I(i) in the formula represents the characteristic intensity of the second pixel of image I. The comparison of pixel illumination and contrast of image Cl(I,I^) and Cc(I,I^) can be expressed by the following formula: Cl(I,I^)=2γIγI^+c1γI2+γI^2+c1 Cc(I,I^)=2λIλI+c1λI2+λI2+c2

c1=(k1m)2 and c2=(k2m)2 in the above formula are constants whose denominator is 0. Among them, k1 and k2 are two constants far less than 1, usually k1=0.01 and k2=0.03. The normalised pixel value of the image can be expressed as ((IγI)λI). The structural similarity is measured by the correlation between the pixel value of the original image and the brightness and contrast of the image, which is equivalent to the correlation coefficient between image I and image Cs(I,I^). Therefore, the comparison function 8 between them can be expressed by the following formula: λII^=1N1i=1N(I(i)γI)(I^(i)γI^)\begin{equation} {\lambda _{I\hat I}} = \frac{1}{{N - 1}}\mathop \sum \limits_{i = 1}^N \left( {I(i) - {\gamma _I}} \right)\left( {\hat I(i) - {\gamma _{\hat I}}} \right) \label{eq19} %(19) \end{equation} Cs(I,I^)=(γII^+c3)γIγI^+c3$${C_s}(I,\hat I) = {{\left( {{\gamma _{I\hat I}} + {c_3}} \right)} \over {{\gamma _I}{\gamma _{\hat I}} + {c_3}}}$$

In the formula is the covariance between image I and image I^, and c3 is also to prevent the constant with denominator 0. Finally, the SSIM expression can be obtained from the above formula: SSIM(I,I^)=[Cl(I,I^)]α[Cc(I,I^)]β[Cs(I,I^)]δ$$SSIM(I,\hat I) = {\left[ {{C_l}(I,\hat I)} \right]^\alpha }{\left[ {{C_c}(I,\hat I)} \right]^\beta }{\left[ {{C_s}(I,\hat I)} \right]^\delta }$$

Among them, α, β, and δ are the control parameters for adjusting image elements. If these parameters are set to 1, the following formula can be obtained: SSIM(II^)=(2γIγI^+c1)(λII^+c2)(γI2+γI^2+c1)(λI2+λI2+c2)$$SSIM(I\hat I) = {{\left( {2{\gamma _I}{\gamma _{\hat I}} + {c_1}} \right)\left( {{\lambda _{I\hat I}} + {c_2}} \right)} \over {\left( {\gamma _I^2 + \gamma _{\hat I}^2 + {c_1}} \right)\left( {\lambda _I^2 + \lambda _I^2 + {c_2}} \right)}}$$

When calculating the formula, a m×m size window will be taken from the image, and then calculated according to the movement of the window. Finally, the average value of the results will be taken as the global SSM. This evaluation method is to express the traditional brightness and contrast through structural information, use the mean value as the brightness estimation, covariance as the measure of structural similarity, and standard deviation as the contrast estimation. SSIM evaluates the reconstruction quality of an image from the perspective of human eye observation. Compared with PSNR, SSIM can better meet people’s needs in intuitive vision. Therefore, this image evaluation method has also been widely used

Target tracking algorithm based on convolutional neural network

As a method of deep learning, convolutional neural networks have performed well in many fields, such as speech analysis and image recognition. In the convolutional neural network, over-complete filters can be used to extract various potential features. For example, if you want to extract a certain feature, you need to use enough filters to extract the possible situations of all the features, so that you can extract The features that actually need to be extracted are also covered.

Convolutional neural networks provide a hierarchical model that can learn features directly from image raw pixels. In the first stage, a preprocessed black and white image of size 32×32 is input into a convolutional layer consisting of six filters of size 5×5, and then the resulting six filters of size 28×28 are input. The feature map is input to a nonlinear transformation layer. For each pixel of all feature maps, the nonlinear transformation layer is obtained by the following transformation: relu(x)=max(0,x)$${\rm{rel}}u(x) = \max (0,x)$$

The resulting six feature maps are then input into a maximum downsampling layer, which takes the maximum value within each 2×2 spatial neighbourhood. Compared with the sigmoid and tanh functions, the RELU function can effectively solve the gradient dispersion problem. The maximum downsampling layer makes the convolutional neural network robust to small translations of the image. When applied to target tracking, they make the network more robust to tracking errors due to inaccurate target locations.

The implementation of the second stage is similar to the first stage. First, the output of the first stage is subjected to convolution operation. The convolution layer consists of 12 filters with a size of 5×5. The feature map is input to the ReLU function, and each 2×2 field is subjected to maximum downsampling. The downsampled feature map is transformed into the form of a vector, and each node is regarded as one dimension in the high-level features. After a two-stage convolution operation, ReLU transform and maximum downsampling, this network can extract higher-level features.

Target Tracking Algorithm Based on Particle Filter and Convolutional Neural Network

The particle filter implements a recursive Bayesian filtering method through the sequential Monte Carlo importance sampling method. The main idea of a particle filter is to use a series of random particles to represent the posterior probability. Particle filter consists of two main parts: 1. Dynamic model: generate candidate samples based on previous particles. 2. Observation model: Calculate the similarity between the candidate sample and the target apparent model. Given all the observations y1:t=[y1,,yt] at time t, the goal of a particle filter-based target tracking system is to estimate the posterior density p(xt|y1:t) of the target. Based on Bayesian theory, the posterior density can be re-derived as: p(xty1:t)p(ytxt)p(xtxt1)p(xt1y1:t1)dxt1

In the formula, p(xtxt1) and p(ytxt) are the dynamic model and the observation model, respectively. In particle filtering, the integral is calculated by Monte Carlo sampling. That is, the posterior density p(xtyl:t) is approximated by a series of corresponding weights.

Finally, the optimal target state x at time t can be obtained by the following maximum posterior probability: xt=argmaxxtp(xtyl:t)=xti=argmaxxtiwti$$x_t^ * = \arg \mathop {\max }\limits_{{x_t}} p({x_t}{y_{{\rm{l}}:t}}) = x_t^i = \arg \mathop {\max }\limits_{x_t^i} w_t^i$$

To improve the computational efficiency, this algorithm chooses to track only the position and size of the target. let xt=(ptx,pty,wt,ht) Indicates the target state parameters, which are the horizontal coordinates, vertical coordinates, width and length of the target. The dynamic model of two consecutive frames is assumed to obey a Gaussian distribution: p(xtxt1)=M(xt;xt1,Σ) where k is the diagonal covariance matrix, and the diagonal elements are the variances corresponding to each parameter. For state x, there is a corresponding image patch of size 32×32. Obtained from the apparent model based on the convolutional neural network, and is the output layer result of the apparent model based on the convolutional neural network, the likelihood function can be calculated as follows: p(ytxt)=exp(dt)

In order to obtain the apparent change, the apparent model of the convolutional neural network will be continuously updated, and the likelihood function also needs to be continuously adapted over time. However, the main disadvantage of the apparent adaptation method is that it is sensitive to drift problems, such as possible gradual adaptation to non-objects. In order to alleviate the drift problem, this algorithm uses the real appearance information of the target marked in the first frame and the image observations obtained online.

Tracking effect estimation

Usually, the centre point error is used to evaluate the pros and cons of tracking performance. As an evaluation criterion to quantitatively analyse the quality of tracking results, the centre point error calculates the square error of the coordinates of the centre point of the tracker tracking result and the centre point of the real label value in the x-direction and y-direction in the video sequence: err=(xgxt)2+(ygyt)2 where x represents the coordinate value of the centre point of the tracker tracking result in the x direction, y represents the coordinate value of the centre point of the tracker tracking result in the y direction, and x represents the centre point of the real label value in the x direction The coordinate value, y, represents the coordinate value of the centre point of the true label value in the y direction.

The average centre point error is the average value of the centre point error in the whole time. The smaller the average centre point error value, the better the tracking result of the tracker in the entire video sequence. In the ideal case, the average centre point error value is 0. But to be precise, sometimes the average centre point error is not a good measure of the tracking performance of the tracker. If a better tracking algorithm tracks the target well most of the time, but only loses the target at certain moments, the tracking results obtained by the tracker at these moments are likely to be random. The point error will become very large, pulling up the average centre point error, and the result obtained from the average centre point error will consider the tracking algorithm not an excellent tracking algorithm to better evaluate the performance of the tracker.

Accuracy Statistics In the video sequence, the percentage of frames where the error between the centre point of the tracker’s tracking result and the centre point of the real label value is within a certain threshold range of all frames. The setting of the precision reading value is very important here. Generally, the reading value will be set to 20 pixels. A good tracking algorithm can get a relatively high accuracy value when the reading value is set relatively low. If one arrives at the position of the target and only loses the target at certain moments, the accuracy of the tracking algorithm in the video sequence will not be very high, and the tracking algorithm will still be considered a better tracking method if the accuracy method is used., so the accuracy can better evaluate the performance of the tracker than the average center point error. The centre point error mainly evaluates the stability of the tracker, and the method to evaluate the robustness of the tracker is mainly the overlap rate method.

Both the accuracy method and the success rate method are indispensable. For example, if a tracking method is difficult to deal with the problem of target scale change well, the value of the average centre point error of the tracking method will be relatively small but the value of the average overlap rate may be very low. Similarly, if a tracking method does not track the target very stably, but keeps shaking around the target object, the value of the average overlap rate of the tracking method may be relatively high, but the average centre point error will be very large. Therefore, when the accuracy and the success rate are considered together, the pros and cons of the tracking algorithm can be objectively and comprehensively investigated.

Conclusion

The target tracking algorithm based on the convolutional neural network has achieved a good tracking effect. The algorithm is completely pre-trained with convolutional neural networks in natural scenes that are not related to the tracking task, which shows that by learning and transferring the mid-level general features of the convolutional neural network, the tracking algorithm can effectively construct the target appearance model, which can be used in a certain amount of time. To a certain extent, the drift problem in the tracking process is alleviated.

Guo M, Yu Z, Xu Y, et al. ME-Net: A Deep Convolutional Neural Network for Extracting Mangrove Using Sentinel-2A Data[J]. Remote Sensing, 2021, 13(7):1-24. Guo M, Yu Z, Xu Y, ME-Net: A Deep Convolutional Neural Network for Extracting Mangrove Using Sentinel-2A Data[J]. Remote Sensing, 2021, 13(7):1-24.10.3390/rs13071292Search in Google Scholar

Ribeiro T, Mascarenhas S M, Afonso J, et al. P113 Automatic detection of colonic ulcers and erosions in colon capsule endoscopy images using a convolutional neural network[J]. Journal of Crohn’s and Colitis, 2021(Supplement_1):Supplement_1. Ribeiro T, Mascarenhas S M, Afonso J, P113 Automatic detection of colonic ulcers and erosions in colon capsule endoscopy images using a convolutional neural network[J]. Journal of Crohn’s and Colitis, 2021(Supplement_1):Supplement_1.10.1093/ecco-jcc/jjab076.240Search in Google Scholar

Pardoe H R, Martin S P, Zhao Y, et al. Estimation of in-scanner head pose changes during structural MRI using a convolutional neural network trained on eye tracker video[J]. Magnetic Resonance Imaging, 2021. Pardoe H R, Martin S P, Zhao Y, Estimation of in-scanner head pose changes during structural MRI using a convolutional neural network trained on eye tracker video[J]. Magnetic Resonance Imaging, 2021.10.1016/j.mri.2021.06.01034147591Search in Google Scholar

Oyelade O N, Ezugwu A E. A novel wavelet decomposition and transformation convolutional neural network with data augmentation for breast cancer detection using digital mammogram[J]. Scientific Reports, 2022. Oyelade O N, Ezugwu A E. A novel wavelet decomposition and transformation convolutional neural network with data augmentation for breast cancer detection using digital mammogram[J]. Scientific Reports, 2022.10.1038/s41598-022-09905-3899380335396565Search in Google Scholar

Wang C, Zhou J, Xiao B, et al. Uncertainty Estimation for Stereo Matching Based on Evidential Deep Learning. Pattern Recognition, 2021. https://doi.org/10.1016/j.patcog.2021.108498. Wang C, Zhou J, Xiao B, Uncertainty Estimation for Stereo Matching Based on Evidential Deep Learning. Pattern Recognition, 2021. https://doi.org/10.1016/j.patcog.2021.108498.10.1016/j.patcog.2021.108498Search in Google Scholar

J. Miao, Z Wang, X. Ning, X. Nan, W. Cai, R. Liu. Practical and Secure Multifactor Authentication Protocol for Autonomous Vehicles in 5G. Software:Practice and Experience, 2022. https://doi.org/10.1002/SPE.3087. Miao, J. Wang, Z Ning, X. Nan, X. Cai, W. Liu. R. Practical and Secure Multifactor Authentication Protocol for Autonomous Vehicles in 5G. Software:Practice and Experience, 2022. https://doi.org/10.1002/SPE.3087.10.1002/spe.3087Search in Google Scholar

X. Ning, K. Gong, W. Li, and L. Zhang, “JWSAA: Joint Weak Saliency and Attention Aware for person re-identification,” Neurocomputing, 2021, vol. 453, pp. 801-811. https://doi.org/10.1016/j.neucom.2020.05.106. Ning, X. Gong, K. Li, W. and Zhang, L. “JWSAA: Joint Weak Saliency and Attention Aware for person re-identification,” Neurocomputing, 2021, vol. 453, pp. 801-811. https://doi.org/10.1016/j.neucom.2020.05.106.10.1016/j.neucom.2020.05.106Search in Google Scholar

C. Yan, G. Pang, X. Bai, et al., Beyond triplet loss: person re-identification with finegrained difference-aware pairwise loss, IEEE Trans. Multimedia (2021), https://doi.org/10.1109/TMM.2021.3069562. Yan, C. Pang, G. X. Bai, Beyond triplet loss: person re-identification with finegrained difference-aware pairwise loss, IEEE Trans. Multimedia (2021), https://doi.org/10.1109/TMM.2021.3069562.10.1109/TMM.2021.3069562Search in Google Scholar

W Cai, B Zhai, Y Liu, et al., Quadratic polynomial guided fuzzy C-means and dual attention mechanism for medical image segmentation. Displays, vol. 70, no. 102106, 2021. https://doi.org/10.1016/j.displa.2021.102106. Cai, W Zhai, B Liu, Y Quadratic polynomial guided fuzzy C-means and dual attention mechanism for medical image segmentation. Displays, vol. 70, no. 102106, 2021. https://doi.org/10.1016/j.displa.2021.102106.10.1016/j.displa.2021.102106Search in Google Scholar

X. Ning, P. Duan, W. Li, and S. Zhang, “Real-time 3D face alignment using an encoder-decoder network with an efficient deconvolution layer,” IEEE Signal Processing Letters, vol. 27, pp. 1944–1948, 2020. https://doi.org/10.1109/LSP.2020.3032277. Ning, X. Duan, P. Li, W. and Zhang, S. “Real-time 3D face alignment using an encoder-decoder network with an efficient deconvolution layer,” IEEE Signal Processing Letters, vol. 27, pp. 19441948, 2020. https://doi.org/10.1109/LSP.2020.3032277.10.1109/LSP.2020.3032277Search in Google Scholar

Bai X, Zhou J, Ning X, et al. 3D data computation and visualization[J]. Displays, 2022: 102169. Bai X, Zhou J, Ning X, 3D data computation and visualization[J]. Displays, 2022: 102169.10.1016/j.displa.2022.102169Search in Google Scholar

S. Qi, X. Ning, G. Yang, et al., Review of Multi-view 3D Object Recognition Methods Based on Deep Learning, Displays 69 (1) (2021), https://doi.org/10.1016/j.displa.2021.102053. Qi, S. Ning, X. Yang, G. Review of Multi-view 3D Object Recognition Methods Based on Deep Learning, Displays 69 (1) (2021), https://doi.org/10.1016/j.displa.2021.102053.10.1016/j.displa.2021.102053Search in Google Scholar

W. Cai, D. Liu, X. Ning, et al., Voxel-based Three-view Hybrid Parallel Network for 3D Object Classification, Displays 69 (1) (2021), https://doi.org/10.1016/j.displa.2021.102076. Cai, W. Liu, D. X. Ning, Voxel-based Three-view Hybrid Parallel Network for 3D Object Classification, Displays 69 (1) (2021), https://doi.org/10.1016/j.displa.2021.102076.10.1016/j.displa.2021.102076Search in Google Scholar

Bai X, Wang X, Liu X, et al. Explainable Deep Learning for Efficient and Robust Pattern Recognition: A Survey of Recent Developments. Pattern Recognition, 2021: 108102 Bai X, Wang X, Liu X, Explainable Deep Learning for Efficient and Robust Pattern Recognition: A Survey of Recent Developments. Pattern Recognition, 2021: 10810210.1016/j.patcog.2021.108102Search in Google Scholar

Wang M, Sun T, Song K, et al. An efficient sparse pruning method for human pose estimation. Connection Science, 2021: 1-15. Wang M, Sun T, Song K, An efficient sparse pruning method for human pose estimation. Connection Science, 2021: 1-15.10.1080/09540091.2021.2012423Search in Google Scholar

Zaiyang Yu, Shuang Li, Linjun Sun, Liang Liu & Wang Haining (2022): Multidistribution noise quantisation: an extreme compression scheme for transformer according to parameter distribution, Connection Science, DOI: 10.1080/09540091.2021.2024510 Zaiyang Yu, Shuang Li, Linjun Sun, Liang Liu & Wang Haining (2022): Multidistribution noise quantisation: an extreme compression scheme for transformer according to parameter distribution, Connection Science, DOI: 10.1080/09540091.2021.2024510Apri DOISearch in Google Scholar

Shan W. Digital streaming media distribution and transmission process optimisation based on adaptive recurrent neural network[J]. Connection Science, 2022, 34(1): 1169-1180. Shan W. Digital streaming media distribution and transmission process optimisation based on adaptive recurrent neural network[J]. Connection Science, 2022, 34(1): 1169-1180.10.1080/09540091.2022.2052264Search in Google Scholar

Ying L, Qian Nan Z, Fu Ping W, et al. Adaptive weights learning in CNN feature fusion for crime scene investigation image classification[J]. Connection Science, 2021, 33(3): 719-734. Ying L, Qian Nan Z, Fu Ping W, Adaptive weights learning in CNN feature fusion for crime scene investigation image classification[J]. Connection Science, 2021, 33(3): 719-734.10.1080/09540091.2021.1875987Search in Google Scholar

Wang J, Wang R, Yang M, et al. Understanding zinc-doped hydroxyapatite structures using first-principles calculations and convolutional neural network algorithm[J]. Journal of Materials Chemistry B, 2022, 10. Wang J, Wang R, Yang M, Understanding zinc-doped hydroxyapatite structures using first-principles calculations and convolutional neural network algorithm[J]. Journal of Materials Chemistry B, 2022, 10.10.1039/D1TB02687ASearch in Google Scholar

Bilal E. Heart sounds classification using convolutional neural network with 1D-local binary pattern and 1D-local ternary pattern features[J]. Applied Acoustics, 2021, 180:108152. Bilal E. Heart sounds classification using convolutional neural network with 1D-local binary pattern and 1D-local ternary pattern features[J]. Applied Acoustics, 2021, 180:108152.10.1016/j.apacoust.2021.108152Search in Google Scholar

Song T. Multi-Classification of Complex Microseismic Waveforms Using Convolutional Neural Network: A Case Study in Tunnel Engineering[J]. Sensors, 2021, 21. Song T. Multi-Classification of Complex Microseismic Waveforms Using Convolutional Neural Network: A Case Study in Tunnel Engineering[J]. Sensors, 2021, 21.10.3390/s21206762853902434695975Search in Google Scholar

Articoli consigliati da Trend MD

Pianifica la tua conferenza remota con Sciendo