Visual tracking is a challenging computer vision task with applications in human-computer interaction, video surveillance and crowd monitoring among others. Modern visual tracking systems may use complex object detection schemes for estimating the current state of a target in any particular video frame. However, this approach does not fully exploit the temporal structure of the estimation problem. Visual tracking can be also thought of as a dynamic model with observed features and latent states representing the position/velocity of an object (Maggio and Cavallaro, 2011). In this context, the generative model for visual tracking requires not only the correct specification of the model and its parameters but also the ability to capture the variations of the system (Wang et al., 2015).
The Bernoulli filter is a powerful algorithm that allows objects to appear and disappear, using extracted features from the image as observations (Vo et al., 2010). Similar approaches for visual tracking have been proposed (also known in the literature as Track-Before-Detect). Nevertheless, these methods rely on unreliable background subtraction operations or the likelihood function being in a
Current state-of-the-art trackers are based on either correlation filters (Bolme et al., 2010), deformable parts models (Hare et al., 2016) or convolutional neural networks (Li et al., 2018). These trackers learn a discriminative model from a single frame and then update the model using new frames. Furthermore, tracking performance can be increased when using more discriminative features such as HOG (Henriques et al., 2015; Solis Montero et al., 2015; Xu et al., 2019). On the other hand, even when Bernoulli filters have demonstrated being useful models for tracking in complex scenarios, it is still hard to rely on such features for increasing their performance.
The Bernoulli filter is a specialized version of the PHD filter (Mahler, 2003), with the focus on single target tracking. While the original PHD filter is based on a Poisson point process, several extensions have been proposed to cope with non-Poisson distributions. In particular, the Cardinalized PHD filter allows estimating the number of targets using arbitrary distributions and provides improved estimates (Mahler, 2007). The multi-Bernoulli and Poisson multi-Bernoulli mixture filters also allow to approximate the cardinality distribution and become especially well suited when the mean of the multi-target posterior is higher than the variance (García-Fernández et al., 2018). All of these methods rely on first-order or second-order moments but assume that targets behave independently with each other. Therefore, the authors in Privault and Teoh (2019) propose a second-order filter that accounts for interaction between the targets. The method is based on determinantal point processes (DPP) that take into consideration the correlation among the targets through a kernel function. In Jorquera et al. (2017), the authors propose a determinantal point process for pruning the components of the Gaussian mixture PHD filter. More recently, the authors in Jorquera et al. (2019) compared the PHD filter using determinantal point process observations with other methods for visual multi-target tracking.
The contributions of this paper are twofold. First, the third section provides introductory notions of the Bernoulli filter and then we derive a novel Bernoulli filter using determinantal point process observations (B-DPP filter) for single target tracking in the fourth section. Second, in the fifth section we derive a Sequential Monte Carlo implementation of the B-DPP filter using a truncated likelihood, which can outperform other discriminative trackers in several scenarios.
A
The problem of performing joint detection and estimation of multiple objects has a natural interpretation as a
In recent years, deep learning approaches have demonstrated outstanding performance in several visual tracking benchmarks (Kristan et al., 2019). These trackers are mostly based on extracted features from a convolutional neural network and an objective loss that minimizes a localization error (Li et al., 2018). However, the detection process is not perfect and false positives and negatives are to be encountered after ranking the top proposals from the convolutional features.
In order to develop an stochastic approach for the single-object observation model, a discrete DPP can be used to capture probabilistic relationships using a kernel matrix
In this case, a model for detection and estimation of multiple objects can be achieved by the conditional expectation of the posterior point process (random finite set) under transformations (Ristic et al., 2013).
Let
The filtering density of a Bernoulli point process is completely specified by the pair (
Using Equation (5), the probability of existence
And the probability of the predicted Bernoulli point process:
If we let
The likelihood term in Equation (8) considers all possible locations and location-to-track associations
The Bayes update equation takes the form:
The denominator of Equation (10) can be written as:
The updated binomial point process can be derived as follows:
Let
The function
The DPP
The likelihood function for the standard measurement model using determinantal observations becomes:
Now, we want to derive the posterior distribution for Bernoulli point process given DPP observations:
The updated binomial point process can now be derived as follows:
In practice, it is difficult to store and compute the power set with all possible configurations of
DPPs have been proposed in the literature as an alternative to other object refinement techniques such as non-maximum suppression (Lee et al., 2016). These methods operate over object proposals and eliminate redundant detections. For DPPs, mode finding can be tackled using the following greedy algorithm (Kulesza and Taskar, 2011):
Conversely, by using the truncated likelihood from Equation (25), the Sequential Monte Carlo algorithm for the Bernoulli filter can be used to estimate the single-target posterior (Ristic, 2013).
In order to demonstrate the advantages of the proposed model updating approach over other discriminative approaches, we evaluate the tracking results on six challenging video sequences from the Visual Object Challenge 2014 (VOT) data set
Particle Bernoulli-DPP filter.
|
|
Number of particles |
100 |
Uniform birth probability ( |
0.1 |
Uniform survival probability ( |
0.99 |
Newborn particles ( |
0 |
Standard deviation for observation model ( |
20.4 |
Covariance matrix for dynamic model ( |
3.0 × 1 |
The parameter setting the Greedy Mode Finding algorithm is described in Table 2.
Greedy mode finding.
|
|
Acceptance ratio |
0.7 |
The sequence
Frame 85 of the
The Bernoulli DPP filter maintains a balance between the observed features and the quality of the observations (see Figure 1). The observation model uses a simple histogram comparison and no template update is performed, so the model is not robust to object deformation or rotation. Even that, as seen in Figure 1 the Bernoulli-DPP tracker achieves good performance in cases such as full occlusion where the other discriminative tracking methods fail. Performance is measured using widely used precision and success metrics
The precision metric describes the percentage of frames whose center location error is below a given threshold. Table 3 shows the overall precision metric averaged over all sequences on five different runs for each one of the algorithms.
Average precision (
Sequence | DPP | KCF | sKCF | Struck |
---|---|---|---|---|
Ball | 0.309 | 0.289 | 0.246 |
|
Bolt |
|
0.017 | 0.017 | 0.026 |
Diving | 0.073 | 0.082 | 0.087 |
|
Gymnastics |
|
0.425 | 0.425 | 0.435 |
Jogging |
|
0.231 | 0.231 | 0.228 |
Polarbear |
|
0.857 | 0.916 | 0.844 |
th = threshold.
The success measure accounts for bounding box overlap. Table 4 shows the number of success frames whose overlap is above some threshold, averaged over the sequences on five different runs. Quantitative analysis shows improved performance for the proposed approach when compared to the discriminative trackers in six different video sequences.
Average success (
Sequence | DPP | KCF | sKCF | Struck |
---|---|---|---|---|
Ball | 0.206 |
|
0.201 | 0.128 |
Bolt |
|
|
0.011 | 0.017 |
Diving |
|
|
0.114 | 0.151 |
Gymnastics |
|
0.415 | 0.420 | 0.425 |
Jogging | 0.205 |
|
|
|
Polarbear | 0.749 | 0.747 |
|
0.712 |
th = threshold.
Figure 2 shows the precision metric against the location error threshold for all of the six tested sequences. The red line indicates the best performing method among the four different algorithms. Since the
Overall precision plots for the visual tracking sequences.
Figure 3 shows the ratio of the frames whose tracked box has more overlap with the ground-truth box than a threshold. The success metric can be associated with the tracker algorithm ability to maintain long-term tracks. Since the Bernoulli DPP filter accounts for missed detections, the proposed approach improves the area under the curve of the success metric in 67% of the tested sequences.
Overall success plots for the visual tracking sequences.
In this paper, a novel algorithm for joint detection and tracking a single object in video has been presented. The proposed approach takes into account the detection score and the similarity of the observed features. Then, a Bayesian filter using a Bernoulli point process estimates the state of the target from a diverse subset of object proposals. Experimental evaluations show that the results are comparable to other state-of-the-art techniques for visual tracking in only 6 of the 25 sequences of the data set. In this paper, we only considered a simple observation model (distance to a reference LBP histogram), which might hinder the performance of this approach in the overall data set. This observation model is not robust to scale and rotation changes and no model updating strategies are considered in this paper. Nevertheless, our model is expected to increase its performance when using a more complex observation model (such as deep learning features), model updating and ensemble post-processing techniques for combining the output from different tracking schemes.