Accès libre

Robust single target tracking using determinantal point process observations

À propos de cet article

Citez

Visual tracking is a challenging computer vision task with applications in human-computer interaction, video surveillance and crowd monitoring among others. Modern visual tracking systems may use complex object detection schemes for estimating the current state of a target in any particular video frame. However, this approach does not fully exploit the temporal structure of the estimation problem. Visual tracking can be also thought of as a dynamic model with observed features and latent states representing the position/velocity of an object (Maggio and Cavallaro, 2011). In this context, the generative model for visual tracking requires not only the correct specification of the model and its parameters but also the ability to capture the variations of the system (Wang et al., 2015).

The Bernoulli filter is a powerful algorithm that allows objects to appear and disappear, using extracted features from the image as observations (Vo et al., 2010). Similar approaches for visual tracking have been proposed (also known in the literature as Track-Before-Detect). Nevertheless, these methods rely on unreliable background subtraction operations or the likelihood function being in a separable form (Hoseinnezhad et al., 2012, 2013).

Current state-of-the-art trackers are based on either correlation filters (Bolme et al., 2010), deformable parts models (Hare et al., 2016) or convolutional neural networks (Li et al., 2018). These trackers learn a discriminative model from a single frame and then update the model using new frames. Furthermore, tracking performance can be increased when using more discriminative features such as HOG (Henriques et al., 2015; Solis Montero et al., 2015; Xu et al., 2019). On the other hand, even when Bernoulli filters have demonstrated being useful models for tracking in complex scenarios, it is still hard to rely on such features for increasing their performance.

Related works

The Bernoulli filter is a specialized version of the PHD filter (Mahler, 2003), with the focus on single target tracking. While the original PHD filter is based on a Poisson point process, several extensions have been proposed to cope with non-Poisson distributions. In particular, the Cardinalized PHD filter allows estimating the number of targets using arbitrary distributions and provides improved estimates (Mahler, 2007). The multi-Bernoulli and Poisson multi-Bernoulli mixture filters also allow to approximate the cardinality distribution and become especially well suited when the mean of the multi-target posterior is higher than the variance (García-Fernández et al., 2018). All of these methods rely on first-order or second-order moments but assume that targets behave independently with each other. Therefore, the authors in Privault and Teoh (2019) propose a second-order filter that accounts for interaction between the targets. The method is based on determinantal point processes (DPP) that take into consideration the correlation among the targets through a kernel function. In Jorquera et al. (2017), the authors propose a determinantal point process for pruning the components of the Gaussian mixture PHD filter. More recently, the authors in Jorquera et al. (2019) compared the PHD filter using determinantal point process observations with other methods for visual multi-target tracking.

The contributions of this paper are twofold. First, the third section provides introductory notions of the Bernoulli filter and then we derive a novel Bernoulli filter using determinantal point process observations (B-DPP filter) for single target tracking in the fourth section. Second, in the fifth section we derive a Sequential Monte Carlo implementation of the B-DPP filter using a truncated likelihood, which can outperform other discriminative trackers in several scenarios.

Point processes for visual object tracking

A point process is a random pattern of points in a possibly multi-dimensional space (Kingman, 1993). A simple point process can be defined in one dimension, which is usually times and can be used to describe the random times where the events can occur with no coincident points.

Bernoulli point process

The problem of performing joint detection and estimation of multiple objects has a natural interpretation as a dynamic point process, where the stochastic intensity of the model is a space-time function λ(x), where x ∈ℝd denotes the state space of the target. If we let B = B1∪ B2 ∪ … ∪ Bk represent the union of disjoint video frames Bi, the corresponding number of objects on each image can be written as N(B1), N(B2), …, N(Bk). The Bernoulli point process for a single object that can randomly appear or disappear takes the form: p ( N ( B 1 ) = n 1 , , N ( B k ) = n k ) = n ! n 1 ! n k ! i k ( λ ( x i ) Λ ( B ) ) n i , = n ! n 1 ! n k ! i k p ( x i ) n i , where ni can take either 1 or 0, n = ∑ni and Λ(B) = ∫B λ(x)dx. Every subset Bi can take at most one target x with probability q, therefore we can characterize the distribution of the point process X = {x} using the following relationship: p ( X ) = { 1 q i f X = q p ( x ) i f X = { x } . (1)

Determinantal point process

In recent years, deep learning approaches have demonstrated outstanding performance in several visual tracking benchmarks (Kristan et al., 2019). These trackers are mostly based on extracted features from a convolutional neural network and an objective loss that minimizes a localization error (Li et al., 2018). However, the detection process is not perfect and false positives and negatives are to be encountered after ranking the top proposals from the convolutional features.

In order to develop an stochastic approach for the single-object observation model, a discrete DPP can be used to capture probabilistic relationships using a kernel matrix K : Ƶ × Ƶ ↦ ℝ that measures the similarity among different detections (Lee et al., 2016). Therefore, instead of considering independent detections in a particular frame, the DPP likelihood specifies the joint probability over all 2n subsets of Ƶ with distribution: p ( Z Ƶ ) = det ( K Z ) , Z Ƶ , p(Z  ⊂  Z _  ) = det (Kz),  ∀Z  ⊂  Z _  (2)where Z is a random subset of Ƶ and KZ  ≡  [Ki,j] for all i,j ∈ Ƶ. Furthermore, the product density can also be written in terms of a positive definite matrix L  =  K(I  −  K)−1, such that the probability mass function of Z can be written as: p ( Z ) = det ( L Z ) det ( I + L ) , (3)where I is the identity matrix and LZ is a sub-matrix of L indexed by the elements of Z.

Bernoulli filter

In this case, a model for detection and estimation of multiple objects can be achieved by the conditional expectation of the posterior point process (random finite set) under transformations (Ristic et al., 2013).

Let Xk = {x} be a Bernoulli point process and Zk = {z1, z2, …, zm} a DPP observed from frame K. The result from superposition, translation and thinning transformations is also a Bernoulli point process Xk ∼ p(Xk|Xk−1) (Kingman, 1993). The predicted point process can be written as the linear superposition of a πs thinned point process with Markov translation f(x|x′) and a πb Bernoulli birth process. The predicted expected number of targets Nk|k−1 for a single target with probability of survival πs(x) and spontaneous birth can be written as: N k | k 1 = N k | k 1 s + N k | k 1 b , (4)where: N k | k 1 s = π s f ( x | x ) p k 1| k 1 ( { x } ) d x , N k | k 1 b = π b p b ( x | ) p k 1| k 1 ( ) d x .

The filtering density of a Bernoulli point process is completely specified by the pair (pk|k−1, qk|k−1), which is obtained by: p k | k 1 ( X ) = { 1 q k 1| k 1 i f X = q k 1| k 1 i f | X | =1 . (5)

Using Equation (5), the probability of existence qk|k−1 can be written as: q k | k 1 = π b ( 1 q k 1| k 1 ) + π s q k 1| k 1 . (6)

And the probability of the predicted Bernoulli point process: q k | k 1 p k | k 1 ( { x } ) = π b ( 1 q k 1 | k 1 ) p b ( x ) + π s q k 1 | k 1 f ( x | x ) p k 1 | k 1 ( { x } ) d x a . (7)

If we let Zk be the observations that contain both false detections and target originated measurements, the update equation considers the probability of observing the target with probability of detection πd under clutter (e.g. false positives). From (Mahler, 2003, 2007), the multi-target likelihood function for the standard measurement model (Poisson distributed clutter with density κp(Zk) = eλi λfc(zi) and Bernoulli probability of detection πd) can be written as: p ( Z k | X k ) = κ p ( Z k ) ( 1 π d ) | X k | σ i π d p ( z σ i | x i ) ( 1 π d ) λ f c ( z σ i ) . (8)

The likelihood term in Equation (8) considers all possible locations and location-to-track associations σ, so most of the terms will be canceled. The likelihood term becomes: p ( Z k | { x } ) = κ p ( Z k ) ( 1 π d ) + π d z Z k i p ( z i | x ) λ f c ( z i ) . (9)

The Bayes update equation takes the form: p ( X k | Z k ) = p ( Z k | X k ) p ( X k | Z 1 : k 1 ) p ( Z k | Z 1 : k 1 ) . (10)

The denominator of Equation (10) can be written as: p ( Z k | Z 1: k 1 ) = f c ( Z k ) { 1 q k | k 1 + q k | k 1 ( 1 π d ) M k + Z Z k ψ k i p ( z i | x ) p k | k 1 ( x ) d x j f c ( z j ) } , (11)where: ψ k = M k ! ( M k | Z | ) ! π d | Z | ( 1 π d ) | Z | M k .

The updated binomial point process can be derived as follows: q k | k = 1 Δ k 1 q k | k 1 Δ k q k | k 1 , (12)where: Δ k = 1 ( 1 π d ) M k z Z k ψ k i p ( z i | x ) p k | k 1 ( x ) d x j λ f c ( z j ) , (13)and p k | k ( x ) = ( 1 π d ) M k + z Z k ψ k i p ( z i | x ) p k | k 1 ( x ) d x j f c ( z j ) 1 Δ k . (14)

Determinantal filter

Let Xk = {x} be a Bernoulli point process and Zk = {z1, z2,…, zm} a DPP observed at frame K. The result from superposition, translation and thinning transformations is also a Bernoulli point process Xk ∼ p(Xk|Xk−1) (Ristic et al., 2013). The predicted point process can be written as the linear superposition of a πs thinned point process with Markov translation f(x|x′) and a πb Bernoulli birth process. In order to measure the quality of the observations, we must introduce a random variable L such that p(L|Z)  ∝  det(L(Z)), where L(Z) is a positive definite kernel matrix that depends on the observed features Z. The L(Z) kernel can be written as a Gram matrix: L i j ( Z ) = g x ( z i ) ϕ ( z i ) T ϕ ( z j ) g x ( z j ) , (15) = g x ( z i ) S i j ( Z ) g x ( z j ) . (16)

The function gx(zi) = ∑c p(zi|c)p(c|x) is used to model the quality of the item zi and S(Z) the diversity of the set Z. If we let W be a subset of detections arising from the target (Reuter et al., 2013): η ( W | { x } ) { ( 1 π d ) if W = π d g x ( w 1 ) det ( S w 1 ) if | W | =1 | W | ! π d m i g x 2 ( w i ) det ( S W ) if | W | = m . (17)

The DPP Z can be treated as the union of two independent sets Z = C ∪ W, where C = {c1, …, cm} represents clutter. The clutter density becomes: κ d ( C ) = | C | ! i f c 2 ( c i ) det ( S C ) . (18)

The likelihood function for the standard measurement model using determinantal observations becomes: P ( Z | { x } ) = W Z η ( W | { x } ) κ d ( Z \ W ) , (19) p ( Z k | { x } ) = κ d ( Z k ) [ (1 π d ) M k + Z Z k | Z | ! ( M k | Z | ) ! M k ! π d | Z | ( 1 π d ) M k | Z | i [ g x ( z i ) f c ( z i ) ] 2 det ( S Z ) det ( S Z k \ Z ) det ( S Z k ) ] . (20)

Now, we want to derive the posterior distribution for Bernoulli point process given DPP observations: p ( Z k ) = κ d ( Z k ) [ ( 1 q k | k 1 ) + q k | k 1 ( 1 π d ) M k + Z Z k Ξ k i g x 2 ( z i ) p k | k 1 ( x ) d x j f c 2 ( z j ) ] , (21)where: Ξ k = | Z | ! ( M k | Z | ) ! M k ! π d | Z | ( 1 π d ) | Z | M k det ( S Z ) det ( S Z k \ Z ) det ( S Z k ) .

The updated binomial point process can now be derived as follows: q k | k = 1 Δ ˘ k 1 q k | k 1 Δ ˘ k q k | k 1 , (22)where: Δ ˘ k = 1 ( 1 π d ) M k z Z k Ξ k i p ( z i | x ) p k | k 1 ( x ) d x j f c ( z j ) , (23)and: p k | k ( x ) = ( 1 π d ) M k + z Z k Ξ k i g x ( z i ) p k | k 1 ( x ) d x j f c ( z j ) 1 Δ ˘ k . (24)

Approximated Bernoulli determinantal filter

In practice, it is difficult to store and compute the power set with all possible configurations of Zk in the likelihood term (see Equation (20)). An approximation can be constructed by truncating the likelihood and focusing only on the more likely elements. Let Zk* = arg maxZ⊂ Zk η(Z|{x}) be a subset of Zk whose elements are detections arising from the target. The likelihood becomes: p ( Z k * | { x } ) = | Z k * | ! i g x 2 ( z i ) det ( Z k * ) . (25)

DPPs have been proposed in the literature as an alternative to other object refinement techniques such as non-maximum suppression (Lee et al., 2016). These methods operate over object proposals and eliminate redundant detections. For DPPs, mode finding can be tackled using the following greedy algorithm (Kulesza and Taskar, 2011):

Conversely, by using the truncated likelihood from Equation (25), the Sequential Monte Carlo algorithm for the Bernoulli filter can be used to estimate the single-target posterior (Ristic, 2013).

Experimental results

In order to demonstrate the advantages of the proposed model updating approach over other discriminative approaches, we evaluate the tracking results on six challenging video sequences from the Visual Object Challenge 2014 (VOT) data set

http://www.votchallenge.net/vot2014/

. The proposed SMC implementation uses local binary patterns (LBP) as observed features and a simple observation model p ( z i | c ) e x p ( D k 2 2 σ o 2 ) , with Dk = dist[zc,zk] and zc being a reference LBP histogram (Czyz et al., 2007). The state xk is configured as a 4-dimensional rectangle including the left-most position, width and height of the target. The dynamic model uses a random walk and the parameters of the model are held fixed for all sequences. The B-DPP filter is implemented in the C++ language using the OpenCV library. The parameters for the B-DPP filter are determined empirically and shown in Table 1.

Particle Bernoulli-DPP filter.

Particle Bernoulli-DPP filter
Number of particles N 100
Uniform birth probability (πb) 0.1
Uniform survival probability (πs) 0.99
Newborn particles (Nb) 0
Standard deviation for observation model (σo) 20.4
Covariance matrix for dynamic model (σx × 1) 3.0 × 1

The parameter setting the Greedy Mode Finding algorithm is described in Table 2.

Greedy mode finding.

Greedy mode finding
Acceptance ratio ε 0.7

The sequence jogging is a challenging example containing full occlusions, rotations and background clutter. Figure 1 shows one frame of the sequence and the estimates using the proposed approach and other state-of-the-art methods.

Figure 1:

Frame 85 of the jogging sequence. At each frame, a greedy mode finding step is performed using Algorithm 1. Rectangles represent ground-truth, state estimates and DPP observations.

The Bernoulli DPP filter maintains a balance between the observed features and the quality of the observations (see Figure 1). The observation model uses a simple histogram comparison and no template update is performed, so the model is not robust to object deformation or rotation. Even that, as seen in Figure 1 the Bernoulli-DPP tracker achieves good performance in cases such as full occlusion where the other discriminative tracking methods fail. Performance is measured using widely used precision and success metrics

http://cvlab.hanyang.ac.kr/tracker_benchmark/benchmark_v10.html

.

The precision metric describes the percentage of frames whose center location error is below a given threshold. Table 3 shows the overall precision metric averaged over all sequences on five different runs for each one of the algorithms.

Average precision (th = 20).

Sequence DPP KCF sKCF Struck
Ball 0.309 0.289 0.246 0.372
Bolt 0.083 0.017 0.017 0.026
Diving 0.073 0.082 0.087 0.091
Gymnastics 0.710 0.425 0.425 0.435
Jogging 0.707 0.231 0.231 0.228
Polarbear 0.946 0.857 0.916 0.844

th = threshold.

The success measure accounts for bounding box overlap. Table 4 shows the number of success frames whose overlap is above some threshold, averaged over the sequences on five different runs. Quantitative analysis shows improved performance for the proposed approach when compared to the discriminative trackers in six different video sequences.

Average success (th = 0.5).

Sequence DPP KCF sKCF Struck
Ball 0.206 0.211 0.201 0.128
Bolt 0.031 0.011 0.011 0.017
Diving 0.183 0.110 0.114 0.151
Gymnastics 0.560 0.415 0.420 0.425
Jogging 0.205 0.225 0.225 0.225
Polarbear 0.749 0.747 0.760 0.712

th = threshold.

Figure 2 shows the precision metric against the location error threshold for all of the six tested sequences. The red line indicates the best performing method among the four different algorithms. Since the bolt and jogging sequences have background clutters (the background near the target has similar appearance as the target), the proposed Bernoulli DPP tracker reduces redundant observations and improves precision.

Figure 2:

Overall precision plots for the visual tracking sequences.

Figure 3 shows the ratio of the frames whose tracked box has more overlap with the ground-truth box than a threshold. The success metric can be associated with the tracker algorithm ability to maintain long-term tracks. Since the Bernoulli DPP filter accounts for missed detections, the proposed approach improves the area under the curve of the success metric in 67% of the tested sequences.

Figure 3:

Overall success plots for the visual tracking sequences.

Conclusions

In this paper, a novel algorithm for joint detection and tracking a single object in video has been presented. The proposed approach takes into account the detection score and the similarity of the observed features. Then, a Bayesian filter using a Bernoulli point process estimates the state of the target from a diverse subset of object proposals. Experimental evaluations show that the results are comparable to other state-of-the-art techniques for visual tracking in only 6 of the 25 sequences of the data set. In this paper, we only considered a simple observation model (distance to a reference LBP histogram), which might hinder the performance of this approach in the overall data set. This observation model is not robust to scale and rotation changes and no model updating strategies are considered in this paper. Nevertheless, our model is expected to increase its performance when using a more complex observation model (such as deep learning features), model updating and ensemble post-processing techniques for combining the output from different tracking schemes.

eISSN:
1178-5608
Langue:
Anglais
Périodicité:
Volume Open
Sujets de la revue:
Engineering, Introductions and Overviews, other