PaperReading2目标跟踪系统的各个部分

2019-02-28  本文已影响0人  我好菜啊_

以下摘录于:
[Uderstanding and Diagnosing Visual Tracking Systems]
Authors:Naiyan Wang, Jianping Shi, Dit-Yan Yeung, Jiaya Jia


(this paper focus on the most general type of visual tracking problems: short-term single-object model-free tracking)


generative tracker VS discriminative tracker
usually discriminative trackers or hybrid ones
mainly because purely generative trackers cannot handle complicated background well, making it easy to drift away from the target.


Benchmark
sample videos(VOT,VBT,PTB)
The key difference with the benchmark above lies in the evaluation metric.

accuracy:the overlap rate between the prediction and ground truth when the tracker does not drift away
robustness: the frequency of tracking failure which happens when the overlap rate is zero(Whenever such failure occurs, the tracker is reset to the correct bounding box to continue tracking.)

Two metrics for evalution:
1.AUC(the area of the curve) of overlap rate(between the ground truth and predicted bounding boxes)
2.the precision at threshold 20 for central pixel error curve.
This metric is useful for the cases that the scale of the object changes but the tracker does not support scale variation


five parts of tracker


A tracking system
initializing the observation model with the given bounding box of the target in the first frame.
In each of the following frames, the motion model first generates candidate regions or proposals for testing based on the estimation from the previous frame.
The candidate regions or proposals are fed into the observation model to compute their probability of being the target.
The one with the highest probability is then selected as the estimation result of the current frame.
Based on the output of the observation model, the model updater decides whether the observation model needs any update and, if needed, the update frequency.
Finally, if there are multiple trackers, the bounding boxes returned by the trackers will be combined by the ensemble post-processorto obtain a more accurate estimate.


Validation Setup
determine the parameters of each component using five videos outside the benchmark and then fix the parameters afterwards throughout the evaluation unless specified otherwise.

measured by the overlap rate between the ground-truth and predicted bounding boxes, where the overlap rate is defined as the area of intersection of the two bounding boxes over the area of their union.
With a given threshold for the overlap rate, we can calculate the success rate of the tracker over all the video frames.
By varying the threshold from 0 gradually to 1, it will yield a curve which varies from it maximum successful rate to success rate 0 accordingly.

basic model for following analysis
motion mode:particle filter framework motion model
features:raw pixels of grayscale images
observation model:logistic regression
model updater: if the highest score among the candidates tested is below a threshold, the model will be updated.
ensemble post-processor:none(single tracker)


See how each component of a tracker affects its final performance.


Feature Extractor
the raw image data->some (usually) more informative representation.

CNN can get more powerful features but incur high conputational cost.
An direction is to exploit the color information. Some recent methods demonstrated notable performance with carefully designed color features. Not only are these features lightweight, but they are also suitable for deformable objects.
We believe that finding good features for object tracking is still a research direction that is worth pursuing.

Our Findings: Using proper features can dramatically improve the tracking performance.


Observation Model
returns the confidence of a given candidate being the target
consider the following observation models(discrimitive):

Our Findings: Different observation models indeed affect the performance when the features are weak.
However, the performance gaps diminish when the features are strong enough. Consequently, satisfactory results can be obtained
even using simple classifiers from textbooks.


Motion Model

scale the parameters by the video resolution



even such a simple normalization step can improve the performance significantly especially when there exists fast motion.

Our Findings: particle filter approach with resized input is good.


Model Updater(这部分其实没太看懂)
determines both the strategy and frequency of model update.
need to maintain a tradeoff between adapting to new( but possibly noisy )examples collected during tracking and preventing the tracker from drifting to the background.
When the model needs update, we first collect some positive examples whose centers are within 5 pixels from the target and some negative examples within 100 pixels but with overlapping rate less than 0.3.

two model update methods:

  1. update the model whenever the confidence of the target falls below a threshold.
  2. update the model whenever the difference between the confidence of the target and that of the background examples is below a threshold.
    This strategy simply maintains a sufficiently large
    margin between the positive and negative examples instead of forcing the target to have high confidence. It is potentially helpful when the target is occluded or disappears.

To the best of our knowledge, the only principled method for model updater is the one by [41].They proposed to use entropy minimization to identify reliable model update and discard the incorrect ones.

Our Findings: Although implementation of the model updater is often treated as engineering tricks in papers especially for discriminative trackers, their impact on performance is usually very significant and hence is worth studying. Unfortunately, very few work focuses on this component.


Ensemble Post-processor

  1. a loss function for bounding box majority voting and
    then extended it to incorporate tracker weights, trajectory continuity and removal of bad trackers.
  2. formulated the ensemble learning problem as a structured crowd-sourcing problem which treats the reliability of each tracker as a hidden variable to be inferred. Then they proposed a factorial hidden Markov model that considers the temporal smoothness between frames. We adopt the basic model called ensemble based tracking (EBT) without self-correction.

the difference between these method is small.
but diversity of trackers in the ensemble helps to achieve good results. Both ensemble methods can significantly improve the results when the trackers have high diversity
Even when the diversity is low, the ensemble does not impair the performance but still slightly outperforms the best single tracker.

Our Findings: The ensemble post-processor can improve the performance substantially especially when the trackers have high diversity. This component is universal and effective yet it is least explored


in the latest deep learning trackers, the feature extractor and observation model are combined into a unified deep learning framework for end-to-end learning


speed is a problem.
fast Fourier transform (FFT) and circular matrices are used to accelerate dense (kernelized) ridge regression.


Conclusion

  1. the feature extractor is the most important part of a tracker.
  2. the observation model is not that important if the features are
    good enough.
  3. the model updater can affect the result significantly, but currently there are not many principled ways for realizing this component.
  4. the ensemble post-processor is quite universal and effective.
  5. paying attention to some details of the motion model and model updater can significantly improve the performance.
上一篇下一篇

猜你喜欢

热点阅读