Hierarchical Convolutional Features for Visual Tracking Chao Ma ， SJTU Jia-Bin Huang ， UIUC Xiaokang Yang ， SJTU Ming-Hsuan Yang ， UC Merced.

Hierarchical Convolutional Features for Visual Tracking Chao Ma ， SJTU Jia-Bin Huang ， UIUC Xiaokang Yang ， SJTU Ming-Hsuan Yang ， UC Merced

Hierarchical Convolutional Features for Visual Tracking What is visual tracking ？ How to do it? What is the Novel point of this paper?

Visual Tracking A typical scenario of visual tracking is to track an unknown target object, specified by a bounding box in the first frame.

Visual Tracking Method Tracking by Binary Classifiers. Visual tracking can be posed as a repeated detection problem in a local window. For each frame, a set of positive and negative training samples are collected for incrementally learning a discriminative classifier to separate a target from its backgrounds. Sampling ambiguity Tracking by Correlation Filters. Tracking methods based on correlation filters regress all the circular-shifted versions of input features to a target Gaussian function and thus no hard-thresholded samples of target appearance are needed. Tracking by CNNs Visual representations are of great importance for object tracking.

Chao Ma’s Work Learn correlation filters over multi-dimensional features in a way similar to existing methods. The main differences lie in the use of learned CNN features rather than hand-crafted features Former CNN trackers all rely on positive and negative training samples and only exploit the features from the last layer. In contrast, our approach builds on adaptive correlation filters which regress the dense, circularly shifted samples with soft labels and effectively alleviate sampling ambiguity.

Algorithm Use the convolutional feature maps from a CNN, AlexNet or VGG-Net to encode target appearance. Along with the CNN forward propagation, the semantical discrimination between objects from different categories is strengthened, as well as a gradual reduction of spatial resolution for precise localization. Learn a discriminative classifier and estimate the translation of target objects by searching for the maximum value of correlation response map. Given the set of correlation response maps, we hierarchically infer the target translation of each layer.

Implementation Details Experimental Validations

Conclusion Combine CNN and Correlation Filters together. Use not only the last layer but also the early layers of CNN to achieve better performance. Extensive experimental results show that the proposed algorithm performs favorably against the state-of-the-art methods in terms of accuracy and robustness.

Online Object Tracking with Proposal Selection Reporter : Liu Cun Student ID: 115413910018 2016.05.03

Outline Backgrounds and Introduction Highlights and Contribution Proposal Selection for Tracking Experiment and Result Conclusion and Summary

Backgrounds & Introduction Tracking-by-detection approaches are some of the most successful object trackers in recent years. Their success is largely determined by the detector model they learn initially and then update over time. However, under challenging conditions where an object can undergo transformations like severe rotation, these methods are found to be lacking.

Highlights and Contribution In this paper, the author addresses this challenging problem by formulating it as proposal selection task and making two contributions: The first one is introducing novel proposal estimated from the geometric transformations undergone by the object, and building a rich candidate set for predicting the object location. The second one is devising a novel selection strategy using multiple cues like detection score and edgeness score computed from state-of-the- art object edges and motion boundaries.

Proposal Selection for Tracking The main components of the framework introduced by the passage for online object tracking are: (i) learning the initial detector with a training set consisting of one positive sample, available as a bounding box annotation in the first frame, and several negative bounding box samples which are automatically extracted from the entire image. Then use HOG feature computed for these bounding boxes and learn the detector with a linear SVM, similar to other tracking-by-detection approaches. The detector is then evaluated on subsequent frames to estimate the candidate locations of the object.

Proposal Selection for Tracking (ii)building a rich candidate set of object locations in each frame, consisting of proposals from the detector as well as the estimated geometric transformations represent the geometric transformation with a similarity matrix. The similarity transformation is defined by four parameters ---- one each for rotation and scale, and two for translation. Then estimate them with a Hough transform voting scheme using frame-to-frame optical flow correspondences.

Proposal Selection for Tracking (iii) evaluating all the proposals in each frame with multiple cues to select the best one use three cues, detection detection confidence score, objectness measures computed with object edges and motion boundaries. first use the normalized detection confidence score computed for each proposal box with the SVM learned from the object annotation in the first frame and updated during tracking. This provides information directly relevant to the object of interest in a given sequence.

Proposal Selection for Tracking (iv) updating the detector model having computed the best proposal containing the object, use it as a positive exemplar to learn a new object model.

The paper presents our empirical evaluation on 2 state-of- the-art benchmark datasets and compare with several recent methods. Experiments and results

The top performer in each measure is shown in red, and the second and the third best are in blue and in green respectively. Ours-ms-rot:use multiscale detector and geometry proposals Our-ms:use only multiscale detector proposals Our-ss:use only single-scale detector proposals Experiments and results

Conclusion and Summary This paper presents a new tracking-by-detection framework for online object tracking. The approach introduced by the paper begins with building a candidate set of object location proposals extracted using a learned detector model. It is then augmented with novel proposals computed by estimating the geometric transformations undergone by the object. The paper localize the object by selecting the best proposal from this candidate set using multiple cues: detection confidence score, edges, and motion boundaries. The performance of our tracker is evaluated extensively on the VOT 2014 challenge and the OTB datasets. It shows state-of-the-art results on both these benchmarks, significantly improving over the top performers of these two evaluations.

Thank you!

Visual Tracking with Fully Convolutional Networks ——Shi Yuzhou

Abstract A new approach for general object tracking with fully convolutional neural network. Instead of treating convolutional neural network (CNN) as a black-box feature extractor. 1.Convolutional layers in different levels characterize the target from different perspectives. 2.A top layer encodes more semantic features and serves as a category detector, while a lower layer carries more discriminative information and can better separate the target from distracters with similar appearance. 3.For a tracking target, only a subset of neurons are relevant.

Deep Feature Analysis for Visual Tracking Presenting some important properties of CNN features which can better facilitate visual tracking. Observation 1 Although the receptive field of CNN feature maps is large, the activated feature maps are sparse and localized. The activated regions are highly correlated to the regions of semantic objects. The feature maps have only small regions with nonzero values, and are capturing the visual representation related to the objects.

Deep Feature Analysis for Visual Tracking Observation 2 Many CNN feature maps are noisy or unrelated for the task of discriminating a particular target from its background. (activation value:the sum of a feature map's responses in the object region.) Discarding feature maps: most of the feature maps have small or zero values within the object region. So there are lots of feature maps that are not related to the target object.

Deep Feature Analysis for Visual Tracking Observation 3 Different layers encode different types of features. Higher layers capture semantic concepts on object categories, whereas lower layers encode more discriminative features to capture intra class variations. Because of the redundancy of feature maps, we employ a sparse representation scheme to facilitate better visualization. So we use the feature maps to update the sparse coefficient vector.

Proposed Algorithm

1.conv4-3 and conv5-3 layers——feature map selection 2.GNet——capture the category information (built on the conv5-3 layer) 3.SNet——discriminates the target from background (built on the conv4-3 layer) 4.Both initialized in the first frame and adopt different online update strategies. 5.For a new frame, last target ROI is cropped and propagated through the fully convolutional network. 6.Target localization is performed independently based on the two heat maps by GNet and SNet. 7.The final target is determined by a distracter detection scheme that decides which heat map in step 6 to be used.

Feature Map Selection The proposed feature map selection method is based on a target heat map regression model, named as sel-CNN. The sel-CNN model consists of a dropout layer followed by a convolutional layer without any nonlinear transformation. It takes the feature maps (conv4-3 or con5-3) to be selected as input to predict the target heat map M. The model is trained by minimizing the square loss between the predicted foreground heat map Mˆ and the target heat map M:

Feature Map Selection Fixing the model parameters and select the feature maps according to their impacts on the loss function. Vectorized the input feature maps F. Denote f i as the i-th element of vec(F) The significance of f i : (, ) The significance of the k-th feature map : top K feature maps are selected

Target Localization After feature map selection in the first frame, we build the SNet and the GNet on top of the selected conv4-3 and conv5-3 feature maps, respectively. conv4-3 9×9 convolutional kernels 36 feature maps output more sensitive to intra-class appearance variation conv5-3 5×5 convolutional kernels foreground heat map output invariant to pose variation and rotation Initializing in the first frame by minimizing square loss function:

Target Localization In a new frame, we crop a rectangle ROI region centered at the last target location. By forward propagating the ROI region through the networks, the foreground heat maps are predicted by both GNet and SNet. Assuming the locations of target candidates in the current frame are subject to a Gaussian distribution. The confidence of the i-th candidate is computed as the summation of all the heat map values within the candidate region

Online Update To avoid the background noise introduced by online update, we fix GNet and only update SNet after the initialization in the first frame. Following two different rules: the adaptation rule and the discrimination rule. adaptation rule:finetune SNet every 20 frames using the most confident tracking result within the intervening frames. discrimination rule:when distracters are detected using (9), SNet is further updated using the tracking results in the first frame and the current frame by minimizing:

In Defense of Color-based Model-free Tracking Sun Jianhui

Problem ： Trackers often tend to drift towards regions which exhibit a similar appearance compared to the object of interest Solution ： identify potentially distracting regions in advance and suppress them

Object-background model

Simplify the equation with the prior probability Estimate the likelihood directly from color histogram H ： color non-normalized histogram

Distractor-aware object model O ： object hypothesis region D ： potentially distracting regions

Combined object model Update the object model using linear interpolation

Localization

Scale estimation Cumulative histogram Adaptive segmentation threshold

Evaluation VOT ： the Visual Object Tracking DAT ： the distractor-aware tracker DATs ： scale-adaptive DAT noDAT ： the tracker only using the discriminative object-background model

Results VOT14 benchmark

Results VOT13 benchmark

Results VOT14 benchmarkResults VOT14 benchmark using randomly perturbed initializations

Runtime performance DAT ： 17fps noDAT ： 18fps DATs ： 15fps （ PC with 3.4GHz Intel CPU ， MATLAB ）

Thanks

Hierarchical Convolutional Features for Visual Tracking Chao Ma ， SJTU Jia-Bin Huang ， UIUC Xiaokang Yang ， SJTU Ming-Hsuan Yang ， UC Merced.

Similar presentations

Presentation on theme: "Hierarchical Convolutional Features for Visual Tracking Chao Ma ， SJTU Jia-Bin Huang ， UIUC Xiaokang Yang ， SJTU Ming-Hsuan Yang ， UC Merced."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Hierarchical Convolutional Features for Visual Tracking Chao Ma ， SJTU Jia-Bin Huang ， UIUC Xiaokang Yang ， SJTU Ming-Hsuan Yang ， UC Merced.

Similar presentations

Presentation on theme: "Hierarchical Convolutional Features for Visual Tracking Chao Ma ， SJTU Jia-Bin Huang ， UIUC Xiaokang Yang ， SJTU Ming-Hsuan Yang ， UC Merced."— Presentation transcript:

Similar presentations

About project

Feedback