Hierarchical Convolutional Features for Visual Tracking Chao Ma , SJTU Jia-Bin Huang , UIUC Xiaokang Yang , SJTU Ming-Hsuan Yang , UC Merced.

Slides:



Advertisements
Similar presentations
Neural Networks and Kernel Methods
Advertisements

Road-Sign Detection and Recognition Based on Support Vector Machines Saturnino, Sergio et al. Yunjia Man ECG 782 Dr. Brendan.
Human Identity Recognition in Aerial Images Omar Oreifej Ramin Mehran Mubarak Shah CVPR 2010, June Computer Vision Lab of UCF.
Human Action Recognition across Datasets by Foreground-weighted Histogram Decomposition Waqas Sultani, Imran Saleemi CVPR 2014.
Carolina Galleguillos, Brian McFee, Serge Belongie, Gert Lanckriet Computer Science and Engineering Department Electrical and Computer Engineering Department.
Patch-based Image Deconvolution via Joint Modeling of Sparse Priors Chao Jia and Brian L. Evans The University of Texas at Austin 12 Sep
Learning to estimate human pose with data driven belief propagation Gang Hua, Ming-Hsuan Yang, Ying Wu CVPR 05.
Robust Object Tracking via Sparsity-based Collaborative Model
Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots Chao-Yeh Chen and Kristen Grauman University of Texas at Austin.
Ghunhui Gu, Joseph J. Lim, Pablo Arbeláez, Jitendra Malik University of California at Berkeley Berkeley, CA
Bayesian Robust Principal Component Analysis Presenter: Raghu Ranganathan ECE / CMR Tennessee Technological University January 21, 2011 Reading Group (Xinghao.
Robust Moving Object Detection & Categorization using self- improving classifiers Omar Javed, Saad Ali & Mubarak Shah.
Modeling Pixel Process with Scale Invariant Local Patterns for Background Subtraction in Complex Scenes (CVPR’10) Shengcai Liao, Guoying Zhao, Vili Kellokumpu,
On the Relationship between Visual Attributes and Convolutional Networks Paper ID - 52.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
A Study of Approaches for Object Recognition
Probabilistic video stabilization using Kalman filtering and mosaicking.
Rodent Behavior Analysis Tom Henderson Vision Based Behavior Analysis Universitaet Karlsruhe (TH) 12 November /9.
1 Integration of Background Modeling and Object Tracking Yu-Ting Chen, Chu-Song Chen, Yi-Ping Hung IEEE ICME, 2006.
Face Processing System Presented by: Harvest Jang Group meeting Fall 2002.
Face Recognition Using Neural Networks Presented By: Hadis Mohseni Leila Taghavi Atefeh Mirsafian.
Convolutional Neural Networks for Image Processing with Applications in Mobile Robotics By, Sruthi Moola.
Multimodal Interaction Dr. Mike Spann
Object Detection with Discriminatively Trained Part Based Models
Stable Multi-Target Tracking in Real-Time Surveillance Video
Limitations of Cotemporary Classification Algorithms Major limitations of classification algorithms like Adaboost, SVMs, or Naïve Bayes include, Requirement.
Event retrieval in large video collections with circulant temporal encoding CVPR 2013 Oral.
Sparse Bayesian Learning for Efficient Visual Tracking O. Williams, A. Blake & R. Cipolloa PAMI, Aug Presented by Yuting Qi Machine Learning Reading.
VIP: Finding Important People in Images Clint Solomon Mathialagan Andrew C. Gallagher Dhruv Batra CVPR
Object Recognition as Ranking Holistic Figure-Ground Hypotheses Fuxin Li and Joao Carreira and Cristian Sminchisescu 1.
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Intro. ANN & Fuzzy Systems Lecture 16. Classification (II): Practical Considerations.
Max-Confidence Boosting With Uncertainty for Visual tracking WEN GUO, LIANGLIANG CAO, TONY X. HAN, SHUICHENG YAN AND CHANGSHENG XU IEEE TRANSACTIONS ON.
Rich feature hierarchies for accurate object detection and semantic segmentation 2014 IEEE Conference on Computer Vision and Pattern Recognition Ross Girshick,
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Shadow Detection in Remotely Sensed Images Based on Self-Adaptive Feature Selection Jiahang Liu, Tao Fang, and Deren Li IEEE TRANSACTIONS ON GEOSCIENCE.
1 Bilinear Classifiers for Visual Recognition Computational Vision Lab. University of California Irvine To be presented in NIPS 2009 Hamed Pirsiavash Deva.
When deep learning meets object detection: Introduction to two technologies: SSD and YOLO Wenchi Ma.
Recent developments in object detection
Robust and Fast Collaborative Tracking with Two Stage Sparse Optimization Authors: Baiyang Liu, Lin Yang, Junzhou Huang, Peter Meer, Leiguang Gong and.
Deep Feedforward Networks
Guillaume-Alexandre Bilodeau
Summary of “Efficient Deep Learning for Stereo Matching”
Object Detection based on Segment Masks
Bag-of-Visual-Words Based Feature Extraction
Traffic Sign Recognition Using Discriminative Local Features Andrzej Ruta, Yongmin Li, Xiaohui Liu School of Information Systems, Computing and Mathematics.
Article Review Todd Hricik.
Nonparametric Semantic Segmentation
Fast Preprocessing for Robust Face Sketch Synthesis
Machine Learning Basics
Training Techniques for Deep Neural Networks
Dynamical Statistical Shape Priors for Level Set Based Tracking
R-CNN region By Ilia Iofedov 11/11/2018 BGU, DNN course 2016.
Dynamic Routing Using Inter Capsule Routing Protocol Between Capsules
A New Approach to Track Multiple Vehicles With the Combination of Robust Detection and Two Classifiers Weidong Min , Mengdan Fan, Xiaoguang Guo, and Qing.
Image Classification.
Convolutional Neural Networks for Visual Tracking
Deep Learning Hierarchical Representations for Image Steganalysis
Brief Review of Recognition + Context
KFC: Keypoints, Features and Correspondences
Object Detection Creation from Scratch Samsung R&D Institute Ukraine
Outline Background Motivation Proposed Model Experimental Results
RCNN, Fast-RCNN, Faster-RCNN
CNN-based Action Recognition Using Adaptive Multiscale Depth Motion Maps And Stable Joint Distance Maps Junyou He, Hailun Xia, Chunyan Feng, Yunfei Chu.
Introduction to Object Tracking
Human-object interaction
Volume 23, Issue 21, Pages (November 2013)
Lecture 16. Classification (II): Practical Considerations
CVPR2019 Jiahe Li SiamRPN introduces the region proposal network after the Siamese network and performs joint classification and regression.
Shengcong Chen, Changxing Ding, Minfeng Liu 2018
Presentation transcript:

Hierarchical Convolutional Features for Visual Tracking Chao Ma , SJTU Jia-Bin Huang , UIUC Xiaokang Yang , SJTU Ming-Hsuan Yang , UC Merced

Hierarchical Convolutional Features for Visual Tracking What is visual tracking ? How to do it? What is the Novel point of this paper?

Visual Tracking A typical scenario of visual tracking is to track an unknown target object, specified by a bounding box in the first frame.

Visual Tracking Method Tracking by Binary Classifiers. Visual tracking can be posed as a repeated detection problem in a local window. For each frame, a set of positive and negative training samples are collected for incrementally learning a discriminative classifier to separate a target from its backgrounds. Sampling ambiguity Tracking by Correlation Filters. Tracking methods based on correlation filters regress all the circular-shifted versions of input features to a target Gaussian function and thus no hard-thresholded samples of target appearance are needed. Tracking by CNNs Visual representations are of great importance for object tracking.

Chao Ma’s Work Learn correlation filters over multi-dimensional features in a way similar to existing methods. The main differences lie in the use of learned CNN features rather than hand-crafted features Former CNN trackers all rely on positive and negative training samples and only exploit the features from the last layer. In contrast, our approach builds on adaptive correlation filters which regress the dense, circularly shifted samples with soft labels and effectively alleviate sampling ambiguity.

Algorithm Use the convolutional feature maps from a CNN, AlexNet or VGG-Net to encode target appearance. Along with the CNN forward propagation, the semantical discrimination between objects from different categories is strengthened, as well as a gradual reduction of spatial resolution for precise localization. Learn a discriminative classifier and estimate the translation of target objects by searching for the maximum value of correlation response map. Given the set of correlation response maps, we hierarchically infer the target translation of each layer.

Implementation Details Experimental Validations

Conclusion Combine CNN and Correlation Filters together. Use not only the last layer but also the early layers of CNN to achieve better performance. Extensive experimental results show that the proposed algorithm performs favorably against the state-of-the-art methods in terms of accuracy and robustness.

Online Object Tracking with Proposal Selection Reporter : Liu Cun Student ID:

Outline Backgrounds and Introduction Highlights and Contribution Proposal Selection for Tracking Experiment and Result Conclusion and Summary

Backgrounds & Introduction Tracking-by-detection approaches are some of the most successful object trackers in recent years. Their success is largely determined by the detector model they learn initially and then update over time. However, under challenging conditions where an object can undergo transformations like severe rotation, these methods are found to be lacking.

Highlights and Contribution In this paper, the author addresses this challenging problem by formulating it as proposal selection task and making two contributions: The first one is introducing novel proposal estimated from the geometric transformations undergone by the object, and building a rich candidate set for predicting the object location. The second one is devising a novel selection strategy using multiple cues like detection score and edgeness score computed from state-of-the- art object edges and motion boundaries.

Proposal Selection for Tracking The main components of the framework introduced by the passage for online object tracking are: (i) learning the initial detector with a training set consisting of one positive sample, available as a bounding box annotation in the first frame, and several negative bounding box samples which are automatically extracted from the entire image. Then use HOG feature computed for these bounding boxes and learn the detector with a linear SVM, similar to other tracking-by-detection approaches. The detector is then evaluated on subsequent frames to estimate the candidate locations of the object.

Proposal Selection for Tracking (ii)building a rich candidate set of object locations in each frame, consisting of proposals from the detector as well as the estimated geometric transformations represent the geometric transformation with a similarity matrix. The similarity transformation is defined by four parameters ---- one each for rotation and scale, and two for translation. Then estimate them with a Hough transform voting scheme using frame-to-frame optical flow correspondences.

Proposal Selection for Tracking (iii) evaluating all the proposals in each frame with multiple cues to select the best one use three cues, detection detection confidence score, objectness measures computed with object edges and motion boundaries. first use the normalized detection confidence score computed for each proposal box with the SVM learned from the object annotation in the first frame and updated during tracking. This provides information directly relevant to the object of interest in a given sequence.

Proposal Selection for Tracking (iv) updating the detector model having computed the best proposal containing the object, use it as a positive exemplar to learn a new object model.

The paper presents our empirical evaluation on 2 state-of- the-art benchmark datasets and compare with several recent methods. Experiments and results

The top performer in each measure is shown in red, and the second and the third best are in blue and in green respectively. Ours-ms-rot:use multiscale detector and geometry proposals Our-ms:use only multiscale detector proposals Our-ss:use only single-scale detector proposals Experiments and results

Conclusion and Summary This paper presents a new tracking-by-detection framework for online object tracking. The approach introduced by the paper begins with building a candidate set of object location proposals extracted using a learned detector model. It is then augmented with novel proposals computed by estimating the geometric transformations undergone by the object. The paper localize the object by selecting the best proposal from this candidate set using multiple cues: detection confidence score, edges, and motion boundaries. The performance of our tracker is evaluated extensively on the VOT 2014 challenge and the OTB datasets. It shows state-of-the-art results on both these benchmarks, significantly improving over the top performers of these two evaluations.

Thank you!

Visual Tracking with Fully Convolutional Networks ——Shi Yuzhou

Abstract A new approach for general object tracking with fully convolutional neural network. Instead of treating convolutional neural network (CNN) as a black-box feature extractor. 1.Convolutional layers in different levels characterize the target from different perspectives. 2.A top layer encodes more semantic features and serves as a category detector, while a lower layer carries more discriminative information and can better separate the target from distracters with similar appearance. 3.For a tracking target, only a subset of neurons are relevant.

Deep Feature Analysis for Visual Tracking Presenting some important properties of CNN features which can better facilitate visual tracking. Observation 1 Although the receptive field of CNN feature maps is large, the activated feature maps are sparse and localized. The activated regions are highly correlated to the regions of semantic objects. The feature maps have only small regions with nonzero values, and are capturing the visual representation related to the objects.

Deep Feature Analysis for Visual Tracking Observation 2 Many CNN feature maps are noisy or unrelated for the task of discriminating a particular target from its background. (activation value:the sum of a feature map's responses in the object region.) Discarding feature maps: most of the feature maps have small or zero values within the object region. So there are lots of feature maps that are not related to the target object.

Deep Feature Analysis for Visual Tracking Observation 3 Different layers encode different types of features. Higher layers capture semantic concepts on object categories, whereas lower layers encode more discriminative features to capture intra class variations. Because of the redundancy of feature maps, we employ a sparse representation scheme to facilitate better visualization. So we use the feature maps to update the sparse coefficient vector.

Proposed Algorithm

1.conv4-3 and conv5-3 layers——feature map selection 2.GNet——capture the category information (built on the conv5-3 layer) 3.SNet——discriminates the target from background (built on the conv4-3 layer) 4.Both initialized in the first frame and adopt different online update strategies. 5.For a new frame, last target ROI is cropped and propagated through the fully convolutional network. 6.Target localization is performed independently based on the two heat maps by GNet and SNet. 7.The final target is determined by a distracter detection scheme that decides which heat map in step 6 to be used.

Feature Map Selection The proposed feature map selection method is based on a target heat map regression model, named as sel-CNN. The sel-CNN model consists of a dropout layer followed by a convolutional layer without any nonlinear transformation. It takes the feature maps (conv4-3 or con5-3) to be selected as input to predict the target heat map M. The model is trained by minimizing the square loss between the predicted foreground heat map Mˆ and the target heat map M:

Feature Map Selection Fixing the model parameters and select the feature maps according to their impacts on the loss function. Vectorized the input feature maps F. Denote f i as the i-th element of vec(F) The significance of f i : (, ) The significance of the k-th feature map : top K feature maps are selected

Target Localization After feature map selection in the first frame, we build the SNet and the GNet on top of the selected conv4-3 and conv5-3 feature maps, respectively. conv4-3 9×9 convolutional kernels 36 feature maps output more sensitive to intra-class appearance variation conv5-3 5×5 convolutional kernels foreground heat map output invariant to pose variation and rotation Initializing in the first frame by minimizing square loss function:

Target Localization In a new frame, we crop a rectangle ROI region centered at the last target location. By forward propagating the ROI region through the networks, the foreground heat maps are predicted by both GNet and SNet. Assuming the locations of target candidates in the current frame are subject to a Gaussian distribution. The confidence of the i-th candidate is computed as the summation of all the heat map values within the candidate region

Online Update To avoid the background noise introduced by online update, we fix GNet and only update SNet after the initialization in the first frame. Following two different rules: the adaptation rule and the discrimination rule. adaptation rule:finetune SNet every 20 frames using the most confident tracking result within the intervening frames. discrimination rule:when distracters are detected using (9), SNet is further updated using the tracking results in the first frame and the current frame by minimizing:

In Defense of Color-based Model-free Tracking Sun Jianhui

Problem : Trackers often tend to drift towards regions which exhibit a similar appearance compared to the object of interest Solution : identify potentially distracting regions in advance and suppress them

Object-background model

Simplify the equation with the prior probability Estimate the likelihood directly from color histogram H : color non-normalized histogram

Distractor-aware object model O : object hypothesis region D : potentially distracting regions

Combined object model Update the object model using linear interpolation

Localization

Scale estimation Cumulative histogram Adaptive segmentation threshold

Evaluation VOT : the Visual Object Tracking DAT : the distractor-aware tracker DATs : scale-adaptive DAT noDAT : the tracker only using the discriminative object-background model

Results VOT14 benchmark

Results VOT13 benchmark

Results VOT14 benchmarkResults VOT14 benchmark using randomly perturbed initializations

Runtime performance DAT : 17fps noDAT : 18fps DATs : 15fps ( PC with 3.4GHz Intel CPU , MATLAB )

Thanks