Spatially Supervised Recurrent Neural Networks for Visual Object Tracking Authors: Guanghan Ning, Zhi Zhang, Chen Huang, Xiaobo Ren, Haohong Wang, Canhui.

Slides:

Advertisements

Similar presentations

Université du Québec École de technologie supérieure Face Recognition in Video Using What- and-Where Fusion Neural Network Mamoudou Barry and Eric Granger.

Advertisements

Road-Sign Detection and Recognition Based on Support Vector Machines Saturnino, Sergio et al. Yunjia Man ECG 782 Dr. Brendan.

Limin Wang, Yu Qiao, and Xiaoou Tang

Activity Recognition Aneeq Zia. Agenda What is activity recognition Typical methods used for action recognition “Evaluation of local spatio-temporal features.

IIIT Hyderabad Pose Invariant Palmprint Recognition Chhaya Methani and Anoop Namboodiri Centre for Visual Information Technology IIIT, Hyderabad, INDIA.

Vision Based Control Motion Matt Baker Kevin VanDyke.

Performance Evaluation Measures for Face Detection Algorithms Prag Sharma, Richard B. Reilly DSP Research Group, Department of Electronic and Electrical.

Forward-Backward Correlation for Template-Based Tracking Xiao Wang ECE Dept. Clemson University.

Robust Object Tracking via Sparsity-based Collaborative Model

Recognition of Traffic Lights in Live Video Streams on Mobile Devices

Spatial Pyramid Pooling in Deep Convolutional

EE392J Final Project, March 20, Multiple Camera Object Tracking Helmy Eltoukhy and Khaled Salama.

1. Introduction Motion Segmentation The Affine Motion Model Contour Extraction & Shape Estimation Recursive Shape Estimation & Motion Estimation Occlusion.

Object Bank Presenter ： Liu Changyu Advisor ： Prof. Alex Hauptmann Interest ： Multimedia Analysis April 4 th, 2013.

Kourosh MESHGI Shin-ichi MAEDA Shigeyuki OBA Shin ISHII 18 MAR 2014 Integrated System Biology Lab (Ishii Lab) Graduate School of Informatics Kyoto University.

KOUROSH MESHGI PROGRESS REPORT TOPIC To: Ishii Lab Members, Dr. Shin-ichi Maeda, Dr. Shigeuki Oba, And Prof. Shin Ishii 9 MAY 2014.

Video Based Palmprint Recognition Chhaya Methani and Anoop M. Namboodiri Center for Visual Information Technology International Institute of Information.

Spatio-temporal constraints for recognizing 3D objects in videos Nicoletta Noceti Università degli Studi di Genova.

Object Detection with Discriminatively Trained Part Based Models

Online Kinect Handwritten Digit Recognition Based on Dynamic Time Warping and Support Vector Machine Journal of Information & Computational Science, 2015.

Stable Multi-Target Tracking in Real-Time Surveillance Video

Expectation-Maximization (EM) Case Studies

模式识别国家重点实验室中国科学院自动化研究所 National Laboratory of Pattern Recognition Institute of Automation, Chinese Academy of Sciences Matching Tracking Sequences Across.

Max-Confidence Boosting With Uncertainty for Visual tracking WEN GUO, LIANGLIANG CAO, TONY X. HAN, SHUICHENG YAN AND CHANGSHENG XU IEEE TRANSACTIONS ON.

Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition arXiv: v4 [cs.CV(CVPR)] 23 Apr 2015 Kaiming He, Xiangyu Zhang, Shaoqing.

Detecting Occlusion from Color Information to Improve Visual Tracking

Date of download: 7/8/2016 Copyright © 2016 SPIE. All rights reserved. A scalable platform for learning and evaluating a real-time vehicle detection system.

A Hierarchical Deep Temporal Model for Group Activity Recognition

When deep learning meets object detection: Introduction to two technologies: SSD and YOLO Wenchi Ma.

Recent developments in object detection

CNN-RNN: A Uniﬁed Framework for Multi-label Image Classiﬁcation

Learning to Compare Image Patches via Convolutional Neural Networks

Guillaume-Alexandre Bilodeau

Object Detection based on Segment Masks

Krishna Kumar Singh, Yong Jae Lee University of California, Davis

Saliency-guided Video Classification via Adaptively weighted learning

Traffic Sign Recognition Using Discriminative Local Features Andrzej Ruta, Yongmin Li, Xiaohui Liu School of Information Systems, Computing and Mathematics.

Learning Mid-Level Features For Recognition

Reconstruction For Rendering distribution Effect

Combining CNN with RNN for scene labeling (segmentation)

Rotational Rectification Network for Robust Pedestrian Detection

R-CNN region By Ilia Iofedov 11/11/2018 BGU, DNN course 2016.

Zan Gao, Deyu Wang, Xiangnan He, Hua Zhang

Learning to See in the Dark

CornerNet: Detecting Objects as Paired Keypoints

Object Detection Creation from Scratch Samsung R&D Institute Ukraine

Outline Background Motivation Proposed Model Experimental Results

SIMPLE ONLINE AND REALTIME TRACKING WITH A DEEP ASSOCIATION METRIC

RCNN, Fast-RCNN, Faster-RCNN

Introduction to Object Tracking

Neural Network Pipeline CONTACT & ACKNOWLEDGEMENTS

Related Work in Camera Network Tracking

Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton

Jointly Generating Captions to Aid Visual Question Answering

Automatic Handwriting Generation

Human-object interaction

Object Detection Implementations

Multi-UAV to UAV Tracking

Weak-supervision based Multi-Object Tracking

Week 3 Volodymyr Bobyr.

Bidirectional LSTM-CRF Models for Sequence Tagging

Report 2 Brandon Silva.

Week 7 Presentation Ngoc Ta Aidean Sharghi

Multi-Target Detection and Tracking of UAVs from a UAV

Deep Video Quality Assessor: From Spatio-temporal Visual Sensitivity to A convolutional Neural Aggregation Network Woojae Kim1, Jongyoo Kim2, Sewoong Ahn1,Jinwoo.

CVPR2019 Jiahe Li SiamRPN introduces the region proposal network after the Siamese network and performs joint classification and regression.

Shengcong Chen, Changxing Ding, Minfeng Liu 2018

Presentation transcript:

Spatially Supervised Recurrent Neural Networks for Visual Object Tracking Authors: Guanghan Ning, Zhi Zhang, Chen Huang, Xiaobo Ren, Haohong Wang, Canhui Cai, Zhihai(Henry) He

The Problem Object Tracking Visual Object Tracking is the process of localizing a single target in a video or sequential images, given the target position in the first frame. The significance lies in two aspects: It has a wide range of applications such as motion analysis, activity recognition, surveillance, and human-computer interaction. It can be a prerequisite or a necessary component of another system. ICCAS 2017: Spatially Supervised Recurrent Neural Networks for Visual Object Tracking 7/1/2019 6:25 PM

Dataset Object Tracking Benchmark (OTB) OTB is one of the most commonly used datasets. Each video is annotated with one or more attributes: IV: Illumination Variation SV: Scale Variation OCC: Occlusion DEF: Deformation MB: Motion Blur FM: Fast Motion IPR: In-plane Rotation OPR: Out-of-Plane Rotation OV: Out-of-View BC: Background Clutters LR: Low Resolution Figure 1: OTB dataset ICCAS 2017: Spatially Supervised Recurrent Neural Networks for Visual Object Tracking 7/1/2019 6:25 PM

Evaluation 1. How do we measure the performance? Measured by OPE (one pass evaluation), testing a sequence with initialization from the ground truth position in the 1st frame and report the [average precision] or [success rate]. 2. How to calculate the [average precision] or [success rate]? Average precision is the average overlap score over frames Frame is a success when its overlap score is above threshold 3. How to evaluate over a range of thresholds? The [success plot] shows the ratios of successful frames at the thresholds varied from 0 to 1. We use the [area under curve (AUC)] of each success plot to rank the tracking algorithms. S= | 𝑟 𝑡 ∩ 𝑟 𝑎 | | 𝑟 𝑡 ∪ 𝑟 𝑎 | S : Overlap score 𝒓 𝒕 : Tracked bounding box. 𝒓 𝒂 : Ground-truth bounding box ∩ : The intersection of two regions ∪: The union of two regions | · |: The number of pixels in a region ICCAS 2017: Spatially Supervised Recurrent Neural Networks for Visual Object Tracking 7/1/2019 6:25 PM

Challenges 1. Appearance Variations: Target deformations Illumination variations Scale changes Background Clutters Fast and abrupt motion 2. Occlusion Partial Occlusion Full Occlusion 3. Difficulties Introduced by Camera Uneven lighting Illumination Blur Low resolution Perspective distortion ICCAS 2017: Spatially Supervised Recurrent Neural Networks for Visual Object Tracking 7/1/2019 6:25 PM

Related Works What are the major related works? Regression-Based Object Recognition YOLO is regression-based CNN network that detects generic objects, and inspires us to research into the regression capabilities of LSTM. 2. LSTM LSTM is an RNN module with memory that is temporally deep. We research into incorporating CNN and LSTM to interpreted high-level visual features both spatially and temporally, and propose to regress the features into object locations. ICCAS 2017: Spatially Supervised Recurrent Neural Networks for Visual Object Tracking 7/1/2019 6:25 PM

Related Works Our proposed network is the first work that incorporates CNN and LSTM for the purpose of object tracking on real-world datasets. There are two prior papers [1, 2] that are closely related to this work: [1] Quan Gan, Qipeng Guo, Zheng Zhang, and Kyunghyun Cho. First step toward model-free, anonymous object tracking with recurrent neural networks. arXiv preprint arXiv:1511.06425, 2015. Traditional RNN, not LSTM Focused on artificially generated sequences and synthesized data, not real-world videos ICCAS 2017: Spatially Supervised Recurrent Neural Networks for Visual Object Tracking 7/1/2019 6:25 PM

Related Works Our proposed network is the first work that incorporates CNN and LSTM for the purpose of object tracking on real-world datasets. There are two prior papers [1, 2] that are closely related to this work: [2] Samira Ebrahimi Kahou, Vincent Michalski, and Roland Memisevic. Ratm: Recurrent attentive tracking model. arXiv preprint arXiv:1510.08660, 2015. Traditional RNN as an attention scheme In contrast, we directly regress coordinates or heatmaps instead of using sub-region classifiers. We use the LSTM for an end-to-end spatio-temporal regression with a single evaluation ICCAS 2017: Spatially Supervised Recurrent Neural Networks for Visual Object Tracking 7/1/2019 6:25 PM

Related Works Many recent works [3, 4, 5] have appeared since our work. (July 2016 on Arxiv). Some works [3, 4] extend our proposed YOLO + LSTM scheme with multi-target tracking and reinforcement learning. Some works [4] seem to be built upon our open-sourced code. Some works [5] are similar but independent. [3] Dan Iter, et. al. Target Tracking with Kalman Filtering, KNN and LSTMs, December 2016. [4] Da Zhang, et. al. Deep Reinforcement Learning for Visual Object Tracking in Videos, April 2017. [5] Anton Milan, et. al. Online Multi-Target Tracking Using Recurrent Neural Networks, December 2016. ICCAS 2017: Spatially Supervised Recurrent Neural Networks for Visual Object Tracking 7/1/2019 6:25 PM

Overview Figure 2: Flowchart of the proposed algorithm Motives: Aims to interpret high-level visual features produced by YOLO to regress into object locations Incorporate with an LSTM module that also takes into account the temporal flow of these visual features. ICCAS 2017: Spatially Supervised Recurrent Neural Networks for Visual Object Tracking 7/1/2019 6:25 PM

Flowchart of the Proposed Algorithm What is the network? Figure 3: Recurrent YOLO (ROLO) How do we incorporate YOLO and the recurrent module? Extract high-level visual features with YOLO (a CNN network with Conv layers and Pooling layers) Use an FC layer to regress the features into target coordinates/heatmaps for spatial supervision Concatenate them with the visual features Feed the concatenated features into LSTM modules ICCAS 2017: Spatially Supervised Recurrent Neural Networks for Visual Object Tracking 7/1/2019 6:25 PM

LSTM Why is LSTM useful in this network? Traditional Kalman Filter only takes into account location histories The memory of LSTM is useful to store visual dynamics as well as location histories High-level visual features and locations over frames [Input of LSTM] are jointly used to regress into tracking predictions [Output of LSTM] More robust in occlusion situations. ICCAS 2017: Spatially Supervised Recurrent Neural Networks for Visual Object Tracking 7/1/2019 6:25 PM

Spatiao-temporal Robustness Against Occlusion Figure 5: Visualization with Regression of Locations (Unseen Frames) Green: ROLO Blue: YOLO Red: Ground Truth ICCAS 2017: Spatially Supervised Recurrent Neural Networks for Visual Object Tracking 7/1/2019 6:25 PM

Spatiao-temporal Robustness Against Occlusion ROLO is effective due to several reasons: (1) the representation power of the high-level visual features from convNets, (2) the feature interpretation power of LSTM, therefore the ability to detect visual objects, (3) spatially supervised by a location or heatmap vector, (3) the capability of LSTM in regressing effectively with spatio-temporal information. ICCAS 2017: Spatially Supervised Recurrent Neural Networks for Visual Object Tracking 7/1/2019 6:25 PM

Spatiao-temporal Robustness Against Occlusion It is shown in the figure that ROLO tracks the object in near-complete occlusions. Two similar targets occur in this video, ROLO tracks the correct target as the detection module inherently feeds the LSTM unit with spatial constraint. Between frame 47-60, YOLO fails in detection but ROLO does not lose the track. Heatmap is with minor noise when no detection is presented because the similar target is still in sight. Figure 6: Visualization with Regression of Heatmaps (Unseen Videos) ICCAS 2017: Spatially Supervised Recurrent Neural Networks for Visual Object Tracking 7/1/2019 6:25 PM

Results: Tracking results on OTB dataset Figure 7: Tracking Results: Bounding rectangles of the ground truth are indicated in red while the detection results in blue. The ROLO output from LSTM modules is in green. ICCAS 2017: Spatially Supervised Recurrent Neural Networks for Visual Object Tracking 7/1/2019 6:25 PM

Area Under Curve (AUC) score reflected on right-top. Results: Comparing with Other Methods Due to fast motions, occlusions, and therefore poor detections, YOLO with the kalman ﬁlter perform inferiorly lacking knowledge of the visual context. LSTM is capable of regressing both visual context and location histories, performing better than [YOLO + Kalman] Figure 8: Success Plot. Area Under Curve (AUC) score reflected on right-top. ICCAS 2017: Spatially Supervised Recurrent Neural Networks for Visual Object Tracking 7/1/2019 6:25 PM

Results: Steps VS Speed The frame per second (fps) drops as the steps increase. Steps VS Accuracy It appears that the accuracy does not continually increase as the number of steps for LSTM increases ICCAS 2017: Spatially Supervised Recurrent Neural Networks for Visual Object Tracking 7/1/2019 6:25 PM

Conclusion And contributions of this paper: Our proposed ROLO method extends the deep neural network learning and analysis into the spatiotemporal domain. We have also studied LSTM’s interpretation and regression capabilities of high-level visual features. Our proposed tracker is both spatially and temporally deep, and can effectively tackle problems of major occlusion and severe motion blur. ICCAS 2017: Spatially Supervised Recurrent Neural Networks for Visual Object Tracking 7/1/2019 6:25 PM

THANKS! ICCAS 2017: Spatially Supervised Recurrent Neural Networks for Visual Object Tracking 7/1/2019 6:25 PM