Limin Wang, Yu Qiao, and Xiaoou Tang

Slides:

Advertisements

Similar presentations

Rich feature Hierarchies for Accurate object detection and semantic segmentation Ross Girshick, Jeff Donahue, Trevor Darrell, Jitandra Malik (UC Berkeley)

Advertisements

A brief review of non-neural-network approaches to deep learning

Context-based object-class recognition and retrieval by generalized correlograms by J. Amores, N. Sebe and P. Radeva Discussion led by Qi An Duke University.

Zhimin CaoThe Chinese University of Hong Kong Qi YinITCS, Tsinghua University Xiaoou TangShenzhen Institutes of Advanced Technology Chinese Academy of.

Human Action Recognition across Datasets by Foreground-weighted Histogram Decomposition Waqas Sultani, Imran Saleemi CVPR 2014.

Patch to the Future: Unsupervised Visual Prediction

Activity Recognition Aneeq Zia. Agenda What is activity recognition Typical methods used for action recognition “Evaluation of local spatio-temporal features.

Intelligent Systems Lab. Recognizing Human actions from Still Images with Latent Poses Authors: Weilong Yang, Yang Wang, and Greg Mori Simon Fraser University,

Robust Object Tracking via Sparsity-based Collaborative Model

Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots Chao-Yeh Chen and Kristen Grauman University of Texas at Austin.

Large-Scale Object Recognition with Weak Supervision

Ghunhui Gu, Joseph J. Lim, Pablo Arbeláez, Jitendra Malik University of California at Berkeley Berkeley, CA

Recognition using Regions CVPR Outline Introduction Overview of the Approach Experimental Results Conclusion.

Generic Object Detection using Feature Maps Oscar Danielsson Stefan Carlsson

PANDA: Pose Aligned Networks for Deep Attribute Modeling Ning Zhang1;2, Manohar Paluri1, Marc’Aurelio Ranzato1, Trevor Darrell2, Lubomir Bourdev1 1: Facebook.

Spatial Pyramid Pooling in Deep Convolutional

Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.

A Discriminative CNN Video Representation for Event Detection

Bag of Video-Words Video Representation

Action recognition with improved trajectories

Object Bank Presenter ： Liu Changyu Advisor ： Prof. Alex Hauptmann Interest ： Multimedia Analysis April 4 th, 2013.

Watch, Listen and Learn Sonal Gupta, Joohyun Kim, Kristen Grauman and Raymond Mooney -Pratiksha Shah.

Marcin Marszałek, Ivan Laptev, Cordelia Schmid Computer Vision and Pattern Recognition, CVPR Actions in Context.

Presented by: Mingyuan Zhou Duke University, ECE June 17, 2011

Learning Collections of Parts for Object Recognition and Transfer Learning University of Illinois at Urbana- Champaign.

Deep face recognition Omkar M. Parkhi, Andrea Vedaldi, Andrew Zisserman.

Mentor: Salman Khokhar Action Recognition in Crowds Week 7.

Semantic Embedding Space for Zero Shot Action Recognition Xun XuTimothy HospedalesShaogang GongAuthors: Computer Vision Group Queen Mary University of.

The Viola/Jones Face Detector A “paradigmatic” method for real-time object detection Training is slow, but detection is very fast Key ideas Integral images.

Recognition Using Visual Phrases

Learning Features and Parts for Fine-Grained Recognition Authors: Jonathan Krause, Timnit Gebru, Jia Deng, Li-Jia Li, Li Fei-Fei ICPR, 2014 Presented by:

Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.

Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue

Describing People: A Poselet-Based Approach to Attribute Classification.

Convolutional Restricted Boltzmann Machines for Feature Learning Mohammad Norouzi Advisor: Dr. Greg Mori Simon Fraser University 27 Nov

Rich feature hierarchies for accurate object detection and semantic segmentation 2014 IEEE Conference on Computer Vision and Pattern Recognition Ross Girshick,

PANDA: Pose Aligned Networks for Deep Attribute Modeling Ning Zhang 1,2 Manohar Paluri 1 Marć Aurelio Ranzato 1 Trevor Darrell 2 Lumbomir Boudev 1 1 Facebook.

Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition arXiv: v4 [cs.CV(CVPR)] 23 Apr 2015 Kaiming He, Xiangyu Zhang, Shaoqing.

Parsing Natural Scenes and Natural Language with Recursive Neural Networks INTERNATIONAL CONFERENCE ON MACHINE LEARNING (ICML 2011) RICHARD SOCHER CLIFF.

1 Bilinear Classifiers for Visual Recognition Computational Vision Lab. University of California Irvine To be presented in NIPS 2009 Hamed Pirsiavash Deva.

Action Recognition in Video

A Hierarchical Deep Temporal Model for Group Activity Recognition

Recent developments in object detection

Guillaume-Alexandre Bilodeau

Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek

Data Driven Attributes for Action Detection

Krishna Kumar Singh, Yong Jae Lee University of California, Davis

Saliency-guided Video Classification via Adaptively weighted learning

Learning Mid-Level Features For Recognition

Classification of Hand-Written Digits Using Scattering Convolutional Network Dongmian Zou Advisor: Professor Radu Balan.

Training Techniques for Deep Neural Networks

Object detection as supervised classification

Cheng-Ming Huang, Wen-Hung Liao Department of Computer Science

Action Recognition in Temporally Untrimmed Videos

Computer Vision James Hays

Two-Stream Convolutional Networks for Action Recognition in Videos

The Open World of Micro-Videos

Oral presentation for ACM International Conference on Multimedia, 2014

KFC: Keypoints, Features and Correspondences

Outline Background Motivation Proposed Model Experimental Results

RCNN, Fast-RCNN, Faster-RCNN

Comparison of EET and Rank Pooling on UCF101 (split 1)

Eigen-Evolution Dense Trajectory Descriptors

Heterogeneous convolutional neural networks for visual recognition

Human-object interaction

VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION

Week 3 Volodymyr Bobyr.

Shengcong Chen, Changxing Ding, Minfeng Liu 2018

Presentation transcript:

Action Recognition and Detection by Combining Motion and Appearance Features Limin Wang, Yu Qiao, and Xiaoou Tang The Chinese University of Hong Kong, Hong Kong Shenzhen Institutes of Advanced Technology, CAS, China

Outline Introduction Method Result Conclusion

I. Introduction

Introduction Action recognition from short clips has been widely studied recently (e.g. HMDB51, UCF101) Action recognition and detection from temporally untrimmed videos received less attention. THUMOS 14 challenge focuses on this more difficult problem. State of the art results UCF 101 Recognition THUMOS 14 THUMOS 14 Detection 87.9% 71.0% 33.6%

Introduction There are two key problems in this challenge: How to conduct temporal segmentation of continuous video sequence. How to represent the video clips for action recognition. In our current method, we only try a simple segmentation method. We mainly focus on how to extract effective visual representation.

Introduction What kinds of information are important for action understanding from video. In video data, there are many cues for understanding human actions. Motion is an important feature as the humon move in a similar way as they conduct the same action. Pose is another imporntant cue. Human usually exihibit the same pose configuration when they perform the same action. In addition to motion and pose, some other context information such as object, and scene category can alos provide extra information to help action reconigiton. Motion trajectory Pose Interacting object Scene category Dynamic motion cue Static appearance cue

Related Works Dynamic motion cues: Low-level features: STIPs [Laptev 05], Improved Trajectories [Wang13] etc. Mid-level representation: Motionlet [Wang 13], Discriminative patches [Jain etc 13], Motion atoms and phrases[Wang13] etc. High-level representation: Action Bank [Sadanand12] etc. From THUMOS 13, it is known that Fisher vector of improved dense trajectories is very effective. From our experience, the mid-level representation is complementary to FV of IDT.

Related Works Pose: Poselet [Bourdev09], Mixture of parts [Yang13] etc. Object: Deformable part model [Felzenszwalb10] etc. Scene: Gist [Oliva01], Discriminative Patches [Singh12] etc. Recently, deep CNN obtains much better results with theses tasks. Deep CNN will need a large number of training samples with supervised labels.

II. Method

Overview of Our Method We propose a simple pipeline for action recognition and detection from temporally untrimmed videos It is composed of three steps: temporal segmentation, clip representation and recognition, post-processing

Temporal Segmentation We use the simple temporal sliding window method. In current implementation, we just a single temporal scale for sliding window (duration =150 frames, step = 100 frames) Window duration Sliding step

Clip Representation

Improved Dense Trajectories Improved dense trajectories extract: HOG, HOF, MBHx, MBHy Improved dense trajectories firstly estimate camera motion and compensate it. [1] Heng Wang and Cordelia Schmid, Action Recognition with Improved Trajectories, in ICCV 2013.

Bag of Visual Words There many choices in each step of BoVW and implementation details are important. Super vector encoding outperforms others. [1] X. Peng, L. Wang, X. Wang, Y. Qiao, Bag of Visual Words and Fusion Methods for Action Recognition: Comprehensive Study and Good Practice. CoRR abs/1405.4506, 2014

Fisher Vector Given a set of descriptors: , we learn a generative GMM: Given the descriptors from video clip, we derive the Fisher vector: In our current implementation, we choose power ℓ2- normalization (α = 0.5) to obtain final representation S :

CNN Activation Conv1 Conv2 Conv3 Conv4 Conv5 Full6 Full7 Soft Max We firstly resize frame as 256*256 and then crop a region as 227*227 We use the Caffe implementation of the CNN described by [Krizhevsky et al.] We extract the activation of Full7 as CNN features (4096 dimensions) and conduct average pooling over different crops. [1] Jia, Y.: Caffe: An open source convolutional architecture for fast feature embedding. (2013) [2] Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS. pp. 1106–1114 (2012)

CNN Fine-tuning We fine tune the parameters of CNN using the UCF101 dataset through 50,000 iterations. We extract 10 frames from each video and use the video label as frame label to fine tune the ImageNet training result The batch size is set as 256, drop out ratio as 0.5, iteration: 50,000. Results of Split 1 on UCF 101 Without fine tuning With fine tuning 65.3% 70.5%

Feature Fusion For appearance feature, we extract CNN activations on 15 frames and perform average pooling to get the representation of video clip. For motion feature, we use FV for each descriptor of IDT features independently. Both appearance and motion features are firstly normalized and then concatenated as a hybrid representation. The fusion weight: appearance (0.4), motion (1.0).

Classifier We choose linear SVM classifier for action recognition and detection using Training dataset and Background dataset. For multi-class classification, we use the one vs. all training scheme. To void false detections, we randomly select 4,000 clips from the background dataset as negative examples when training SVM for each action class.

Post Processing To obtain the action recognition result for the whole video sequence, we conduct max pooling over the recognition result of video clip. To avoid false positive recognition and detection, we simply use three thresholds: Clip level threshold t1: for each clip, there are at most t1 action instances. Sequence level threshold t2: for each sequence, there are at most t2 action instances. SVM score threshold t3: elimination of detection instance with confidence score lower than t3.

III. Experimental Results

Experiment Results 1 Action recognition result of Split1 on UCF 101 Motion Feature Appearance Feature Fused Feature 85.3% 70.5% 89.1% Action recognition result on validation dataset of THUMOS 14 Motion Feature Appearance Feature Fused Feature 57.1% 48.2% 65.3%

Experiment Results 2 Action recognition result on test dataset of THUMOS 14 (t1,t2) (1,10) (1,20) (5,10) (5,20) (101,101) Result 0.617 0.6177 0.6196 0.6174 0.6201 Action detection result on test dataset of THUMOS 14 (t1,t2) (1,0.5) (1,0) (1,-0.5) (1,-1) Overlap=0.1 0.1080 0.1373 0.1701 0.1818 Overlap=0.2 0.1042 0.1319 0.1591 0.1700 Overlap=0.3 0.0891 0.1137 0.1306 0.1405 Overlap=0.4 0.0765 0.0975 0.1090 0.1174 Overlap=0.5 0.0563 0.0695 0.0775 0.0834

IV. Conclusions

Conclusions We prove that motion features (IDT) and appearance features (CNN) are complimentary to each other. For FV of IDT, implementation detail such as descriptor pre- processing, normalization operation has a great influence on final performance. For CNN feature, fine-tuning on UCF101 dataset helps to improve recognition performance. In the future, we may consider designing more effective segmentation algorithm or performing both tasks simultaneously.

Another Work on Action Detection We design a method unifying action detection and pose estimation in ECCV 2014:

Thank You! lmwang.nju@gmail.com Welcome to our ECCV poster presentation: Video Action Detection with Relational Dynamic-Poselets (Session 3B). Action Recognition with Stacked Fisher Vectors (Session 3B). Boosting VLAD with Supervised Dictionary Learning and High-Order Statistics (Session 2B).