Bangpeng Yao1, Xiaoye Jiang2, Aditya Khosla1,

Slides:



Advertisements
Similar presentations
Poselets: Body Part Detectors trained Using 3D Human Pose Annotations Lubomir Bourdev & Jitendra Malik ICCV 2009.
Advertisements

Semantic Contours from Inverse Detectors Bharath Hariharan et.al. (ICCV-11)
Attributes for Classifier Feedback Amar Parkash and Devi Parikh.
Human Action Recognition by Learning Bases of Action Attributes and Parts Bangpeng Yao, Xiaoye Jiang, Aditya Khosla, Andy Lai Lin, Leonidas Guibas, and.
Contributions A people dataset of 8035 images. Three layer attribute classification framework using poselets. 1 2.
Pose Estimation and Segmentation of People in 3D Movies Karteek Alahari, Guillaume Seguin, Josef Sivic, Ivan Laptev Inria, Ecole Normale Superieure ICCV.
Foreground Focus: Finding Meaningful Features in Unlabeled Images Yong Jae Lee and Kristen Grauman University of Texas at Austin.
A generic model to compose vision modules for holistic scene understanding Adarsh Kowdle *, Congcong Li *, Ashutosh Saxena, and Tsuhan Chen Cornell University,
Human Action Recognition by Learning Bases of Action Attributes and Parts Bangpeng Yao, Xiaoye Jiang, Aditya Khosla, Andy Lai Lin, Leonidas Guibas, and.
Lecture 31: Modern object recognition
Activity Recognition Computer Vision CS 143, Brown James Hays 11/21/11 With slides by Derek Hoiem and Kristen Grauman.
Steerable Part Models Hamed Pirsiavash and Deva Ramanan
Bangpeng Yao Li Fei-Fei Computer Science Department, Stanford University, USA.
Structural Human Action Recognition from Still Images Moin Nabi Computer Vision Lab. ©IPM - Oct
Intelligent Systems Lab. Recognizing Human actions from Still Images with Latent Poses Authors: Weilong Yang, Yang Wang, and Greg Mori Simon Fraser University,
Human Action Recognition by Learning Bases of Action Attributes and Parts.
Object-centric spatial pooling for image classification Olga Russakovsky, Yuanqing Lin, Kai Yu, Li Fei-Fei ECCV 2012.
Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots Chao-Yeh Chen and Kristen Grauman University of Texas at Austin.
Enhancing Exemplar SVMs using Part Level Transfer Regularization 1.
Biased Normalized Cuts 1 Subhransu Maji and Jithndra Malik University of California, Berkeley IEEE Conference on Computer Vision and Pattern Recognition.
Student: Yao-Sheng Wang Advisor: Prof. Sheng-Jyh Wang ARTICULATED HUMAN DETECTION 1 Department of Electronics Engineering National Chiao Tung University.
DISCRIMINATIVE DECORELATION FOR CLUSTERING AND CLASSIFICATION ECCV 12 Bharath Hariharan, Jitandra Malik, and Deva Ramanan.
Beyond Actions: Discriminative Models for Contextual Group Activities Tian Lan School of Computing Science Simon Fraser University August 12, 2010 M.Sc.
Poselets Michael Krainin CSE 590V Oct 18, Person Detection Dalal and Triggs ‘05 – Learn to classify pedestrians vs. background – HOG + linear SVM.
Good morning, everyone, thank you for coming to my presentation.
5/30/2006EE 148, Spring Visual Categorization with Bags of Keypoints Gabriella Csurka Christopher R. Dance Lixin Fan Jutta Willamowski Cedric Bray.
Spatial Pyramid Pooling in Deep Convolutional
Generic object detection with deformable part-based models
Ranking with High-Order and Missing Information M. Pawan Kumar Ecole Centrale Paris Aseem BehlPuneet DokaniaPritish MohapatraC. V. Jawahar.
Describing People: A Poselet-Based Approach to Attribute Classification Lubomir Bourdev 1,2 Subhransu Maji 1 Jitendra Malik 1 1 EECS U.C. Berkeley 2 Adobe.
School of Electronic Information Engineering, Tianjin University Human Action Recognition by Learning Bases of Action Attributes and Parts Jia pingping.
The Three R’s of Vision Jitendra Malik.
Mining Discriminative Components With Low-Rank and Sparsity Constraints for Face Recognition Qiang Zhang, Baoxin Li Computer Science and Engineering Arizona.
“Secret” of Object Detection Zheng Wu (Summer intern in MSRNE) Sep. 3, 2010 Joint work with Ce Liu (MSRNE) William T. Freeman (MIT) Adam Kalai (MSRNE)
Why Categorize in Computer Vision ?. Why Use Categories? People love categories!
Ranking with High-Order and Missing Information M. Pawan Kumar Ecole Centrale Paris Aseem BehlPuneet KumarPritish MohapatraC. V. Jawahar.
1 Action Classification: An Integration of Randomization and Discrimination in A Dense Feature Representation Computer Science Department, Stanford University.
Reading Between The Lines: Object Localization Using Implicit Cues from Image Tags Sung Ju Hwang and Kristen Grauman University of Texas at Austin Jingnan.
Efficient Region Search for Object Detection Sudheendra Vijayanarasimhan and Kristen Grauman Department of Computer Science, University of Texas at Austin.
Object Detection with Discriminatively Trained Part Based Models
Optimizing Average Precision using Weakly Supervised Data Aseem Behl IIIT Hyderabad Under supervision of: Dr. M. Pawan Kumar (INRIA Paris), Prof. C.V.
Lecture 31: Modern recognition CS4670 / 5670: Computer Vision Noah Snavely.
Semantic Embedding Space for Zero ­ Shot Action Recognition Xun XuTimothy HospedalesShaogang GongAuthors: Computer Vision Group Queen Mary University of.
Gang WangDerek HoiemDavid Forsyth. INTRODUCTION APROACH (implement detail) EXPERIMENTS CONCLUSION.
Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions Bangpeng Yao and Li Fei-Fei Computer Science Department, Stanford.
Recognition Using Visual Phrases
Poselets: Body Part Detectors Trained Using 3D Human Pose Annotations ZUO ZHEN 27 SEP 2011.
Extracting Simple Verb Frames from Images Toward Holistic Scene Understanding Prof. Daphne Koller Research Group Stanford University Geremy Heitz DARPA.
Object Recognition as Ranking Holistic Figure-Ground Hypotheses Fuxin Li and Joao Carreira and Cristian Sminchisescu 1.
Object Recognition by Integrating Multiple Image Segmentations Caroline Pantofaru, Cordelia Schmid, Martial Hebert ECCV 2008 E.
Describing People: A Poselet-Based Approach to Attribute Classification.
PANDA: Pose Aligned Networks for Deep Attribute Modeling Ning Zhang 1,2 Manohar Paluri 1 Marć Aurelio Ranzato 1 Trevor Darrell 2 Lumbomir Boudev 1 1 Facebook.
NEIL: Extracting Visual Knowledge from Web Data Xinlei Chen, Abhinav Shrivastava, Abhinav Gupta Carnegie Mellon University CS381V Visual Recognition -
Discriminative Machine Learning Topic 4: Weak Supervision M. Pawan Kumar Slides available online
Recent developments in object detection
Object detection with deformable part-based models
Data Driven Attributes for Action Detection
Krishna Kumar Singh, Yong Jae Lee University of California, Davis
Finding Things: Image Parsing with Regions and Per-Exemplar Detectors
Action Recognition ECE6504 Xiao Lin.
Thesis Advisor : Prof C.V. Jawahar
Accounting for the relative importance of objects in image retrieval
Image Classification.
“The Truth About Cats And Dogs”
Multiple Feature Learning for Action Classification
Exemplar-SVM for Action Recognition
Outline Background Motivation Proposed Model Experimental Results
Learning Object Context for Dense Captioning
Human-object interaction
Presentation transcript:

Human Action Recognition by Learning Bases of Action Attributes and Parts Bangpeng Yao1, Xiaoye Jiang2, Aditya Khosla1, Andy Lai Lin3, Leonidas Guibas1, and Li Fei-Fei1 Computer Science Department, Stanford University Institute for Computational & Mathematical Engineering, Stanford University Electrical Engineering Department, Stanford University {bangpeng,aditya86,guibas,feifeili}@cs.stanford.edu {xiaoye,ydna}@stanford.edu

Action Classification in Still Images Riding bike Directly using low level feature for classification: Grouplet (Yao & Fei-Fei, 2010) Multiple kernel learning (Koniusz et al., 2010) Spatial pyramid (Delaitre et al., 2010) Random forest (Yao et al., 2011)

Action Classification in Still Images Riding bike Human actions are more than just a class label: Riding a bike Sitting on a bike seat Wearing a helmet Peddling the pedals … High-level concepts - Attributes

Action Classification in Still Images Riding bike Human actions are more than just a class label: Riding a bike Sitting on a bike seat Wearing a helmet Peddling the pedals … High-level concepts – Attributes Objects

Action Classification in Still Images Riding bike Human actions are more than just a class label: Riding a bike Sitting on a bike seat Wearing a helmet Peddling the pedals … High-level concepts – Attributes Objects Human poses Parts

Action Classification in Still Images Riding bike Riding Human actions are more than just a class label: Riding a bike Sitting on a bike seat Wearing a helmet Peddling the pedals … High-level concepts – Attributes Objects Human poses Interactions of attributes & parts Parts

Attributes & Parts for Classification wearing a helmet Riding bike sitting on bike seat Peddling the pedal riding a bike Human actions are more than just a class label. Attributes, objects, and human poses in visual recognition: Farhadi et al., 2009 Lampert et al., 2009 Berg et al., 2010 Parikh & Grauman, 2011 Liu et al., 2011 Gupta et al., 2009 Yao & Fei-Fei, 2010 Torresani et al., 2010 Li et al., 2010 Yao & Fei-Fei, 2010 Yang et al., 2010 Maji et al., 2011

Benefits of the Attribute & Part Rep. Incorporate more human knowledge; Produce more descriptive intermediate outputs; Allow more discriminative classifiers; Farhadi et al., 2009 Lampert et al., 2009 Berg et al., 2010 Parikh & Grauman, 2011 Torresani et al., 2010 Li et al., 2010 Maji et al., 2011 Liu et al., 2011 Complementary information in attributes and parts, hence improve classification performance.

Challenges We Need to Address How to model attributes and parts (objects & poses)? How to model their interactions? How to eliminate noise or inconsistency in the data? How to use attributes and parts for recognition? Unexpected object Errors in detection Object does not appear

Outline Attributes and Parts in Human Actions Learning Bases of Attributes and Parts (modeling the interactions) Dataset & Experiments Conclusion

Outline Attributes and Parts in Human Actions Learning Bases of Attributes and Parts (modeling the interactions) Dataset & Experiments Conclusion

Action Attributes Semantic descriptions of actions; Usually related to verbs. Cycling Peddling Writing Phoning Jumping … Cycling Peddling Writing Phoning Jumping …

Action Attributes Semantic descriptions of actions; Usually related to verbs. A discriminative classifier for each attribute: Cycling Peddling Writing Phoning Jumping … Cycling Peddling Writing Phoning Jumping …

Action Parts – Objects and Poses Human poses – poselets: … (Bourdev & Malik, 2010) bike detector … For each part (object or poselet), we have a pre-trained detector. (Li et al., 2010 Bourdev & Malik, 2010)

Putting Attributes and Parts Together Confidence scores Cycling Peddling Writing Phoning Attribute classification … … SVM Classifier Object detection Low High … … Poselet detection … …

Challenges We Need to Address How to model attributes and parts (objects & poses)? How to model their interactions? How to eliminate noise or inconsistency in the data? How to use attributes and parts for recognition? Unexpected object Errors in detection Object does not appear

Challenges We Need to Address How to model attributes and parts (objects & poses)? How to model their interactions? How to eliminate noise or inconsistency in the data? How to use attributes and parts for recognition? Unexpected object Errors in detection Object does not appear

Outline Attributes and Parts in Human Actions Learning Bases of Attributes and Parts (modeling the interactions) Dataset & Experiments Conclusion

Bases of Atr. & Parts: Motivation Ideal vector Cycling Peddling Writing Phoning Low High … … … … … …

Bases of Atr. & Parts: Motivation Real vector Ideal vector Cycling Cycling Peddling Peddling Writing Writing Phoning Phoning Low High … … … … … … … … … … … …

Bases of Atr. & Parts: Motivation Real vector Action bases Ideal vector … Cycling Cycling Peddling Peddling Writing Writing Phoning Phoning Low High … … … … … … … … … … … … … … … … … … … … … … … … …

Bases of Atr. & Parts: Motivation Real vector Action bases Reconstruction coefficients Ideal vector … Cycling Cycling Peddling Peddling Writing Writing Phoning Phoning Low High … … … … … … … … … … … … … … … … … … … … … … … … … …

Bases of Atr. & Parts: Motivation Real vector Action bases (sparse) Action bases Reconstruction coefficients Ideal vector … Cycling Cycling Peddling Peddling Writing Writing Phoning Phoning Low High … … … … … … … … … … … … … … … … … … … … … … … … … …

Bases of Atr. & Parts: Motivation Reconstruction coefficients (sparse) Real vector Action bases (sparse) Reconstruction coefficients Ideal vector … Cycling Cycling Peddling Peddling Writing Writing Phoning Phoning Low High … … … … … … … … … … … … … … … … … … … … … … … … … …

Bases of Atr. & Parts: Training Reconstruction coefficients (sparse) Real vector Action bases (sparse) (N images) (M bases) Input Output Low High L1 regularization, sparsity of W Accurate reconstruction Elastic net, sparsity of [Zou & Hasti, 2005]

Bases of Atr. & Parts: Testing Reconstruction coefficients (sparse) Real vector Action bases (sparse) (M bases) Input Output Low High L1 regularization, sparsity of W Accurate reconstruction

Bases of Atr. & Parts: Benefits Reconstruction coefficients (sparse) Real vector Action bases (sparse) Ideal vector … Cycling Cycling Peddling Peddling Writing Writing Phoning Phoning Low High … … … … … … … … Co-occurrence context; … … … … … … … … … … … … … … … … … …

Bases of Atr. & Parts: Benefits Reconstruction coefficients (sparse) Real vector Action bases (sparse) Ideal vector … Cycling Cycling Peddling Peddling Writing Writing Phoning Phoning Low High … … … … … … … … Co-occurrence context; … … Reduce noise; … … … … … … … … … … … … … … … …

Bases of Atr. & Parts: Benefits Reconstruction coefficients (sparse) Real vector Action bases (sparse) Ideal vector … Cycling Cycling Peddling Peddling Writing Writing Phoning Phoning Low High … … … … … … … … Co-occurrence context; … … Reduce noise; … … … … … … … … Improve performance. SVM Classifier … … … … … … … …

Outline Attributes and Parts in Human Actions Learning Bases of Attributes and Parts (modeling the interactions) Datasets & Experiments Conclusion

PASCAL VOC 2010 Action Dataset 9 classes, 50-100 training / testing images per class Slide credit: Ivan Laptev

PASCAL VOC 2010 Action Dataset Average precision (%) Phoning Reading Riding bike Riding horse Running Taking photo Walking SURREY_MK 52.6 53.5 35.9 81.0 89.3 86.5 32.8 59.2 68.6 UCLEAR_DOSP 47.0 57.8 26.9 78.8 89.7 87.3 32.5 60.0 70.1 POSELETS 49.6 43.2 27.7 83.7 89.4 85.6 31.0 59.1 67.9 49.5 56.6 31.4 82.3 87.0 36.1 67.7 73.0 Playing instrument Using computer Ours Conf_Score SURREY_MK, UCLEAR_DOSP: Best results from the challenge; POSELETS: Results from Maji et al, 2011; Ours Conf_Score: Concatenating attributes classification and parts detection scores. 14 attributes – trained from the trainval images; 27 objects – taken from Li et al, NIPS 2010; 150 poselets – taken from Bourdev & Malik, ICCV 2009.

PASCAL VOC 2010 Action Dataset Average precision (%) Phoning Reading Riding bike Riding horse Running Taking photo Walking SURREY_MK 52.6 53.5 35.9 81.0 89.3 86.5 32.8 59.2 68.6 UCLEAR_DOSP 47.0 57.8 26.9 78.8 89.7 87.3 32.5 60.0 70.1 POSELETS 49.6 43.2 27.7 83.7 89.4 85.6 31.0 59.1 67.9 49.5 56.6 31.4 82.3 87.0 36.1 67.7 73.0 Playing instrument Using computer Ours Conf_Score

PASCAL VOC 2010 Action Dataset Average precision (%) Phoning Reading Riding bike Riding horse Running Taking photo Walking SURREY_MK 52.6 53.5 35.9 81.0 89.3 86.5 32.8 59.2 68.6 UCLEAR_DOSP 47.0 57.8 26.9 78.8 89.7 87.3 32.5 60.0 70.1 POSELETS 49.6 43.2 27.7 83.7 89.4 85.6 31.0 59.1 67.9 49.5 56.6 31.4 82.3 87.0 36.1 67.7 73.0 42.8 60.8 41.5 80.2 90.6 87.8 41.4 66.1 74.4 Playing instrument Using computer Ours Conf_Score Ours Sparse_Base Ours Sparse_Base: Using the reconstruction coefficients as the input of SVM classifiers.

PASCAL VOC 2010 Action Dataset Average precision (%) Phoning Reading Riding bike Riding horse Running Taking photo Walking SURREY_MK 52.6 53.5 35.9 81.0 89.3 86.5 32.8 59.2 68.6 UCLEAR_DOSP 47.0 57.8 26.9 78.8 89.7 87.3 32.5 60.0 70.1 POSELETS 49.6 43.2 27.7 83.7 89.4 85.6 31.0 59.1 67.9 49.5 56.6 31.4 82.3 87.0 36.1 67.7 73.0 42.8 60.8 41.5 80.2 90.6 87.8 41.4 66.1 74.4 Playing instrument Using computer Ours Conf_Score Ours Sparse_Base attributes objects riding poselets 400 action bases

PASCAL VOC 2010 Action Dataset Average precision (%) Phoning Reading Riding bike Riding horse Running Taking photo Walking SURREY_MK 52.6 53.5 35.9 81.0 89.3 86.5 32.8 59.2 68.6 UCLEAR_DOSP 47.0 57.8 26.9 78.8 89.7 87.3 32.5 60.0 70.1 POSELETS 49.6 43.2 27.7 83.7 89.4 85.6 31.0 59.1 67.9 49.5 56.6 31.4 82.3 87.0 36.1 67.7 73.0 42.8 60.8 41.5 80.2 90.6 87.8 41.4 66.1 74.4 Playing instrument Using computer Ours Conf_Score Ours Sparse_Base attributes Using objects Sitting poselets 400 action bases

PASCAL VOC 2010 Action Dataset Average precision (%) Phoning Reading Riding bike Riding horse Running Taking photo Walking SURREY_MK 52.6 53.5 35.9 81.0 89.3 86.5 32.8 59.2 68.6 UCLEAR_DOSP 47.0 57.8 26.9 78.8 89.7 87.3 32.5 60.0 70.1 POSELETS 49.6 43.2 27.7 83.7 89.4 85.6 31.0 59.1 67.9 49.5 56.6 31.4 82.3 87.0 36.1 67.7 73.0 42.8 60.8 41.5 80.2 90.6 87.8 41.4 66.1 74.4 Playing instrument Using computer Ours Conf_Score Ours Sparse_Base attributes objects Phoning poselets 400 action bases

PASCAL VOC 2010 Action Dataset Average precision (%) Phoning Reading Riding bike Riding horse Running Taking photo Walking SURREY_MK 52.6 53.5 35.9 81.0 89.3 86.5 32.8 59.2 68.6 UCLEAR_DOSP 47.0 57.8 26.9 78.8 89.7 87.3 32.5 60.0 70.1 POSELETS 49.6 43.2 27.7 83.7 89.4 85.6 31.0 59.1 67.9 49.5 56.6 31.4 82.3 87.0 36.1 67.7 73.0 42.8 60.8 41.5 80.2 90.6 87.8 41.4 66.1 74.4 Playing instrument Using computer Ours Conf_Score Ours Sparse_Base attributes objects poselets 400 action bases

PASCAL VOC 2011 Action Dataset Others’ best in comp9 Others’ best in comp10 Our method Jumping 71.6 59.5 66.7 Phoning 50.7 31.3 41.1 Playing instrument 77.5 45.6 60.8 Reading 37.8 27.8 42.2 Riding bike 88.8 84.4 90.5 Riding horse 90.2 88.3 92.2 Running 87.9 77.6 86.2 Taking photo 25.7 31.0 28.8 Using computer 58.9 47.4 63.5 Walking 57.6 64.2 Our method ranks the first in nine out of ten classes in comp10; Our method achieves the best performance in five out of ten classes if we consider both comp9 and comp10.

Stanford 40 Actions http://vision.stanford.edu/Datasets/40actions.html Applauding Blowing bubbles Brushing teeth Calling Cleaning floor Climbing wall Cooking Cutting trees Cutting vegetables Drinking Feeding horse Fishing Fixing bike Gardening Holding umbrella Jumping Playing guitar Playing violin Pouring liquid Pushing cart Reading Repairing car Riding bike Riding horse Rowing Running Shooting arrow Smoking cigarette Taking photo Texting message Throwing frisbee Using computer Using microscope Using telescope Walking dog Washing dishes Watching television Waving hands Writing on board Writing on paper http://vision.stanford.edu/Datasets/40actions.html

Stanford 40 Actions 40 actions – the largest number of action classes. Opportunity to study the relationships between actions. washing dishes cutting vegetables fixing a bike fiding bike writing on a board writing on a paper http://vision.stanford.edu/Datasets/40actions.html

Stanford 40 Actions 40 actions – the largest number of action classes. Opportunity to study the relationships between actions. 9532 images from Google, Flickr – The largest action dataset. Large pose variation and background clutter. Bounding boxes annotations of humans. Upper-body visible, possible to explore human poses. More annotations are coming ... http://vision.stanford.edu/Datasets/40actions.html

Stanford 40 Actions We use 45 attributes, 81 objects, and 150 poselets. Compare our method with the Locality-constrained Linear Coding (LLC, Wang et al, CVPR 2010) baseline. Average precision

Stanford 40 Actions Riding horse: Riding bike: 92.2 90.5 Running: Compare with PASCAL VOC 2011 results: Riding horse: Riding bike: 92.2 90.5 Running: Jumping: 86.2 66.7 Using computer: 63.5 Reading: Phoning: Taking photo: 42.2 41.1 28.8 Average precision

Stanford 40 Actions Poses are relatively consistent Very large pose variation Average precision

Outline Attributes and Parts in Human Actions Learning Bases of Attributes and Parts (modeling the interactions) Dataset & Experiments Conclusion

Reconstruction coefficients Conclusion Real vector Action bases Reconstruction coefficients Ideal vector … Cycling Cycling Peddling Peddling Writing Writing Phoning Phoning Low High … … … … … … … … … … … … … … … … … … Sparse … … … … … … Sparse … …

Acknowledgement