Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions Bangpeng Yao and Li Fei-Fei Computer Science Department, Stanford.

Slides:

Advertisements

Similar presentations

Shape Matching and Object Recognition using Low Distortion Correspondence Alexander C. Berg, Tamara L. Berg, Jitendra Malik U.C. Berkeley.

Advertisements

Foreground Focus: Finding Meaningful Features in Unlabeled Images Yong Jae Lee and Kristen Grauman University of Texas at Austin.

Human Action Recognition by Learning Bases of Action Attributes and Parts Bangpeng Yao, Xiaoye Jiang, Aditya Khosla, Andy Lai Lin, Leonidas Guibas, and.

Ivan Laptev IRISA/INRIA, Rennes, France September 07, 2006 Boosted Histograms for Improved Object Detection.

1 Part 1: Classical Image Classification Methods Kai Yu Dept. of Media Analytics NEC Laboratories America Andrew Ng Computer Science Dept. Stanford University.

Activity Recognition Computer Vision CS 143, Brown James Hays 11/21/11 With slides by Derek Hoiem and Kristen Grauman.

Bangpeng Yao and Li Fei-Fei

Steerable Part Models Hamed Pirsiavash and Deva Ramanan

Bangpeng Yao Li Fei-Fei Computer Science Department, Stanford University, USA.

Structural Human Action Recognition from Still Images Moin Nabi Computer Vision Lab. ©IPM - Oct

Ziming Zhang *, Ze-Nian Li, Mark Drew School of Computing Science, Simon Fraser University, Vancouver, B.C., Canada {zza27, li, Learning.

CS4670 / 5670: Computer Vision Bag-of-words models Noah Snavely Object

Bag-of-features models. Origin 1: Texture recognition Texture is characterized by the repetition of basic elements or textons For stochastic textures,

Object-centric spatial pooling for image classification Olga Russakovsky, Yuanqing Lin, Kai Yu, Li Fei-Fei ECCV 2012.

Contour Based Approaches for Visual Object Recognition Jamie Shotton University of Cambridge Joint work with Roberto Cipolla, Andrew Blake.

Detecting Pedestrians by Learning Shapelet Features

DISCRIMINATIVE DECORELATION FOR CLUSTERING AND CLASSIFICATION ECCV 12 Bharath Hariharan, Jitandra Malik, and Deva Ramanan.

Bag-of-features models Many slides adapted from Fei-Fei Li, Rob Fergus, and Antonio Torralba.

Model: Parts and Structure. History of Idea Fischler & Elschlager 1973 Yuille ‘91 Brunelli & Poggio ‘93 Lades, v.d. Malsburg et al. ‘93 Cootes, Lanitis,

Retrieving Actions in Group Contexts Tian Lan, Yang Wang, Greg Mori, Stephen Robinovitch Simon Fraser University Sept. 11, 2010.

Beyond bags of features: Part-based models Many slides adapted from Fei-Fei Li, Rob Fergus, and Antonio Torralba.

Poselets Michael Krainin CSE 590V Oct 18, Person Detection Dalal and Triggs ‘05 – Learn to classify pedestrians vs. background – HOG + linear SVM.

Beyond bags of features: Adding spatial information Many slides adapted from Fei-Fei Li, Rob Fergus, and Antonio Torralba.

1 Image Recognition - I. Global appearance patterns Slides by K. Grauman, B. Leibe.

Towards Total Scene Understanding: Classification, Annotation and Segmentation in an Automatic Framework Li-Jia Li, Richard Socher, Li Fei- Fei 1.

Lecture 28: Bag-of-words models

CS335 Principles of Multimedia Systems Content Based Media Retrieval Hao Jiang Computer Science Department Boston College Dec. 4, 2007.

Beyond bags of features: Adding spatial information Many slides adapted from Fei-Fei Li, Rob Fergus, and Antonio Torralba.

Transferring information using Bayesian priors on object categories Li Fei-Fei 1, Rob Fergus 2, Pietro Perona 1 1 California Institute of Technology, 2.

CS294‐43: Visual Object and Activity Recognition Prof. Trevor Darrell Spring 2009 March 17 th, 2009.

Bag-of-features models

Visual Object Recognition Rob Fergus Courant Institute, New York University

PANDA: Pose Aligned Networks for Deep Attribute Modeling Ning Zhang1;2, Manohar Paluri1, Marc’Aurelio Ranzato1, Trevor Darrell2, Lubomir Bourdev1 1: Facebook.

By Suren Manvelyan,

Face Detection and Recognition

Generic object detection with deformable part-based models

The Beauty of Local Invariant Features

Review: Intro to recognition Recognition tasks Machine learning approach: training, testing, generalization Example classifiers Nearest neighbor Linear.

Bag-of-features models. Origin 1: Texture recognition Texture is characterized by the repetition of basic elements or textons For stochastic textures,

Action Recognition Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem 04/21/11.

Face Recognition and Feature Subspaces

Face Recognition and Feature Subspaces

Object Detection Sliding Window Based Approach Context Helps

“Secret” of Object Detection Zheng Wu (Summer intern in MSRNE) Sep. 3, 2010 Joint work with Ce Liu (MSRNE) William T. Freeman (MIT) Adam Kalai (MSRNE)

1 Action Classification: An Integration of Randomization and Discrimination in A Dense Feature Representation Computer Science Department, Stanford University.

Bag-of-features models. Origin 1: Texture recognition Texture is characterized by the repetition of basic elements or textons For stochastic textures,

A Statistically Selected Part-Based Probabilistic Model for Object Recognition Zhipeng Zhao, Ahmed Elgammal Department of Computer Science, Rutgers, The.

Yao, B., and Fei-fei, L. IEEE Transactions on PAMI(2012)

Visual Object Recognition

Classical Methods for Object Recognition Rob Fergus (NYU)

Lecture 31: Modern recognition CS4670 / 5670: Computer Vision Noah Snavely.

Perceptual and Sensory Augmented Computing Discussion Session: Sliding Windows Sliding Windows – Silver Bullet or Evolutionary Deadend? Alyosha Efros,

MSRI workshop, January 2005 Object Recognition Collected databases of objects on uniform background (no occlusions, no clutter) Mostly focus on viewpoint.

Project 3 Results.

Discussion of Pictorial Structures Pedro Felzenszwalb Daniel Huttenlocher Sicily Workshop September, 2006.

CS 1699: Intro to Computer Vision Detection II: Deformable Part Models Prof. Adriana Kovashka University of Pittsburgh November 12, 2015.

Improved Object Detection

CS654: Digital Image Analysis

Object Recognition by Integrating Multiple Image Segmentations Caroline Pantofaru, Cordelia Schmid, Martial Hebert ECCV 2008 E.

Face Recognition and Feature Subspaces Devi Parikh Virginia Tech 11/05/15 Slides borrowed from Derek Hoiem, who borrowed some slides from Lana Lazebnik,

Bangpeng Yao1, Xiaoye Jiang2, Aditya Khosla1,

Object detection with deformable part-based models

Face Recognition and Feature Subspaces

Action Recognition ECE6504 Xiao Lin.

By Suren Manvelyan, Crocodile (nile crocodile?) By Suren Manvelyan,

Object detection as supervised classification

CS 1674: Intro to Computer Vision Scene Recognition

Liyuan Li, Jerry Kah Eng Hoe, Xinguo Yu, Li Dong, and Xinqi Chu

A Graph-Matching Kernel for Object Categorization

Presentation transcript:

Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions Bangpeng Yao and Li Fei-Fei Computer Science Department, Stanford University 1

2 Human-Object Interaction Playing saxophoneHumanSaxophoneNot playing saxophone

Robots interact with objects Automatic sports commentary “Kobe is dunking the ball.” Medical care 3 Human-Object Interaction

Background: Human-Object Interaction Schneiderman & Kanade, 2000 Viola & Jones, 2001 Huang et al, 2007 Papageorgiou & Poggio, 2000 Wu & Nevatia, 2005 Dalal & Triggs, 2005 Mikolajczyk et al, 2005 Leibe et al, 2005 Bourdev & Malik, 2009 Felzenszwalb & Huttenlocher, 2005 Ren et al, 2005 Ramanan, 2006 Ferrari et al, 2008 Yang & Mori, 2008 Andriluka et al, 2009 Eichner & Ferrari, 2009 Lowe, 1999 Belongie et al, 2002 Fergus et al, 2003 Fei-Fei et al, 2004 Berg & Malik, 2005 Felzenszwalb et al, 2005 Grauman & Darrell, 2005 Sivic et al, 2005 Lazebnik et al, 2006 Zhang et al, 2006 Savarese et al, 2007 Lampert et al, 2008 Desai et al, 2009 Gehler & Nowozin, 2009 Murphy et al, 2003 Hoiem et al, 2006 Shotton et al, 2006 Rabinovich et al, 2007 Heitz & Koller, 2008 Divvala et al, 2009 Gupta et al, context vs. To be done Yao & Fei-Fei, 2010a Yao & Fei-Fei, 2010b

Background: Human-Object Interaction Schneiderman & Kanade, 2000 Viola & Jones, 2001 Huang et al, 2007 Papageorgiou & Poggio, 2000 Wu & Nevatia, 2005 Dalal & Triggs, 2005 Mikolajczyk et al, 2005 Leibe et al, 2005 Bourdev & Malik, 2009 Felzenszwalb & Huttenlocher, 2005 Ren et al, 2005 Ramanan, 2006 Ferrari et al, 2008 Yang & Mori, 2008 Andriluka et al, 2009 Eichner & Ferrari, 2009 Lowe, 1999 Belongie et al, 2002 Fergus et al, 2003 Fei-Fei et al, 2004 Berg & Malik, 2005 Felzenszwalb et al, 2005 Grauman & Darrell, 2005 Sivic et al, 2005 Lazebnik et al, 2006 Zhang et al, 2006 Savarese et al, 2007 Lampert et al, 2008 Desai et al, 2009 Gehler & Nowozin, 2009 Murphy et al, 2003 Hoiem et al, 2006 Shotton et al, 2006 Rabinovich et al, 2007 Heitz & Koller, 2008 Divvala et al, 2009 Gupta et al, context vs. To be done Yao & Fei-Fei, 2010a Yao & Fei-Fei, 2010b

Intuition of Grouplet Representation Grouplet Feature Representation Using Grouplet for Recognition Dataset & Experiments Conclusion Outline 6

Intuition of Grouplet Representation Grouplet Feature Representation Using Grouplet for Recognition Dataset & Experiments Conclusion Outline 7

8 Recognizing Human-Object Interaction is Challenging Different background Same object (saxophone), different interactions Different pose (or viewpoint) Different lighting Different instrument, similar pose Reference image: playing saxophone

9 Grouplet: our intuition Bag-of-wordsSpatial pyramidPart-based Thomas & Malik, 2001 Csurka et al, 2004 Fei-Fei & Perona, 2005 Sivic et al, 2005 Grauman & Darrell, 2005 Lazebnik et al, 2006 Weber et al, 2000 Fergus et al, 2003 Leibe et al, 2004 Felzenszwalb et al, 2005 Bourdev & Malik, 2009 Grouplet Representation:

10 Grouplet: our intuition Grouplet Representation: Part-based configuration Co-occurrence Discriminative Dense Capture the subtle difference in human-object interactions.

Intuition of Grouplet Representation Grouplet Feature Representation Using Grouplet for Recognition Dataset & Experiments Conclusion Outline 11

12 Grouplet representation (e.g. 2-Grouplet) I : Image. P : Reference point in the image. Λ : Grouplet. λ i : Feature unit. - A i : Visual codeword; - x i : Image location; - σ i : Variance of spatial distribution. Notations Visual codewords Gaussian distribution

13 Grouplet representation (e.g. 2-Grouplet) I : Image. P : Reference point in the image. Λ : Grouplet. λ i : Feature unit. ν(Λ,I) : Matching score of Λ and I. ν(λ i,I) : Matching score of λ i and I. - A i : Visual codeword; - x i : Image location; - σ i : Variance of spatial distribution. Notations Matching score between Λ and I Matching score between λ i and I Visual codewords Gaussian distribution

14 Grouplet representation (e.g. 2-Grouplet) I : Image. P : Reference point in the image. Λ : Grouplet. λ i : Feature unit. ν(Λ,I) : Matching score of Λ and I. ν(λ i,I) : Matching score of λ i and I. For an image patch: Ω(x) : Image neighborhood of x. - A i : Visual codeword; - x i : Image location; - σ i : Variance of spatial distribution. Notations - a′ : Its visual appearance; - x′ : Its image location. Codeword assignment score Gaussian density value Visual codewords Gaussian distribution Matching score between Λ and I Matching score between λ i and I

15 Grouplet representation (e.g. 2-Grouplet) I : Image. P : Reference point in the image. Λ : Grouplet. λ i : Feature unit. ν(Λ,I) : Matching score of Λ and I. ν(λ i,I) : Matching score of λ i and I. For an image patch: Ω(x) : Image neighborhood of x. Δ : A small shift of the location. - A i : Visual codeword; - x i : Image location; - σ i : Variance of spatial distribution. Notations Matching score between Λ and I Codeword assignment score Gaussian density value - a′ : Its visual appearance; - x′ : Its image location. Visual codewords Gaussian distribution Matching score between λ i and I Codeword assignment score Gaussian density value

matching score: Grouplet representation Part-based configuration Co-occurrence Discriminative matching score: 0.4matching score: 0.0matching score: 0.1 Playing saxophoneOther interactions

17 Part-based configuration Co-occurrence Discriminative Dense Grouplet representation All possible Codewords Densely sample image locations Many possible spatial distributions 1-grouplet2-grouplet3-grouplet All possible combinations of feature units

Intuition of Grouplet Representation Grouplet Feature Representation Using Grouplet for Recognition Dataset & Experiments Conclusion Outline 18

A “Space” of Grouplets 19

20 Playing violin Other interactions A “Space” of Grouplets

21 Playing violin Other interactions Playing saxophone Other interactions A “Space” of Grouplets

22 Playing violin Other interactions Playing saxophone Other interactions On background Shared by different interactions A “Space” of Grouplets

Shared by different interactions On background 23 We only need discriminative Grouplets Large ν(Λ,I) Small ν(Λ,I) Large ν(Λ,I) Small ν(Λ,I) Playing violin Other interactions Playing saxophone Other interactions Number of Grouplets: 2 N very large space Number of feature units: N. N is large (192200)

24 Obtaining discriminative grouplets for a class Obtain grouplets with large ν(Λ,I) on the class. Remove grouplets with large ν(Λ,I) from other classes. Apriori Mining [Agrawal & Srikant, 1994] Selected 1-grouplets Candidate 2-grouplets Number of Grouplets: 2 N very large space Number of feature units: N. N is large (192200) Mine 1000~2000 grouplets, only need to evaluate (2~100)×N grouplets

25 Using Grouplets for Classification Discriminative grouplets SVM

Intuition of Grouplet Representation Grouplet Feature Representation Using Grouplet for Recognition Dataset & Experiments Conclusion Outline 26

People-Playing-Musical-Instruments (PPMI) Dataset ml PPMI+ PPMI- 27 (172) (164) (191) (148) (177) (133) (179) (149) (200) (188) (198) (169) (185) (167) # Image: Original image Normalized image (200 images each interaction)

Recognition Tasks on People-Playing- Musical-Instruments (PPMI) Dataset 28 Classification Detection Playing saxophone Playing bassoon Playing saxophone Playing French horn Playing violin vs. Playing violin Not playing violin vs. Playing different instruments Playing vs. Not playing For each interaction, 100 training and 100 testing images.

Classification: Playing Different Instruments 7-class classification on PPMI+ images SPM: [Lazebnik et al, 2006] DPM: [Felzenszwalb et al, 2008] Constellation: [Fergus et al, 2003] [Niebles & Fei-Fei, 2007] 29

Average PPMI+ images Classifying Playing vs. Not playing 30 Seven 2-class classification problem; PPMI+ vs. PPMI- for each instrument. Average PPMI- images SPM Bassoon ErhuFluteFrench hornSaxophoneViolin

Average PPMI+ images Classifying Playing vs. Not playing 31 Seven 2-class classification problem; PPMI+ vs. PPMI- for each instrument. Average PPMI- images SPM Guitar

Detecting people playing musical instruments 32 Face detection with a low threshold; Crop and normalize image regions; 8-class classification Procedure: Playing saxophoneNo playing - 7 classes of playing instruments; - Another class of not playing any instrument.

33 Detecting people playing musical instruments Playing saxophone Playing bassoon Playing French horn Playing saxophone Playing French horn Area under the precision-recall curve: Out method: 45.7%; Spatial pyramid: 37.3%.

34 Detecting people playing musical instruments Playing French horn False detectionMissed detection Area under the precision-recall curve: Out method: 45.7%; Spatial pyramid: 37.3%.

35 Examples of Mined Grouplets Playing bassoon: Playing saxophone: Playing violin: Playing guitar:

36 Conclusion Holistic image-based classification Vs. [B. Yao and L. Fei-Fei. “Modeling mutual context of object and human pose in human- object interaction activities.” CVPR 2010.] [B. Yao and L. Fei-Fei. “Grouplet: A structured image representation for recognizing human and object interactions.” CVPR 2010.] Detailed understanding and reasoning Pose estimation & object detection The Next Talk Playing saxophone Playing bassoon Playing saxophone

Thanks to Juan Carlos Niebles, Jia Deng, Jia Li, Hao Su, Silvio Savarese, and anonymous reviewers. And You 37