Yao, B., and Fei-fei, L. IEEE Transactions on PAMI(2012)

Slides:

Advertisements

Similar presentations

Pose Estimation and Segmentation of People in 3D Movies Karteek Alahari, Guillaume Seguin, Josef Sivic, Ivan Laptev Inria, Ecole Normale Superieure ICCV.

Advertisements

DONG XU, MEMBER, IEEE, AND SHIH-FU CHANG, FELLOW, IEEE Video Event Recognition Using Kernel Methods with Multilevel Temporal Alignment.

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

Víctor Ponce Miguel Reyes Xavier Baró Mario Gorga Sergio Escalera Two-level GMM Clustering of Human Poses for Automatic Human Behavior Analysis Departament.

Clustering with k-means and mixture of Gaussian densities Jakob Verbeek December 3, 2010 Course website:

Multi-layer Orthogonal Codebook for Image Classification Presented by Xia Li.

CS395: Visual Recognition Spatial Pyramid Matching Heath Vinicombe The University of Texas at Austin 21 st September 2012.

Activity Recognition Computer Vision CS 143, Brown James Hays 11/21/11 With slides by Derek Hoiem and Kristen Grauman.

Bangpeng Yao and Li Fei-Fei

Bangpeng Yao Li Fei-Fei Computer Science Department, Stanford University, USA.

Structural Human Action Recognition from Still Images Moin Nabi Computer Vision Lab. ©IPM - Oct

Intelligent Systems Lab. Recognizing Human actions from Still Images with Latent Poses Authors: Weilong Yang, Yang Wang, and Greg Mori Simon Fraser University,

Human Action Recognition by Learning Bases of Action Attributes and Parts.

Ziming Zhang *, Ze-Nian Li, Mark Drew School of Computing Science, Simon Fraser University, Vancouver, B.C., Canada {zza27, li, Learning.

Bag-of-features models. Origin 1: Texture recognition Texture is characterized by the repetition of basic elements or textons For stochastic textures,

Global spatial layout: spatial pyramid matching Spatial weighting the features Beyond bags of features: Adding spatial information.

Object-centric spatial pooling for image classification Olga Russakovsky, Yuanqing Lin, Kai Yu, Li Fei-Fei ECCV 2012.

Fast intersection kernel SVMs for Realtime Object Detection

Event prediction CS 590v. Applications Video search Surveillance – Detecting suspicious activities – Illegally parked cars – Abandoned bags Intelligent.

Beyond bags of features: Adding spatial information Many slides adapted from Fei-Fei Li, Rob Fergus, and Antonio Torralba.

1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.

Lecture 28: Bag-of-words models

Beyond bags of features: Adding spatial information Many slides adapted from Fei-Fei Li, Rob Fergus, and Antonio Torralba.

CS294‐43: Visual Object and Activity Recognition Prof. Trevor Darrell Spring 2009 March 17 th, 2009.

Spatial Pyramid Pooling in Deep Convolutional

Estimating the Driving State of Oncoming Vehicles From a Moving Platform Using Stereo Vision IEEE Intelligent Transportation Systems 2009 M.S. Student,

Dynamic Cascades for Face Detection 第三組馮堃齊、莊以暘. 2009/01/072 Outline Introduction Dynamic Cascade Boosting with a Bayesian Stump Experiments Conclusion.

A String Matching Approach for Visual Retrieval and Classification Mei-Chen Yeh* and Kwang-Ting Cheng Learning-Based Multimedia Lab Department of Electrical.

Review: Intro to recognition Recognition tasks Machine learning approach: training, testing, generalization Example classifiers Nearest neighbor Linear.

Action Recognition Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem 04/21/11.

A Thousand Words in a Scene P. Quelhas, F. Monay, J. Odobez, D. Gatica-Perez and T. Tuytelaars PAMI, Sept

Classification 2: discriminative models

Marcin Marszałek, Ivan Laptev, Cordelia Schmid Computer Vision and Pattern Recognition, CVPR Actions in Context.

Why Categorize in Computer Vision ?. Why Use Categories? People love categories!

Building local part models for category-level recognition C. Schmid, INRIA Grenoble Joint work with G. Dorko, S. Lazebnik, J. Ponce.

1 Action Classification: An Integration of Randomization and Discrimination in A Dense Feature Representation Computer Science Department, Stanford University.

Bag-of-features models. Origin 1: Texture recognition Texture is characterized by the repetition of basic elements or textons For stochastic textures,

Svetlana Lazebnik, Cordelia Schmid, Jean Ponce

Marco Pedersoli, Jordi Gonzàlez, Xu Hu, and Xavier Roca

SVM-KNN Discriminative Nearest Neighbor Classification for Visual Category Recognition Hao Zhang, Alex Berg, Michael Maire, Jitendra Malik.

ALIP: Automatic Linguistic Indexing of Pictures Jia Li The Pennsylvania State University.

Object Detection with Discriminatively Trained Part Based Models

Deformable Part Model Presenter ： Liu Changyu Advisor ： Prof. Alex Hauptmann Interest ： Multimedia Analysis April 11 st, 2013.

Template matching and object recognition. CS8690 Computer Vision University of Missouri at Columbia Matching by relations Idea: –find bits, then say object.

U NIVERSITEIT VAN A MSTERDAM IAS INTELLIGENT AUTONOMOUS SYSTEMS 1 M. Hofmann Prof. Dr. D. M. Gavrila Intelligent Systems Laboratory Informatics Institute,

In Defense of Nearest-Neighbor Based Image Classification Oren Boiman The Weizmann Institute of Science Rehovot, ISRAEL Eli Shechtman Adobe Systems Inc.

MSRI workshop, January 2005 Object Recognition Collected databases of objects on uniform background (no occlusions, no clutter) Mostly focus on viewpoint.

Histograms of Oriented Gradients for Human Detection(HOG)

Methods for classification and image representation

Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions Bangpeng Yao and Li Fei-Fei Computer Science Department, Stanford.

Goggle Gist on the Google Phone A Content-based image retrieval system for the Google phone Manu Viswanathan Chin-Kai Chang Ji Hyun Moon.

1.Learn appearance based models for concepts 2.Compute posterior probabilities or Semantic Multinomial (SMN) under appearance models. -But, suffers from.

Regionlets for Generic Object Detection IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 37, NO. 10, OCTOBER 2015 Xiaoyu Wang, Ming.

A REAL-TIME DEFORMABLE DETECTOR 謝汝欣 OUTLINE  Introduction  Related Work  Proposed Method  Experiments 2.

Learning Hierarchical Features for Scene Labeling

Finding Clusters within a Class to Improve Classification Accuracy Literature Survey Yong Jae Lee 3/6/08.

WLD: A Robust Local Image Descriptor Jie Chen, Shiguang Shan, Chu He, Guoying Zhao, Matti Pietikäinen, Xilin Chen, Wen Gao 报告人：蒲薇榄.

Shape2Pose: Human Centric Shape Analysis CMPT888 Vladimir G. Kim Siddhartha Chaudhuri Leonidas Guibas Thomas Funkhouser Stanford University Princeton University.

A Hierarchical Deep Temporal Model for Group Activity Recognition

Guillaume-Alexandre Bilodeau

Learning Mid-Level Features For Recognition

Paper Presentation: Shape and Matching

Object detection as supervised classification

Finding Clusters within a Class to Improve Classification Accuracy

CS 1674: Intro to Computer Vision Scene Recognition

CVPR 2014 Orientational Pyramid Matching for Recognizing Indoor Scenes

Brief Review of Recognition + Context

Liyuan Li, Jerry Kah Eng Hoe, Xinguo Yu, Li Dong, and Xinqi Chu

Human-object interaction

Presentation transcript:

Yao, B., and Fei-fei, L. IEEE Transactions on PAMI(2012) Recognizing Human-Object Interaction in still Image by Modeling the Mutual Context of Objects and Human Poses Date: 2013/05/27 Instructor: Prof. Wang, Sheng-Jyh Student: Hung, Fei-Fan Yao, B., and Fei-fei, L. IEEE Transactions on PAMI(2012)

Outline Introduction Model Representation Model Learning Intuition and goal Model Representation Model Learning Obtaining Atomic Poses Training Detectors and Classifiers Estimating Model Parameters Model Inference Experimental Results Conclusion

Outline Introduction Model Representation Model Learning Intuition and goal Model Representation Model Learning Obtaining Atomic Poses Training Detectors and Classifiers Estimating Model Parameters Model Inference Experimental Results Conclusion

Why using context in computer vision? simple image vs. human activities Without context: With mutual context: ~3-4% with context without context

Challenges in Human Pose Estimation Human pose estimation is challenging  Object detection facilitate human pose estimation Difficult part appearance Self-occlusion Image region looks like a body part

Challenges in Object Detection Object detection is challenging human pose estimation facilitate object detection Small, low-resolution, partially occluded Image region similar to detection target

The Goal To build a mutual context model in Human-Object Interaction(HOI) activities To

Outline Introduction Model Representation Model Learning Intuition and goal Model Representation Model Learning Obtaining Atomic Poses Training Detectors and Classifiers Estimating Model Parameters Model Inference Experimental Results Conclusion

Model representation Modeling the mutual context of object and human poses A: Croquet shot Volleyball smash Tennis forehand Body parts Tennis ball Croquet mallet Volleyball Tennis racket O: Conditional random field without hidden variable 𝑂 1 , 𝑂 2 ,…, 𝑂 𝑀 , M:num of bounding box H: More than one atomic pose H in A P: body parts, 𝑃 1 , 𝑃 2 ,…, 𝑃 𝐿

Model representation 𝝓 𝟏 : co-occurrence compatibility between A,O,H activity H A P1 P2 PL O1 O2 Human pose objects 𝝓 𝟏 : co-occurrence compatibility between A,O,H 𝝓 𝟐 : spatial relationship between O,H 𝝓 𝟑 ~ 𝝓 𝟓 : modeling the image evidence with detectors or classifiers

𝝓1: Co-occurrence context co-occurrence between all A,O,H 𝜍 𝑖,𝑗,𝑘 : strength of co-occurrence interaction between ℎ 𝑖 , 𝑜 𝑗 , 𝑎 𝑘 H A P1 P2 PL O1 O2 𝟏 (∙) : indicator function 𝑁 ℎ : total number of atomic poses 𝑁 𝑜 : total number of objects 𝑁 𝑎 : total number of activity classes

𝝓2: Spatial context Spatial relationship between all O and different H 𝒙 𝐼 𝑙 : Spatial relationship between all O and different H 𝜆 𝑖,𝑗,𝑘 : weight of 𝑏 𝒙 𝐼 𝑙 , 𝑂 𝑚 𝑂 𝑚 = 𝑜 𝑗 𝑏 𝒙 𝐼 𝑙 , 𝑂 𝑚 : a sparse binary vector shows relative location of 𝑂 𝑚 w.r.t. 𝒙 𝐼 𝑙 H A P1 P2 PL O1 O2

𝝓3: Modeling objects Model O in the image I using object detection score For all object O 𝑔 𝑂 𝑚 : vector of score of detecting 𝑂 𝑚 𝛾 𝑗 : weight of 𝑔 𝑂 𝑚 𝑂 𝑚 = 𝑜 𝑗 Between Om and Om’ 𝑏 𝑂 𝑚 , 𝑂 𝑚′ : binary feature vector 𝛾 𝑗,𝑗′ : weight of 𝑜 𝑗 and 𝑜 𝑗′ H A P1 P2 PL O1 O2 𝑂 𝑚 =object in the mth bounding box 𝑏 𝑂 𝑚 , 𝑂 𝑚′ : binary feature vector show the spatial relationship between m & m’ 𝛾 𝑗,𝑗′ : weight for geometric configuration between 𝑜 𝑗 and 𝑜 𝑗′

𝝓4: Modeling human pose Model atomic pose that H belongs to and likelihood 𝑃(𝐼| ℎ 𝑖 ) 𝑃 𝒙 𝐼 𝑙 | 𝒙 ℎ 𝑖 𝑙 : Gaussian likelihood function 𝑓 𝑙 (𝐼) : vector of score of detecting body part in 𝒙 𝐼 𝑙 H A P1 P2 PL O1 O2 𝑃 𝒙 𝐼 𝑙 | 𝒙 ℎ 𝑖 𝑙 : Gaussian likelihood of body part given atomic pose

𝝓5: Modeling activity Model HOI activity by training activity classifier 𝑠 𝐼 : 𝑁 𝑎 -dim output of one-versus-all (OVA) discriminative classifier taking image as features 𝜂 𝑘 : feature weight of 𝑎 𝑘 H A P1 P2 PL O1 O2

One-versus-all classifier OVA: OVO:

Model Properties Spatial context between O and H Object detection and human pose estimation facilitate each other Ignore the objects and body parts that are unreliable Flexible to extend to large scale datasets and other activities Jointly model can share all objects and atomic poses

Outline Introduction Model Representation Model Learning Intuition and goal Model Representation Model Learning Obtaining Atomic Poses Training Detectors and Classifiers Estimating Model Parameters Model Inference Experimental Results Conclusion

Training detectors and classifiers Model Learning Assign human pose to atomic pose Training detectors and classifiers Estimate parameters by Maximum Likelihood

Obtaining Atomic Poses Assign human pose to atomic pose Using clustering to obtain atomic poses Normalize the annotations 𝒙 1 , 𝒙 2 ,…, 𝒙 𝐿 Finding missing part Using the nearest visible neighbor Obtain a set of atomic poses Hierarchical clustering with maximum linkage measure : 𝑙=1 𝐿 𝑤 𝑇 | 𝒙 𝑖 𝑙 − 𝒙 𝑗 𝑙 | Training detectors and classifiers Estimate parameters by Maximum Likelihood

Training Detectors and Classifiers Assign human pose to atomic pose 𝑔 𝑂 𝑚 : Object detector in 𝜙 3 𝑂,𝐼 𝑓 𝑙 (𝐼) : Human body part detector in 𝜙 4 𝐻,𝐼 𝑠 𝐼 : Overall activity classifier in 𝜙 5 (𝐴,𝐼)  deformable part model Training detectors and classifiers Spatial pyramid matching (SPM) SIFT + 3 level image pyramid Estimate parameters by Maximum Likelihood

Spatial pyramid matching SIFT 3-level histogram intersection kernel S. Lazebnik, C. Schmid, and J. Ponce, “Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories,” Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, 2006.

Spatial pyramid matching

Estimating Model Parameters Assign human pose to atomic pose Estimate 𝜍, 𝜆, 𝛾, 𝛼, 𝛽 by using ML approach with zero-mean Gaussian prior Training detectors and classifiers Estimate parameters by Maximum Likelihood

Learning result

Outline Introduction Model Representation Model Learning Intuition and goal Model Representation Model Learning Obtaining Atomic Poses Training Detectors and Classifiers Estimating Model Parameters Model Inference Experimental Results Conclusion

Update object detection results Model Inference New image Update human body parts Update object detection results Initialize with learned results Update A and H labels

Activity classification Initialization New image A: SPM classification O: object detection H: pictorial structure model Initialize with learned results Initialize Activity classification Object detection Human pose estimation

Update model inference Marginal distribution of human pose: 𝑝(𝐻= ℎ 𝑖 ) 𝑖=1 𝑁 ℎ Using mixture of Gaussian to refine the prior of body part 𝒩( 𝒙 ℎ 𝑖 𝑙 ) 𝑖=1 𝑁 ℎ 𝑝(𝐻= ℎ 𝑖 ) 𝒩( 𝒙 ℎ 𝑖 𝑙 ) Update human body parts Update object detection results Marginal distribution of human pose 看哪一個 human pose H 的機率最高 Update A and H labels

Update model inference O,H O,A,H O,I Update human body parts Update object detection results Greedy forward search method : Initial (𝑚,𝑗) and no object in bounding box Select 𝑚 ∗ , 𝑗 ∗ =𝑎𝑟𝑔𝑚𝑎𝑥 (𝑚,𝑗) Label 𝑚 ∗ box as 𝑜 𝑗 ∗ update (𝑚,𝑗) Stop when 𝑚 ∗ , 𝑗 ∗ <0 Score of the mth object bounding box to object oj Update A and H labels

Update model inference Enumerate possible A and H label Optimize Ψ(𝐴,𝑂,𝐻,𝐼) Update human body parts Update object detection results Based on results of Human pose estimation and Object detection Update A and H labels

Outline Introduction Model Representation Model Learning Intuition and goal Model Representation Model Learning Obtaining Atomic Poses Training Detectors and Classifiers Estimating Model Parameters Model Inference Experimental Results Conclusion

Experimental Results (Sports Dataset)

Experimental Results (Sports Dataset)

Experimental Results (Sports Dataset) Activity classification

Experimental results (PPMI Dataset)

Experimental results (PPMI Dataset)

Outline Introduction Model Representation Model Learning Intuition and goal Model Representation Model Learning Obtaining Atomic Poses Training Detectors and Classifiers Estimating Model Parameters Model Inference Experimental Results Conclusion

Conclusion Mutual context can significantly improve the performance in difficult visual recognition problems The joint model can share all the information Annotate all the human body parts and objects in training images

Reference Yao, B., and Fei-fei, L. “Recognizing Human-Object Interactions in Still Images by Modeling the Mutual Context of Objects and Human Poses,” IEEE Transactions on Pattern Analysis and Machine Intelligence (2012) B. Yao and L. Fei-Fei, “Modeling Mutual Context of Object and Human Pose in Human-Object Interaction Activities,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2010 B. Sapp, A. Toshev, and B. Taskar, “Cascade Models for Articulated Pose Estimation,” Proc. European Conf. Computer Vision, 2010. S. Lazebnik, C. Schmid, and J. Ponce, “Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories,” Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, 2006. http://en.wikipedia.org/wiki/Hierarchical_clustering

H A P1 P2 PL O1 O2 H A P1 P2 PL O1 O2