Recognizing Humans: Action Recognition

Slides:



Advertisements
Similar presentations
Pattern Classification & Decision Theory. How are we doing on the pass sequence? Bayesian regression and estimation enables us to track the man in the.
Advertisements

Recognizing Human Actions by Attributes CVPR2011 Jingen Liu, Benjamin Kuipers, Silvio Savarese Dept. of Electrical Engineering and Computer Science University.
Linear Regression.
Human Identity Recognition in Aerial Images Omar Oreifej Ramin Mehran Mubarak Shah CVPR 2010, June Computer Vision Lab of UCF.
Ľubor Ladický1 Phil Torr2 Andrew Zisserman1
R OBERTO B ATTITI, M AURO B RUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
- Recovering Human Body Configurations: Combining Segmentation and Recognition (CVPR’04) Greg Mori, Xiaofeng Ren, Alexei A. Efros and Jitendra Malik -
Introduction To Tracking
Activity Recognition Aneeq Zia. Agenda What is activity recognition Typical methods used for action recognition “Evaluation of local spatio-temporal features.
CMPUT 466/551 Principal Source: CMU
Structural Human Action Recognition from Still Images Moin Nabi Computer Vision Lab. ©IPM - Oct
Intelligent Systems Lab. Recognizing Human actions from Still Images with Latent Poses Authors: Weilong Yang, Yang Wang, and Greg Mori Simon Fraser University,
Real-Time Human Pose Recognition in Parts from Single Depth Images Presented by: Mohammad A. Gowayyed.
Ghunhui Gu, Joseph J. Lim, Pablo Arbeláez, Jitendra Malik University of California at Berkeley Berkeley, CA
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Radial Basis Function Networks
Crash Course on Machine Learning
Exercise Session 10 – Image Categorization
Distinctive Image Features from Scale-Invariant Keypoints By David G. Lowe, University of British Columbia Presented by: Tim Havinga, Joël van Neerbos.
Classification Part 3: Artificial Neural Networks
Using Fast Weights to Improve Persistent Contrastive Divergence Tijmen Tieleman Geoffrey Hinton Department of Computer Science, University of Toronto ICML.
Object Recognizing. Recognition -- topics Features Classifiers Example ‘winning’ system.
Watch, Listen and Learn Sonal Gupta, Joohyun Kim, Kristen Grauman and Raymond Mooney -Pratiksha Shah.
Marcin Marszałek, Ivan Laptev, Cordelia Schmid Computer Vision and Pattern Recognition, CVPR Actions in Context.
Universit at Dortmund, LS VIII
Machine Learning in Ad-hoc IR. Machine Learning for ad hoc IR We’ve looked at methods for ranking documents in IR using factors like –Cosine similarity,
Visual Categorization With Bags of Keypoints Original Authors: G. Csurka, C.R. Dance, L. Fan, J. Willamowski, C. Bray ECCV Workshop on Statistical Learning.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34.
Object Recognition as Ranking Holistic Figure-Ground Hypotheses Fuxin Li and Joao Carreira and Cristian Sminchisescu 1.
Lecture 5: Statistical Methods for Classification CAP 5415: Computer Vision Fall 2006.
1 Random Walks on the Click Graph Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007.
1 Bilinear Classifiers for Visual Recognition Computational Vision Lab. University of California Irvine To be presented in NIPS 2009 Hamed Pirsiavash Deva.
Machine learning & object recognition Cordelia Schmid Jakob Verbeek.
Sparse Coding: A Deep Learning using Unlabeled Data for High - Level Representation Dr.G.M.Nasira R. Vidya R. P. Jaia Priyankka.
Another Example: Circle Detection
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Bayesian Neural Networks
Computer vision: models, learning and inference
Deep Feedforward Networks
Large-Scale Content-Based Audio Retrieval from Text Queries
Histograms CSE 6363 – Machine Learning Vassilis Athitsos
Deep learning David Kauchak CS158 – Fall 2016.
CSC321: Neural Networks Lecture 22 Learning features one layer at a time Geoffrey Hinton.
Data Driven Attributes for Action Detection
Line Fitting James Hayes.
Session 7: Face Detection (cont.)
Tracking Objects with Dynamics
Computer vision: models, learning and inference
Introductory Seminar on Research: Fall 2017
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Basic machine learning background with Python scikit-learn
Machine Learning Basics
Dynamical Statistical Shape Priors for Level Set Based Tracking
Roberto Battiti, Mauro Brunato
Recognition - III.
A Tutorial on HOG Human Detection
Identifying Human-Object Interaction in Range and Video Data
Jun Xu Harbin Institute of Technology China
“The Truth About Cats And Dogs”
CRANDEM: Conditional Random Fields for ASR
The Open World of Micro-Videos
Logistic Regression & Parallel SGD
Word Embedding Word2Vec.
Ensemble learning.
Lip movement Synthesis from Text
CS4670: Intro to Computer Vision
Learning Chapter 18 and Parts of Chapter 20
NON-NEGATIVE COMPONENT PARTS OF SOUND FOR CLASSIFICATION Yong-Choon Cho, Seungjin Choi, Sung-Yang Bang Wen-Yi Chu Department of Computer Science &
Presentation transcript:

Recognizing Humans: Action Recognition adam So, coming from pose recognition, we can further recognize what people are doing from their poses

Overview Data (Human localization) Action representation Classification i’ll try to give a quick general super high-level overview of action recognition across a couple of papers, then focus on the paper that was posted so, generally, what people do to recognize actions is to first get some data find where the people in that data are if necessary represent each class of actions somehow train a classifier on those representations so let’s talk a bit about what kind of data people use

Getting Data

Recognizing Human Actions: A Local SVM Approach (2004) There were some attempts with very controlled settings; this one is a set of videos of people doing stuff in a field or room in this one, most of the people are centralized, and all the actions are emphasized and there’s no noise these are easy to label, since there are like, 6 actions

Learning Actions From the Web (2009) in the paper yall read, they used images from the web as sources of data (but evaluated on video)

Learning Realistic Human Actions from Movies (2010) movies can also be a good source of data for actions the labeling is based on the movie’s scripts -- so, they look at the times of actions in the scripts (as an example, “Rick sits down with Ilsa”) which they have specific times for (since it’s a script) and they match those time snippets with the action (“sit down”)

Describing Common Human Visual Actions in Images (2015) And now the size of data has been growing -- this paper uses the MSCOCO captioned set not strictly actions, but actions are a subset of the images these are mostly used in deep learning situations i think? 2015 Describing Common Human Visual Actions in Images M.R. Ronchi and P. Perona BMVC 2015, Swansea, Wales, September 2015.

Human Localization Localization may be necessary if there’s a lot of noise in the data, that is, if the people are not necessarily the foreground objects

A Discriminatively Trained, Multiscale, Deformable Part Model (2008) the posted paper used the Deformable parts model which basically uses a pyramid of features that we’ll talk about in a bit and goes by the idea that objects are made of a bunch of parts which can be deformed

Action Representation (Features)

Recognizing Human Actions: A Local SVM Approach (2004) since this paper used video, they incorporated spatio-temporal features

that is, they’re taking the gradient wrt x,y, and time

Learning Actions From the Web (2009) Histogram of Oriented Gradients (HOG) features i think another group will be presenting more in depth on this, but here’s one set of features that you can use essentially the idea is to take a grid of the image, then take the image gradients in different directions and represent with the distribution (i.e. histogram) of those gradients these descriptors are affected by noisy responses, so the posted paper took a prior “probability of boundary” operator, which sort of takes the outline of the foreground objects

Actions Representation (Poses) Jae Sung already talked a lot on poses, so I won’t spend too much time here

Recognizing Human Actions from Still Images with Latent Poses (2010) but you can take the HOG features further by associating specific portions of the images like looking at arms or legs each of these rows specifies a “poselet”, which is learned from HOG features and SVM

Same paper. then taking those further to generalize where the joints are parts of a larger object so we define a “pose” to be legs, arms, and upper body and their locations other ways to represent may be to use attributes like we saw last class

Classification Non/Linear SVMs Regression Clustering Deep stuff After representation you can use some method to classify your actions

Learning Actions from the Web now let’s look at some parts of the posted paper that we havent talked about yet

Steps Query and collect images of actions from search engines Find a representation for the images Build action models for each action Classify actions and annotate video I think their paper can be broken down into 4 main steps: (above) querying is straightforward enough. they used a really small set of actions though: running/walking/sitting/playing golf/dancing we also talked about the representation used, which was the PbHOG features

Cleaning the Noise Use a simple classifier for common foreground features background stuff shouldn’t influence action Separate features into background and foreground Keep images with high prob of foreground, toss images with low foreground Iterate through query results, and retrain classifier since the queries from the web (then) were pretty noisy, they attempted to clean up the data so, using the HOG features we talked about earlier, they separated features based on whether they were in the background or foreground which is decided based on whether or not the pose estimation box detected a person then they used a logistic regression with L2 regularization to reduce overfitting want to minimize the log-likelihood, y=1 is foreground, y=-1 is background, x is feature vector, w is weights

Combining different poses of the same action NMF factorization to decompose data Cluster images with same max response basis vectors (Choose k=5, for different directions) Since the data is also variable in the different poses of actions, we want to somehow say that certain poses are the same action so we can use non-negative-matrix factorization to cluster these poses from the poses, we can then train a local classifier on the actions for our annotation task

Annotating Video For annotations, they track bounding boxes for each person they find in the frame, they also do some smoothing based on a dynamic programming approach and optimize for the highest classification probability

Evaluation Improving query returns Annotating video

Evaluation - query though i’m not sure how they determine the relevancy of each image. i think it’s by hand?

Evaluation - video and here’s the confusion matrix for their video annotations, per frame what jumps out is that a lot of the running poses get misclassified as dancing, and they say that the “dancing poses look similar to running” do you think that makes sense?

Extra stuff Neural Talk and Walk trained on MS Coco image/caption pairs https://vimeo.com/146492001 this is some extra stuff if there’s time. the system in this video uses a deep net model trained on the MS coco image set with captions, and tries to caption things as they’re walking by it’s not all actions, but since the set does include actions, it’ll try to classify those, like this one

Links to papers Recognizing human actions a local SVM approach Recognizing human actions from still images with latent poses Recognizing realistic actions from videos “in the wild” Recognizing human actions from attributes Learning realistic human actions from movies Learning actions from the web A Discriminatively Trained, Multiscale, Deformable Part Model