Action Recognition Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem 04/21/11.

Slides:

Advertisements

Similar presentations

Weakly supervised learning of MRF models for image region labeling Jakob Verbeek LEAR team, INRIA Rhône-Alpes.

Advertisements

Pose Estimation and Segmentation of People in 3D Movies Karteek Alahari, Guillaume Seguin, Josef Sivic, Ivan Laptev Inria, Ecole Normale Superieure ICCV.

Histograms of Oriented Gradients for Human Detection

Limin Wang, Yu Qiao, and Xiaoou Tang

Human Identity Recognition in Aerial Images Omar Oreifej Ramin Mehran Mubarak Shah CVPR 2010, June Computer Vision Lab of UCF.

Human Action Recognition across Datasets by Foreground-weighted Histogram Decomposition Waqas Sultani, Imran Saleemi CVPR 2014.

Actions in video Monday, April 25 Kristen Grauman UT-Austin.

- Recovering Human Body Configurations: Combining Segmentation and Recognition (CVPR’04) Greg Mori, Xiaofeng Ren, Alexei A. Efros and Jitendra Malik -

Introduction To Tracking

Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem

Computer Vision for Human-Computer InteractionResearch Group, Universität Karlsruhe (TH) cv:hci Dr. Edgar Seemann 1 Computer Vision: Histograms of Oriented.

Activity Recognition Aneeq Zia. Agenda What is activity recognition Typical methods used for action recognition “Evaluation of local spatio-temporal features.

Activity Recognition Computer Vision CS 143, Brown James Hays 11/21/11 With slides by Derek Hoiem and Kristen Grauman.

Bangpeng Yao and Li Fei-Fei

Bangpeng Yao Li Fei-Fei Computer Science Department, Stanford University, USA.

Structural Human Action Recognition from Still Images Moin Nabi Computer Vision Lab. ©IPM - Oct

Intelligent Systems Lab. Recognizing Human actions from Still Images with Latent Poses Authors: Weilong Yang, Yang Wang, and Greg Mori Simon Fraser University,

2D Human Pose Estimation in TV Shows Vittorio Ferrari Manuel Marin Andrew Zisserman Dagstuhl Seminar July 2008.

Detecting Pedestrians by Learning Shapelet Features

Robust Moving Object Detection & Categorization using self- improving classifiers Omar Javed, Saad Ali & Mubarak Shah.

Event prediction CS 590v. Applications Video search Surveillance – Detecting suspicious activities – Illegally parked cars – Abandoned bags Intelligent.

Beyond Actions: Discriminative Models for Contextual Group Activities Tian Lan School of Computing Science Simon Fraser University August 12, 2010 M.Sc.

Tracking Objects with Dynamics Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem 04/21/15 some slides from Amin Sadeghi, Lana Lazebnik,

Generic Object Detection using Feature Maps Oscar Danielsson Stefan Carlsson

LARGE-SCALE NONPARAMETRIC IMAGE PARSING Joseph Tighe and Svetlana Lazebnik University of North Carolina at Chapel Hill CVPR 2011Workshop on Large-Scale.

Visual Object Recognition Rob Fergus Courant Institute, New York University

Generic object detection with deformable part-based models

ICCV 2003UC Berkeley Computer Vision Group Recognizing Action at a Distance A.A. Efros, A.C. Berg, G. Mori, J. Malik UC Berkeley.

Bag of Video-Words Video Representation

Flow Based Action Recognition Papers to discuss: The Representation and Recognition of Action Using Temporal Templates (Bobbick & Davis 2001) Recognizing.

Prakash Chockalingam Clemson University Non-Rigid Multi-Modal Object Tracking Using Gaussian Mixture Models Committee Members Dr Stan Birchfield (chair)

TP15 - Tracking Computer Vision, FCUP, 2013 Miguel Coimbra Slides by Prof. Kristen Grauman.

Machine learning & category recognition Cordelia Schmid Jakob Verbeek.

Object Detection Sliding Window Based Approach Context Helps

Watch, Listen and Learn Sonal Gupta, Joohyun Kim, Kristen Grauman and Raymond Mooney -Pratiksha Shah.

Computer Vision CS 776 Spring 2014 Recognition Machine Learning Prof. Alex Berg.

Marcin Marszałek, Ivan Laptev, Cordelia Schmid Computer Vision and Pattern Recognition, CVPR Actions in Context.

Recognizing Human Figures and Actions Greg Mori Simon Fraser University.

Why Categorize in Computer Vision ?. Why Use Categories? People love categories!

Yao, B., and Fei-fei, L. IEEE Transactions on PAMI(2012)

Face detection Slides adapted Grauman & Liebe’s tutorial

Visual Object Recognition

Classifying Images with Visual/Textual Cues By Steven Kappes and Yan Cao.

Object Detection with Discriminatively Trained Part Based Models

Lecture 31: Modern recognition CS4670 / 5670: Computer Vision Noah Snavely.

Pedestrian Detection and Localization

Image Categorization Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem 03/15/11.

Classifiers Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem 04/09/15.

MSRI workshop, January 2005 Object Recognition Collected databases of objects on uniform background (no occlusions, no clutter) Mostly focus on viewpoint.

Computer Vision 776 Jan-Michael Frahm 12/05/2011 Many slides from Derek Hoiem, James Hays.

Efficient Visual Object Tracking with Online Nearest Neighbor Classifier Many slides adapt from Steve Gu.

Project 3 Results.

Histograms of Oriented Gradients for Human Detection(HOG)

Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions Bangpeng Yao and Li Fei-Fei Computer Science Department, Stanford.

Pictorial Structures and Distance Transforms Computer Vision CS 543 / ECE 549 University of Illinois Ian Endres 03/31/11.

Categorical Perception 강우현. Introduction Scalable representations for visual categorization Representation for functional and affordance-based categorization.

Representation in Vision Derek Hoiem CS 598, Spring 2009 Jan 22, 2009.

1 Bilinear Classifiers for Visual Recognition Computational Vision Lab. University of California Irvine To be presented in NIPS 2009 Hamed Pirsiavash Deva.

Object detection with deformable part-based models

Data Driven Attributes for Action Detection

Tracking Objects with Dynamics

Object detection as supervised classification

A Tutorial on HOG Human Detection

Context-Aware Modeling and Recognition of Activities in Video

CS 1674: Intro to Computer Vision Scene Recognition

Brief Review of Recognition + Context

Object DetectionII Ali Taalimi 01/08/2013.

Cascaded Classification Models

KFC: Keypoints, Features and Correspondences

Human-object interaction

Presentation transcript:

Action Recognition Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem 04/21/11

This section: advanced topics Action recognition 3D Scenes and Context Massively data-driven approaches

Today: action recognition Action recognition – What is an “action”? – How can we represent movement? – How do we incorporate motion, pose, and nearby objects?

What is an action? Action: a transition from one state to another Who is the actor? How is the state of the actor changing? What (if anything) is being acted on? How is that thing changing? What is the purpose of the action (if any)?

How do we represent actions? Categories Walking, hammering, dancing, skiing, sitting down, standing up, jumping Poses Nouns and Predicates

What is the purpose of action recognition? To describe To predict

How can we identify actions? MotionPose Held Objects Nearby Objects

Representing Motion Bobick Davis 2001 Optical Flow with Motion History

Representing Motion Efros et al Optical Flow with Split Channels

Representing Motion Tracked Points Matikainen et al. 2009

Representing Motion Space-Time Interest Points Corner detectors in space-time Laptev 2005

Representing Motion Space-Time Interest Points Laptev 2005

Representing Motion Space-Time Volumes Blank et al. 2005

Examples of Action Recognition Systems Feature-based classification Recognition using pose and objects

Action recognition as classification Retrieving actions in moviesRetrieving actions in movies, Laptev and Perez, 2007

Remember image categorization… Training Labels Training Images Classifier Training Training Image Features Trained Classifier

Remember image categorization… Training Labels Training Images Classifier Training Training Image Features Testing Test Image Trained Classifier Outdoor Prediction

Remember spatial pyramids…. Compute histogram in each spatial bin

Features for Classifying Actions 1.Spatio-temporal pyramids (14x14x8 bins) – Image Gradients – Optical Flow

Features for Classifying Actions 2.Spatio-temporal interest points Corner detectors in space-time Descriptors based on Gaussian derivative filters over x, y, time

Classification Boosted stubs for pyramids of optical flow, gradient Nearest neighbor for STIP

Searching the video for an action 1.Detect keyframes using a trained HOG detector in each frame 2.Classify detected keyframes as positive (e.g., “drinking”) or negative (“other”)

Accuracy in searching video Without keyframe detection With keyframe detection

Learning realistic human actions from movies, Laptev et al “Talk on phone” “Get out of car”

Approach Space-time interest point detectors Descriptors – HOG, HOF Pyramid histograms (3x3x2) SVMs with Chi-Squared Kernel Interest Points Spatio-Temporal Binning

Results

Action Recognition using Pose and Objects Modeling Mutual Context of Object and Human Pose in Human-Object Interaction ActivitiesModeling Mutual Context of Object and Human Pose in Human-Object Interaction Activities, B. Yao and Li Fei-Fei, 2010 Slide Credit: Yao/Fei-Fei

Human-Object Interaction Torso Right-arm Left-arm Right-leg Left-leg Head Human pose estimation Holistic image based classification Integrated reasoning Slide Credit: Yao/Fei-Fei

Human-Object Interaction Tennis racket Human pose estimation Holistic image based classification Integrated reasoning Object detection Slide Credit: Yao/Fei-Fei

Human-Object Interaction Human pose estimation Holistic image based classification Integrated reasoning Object detection Torso Right-arm Left-arm Right-leg Left-leg Head Tennis racket HOI activity: Tennis Forehand Slide Credit: Yao/Fei-Fei Action categorization

Felzenszwalb & Huttenlocher, 2005 Ren et al, 2005 Ramanan, 2006 Ferrari et al, 2008 Yang & Mori, 2008 Andriluka et al, 2009 Eichner & Ferrari, 2009 Difficult part appearance Self-occlusion Image region looks like a body part Human pose estimation & Object detection Human pose estimation is challenging. Slide Credit: Yao/Fei-Fei

Human pose estimation & Object detection Human pose estimation is challenging. Felzenszwalb & Huttenlocher, 2005 Ren et al, 2005 Ramanan, 2006 Ferrari et al, 2008 Yang & Mori, 2008 Andriluka et al, 2009 Eichner & Ferrari, 2009 Slide Credit: Yao/Fei-Fei

Human pose estimation & Object detection Facilitate Given the object is detected. Slide Credit: Yao/Fei-Fei

Viola & Jones, 2001 Lampert et al, 2008 Divvala et al, 2009 Vedaldi et al, 2009 Small, low-resolution, partially occluded Image region similar to detection target Human pose estimation & Object detection Object detection is challenging Slide Credit: Yao/Fei-Fei

Human pose estimation & Object detection Object detection is challenging Viola & Jones, 2001 Lampert et al, 2008 Divvala et al, 2009 Vedaldi et al, 2009 Slide Credit: Yao/Fei-Fei

Human pose estimation & Object detection Facilitate Given the pose is estimated. Slide Credit: Yao/Fei-Fei

Human pose estimation & Object detection Mutual Context Slide Credit: Yao/Fei-Fei

H A Mutual Context Model Representation More than one H for each A ; Unobserved during training. A:A: Croquet shot Volleyball smash Tennis forehand Intra-class variations Activity Object Human pose Body parts l P : location; θ P : orientation; s P : scale. Croquet mallet Volleyball Tennis racket O:O: H:H: P:P: f:f: Shape context. [Belongie et al, 2002] P1P1 Image evidence fOfO f1f1 f2f2 fNfN O P2P2 PNPN Slide Credit: Yao/Fei-Fei

Mutual Context Model Representation Markov Random Field Clique potential Clique weight O P1P1 PNPN fOfO H A P2P2 f1f1 f2f2 fNfN,, : Frequency of co-occurrence between A, O, and H. Slide Credit: Yao/Fei-Fei

A f1f1 f2f2 fNfN Mutual Context Model Representation fOfO P1P1 PNPN P2P2 O H,, : Spatial relationship among object and body parts. locationorientation size Markov Random Field Clique potential Clique weight,, : Frequency of co-occurrence between A, O, and H. Slide Credit: Yao/Fei-Fei

H A f1f1 f2f2 fNfN Mutual Context Model Representation Obtained by structure learning fOfO PNPN P1P1 P2P2 O Learn structural connectivity among the body parts and the object.,, : Frequency of co-occurrence between A, O, and H.,, : Spatial relationship among object and body parts. locationorientation size Markov Random Field Clique potential Clique weight Slide Credit: Yao/Fei-Fei

H O A fOfO f1f1 f2f2 fNfN P1P1 P2P2 PNPN Mutual Context Model Representation and : Discriminative part detection scores. [Andriluka et al, 2009] Shape context + AdaBoost Learn structural connectivity among the body parts and the object. [Belongie et al, 2002] [Viola & Jones, 2001],, : Frequency of co-occurrence between A, O, and H.,, : Spatial relationship among object and body parts. locationorientation size Markov Random Field Clique potential Clique weight Slide Credit: Yao/Fei-Fei

Model Learning H O A fOfO f1f1 f2f2 fNfN P1P1 P2P2 PNPN cricket shot cricket bowling Input: Goals: Hidden human poses Slide Credit: Yao/Fei-Fei

Model Learning H O A fOfO f1f1 f2f2 fNfN P1P1 P2P2 PNPN Input: Goals: Hidden human poses Structural connectivity cricket shot cricket bowling Slide Credit: Yao/Fei-Fei

Model Learning Goals: Hidden human poses Structural connectivity Potential parameters Potential weights H O A fOfO f1f1 f2f2 fNfN P1P1 P2P2 PNPN Input: cricket shot cricket bowling Slide Credit: Yao/Fei-Fei

Model Learning Goals: Parameter estimation Hidden variables Structure learning H O A fOfO f1f1 f2f2 fNfN P1P1 P2P2 PNPN Input: cricket shot cricket bowling Hidden human poses Structural connectivity Potential parameters Potential weights Slide Credit: Yao/Fei-Fei

Model Learning Goals: H O A fOfO f1f1 f2f2 fNfN P1P1 P2P2 PNPN Approach: croquet shot Hidden human poses Structural connectivity Potential parameters Potential weights Slide Credit: Yao/Fei-Fei

Model Learning Goals: H O A fOfO f1f1 f2f2 fNfN P1P1 P2P2 PNPN Approach: Joint density of the model Gaussian priori of the edge number Add an edge Remove an edge Add an edge Remove an edge Hill-climbing Hidden human poses Structural connectivity Potential parameters Potential weights Slide Credit: Yao/Fei-Fei

Model Learning Goals: H O A fOfO f1f1 f2f2 fNfN P1P1 P2P2 PNPN Approach: Maximum likelihood Standard AdaBoost Hidden human poses Structural connectivity Potential parameters Potential weights Slide Credit: Yao/Fei-Fei

Model Learning Goals: H O A fOfO f1f1 f2f2 fNfN P1P1 P2P2 PNPN Approach: Max-margin learning x i : Potential values of the i -th image. w r : Potential weights of the r -th pose. y(r) : Activity of the r -th pose. ξ i : A slack variable for the i -th image. Notations Hidden human poses Structural connectivity Potential parameters Potential weights Slide Credit: Yao/Fei-Fei

Learning Results Cricket defensive shot Cricket bowling Croquet shot Slide Credit: Yao/Fei-Fei

Learning Results Tennis serve Volleyball smash Tennis forehand Slide Credit: Yao/Fei-Fei

Model Inference The learned models Slide Credit: Yao/Fei-Fei

Model Inference The learned models Head detection Torso detection Tennis racket detection Layout of the object and body parts. Compositional Inference [Chen et al, 2007] Slide Credit: Yao/Fei-Fei

Model Inference The learned models Output Slide Credit: Yao/Fei-Fei

Dataset and Experiment Setup Object detection; Pose estimation; Activity classification. Tasks: [Gupta et al, 2009] Cricket defensive shot Cricket bowling Croquet shot Tennis forehand Tennis serve Volleyball smash Sport data set: 6 classes 180 training (supervised with object and part locations) & 120 testing images Slide Credit: Yao/Fei-Fei

[Gupta et al, 2009] Cricket defensive shot Cricket bowling Croquet shot Tennis forehand Tennis serve Volleyball smash Sport data set: 6 classes Dataset and Experiment Setup Object detection; Pose estimation; Activity classification. Tasks: 180 training (supervised with object and part locations) & 120 testing images Slide Credit: Yao/Fei-Fei

Object Detection Results Cricket bat Valid region Croquet mallet Tennis racketVolleyball Cricket ball Our Method Sliding window Pedestrian context [Andriluka et al, 2009] [Dalal & Triggs, 2006] Slide Credit: Yao/Fei-Fei

Object Detection Results 59 Volleyball Cricket ball Sliding windowPedestrian contextOur method Small object Background clutter Slide Credit: Yao/Fei-Fei

Dataset and Experiment Setup Object detection; Pose estimation; Activity classification. Tasks: [Gupta et al, 2009] Cricket defensive shot Cricket bowling Croquet shot Tennis forehand Tennis serve Volleyball smash Sport data set: 6 classes 180 training & 120 testing images Slide Credit: Yao/Fei-Fei

Human Pose Estimation Results MethodTorsoUpper LegLower LegUpper ArmLower ArmHead Ramanan, Andriluka et al, Our full model Slide Credit: Yao/Fei-Fei

Human Pose Estimation Results MethodTorsoUpper LegLower LegUpper ArmLower ArmHead Ramanan, Andriluka et al, Our full model Andriluka et al, 2009 Our estimation result Tennis serve model Andriluka et al, 2009 Our estimation result Volleyball smash model Slide Credit: Yao/Fei-Fei

Human Pose Estimation Results MethodTorsoUpper LegLower LegUpper ArmLower ArmHead Ramanan, Andriluka et al, Our full model One pose per class Estimation result Slide Credit: Yao/Fei-Fei

Dataset and Experiment Setup Object detection; Pose estimation; Activity classification. Tasks: [Gupta et al, 2009] Cricket defensive shot Cricket bowling Croquet shot Tennis forehand Tennis serve Volleyball smash Sport data set: 6 classes 180 training & 120 testing images Slide Credit: Yao/Fei-Fei

Activity Classification Results Cricket shot Tennis forehand Bag-of-words SIFT+SVM Gupta et al, 2009 Our model Slide Credit: Yao/Fei-Fei

More results Independent Joint

Take-home messages Action recognition is an open problem. – How to define actions? – How to infer them? – What are good visual cues? – How do we incorporate higher level reasoning?

Take-home messages Some work done, but it is just the beginning of exploring the problem. So far… – Actions are mainly categorical – Most approaches are classification using simple features (spatial-temporal histograms of gradients or flow, s-t interest points, SIFT in images) – Just a couple works on how to incorporate pose and objects – Not much idea of how to reason about long-term activities or to describe video sequences

Next class: 3D Scenes and Context