An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009.

Slides:



Advertisements
Similar presentations
Spatio-Temporal Relationship Match: Video Structure Comparison for Recognition of Complex Human Activities M. S. Ryoo and J. K. Aggarwal ICCV2009.
Advertisements

Antón R. Escobedo cse 252c Behavior Recognition via Sparse Spatio-Temporal Features Piotr Dollár Vincent Rabaud Garrison CottrellSerge Belongie.
Foreground Focus: Finding Meaningful Features in Unlabeled Images Yong Jae Lee and Kristen Grauman University of Texas at Austin.
Human Identity Recognition in Aerial Images Omar Oreifej Ramin Mehran Mubarak Shah CVPR 2010, June Computer Vision Lab of UCF.
- Recovering Human Body Configurations: Combining Segmentation and Recognition (CVPR’04) Greg Mori, Xiaofeng Ren, Alexei A. Efros and Jitendra Malik -
TP14 - Local features: detection and description Computer Vision, FCUP, 2014 Miguel Coimbra Slides by Prof. Kristen Grauman.
MIT CSAIL Vision interfaces Approximate Correspondences in High Dimensions Kristen Grauman* Trevor Darrell MIT CSAIL (*) UT Austin…
Patch to the Future: Unsupervised Visual Prediction
1 Part 1: Classical Image Classification Methods Kai Yu Dept. of Media Analytics NEC Laboratories America Andrew Ng Computer Science Dept. Stanford University.
Activity Recognition Aneeq Zia. Agenda What is activity recognition Typical methods used for action recognition “Evaluation of local spatio-temporal features.
Intelligent Systems Lab. Recognizing Human actions from Still Images with Latent Poses Authors: Weilong Yang, Yang Wang, and Greg Mori Simon Fraser University,
Generative learning methods for bags of features
Bag-of-features models. Origin 1: Texture recognition Texture is characterized by the repetition of basic elements or textons For stochastic textures,
Transferable Dictionary Pair based Cross-view Action Recognition Lin Hong.
Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots Chao-Yeh Chen and Kristen Grauman University of Texas at Austin.
Discriminative and generative methods for bags of features
Local Descriptors for Spatio-Temporal Recognition
Bag-of-features models Many slides adapted from Fei-Fei Li, Rob Fergus, and Antonio Torralba.
1 Image Recognition - I. Global appearance patterns Slides by K. Grauman, B. Leibe.
A Study of Approaches for Object Recognition
Beyond bags of features: Adding spatial information Many slides adapted from Fei-Fei Li, Rob Fergus, and Antonio Torralba.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Bag-of-features models
Human Action Recognition
Generative learning methods for bags of features
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Multiple Object Class Detection with a Generative Model K. Mikolajczyk, B. Leibe and B. Schiele Carolina Galleguillos.
Discriminative and generative methods for bags of features
Review: Intro to recognition Recognition tasks Machine learning approach: training, testing, generalization Example classifiers Nearest neighbor Linear.
Bag-of-features models. Origin 1: Texture recognition Texture is characterized by the repetition of basic elements or textons For stochastic textures,
Tal Mor  Create an automatic system that given an image of a room and a color, will color the room walls  Maintaining the original texture.
Exercise Session 10 – Image Categorization
Real-time Action Recognition by Spatiotemporal Semantic and Structural Forest Tsz-Ho Yu, Tae-Kyun Kim and Roberto Cipolla Machine Intelligence Laboratory,
Bag of Video-Words Video Representation
Step 3: Classification Learn a decision rule (classifier) assigning bag-of-features representations of images to different classes Decision boundary Zebra.
Flow Based Action Recognition Papers to discuss: The Representation and Recognition of Action Using Temporal Templates (Bobbick & Davis 2001) Recognizing.
IRISA / INRIA Rennes Computational Vision and Active Perception Laboratory (CVAP) KTH (Royal Institute of Technology)
Watch, Listen and Learn Sonal Gupta, Joohyun Kim, Kristen Grauman and Raymond Mooney -Pratiksha Shah.
Marcin Marszałek, Ivan Laptev, Cordelia Schmid Computer Vision and Pattern Recognition, CVPR Actions in Context.
Bag-of-features models. Origin 1: Texture recognition Texture is characterized by the repetition of basic elements or textons For stochastic textures,
Svetlana Lazebnik, Cordelia Schmid, Jean Ponce
Image Classification 영상분류
Recognizing Action at a Distance Alexei A. Efros, Alexander C. Berg, Greg Mori, Jitendra Malik Computer Science Division, UC Berkeley Presented by Pundik.
MSRI workshop, January 2005 Object Recognition Collected databases of objects on uniform background (no occlusions, no clutter) Mostly focus on viewpoint.
Bayesian Parameter Estimation Liad Serruya. Agenda Introduction Bayesian decision theory Scale-Invariant Learning Bayesian “One-Shot” Learning.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Unsupervised Learning: Kmeans, GMM, EM Readings: Barber
Visual Categorization With Bags of Keypoints Original Authors: G. Csurka, C.R. Dance, L. Fan, J. Willamowski, C. Bray ECCV Workshop on Statistical Learning.
Epitomic Location Recognition A generative approach for location recognition K. Ni, A. Kannan, A. Criminisi and J. Winn In proc. CVPR Anchorage,
Using Webcast Text for Semantic Event Detection in Broadcast Sports Video IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 7, NOVEMBER 2008.
Project by: Cirill Aizenberg, Dima Altshuler Supervisor: Erez Berkovich.
Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.
Lecture 08 27/12/2011 Shai Avidan הבהרה: החומר המחייב הוא החומר הנלמד בכיתה ולא זה המופיע / לא מופיע במצגת.
Discovering Objects and their Location in Images Josef Sivic 1, Bryan C. Russell 2, Alexei A. Efros 3, Andrew Zisserman 1 and William T. Freeman 2 Goal:
Local features: detection and description
CS654: Digital Image Analysis
Data Mining for Surveillance Applications Suspicious Event Detection Dr. Bhavani Thuraisingham.
SUMMERY 1. VOLUMETRIC FEATURES FOR EVENT DETECTION IN VIDEO correlate spatio-temporal shapes to video clips that have been automatically segmented we.
A PPLICATIONS OF TOPIC MODELS Daphna Weinshall B Slides credit: Joseph Sivic, Li Fei-Fei, Brian Russel and others.
Lecture IX: Object Recognition (2)
Lecture 07 13/12/2011 Shai Avidan הבהרה: החומר המחייב הוא החומר הנלמד בכיתה ולא זה המופיע / לא מופיע במצגת.
The topic discovery models
Data Driven Attributes for Action Detection
Traffic Sign Recognition Using Discriminative Local Features Andrzej Ruta, Yongmin Li, Xiaohui Liu School of Information Systems, Computing and Mathematics.
TP12 - Local features: detection and description
Nonparametric Semantic Segmentation
Paper Presentation: Shape and Matching
By Suren Manvelyan, Crocodile (nile crocodile?) By Suren Manvelyan,
The topic discovery models
Human Activity Analysis
The topic discovery models
Presentation transcript:

An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009

What is action recognition? The process of identifying actions that occur in video sequences (in this case, by humans). Videos from Example Videos

Why perform action recognition? Surveillance footage User-interfaces Automatic video organization / tagging Search-by-video?

Complications Different scales People may appear at different scales in different videos, yet perform the same action. Movement of the camera The camera may be a handheld camera, and the person holding it can cause it to shake. Camera may be mounted on something that moves.

Complications, continued Movement with the camera The subject performing an action (i.e., skating) may be moving with the camera at a similar speed. Figure from Niebles et al.

Complications, continued Occlusions Action may not be fully visible Figure from Ke et al.

Complications, continued Background “clutter” Other objects/humans present in the video frame. Human variation Humans are of different sizes/shapes Action variation Different people perform different actions in different ways. Etc…

Why have I chosen this topic? I wanted an opportunity to learn about it. (I knew nothing of it beforehand). I will likely be incorporating it into my research in the future, but I'm not there yet.

Why have I chosen this topic? Specifically: Want to know if someone is on the phone hence not interruptible; can set status to “busy” Want to know if someone is present (look for characteristic human actions) Things other than humans can cause motion, this will make for a more reliable presence detector. Online status can change accordingly. Want to know immediately when someone leaves Can accurately set status to unavailable

1 st Paper Overview Recognizing Human Actions: A Local SVM Approach (2004) Use local space-time features to represent video sequences that contain actions. Classification is done via an SVM. Results are also computed for KNN for comparison. Christian Schuldt, Ivan Laptev and Barbara Caputo (2004)

The Dataset Video dataset with a few thousand instances. 25 people each: perform 6 different actions Walking, jogging, running, boxing, hand waving, clapping in 4 different scenarios Outdoors, outdoors w/scale variation, outdoors w/different clothes, indoors (several times) Backgrounds are mostly free of clutter. Only one person performing a single action per video.

The Dataset Figure from Schuldt et al.

Local Space-time features Figure from Schuldt et al.

Representation of Features Spatial-temporal “jets” (4 th order) are computed at each feature center: Using k-means clustering, a vocabulary consisting of words h i is created from the jet descriptors. Finally, a given video is represented by a histogram of counts of occurrences of features corresponding to h i in that video:

Recognition Methods 2 representations of data: [1] Raw jet descriptors (LF) (“local feature” kernel) Wallraven, Caputo, and Graf (2003) [2] Histograms (HistLF) (X 2 kernel) 2 classification methods: SVM K Nearest Neighbor

Results Figure from Schuldt et al.

Results Some categories can be confused with others (running vs. jogging vs. walking / waving vs. boxing) due to different ways that people perform these tasks. Local Features (raw jet descriptors without histograms) combined with SVMs was the best-performing technique for all tested scenarios.

A Potentially Prohibitive Cost The method we just saw was entirely supervised. All videos used in the training set had labels attached to them. Sometimes, labeling videos can be a very expensive operation. What if we want to be able to train on a training set in which the videos are not necessarily labeled, and recognize the occurrences of their actions in a test set?

2 nd Paper Overview Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words (2008) Unsupervised approach to classifying actions that occur in video. Uses pLSA (Probabilistic Latent Semantic Analysis) to learn a model. Juan Carlos Niebles, Hongcheng Wang and Li Fei-Fei (2008)

Data Training set: Set of videos in which a single person is performing a single action. Videos are unlabeled. Test set (relaxed requirement): Set of videos which can contain multiple people performing multiple actions simultaneously.

Space-time Interest Points Figure from Niebles et al.

Space-time Interest Points Figure from Niebles et al.

Feature Descriptors Brightness gradients calculated for each interest point ‘cube’. Image gradients computed for each cube (at different scales of smoothing). Gradients concatenated to form feature vector. Length of vector: [# of pixels in interest “point”] * [# of smoothing scales] * [# of gradient directions] Vector is projected to lower dimensions using PCA.

Codebook Formation Codebook of spatial-temporal words. K-means clustering of all space-time interest points in training set. Clustering metric: Euclidean distance Videos represented as collections of spatial-temporal words.

Representation Space-time interest points: Figure from Niebles et al.

pLSA: Learning a Model Probabilistic Latent Semantic Analysis Generative Model Variables: w i = spatial-temporal word d j = video n(w i, d j ) = co-occurrence table (# of occurrences of word w i in video d j ) z = topic, corresponding to an action

Probabilistic Latent Semantic Analysis Unsupervised technique Two-level generative model: a video is a mixture of topics, and each topic has its own characteristic “word” distribution wd z T. Hofmann, Probabilistic Latent Semantic Analysis, UAI 1999Probabilistic Latent Semantic Analysis videotopicword P(z|d)P(w|z) Slide: Lana Lazebnik

Probabilistic Latent Semantic Analysis Unsupervised technique Two-level generative model: a video is a mixture of topics, and each topic has its own characteristic “word” distribution wd z T. Hofmann, Probabilistic Latent Semantic Analysis, UAI 1999Probabilistic Latent Semantic Analysis Slide: Lana Lazebnik

The pLSA model Probability of word i in video j (known) Probability of word i given topic k (unknown) Probability of topic k given video j (unknown) Slide: Lana Lazebnik

The pLSA model Observed codeword distributions (M×N) Codeword distributions per topic (class) (M×K) Class distributions per video (K×N) p(wi|dj)p(wi|dj)p(wi|zk)p(wi|zk) p(zk|dj)p(zk|dj) videos words topics videos = Slide: Lana Lazebnik

Maximize likelihood of data: Observed counts of word i in video j M … number of codewords N … number of videos Slide credit: Josef Sivic Learning pLSA parameters

Inference Finding the most likely topic (class) for a video: Slide: Lana Lazebnik

Inference Finding the most likely topic (class) for a video: Finding the most likely topic (class) for a visual word in a given video: Slide: Lana Lazebnik

Datasets KTH human motion dataset (Schuldt et al. 2004) (Dataset introduced in previous paper.) Weizmann human action dataset (Blank et al. 2005) Figure skating dataset (Wang et al. 2006)

Example of Testing (KTH) Figure from Niebles et al.

KTH Dataset Results Figure from Niebles et al.

Results Compared to Prev. Paper Figures from Niebles et al., Schuldt et al.

Results Compared to Prev. Paper Walking Running Jogging Handwaving Handclapping Boxing Walking Running Jogging Handwaving Handclapping Boxing First Paper (Supervised Technique)This Paper (Unsupervised Technique)

Weizmann Dataset 10 action categories, 9 different people performing each category. 90 videos. Static camera, simple background.

Weizmann Examples Figure from Niebles et al.

Example of Testing (Weizmann) Figure from Niebles et al.

Weizmann Dataset Results Figure from Niebles et al.

Figure Skating Dataset 32 video sequences 7 people 3 actions: Stand-spin Camel-spin Sit-spin Camera motion, background clutter, viewpoint changes.

Example Frames Figure from Niebles et al.

Example of Testing Figure from Niebles et al.

Figure Skating Results Figure from Niebles et al.

A Distinction Action Recognition vs. Event Detection Action Recognition = Classify a video of an actor performing a certain class of action. Event Detection = Detect instance(s) of event(s) from predefined classes, occurring in video that more closely resembles real-life events (i.e., cluttered background, multiple humans, occlusions).

3 rd Paper Overview Event Detection in Cluttered Videos (2007) Represent actions as spatial-temporal volumes. Detection is done via a distance threshold between a template action volume and a video sequence. Y. Ke, R. Sukthankar, and M. Hebert (2007)

Example images Cluttered background + occlusion: hand-waving Cluttered background: picking something up Images from Ke et al.

Representation of an event Spatio-temporal volume: Example: hand-waving Image from Ke et al.

Preprocessing A video in which event detection is to be performed is first segmented into space- time regions using mean-shift. The video is treated as a volume, and not individually by frames. Objects in the video are over-segmented in space and time. Authors state this is equivalent to “superpixels”, in the space-time context.

Preprocessing Figure from Ke et al.

Detecting an Event in Video Want to detect event corresponding to template T (i.e., hand-waving volume), in video volume V (which has been oversegmented). Slide the template along all locations in the video volume V. Measure shape- matching distance between T and relevant subset of V.

Shape-matching distance Appropriate region intersection distance: The authors point out that enumerating through all subsets of segmented objects in V is very inefficient. They detail an optimization to reduce run-time computation to table lookups, as well as a different distance metric.

Shape-matching distance Basic Idea: Figure from Ke et al.

Flow Distance Additionally, a flow distance metric is used. For a spatial-temporal patch P1 in T and P2 in V, calculate flow correlation distance (whether or not the same flow could have generated both patches). Flow correlation distance algorithm by Shechtman and Irani (2005)

Breaking up the template Why is this useful? Figure from Ke et al.

Matching template parts to regions Template parts may not match oversegmented regions as well: Figure from Ke et al.

Detection with parts Use a cutting plane (where the template was split) when calculating distance. Encode relationship between multiple parts of a template as a graph, with information about distances. Energy function to minimize: Appearance distance: Detection occurs when distance is below a chosen threshold.

The Data 20ish minutes of video, containing approximately 110 events for which templates exist. Templates/”training set” are created from a single instance of an action being performed; they are manually segmented and split.

Example Detections Figure from Ke et al.

Results Figure from Ke et al.

Reference Links Recognizing Human Actions: A Local SVM Approach Christian Schuldt, Ivan Laptev and Barbara Caputo Event Detection in Cluttered Videos Y. Ke, R. Sukthankar, and M. Hebert Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words Juan Carlos Niebles, Hongcheng Wang and Li Fei- Fei

Extra Slides Extra Slides after this point.

Detection of features (Schuldt et al.) Video is an image sequence Scale space representation of is constructed by convolution with a Gaussian kernel: Second-moment matrix is computed using spatio-temporal image gradients: ( and are spatial / temporal scale parameters).

Detection of features (Schuldt et al.) The positions of the features are the maxima of over all. Size/neighborhood of features determined by scale parameters. More detailed info about space-time interest points: Ivan Laptev and Tony Lindeberg. Space- Time Interest Points. In Proc. ICCV 2003, Nice, France, pp.I:

Space-time Interest Points (Niebles et al.) Detection algorithm based on: Dollár, Rabaud, Cottrell, & Belongie (2005) Response function: 2-d Gaussian kernel applied only in spatial dimensions. 1-D “Gabor filters” applied over time. Point locations are local maxima of response function. Size is determined by spatial/temporal scale factors.

Recognition of Multiple Actions (Niebles et al.) Figure from Niebles et al.