First-Person Activity Recognition: What Are They Doing to Me? M. S. Ryoo and Larry Matthies Jet Propulsion Laboratory, California Institute of Technology,

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Spatio-Temporal Relationship Match: Video Structure Comparison for Recognition of Complex Human Activities M. S. Ryoo and J. K. Aggarwal ICCV2009.
CVPR2013 Poster Modeling Actions through State Changes.
A Discriminative Key Pose Sequence Model for Recognizing Human Interactions Arash Vahdat, Bo Gao, Mani Ranjbar, and Greg Mori ICCV2011.
DONG XU, MEMBER, IEEE, AND SHIH-FU CHANG, FELLOW, IEEE Video Event Recognition Using Kernel Methods with Multilevel Temporal Alignment.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Foreground Focus: Finding Meaningful Features in Unlabeled Images Yong Jae Lee and Kristen Grauman University of Texas at Austin.
Human Identity Recognition in Aerial Images Omar Oreifej Ramin Mehran Mubarak Shah CVPR 2010, June Computer Vision Lab of UCF.
Multi-layer Orthogonal Codebook for Image Classification Presented by Xia Li.
CS395: Visual Recognition Spatial Pyramid Matching Heath Vinicombe The University of Texas at Austin 21 st September 2012.
Patch to the Future: Unsupervised Visual Prediction
Computer Vision for Human-Computer InteractionResearch Group, Universität Karlsruhe (TH) cv:hci Dr. Edgar Seemann 1 Computer Vision: Histograms of Oriented.
Activity Recognition Aneeq Zia. Agenda What is activity recognition Typical methods used for action recognition “Evaluation of local spatio-temporal features.
Intelligent Systems Lab. Recognizing Human actions from Still Images with Latent Poses Authors: Weilong Yang, Yang Wang, and Greg Mori Simon Fraser University,
Global spatial layout: spatial pyramid matching Spatial weighting the features Beyond bags of features: Adding spatial information.
Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots Chao-Yeh Chen and Kristen Grauman University of Texas at Austin.
Ghunhui Gu, Joseph J. Lim, Pablo Arbeláez, Jitendra Malik University of California at Berkeley Berkeley, CA
Discriminative and generative methods for bags of features
Local Descriptors for Spatio-Temporal Recognition
Modeling Pixel Process with Scale Invariant Local Patterns for Background Subtraction in Complex Scenes (CVPR’10) Shengcai Liao, Guoying Zhao, Vili Kellokumpu,
Personal Driving Diary: Constructing a Video Archive of Everyday Driving Events IEEE workshop on Motion and Video Computing ( WMVC) 2011 IEEE Workshop.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Generic Object Detection using Feature Maps Oscar Danielsson Stefan Carlsson
Human Action Recognition
Tracking Video Objects in Cluttered Background
Local Features and Kernels for Classification of Object Categories J. Zhang --- QMUL UK (INRIA till July 2005) with M. Marszalek and C. Schmid --- INRIA.
5/30/2006EE 148, Spring Visual Categorization with Bags of Keypoints Gabriella Csurka Christopher R. Dance Lixin Fan Jutta Willamowski Cedric Bray.
1 Invariant Local Feature for Object Recognition Presented by Wyman 2/05/2006.
Jacinto C. Nascimento, Member, IEEE, and Jorge S. Marques
© 2013 IBM Corporation Efficient Multi-stage Image Classification for Mobile Sensing in Urban Environments Presented by Shashank Mujumdar IBM Research,
DVMM Lab, Columbia UniversityVideo Event Recognition Video Event Recognition: Multilevel Pyramid Matching Dong Xu and Shih-Fu Chang Digital Video and Multimedia.
Learning to classify the visual dynamics of a scene Nicoletta Noceti Università degli Studi di Genova Corso di Dottorato.
Real-time Action Recognition by Spatiotemporal Semantic and Structural Forest Tsz-Ho Yu, Tae-Kyun Kim and Roberto Cipolla Machine Intelligence Laboratory,
Bag of Video-Words Video Representation
Computer vision.
Step 3: Classification Learn a decision rule (classifier) assigning bag-of-features representations of images to different classes Decision boundary Zebra.
Flow Based Action Recognition Papers to discuss: The Representation and Recognition of Action Using Temporal Templates (Bobbick & Davis 2001) Recognizing.
Action recognition with improved trajectories
EADS DS / SDC LTIS Page 1 7 th CNES/DLR Workshop on Information Extraction and Scene Understanding for Meter Resolution Image – 29/03/07 - Oberpfaffenhofen.
Shape-Based Human Detection and Segmentation via Hierarchical Part- Template Matching Zhe Lin, Member, IEEE Larry S. Davis, Fellow, IEEE IEEE TRANSACTIONS.
Watch, Listen and Learn Sonal Gupta, Joohyun Kim, Kristen Grauman and Raymond Mooney -Pratiksha Shah.
Marcin Marszałek, Ivan Laptev, Cordelia Schmid Computer Vision and Pattern Recognition, CVPR Actions in Context.
Window-based models for generic object detection Mei-Chen Yeh 04/24/2012.
Svetlana Lazebnik, Cordelia Schmid, Jean Ponce
Object Detection with Discriminatively Trained Part Based Models
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
Deformable Part Model Presenter : Liu Changyu Advisor : Prof. Alex Hauptmann Interest : Multimedia Analysis April 11 st, 2013.
BING: Binarized Normed Gradients for Objectness Estimation at 300fps
Recognizing Action at a Distance Alexei A. Efros, Alexander C. Berg, Greg Mori, Jitendra Malik Computer Science Division, UC Berkeley Presented by Pundik.
Distributed Representative Reading Group. Research Highlights 1Support vector machines can robustly decode semantic information from EEG and MEG 2Multivariate.
Event retrieval in large video collections with circulant temporal encoding CVPR 2013 Oral.
Gang WangDerek HoiemDavid Forsyth. INTRODUCTION APROACH (implement detail) EXPERIMENTS CONCLUSION.
Hierarchical Matching with Side Information for Image Classification
Recognition Using Visual Phrases
Object Recognition as Ranking Holistic Figure-Ground Hypotheses Fuxin Li and Joao Carreira and Cristian Sminchisescu 1.
Learning video saliency from human gaze using candidate selection CVPR2013 Poster.
Goggle Gist on the Google Phone A Content-based image retrieval system for the Google phone Manu Viswanathan Chin-Kai Chang Ji Hyun Moon.
1 Bilinear Classifiers for Visual Recognition Computational Vision Lab. University of California Irvine To be presented in NIPS 2009 Hamed Pirsiavash Deva.
A. M. R. R. Bandara & L. Ranathunga
Bag-of-Visual-Words Based Feature Extraction
Saliency-guided Video Classification via Adaptively weighted learning
Learning Mid-Level Features For Recognition
When to engage in interaction – and how
Paper Presentation: Shape and Matching
Thesis Advisor : Prof C.V. Jawahar
CS 1674: Intro to Computer Vision Scene Recognition
The Open World of Micro-Videos
Brief Review of Recognition + Context
Handwritten Characters Recognition Based on an HMM Model
Presentation transcript:

First-Person Activity Recognition: What Are They Doing to Me? M. S. Ryoo and Larry Matthies Jet Propulsion Laboratory, California Institute of Technology, Pasadena, CA

Outline  Introduction  First-person video dataset  Feature for first-person videos  Recognition with activity structure  Conclusion

Introduction  Most of these previous works focused on activity recognition from a 3rd- person perspective.  They focused on recognition of ego-actions of the person wearing the camera(e.g., ‘riding a bike’ or ‘cooking’).[6,4,10] [6] K. M. Kitani, T. Okabe, Y. Sato, and A. Sugimoto. Fast unsupervised ego-action learning for first-person sports videos. In CVPR, [4] A. Fathi, A. Farhadi, and J. M. Rehg. Understanding egocentric activities. In ICCV, [10] H. Pirsiavash and D. Ramanan. Detecting activities of daily living in firstperson camera views. In CVPR, 2012.

Introduction  There also are works on recognition of gesture-level motion to the sensor and analysis of face/eye directions.  We particularly focus on two aspects of first-person activity recognition, aiming to provide answers to the following two questions: 1) What features (and their combination) do we need to recognize interaction- level activities from first-person videos? 2) How important is it to consider temporal structure of the activities in first- person recognition?  Up to our knowledge, this paper is the first paper to recognize human interactions from first-person videos.

First-person video dataset(1)  We attached a GoPro camera to the head of a humanoid model and asked human participants to interact with the humanoid by performing activities.  The participants were asked to perform 7 different types of activities, including 4 positive interactions with the observer, 1 neutral interaction, and 2 negative interactions.

First-person video dataset(2)  For the video collection, our robot was placed in 5 different environments with distinct background and lighting conditions. A total of 8 participants wearing a total of 10 different clothings participated in our experiments.  The videos are in 320*240 image resolution, 30 frames per second.  Notice that the robot is not stationary and it displays a large amount of ego- motion in its videos particularly during the human activity.

Feature for first-person videos  We construct and evaluate two categories of video features, global motion descriptors and local motion descriptors.  we present kernel functions to combine global features and local features for the activity recognition. Our kernels reliably integrates both global and local motion information, and we illustrate that these multi-channel kernels benefit first-person activity recognition.

Global motion descriptors  We take advantage of dense optical flows. Optical flows are measured between every two consecutive frames of a video, where each flow is a vector describing the direction and magnitude of the movement of each pixel.  The system places each of the computed optical flows into one of the predefined s-by-s-by-8 histogram bins, where they spatially divide a scene into s by s grids and 8 representative motion directions.

Local motion descriptors  we interpret a video as a 3-D XYT volume, which is formed by concatenating 2-D XY image frames of the video along time axis T.

Visual words

Multi-channel kernels(1)  We present multi-channel kernels that consider both global features and local features for computing video similarities.  A kernel k(x i, x j ) is a function defined to model distance between two vectors xi and x j.  We construct two types of kernels: a multi-channel version of histogram intersection kernel, and multi-channel 2 kernel which was also used in [19] for object classification.  Our histogram intersection kernel is defined as follows: [19] J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid. Local features and kernels for classification of texture and object categories: A comprehensive study. IJCV, 73:213–238, April 2007.

Multi-channel kernels(2)

Evaluation  We use a repeated random sub-sampling validation to measure the classification accuracy of our recognition approach.  At each round, we selected a half of our dataset (i.e., 6 sets with 42 videos) as training videos and use the other 6 sets for the testing.  Local vs. global motion:

Classification with multi-channel:  This confirms that utilizing both global and local motion benefits overall recognition of human activities from first-person videos, and that our kernel functions are able to combine such information reliably.

Recognition with activity structure  We first describe our structure representation, and define a new kernel function computing video distances given a particular structure. Next, we present an algorithm to search for the best activity structure given training videos.  The idea is to enable evaluation of each structure by measuring how similar its kernel function is to the optimal function, and use such evaluation to find the optimal structure.

Hierarchical structure match kernel(1)  Formally, we represent the activity structure in terms of hierarchical binary divisions with the following production rules:  The idea behind our structure representation is to take advantage of it to better measure the distance between two videos by performing hierarchical segment-to-segment matching.

Hierarchical structure match kernel(2)  Given a particular activity structure S, we define the kernel function k S (x i, x j ) measuring the distance between two feature vectors x i and x j with the following two equations:  hierarchical structure match kernel.  Our kernel takes per each (x i, x j ), where r is the number of segments generated as a result of the structure.

Structure learning(1)  kernel target alignment [15] that measures the angle between two Gram matrices, and present that it can be used to evaluate structure kernels for our activity recognition.  Kernel target alignment: Given a set of training samples { x 1,…, x m }, let K 1 and K 2 be the Gram matrices of kernel functions K 1 and K 2 :  Alignment between two kernels can be computed as:  is the Frobenius inner product between the kernel matrix K 1 and K 2. [15] N. Shawe-Taylor and A. Kandola. On kernel target alignment. In NIPS, 2002.

Structure learning(2)  We take advantage of the kernel target alignment for evaluating candidate activity structures.  The idea is to compute the alignment A(K S, L) and evaluate each candidate kernel K S.  This provides the system an ability to score possible activity structure candidates so that it can search for the best structure S*.  We denote A(K S, L) simply as A(K S ). Computation of A(K S ) takes

Structure learning(3)  Hierarchical structure learning: The goal is to find the structure S that maximizes the kernel alignment for the training data: S =  Our learning process as:  For the computational efficiency, we take advantage of the following greedy assumption:  As a result, the following recursive equation T, when computed for T(0, 1), provides us the optimal structure S: 

Structure learning(4)  This structure can either be learned per activity, or the system may learn the common structure suitable for all activity classes.  The final structure S is where p is the number of layers and q is the number of t’ the system is checking at each layer.

Evaluation-classification  SVM classifiers were used as the base classifiers of our approach.  Spatio-temporal pyramid matching [2]  Dynamic bag-of-words (BoW) [11] [2] J. Choi, W. Jeon, and S. Lee. Spatio-temporal pyramid matching for sports videos. In ACM MIR, [11] M. S. Ryoo. Human activity prediction: Early recognition of ongoing activities from streaming videos. In ICCV, 2011.

Evaluation-detection(1)  Implementation:  Activity detection is the process of finding correct starting time and ending time of the activity from continuous videos.  We implemented three baseline approaches for comparison: 1. SVM classifiers only using local features 2. Those only using global features. 3. The method with our multi-channel kernel.  All three baselines use 2-based kernels.  Settings:  We performed validations by randomly splitting the dataset.  This training-testing set selection process was repeated 100 rounds, and we averaged their performance.

Evaluation-detection(2)  Results:  We measured average precision-recall (PR) curves with our dataset.

Evaluation-detection(3)

Conclusion  We extracted global and local features from first-person videos, and confirmed that multi-channel kernels combining their information are needed.  We developed a new kernel-based activity learning/recognition methodology to consider the activities’ hierarchical structures, and verified that learning activity structures from training videos benefits recognition of human interactions targeted to the observer.  One future work is to extend our approach for early recognition of humans’ intention based on activity detection results and other subtle information from human body movements.