Action Recognition in the Presence of One

Slides:

Advertisements

Similar presentations

A Discriminative Key Pose Sequence Model for Recognizing Human Interactions Arash Vahdat, Bo Gao, Mani Ranjbar, and Greg Mori ICCV2011.

Advertisements

Recognizing Human Actions by Attributes CVPR2011 Jingen Liu, Benjamin Kuipers, Silvio Savarese Dept. of Electrical Engineering and Computer Science University.

Human Action Recognition across Datasets by Foreground-weighted Histogram Decomposition Waqas Sultani, Imran Saleemi CVPR 2014.

Carolina Galleguillos, Brian McFee, Serge Belongie, Gert Lanckriet Computer Science and Engineering Department Electrical and Computer Engineering Department.

Intelligent Systems Lab. Recognizing Human actions from Still Images with Latent Poses Authors: Weilong Yang, Yang Wang, and Greg Mori Simon Fraser University,

Transferable Dictionary Pair based Cross-view Action Recognition Lin Hong.

São Paulo Advanced School of Computing (SP-ASC’10). São Paulo, Brazil, July 12-17, 2010 Looking at People Using Partial Least Squares William Robson Schwartz.

Single Category Classification Stage One Additive Weighted Prototype Model.

Rodent Behavior Analysis Tom Henderson Vision Based Behavior Analysis Universitaet Karlsruhe (TH) 12 November /9.

End of Chapter 8 Neil Weisenfeld March 28, 2005.

Group Norm for Learning Latent Structural SVMs Overview Daozheng Chen (UMD, College Park), Dhruv Batra (TTI Chicago), Bill Freeman (MIT), Micah K. Johnson.

For Better Accuracy Eick: Ensemble Learning

Latent Boosting for Action Recognition Zhi Feng Huang et al. BMVC Jeany Son.

Chapter 3 Data Exploration and Dimension Reduction 1.

Action recognition with improved trajectories

How To Bake A Cake Kayla Golack Period Exer03.

KIT KAT CAKE By: Norma Manzur. Ingredients and Things we need  2 Super Moist Cake Mix (Flavor: Milk Chocolate)  2 Milk Chocolate Frosting  1 cup of.

Recognizing Activities of Daily Living from Sensor Data Henry Kautz Department of Computer Science University of Rochester.

How to Make Chocolate Chip Brownies? By: Lexi C..

Prediction of Molecular Bioactivity for Drug Design Experiences from the KDD Cup 2001 competition Sunita Sarawagi, IITB

The Group Lasso for Logistic Regression Lukas Meier, Sara van de Geer and Peter Bühlmann Presenter: Lu Ren ECE Dept., Duke University Sept. 19, 2008.

Discriminant Analysis Discriminant analysis is a technique for analyzing data when the criterion or dependent variable is categorical and the predictor.

Today Ensemble Methods. Recap of the course. Classifier Fusion

ISQS 6347, Data & Text Mining1 Ensemble Methods. ISQS 6347, Data & Text Mining 2 Ensemble Methods Construct a set of classifiers from the training data.

CVPR2013 Poster Detecting and Naming Actors in Movies using Generative Appearance Models.

Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.

Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.

Recognition Using Visual Phrases

Classification Ensemble Methods 1

Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -

Chapter 8. Learning of Gestures by Imitation in a Humanoid Robot in Imitation and Social Learning in Robots, Calinon and Billard. Course: Robots Learning.

Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Presented By Meet Shah. Goal  Automatically predicting the respondent’s reactions (accept or reject) to offers during face to face negotiation by analyzing.

Methods of multivariate analysis Ing. Jozef Palkovič, PhD.

Support Vector Machines

Experiments: Part 2.

Deep Learning Amin Sobhani.

Deep Predictive Model for Autonomous Driving

A Pool of Deep Models for Event Recognition

Temporal Order-Preserving Dynamic Quantization for Human Action Recognition from Multimodal Sensor Streams Jun Ye Kai Li Guo-Jun Qi Kien.

The Classification of Motor Skills

Basic machine learning background with Python scikit-learn

Machine Learning Basics

Look-ahead before you leap

Object Localization Goal: detect the location of an object within an image Fully supervised: Training data labeled with object category and ground truth.

Context-based vision system for place and object recognition

Group Norm for Learning Latent Structural SVMs

Tremor Detection Using Motion Filtering and SVM Bilge Soran, Jenq-Neng Hwang, Linda Shapiro, ICPR, /16/2018.

CS 1674: Intro to Computer Vision Scene Recognition

Learning with information of features

Probabilistic Models with Latent Variables

The Open World of Micro-Videos

Experiments: Part 2.

Volume 21, Issue 19, Pages (October 2011)

Weakly Supervised Action Recognition

Visual Features in the Perception of Liquids

Im2Calories: towards an automated mobile vision food diary

Support Vector Machines

Somi Jacob and Christian Bach

Presented by Wanxue Dong

Order of Operations.

Chapter 7: Transformations

Multivariate Methods Berlin Chen, 2005 References:

Towards an Unequivocal Representation of Actions

Human-object interaction

Word representations David Kauchak CS158 – Fall 2016.

Presented By: Harshul Gupta

GhostLink: Latent Network Inference for Influence-aware Recommendation

Iterative Projection and Matching: Finding Structure-preserving Representatives and Its Application to Computer Vision.

Presentation transcript:

Action Recognition in the Presence of One Egocentric and Multiple Static Cameras Bilge Soran Ali Farhadi Linda Shapiro Why Egocentric and Static Together: Some actions are better presented in static cameras, where the whole body of an actor and the context of actions are visible. Some other actions are better recognized in egocentric cameras, where subtle movements of hands and complex object interactions are visible. Performance Comparison with State-of-the-Art and Several Baselines: To analyze our model, we examine the effects of each component in our model with several baselines designed to challenge different components in our formulation. Except for (*), which measures the effect of Viterbi smoothing, all other baselines use a final Viterbi smoothing stage. Our model also outperforms 2 state-of-the-art latent CRF models: Latent Dynamic CRF and Hidden Unit CRF. A B C Egocentric Side Back Top Approach Avg. Acc. Avg. per Class Acc. Our Method 54.62 45.95 No Viterbi Smoothing (*) 41.80 44.05 Binary α 52.00 41.55 No Cam Combination 47.93 42.45 Binary α, No Cam Combination 44.58 35.01 Per Action α 48.83 45.89 No α 43.55 39.81 Equal α 48.75 45.66 Latent Dynamic CRF 33.57 Hidden Unit CRF 24.13 26.22 Feature Level Merge 35.04 36.97 Egocentric Camera 37.92 32.93 Static Back Camera 22.07 31.46 Static Side Camera 21.26 24.43 Static Top Camera 20.86 21.84 Our model benefits from the best of both worlds : by learning to predict the importance of each camera in recognizing actions in each frame. Model: Given the importance of all cameras (A) for recognizing the action in a frame, the problem of action recognition reduces to a multi-modal classification problem. A is not available: We jointly learn the importance of cameras along with the action classifiers. The first constraint encourages the model to make correct predictions. This constraint pushes for (W, A) of an action to score higher for instances of that action compared to any other action model. The other constraints push the latent variable to be similar for all dimensions within a camera, and contributions of cameras form a convex combination. To optimize for A, W, ξ, we use block coordinate descent. This involves estimating W for fixed A and optimizing for A given fixed W. We initialize W by independently trained classifiers for each action. The best estimate of the camera importance variables is the one that maximizes the accuracy of action prediction. Distribution of Camera Importance Variables: More importance is given to egocentric camera for “cracking an egg” when the subject interacts with the egg until the subject looks away from the action, when more weight is assigned to the side static camera. For the “take baking pan" action our model give more weight to the static back camera when the subject walks to the cabinet. After opening the cabinet door, egocentric camera becomes more important, followed by putting more weights on the side camera, when the subject is about to put the pan on the counter top. … Take baking pan frame Crack egg α 1) Ego 2) Side 3) Back The preference of our model in terms of cameras for each action: Actions like closing and opening fridge, cracking an egg, pouring oil into bow or measuring cup are better encoded by an egocentric camera. For actions such as walking to the fridge, taking a brownie box, and putting a baking pan into an oven, our model prefers the side view camera where the body movements are visible. Dataset & Features: CMU-Multi Modal Activity Dataset is used: 28 different actions, for 5 different subjects, recorded from one egocentric and 3 static cameras in 1/30 sample rate. We use the GIST features (8 orientations per scale and 4 blocks per direction, resulting in 512 dimensions) for all the methods and baselines together with PCA (99 % of data coverage: 80-121 dimensional features). To encode higher-order correlations in the feature space, we also consider different combinations of cameras, where each combination of cameras can be though of as a new dummy camera. Virtual Cinematography (Using α to Direct a Video Scene): This figure depicts some frames across 3 cameras (side, ego, back) for eight different actions. Red boxes indicated the frames that our method selects for the virtual cinematography. Pour brownie bag Take egg Open drawer Stir big bowl Take baking pan Switch on A B C D Side Egocentric Back