Raviteja Vemulapalli University of Maryland, College Park.

Slides:

Advertisements

Similar presentations

Complex Networks for Representation and Characterization of Images For CS790g Project Bingdong Li 9/23/2009.

Advertisements

A Discriminative Key Pose Sequence Model for Recognizing Human Interactions Arash Vahdat, Bo Gao, Mani Ranjbar, and Greg Mori ICCV2011.

Recognizing Human Actions by Attributes CVPR2011 Jingen Liu, Benjamin Kuipers, Silvio Savarese Dept. of Electrical Engineering and Computer Science University.

Computer Graphics Lecture 4 Geometry & Transformations.

Object Recognition with Features Inspired by Visual Cortex T. Serre, L. Wolf, T. Poggio Presented by Andrew C. Gallagher Jan. 25, 2007.

Human Identity Recognition in Aerial Images Omar Oreifej Ramin Mehran Mubarak Shah CVPR 2010, June Computer Vision Lab of UCF.

Víctor Ponce Miguel Reyes Xavier Baró Mario Gorga Sergio Escalera Two-level GMM Clustering of Human Poses for Automatic Human Behavior Analysis Departament.

Robust Part-Based Hand Gesture Recognition Using Kinect Sensor

Semantic Texton Forests for Image Categorization and Segmentation We would like to thank Amnon Drory for this deck הבהרה : החומר המחייב הוא החומר הנלמד.

Learning Visual Similarity Measures for Comparing Never Seen Objects Eric Nowak, Frédéric Jurie CVPR 2007.

Activity Recognition Aneeq Zia. Agenda What is activity recognition Typical methods used for action recognition “Evaluation of local spatio-temporal features.

Juergen Gall Action Recognition.

Animation Following “Advanced Animation and Rendering Techniques” (chapter 15+16) By Agata Przybyszewska.

Low Complexity Keypoint Recognition and Pose Estimation Vincent Lepetit.

Object Recognition & Model Based Tracking © Danica Kragic Tracking system.

Cambridge, Massachusetts Pose Estimation in Heavy Clutter using a Multi-Flash Camera Ming-Yu Liu, Oncel Tuzel, Ashok Veeraraghavan, Rama Chellappa, Amit.

Intelligent Systems Lab. Recognizing Human actions from Still Images with Latent Poses Authors: Weilong Yang, Yang Wang, and Greg Mori Simon Fraser University,

Real-Time Human Pose Recognition in Parts from Single Depth Images Presented by: Mohammad A. Gowayyed.

Local Descriptors for Spatio-Temporal Recognition

A Versatile Depalletizer of Boxes Based on Range Imagery Dimitrios Katsoulas*, Lothar Bergen*, Lambis Tassakos** *University of Freiburg **Inos Automation-software.

Shape and Dynamics in Human Movement Analysis Ashok Veeraraghavan.

1Notes  Assignment 0 marks should be ready by tonight (hand back in class on Monday)

Kinect Case Study CSE P 576 Larry Zitnick

Shape and Dynamics in Human Movement Analysis Ashok Veeraraghavan.

Time to Derive Kinematics Model of the Robotic Arm

Learning the space of time warping functions for Activity Recognition Function-Space of an Activity Ashok Veeraraghavan Rama Chellappa Amit K. Roy-Chowdhury.

1 Invariant Local Feature for Object Recognition Presented by Wyman 2/05/2006.

A Novel 2D To 3D Image Technique Based On Object- Oriented Conversion.

Avalanche Ski-Resort Snow-Clad Mountain Moving Vistas: Exploiting Motion for Describing Scenes Nitesh Shroff, Pavan Turaga, Rama Chellappa University of.

Information that lets you recognise a region.

Hand Signals Recognition from Video Using 3D Motion Capture Archive Tai-Peng Tian Stan Sclaroff Computer Science Department B OSTON U NIVERSITY I. Introduction.

Camera Parameters and Calibration. Camera parameters From last time….

(1) A probability model respecting those covariance observations: Gaussian Maximum entropy probability distribution for a given covariance observation.

Face Recognition Using Neural Networks Presented By: Hadis Mohseni Leila Taghavi Atefeh Mirsafian.

Quick Overview of Robotics and Computer Vision. Computer Vision Agent Environment camera Light ?

Computer Vision Group Prof. Daniel Cremers Autonomous Navigation for Flying Robots Lecture 3.1: 3D Geometry Jürgen Sturm Technische Universität München.

Gender and 3D Facial Symmetry: What’s the Relationship ? Xia BAIQIANG (University Lille1/LIFL) Boulbaba Ben Amor (TELECOM Lille1/LIFL) Hassen Drira (TELECOM.

Bag-of-Words based Image Classification Joost van de Weijer.

Bag of Video-Words Video Representation

Action and Gait Recognition From Recovered 3-D Human Joints IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS— PART B: CYBERNETICS, VOL. 40, NO. 4, AUGUST.

Mining Discriminative Components With Low-Rank and Sparsity Constraints for Face Recognition Qiang Zhang, Baoxin Li Computer Science and Engineering Arizona.

Zhengyou Zhang Microsoft Research Digital Object Identifier: /MMUL Publication Year: 2012, Page(s): Professor: Yih-Ran Sheu Student.

Human Gesture Recognition Using Kinect Camera Presented by Carolina Vettorazzo and Diego Santo Orasa Patsadu, Chakarida Nukoolkit and Bunthit Watanapa.

Miguel Reyes 1,2, Gabriel Dominguez 2, Sergio Escalera 1,2 Computer Vision Center (CVC) 1, University of Barcelona (UB) 2

S DTW: COMPUTING DTW DISTANCES USING LOCALLY RELEVANT CONSTRAINTS BASED ON SALIENT FEATURE ALIGNMENTS K. Selçuk Candan Arizona State University Maria Luisa.

Handwritten digit recognition Jitendra Malik. Handwritten digit recognition (MNIST,USPS) LeCun’s Convolutional Neural Networks variations (0.8%, 0.6%

K. Selçuk Candan, Maria Luisa Sapino Xiaolan Wang, Rosaria Rossini

Online Kinect Handwritten Digit Recognition Based on Dynamic Time Warping and Support Vector Machine Journal of Information & Computational Science, 2015.

Relative Hidden Markov Models Qiang Zhang, Baoxin Li Arizona State University.

A Statistical Method for 3D Object Detection Applied to Face and Cars CVPR 2000 Henry Schneiderman and Takeo Kanade Robotics Institute, Carnegie Mellon.

Action and Gait Recognition From Recovered 3-D Human Joints IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS— PART B: CYBERNETICS, VOL. 40, NO. 4, AUGUST.

Epitomic Location Recognition A generative approach for location recognition K. Ni, A. Kannan, A. Criminisi and J. Winn In proc. CVPR Anchorage,

First-Person Activity Recognition: What Are They Doing to Me? M. S. Ryoo and Larry Matthies Jet Propulsion Laboratory, California Institute of Technology,

Skeleton Based Action Recognition with Convolutional Neural Network

Sparse Granger Causality Graphs for Human Action Classification Saehoon Yi and Vladimir Pavlovic Rutgers, The State University of New Jersey.

RGB-D Images and Applications

Robotics Chapter 6 – Machine Vision Dr. Amit Goradia.

Mustafa Gokce Baydogan, George Runger and Eugene Tuv INFORMS Annual Meeting 2011, Charlotte A Bag-of-Features Framework for Time Series Classification.

1 Manipulation by humans and robots CENG782 - Planning of Robotic Manipulation Instructor: Erol Sahin Adapted from slides from..

Parallel Lines and a Transversal Parallel Lines and a Transversal

Learning Mid-Level Features For Recognition

Temporal Order-Preserving Dynamic Quantization for Human Action Recognition from Multimodal Sensor Streams Jun Ye Kai Li Guo-Jun Qi Kien.

Supervised Time Series Pattern Discovery through Local Importance

Paper Presentation: Shape and Matching

Real-Time Human Pose Recognition in Parts from Single Depth Image

CHAPTER 2 FORWARD KINEMATIC 1.

The Functional Space of an Activity Ashok Veeraraghavan , Rama Chellappa, Amit Roy-Chowdhury Avinash Ravichandran.

Raviteja Vemulapalli University of Maryland, College Park.

A movement of a figure in a plane.

CNN-based Action Recognition Using Adaptive Multiscale Depth Motion Maps And Stable Joint Distance Maps Junyou He, Hailun Xia, Chunyan Feng, Yunfei Chu.

Presentation transcript:

Human Action Recognition by Representing 3D Skeletons as Points in a Lie Group Raviteja Vemulapalli University of Maryland, College Park. Professor Rama Chellappa Dr. Felipe Arrate

Action Recognition from 3D Skeletal Data Motivation: Humans can recognize many actions directly from skeletal sequences. Tennis serve Jogging Sit down Boxing But, how do we get the 3D skeletal data?

Cost Effective Depth Sensors Human performing an action Cost effective depth sensors like Kinect State-of-the-art depth-based skeleton estimation algorithm [Shotton 2011] Real-time skeletal sequence UTKinect-Action dataset [Xia2012] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman and A. Blake, "Real-time Human Pose Recognition in Parts From a Single Depth Image", In CVPR, 2011. L. Xia, C. C. Chen and J. K. Aggarwal, "View Invariant Human Action Recognition using Histograms of 3D Joints ", In CVPRW, 2012.

Gesture-based Control Applications Gesture-based Control Elderly Care Teaching Robots

Skeleton-based Action Recognition Sequence of skeletons Skeletal Representation Joint positions, Joint angles, etc. Temporal Modeling HMM, DTW, Fourier analysis, etc. Classification Bayes classifier, NN, SVM, etc. Action label Overview of a typical skeleton-based action recognition approach.

How to represent a 3D human skeleton for action recognition ?

Human Skeleton: Points or Rigid Rods? Set of points (joints) Set of rigid rods (body parts)

Human Skeleton as a Set of Points Inspired by the moving lights display experiment by [Johansson 1973]. Popularly-used skeletal representation. Representation: Concatenation of the 3D coordinates of the joints. G. Johansson, “Visual Perception of Biological Motion and a Model for its Analysis", Perception and Psychophysics, 1973.

Human Skeleton as a Set of Rigid Rods Human skeleton is a set of 3D rigid rods (body parts) connected by joints. Spatial configuration of these rods can be represented using joint angles (shown using red arcs in the below figure). Representation: Concatenation of the Euler angles or Axis-angle or quaternion representations corresponding to the 3D joint angles.

Proposed Representation: Motivation Human actions are characterized by how different body parts move relative to each other. For action recognition, we need a skeletal representation whose temporal evolution directly describes the relative motion between various body parts.

Proposed Skeletal Representation We represent a skeleton using the relative 3D geometry between different body parts. The relative geometry between two body parts can be described using the 3D rotation and translation required to take one body part to the position and orientation of the other.

Relative 3D Geometry between Body Parts We describe the relative geometry between two rigid body parts ( 𝑒 𝑚 , 𝑒 𝑛 ) at time instance t using the rotation 𝑅 𝑚,𝑛 (𝑡) and the translation 𝑑 𝑚,𝑛 (𝑡) (measured in the local coordinate system attached to 𝑒 𝑛 ) required to take 𝑒 𝑛 to the position and orientation of 𝑒 𝑚 . 𝑒 𝑚1 𝑛 (𝑡) 𝑒 𝑚2 𝑛 (𝑡) 1 1 = 𝑅 𝑚,𝑛 (𝑡) 𝑑 𝑚,𝑛 (𝑡) 0 1 𝑒 𝑛1 (𝑡) 𝑠 𝑚,𝑛 ∗𝑒 𝑛2 (𝑡) 1 1 Rotation and translation vary with time. Scaling factor: Independent of time since lengths of the body parts do not change with time.

R. M. Murray, Z. Li, and S. S. Sastry, "A Mathematical Introduction to Robotic Manipulation", CRC Press, 1994.

Special Euclidean Group SE(3) The special Euclidean group, denoted by SE(3), is the set of all 4×4 matrices of the form 𝑃 𝑅, 𝑑 = 𝑅 𝑑 0 1 , where 𝑑 ∈ 𝑅 3 and R is a 3×3 rotation matrix. The group SE(3) is a smooth 6-dimensional curved manifold. The tangent plane to the manifold SE(3) at the identity matrix 𝐼 4 , denoted by se(3), is known as the Lie algebra of SE(3). Lie algebra se(3) is a 6-dimensional vector space. The exponential map 𝑒𝑥𝑝 𝑆𝐸(3) :se(3)→𝑆𝐸(3) and the logarithm map 𝑙𝑜𝑔 𝑆𝐸(3) :𝑆𝐸(3)→se(3) are given by 𝑒𝑥𝑝 𝑆𝐸 3 𝐵 = 𝒆 𝐵 , 𝑙𝑜𝑔 𝑆𝐸 3 𝑃 =𝒍𝒐𝒈⁡(𝑃), where e and log denote the usual matrix exponential and logarithm.

Proposed Skeletal Representation Human skeleton is described using the relative 3D geometry between all pairs of body parts. 𝑃 𝑚,𝑛 (𝑡)= 𝑅 𝑚,𝑛 (𝑡) 𝑑 𝑚,𝑛 (𝑡) 0 1 : 𝑚≠𝑛, 1≤𝑚,𝑛≤19 ∈ 𝑆𝐸 3 ×…×𝑆𝐸 3 . Point in SE(3) describing the relative 3D geometry between body parts (em , en) at time instance t. Lie group obtained by combining multiple SE(3) using the direct product ×.

Proposed Action Representation Using the proposed skeletal representation, a skeletal sequence can be represented as a curve in the Lie group 𝑆𝐸 3 ×…×𝑆𝐸 3 : 𝑃 𝑚,𝑛 𝑡 | 𝑚≠𝑛, 1≤𝑚,𝑛≤19 , 𝑡∈[0, 𝑇] . Point in SE(3) ×… × SE(3) representing the skeleton at time instance t.

Proposed Action Representation Classification of the curves in 𝑆𝐸 3 ×…×𝑆𝐸(3) into different action categories is a difficult task due to the non-Euclidean nature of the space. Standard classification approaches like support vector machines (SVM) and temporal modeling approaches like Fourier analysis are not directly applicable to this space. To overcome these difficulties, we map the curves from the Lie group 𝑆𝐸 3 ×…×𝑆𝐸(3) to its Lie algebra se(3) ×… × se(3), which is a vector space.

Proposed Action Representation Human actions are represented as curves in the Lie algebra se(3) ×… × se(3). 𝑙𝑜𝑔 𝑆𝐸 3 ( 𝑃 𝑚,𝑛 𝑡 ) | 𝑚≠𝑛, 1≤𝑚,𝑛≤19 , 𝑡∈[0, 𝑇] . Action recognition can be performed by classifying the curves in the vector space se(3) ×… × se(3) into different action categories. Point in se(3) ×… ×se(3) representing the skeleton at time instance t. Point in SE(3) describing the relative 3D geometry between body parts (em , en) at time instance t. Point in se(3)

Temporal Modeling and Classification Action classification is a difficult task due to various issues like rate variations, temporal misalignments, noise, etc. Following [Veeraraghavan 2009], we use Dynamic Time Warping (DTW) to handle rate variations. Following [Wang 2012], we use the Fourier temporal pyramid (FTP) representation to handle noise and temporal misalignments. We use linear SVM with Fourier temporal pyramid representation for final classification. Curve in se(3) ×… × se(3) Dynamic Time Warping Fourier Temporal Pyramid Representation Linear SVM Action label A. Veeraraghavan, A. Srivastava, A. K. Roy-Chowdhury and R. Chellappa, "Rate-invariant Recognition of Humans and Their Activities", IEEE Trans. on Image Processing, 18(6):1326–1339, 2009. J. Wang, Z. Liu, Y. Wu and J. Yuan, "Mining Actionlet Ensemble for Action Recognition with Depth Cameras", In CVPR, 2012.

Computation of Nominal Curves using DTW We interpolate all the curves in the Lie group 𝑆𝐸 3 ×…×𝑆𝐸(3) to have same length.

Fourier Temporal Pyramid Representation Fourier Transform Level 0 Fourier Transform Fourier Transform Fourier Transform Level 1 Fourier Transform Fourier Transform Level 2 Fourier Transform Magnitude of the low frequency Fourier coefficients from each level are used to represent a time sequence. J. Wang, Z. Liu, Y. Wu and J. Yuan, "Mining Actionlet Ensemble for Action Recognition with Depth Cameras", In CVPR, 2012.

Overview of the Proposed Approach

Experiments: Datasets MSR-Action3D dataset Total 557 action sequences 20 actions 10 subjects W. Li, Z. Zhang, and Z. Liu, "Action Recognition Based on a Bag of 3D Points", In CVPR Workshops, 2010. UTKinect-Action dataset Total 199 action sequences 10 actions 10 subjects L. Xia, C. C. Chen, and J. K. Aggarwal, "View Invariant Human Action Recognition Using Histograms of 3D Joints", In CVPR Workshops, 2012. Florence3D-Action dataset Total 215 action sequences 9 actions 10 subjects L. Seidenari, V. Varano, S. Berretti, A. D. Bimbo, and P. Pala, "Recognizing Actions from Depth Cameras as Weakly Aligned Multi-part Bag-of-Poses", In CVPR Workshops, 2013.

Alternative Representations for Comparison Joint positions (JP): Concatenation of the 3D coordinates of the joints. Joint angles (JA): Concatenation of the quaternions corresponding to the joint angles (shown using red arcs in the figure). Individual body part locations(BPL): Each body part 𝑒 𝑚 is represented as a point in 𝑆𝐸(3) using its relative 3D geometry with respect to the global 𝑥-axis. Pairwise relative positions of the joints (RJP): Concatenation of the 3D vectors 𝑣 𝑖 𝑣 𝑗 , 1 ≤𝑖 <𝑗≤20.

Horizontal arm wave Hammer MSR-Action3D Dataset Total 557 action sequences: 20 actions performed (2 or 3 times) by 10 different subjects. Dataset is further divided into 3 subsets: AS1, AS2 and AS3. Action Set 1 (AS1) Action Set 2 (AS2) Action Set 3 (AS3) Horizontal arm wave Hammer Forward punch High throw Hand clap Bend Tennis serve Pickup & throw High arm wave Hand catch Draw x Draw tick Draw circle Two hand wave Forward kick Side boxing Side kick Jogging Tennis swing Golf swing

Results: MSR-Action3D Dataset Experiments performed on each of the subsets (AS1, AS2 and AS3) separately. Half of the subjects were used for training and the other half were used for testing. Dataset JP RJP JA BPL Proposed AS1 91.65 92.15 85.80 83.87 95.29 AS2 75.36 79.24 65.47 75.23 AS3 94.64 93.31 94.22 91.54 98.22 Average 87.22 88.23 81.83 83.54 92.46 Approach Accuracy Eigen Joints 82.30 Joint angle similarities 83.53 Spatial and temporal part-sets 90.22 Covariance descriptors on 3D joint locations 90.53 Random forests 90.90 Proposed approach 92.46 Recognition rates for various skeletal representations on MSR-Action3D dataset. Comparison with the state-of-the-art results on MSR-Action3D dataset.

MSR-Action3D Confusion Matrices Action set 1 (AS1) Action set 2 (AS2) Action set 3 (AS3) Average recognition accuracy: 95.29% Average recognition accuracy: 83.87% Average recognition accuracy: 98.22%

Results: UTKinect-Action Dataset Total 199 action sequences: 10 actions performed (2 times) by 10 different subjects. Half of the subjects were used for training and the other half were used for testing. JP RJP JA BPL Proposed 94.68 95.58 94.07 94.57 97.08 Recognition rates for various skeletal representations on UTKinect-Action dataset. Approach Accuracy Random forests 87.90 Histograms of 3D joints 90.92 Proposed approach 97.08 Comparison with the state-of-the-art results on UTKinect-Action dataset.

Results: Florence3D-Action Dataset Total 215 action sequences: 9 actions performed (2 or 3 times) by 10 different subjects. Half of the subjects were used for training and the other half were used for testing. JP RJP JA BPL Proposed 85.26 85.2 81.36 80.80 90.88 Recognition rates for various skeletal representations on Florence3D-Action dataset. Approach Accuracy Multi-Part Bag-of-Poses 82.00 Proposed approach 90.88 Comparison with the state-of-the-art results on Florence3D-Action dataset.

Thank You