Skeleton Based Action Recognition with Convolutional Neural Network

Slides:

Advertisements

Similar presentations

Limin Wang, Yu Qiao, and Xiaoou Tang

Advertisements

Analysis of Contour Motions Ce Liu William T. Freeman Edward H. Adelson Computer Science and Artificial Intelligence Laboratory Massachusetts Institute.

Patch to the Future: Unsupervised Visual Prediction

Activity Recognition Aneeq Zia. Agenda What is activity recognition Typical methods used for action recognition “Evaluation of local spatio-temporal features.

Juergen Gall Action Recognition.

Robust 3D Head Pose Classification using Wavelets by Mukesh C. Motwani Dr. Frederick C. Harris, Jr., Thesis Advisor December 5 th, 2002 A thesis submitted.

Probability-based Dynamic Time Warping for Gesture Recognition on RGB-D data All rights reserved HuBPA© Human Pose Recovery and Behavior Analysis Group.

Robust Object Tracking via Sparsity-based Collaborative Model

Multiple View Based 3D Object Classification Using Ensemble Learning of Local Subspaces ( ThBT4.3 ) Jianing Wu, Kazuhiro Fukui

Optimization & Learning for Registration of Moving Dynamic Textures Junzhou Huang 1, Xiaolei Huang 2, Dimitris Metaxas 1 Rutgers University 1, Lehigh University.

Learning Convolutional Feature Hierarchies for Visual Recognition

1 Integration of Background Modeling and Object Tracking Yu-Ting Chen, Chu-Song Chen, Yi-Ping Hung IEEE ICME, 2006.

A Self-Organizing Approach to Background Subtraction for Visual Surveillance Applications Lucia Maddalena and Alfredo Petrosino, Senior Member, IEEE.

Spatial Pyramid Pooling in Deep Convolutional

Oral Defense by Sunny Tang 15 Aug 2003

Action and Gait Recognition From Recovered 3-D Human Joints IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS— PART B: CYBERNETICS, VOL. 40, NO. 4, AUGUST.

Image Recognition using Hierarchical Temporal Memory Radoslav Škoviera Ústav merania SAV Fakulta matematiky, fyziky a informatiky UK.

Mining Discriminative Components With Low-Rank and Sparsity Constraints for Face Recognition Qiang Zhang, Baoxin Li Computer Science and Engineering Arizona.

Zhengyou Zhang Microsoft Research Digital Object Identifier: /MMUL Publication Year: 2012, Page(s): Professor: Yih-Ran Sheu Student.

Hurieh Khalajzadeh Mohammad Mansouri Mohammad Teshnehlab

Miguel Reyes 1,2, Gabriel Dominguez 2, Sergio Escalera 1,2 Computer Vision Center (CVC) 1, University of Barcelona (UB) 2

A Method for Hand Gesture Recognition Jaya Shukla Department of Computer Science Shiv Nadar University Gautam Budh Nagar, India Ashutosh Dwivedi.

Video Tracking Using Learned Hierarchical Features

Raviteja Vemulapalli University of Maryland, College Park.

Deformable Part Model Presenter ： Liu Changyu Advisor ： Prof. Alex Hauptmann Interest ： Multimedia Analysis April 11 st, 2013.

Online Kinect Handwritten Digit Recognition Based on Dynamic Time Warping and Support Vector Machine Journal of Information & Computational Science, 2015.

A Face processing system Based on Committee Machine: The Approach and Experimental Results Presented by: Harvest Jang 29 Jan 2003.

Action and Gait Recognition From Recovered 3-D Human Joints IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS— PART B: CYBERNETICS, VOL. 40, NO. 4, AUGUST.

Efficient Visual Object Tracking with Online Nearest Neighbor Classifier Many slides adapt from Steve Gu.

模式识别国家重点实验室中国科学院自动化研究所 National Laboratory of Pattern Recognition Institute of Automation, Chinese Academy of Sciences Context Enhancement of Nighttime.

模式识别国家重点实验室中国科学院自动化研究所 National Laboratory of Pattern Recognition Institute of Automation, Chinese Academy of Sciences Matching Tracking Sequences Across.

Experimental Results Abstract Fingerspelling is widely used for education and communication among signers. We propose a new static fingerspelling recognition.

Image Classification over Visual Tree Jianping Fan Dept of Computer Science UNC-Charlotte, NC

ACADS-SVMConclusions Introduction CMU-MMAC Unsupervised and weakly-supervised discovery of events in video (and audio) Fernando De la Torre.

Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue

Towards Total Scene Understanding: Classiﬁcation, Annotation and Segmentation in an Automatic Framework N 工科所錢雅馨 2011/01/16 Li-Jia Li, Richard.

Using decision trees to build an a framework for multivariate time- series classification 1 Present By Xiayi Kuang.

ICCV 2007 Optimization & Learning for Registration of Moving Dynamic Textures Junzhou Huang 1, Xiaolei Huang 2, Dimitris Metaxas 1 Rutgers University 1,

WLD: A Robust Local Image Descriptor Jie Chen, Shiguang Shan, Chu He, Guoying Zhao, Matti Pietikäinen, Xilin Chen, Wen Gao 报告人：蒲薇榄.

Gaussian Conditional Random Field Network for Semantic Segmentation

Hierarchical Motion Evolution for Action Recognition Authors: Hongsong Wang, Wei Wang, Liang Wang Center for Research on Intelligent Perception and Computing,

Facial Smile Detection Based on Deep Learning Features Authors: Kaihao Zhang, Yongzhen Huang, Hong Wu and Liang Wang Center for Research on Intelligent.

1 Bilinear Classifiers for Visual Recognition Computational Vision Lab. University of California Irvine To be presented in NIPS 2009 Hamed Pirsiavash Deva.

Deeply-Recursive Convolutional Network for Image Super-Resolution

Hybrid Deep Learning for Reflectance Confocal Microscopy Skin Images

Deeply learned face representations are sparse, selective, and robust

Guillaume-Alexandre Bilodeau

HyperNetworks Engın denız usta

Temporal Order-Preserving Dynamic Quantization for Human Action Recognition from Multimodal Sensor Streams Jun Ye Kai Li Guo-Jun Qi Kien.

Yun-FuLiu Jing-MingGuo Che-HaoChang

Optical Flow Estimation and Segmentation of Moving Dynamic Textures

Mean Euclidean Distance Error (mm)

Deep Reconfigurable Models with Radius-Margin Bound for

Dynamic Routing Using Inter Capsule Routing Protocol Between Capsules

A Comparative Study of Convolutional Neural Network Models with Rosenblatt’s Brain Model Abu Kamruzzaman, Atik Khatri , Milind Ikke, Damiano Mastrandrea,

Wei Liu, Chaofeng Chen and Kwan-Yee K. Wong

Convolutional Neural Networks for Visual Tracking

Institute of Neural Information Processing (Prof. Heiko Neumann •

PRAKASH CHOCKALINGAM, NALIN PRADEEP, AND STAN BIRCHFIELD

Xiaodan Liang Sun Yat-Sen University

Oral presentation for ACM International Conference on Multimedia, 2014

Single Image Rolling Shutter Distortion Correction

Outline Background Motivation Proposed Model Experimental Results

Paper Reading Dalong Du April.08, 2011.

CNN-based Action Recognition Using Adaptive Multiscale Depth Motion Maps And Stable Joint Distance Maps Junyou He, Hailun Xia, Chunyan Feng, Yunfei Chu.

Heterogeneous convolutional neural networks for visual recognition

Human-object interaction

Week 3 Volodymyr Bobyr.

End-to-End Speech-Driven Facial Animation with Temporal GANs

Nguyen Ngoc Hoang, Guee-Sang Lee, Soo-Hyung Kim, Hyung-Jeong Yang

Presentation transcript:

Skeleton Based Action Recognition with Convolutional Neural Network Yong Duƚ, Yun Fuǂ, Liang Wangƚ Hello everyone！ I‘m very honor to present our work here. In this report, we propose a sample end-to-end but high-precision and high-efficiency framework for skeleton based action recognition. ɫNat’l Lab of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences ǂCollege of Engineering, College of Computer and Information Science, Northeastern University, USA Nov. 6, 2015

Outline Background & Motivation Our Proposed Model Experimental Results Conclusions & Future Work This is the outline. (click) I will give an introduction about our work at first.

Action Recognition Two main branches of action recognition Automatic Drive Content-Based Video Search Game control Robot Vision HC Interaction Intelligent Surveillance Applications RGB video based action recognition RGBD video (skeleton) based action recognition Researches about action recognition mainly contain two branches, one is the RGB video based action recognition and the other is the depth video based action recognition. Skeleton estimation algorithms can estimate the relatively reliable joint coordinates from depth videos. Because most approaches for RGB video based action recognition can be directly transformed to handle the depth video based action recognition, so another main branch of action recognition is skeleton based action recognition. Applications about action recognition are … The objective of this work is to solve the skeleton based action recognition. Objective of this work – skeleton based action recognition

An approach to pose-based action recognition (CVPR 2013) Related Work Mining actionlet ensemble for action recognition with depth cameras (CVPR 2012) An approach to pose-based action recognition (CVPR 2013) Bio-inspired Dynamic 3D Discriminative Skeletal Features for Human Action Recognition (CVPRW 2013) Hierarchical recurrent neural network for skeleton based action recognition (CVPR 2015) Leveraging Hierarchical Parametric Networks for Skeletal Joints Based Action Segmentation and Recognition (CVPR 2014) Most existing skeleton based action recognition frameworks are dictionary learning based approaches. And Temporal Pyramids and its variants are employed to capture the local temporal evolution, such as actionlet, pose based model, dynamic 3D discriminative skeletal features. For the restriction from the width of time windows, the Temporal Pyramids models can only utilize limited contextual information. In some works, time series models, especially HMM, are applied to model the global temporal evolution, yet it is very difficult to obtain the temporal aligned sequences and the emission distributions of HMMs. Recently, an end-to-end approach based on Recurrent Neural Network was proposed for this problem.

Related Work Limitations of most existing methods: Hand-crafted features; Dictionary learning based approaches (BoW) Temporal pyramids and its variants -> utilize limited contextual information; Time series models – mainly DTWs & HMMs need pre-segmentation & pre-alignment, difficult to obtain the emission distribution; Most existing skeleton based action recognition frameworks are dictionary learning based approaches. And Temporal Pyramids and its variants are employed to capture the local temporal evolution, such as (click) actionlet, (click) pose based model, (click) dynamic 3D discriminative skeletal features. For the restriction from the width of time windows, the Temporal Pyramids models can only utilize limited contextual information. In some works, time series models, especially HMM (click) , are applied to model the global temporal evolution, yet it is very difficult to obtain the temporal aligned sequences and the emission distributions of HMMs. Recently (click), an end-to-end approach based on Recurrent Neural Network was proposed for this problem.

Motivation & Contributions Information in action sequence – postures overtime and their evolution; Temporal dynamics -> static structure. Transform sequences into images – preserve dynamic & static information; Representation – structural information (I) <-> dynamic & static information (S). Contributions: Propose a novel representation for skeleton sequences; Propose a sample end-to-end but high-efficiency and high-precision solution for skeleton based action recognition; This framework is easily transformed to other time series problem. Two essential elements for action recognition are static postures and their temporal dynamics. And generally, temporal dynamics can be easily transformed into static spatial structure, so we can represent the spatial-temporal information in the skeleton sequences as the static spatial structure information in an image. Then mature feature learning methods may be used to learn the representation of the image, which is the indirect representation of the original skeleton sequence. (click) Our contributions are as follows.

Outline Background & Motivation Our Proposed Model Experimental Results Conclusions & Future Work This is the outline. (click) I will give an introduction about our work at first.

Skeleton Based Action Recognition with CNN The proposed model contains three parts, i.e., data transformation, normalized image representation, and the hierarchical spatial-temporal adaptive filter banks. Each part contains … (Alignment …) Considering that the length of skeleton sequences are variable, we resize the generated images to an uniform size. Finally, a hierarchical spatial-temporal adaptive filter banks model is used for feature learning and classification. Data Transformation Image Representation Feature extraction & classification

From Skeleton Sequences to Images Data Transformation Spatial postures – align joints according to human physical structure; Temporal dynamics - arrange in chronological order; Three components of each joint ↔ three components of each pixel. Image representations obtained on the Berkeley MHAD dataset The objective of data transformation is to transform the temporal dynamics in sequence into static structure of image, as well as preserving the posture information. Skeleton in each frame is represent as a vector according to human physical structure (the five parts). Representations of all frames are arranged in chronological order to represent the whole sequence, and three components of each joint are represented as the corresponding three components of each pixel. The main problem of this approach is the variable frequency, while the following Spatial-Temporal Synchronous Pooling can restrain this problem. The reasons are: (Pooling 对频率的调整是一定程度上的，不用过于强调) 1)Max-pooling can smooth the texture of its inputs; 2) motion information of actions always present as low frequency signal; 3) the frequency of low-frequency component in a smooth signal is represented by the extreme points.

Hierarchical Architecture Adaptive filter banks for feature representation learning: Convolution: all filter sizes are 3 x 3 and all convolutional strides are 1; Spatial-temporal synchronous pooling: max-pooling; The number of weights: about 75,000; Tested by voting. This is the hierarchical architecture of the adaptive filter banks. After resizing the image into an uniform size, an adaptive filter banks (CNN model) is used for feature learning and classification. For treating CNN as hierarchical adaptive filter banks, all filter sizes are 3x3 and all strides during convolution are set to 1. Considering that the original action frequencies are changed in different scales when resizing and same actions performed by different subjects may have various frequencies, we adopt the max-pooling strategy following each of the first three filter banks. For the special structure of input images, the scale-invariance of max-pooling along horizontal axis is transformed as the frequency-invariance of actions. And max-pooling along vertical axis can select more discriminative skeleton joints for different actions. After feature extraction, a feed-forward neural network with two fully connected layers is employed for classification.

From Skeleton Sequences to Images Problem & Solution Variable frequency problem – different subjects, different sequences, resize; Solution - Spatial-Temporal Synchronous Pooling. The objective of data transformation is to transform the temporal dynamics in sequence into static structure of image, as well as preserving the posture information. Skeleton in each frame is represent as a vector according to human physical structure (the five parts). Representations of all frames are arranged in chronological order to represent the whole sequence, and three components of each joint are represented as the corresponding three components of each pixel. The main problem of this approach is the variable frequency, while the following Spatial-Temporal Synchronous Pooling can restrain this problem. The reasons are: (Pooling 对频率的调整是一定程度上的，不用过于强调) 1)Max-pooling can smooth the texture of its inputs; 2) motion information of actions always present as low frequency signal; 3) the frequency of low-frequency component in a smooth signal is represented by the extreme points.

Outline Background & Motivation Our Proposed Model Experimental Results Conclusions & Future Work This is the outline. (click) I will give an introduction about our work at first.

Datasets Motion capture dataset： High sample frequency; Long sequences; High precision. Berkeley Multimodal Human Action Dataset (Berkeley MHAD): 12 subjects, 11 actions, 659 valid samples, 35 joints; 480 FPS, ≈3602 frames/sequence; Training on 384 samples of the first 7 subjects, testing on the rest 275 samples. Berkeley MHAD There are two kinds of datasets for skeleton based action recognition. One is the motion capture dataset, and the other is the Kinect dataset.

Datasets Kinect dataset： ChaLearn Gesture Recognition Dataset: Low sample frequency; Short sequences; Low precision. ChaLearn Gesture Recognition Dataset: 27 persons, 20 Italian gestures, 6850 training samples, 3454 validation samples, 3579 test samples; 20 FPS, ≈39 frames/sequence; Provides RGB, depth, foreground segmentation and Kinect skeletons; Only use skeleton data, training on the training set and testing on the validation set. There are two kinds of datasets for skeleton based action recognition. One is the motion capture dataset, and the other is the Kinect dataset. The main different between this two kinds of datasets is the coordinate precision. ChaLearn Gesture Recognition Dataset

Experimental results on the Berkeley MHAD Method Accuracy(%) Ofli et al., 2014 95.37 Vantigodi et al., 2013 96.06 Vantigodi et al., 2014 97.58 Kapsouras et al., 2014 98.18 Du et al., 2015 100 Ours Analysis： Temporal dynamics → static structure → final sequence representation: successful; Spatial-temporal synchronous pooling – overcome variable frequency problem; This model handle this problem very well. On Berkeley MHAD dataset, our model can achieve the 100% accuracy without any other pre- or post-processing.

Experimental results on the ChaLearn Gesture Recognition Dataset Method Precision Recall F1-score Yao et al., CVPR 2014 - 56.0 Wu et al., ACM-ICMI 2013 59.9 59.3 59.6 Pfister et al., ECCV 2014 61.2 62.3 61.7 Fernando et al., CVPR 2015 75.3 75.1 75.2 Our Hierarchical RNN 91.93 92.01 91.97 Ours 91.16 91.25 91.21 Analysis： Excellent performance and good robustness; Better process the temporal information compared with traditional methods; Skeleton data can well represent human motions. And it is clear that our method significantly surpass the state-of-the-art precision by more than 15 percentage, which demonstrate that it is a great success to transform temporal dynamics in sequences into spatial structure information in images for sequence representation learning. Someone may be interesting in the cooperation between this model and our hierarchical model. For the limited space, we didn’t put this part in the paper, but we can give the results here. It is very clear that our both models can deal with this problem very well without any sophisticated processing. Considering the stochastic error, we can not tell which one is the best solution for this problem, because each of them has its own advantages. For potential questions: Commonalities of the both proposed model: 1) Recognize actions based on global analysis without temporal alignment and pre-segmentation; 2) Handle variable-length/frequency sequences; 3) End-to-end, high computational efficiency and excellent performances; Differences of the both proposed model: 1) Hierarchical RNN – access contextual information over long time, hierarchical fusion, more robust and better convergence; 2) CNN based model – hierarchical filter & spatial-temporal synchronal pooling, fewer parameters and much faster (GPU implementation);

Computational efficiency Experimental Results Computational efficiency (NVIDIA Titan GK110) Training with 1.95ms/sequence, testing with 2.27ms/sequence Filters and convergence curves on the ChaLearn Gesture Dataset This model has very high computational efficiency. The following is the filters and the convergence curves of the both models. (It’s no need to explain these filters). The convergence of our both model is very good. For potential questions: We add input noise and weight noise during training to improve the performance of hierarchical RNN, so the convergence curve of test set is under that of training set.

Outline Background & Motivation Our Proposed Model Experimental Results Conclusions & Future Work This is the outline. (click) I will give an introduction about our work at first.

Conclusions Advantages & disadvantages Advantages: Disadvantages: sample, end-to-end, high-precision and high-efficiency; no need for temporal alignment and pre-segmentation; Advantages: handle variable-length/frequency sequences. Sensitive to fragments missing in sequences. Disadvantages: Disadvantage: This model is robust to general noise, but if data error of local fragments in the input sequences is particularly highlighted, the recognition rate may be cut down. Future work: consider the appearance features as an assistance to solve the depth video based action recognition.

E-mail: {Yong.du, wangliang}@nlpr.ia.ac.cn, yunfu@ece.neu.edu