Skeleton Based Action Recognition with Convolutional Neural Network Yong Duƚ, Yun Fuǂ, Liang Wangƚ Hello everyone! I‘m very honor to present our work here. In this report, we propose a sample end-to-end but high-precision and high-efficiency framework for skeleton based action recognition. ɫNat’l Lab of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences ǂCollege of Engineering, College of Computer and Information Science, Northeastern University, USA Nov. 6, 2015
Outline Background & Motivation Our Proposed Model Experimental Results Conclusions & Future Work This is the outline. (click) I will give an introduction about our work at first.
Action Recognition Two main branches of action recognition Automatic Drive Content-Based Video Search Game control Robot Vision HC Interaction Intelligent Surveillance Applications RGB video based action recognition RGBD video (skeleton) based action recognition Researches about action recognition mainly contain two branches, one is the RGB video based action recognition and the other is the depth video based action recognition. Skeleton estimation algorithms can estimate the relatively reliable joint coordinates from depth videos. Because most approaches for RGB video based action recognition can be directly transformed to handle the depth video based action recognition, so another main branch of action recognition is skeleton based action recognition. Applications about action recognition are … The objective of this work is to solve the skeleton based action recognition. Objective of this work – skeleton based action recognition
An approach to pose-based action recognition (CVPR 2013) Related Work Mining actionlet ensemble for action recognition with depth cameras (CVPR 2012) An approach to pose-based action recognition (CVPR 2013) Bio-inspired Dynamic 3D Discriminative Skeletal Features for Human Action Recognition (CVPRW 2013) Hierarchical recurrent neural network for skeleton based action recognition (CVPR 2015) Leveraging Hierarchical Parametric Networks for Skeletal Joints Based Action Segmentation and Recognition (CVPR 2014) Most existing skeleton based action recognition frameworks are dictionary learning based approaches. And Temporal Pyramids and its variants are employed to capture the local temporal evolution, such as actionlet, pose based model, dynamic 3D discriminative skeletal features. For the restriction from the width of time windows, the Temporal Pyramids models can only utilize limited contextual information. In some works, time series models, especially HMM, are applied to model the global temporal evolution, yet it is very difficult to obtain the temporal aligned sequences and the emission distributions of HMMs. Recently, an end-to-end approach based on Recurrent Neural Network was proposed for this problem.
Related Work Limitations of most existing methods: Hand-crafted features; Dictionary learning based approaches (BoW) Temporal pyramids and its variants -> utilize limited contextual information; Time series models – mainly DTWs & HMMs need pre-segmentation & pre-alignment, difficult to obtain the emission distribution; Most existing skeleton based action recognition frameworks are dictionary learning based approaches. And Temporal Pyramids and its variants are employed to capture the local temporal evolution, such as (click) actionlet, (click) pose based model, (click) dynamic 3D discriminative skeletal features. For the restriction from the width of time windows, the Temporal Pyramids models can only utilize limited contextual information. In some works, time series models, especially HMM (click) , are applied to model the global temporal evolution, yet it is very difficult to obtain the temporal aligned sequences and the emission distributions of HMMs. Recently (click), an end-to-end approach based on Recurrent Neural Network was proposed for this problem.
Motivation & Contributions Information in action sequence – postures overtime and their evolution; Temporal dynamics -> static structure. Transform sequences into images – preserve dynamic & static information; Representation – structural information (I) <-> dynamic & static information (S). Contributions: Propose a novel representation for skeleton sequences; Propose a sample end-to-end but high-efficiency and high-precision solution for skeleton based action recognition; This framework is easily transformed to other time series problem. Two essential elements for action recognition are static postures and their temporal dynamics. And generally, temporal dynamics can be easily transformed into static spatial structure, so we can represent the spatial-temporal information in the skeleton sequences as the static spatial structure information in an image. Then mature feature learning methods may be used to learn the representation of the image, which is the indirect representation of the original skeleton sequence. (click) Our contributions are as follows.
Outline Background & Motivation Our Proposed Model Experimental Results Conclusions & Future Work This is the outline. (click) I will give an introduction about our work at first.
Skeleton Based Action Recognition with CNN The proposed model contains three parts, i.e., data transformation, normalized image representation, and the hierarchical spatial-temporal adaptive filter banks. Each part contains … (Alignment …) Considering that the length of skeleton sequences are variable, we resize the generated images to an uniform size. Finally, a hierarchical spatial-temporal adaptive filter banks model is used for feature learning and classification. Data Transformation Image Representation Feature extraction & classification
From Skeleton Sequences to Images Data Transformation Spatial postures – align joints according to human physical structure; Temporal dynamics - arrange in chronological order; Three components of each joint ↔ three components of each pixel. Image representations obtained on the Berkeley MHAD dataset The objective of data transformation is to transform the temporal dynamics in sequence into static structure of image, as well as preserving the posture information. Skeleton in each frame is represent as a vector according to human physical structure (the five parts). Representations of all frames are arranged in chronological order to represent the whole sequence, and three components of each joint are represented as the corresponding three components of each pixel. The main problem of this approach is the variable frequency, while the following Spatial-Temporal Synchronous Pooling can restrain this problem. The reasons are: (Pooling 对频率的调整是一定程度上的,不用过于强调) 1)Max-pooling can smooth the texture of its inputs; 2) motion information of actions always present as low frequency signal; 3) the frequency of low-frequency component in a smooth signal is represented by the extreme points.
Hierarchical Architecture Adaptive filter banks for feature representation learning: Convolution: all filter sizes are 3 x 3 and all convolutional strides are 1; Spatial-temporal synchronous pooling: max-pooling; The number of weights: about 75,000; Tested by voting. This is the hierarchical architecture of the adaptive filter banks. After resizing the image into an uniform size, an adaptive filter banks (CNN model) is used for feature learning and classification. For treating CNN as hierarchical adaptive filter banks, all filter sizes are 3x3 and all strides during convolution are set to 1. Considering that the original action frequencies are changed in different scales when resizing and same actions performed by different subjects may have various frequencies, we adopt the max-pooling strategy following each of the first three filter banks. For the special structure of input images, the scale-invariance of max-pooling along horizontal axis is transformed as the frequency-invariance of actions. And max-pooling along vertical axis can select more discriminative skeleton joints for different actions. After feature extraction, a feed-forward neural network with two fully connected layers is employed for classification.
From Skeleton Sequences to Images Problem & Solution Variable frequency problem – different subjects, different sequences, resize; Solution - Spatial-Temporal Synchronous Pooling. The objective of data transformation is to transform the temporal dynamics in sequence into static structure of image, as well as preserving the posture information. Skeleton in each frame is represent as a vector according to human physical structure (the five parts). Representations of all frames are arranged in chronological order to represent the whole sequence, and three components of each joint are represented as the corresponding three components of each pixel. The main problem of this approach is the variable frequency, while the following Spatial-Temporal Synchronous Pooling can restrain this problem. The reasons are: (Pooling 对频率的调整是一定程度上的,不用过于强调) 1)Max-pooling can smooth the texture of its inputs; 2) motion information of actions always present as low frequency signal; 3) the frequency of low-frequency component in a smooth signal is represented by the extreme points.
Outline Background & Motivation Our Proposed Model Experimental Results Conclusions & Future Work This is the outline. (click) I will give an introduction about our work at first.
Datasets Motion capture dataset: High sample frequency; Long sequences; High precision. Berkeley Multimodal Human Action Dataset (Berkeley MHAD): 12 subjects, 11 actions, 659 valid samples, 35 joints; 480 FPS, ≈3602 frames/sequence; Training on 384 samples of the first 7 subjects, testing on the rest 275 samples. Berkeley MHAD There are two kinds of datasets for skeleton based action recognition. One is the motion capture dataset, and the other is the Kinect dataset.
Datasets Kinect dataset: ChaLearn Gesture Recognition Dataset: Low sample frequency; Short sequences; Low precision. ChaLearn Gesture Recognition Dataset: 27 persons, 20 Italian gestures, 6850 training samples, 3454 validation samples, 3579 test samples; 20 FPS, ≈39 frames/sequence; Provides RGB, depth, foreground segmentation and Kinect skeletons; Only use skeleton data, training on the training set and testing on the validation set. There are two kinds of datasets for skeleton based action recognition. One is the motion capture dataset, and the other is the Kinect dataset. The main different between this two kinds of datasets is the coordinate precision. ChaLearn Gesture Recognition Dataset
Experimental results on the Berkeley MHAD Method Accuracy(%) Ofli et al., 2014 95.37 Vantigodi et al., 2013 96.06 Vantigodi et al., 2014 97.58 Kapsouras et al., 2014 98.18 Du et al., 2015 100 Ours Analysis: Temporal dynamics → static structure → final sequence representation: successful; Spatial-temporal synchronous pooling – overcome variable frequency problem; This model handle this problem very well. On Berkeley MHAD dataset, our model can achieve the 100% accuracy without any other pre- or post-processing.
Experimental results on the ChaLearn Gesture Recognition Dataset Method Precision Recall F1-score Yao et al., CVPR 2014 - 56.0 Wu et al., ACM-ICMI 2013 59.9 59.3 59.6 Pfister et al., ECCV 2014 61.2 62.3 61.7 Fernando et al., CVPR 2015 75.3 75.1 75.2 Our Hierarchical RNN 91.93 92.01 91.97 Ours 91.16 91.25 91.21 Analysis: Excellent performance and good robustness; Better process the temporal information compared with traditional methods; Skeleton data can well represent human motions. And it is clear that our method significantly surpass the state-of-the-art precision by more than 15 percentage, which demonstrate that it is a great success to transform temporal dynamics in sequences into spatial structure information in images for sequence representation learning. Someone may be interesting in the cooperation between this model and our hierarchical model. For the limited space, we didn’t put this part in the paper, but we can give the results here. It is very clear that our both models can deal with this problem very well without any sophisticated processing. Considering the stochastic error, we can not tell which one is the best solution for this problem, because each of them has its own advantages. For potential questions: Commonalities of the both proposed model: 1) Recognize actions based on global analysis without temporal alignment and pre-segmentation; 2) Handle variable-length/frequency sequences; 3) End-to-end, high computational efficiency and excellent performances; Differences of the both proposed model: 1) Hierarchical RNN – access contextual information over long time, hierarchical fusion, more robust and better convergence; 2) CNN based model – hierarchical filter & spatial-temporal synchronal pooling, fewer parameters and much faster (GPU implementation);
Computational efficiency Experimental Results Computational efficiency (NVIDIA Titan GK110) Training with 1.95ms/sequence, testing with 2.27ms/sequence Filters and convergence curves on the ChaLearn Gesture Dataset This model has very high computational efficiency. The following is the filters and the convergence curves of the both models. (It’s no need to explain these filters). The convergence of our both model is very good. For potential questions: We add input noise and weight noise during training to improve the performance of hierarchical RNN, so the convergence curve of test set is under that of training set.
Outline Background & Motivation Our Proposed Model Experimental Results Conclusions & Future Work This is the outline. (click) I will give an introduction about our work at first.
Conclusions Advantages & disadvantages Advantages: Disadvantages: sample, end-to-end, high-precision and high-efficiency; no need for temporal alignment and pre-segmentation; Advantages: handle variable-length/frequency sequences. Sensitive to fragments missing in sequences. Disadvantages: Disadvantage: This model is robust to general noise, but if data error of local fragments in the input sequences is particularly highlighted, the recognition rate may be cut down. Future work: consider the appearance features as an assistance to solve the depth video based action recognition.
E-mail: {Yong.du, wangliang}@nlpr.ia.ac.cn, yunfu@ece.neu.edu