Deep Reconfigurable Models with Radius-Margin Bound for

Slides:

Advertisements

Similar presentations

Pose Estimation and Segmentation of People in 3D Movies Karteek Alahari, Guillaume Seguin, Josef Sivic, Ivan Laptev Inria, Ecole Normale Superieure ICCV.

Advertisements

Limin Wang, Yu Qiao, and Xiaoou Tang

Human Identity Recognition in Aerial Images Omar Oreifej Ramin Mehran Mubarak Shah CVPR 2010, June Computer Vision Lab of UCF.

1 Challenge the future HON4D: Histogram of Oriented 4D Normals for Activity Recognition from Depth Sequences Omar Oreifej Zicheng Liu CVPR 2013.

ImageNet Classification with Deep Convolutional Neural Networks

Juergen Gall Action Recognition.

Yuanlu Xu Human Re-identification: A Survey.

Object-centric spatial pooling for image classification Olga Russakovsky, Yuanqing Lin, Kai Yu, Li Fei-Fei ECCV 2012.

Optimization & Learning for Registration of Moving Dynamic Textures Junzhou Huang 1, Xiaolei Huang 2, Dimitris Metaxas 1 Rutgers University 1, Lehigh University.

Large-Scale Object Recognition with Weak Supervision

Sparse vs. Ensemble Approaches to Supervised Learning

Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.

Ensemble Tracking Shai Avidan IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE February 2007.

Sparse vs. Ensemble Approaches to Supervised Learning

Spatial Pyramid Pooling in Deep Convolutional

Face Processing System Presented by: Harvest Jang Group meeting Fall 2002.

Hand Signals Recognition from Video Using 3D Motion Capture Archive Tai-Peng Tian Stan Sclaroff Computer Science Department B OSTON U NIVERSITY I. Introduction.

Action and Gait Recognition From Recovered 3-D Human Joints IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS— PART B: CYBERNETICS, VOL. 40, NO. 4, AUGUST.

Mining Discriminative Components With Low-Rank and Sparsity Constraints for Face Recognition Qiang Zhang, Baoxin Li Computer Science and Engineering Arizona.

Building high-level features using large-scale unsupervised learning Anh Nguyen, Bay-yuan Hsu CS290D – Data Mining (Spring 2014) University of California,

Action and Gait Recognition From Recovered 3-D Human Joints IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS— PART B: CYBERNETICS, VOL. 40, NO. 4, AUGUST.

Epitomic Location Recognition A generative approach for location recognition K. Ni, A. Kannan, A. Criminisi and J. Winn In proc. CVPR Anchorage,

Skeleton Based Action Recognition with Convolutional Neural Network

A Theoretical Analysis of Feature Pooling in Visual Recognition Y-Lan Boureau, Jean Ponce and Yann LeCun ICML 2010 Presented by Bo Chen.

Computer Vision Lecture 7 Classifiers. Computer Vision, Lecture 6 Oleh Tretiak © 2005Slide 1 This Lecture Bayesian decision theory (22.1, 22.2) –General.

Deep Learning Overview Sources: workshop-tutorial-final.pdf

Mustafa Gokce Baydogan, George Runger and Eugene Tuv INFORMS Annual Meeting 2011, Charlotte A Bag-of-Features Framework for Time Series Classification.

1 Bilinear Classifiers for Visual Recognition Computational Vision Lab. University of California Irvine To be presented in NIPS 2009 Hamed Pirsiavash Deva.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Big data classification using neural network

Hybrid Deep Learning for Reflectance Confocal Microscopy Skin Images

Learning Deep Generative Models by Ruslan Salakhutdinov

Deep Feedforward Networks

Guillaume-Alexandre Bilodeau

Data Mining, Neural Network and Genetic Programming

Data Driven Attributes for Action Detection

Learning Mid-Level Features For Recognition

Temporal Order-Preserving Dynamic Quantization for Human Action Recognition from Multimodal Sensor Streams Jun Ye Kai Li Guo-Jun Qi Kien.

Supervised Time Series Pattern Discovery through Local Importance

Multimodal Learning with Deep Boltzmann Machines

Compositional Human Pose Regression

COMP61011 : Machine Learning Ensemble Models

Intelligent Information System Lab

Basic machine learning background with Python scikit-learn

Machine Learning Basics

Real-Time Human Pose Recognition in Parts from Single Depth Image

Training Techniques for Deep Neural Networks

Object detection as supervised classification

A New Approach to Track Multiple Vehicles With the Combination of Robust Detection and Two Classifiers Weidong Min , Mengdan Fan, Xiaoguang Guo, and Qing.

Context-Aware Modeling and Recognition of Activities in Video

Learning with information of features

Convolutional Neural Networks for Visual Tracking

Two-Stream Convolutional Networks for Action Recognition in Videos

Towards Understanding the Invertibility of Convolutional Neural Networks Anna C. Gilbert1, Yi Zhang1, Kibok Lee1, Yuting Zhang1, Honglak Lee1,2 1University.

Deep Learning Hierarchical Representations for Image Steganalysis

The Open World of Micro-Videos

Xiaodan Liang Sun Yat-Sen University

Oral presentation for ACM International Conference on Multimedia, 2014

Outline Background Motivation Proposed Model Experimental Results

Liyuan Li, Jerry Kah Eng Hoe, Xinguo Yu, Li Dong, and Xinqi Chu

RCNN, Fast-RCNN, Faster-RCNN

CNN-based Action Recognition Using Adaptive Multiscale Depth Motion Maps And Stable Joint Distance Maps Junyou He, Hailun Xia, Chunyan Feng, Yunfei Chu.

Sun Yat-sen University

Heterogeneous convolutional neural networks for visual recognition

Human-object interaction

Introduction to Neural Networks

VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION

Object Detection Implementations

Patterson: Chap 1 A Review of Machine Learning

Presentation transcript:

Deep Reconfigurable Models with Radius-Margin Bound for 3D Human Activity Recognition

Outline Background Related Works Motivation Framework Results Conclusion

Outline Background Related Works Motivation Framework Results Conclusion

Background

Outline Background Related Works Motivation Framework Results Conclusion

Joints as Feature Recognizing nine atomic marker movements from Motion Capture Data (MoCap) Curves in 2D phase spaces (joint angle vs. height of hips) Supervised learning for selecting phase spaces [ L. Campbell and A. Bobick. Recognition of human body motion using phase space constraints. ICCV 1995 ]

HMMs Feature: Dynamics of single joints modeled by HMM Classifier: HMMs as weak classifiers for AdaBoost [ F. Lv and R. Nevatia. Recognition and segmentation of 3-d human action using hmm and multi-class adaboost. ECCV 2006 ]

Histogram of 3D Joint Locations Reference: Joint locations relative to hip in spherical coordinates Pose Feature: Quantization using soft binning with Gaussians Classifier: LDA + Codebook of poses (k-means) + HMM [ L. Xia et al. View invariant human action recognition using histograms of 3d joints. HAU3D 2012 ]

EigenJoints Integrated pose feature: Classifier: Naïve Bayes Nearest Neighbor fcc: spatial joint differences fcp: temporal joint differences fci: pose difference to initial pose [ X. Yang and Y. Tian. Eigenjoints-based action recognition using naive-bayes-nearest-neighbor. HAU3D 2012 ]

Relational Pose Feature Feature: Spatio-temporal relation between joints, e.g., Classifier: Classification and regression forest for action recognition [ A. Yao et al. Does human action recognition benefit from pose estimation? BMVC 2011 ] [ A. Yao et al. Coupled action recognition and pose estimation from multiple views. IJCV 2012 ]

Silhouette (轮廓) Posture Depth Map as Feature Silhouette (轮廓) Posture Project depth maps Select 3D points as pose representation Gaussian Mixture Model to model spatial locations of points Action Graph: [ W. Li et al. Action recognition based on a bag of 3d points. HAU3D 2010 ]

Depth Motion Maps Project depth maps and compute differences: HOG + SVM [ X. Yang et al. Recognizing actions using depth motion maps-based histograms of oriented gradients. ICM 2012 ]

Space-Time Occupancy Patterns Silhouettes are sensitive to occlusion and noise Clip (5 frames) as 4D spatio-temporal grid Feature vector: Number of points per cell [ A. Vieira et al. STOP: Space-Time Occupancy Patterns for 3D Action Recognition from Depth Map Sequences. LNCS 2012 ]

Random Occupancy Patterns Compute occupancy patterns from spatio-temporal subvolumes Select subvolumes based on Within-class scatter matrix (SW) and Between-class scatter matrix (SB): Sparse coding + SVM [ J. Wang et al. Robust 3d action recognition with random occupancy patterns. ECCV 2012 ]

Histogram of 4D Surface Normals Quantization according to “projectors” pi: Add additional discriminative “projectors” [ O. Oreifej and L. Zicheng. Hon4d: Histogram of oriented 4d normals for activity recognition from depth sequences. CVPR 2013 ]

Depth and Joints Local occupancy features around joint locations Features are histograms of a temporal pyramid Discriminatively select actionlets (subsets of joints) [ J. Wang et al. Mining actionlet ensemble for action recognition with depth cameras. CVPR 2012 ]

Pose and Objects Spatio-temporal relations between human poses and objects [ L. Lei et al. Fine-grained kitchen activity recognition using rgb-d. UbiComp 2012 ] [ H. Koppula et al. Learning human activities and object affordances from rgb-d videos. IJRR 2013 ]

Outline Background Related Works Motivation Framework Results Conclusion

Motivation Difficulties of complex human activity recognition Representing complex human appearance and motion information. 1) It is usually hard to retrieve accurate body information with motions due to diverse poses/views of individuals. 2) depth maps are often unavoidably contaminated, and may become un-stable due to sensor noises or self-occlusions of bodies. 3) First row: two activity example from CAD 120 dataset, and two subjects perform the same activity (arranging-objects) Second row: two activity example from proposed OA dataset, and two subjects perform the same activity (going-to-work) Activity examples (diverse poses/views of individuals)

Motivation Difficulties of complex human activity recognition Capturing large temporal variations of human activities. Variant temporal lengths We consider one activity as a sequence of actions occurred over time. (a) temporal compositions of actions are diverse by different subjects. (b) the temporal lengths of decomposed actions are variant by different subjects 2) Figure is two activities of the same category performed by two subjects. Activity = Temporal compositions of actions

Motivation Deep architecture Reconfigurable structure Act directly on grayscale-depth data. Reconfigurable structure Handle large temporal variations of activities via latent variables. Radius-margin regularization Conduct the classification with good generalization performance. We build the layered network stacked up by convolutional layers, max-pooling operators and full connection layers, where the raw segmented videos are treated as inputs. 2) Impose the latent variables to capture the temporal compositions of activities by manipulating the activation of neurons. 3) 3D data is small. Small scale training data has very impressive generalization power.

Outline Background Related Works Motivation Framework Results Conclusion

Architecture of spatio-temporal convolutional neural networks 1) The network is stacked up by convolutional layers, max-pooling operators and full connection layers, where the raw segmented videos are treated as the input. 2) A clique is designed as a subpart of the network stacked up for several layers, extracting features for one segmented video. 3) The architecture can be partially enabled to explicitly handle different temporal compositions of the activities. video segment of unfixed length

3D Convolution across both spatial and temporal domains 3D Convolution across 3 adjacent frames

Deep learning with reconfigurable structure Different model cliques are represented by different colors. As the temporal segmentation for an input video can be variant, different cliques might have different number of input frames. For the cliques whose input frame number less than m, part of the neurons in the network are inactivated, as represented by the dotted blank circles. Incorporating the latent variables to manipulate the activation of neurons

Deep model with Linear SVM Pros: Better generalization performance on small data. Cons: The margin can be increased by simply expanding the feature space!

Deep model with relaxed radius-margin bound 1) To improve the generalization performance for classification, we propose to integrate the radius-margin bound as a regularizer with the feature learning. Intuitively, together with optimizing the max-margin parameters (w,b), we shrink the radius R of the minimum enclosing ball (MEB) of the training data that distribute in the feature space generated by the CNNs. 2) The resulting classifier with the regularizer shows better generalization performance compared to the traditional softmax output.

Deep model with relaxed radius-margin bound Relaxed: mini-batch based

Joint Component Learning

Deep Structured Models Pre-training Goal: 1) Handle the insufficient RGB-D data for training. 2) Boosting the activity recognition results. More details: Discard the reconfigurability and depth channel. Train with 2D activity videos with labels (Supervised). Apply standard back propagation to learn parameters. 2D activity videos can be easily accessed. For instance: MPII Human Pose Dataset (http://human-pose.mpi-inf.mpg.de/) The pre-training steps: a) Obtain large sum of 2D activity videos with labels.(a supervised manner). b) Discard the reconfigurability of our deep structure model by segmenting the 2D videos evenly to the number as the number of model cliques. (i) The initial model parameters are unstable, and it is not reliable to estimate latent variables H with them; (ii) The training efficiency can be improved without considering the latent variables. c) Apply standard back propagation to learn parameters. d) Initialize our model with the learned parameters of the convolutional layers. (gray and depth channels share the learned parameters for the first 3D CNN layer. i.e., we enable gray channel of first 3D CNN layer only during 2D videos for pre-training. After that, we duplicate learned parameters of the gray channel to depth channel.)

Deep Structured Models Inference Perform the standard procedure of brute search for activity label and latent variables. More details: Given an activity label, we calculate its maximum probability by brute searching across all the temporal segmentations. Search across all labels. Choose the label with the maximum. Accelerate the inference speed via Parallel computation. (0.4 second per video)

Outline Background Related Works Motivation Framework Results Conclusion

Results Dataset Challenge CAD-120 activity dataset (120 videos, made by Cornell University) Office Activity (OA) dataset. (1180 videos, newly created by us) SBU dataset. (300 videos, made by Stony Brook University) Challenge large variance in object appearance, human pose, and viewpoint. OA and SBU contain two subjects with interactions. OA is nearly ten times larger than CAD-120. 1) The experiments are executed on a desktop with an Intel i7 3.7GHz CPU, 8GB RAM and GTX TITAN GPU. The source is in python with theano library. 2) For model learning, we set the learning rate as 0.002 for applying the stochastic gradient descent algorithm. 3) The training times of one fold are 2 hours for CAD-120 (including 120 videos and occupying 0.2GB), 5 hours for OA1(including 600 videos and occupying 1.0GB), and 5 hours for OA2 (including 580 videos and occupying 0.9GB) dataset, respectively. 4) Due to small training sample, CAD120 only has 120 long videos. We obtain many training videos by sampling the long video with equal stride. For instance, we divide one given video with 90 frames into three small videos with 30 frames via a stride of 3 sampling.

CAD-120 CAD-120

SBU

OA-1 (single subject)

OA-2 (two subjects)

Experimental Results

Experimental Results Our model obtains superior performance! The depth data has much smaller variance, and does help to capture the motion information.

Empirical Analysis With/without using the pre-training With/without incorporating the latent structure This experiment is executed on the OA1 dataset. Test error rate is one minus the classification accuracy. 1) It is shown that the model using the pre-training converges after 25 iterations, while the other one without the pre-training requires 140 iterations. More importantly, the pre-training can effectively reduce the error rate in the testing, say 8% less than without the pre-training. 2) We observe that the error rates of the structured model decrease faster and reach to the lower result, compared with the non-structured model.

Empirical Analysis With/without Relaxed Radius-margin Bound. With/without depth channel. This experiment is executed on the OA1 dataset. Test error rate is one minus the classification accuracy. 1) It is shown that the model using the pre-training converges after 25 iterations, while the other one without the pre-training requires 140 iterations. More importantly, the pre-training can effectively reduce the error rate in the testing, say 8% less than without the pre-training. 2) We observe that the error rates of the structured model decrease faster and reach to the lower result, compared with the non-structured model.

Outline Background Related Works Motivation Framework Results Conclusion

Conclusions Recognize from raw RGB-D data directly. Incorporate model reconfigurability into layered convolutional neutral networks Integrate the radius-margin regularization with the feature learning Handle well realistic challenges in 3D activity recognition

Thank you!