Oral presentation for ACM International Conference on Multimedia, 2014

Slides:



Advertisements
Similar presentations
Recognizing Human Actions by Attributes CVPR2011 Jingen Liu, Benjamin Kuipers, Silvio Savarese Dept. of Electrical Engineering and Computer Science University.
Advertisements

Limin Wang, Yu Qiao, and Xiaoou Tang
ImageNet Classification with Deep Convolutional Neural Networks
Patch to the Future: Unsupervised Visual Prediction
Vision Based Control Motion Matt Baker Kevin VanDyke.
Convolutional Neural Networks for Image Processing with Applications in Mobile Robotics By, Sruthi Moola.
Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.
A Genetic Algorithms Approach to Feature Subset Selection Problem by Hasan Doğu TAŞKIRAN CS 550 – Machine Learning Workshop Department of Computer Engineering.
Action and Gait Recognition From Recovered 3-D Human Joints IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS— PART B: CYBERNETICS, VOL. 40, NO. 4, AUGUST.
Building high-level features using large-scale unsupervised learning Anh Nguyen, Bay-yuan Hsu CS290D – Data Mining (Spring 2014) University of California,
Dr. Z. R. Ghassabi Spring 2015 Deep learning for Human action Recognition 1.
Human pose recognition from depth image MS Research Cambridge.
Action and Gait Recognition From Recovered 3-D Human Joints IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS— PART B: CYBERNETICS, VOL. 40, NO. 4, AUGUST.
Epitomic Location Recognition A generative approach for location recognition K. Ni, A. Kannan, A. Criminisi and J. Winn In proc. CVPR Anchorage,
A Simulated-annealing-based Approach for Simultaneous Parameter Optimization and Feature Selection of Back-Propagation Networks (BPN) Shih-Wei Lin, Tsung-Yuan.
Skeleton Based Action Recognition with Convolutional Neural Network
ImageNet Classification with Deep Convolutional Neural Networks Presenter: Weicong Chen.
Convolutional Neural Network
Computer Vision Lecture 7 Classifiers. Computer Vision, Lecture 6 Oleh Tretiak © 2005Slide 1 This Lecture Bayesian decision theory (22.1, 22.2) –General.
Rich feature hierarchies for accurate object detection and semantic segmentation 2014 IEEE Conference on Computer Vision and Pattern Recognition Ross Girshick,
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Mustafa Gokce Baydogan, George Runger and Eugene Tuv INFORMS Annual Meeting 2011, Charlotte A Bag-of-Features Framework for Time Series Classification.
Naifan Zhuang, Jun Ye, Kien A. Hua
When deep learning meets object detection: Introduction to two technologies: SSD and YOLO Wenchi Ma.
Big data classification using neural network
Deep Residual Learning for Image Recognition
Hybrid Deep Learning for Reflectance Confocal Microscopy Skin Images
Convolutional Sequence to Sequence Learning
Analysis of Sparse Convolutional Neural Networks
Demo.
The Relationship between Deep Learning and Brain Function
CS 6501: 3D Reconstruction and Understanding Convolutional Neural Networks Connelly Barnes.
Summary of “Efficient Deep Learning for Stereo Matching”
Deep Reinforcement Learning
Data Mining, Neural Network and Genetic Programming
Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek
Saliency-guided Video Classification via Adaptively weighted learning
Learning Mid-Level Features For Recognition
Supervised Time Series Pattern Discovery through Local Importance
Multimodal Learning with Deep Boltzmann Machines
Compositional Human Pose Regression
Intelligent Information System Lab
Basic machine learning background with Python scikit-learn
Neural networks (3) Regularization Autoencoder
Classification of Hand-Written Digits Using Scattering Convolutional Network Dongmian Zou Advisor: Professor Radu Balan.
Real-Time Human Pose Recognition in Parts from Single Depth Image
Deep Reconfigurable Models with Radius-Margin Bound for
Deep Learning and Newtonian Physics
A New Approach to Track Multiple Vehicles With the Combination of Robust Detection and Two Classifiers Weidong Min , Mengdan Fan, Xiaoguang Guo, and Qing.
Layer-wise Performance Bottleneck Analysis of Deep Neural Networks
Goodfellow: Chap 6 Deep Feedforward Networks
Two-Stream Convolutional Networks for Action Recognition in Videos
Deep Learning Hierarchical Representations for Image Steganalysis
The Open World of Micro-Videos
Xiaodan Liang Sun Yat-Sen University
A Proposal Defense On Deep Residual Network For Face Recognition Presented By SAGAR MISHRA MECE
LECTURE 35: Introduction to EEG Processing
Neural Networks Geoff Hulten.
Outline Background Motivation Proposed Model Experimental Results
LECTURE 33: Alternative OPTIMIZERS
CNN-based Action Recognition Using Adaptive Multiscale Depth Motion Maps And Stable Joint Distance Maps Junyou He, Hailun Xia, Chunyan Feng, Yunfei Chu.
Sun Yat-sen University
Human-object interaction
Introduction to Neural Networks
CS295: Modern Systems: Application Case Study Neural Network Accelerator Sang-Woo Jun Spring 2019 Many slides adapted from Hyoukjun Kwon‘s Gatech “Designing.
VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION
Object Detection Implementations
Deep Structured Scene Parsing by Learning with Image Descriptions
Patterson: Chap 1 A Review of Machine Learning
Shengcong Chen, Changxing Ding, Minfeng Liu 2018
Presentation transcript:

Oral presentation for ACM International Conference on Multimedia, 2014 3D Human Activity Recognition with Reconfigurable Convolutional Neural Networks Keze Wang, Xiaolong Wang, Liang Lin, Meng Wang, and Wangmeng Zuo

Outline Background Motivation Framework Results Conclusion

Outline Background Motivation Framework Results Conclusion

Background Difficulties of complex human activity recognition Representing complex human appearance and motion information. 1) It is usually hard to retrieve accurate body information with motions due to diverse poses/views of individuals. 2) depth maps are often unavoidably contaminated, and may become un-stable due to sensor noises or self-occlusions of bodies. 3) First row: two activity example from CAD 120 dataset, and two subjects perform the same activity (arranging-objects) Second row: two activity example from proposed OA dataset, and two subjects perform the same activity (going-to-work) Activity examples (diverse poses/views of individuals)

Background Difficulties of complex human activity recognition Capturing large temporal variations of human activities. Variant temporal lengths We consider one activity as a sequence of actions occurred over time. (a) temporal compositions of actions are diverse by different subjects. (b) the temporal lengths of decomposed actions are variant by different subjects 2) Figure is two activities of the same category performed by two subjects. Activity = Temporal compositions of actions

Outline Background Motivation Framework Results Conclusion

Motivation Deep architecture Reconfigurable structure Act directly on grayscale-depth data rather than relying on hand-crafted features. Reconfigurable structure Handle large temporal variations of activities via latent variables. We build the layered network stacked up by convolutional layers, max-pooling operators and full connection layers, where the raw segmented videos are treated as inputs. 2) Impose the latent variables to capture the temporal compositions of activities by manipulating the activation of neurons.

Outline Background Motivation Framework Results Conclusion

Deep Structured Models Architecture The network is stacked up by convolutional layers, max-pooling operators and full connection layers, where the raw segmented videos are treated as the input. 2) A clique is designed as a subpart of the network stacked up for several layers, extracting features for one segmented video. 3) the architecture can be partially enabled to explicitly handle different temporal compositions of the activities. video segment of unfixed length

Deep Structured Models Deep learning with reconfigurable structure Different model cliques are represented by different colors. As the temporal segmentation for an input video can be variant, different cliques might have different number of input frames. For the cliques whose input frame number less than m, part of the neurons in the network are inactivated, as represented by the dotted blank circles. Incorporating the latent variables to manipulate the activation of neurons

Deep Structured Models Latent Structural Back Propagation Algorithm(an iterative algorithm with 2 steps): Step 1: Given the current model parameters, estimate the latent variables by adjusting the video segmentation. Step 2: Given the estimated latent variables (inferring segmentation), perform back propagation to optimize model parameters. It is an iterative algorithm with two steps: (a) Given the current model parameters, estimate the latent variables by adjusting the video segmentation; (b) Given the estimated latent variables H which infers a video segmentation proposal, perform back propagation to optimize model parameters. Note that the network neurons can be partially inactivated (as the dotted circles) according to the input segments. Note that the network neurons can be partially inactivated (as the dotted circles) according to the input segments.

Deep Structured Models Model pre-training Goal: 1) Handle the insufficient RGB-D data for training. 2) Boosting the activity recognition results. More details: Pre-train with 2D activity videos with labels (Supervised). Discard the reconfigurability. Apply standard back propagation to learn parameters. Employ the learned parameters of the convolutional layers. 2D activity videos can be easily accessed. For instance: MPII Human Pose Dataset (http://human-pose.mpi-inf.mpg.de/) The pre-training steps: a) Obtain large sum of 2D activity videos with labels.(a supervised manner). b) Discard the reconfigurability of our deep structure model by segmenting the 2D videos evenly to the number as the number of model cliques. (i) The initial model parameters are unstable, and it is not reliable to estimate latent variables H with them; (ii) The training efficiency can be improved without considering the latent variables. c) Apply standard back propagation to learn parameters. d) Initialize our model with the learned parameters of the convolutional layers. (gray and depth channels share the learned parameters for the first 3D CNN layer. i.e., we enable gray channel of first 3D CNN layer only during 2D videos for pre-training. After that, we duplicate learned parameters of the gray channel to depth channel.)

Deep Structured Models Inference Perform the standard procedure of brute search for activity label and latent variables. More details: Given an activity label, we calculate its maximum probability by brute searching across all latent variables, i.e., all the temporal segmentations. Search across all labels. Choose the label with the maximum. Accelerate the inference speed via Parallel computation. (0.4 second per video)

Outline Background Motivation Framework Results Conclusion

Results Dataset Challenge CAD-120 activity dataset (120 videos, made by Cornell University) Office Activity (OA) dataset. (1180 videos, newly created by us) Challenge large variance in object appearance, human pose, and viewpoint. OA contains two subjects with interactions in two totally different scene. OA is nearly ten times larger than CAD-120. 1) The experiments are executed on a desktop with an Intel i7 3.7GHz CPU, 8GB RAM and GTX TITAN GPU. The source is in python with theano library. 2) For model learning, we set the learning rate as 0:002 for applying the stochastic gradient descent algorithm. 3) The training times of one fold are 2 hours for CAD-120 (including 120 videos and occupying 0.2GB), 5 hours for OA1(including 600 videos and occupying 1.0GB), and 5 hours for OA2 (including 580 videos and occupying 0.9GB) dataset, respectively. 4) Due to small training sample, CAD120 only has 120 long videos. We obtain many training videos by sampling the long video with equal stride. For instance, we divide one given video with 90 frames into three small videos with 30 frames via a stride of 3 sampling.

CAD-120 CAD-120

OA-1 (single subject)

OA-2 (two subjects)

Experimental Results Our model obtains superior performance! The depth data has much smaller variance, and does help to capture the motion information.

Empirical Analysis With/without using the pre-training With/without incorporating the latent structure This experiment is executed on the OA1 dataset. Test error rate is one minus the classification accuracy. 1) It is shown that the model using the pre-training converges after 25 iterations, while the other one without the pre-training requires 140 iterations. More importantly, the pre-training can effectively reduce the error rate in the testing, say 8% less than without the pre-training. 2) We observe that the error rates of the structured model decrease faster and reach to the lower result, compared with the non-structured model.

Outline Background Motivation Framework Results Conclusion

Conclusions Recognition from raw RGB-D data rather than relying on hand-crafted features Incorporating model reconfigurability into layered convolutional neutral networks Handling well realistic challenges in 3D activity recognition

Thank you!