Oral presentation for ACM International Conference on Multimedia, 2014

Slides:

Advertisements

Similar presentations

Recognizing Human Actions by Attributes CVPR2011 Jingen Liu, Benjamin Kuipers, Silvio Savarese Dept. of Electrical Engineering and Computer Science University.

Advertisements

Limin Wang, Yu Qiao, and Xiaoou Tang

ImageNet Classification with Deep Convolutional Neural Networks

Patch to the Future: Unsupervised Visual Prediction

Vision Based Control Motion Matt Baker Kevin VanDyke.

Convolutional Neural Networks for Image Processing with Applications in Mobile Robotics By, Sruthi Moola.

Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.

A Genetic Algorithms Approach to Feature Subset Selection Problem by Hasan Doğu TAŞKIRAN CS 550 – Machine Learning Workshop Department of Computer Engineering.

Action and Gait Recognition From Recovered 3-D Human Joints IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS— PART B: CYBERNETICS, VOL. 40, NO. 4, AUGUST.

Building high-level features using large-scale unsupervised learning Anh Nguyen, Bay-yuan Hsu CS290D – Data Mining (Spring 2014) University of California,

Dr. Z. R. Ghassabi Spring 2015 Deep learning for Human action Recognition 1.

Human pose recognition from depth image MS Research Cambridge.

Action and Gait Recognition From Recovered 3-D Human Joints IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS— PART B: CYBERNETICS, VOL. 40, NO. 4, AUGUST.

Epitomic Location Recognition A generative approach for location recognition K. Ni, A. Kannan, A. Criminisi and J. Winn In proc. CVPR Anchorage,

A Simulated-annealing-based Approach for Simultaneous Parameter Optimization and Feature Selection of Back-Propagation Networks (BPN) Shih-Wei Lin, Tsung-Yuan.

Skeleton Based Action Recognition with Convolutional Neural Network

ImageNet Classification with Deep Convolutional Neural Networks Presenter: Weicong Chen.

Convolutional Neural Network

Computer Vision Lecture 7 Classifiers. Computer Vision, Lecture 6 Oleh Tretiak © 2005Slide 1 This Lecture Bayesian decision theory (22.1, 22.2) –General.

Rich feature hierarchies for accurate object detection and semantic segmentation 2014 IEEE Conference on Computer Vision and Pattern Recognition Ross Girshick,

Deep Learning Overview Sources: workshop-tutorial-final.pdf

Mustafa Gokce Baydogan, George Runger and Eugene Tuv INFORMS Annual Meeting 2011, Charlotte A Bag-of-Features Framework for Time Series Classification.

Naifan Zhuang, Jun Ye, Kien A. Hua

When deep learning meets object detection: Introduction to two technologies: SSD and YOLO Wenchi Ma.

Big data classification using neural network

Deep Residual Learning for Image Recognition

Hybrid Deep Learning for Reflectance Confocal Microscopy Skin Images

Convolutional Sequence to Sequence Learning

Analysis of Sparse Convolutional Neural Networks

The Relationship between Deep Learning and Brain Function

CS 6501: 3D Reconstruction and Understanding Convolutional Neural Networks Connelly Barnes.

Summary of “Efficient Deep Learning for Stereo Matching”

Deep Reinforcement Learning

Data Mining, Neural Network and Genetic Programming

Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek

Saliency-guided Video Classification via Adaptively weighted learning

Learning Mid-Level Features For Recognition

Supervised Time Series Pattern Discovery through Local Importance

Multimodal Learning with Deep Boltzmann Machines

Compositional Human Pose Regression

Intelligent Information System Lab

Basic machine learning background with Python scikit-learn

Neural networks (3) Regularization Autoencoder

Classification of Hand-Written Digits Using Scattering Convolutional Network Dongmian Zou Advisor: Professor Radu Balan.

Real-Time Human Pose Recognition in Parts from Single Depth Image

Deep Reconfigurable Models with Radius-Margin Bound for

Deep Learning and Newtonian Physics

A New Approach to Track Multiple Vehicles With the Combination of Robust Detection and Two Classifiers Weidong Min , Mengdan Fan, Xiaoguang Guo, and Qing.

Layer-wise Performance Bottleneck Analysis of Deep Neural Networks

Goodfellow: Chap 6 Deep Feedforward Networks

Two-Stream Convolutional Networks for Action Recognition in Videos

Deep Learning Hierarchical Representations for Image Steganalysis

The Open World of Micro-Videos

Xiaodan Liang Sun Yat-Sen University

A Proposal Defense On Deep Residual Network For Face Recognition Presented By SAGAR MISHRA MECE

LECTURE 35: Introduction to EEG Processing

Neural Networks Geoff Hulten.

Outline Background Motivation Proposed Model Experimental Results

LECTURE 33: Alternative OPTIMIZERS

CNN-based Action Recognition Using Adaptive Multiscale Depth Motion Maps And Stable Joint Distance Maps Junyou He, Hailun Xia, Chunyan Feng, Yunfei Chu.

Sun Yat-sen University

Human-object interaction

Introduction to Neural Networks

CS295: Modern Systems: Application Case Study Neural Network Accelerator Sang-Woo Jun Spring 2019 Many slides adapted from Hyoukjun Kwon‘s Gatech “Designing.

VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION

Object Detection Implementations

Deep Structured Scene Parsing by Learning with Image Descriptions

Patterson: Chap 1 A Review of Machine Learning

Shengcong Chen, Changxing Ding, Minfeng Liu 2018

Presentation transcript:

Oral presentation for ACM International Conference on Multimedia, 2014 3D Human Activity Recognition with Reconfigurable Convolutional Neural Networks Keze Wang, Xiaolong Wang, Liang Lin, Meng Wang, and Wangmeng Zuo

Outline Background Motivation Framework Results Conclusion

Outline Background Motivation Framework Results Conclusion

Background Difficulties of complex human activity recognition Representing complex human appearance and motion information. 1) It is usually hard to retrieve accurate body information with motions due to diverse poses/views of individuals. 2) depth maps are often unavoidably contaminated, and may become un-stable due to sensor noises or self-occlusions of bodies. 3) First row: two activity example from CAD 120 dataset, and two subjects perform the same activity (arranging-objects) Second row: two activity example from proposed OA dataset, and two subjects perform the same activity (going-to-work) Activity examples (diverse poses/views of individuals)

Background Difficulties of complex human activity recognition Capturing large temporal variations of human activities. Variant temporal lengths We consider one activity as a sequence of actions occurred over time. (a) temporal compositions of actions are diverse by different subjects. (b) the temporal lengths of decomposed actions are variant by different subjects 2) Figure is two activities of the same category performed by two subjects. Activity = Temporal compositions of actions

Outline Background Motivation Framework Results Conclusion

Motivation Deep architecture Reconfigurable structure Act directly on grayscale-depth data rather than relying on hand-crafted features. Reconfigurable structure Handle large temporal variations of activities via latent variables. We build the layered network stacked up by convolutional layers, max-pooling operators and full connection layers, where the raw segmented videos are treated as inputs. 2) Impose the latent variables to capture the temporal compositions of activities by manipulating the activation of neurons.

Outline Background Motivation Framework Results Conclusion

Deep Structured Models Architecture The network is stacked up by convolutional layers, max-pooling operators and full connection layers, where the raw segmented videos are treated as the input. 2) A clique is designed as a subpart of the network stacked up for several layers, extracting features for one segmented video. 3) the architecture can be partially enabled to explicitly handle different temporal compositions of the activities. video segment of unfixed length

Deep Structured Models Deep learning with reconfigurable structure Different model cliques are represented by different colors. As the temporal segmentation for an input video can be variant, different cliques might have different number of input frames. For the cliques whose input frame number less than m, part of the neurons in the network are inactivated, as represented by the dotted blank circles. Incorporating the latent variables to manipulate the activation of neurons

Deep Structured Models Latent Structural Back Propagation Algorithm(an iterative algorithm with 2 steps): Step 1: Given the current model parameters, estimate the latent variables by adjusting the video segmentation. Step 2: Given the estimated latent variables (inferring segmentation), perform back propagation to optimize model parameters. It is an iterative algorithm with two steps: (a) Given the current model parameters, estimate the latent variables by adjusting the video segmentation; (b) Given the estimated latent variables H which infers a video segmentation proposal, perform back propagation to optimize model parameters. Note that the network neurons can be partially inactivated (as the dotted circles) according to the input segments. Note that the network neurons can be partially inactivated (as the dotted circles) according to the input segments.

Deep Structured Models Model pre-training Goal: 1) Handle the insufficient RGB-D data for training. 2) Boosting the activity recognition results. More details: Pre-train with 2D activity videos with labels (Supervised). Discard the reconfigurability. Apply standard back propagation to learn parameters. Employ the learned parameters of the convolutional layers. 2D activity videos can be easily accessed. For instance: MPII Human Pose Dataset (http://human-pose.mpi-inf.mpg.de/) The pre-training steps: a) Obtain large sum of 2D activity videos with labels.(a supervised manner). b) Discard the reconfigurability of our deep structure model by segmenting the 2D videos evenly to the number as the number of model cliques. (i) The initial model parameters are unstable, and it is not reliable to estimate latent variables H with them; (ii) The training efficiency can be improved without considering the latent variables. c) Apply standard back propagation to learn parameters. d) Initialize our model with the learned parameters of the convolutional layers. (gray and depth channels share the learned parameters for the first 3D CNN layer. i.e., we enable gray channel of first 3D CNN layer only during 2D videos for pre-training. After that, we duplicate learned parameters of the gray channel to depth channel.)

Deep Structured Models Inference Perform the standard procedure of brute search for activity label and latent variables. More details: Given an activity label, we calculate its maximum probability by brute searching across all latent variables, i.e., all the temporal segmentations. Search across all labels. Choose the label with the maximum. Accelerate the inference speed via Parallel computation. (0.4 second per video)

Outline Background Motivation Framework Results Conclusion

Results Dataset Challenge CAD-120 activity dataset (120 videos, made by Cornell University) Office Activity (OA) dataset. (1180 videos, newly created by us) Challenge large variance in object appearance, human pose, and viewpoint. OA contains two subjects with interactions in two totally different scene. OA is nearly ten times larger than CAD-120. 1) The experiments are executed on a desktop with an Intel i7 3.7GHz CPU, 8GB RAM and GTX TITAN GPU. The source is in python with theano library. 2) For model learning, we set the learning rate as 0:002 for applying the stochastic gradient descent algorithm. 3) The training times of one fold are 2 hours for CAD-120 (including 120 videos and occupying 0.2GB), 5 hours for OA1(including 600 videos and occupying 1.0GB), and 5 hours for OA2 (including 580 videos and occupying 0.9GB) dataset, respectively. 4) Due to small training sample, CAD120 only has 120 long videos. We obtain many training videos by sampling the long video with equal stride. For instance, we divide one given video with 90 frames into three small videos with 30 frames via a stride of 3 sampling.

CAD-120 CAD-120

OA-1 (single subject)

OA-2 (two subjects)

Experimental Results Our model obtains superior performance! The depth data has much smaller variance, and does help to capture the motion information.

Empirical Analysis With/without using the pre-training With/without incorporating the latent structure This experiment is executed on the OA1 dataset. Test error rate is one minus the classification accuracy. 1) It is shown that the model using the pre-training converges after 25 iterations, while the other one without the pre-training requires 140 iterations. More importantly, the pre-training can effectively reduce the error rate in the testing, say 8% less than without the pre-training. 2) We observe that the error rates of the structured model decrease faster and reach to the lower result, compared with the non-structured model.

Outline Background Motivation Framework Results Conclusion

Conclusions Recognition from raw RGB-D data rather than relying on hand-crafted features Incorporating model reconfigurability into layered convolutional neutral networks Handling well realistic challenges in 3D activity recognition

Thank you!