Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue

Slides:



Advertisements
Similar presentations
A brief review of non-neural-network approaches to deep learning
Advertisements

NEURAL NETWORKS Backpropagation Algorithm
Limin Wang, Yu Qiao, and Xiaoou Tang
Tiled Convolutional Neural Networks TICA Speedup Results on the CIFAR-10 dataset Motivation Pretraining with Topographic ICA References [1] Y. LeCun, L.
SUPER: Towards Real-time Event Recognition in Internet Videos Yu-Gang Jiang School of Computer Science Fudan University Shanghai, China
Deep Learning and Neural Nets Spring 2015
Large-Scale Object Recognition with Weak Supervision
Beyond Actions: Discriminative Models for Contextual Group Activities Tian Lan School of Computing Science Simon Fraser University August 12, 2010 M.Sc.
A Discriminative CNN Video Representation for Event Detection
Problem Statement A pair of images or videos in which one is close to the exact duplicate of the other, but different in conditions related to capture,
Yu-Gang Jiang, Yanran Wang, Rui Feng Xiangyang Xue, Yingbin Zheng, Hanfang Yang Understanding and Predicting Interestingness of Videos Fudan University,
Video Tracking Using Learned Hierarchical Features
Beauty is Here! Evaluating Aesthetics in Videos Using Multimodal Features and Free Training Data Yanran Wang, Qi Dai, Rui Feng, Yu-Gang Jiang School of.
Semantic Embedding Space for Zero ­ Shot Action Recognition Xun XuTimothy HospedalesShaogang GongAuthors: Computer Vision Group Queen Mary University of.
Skeleton Based Action Recognition with Convolutional Neural Network
Understanding and Predicting Interestingness of Videos Yu-Gang Jiang, Yanran Wang, Rui Feng, Hanfang Yang, Yingbin Zheng, Xiangyang Xue School of Computer.
Motion Features for Action Recognition YeHao 3/11/2014.
Dynamic Background Learning through Deep Auto-encoder Networks Pei Xu 1, Mao Ye 1, Xue Li 2, Qihe Liu 1, Yi Yang 2 and Jian Ding 3 1.University of Electronic.
Speech Enhancement based on
Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting 卷积LSTM网络:利用机器学习预测短期降雨 施行健 香港科技大学 VALSE 2016/03/23.
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Hierarchical Motion Evolution for Action Recognition Authors: Hongsong Wang, Wei Wang, Liang Wang Center for Research on Intelligent Perception and Computing,
Facial Smile Detection Based on Deep Learning Features Authors: Kaihao Zhang, Yongzhen Huang, Hong Wu and Liang Wang Center for Research on Intelligent.
Learning to Answer Questions from Image Using Convolutional Neural Network Lin Ma, Zhengdong Lu, and Hang Li Huawei Noah’s Ark Lab, Hong Kong
Action Recognition in Video
Learning Multi-Domain Convolutional Neural Networks for Visual Tracking arXiv : [cs.CV] v1 2015, v Hyeonseob Nam, Bohyung Han Dept.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Recent developments in object detection
Big data classification using neural network
Unsupervised Learning of Video Representations using LSTMs
Deep Reinforcement Learning
Computer Science and Engineering, Seoul National University
Krishna Kumar Singh, Yong Jae Lee University of California, Davis
Saliency-guided Video Classification via Adaptively weighted learning
Learning Mid-Level Features For Recognition
Regularizing Face Verification Nets To Discrete-Valued Pain Regression
Intelligent Information System Lab
Synthesis of X-ray Projections via Deep Learning
ECE 6504 Deep Learning for Perception
R-CNN region By Ilia Iofedov 11/11/2018 BGU, DNN course 2016.
A brief introduction to neural network
Introduction to Deep Learning for neuronal data analyses
Master’s Thesis defense Ming Du Advisor: Dr. Yi Shang
Proposed (MoDL-SToRM)
Zan Gao, Deyu Wang, Xiangnan He, Hua Zhang
Introduction to Neural Networks
Two-Stream Convolutional Networks for Action Recognition in Videos
Towards Understanding the Invertibility of Convolutional Neural Networks Anna C. Gilbert1, Yi Zhang1, Kibok Lee1, Yuting Zhang1, Honglak Lee1,2 1University.
The Open World of Micro-Videos
Object Classification through Deconvolutional Neural Networks
Smart Robots, Drones, IoT
Oral presentation for ACM International Conference on Multimedia, 2014
Papers 15/08.
Outline Background Motivation Proposed Model Experimental Results
Yi Zhao1, Yanyan Shen*1, Yanmin Zhu1, Junjie Yao2
Recurrent Encoder-Decoder Networks for Time-Varying Dense Predictions
Comparison of EET and Rank Pooling on UCF101 (split 1)
Heterogeneous convolutional neural networks for visual recognition
Department of Computer Science Ben-Gurion University of the Negev
Chuan Wang1, Haibin Huang1, Xiaoguang Han2, Jue Wang1
Automatic Handwriting Generation
Human-object interaction
Deep learning enhanced Markov State Models (MSMs)
Learning and Memorization
Presented By: Harshul Gupta
Week 3 Volodymyr Bobyr.
Bidirectional LSTM-CRF Models for Sequence Tagging
Week 7 Presentation Ngoc Ta Aidean Sharghi
SDSEN: Self-Refining Deep Symmetry Enhanced Network
Presentation transcript:

Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue ACM Multimedia, Brisbane, Australia, Oct., 2015 Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue zxwu@fudan.edu.cn School of Computer Science, Fudan University, Shanghai, China

Video Classification Videos are everywhere Wide applications Web video search Video collection management Intelligent video surveillance

Video Classification: State-of-the-Arts 1. Improved Dense Trajectories [Wang et al., ICCV 2013] Tracking trajectories Computing local descriptors along the trajectories 2. Feature Encoding [Perronnin et al., CVPR 2010, Xu et al., CVPR 2015] Encoding local features with Fisher Vector/VLAD Normalization methods, such as Power Norm

Video Classification: Deep Learning 1. Image-based CNN Classification [Zha et al., arXiv 2015] Extracting deep features for each frame Averaging frame-level deep features 2. Two-Stream CNN [Simonyan et al., NIPS 2014]

Video Classification: Deep Learning Jumping from platform 3. Recurrent NN: LSTM [Ng et al., CVPR 2015] LSTM Ot-1 Ot Ot+1 Rotating in the air Diving The performance is not ideal,same as image-based classification. Falling into water

Video Classification: Deep Learning We propose a hybrid deep learning framework to capture appearance, short-term motion and long-term temporal dynamics in videos. Video Classification: Deep Learning Jumping from platform 3. Recurrent NN: LSTM [Ng et al., CVPR 2015] Rotating in the air Diving Falling into water The performance of LSTM and average pooling is close.

Our Framework Regularzation We propose a hybrid deep learning framework to model rich multimodal information: Appearance, shot-term motion with CNN Long-term temporal information with LSTM Regularized fusion to explore feature correlations Input Video Final Prediction Individual Frames LSTM Spatial CNN Stacked Optical Flow Motion CNN Fusion Layer Regularzation

Spatial and Motion CNN Features Spatial Convolutional Neural Network Individual Frame Motion Convolutional Neural Network Score Fusion Input Video Stacked Optical Flow

Temporal Modeling with LSTM An unrolled recurrent neural network.

Regularized Feature Fusion [Ngiam et al., ICML 2011 Srivastava et al., NIPS 2012]

Regularized Feature Fusion DNN Learning Scheme - Calculate prediction error - Update weights in a BP manner The fusion is performed in a free manner without explicitly exploring the feature correlations. [Ngiam et al., ICML 2011 Srivastava et al., NIPS 2012]

Regularized Feature Fusion Objective function: Empirical loss Prevent overfitting Model feature relationships Provide Robustness

Regularized Feature Fusion Objective function: Empirical loss Prevent overfitting Model feature relationships Provide Robustness

Regularized Feature Fusion Objective function: Empirical loss Prevent overfitting Model feature relationships Provide Robustness Minimizing the l21 norm will make the matrix be row-sparse!

Regularized Feature Fusion Objective function: Empirical loss Prevent overfitting Model feature relationships Provide Robustness Minimizing the l1 norm will prevent incorrect feature sharing!

Regularized Feature Fusion Objective function: Empirical loss Prevent overfitting Model feature relationships Provide Robustness Optimization: For the E-th layer: Proximal gradient descent

Regularized Feature Fusion Algorithm: Initialize weights randomly 2. for epoch = 1: K Calculate prediction error with feed forward propagation. for l = 1: L Back propagate the prediction error and update weight matrices if L == E: Evaluating the proximal operator end for

Experiments Datasets: UCF101: 101 action classes, 13,320 video clips from YouTube Columbia Consumer Videos (CCV): 20 classes, 9,317 videos from YouTube

Experiments Temporal Modeling: UCF-101 CCV Spatial ConvNet 80.4 75.0 Motion ConvNet 78.3 59.1 Spatial LSTM 83.3 43.3 Motion LSTM 76.6 54.7 ConvNet (spatial + motion) 86.2 75.8 LSTM (spatial + motion) 86.3 61.9 ConvNet + LSTM (spatial) 84.4 77.9 ConvNet + LSTM (motion) 81.4 70.9 ALL Streams 90.3 82.4 LSTM are worse than CNN on noisy long videos. CNN and LSTM are highly complementary!

Experiments Regularized Feature Fusion: UCF-101 CCV Spatial SVM 78.6 74.4 Motion SVM 78.2 57.9 SVM-EF 86.6 75.3 SVM-LF 85.3 74.9 SVM-MKL 86.8 75.4 NN-EF 86.5 75.6 NN-LF 85.1 75.2 M-DBM 86.9 Two-Stream CNN 86.2 75.8 Regularized Fusion 88.4 76.2% Regularized fusion performs better compared with fusion in a free manner.

Experiments Hybrid Deep Learning Framework:

Experiments Comparisons with State-of-the-Art: UCF101 Donahue et al. 82.9% Srivastava et al. 84.3% Wang et al. 85.9% Tran et al. 86.7% Simonyan et al. 88.0% Lan et al. 89.1% Zha et al. 89.6% Ours 91.3% CCV Xu et al. 60.3% Ye et al. 64.0% Jhuo et al. Ma et al. 63.4% Liu et al. 68.2% Wu et al. 70.6%   Ours 83.5%

Conclusion We propose a hybrid deep learning framework to model rich multimodal information: Modeling appearance, shot-term motion with CNN Capturing long-term temporal information with LSTM Regularized fusion to explore feature correlations Take-home message: LSTMs and CNNs are highly complementary Regularized feature fusion performs better.

Thank you! Q & A zxwu@fudan.edu.cn