Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue

Name: Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue
Uploaded: 2017-07-24T19:24:44+00:00
Duration: PTM11S29
Channel: Jessie Watkins
Description: Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue

Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue
ACM Multimedia, Brisbane, Australia, Oct., 2015 Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue School of Computer Science, Fudan University, Shanghai, China

Video Classification Videos are everywhere Wide applications
Web video search Video collection management Intelligent video surveillance

Video Classification: State-of-the-Arts
1. Improved Dense Trajectories [Wang et al., ICCV 2013] Tracking trajectories Computing local descriptors along the trajectories 2. Feature Encoding [Perronnin et al., CVPR 2010, Xu et al., CVPR 2015] Encoding local features with Fisher Vector/VLAD Normalization methods, such as Power Norm

Video Classification: Deep Learning
1. Image-based CNN Classification [Zha et al., arXiv 2015] Extracting deep features for each frame Averaging frame-level deep features 2. Two-Stream CNN [Simonyan et al., NIPS 2014]

Jumping from platform 3. Recurrent NN: LSTM [Ng et al., CVPR 2015] LSTM Ot-1 Ot Ot+1 Rotating in the air Diving The performance is not ideal，same as image-based classification. Falling into water

We propose a hybrid deep learning framework to capture appearance, short-term motion and long-term temporal dynamics in videos. Video Classification: Deep Learning Jumping from platform 3. Recurrent NN: LSTM [Ng et al., CVPR 2015] Rotating in the air Diving Falling into water The performance of LSTM and average pooling is close.

Our Framework Regularzation
We propose a hybrid deep learning framework to model rich multimodal information: Appearance, shot-term motion with CNN Long-term temporal information with LSTM Regularized fusion to explore feature correlations Input Video Final Prediction Individual Frames LSTM Spatial CNN Stacked Optical Flow Motion CNN Fusion Layer Regularzation

Spatial and Motion CNN Features
Spatial Convolutional Neural Network Individual Frame Motion Convolutional Neural Network Score Fusion Input Video Stacked Optical Flow

Temporal Modeling with LSTM
An unrolled recurrent neural network.

Regularized Feature Fusion
[Ngiam et al., ICML 2011 Srivastava et al., NIPS 2012]

DNN Learning Scheme - Calculate prediction error - Update weights in a BP manner The fusion is performed in a free manner without explicitly exploring the feature correlations. [Ngiam et al., ICML 2011 Srivastava et al., NIPS 2012]

Objective function: Empirical loss Prevent overfitting Model feature relationships Provide Robustness

Objective function: Empirical loss Prevent overfitting Model feature relationships Provide Robustness Minimizing the l21 norm will make the matrix be row-sparse!

Objective function: Empirical loss Prevent overfitting Model feature relationships Provide Robustness Minimizing the l1 norm will prevent incorrect feature sharing!

Objective function: Empirical loss Prevent overfitting Model feature relationships Provide Robustness Optimization: For the E-th layer: Proximal gradient descent

Algorithm: Initialize weights randomly 2. for epoch = 1: K Calculate prediction error with feed forward propagation. for l = 1: L Back propagate the prediction error and update weight matrices if L == E: Evaluating the proximal operator end for

Experiments Datasets:
UCF101: 101 action classes, 13,320 video clips from YouTube Columbia Consumer Videos (CCV): 20 classes, 9,317 videos from YouTube

Experiments Temporal Modeling:
UCF-101 CCV Spatial ConvNet 80.4 75.0 Motion ConvNet 78.3 59.1 Spatial LSTM 83.3 43.3 Motion LSTM 76.6 54.7 ConvNet (spatial + motion) 86.2 75.8 LSTM (spatial + motion) 86.3 61.9 ConvNet + LSTM (spatial) 84.4 77.9 ConvNet + LSTM (motion) 81.4 70.9 ALL Streams 90.3 82.4 LSTM are worse than CNN on noisy long videos. CNN and LSTM are highly complementary!

Experiments Regularized Feature Fusion:
UCF-101 CCV Spatial SVM 78.6 74.4 Motion SVM 78.2 57.9 SVM-EF 86.6 75.3 SVM-LF 85.3 74.9 SVM-MKL 86.8 75.4 NN-EF 86.5 75.6 NN-LF 85.1 75.2 M-DBM 86.9 Two-Stream CNN 86.2 75.8 Regularized Fusion 88.4 76.2% Regularized fusion performs better compared with fusion in a free manner.

Experiments Hybrid Deep Learning Framework:

Experiments Comparisons with State-of-the-Art: UCF101 Donahue et al.
82.9% Srivastava et al. 84.3% Wang et al. 85.9% Tran et al. 86.7% Simonyan et al. 88.0% Lan et al. 89.1% Zha et al. 89.6% Ours 91.3% CCV Xu et al. 60.3% Ye et al. 64.0% Jhuo et al. Ma et al. 63.4% Liu et al. 68.2% Wu et al. 70.6% Ours 83.5%

Conclusion We propose a hybrid deep learning framework to model rich multimodal information: Modeling appearance, shot-term motion with CNN Capturing long-term temporal information with LSTM Regularized fusion to explore feature correlations Take-home message: LSTMs and CNNs are highly complementary Regularized feature fusion performs better.

Thank you! Q & A

Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue

Similar presentations

Presentation on theme: "Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue

Similar presentations

Presentation on theme: "Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue"— Presentation transcript:

Similar presentations

About project

Feedback