Download presentation
Published byJessie Watkins Modified over 9 years ago
1
Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue
ACM Multimedia, Brisbane, Australia, Oct., 2015 Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue School of Computer Science, Fudan University, Shanghai, China
2
Video Classification Videos are everywhere Wide applications
Web video search Video collection management Intelligent video surveillance
3
Video Classification: State-of-the-Arts
1. Improved Dense Trajectories [Wang et al., ICCV 2013] Tracking trajectories Computing local descriptors along the trajectories 2. Feature Encoding [Perronnin et al., CVPR 2010, Xu et al., CVPR 2015] Encoding local features with Fisher Vector/VLAD Normalization methods, such as Power Norm
4
Video Classification: Deep Learning
1. Image-based CNN Classification [Zha et al., arXiv 2015] Extracting deep features for each frame Averaging frame-level deep features 2. Two-Stream CNN [Simonyan et al., NIPS 2014]
5
Video Classification: Deep Learning
Jumping from platform 3. Recurrent NN: LSTM [Ng et al., CVPR 2015] LSTM Ot-1 Ot Ot+1 Rotating in the air Diving The performance is not ideal,same as image-based classification. Falling into water
6
Video Classification: Deep Learning
We propose a hybrid deep learning framework to capture appearance, short-term motion and long-term temporal dynamics in videos. Video Classification: Deep Learning Jumping from platform 3. Recurrent NN: LSTM [Ng et al., CVPR 2015] Rotating in the air Diving Falling into water The performance of LSTM and average pooling is close.
7
Our Framework Regularzation
We propose a hybrid deep learning framework to model rich multimodal information: Appearance, shot-term motion with CNN Long-term temporal information with LSTM Regularized fusion to explore feature correlations Input Video Final Prediction Individual Frames LSTM Spatial CNN Stacked Optical Flow Motion CNN Fusion Layer Regularzation
8
Spatial and Motion CNN Features
Spatial Convolutional Neural Network Individual Frame Motion Convolutional Neural Network Score Fusion Input Video Stacked Optical Flow
9
Temporal Modeling with LSTM
An unrolled recurrent neural network.
10
Regularized Feature Fusion
[Ngiam et al., ICML 2011 Srivastava et al., NIPS 2012]
11
Regularized Feature Fusion
DNN Learning Scheme - Calculate prediction error - Update weights in a BP manner The fusion is performed in a free manner without explicitly exploring the feature correlations. [Ngiam et al., ICML 2011 Srivastava et al., NIPS 2012]
12
Regularized Feature Fusion
Objective function: Empirical loss Prevent overfitting Model feature relationships Provide Robustness
13
Regularized Feature Fusion
Objective function: Empirical loss Prevent overfitting Model feature relationships Provide Robustness
14
Regularized Feature Fusion
Objective function: Empirical loss Prevent overfitting Model feature relationships Provide Robustness Minimizing the l21 norm will make the matrix be row-sparse!
15
Regularized Feature Fusion
Objective function: Empirical loss Prevent overfitting Model feature relationships Provide Robustness Minimizing the l1 norm will prevent incorrect feature sharing!
16
Regularized Feature Fusion
Objective function: Empirical loss Prevent overfitting Model feature relationships Provide Robustness Optimization: For the E-th layer: Proximal gradient descent
17
Regularized Feature Fusion
Algorithm: Initialize weights randomly 2. for epoch = 1: K Calculate prediction error with feed forward propagation. for l = 1: L Back propagate the prediction error and update weight matrices if L == E: Evaluating the proximal operator end for
18
Experiments Datasets:
UCF101: 101 action classes, 13,320 video clips from YouTube Columbia Consumer Videos (CCV): 20 classes, 9,317 videos from YouTube
19
Experiments Temporal Modeling:
UCF-101 CCV Spatial ConvNet 80.4 75.0 Motion ConvNet 78.3 59.1 Spatial LSTM 83.3 43.3 Motion LSTM 76.6 54.7 ConvNet (spatial + motion) 86.2 75.8 LSTM (spatial + motion) 86.3 61.9 ConvNet + LSTM (spatial) 84.4 77.9 ConvNet + LSTM (motion) 81.4 70.9 ALL Streams 90.3 82.4 LSTM are worse than CNN on noisy long videos. CNN and LSTM are highly complementary!
20
Experiments Regularized Feature Fusion:
UCF-101 CCV Spatial SVM 78.6 74.4 Motion SVM 78.2 57.9 SVM-EF 86.6 75.3 SVM-LF 85.3 74.9 SVM-MKL 86.8 75.4 NN-EF 86.5 75.6 NN-LF 85.1 75.2 M-DBM 86.9 Two-Stream CNN 86.2 75.8 Regularized Fusion 88.4 76.2% Regularized fusion performs better compared with fusion in a free manner.
21
Experiments Hybrid Deep Learning Framework:
22
Experiments Comparisons with State-of-the-Art: UCF101 Donahue et al.
82.9% Srivastava et al. 84.3% Wang et al. 85.9% Tran et al. 86.7% Simonyan et al. 88.0% Lan et al. 89.1% Zha et al. 89.6% Ours 91.3% CCV Xu et al. 60.3% Ye et al. 64.0% Jhuo et al. Ma et al. 63.4% Liu et al. 68.2% Wu et al. 70.6% Ours 83.5%
23
Conclusion We propose a hybrid deep learning framework to model rich multimodal information: Modeling appearance, shot-term motion with CNN Capturing long-term temporal information with LSTM Regularized fusion to explore feature correlations Take-home message: LSTMs and CNNs are highly complementary Regularized feature fusion performs better.
24
Thank you! Q & A
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.