Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue ACM Multimedia, Brisbane, Australia, Oct., 2015 Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue zxwu@fudan.edu.cn School of Computer Science, Fudan University, Shanghai, China
Video Classification Videos are everywhere Wide applications Web video search Video collection management Intelligent video surveillance
Video Classification: State-of-the-Arts 1. Improved Dense Trajectories [Wang et al., ICCV 2013] Tracking trajectories Computing local descriptors along the trajectories 2. Feature Encoding [Perronnin et al., CVPR 2010, Xu et al., CVPR 2015] Encoding local features with Fisher Vector/VLAD Normalization methods, such as Power Norm
Video Classification: Deep Learning 1. Image-based CNN Classification [Zha et al., arXiv 2015] Extracting deep features for each frame Averaging frame-level deep features 2. Two-Stream CNN [Simonyan et al., NIPS 2014]
Video Classification: Deep Learning Jumping from platform 3. Recurrent NN: LSTM [Ng et al., CVPR 2015] LSTM Ot-1 Ot Ot+1 Rotating in the air Diving The performance is not ideal,same as image-based classification. Falling into water
Video Classification: Deep Learning We propose a hybrid deep learning framework to capture appearance, short-term motion and long-term temporal dynamics in videos. Video Classification: Deep Learning Jumping from platform 3. Recurrent NN: LSTM [Ng et al., CVPR 2015] Rotating in the air Diving Falling into water The performance of LSTM and average pooling is close.
Our Framework Regularzation We propose a hybrid deep learning framework to model rich multimodal information: Appearance, shot-term motion with CNN Long-term temporal information with LSTM Regularized fusion to explore feature correlations Input Video Final Prediction Individual Frames LSTM Spatial CNN Stacked Optical Flow Motion CNN Fusion Layer Regularzation
Spatial and Motion CNN Features Spatial Convolutional Neural Network Individual Frame Motion Convolutional Neural Network Score Fusion Input Video Stacked Optical Flow
Temporal Modeling with LSTM An unrolled recurrent neural network.
Regularized Feature Fusion [Ngiam et al., ICML 2011 Srivastava et al., NIPS 2012]
Regularized Feature Fusion DNN Learning Scheme - Calculate prediction error - Update weights in a BP manner The fusion is performed in a free manner without explicitly exploring the feature correlations. [Ngiam et al., ICML 2011 Srivastava et al., NIPS 2012]
Regularized Feature Fusion Objective function: Empirical loss Prevent overfitting Model feature relationships Provide Robustness
Regularized Feature Fusion Objective function: Empirical loss Prevent overfitting Model feature relationships Provide Robustness
Regularized Feature Fusion Objective function: Empirical loss Prevent overfitting Model feature relationships Provide Robustness Minimizing the l21 norm will make the matrix be row-sparse!
Regularized Feature Fusion Objective function: Empirical loss Prevent overfitting Model feature relationships Provide Robustness Minimizing the l1 norm will prevent incorrect feature sharing!
Regularized Feature Fusion Objective function: Empirical loss Prevent overfitting Model feature relationships Provide Robustness Optimization: For the E-th layer: Proximal gradient descent
Regularized Feature Fusion Algorithm: Initialize weights randomly 2. for epoch = 1: K Calculate prediction error with feed forward propagation. for l = 1: L Back propagate the prediction error and update weight matrices if L == E: Evaluating the proximal operator end for
Experiments Datasets: UCF101: 101 action classes, 13,320 video clips from YouTube Columbia Consumer Videos (CCV): 20 classes, 9,317 videos from YouTube
Experiments Temporal Modeling: UCF-101 CCV Spatial ConvNet 80.4 75.0 Motion ConvNet 78.3 59.1 Spatial LSTM 83.3 43.3 Motion LSTM 76.6 54.7 ConvNet (spatial + motion) 86.2 75.8 LSTM (spatial + motion) 86.3 61.9 ConvNet + LSTM (spatial) 84.4 77.9 ConvNet + LSTM (motion) 81.4 70.9 ALL Streams 90.3 82.4 LSTM are worse than CNN on noisy long videos. CNN and LSTM are highly complementary!
Experiments Regularized Feature Fusion: UCF-101 CCV Spatial SVM 78.6 74.4 Motion SVM 78.2 57.9 SVM-EF 86.6 75.3 SVM-LF 85.3 74.9 SVM-MKL 86.8 75.4 NN-EF 86.5 75.6 NN-LF 85.1 75.2 M-DBM 86.9 Two-Stream CNN 86.2 75.8 Regularized Fusion 88.4 76.2% Regularized fusion performs better compared with fusion in a free manner.
Experiments Hybrid Deep Learning Framework:
Experiments Comparisons with State-of-the-Art: UCF101 Donahue et al. 82.9% Srivastava et al. 84.3% Wang et al. 85.9% Tran et al. 86.7% Simonyan et al. 88.0% Lan et al. 89.1% Zha et al. 89.6% Ours 91.3% CCV Xu et al. 60.3% Ye et al. 64.0% Jhuo et al. Ma et al. 63.4% Liu et al. 68.2% Wu et al. 70.6% Ours 83.5%
Conclusion We propose a hybrid deep learning framework to model rich multimodal information: Modeling appearance, shot-term motion with CNN Capturing long-term temporal information with LSTM Regularized fusion to explore feature correlations Take-home message: LSTMs and CNNs are highly complementary Regularized feature fusion performs better.
Thank you! Q & A zxwu@fudan.edu.cn