Audio-Based Multimedia Event Detection Using Deep Recurrent Neural Networks Yun Wang, Leonardo Neves, Florian Metze 3/23/2016.

Audio-Based Multimedia Event Detection Using Deep Recurrent Neural Networks Yun Wang, Leonardo Neves, Florian Metze 3/23/2016

2 Multimedia Event Detection ▣ Goal: Content-based retrieval ▣ Example events:

3 Multimedia Event Detection SpeechVisual Non-speech audio ▣ Sources of information:

4 Bag of audio words, GMM supervector, or i-vector Event Low-level features (e.g. MFCCs) Conventional Pipeline Local context only Order disregarded

5 Noisemes ▣ Semantically meaningful sound units ▣ Examples: ▣ Can be long-lasting or transient ▣ Allow for fine-grained audio scene understanding

6 Noiseme confidence vectors Event OpenSMILE features (983 dimensions) Proposed Pipeline Deep RNN

Step 1 Frame-level Noiseme Classification

8 The “Noiseme” Corpus ▣ 388 clips, 7.9 hours ▣ Hand-annotated with 48 noisemes ▣ Merged into 17 (+background) ▣ 30% overlap ▣ 60% training, 20% validation, 20% test * S. Burger, Q. Jin, P. F. Schulam, and F. Metze, “Noisemes: manual annotation of environmental noise in audio streams”, technical report CMU-LTI-12-07, Carnegie Mellon University, 2012.

9 Baseline ▣ Evaluation criterion: frame accuracy ▣ Linear SVM: 41.5% ▣ Feed-forward DNN: ▣ 2 hidden layers ▣ 500 ReLU units per layer ▣ Softmax output ▣ Accuracy: 45.1%

10 Recurrent Neural Networks ▣ Hidden unit: ReLU or LSTM cell

11 Bidirectional RNNs

12 Size# Params Frame Accuracy Feed-forward500 * 20.75M45.1 % ReLU RNN500 * 10.75M46.3 % ReLU BRNN300 * 21.32M47.0 % LSTM RNN300 * 11.55M46.3 % LSTM BRNN300 * 13.09M46.7 % ▣ Bidirectionality helps ▣ LSTM cells not necessary Evaluation

Step 2 Clip-level Event Detection

14 Noiseme Confidence Vectors ▣ Generated with trained ReLU BRNN

15 TRECVID 2011 MED Corpus ▣ 3,104 training clips, 6,642 test clips ▣ 15 events ▣ Evaluation criterion: ▣ Mean average precision (MAP) ▣ Average precision (AP) for one event: ▣✓✗✓✗✓  (1/1 + 2/3 + 3/5) / 3 ▣ MAP = mean(AP) across all events

16 RNN Models ▣ One RNN for each event ▣ Unidirectional LSTM ▣ Sigmoid output at last frame only

17 Multi-Resolution Training

20 Multi-Resolution Training ▣ MAP: ▣ 4.0% @ length = 1 (Feed-forward baseline) ▣ 4.6% @ length = 32 ▣ 3.2% @ length = 512 ▣ LSTM can use temporal information, but only for short sequences

21 Follow-Up Work ▣ SVM baseline: 7.1% ▣ Using the chi 2 -RBF kernel ▣ Recurrent SVMs: 8.8% * Y. Wang and F. Metze, “Recurrent Support Vector Machines for Audio-Based Multimedia Event Detection”, submitted to ICMR 2016.

22 Conclusion ▣ Temporal information helps! ▣ Frame-level noiseme classification accuracy: 45.1%  47.0% ▣ Clip-level event detection: 4.0%  4.6% MAP ▣ Clip-level event detection still needs improvement ▣ Recurrent SVMs

Thanks! Any questions?

Audio-Based Multimedia Event Detection Using Deep Recurrent Neural Networks Yun Wang, Leonardo Neves, Florian Metze 3/23/2016.

Similar presentations

Presentation on theme: "Audio-Based Multimedia Event Detection Using Deep Recurrent Neural Networks Yun Wang, Leonardo Neves, Florian Metze 3/23/2016."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Audio-Based Multimedia Event Detection Using Deep Recurrent Neural Networks Yun Wang, Leonardo Neves, Florian Metze 3/23/2016.

Similar presentations

Presentation on theme: "Audio-Based Multimedia Event Detection Using Deep Recurrent Neural Networks Yun Wang, Leonardo Neves, Florian Metze 3/23/2016."— Presentation transcript:

Similar presentations

About project

Feedback