Download presentation
Presentation is loading. Please wait.
Published byJade Suzanna Lamb Modified over 8 years ago
1
Audio-Based Multimedia Event Detection Using Deep Recurrent Neural Networks Yun Wang, Leonardo Neves, Florian Metze 3/23/2016
2
2 Multimedia Event Detection ▣ Goal: Content-based retrieval ▣ Example events:
3
3 Multimedia Event Detection SpeechVisual Non-speech audio ▣ Sources of information:
4
4 Bag of audio words, GMM supervector, or i-vector Event Low-level features (e.g. MFCCs) Conventional Pipeline Local context only Order disregarded
5
5 Noisemes ▣ Semantically meaningful sound units ▣ Examples: ▣ Can be long-lasting or transient ▣ Allow for fine-grained audio scene understanding
6
6 Noiseme confidence vectors Event OpenSMILE features (983 dimensions) Proposed Pipeline Deep RNN
7
Step 1 Frame-level Noiseme Classification
8
8 The “Noiseme” Corpus ▣ 388 clips, 7.9 hours ▣ Hand-annotated with 48 noisemes ▣ Merged into 17 (+background) ▣ 30% overlap ▣ 60% training, 20% validation, 20% test * S. Burger, Q. Jin, P. F. Schulam, and F. Metze, “Noisemes: manual annotation of environmental noise in audio streams”, technical report CMU-LTI-12-07, Carnegie Mellon University, 2012.
9
9 Baseline ▣ Evaluation criterion: frame accuracy ▣ Linear SVM: 41.5% ▣ Feed-forward DNN: ▣ 2 hidden layers ▣ 500 ReLU units per layer ▣ Softmax output ▣ Accuracy: 45.1%
10
10 Recurrent Neural Networks ▣ Hidden unit: ReLU or LSTM cell
11
11 Bidirectional RNNs
12
12 Size# Params Frame Accuracy Feed-forward500 * 20.75M45.1 % ReLU RNN500 * 10.75M46.3 % ReLU BRNN300 * 21.32M47.0 % LSTM RNN300 * 11.55M46.3 % LSTM BRNN300 * 13.09M46.7 % ▣ Bidirectionality helps ▣ LSTM cells not necessary Evaluation
13
Step 2 Clip-level Event Detection
14
14 Noiseme Confidence Vectors ▣ Generated with trained ReLU BRNN
15
15 TRECVID 2011 MED Corpus ▣ 3,104 training clips, 6,642 test clips ▣ 15 events ▣ Evaluation criterion: ▣ Mean average precision (MAP) ▣ Average precision (AP) for one event: ▣✓✗✓✗✓ (1/1 + 2/3 + 3/5) / 3 ▣ MAP = mean(AP) across all events
16
16 RNN Models ▣ One RNN for each event ▣ Unidirectional LSTM ▣ Sigmoid output at last frame only
17
17 Multi-Resolution Training
18
18 Multi-Resolution Training
19
19 Multi-Resolution Training
20
20 Multi-Resolution Training ▣ MAP: ▣ 4.0% @ length = 1 (Feed-forward baseline) ▣ 4.6% @ length = 32 ▣ 3.2% @ length = 512 ▣ LSTM can use temporal information, but only for short sequences
21
21 Follow-Up Work ▣ SVM baseline: 7.1% ▣ Using the chi 2 -RBF kernel ▣ Recurrent SVMs: 8.8% * Y. Wang and F. Metze, “Recurrent Support Vector Machines for Audio-Based Multimedia Event Detection”, submitted to ICMR 2016.
22
22 Conclusion ▣ Temporal information helps! ▣ Frame-level noiseme classification accuracy: 45.1% 47.0% ▣ Clip-level event detection: 4.0% 4.6% MAP ▣ Clip-level event detection still needs improvement ▣ Recurrent SVMs
23
Thanks! Any questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.