Download presentation
Presentation is loading. Please wait.
Published byVerity Lloyd Modified over 9 years ago
1
1 Detecting Group Interest-level in Meetings Daniel Gatica-Perez, Iain McCowan, Dong Zhang, and Samy Bengio IDIAP Research Institute, Martigny, Switzerland
2
2 Outline The Goal Our approach Meeting Corpus Audio-Visual Features Experiments Performances Measures Feature Selection Results Conclusions
3
3 The Goal Extract relevant segments in meetings Relevant segments are defined based on group interest-level (degree of engagement in participants’ interactions)
4
4 Our Approach Microphones Cameras Person 1 AV Features Person 2 AV Features Person N AV Features Early Integration Multi-stream HMM Statistical ModelsLow-level AV Features
5
5 Our Approach Early integration HMM: concatenate audio and visual features to form the observation vector Multi-stream HMM: audio and visual streams are trained independently; outputs are merged at the state level during decoding
6
6 Meeting Corpus (mmm.idiap.ch) 50 meetings: 30 for training, 20 for testing Each meeting: 5 minutes, 4 participants Recorded based on topic and action scripts Behavior and emotion of participants are natural
7
7 Annotating Group-Interest Level Interval coding scheme (a) discrete scale: 1-5 (b) 15-second interval unit (c) 2 independent annotators Post-processing (a) normalization (for annotator bias) (b) analysis of inter-annotator agreement (c) average of the two annotators
8
8 Annotating Group-Interest Level 1 NEUTRAL 4 HIGH 3 NEUTRAL 5 HIGH Time
9
9 Audio-Visual Features ModalityDescription Visual head orientation from skin color blobs right hand orientation from skin color blobs right hand eccentricity from skin color blobs head and hand motion from skin color blobs Audio SRP-PHAT from microphone array speech relative pitch from lapels speech energy from lapels speech rate from lapels
10
10 Performance Measures - Nc: high-level frames correctly detected - Nf : high-level frames falsely accepted - Nd: high-level frames falsely rejected precision = Nc / (Nc + Nf) recall = Nc / (Nc + Nd) - Expected Performance Curve (EPC): ep = alpha*precision + (1-alpha)*recall
11
11 Feature Selection Selected AV features: (3 audio + 2 visual features) Audio: speech energy, speaking rate, speech pitch Visual: person motion, head angle
12
12 Results (Single-modal vs. Multimodel)
13
13 Results (Single-stream vs. Multi-stream)
14
14 Overall Results Method alpha = 0alpha = 0.5alpha = 1 prrcprrcprrc Audio-only0.540.850.580.800.700.34 Audio-only (Feature fusion) 0.540.850.600.770.730.42 MS-HMM0.630.850.630.840.670.54 MS-HMM (Feature fusion) 0.590.840.770.600.750.55
15
15 Conclusions Audio modality is dominant Modality combination improves performance in some regions Multi-stream better than early integration Feature fusion at the group level is beneficial
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.