Multimodal Integration for Meeting Group Action Segmentation and Recognition M. Al-Hames, A. Dielmann, D. Gatica-Perez, S. Reiter, S. Renals, G. Rigoll,

Multimodal Integration for Meeting Group Action Segmentation and Recognition M. Al-Hames, A. Dielmann, D. Gatica-Perez, S. Reiter, S. Renals, G. Rigoll, and D. Zhang Institute for Human-Machine Communication, Technische Universität München Centre for Speech Technology Research, University of Edinburgh IDIAP Research Institute

M. Al-Hames, A. Dielmann, D. Gatica-Perez, S. Reiter, S. Renals, G. Rigoll, and D. Zhang Multimodal Integration for Meeting Group Action Segmentation and Recognition 2 Overview 1.Introduction: Structuring of meetings 2.Audio-visual features 3.Models and experimental results Segmentation using semantic features (BIC-like) Multi-layer HMM Multi-stream DBN Noise robustness with a mixed-state DBN 4.Summary

M. Al-Hames, A. Dielmann, D. Gatica-Perez, S. Reiter, S. Renals, G. Rigoll, and D. Zhang Multimodal Integration for Meeting Group Action Segmentation and Recognition 3 Structuring of Meetings Meetings can be modelled as a sequence of events, group, or individual actions from a set A = {A 1, A 2, A 3, …, A N } A2A2 A1A1 A2A2 A6A6 A3A3 Time

M. Al-Hames, A. Dielmann, D. Gatica-Perez, S. Reiter, S. Renals, G. Rigoll, and D. Zhang Multimodal Integration for Meeting Group Action Segmentation and Recognition 4 Structuring of Meetings Meetings can be modelled as a sequence of events, group, or individual actions from a set A = {A 1, A 2, A 3, …, A N } Different sets represent different meeting views NeutralLowHigh Group Interest Group task Information sharingDecisionInformation sharing Time PresentationDiscussion Discussion Phase Monologue IdleWritingSpeaking Person 1 Idle

M. Al-Hames, A. Dielmann, D. Gatica-Perez, S. Reiter, S. Renals, G. Rigoll, and D. Zhang Multimodal Integration for Meeting Group Action Segmentation and Recognition 5 Meeting Views Group Discussion Phase Discussion Monologue (with ID) Note-taking Presentation Whiteboard Monologue (ID) + Note-taking Presentation + Note-taking Whiteboard + Note-taking Interest Level High Neutral Low Task Brainstorming Decision making Information sharing Individual Interest Level High Neutral Low Actions Speaking Writing Idle Further sets could represent the current agenda item or the topic discussed, but may require higher semantic knowledge for an automatic analysis.

M. Al-Hames, A. Dielmann, D. Gatica-Perez, S. Reiter, S. Renals, G. Rigoll, and D. Zhang Multimodal Integration for Meeting Group Action Segmentation and Recognition 6 Action Lexicon Group Discussion Phase Discussion Monologue (with ID) Note-taking Presentation Whiteboard Monologue (ID) + Note-taking Presentation + Note-taking Whiteboard + Note-taking Action lexicon I Only single actions 8 action classes Action lexicon II Single actions and combinations of parallel actions 14 action classes

M. Al-Hames, A. Dielmann, D. Gatica-Perez, S. Reiter, S. Renals, G. Rigoll, and D. Zhang Multimodal Integration for Meeting Group Action Segmentation and Recognition 7 M4 Meeting Corpus Recorded at IDIAP Three fixed cameras Lapel microphones Microphone array Four participants 59 videos Each 5 minutes

M. Al-Hames, A. Dielmann, D. Gatica-Perez, S. Reiter, S. Renals, G. Rigoll, and D. Zhang Multimodal Integration for Meeting Group Action Segmentation and Recognition 8 System Overview Multimodal Feature Extraction Action Segmentation Action Classification CamerasMicrophone Array … Lapel Microphones

M. Al-Hames, A. Dielmann, D. Gatica-Perez, S. Reiter, S. Renals, G. Rigoll, and D. Zhang Multimodal Integration for Meeting Group Action Segmentation and Recognition 9 Visual Features Global motion For each position Centre of motion Changes in motion (dynamics) Mean absolute deviation Intensity of motion Skin colour blobs (GMM) For each person Head: orientation and motion Hands: Position, size, orientation, and motion Moving blobs from background subtraction

M. Al-Hames, A. Dielmann, D. Gatica-Perez, S. Reiter, S. Renals, G. Rigoll, and D. Zhang Multimodal Integration for Meeting Group Action Segmentation and Recognition 10 Audio Features Microphone array For each position For each seat (4), whiteboard, and projector screen SRP-PHAT measure to estimate a speech activity And a speech and silence segmentation Lapel microphones For each person For each speech segment Energy Pitch (SIFT algorithm) Speaking rate (combination of estimators) MFC-Coefficients

M. Al-Hames, A. Dielmann, D. Gatica-Perez, S. Reiter, S. Renals, G. Rigoll, and D. Zhang Multimodal Integration for Meeting Group Action Segmentation and Recognition 11 Lexical Features Speech transcription (or output of ASR) Gesture transcription (or output of an automatic gesture recognizer) Gesture inventory –Writing –Pointing –Standing up –Sitting down –Nodding –Shaking head For each person

M. Al-Hames, A. Dielmann, D. Gatica-Perez, S. Reiter, S. Renals, G. Rigoll, and D. Zhang Multimodal Integration for Meeting Group Action Segmentation and Recognition 12 BIC-like approach Strategy similar to the Bayesian information criterion (BIC) Segmentation and classification in one step Two windows with variable length shifted over the time scale Inner border is shifted from left to right Classify each window: different results for left and right window -> inner border is considered as a boundary of a meeting event No boundary is detected -> enlarge the whole window Repeat until right border reaches end of meeting

M. Al-Hames, A. Dielmann, D. Gatica-Perez, S. Reiter, S. Renals, G. Rigoll, and D. Zhang Multimodal Integration for Meeting Group Action Segmentation and Recognition 13 BIC-like approach ClassifierInsertion- rate [%] Deletion- rate [%] Accuracy [s] Rec-Error [%] BN14.76.227.9339.0 GMM24.72.3310.841.4 MLP8.611.676.3332.4 RBF6.893.005.6631.6 SVM17.70.839.0835.7 Experimental setup: 57 meetings using “Lexicon 1”

M. Al-Hames, A. Dielmann, D. Gatica-Perez, S. Reiter, S. Renals, G. Rigoll, and D. Zhang Multimodal Integration for Meeting Group Action Segmentation and Recognition 14 Multi-layer Hidden Markov Model Two-layer HMM Group HMM Individual I-HMM 1 I-HMM 2 I-HMM 3 I-HMM 4 Person 1 features Person 2 features Person 3 features Person 4 features Group features Individual action layer: I-HMM –Speaking, Writing, Idle Group action layer: G-HMM –Discussion, Monologue, … Each layer trained independently Simultaneous segmentation and recognition

M. Al-Hames, A. Dielmann, D. Gatica-Perez, S. Reiter, S. Renals, G. Rigoll, and D. Zhang Multimodal Integration for Meeting Group Action Segmentation and Recognition 15 Multi-layer Hidden Markov Model Two-layer HMM Group HMM Individual I-HMM 1 I-HMM 2 I-HMM 3 I-HMM 4 Person 1 features Person 2 features Person 3 features Person 4 features Group features Smaller observation space -> stable for limited training data I-HMMs are person independent -> much more training data from different persons available G-HMM less sensitive to small changes in the low-level features Two layers are trained independently -> different HMM combinations can be explored

M. Al-Hames, A. Dielmann, D. Gatica-Perez, S. Reiter, S. Renals, G. Rigoll, and D. Zhang Multimodal Integration for Meeting Group Action Segmentation and Recognition 16 Multi-layer Hidden Markov Model Experimental setup: 59 meetings using “Lexicon 2” MethodAER (%) Single-layer HMM Visual only48.2 Audio only36.7 Early integration23.7 Multi-stream23.1 Asynchronous22.2 Multi-layer HMM Visual only42.4 Audio only32.3 Early Integration16.5 Multi-stream15.8 Asynchronous15.1

M. Al-Hames, A. Dielmann, D. Gatica-Perez, S. Reiter, S. Renals, G. Rigoll, and D. Zhang Multimodal Integration for Meeting Group Action Segmentation and Recognition 17 Multi-stream DBN Model Counter structure –Improves state-duration modelling –Action counter C is incremented after each “action transition” Hierarchical approach State-space decomposition: –Two feature related sub-action nodes (S 1 and S 2 ) –Each sub-state corresponds to a cluster of feature vectors –Unsupervised training Independent modality processing S01S01 Y01Y01 St1St1 Yt1Yt1 S t+1 1 Y t+1 1 …. A0A0 AtAt A t+1 …. S02S02 Y02Y02 St2St2 Yt2Yt2 S t+1 2 …. C0C0 E0E0 CtCt EtEt C t+1 E t+1 …. Yt2Yt2 Y t+1 2 ….

M. Al-Hames, A. Dielmann, D. Gatica-Perez, S. Reiter, S. Renals, G. Rigoll, and D. Zhang Multimodal Integration for Meeting Group Action Segmentation and Recognition 18 Experimental Results 21.4 16.7 11.0 7x76x6 21.4 92.9TUM (Beamf. + Video) 24.926.761.6IDIAP (Audio + Video) 18.917.148.4UEDIN (Turn.+Pros.) Mult.+ Count. Multi-stream HMM Model Feature setup Experimental setup: cross-validation over 53 meetings using “Lexicon 1”

M. Al-Hames, A. Dielmann, D. Gatica-Perez, S. Reiter, S. Renals, G. Rigoll, and D. Zhang Multimodal Integration for Meeting Group Action Segmentation and Recognition 19 Mixed-State DBN LDS Multi-stream HMM H A t+1 HAtHAt H V t+1 HVtHVt OAtOAt UVtUVt U V t+1 XVtXVt X V t+1 OVtOVt O V t+1 Time t Audio stream Visual stream O A t+1 Coupled System HMM LDS Video Audio Video Couple a multi-stream HMM with a linear dynamical system (LDS) HMM is driving input for the LDS With the HMM as driving input disturbances can be compensated System can be described as DBN

M. Al-Hames, A. Dielmann, D. Gatica-Perez, S. Reiter, S. Renals, G. Rigoll, and D. Zhang Multimodal Integration for Meeting Group Action Segmentation and Recognition 20 Mixed-State DBN Experimental setup: 59 meetings using “Lexicon 1” Recognition Results (%) Single modal HMMsMulti-modal AudioArrayVideoHMMDBN IClear test data83.183.667.285.287.8 IILapel disturbed61.180.987.0 IIIArray disturbed75.483.587.0 IVVideo disturbed40.982.683.5 VAll streams disturbed80.081.7

M. Al-Hames, A. Dielmann, D. Gatica-Perez, S. Reiter, S. Renals, G. Rigoll, and D. Zhang Multimodal Integration for Meeting Group Action Segmentation and Recognition 21 Summary Joint effort of TUM, IDIAP, and UEDIN Modelling meetings as a sequence of individual and group actions Large number of audio-visual features Four different group action modelling frameworks: Algorithm based on BIC Multi-layer HMM Multi-stream DBN A mixed-state DBN Good performances of the proposed frameworks Future work –Space for further improvements: Both in the feature domain and in the model structures –Investigate combinations of the proposed models

Multimodal Integration for Meeting Group Action Segmentation and Recognition M. Al-Hames, A. Dielmann, D. Gatica-Perez, S. Reiter, S. Renals, G. Rigoll,

Similar presentations

Presentation on theme: "Multimodal Integration for Meeting Group Action Segmentation and Recognition M. Al-Hames, A. Dielmann, D. Gatica-Perez, S. Reiter, S. Renals, G. Rigoll,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Multimodal Integration for Meeting Group Action Segmentation and Recognition M. Al-Hames, A. Dielmann, D. Gatica-Perez, S. Reiter, S. Renals, G. Rigoll,

Similar presentations

Presentation on theme: "Multimodal Integration for Meeting Group Action Segmentation and Recognition M. Al-Hames, A. Dielmann, D. Gatica-Perez, S. Reiter, S. Renals, G. Rigoll,"— Presentation transcript:

Similar presentations

About project

Feedback