Segmentation and Recognition of Meeting Events M4 – Meeting Munich 23. September 2004 Stephan Reiter
2/ M4 – Meeting Munich Overview Meeting Event Recognition (MER) by User Modelling MER from Audio-Signal MER from Binary Speech Profile MER from Transcriptions Late semantic fusion of three recognisers Integration of two feature streams via DBN‘s Segmentation based on higher semantic features
3/ M4 – Meeting Munich Meeting Event Recognition Well known Meeting Events: Discussion Monologue 1 Monologue 2 Monologue 3 Monologue 4 Data: Scripted Meetings Note-taking Presentation Whiteboard (Consensus) (Disagreement)
4/ M4 – Meeting Munich MER by User-Modelling Annotations User-State Meeting Event
5/ M4 – Meeting Munich MER by User-Modelling cont. Definition of five states a participant can be in: sitting – silent sitting – silent – writing sitting – talking standing – talking standing – talking – writing Annotations User-State Meeting Event
6/ M4 – Meeting Munich MER by User-Modelling cont. Two-step-approach based on annotations: From annotations to user-states: features: talking, writing, sitting, standing using SVMs % From annotations to meeting-events: using SVMs % Annotations User-State Meeting Event
7/ M4 – Meeting Munich MER from Audio-Signal Using single lapel files: 12 MFCCs; cont. HMMs, 6 States %
8/ M4 – Meeting Munich MER from Binary Speech Profile Using the Speaker Turn Detection results from IDIAP Discrete HMMs, Codebook with 64 entries, 32 States %
9/ M4 – Meeting Munich MER from Transcriptions using transcriptions from media file sever 1-state-HMM, discrete all Monologues put together %
10/ M4 – Meeting Munich Late semantic fusion Joining the results of three entities (all 10 meeting events): MER from Annotations % MER from Audio-Files % MER from Transcriptions % simple rule-based fusion system: If two or more results are equal, the fused result is considered that class. Otherwise the result with the highest score is taken. recognition rate after fusion: %
11/ M4 – Meeting Munich MER using DBNs Integration of two features streams: Binary-Speech- Profile (5Hz) Global-Motion- Features (12.5Hz) Recognition rate: %
12/ M4 – Meeting Munich Segmentation based on higher semantic features benefits from Speaker Turn Detection and Gesture Recognition (81.76 %) Segmentation via sliding windows Results: