Presentation is loading. Please wait.

Presentation is loading. Please wait.

Dynamic Bayesian Networks for Meeting Structuring Alfred Dielmann, Steve Renals (University of Sheffield)

Similar presentations


Presentation on theme: "Dynamic Bayesian Networks for Meeting Structuring Alfred Dielmann, Steve Renals (University of Sheffield)"— Presentation transcript:

1 Dynamic Bayesian Networks for Meeting Structuring Alfred Dielmann, Steve Renals (University of Sheffield)

2 Introduction Automatic analysis of meetings through “multimodal events” recognition GOAL events which involve one or more communicative modalities, and represent a single participant or a whole group behaviour Using objective measures and statistical methods

3 Multimodal Recognition Meeting Room AudioVideo……… “Multimodal Events” Recognition Feature Extraction Knowledge Database ……… Signal Pre-processing Models Specialised Recognition Systems (Speech,Video,Gestures) Information Retrieval

4 Group Actions 1.The machine observes group behaviours through objective measures (“external observer”) 2.Results of this analysis are “structured” into a sequence of symbols (“coding system”) –Exhaustive (covering the entire meeting duration) –Mutually exclusive (non overlapping symbols) We used the coding system adopted by the “IDIAP framework”, composed by 5 “meeting actions”: Monologue / Dialogue / Note taking / Presentation / Presentation at the whiteboard derived from different comunicative modalities

5 Corpus 60 meetings (30x2 set) collected in the “IDIAP Smart Meeting Room”: –30 meetings are used for the training –23 meetings are used for the testing –7 meetings will be used for the results validation 4 participants per meeting 5 hours of multi-channel Audio-Visual recordings: –3 fixed cameras –4 lapel microphones + 8 element circular microphones array Meeting agendas are generated “a priori” and strictly followed, in order to have an average of 5 “meeting actions” for each meeting Available for public distribution http://mmm.idiap.ch/

6 Features (1) Mic. Array Lapel Mic. Speaker Turns Beam-forming Rate Of Speech….. Pitch baselineEnergy Prosody and Acoustic Only features derived from audio are currently used... Dimension reduction

7 Features (2) Speaker Turns L1 L2 L3 L4 t-3 t-2 t-1 t 0.1 0.4 0.6 0.3 0.3 0.5 0.5 0.3 0.2 0.4 0.7 0.2 0.2 0.3 0.7 0.1 i j k Location based “Speech activities” (SRP-PHAT beamforming) Kindly provided by IDIAP Speaker Turns Features L i (t)*L j (t-1)*L k (t-2)

8 Features (3) RMS Energy Pitch Rate Of Speech Mask Features using “Speech activity” Mic. ArrayBeam-forming Pitch extractor Filters (*) MRATE (*) Histogram, median and interpolating filter Lapel Mic.

9 Gestures and Actions … Features (4) Video Other blob positions … Participants Motion features Image Processing Audio.ASR Transcripts We’d like to integrate other features….. … Everything that could be automatically extracted from a recorded meeting … Other …

10 Given a set of examples, EM learning algorithms (ie: Baum-Welch) could be used to train CPTs Dynamic Bayesian Networks (1) Bayesian Networks are a convenient graphical way to describe statistical (in)dependencies among random variables CS L AF Direct Acyclic Graph Conditional Probability Tables Given a set of known evidence nodes, the probability of other nodes can be computed through inference O

11 Dynamic Bayesian Networks (2) DBN are an extension of BNs with random variables that evolves in time: Instancing a static BN for each temporal slice t Explicating temporal dependences between variables CS L O CS L O CS L O …….. t=0 t=+1 t=T

12 Dynamic Bayesian Networks (3) Hidden Markov Models, Kalman Filter Models and other state-space models are just a special case of DBNs : Q0Q0 Y0Y0 QtQt YtYt Q t+1 Y t+1 …. t=0 t t+1    Representation of an HMM as an instance of a DBN

13 Dynamic Bayesian Networks (4) Representing HMMs in terms of DBNs makes easy to create variations on the basic theme …. Q0Q0 Y0Y0 QtQt YtYt Q t+1 Y t+1 …. Z0Z0 ZtZt Z t+1 …. X0X0 XtXt X t+1 …. Factorial HMMsCoupled HMMs Q0Q0 Y0Y0 QtQt YtYt …. QtQt YtYt Z0Z0 V0V0 ZtZt VtVt ZtZt VtVt

14 Dynamic Bayesian Networks (5) Use of DBN and BN present some advantages: Intuitive way to represent models graphically, with a standard notation Unified theory for a huge number of models Connecting different models in a structured view Making easier to study new models Unified set of instruments (ie: GMTK) to work with them (training, inference, decoding) Maximizes resources reuse Minimizes “setup” time

15 First Model (1) “Early integration” of features and modelling through a 2-level Hidden Markov Model S0S0 Y0Y0 StSt YtYt S t+1 Y t+1 …. A0A0 AtAt A t+1 …. STST YTYT ATAT Hidden Meeting Actions Observable Features Vector Hidden Sub-states

16 First Model (2) The main idea behind this model is to decompose each “meeting action” in a sequence of “sub actions” or substates (Note that different actions are free to share the same sub-state) S0S0 Y0Y0 StSt YtYt …. A0A0 AtAt The structure is composed by two Ergodic HMM chains: The top chain links sub-states {S t } with “actions” {A t } The lower one maps directly the feature vectors {Y t } into a sub-state {S t }

17 First Model (3) The sequence of actions {A t } is known a priori The sequence {S t } is determined during the training process,and the meaning of each substate is unknown S0S0 Y0Y0 StSt YtYt …. A0A0 AtAt The cardinality of {S t } is one of the model’s parameters The mapping of observable features {Y t } into hidden sub- states {S t } is obtained through Gaussian Mixture Models

18 Second Model (1) Multistream processing of features through two parallel and independent Hidden Markov Models S01S01 Y01Y01 St1St1 Yt1Yt1 S t+1 1 Y t+1 1 …. A0A0 AtAt A t+1 …. ST1ST1 YT1YT1 ATAT Meeting Actions Prosodic Features Hidden Sub-states S02S02 Y02Y02 St2St2 Yt2Yt2 S t+1 2 Y t+1 2 …. ST2ST2 YT2YT2 Speaker Turns Features C0C0 E0E0 C0C0 E0E0 C0C0 E0E0 C0C0 …. Enable Transitions Action Counter

19 Second Model (2) Each features-group (or modality) Y m, is mapped into an independent HMM chain, therefore every group is evaluated independently and mapped into an hidden sub-state {S t n } S01S01 Y01Y01 St1St1 Yt1Yt1 …. A0A0 AtAt S02S02 Y02Y02 St2St2 Yt2Yt2 As in the previous model, there is another HMM layer ( A ), witch represents “meeting actions” The whole sub-state { S t 1 x S t 2 x … S t n } is mapped into an action { A t }

20 Second Model (3) It is a variable-duration HMM with explicit enable node: A t represents “meeting actions” as usual C t counts “meeting actions” E t is a binary indicator variable that enables states changes inside the node A t A0A0 AtAt A t+1 …. C0C0 E0E0 C0C0 E0E0 C0C0 E0E0 Ct…11222…Ct…11222… Et…01000…Et…01000… At…88555…At…88555…

21 Second Model (4) Training: when {A t } changes {C t } is incremented and is set on for a single frame {E t } (A t,E t and C t are part of the training dataset) Ct…11222…Ct…11222… Et…01000…Et…01000… At…88555…At…88555… Decoding: { A t } is free to change only if { E t } is high, and then according to { C t } state Behaviours of {E t } and {C t } learned during the training phase are then exploited during the decoding

22 Results Using the two models previously described, results obtained using only audio derived features: Corr.Sub.Del.Ins.AER First Model93.22.34.5 11.4 Second Model94.71.53.80.86.1 Equivalent to the Word Error Rate measure, used to evaluate speech recogniser performances The second model reduces effectively both the number of Substitutions and the number of Insertions

23 Conclusions A new approach has been proposed Achieved results seem to be promising, and in the future we’d like to: –Validate them with the remaining part of the test-set (or eventually an independent test-set) –Integrate other features: video, ASR transcripts, Xtalk, …. –Try new experiments with existing models –Develop new DBNs based models

24

25 Multimodal Recognition (2) Raw Audio Raw Video Acoustic Features Visual Features Automatic Speech Recognition Video Understanding Gesture Recognition Eye Gaze Tracking Emotion Detection …. Fusion of different recognisers at an early stage, generating hybrid recognisers (like AVSR) Knowledge sources:Approaches: Integration of recognisers outputs through an “high level” recogniser A standalone hi-level recogniser operating on low level raw data


Download ppt "Dynamic Bayesian Networks for Meeting Structuring Alfred Dielmann, Steve Renals (University of Sheffield)"

Similar presentations


Ads by Google