Dynamic Bayesian Networks for Meeting Structuring Alfred Dielmann, Steve Renals (University of Sheffield)

Slides:

Advertisements

Similar presentations

1 Gesture recognition Using HMMs and size functions.

Advertisements

Angelo Dalli Department of Intelligent Computing Systems

Automatic Speech Recognition II  Hidden Markov Models  Neural Network.

Lirong Xia Approximate inference: Particle filter Tue, April 1, 2014.

Hidden Markov Models Reading: Russell and Norvig, Chapter 15, Sections

Introduction of Probabilistic Reasoning and Bayesian Networks

SPEECH RECOGNITION BASED ON BAYESIAN NETWORKS WITH ENERGY AS AN AUXILIARY VARIABLE Jaume Escofet Carmona IDIAP, Martigny, Switzerland UPC, Barcelona, Spain.

Cognitive Computer Vision

Hidden Markov Models Theory By Johan Walters (SR 2003)

Hidden Markov Models M. Vijay Venkatesh. Outline Introduction Graphical Model Parameterization Inference Summary.

Feature Vector Selection and Use With Hidden Markov Models to Identify Frequency-Modulated Bioacoustic Signals Amidst Noise T. Scott Brandes IEEE Transactions.

Advanced Artificial Intelligence

SPEECH RECOGNITION Kunal Shalia and Dima Smirnov.

Albert Gatt Corpora and Statistical Methods Lecture 8.

EE-148 Expectation Maximization Markus Weber 5/11/99.

Content-based Video Indexing, Classification & Retrieval Presented by HOI, Chu Hong Nov. 27, 2002.

Probabilistic reasoning over time So far, we’ve mostly dealt with episodic environments –One exception: games with multiple moves In particular, the Bayesian.

HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.

Hidden Markov Models. Hidden Markov Model In some Markov processes, we may not be able to observe the states directly.

Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.

1 Hidden Markov Model Instructor : Saeed Shiry  CHAPTER 13 ETHEM ALPAYDIN © The MIT Press, 2004.

Hidden Markov Models 戴玉書

Hidden Markov Models. Hidden Markov Model In some Markov processes, we may not be able to observe the states directly.

Computer vision: models, learning and inference Chapter 10 Graphical Models.

Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd Ed.

CS 188: Artificial Intelligence Fall 2009 Lecture 19: Hidden Markov Models 11/3/2009 Dan Klein – UC Berkeley.

EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.

Introduction to Automatic Speech Recognition

Cognitive Computer Vision Kingsley Sage and Hilary Buxton Prepared under ECVision Specific Action 8-3

Exploiting video information for Meeting Structuring ….

Recognition of meeting actions using information obtained from different modalities Natasa Jovanovic TKI University of Twente.

Exploiting lexical information for Meeting Structuring Alfred Dielmann, Steve Renals (University of Edinburgh) {

1 Multimodal Group Action Clustering in Meetings Dong Zhang, Daniel Gatica-Perez, Samy Bengio, Iain McCowan, Guillaume Lathoud IDIAP Research Institute.

Abstract Developing sign language applications for deaf people is extremely important, since it is difficult to communicate with people that are unfamiliar.

BINF6201/8201 Hidden Markov Models for Sequence Analysis

Multimodal Integration for Meeting Group Action Segmentation and Recognition M. Al-Hames, A. Dielmann, D. Gatica-Perez, S. Reiter, S. Renals, G. Rigoll,

Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.

Multimodal Information Analysis for Emotion Recognition

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

22CS 338: Graphical User Interfaces. Dario Salvucci, Drexel University. Lecture 10: Advanced Input.

Modeling individual and group actions in meetings with layered HMMs dong zhang, daniel gatica-perez samy bengio, iain mccowan, guillaume lathoud idiap.

UIUC CS 498: Section EA Lecture #21 Reasoning in Artificial Intelligence Professor: Eyal Amir Fall Semester 2011 (Some slides from Kevin Murphy (UBC))

Feature Vector Selection and Use With Hidden Markov Models to Identify Frequency-Modulated Bioacoustic Signals Amidst Noise T. Scott Brandes IEEE Transactions.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

CS Statistical Machine learning Lecture 24

CHAPTER 8 DISCRIMINATIVE CLASSIFIERS HIDDEN MARKOV MODELS.

The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)

Conditional Random Fields for ASR Jeremy Morris July 25, 2006.

Probabilistic reasoning over time Ch. 15, 17. Probabilistic reasoning over time So far, we’ve mostly dealt with episodic environments –Exceptions: games.

Unsupervised Mining of Statistical Temporal Structures in Video Liu ze yuan May 15,2011.

Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.

 Present by 陳群元.  Introduction  Previous work  Predicting motion patterns  Spatio-temporal transition distribution  Discerning pedestrians  Experimental.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.

Chapter 8. Learning of Gestures by Imitation in a Humanoid Robot in Imitation and Social Learning in Robots, Calinon and Billard. Course: Robots Learning.

1 Detecting Group Interest-level in Meetings Daniel Gatica-Perez, Iain McCowan, Dong Zhang, and Samy Bengio IDIAP Research Institute, Martigny, Switzerland.

HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004.

Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.

1 Hidden Markov Models Hsin-min Wang References: 1.L. R. Rabiner and B. H. Juang, (1993) Fundamentals of Speech Recognition, Chapter.

Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.

Statistical Models for Automatic Speech Recognition Lukáš Burget.

CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov

Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

Instructor: Eyal Amir Grad TAs: Wen Pu, Yonatan Bisk Undergrad TAs: Sam Johnson, Nikhil Johri CS 440 / ECE 448 Introduction to Artificial Intelligence.

A NONPARAMETRIC BAYESIAN APPROACH FOR

Statistical Models for Automatic Speech Recognition

Multimodal Learning with Deep Boltzmann Machines

Intelligent Information System Lab

Dynamical Statistical Shape Priors for Level Set Based Tracking

Presentation transcript:

Dynamic Bayesian Networks for Meeting Structuring Alfred Dielmann, Steve Renals (University of Sheffield)

Introduction Automatic analysis of meetings through “multimodal events” recognition GOAL events which involve one or more communicative modalities, and represent a single participant or a whole group behaviour Using objective measures and statistical methods

Multimodal Recognition Meeting Room AudioVideo……… “Multimodal Events” Recognition Feature Extraction Knowledge Database ……… Signal Pre-processing Models Specialised Recognition Systems (Speech,Video,Gestures) Information Retrieval

Group Actions 1.The machine observes group behaviours through objective measures (“external observer”) 2.Results of this analysis are “structured” into a sequence of symbols (“coding system”) –Exhaustive (covering the entire meeting duration) –Mutually exclusive (non overlapping symbols) We used the coding system adopted by the “IDIAP framework”, composed by 5 “meeting actions”: Monologue / Dialogue / Note taking / Presentation / Presentation at the whiteboard derived from different comunicative modalities

Corpus 60 meetings (30x2 set) collected in the “IDIAP Smart Meeting Room”: –30 meetings are used for the training –23 meetings are used for the testing –7 meetings will be used for the results validation 4 participants per meeting 5 hours of multi-channel Audio-Visual recordings: –3 fixed cameras –4 lapel microphones + 8 element circular microphones array Meeting agendas are generated “a priori” and strictly followed, in order to have an average of 5 “meeting actions” for each meeting Available for public distribution

Features (1) Mic. Array Lapel Mic. Speaker Turns Beam-forming Rate Of Speech….. Pitch baselineEnergy Prosody and Acoustic Only features derived from audio are currently used... Dimension reduction

Features (2) Speaker Turns L1 L2 L3 L4 t-3 t-2 t-1 t i j k Location based “Speech activities” (SRP-PHAT beamforming) Kindly provided by IDIAP Speaker Turns Features L i (t)*L j (t-1)*L k (t-2)

Features (3) RMS Energy Pitch Rate Of Speech Mask Features using “Speech activity” Mic. ArrayBeam-forming Pitch extractor Filters (*) MRATE (*) Histogram, median and interpolating filter Lapel Mic.

Gestures and Actions … Features (4) Video Other blob positions … Participants Motion features Image Processing Audio.ASR Transcripts We’d like to integrate other features….. … Everything that could be automatically extracted from a recorded meeting … Other …

Given a set of examples, EM learning algorithms (ie: Baum-Welch) could be used to train CPTs Dynamic Bayesian Networks (1) Bayesian Networks are a convenient graphical way to describe statistical (in)dependencies among random variables CS L AF Direct Acyclic Graph Conditional Probability Tables Given a set of known evidence nodes, the probability of other nodes can be computed through inference O

Dynamic Bayesian Networks (2) DBN are an extension of BNs with random variables that evolves in time: Instancing a static BN for each temporal slice t Explicating temporal dependences between variables CS L O CS L O CS L O …….. t=0 t=+1 t=T

Dynamic Bayesian Networks (3) Hidden Markov Models, Kalman Filter Models and other state-space models are just a special case of DBNs : Q0Q0 Y0Y0 QtQt YtYt Q t+1 Y t+1 …. t=0 t t+1    Representation of an HMM as an instance of a DBN

Dynamic Bayesian Networks (4) Representing HMMs in terms of DBNs makes easy to create variations on the basic theme …. Q0Q0 Y0Y0 QtQt YtYt Q t+1 Y t+1 …. Z0Z0 ZtZt Z t+1 …. X0X0 XtXt X t+1 …. Factorial HMMsCoupled HMMs Q0Q0 Y0Y0 QtQt YtYt …. QtQt YtYt Z0Z0 V0V0 ZtZt VtVt ZtZt VtVt

Dynamic Bayesian Networks (5) Use of DBN and BN present some advantages: Intuitive way to represent models graphically, with a standard notation Unified theory for a huge number of models Connecting different models in a structured view Making easier to study new models Unified set of instruments (ie: GMTK) to work with them (training, inference, decoding) Maximizes resources reuse Minimizes “setup” time

First Model (1) “Early integration” of features and modelling through a 2-level Hidden Markov Model S0S0 Y0Y0 StSt YtYt S t+1 Y t+1 …. A0A0 AtAt A t+1 …. STST YTYT ATAT Hidden Meeting Actions Observable Features Vector Hidden Sub-states

First Model (2) The main idea behind this model is to decompose each “meeting action” in a sequence of “sub actions” or substates (Note that different actions are free to share the same sub-state) S0S0 Y0Y0 StSt YtYt …. A0A0 AtAt The structure is composed by two Ergodic HMM chains: The top chain links sub-states {S t } with “actions” {A t } The lower one maps directly the feature vectors {Y t } into a sub-state {S t }

First Model (3) The sequence of actions {A t } is known a priori The sequence {S t } is determined during the training process,and the meaning of each substate is unknown S0S0 Y0Y0 StSt YtYt …. A0A0 AtAt The cardinality of {S t } is one of the model’s parameters The mapping of observable features {Y t } into hidden substates {S t } is obtained through Gaussian Mixture Models

Second Model (1) Multistream processing of features through two parallel and independent Hidden Markov Models S01S01 Y01Y01 St1St1 Yt1Yt1 S t+1 1 Y t+1 1 …. A0A0 AtAt A t+1 …. ST1ST1 YT1YT1 ATAT Meeting Actions Prosodic Features Hidden Sub-states S02S02 Y02Y02 St2St2 Yt2Yt2 S t+1 2 Y t+1 2 …. ST2ST2 YT2YT2 Speaker Turns Features C0C0 E0E0 C0C0 E0E0 C0C0 E0E0 C0C0 …. Enable Transitions Action Counter

Second Model (2) Each features-group (or modality) Y m, is mapped into an independent HMM chain, therefore every group is evaluated independently and mapped into an hidden sub-state {S t n } S01S01 Y01Y01 St1St1 Yt1Yt1 …. A0A0 AtAt S02S02 Y02Y02 St2St2 Yt2Yt2 As in the previous model, there is another HMM layer ( A ), witch represents “meeting actions” The whole sub-state { S t 1 x S t 2 x … S t n } is mapped into an action { A t }

Second Model (3) It is a variable-duration HMM with explicit enable node: A t represents “meeting actions” as usual C t counts “meeting actions” E t is a binary indicator variable that enables states changes inside the node A t A0A0 AtAt A t+1 …. C0C0 E0E0 C0C0 E0E0 C0C0 E0E0 Ct…11222…Ct…11222… Et…01000…Et…01000… At…88555…At…88555…

Second Model (4) Training: when {A t } changes {C t } is incremented and is set on for a single frame {E t } (A t,E t and C t are part of the training dataset) Ct…11222…Ct…11222… Et…01000…Et…01000… At…88555…At…88555… Decoding: { A t } is free to change only if { E t } is high, and then according to { C t } state Behaviours of {E t } and {C t } learned during the training phase are then exploited during the decoding

Results Using the two models previously described, results obtained using only audio derived features: Corr.Sub.Del.Ins.AER First Model Second Model Equivalent to the Word Error Rate measure, used to evaluate speech recogniser performances The second model reduces effectively both the number of Substitutions and the number of Insertions

Conclusions A new approach has been proposed Achieved results seem to be promising, and in the future we’d like to: –Validate them with the remaining part of the test-set (or eventually an independent test-set) –Integrate other features: video, ASR transcripts, Xtalk, …. –Try new experiments with existing models –Develop new DBNs based models

Multimodal Recognition (2) Raw Audio Raw Video Acoustic Features Visual Features Automatic Speech Recognition Video Understanding Gesture Recognition Eye Gaze Tracking Emotion Detection …. Fusion of different recognisers at an early stage, generating hybrid recognisers (like AVSR) Knowledge sources:Approaches: Integration of recognisers outputs through an “high level” recogniser A standalone hi-level recogniser operating on low level raw data