Exploiting lexical information for Meeting Structuring Alfred Dielmann, Steve Renals (University of Edinburgh) {

Slides:

Advertisements

Similar presentations

Machine Learning Approaches to the Analysis of Large Corpora : A Survey Xunlei Rose Hu and Eric Atwell University of Leeds.

Advertisements

Atomatic summarization of voic messages using lexical and prosodic features Koumpis and Renals Presented by Daniel Vassilev.

Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.

Hyeonsoo, Kang. ▫ Structure of the algorithm ▫ Introduction 1.Model learning algorithm 2.[Review HMM] 3.Feature selection algorithm ▫ Results.

Modelling and Analyzing Multimodal Dyadic Interactions Using Social Networks Sergio Escalera, Petia Radeva, Jordi Vitrià, Xavier Barò and Bogdan Raducanu.

M. Emre Sargın, Ferda Ofli, Yelena Yasinnik, Oya Aran, Alexey Karpov, Stephen Wilson,Engin Erzin, Yücel Yemez, A. Murat Tekalp Combined Gesture- Speech.

Hidden Markov Model based 2D Shape Classification Ninad Thakoor 1 and Jean Gao 2 1 Electrical Engineering, University of Texas at Arlington, TX-76013,

Service Discrimination and Audit File Reduction for Effective Intrusion Detection by Fernando Godínez (ITESM) In collaboration with Dieter Hutter (DFKI)

SPEECH RECOGNITION Kunal Shalia and Dima Smirnov.

HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.

Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News Gina-Anne Levow University of Chicago SIGHAN July 25, 2004.

Speech Recognition in Noise

1 Integration of Background Modeling and Object Tracking Yu-Ting Chen, Chu-Song Chen, Yi-Ping Hung IEEE ICME, 2006.

Turn-taking in Mandarin Dialogue: Interactions of Tone and Intonation Gina-Anne Levow University of Chicago October 14, 2005.

Presented by Zeehasham Rasheed

1 lBayesian Estimation (BE) l Bayesian Parameter Estimation: Gaussian Case l Bayesian Parameter Estimation: General Estimation l Problems of Dimensionality.

Jacinto C. Nascimento, Member, IEEE, and Jorge S. Marques

EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.

Introduction to Automatic Speech Recognition

Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.

Isolated-Word Speech Recognition Using Hidden Markov Models

SoundSense by Andrius Andrijauskas. Introduction  Today’s mobile phones come with various embedded sensors such as GPS, WiFi, compass, etc.  Arguably,

Exploiting video information for Meeting Structuring ….

1 Multimodal Group Action Clustering in Meetings Dong Zhang, Daniel Gatica-Perez, Samy Bengio, Iain McCowan, Guillaume Lathoud IDIAP Research Institute.

Abstract Developing sign language applications for deaf people is extremely important, since it is difficult to communicate with people that are unfamiliar.

7-Speech Recognition Speech Recognition Concepts

BINF6201/8201 Hidden Markov Models for Sequence Analysis

Multimodal Integration for Meeting Group Action Segmentation and Recognition M. Al-Hames, A. Dielmann, D. Gatica-Perez, S. Reiter, S. Renals, G. Rigoll,

A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.

Better Punctuation Prediction with Dynamic Conditional Random Fields Wei Lu and Hwee Tou Ng National University of Singapore.

On Speaker-Specific Prosodic Models for Automatic Dialog Act Segmentation of Multi-Party Meetings Jáchym Kolář 1,2 Elizabeth Shriberg 1,3 Yang Liu 1,4.

ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.

Dynamic Bayesian Networks for Meeting Structuring Alfred Dielmann, Steve Renals (University of Sheffield)

Multimodal Information Analysis for Emotion Recognition

Yun-Nung (Vivian) Chen, Yu Huang, Sheng-Yi Kong, Lin-Shan Lee National Taiwan University, Taiwan.

Collaborative Annotation of the AMI Meeting Corpus Jean Carletta University of Edinburgh.

LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.

CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov

Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.

Modeling individual and group actions in meetings with layered HMMs dong zhang, daniel gatica-perez samy bengio, iain mccowan, guillaume lathoud idiap.

1 Boostrapping language models for dialogue systems Karl Weilhammer, Matthew N Stuttle, Steve Young Presenter: Hsuan-Sheng Chiu.

Automatic Cue-Based Dialogue Act Tagging Discourse & Dialogue CMSC November 3, 2006.

Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.

1 Prosody-Based Automatic Segmentation of Speech into Sentences and Topics Elizabeth Shriberg Andreas Stolcke Speech Technology and Research Laboratory.

AMSP : Advanced Methods for Speech Processing An expression of Interest to set up a Network of Excellence in FP6 Prepared by members of COST-277 and colleagues.

National Taiwan University, Taiwan

Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.

Conditional Random Fields for ASR Jeremy Morris July 25, 2006.

Results of the 2000 Topic Detection and Tracking Evaluation in Mandarin and English Jonathan Fiscus and George Doddington.

Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.

Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.

 Present by 陳群元.  Introduction  Previous work  Predicting motion patterns  Spatio-temporal transition distribution  Discerning pedestrians  Experimental.

Classification Ensemble Methods 1

1 Detecting Group Interest-level in Meetings Daniel Gatica-Perez, Iain McCowan, Dong Zhang, and Samy Bengio IDIAP Research Institute, Martigny, Switzerland.

A New Approach to Utterance Verification Based on Neighborhood Information in Model Space Author :Hui Jiang, Chin-Hui Lee Reporter : 陳燦輝.

HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004.

Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.

1 Hidden Markov Model: Overview and Applications in MIR MUMT 611, March 2005 Paul Kolesnik MUMT 611, March 2005 Paul Kolesnik.

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.

N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.

Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.

Statistical Models for Automatic Speech Recognition

Recognizing Structure: Sentence, Speaker, andTopic Segmentation

Automatic Speech Recognition: Conditional Random Fields for ASR

EE513 Audio Signals and Systems

LECTURE 23: INFORMATION THEORY REVIEW

The Application of Hidden Markov Models in Speech Recognition

Automatic Prosodic Event Detection

Presentation transcript:

Exploiting lexical information for Meeting Structuring Alfred Dielmann, Steve Renals (University of Edinburgh) { }

Meeting Structuring (1) Goal: recognise events which involve one or more communicative modalities: Monologue / Dialogue / Note taking / Presentation / Presentation at the whiteboard Working environment: “IDIAP framework”: five minutes long meetings of 4 participants 4 audio derived features: Speaker turns (derived from mic. Array localisation) Prosodic Features: RMS energy, F0, Rate Of Speech But we’d like to integrate other features…

Meeting Structuring (2) We’re working with Dynamic Bayesian Network based models and in the previous meeting we proposed two models: S01S01 Y01Y01 St1St1 Yt1Yt1 S t+1 1 Y t+1 1 …. A0A0 AtAt A t+1 …. C0C0 E0E0 C0C0 E0E0 C0C0 E0E0 Counter structure (Reduces insertions number) CorrSubDelInsAER W.o. counter With counter The first one is characterised by: Decomposition of actions {A t } into “sub-actions” {S t } Early features integration

Meeting Structuring (3) The second model extends the previous one, through multi-stream processing (avoiding early integration): S01S01 Y01Y01 St1St1 Yt1Yt1 S t+1 1 Y t+1 1 …. A0A0 AtAt A t+1 …. S02S02 Y02Y02 St2St2 Yt2Yt2 S t+1 2 Y t+1 2 …. C0C0 E0E0 C0C0 E0E0 C0C0 E0E0 CorrSubDelInsAER W.o. counter With counter Different feature groups are processed independently Parallel independent HMM chains are responsible only for one part of the feature set Cardinalities of {S t n } are part of the model Hidden sub-states {S t n } are a result of the training process

Further developments The feature set adopted has to be extended!! Integration of gestures/body movements (VIDEO) Integration of lexical information (ASR) The problem presents some analogies with “Topic Detection and Tracking” Work In Progress ! Correlate the sequence of transcribed words with the sequence of “meeting actions” Discover homogeneous partitions in a transcription, according to the communicative phase of the meeting (Dialogue,Monologue,..)

TDT approaches Lexical cohesion based, like “TextTiling”: –Given two adjacent windows a lexical cohesion function is evaluated at the transition between the 2 windows, in order to find topically coherent passages and highlight topic boundaries candidates Feature based, reducing topic segmentation to a statistical classification problem of : –Lexical features: perplexity, mutual information, other information content measures –Cue phrases (word frequent on topic changes) –Short/long range language models: n-gram, -binomial distribution –Prosodic features: pauses, F0, Cross-talk, speaker changes, … Mixed approach (Lexical cohesion + feature classification)

Feature based approach Lexical feature based approach could be: –Easily transferred from the TDT problem to the meeting segmentation one –Quickly integrated with proposed DBNs models Mutual Information Investigate a lexical function that discriminates between different communicative phases Look for a list of cue-phrases that highlights “Meeting Actions” boundaries …… Could be interesting starting points for further experiments! “the amount of information that one random variable X contains about another variable Y”

We must cope with speech recognition errors!! Problems (1) Challenging conditions: spoken, multiparty dialogues, in unrestricted domain Insufficient training/testing data Especially for Agreement\Disagr. (4% of the corpus) Only 30 meetings are fully transcribed (~25k words) These first experiments are based on hand labelled transcriptions, and attempt to discriminate only between Monologue & Dialogue

Lexical classification (1) Each word of the testing corpus is compared with every “Meeting Action” lexical model, and then classified …. ASR transcript ….. …….. …. Monologue Model Dialogue Model xxxxx Model MAX Output Filter …. maximizing the Mutual Information X

Lexical classification (2) Each recognised word is classified as a “Monologue word” or a “Dialogue word” This stream of symbols is then filtered (de-noised) : –Considering a moving window –Estimating the the temporal density for each class (Monologue, Dialogue) –The class with higher symbol density (frequency) is the winning one We expect that during a “Dialogue act”, the temporal density of “Monologue words” is lower than the “Dialogue words” one

Initial results We evaluated(*) the proposed system using two different classification criteria: Tri-gram probability 63.8 % Shannon M.I.66.8 % Achieved results are very close !! Mutual information seems to be more efficient than the simple 3-gram language model Correct classification percentage: (*) Using 13 meeting to construct Monologue&Dialogue lexical models and remaining 17 to evaluate performances

Integration (1) The next step is to integrate these results into the previously adopted framework, therefore we assume that: –Lexical classifier output can be seen as a new independent feature, and combined with Speaker Turns and prosodic features –Developed models could be easily adapted (and eventually re- engineered) to support newly introduced features Analysing a further communicative modality Thanks to the flexibility of Dynamic Bayesian Networks

The model must provide at least a minimum degree of a-synchronicity Problems (2) The new lexical feature stays on a new different time-scale Different from both Speaker Turns and Prosodic feature time-scales And different from the time-scale of the events that we’d like to recognise Meeting events usually appear on different modalities (i.e. turn-taking, prosody and word lexical environment), without a precise synchronism Speech and gestures for example Some features are asynchronous because calculated for each participant Other derive from the interaction of different participants (Speaker Turns) A true multi-time scale model probably is more compact and more efficient !?

Integration (2) Soon (hopefully!) the new feature will be integrated into the Multistream model The model will be adapted in order to support multiple time-scales, at least at a feature level The lack of synchronism between features will be investigated, and verified in which measure proposed models are able to manage it S01S01 Y01Y01 St1St1 Yt1Yt1 …. A0A0 AtAt S02S02 Y02Y02 S02S02 Y03Y03 Yt3Yt3 St1St1 Yt1Yt1 AtAt St2St2 Yt2Yt2 St2St2 Yt3Yt3 St2St2 St1St1 Yt1Yt1 AtAt St2St2 Yt3Yt3

Summary Open problems: –Choose the best way to process lexical data –Integration in the existing framework –More data (transcriptions) are needed –Multi-modal = Multi-time-scale –Synchronism Suggestions ?