1 ICSI-SRI-UW Structural MDE: Modeling, Analysis, & Issues Yang Liu 1,3, Elizabeth Shriberg 1,2, Andreas Stolcke 1,2, Barbara Peskin 1, Jeremy Ang 1, Mary.

Slides:

Advertisements

Similar presentations

Punctuation Generation Inspired Linguistic Features For Mandarin Prosodic Boundary Prediction CHEN-YU CHIANG, YIH-RU WANG AND SIN-HORNG CHEN 2012 ICASSP.

Advertisements

Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.

Hidden Markov Models Theory By Johan Walters (SR 2003)

Foundations of Statistical NLP Chapter 9. Markov Models 한 기 덕한 기 덕.

Evaluating Search Engine

Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.

Confidence Estimation for Machine Translation J. Blatz et.al, Coling 04 SSLI MTRG 11/17/2004 Takahiro Shinozaki.

November 2003EARS Metadata Meeting 1 ICSI-SRI-UW RT03F MDE System and Research Yang Liu, Chuck Wooters, Barbara Peskin ICSI Elizabeth Shriberg, Andreas.

Error Analysis: Indicators of the Success of Adaptation Arindam Mandal, Mari Ostendorf, & Ivan Bulyko University of Washington.

Presented by Ravi Kiran. Julia Hirschberg Stefan Benus Jason M. Brenier Frank Enos Sarah Friedman Sarah Gilman Cynthia Girand Martin Graciarena Andreas.

Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.

1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.

EE225D Final Project Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye EE225D Final Project.

Scalable Text Mining with Sparse Generative Models

CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.

Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.

STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.

1 Sequence Labeling Raymond J. Mooney University of Texas at Austin.

Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.

1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke.

Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.

Graphical models for part of speech tagging

March 24, 2005EARS STT Workshop1 A Study of Some Factors Impacting SuperARV Language Modeling Wen Wang 1 Andreas Stolcke 1 Mary P. Harper 2 1. Speech Technology.

Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.

Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,

On Speaker-Specific Prosodic Models for Automatic Dialog Act Segmentation of Multi-Party Meetings Jáchym Kolář 1,2 Elizabeth Shriberg 1,3 Yang Liu 1,4.

Discriminative Models for Spoken Language Understanding Ye-Yi Wang, Alex Acero Microsoft Research, Redmond, Washington USA ICSLP 2006.

Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao.

Style & Topic Language Model Adaptation Using HMM-LDA Bo-June (Paul) Hsu, James Glass.

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

Data Sampling & Progressive Training T. Shinozaki & M. Ostendorf University of Washington In collaboration with L. Atlas.

CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov

Turn-taking Discourse and Dialogue CS 359 November 6, 2001.

Automatic Cue-Based Dialogue Act Tagging Discourse & Dialogue CMSC November 3, 2006.

Sentence Unit Detection in Conversational Dialogue Elizabeth Lingg, Tejaswi Tennetti, Anand Madhavan it has a lot of garlic in it too does n't it i it.

Structural Metadata Annotation of Speech Corpora: Comparing Broadcast News and Broadcast Conversations Jáchym KolářJan Švec University of West Bohemia.

Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.

Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.

Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.

1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.

1 Prosody-Based Automatic Segmentation of Speech into Sentences and Topics Elizabeth Shriberg Andreas Stolcke Speech Technology and Research Laboratory.

1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.

Conditional Random Fields for ASR Jeremy Morris July 25, 2006.

1 CRANDEM: Conditional Random Fields for ASR Jeremy Morris 11/21/2008.

Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.

School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.

Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.

John Lafferty Andrew McCallum Fernando Pereira

HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004.

Data Mining and Decision Support

NTU & MSRA Ming-Feng Tsai

Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05

Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.

Statistical Models for Automatic Speech Recognition Lukáš Burget.

A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.

Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.

N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.

1 Experiments with Detector- based Conditional Random Fields in Phonetic Recogntion Jeremy Morris 06/01/2007.

Dec. 4-5, 2003EARS STT Workshop1 Broadcast News Training Experiments Anand Venkataraman, Dimitra Vergyri, Wen Wang, Ramana Rao Gadde, Martin Graciarena,

Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.

Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.

1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ； Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.

Maximum Entropy Models and Feature Engineering CSCI-GA.2591

Conditional Random Fields for ASR

Recognizing Structure: Sentence, Speaker, andTopic Segmentation

Introduction to Data Mining, 2nd Edition

Jeremy Morris & Eric Fosler-Lussier 04/19/2007

Presentation transcript:

1 ICSI-SRI-UW Structural MDE: Modeling, Analysis, & Issues Yang Liu 1,3, Elizabeth Shriberg 1,2, Andreas Stolcke 1,2, Barbara Peskin 1, Jeremy Ang 1, Mary Harper 3 1 International Computer Science Institute 2 SRI International 3 Purdue University

2 Outline  ICSI/SRI/UW RT-04 System Modeling approaches Tasks & results Ongoing and future work Collaboration with Brown  Error Analysis  Issue: Need for more Fisher data Training size effects Style differences  Conclusions

3 Modeling Approaches: HMM  Used in our previous system and many other MDE systems  The most likely state is found via forward-backward algorithm

4 Modeling Approaches: HMM (2)  Transition probability A hidden-event LM for the joint word and event sequence P(W,E)  Observation probability At each word boundary, extract prosodic features related to duration, pause, pitch, and energy The posterior probability of an event given prosodic features is estimated using a decision tree classifier Bagging and ensemble techniques are used to obtain more robust posteriors estimation

5 Modeling Approaches: Maxent  HMM Does not jointly model textual information (e.g., word, POS, classes). Maximizes the joint likelihood of words and metadata events, not matching performance measure (per-boundary based classification metric)  Maxent Assigns a posterior probability for an event at each interword boundary, exponential form: Supports highly correlated features

6 Modeling Approaches: Maxent (2)  Features are indicator functions, e.g.  Maxent is a discriminative approach, it maximizes the conditional log likelihood, therefore better matching performance metric. HMM is a generative approach, but it can model sequence information via the forward- backward algorithm

7 Modeling Approaches: CRF (Conditional Random Field)  Maxent allows overlapping features, but it makes decisions locally and does not capture sequential information  A conditional random field (CRF) is globally conditioned on the observation sequence X  Estimates conditional probability P(E|X)  The most likely event sequence is found using the Viterbi algorithm

8 Modeling Approaches: Summary  Both Maxent and CRF directly estimate the conditional probability of an event given observations, and match the performance measure; HMM is a generative approach, it maximizes the joint likelihood  Both Maxent and CRF allow overlapping features and provide a principled way to model these; in HMM, these are not well modeled (some independence assumption is made when using linear interpolation)  HMM and CRF model sequence information; whereas, Maxent only uses local information (except for the features that encode contextual info)

9 RT-04 System Introduction  Tasks: We participated in all the structural MDE tasks, on both CTS and BN, reference transcriptions and various STT conditions  Data: CTS: RT-04 training data only (enough difference between V5 and V6 annotation that combining RT-04 and RT-03 data does not improve performance) BN: combined RT-04 and RT-03 data for training (difference between V5 and V6 is small, and data sparsity on BN is more serious)

10 Tasks & Results: Data CTS (V6 data only) BN (merged V5+V6) Training size (# words)484k353k SU % Edit IP % Filled pause % Discourse marker %

11 Tasks & Results: STT Input Used WER BN SuperEARS11.7 SRI15.0 CTS IBM+SRI14.9 SRI18.6

12 Tasks & Results: SU Tasks  Both SU boundary and subtype are required. We use a two-step approach: first detect SU boundaries and then a classifier is used for SU subtype detection  SU boundary detection HMM  Hidden event word LM  Used POS, automatically-induced classes, and additional word LMs (from Meteer data on CTS, and the BN LM used in STT)  LMs except POS-LM are combined at LM level, i.e., the interpolation of these provides the transition probabilities in HMM  Prosody model uses bagging and ensemble techniques  POS-LM is combined via posterior interpolation

13 Tasks & Results: SU Tasks (2)  SU boundary detection: features in the Maxent and CRF (similar knowledge sources as in HMM, but different representation):  Word N-grams  POS N-grams  N-grams of automatically-induced class  Chunk N-grams (on BN only)  Cumulative binned posterior probabilities from the prosody model  Cumulative binned posteriors from the additional word-LMs (which are applied to the MDE training and test sets in an HMM approach w/o prosody model)

14 Tasks & Results: CTS SU Boundary Results  SU boundary detection results: majority vote of HMM, Maxent, and CRF approaches in eval submission REFSTT-ibm+sri HMM Maxent CRF Majority Vote [ Note: Results are from mdeval-v17 ]

15 Tasks & Result: BN SU Boundary Results  SU boundary detection results: used linear interpolation of posteriors from the HMM and Maxent in eval submission REFSTT-superears HMM Maxent CRF Majority Vote Eval submission

16 Tasks & Results: Summary for SU Boundary Task  HMM Prosody is improved by bagging Additional textual information helps  Maxent is better than HMM on both Ref and STT (esp. on BN) because of the better STT results and more training data  CRF combines advantages of the Maxent and HMM approaches  Majority vote may not be a good system combination approach  Relatively speaking (to the “chance performance”), BN is still a hard task

17 Tasks & Results: Effect of Speaker Label on BN  Speaker info (and turn change) is important for SU detection (used in prosodic features, for chunking text)  No speaker info is available on BN. Two methods to derive speaker labels: Diarization output Automatic clustering as used in STT SU error rate Diarization54.49 Automatic clustering64.10 On BN ref condition, using HMM approach

18 Tasks & Results: SU Subtype Detection  Percentage of subtypes:  A Maxent classifier is used in a second step for SU subtype detection Features are: sentence length, initial and final word, turn change, binned posteriors from the prosody model CTSBN Statement Backchannel Question Incomplete

19 Tasks & Results: SU Subtype Results  Similar substitution error rates for Ref and STT conditions BoundarySubstitutionTotal CTS REF STT: ibm+sri STT: sri BN REF STT: superears STT: sri

20 Tasks & Results: SU Subtype Summary  Word errors in STT do not affect substitution errors as much as for boundary detection  Wrong SU boundaries (inserted or deleted) affect features such as SU initial words  Prosody model is built from word-based prosodic features and may not represent the sentence level pitch or energy contour

21 Tasks & Results: Edit Detection  HMM A hidden-event LM and prosody model for edit IP detection A repetition detection module Heuristic rules for onset of edits detection based on IP hypotheses  Maxent A Maxent classifier for SU/editIP/NULL detection, features:  All the features used for SU  Repeated word sequence  Word fragment  Filler words (predefined cue words)  Binned posteriors from IP/NULL prosody model Heuristic rules for onset of edits detection based on IP hypotheses

22 Task & Results: Edit Detection (2)  CRF Jointly for edit word and edit IP detection, example: I I work uh I’m an analyst B-E+IPI-E I-E+IP O O O O Each word has an associated class tag (an edit word or not) Add ‘IP’ in target class to account for the internal IPs inside complex edits Valid state transition is learned from the training set No heuristic rules are used Features:  word and POS N-grams  other features used in the Maxent

23 Tasks & Results: Edit Results  CTS: Ref condition  CRF outperforms the HMM and Maxent for edit word detection, but not for edit IP task (probably the other two are better trained for finding IPs).  Post-eval: Added ‘turn’ feature in CRF for edit word detection, error rate goes to 50.07%. Still room for further improvement. Edit wordEdit IP HMM Maxent CRF

24 Tasks & Results: Filler Words  The HMM approach is used to find the end of FPs and DMs, then for DM, the onset is determined by searching a pre-defined list  On BN, when using superears STT output (which does not have FPs), we generate FPs by aligning SRI’s STT with SuperEARS STT  On CTS, there is some mismatch between training and testing in terms of percentage of ‘like’ ‘so’ as DM, possibly because of the differences between the Swbd and Fisher data [see later in this talk for more details]

25 Tasks & Results: Filler Words  Better filler (filled pause) detection results on BN SRI STT than using SuperEARS (even though higher WER of SRI STT) due to the suboptimal derivation of FPs in SuperEARS STT CTS REF27.10 STT: ibm+sri42.53 STT: sri44.64 BN REF18.11 STT: superears56.63 STT: sri52.63

26 Tasks & Results: IP Detection  Combined edit word detection and filler word detection  Better IP results on BN SRI STT is because of the better filler word detection CTS REF30.31 STT: ibm+sri60.59 STT: sri61.48 BN REF21.42 STT: superears70.39 STT: sri67.91

27 Tasks & Results: Effect of WER by Task and Corpus  What’s the effect of WER on various tasks? SUEdit CTS REF STT: ibm+sri STT: sri BN REF STT: superears STT: sri [ Eval results on ref and various STT conditions]

28 Tasks & Results: Effect of WER by Task and Corpus  Apparently MDE error rate is not linear with WER Word errors affect MDE differently, sentence initial and final words have a great impact Insertion and deletion of short SU cue words greatly impact SU detection  As STT gets better, SU performance improves more than edit word detection: Word errors occur more frequently in disfluencies Word errors have more impact on cues used for edit detection Fragment information is unavailable in the STT condition and is one of main reasons for the great gap between Ref and STT conditions for edit word detection.

29 Tasks & Results: Across Genre  Across CTS and BN Different speaking style (word cues for SU, different edit disfluencies, etc) Edit performance is better on BN ref condition, but worse on STT than CTS — meaning edits are relatively easier on BN ref condition (simple repeats and revisions) Performance of SU is relatively worse on BN than CTS (after normalizing by the event priors), due to the lack of cue words and more severe sparse data problem More degradation for the STT condition on CTS than BN for SU detection, partly because of the poorer recognition performance

30 Collaboration with Brown & UW  Through our work with UW, we learned of Brown’s interest in participating in disfluency tasks  We provided Brown with our SU and IP hypotheses  Output was generated using the HMM approach  See Brown’s presentation for details  We are eager to continue collaboration examining parsing cues, both with UW and Brown, as well as with other interested parsing researchers.

31 Ongoing Work  Incorporate SuperARV tags for MDE Capture more syntactic information Similar to class tags and easy to use  Include better features in Maxent and CRF Additional knowledge sources or compound features Directly incorporate prosodic features  Build speaker-dependent prosody model  Build word-dependent prosody model for DM words  Investigate additional prosodic features

32 Outline  Introduction  ICSI/SRI/UW RT-04 System Modeling approaches Tasks & results Ongoing and future work Collaboration with Brown  Error Analysis  Issue: Need for more Fisher data Training size effects Style differences  Conclusions

33 Error Analysis: SUs Compared ICSI + and NIST MDE Reference  Interested in errors due to modeling rather than to STT errors, so looked at errors in our Ref output  Focused on SU Task (for disfluencies we looked at Brown vs. ICSI results)  Looked at both CTS and BN  Aligned word+event stream from ICSI with that from NIST/LDC reference to find MDE errors.  Force-aligned words with speech to examine prosody

34 Error Analysis: SUs  Examined 400 system errors for both CTS (~30% of errors) BN (~23% of errors)  Broadly speaking, identified 4 general error types: Keyword errors - errors associated with specific words (more than expected by chance) Dialog-act related errors – associated with syntactic or other modifications to words, e.g. commands (missing subject), before a question, and near quotes Transcription / labeling / mapping errors – small percentage in CTS; very small number of these due to mapping Unclear

35 SU Error Analysis: Classes by Corpus  Types can cooccur (sum to 1.1 for BN, 1.2 for CTS)  Some ref errors in CTS; none noticed in BN sample  Similar rate of errors related to dialog acts  High rate of keyword-associated errors, esp. in CTS Keyword DialogAct RefError Unclear Percentage of Errors BN (N=417) CTS (N=399)

36 SU Error Analysis: Keywords by Ratio  Examined N-grams using ratio: % of Errors / % of Total  Ratios different from 1 suggest association with Ngram  Note these are normalized for priors (including correct)  Found many cases of high ratios, which should be examined to improve modeling  E.g., for unigram following SU boundary: High ratios for CTS: conjunctions, DMs, and 1 st person pronouns  top ratio: “or” (3.95) High ratios for BN: conjunctions, pronouns  top ratio: “but” (4.79)

37 Error Analysis: Disfluencies Looked at Brown + and ICSI + Results  Interested in discerning role of parse information  But on inspection, many ICSI errors not clearly related to syntax  Many errors involved restart vs. incomplete SU “i mean i ha- yeah definitely” We forgot TURN feature in CRF; when added, helps  For repeats we modeled up to 3 words; should allow more “it’s a neat little it’s a neat little...”  We used a too-limited list of fillers when looking for repeats  We should add heuristics to allow skips between matched words  We can try using SU info (given to Brown, did not use ourselves)

38 Outline  Introduction  ICSI/SRI/UW RT-04 System Modeling approaches Tasks & results Ongoing and future work Collaboration with Brown  Error Analysis  Issue: Need for more Fisher data Training size effects Style differences  Conclusions

39 Need for More Fisher Data: Training Size Effects Training sizeSU error All % %35.85  We think it is important to have more MDE-annotated Fisher data  2 empirical motivations. 1 st motivation: training size effect  Used baseline HMM: word-based LM + downsampled prosody  Randomly decreasing amount of training data yields progressively higher error rates. For example:

40 Need for Fisher Data: Style Differences: Perplexity  2 nd motivation: Fisher and Switchboard differ stylistically  One way to see this: overall perplexity results  We split training data randomly into 2 halves  Trained hidden-event LM using bigrams (N-grams include events but not the nonevents)  Always train on ½ of the RT-04 Switchboard training data  Compute test set perplexity for: Switchboard, the other ½ of training set Fisher test data  Repeated X times, each time with different random split

41 Style Differences: Perplexity (2) Train Event LM on SWB Test on Swb Test on Fisher Approx. Perplexity for Words Only (no Events)

42 Style Differences: Sample Overall Rates  Fisher has higher rate of many characteristics associated with less formal speech Switchboard Fisher  incomplete SU / total SU (%)6.92/  total IP / total words (%) 4.37/  total DM / total words (%) 2.96/ to name only a few

43 Style Differences: Events at SU Onsets  SU-initial event / total SUs: Switchboard Fisher “and” / filled pause5.13 /  Neutral choices relatively higher in Switchboard than in Fisher “you know” 2.46 / “so”3.30 /  More informal choices relatively higher in Fisher “well”3.73 /  More “softeners” (politeness/formality) higher in Switchboard  SU-initial “because” (coordinating vs subordinating conj.): At SU onset / total “because”: 7.61 / At SU onset / total SUs: 0.19 /  More informal usage in Fisher; less governed by syntax.

44 Style Differences: Length Distributions Number of SUs in Turn Percentage of Turns  2 Switchboard splits have same distribution; Fisher has more SUs per turn  Additional distributions suggest difference is not. related to backchannels  Statement SUs shorter in Fisher (  more informal?)  Incomplete SUs are more prevalent in Fisher (whether true IncSUs or restarts  more informal)

45 Style Differences: Specific Example by ConvSide: “Like”

46 Style Differences in “Like” Sanity Check: Usage of “Like” as NonDM  Same usage profile for like as nonDM in both corpora  Slightly higher rate in test not unexpected given higher DM usage (overall “like”s) in Fisher

47 Outline  Introduction  ICSI/SRI/UW RT-04 System Modeling approaches Tasks & results Ongoing and future work Collaboration with Brown  Error Analysis  Issue: Need for more Fisher data Training size effects Style differences  Conclusions

48 Conclusions  Maximum entropy and CRF approaches improve on our HMMs by better modeling hierarchical and longer-distance features  Combining the HMM, Maxent, and CRF approaches yields best performance across BN and CTS, on both Ref and STT conditions  Diarization output greatly aids BN SU task  The effect of STT on MDE is not linear & differs by task and corpus  Collaboration with Brown suggests higher-level syntactic information as well as joint MDE modeling may provide further gains  Error analysis for SU task shows 4 general classes of errors, two of which we could attempt to fix. Class distribution differs by corpus  We think it’s important to obtain more MDE-annotated Fisher for training, based on (1) training size experiments and (2) evidence of stylistic differences involving the phenomena we are modeling.