Speech and Language Processing Chapter 10 of SLP Advanced Automatic Speech Recognition (II) Disfluencies and Metadata.

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Advertisements

Non-Native Users in the Let s Go!! Spoken Dialogue System: Dealing with Linguistic Mismatch Antoine Raux & Maxine Eskenazi Language Technologies Institute.
Atomatic summarization of voic messages using lexical and prosodic features Koumpis and Renals Presented by Daniel Vassilev.
Punctuation Generation Inspired Linguistic Features For Mandarin Prosodic Boundary Prediction CHEN-YU CHIANG, YIH-RU WANG AND SIN-HORNG CHEN 2012 ICASSP.
Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.
Progress update Lin Ziheng. System overview 2 Components – Connective classifier Features from Pitler and Nenkova (2009): – Connective: because – Self.
Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
® Towards Using Structural Events To Assess Non-Native Speech Lei Chen, Joel Tetreault, Xiaoming Xi Educational Testing Service (ETS) The 5th Workshop.
Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.
Using Web Queries for Learner Error Detection Michael Gamon, Microsoft Research Claudia Leacock, Butler-Hill Group.
1 LSA 352 Summer 2007 LSA 352 Speech Recognition and Synthesis Dan Jurafsky Lecture 7: Baum Welch/Learning and Disfluencies IP Notice: Some of these slides.
Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News Gina-Anne Levow University of Chicago SIGHAN July 25, 2004.
Towards a model of speech production: Cognitive modeling and computational applications Michelle L. Gregory SNeRG 2003.
Presented by Ravi Kiran. Julia Hirschberg Stefan Benus Jason M. Brenier Frank Enos Sarah Friedman Sarah Gilman Cynthia Girand Martin Graciarena Andreas.
Prosodic Cues to Discourse Segment Boundaries in Human-Computer Dialogue SIGDial 2004 Gina-Anne Levow April 30, 2004.
Automatic Prosody Labeling Final Presentation Andrew Rosenberg ELEN Speech and Audio Processing and Recognition 4/27/05.
Using Emotion Recognition and Dialog Analysis to Detect Trouble in Communication in Spoken Dialog Systems Nathan Imse Kelly Peterson.
Extracting Social Meaning Identifying Interactional Style in Spoken Conversation Jurafsky et al ‘09 Presented by Laura Willson.
On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg Weekly Speech Lab Talk 6/27/06.
ASR Evaluation Julia Hirschberg CS Outline Intrinsic Methods –Transcription Accuracy Word Error Rate Automatic methods, toolkits Limitations –Concept.
Classification of Discourse Functions of Affirmative Words in Spoken Dialogue Julia Agustín Gravano, Stefan Benus, Julia Hirschberg Shira Mitchell, Ilia.
1 ICSI-SRI-UW Structural MDE: Modeling, Analysis, & Issues Yang Liu 1,3, Elizabeth Shriberg 1,2, Andreas Stolcke 1,2, Barbara Peskin 1, Jeremy Ang 1, Mary.
2001/03/29Chin-Kai Wu, CS, NTHU1 Speech and Language Technologies for Audio Indexing and Retrieval JOHN MAKHOUL, FELLOW, IEEE, FRANCIS KUBALA, TIMOTHY.
Varying Input Segmentation for Story Boundary Detection Julia Hirschberg GALE PI Meeting March 23, 2007.
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
Speech and Language Processing
Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,
Better Punctuation Prediction with Dynamic Conditional Random Fields Wei Lu and Hwee Tou Ng National University of Singapore.
On Speaker-Specific Prosodic Models for Automatic Dialog Act Segmentation of Multi-Party Meetings Jáchym Kolář 1,2 Elizabeth Shriberg 1,3 Yang Liu 1,4.
Yun-Nung (Vivian) Chen, Yu Huang, Sheng-Yi Kong, Lin-Shan Lee National Taiwan University, Taiwan.
Turn-taking Discourse and Dialogue CS 359 November 6, 2001.
1 Boostrapping language models for dialogue systems Karl Weilhammer, Matthew N Stuttle, Steve Young Presenter: Hsuan-Sheng Chiu.
Automatic Cue-Based Dialogue Act Tagging Discourse & Dialogue CMSC November 3, 2006.
Sentence Unit Detection in Conversational Dialogue Elizabeth Lingg, Tejaswi Tennetti, Anand Madhavan it has a lot of garlic in it too does n't it i it.
Structural Metadata Annotation of Speech Corpora: Comparing Broadcast News and Broadcast Conversations Jáchym KolářJan Švec University of West Bohemia.
Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.
1 Prosody-Based Automatic Segmentation of Speech into Sentences and Topics Elizabeth Shriberg Andreas Stolcke Speech Technology and Research Laboratory.
National Taiwan University, Taiwan
1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.
Conditional Random Fields for ASR Jeremy Morris July 25, 2006.
Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.
Copyright © 2013 by Educational Testing Service. All rights reserved. Evaluating Unsupervised Language Model Adaption Methods for Speaking Assessment ShaSha.
HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004.
Lexical, Prosodic, and Syntactics Cues for Dialog Acts.
CS 4705 Corpus Linguistics and Machine Learning Techniques.
Adapting Dialogue Models Discourse & Dialogue CMSC November 19, 2006.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
S1S1 S2S2 S3S3 8 October 2002 DARTS ATraNoS Automatic Transcription and Normalisation of Speech Jacques Duchateau, Patrick Wambacq, Johan Depoortere,
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Dec. 4-5, 2003EARS STT Workshop1 Broadcast News Training Experiments Anand Venkataraman, Dimitra Vergyri, Wen Wang, Ramana Rao Gadde, Martin Graciarena,
Non-Monotonic Parsing of Fluent Umm I mean Disfluent Sentences Mohammad Sadegh Rasooli[Columbia University] Joel Tetreault[Yahoo Labs] This work conducted.
Natural Language Processing Information Extraction Jim Martin (slightly modified by Jason Baldridge)
Conditional Random Fields for ASR
Recognizing Disfluencies
Why Study Spoken Language?
Recognizing Structure: Dialogue Acts and Segmentation
Intonational and Its Meanings
Intonational and Its Meanings
Why Study Spoken Language?
Turn-taking and Disfluencies
Recognizing Structure: Sentence, Speaker, andTopic Segmentation
Turn-taking and Disfluencies
Recognizing Disfluencies
Jun Wu and Sanjeev Khudanpur Center for Language and Speech Processing
Recognizing Structure: Dialogue Acts and Segmentation
CSCI 5832 Natural Language Processing
Low Level Cues to Emotion
Automatic Prosodic Event Detection
Presentation transcript:

Speech and Language Processing Chapter 10 of SLP Advanced Automatic Speech Recognition (II) Disfluencies and Metadata

Outline  Advanced in Search  Multipass search  N-best lists  Lattices  Advances in Context  Context-dependent (triphone) acoustic modeling  Metadata  Disfluencies, etc 5/6/20152 Speech and Language Processing Jurafsky and Martin

Metadata  yeah actually um i belong to a gym down here a gold’s gym uh-huh and uh exercise i try to exercise five days a week um and i usually do that uh what type of exercising do you do in the gym  A: Yeah I belong to a gym down here. Gold’s Gym. And I try to exercise five days a week. And I usually do that.  B: What type of exercising do you do in the gym? 5/6/20153 Speech and Language Processing Jurafsky and Martin

Metadata tasks  Diarization  Sentence Boundary Detection  Truecasing  Punctuation detection  Disfluency detection 5/6/20154 Speech and Language Processing Jurafsky and Martin

Why?  Diarization?  Sentence Boundary Detection? 5/6/2015 Speech and Language Processing Jurafsky and Martin 5

Sentence Segmentation (Shriberg et al 2000)  Binary classification task; judge the juncture between each two words:  Features:  Pause  Duration of previous phone and rime  Pitch change across boundary; pitch range of previous word 5/6/20156 Speech and Language Processing Jurafsky and Martin

Disfluencies: standard terminology (Levelt)  Reparandum: thing repaired  Interruption point (IP): where speaker breaks off  Editing phase (edit terms): uh, I mean, you know  Repair: fluent continuation 5/6/20157 Speech and Language Processing Jurafsky and Martin

Types of Disfluencies 5/6/20158 Speech and Language Processing Jurafsky and Martin

Why Disfluencies?  Cues to emotion or attitude  Known to cause problems for speech recognition  Goldwater, Jurafsky, Manning (2008):  Word errors increase by up to 15% (absolute)for words near fragments 5/6/2015 Speech and Language Processing Jurafsky and Martin 9

Why disfluencies?  Need to clean them up to get understanding  Does American airlines offer any one-way flights [uh] one-way fares for 160 dollars?  Delta leaving Boston seventeen twenty one arriving Fort Worth twenty two twenty one forty  Known to cause problems for speech recognition  Goldwater, Jurafsky, Manning (2008):  Word errors increase by up to 15% (absolute) for words near fragments  Might help in language modeling  Disfluencies might occur at particular positions (boundaries of clauses, phrases)  Annotating them helps readability 5/6/ Speech and Language Processing Jurafsky and Martin

Counts (from Shriberg, Heeman)  Sentence disfluency rate  ATIS: 6% of sentences disfluent (10% long sentences)  Levelt human dialogs: 34% of sentences disfluent  Swbd: ~50% of multiword sentences disfluent  TRAINS: 10% of words are in reparandum or editing phrase  Word disfluency rate  SWBD: 6%  ATIS:0.4%  AMEX 13%  (human-human air travel) 5/6/ Speech and Language Processing Jurafsky and Martin

Prosodic characteristics of disfluencies  Nakatani and Hirschberg 1994  Fragments are good cues to disfluencies  Prosody:  Pause duration is shorter in disfluent silence than fluent silence  F0 increases from end of reparandum to beginning of repair, but only minor change  Repair interval offsets have minor prosodic phrase boundary, even in middle of NP:  Show me all n- | round-trip flights | from Pittsburgh | to Atlanta 5/6/ Speech and Language Processing Jurafsky and Martin

Syntactic Characteristics of Disfluencies  Hindle (1983)  The repair often has same structure as reparandum  Both are Noun Phrases (NPs) in this example:  So if could automatically find IP, could find and correct reparandum! 5/6/ Speech and Language Processing Jurafsky and Martin

Disfluencies and LM  Clark and Fox Tree  Looked at “um” and “uh”  “uh” includes “er” (“er” is just British non- rhotic dialect spelling for “uh”)  Different meanings  Uh: used to announce minor delays  Preceded and followed by shorter pauses  Um: used to announce major delays  Preceded and followed by longer pauses 5/6/ Speech and Language Processing Jurafsky and Martin

Um versus uh: delays (Clark and Fox Tree) 5/6/ Speech and Language Processing Jurafsky and Martin

Utterance Planning  The more difficulty speakers have in planning, the more delays  Consider 3 locations:  I: before intonation phrase: hardest  II: after first word of intonation phrase: easier  III: later: easiest  And then uh somebody said,. [I] but um -- [II] don’t you think there’s evidence of this, in the twelfth - [III] and thirteenth centuries? 5/6/ Speech and Language Processing Jurafsky and Martin

Fragments  Incomplete or cut-off words:  Leaving at seven fif- eight thirty  uh, I, I d-, don't feel comfortable  You know the fam-, well, the families  I need to know, uh, how- how do you feel…  Uh yeah, yeah, well, it- it- that’s right. And it-  SWBD: around 0.7% of words are fragments (Liu 2003)  ATIS: 60.2% of repairs contain fragments (6% of corpus sentences had a least 1 repair) Bear et al (1992)  Another ATIS corpus: 74% of all reparanda end in word fragments (Nakatani and Hirschberg 1994) 5/6/ Speech and Language Processing Jurafsky and Martin

Fragment glottalization  Uh yeah, yeah, well, it- it- that’s right. And it- 5/6/ Speech and Language Processing Jurafsky and Martin

Fragments in other languages  Mandarin (Chu, Sung, Zhao, Jurafsky 2006)  Fragments cause similar errors as in English:  但我﹣我問的是  I- I was asking…  他是﹣卻很﹣活的很好  He very- lived very well 5/6/ Speech and Language Processing Jurafsky and Martin

Fragments in Mandarin  Mandarin fragments unlike English; no glottalization.  Instead: Mostly (unglottalized) repetitions  So: best features are lexical, rather than voice quality 5/6/ Speech and Language Processing Jurafsky and Martin

Does deleting disfluencies improve LM perplexity?  Stolcke and Shriberg (1996)  Build two LMs  Raw  Removing disfluencies  See which has a higher perplexity on the data.  Filled Pause  Repetition  Deletion 5/6/ Speech and Language Processing Jurafsky and Martin

Change in Perplexity when Filled Pauses (FP) are removed  LM Perplexity goes up at following word:  Removing filled pauses makes LM worse!!  I.e., filled pauses seem to help to predict next word.  Why would that be? Stolcke and Shriberg /6/ Speech and Language Processing Jurafsky and Martin

Filled pauses tend to occur at clause boundaries  Word before FP is end of previous clause; word after is start of new clause;  Best not to delete FP  Some of the different things we’re doing [uh] there’s not time to do it all  “there’s” is very likely to start a sentence  So P(there’s|uh) is better estimate than P(there’s|doing) 5/6/ Speech and Language Processing Jurafsky and Martin

Suppose we just delete medial FPs  Experiment 2:  Parts of SWBD hand-annotated for clauses  Build FP-model by deleting only medial FPs  Now prediction of post-FP word (perplexity) improves greatly!  Siu and Ostendorf found same with “you know” 5/6/ Speech and Language Processing Jurafsky and Martin

What about REP and DEL  S+S built a model with “cleaned-up” REP and DEL  Slightly lower perplexity  But exact same word error rate (49.5%)  Why?  Rare: only 2 words per 100  Doesn’t help because adjacent words are misrecognized anyhow! 5/6/ Speech and Language Processing Jurafsky and Martin

Stolcke and Shriberg conclusions wrt LM and disfluencies  Disfluency modeling purely in the LM probably won’t vastly improve WER  But  Disfluencies should be considered jointly with sentence segmentation task  Need to do better at recognizing disfluent words themselves  Need acoustic and prosodic features 5/6/ Speech and Language Processing Jurafsky and Martin

WER reductions from modeling disfluencies+background events  Rose and Riccardi (1999) 5/6/ Speech and Language Processing Jurafsky and Martin

HMIHY Background events  Out of 16,159 utterances: Filled Pauses7189 Word Fragments1265 Hesitations792 Laughter163 Lipsmack2171 Breath8048 Non-Speech Noise8834 Background Speech3585 Operater Utt.5112 Echoed Prompt5353 5/6/ Speech and Language Processing Jurafsky and Martin

Phrase-based LM  “I would like to make a collect call”  “a [wfrag]”  [brth]  “[brth] and I” 5/6/ Speech and Language Processing Jurafsky and Martin

Rose and Riccardi (1999) Modeling “LBEs” does help in WER ASR Word Accuracy System ConfigurationTest Corpora HMMLMGreeting (HMIHY) Card Number Baseline LBEBaseline LBE /6/ Speech and Language Processing Jurafsky and Martin

More on location of FPs  Peters: Medical dictation task  Monologue rather than dialogue  In this data, FPs occurred INSIDE clauses  Trigram PP after FP: 367  Trigram PP after word: 51  Stolcke and Shriberg (1996b)  w k FP w k+1 : looked at P(w k+1 |w k )  Transition probabilities lower for these transitions than normal ones  Conclusion:  People use FPs when they are planning difficult things, so following words likely to be unexpected/rare/difficult 5/6/ Speech and Language Processing Jurafsky and Martin

Detection of disfluencies  Nakatani and Hirschberg  Decision tree at w i -w j boundary  pause duration  Word fragments  Filled pause  Energy peak within w i  Amplitude difference between w i and w j  F0 of w i  F0 differences  Whether w i accented  Results:  78% recall/89.2% precision 5/6/ Speech and Language Processing Jurafsky and Martin

Detection/Correction  Bear, Dowding, Shriberg (1992)  System 1:  Hand-written pattern matching rules to find repairs  Look for identical sequences of words  Look for syntactic anomalies (“a the”, “to from”)  62% precision, 76% recall  Rate of accurately correcting: 57% 5/6/ Speech and Language Processing Jurafsky and Martin

Using Natural Language Constraints  Gemini natural language system  Based on Core Language Engine  Full syntax and semantics for ATIS  Coverage of whole corpus:  70% syntax  50% semantics 5/6/ Speech and Language Processing Jurafsky and Martin

Using Natural Language Constraints  Gemini natural language system  Run pattern matcher  For each sentence it returned  Remove fragment sentences  Leaving 179 repairs, 176 false positives  Parse each sentence  If succeed: mark as false positive  If fail:  run pattern matcher, make corrections  Parse again  If succeeds, mark as repair  If fails, mark no opinion 5/6/ Speech and Language Processing Jurafsky and Martin

NL Constraints  Syntax Only  Precision: 96%  Syntax and Semantics  Correction: 70% 5/6/ Speech and Language Processing Jurafsky and Martin

Recent work: EARS Metadata Evaluation (MDE)  A recent multiyear DARPA bakeoff  Sentence-like Unit (SU) detection:  find end points of SU  Detect subtype (question, statement, backchannel)  Edit word detection:  Find all words in reparandum (words that will be removed)  Filler word detection  Filled pauses (uh, um)  Discourse markers (you know, like, so)  Editing terms (I mean)  Interruption point detection Liu et al /6/ Speech and Language Processing Jurafsky and Martin

Kinds of disfluencies  Repetitions  I * I like it  Revisions  We * I like it  Restarts (false starts)  It’s also * I like it 5/6/ Speech and Language Processing Jurafsky and Martin

MDE transcription  Conventions: ./ for statement SU boundaries,  <> for fillers,  [] for edit words,  * for IP (interruption point) inside edits  And wash your clothes wherever you are./ and [ you ] * you really get used to the outdoors./ 5/6/ Speech and Language Processing Jurafsky and Martin

MDE Labeled Corpora CTSBN Training set (words)484K182K Test set (words)35K45K STT WER (%) SU % Edit word % Filler word % /6/ Speech and Language Processing Jurafsky and Martin

MDE Algorithms  Use both text and prosodic features  At each interword boundary  Extract Prosodic features (pause length, durations, pitch contours, energy contours)  Use N-gram Language model  Combine via HMM, Maxent, CRF, or other classifier 5/6/ Speech and Language Processing Jurafsky and Martin

State of the art: SU detection  2 stage  Decision tree plus N-gram LM to decide boundary  Second maxent classifier to decide subtype  Current error rates:  Finding boundaries  40-60% using ASR  26-47% using transcripts 5/6/ Speech and Language Processing Jurafsky and Martin

State of the art: Edit word detection  Multi-stage model  HMM combining LM and decision tree finds IP  Heuristics rules find onset of reparandum  Separate repetition detector for repeated words  One-stage model  CRF jointly finds edit region and IP  BIO tagging (each word has tag whether is beginning of edit, inside edit, outside edit)  Error rates:  43-50% using transcripts  80-90% using ASR 5/6/ Speech and Language Processing Jurafsky and Martin

Using only lexical cues  3-way classification for each word  Edit, filler, fluent  Using TBL  Templates: Change  Word X from L1 to L2  Word sequence X Y to L1  Left side of simple repeat to L1  Word with POS X from L1 to L2 if followed by word with POS Y 5/6/ Speech and Language Processing Jurafsky and Martin

Rules learned  Label all fluent filled pauses as fillers  Label the left side of a simple repeat as an edit  Label “you know” as fillers  Label fluent well’s as filler  Label fluent fragments as edits  Label “I mean” as a filler 5/6/ Speech and Language Processing Jurafsky and Martin

Error rates using only lexical cues  CTS, using transcripts  Edits: 68%  Fillers: 18.1%  Broadcast News, using transcripts  Edits 45%  Fillers 6.5%  Using speech:  Broadcast news filler detection from 6.5% error to 57.2%  Other systems (using prosody) better on CTS, not on Broadcast News 5/6/ Speech and Language Processing Jurafsky and Martin

Conclusions: Lexical Cues Only  Can do pretty well with only words  (As long as the words are correct)  Much harder to do fillers and fragments from ASR output, since recognition of these is bad 5/6/ Speech and Language Processing Jurafsky and Martin

Accents: An experiment  A word by itself  The word in context 5/6/ Speech and Language Processing Jurafsky and Martin

Summary  Advanced in Search  Multipass search  N-best lists  Lattices  Advances in Context  Context-dependent (triphone) acoustic modeling  Metadata  Disfluencies, etc 5/6/ Speech and Language Processing Jurafsky and Martin