Recognizing Disfluencies

Slides:



Advertisements
Similar presentations
Human Speech Recognition Julia Hirschberg CS4706 (thanks to John-Paul Hosum for some slides)
Advertisements

Progress update Lin Ziheng. System overview 2 Components – Connective classifier Features from Pitler and Nenkova (2009): – Connective: because – Self.
5/5/20151 Recognizing Metadata: Segmentation and Disfluencies Julia Hirschberg CS 4706.
Using prosody to avoid ambiguity: Effects of speaker awareness and referential context Snedeker and Trueswell (2003) Psych 526 Eun-Kyung Lee.
Speech and Language Processing Chapter 10 of SLP Advanced Automatic Speech Recognition (II) Disfluencies and Metadata.
® Towards Using Structural Events To Assess Non-Native Speech Lei Chen, Joel Tetreault, Xiaoming Xi Educational Testing Service (ETS) The 5th Workshop.
Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.
Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News Gina-Anne Levow University of Chicago SIGHAN July 25, 2004.
Towards a model of speech production: Cognitive modeling and computational applications Michelle L. Gregory SNeRG 2003.
Presented by Ravi Kiran. Julia Hirschberg Stefan Benus Jason M. Brenier Frank Enos Sarah Friedman Sarah Gilman Cynthia Girand Martin Graciarena Andreas.
Advanced Technology Center Stuttgart EMOTIONAL SPACE IMPROVES EMOTION RECOGNITION Raquel Tato, Rocio Santos, Ralf Kompe Man Machine Interface Lab Advance.
Spoken Language Processing Lab Who we are: Julia Hirschberg, Stefan Benus, Fadi Biadsy, Frank Enos, Agus Gravano, Jackson Liscombe, Sameer Maskey, Andrew.
CS 4705 Automatic Speech Recognition Opportunity to participate in a new user study for Newsblaster and get $25-$30 for hours of time respectively.
ASR Evaluation Julia Hirschberg CS Outline Intrinsic Methods –Transcription Accuracy Word Error Rate Automatic methods, toolkits Limitations –Concept.
Detecting missrecognitions Predicting with prosody.
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
ISSUES IN SPEECH RECOGNITION Shraddha Sharma
Speech & Language Development 1 Normal Development of Speech & Language Language...“Standardized set of symbols and the knowledge about how to combine.
Speech Perception 4/6/00 Acoustic-Perceptual Invariance in Speech Perceptual Constancy or Perceptual Invariance: –Perpetual constancy is necessary, however,
1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.
On Speaker-Specific Prosodic Models for Automatic Dialog Act Segmentation of Multi-Party Meetings Jáchym Kolář 1,2 Elizabeth Shriberg 1,3 Yang Liu 1,4.
Turn-taking Discourse and Dialogue CS 359 November 6, 2001.
Automatic Cue-Based Dialogue Act Tagging Discourse & Dialogue CMSC November 3, 2006.
Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.
© 2005, it - instituto de telecomunicações. Todos os direitos reservados. Arlindo Veiga 1,2 Sara Cadeias 1 Carla Lopes 1,2 Fernando Perdigão 1,2 1 Instituto.
1/17/20161 Emotion in Meetings: Business and Personal Julia Hirschberg CS 4995/6998.
Natural conversation “When we investigate how dialogues actually work, as found in recordings of natural speech, we are often in for a surprise. We are.
Adapting Dialogue Models Discourse & Dialogue CMSC November 19, 2006.
Misrecognitions and Corrections in Spoken Dialogue Systems Diane Litman AT&T Labs -- Research (Joint Work With Julia Hirschberg, AT&T, and Marc Swerts,
1 Spoken Dialogue Systems Error Detection and Correction in Spoken Dialogue Systems.
S1S1 S2S2 S3S3 8 October 2002 DARTS ATraNoS Automatic Transcription and Normalisation of Speech Jacques Duchateau, Patrick Wambacq, Johan Depoortere,
Acoustic Cues to Emotional Speech Julia Hirschberg (joint work with Jennifer Venditti and Jackson Liscombe) Columbia University 26 June 2003.
Suprasegmental features and Prosody Lect 6A&B LING1005/6105.
Analysis of spontaneous speech
Automatic Speech Recognition
Conditional Random Fields for ASR
Why Study Spoken Language?
Recognizing Structure: Dialogue Acts and Segmentation
Recognizing Structure: Sentence and Topic Segmentation
Error Detection and Correction in SDS
Studying Intonation Julia Hirschberg CS /21/2018.
Spoken Dialogue Systems
Automatic Fluency Assessment
Automatic Speech Recognition
Job Google Job Title: Linguistic Project Manager
Dialogue Acts Julia Hirschberg CS /18/2018.
Comparing American and Palestinian Perceptions of Charisma Using Acoustic-Prosodic and Lexical Analysis Fadi Biadsy, Julia Hirschberg, Andrew Rosenberg,
Turn-taking and Disfluencies
Advanced NLP: Speech Research and Technologies
Recognizing Structure: Sentence, Speaker, andTopic Segmentation
Searching and Summarizing Speech
Turn-taking and Disfluencies
Searching and Summarizing Speech
Recognizing Disfluencies
Advanced NLP: Speech Research and Technologies
Spoken Dialogue Systems
Discourse Structure in Generation
Emotional Speech Julia Hirschberg CS /8/2018.
Tetsuya Nasukawa, IBM Tokyo Research Lab
Towards Automatic Fluency Assessment
Emotional Speech Julia Hirschberg CS /16/2019.
Spoken Dialogue Systems: System Overview
Recognizing Structure: Dialogue Acts and Segmentation
Research on the Modeling of Chinese Continuous Speech Recognition
Automatic Speech Recognition
Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,
Automatic Speech Recognition
Artificial Intelligence 2004 Speech & Natural Language Processing
Low Level Cues to Emotion
Automatic Prosodic Event Detection
Presentation transcript:

Recognizing Disfluencies Julia Hirschberg CS 4706 9/16/2018

Today Why are disfluencies important to speech recognition? The case of Self Repairs What are Self Repairs? How are they distributed? What are their distinguishing characteristics? Can they be detected automatically? Parsing approaches Pattern-matching approaches Machine Learning approaches 9/16/2018

Disfluencies and Self-Repairs Spontaneous speech is ‘ungrammatical’ every 4.6s in radio call-in (Blackmer & Mitton ‘91) hesitation: Ch- change strategy. filled pause: Um Baltimore. self-repair: Ba- uh Chicago. A big problem for speech recognition Ch- change strategy. --> to D C D C today ten fifteen. Um Baltimore. --> From Baltimore ten. Ba- uh Chicago. --> For Boston Chicago. Kasl & Mahl: 41% more filled pauses in audio only vs ftf; Oviatt: 8.83 to 5.50% disfluencies in phone conversations vs. non /n/u118/exp98/adapt/MixedImplicit/mccoy/task1 line 198, rehesitation (mor- morning --> not really) /n/u118/exp98/adapt/UserNoConfirm/sgoel/task1 line 1315, filled pause (um baltimore --> from baltimoreten) /n/u118/exp98/adapt/UserNoConfirm/selina/task1, line 1132, repair (Ba- uh Chicago --> for Boston Chicago) 9/16/2018

Disfluencies as ‘Noise’ For people Repairs as replanning events Repairs as attention-getting devices (taking the turn) For parsers For speech recognizers 9/16/2018

What’s the Alternative? Modeling disfluencies Filled pauses Self-repairs Hesitations Detecting disfluencies explicitly Why is this hard? Distinguishing them from ‘real’ words Distinguishing them from ‘real’ noise 9/16/2018

What are Self-Repairs? Hindle ’83: When people produce disfluent speech and correct themselves…. They leave a trail behind Hearers can compare the fluent finish with the disfluent start This is a bad – a disastrous move ‘a/DET bad/ADJ’/’a/DET disastrous/ADJ’ To determine what to ‘replace’ with what Corpus: interview transcripts with correct p.o.s. assigned 9/16/2018

The ‘Edit Signal’ How do Hearers know what to keep and what to discard? Hypothesis: Speakers signal an upcoming repair by some acoustic/prosodic edit signal Tells hearers where the disfluent portion of speech ends and the correction begins Reparandum – edit signal – repair What I {uh,I mean, I-,..} what I said is If there is an edit signal, what might it be? Filled pauses Explicit words Or some ‘non-lexical’ acoustic phenomena 9/16/2018

Categories of Self Repairs Same surface string Well if they’d * if they’d… Same part-of-speech I was just that * the kind of guy… Same syntactic constituent I think that you get * it’s more strict in Catholic schools Restarts are completely different… I just think * Do you want something to eat? 9/16/2018

Hindle Category Distribution for 1 Interview 1512 sentences, 544 repairs % Edit Signal Only 128 24% Exact Match 161 29% Same POS 47 9% Same Syntactic Constituent 148 27% Restart 32 6% Other 28 5% 9/16/2018

Bear et al ’92 Repairs 10,718 utterances Of 646 repairs: Most nontrivial repairs (339/436) involve matched strings of identical words Longer matched string More likely a repair More words between matches Less likely repair Distribution of reparanda by length in words ---------- Len N % 1 376 59% 2 154 24% 3 52 8% 4 25 4% 5 23 6+ 16 3% 9/16/2018

But is there an Edit Signal? Definition: a reliable indicator that divides the reparandum from the repair In search of the edit signal: RIM Model of Self-Repairs (Nakatani & Hirschberg ’94) Reparandum, Disfluency Interval (Interruption Site), Repair ATIS corpus 6414 turns with 346 (5.4%) repairs, 122 speakers, hand-labeled for repairs and prosodic features I.e. 346 repairs Reparanda: 73% end in fragments 30% end in glottalization (short, no decrease in energy or f0) co-articulatory gestures DI: signif diffs in duration of pause but can’t be used (alone) to predict disfluencies as too many false positive small increase in f0 and amplitude across to repair Repairs: offsets occur at phrase boundaries differences in phrasing compared to fluent speech CART prediction of repairs: 86% precision, 91% recall (192 of 223 interruption sites, 19 false positives). Features imp: dur of interva, fragment, pause filler, pos, lexical matching across DI. 9/16/2018

9/16/2018

Lexical Class of Word Fragments Ending Reparandum % Content words 128 43% Function words 14 5% ? 156 52% 9/16/2018

Length of Fragments at End of Reparandum Syllables N % 119 40% 1 153 51% 2 25 8% 3 .3% 9/16/2018

Length in Words of Reparandum Fragment Repairs (N=280) Non-Fragment Repairs (N=102) 1 183 65% 53 52% 2 64 23% 33 32% 3 18 6% 9 9% 4 6 2% 5 or more 3% 5 5% 9/16/2018

Type of Initial Phoneme in Fragment Class of First Phoneme % of All Words % of All Fragments % of 1-Syl Fragments % of 1-C Fragments Stop 23% 23$ 29% 12% Vowel 25% 13% 20% 0% Fricative 33% 44% 27% 72% Nasal/glide/liquid 18% 17% 15% H 1% 2% 4% Total N 64,896 298 153 119 9/16/2018

Presence of Filled Pauses/Cue Phrases FP/Cue Phrases Unfilled Pauses Fragment 16 264 Non-Fragment 20 82 9/16/2018

Duration of Pause Mean SDev N Fluent Pause 513ms 676ms 1186 DI 334ms 346 Fragment 289ms 377ms 264 Non-Fragment 481ms 517ms 82 9/16/2018

Is There an Edit Signal? Findings: Reparanda: 73% end in fragments, 30% in glottalization, co-articulatory gestures DI: pausal duration differs significantly from fluent boundaries,small increase in f0 and amplitude Speculation: articulatory disruption Are there edit signals? 9/16/2018

With or Without an Edit Signal, How Might Hearers/Machines Process Disfluent Speech? Parsing-based approaches: (Weischedel & Black ’80; Carbonell & Hayes ’83; Hindle ’83; Fink & Biermann ’86): If 2 constituents of identical semantic/syntactic type are found where grammar allows only one, delete the first Use an ‘edit signal’ or explicit words as cues Select the minimal constituent Pick up the blue- green ball. 9/16/2018

Results: Detection and correction Trivial (edit signal only): 128 (24%) Non-trivial: 388 (71%) 9/16/2018

Pattern-matching approaches (Bear et al ’92) Find candidate self-repairs using lexical matching rules Exact repetitions within a window I’d like a a tall latte. A pair of specified adjacent items The a great place to visit. ‘Correction phrases’ That’s the well uh the Raritan Line. Filter using syntactic/semantic information That’s what I mean when I say it’s too bad. 9/16/2018

201 ‘trivial’ (fragments or filled pauses) Of 406 remaining: Detection results: 201 ‘trivial’ (fragments or filled pauses) Of 406 remaining: Found 309 correctly (76% Recall) Hypothesized 191 incorrectly (61% Precision) Adding ‘trivial’: 84% Recall, 82% Precision Correcting is harder: Corrects all ‘trivial’ but only 57% of correctly identified non-trivial 9/16/2018

Machine Learning Approaches (Nakatani & Hirschberg ’94) CART prediction: 86% precision, 91% recall Features: Duration of interval, presence of fragment, pause filler, p.o.s., lexical matching across DI Produce rules to use on unseen data But…requires hand-labeled data 9/16/2018

State of the Art Today (Liu et al 2002) Detecting the Interruption Point using acoustic/prosodic and lexical features Features: Normalized duration and pitch features Voice quality features: Jitter: perturbation in the pitch period Spectral Tilt: overall slope of the spectrum Open Quotient: ratio of time vocal folds open/total length of glottal cycle Language Models: words, POS, repetition patterns 9/16/2018

1593 Switchboard conversations, hand-labeled I hope to have to have NP VB PREP VB PREP VB X X Start Orig2 IP Rep End Corpus: 1593 Switchboard conversations, hand-labeled Downsample to 50:50 IP/not since otherwise baseline is 96.2% (predict no IP) Results: Prosody alone produces best results on downsampled data (Prec. 77%, Recall 76%) 9/16/2018

IP Detection: Precision/Recall Prosody+Word LM+POS LM does best on non-downsampled (Prec.57%, Recall 81%) IP Detection: Overall accuracy Prosody alone on reference transcripts (77%) vs. ASR transcripts (73%) -- ds Word LM alone on reference transcripts (98%) vs ASR transcripts (97%) – non-ds Finding reparandum start: Rule-based system (Prec. 69%, Recall 61%) LM (Prec. 76%, Recall 46%) Have we made progress? 9/16/2018

Next Class Segmentation for recognition 9/16/2018