Misrecognitions and Corrections in Spoken Dialogue Systems Diane Litman AT&T Labs -- Research (Joint Work With Julia Hirschberg, AT&T, and Marc Swerts,

Slides:

Advertisements

Similar presentations

Non-Native Users in the Let s Go!! Spoken Dialogue System: Dealing with Linguistic Mismatch Antoine Raux & Maxine Eskenazi Language Technologies Institute.

Advertisements

Using prosody to avoid ambiguity: Effects of speaker awareness and referential context Snedeker and Trueswell (2003) Psych 526 Eun-Kyung Lee.

15.0 Utterance Verification and Keyword/Key Phrase Spotting References: 1. “Speech Recognition and Utterance Verification Based on a Generalized Confidence.

5/10/20151 Evaluating Spoken Dialogue Systems Julia Hirschberg CS 4706.

TT Centre for Speech Technology Early error detection on word level Gabriel Skantze and Jens Edlund Centre for Speech Technology.

Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News Gina-Anne Levow University of Chicago SIGHAN July 25, 2004.

Error detection in spoken dialogue systems GSLT Dialogue Systems, 5p Gabriel Skantze TT Centrum för talteknologi.

Identifying Local Corrections in Human-Computer Dialogue Gina-Anne Levow University of Chicago October 5, 2004.

Characterizing and Recognizing Spoken Corrections in Human-Computer Dialog Gina-Anne Levow August 25, 1998.

Understanding Spoken Corrections in Human-Computer Dialogue Gina-Anne Levow University of Chicago MAICS April 1, 2006.

Prosodic Cues to Discourse Segment Boundaries in Human-Computer Dialogue SIGDial 2004 Gina-Anne Levow April 30, 2004.

What can humans do when faced with ASR errors? Dan Bohus Dialogs on Dialogs Group, October 2003.

Belief Updating in Spoken Dialog Systems Dan Bohus Computer Science Department Carnegie Mellon University Pittsburgh,

CS 4705 Automatic Speech Recognition Opportunity to participate in a new user study for Newsblaster and get $25-$30 for hours of time respectively.

ASR Evaluation Julia Hirschberg CS Outline Intrinsic Methods –Transcription Accuracy Word Error Rate Automatic methods, toolkits Limitations –Concept.

Detecting missrecognitions Predicting with prosody.

Error Detection in Human-Machine Interaction Dan Bohus DoD Group, Oct 2002.

misunderstandings, corrections and beliefs in spoken language interfaces Dan Bohus Computer Science Department Carnegie Mellon.

Classification of Discourse Functions of Affirmative Words in Spoken Dialogue Julia Agustín Gravano, Stefan Benus, Julia Hirschberg Shira Mitchell, Ilia.

6/28/20151 Spoken Dialogue Systems: Human and Machine Julia Hirschberg CS 4706.

Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.

Beyond Usability: Measuring Speech Application Success Silke Witt-Ehsani, PhD VP, VUI Design Center TuVox.

Short Introduction to Machine Learning Instructor: Rada Mihalcea.

Interactive Dialogue Systems Professor Diane Litman Computer Science Department & Learning Research and Development Center University of Pittsburgh Pittsburgh,

Center for Human Computer Communication Department of Computer Science, OG I 1 Designing Robust Multimodal Systems for Diverse Users and Mobile Environments.

BPS - 3rd Ed. Chapter 211 Inference for Regression.

User Study Evaluation Human-Computer Interaction.

On Speaker-Specific Prosodic Models for Automatic Dialog Act Segmentation of Multi-Party Meetings Jáchym Kolář 1,2 Elizabeth Shriberg 1,3 Yang Liu 1,4.

circle Adding Spoken Dialogue to a Text-Based Tutorial Dialogue System Diane J. Litman Learning Research and Development Center & Computer Science Department.

Evaluation of SDS Svetlana Stoyanchev 3/2/2015. Goal of dialogue evaluation Assess system performance Challenges of evaluation of SDS systems – SDS developer.

Crowdsourcing for Spoken Dialogue System Evaluation Ling 575 Spoken Dialog April 30, 2015.

Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.

Adaptive Spoken Dialogue Systems & Computational Linguistics Diane J. Litman Dept. of Computer Science & Learning Research and Development Center University.

A Successful Dialogue without Adaptation S: Hi, this is AT&T Amtrak schedule system. This is Toot. How may I help you? U: I want a train from Baltimore.

1 Natural Language Processing Lecture Notes 14 Chapter 19.

Automatic Cue-Based Dialogue Act Tagging Discourse & Dialogue CMSC November 3, 2006.

Predicting Student Emotions in Computer-Human Tutoring Dialogues Diane J. Litman&Kate Forbes-Riley University of Pittsburgh Department of Computer Science.

Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,

Week 6. Statistics etc. GRS LX 865 Topics in Linguistics.

Lti Shaping Spoken Input in User-Initiative Systems Stefanie Tomko and Roni Rosenfeld Language Technologies Institute School of Computer Science Carnegie.

1/17/20161 Emotion in Meetings: Business and Personal Julia Hirschberg CS 4995/6998.

Integrating Multiple Knowledge Sources For Improved Speech Understanding Sherif Abdou, Michael Scordilis Department of Electrical and Computer Engineering,

circle Spoken Dialogue for the Why2 Intelligent Tutoring System Diane J. Litman Learning Research and Development Center & Computer Science Department.

1 Spoken Dialogue Systems Error Detection and Correction in Spoken Dialogue Systems.

Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Designing and Evaluating Two Adaptive Spoken Dialogue Systems Diane J. Litman* University of Pittsburgh Dept. of Computer Science & LRDC

Acoustic Cues to Emotional Speech Julia Hirschberg (joint work with Jennifer Venditti and Jackson Liscombe) Columbia University 26 June 2003.

Prosodic Cues to Disengagement and Uncertainty in Physics Tutorial Dialogues Diane Litman, Heather Friedberg, Kate Forbes-Riley University of Pittsburgh.

BPS - 5th Ed. Chapter 231 Inference for Regression.

Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Predicting and Adapting to Poor Speech Recognition in a Spoken Dialogue System Diane J. Litman AT&T Labs -- Research

Linguistic knowledge for Speech recognition

Chapter 6. Data Collection in a Wizard-of-Oz Experiment in Reinforcement Learning for Adaptive Dialogue Systems by: Rieser & Lemon. Course: Autonomous.

Towards Emotion Prediction in Spoken Tutoring Dialogues

Error Detection and Correction in SDS

Issues in Spoken Dialogue Systems

Spoken Dialogue Systems

Prosody in Recognition/Understanding

Automatic Speech Recognition

Detecting Prosody Improvement in Oral Rereading

Turn-taking and Disfluencies

Recognizing Structure: Sentence, Speaker, andTopic Segmentation

Spoken Dialogue Systems

Spoken Dialogue Systems

PROJ2: Building an ASR System

Spoken Dialogue Systems: System Overview

Spoken Dialogue Systems

Automatic Speech Recognition

Automatic Speech Recognition

Presentation transcript:

Misrecognitions and Corrections in Spoken Dialogue Systems Diane Litman AT&T Labs -- Research (Joint Work With Julia Hirschberg, AT&T, and Marc Swerts, IPO)

Talking to a Machine….and Getting an Answer Today’s spoken dialogue systems make it possible to accomplish real tasks, over the phone, without talking to a person Real-time speech technology enables real-time interaction Speech recognition and understanding is ‘good enough’ for limited, goal-directed interactions Careful dialogue design can be tailored to capabilities of component technologies –Limited domain –Judicious use of system initiative vs. mixed initiative

Some Representative Spoken Dialogue Systems Mixed Initiative System Initiative Banking (ANSER) Deployed ATIS (DARPA Travel) MIT Galaxy/Jupiter Directory Assistant (BNR) Multimodal Maps (Trains, Quickset) Customer Care (HMIHY – AT&T) Communications (Wildfire, Portico) Train Schedule (ARISE) Communicator (DARPA Travel) Brokerage (Schwab-Nuance) Air Travel (UA Info-SpeechWorks) Access (myTalk) User

But...Systems Have Trouble Knowing When They’ve Made a Mistake Hard: correcting system misconceptions (Krahmer et al `99) User: I want to go to Boston. System: What day do you want to go to Baltimore? –Easier: answering explicit requests for confirmation or responding to ASR rejections System: Did you say you want to go to Baltimore? System: I'm sorry. I didn't understand you. Could you please repeat your utterance? But constant confirmation or over-cautious rejection lengthens dialogue and decreases user satisfaction

…And Systems Have Trouble Recognizing User Corrections Probability of recognition failures increases after a misrecognition (Levow ‘98) Corrections of system errors often hyperarticulated (louder, slower, more internal pauses, exaggerated pronunciation)  more ASR error (Wade et al ‘92, Oviatt et al ‘96, Swerts & Ostendorf ‘97, Levow ‘98, Bell & Gustafson ‘99)

Can Prosodic Information Help Systems Perform Better? If errors occur where speaker turns are prosodically ‘marked’…. –Can we recognize turns that will be misrecognized by examining their prosody? –Can we modify our dialogue and recognition strategies to handle corrections more appropriately?

Current Study Corpus of recognized speech collected by an interactive voice response system Identify speaker ‘turns’ that are incorrectly transcribed or semantically incorrect at which speakers first become aware of a misrecognition that correct misrecognitions Identify prosodic features of turns in each category and compare to turns not so classified Use Machine Learning techniques to train a classifier to make these distinctions automatically misrecognition correction aware site

Turn Types TOOT: Hi. This is AT&T Amtrak Schedule System. This is TOOT. How may I help you? User: Hello. I would like trains from Philadelphia to New York leaving on Sunday at ten thirty in the evening. TOOT: Which city do you want to go to? User: New York. misrecognition correction aware site

TOOT Dialogues Collected to study effects of differences in dialogue strategy on user performance and satisfaction (Litman & Pan ‘99) –type of initiative (system, user, mixed) –type of confirmation (explicit, implicit, none) –adaptability condition Subjects –39 summer students –16/23 (F/M) –20/19 (native speaker/non) Platform: combined over-the-phone ASR and TTS (Kamm et al ‘97) with web access to train information Task: find train information for 4 scenarios

Corpus for current study: – 152 dialogues – 2328 speaker turns (1975 with ASR hypothesis) Misrecognitions – Overall word accuracy: 49% (WER 51%) – Overall concept accuracy(CA): 71% – “I want to go to Boston from Philadelphia” (2 domain concepts) – recognized as “I want to go to Boston” (one concept) = 50%

A Successful Dialogue S: Hi, this is AT&T Amtrak schedule system. This is TOOT. How may I help you? U: I would like a train to New York City from Philadelphia on Sunday at ten thirty P M S: I heard you say “go to New York City from Philadelphia on Sunday approximately at ten thirty p.m." Do you want me to find the trains from Philadelphia to New York City on Sunday approximately at ten thirty p. m. now? U: Yes. S: I am going to get the train schedule for you....

An Unsuccessful Dialogue S: Hi, this is AT&T Amtrak schedule system. This is TOOT. How may I help you? U: I'd like to find a train from Ch- to Chicago from Baltimore. S: Sorry, I can't understand you. Please repeat your utterance. U: I'd like to find a train from Baltimore to Chicago. S: I heard you say “leave today around 5 thirty in the morning." Which city do you want to go to? U: I would like to leave Saturday at eight in the morning. S: You can say the arrival city name, such as “New York City." U: Baltimore to Chicago. S: You can say the arrival city name, such as “New York City.” U: Baltimore to Chicago. (I hate it.)

U: Train from Baltimore...S:...I heard you s...U: to Chicago.... Hello? S: You can say the name of your departure city, such as "New York City." U: Baltimore. Bal-ti-more.... S: You can say... U: Baltimore. S: Do you want to continue and find a new set of trains? U: Yes. S: You can say "yes" or "I am done here" to find a new set of trains. U: Yes. S: Thank you for using AT&T Amtrak train time table system. See you next time. U: I nev-

Descriptive Analyses For each speaker, for each prosodic feature, calculate mean values for e.g. all correctly recognized speaker turns and for all incorrectly recognized turns Perform paired t-tests on these speaker pairs of means (e.g., for each speaker, pairing mean values for correctly and incorrectly recognized turns) Similarly for corrections vs. non-corrections, aware vs. non-aware

Prosodic Features Examined per Turn Raw prosodic/acoustic features –f0 maximum and mean (pitch excursion/range) –rms maximum and mean (amplitude) –total duration –duration of preceding silence –amount of silence within turn –speaking rate (estimated from syllables of recognized string per second) Normalized versions of each feature (compared to first turn in task, to previous turn in task, Z scores)

Predicting Turn Types Using Machine Learning Used Ripper (Cohen ‘96) to automatically induce rule sets for predicting WER and CA scores –greedy search guided by a measure of information gain –input: vectors of feature values –output: ordered rules for predicting dependent variable and (cross-validated) scores for each rule set, which identify best performers and likely performance on unseen data Features examined: –all prosodic features, raw and normalized –original experimental conditions (adaptability of system, initiative type, style of confirmation, subject, task) –gender, native/non-native status –ASR recognized string, grammar, and acoustic confidence score

Turn Types TOOT: Hi. This is AT&T Amtrak Schedule System. This is TOOT. How may I help you? User: Hello. I would like trains from Philadelphia to New York leaving on Sunday at ten thirty in the evening. TOOT: Which city do you want to go to? User: New York. misrecognition

Distinguishing Correct Recognitions from Misrecognitions (NAACL ‘00) Misrecognitions differ prosodically from correct recognitions: –F0 maximum (higher) –RMS maximum (louder) –turn duration (longer) –preceding pause (longer) Effect holds up across speakers and even when hyperarticulated turns are excluded Prosodic features plus other automatically available features (acoustic confidence scores, recognized string, grammar) predict WER-based misrecognitions with ~6.5% error -- better than ~22% of acoustic confidence measures alone

Turn Types TOOT: Hi. This is AT&T Amtrak Schedule System. This is TOOT. How may I help you? User: Hello. I would like trains from Philadelphia to New York leaving on Sunday at ten thirty in the evening. TOOT: Which city do you want to go to? User: New York. aware site

Awareness Sites Shorter, somewhat louder, with less internal silence – compared to other turns Poorly recognized (47% misrecognized vs. 32% of other turns) Machine Learning experiments: –30% baseline (not aware site) –Average error: 25.81% +/- 1.10%

Awareness Sites

30% baseline (not aware site) Average error: 25.81% +/- 1.10% TRUE :- strat=UserNoConfirm, f0max =0.22 TRUE :- strat=UserNoConfirm, gram=yes_no_more, zeros =874.91, dur>=0.52 TRUE :- ppau>=0.47, dur = TRUE :- rmsav>=838.67, f0max<= TRUE :- stringsvals ~ cancel TRUE :- asr =0.25 TRUE :- stringsvals ~ no, rmsmax>= (37/23). default FALSE ML Rules for Aware Site Prediction

Turn Types TOOT: Hi. This is AT&T Amtrak Schedule System. This is TOOT. How may I help you? User: Hello. I would like trains from Philadelphia to New York leaving on Sunday at ten thirty in the evening. TOOT: Which city do you want to go to? User: New York. correction

Speaker Corrections: A Serious Problem for Spoken Dialogue Systems (ICSLP ‘00) 29% of turns in our corpus are corrections 52% of corrections are hyperarticulated but only 12% of other turns Corrections are misrecognized at least twice as often as non-corrections (61% vs. 33%) But corrections are no more likely to be rejected than non- corrections…. (9% vs. 8%)

Prosodic Indicators of Corrections Corrections differ from other turns prosodically: longer, louder, higher in pitch excursion, longer preceding pause, less internal silence

ML Rules for Correction Prediction Baseline: 30% (not correction) norm’d prosody +non-prosody: 18.45% +/- 0.78% automatic: 21.48% +/- 0.68% TRUE :- gram=universal, f0max>=0.96, dur>=6.55. TRUE :- gram=universal, zeros>=0.57, asr<= TRUE :- gram=universal, f0max =1.21, zeros>=0.71. TRUE :- dur>=0.76, asr<=-2.97, strat=UserNoConfirm. TRUE :- dur>=1.00, dur>=2.28, ppau<=0.86. TRUE :- rmsav>=1.11, strat=MixedImplicit, gram=cityname, f0max>=0.70. default FALSE.

When Users Correct, What Do They Do? Correction classes: ‘omits’ and ‘repetitions’ lead to fewer misrecognitions than ‘adds’ and ‘paraphrases’ Turns that correct rejections are more likely to be repetitions, while turns correcting misrecognitions are more likely to be omits

Corrections and System Strategies Per taskMixedImplicitSystemExplicit UserNo Confirm Mean #turns Mean #corr Mean #misrec Mean #misrec corr Matches user responses: Would you use again (1-5): SE (3.5), MI (2.6), UNC (1.7) Satisfaction (0-40): SE (31.25), MI (24.00), UNC (22.10)

Type of correction is sensitive to strategy –much more likely to exactly repeat their misrecognized utterance in a system-initiative environment –much more likely to correct by omitting information if no system confirmation than with explicit confirmation –omits used more in MixedImplicit and UserNoConfirm conditions “Restarts” unlikely to be recognized (77% misrecognized) and skewed in distribution: –31% of corrections are “restarts” in MI and UNC –None for SE, where initial turns well recognized –It doesn’t pay to start over!

U: I'd like to find a train from Ch- to Chicago from Baltimore. S: Sorry, I can't understand you. Please repeat your utterance. U: I'd like to find a train from Baltimore to Chicago. …. S: You can say the arrival city name, such as “New York City." U: Baltimore to Chicago. S: You can say the arrival city name, such as “New York City.” U: Baltimore to Chicago. (I hate it.) U: Train from Baltimore...S:...I heard you s...U: to Chicago.... Hello? S: You can say the name of your departure city, such as "New York City." U: Baltimore. Bal-ti-more.... S: You can say... U: Baltimore. Correction Chains

Effects of Distance from Error on User Corrections The further a correction is from the original error, the higher in f0 (max,mean), lower in rms (max,mean), longer (sec,words), slower, with more pause preceding, and lower CA Corrections further in chain position from the error are similar – and their word accuracy is also lower But corrections more distant from the immediately preceding error, while similar in f0, rms, and duration, are actually more likely to be correctly recognized (WER) – why? Not a difference among strategies.

Related Work Predicting and adapting to dialogue (Litman and Pan ‘00) & utterance level misrecognitions (Levow ‘98, Smith ‘98) Adaptation over time using reinforcement learning (Levin & Pieraccini ‘97, Satinder et al. ‘00) Speaking style and prosody in ASR (Weintraub et al ‘96, Wang & Hirschberg ‘93, Hirose ‘97) Hyperarticulation, corrections and misrecognitions (Wade et al ‘92, Oviatt et al ‘96, Swerts & Ostendorf ‘97, Bell & Gustafson ‘99)

Current and Future Research Generalized misrecognition results to a new corpus and recognizer (ICSLP ‘00) Can we identify misrecognitions more accurately using correction and “aware” prediction? Can we predict (only) awares vs. (only) corrections vs. aware/corrections vs. “normal”? Could a speech recognizer trained on corrections recognize them better? Can we predict “goats” vs. “sheeps” using prosody? (ASRU ‘99)

Summary Misrecognitions are prosodically different from correct recognitions (higher, louder, longer, longer preceding pause) Corrections are prosodically different from non- corrections (higher, louder, longer, longer preceding pause, less internal silence) Prosodic and other differences can be used to predict misrecognitions (6.5% error) and their corrections (18.5% error) Corrections vary with strategy and distance