Understanding Spoken Corrections in Human-Computer Dialogue Gina-Anne Levow University of Chicago MAICS April 1, 2006.

Slides:



Advertisements
Similar presentations
National Technical University of Athens Department of Electrical and Computer Engineering Image, Video and Multimedia Systems Laboratory
Advertisements

1 Multimodal Technology Integration for News-on-Demand SRI International News-on-Demand Compare & Contrast DARPA September 30, 1998.
Atomatic summarization of voic messages using lexical and prosodic features Koumpis and Renals Presented by Daniel Vassilev.
Connecting Acoustics to Linguistics in Chinese Intonation Greg Kochanski (Oxford Phonetics) Chilin Shih (University of Illinois) Tan Lee (CUHK) with Hongyan.
TT Centre for Speech Technology Early error detection on word level Gabriel Skantze and Jens Edlund Centre for Speech Technology.
1 Spoken Dialogue Systems Dialogue and Conversational Agents (Part IV) Chapter 19: Draft of May 18, 2005 Speech and Language Processing: An Introduction.
Development of Automatic Speech Recognition and Synthesis Technologies to Support Chinese Learners of English: The CUHK Experience Helen Meng, Wai-Kit.
Emotion in Meetings: Hot Spots and Laughter. Corpus used ICSI Meeting Corpus – 75 unscripted, naturally occurring meetings on scientific topics – 71 hours.
Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News Gina-Anne Levow University of Chicago SIGHAN July 25, 2004.
Detecting Prosody Improvement in Oral Rereading Minh Duong and Jack Mostow Project LISTEN Carnegie Mellon University The research.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
Identifying Local Corrections in Human-Computer Dialogue Gina-Anne Levow University of Chicago October 5, 2004.
Characterizing and Recognizing Spoken Corrections in Human-Computer Dialog Gina-Anne Levow August 25, 1998.
Spoken Language Technologies: A review of application areas and research issues Analysis and synthesis of F0 contours Agnieszka Wagner Department of Phonetics,
Recognition of Voice Onset Time for Use in Detecting Pronunciation Variation ● Project Description ● What is Voice Onset Time (VOT)? – Physical Realization.
Interface Design for ICT4B Speech, Dialects, and Interfaces Prof. Dan Klein and Prof. Marti Hearst.
Prosodic Cues to Discourse Segment Boundaries in Human-Computer Dialogue SIGDial 2004 Gina-Anne Levow April 30, 2004.
Automatic Prosody Labeling Final Presentation Andrew Rosenberg ELEN Speech and Audio Processing and Recognition 4/27/05.
Context in Multilingual Tone and Pitch Accent Recognition Gina-Anne Levow University of Chicago September 7, 2005.
Extracting Social Meaning Identifying Interactional Style in Spoken Conversation Jurafsky et al ‘09 Presented by Laura Willson.
On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg, Julia Hirschberg Columbia University Interspeech /14/06.
On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg Weekly Speech Lab Talk 6/27/06.
Detecting missrecognitions Predicting with prosody.
Turn-taking in Mandarin Dialogue: Interactions of Tone and Intonation Gina-Anne Levow University of Chicago October 14, 2005.
Classification of Discourse Functions of Affirmative Words in Spoken Dialogue Julia Agustín Gravano, Stefan Benus, Julia Hirschberg Shira Mitchell, Ilia.
Improved Tone Modeling for Mandarin Broadcast News Speech Recognition Xin Lei 1, Manhung Siu 2, Mei-Yuh Hwang 1, Mari Ostendorf 1, Tan Lee 3 1 SSLI Lab,
Varying Input Segmentation for Story Boundary Detection Julia Hirschberg GALE PI Meeting March 23, 2007.
DIVINES – Speech Rec. and Intrinsic Variation W.S.May 20, 2006 Richard Rose DIVINES SRIV Workshop The Influence of Word Detection Variability on IR Performance.
Natural Language Understanding
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Short Introduction to Machine Learning Instructor: Rada Mihalcea.
On Speaker-Specific Prosodic Models for Automatic Dialog Act Segmentation of Multi-Party Meetings Jáchym Kolář 1,2 Elizabeth Shriberg 1,3 Yang Liu 1,4.
circle Adding Spoken Dialogue to a Text-Based Tutorial Dialogue System Diane J. Litman Learning Research and Development Center & Computer Science Department.
Crowdsourcing for Spoken Dialogue System Evaluation Ling 575 Spoken Dialog April 30, 2015.
Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.
Adaptive Spoken Dialogue Systems & Computational Linguistics Diane J. Litman Dept. of Computer Science & Learning Research and Development Center University.
Evaluating prosody prediction in synthesis with respect to Modern Greek prenuclear accents Elisabeth Chorianopoulou MSc in Speech and Language Processing.
Turn-taking Discourse and Dialogue CS 359 November 6, 2001.
1 CS 430: Information Discovery Lecture 22 Non-Textual Materials: Informedia.
1 Determining query types by analysing intonation.
1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine.
1 Natural Language Processing Lecture Notes 14 Chapter 19.
Automatic Cue-Based Dialogue Act Tagging Discourse & Dialogue CMSC November 3, 2006.
Dirk Van CompernolleAtranos Workshop, Leuven 12 April 2002 Automatic Transcription of Natural Speech - A Broader Perspective – Dirk Van Compernolle ESAT.
Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.
1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.
Spoken Dialog Systems Diane J. Litman Professor, Computer Science Department.
1/17/20161 Emotion in Meetings: Business and Personal Julia Hirschberg CS 4995/6998.
Lexical, Prosodic, and Syntactics Cues for Dialog Acts.
Adapting Dialogue Models Discourse & Dialogue CMSC November 19, 2006.
Audio/Speech CS376: November 4, 2004 as presented by Jessica Kuo.
Misrecognitions and Corrections in Spoken Dialogue Systems Diane Litman AT&T Labs -- Research (Joint Work With Julia Hirschberg, AT&T, and Marc Swerts,
1 Spoken Dialogue Systems Error Detection and Correction in Spoken Dialogue Systems.
S1S1 S2S2 S3S3 8 October 2002 DARTS ATraNoS Automatic Transcription and Normalisation of Speech Jacques Duchateau, Patrick Wambacq, Johan Depoortere,
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
Predicting and Adapting to Poor Speech Recognition in a Spoken Dialogue System Diane J. Litman AT&T Labs -- Research
Speech and multimodal Jesse Cirimele. papers “Multimodal interaction” Sharon Oviatt “Designing SpeechActs” Yankelovich et al.
Using Speech Recognition to Predict VoIP Quality
Investigating Pitch Accent Recognition in Non-native Speech
Recognizing Structure: Dialogue Acts and Segmentation
Error Detection and Correction in SDS
Spoken Dialogue Systems
Prosody in Recognition/Understanding
Turn-taking and Disfluencies
Recognizing Structure: Sentence, Speaker, andTopic Segmentation
Spoken Dialogue Systems
Discourse Structure in Generation
Recognizing Structure: Dialogue Acts and Segmentation
Low Level Cues to Emotion
Automatic Prosodic Event Detection
Presentation transcript:

Understanding Spoken Corrections in Human-Computer Dialogue Gina-Anne Levow University of Chicago MAICS April 1, 2006

Error Correction Spiral U: Give me the price for AT&T. S: Hewlett-Packard was 83 ¾, up 2 ½ since yesterday. U: Give me the price for AT&T. S: Data General was at 10 ½ up a quarter. U: Give me the price for AT&T. S: Hewlett-Packard was 83 ¾, up 2 ½. U: Give me the price for American Telephone and Telegraph. S: Excuse me?

Identifying Corrections Most "Reasoning" Approaches Correction => Violates Task, Belief Constraint Requires Tight Task, Belief Model Often Requires Accurate Recognition This Approach Uses Acoustic or Lexical Information Content, Context Independent

Accomplishments Corrections vs Original Inputs Significant Differences: Duration, Pause, Pitch Corrections vs Recognizer Models Contrasts: Phonology and Duration Correction Recognition Decision Tree Classifier: 65-77% accuracy Human Baseline ~80%

Why Corrections? Recognizer Error Rates ~25-40% REAL meaning of utterance user intent Corrections misrecognized 2.5X as often Hard to Correct => Poor Quality System

Why it's Necessary Error Repair Requires Detection Errors can be very difficult to detect E.g. Misrecognitions Focus Repair Efforts Corrections Decrease Recognition Accuracy Adaptation Requires Identification

Why is it Hard? Recognition Failures and Errors Repetition <> Correction 500 Strings => 6700 Instances (80%) Speech Recognition Technology Variation - Undesirable, Suppressed

Roadmap Data Collection and Description SpeechActs System & Field Trial Characterizing Corrections Original-Repeat Pair Data Analysis Acoustic and Phonological Measures & Results Recognizing Corrections Conclusions and Future Work

SpeechActs System Speech-Only System over the Telephone (Yankelovich, Levow & Marx 1995) Access to Common Desktop Applications , Calendar, Weather, Stock Quotes BBN's Hark Speech Recognition, Centigram TruVoice Speech Synthesis In-house: Natural Language Analysis Back-end Applications, Dialog Manager

System Data Overview Approximately 60 hours of interactions Digitized at 8kHz, 8-bit mu-law encoding 18 subjects: 14 novices, 4 experts, single shots 7529 user utterances, 1961 errors P(error | correct) = 18%; P(error | error) = 44%

System: Recognition Error Types Rejection Errors - Below Recognition Level U: Switch to Weather S (Heard): S (said): Huh? Misrecognition Errors - Substitution in Text U: Switch to Weather S (Heard): Switch to Calendar S (Said): On Friday, May 4, you have talk at Chicago Rejections ~2/3 706 Misrecognitions ~1/3

Analysis: Data 300 Original Input-Repeat Correction Pairs Lexically Matched, Same Speaker Example: S: (Said): Please say mail, calendar, weather. U: Switch to Weather. Original S (Said): Huh? U: Switch to Weather. Repeat.

Analysis: Duration Automatic Forced Alignment, Hand-Edited Total: Speech Onset to End of Utterance Speech: Total - Internal Silence Contrasts: Original Input/Repeat Correction Total: Increases 12.5% on average Speech: Increases 9% on average

Analysis: Pause Utterance Internal Silence > 10ms Not Preceding Unvoiced Stops(t), Affricates(ch) Contrasts: Original Input/Repeat Correction Absolute: 46% Increase Ratio of Silence to Total Duration: 58% Increase

Pitch Tracks

Analysis: Pitch I ESPS/Waves+ Pitch Tracker, Hand-Edited Normalized Per-Subject: (Value-Subject Mean) / (Subject Std Dev) Pitch Maximum, Minimum, Range Whole Utterance & Last Word Contrasts: Original Input/Repeat Correction Significant Decrease in Pitch Minimum Whole Utterance & Last Word

Analysis: Pitch II

Analysis: Overview Significant Differences: Original/Correction Duration & Pause Significant Increases: Original vs Correction Pitch Significant Decrease in Pitch Minimum Increase in Final Falling Contours Conversational-to-Clear Speech Shift

Analysis: Phonology Reduced Form => Citation Form Schwa to unreduced vowel (~20) E.g. Switch t' mail => Switch to mail. Unreleased or Flapped 't' => Released 't' (~50) E.g. Read message tweny => Read message twenty Citation Form => Hyperclear Form Extreme lengthening, calling intonation (~20) E.g. Goodbye => Goodba-aye

Durational Model Contrasts I Original Inputs: Final vs Non-final position # of Words Departure from Model Mean (Std Dev) Non-final Final

Durational Model Contrasts Departure from Model Mean (Std Dev) # of Words Non-final Final Compare to SR model (Chung 1995) Phrase-final lengthening Words in final position significantly longer than non-final and than model prediction All significantly longer in correction utterances

Durational Model Contrasts II Correction Utterances: Greater Increases Departure from Model Mean (Std Dev) # of Words 20 Non-final Final

Analysis: Overview II Original vs Correction & Recognizer Model Phonology Reduced Form => Citation Form => Hyperclear Form Conversational to (Hyper) Clear Shift Duration Contrast between Final and Non-final Words Departure from ASR Model Increase for Corrections, especially Final Words

Automatic Recognition of Spoken Corrections " Machine learning classifier: – Decision Trees " Trained on labeled examples " Features: Duration, Pause, Pitch " Evaluation: – Overall: 65% accuracy (inc. text features) " Absolute and normalized duration – Misrecognitions: 77% accuracy (inc. text features) " Absolute and normalized duration, pitch " 65% accuracy – acoustic features only – Approaches human baseline: 79.4%

Accomplishments Contrasts between Originals vs Corrections Significant Differences in Duration, Pause, Pitch Conversational-to-Clear Speech Shifts Shifts away from Recognizer Models Corrections Recognized at 65-77% Near-human Levels

Future Work Modify ASR Duration Model for Correction Reflect Phonological and Duration Change Identify Locus of Correction for Misrecognitions U: Switch to Weather S (Heard): Switch to Calendar S (Said): On Friday, May 4, you have talk at Chicago. U:Switch to WEATHER! Preliminary tests: 26/28 Corrected Words Detected, 2 False Alarms

Future Work Identify and Exploit Cues to Discourse and Information Structure Incorporate Prosodic Features into Model of Spoken Dialogue Exploit Text and Acoustic Features for Segmentation of Broadcast Audio and Video – Necessary first phase for information retrieval – Assess language independence – First phase: Segmentation of Mandarin and Cantonese Broadcast News (in collaboration with CUHK)

Classification of Spoken Corrections Decision Trees Intelligible, Robust to irrelevant attributes ?Rectangular decision boundaries; Don’t combine features Features (38 total, 15 in best trees) Duration, pause, pitch, and amplitude Normalized and absolute Training and Testing 50% Original Inputs, 50% Repeat Corrections 7-way cross-validation

Recognizer: Results (Overall) Tree Size: 57 (unpruned), 37 (pruned) Minimum of 10 nodes per branch required First Split: Normalized Duration (All Trees) Most Important Features: Normalized & Absolute Duration, Speaking Rate 65% Accuracy - Null Baseline-50%

Example Tree

Classifier Results: Misrecognitions Most important features: Absolute and Normalized Duration Pitch Minimum and Pitch Slope 77% accuracy (with text) 65% (acoustic features only) Null baseline - 50% Human baseline % (Hauptman & Rudnicky 1987)

Misrecognition Classifier

Background & Related Work Detecting and Preventing Miscommunication (Smith & Gordon 96, Traum & Dillenbourg 96) Identifying Discourse Structure in Speech Prosody: (Grosz & Hirschberg 92, Swerts & Ostendorf 95) Cue words+prosody: (Taylor et al 96, Hirschberg&Litman 93) Self-repairs: (Heeman & Allen 94, Bear et al 92) Acoustic-only: (Nakatani & Hirschberg 94, Shriberg et al 97) Speaking Modes: (Ostendorf et al 96, Daly & Zue 96) Spoken Corrections: Human baseline (Rudnicky & Hauptmann 87) (Oviatt et al 96, 98; Levow 98,99; Hirschberg et al 99,00) Other languages: (Bell & Gustafson 99, Pirker et al 99,Fischer 99)

Learning Method Options (K)-Nearest Neighbor Need Commensurable Attribute Values Sensitive to Irrelevant Attributes Labeling Speed - Training Set Size Neural Nets Hard to Interpret Can Require More Computation & Training Data +Fast, Accurate when Trained Decision Trees <= Intelligible, Robust to Irrelevant Attributes +Fast, Compact when Trained ?Rectangular Decision Boundaries, Don't Test Feature Combinations