Characterizing and Recognizing Spoken Corrections in Human-Computer Dialog Gina-Anne Levow August 25, 1998.

Slides:

Advertisements

Similar presentations

Atomatic summarization of voic messages using lexical and prosodic features Koumpis and Renals Presented by Daniel Vassilev.

Advertisements

Frederico Rodrigues and Isabel Trancoso INESC/IST, 2000 Robust Recognition of Digits and Natural Numbers.

Connecting Acoustics to Linguistics in Chinese Intonation Greg Kochanski (Oxford Phonetics) Chilin Shih (University of Illinois) Tan Lee (CUHK) with Hongyan.

Development of Automatic Speech Recognition and Synthesis Technologies to Support Chinese Learners of English: The CUHK Experience Helen Meng, Wai-Kit.

Emotion in Meetings: Hot Spots and Laughter. Corpus used ICSI Meeting Corpus – 75 unscripted, naturally occurring meetings on scientific topics – 71 hours.

Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News Gina-Anne Levow University of Chicago SIGHAN July 25, 2004.

Identifying Local Corrections in Human-Computer Dialogue Gina-Anne Levow University of Chicago October 5, 2004.

Recognition of Voice Onset Time for Use in Detecting Pronunciation Variation ● Project Description ● What is Voice Onset Time (VOT)? – Physical Realization.

Understanding Spoken Corrections in Human-Computer Dialogue Gina-Anne Levow University of Chicago MAICS April 1, 2006.

Prosodic Cues to Discourse Segment Boundaries in Human-Computer Dialogue SIGDial 2004 Gina-Anne Levow April 30, 2004.

Automatic Prosody Labeling Final Presentation Andrew Rosenberg ELEN Speech and Audio Processing and Recognition 4/27/05.

Context in Multilingual Tone and Pitch Accent Recognition Gina-Anne Levow University of Chicago September 7, 2005.

Extracting Social Meaning Identifying Interactional Style in Spoken Conversation Jurafsky et al ‘09 Presented by Laura Willson.

Ensemble Learning: An Introduction

On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg Weekly Speech Lab Talk 6/27/06.

Detecting missrecognitions Predicting with prosody.

Turn-taking in Mandarin Dialogue: Interactions of Tone and Intonation Gina-Anne Levow University of Chicago October 14, 2005.

Classification of Discourse Functions of Affirmative Words in Spoken Dialogue Julia Agustín Gravano, Stefan Benus, Julia Hirschberg Shira Mitchell, Ilia.

EE225D Final Project Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye EE225D Final Project.

Neural Networks Primer Dr Bernie Domanski The City University of New York / CSI 2800 Victory Blvd 1N-215 Staten Island, New York 10314

Experimental Evaluation

Why is ASR Hard? Natural speech is continuous

Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.

Natural Language Understanding

Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.

Statistical automatic identification of microchiroptera from echolocation calls Lessons learned from human automatic speech recognition Mark D. Skowronski.

Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

1 Artificial Neural Networks Sanun Srisuk EECP0720 Expert Systems – Artificial Neural Networks.

Machine Learning in Spoken Language Processing Lecture 21 Spoken Language Processing Prof. Andrew Rosenberg.

On Speaker-Specific Prosodic Models for Automatic Dialog Act Segmentation of Multi-Party Meetings Jáchym Kolář 1,2 Elizabeth Shriberg 1,3 Yang Liu 1,4.

Crowdsourcing for Spoken Dialogue System Evaluation Ling 575 Spoken Dialog April 30, 2015.

Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.

Information Technology – Dialogue Systems Ulm University (Germany) Speech Data Corpus for Verbal Intelligence Estimation.

Yun-Nung (Vivian) Chen, Yu Huang, Sheng-Yi Kong, Lin-Shan Lee National Taiwan University, Taiwan.

Statistics (cont.) Psych 231: Research Methods in Psychology.

Evaluating prosody prediction in synthesis with respect to Modern Greek prenuclear accents Elisabeth Chorianopoulou MSc in Speech and Language Processing.

Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.

1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine.

1 Natural Language Processing Lecture Notes 14 Chapter 19.

LOGO Summarizing Conversations with Clue Words Giuseppe Carenini, Raymond T. Ng, Xiaodong Zhou (WWW ’07) Advisor ： Dr. Koh Jia-Ling Speaker ： Tu.

Automatic Cue-Based Dialogue Act Tagging Discourse & Dialogue CMSC November 3, 2006.

Chapter 10 Verification and Validation of Simulation Models

Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.

Why predict emotions? Feature granularity levels [1] uses pitch features computed at the word-level Offers a better approximation of the pitch contour.

ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.

1/17/20161 Emotion in Meetings: Business and Personal Julia Hirschberg CS 4995/6998.

Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.

Audio/Speech CS376: November 4, 2004 as presented by Jessica Kuo.

1 Spoken Dialogue Systems Error Detection and Correction in Spoken Dialogue Systems.

Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:

S1S1 S2S2 S3S3 8 October 2002 DARTS ATraNoS Automatic Transcription and Normalisation of Speech Jacques Duchateau, Patrick Wambacq, Johan Depoortere,

Inferential Statistics Psych 231: Research Methods in Psychology.

Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.

Predicting and Adapting to Poor Speech Recognition in a Spoken Dialogue System Diane J. Litman AT&T Labs -- Research

1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ； Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.

A Text-free Approach to Assessing Nonnative Intonation Joseph Tepperman, Abe Kazemzadeh, and Shrikanth Narayanan Signal Analysis and Interpretation Laboratory,

Using Speech Recognition to Predict VoIP Quality

Machine Learning: Ensemble Methods

Investigating Pitch Accent Recognition in Non-native Speech

Spoken Dialogue Systems

Chapter 10 Verification and Validation of Simulation Models

Introduction to Data Mining, 2nd Edition by

Introduction to Data Mining, 2nd Edition by

Spoken Dialogue Systems

Psych 231: Research Methods in Psychology

Psych 231: Research Methods in Psychology

Ju Lin, Yanlu Xie, Yingming Gao, Jinsong Zhang

Low Level Cues to Emotion

Automatic Prosodic Event Detection

Presentation transcript:

Characterizing and Recognizing Spoken Corrections in Human-Computer Dialog Gina-Anne Levow August 25, 1998

Error Correction Spiral U: Give me the price for AT&T. S: Hewlett-Packard was 83 ¾, up 2 ½ since yesterday. U: Give me the price for AT&T. S: Data General was at 10 ½ up a quarter. U: Give me the price for AT&T. S: Hewlett-Packard was 83 ¾, up 2 ½. U: Give me the price for American Telephone and Telegraph. S: Excuse me?

Identifying Corrections Most "Reasoning" Approaches Correction => Violates Task, Belief Constraint Requires Tight Task, Belief Model Often Requires Accurate Recognition This Approach Uses Acoustic or Lexical Information Content, Context Independent

Accomplishments Corrections vs Original Inputs Significant Differences: Duration, Pause, Pitch Corrections vs Recognizer Models Contrasts: Phonology and Duration Correction Recognition Decision Tree Classifier: 65-77% accuracy Human Baseline ~80%

Why Corrections? Recognizer Error Rates ~25% REAL meaning of utterance user intent Corrections misrecognized 2.5X as often Hard to Correct => Poor Quality System

Why it's Necessary Error Repair Requires Detection Errors can be very difficult to detect E.g. Misrecognitions Focus Repair Efforts Corrections Decrease Recognition Accuracy Adaptation Requires Identification

Why is it Hard? Recognition Failures and Errors Repetition <> Correction 500 Strings => 6700 Instances (80%) Speech Recognition Technology Variation - Undesirable, Suppressed

Corrections Within

Roadmap Data Collection and Description SpeechActs System & Field Trial Characterizing Corrections Original-Repeat Pair Data Analysis Acoustic and Phonological Measures & Results Recognizing Corrections Conclusions and Future Work

SpeechActs System Speech-Only System over the Telephone Access to Common Desktop Applications , Calendar, Weather, Stock Quotes BBN's Hark Speech Recognition, Centigram TruVoice Speech Synthesis In-house: Natural Language Analysis Back-end Applications, Dialog Manager

Example Interaction SpeechActs: You have 14 new messages and 47 old messages. The first new message is from Eric Baatz regarding "report completed". User: Let me hear it. SpeechActs: "The first draft is ready for your comments." User: Reply and Include the current message. SpeechActs: Preparing message to Eric Baatz. Begin recording after the tone. When finished, pause for several seconds. User: Eric, I'll get those to you this afternoon. SpeechActs: Say cancel, send or review.

Example Interaction: Con't User: Send. SpeechActs: Message Sent. What now? User: Next message. SpeechActs: New Message two from Stuart Adams, subject "Can we meet at 3:00 today?" User: Switch to Calendar.

System Data Overview Approximately 60 hours of interactions Digitized at 8kHz, 8-bit mu-law encoding 18 subjects: 14 novices, 4 experts, single shots 7529 user utterances, 1961 errors ~25% error P(error | correct) = 18%; P(error | error) = 44%

System: Recognition Error Types Rejection Errors - Below Recognition Level U: Switch to Weather S (Heard): S (said): Huh? Misrecognition Errors - Substitution in Text U: Switch to Weather S (Heard): Switch to Calendar S (Said): On Tuesday August 25, you have defense 1250 Rejections ~2/3 706 Misrecognitions ~1/3

Roadmap Data Collection and Description SpeechActs System & Field Trial Characterizing Corrections Original-Repeat Pair Data Analysis Acoustic and Phonological Measures & Results Divergence from Recognizer Models Recognizing Corrections Conclusions and Future Work

Analysis: Data 300 Original Input-Repeat Correction Pairs Lexically Matched, Same Speaker Example: S: (Said): Please say mail, calendar, weather. U: Switch to Weather. Original S (Said): Huh? U: Switch to Weather. Repeat.

Analysis: Duration Automatic Forced Alignment, Hand-Edited Total: Speech Onset to End of Utterance Speech: Total - Internal Silence Contrasts: Original Input/Repeat Correction Total: Increases 12.5% on average Speech: Increases 9% on average

Analysis: Pause Utterance Internal Silence > 10ms Not Preceding Unvoiced Stops(t), Affricates(ch) Contrasts: Original Input/Repeat Correction Absolute: 46% Increase Ratio of Silence to Total Duration: 58% Increase

Pitch Tracks

Analysis: Pitch I ESPS/Waves+ Pitch Tracker, Hand-Edited Normalized Per-Subject: (Value-Subject Mean) / (Subject Std Dev) Pitch Maximum, Minimum, Range Whole Utterance & Last Word Contrasts: Original Input/Repeat Correction Significant Decrease in Pitch Minimum Whole Utterance & Last Word

Analysis: Pitch II

Analysis: Pitch III Internal Pitch Contours: Pitch Accent Steepest Rise, Steepest Fall, Slope Sum Overall => Not Significant Misrecognitions Only: Original vs Repeat Significant Increases: Steepest Rise, Slope Sum

Pitch Contour Detail Exclude Boundary Tone Region 5-Point median smoothing (Taylor 1996) Piecewise linear contour between max and min

Analysis: Overview Significant Differences: Original/Correction Duration & Pause Significant Increases: Original vs Correction Pitch Significant Decrease in Pitch Minimum Increase in Final Falling Contours Misrecognitions: Increase in Pitch Variability Conversational-to-Clear Speech Shift Contrastive Use of Pitch Accent

Roadmap Data Collection and Description SpeechActs System & Field Trial Characterizing Corrections Original-Repeat Pair Data Analysis Acoustic and Phonological Measures & Results Divergence from Recognizer Models Recognizing Corrections Conclusions and Future Work

Analysis: Phonology Reduced Form => Citation Form Schwa to unreduced vowel (~20) E.g. Switch t' mail => Switch to mail. Unreleased or Flapped 't' => Released 't' (~50) E.g. Read message tweny => Read message twenty Citation Form => Hyperclear Form Vowel or Syllabic Insertion (~20) E.g. Goodbye => Goodba-aye

Analysis: Overview II Original vs Correction & Recognizer Model Phonology Reduced Form => Citation Form => Hyperclear Form Conversational to (Hyper) Clear Shift Duration Contrast between Final and Non-final Words Departure from ASR Model Increase for Corrections, especially Final Words

Roadmap Data Collection and Description SpeechActs System & Field Trial Characterizing Corrections Original-Repeat Pair Data Analysis Acoustic and Phonological Measures & Results Divergence from Recognizer Models Recognizing Corrections Conclusions and Future Work

Learning Method Options (K)-Nearest Neighbor Need Commensurable Attribute Values Sensitive to Irrelevant Attributes Labeling Speed - Training Set Size Neural Nets Hard to Interpret Can Require More Computation & Training Data +Fast, Accurate when Trained Decision Trees Intelligible, Robust to Irrelevant Attributes +Fast, Compact when Trained ?Rectangular Decision Boundaries, Don't Test Feature Combinations Alternative: Mixture of Experts

Learning Method Options (K)-Nearest Neighbor Need Commensurable Attribute Values Sensitive to Irrelevant Attributes Labeling Speed - Training Set Size Neural Nets Hard to Interpret Can Require More Computation & Training Data +Fast, Accurate when Trained Decision Trees <= Intelligible, Robust to Irrelevant Attributes +Fast, Compact when Trained ?Rectangular Decision Boundaries, Don't Test Feature Combinations

Decision Tree Features 38 Features Total, E.g. 15 for best trees Pause Total Pause Duration Pause / Total Duration Duration Total Duration (uttdur) Speaking Rate (sps) Normalized Duration Amplitude Max, Mean, Last Max-Last (ampdiff) Mean-Last (ampdelta) Pitch Max, Min, Range Global, Last Word Range/Total Contour Max, min, sum slope

Decision Tree Training & Testing Data: 50% Original Inputs, 50% Repeat Corrections Classifier Labels: Original, Correction 7-Way Cross-Validation Train on 6/7 of data, Test on remaining 1/7 Subsets drawn at random according to distribution Cycle through all subsets, training & testing Report average results on unseen test data

Recognizer: Results (Overall) Tree Size: 57 (unpruned), 37 (pruned) Minimum of 10 nodes per branch required First Split: Normalized Duration (All Trees) Most Important Features: Normalized & Absolute Duration, Speaking Rate 65% Accuracy - Null Baseline-50%

Example Tree

Classifier Results: Misrecognitions Most important features: Absolute and Normalized Duration Pitch Minimum and Pitch Slope 77% accuracy (with text) 65% (acoustic features only) Null baseline - 50% Human baseline % (Hauptman & Rudnicky 1990)

Classifier Results: Misrecognitions Most important features: Absolute and Normalized Duration Pitch Minimum and Pitch Slope 77% accuracy (with text) 65% (acoustic features only) Errors, most trees: ½ false positive, ½ false negative Null baseline - 50% Human baseline % (Hauptman & Rudnicky 1990)

Misrecognition Classifier

Roadmap Data Collection and Description Characterizing Corrections Recognizing Corrections Conclusions and Future Work

Accomplishments Contrasts between Originals vs Corrections Significant Differences in Duration, Pause, Pitch Conversational-to-Clear Speech Shifts Shifts away from Recognizer Models Corrections Recognized at 65-77% Near-human Levels

The Recipe Original/Correction Training Set (300+ sets) Labeled, Transcribed, Digitized, Corpus or Wizard Acoustic Analyses Pitch Tracking, Silence Detection, Speaking Rate,... Classifier Training & Tuning Confidence Measure (Weighted Pessimistic Error) Phonological Rule Extraction Durational Contrast Modeling Repair Dialog Management

Future Work Modify ASR Duration Model for Correction Reflect Phonological and Duration Change Identify Locus of Correction for Misrecognitions Preliminary tests: 26/28 Corrected Words Detected, 2 False Alarms