Presentation is loading. Please wait.

Presentation is loading. Please wait.

Understanding Spoken Corrections in Human-Computer Dialogue Gina-Anne Levow University of Chicago MAICS April 1, 2006.

Similar presentations


Presentation on theme: "Understanding Spoken Corrections in Human-Computer Dialogue Gina-Anne Levow University of Chicago MAICS April 1, 2006."— Presentation transcript:

1

2 Understanding Spoken Corrections in Human-Computer Dialogue Gina-Anne Levow University of Chicago http://www.cs.uchicago.edu/~levow MAICS April 1, 2006

3 Error Correction Spiral U: Give me the price for AT&T. S: Hewlett-Packard was 83 ¾, up 2 ½ since yesterday. U: Give me the price for AT&T. S: Data General was at 10 ½ up a quarter. U: Give me the price for AT&T. S: Hewlett-Packard was 83 ¾, up 2 ½. U: Give me the price for American Telephone and Telegraph. S: Excuse me?

4 Identifying Corrections Most "Reasoning" Approaches Correction => Violates Task, Belief Constraint Requires Tight Task, Belief Model Often Requires Accurate Recognition This Approach Uses Acoustic or Lexical Information Content, Context Independent

5 Accomplishments Corrections vs Original Inputs Significant Differences: Duration, Pause, Pitch Corrections vs Recognizer Models Contrasts: Phonology and Duration Correction Recognition Decision Tree Classifier: 65-77% accuracy Human Baseline ~80%

6 Why Corrections? Recognizer Error Rates ~25-40% REAL meaning of utterance user intent Corrections misrecognized 2.5X as often Hard to Correct => Poor Quality System

7 Why it's Necessary Error Repair Requires Detection Errors can be very difficult to detect E.g. Misrecognitions Focus Repair Efforts Corrections Decrease Recognition Accuracy Adaptation Requires Identification

8 Why is it Hard? Recognition Failures and Errors Repetition <> Correction 500 Strings => 6700 Instances (80%) Speech Recognition Technology Variation - Undesirable, Suppressed

9

10 Roadmap Data Collection and Description SpeechActs System & Field Trial Characterizing Corrections Original-Repeat Pair Data Analysis Acoustic and Phonological Measures & Results Recognizing Corrections Conclusions and Future Work

11 SpeechActs System Speech-Only System over the Telephone (Yankelovich, Levow & Marx 1995) Access to Common Desktop Applications Email, Calendar, Weather, Stock Quotes BBN's Hark Speech Recognition, Centigram TruVoice Speech Synthesis In-house: Natural Language Analysis Back-end Applications, Dialog Manager

12 System Data Overview Approximately 60 hours of interactions Digitized at 8kHz, 8-bit mu-law encoding 18 subjects: 14 novices, 4 experts, single shots 7529 user utterances, 1961 errors P(error | correct) = 18%; P(error | error) = 44%

13 System: Recognition Error Types Rejection Errors - Below Recognition Level U: Switch to Weather S (Heard): S (said): Huh? Misrecognition Errors - Substitution in Text U: Switch to Weather S (Heard): Switch to Calendar S (Said): On Friday, May 4, you have talk at Chicago. 1250 Rejections ~2/3 706 Misrecognitions ~1/3

14 Analysis: Data 300 Original Input-Repeat Correction Pairs Lexically Matched, Same Speaker Example: S: (Said): Please say mail, calendar, weather. U: Switch to Weather. Original S (Said): Huh? U: Switch to Weather. Repeat.

15 Analysis: Duration Automatic Forced Alignment, Hand-Edited Total: Speech Onset to End of Utterance Speech: Total - Internal Silence Contrasts: Original Input/Repeat Correction Total: Increases 12.5% on average Speech: Increases 9% on average

16

17 Analysis: Pause Utterance Internal Silence > 10ms Not Preceding Unvoiced Stops(t), Affricates(ch) Contrasts: Original Input/Repeat Correction Absolute: 46% Increase Ratio of Silence to Total Duration: 58% Increase

18

19 Pitch Tracks

20 Analysis: Pitch I ESPS/Waves+ Pitch Tracker, Hand-Edited Normalized Per-Subject: (Value-Subject Mean) / (Subject Std Dev) Pitch Maximum, Minimum, Range Whole Utterance & Last Word Contrasts: Original Input/Repeat Correction Significant Decrease in Pitch Minimum Whole Utterance & Last Word

21 Analysis: Pitch II

22 Analysis: Overview Significant Differences: Original/Correction Duration & Pause Significant Increases: Original vs Correction Pitch Significant Decrease in Pitch Minimum Increase in Final Falling Contours Conversational-to-Clear Speech Shift

23 Analysis: Phonology Reduced Form => Citation Form Schwa to unreduced vowel (~20) E.g. Switch t' mail => Switch to mail. Unreleased or Flapped 't' => Released 't' (~50) E.g. Read message tweny => Read message twenty Citation Form => Hyperclear Form Extreme lengthening, calling intonation (~20) E.g. Goodbye => Goodba-aye

24 Durational Model Contrasts I Original Inputs: Final vs Non-final position # of Words 20 -1 0 1 2 Departure from Model Mean (Std Dev) Non-final Final

25 Durational Model Contrasts Departure from Model Mean (Std Dev) # of Words Non-final Final Compare to SR model (Chung 1995) Phrase-final lengthening Words in final position significantly longer than non-final and than model prediction All significantly longer in correction utterances

26 Durational Model Contrasts II Correction Utterances: Greater Increases Departure from Model Mean (Std Dev) -1 0 1 2 # of Words 20 Non-final Final

27 Analysis: Overview II Original vs Correction & Recognizer Model Phonology Reduced Form => Citation Form => Hyperclear Form Conversational to (Hyper) Clear Shift Duration Contrast between Final and Non-final Words Departure from ASR Model Increase for Corrections, especially Final Words

28 Automatic Recognition of Spoken Corrections " Machine learning classifier: – Decision Trees " Trained on labeled examples " Features: Duration, Pause, Pitch " Evaluation: – Overall: 65% accuracy (inc. text features) " Absolute and normalized duration – Misrecognitions: 77% accuracy (inc. text features) " Absolute and normalized duration, pitch " 65% accuracy – acoustic features only – Approaches human baseline: 79.4%

29 Accomplishments Contrasts between Originals vs Corrections Significant Differences in Duration, Pause, Pitch Conversational-to-Clear Speech Shifts Shifts away from Recognizer Models Corrections Recognized at 65-77% Near-human Levels

30 Future Work Modify ASR Duration Model for Correction Reflect Phonological and Duration Change Identify Locus of Correction for Misrecognitions U: Switch to Weather S (Heard): Switch to Calendar S (Said): On Friday, May 4, you have talk at Chicago. U:Switch to WEATHER! Preliminary tests: 26/28 Corrected Words Detected, 2 False Alarms

31 Future Work Identify and Exploit Cues to Discourse and Information Structure Incorporate Prosodic Features into Model of Spoken Dialogue Exploit Text and Acoustic Features for Segmentation of Broadcast Audio and Video – Necessary first phase for information retrieval – Assess language independence – First phase: Segmentation of Mandarin and Cantonese Broadcast News (in collaboration with CUHK)

32 Classification of Spoken Corrections Decision Trees Intelligible, Robust to irrelevant attributes ?Rectangular decision boundaries; Don’t combine features Features (38 total, 15 in best trees) Duration, pause, pitch, and amplitude Normalized and absolute Training and Testing 50% Original Inputs, 50% Repeat Corrections 7-way cross-validation

33 Recognizer: Results (Overall) Tree Size: 57 (unpruned), 37 (pruned) Minimum of 10 nodes per branch required First Split: Normalized Duration (All Trees) Most Important Features: Normalized & Absolute Duration, Speaking Rate 65% Accuracy - Null Baseline-50%

34 Example Tree

35 Classifier Results: Misrecognitions Most important features: Absolute and Normalized Duration Pitch Minimum and Pitch Slope 77% accuracy (with text) 65% (acoustic features only) Null baseline - 50% Human baseline - 79.4% (Hauptman & Rudnicky 1987)

36 Misrecognition Classifier

37 Background & Related Work Detecting and Preventing Miscommunication (Smith & Gordon 96, Traum & Dillenbourg 96) Identifying Discourse Structure in Speech Prosody: (Grosz & Hirschberg 92, Swerts & Ostendorf 95) Cue words+prosody: (Taylor et al 96, Hirschberg&Litman 93) Self-repairs: (Heeman & Allen 94, Bear et al 92) Acoustic-only: (Nakatani & Hirschberg 94, Shriberg et al 97) Speaking Modes: (Ostendorf et al 96, Daly & Zue 96) Spoken Corrections: Human baseline (Rudnicky & Hauptmann 87) (Oviatt et al 96, 98; Levow 98,99; Hirschberg et al 99,00) Other languages: (Bell & Gustafson 99, Pirker et al 99,Fischer 99)

38 Learning Method Options (K)-Nearest Neighbor Need Commensurable Attribute Values Sensitive to Irrelevant Attributes Labeling Speed - Training Set Size Neural Nets Hard to Interpret Can Require More Computation & Training Data +Fast, Accurate when Trained Decision Trees <= Intelligible, Robust to Irrelevant Attributes +Fast, Compact when Trained ?Rectangular Decision Boundaries, Don't Test Feature Combinations


Download ppt "Understanding Spoken Corrections in Human-Computer Dialogue Gina-Anne Levow University of Chicago MAICS April 1, 2006."

Similar presentations


Ads by Google