“k hypotheses + other” belief updating in spoken dialog systems Dialogs on Dialogs Talk, March 2006 Dan Bohus Computer Science Department www.cs.cmu.edu/~dbohus.

Slides:

Advertisements

Similar presentations

( current & future work ) explicit confirmation implicit confirmation unplanned implicit confirmation request constructing accurate beliefs in spoken dialog.

Advertisements

Is This Conversation on Track? Utterance Level Confidence Annotation in the CMU Communicator spoken dialog system Presented by: Dan Bohus

Change-Point Detection Techniques for Piecewise Locally Stationary Time Series Michael Last National Institute of Statistical Sciences Talk for Midyear.

Non-Native Users in the Let s Go!! Spoken Dialogue System: Dealing with Linguistic Mismatch Antoine Raux & Maxine Eskenazi Language Technologies Institute.

Error Handling in the RavenClaw Dialog Management Framework Dan Bohus, Alexander I. Rudnicky Computer Science Department, Carnegie Mellon University (

Tips for Writing Free Response Questions on the AP Statistics Exam Laura Trojan Cannon School.

Review of the Basic Logic of NHST Significance tests are used to accept or reject the null hypothesis. This is done by studying the sampling distribution.

What is Statistical Modeling

An Investigation into Recovering from Non-understanding Errors Dan Bohus Dialogs on Dialogs Reading Group Talk Carnegie Mellon University, October 2004.

Learning Within-Sentence Semantic Coherence Elena Eneva Rose Hoberman Lucian Lita Carnegie Mellon University.

constructing accurate beliefs in task-oriented spoken dialog systems Dan Bohus Computer Science Department Carnegie Mellon University.

HIGGINS Error handling strategies in a spoken dialogue system Rolf Carlson, Jens Edlund and Gabriel Skantze Error handling research issues The long term.

Sorry, I didn’t catch that! – an investigation of non-understandings and recovery strategies Dan Bohuswww.cs.cmu.edu/~dbohus Alexander I. Rudnickywww.cs.cmu.edu/~air.

Belief Updating in Spoken Dialog Systems Dialogs on Dialogs Reading Group June, 2005 Dan Bohus Carnegie Mellon University, January 2004.

Increased Robustness in Spoken Dialog Systems 1 (roadmap to a thesis proposal) Dan Bohus, SPHINX Lunch, May 2003.

Cal State Northridge  320 Ainsworth Sampling Distributions and Hypothesis Testing.

What can humans do when faced with ASR errors? Dan Bohus Dialogs on Dialogs Group, October 2003.

Sorry, I didn’t catch that … Non-understandings and recovery in spoken dialog systems Part II: Sources & impact of non-understandings, Performance of various.

1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.

Belief Updating in Spoken Dialog Systems Dan Bohus Computer Science Department Carnegie Mellon University Pittsburgh,

Modeling the Cost of Misunderstandings in the CMU Communicator System Dan BohusAlex Rudnicky School of Computer Science, Carnegie Mellon University, Pittsburgh,

Online supervised learning of non-understanding recovery policies Dan Bohus Computer Science Department Carnegie.

Extracting Social Meaning Identifying Interactional Style in Spoken Conversation Jurafsky et al ‘09 Presented by Laura Willson.

On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg, Julia Hirschberg Columbia University Interspeech /14/06.

On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg Weekly Speech Lab Talk 6/27/06.

Error Detection in Human-Machine Interaction Dan Bohus DoD Group, Oct 2002.

Cost of Misunderstandings Modeling the Cost of Misunderstanding Errors in the CMU Communicator Dialog System Presented by: Dan Bohus

A “k-hypotheses + other” belief updating model Dan Bohus Alex Rudnicky Computer Science Department Carnegie Mellon University Pittsburgh, PA acknowledgements.

misunderstandings, corrections and beliefs in spoken language interfaces Dan Bohus Computer Science Department Carnegie Mellon.

belief updating in spoken dialog systems Dan Bohus Computer Science Department Carnegie Mellon University Pittsburgh, PA acknowledgements Alex Rudnicky,

Extending VERA (Conference Information) Design Specification & Schedules Arthur Chan (AC) Rohit Kumar (RK) Lingyun Gao (LG)

Simple Correlation Scatterplots & r Interpreting r Outcomes vs. RH:

MUSCLE Multimodal e-team related activity Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Prof. Alex Potamianos Technical.

Sorry, I didn’t catch that … Non-understandings and recovery in spoken dialog systems Part I:Issues,Data Collection,Rejection Tuning Dan Bohus Sphinx Lunch.

Analysis of Individual Variables Descriptive – –Measures of Central Tendency Mean – Average score of distribution (1 st moment) Median – Middle score (50.

A principled approach for rejection threshold optimization Dan Bohuswww.cs.cmu.edu/~dbohus Alexander I. Rudnickywww.cs.cmu.edu/~air Computer Science Department.

Today Concepts underlying inferential statistics

CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.

Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.

Statistics for the Social Sciences Psychology 340 Fall 2013 Thursday, November 21 Review for Exam #4.

Beyond Usability: Measuring Speech Application Success Silke Witt-Ehsani, PhD VP, VUI Design Center TuVox.

Interactive Dialogue Systems Professor Diane Litman Computer Science Department & Learning Research and Development Center University of Pittsburgh Pittsburgh,

User Study Evaluation Human-Computer Interaction.

On Speaker-Specific Prosodic Models for Automatic Dialog Act Segmentation of Multi-Party Meetings Jáchym Kolář 1,2 Elizabeth Shriberg 1,3 Yang Liu 1,4.

circle Adding Spoken Dialogue to a Text-Based Tutorial Dialogue System Diane J. Litman Learning Research and Development Center & Computer Science Department.

Evaluation of SDS Svetlana Stoyanchev 3/2/2015. Goal of dialogue evaluation Assess system performance Challenges of evaluation of SDS systems – SDS developer.

The Common Shock Model for Correlations Between Lines of Insurance

Crowdsourcing for Spoken Dialogue System Evaluation Ling 575 Spoken Dialog April 30, 2015.

Adaptive Spoken Dialogue Systems & Computational Linguistics Diane J. Litman Dept. of Computer Science & Learning Research and Development Center University.

10/22/20151 PUAF 610 TA Session 8. 10/22/20152 Recover from midterm.

Statistics (cont.) Psych 231: Research Methods in Psychology.

1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.

Statistical analysis Outline that error bars are a graphical representation of the variability of data. The knowledge that any individual measurement.

1 Chapter 10: Introduction to Inference. 2 Inference Inference is the statistical process by which we use information collected from a sample to infer.

SW388R6 Data Analysis and Computers I Slide 1 Multiple Regression Key Points about Multiple Regression Sample Homework Problem Solving the Problem with.

Predicting Student Emotions in Computer-Human Tutoring Dialogues Diane J. Litman&Kate Forbes-Riley University of Pittsburgh Department of Computer Science.

Chapter 10 Verification and Validation of Simulation Models

Extending VERA (Conference Information) Design Specification & Schedules Arthur Chan (AC) Rohit Kumar (RK) Lingyun Gu (LG)

M Machine Learning F# and Accord.net.

Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 11: Models Marshall University Genomics Core Facility.

Lti Shaping Spoken Input in User-Initiative Systems Stefanie Tomko and Roni Rosenfeld Language Technologies Institute School of Computer Science Carnegie.

Integrating Multiple Knowledge Sources For Improved Speech Understanding Sherif Abdou, Michael Scordilis Department of Electrical and Computer Engineering,

Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.

RADAR February 15, RADAR /Space-Time Learning.

Prosodic Cues to Disengagement and Uncertainty in Physics Tutorial Dialogues Diane Litman, Heather Friedberg, Kate Forbes-Riley University of Pittsburgh.

On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007.

Data Screening. What is it? Data screening is very important to make sure you’ve met all your assumptions, outliers, and error problems. Each type of.

Predicting and Adapting to Poor Speech Recognition in a Spoken Dialogue System Diane J. Litman AT&T Labs -- Research

Spoken Dialogue Systems

Spoken Dialogue Systems

Presentation transcript:

“k hypotheses + other” belief updating in spoken dialog systems Dialogs on Dialogs Talk, March 2006 Dan Bohus Computer Science Department Carnegie Mellon University Pittsburgh, PA 15213

2 problem spoken language interfaces lack robustness when faced with understanding errors  errors stem mostly from speech recognition  typical word error rates: 20-30%  significant negative impact on interactions

3 guarding against understanding errors  use confidence scores  machine learning approaches for detecting misunderstadings [Walker, Litman, San-Segundo, Wright, and others]  engage in confirmation actions  explicit confirmation did you say you wanted to fly to Seoul? yes → trust hypothesis no → delete hypothesis “other” → non-understanding  implicit confirmation traveling to Seoul … what day did you need to travel? rely on new values overwriting old values related work : data : user response analysis : proposed approach: experiments and results : conclusion

4 construct accurate beliefs by integrating information over multiple turns in a conversation today’s talk … S:Where would you like to go? U:Huntsville [SEOUL / 0.65] S: traveling to Seoul. What day did you need to travel? destination = {seoul/0.65} destination = {?} U: no no I’m traveling to Birmingham [THE TRAVELING TO BERLIN P_M / 0.60]

5 belief updating: problem statement S: traveling to Seoul. What day did you need to travel? destination = {seoul/0.65} destination = {?} [THE TRAVELING TO BERLIN P_M / 0.60]  given  an initial belief B initial (C) over concept C  a system action SA  a user response R  construct an updated belief  B updated (C) ← f (B initial (C), SA, R)

6 outline  proposed approach  data  experiments and results  effect on dialog performance  conclusion proposed approach: data: experiments and results : effect on dialog performance : conclusion

7 belief updating: problem statement S: traveling to Seoul. What day did you need to travel? destination = {seoul/0.65} destination = {?} [THE TRAVELING TO BERLIN P_M / 0.60]  given  an initial belief B initial (C) over concept C  a system action SA(C)  a user response R  construct an updated belief  B updated (C) ← f(B initial (C),SA(C),R) proposed approach: data: experiments and results : effect on dialog performance : conclusion

8 belief representation B updated (C) ← f(B initial (C), SA(C), R)  most accurate representation  probability distribution over the set of possible values  however  system will “hear” only a small number of conflicting values for a concept within a dialog session  in our data max = 3 (conflicting values heard) only in 6.9% of cases, more than 1 value heard proposed approach: data: experiments and results : effect on dialog performance : conclusion

9  compressed belief representation  k hypotheses + other  at each turn, the system retains the top m initial hypotheses and adds n new hypotheses from the input (m+n=k) belief representation B updated (C) ← f(B initial (C), SA(C), R) proposed approach: data: experiments and results : effect on dialog performance : conclusion

10  B(C) modeled as a multinomial variable  {h 1, h 2, … h k, other}  B(C) = where c h1 + c h2 + … + c hk + c other = 1  belief updating can be cast as multinomial regression problem: B updated (C) ← B initial (C) + SA(C) + R belief representation B updated (C) ← f(B initial (C), SA(C), R) proposed approach: data: experiments and results : effect on dialog performance : conclusion

11 request S:For when do you want the room? U:Friday [FRIDAY / 0.65] explicit confirmation S:Did you say you wanted a room for Friday? U:Yes [GUEST / 0.30] implicit confirmation S:a room for Friday … starting at what time? U:starting at ten a.m. [STARTING AT TEN A_M / 0.86] unplanned implicit confirmation S:I found 5 rooms available Friday from 10 until noon. Would you like a small or a large room? U:not Friday, Thursday [FRIDAY THURSDAY / 0.25] no action / unexpected update S:okay. I will complete the reservation. Please tell me your name or say ‘guest user’ if you are not a registered user. U:guest user [THIS TUESDAY / 0.55] system action B updated (C) ← f(B initial (C), SA(C), R) proposed approach: data: experiments and results : effect on dialog performance : conclusion

12 acoustic / prosodic acoustic and language scores, duration, pitch (min, max, mean, range, std.dev, min and max slope, plus normalized versions), voiced-to- unvoiced ratio, speech rate, initial pause, etc; lexical number of words, lexical terms highly correlated with corrections or acknowledgements (selected via mutual information computation). grammatical number of slots (new and repeated), parse fragmentation, parse gaps, etc; dialog dialog state, turn number, expectation match, new value for concept, timeout, barge-in, concept identity priors priors for concept values (manually constructed by a domain expert for 3 of 29 concepts: date, start_time, end_time; uniform assumed o/w) confusability empirically derived confusability scores B updated (C) ← f(B initial (C), SA(C), R) user response proposed approach: data: experiments and results : effect on dialog performance : conclusion

13 approach  problem  ← f(, SA(C), R)  approach: multinomial generalized linear model  regression model, multinomial independent variable  sample efficient  stepwise approach feature selection BIC to control over-fitting  one model for each system action ← f SA(C) (, R) B updated (C) ← f(B initial (C), SA(C), R) proposed approach: data: experiments and results : effect on dialog performance : conclusion

14 outline  proposed approach  data  experiments and results  effect on dialog performance  conclusion proposed approach: data: experiments and results : effect on dialog performance : conclusion

15 data  collected with RoomLine  a phone-based mixed-initiative spoken dialog system  conference room reservation  explicit and implicit confirmations  simple heuristic rules for belief updating  explicit confirm: yes / no  implicit confirm: new values overwrite old ones proposed approach: data: experiments and results : effect on dialog performance : conclusion

16 corpus  user study  46 participants (naïve users)  10 scenario-based interactions each  compensated per task success  corpus  449 sessions, 8848 user turns  orthographically transcribed  manually annotated misunderstandings corrections correct concept values proposed approach: data: experiments and results : effect on dialog performance : conclusion

17 outline  proposed approach  data  experiments and results  effect on dialog performance  conclusion proposed approach: data: experiments and results : effect on dialog performance : conclusion

18 baselines  initial baseline  accuracy of system beliefs before the update  heuristic baseline  accuracy of heuristic update rule used by the system  oracle baseline  accuracy if we knew exactly when the user corrects proposed approach: data: experiments and results : effect on dialog performance : conclusion

19 k=2 hypotheses + other  priors and confusability  initial confidence score  concept identity  barge-in  expectation match  repeated grammar slots Informative features proposed approach: data: experiments and results : effect on dialog performance : conclusion

20 outline  proposed approach  data  experiments and results  effect on dialog performance  conclusion proposed approach: data: experiments and results : effect on dialog performance : conclusion

21 a question remains … … does this really matter? what is the effect on global dialog performance? proposed approach: data: experiments and results : effect on dialog performance : conclusion

22 let’s run an experiment guinea pigs from Speech Lab for exp: $0 getting change from guys in the lab: $2/$3/$5 real subjects for the experiment: $25 picture with advisor of the VERY last exp at CMU: priceless!!!! [courtesy of Mohit Kumar]

23 a new user study …  implemented models in RavenClaw, performed a new user study  40 participants, first-time users  10 scenario-driven interactions each  non-native speakers of North-American English  improvements more likely at higher WER supported by empirical evidence  between-subjects; 2 gender-balanced groups  control: RoomLine using heuristic update rules  treatment: RoomLine using runtime models proposed approach: data: experiments and results : effect on dialog performance : conclusion

24 effect on task success proposed approach: data: experiments and results : effect on dialog performance : conclusion 73.6% 81.3% control treatment task success control treatment even though average user WER 21.9% 24.2%

25 effect on task success … a closer look proposed approach: data: experiments and results : effect on dialog performance : conclusion Task Success ← ∙WER ∙Condition probability of task success word error rate 16% WER 30% WER 64% 78% p= %

26 improvements at different WER proposed approach: data: experiments and results : effect on dialog performance : conclusion word-error-rate absolute Improvement in task success

27 effect on task duration (for successful tasks)  ANOVA on task duration for successful tasks Duration ← ∙WER ∙Condition  significant improvement, equivalent to 7.9% absolute reduction in WER proposed approach: data: experiments and results : effect on dialog performance : conclusion

28 outline  proposed approach  data  experiments and results  effect on dialog performance  conclusion proposed approach: data: experiments and results : effect on dialog performance : conclusion

29 summary  data-driven approach for constructing accurate system beliefs  integrate information across multiple turns  bridge together detection of misunderstandings and corrections  significantly outperforms current heuristics  significantly improves effectiveness and efficiency

30 other advantages  sample efficient  performs a local one-turn optimization  good local performance leads to good global performance  scalable  works independently on concepts  29 concepts, varying cardinalities  portable  decoupled from dialog task specification  doesn’t make strong assumptions about dialog management technology

31 thank you! questions …

32 user study  10 scenarios, fixed order  presented graphically (explained during briefing)  participants compensated per task success