A principled approach for rejection threshold optimization Dan Bohuswww.cs.cmu.edu/~dbohus Alexander I. Rudnickywww.cs.cmu.edu/~air Computer Science Department.

Slides:

Advertisements

Similar presentations

( current & future work ) explicit confirmation implicit confirmation unplanned implicit confirmation request constructing accurate beliefs in spoken dialog.

Advertisements

Is This Conversation on Track? Utterance Level Confidence Annotation in the CMU Communicator spoken dialog system Presented by: Dan Bohus

Key architectural details RavenClaw: Dialog Management Using Hierarchical Task Decomposition and an Expectation Agenda Dan BohusAlex Rudnicky School of.

Non-Native Users in the Let s Go!! Spoken Dialogue System: Dealing with Linguistic Mismatch Antoine Raux & Maxine Eskenazi Language Technologies Institute.

The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.

Interactively Co-segmentating Topically Related Images with Intelligent Scribble Guidance Dhruv Batra, Carnegie Mellon University Adarsh Kowdle, Cornell.

Error Handling in the RavenClaw Dialog Management Framework Dan Bohus, Alexander I. Rudnicky Computer Science Department, Carnegie Mellon University (

5/10/20151 Evaluating Spoken Dialogue Systems Julia Hirschberg CS 4706.

Proactive Learning: Cost- Sensitive Active Learning with Multiple Imperfect Oracles Pinar Donmez and Jaime Carbonell Pinar Donmez and Jaime Carbonell Language.

Uncertainty Corpus: Resource to Study User Affect in Complex Spoken Dialogue Systems Kate Forbes-Riley, Diane Litman, Scott Silliman, Amruta Purandare.

An Investigation into Recovering from Non-understanding Errors Dan Bohus Dialogs on Dialogs Reading Group Talk Carnegie Mellon University, October 2004.

AN IMPROVED AUDIO Jenn Tam Computer Science Dept. Carnegie Mellon University SOAPS 2008, Pittsburgh, PA.

Chapter 17 Overview of Multivariate Analysis Methods

constructing accurate beliefs in task-oriented spoken dialog systems Dan Bohus Computer Science Department Carnegie Mellon University.

Sorry, I didn’t catch that! – an investigation of non-understandings and recovery strategies Dan Bohuswww.cs.cmu.edu/~dbohus Alexander I. Rudnickywww.cs.cmu.edu/~air.

Belief Updating in Spoken Dialog Systems Dialogs on Dialogs Reading Group June, 2005 Dan Bohus Carnegie Mellon University, January 2004.

Increased Robustness in Spoken Dialog Systems 1 (roadmap to a thesis proposal) Dan Bohus, SPHINX Lunch, May 2003.

What can humans do when faced with ASR errors? Dan Bohus Dialogs on Dialogs Group, October 2003.

Sorry, I didn’t catch that … Non-understandings and recovery in spoken dialog systems Part II: Sources & impact of non-understandings, Performance of various.

Belief Updating in Spoken Dialog Systems Dan Bohus Computer Science Department Carnegie Mellon University Pittsburgh,

Modeling the Cost of Misunderstandings in the CMU Communicator System Dan BohusAlex Rudnicky School of Computer Science, Carnegie Mellon University, Pittsburgh,

1RADAR – Scheduling Task © 2003 Carnegie Mellon University RADAR – Scheduling Task May 20, 2003 Manuela Veloso, Stephen Smith, Jaime Carbonell, Brett Browning,

Cost of Misunderstandings Modeling the Cost of Misunderstanding Errors in the CMU Communicator Dialog System Presented by: Dan Bohus

Ideas in Confidence Annotation Arthur Chan. Three papers for today Frank Wessel et al, “Using Word Probabilities as Confidence Measures”

A “k-hypotheses + other” belief updating model Dan Bohus Alex Rudnicky Computer Science Department Carnegie Mellon University Pittsburgh, PA acknowledgements.

misunderstandings, corrections and beliefs in spoken language interfaces Dan Bohus Computer Science Department Carnegie Mellon.

Carnegie Mellon School of Computer Science Understanding SMT without the “S” (Statistics) Robert Frederking.

belief updating in spoken dialog systems Dan Bohus Computer Science Department Carnegie Mellon University Pittsburgh, PA acknowledgements Alex Rudnicky,

“k hypotheses + other” belief updating in spoken dialog systems Dialogs on Dialogs Talk, March 2006 Dan Bohus Computer Science Department

+ Doing More with Less : Student Modeling and Performance Prediction with Reduced Content Models Yun Huang, University of Pittsburgh Yanbo Xu, Carnegie.

Sorry, I didn’t catch that … Non-understandings and recovery in spoken dialog systems Part I:Issues,Data Collection,Rejection Tuning Dan Bohus Sphinx Lunch.

1 Efficiently Learning the Accuracy of Labeling Sources for Selective Sampling by Pinar Donmez, Jaime Carbonell, Jeff Schneider School of Computer Science,

Modeling User Satisfaction and Student Learning in a Spoken Dialogue Tutoring System with Generic, Tutoring, and User Affect Parameters Kate Forbes-Riley.

1 Information Filtering & Recommender Systems (Lecture for CS410 Text Info Systems) ChengXiang Zhai Department of Computer Science University of Illinois,

Genetic Regulatory Network Inference Russell Schwartz Department of Biological Sciences Carnegie Mellon University.

Interactive Dialogue Systems Professor Diane Litman Computer Science Department & Learning Research and Development Center University of Pittsburgh Pittsburgh,

Resource Provisioning based on Lease Preemption in InterGrid Mohsen Amini Salehi, Bahman Javadi, Rajkumar Buyya Cloud Computing and Distributed Systems.

Implicit An Agent-Based Recommendation System for Web Search Presented by Shaun McQuaker Presentation based on paper Implicit:

Fan Guo 1, Chao Liu 2 and Yi-Min Wang 2 1 Carnegie Mellon University 2 Microsoft Research Feb 11, 2009.

circle Adding Spoken Dialogue to a Text-Based Tutorial Dialogue System Diane J. Litman Learning Research and Development Center & Computer Science Department.

Evaluation of SDS Svetlana Stoyanchev 3/2/2015. Goal of dialogue evaluation Assess system performance Challenges of evaluation of SDS systems – SDS developer.

A Weakly-Supervised Approach to Argumentative Zoning of Scientific Documents Yufan Guo Anna Korhonen Thierry Poibeau 1 Review By: Pranjal Singh Paper.

Crowdsourcing for Spoken Dialogue System Evaluation Ling 575 Spoken Dialog April 30, 2015.

Exploiting Subjective Annotations Dennis Reidsma and Rieks op den Akker Human Media Interaction University of Twente

Boosting Training Scheme for Acoustic Modeling Rong Zhang and Alexander I. Rudnicky Language Technologies Institute, School of Computer Science Carnegie.

Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.

A Novel Local Patch Framework for Fixing Supervised Learning Models Yilei Wang 1, Bingzheng Wei 2, Jun Yan 2, Yang Hu 2, Zhi-Hong Deng 1, Zheng Chen 2.

Paired Sampling in Density-Sensitive Active Learning Pinar Donmez joint work with Jaime G. Carbonell Language Technologies Institute School of Computer.

Blind Information Processing: Microarray Data Hyejin Kim, Dukhee KimSeungjin Choi Department of Computer Science and Engineering, Department of Chemical.

Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.

Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,

Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.

Effective Automatic Image Annotation Via A Coherent Language Model and Active Learning Rong Jin, Joyce Y. Chai Michigan State University Luo Si Carnegie.

Multiple Instance Learning for Sparse Positive Bags Razvan C. Bunescu Machine Learning Group Department of Computer Sciences University of Texas at Austin.

Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.

A Critique and Improvement of an Evaluation Metric for Text Segmentation A Paper by Lev Pevzner (Harvard University) Marti A. Hearst (UC, Berkeley) Presented.

EXPLORING PROCESS OF DOING DATA SCIENCE VIA AN ETHNOGRAPHIC STUDY OF A MEDIA ADVERTISING COMPANY J.SALTZ, I.SHAMSHURIN 2015 IEEE INTERNATIONAL CONFERENCE.

An Effective & Interactive Approach to Particle Tracking for DNA Melting Curve Analysis 李穎忠 DEPARTMENT OF COMPUTER SCIENCE & INFORMATION ENGINEERING NATIONAL.

Dynamic Decision Making Laboratory Carnegie Mellon University 1 Social and Decision Sciences Department ACT-R models of training Cleotilde Gonzalez and.

0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.

Predicting Consensus Ranking in Crowdsourced Setting Xi Chen Mentors: Paul Bennett and Eric Horvitz Collaborator: Kevyn Collins-Thompson Machine Learning.

Using Asymmetric Distributions to Improve Text Classifier Probability Estimates Paul N. Bennett Computer Science Dept. Carnegie Mellon University SIGIR.

1 Performance Impact of Resource Provisioning on Workflows Gurmeet Singh, Carl Kesselman and Ewa Deelman Information Science Institute University of Southern.

10701 / Machine Learning.

Spoken Dialogue Systems

Rui Wu, Jose Painumkal, Sergiu M. Dascalu, Frederick C. Harris, Jr

Structured Learning of Two-Level Dynamic Rankings

Spoken Dialogue Systems

Learning a Policy for Opportunistic Active Learning

Label propagation algorithm

Presentation transcript:

a principled approach for rejection threshold optimization Dan Bohuswww.cs.cmu.edu/~dbohus Alexander I. Rudnickywww.cs.cmu.edu/~air Computer Science Department Carnegie Mellon University Pittsburgh, PA, 15217

2 understanding errors and rejection  systems often misunderstand  use confidence scores  common design pattern  compare input confidence against a threshold  reject utterance if confidence is too low  may lead to false rejections

rejection threshold 0% 25% 50% 75%  misunderstandings vs. false rejections rejection tradeoff misunderstandings false rejections

rejection threshold  misunderstandings vs. false rejections  correctly vs. incorrectly transferred concepts rejection tradeoff correctly transferred concepts / turn incorrectly transferred

5 given this trade-off, how can we optimize the rejection threshold in a principled fashion? question

6 outline  current solutions  proposed approach  data  results  conclusion

7 current solutions  follow ASR manual [Nuance documentation]  acknowledge the tradeoff + postulate costs  “misunderstandings are X times more costly than false rejections” [Raymond et al 2004; Kawahara et al, 2000; Cuayahuitl et al, 2002]  costs are likely to differ  across domains / systems  across dialog states within a system

8 proposed approach  derive costs in a principled fashion 1.identify a set of variables involved in the tradeoff correctly and incorrectly transferred concepts per turn (CTC, ITC) CTC ITC 2.choose a dialog performance metric task completion (binary, kappa) – TC; 3.build a regression model logit(TC) ← C 0 + C CTC CTC + C ITC ITC 4.optimize threshold to maximize performance th* = argmax (C CTC CTC + C ITC ITC)

9 state-specific costs  costs are different in different dialog states  CTC and ITC on a per-state basis logit(TC) ← C 0 + C CTCstate1 CTC state1 + C ITCstate1 ITC state1 + C CTCstate2 CTC state2 + C ITCstate2 ITC state2 + C CTCstate3 CTC state3 + C ITCstate3 ITC state3 + …  optimize separate threshold for each state th state_x * = argmax (C CTCstate_x CTC state_x + C ITCstate_x ITC state_x )

10 outline  current solutions  proposed approach  data  results  conclusion

11 data  collected using RoomLine  phone-based, mixed-initiative spoken dialog system  conference room reservations  sphinx-2  utterance-level confidence annotator [0-1]  46 participants (first-time users)  10 scenario-driven interactions  corpus  449 dialog sessions  8278 user turns  manually labeled decoded concept “correctness”

12 roomline states  71 “dialog states” total  clustered into 3 classes  open-request How may I help you?  request(bool) Would you like a reservation for this room? Would you like a room with a projector?  request(non-bool) For what time would you like to reserve the room?

13 results: task success model BaselineTrainCross-Vp AVG-LL < HARD 17.62%11.66%11.75% model predicting binary task success sepCoeffVariable ITC / request(non-bool) CTC / request(non-bool) ITC / request(bool) CTC / request(bool) ITC / open-request CTC / open-request Const cost coefficients

14 results: threshold optimization correctly transferred concepts per turn incorrectly transferred concepts per turn utility = 0.55 x CTC – 0.40 x ITC open-request sepCoeffVariable ITC / request(non-bool) CTC / request(non-bool) ITC / request(bool) CTC / request(bool) ITC / open-request CTC / open-request Const cost coefficients

15 results: threshold optimization request(bool) utility = 3.31 x CTC – 0.60 x ITC  utility profiles are different across the three states  task duration models lead to similar results correctly transferred concepts per turn incorrectly transferred concepts per turn utility = 0.55 x CTC – 0.40 x ITC open-request request(non-bool) utility = 2.55 x CTC – 3.44 x ITC

16 conclusion  principled method for optimizing rejection threshold  determine costs for various types of understanding errors  data-driven approach  can derive state-specific costs  bridge mismatches between off-the-shelf confidence annotators and domain

17 thank you

18 fit for task success model

19 CurrentNew EstimateDelta Open-request CTC ITC Request bool CTC ITC Request non-bool CTC ITC CurrentNew EstimateDelta Task success82.75%87.16%+4.41% Remains to be seen … expected changes in task success

20 task duration model VariableCoeffpse Const CTC / oreq ITC / oreq CTC / req(bool) ITC / req(bool) CTC / req(non-bool) ITC / req(non-bool)

21 Model 2: Resulting fit and coefficients R^2 = 0.56 intro : data collection : rejection threshold