Designing and Evaluating Two Adaptive Spoken Dialogue Systems Diane J. Litman* University of Pittsburgh Dept. of Computer Science & LRDC

Slides:

Advertisements

Similar presentations

Dialogue Policy Optimisation

Advertisements

On-line dialogue policy optimisation Milica Gašić Dialogue Systems Group.

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.

1 University of Southern California Keep the Adversary Guessing: Agent Security by Policy Randomization Praveen Paruchuri University of Southern California.

Imbalanced data David Kauchak CS 451 – Fall 2013.

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.

SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.

Questions?. Setting a reward function, with and without subgoals Difference between agent and environment AI for games, Roomba Markov Property – Broken.

Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.

5/10/20151 Evaluating Spoken Dialogue Systems Julia Hirschberg CS 4706.

1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.

All Hands Meeting, 2006 Title: Grid Workflow Scheduling in WOSE (Workflow Optimisation Services for e- Science Applications) Authors: Yash Patel, Andrew.

COSC 878 Seminar on Large Scale Statistical Machine Learning 1.

Planning under Uncertainty

Online Performance Auditing Using Hot Optimizations Without Getting Burned Jeremy Lau (UCSD, IBM) Matthew Arnold (IBM) Michael Hind (IBM) Brad Calder (UCSD)

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

Department of Computer Science Undergraduate Events More

Reinforcement Learning Yishay Mansour Tel-Aviv University.

Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK

CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.

Introduction to machine learning

Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

SpeechCycle Confidential Confidential 1 Optimizing Natural Language Interfaces: No Data Like More Data SpeechTEK New York, 2007 Jonathan Bloom & Roberto.

Interactive Dialogue Systems Professor Diane Litman Computer Science Department & Learning Research and Development Center University of Pittsburgh Pittsburgh,

REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.

Speech and Language Processing Chapter 24 of SLP (part 3) Dialogue and Conversational Agents.

Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:

circle Adding Spoken Dialogue to a Text-Based Tutorial Dialogue System Diane J. Litman Learning Research and Development Center & Computer Science Department.

Evaluation of SDS Svetlana Stoyanchev 3/2/2015. Goal of dialogue evaluation Assess system performance Challenges of evaluation of SDS systems – SDS developer.

Learning, Adaptation and Personalization in Spoken Dialogue Systems Diane J. Litman* University of Pittsburgh Dept. of Computer Science & LRDC

Experimental Evaluation of Learning Algorithms Part 1.

TKK | Automation Technology Laboratory Partially Observable Markov Decision Process (Chapter 15 & 16) José Luis Peralta.

Crowdsourcing for Spoken Dialogue System Evaluation Ling 575 Spoken Dialog April 30, 2015.

Optimizing Dialogue Management with Reinforcement Learning: Experiments with the NJFun Spoken Dialogue System Diane Litman AT&T Labs - Research Florham.

Adaptive Spoken Dialogue Systems & Computational Linguistics Diane J. Litman Dept. of Computer Science & Learning Research and Development Center University.

Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 March 4, 2013.

Reinforcement Learning for Spoken Dialogue Systems: Comparing Strengths & Weaknesses for Practical Deployment Tim Paek Microsoft Research Dialogue on Dialogues.

Modeling Speech using POMDPs In this work we apply a new model, POMPD, in place of the traditional HMM to acoustically model the speech signal. We use.

Learning Automata based Approach to Model Dialogue Strategy in Spoken Dialogue System: A Performance Evaluation G.Kumaravelan Pondicherry University, Karaikal.

A Successful Dialogue without Adaptation S: Hi, this is AT&T Amtrak schedule system. This is Toot. How may I help you? U: I want a train from Baltimore.

1 CS 224S W2006 CS 224S LING 281 Speech Recognition and Synthesis Lecture 15: Dialogue and Conversational Agents (III) Dan Jurafsky.

© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.

Presentation Title Department of Computer Science A More Principled Approach to Machine Learning Michael R. Smith Brigham Young University Department of.

CONFIDENTIAL1 Hidden Decision Trees to Design Predictive Scores – Application to Fraud Detection Vincent Granville, Ph.D. AnalyticBridge October 27, 2009.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.

Conditional Random Fields for ASR Jeremy Morris July 25, 2006.

Final Exam Information Exam is Friday, December 13 11AM-1PM Exam will be cumulative, slightly emphasizing material since midterm Extra credit (and advantage)

Spoken Dialog Systems Diane J. Litman Professor, Computer Science Department.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Misrecognitions and Corrections in Spoken Dialogue Systems Diane Litman AT&T Labs -- Research (Joint Work With Julia Hirschberg, AT&T, and Marc Swerts,

Generalized Point Based Value Iteration for Interactive POMDPs Prashant Doshi Dept. of Computer Science and AI Institute University of Georgia

circle Spoken Dialogue for the Why2 Intelligent Tutoring System Diane J. Litman Learning Research and Development Center & Computer Science Department.

1 Spoken Dialogue Systems Error Detection and Correction in Spoken Dialogue Systems.

1 Spoken Dialogue Systems Dialogue and Conversational Agents (Part III) Chapter 19: Draft of May 18, 2005 Speech and Language Processing: An Introduction.

Predicting and Adapting to Poor Speech Recognition in a Spoken Dialogue System Diane J. Litman AT&T Labs -- Research

Introduction to Machine Learning, its potential usage in network area,

Online Multiscale Dynamic Topic Models

Conditional Random Fields for ASR

Learning for Dialogue.

Issues in Spoken Dialogue Systems

Spoken Dialogue Systems

Integrating Learning of Dialog Strategies and Semantic Parsing

Spoken Dialogue Systems

CS 188: Artificial Intelligence Fall 2008

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

CS 440/ECE448 Lecture 22: Reinforcement Learning

Presentation transcript:

Designing and Evaluating Two Adaptive Spoken Dialogue Systems Diane J. Litman* University of Pittsburgh Dept. of Computer Science & LRDC Collaborators: S. Singh, M. Kearns, M. Walker, S. Pan *this work was done at AT&T Labs – Research, Florham Park, NJ

Overview of Research Adaptive and Adaptable Systems TOOT: automatic optimization within a dialogue via supervised learning [ACL-99, NAACL-00, AAAI- 00, NAACL-01, ACL-01, UMUAI-02] (this talk) NJFun: automatic optimization of default dialogue behavior via reinforcement learning [COLING-00, AAAI-00, JAIR-02] (this talk) TOOT, CobotDS: personalization within a dialogue via voice control [UM-99, AAAI-02] Novel Applications CobotDS: new types of voice-enabled systems pose many technical challenges (e.g. chat interfaces) [AAAI-02] ITSpeaks: spoken dialogue for intelligent tutoring [ITS wkshp-02]

Today’s Talk: Overview of Issues Builders of spoken dialogue systems face fundamental design choices that strongly influence system performance Can performance be improved by adapting the system? Many forms of adaptation –machine learning vs. user-led Many levels of granularity –dialogue vs. subdialogue vs. utterance Two case studies

Outline I. Reinforcement Learning in Spoken Dialogue Systems –Background –Applying Reinforcement Learning –NJFun: A Reinforcement Learning Spoken Dialogue System –A Controlled User Study II. Rule Induction in Spoken Dialogue Systems –Background –Detecting Problematic Situations –TOOT: An Adaptive Spoken Dialogue System –A Controlled User Study

I. Optimizing Dialogue Management with Reinforcement Learning: Experiments with the NJFun Spoken Dialogue System

Spoken Dialogue Systems Front end: ASR (automatic speech recognition) and TTS (text to speech) Back end: DB, web, etc. Middle: dialogue strategy or policy (what action to take at each point) Dominated by ASR “precision-recall” trade-off user ASR TTS DB spoken dialogue system

NJFun Dialogues Task Description:"You have lived in Stanhope for many years but have never managed to visit its historic sites. Today you are feeling virtuous. Use RLDS to find out what you can see this morning.” TTS: Welcome to RLDS. Please say an activity name or say 'list activities' for a list of activities I know about. USER: I would like to visit a historic site in Stanhope this morning. TTS: Did you say you are interested in going to a zoo? USER: No. TTS: I know about amusement parks, aquariums, cruises, historic sites, museums, parks, theaters, wineries, and zoos. Please say an activity name from this list. USER: I would like to visit a historic site. TTS: Did you say you are interested in going to a historic site? USER: Yes. TTS: Please give me more information. USER:... TTS: Please tell me the location that you are interested in. You can also tell me the time. …. (8 turns omitted) TTS: Thank you for using the system. Please give me feedback by saying 'good', 'so-so', or 'bad'. USER: Bad. TTS: Goodbye and have a nice day! TTS: Welcome to RLDS. How may I help you? USER: I would like to um find about Stanhope historic sites. TTS: Did you say you are interested in Stanhope? USER: Yes. TTS: What time of the day do you want to go? USER: Uh...morning. TTS: Did you say you want to go in the morning? USER: Yes. TTS: I found 2 historic sites in Stanhope that are open in the morning. They are the Lenape Indian Village, and the Waterloo Village. Thank you for using the system. Please give me feedback by saying 'good', 'so-so', or 'bad'. USER: Good. TTS: Goodbye and have a nice day! Unoptimized SystemOptimized System

Initiative Strategy System initiative vs. user initiative: –Please state your departure city. –How can I help you? Influences user expectations ASR grammar must be chosen accordingly Best choice may differ from state to state! May depend on user population & task Early mistakes may be costly (e.g. confirmation) Delayed reward

Typical System Design: Sequential Search Choose and implement several “reasonable” dialogue policies Field systems, gather dialogue data Do statistical analyses Refield system with “best” dialogue policy Can only examine a handful of policies

Why Reinforcement Learning? ( Levin, Pieraccini, Eckert; Walker; Singh, Kearns, Litman, Walker) Agents can learn to improve performance by interacting with their environment Thousands of possible dialogue policies, and want to automate the choice of the “optimal” Can handle many features of spoken dialogue –noisy sensors (ASR output) –stochastic behavior (user population) –delayed rewards, and many possible rewards –multiple plausible actions However, many practical challenges remain

Our Approach Build initial system that is deliberately exploratory wrt state and action space Use dialogue data from initial system to build a Markov decision process (MDP) Use methods of reinforcement learning to compute optimal policy of the MDP, with respect to some reward Re-field (improved?) system given by the optimal policy

State-Based Design System state: contains information relevant for deciding the next action Dialogue policy: mapping from current state to system action Typically hundreds of states, several “reasonable” actions from each state In practice, need a compressed state

Markov Decision Processes System state s (in S) System action a in (in A); not all states need have choice Transition probabilities P(s’|s,a) Reward function R(s,a) (stochastic) Fast algorithms for optimal policy Our application: P(s’|s,a) models the population of users Allow choice of actions Learn best choices Parallel search in policy space!

The Application: NJFun Dialogue system providing telephone access to a DB of activities in NJ Want to obtain 3 attributes: –activity type (e.g., wine tasting) –location (e.g., Lambertville) –time (e.g., morning) Failure to bind: query DB with don’t-care

The State Space N.B. Non-state variables record attribute values; state does not condition on previous attributes!

Sample Action Choices Initiative (when T = 0) –user (open prompt and grammar) –mixed (constrained prompt, open grammar) –system (constrained prompt and grammar) Example –GreetU: “How may I help you?” –GreetS: “Please say an activity name.”

Dialogue Policy Class Specify “reasonable” actions for each state –42 choice states (binary initiative or confirmation action choices) –no choice for all other states Small state space (62), large policy space (2^42) Example choice state –initial state: [1,0,0,0,0,0] –action choices: GreetS, GreetU Learn optimal action for each choice state

The Experiment Designed 6 specific tasks, each with web survey 54 training subjects generated 311 dialogues Exploratory training dialogues used to build MDP Optimal policy for objective (binary) task completion computed and implemented 21 test subjects performed tasks and web surveys for modified system generated 124 dialogues Did statistical analyses of performance changes

Main Results Objective task completion (-1 to 3, partial credit): –train mean ~ 1.722, test mean ~ –two-sample t-test p-value ~ Binary task completion: –train mean ~ 51.5%, test mean ~ 63.5% –two-sample t-test p-value ~ 0.05

Other Results Subjective measures “move to the middle” rather than improve First graph: It was easy to find the place that I wanted (strongly agree = 5,…, strongly disagree=1) train mean = 3.38, test mean = 3.39, p-value =.98

Other Results (continued) Using exploratory dialogues as a Monte Carlo proxy shows that our learned policy outperforms several standard fixed policies Comparison to Human Design A Sanity Check of the MDP Correlation between Monte Carlo and MDP

Conclusions I MDPs and RL a natural and promising framework for automated dialogue policy design First practical empirical test of formalism Resulted in significant system improvements Favorable comparison to human-designed strategies Interesting dialogue results Care in application: –choice of states and actions –gathering exploratory data –choice of reward to optimize

Future Work I Automate choice of states and actions Scale to more complex systems POMDPs due to hidden state Learn terminal (and non-terminal) reward function Online rather than batch learning

II. CobotDS: A Spoken Dialogue System for Chat

Predicting and Adapting to Poor Speech Recognition in a Spoken Dialogue System Diane J. Litman AT&T Labs -- Research Shimei Pan Columbia University

A Successful Dialogue after Adaptation S: Hi, this is AT&T Amtrak schedule system. This is Toot. How may I help you? U: I need to find a train going to New York City. S: Which city do you want to leave from? U: Baltimore. S: Do you want me to find the trains from Baltimore to DC on Thursday at anytime now? U: No. S: How may I help you? U: I need to leave on Sunday. **TOOT adapts to more conservative dialogue behaviors** S: I heard you say go to DC between 1 night and 8. Do you want me to find the trains from Baltimore to DC on Thursday between 1 night and 8 now? U: No.

A Successful Dialogue without Adaptation S: Hi, this is AT&T Amtrak schedule system. This is Toot. How may I help you? U: I want a train from Baltimore to New York City on Sunday at 8:30 in the evening. S: Do you want me to find the trains from Baltimore to New York City on Sunday approximately at 8:30 in the evening now? U: Yes. S: I am going to get the train schedules for you. It will take some time depending on the internet traffic. Please wait patiently....

S: You can say yes or no. U: No. S: How may I help you? U: I need to go to New York City from Baltimore on Sunday at 8:30 pm. S: I heard you say go to Chicago on Tuesday around 8:45 am. Do you want me to find the trains from Baltimore to Chicago on Tuesday around 8:45 am now? U: No. **TOOT adapts to even more conservative dialogue behaviors** S: Which city do you want to go to? U: New York City. S: Do you want to go to New York City? U: Yes. S: Which city do you want to leave from? U: Baltimore...

Toot: An Adaptive Spoken Dialogue System A phone-based system for accessing train schedules Implemented using internal platform (Kamm et al ‘97) –phone interface –automatic speech recognition (ASR) –text to speech (TTS) –dialogue manager –application manager Different versions depending on dialogue strategy and adaptability parameters –initiative strategies (system, mixed, user) –confirmation strategies (explicit, implicit, none) –adaptability condition (adaptive, non-adaptive)

Learning to Detect Problematic Dialogues (Litman et al. ‘99) Use machine learning to automatically derive rules for detecting poor speech recognition at the dialogue level –speech recognition is most predictive of performance –can improve recognition by changing dialogue strategies 2 classifications –if (% of misrecognized utterances > threshold) then BAD –else GOOD 23 (automatically computable) features –acoustic confidence, dialogue efficiency, dialogue quality, lexical, experimental parameters

Instantiation for Toot Corpus –120 Toot dialogues from previous experiments (Litman, and Pan ‘99) –45 GOOD dialogues (e.g., first Toot example) –75 BAD dialogues (e.g., second Toot example) Machine learning program –Ripper (Cohen ‘96) Best learned ruleset uses a single acoustic feature –if (predicted_misrecog%_using_confScore_-4 > 3%) then BAD –else GOOD –80% cross-validated accuracy rate

Example S1: This is Toot. How may I help you? U1: I need to find a train going to New York City. ASR1: string=DC I don’t care on ThursdayconfScore= -5.3 S2: Which city do you want to leave from? U2: Baltimore. ASR2: string=BaltimoreconfScore= -1.7 Since predicted_misrecog%_using_confScore_-4 = 50%, which is > 3%, dialogue is classified as BAD

Predicting and Adapting to Problems Online Algorithm and tuneable parameters: Main … for each user utterance if ((turns since CurStrat assigned) >= AdaptFreq) PredictUsing(Ruleset); … PredictUsing(Ruleset) for each rule R in Ruleset if (CheckPre(R) == “TRUE”) if (RightHandSide(R) == “BAD”) AdaptConservative(CurStrat);

Experimental Evaluation Adaptive vs. Non-Adaptive Toot –initial dialogue strategy = UserNo (user initiative, no confirmation) –adaptation frequency = 4 user turns –ruleset = rules learned using Ripper –AdaptConservative = switch to MixedImplicit, then switch to SystemExplicit 6 subjects (naïve users) per Toot 4 tasks per subject 48 total dialogues (recordings, logs, and user surveys) Evaluation measures: dialogue efficiency, dialogue quality, task success, and usability

Adaptability Results Adaptive Toot outperforms Non-Adaptive Toot –higher task success –higher user expertise and overall satisfaction –more accurate ASR –shorter dialogues Adaptive Toot only adapts when appropriate –task success is.66/.60 when adaptation is/isn’t triggered

Related Work Dialogue level –Litman and Pan ‘99, Litman et al.’99, Walker et al. ‘00 Utterance level –Smith ‘98, Levow ‘98, Litman et al. ‘00, Chu-Caroll and Nickerson ‘00 Multiple dialogue level –Levin & Pieraccini ‘97, Satinder et al. ‘00 (see talk tomorrow at 2!)

Summary A fixed dialogue strategy will not be ideal for different users, or even for the same user in different circumstances An adaptive system can improve performance –adaptive Toot adapts to more conservative dialogue strategies, based on its predictions of ASR problems –task success increases from 23% to 65%

Sample Task Scenario Try to find a train going to New York City from Baltimore on Sunday at 8:30 pm. If you cannot find an exact match, find the one with the closest departure time. Please write down the exact departure time of the train you found as well as the total travel time.

Overview Problem –spoken dialogue system performance can vary widely for different users, and even for the same user during different dialogues Hypothesis –system performance can be improved by dynamically detecting and responding to problematic dialogue situations Results –task success increases in a system which predicts and adapts to poor speech recognition during the course of a dialogue

Conclusions Adaptation via Learning and Personalization can improve dialogue system performance NJFun: Given a small number of exploratory dialogues, reinforcement learning can be used in a working system to choose a more optimal policy within a large search space CobotDS: Personalization can be useful and entertaining in a social environment; such applications pose interesting and different research challenges

Backup Slides

Sample Dialogue S1: Welcome to NJFun. How may I help you? U1: I’d like to find um winetasting in Lambertville in the morning. (ASR output: I’d like to find out wineries the in the Lambertville in the morning.) S2: Did you say you are interested in Lambertville? U2: Yes S3: Did you say you want to go in the morning? U3: Yes. S4. I found a winery near Lambertville that is open in the morning. It is […] Please give me feedback by saying “good”, “so-so” or “bad”. U4: Good

Confirmation Policy High ASR confidence: accept ASR match and move on Moderate ASR confidence: confirm Low ASR confidence: re-ask How to set confidence thresholds? Early mistakes can be costly later, but excessive confirmation is annoying

Computing the Optimal Given parameters P(s’|s,a), R(s,a), can efficiently compute policy maximizing expected return Typically compute the expected cumulative reward (or Q-value) Q(s,a), using value iteration Optimal policy selects the action with the maximum Q-value at each dialogue state

Potential Benefits A principled and general framework for automated dialogue policy synthesis –learn the optimal action to take in each state Compares all policies simultaneously –data efficient because actions are evaluated as a function of state –traditional methods evaluate entire policies Potential for “lifelong learning” systems, adapting to changing user populations

Actions and State Actions: –initiative: open or closed prompt? open or closed grammar? –confirmation: confirm, re-ask, move on? –binary choices –only “reasonable” states –conservative actions State features: –ASR scores –barge-in counts –number of tries –time-outs –ASR-centric 42 states with binary action choice; no function approximation

Sample Confirmation Choices Confirmation (when V = 1) –confirm –no confirm Example –Conf3: “Did you say want to go in the ?” –NoConf3: “”

Some System Details Uses AT&T’s WATSON ASR and TTS platform, DMD dialogue manager Natural language web version used to build multiple ASR language models Initial statistics used to tune bins for confidence values, history bit (informative state encoding)

Main Results Objective task completion (-1 to 3, partial credit): –train mean ~ 1.722, test mean ~ –two-sample t-test p-value ~ Binary task completion: –train mean ~ 51.5%, test mean ~ 63.5% –two-sample t-test p-value ~ 0.05 On all dialogues: On “expert” dialogues 3-6: Binary task completion - train mean ~ 45.6%, test mean ~ 68.2% - two-sample t-test p-value ~ 0.001

Comparison to Human Design Fielded comparison infeasible, but exploratory dialogues provide a Monte Carlo proxy of “consistent trajectories” Test policy: Average binary completion reward = 0.67 (based on 12 trajectories) Outperforms several standard fixed policies –SysNoConfirm: (11) –SysConfirm: -0.6 (5) –UserNoConfirm: -0.2 (15) –Mixed: (13) –User Confirm: (11), no difference

A Sanity Check of the MDP Generate many random policies Compare value according to MDP and value based on consistent exploratory trajectories MDP evaluation of policy: ideally perfectly accurate (infinite Monte Carlo sampling), linear fit with slope 1, intercept 0 Correlation between Monte Carlo and MDP: –1000 policies, > 0 trajs: cor. 0.31, slope 0.953, int , p < –868 policies, > 5 trajs: cor. 0.39, slope 1.058, int , p < 0.001

Related Work Biermann and Long (1996) Levin, Pieraccini, and Eckert (1997) Walker, Fromer, and Narayanan (1998) Singh, Kearns, Litman, and Walker (1999) Scheffler and Young (2000) Beck, Woolf, and Beal (2000) Roy, Pineau, and Thrun (2000)