User Simulation for Spoken Dialogue Systems Diane Litman Computer Science Department & Learning Research and Development Center University of Pittsburgh.

Slides:

Advertisements

Similar presentations

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Advertisements

Dialogue Policy Optimisation

1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.

1 University of Southern California Keep the Adversary Guessing: Agent Security by Policy Randomization Praveen Paruchuri University of Southern California.

The Q-matrix method: A new artificial intelligence tool for data mining Dr. Tiffany Barnes Kennedy 213, PhD - North Carolina State University.

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.

5/10/20151 Evaluating Spoken Dialogue Systems Julia Hirschberg CS 4706.

1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.

Uncertainty Corpus: Resource to Study User Affect in Complex Spoken Dialogue Systems Kate Forbes-Riley, Diane Litman, Scott Silliman, Amruta Purandare.

How much data is enough? – Generating reliable policies w/MDP’s Joel Tetreault University of Pittsburgh LRDC July 14, 2006.

Computer Science Department Jeff Johns Autonomous Learning Laboratory A Dynamic Mixture Model to Detect Student Motivation and Proficiency Beverly Woolf.

Student simulation and evaluation DOD meeting Hua Ai 03/03/2006.

Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.

Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.

Modeling the Cost of Misunderstandings in the CMU Communicator System Dan BohusAlex Rudnicky School of Computer Science, Carnegie Mellon University, Pittsburgh,

1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.

Topics = Domain-Specific Concepts Online Physics Encyclopedia ‘Eric Weisstein's World of Physics’ Contains total 3040 terms including multi-word concepts.

Modeling User Satisfaction and Student Learning in a Spoken Dialogue Tutoring System with Generic, Tutoring, and User Affect Parameters Kate Forbes-Riley.

Interactive Dialogue Systems Professor Diane Litman Computer Science Department & Learning Research and Development Center University of Pittsburgh Pittsburgh,

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

Exploiting Ontologies for Automatic Image Annotation M. Srikanth, J. Varner, M. Bowden, D. Moldovan Language Computer Corporation

Using Transactional Information to Predict Link Strength in Online Social Networks Indika Kahanda and Jennifer Neville Purdue University.

Speech and Language Processing Chapter 24 of SLP (part 3) Dialogue and Conversational Agents.

Kate’s Ongoing Work on Uncertainty Adaptation in ITSPOKE.

circle Adding Spoken Dialogue to a Text-Based Tutorial Dialogue System Diane J. Litman Learning Research and Development Center & Computer Science Department.

Comparing Synthesized versus Pre-Recorded Tutor Speech in an Intelligent Tutoring Spoken Dialogue System Kate Forbes-Riley and Diane Litman and Scott Silliman.

Crowdsourcing for Spoken Dialogue System Evaluation Ling 575 Spoken Dialog April 30, 2015.

Adaptive Spoken Dialogue Systems & Computational Linguistics Diane J. Litman Dept. of Computer Science & Learning Research and Development Center University.

Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 March 4, 2013.

Reinforcement Learning for Spoken Dialogue Systems: Comparing Strengths & Weaknesses for Practical Deployment Tim Paek Microsoft Research Dialogue on Dialogues.

DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.

Learning Automata based Approach to Model Dialogue Strategy in Spoken Dialogue System: A Performance Evaluation G.Kumaravelan Pondicherry University, Karaikal.

LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.

Design Principles for Creating Human-Shapable Agents W. Bradley Knox, Ian Fasel, and Peter Stone The University of Texas at Austin Department of Computer.

Collaborative Research: Monitoring Student State in Tutorial Spoken Dialogue Diane Litman Computer Science Department and Learning Research and Development.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

PIER Research Methods Protocol Analysis Module Hua Ai Language Technologies Institute/ PSLC.

A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,

Predicting Student Emotions in Computer-Human Tutoring Dialogues Diane J. Litman&Kate Forbes-Riley University of Pittsburgh Department of Computer Science.

Paired Sampling in Density-Sensitive Active Learning Pinar Donmez joint work with Jaime G. Carbonell Language Technologies Institute School of Computer.

Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.

Why predict emotions? Feature granularity levels [1] uses pitch features computed at the word-level Offers a better approximation of the pitch contour.

Using Word-level Features to Better Predict Student Emotions during Spoken Tutoring Dialogues Mihai Rotaru Diane J. Litman Graduate Research Competition.

Speech and Language Processing for Educational Applications Professor Diane Litman Computer Science Department & Intelligent Systems Program & Learning.

Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.

Experimentation in Computer Science (Part 2). Experimentation in Software Engineering --- Outline  Empirical Strategies  Measurement  Experiment Process.

OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.

Spoken Dialog Systems Diane J. Litman Professor, Computer Science Department.

Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Metacognition and Learning in Spoken Dialogue Computer Tutoring Kate Forbes-Riley and Diane Litman Learning Research and Development Center University.

Modeling Student Benefits from Illustrations and Graphs Michael Lipschultz Diane Litman Computer Science Department University of Pittsburgh.

A Tutorial Dialogue System that Adapts to Student Uncertainty Diane Litman Computer Science Department & Intelligent Systems Program & Learning Research.

Improving (Meta)cognitive Tutoring by Detecting and Responding to Uncertainty Diane Litman & Kate Forbes-Riley University of Pittsburgh Pittsburgh, PA.

Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:

Using Natural Language Processing to Analyze Tutorial Dialogue Corpora Across Domains and Modalities Diane Litman, University of Pittsburgh, Pittsburgh,

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

Detecting and Adapting to Student Uncertainty in a Spoken Tutorial Dialogue System Diane Litman Computer Science Department & Learning Research & Development.

Prosodic Cues to Disengagement and Uncertainty in Physics Tutorial Dialogues Diane Litman, Heather Friedberg, Kate Forbes-Riley University of Pittsburgh.

Simulation-based inference beyond the introductory course Beth Chance Department of Statistics Cal Poly – San Luis Obispo

Predicting and Adapting to Poor Speech Recognition in a Spoken Dialogue System Diane J. Litman AT&T Labs -- Research

Sofus A. Macskassy Fetch Technologies

Chapter 6. Data Collection in a Wizard-of-Oz Experiment in Reinforcement Learning for Adaptive Dialogue Systems by: Rieser & Lemon. Course: Autonomous.

Towards Emotion Prediction in Spoken Tutoring Dialogues

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

Dialogue-Learning Correlations in Spoken Dialogue Tutoring

Chapter 5. The Bootstrapping Approach to Developing Reinforcement Learning-based Strategies in Reinforcement Learning for Adaptive Dialogue Systems, V.

CS 188: Artificial Intelligence Fall 2008

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

Presentation transcript:

User Simulation for Spoken Dialogue Systems Diane Litman Computer Science Department & Learning Research and Development Center University of Pittsburgh (Currently Leverhulme Visiting Professor, University of Edinburgh) Joint work with Hua Ai Intelligent Systems Program, University of Pittsburgh

User Simulation Motivation: Empirical Research requires Dialogue Corpora l Less expensive More efficient More (and better?) data compared to humans

User Simulation Power of evaluation measures How realistic? Discriminative ability [AAAI WS, 2006] Motivation: Empirical Research requires Dialogue Corpora Impact of the source corpus Subjects vs. real users [SIGDial, 2007] Human assessment Validation of evaluation Less expensive More efficient More (and better?) data compared to humans [ACL, 2008]

User Simulation Power of evaluation measures Task Dependent How realistic? How useful? Dialogue Strategy Learning Utility of realistic vs. exploratory models for reinforcement learning Discriminative ability [AAAI WS, 2006] [NAACL, 2007] Motivation: Empirical Research requires Dialogue Corpora Dialogue System Evaluation More realistic models via knowledge consistency [Interspeech, 2007] Impact of the source corpus Subjects vs. real users [SIGDial, 2007] Human assessment Validation of evaluation Less expensive More efficient More (and better?) data compared to humans [ACL, 2008]

User Simulation Power of evaluation measures Task Dependent How realistic? How useful? Dialogue Strategy Learning Utility of realistic vs. exploratory models for reinforcement learning Discriminative ability [AAAI WS, 2006] [NAACL, 2007] Motivation: Empirical Research requires Dialogue Corpora Dialogue System Evaluation More realistic models via knowledge consistency [Interspeech, 2007] Impact of the source corpus Subjects vs. real users [SIGDial, 2007] Human assessment Validation of evaluation Less expensive More efficient More (and better?) data compared to humans [ACL, 2008] * *

Outline User Simulation Models –Previous work –Our initial models Are more realistic models always “better”? Developing more realistic models via knowledge consistency Summary and Current Work

User Simulation Models Simulate user dialogue behaviors in simple (or, not too complicated) ways How to simulate –Various strategies: random, statistical, analytical What to simulate –Model dialogue behaviors on different levels: acoustic, lexical, semantic / intentional

Previous Work Most models simulate on the intentional level, and are statistically trained from human user corpora Bigram Models –P(next user action | previous system action) [Eckert et al., 1997] –Only accept the expected dialogue acts [Levin et al., 2000] Goal-Based Models –Hard-coded fixed goal structures [Scheffler, 2002] –P(next user action | previous system action, user goal) [Pietquin, 2004] –Goal and agenda-based models [Schatzmann et al., 2007]

Previous Work (continued) Models that exploit user state commonalities –Linear combinations of shared features [Georgila et al., 2005] –Clustering [Rieser et al., 2006] Improve speech recognizer and understanding components –Word-level simulation [Chung, 2004]

Our Domain: Tutoring ITSpoke: Intelligent Tutoring Spoken Dialogue System –Back-end is Why2-Atlas system [VanLehn et al., 2002] –Sphinx2 speech recognition and Cepstral text-to-speech The system initiates a tutoring conversation with the student to correct misconceptions and to elicit explanations Student answers: correct, incorrect

ITSpoke Corpora Two different student groups in f03 and s05 Systems have minor variations (e.g., voice, slightly different language models) Corpus Student Population System Voice Number of Dialogues f032003Synthesized100 s05 syn2005Synthesized136 pre2005Pre-recorded135

Our Simulation Approach Simulate on the word level –We use the answers from the real student answer sets as candidate answers for simulated students First step – basic simulation models –A random model Gives random answers –A probabilistic model Answers a question with the same correctness rate as our real students

The Random Model A unigram model –Randomly pick a student answer from all utterances, neglecting the tutor question Example dialogue ITSpoke:The best law of motion to use is Newton’s third law. Do you recall what it says? Student: Down. ITSpoke: Newton’s third law says… … ITSpoke: Do you recall what Newton’s third law says? Student: More.

The ProbCorrect Model A bigram model –P(Student Answer | Tutor Question) –Give correct/incorrect answers with the same probability as the real students Example dialogue ITSpoke: The best law of motion to use is Newton’s third law. Do you recall what it says? Student: Yes, for every action, there is an equal and opposite reaction. ITSpoke: This is correct! … ITSpoke: Do you recall what Newton’s third law says? Student:No.

Outline User Simulation Models –Previous work –Our initial models Are more realistic models always “better”? –Task: Dialogue Strategy Learning Developing more realistic models via knowledge consistency Summary and Current Work

Learning Task ITSpoke can only respond to student (in)correctness, but student (un)certainty is also believed to be relevant Goal: Learn how to manipulate the strength of tutor feedback, in order to maximize student certainty

Corpus Part of S05 data (with annotation) –26 human subjects, 130 dialogues Automatically logged –Correctness (c, ic); percent incorrectness (ic%) –Kappa (automatic/manual) = 0.79 Human annotated –certainty (cert, ncert) –Kappa (two annotators) = 0.68

Sample Coded Dialogue ITSPoke:Which law of motion would you use? Student:Newton’s second law.[ ic, ic%=100, ncert ] ITSpoke:Well… The best law to use is Newton’s third law. Do you recall what it says? Student:For every action there is an equal and opposite reaction. [ c, ic%=50, ncert ]

Markov Decision Processes (MDPs) and Reinforcement Learning What is the best action for an agent to take at any state to maximize reward? MDP Representation –States, Actions, Transition Probabilities –Reward Learned Policy –Optimal action to take for each state

MDP’s in Spoken Dialogue MDP Dialogue System Training data Policy User Simulator Human User MDP can be created offline Interactions work online

Our MDP Action Choices Tutor feedback –Strong Feedback (SF) “This is great!” –Weak Feedback (WF) “Well…”, doesn’t comment on the correctness Strength of tutor’s feedback is strongly related to the percentage of student certainty (chi-square, p<0.01)

Our MDP States and Rewards State features are derived from Certainty and Correctness Annotations Reward is based on the percentage of Certain student utterances during the dialogue

Our MDP Configuration States –Representation 1: c + ic% –Representation 2: c + ic% + cert Actions –Strong Feedback, Weak Feedback Reward –+100 (high certainty), -100 (low certainty)

Our Reinforcement Learning Goal Learn an optimal policy using simulated dialogue corpora Example Learned Policy –Give Strong Feedback when the current student answer is Incorrect and the percentage of Incorrect answers is greater than 50% –Otherwise give Weak Feedback Research Question: what is the impact of using different simulation models?

Probabilistic Simulation Model Capture realistic student behavior in a probabilistic way Strong Feedback c+cert (5)c+ncert (1)ic+cert (2)ic+ncert (3) Weak Feedback c+cert (4)c+ncert (4)ic+cert (1)ic+ncert (3) For each question:

Total Random Simulation Model Explore all possible dialogue states Ignores what the current question is or what feedback is given Randomly picks one utterance from the candidate answer set

Restricted Random Model Compromise between the exploration of the dialogue state space and the realness of generated user behaviors. Strong Feedback c+cert (1)c+ncert (1)ic+cert (1)ic+ncert (1) Weak Feedback c+cert (1)c+ncert (1)ic+cert (1)ic+ncert (1) For each question:

Methodology Old System Prob Total Ran. Res. Ran. Corpus1 Corpus2 Corpus3 Policy1 Policy2 Policy3 Sys1 Sys2 Sys3 Prob MDP ,000

Methodology (continued) For each configuration, we run the simulation models until the learned policies do not change anymore Evaluation measure –number of dialogues that would be assigned reward +100 using the old median split –Baseline = 250

Evaluation Results Simulation Model State Rep. 1State Rep. 2 Probabilistic Total Random Restricted Random Blue: Restricted Random significantly outperforms the other two models Underline: the learned policy significantly outperforms the baseline NB: Results similar with other reward functions and evaluation metrics

Discussion We suspect that the performance of the Probabilistic Model is harmed by the data sparsity issue in the real corpus –In State Representation 1, 25.8% of the possible states do not exist in the real corpus –Of most frequent states in State Representation % are seen frequently in Probabilistic Training corpus 76.3% are seen frequently in Restricted Random corpus 65.2% are seen frequently in Total Random corpus

In Sum When using simulation models for MDP policy training –Hypothesis confirmed: when trained from a sparse data set, it may be better to use a Restricted Random Model than a more realistic Probabilistic Model or a more exploratory Total Random Model Next Step: –Test the learned policies with human subjects to validate the learning process –How about the cases when we do need a realistic simulation model?

Outline User Simulation Models –Previous work –Our initial models Are more realistic models always “better”? Developing more realistic models via knowledge consistency Summary and Current Work

A New Model & A New Measure Goal Consistency Knowledge Consistency Student’s knowledge during a tutoring session is consistent. Knowledge consistency can be measured using learning curves. If the student answers a question correctly, the student is more likely to answer a similar question correctly later. If a simulated student behaves similarly to a real student, we should see a similar learning curve in the simulated data. A new simulation model A new evaluation measure

The Cluster Model Model student learning –P(Student Answer |Cluster of Tutor Question, last Student Correctness) Example dialogue ITSpoke: The best law of motion to use is Newton’s third law. Do you recall what it says? Student: Yes, for every action, there is an equal reaction. ITSpoke: This is almost right… there is an equal and opposite reaction … ITSpoke: Do you recall what Newton’s third law says? Student: Yes, for every action, there is an equal and opposite reaction.

Knowledge Component Representation Knowledge component – “concepts” discussed by the tutor The choice of grain size is determined by the instructional objectives of the designers A domain expert manually clustered the 210 tutor questions into 20 knowledge components (f03 data) –E.g., 3rdLaw, acceleration, etc.

Sample Coded Dialogue ITSpoke: Do you recall what Newton’s third law says? [3rdLaw] Student:No. [incorrect] ITSpoke: Newton’s third law says … If you hit the wall harder, is the force of your fist acting on the wall greater or less?[3rdLaw] Student:Greater. [correct]

Evaluation: Learning Curves (1) Learning effect – the student performs better after practicing more We can visualize the learning effect by plotting an exponentially decreasing learning curve [PSLC, ]

Learning Curves (2) Among all the students, 36.5% of them made at least 1 error at their 2 nd opportunity to practice

Learning Curves (3) Standard way to plot the learning curve –First compute separate learning curves for each knowledge components, then, average them to get an overall learning curve We only see smooth learning curves among high learners –High/Low Learners: median split based on normalized learning gain –Learning Curve: Mathematical representation

Experiments (1) Simulation Models –ProbCorrect Model P(A | Q) A:Student Answer Q:Tutor Question –Cluster Model P(A | KC, C) A:Student Answer KC:Knowledge Component C:Correctness of the student’s answer to the last previous question that requires the same KC

Experiments (2) Evaluation Measures: Compare simulated user dialogues to human user dialogues using automatic measures New Measure: User Processing based on Knowledge Consistency –R-squared – How good the simulated learning curve correlates with the observed learning curve in the real student data Prior Measures: High-level Dialogue Features [Schatzmann et al., 2005]

Prior Evaluation Measures Schatzmann et al.Our measuresAbbreviation High-level Dialogue Features Dialogue Length (Number of turns) Number of students/tutor turns Sturn, Tturn Turn Length (Number of actions per turn) Total words per student/tutor turn Swordrate, Twordrate Participant Activity (Ratio of system/user actions per dialogue) Ratio of system/user words per dialogue wordRatio

Prior Evaluation Measures Schatzmann et al.Our measuresAbbreviation High-level Dialogue Features Dialogue Length (Number of turns) Number of students/tutor turns Sturn, Tturn Turn Length (Number of actions per turn) Total words per student/tutor turn Swordrate, Twordrate Participant Activity (Ratio of system/user actions per dialogue) Ratio of system/user words per dialogue wordRatio Learning Feature: % of Correct Answers CRate

Experiments (3) Simulation Models –The ProbCorrect Model P(A | Q) –The Cluster Model P(A | KC, c) Evaluation Measures –Previously proposed Evaluation Measures –Knowledge Consistency Measures Both of the simulation models interact with the system, generating 500 dialogues for each model

Results: Prior Measures Both models do not significantly differ from the real students, on all the original evaluation measures Thus, both models can simulate realistic high-level dialogue behaviors

Results: New Measures ModelprobCorrectCluster R adjusted R

Results: New Measures ModelprobCorrectCluster R adjusted R The Cluster model outperforms the ProbCorrect simulation model, with respect to learning curves

Results: New Measures ModelprobCorrectCluster R adjusted R The Cluster model outperforms the ProbCorrect simulation model, with respect to learning curves Model ranking also validated by human judges [Ai and Litman, 2008]

In Sum Recall goal: simulate consistent user behaviors based on user knowledge consistency rather than fixed user goals Knowledge consistent models outperform the probabilistic model when measured by knowledge consistency measures –Do not differ on high-level dialogue measures –Similar approach should be applicable for other temporal user processes (e.g., forgetting)

User Simulation Power of evaluation measures Task Dependent How realistic? How useful? Dialogue Strategy Learning Utility of realistic vs. exploratory models for reinforcement learning Discriminative ability [AAAI WS, 2006] [NAACL, 2007] Conclusions: The Big Picture Dialogue System Evaluation More realistic models via knowledge consistency [Interspeech, 2007] Impact of the source corpus Subjects vs. real users [SIGDial, 2007] Human assessment Validation of evaluation * * [ACL, 2008]

Other ITSpoke Research Affect detection and adaptation in dialogue systems –Annotated ITSpoke Corpus now available! – Reinforcement Learning Using NLP and psycholinguistics to predict learning –Cohesion, alignment/convergence, semantics More details:

Questions? Thank You!