Using Reinforcement Learning to Build a Better Model of Dialogue State Joel Tetreault & Diane Litman University of Pittsburgh LRDC April 7, 2006.

Slides:



Advertisements
Similar presentations
Feedback Loops Guy Rousseau Atlanta Regional Commission.
Advertisements

Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014.
Dialogue Policy Optimisation
Markov Decision Process
Detecting Certainness in Spoken Tutorial Dialogues Liscombe, Hirschberg & Venditti Using System and User Performance Features to Improve Emotion Detection.
Uncertainty Corpus: Resource to Study User Affect in Complex Spoken Dialogue Systems Kate Forbes-Riley, Diane Litman, Scott Silliman, Amruta Purandare.
An Introduction to Markov Decision Processes Sarah Hickmott
Reinforcement Learning & Apprenticeship Learning Chenyi Chen.
E XPLORING M ARKOV D ECISION P ROCESS V IOLATIONS IN R EINFORCEMENT L EARNING Jordan Fryer – University of Portland Working with Peter Heeman 1.
How much data is enough? – Generating reliable policies w/MDP’s Joel Tetreault University of Pittsburgh LRDC July 14, 2006.
Student simulation and evaluation DOD meeting Hua Ai 03/03/2006.
Reinforcement Learning
Machine LearningRL1 Reinforcement Learning in Partially Observable Environments Michael L. Littman.
Nov 14 th  Homework 4 due  Project 4 due 11/26.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Cost of Misunderstandings Modeling the Cost of Misunderstanding Errors in the CMU Communicator Dialog System Presented by: Dan Bohus
Computer Science 1620 Programming & Problem Solving.
1 Validation and Verification of Simulation Models.
Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.
1 Introduction of MDP Speaker : Xu Jia-Hao Adviser : Ke Kai-Wei.
Learning and Planning for POMDPs Eyal Even-Dar, Tel-Aviv University Sham Kakade, University of Pennsylvania Yishay Mansour, Tel-Aviv University.
CODING Research Data Management. Research Data Management Coding When writing software or analytical code it is important that others and your future.
Topics = Domain-Specific Concepts Online Physics Encyclopedia ‘Eric Weisstein's World of Physics’ Contains total 3040 terms including multi-word concepts.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
MDP Reinforcement Learning. Markov Decision Process “Should you give money to charity?” “Would you contribute?” “Should you give money to charity?” $
Stochastic Dynamic Programming with Factored Representations Presentation by Dafna Shahaf (Boutilier, Dearden, Goldszmidt 2000)
Search and Planning for Inference and Learning in Computer Vision
1 Validation & Verification Chapter VALIDATION & VERIFICATION Very Difficult Very Important Conceptually distinct, but performed simultaneously.
Modeling User Satisfaction and Student Learning in a Spoken Dialogue Tutoring System with Generic, Tutoring, and User Affect Parameters Kate Forbes-Riley.
DQA meeting: : Learning more effective dialogue strategies using limited dialogue move features Matthew Frampton & Oliver Lemon, Coling/ACL-2006.
Interactive Dialogue Systems Professor Diane Litman Computer Science Department & Learning Research and Development Center University of Pittsburgh Pittsburgh,
Reinforcement Learning
circle Adding Spoken Dialogue to a Text-Based Tutorial Dialogue System Diane J. Litman Learning Research and Development Center & Computer Science Department.
Comparing Synthesized versus Pre-Recorded Tutor Speech in an Intelligent Tutoring Spoken Dialogue System Kate Forbes-Riley and Diane Litman and Scott Silliman.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Reinforcement Learning Presentation Markov Games as a Framework for Multi-agent Reinforcement Learning Mike L. Littman Jinzhong Niu March 30, 2004.
Dynamic Programming for Partially Observable Stochastic Games Daniel S. Bernstein University of Massachusetts Amherst in collaboration with Christopher.
Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 March 4, 2013.
Reinforcement Learning for Spoken Dialogue Systems: Comparing Strengths & Weaknesses for Practical Deployment Tim Paek Microsoft Research Dialogue on Dialogues.
Learning Automata based Approach to Model Dialogue Strategy in Spoken Dialogue System: A Performance Evaluation G.Kumaravelan Pondicherry University, Karaikal.
1 CS 224S W2006 CS 224S LING 281 Speech Recognition and Synthesis Lecture 15: Dialogue and Conversational Agents (III) Dan Jurafsky.
Collaborative Research: Monitoring Student State in Tutorial Spoken Dialogue Diane Litman Computer Science Department and Learning Research and Development.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
McGraw-Hill/Irwin © 2006 The McGraw-Hill Companies, Inc., All Rights Reserved. 1.
Predicting Student Emotions in Computer-Human Tutoring Dialogues Diane J. Litman&Kate Forbes-Riley University of Pittsburgh Department of Computer Science.
Why predict emotions? Feature granularity levels [1] uses pitch features computed at the word-level Offers a better approximation of the pitch contour.
Using Word-level Features to Better Predict Student Emotions during Spoken Tutoring Dialogues Mihai Rotaru Diane J. Litman Graduate Research Competition.
Transfer Learning in Sequential Decision Problems: A Hierarchical Bayesian Approach Aaron Wilson, Alan Fern, Prasad Tadepalli School of EECS Oregon State.
1 Chapter 17 2 nd Part Making Complex Decisions --- Decision-theoretic Agent Design Xin Lu 11/04/2002.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Metacognition and Learning in Spoken Dialogue Computer Tutoring Kate Forbes-Riley and Diane Litman Learning Research and Development Center University.
Reinforcement Learning for Mapping Instructions to Actions S.R.K. Branavan, Harr Chen, Luke S. Zettlemoyer, Regina Barzilay Computer Science and Artificial.
Improving (Meta)cognitive Tutoring by Detecting and Responding to Uncertainty Diane Litman & Kate Forbes-Riley University of Pittsburgh Pittsburgh, PA.
1 A Methodology for automatic retrieval of similarly shaped machinable components Mark Ascher - Dept of ECE.
User Simulation for Spoken Dialogue Systems Diane Litman Computer Science Department & Learning Research and Development Center University of Pittsburgh.
Using Natural Language Processing to Analyze Tutorial Dialogue Corpora Across Domains and Modalities Diane Litman, University of Pittsburgh, Pittsburgh,
Reinforcement Learning. Overview Supervised Learning: Immediate feedback (labels provided for every input). Unsupervised Learning: No feedback (no labels.
Prosodic Cues to Disengagement and Uncertainty in Physics Tutorial Dialogues Diane Litman, Heather Friedberg, Kate Forbes-Riley University of Pittsburgh.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
Using Reinforcement Learning to Build a Better Model of Dialogue State Joel Tetreault LRDC University of Pittsburgh August 3, 2006.
Automated Experiments on Ad Privacy Settings
Towards Emotion Prediction in Spoken Tutoring Dialogues
Reinforcement Learning (1)
For Evaluating Dialog Error Conditions Based on Acoustic Information
Dialogue-Learning Correlations in Spoken Dialogue Tutoring
Chapter 5. The Bootstrapping Approach to Developing Reinforcement Learning-based Strategies in Reinforcement Learning for Adaptive Dialogue Systems, V.
Markov Decision Processes
Markov Decision Processes
Presentation transcript:

Using Reinforcement Learning to Build a Better Model of Dialogue State Joel Tetreault & Diane Litman University of Pittsburgh LRDC April 7, 2006

Problem Problems with designing spoken dialogue systems:  What features to use?  How to handle noisy data or miscommunications?  Hand-tailoring policies for complex dialogues? Previous work used machine learning to improve the dialogue manager of spoken dialogue systems [Singh et al., ‘02; Walker, ‘00; Henderson et al., ‘05] However, very little empirical work on testing the utility of adding specialized features to construct a better dialogue state

Goal Lots of features can be used to describe the user state, which ones to you use? Goal: show that adding more complex features to a state is a worthwhile pursuit since it alters what actions a system should make 5 features: certainty, student dialogue move, concept repetition, frustration, student performance All are important to tutoring systems, but also are important to dialogue systems in general

Outline Markov Decision Processes (MDP) MDP Instantiation Experimental Method Results

Markov Decision Processes What is the best action an agent to take at any state to maximize reward at the end? MDP Input:  States  Actions  Reward Function

MDP Output Use policy iteration to propagate final reward to the states to determine:  V-value: the worth of each state  Policy: optimal action to take for each state Values and policies are based on the reward function but also on the probabilities of getting from one state to the next given a certain action

What’s the best path to the fly?

MDP Frog Example Final State: +1

MDP Frog Example Final State:

MDP’s in Spoken Dialogue MDP Dialogue System Training data Policy User Simulator Human User MDP works offline Interactions work online

ITSPOKE Corpus 100 dialogues with ITSPOKE spoken dialogue tutoring system [Litman et al. ’04]  All possible dialogue paths were authored by physics experts  Dialogues informally follow question-answer format  50 turns per dialogue on average Each student session has 5 dialogues bookended by a pretest and posttest to calculate how much student learned

Corpus Annotations Manual annotations:  Tutor and Student Moves (similar to Dialog Acts) [ Forbes-Riley et al., ’05]  Frustration and certainty [Litman et al. ’04] [Liscombe et al. ’05] Automated annotations:  Correctness (based on student’s response to last question)  Concept Repetition (whether a concept is repeated)  %Correctness (past performance)

MDP State Features FeaturesValues CorrectnessCorrect (C) Incorrect/Partially Correct (I) CertaintyCertain (cer), Neutral (neu), Uncertain (unc) Student MoveShallow (S), Deep/Novel Answer/Assertion (O) Concept RepetitionNew Concept (0), Repeated (R) FrustrationFrustrated (F), Neutral (N) % Correctness50-100% (H)igh, 0-49% (L)ow

MDP Action Choices CaseTMoveExample Turn FeedPos“Super.” NonFeedHint, Ques.“To analyze the pumpkin’s acceleration we will use Newton’s Second Law. What is the definition of the law?” MixPos, Rst, Ques. “Good. So when the truck and car collide they exert a force on each other. What is the relationship between their magnitudes?”

MDP Reward Function Reward Function: use normalized learning gain to do a median split on corpus: 10 students are “high learners” and the other 10 are “low learners” High learner dialogues had a final state with a reward of +100, low learners had one of -100

Infrastructure 1. State Transformer:  Based on RLDS [Singh et al., ’99]  Outputs State-Action probability matrix and reward matrix 2. MDP Matlab Toolkit (from INRA) to generate policies

Methodology Construct MDP’s to test the inclusion of new state features to a baseline:  Develop baseline state and policy  Add a feature to baseline and compare polices  A feature is deemed important if adding it results in a change in policy from a baseline policy (“shifts”) For each MDP: verify policies are reliable (V-value convergence)

Hypothetical Policy Change Example B1 StatePolicyB1+Certainty State 1[C]Feed[C,Cer] [C,Neu] [C,Unc] 2[I]Feed[I,Cer] [I,Neu] [I,Unc] +Cert 1 Policy  Feed Mix +Cert 2 Policy Mix Feed Mix NonFeed Mix 0 shifts5 shifts

Tests +%Correct +Goal +Frustration B2+ Correctness+Certainty Baseline 1 Baseline 2 B1+ +SMove

Baseline Actions: {Feed, NonFeed, Mix} Baseline State: {Correctness} Baseline network [C] FINAL F|NF|Mix [I] F|NF|Mix

Baseline 1 Policies Trend: if you only have student correctness as a model of student state, regardless of their response, the best tactic is to always give simple feedback #StateState SizePolicy 1[C]1308Feed 2[I]872Feed

But are our policies reliable? Best way to test is to run real experiments with human users with new dialogue manager, but that is months of work Our tact: check if our corpus is large enough to develop reliable policies by seeing if V-values converge as we add more data to corpus Method: run MDP on subsets of our corpus (incrementally add a student (5 dialogues) to data, and rerun MDP on each subset)

Baseline Convergence Plot

Methodology: Adding more Features Create more complicated baseline by adding certainty feature (new baseline = B2) Add other 4 features (student moves, concept repetition, frustration, performance) individually to new baseline Check that V-values converge Analyze policy changes

Tests +%Correct +Goal +Frustration B2+ Correctness+Certainty Baseline 1 Baseline 2 B1+ +SMove

Certainty Previous work (Bhatt et al., ’04) has shown the importance of certainty in ITS A student who is certain and correct, may not need feedback, but one that is correct but showing some doubt is a sign they are becoming confused, give more feedback

B2: Baseline + Certainty Policies B1 StatePolicyB1+Certainty State +Certainty Policy 1[C]Feed[C,Cer] [C,Neu] [C,Unc] NonFeed Feed NonFeed 2[I]Feed[I,Cer] [I,Neu] [I,Unc] NonFeed Mix NonFeed Trend: if neutral, give Feed or Mix, else give NonFeed

Baseline 1 and 2 Convergence Plots

Tests + %Correct +Goal +Frustration B2+ Correctness+Certainty Baseline 1 Baseline 2 B1+ +SMove

% Correct Convergence Plots

Student Move Policies B2B2 PolicyB2 +SMove+Smove Policy 1[Cer,C]NonFeed [Cer,C,S] [Cer,C,O] NonFeed Feed 2[Cer,I]NonFeed [Cer,I,S] [Cer,I,O] Mix 3[Neu,C]Feed [Neu,C,S] [Neu,C,O] Feed NonFeed 4[Neu,I]Mix [Neu,I,S] [Neu,I,O] Mix NonFeed 5[Unc,C]NonFeed [Unc,C,S] [Unc,C,O] Mix NonFeed 6[Unc,I]NonFeed [Unc,I,S] [Unc,I,O] Mix NonFeed Trend: give Mix if shallow (S), give NonFeed if Other (O) 7 Changes

Concept Repetition Policies Trend: if concept is repeated (R) give complex or mix feedback B2B2 PolicyB2 +Concept+Concept Policy 1[Cer,C]NonFeed [Cer,C,O] [Cer,C,R] NonFeed Feed 2[Cer,I]NonFeed [Cer,I,O] [Cer,I,R] Mix 3[Neu,C]Feed [Neu,C,O] [Neu,C,R] Mix Feed 4[Neu,I]Mix [Neu,I,O] [Neu,I,R] Mix 5[Unc,C]NonFeed [Unc,C,O] [Unc,C,R] NonFeed 6[Unc,I]NonFeed [Unc,I,O] [Unc,I,R] NonFeed 4 Shifts

Frustration Policies Trend: if student is frustrated (F), give NonFeed B2B2 PolicyB2 +Frustration+Frustration Policy 1[Cer,C]NonFeed [Cer,C,N] [Cer,C,F] NonFeed Feed 2[Cer,I]NonFeed [Cer,I,N] [Cer,I,F] NonFeed 3[Neu,C]Feed [Neu,C,N] [Neu,C,F] Feed NonFeed 4[Neu,I]Mix [Neu,I,N] [Neu,I,F] Mix NonFeed 5[Unc,C]NonFeed [Unc,C,N] [Unc,C,F] NonFeed 6[Unc,I]NonFeed [Unc,I,N] [Unc,I,F] NonFeed 4 Shifts

Percent Correct Policies 3 Shifts Trend: if student is a low performer (L), give NonFeed B2B2 PolicyB2 +%Correct+%Correct Policy 1[Cer,C]NonFeed [Cer,C,H] [Cer,C,L] NonFeed 2[Cer,I]NonFeed [Cer,I,H] [Cer,I,L] Mix NonFeed 3[Neu,C]Feed [Neu,C,H] [Neu,C,L] Feed 4[Neu,I]Mix [Neu,I,H] [Neu,I,L] NonFeed Mix 5[Unc,C]NonFeed [Unc,C,H] [Unc,C,L] Mix NonFeed 6[Unc,I]NonFeed [Unc,I,H] [Unc,I,L] NonFeed

Discussion Incorporating more information into a representation of the student state has an impact on tutor policies Despite not having human or simulated users, can still claim that our findings are reliable due to convergence of V-values and policies Including Certainty, Student Moves and Concept Repetition effected the most change

Future Work Developing user simulations and annotating more human-computer experiments to further verify our policies are correct More data allows us to develop more complicated policies such as  More complex tutor actions (hints, questions)  Combinations of state features  More refined reward functions (PARADISE) Developing more complex convergence tests

Related Work [Paek and Chickering, ‘05] [Singh et al., ‘99] – optimal dialogue length [Frampton et al., ‘05] – last dialogue act [Williams et al., ‘03] – automatically generate good state/action sets

Diff Plots Diff Plot: compare final policy (20 students) with policies generated at smaller cuts