How much data is enough? – Generating reliable policies w/MDP’s Joel Tetreault University of Pittsburgh LRDC July 14, 2006.

Slides:

Advertisements

Similar presentations

Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014.

Advertisements

Dialogue Policy Optimisation

Markov Decision Process

Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.

Kshitij Judah, Alan Fern, Tom Dietterich TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: School of EECS, Oregon State.

BPT2423 – STATISTICAL PROCESS CONTROL

Statistics Versus Parameters

IBM Labs in Haifa © 2005 IBM Corporation Adaptive Application of SAT Solving Techniques Ohad Shacham and Karen Yorav Presented by Sharon Barner.

Detecting Certainness in Spoken Tutorial Dialogues Liscombe, Hirschberg & Venditti Using System and User Performance Features to Improve Emotion Detection.

1. Algorithms for Inverse Reinforcement Learning 2

Quick Sort, Shell Sort, Counting Sort, Radix Sort AND Bucket Sort

©GoldSim Technology Group LLC., 2012 Optimization in GoldSim Jason Lillywhite and Ryan Roper June 2012 Webinar.

Chapter 2 Matrices Finite Mathematics & Its Applications, 11/e by Goldstein/Schneider/Siegel Copyright © 2014 Pearson Education, Inc.

1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.

Uncertainty Corpus: Resource to Study User Affect in Complex Spoken Dialogue Systems Kate Forbes-Riley, Diane Litman, Scott Silliman, Amruta Purandare.

Using Reinforcement Learning to Build a Better Model of Dialogue State Joel Tetreault & Diane Litman University of Pittsburgh LRDC April 7, 2006.

Reinforcement Learning & Apprenticeship Learning Chenyi Chen.

Markov Decision Processes

Student simulation and evaluation DOD meeting Hua Ai 03/03/2006.

Reinforcement Learning

Goldstein/Schnieder/Lay: Finite Math & Its Applications, 9e 1 of 86 Chapter 2 Matrices.

Reduced Support Vector Machine

Nov 14 th  Homework 4 due  Project 4 due 11/26.

Intro to Statistics for the Behavioral Sciences PSYC 1900 Lecture 10: Hypothesis Tests for Two Means: Related & Independent Samples.

Integrating POMDP and RL for a Two Layer Simulated Robot Architecture Presented by Alp Sardağ.

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

Experimental Evaluation

Finite Mathematics & Its Applications, 10/e by Goldstein/Schneider/SiegelCopyright © 2010 Pearson Education, Inc. 1 of 86 Chapter 2 Matrices.

Topics = Domain-Specific Concepts Online Physics Encyclopedia ‘Eric Weisstein's World of Physics’ Contains total 3040 terms including multi-word concepts.

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

MDP Reinforcement Learning. Markov Decision Process “Should you give money to charity?” “Would you contribute?” “Should you give money to charity?” $

Ensembles of Classifiers Evgueni Smirnov

Respected Professor Kihyeon Cho

Search and Planning for Inference and Learning in Computer Vision

Modeling User Satisfaction and Student Learning in a Spoken Dialogue Tutoring System with Generic, Tutoring, and User Affect Parameters Kate Forbes-Riley.

Interactive Dialogue Systems Professor Diane Litman Computer Science Department & Learning Research and Development Center University of Pittsburgh Pittsburgh,

Minimal Test Collections for Retrieval Evaluation B. Carterette, J. Allan, R. Sitaraman University of Massachusetts Amherst SIGIR2006.

STA Lecture 161 STA 291 Lecture 16 Normal distributions: ( mean and SD ) use table or web page. The sampling distribution of and are both (approximately)

Reinforcement Learning

The AIE Monte Carlo Tool The AIE Monte Carlo tool is an Excel spreadsheet and a set of supporting macros. It is the main tool used in AIE analysis of a.

The AIE Monte Carlo Tool The AIE Monte Carlo tool is an Excel spreadsheet and a set of supporting macros. It is the main tool used in AIE analysis of a.

Speech and Language Processing Chapter 24 of SLP (part 3) Dialogue and Conversational Agents.

OBJECT FOCUSED Q-LEARNING FOR AUTONOMOUS AGENTS M. ONUR CANCI.

Identifying Feature Relevance Using a Random Forest Jeremy Rogers & Steve Gunn.

Recap: How the Process Works (1) Determine the weights. The weights can be absolute or relative. Weights encompass two parts -- the quantitative weight.

circle Adding Spoken Dialogue to a Text-Based Tutorial Dialogue System Diane J. Litman Learning Research and Development Center & Computer Science Department.

Universit at Dortmund, LS VIII

Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 March 4, 2013.

1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.

1 CS 224S W2006 CS 224S LING 281 Speech Recognition and Synthesis Lecture 15: Dialogue and Conversational Agents (III) Dan Jurafsky.

© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.

CMP 131 Introduction to Computer Programming Violetta Cavalli-Sforza Week 3, Lecture 1.

Issues concerning the interpretation of statistical significance tests.

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Metacognition and Learning in Spoken Dialogue Computer Tutoring Kate Forbes-Riley and Diane Litman Learning Research and Development Center University.

Improving (Meta)cognitive Tutoring by Detecting and Responding to Uncertainty Diane Litman & Kate Forbes-Riley University of Pittsburgh Pittsburgh, PA.

User Simulation for Spoken Dialogue Systems Diane Litman Computer Science Department & Learning Research and Development Center University of Pittsburgh.

REINFORCEMENT LEARNING Unsupervised learning 1. 2 So far ….  Supervised machine learning: given a set of annotated istances and a set of categories,

Prosodic Cues to Disengagement and Uncertainty in Physics Tutorial Dialogues Diane Litman, Heather Friedberg, Kate Forbes-Riley University of Pittsburgh.

1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.

Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.

Using Reinforcement Learning to Build a Better Model of Dialogue State Joel Tetreault LRDC University of Pittsburgh August 3, 2006.

1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.

Towards Emotion Prediction in Spoken Tutoring Dialogues

Reinforcement Learning (1)

Markov Decision Processes

Reinforcement Learning in MDPs by Lease-Square Policy Iteration

Reinforcement Learning

CS 188: Artificial Intelligence Spring 2006

Presentation transcript:

How much data is enough? – Generating reliable policies w/MDP’s Joel Tetreault University of Pittsburgh LRDC July 14, 2006

Problem Problems with designing spoken dialogue systems:  How to handle noisy data or miscommunications?  Hand-tailoring policies for complex dialogues?  What features to use? Previous work used machine learning to improve the dialogue manager of spoken dialogue systems [Singh et al., ‘02; Walker, ‘00; Henderson et al., ‘05] However, very little empirical work [Paek et al., ‘05; Frampton ‘05] on comparing the utility of adding specialized features to construct a better dialogue state

Goal How does one choose which features best contribute to a better model of dialogue state? Goal: show the comparative utility of adding three different features to a dialogue state 4 features: concept repetition, frustration, student performance, student moves All are important to tutoring systems, but also are important to dialogue systems in general

Previous Work In complex domains, annotation and testing is time- consuming so it is important to properly choose best features beforehand Developed a methodology for using Reinforcement Learning to determine whether adding complex features to a dialogue state will beneficially alter policies [Tetreault & Litman, EACL ’06] Extensions:  Methodology to determine which features are the best  Also show our results generalize over different action choices (feedback vs. questions)

Outline Markov Decision Processes (MDP) MDP Instantiation Experimental Method Results  Policies  Feature Comparison

Markov Decision Processes What is the best action an agent should take at any state to maximize reward at the end? MDP Input:  States  Actions  Reward Function

MDP Output Policy: optimal action for system to take in each state Calculated using policy iteration which depends on:  Propagating final reward to each state  the probabilities of getting from one state to the next given a certain action Additional output: V-value: the worth of each state

MDP’s in Spoken Dialogue MDP Dialogue System Training data Policy User Simulator Human User MDP works offline Interactions work online

ITSPOKE Corpus 100 dialogues with ITSPOKE spoken dialogue tutoring system [Litman et al. ’04]  All possible dialogue paths were authored by physics experts  Dialogues informally follow question-answer format  60 turns per dialogue on average Each student session has 5 dialogues bookended by a pretest and posttest to calculate how much student learned

Corpus Annotations Manual annotations:  Tutor Moves (similar to Dialog Acts) [ Forbes-Riley et al., ’05]  Student Frustration and Certainty [Litman et al. ’04] [Liscombe et al. ’05] Automated annotations:  Correctness (based on student’s response to last question)  Concept Repetition (whether a concept is repeated)  %Correctness (past performance)

MDP State Features FeaturesValues CorrectnessCorrect (C), Incorrect (I) CertaintyCertain (cer), Neutral (neu), Uncertain (unc) Concept RepetitionNew Concept (0), Repeated (R) FrustrationFrustrated (F), Neutral (N) % Correctness50-100% (H)igh, 0-49% (L)ow

MDP Action Choices ActionExample Turn SAQ (Short Answer Question) “What is the direction of that force relative to your fist?” CAQ (Complex Answer Question) “What is the definition of Newton’s Second Law?” Mix“If it doesn’t hit the center of the pool what do you know about the magnitude of its displacement from the center of the pool when it lands? Can it be zero? Can it be nonzero? NoQ“So you can compare it to my response…”

MDP Reward Function Reward Function: use normalized learning gain to do a median split on corpus: 10 students are “high learners” and the other 10 are “low learners” High learner dialogues had a final state with a reward of +100, low learners had one of -100

Methodology Construct MDP’s to test the inclusion of new state features to a baseline:  Develop baseline state and policy  Add a feature to baseline and compare polices  A feature is deemed important if adding it results in a change in policy from a baseline policy given 3 metrics: # of Policy Differences (Diff’s) %Policy Change (%PC) Expected Cumulative Reward (ECR) For each MDP: verify policies are reliable (V-value convergence)

Hypothetical Policy Change Example B1 StatePolicyB1+Certainty State 1[C]CAQ[C,Cer] [C,Neu] [C,Unc] 2[I]SAQ[I,Cer] [I,Neu] [I,Unc] +Cert 1 Policy  CAQ SAQ +Cert 2 Policy Mix CAQ Mix CAQ Mix 0 Diffs5 Diffs

Tests +%Correct +Concept +Frustration B2+ Correctness+Certainty Baseline 1 Baseline 2 B1+

Baseline Actions: {SAQ, CAQ, Mix, NoQ} Baseline State: {Correctness} Baseline network [C] FINAL [I] SAQ|CAQ|Mix|NoQ

Baseline 1 Policies Trend: if you only have student correctness as a model of student state, give a hint or other state act to the student, otherwise give a Mix of complex and short answer questions #StateState SizePolicy 1[C]1308NoQ 2[I]872Mix

But are our policies reliable? Best way to test is to run real experiments with human users with new dialogue manager, but that is months of work Our tact: check if our corpus is large enough to develop reliable policies by seeing if V-values converge as we add more data to corpus Method: run MDP on subsets of our corpus (incrementally add a student (5 dialogues) to data, and rerun MDP on each subset)

Baseline Convergence Plot

Methodology: Adding more Features Create more complicated baseline by adding certainty feature (new baseline = B2) Add other 4 features (concept repetition, frustration, performance, student move) individually to new baseline Check V-value and policy convergence Analyze policy changes Use Feature Comparison Metrics to determine the relative utility of the three features

Tests +%Correct +Concept +Frustration B2+ Correctness+Certainty Baseline 1 Baseline 2 B1+

Certainty Previous work (Bhatt et al., ’04) has shown the importance of certainty in ITS A student who is certain and correct, may require a harder question since he or she is doing well, but one that is correct but showing some doubt is a sign they are becoming confused, give an easier question

B2: Baseline + Certainty Policies B1 StatePolicyB1+Certainty State +Certainty Policy 1[C]NoQ[C,Cer] [C,Neu] [C,Unc] Mix SAQ Mix 2[I]Mix[I,Cer] [I,Neu] [I,Unc] Mix NoQ Mix Trend: if neutral, give SAQ or NoQ, else give Mix

Baseline 2 Convergence Plots

Baseline 2 Diff Plots Diff: For each subset corpus, compare policy with policy generated with full corpus

Tests +%Correct +Concept +Frustration B2+ Correctness+Certainty Baseline 1 Baseline 2 B1+

Feature Comparison (3 metrics) # Diff’s  Number of new states whose policies differ from the original  Insensitive to how frequently a state occurs % Policy Change (%P.C.)  Take into account the frequency of each state-action sequence

Feature Comparison Expected Cumulative Reward (E.C.R.)  One issue with %P.C. is that frequently occurring states have low V-values and thus may bias the score  Use the expected value of being at the start of the dialogue to compare features  ECR = average V-value of all start states

Feature Comparison Results State Feature#Diff’s%P.C.E.C.R Student Move1082.2%43.21 Concept Repetition1080.2%39.52 Frustration866.4%31.30 Percent Correctness444.3%28.47 Trend of SMove > Concept Repetition > Frustration > Percent Correctness stays the same over all three metrics Baseline: Also tested the effects of a binary random feature If enough data, a random feature should not alter policies Average diff of 5.1

How reliable are policies? Frustration Concept Possible data size is small and with increased data we may see more fluctuations

Confidence Bounds Hypothesis: instead of looking at the V-values and policy differences directly, look at the confidence bounds of each V-value As data increases, confidence of V-value should shrink to reflect a better model of the world Additionally, the policies should converge as well

Confidence Bounds CB’s can also be used to distinguish how much better an additional state feature is over a baseline state space That is, if the lower bound of a new state space is greater than the upper bound of the baseline state space

Crossover Example Data ECR More complicated Model Baseline

Confidence Bounds: App #2 Automatic model switching  If you know a model, at it’s worst (ie. It’s lower bound is better than another model’s upper bound) then you can automatically switch to the more complicated model Good for online RL applications

Confidence Bound Methodology For each data slice, calculate upper and lower bounds on the V-value  Take transition matrix for slice and sample from each row using direch. statistical formula 1000 times do this b/c real world data is not exactly approximating what data is like in the real world, but may be close So get 1000 new transition matrices that are all very similar  Run MDP on all 1000 transition matrices to get a range of ECR’s Rows with not a lot of data are very volatile so expect large range of ECR’s, but as data increases, transition matrices should stabilize such that most of the new matrices produce similar policies and values as the original  Take upper and lower bounds at 2.5% percentile

Experiment Original action/state setup did not show anything promising  State/action space too large for data?  Not best MDP instantiation Looked at a variety of MDP configurations  Refined reward metric  Adding discourse segmentation

+essay Instantiation with ’03+’05 data

+essay Baseline1

+essay Baseline2

+essay B2+SMove

Feature Comparison Results State Feature#Diff’s%P.C.E.C.R Student Move543.4%49.17 Concept Repetition325.5%42.56 Frustration10.03%32.99 Percent Correctness311.19%28.50 Reduced state size: Certainty = {Cert+Neutral, Uncert} Trend that SMove and Concept Repetition are the best features B2 ECR = 31.92

Baseline 1 Upper = Lower = 0.24

Baseline 2 Upper = Lower = 39.62

B2+ Concept Repetition Upper = Lower =49.16

B2+Percent Correctness Upper =48.42 Lower = 32.86

B2+Student Move Upper = Lower = 39.94

Discussion Baseline 2 – has crossover effect and policy stability More complex features (B2 + X) – have crossover effect, but not sure if polices are stable (some stabilize at 17 students) Indicates that 100 dialogues isn’t enough for even this simple MDP? (but is enough for baseline 2 to feel confident about?)