Dialogue Policy Optimisation

Slides:

Advertisements

Similar presentations

Dialogue Modelling Milica Gašić Dialogue Systems Group.

Advertisements

Statistical Dialogue Modelling Milica Gašić Dialogue Systems Group.

On-line dialogue policy optimisation Milica Gašić Dialogue Systems Group.

Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.

Markov Decision Process

Partially Observable Markov Decision Process (POMDP)

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.

SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.

CSE-573 Artificial Intelligence Partially-Observable MDPS (POMDPs)

Decision Theoretic Planning

1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.

1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.

Planning under Uncertainty

Gaussian Processes for Fast Policy Optimisation of POMDP-based Dialogue Managers M. Gašić, F. Jurčíček, S. Keizer, F. Mairesse, B. Thomson, K. Yu, S. Young.

KI Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17.

Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.

An Optimal Learning Approach to Finding an Outbreak of a Disease Warren Scott Warren Powell

Integrating POMDP and RL for a Two Layer Simulated Robot Architecture Presented by Alp Sardağ.

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

Based on slides by Nicholas Roy, MIT Finding Approximate POMDP Solutions through Belief Compression.

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

MAKING COMPLEX DEClSlONS

Computational Stochastic Optimization: Bridging communities October 25, 2012 Warren Powell CASTLE Laboratory Princeton University

Reinforcement Learning

General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.

REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.

Simultaneous Localization and Mapping Presented by Lihan He Apr. 21, 2006.

1 Robot Environment Interaction Environment perception provides information about the environment’s state, and it tends to increase the robot’s knowledge.

Engineering Economic Analysis Canadian Edition

TKK | Automation Technology Laboratory Partially Observable Markov Decision Process (Chapter 15 & 16) José Luis Peralta.

Reinforcement Learning for Spoken Dialogue Systems: Comparing Strengths & Weaknesses for Practical Deployment Tim Paek Microsoft Research Dialogue on Dialogues.

Introduction to Reinforcement Learning Dr Kathryn Merrick 2008 Spring School on Optimisation, Learning and Complexity Friday 7 th.

Reinforcement Learning 主講人：虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.

17 May 2007RSS Kent Local Group1 Quantifying uncertainty in the UK carbon flux Tony O’Hagan CTCD, Sheffield.

Solving POMDPs through Macro Decomposition

Decision Making Under Uncertainty Lec #8: Reinforcement Learning UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Most slides by Jeremy.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

Reinforcement Learning 主講人：虞台文大同大學資工所智慧型多媒體研究室.

COMP 2208 Dr. Long Tran-Thanh University of Southampton Bandits.

1 Chapter 17 2 nd Part Making Complex Decisions --- Decision-theoretic Agent Design Xin Lu 11/04/2002.

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

COMP 2208 Dr. Long Tran-Thanh University of Southampton Reinforcement Learning.

1 (Chapter 3 of) Planning and Control in Stochastic Domains with Imperfect Information by Milos Hauskrecht CS594 Automated Decision Making Course Presentation.

CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

Abstract LSPI (Least-Squares Policy Iteration) works well in value function approximation Gaussian kernel is a popular choice as a basis function but can.

R. Brafman and M. Tennenholtz Presented by Daniel Rasmussen.

Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate its.

Evolvable dialogue systems

Reinforcement Learning in POMDPs Without Resets

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

Biomedical Data & Markov Decision Process

Markov Decision Processes

Harm van Seijen Bram Bakker Leon Kester TNO / UvA UvA

Propagating Uncertainty In POMDP Value Iteration with Gaussian Process

Markov Decision Processes

Announcements Homework 3 due today (grace period through Friday)

Hierarchical POMDP Solutions

Filtering and State Estimation: Basic Concepts

Lecture 2 – Monte Carlo method in finance

Reinforcement Learning

CS 188: Artificial Intelligence Fall 2008

CS 416 Artificial Intelligence

Reinforcement Learning Dealing with Partial Observability

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

Presentation transcript:

Dialogue Policy Optimisation Milica Gašić Dialogue Systems Group

Reinforcement learning

Dialogue as a partially observable Markov decision process (POMDP) at st st+1 State is unobservable ot ot+1 State depends on a noisy observation Lets remind ourselves of the POMDP model for dialogue State depends on the previous state and the action that the system took State is unobservable and depends on a noisy observation We keep track of the probability distribution of all states at every time step Based on this distribution we take action rt Action selection (policy) is based on the distribution over all states at every time step t – belief state b(st)

Dialogue policy optimisation state action Dialogue is in a state – belief state b Dialogue manager takes actions a, as defined by a policy π It gets rewards r state reward

Optimal Policy The optimal policy is one which generates the highest expected reward over time

Reinforcement learning – the idea Take actions randomly Compute average reward Change policy to take actions that generated high reward

Challenges in dialogue policy optimisation How to define the reward? Belief state is large and continuous Reinforcement learning takes many iterations

Problem 1: The reward function Solution: Reward is a measure of how good the dialogue is It should incorporate the measure of success whether the system gave all the information that the user wanted It should favour shorter dialogues penalise the system for every dialogue turn It can incorporate more elements

Problem 2: Belief state is large and continuous Solution: Compress the belief state into a smaller scale summary space Original Belief Space Actions Policy Summary Function Master Function Summary Space Summary Actions Summary Policy 1 J. Williams and S. Young (2005). "Scaling up POMDPs for Dialogue Management: The Summary POMDP Method."

Summary space Summary space contains features of the belief space that are important for learning This is hand-coded! It can contain probabilities of concepts, their values and so on! Continuous variables can be discretised into a grid

Q-function Q-function measures the expected discounted reward that can be obtained at a grid point when an action is taken Starting action Expectation with respect to policy π Starting grid point Discount Factor in (0,1] Reward Takes into account the reward of the future actions Optimising the Q-function is equivalent to optimising the policy

Online learning Reinforcement learning in direct interaction with the environment Actions are taken e-greedily Exploitation: choose action according to the best estimate of Q function Exploration: choose action randomly (with probability e)

Monte Carlo control algorithm Initialise Q arbitrary Repeat Repeat for every turn in a dialogue Update belief state, map to summary space Record grid point, record reward Until the end of dialogue For each grid point sum up all rewards that followed Update Q function and policy

How many iterations? Each grid point needs to be visited sufficiently enough to obtain good estimate If the grid is large then the estimate is not precise enough If there are lots of grid points then the policy optimization is slow In practice 10,000s dialogues are needed!

Learning in interaction with a Simulated User Dialogue Manager Speech Understanding Dialogue State Expected Reward Speech Generation Dialogue Policy Optimise Policy

Exhibit a variety of behaviour Simulated user Various models Exhibit a variety of behaviour Imitate real users

Agenda-based user simulator Consists of an agenda and a goal Goal: Concepts that describe the entity that the user wants Example: restaurant, cheap, Chinese Agenda Dialogue acts needed to elicit the user goal Dynamically changed during the dialogue Generated either deterministically or stochastically

Learning with noisy input inform ( price = cheap, area = centre) POMDPs is to provide robustness to speech recognition errors Expose the manager to noise during learning User simulator output can be corrupted to produce an N-best list of scored noisy inputs inform ( price = cheap, area = south) 0.63 inform ( price = expensice ) 0.22 request ( area ) 0.15

Evaluating a dialogue system Dialogue system consists of many components and joint evaluation is difficult What matters is the user experience Dialogue manager uses reward to optimise the dialogue policy This can also be used for evaluation

Results

Problem 3: Policy optimisation requires a lot of dialogues Policy optimisations requires 10,000s dialogues Solution: Take into account similarities between different belief states Essential ingredients Gaussian process Kernel function Outcome Fast policy optimisation The POMDP approach requires 10,000s dialogues to train the policy which is too much for learning in interaction with real users

The Q-function as a Gaussian Process The Q-function in a POMDP is the expected long-term reward from taking action a in belief state b(s). It can be modelled as a stochastic process – a Gaussian process to take into account the uncertainty of approximation

VoiceMail example The user asks the system to save or delete the message. System actions: save, delete, confirm The user input is corrupted with noise, so the true dialogue state is unknown.

Q-function as a Gaussian process belief state b(s)

The role of kernel function in a Gaussian process The kernel function models correlation between different Q-function values Action Belief state Q-function value Confirm Confirm

Exploration in online learning State space needs to be sufficiently explored to find the optimal path How to efficiently explore the space?

Active learning in GP Reinforcement learning gives the uncertainty GP model for Q-function choose action that the model in uncertain about Exploration choose action with the highest expected reward Exploitation

Results – Cambridge tourist information domain

Learning in interaction with real people

Conclusions Statistical dialogue modelling Future work: Automate dialogue manager optimisation Robust to speech recognition errors Enables fast learning Future work: Refined reward function