On-line dialogue policy optimisation Milica Gašić Dialogue Systems Group.

Slides:



Advertisements
Similar presentations
A Decision-Theoretic Model of Assistance - Evaluation, Extension and Open Problems Sriraam Natarajan, Kshitij Judah, Prasad Tadepalli and Alan Fern School.
Advertisements

Statisical Spoken Dialogue System Talk 2 – Belief tracking CLARA Workshop Presented by Blaise Thomson Cambridge University Engineering Department
Hidden Information State System A Statistical Spoken Dialogue System M. Gašić, F. Jurčíček, S. Keizer, F. Mairesse, B. Thomson, K. Yu and S. Young Cambridge.
Dialogue Modelling Milica Gašić Dialogue Systems Group.
Dialogue Policy Optimisation
Statistical Dialogue Modelling Milica Gašić Dialogue Systems Group.
Markov Decision Process
1 University of Southern California Keep the Adversary Guessing: Agent Security by Policy Randomization Praveen Paruchuri University of Southern California.
Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.
SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
Optimal Policies for POMDP Presented by Alp Sardağ.
Planning under Uncertainty
POMDPs: Partially Observable Markov Decision Processes Advanced AI
Gaussian Processes for Fast Policy Optimisation of POMDP-based Dialogue Managers M. Gašić, F. Jurčíček, S. Keizer, F. Mairesse, B. Thomson, K. Yu, S. Young.
KI Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.
An Optimal Learning Approach to Finding an Outbreak of a Disease Warren Scott Warren Powell
Nonlinear and Non-Gaussian Estimation with A Focus on Particle Filters Prasanth Jeevan Mary Knox May 12, 2006.
Integrating POMDP and RL for a Two Layer Simulated Robot Architecture Presented by Alp Sardağ.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.
Markov Decision Processes
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Reinforcement Learning (1)
Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK
Learning and Planning for POMDPs Eyal Even-Dar, Tel-Aviv University Sham Kakade, University of Pennsylvania Yishay Mansour, Tel-Aviv University.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
MAKING COMPLEX DEClSlONS
History-Dependent Graphical Multiagent Models Quang Duong Michael P. Wellman Satinder Singh Computer Science and Engineering University of Michigan, USA.
Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.
Reinforcement Learning
Segmental Hidden Markov Models with Random Effects for Waveform Modeling Author: Seyoung Kim & Padhraic Smyth Presentor: Lu Ren.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
TKK | Automation Technology Laboratory Partially Observable Markov Decision Process (Chapter 15 & 16) José Luis Peralta.
Reinforcement Learning for Spoken Dialogue Systems: Comparing Strengths & Weaknesses for Practical Deployment Tim Paek Microsoft Research Dialogue on Dialogues.
Model-based Bayesian Reinforcement Learning in Partially Observable Domains by Pascal Poupart and Nikos Vlassis (2008 International Symposium on Artificial.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Hidden Markov Models in Keystroke Dynamics Md Liakat Ali, John V. Monaco, and Charles C. Tappert Seidenberg School of CSIS, Pace University, White Plains,
Reinforcement Learning 主講人:虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.
Decision Making Under Uncertainty Lec #8: Reinforcement Learning UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Most slides by Jeremy.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.
Michael Isard and Andrew Blake, IJCV 1998 Presented by Wen Li Department of Computer Science & Engineering Texas A&M University.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Reinforcement Learning
State Estimation and Kalman Filtering Zeeshan Ali Sayyed.
Transfer Learning in Sequential Decision Problems: A Hierarchical Bayesian Approach Aaron Wilson, Alan Fern, Prasad Tadepalli School of EECS Oregon State.
Introduction to Sampling Methods Qi Zhao Oct.27,2004.
The Unscented Particle Filter 2000/09/29 이 시은. Introduction Filtering –estimate the states(parameters or hidden variable) as a set of observations becomes.
COMP 2208 Dr. Long Tran-Thanh University of Southampton Reinforcement Learning.
R. Brafman and M. Tennenholtz Presented by Daniel Rasmussen.
Deep Learning and Deep Reinforcement Learning. Topics 1.Deep learning with convolutional neural networks 2.Learning to play Atari video games with Deep.
Evolvable dialogue systems
Partially Observable Markov Decision Process and RL
Ch3: Model Building through Regression
Reinforcement Learning in POMDPs Without Resets
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Thrust IC: Action Selection in Joint-Human-Robot Teams
Biomedical Data & Markov Decision Process
Integrating Learning of Dialog Strategies and Semantic Parsing
Harm van Seijen Bram Bakker Leon Kester TNO / UvA UvA
Propagating Uncertainty In POMDP Value Iteration with Gaussian Process
Filtering and State Estimation: Basic Concepts
Reinforcement Learning
Reinforcement Nisheeth 18th January 2019.
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Presentation transcript:

On-line dialogue policy optimisation Milica Gašić Dialogue Systems Group

Spoken Dialogue System Optimisation Problem: What is the optimal behaviour Solution: Find it automatically through interaction

Reinforcement learning

Training in interaction with humans Problem 1: Optimisation requires too many dialogues Problem 2: Training makes random moves Problem 3: Humans give inconsistent ratings

Outline Background Dialogue model Dialogue optimisation Sample-efficient optimisation Models for learning Robust reward function Human experiments Conclusion

Model: Partially Observable Markov Decision Process atat stst s t+1 rtrt otot o t+1 State is Markov -- depends on the previous state and action: P(s t+1 |s t, a t ) – the transition probability State is unobservable and generates a noisy observation P(o t |s t ) -- the observation probability In every state action is taken and a reward is obtained Dialogue is a sequence of states Action selection (policy) is based on the distribution over all states at every time step t – belief state b(s t )

Dialogue state factorisation Decompose the state into conditionally independent elements: user goal user action stst gtgt utut dtdt dialogue history atat rtrt otot o t+1 g t+1 u t+1 d t+1

Further dialogue state factorisation gtgt utut dtdt atat rtrt otot o t+1 g t+1 u t+1 d t+1 g t food d t food u t food g t area d t area u t area g t+1 food d t+1 food u t+1 food g t+1 area d t+1 area u t+1 food

Policy optimisation in summary space Compress the belief state into a summary space 1 J. Williams and S. Young (2005). "Scaling up POMDPs for Dialogue Management: The Summary POMDP Method." Original Belief Space Actions Policy Summary Space Summary Actions Summary Function Master Function Summary Policy

Q-function Q-function measures the expected discounted reward that can be obtained at a summary point when an action is taken Takes into account the reward of the future actions Optimising the Q-function is equivalent to optimising the policy Discount factor in (0,1] Reward Starting summary point Starting action Expectation with respect to policy π

Online learning Reinforcement learning in direct interaction with the environment Actions are taken e-greedily Exploitation: choose action according to the best estimate of Q function Exploration: choose action randomly (with probability e) In practice 10,000s of dialogues are needed!

Problem 1: Standard models require too many dialogues

Solution: Take into account similarities between different belief states Essential ingredients Gaussian process Kernel function Outcome Sample-efficient policy optimisation

Gaussian Process Policy Optimisation The Q-function is the expected long-term reward It can be modelled as a Gaussian process Prior: Posterior, given visited summary states, actions and obtained rewards:

Voice mail example Voice mail example: The user asks the system to save or delete the message. The user input is corrupted with noise, so the true dialogue state is unknown. belief state b(s)

The role of kernel function in a Gaussian Process The kernel function models correlation between different Q-function values Confirm Q-function value Action Belief state Confirm

Problem 2: Standard models make random moves Exploitation? Exploration?

Solution: Define a stochastic policy Gaussian process defines Gaussian distributions for each action Sample from these distributions Automatically deal with exploration/exploitation Outcome: Less unexpected behaviour

Results during testing (with simulated user)

Results during training (with simulated user)

Problem 3: Humans give inconsistent ratings Reward is a measure of how good the dialogue is

On-line learning from user rating

User rating inconsistency Random policyOnline learned policy Simulator trained policy User rating (%) Objective score (%) P(user rating=1|objective score=1) P(user rating=1| objective score=0)

Solution: Incorporate both objective and subjective evaluation

Evaluation results Simulator trainedOn-line trained Evaluation dialogues Reward11.6 +/ /- 0.3 Success (%)93.5 +/ /- 0.9

Conclusions GP in policy optimisation Automate dialogue manager optimisation Enable sample efficient optimisation Outperforms simulator trained policies