Download presentation
Presentation is loading. Please wait.
Published bySarai Hayslip Modified over 10 years ago
1
On-line dialogue policy optimisation Milica Gašić Dialogue Systems Group
2
Spoken Dialogue System Optimisation Problem: What is the optimal behaviour Solution: Find it automatically through interaction
3
Reinforcement learning
4
Training in interaction with humans Problem 1: Optimisation requires too many dialogues Problem 2: Training makes random moves Problem 3: Humans give inconsistent ratings
5
Outline Background Dialogue model Dialogue optimisation Sample-efficient optimisation Models for learning Robust reward function Human experiments Conclusion
6
Model: Partially Observable Markov Decision Process atat stst s t+1 rtrt otot o t+1 State is Markov -- depends on the previous state and action: P(s t+1 |s t, a t ) – the transition probability State is unobservable and generates a noisy observation P(o t |s t ) -- the observation probability In every state action is taken and a reward is obtained Dialogue is a sequence of states Action selection (policy) is based on the distribution over all states at every time step t – belief state b(s t )
7
Dialogue state factorisation Decompose the state into conditionally independent elements: user goal user action stst gtgt utut dtdt dialogue history atat rtrt otot o t+1 g t+1 u t+1 d t+1
8
Further dialogue state factorisation gtgt utut dtdt atat rtrt otot o t+1 g t+1 u t+1 d t+1 g t food d t food u t food g t area d t area u t area g t+1 food d t+1 food u t+1 food g t+1 area d t+1 area u t+1 food
9
Policy optimisation in summary space Compress the belief state into a summary space 1 J. Williams and S. Young (2005). "Scaling up POMDPs for Dialogue Management: The Summary POMDP Method." Original Belief Space Actions Policy Summary Space Summary Actions Summary Function Master Function Summary Policy
10
Q-function Q-function measures the expected discounted reward that can be obtained at a summary point when an action is taken Takes into account the reward of the future actions Optimising the Q-function is equivalent to optimising the policy Discount factor in (0,1] Reward Starting summary point Starting action Expectation with respect to policy π
11
Online learning Reinforcement learning in direct interaction with the environment Actions are taken e-greedily Exploitation: choose action according to the best estimate of Q function Exploration: choose action randomly (with probability e) In practice 10,000s of dialogues are needed!
12
Problem 1: Standard models require too many dialogues
13
Solution: Take into account similarities between different belief states Essential ingredients Gaussian process Kernel function Outcome Sample-efficient policy optimisation
14
Gaussian Process Policy Optimisation The Q-function is the expected long-term reward It can be modelled as a Gaussian process Prior: Posterior, given visited summary states, actions and obtained rewards:
15
Voice mail example Voice mail example: The user asks the system to save or delete the message. The user input is corrupted with noise, so the true dialogue state is unknown. belief state b(s)
16
The role of kernel function in a Gaussian Process The kernel function models correlation between different Q-function values Confirm Q-function value Action Belief state Confirm
17
Problem 2: Standard models make random moves Exploitation? Exploration?
18
Solution: Define a stochastic policy Gaussian process defines Gaussian distributions for each action Sample from these distributions Automatically deal with exploration/exploitation Outcome: Less unexpected behaviour
19
Results during testing (with simulated user)
20
Results during training (with simulated user)
21
Problem 3: Humans give inconsistent ratings Reward is a measure of how good the dialogue is
22
On-line learning from user rating
23
User rating inconsistency Random policyOnline learned policy Simulator trained policy User rating (%)36.376.985.7 Objective score (%) 17.753.863.7 P(user rating=1|objective score=1) 0.800.94 P(user rating=1| objective score=0) 0.260.570.68
24
Solution: Incorporate both objective and subjective evaluation
25
Evaluation results Simulator trainedOn-line trained Evaluation dialogues 400410 Reward11.6 +/- 0.413.4 +/- 0.3 Success (%)93.5 +/- 1.296.8 +/- 0.9
26
Conclusions GP in policy optimisation Automate dialogue manager optimisation Enable sample efficient optimisation Outperforms simulator trained policies
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.