Download presentation
Presentation is loading. Please wait.
1
Dialogue Policy Optimisation
Milica Gašić Dialogue Systems Group
2
Reinforcement learning
3
Dialogue as a partially observable Markov decision process (POMDP)
at st st+1 State is unobservable ot ot+1 State depends on a noisy observation Lets remind ourselves of the POMDP model for dialogue State depends on the previous state and the action that the system took State is unobservable and depends on a noisy observation We keep track of the probability distribution of all states at every time step Based on this distribution we take action rt Action selection (policy) is based on the distribution over all states at every time step t – belief state b(st)
4
Dialogue policy optimisation
state action Dialogue is in a state – belief state b Dialogue manager takes actions a, as defined by a policy π It gets rewards r state reward
5
Optimal Policy The optimal policy is one which generates the highest expected reward over time
6
Reinforcement learning – the idea
Take actions randomly Compute average reward Change policy to take actions that generated high reward
7
Challenges in dialogue policy optimisation
How to define the reward? Belief state is large and continuous Reinforcement learning takes many iterations
8
Problem 1: The reward function
Solution: Reward is a measure of how good the dialogue is It should incorporate the measure of success whether the system gave all the information that the user wanted It should favour shorter dialogues penalise the system for every dialogue turn It can incorporate more elements
9
Problem 2: Belief state is large and continuous
Solution: Compress the belief state into a smaller scale summary space Original Belief Space Actions Policy Summary Function Master Function Summary Space Summary Actions Summary Policy 1 J. Williams and S. Young (2005). "Scaling up POMDPs for Dialogue Management: The Summary POMDP Method."
10
Summary space Summary space contains features of the belief space that are important for learning This is hand-coded! It can contain probabilities of concepts, their values and so on! Continuous variables can be discretised into a grid
11
Q-function Q-function measures the expected discounted reward that can be obtained at a grid point when an action is taken Starting action Expectation with respect to policy π Starting grid point Discount Factor in (0,1] Reward Takes into account the reward of the future actions Optimising the Q-function is equivalent to optimising the policy
12
Online learning Reinforcement learning in direct interaction with the environment Actions are taken e-greedily Exploitation: choose action according to the best estimate of Q function Exploration: choose action randomly (with probability e)
13
Monte Carlo control algorithm
Initialise Q arbitrary Repeat Repeat for every turn in a dialogue Update belief state, map to summary space Record grid point, record reward Until the end of dialogue For each grid point sum up all rewards that followed Update Q function and policy
14
How many iterations? Each grid point needs to be visited sufficiently enough to obtain good estimate If the grid is large then the estimate is not precise enough If there are lots of grid points then the policy optimization is slow In practice 10,000s dialogues are needed!
15
Learning in interaction with a Simulated User
Dialogue Manager Speech Understanding Dialogue State Expected Reward Speech Generation Dialogue Policy Optimise Policy
16
Exhibit a variety of behaviour
Simulated user Various models Exhibit a variety of behaviour Imitate real users
17
Agenda-based user simulator
Consists of an agenda and a goal Goal: Concepts that describe the entity that the user wants Example: restaurant, cheap, Chinese Agenda Dialogue acts needed to elicit the user goal Dynamically changed during the dialogue Generated either deterministically or stochastically
18
Learning with noisy input
inform ( price = cheap, area = centre) POMDPs is to provide robustness to speech recognition errors Expose the manager to noise during learning User simulator output can be corrupted to produce an N-best list of scored noisy inputs inform ( price = cheap, area = south) 0.63 inform ( price = expensice ) 0.22 request ( area ) 0.15
19
Evaluating a dialogue system
Dialogue system consists of many components and joint evaluation is difficult What matters is the user experience Dialogue manager uses reward to optimise the dialogue policy This can also be used for evaluation
20
Results
21
Problem 3: Policy optimisation requires a lot of dialogues
Policy optimisations requires 10,000s dialogues Solution: Take into account similarities between different belief states Essential ingredients Gaussian process Kernel function Outcome Fast policy optimisation The POMDP approach requires 10,000s dialogues to train the policy which is too much for learning in interaction with real users
22
The Q-function as a Gaussian Process
The Q-function in a POMDP is the expected long-term reward from taking action a in belief state b(s). It can be modelled as a stochastic process – a Gaussian process to take into account the uncertainty of approximation
23
VoiceMail example The user asks the system to
save or delete the message. System actions: save, delete, confirm The user input is corrupted with noise, so the true dialogue state is unknown.
24
Q-function as a Gaussian process
belief state b(s)
25
The role of kernel function in a Gaussian process
The kernel function models correlation between different Q-function values Action Belief state Q-function value Confirm Confirm
26
Exploration in online learning
State space needs to be sufficiently explored to find the optimal path How to efficiently explore the space?
27
Active learning in GP Reinforcement learning
gives the uncertainty GP model for Q-function choose action that the model in uncertain about Exploration choose action with the highest expected reward Exploitation
28
Results – Cambridge tourist information domain
29
Learning in interaction with real people
30
Conclusions Statistical dialogue modelling Future work:
Automate dialogue manager optimisation Robust to speech recognition errors Enables fast learning Future work: Refined reward function
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.