Download presentation
Presentation is loading. Please wait.
1
Gaussian Processes for Fast Policy Optimisation of POMDP-based Dialogue Managers M. Gašić, F. Jurčíček, S. Keizer, F. Mairesse, B. Thomson, K. Yu, S. Young Cambridge University Engineering Department | {mg436, fj228, sk561, farm2, brmt2, ky219, sjy}@eng.cam.ac.uk Acknowledgements This research was partly funded by the UK EPSRC under grant agreement EP/F013930/1 and by the EU FP7 Programme under grant agreement 216594 (CLASSiC project: www.classic-project.org).www.classic-project.org Real-world Problem: CamInfo Dialogue POMDP-based Dialogue ManagementKernel Function Models the uncertainty about the dialogue state by maintaining a distribution over all possible dialogue states in every turn – belief state Maximises the overall dialogue success by optimising Q-function Q(a,b(s)) – the highest expected long-term reward for action a taken in belief state b(s) Problem: To ensure tractability of optimisation, standard methods discretise the belief space into a small number of points, causing the loss of information. Solution: Model the continuous nature of the belief state and include the prior knowledge about the domain to speed up the learning process. Gaussian Processes in Reinforcement Learning Toy problem: VoiceMail Dialogue Conclusion GP-Sarsa can obtain the optimal policy faster than standard methods provided an adequate choice of the kernel function The measure of uncertainty that GP-Sarsa is estimating can be utilised in an Active Learning framework to additionally speed up the learning process Results on CamInfo Gaussian Process (GP) – non-parametric Bayesian model for function approximation For given prior function correlations and some noisy function observations, it estimates the posterior of any function value GP-Sarsa is an online Reinforcement learning algorithm that models Q-function as a Gaussian Process If the Q-function value was known in one belief state-action pair what is the Q-function value in another belief state for the same action? Prior knowledge about Q-function correlations is incorporated in the kernel function. Kernel hyper-parameters can be learned from the data labelled with belief states, actions and rewards. In that way the kernel captures correlations found in the data. Results on VoiceMail Comparison: GP-Sarsa with various kernel functions Grid-based Monte Carlo Control algorithm Exact POMDP solution In order to estimate the speed of convergence, the policy was evaluated after every training batch: Hidden Information State Dialogue Manager POMDP-based Dialogue Manager that can tractably maintain belief state for real-world problems Optimises policy in a reduced summary space CamInfo Domain Tourist Information domain for Cambridge, UK Comparison: GP-Sarsa with polynomial kernel Active learning GP-Sarsa – during exploration selects the action that has the highest uncertainty estimated by the Gaussian Process Grid-based Monte Carlo Control Algorithm Would you like to save or delete the message? Your message is deleted. b(s)b’(s) a r belief state (immediate) reward action Would you like to save or delete the message? Q-function value Action Belief state The user asks the system to save or delete the message. The user input is corrupted with noise, so the true dialogue state is unknown. A Gaussian Process models every Q-function value Q(a,b(s)) as a Gaussian distributed random variable. The variance of the Gaussian distribution provides a measure of uncertainty about the approximation. GP-Sarsa: Distribution of Q(a,b(s)) value Action a Which action leads to success? Standard approach: Q(a,b(s)) value
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.