Deep reinforcement learning for dialogue policy optimisation

Deep reinforcement learning for dialogue policy optimisation
Milica Gašić You’ve all heard about Siri, Amazon Alexa, or Google now. Establishing intelligent conversation between a machine and a human is one of the fundamental problems of AI. Spoken conversation is natural for humans, but how do we make it natural for machines? It’s an incredibly exiting problem and today I’m going to tell you about my research in this area. Dialogue Systems Group

Structure of spoken dialogue systems
Speech recognition text Language understanding semantics Dialogue management Speech synthesis Language generation text actions

Statistical dialogue management
distribution of text hypotheses distribution of semantic hypotheses Speech recognition text Language understanding semantics Dialogue management Belief tracking Policy optimisation A system that allows human computer interaction using speech as the primary medium. Instant and human-like acquisition of information. Speech synthesis Language generation text actions

Example 1 – No belief tracking

Example 2 – Belief tracking
New probability is much higher The new action is much better than the one taken before

Belief tracking vs policy optimisation
PAST: what has happened so far in the dialogue? Policy optimisation FUTURE: what action to take to achieve the best outcome?

Machine Learning Theory: Reinforcement learning
observations ot Dialogue system reward rt belief states bt - Example of rewards action at

observations ot Dialogue system reward rt belief states bt action at Policy

Return Value function Q-function

Deep Reinforcement Learning
Value function, Q-function or policy approximated via a neural network These are approximated as non-linear functions which is preferred in reinforcement learning. However, optimisation algorithms offer only local optima.

Deep Q-network algorithm
Q-function is approximated as a deep neural network, the gradient is This algorithm provides a biased estimate States are correlated Targets are non-stationary

Policy approximation as a neural network
Policy is parameterised The objective function is the value of the initial state Policy gradient theorem Directly used in REINFORCE, but also in actor-critic methods

Problems of Deep RL in dialogue policy optimisation
Cold start problem Too many data points needed for convergence No uncertainty estimates No common benchmark

Actor-critic framework

Advantage actor-critic (A2C)
Approximate both policy and value function as neural network Policy gradient: Value function loss:

1. Cold start problem Problem:
Learning typically starts from tabula rasa Solution: Make use of a corpus of dialogues Initialise with supervised learning and improve with reinforcement learning Outcome: Better initial performance

Using demonstration data (supervised pre-training)
Make efficient use of off-line demonstration data Initialise policy in a supervised fashion Pre-training: minimise the cross entropy between actions taken in data and the actions taken by the policy

Evaluation set-up Cambridge restaurant domain: 100 venues, 6 slots, 3 requestable Belief state input of size 268 (last system act, distribution over user intent …) System summary action space of size 14 (inform, request, confirm, …) User simulator operating on semantic level

Results Su et al, Sample-efficient Actor-Critic Reinforcement Learning with Supervised Data for Dialogue Management, SigDial, 2017

2. Problem: Too many interactions are needed
Problem: learning is very slow and requires too many interactions. That is why the action spaces are very small, in the order of 10 actions. Solution: ACER- experience replay, truncated importance sampling with bias correction, the off-policy Retrace algorithm and trust region policy optimisation. Outcome: Ability to learn on two order of magnitude larger action spaces

Properties of ACER Modified version of A2C which uses experience replay. Estimates Q-function off-policy and uses truncated importance sampling with bias correction. The Retrace algorithm is used to compute the targets based on the observed rewards in a safe, efficient way, with low bias and variance. Uses trust-region policy optimisation.

Experience replay Off-policy learning follows a behaviour policy 𝛍 while optimising target policy 𝛑 This allows for the previous experience to be ‘replayed’ Correct the sampling bias with importance sampling ratio ER Pool Sample

Truncated importance sampling
When we use importance weights for each sample in the trajectory we can easily obtain vanishing or exploding trajectories This is why we need to truncate the importance sampling weights We need to add a bias correction term to take into account that we are making an error

Modified neural network architecture
Approximate policy and Q- function as neural network Modified loss Targets are estimated via Retrace algorithm

Targets from the Retrace algorithm
Retrace algorithm estimates the current Q-function from off-policy interactions in a safe and efficient way, with small variance Retrace solves this trade-off by setting the traces ``dynamically'', based on the importance sampling weights. In the near on-policy case, it is efficient as the importance sampling weights will be about 1, preventing the traces from vanishing.

Trust-region policy optimisation (TRPO)
Small changes in the parameter space can lead to erratic changes in the output policy. Solution: natural gradient, but expensive to compute. Distance metric in natural gradient can be approximated as the KL divergence. TRPO approximates KL divergence using first order Taylor expansion and limits the KL divergence between policies of subsequent parameters.

Evaluation set-up Cambridge restaurant domain: 100 venues, 6 slots, 3 requestable Belief state input of size 268 (last system act, distribution over user intent …) System summary action space of size 14 (inform, request, confirm, …) User simulator operating on semantic level

Results

Extending to master action space

NN architecture for master space

Learning on master action space

Learning on master action space
ACER Summary space (14 actions) Master space (1035 actions) Success 89.7% 89.1% Reward (20max) 11.39 11.83 Turns 6.42 5.98 Weisz et al, Sample efficient deep reinforcement learning for dialogue systems with large action spaces, arXiv, 2018

3. Problem: No uncertainty estimates
Problem: Neural networks do not provide uncertainties about its estimates. These are very useful to guide the exploration for more sample efficient learning and to stabilase the learning. Solution: Bayesian Neural Networks estimate means and variances of each parameter in the neural network using a variational approximation Outcome: more sample efficient and stable learning

Bayes by backprop architecture

Bayes by backprop Q-network
All weights are represented by probability distributions over possible values given observed dialogues Taking an expectation under the posterior distribution is equivalent to using an ensemble of an uncountably infinite number of neural networks, which is intractable. We use sampling-based variational inference. The intractable posterior is approximated with variational posterior. Resulting cost is:

Results Tegho et al, Benchmarking Uncertainty Estimates with Deep Reinforcement Learning for Dialogue Policy Optimisation, ICASSP, 2018

4. Problem: Common benchmarks
Building dialogue systems and comparing work in the field is very difficult. Solution: Open source toolkit for building statistical spoken dialogue systems Standardized set of tasks Common training and test sets Outcome: reproducible research and larger research community

PyDial: CUED open source Python toolkit
Provides implementations of statistical approaches for all dialogue system modules It offers easy configuration, easy extensibility, and domain-independent implementations of the respective dialogue system modules. It has been extended to provide multi-domain conversational functionality Get the code: Ultes et al, PyDial: A Multi-domain Statistical Dialogue System Toolkit, ACL, 2017

Environments for benchmarking policy optmisation
Benchmarking environment where a fair comparison between different algorithms interacting can be established. These consist of different domains, user simulator settings and noise levels in the input. In total 18 environments are available. The implementation includes 4 state of the art dialogue policy optimisation algorithms. Casanueva et al, A Benchmarking Environment for Reinforcement Learning Based Task Oriented Dialogue Management, NIPS Symposium on Deep RL, 2017

Dataset for multi-domain dalogues
How to effectively evaluate different dialogue models for multi-domain dialogue? Create a large common dataset together with tools for baseline training. Make use of the Wizard of Oz framework to provide more linguistically varied data. 10K dialogues collected to date. Google Faculty Award

Challenges of natural conversation
Large (potentially infinite!) action spaces Less task focused, but more open-ended Beyond speech recognition – incorporates sentiment, gesture and emotion Long-term conversation Incremental operation … lot’s of problems, but not how will I get there, what are fundings that I will apply for, Existing collaborations, Industrial collaborations,

Conclusions Deep reinforcement learning in dialogue management:
can be improved with supervised initialisation can be sample-efficient can provide uncertainty estimates Pydial framework offers a common benchmark. From here we can model more natural conversation address new challenging tasks

Dialogue Systems Group
Stefan Ultes Inigo Casanueva Pawel Budzianowski Bo-Hsiang Tseng Yen-Chen Wu Florian Kreyssig Osman Ramadan Steve Young (now at Apple) Lina Rojas Barahona (now at Orange) Pei-Hao Su (now at Poly.ai) Nikola Mrksic (now at Poly.ai) Tsung-Hsien Wen (now at Poly.ai) Gellert Weisz (now at DeepMind) Chris Tegho (now at Calipsa)

Deep reinforcement learning for dialogue policy optimisation

Similar presentations

Presentation on theme: "Deep reinforcement learning for dialogue policy optimisation"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Deep reinforcement learning for dialogue policy optimisation

Similar presentations

Presentation on theme: "Deep reinforcement learning for dialogue policy optimisation"— Presentation transcript:

Similar presentations

About project

Feedback