Reinforcement Learning CSLT ML Summer Seminar (12) Reinforcement Learning Dong Wang Most slides from David Silver’s UCL Course on RL Some slides from John Schulman’s ‘Deep Reinforcement Learning’, Lecture 1
Content What is reinforcement learning? Shallow discrete learning Deep Q learning
What is reinforcement learning? Reinforcement learning is the problem faced by an agent that learns behavior through tiral-and- error interactions with a dynamic environment. Given a state by the environment, the agent learns how to take an action, which will be given back some (random) rewards and the system moves to another (random) state. The probabilities of the reward and the next state are stationray.
An example
Main component of RL
It is different from other tasks Unlike supervised learning, it has no ‘label’, e.g., which action should take. Feedback is often delayed, e.g., in game playing Time is important. It is sequential decision process. Decision impacts the environment It has some ‘supervision’ (the reward) when compared to unsupervised learning.
Applications Fly stunt manoeuvres in a helicopter Defeat the world champion at Backgammon Manage an investment portfolio Control a power station Make a humanoid robot walk Play many dierent Atari games better than humans
Robot
Robot
Business
Finance
Media
Medicine
Sequence prediction
Game playing http://www.nature.com/nature/journal/v518/n7540/fig_tab/nature14236_SV1.html http://www.nature.com/nature/journal/v518/n7540/fig_tab/nature14236_SV2.html
Some important things to mention If the environment is known (e.g., the transition and reward probability) If we can observe the state (hidden or explicit) Do we need to model the environment If we learn from episode or online If we want to use approximation or explicit table
Content What is reinforcement learning? Shallow discrete learning Deep Q learning
Markov decision process Markov decision processes formally describe an environment for reinforcement learning Where the environment is fully observable, i.e. the current state completely characterises the process Almost all RL problems can be formalised as MDPs, e.g. Optimal control primarily deals with continuous MDPs Partially observable problems can be converted into MDPs Bandits are MDPs with one state
Markov process
Markov reward process
An example of Markov reward process
Return in Markov reward process
Value function
Bellman Equation
Markov decision process (MDP)
Policy
Value function in MDP
Bellman Expectation Equation in MDP
Relation between two valuation functions
Relation between two valuation functions
Optimal value function
Optimal policy
Find optimal policy
Bellman optimization They are non-linear (because of the max()), and no closed form solution Iterative procedures can do that
Policy evaluation Given a policy, look at the valuation function at each state.
Improve policy
General policy process
Value iteration Not involve policy update, however the optimal policy has been learned by the ‘max’ operation.
Can be performed asynchronously Three simple ideas for asynchronous dynamic programming: In-place dynamic programming Prioritised sweeping Real-time dynamic programming
Full-width approach and sampling approach The above approach uses all possible offsprings in the update, but it can also be using the ‘exact experience’. We have to know the dynamic properties of the system It can be also design a model-free approach based on sampling. It interacts with the environment and learn from the expeirence. Sampling approach is easier to implement and more efficient
Monte-Carlo Reinforcement Learning MC methods learn directly from episodes of experience MC is model-free: no knowledge of MDP transitions / rewards MC learns from complete episodes: no bootstrapping MC uses the simplest possible idea: value = mean return Caveat: can only apply MC to episodic MDPs All episodes must terminate
MC evaluation Do not touch the policy, just the value function
First-visit MC
Every-visit MC
Incremental MC
Temporal-Dierence Learning TD methods learn directly from episodes of experience TD is model-free: no knowledge of MDP transitions / rewards TD learns from incomplete episodes, by bootstrapping TD updates a guess towards a guess Can work without knowing the output. It is good for online!
Some comparison
Now we update the policy On-policy learning Learn on the job Learn about policy from experience sampled from O-policy learning Look over someone's shoulder
MC policy learning
Sarsa (TD) policy learning
Off-policy learning Update the policy and Q valuate together
But all the above mostly useless How do we know the status? How do we keep the value function if the status is large? How if the status is continuous? How about if we meet some status that similar but different from those in the training data? How if we just observe a small number of training examples? Use parametric function to approximate it!
Value function approximation
We consider dierentiable function approximators, e.g. Linear combinations of features Neural network Decision tree Nearest neighbour Fourier / wavelet bases ... Furthermore, we require a training method that is suitable for non- stationary, non-iid data
Content What is reinforcement learning? Shallow discrete learning Deep Q learning
Deep Q network Using DNN to approximate the value function Using MC or TD to generate samples, using the error signals from the training samples to train the DNN
Incremental update for Q function
DQN for game learning Human-level control through deep reinforcement
Two mechanisms
DQN for AlphaGo Mastering the game of Go with deep neural networks and tree search
Wrap up Reinforcement learning learns policy. It is basically formulated as a Markov decision process learning. It can be learned in a ‘batch way’ or sample way, and can be in an episode or incremental fashion. Learning value function is highly important. Deep learning provides a brilliant solution. It opens the door for fantastic machine intelligence.