Download presentation
Presentation is loading. Please wait.
1
Reinforcement Learning
CSLT ML Summer Seminar (12) Reinforcement Learning Dong Wang Most slides from David Silver’s UCL Course on RL Some slides from John Schulman’s ‘Deep Reinforcement Learning’, Lecture 1
2
Content What is reinforcement learning? Shallow discrete learning
Deep Q learning
3
What is reinforcement learning?
Reinforcement learning is the problem faced by an agent that learns behavior through tiral-and- error interactions with a dynamic environment. Given a state by the environment, the agent learns how to take an action, which will be given back some (random) rewards and the system moves to another (random) state. The probabilities of the reward and the next state are stationray.
4
An example
6
Main component of RL
7
It is different from other tasks
Unlike supervised learning, it has no ‘label’, e.g., which action should take. Feedback is often delayed, e.g., in game playing Time is important. It is sequential decision process. Decision impacts the environment It has some ‘supervision’ (the reward) when compared to unsupervised learning.
9
Applications Fly stunt manoeuvres in a helicopter
Defeat the world champion at Backgammon Manage an investment portfolio Control a power station Make a humanoid robot walk Play many dierent Atari games better than humans
10
Robot
11
Robot
12
Business
13
Finance
14
Media
15
Medicine
16
Sequence prediction
17
Game playing
18
Some important things to mention
If the environment is known (e.g., the transition and reward probability) If we can observe the state (hidden or explicit) Do we need to model the environment If we learn from episode or online If we want to use approximation or explicit table
19
Content What is reinforcement learning? Shallow discrete learning
Deep Q learning
20
Markov decision process
Markov decision processes formally describe an environment for reinforcement learning Where the environment is fully observable, i.e. the current state completely characterises the process Almost all RL problems can be formalised as MDPs, e.g. Optimal control primarily deals with continuous MDPs Partially observable problems can be converted into MDPs Bandits are MDPs with one state
21
Markov process
22
Markov reward process
23
An example of Markov reward process
24
Return in Markov reward process
25
Value function
26
Bellman Equation
27
Markov decision process (MDP)
29
Policy
30
Value function in MDP
31
Bellman Expectation Equation in MDP
32
Relation between two valuation functions
33
Relation between two valuation functions
34
Optimal value function
35
Optimal policy
36
Find optimal policy
37
Bellman optimization They are non-linear (because of the max()), and no closed form solution Iterative procedures can do that
38
Policy evaluation Given a policy, look at the valuation function at each state.
39
Improve policy
40
General policy process
41
Value iteration Not involve policy update, however the optimal policy has been learned by the ‘max’ operation.
43
Can be performed asynchronously
Three simple ideas for asynchronous dynamic programming: In-place dynamic programming Prioritised sweeping Real-time dynamic programming
44
Full-width approach and sampling approach
The above approach uses all possible offsprings in the update, but it can also be using the ‘exact experience’. We have to know the dynamic properties of the system It can be also design a model-free approach based on sampling. It interacts with the environment and learn from the expeirence. Sampling approach is easier to implement and more efficient
45
Monte-Carlo Reinforcement Learning
MC methods learn directly from episodes of experience MC is model-free: no knowledge of MDP transitions / rewards MC learns from complete episodes: no bootstrapping MC uses the simplest possible idea: value = mean return Caveat: can only apply MC to episodic MDPs All episodes must terminate
46
MC evaluation Do not touch the policy, just the value function
47
First-visit MC
48
Every-visit MC
49
Incremental MC
50
Temporal-Dierence Learning
TD methods learn directly from episodes of experience TD is model-free: no knowledge of MDP transitions / rewards TD learns from incomplete episodes, by bootstrapping TD updates a guess towards a guess Can work without knowing the output. It is good for online!
52
Some comparison
54
Now we update the policy
On-policy learning Learn on the job Learn about policy from experience sampled from O-policy learning Look over someone's shoulder
55
MC policy learning
56
Sarsa (TD) policy learning
57
Off-policy learning Update the policy and Q valuate together
58
But all the above mostly useless
How do we know the status? How do we keep the value function if the status is large? How if the status is continuous? How about if we meet some status that similar but different from those in the training data? How if we just observe a small number of training examples? Use parametric function to approximate it!
59
Value function approximation
60
We consider dierentiable function approximators, e.g.
Linear combinations of features Neural network Decision tree Nearest neighbour Fourier / wavelet bases ... Furthermore, we require a training method that is suitable for non- stationary, non-iid data
61
Content What is reinforcement learning? Shallow discrete learning
Deep Q learning
62
Deep Q network Using DNN to approximate the value function
Using MC or TD to generate samples, using the error signals from the training samples to train the DNN
64
Incremental update for Q function
65
DQN for game learning Human-level control through deep reinforcement
66
Two mechanisms
67
DQN for AlphaGo Mastering the game of Go with deep neural networks and tree search
68
Wrap up Reinforcement learning learns policy.
It is basically formulated as a Markov decision process learning. It can be learned in a ‘batch way’ or sample way, and can be in an episode or incremental fashion. Learning value function is highly important. Deep learning provides a brilliant solution. It opens the door for fantastic machine intelligence.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.