A Crash Course in Reinforcement Learning Oliver Schulte Simon Fraser University emphasis on connections with neural net learning
Outline What is Reinforcement Learning? Key Definitions Key Learning Tasks Reinforcement Learning Techniques Reinforcement Learning with Neural Nets
Overview
Learning To Act So far: learning to predict Now: learn to act In engineering: control theory Economics, operations research: decision and game theory Examples: fly helicopter drive car play Go play soccer
RL at a glance http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/mlbook/ch13.pdf
Acting in Action Autonomous Helicopter Learning to play video games An example of imitation learning: start by observing human actions Learning to play video games “Deep Q works best when it lives in the moment” Learn to flip pancakes https://www.youtube.com/watch?v=VCdxqn0fcnE helicopter https://www.wired.com/2015/02/google-ai-plays-atari-like-pros/
MARKOV DECISION PROCESSES
Markov Decision Processes Recall Markov process (MP) state = vector x ≅ s of input variable values can contain hidden variables = partially observable (POMDP) transition probability P(s’|s) Markov reward process (MRP) = MP + rewards r Markov decision process (MDP) = MRP + actions a Markov game = MDP with actions, rewards for > 1 agent
Model Parameters: transition probabilities Markov process: P(s(t+1)|s(t)) MDP: P(s(t+1)|s(t),a(t)) E(r(t+1)|s(t),a(t)) expected reward recall basketball example also hockey example grid example David Poole’s demo
derived concepts
Returns and discounting A trajectory is a (possibly infinite) sequence s(0),a(0),r(0),s(1),a(1),r(1),...,s(n),a(n),r(n),... The return is the total sum of rewards. But: if the trajectory is infinite, we have an infinite sum! Solution: Weight by discount factor γ between 0 and 1. Return = r(0)+γr(1)+γ2r(n)+... can also interpret as probability of process eding
RL Concepts These 3 functions can be computed by neural networks http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/mlbook/ch13.pdf
Policies and Values A deterministic policy π is a function that maps states to actions. i.e. tells us how to act. Can also be probabilistic. Can be implemented using neural nets. Given a policy and an MDP, we have the expected return from using the policy at a state. Notation: Vπ(s)
Optimal Policies A policy π* is optimal if for any other policy and for all states s Vπ*(s) ≥ Vπ(s) The value of the optimal policy is written as V*(s).
The action value function Given a policy, the expected reward at a state given an action is denoted as Qπ(s,a). Similarly Q*(s,a) for the value of an action given the optimal policy. grid example Show Mitchell example
LEARNING
Two Learning Problems Prediction: For a fixed policy, learn Vπ(s). Control: For a given MDP, learn V*(s) (optimal policy). Variants for Q-function.
Model-Based Learning Transition Probabilities Value Function Data dynamic programming Transition Probabilities Value Function Data Bellmann equation: Vπ(s) = Ps’,a π(a) x ( E(r)|s,a + P(s’|s,a) x Vπ (s’) ) Developed for transition probabilities that are “nice” discrete, Gaussian, Poisson,... grid example
Model-free Learning By-pass estimating transition probabilities Why? Continuous state variables, no “nice” functional form. (How about using LSTM/RNN dynamic model? deep dynamic programming?)
Model-free Learning Directly learn optimal policy π* (policy iteration) Directly learn optimal value function V*. Directly learn optimal action-value function Q*. All of these functions can be implemented in a neural network. NN learning = reinforcement learning
Model-free Learning: What are the data? Data is simply a sequence of events s(0),a(0),r(0),s(1),a(1),r(1),... doesn’t tell us expected values or optimal actions. Monte Carlo learning: to learn V, observe return at end of episode. e.g. chessbase gives percentage of wins by white for any position
Temporal Difference Learning Consistency idea: using current model, and given data, s(0),a(0),r(0),s(1),a(1),r(1), estimate the value V(s(t)) at current state the next-step value V1(s(t)) = r(t)+γV(s(t+1)) Minimize the “error” [V1(s(t))-V(s(t))]2 s(0),a(0),r(0),s(1),a(1),r(1),
Model-Free Learning Example http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching.html