Download presentation
Presentation is loading. Please wait.
1
An Overview of Reinforcement Learning
Angela Yu Cogs 118A February 26, 2009
2
Outline A formal framework for learning from reinforcement
– Markov decision problems – Interactions between an agent and its environment Dynamic programming as a formal solution – Policy iteration – Value iteration Temporal difference methods as a practical solution – Actor-critic learning – Q-learning Extensions – Exploration vs. exploitation – Representation and neural networks -- summarize actor-critic learning & Q-learning, perhaps use format from other RL paper, introducing methods first, then two principle ideas -- exploration vs. exploitation analogous to plasticity vs. stability problem
3
RL as a Markov Decision Process
Markov blanket for rt and xt+1 action state reward
4
RL as a Markov Decision Process
Goal: find optimal policy : x a by maximizing return: action state reward
5
RL as a Markov Decision Process
Simple case: assume transition and reward probabilities are known action state reward
6
Dynamic Programming I: Policy Iteration
Policy Evaluation (system of linear equations) Policy Improvement Based on the values of these state-action pairs, incrementally improve policy: Guaranteed to converge on (one set of) optimal values:
7
Dynamic Programming II: Value Iteration
Q-value Update Guaranteed to converge on (one set of) optimal values: Policy
8
Temporal Difference Learning
Difficult (realistic) case: transition and reward probabilities are unknown action state reward Solution: temporal difference (TD) learning
9
Actor-Critic Learning (related to policy iteration)
Critic improves value estimation incrementally: stochastic gradient ascent MC samples for < > Boot-strapping: V(xt) Mutual dependence Convergence? MC samples Learning rate Temporal difference t Actor improves policy execution incrementally Stochastic policy Delta-rule Monte Carlo samples Learning rate
10
Actor-Critic Learning
Exploration vs. Exploitation Best annealing schedule?
11
(related to value iteration)
Q-Learning (related to value iteration) State-action value estimation MC samples for < > Boot-strapping: Q(xt, at) Proven convergence No explicit parameter to control explore/exploit Policy
12
Pro’s and Con’s of TD Learning
TD learning practically appealing – no representation of sequences of states & actions – relatively simple computations – TD in the brain: dopamine signals temporal difference t TD suffers from several disadvantages – local optima – can be (exponentially) slow to converge – actor-critic not guaranteed to converge – no principled way to trade off exploration and exploitation – cannot easily deal with non-stationary environment
13
TD in the Brain
14
TD in the Brain
15
Extensions to basic TD Learning
A continuum of improvements possible – more complete partial models of the effects of actions – estimate expected reward <r(xt)> – representing & processing longer sequences of actions & states – faster learning & more efficient use of agent’s experiences – parameterize value function (versus look-up table) Timing and partial observability in reward prediction – state not (always) directly observable – delayed payoffs – reward-prediction only (no instrumental contingencies)
17
References Sutton, RS & Barto, AG (1998). Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press. Bellman, RE (1957). Dynamic Programming. Princeton, NJ: Princeton University Press. Daw, ND, Courville, AC, & Touretsky, DS (2003). Timing and partial observability in the dopamine system. In Neural Information Processing Systems 15. Cambridge, MA: MIT Press. Dayan, P & Watkins, CJCH (2001). Reinforcement learning. Encyclopedia of Cognitive Science. London, England: MacMillan Press. Dayan, P & Abbott, LF (2001). Theoretical Neuroscience. Cambridge, MA: MIT Press. Gittins, JC (1979). Bandit processes and dynamic allocation indices. Journal of Royal Statistical Society B, 41: Schultz, W, Dayan, P, & Montague, PR (1997). A neural substrate of prediction and reward. Science, 275,
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.