An Overview of Reinforcement Learning Angela Yu Cogs 118A February 26, 2009
Outline A formal framework for learning from reinforcement – Markov decision problems – Interactions between an agent and its environment Dynamic programming as a formal solution – Policy iteration – Value iteration Temporal difference methods as a practical solution – Actor-critic learning – Q-learning Extensions – Exploration vs. exploitation – Representation and neural networks -- summarize actor-critic learning & Q-learning, perhaps use format from other RL paper, introducing methods first, then two principle ideas -- exploration vs. exploitation analogous to plasticity vs. stability problem
RL as a Markov Decision Process Markov blanket for rt and xt+1 action state reward
RL as a Markov Decision Process Goal: find optimal policy : x a by maximizing return: action state reward
RL as a Markov Decision Process Simple case: assume transition and reward probabilities are known action state reward
Dynamic Programming I: Policy Iteration Policy Evaluation (system of linear equations) Policy Improvement Based on the values of these state-action pairs, incrementally improve policy: Guaranteed to converge on (one set of) optimal values:
Dynamic Programming II: Value Iteration Q-value Update Guaranteed to converge on (one set of) optimal values: Policy
Temporal Difference Learning Difficult (realistic) case: transition and reward probabilities are unknown action state reward Solution: temporal difference (TD) learning
Actor-Critic Learning (related to policy iteration) Critic improves value estimation incrementally: stochastic gradient ascent MC samples for < > Boot-strapping: V(xt) Mutual dependence Convergence? MC samples Learning rate Temporal difference t Actor improves policy execution incrementally Stochastic policy Delta-rule Monte Carlo samples Learning rate
Actor-Critic Learning Exploration vs. Exploitation Best annealing schedule?
(related to value iteration) Q-Learning (related to value iteration) State-action value estimation MC samples for < > Boot-strapping: Q(xt, at) Proven convergence No explicit parameter to control explore/exploit Policy
Pro’s and Con’s of TD Learning TD learning practically appealing – no representation of sequences of states & actions – relatively simple computations – TD in the brain: dopamine signals temporal difference t TD suffers from several disadvantages – local optima – can be (exponentially) slow to converge – actor-critic not guaranteed to converge – no principled way to trade off exploration and exploitation – cannot easily deal with non-stationary environment
TD in the Brain
TD in the Brain
Extensions to basic TD Learning A continuum of improvements possible – more complete partial models of the effects of actions – estimate expected reward <r(xt)> – representing & processing longer sequences of actions & states – faster learning & more efficient use of agent’s experiences – parameterize value function (versus look-up table) Timing and partial observability in reward prediction – state not (always) directly observable – delayed payoffs – reward-prediction only (no instrumental contingencies)
References Sutton, RS & Barto, AG (1998). Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press. Bellman, RE (1957). Dynamic Programming. Princeton, NJ: Princeton University Press. Daw, ND, Courville, AC, & Touretsky, DS (2003). Timing and partial observability in the dopamine system. In Neural Information Processing Systems 15. Cambridge, MA: MIT Press. Dayan, P & Watkins, CJCH (2001). Reinforcement learning. Encyclopedia of Cognitive Science. London, England: MacMillan Press. Dayan, P & Abbott, LF (2001). Theoretical Neuroscience. Cambridge, MA: MIT Press. Gittins, JC (1979). Bandit processes and dynamic allocation indices. Journal of Royal Statistical Society B, 41: 148-177. Schultz, W, Dayan, P, & Montague, PR (1997). A neural substrate of prediction and reward. Science, 275, 1593-1599.