An Overview of Reinforcement Learning

Slides:

Advertisements

Similar presentations

Reinforcement learning

Advertisements

Lecture 18: Temporal-Difference Learning

Markov Decision Process

Reinforcement Learning

RL for Large State Spaces: Value Function Approximation

TEMPORAL DIFFERENCE LEARNING Mark Romero – 11/03/2011.

Questions?. Setting a reward function, with and without subgoals Difference between agent and environment AI for games, Roomba Markov Property – Broken.

INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.

1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.

COSC 878 Seminar on Large Scale Statistical Machine Learning 1.

Markov Decision Processes

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley.

Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.

Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

Chapter 6: Temporal Difference Learning

Chapter 6: Temporal Difference Learning

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

Making Decisions CSE 592 Winter 2003 Henry Kautz.

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

1 Reinforcement Learning: Learning algorithms Function Approximation Yishay Mansour Tel-Aviv University.

Search and Planning for Inference and Learning in Computer Vision

General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.

REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.

Model-based Bayesian Reinforcement Learning in Partially Observable Domains by Pascal Poupart and Nikos Vlassis (2008 International Symposium on Artificial.

Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 1: Generalized policy iteration.

Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.

© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.

Decision Making Under Uncertainty Lec #8: Reinforcement Learning UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Most slides by Jeremy.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

Neural Networks Chapter 7

INTRODUCTION TO Machine Learning

CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:

Reinforcement Learning with Laser Cats! Marshall Wang Maria Jahja DTR Group Meeting October 5, 2015.

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.

CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.

1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.

Markov Decision Process (MDP)

Reinforcement learning

Chapter 6: Temporal Difference Learning

Chapter 5: Monte Carlo Methods

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

CMSC 471 – Spring 2014 Class #25 – Thursday, May 1

Reinforcement learning (Chapter 21)

Reinforcement Learning

Markov Decision Processes

Biomedical Data & Markov Decision Process

UAV Route Planning in Delay Tolerant Networks

CMSC 671 – Fall 2010 Class #22 – Wednesday 11/17

Reinforcement Learning

Reinforcement learning

RL for Large State Spaces: Value Function Approximation

Chapter 2: Evaluative Feedback

یادگیری تقویتی Reinforcement Learning

October 6, 2011 Dr. Itamar Arel College of Engineering

Chapter 6: Temporal Difference Learning

Introduction to Reinforcement Learning and Q-Learning

Deep Reinforcement Learning

Chapter 7: Eligibility Traces

Reinforcement Learning

Chapter 2: Evaluative Feedback

Reinforcement Learning (2)

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

Reinforcement Learning (2)

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

Presentation transcript:

An Overview of Reinforcement Learning Angela Yu Cogs 118A February 26, 2009

Outline A formal framework for learning from reinforcement – Markov decision problems – Interactions between an agent and its environment Dynamic programming as a formal solution – Policy iteration – Value iteration Temporal difference methods as a practical solution – Actor-critic learning – Q-learning Extensions – Exploration vs. exploitation – Representation and neural networks -- summarize actor-critic learning & Q-learning, perhaps use format from other RL paper, introducing methods first, then two principle ideas -- exploration vs. exploitation analogous to plasticity vs. stability problem

RL as a Markov Decision Process Markov blanket for rt and xt+1 action state reward

RL as a Markov Decision Process Goal: find optimal policy : x  a by maximizing return: action state reward

RL as a Markov Decision Process Simple case: assume transition and reward probabilities are known action state reward

Dynamic Programming I: Policy Iteration Policy Evaluation (system of linear equations) Policy Improvement Based on the values of these state-action pairs, incrementally improve policy: Guaranteed to converge on (one set of) optimal values:

Dynamic Programming II: Value Iteration Q-value Update Guaranteed to converge on (one set of) optimal values: Policy

Temporal Difference Learning Difficult (realistic) case: transition and reward probabilities are unknown action state reward Solution: temporal difference (TD) learning

Actor-Critic Learning (related to policy iteration) Critic improves value estimation incrementally: stochastic gradient ascent MC samples for < > Boot-strapping: V(xt) Mutual dependence Convergence? MC samples Learning rate Temporal difference t Actor improves policy execution incrementally Stochastic policy Delta-rule Monte Carlo samples Learning rate

Actor-Critic Learning Exploration vs. Exploitation Best annealing schedule?

(related to value iteration) Q-Learning (related to value iteration) State-action value estimation MC samples for < > Boot-strapping: Q(xt, at) Proven convergence No explicit parameter to control explore/exploit Policy

Pro’s and Con’s of TD Learning TD learning practically appealing – no representation of sequences of states & actions – relatively simple computations – TD in the brain: dopamine signals temporal difference t TD suffers from several disadvantages – local optima – can be (exponentially) slow to converge – actor-critic not guaranteed to converge – no principled way to trade off exploration and exploitation – cannot easily deal with non-stationary environment

TD in the Brain

TD in the Brain

Extensions to basic TD Learning A continuum of improvements possible – more complete partial models of the effects of actions – estimate expected reward <r(xt)> – representing & processing longer sequences of actions & states – faster learning & more efficient use of agent’s experiences – parameterize value function (versus look-up table) Timing and partial observability in reward prediction – state not (always) directly observable – delayed payoffs – reward-prediction only (no instrumental contingencies)

References Sutton, RS & Barto, AG (1998). Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press. Bellman, RE (1957). Dynamic Programming. Princeton, NJ: Princeton University Press. Daw, ND, Courville, AC, & Touretsky, DS (2003). Timing and partial observability in the dopamine system. In Neural Information Processing Systems 15. Cambridge, MA: MIT Press. Dayan, P & Watkins, CJCH (2001). Reinforcement learning. Encyclopedia of Cognitive Science. London, England: MacMillan Press. Dayan, P & Abbott, LF (2001). Theoretical Neuroscience. Cambridge, MA: MIT Press. Gittins, JC (1979). Bandit processes and dynamic allocation indices. Journal of Royal Statistical Society B, 41: 148-177. Schultz, W, Dayan, P, & Montague, PR (1997). A neural substrate of prediction and reward. Science, 275, 1593-1599.