Reinforcement Learning: Learning from Interaction

Slides:



Advertisements
Similar presentations
University Paderborn 07 January 2009 RG Knowledge Based Systems Prof. Dr. Hans Kleine Büning Reinforcement Learning.
Advertisements

Reinforcement Learning
© 2008 Pearson Addison Wesley. All rights reserved Chapter Seven Costs.
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 6 Author: Julia Richards and R. Scott Hawley.
Author: Julia Richards and R. Scott Hawley
Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.
UNITED NATIONS Shipment Details Report – January 2006.
Properties of Real Numbers CommutativeAssociativeDistributive Identity + × Inverse + ×
Chapter 11: Models of Computation
Chapter 5 Plan-Space Planning.
Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014.
Reinforcement Learning
Reinforcement Learning Peter Bodík. Previous Lectures Supervised learning –classification, regression Unsupervised learning –clustering, dimensionality.
© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 1 Reinforcement Learning (RL) Consider an “agent” embedded.
Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.
1..
Lecture 18: Temporal-Difference Learning
CONTROL VISION Set-up. Step 1 Step 2 Step 3 Step 5 Step 4.
Analyzing Genes and Genomes
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
Essential Cell Biology
Intracellular Compartments and Transport
PSSA Preparation.
Essential Cell Biology
Energy Generation in Mitochondria and Chlorplasts
Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.
Probabilistic Reasoning over Time
Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 3 Ann Nowé By Sutton.
1 Reinforcement Learning Problem Week #3. Figure reproduced from the figure on page 52 in reference [1] 2 Reinforcement Learning Loop state Agent Environment.
Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.
Reinforcement Learning
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Chapter 6: Temporal Difference Learning
Chapter 6: Temporal Difference Learning
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 3: The Reinforcement Learning Problem pdescribe the RL problem we will.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 4: Dynamic Programming pOverview of a collection of classical solution.
Reinforcement Learning
REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction Ann Nowé By Sutton and.
Introduction to Reinforcement Learning
Reinforcement Learning 主講人:虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.
Decision Making Under Uncertainty Lec #8: Reinforcement Learning UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Most slides by Jeremy.
Reinforcement Learning
CMSC 471 Fall 2009 Temporal Difference Learning Prof. Marie desJardins Class #25 – Tuesday, 11/24 Thanks to Rich Sutton and Andy Barto for the use of their.
Attributions These slides were originally developed by R.S. Sutton and A.G. Barto, Reinforcement Learning: An Introduction. (They have been reformatted.
Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.
CMSC 471 Fall 2009 MDPs and the RL Problem Prof. Marie desJardins Class #23 – Tuesday, 11/17 Thanks to Rich Sutton and Andy Barto for the use of their.
Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 3: The Reinforcement Learning Problem pdescribe the RL problem we will.
Chapter 6: Temporal Difference Learning
Reinforcement Learning
An Overview of Reinforcement Learning
CMSC 671 – Fall 2010 Class #22 – Wednesday 11/17
Chapter 3: The Reinforcement Learning Problem
CMSC 471 Fall 2009 RL using Dynamic Programming
Chapter 4: Dynamic Programming
Chapter 4: Dynamic Programming
Dr. Unnikrishnan P.C. Professor, EEE
یادگیری تقویتی Reinforcement Learning
Chapter 3: The Reinforcement Learning Problem
Chapter 3: The Reinforcement Learning Problem
October 6, 2011 Dr. Itamar Arel College of Engineering
Chapter 6: Temporal Difference Learning
CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29
Chapter 4: Dynamic Programming
Presentation transcript:

Reinforcement Learning: Learning from Interaction Winter School on Machine Learning and Vision, 2010 B. Ravindran Many slides adapted from Sutton and Barto

Learning to Control So far looked at two models of learning Supervised: Classification, Regression, etc. Unsupervised: Clustering, etc. How did you learn to cycle? Neither of the above Trial and error! Falling down hurts! Intro to RL

Can You hear me now? Can You hear me now? Can You hear me now?

Reinforcement Learning A trial-and-error learning paradigm Rewards and Punishments Not just an algorithm but a new paradigm in itself Learn about a system – behaviour control from minimal feed back Inspired by behavioural psychology Intro to RL

RL Framework Learn from close interaction Stochastic environment State Action evaluation Agent Benefits of RL.Applications. Future rewards. Situate in literature.. AI. Stochastic systems. Connections to OR. Learn from close interaction Stochastic environment Noisy delayed scalar evaluation Maximize a measure of long term performance

Not Supervised Learning! Input Output Agent Target Error Very sparse “supervision” No target output provided No error gradient information available Action chooses next state Explore to estimate gradient – Trail and error learning Intro to RL

Not Unsupervised Learning Input Activation Agent Evaluation Sparse “supervision” available Pattern detection not primary goal Intro to RL

TD Gammon Tesauro 1992, 1994, 1995, ... White has just rolled a 5 and a 2 so can move one of his pieces 5 and one (possibly the same) 2 steps Objective is to advance all pieces to points 19-24 Hitting 30 pieces, 24 locations implies enormous number of configurations Effective branching factor of 400 Intro to RL

The Agent-Environment Interface . . . s a r t +1 t +2 t +3

The Agent Learns a Policy Reinforcement learning methods specify how the agent changes its policy as a result of experience. Roughly, the agent’s goal is to get as much reward as it can over the long run. Intro to RL

Goals and Rewards Is a scalar reward signal an adequate notion of a goal?—maybe not, but it is surprisingly flexible. A goal should specify what we want to achieve, not how we want to achieve it. A goal must be outside the agent’s direct control—thus outside the agent. The agent must be able to measure success: explicitly; frequently during its lifespan. Intro to RL

Returns Episodic tasks: interaction breaks naturally into episodes, e.g., plays of a game, trips through a maze. where T is a final time step at which a terminal state is reached, ending an episode. Intro to RL

The Markov Property “the state” at step t, means whatever information is available to the agent at step t about its environment. The state can include immediate “sensations”, highly processed sensations, and structures built up over time from sequences of sensations. Ideally, a state should summarize past sensations so as to retain all “essential” information, i.e., it should have the Markov Property: Intro to RL

Markov Decision Processes If a reinforcement learning task has the Markov Property, it is basically a Markov Decision Process (MDP). If state and action sets are finite, it is a finite MDP. To define a finite MDP, you need to give: state and action sets one-step “dynamics” defined by transition probabilities: reward expectations: Intro to RL

An Example Finite MDP Recycling Robot At each step, robot has to decide whether it should (1) actively search for a can, (2) wait for someone to bring it a can, or (3) go to home base and recharge. Searching is better but runs down the battery; if runs out of power while searching, has to be rescued (which is bad). Decisions made on basis of current energy level: high, low. Reward = number of cans collected Intro to RL

Recycling Robot MDP Intro to RL

Value Functions The value of a state is the expected return starting from that state; depends on the agent’s policy: The value of a state-action pair is the expected return starting from that state, taking that action, and thereafter following p : Intro to RL

Bellman Equation for a Policy p The basic idea: So: Or, without the expectation operator: Intro to RL

Optimal Value Functions For finite MDPs, policies can be partially ordered: There is always at least one (and possibly many) policies that is better than or equal to all the others. This is an optimal policy. We denote them all p *. Optimal policies share the same optimal state-value function: Intro to RL

Bellman Optimality Equation The value of a state under an optimal policy must equal the expected return for the best action from that state: is the unique solution of this system of nonlinear equations. Intro to RL

Bellman Optimality Equation Similarly, the optimal value of a state-action pair is the expected return for taking that action and thereafter following the optimal policy is the unique solution of this system of nonlinear equations. Intro to RL

Dynamic Programming DP is the solution method of choice for MDPs Require complete knowledge of system dynamics (P and R) Expensive and often not practical Curse of dimensionality Guaranteed to converge! RL methods: online approximate dynamic programming No knowledge of P and R Sample trajectories through state space Some theoretical convergence analysis available Intro to RL

Policy Evaluation Policy Evaluation: for a given policy p, compute the state-value function Recall: Intro to RL

Policy Improvement Suppose we have computed for a deterministic policy p. For a given state s, would it be better to do an action ? Intro to RL

Policy Improvement Cont. Intro to RL

Policy Improvement Cont. Intro to RL

Policy Iteration policy evaluation policy improvement “greedification” Intro to RL

Value Iteration Recall the Bellman optimality equation: We can convert it to an full value iteration backup: Iterate until “convergence” Intro to RL

Generalized Policy Iteration Generalized Policy Iteration (GPI): any interaction of policy evaluation and policy improvement, independent of their granularity. A geometric metaphor for convergence of GPI: Intro to RL

Dynamic Programming T T T T T T T T T T T T T Intro to RL

Simplest TD Method T T T T T T T T T T T Intro to RL

RL Algorithms – Prediction Policy Evaluation (the prediction problem): for a given policy, compute the state-value function. No knowledge of P and R, but access to the real system, or a “sample” model assumed. Uses “bootstrapping” and sampling Intro to RL

Advantages of TD TD methods do not require a model of the environment, only experience TD methods can be fully incremental You can learn before knowing the final outcome Less memory Less peak computation You can learn without the final outcome From incomplete sequences Intro to RL

RL Algorithms – Control SARSA Q-learning Intro to RL

Cliffwalking e-greedy, e = 0.1 Intro to RL

Actor-Critic Methods Explicit representation of policy as well as value function Minimal computation to select actions Can learn an explicit stochastic policy Can put constraints on policies Appealing as psychological and neural models Intro to RL

Actor-Critic Details Intro to RL

Applications of RL Robot navigation Adaptive control Helicopter pilot! Combinatorial optimization VLSI placement and routing , elevator dispatching Game playing Backgammon – world’s best player! Computational Neuroscience Modeling of reward processes Intro to RL

Other Topics Different measures of return Policy gradient approaches Average rewards, Discounted Returns Policy gradient approaches Directly perturb policies Generalization [Saturday] Function approximation Temporal abstraction Least square methods Better use of data More suited for “off-line” RL Intro to RL

References Mitchell, T. Machine Learning. McGraw Hill. 1992 Russell, S. J. and Norvig, P. Artificial Intelligence – A modern approach. Pearson Educational. 2000. Sutton, R.S. and Barto, A. G. Reinforcement Learning: An Introduction. MIT Press. 1998. Dayan, P. and Abbott, L. F. Theoretical Neuroscience: Computational and Mathematical Modeling of Neural Systems. MIT Press. 2001. Bertsikas, D. P. and Tsitsiklis, J. N. Neuro-Dynamic Programming. Athena Scientific. 1997. Intro to RL