Reinforcement Learning: How far can it Go?

Slides:



Advertisements
Similar presentations
Hierarchical Reinforcement Learning Amir massoud Farahmand
Advertisements

Questions?. Setting a reward function, with and without subgoals Difference between agent and environment AI for games, Roomba Markov Property – Broken.
Reinforcement Learning
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
Between MDPs and Semi-MDPs: Learning, Planning and Representing Knowledge at Multiple Temporal Scales Richard S. Sutton Doina Precup University of Massachusetts.
Planning under Uncertainty
12 June, STD( ): learning state temporal differences with TD( ) Lex Weaver Department of Computer Science Australian National University Jonathan.
Reinforcement Learning
Machine LearningRL1 Reinforcement Learning in Partially Observable Environments Michael L. Littman.
Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Toward Grounding Knowledge in Prediction or Toward a Computational Theory of Artificial Intelligence Rich Sutton AT&T Labs with thanks to Satinder Singh.
1 Machine Learning: Symbol-based 9d 9.0Introduction 9.1A Framework for Symbol-based Learning 9.2Version Space Search 9.3The ID3 Decision Tree Induction.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Reinforcement Learning (1)
Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Reinforcement Learning’s Computational Theory of Mind Rich Sutton Andy Barto Satinder SinghDoina Precup with thanks to:
Search and Planning for Inference and Learning in Computer Vision
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction Ann Nowé By Sutton and.
TKK | Automation Technology Laboratory Partially Observable Markov Decision Process (Chapter 15 & 16) José Luis Peralta.
Reinforcement Learning
Reinforcement Learning Ata Kaban School of Computer Science University of Birmingham.
Reinforcement Learning
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Course Overview  What is AI?  What are the Major Challenges?  What are the Main Techniques?  Where are we failing, and why?  Step back and look at.
Model Minimization in Hierarchical Reinforcement Learning Balaraman Ravindran Andrew G. Barto Autonomous Learning Laboratory.
Reinforcement Learning AI – Week 22 Sub-symbolic AI Two: An Introduction to Reinforcement Learning Lee McCluskey, room 3/10
COMP 2208 Dr. Long Tran-Thanh University of Southampton Reinforcement Learning.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
Done Done Course Overview What is AI? What are the Major Challenges?
Reinforcement Learning
Chapter 6: Temporal Difference Learning
Reinforcement learning (Chapter 21)
Reinforcement Learning (1)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Reinforcement learning (Chapter 21)
Reinforcement Learning
An Overview of Reinforcement Learning
Reinforcement Learning
CMSC 671 – Fall 2010 Class #22 – Wednesday 11/17
Course Logistics CS533: Intelligent Agents and Decision Making
Reinforcement learning
Chapter 3: The Reinforcement Learning Problem
Instructors: Fei Fang (This Lecture) and Dave Touretzky
Dr. Unnikrishnan P.C. Professor, EEE
RL for Large State Spaces: Value Function Approximation
Chapter 3: The Reinforcement Learning Problem
Chapter 3: The Reinforcement Learning Problem
Artificial Intelligence Chapter 10 Planning, Acting, and Learning
October 6, 2011 Dr. Itamar Arel College of Engineering
Chapter 6: Temporal Difference Learning
Chapter 1: Introduction
Chapter 10: Dimensions of Reinforcement Learning
Chapter 9: Planning and Learning
CS 188: Artificial Intelligence Fall 2008
CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29
Chapter 7: Eligibility Traces
Artificial Intelligence Chapter 10 Planning, Acting, and Learning
CS 416 Artificial Intelligence
Reinforcement Learning Dealing with Partial Observability
CS 416 Artificial Intelligence
October 20, 2010 Dr. Itamar Arel College of Engineering
Markov Decision Processes
Instructor: Vincent Conitzer
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Markov Decision Processes
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Morteza Kheirkhah University College London
Presentation transcript:

Reinforcement Learning: How far can it Go? Rich Sutton University of Massachusetts ATT Research With thanks to Doina Precup, Satinder Singh, Amy McGovern, B. Ravindran, Ron Parr

Reinforcement Learning An active, popular, successful approach to AI 15 – 50 years old emphasizes learning from interaction Does not assume complete knowledge of world World-class applications Strong theoretical foundations Parallels in other fields: operations research, control theory, psychology, neuroscience Seeks simple general principles How Far Can It Go ?

World-Class Applications of RL TD-Gammon and Jellyfish Tesauro, Dahl World's best backgammon player Elevator Control Crites & Barto (Probably) world's best down-peak elevator controller Job-Shop Scheduling Zhang & Dietterich World’s best scheduler of space-shuttle payload processing Dynamic Channel Assignment Singh & Bertsekas, Nie & Haykin World's best assigner of radio channels to mobile telephone calls

Outline RL Past RL Present RL Future Trial and Error Learning 1950 RL Past Trial and Error Learning RL Present Learning and Planning Values RL Future Constructivism 1985 2000

RL began with dissatisfaction with previous learning problems Such as Learning from examples Unsupervised learning Function optimization None seemed to be purposiveful Where is the learning to how to get something? Where is the learning by trial and error? Earlier learning dealt with prediction, pattern recognition Where is learning that is both selective - tries a variety, prefers the best associative - associates best with the situation for fast recall Need rewards and penalties, interaction with the world!

Rooms Example Early learning methods could not learn how to get reward

The Reward Hypothesis That purposes can be adequately represented as maximization of the cumulative sum of a scalar reward signal received from the environment Is this reasonable? Is it demeaning? Is there no other choice? It seems to be adequate

RL Past – Trial and Error Learning Learned only a policy (a mapping from states to actions) Maximized only Short-term reward (e.g., learning automata) Or delayed reward via simple action traces Assumed good/bad rewards immed. distinguishable E.g., positive is good, negative is bad An implicitly known reinforcement baseline Next steps were to learn baselines and internal rewards Taking these next steps quickly led to modern value functions and temporal-difference learning

A Policy Movement is in the wrong direction 1/3 of the time

Problems with Value-less RL Methods

Outline RL Past RL Present RL Future Trial and Error Learning 1950 RL Past Trial and Error Learning RL Present Learning and Planning Values RL Future Constructivism 1985 2000

The Value-Function Hypothesis Value functions = Measures of expected reward following states: V: States  Expected future reward or following state-action pairs: Q: States x Actions  Expected future reward All efficient methods for optimal sequential decision making estimate value functions The hypothesis: That the dominant purpose of intelligence is to approximate these value functions

State-Value Function

Learning and Planning Values RL Present Accepts reward and value hypotheses Many real-world applications, some impressive Theory strong and active, yet still with more questions than answers Strong links to Operations Research A part of modern AI’s interest in uncertainty: MDPs, POMDPs, Bayes nets, connectionism Includes deliberative planning Learning and Planning Values

Real-world applications using on-line learning New Applications of RL CMUnited Robocup Soccer Team Stone & Veloso World’s best player of Robocup simulated soccer, 1998 KnightCap and TDleaf Baxter, Tridgell & Weaver Improved chess play from intermediate to master in 300 games Inventory Management Van Roy, Bertsekas, Lee & Tsitsiklis 10-15% improvement over industry standard methods Walking Robot Benbrahim & Franklin Learned critical parameters for bipedal walking Real-world applications using on-line learning Back- prop

RL Present, Part II: The Space of Methods Exhaustive search Dynamic programming Also: Function Approx. Explore/Exploit Planning/Learning Action/state values Actor-Critic . full backups Monte Carlo sample backups Temporal- difference learning bootstrapping, l shallow backups deep backups

The TD Hypothesis That all value learning is driven by TD errors Even “Monte Carlo” methods can benefit TD methods enable them to be done incrementally Even planning can benefit Trajectory following improves function approximation and state sampling Sample backups reduce effect of branching factor Psychological support TD models of reinforcement, classical conditioning Physiological support Reward neurons show TD behavior (Schultz et al.)

Planning Modern RL includes planning value/policy Modern RL includes planning As in planning for MDPs A form of state-space planning Still controversial for some Planning and learning are near identical in RL The same algorithms on real or imagined experience Same value functions, backups, function approximation acting planning direct RL model experience model learning Interaction with world Imagined interaction RL Alg. Value/Policy

Planning with Imagined Experience Real experience Imagined experience

Outline RL Past RL Present RL Future Trial and Error Learning 1950 RL Past Trial and Error Learning RL Present Learning and Planning Values RL Future Constructivism 1985 2000

Piaget Drescher Constructivism The active construction of representations and models of the world to facilitate the learning and planning of values Representations and Models Value functions Great flexibility here Policy

Constructivist Prophecy Whereas RL present is about solving an MDP, RL future will be about representing the States Actions Transitions Rewards Features to construct an MDP. Constructing the world to be the way we want it: Markov  Linear  Small Reliable  Independent  Shallow Deterministic  Additive  Low branching The RL agent as active world modeler

Representing State, Part I: Features and Function Approximation Linear-in-the-features methods are state of the art  Memory-based methods Two-stage architecture: Compute feature values Nonlinear, expansive, fixed or slowly changing mapping Map the feature values linearly to the result Linear, convergent, fast changing mapping Works great if features are appropriate Fast, reliable, local learning; good generalization Feature construction best done by hand ...or by methods yet to be found State Features Values Constructive Induction

Good Features Bad Features Features correspond to regions of similar value Features unrelated to values

Representing State, Part II: Partial Observability When immediate observations do not uniquely identify the current state; non-Markov problems Not as big a deal as widely thought A greater problem for theory than for practice Need not use POMDP ideas Can treat as function approximation issue Making do with imperfect observations/features Finding the right memories to add as new features The key is to construct state representations that make the world more Markov – McCallum’s thesis

Representations of Action Nominally, actions in RL are low-level The lowest level at which behavior can vary But people work mostly with courses of action We decide among these We make predictions at this level We plan at this level Remarkably, all this can be incorporated in RL Course of action = policy + termination condition Almost all RL ideas, algorithms and theory extend Wherever actions are used, courses of action can be substituted Parr, Bradtke & Duff, Precup, Singh, Dietterich, Kaelbling, Huber & Grupen, Szepesvari, Dayan, Ryan & Pendrith, Hauskrecht, Lin...

Room-to-Room Courses of Action A course of action for each hallway from each room (2 of 8 shown)

Representing Transitions Models can also be learned for courses of action What state will we be in at termination? How much reward will we receive along the way? Mathematical form of models follows from the theory of semi-Markov decision processes Permits planning at a higher level

Planning (Value Iteration) with Courses of Action

Reconnaissance Example Mission: Fly over (observe) most valuable sites and return to base Stochastic weather affects observability (cloudy or clear) of sites Limited fuel Intractable with classical optimal control methods Actions: Primitives: which direction to fly Courses: which site to head for Courses compress space and time Reduce steps from ~600 to ~6 Reduce states from ~1011 to ~106 Enable finding of best solutions 2 5 1 5 ( r e w a r d ) 2 5 ( m e a n t i m e b e t w e e n 5 w e a t h e r c h a n g e s ) 8 ? 5 5 1 1 5 B a s e 1 d e c i s i o n s t e p s B. Ravindran, UMass

Courses of action permit enormous flexibility

Subgoals Courses of action are often goal-oriented E.g., drive-to-work, open-the-door A course can be learned to achieve its goal Many can be learned at once, independently Solves classic problem of subgoal credit assignment Solves psychological puzzle of goal-oriented action Goal-oriented courses of action create better MDP Fewer states, smaller branching factor Compartmentalizes dependencies Their models are also goal-oriented recognizers...

Perception Real perception, like real action, is temporally extended charger Dockable region Perception Real perception, like real action, is temporally extended Features are ability oriented rather than sensor oriented What is a chair? Something that can be sat upon Consider a goal-oriented course of action, like dock-with-charger Its model gives the probability of successfully docking as a function of state I.e., a feature (detector) for states that afford docking Such features can be learned without supervision

This is RL with a totally different feel Still one primary policy and set of values But many other policies, values, and models are learned not directly in service of reward The dominant purpose is discovery, not reward What possibilities does this world afford? How can I control and predict it in a variety of ways? In other words, constructing representations to make the world: Markov  Linear  Small Reliable  Independent  Shallow Deterministic  Additive  Low branching

Imagine An agent driven primarily by biased curiosity To discover how it can predict and control its interaction with the world What courses of action have predictable effects? What salient observables can be controlled? What models are most useful in planning? A human coach presenting a series of Problems/Tasks Courses of action Highlighting key states, providing subpolicies, termination conditions…

What is New? Constructivism itself is not new. But actually doing it would be! Does RL really change it, make it easier? That is, do values and policies help? Yes! Because so much constructed knowledge is well represented as values and policies in service of approximating values and policies RL’s goal-orientation is also critical to modeling goal-oriented action and perception

Take Home Messages RL Past RL Present RL Future Let’s revisit, but not repeat past work RL Present Do you accept that value functions are critical? And that TD methods are the way to find them? RL Future It’s time to address representation construction Explore/understand the world rather than control it RL/values provide new structure for this May explain goal-oriented action and perception

How far can RL go? A simple and general formulation of AI Yet there is enough structure to make progress While this is true, we should complicate no further, but seek general principles of AI They may take us all the way to human-level intelligence