Reinforcement Learning’s Computational Theory of Mind Rich Sutton Andy Barto Satinder SinghDoina Precup with thanks to:

Slides:



Advertisements
Similar presentations
Reinforcement Learning
Advertisements

Reinforcement Learning: A Tutorial Rich Sutton AT&T Labs -- Research
From Reflex to Reason Rich Sutton AT&T Labs with thanks to Satinder Singh, Doina Precup, and Andy Barto.
From Markov Decision Processes to Artificial Intelligence
Todd W. Neller Gettysburg College
Questions?. Setting a reward function, with and without subgoals Difference between agent and environment AI for games, Roomba Markov Property – Broken.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
12 June, STD( ): learning state temporal differences with TD( ) Lex Weaver Department of Computer Science Australian National University Jonathan.
Reinforcement Learning Tutorial
Reinforcement Learning
Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.
Reinforcement Learning
לביצוע מיידי ! להתחלק לקבוצות –2 או 3 בקבוצה להעביר את הקבוצות – היום בסוף השיעור ! ספר Reinforcement Learning – הספר קיים online ( גישה מהאתר של הסדנה.
Chapter 1: Introduction
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Toward Grounding Knowledge in Prediction or Toward a Computational Theory of Artificial Intelligence Rich Sutton AT&T Labs with thanks to Satinder Singh.
1 Machine Learning: Symbol-based 9d 9.0Introduction 9.1A Framework for Symbol-based Learning 9.2Version Space Search 9.3The ID3 Decision Tree Induction.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Reinforcement Learning (1)
Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK
1 Reinforcement Learning: Learning algorithms Function Approximation Yishay Mansour Tel-Aviv University.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Reinforcement Learning
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Evaluation Function in Game Playing Programs M1 Yasubumi Nozawa Chikayama & Taura Lab.
Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction Ann Nowé By Sutton and.
Introduction to Reinforcement Learning
Reinforcement Learning
Reinforcement Learning 主講人:虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.
Reinforcement Learning Ata Kaban School of Computer Science University of Birmingham.
CHECKERS: TD(Λ) LEARNING APPLIED FOR DETERMINISTIC GAME Presented By: Presented To: Amna Khan Mis Saleha Raza.
Reinforcement Learning
Game Theory, Social Interactions and Artificial Intelligence Supervisor: Philip Sterne Supervisee: John Richter.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Course Overview  What is AI?  What are the Major Challenges?  What are the Main Techniques?  Where are we failing, and why?  Step back and look at.
Reinforcement Learning
1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 12: Generalization and Function Approximation Dr. Itamar Arel College of Engineering.
Reinforcement Learning Based on slides by Avi Pfeffer and David Parkes.
Reinforcement Learning AI – Week 22 Sub-symbolic AI Two: An Introduction to Reinforcement Learning Lee McCluskey, room 3/10
Generalized Point Based Value Iteration for Interactive POMDPs Prashant Doshi Dept. of Computer Science and AI Institute University of Georgia
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 12: Generalization and Function Approximation Dr. Itamar Arel College of Engineering.
Reinforcement Learning for 3 vs. 2 Keepaway P. Stone, R. S. Sutton, and S. Singh Presented by Brian Light.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 1: Introduction Psychology Artificial Intelligence Control Theory and Operations.
Stochastic tree search and stochastic games
Done Done Course Overview What is AI? What are the Major Challenges?
Reinforcement Learning
Chapter 6: Temporal Difference Learning
Reinforcement Learning: How far can it Go?
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
AlphaGo with Deep RL Alpha GO.
Reinforcement Learning
An Overview of Reinforcement Learning
"Playing Atari with deep reinforcement learning."
Chapter 3: The Reinforcement Learning Problem
Dr. Unnikrishnan P.C. Professor, EEE
Reinforcement Learning
Chapter 6: Temporal Difference Learning
Chapter 1: Introduction
CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29
Chapter 7: Eligibility Traces
Search.
Search.
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Presentation transcript:

Reinforcement Learning’s Computational Theory of Mind Rich Sutton Andy Barto Satinder SinghDoina Precup with thanks to:

Outline Computational Theory of Mind Reinforcement Learning –Some vivid examples Intuition of RL’s Computational Theory of Mind –Reward, policy, value (prediction of reward) Some of the Math –Policy iteration (values & policies build on each other) –Discounted reward –TD error Speculative extensions –Reason –Knowledge

Honeybee Brain & V UM Neuron Hammer, Menzel

Dopamine Neurons Signal “Error/Change” in Prediction of Reward

Marr’s Three Levels at which any information processing system can be understood Computational Theory Level –What are the goals of the computation? –What is being computed? –Why are these the right things to compute? –What overall strategy is followed? Representation and Algorithm Level –How are these things computed? –What representation and algorithms are used? Hardware Implementation Level –How is this implemented physically? What and Why? How? Really how?

Cash Register Computational Theory –Adding numbers –Making change –Computing tax –Controling access to cash drawer Representations and Algorithms –Are numbers stored in decimal or binary, or BCD? –Is multiplication done by repeated adding? Hardware Implementation –Silicon or gears? –Motors or springs? What and Why?

Word Processor Computational Theory –The whole “application” level –To display document as it will appear on paper –To make it easy to change, enhance Representations and Algorithms –How is the document stored? –What algorithms are used to maintain its display? –The underlying C code Hardware Implementation –How does the display work? –How does the silicon implement the computer? What and Why?

Flight Computational Theory –Aerodynamics –Lift, propulsion, airfoil shape Representations and Algorithms –Fixed wings or flapping? Hardware Implementation –Steel, wood, or feathers? What and Why?

Importance of Computational Theory The levels are loosely coupled Each has an internal logic, coherence of its own But computational theory drives all lower levels We have so often gone wrong by mixing CT with the other levels –Imagined neuro-physiological constraints –The meaning of connectionism, neural networks Constraints within the CT level have more force We have little computational theory for AI –No clear problem definition –Many methods for knowledge representation, but no theory of knowledge

Outline Computational Theory of Mind Reinforcement Learning –Some vivid examples Intuition of RL’s Computational Theory of Mind –Reward, policy, value (prediction of reward) Some of the Math –Policy iteration (values & policies build on each other) –Discounted reward –TD error Speculative extensions –Reason –Knowledge

Reinforcement learning: Learning from interaction to achieve a goal Environment action state reward Agent complete agent temporally situated continual learning & planning object is to affect environment environment stochastic & uncertain

Strands of History of RL Trial-and-error learning Temporal-difference learning Optimal control, value functions Thorndike (  ) 1911 Minsky Klopf Barto et al Secondary reinforcement (  ) Samuel Witten Sutton Hamilton (Physics) 1800s Shannon Bellman/Howard (OR) Werbos Watkins Holland

Hiroshi Kimura’s RL Robots BeforeAfter BackwardNew Robot, Same algorithm

The RoboCup Soccer Competition

The Acrobot Problem e.g., Dejong & Spong, 1994 Sutton, 1995 Minimum–Time–to–Goal: 4 state variables: 2 joint angles 2 angular velocities Tile coding with 48 layers         Goal: Raise tip above line Torque applied here tip Reward = -1 per time step fixed base

Examples of Reinforcement Learning Robocup Soccer Teams Stone & Veloso, Reidmiller et al. –World’s best player of simulated soccer, 1999; Runner-up 2000 Inventory Management Van Roy, Bertsekas, Lee & Tsitsiklis –10-15% improvement over industry standard methods Dynamic Channel Assignment Singh & Bertsekas, Nie & Haykin –World's best assigner of radio channels to mobile telephone calls Elevator Control Crites & Barto –(Probably) world's best down-peak elevator controller Many Robots –navigation, bi-pedal walking, grasping, switching between skills... TD-Gammon and Jellyfish Tesauro, Dahl –World's best backgammon player

New Applications of RL CMUnited Robocup Soccer Team Stone & Veloso –World’s best player of Robocup simulated soccer, 1998 KnightCap and TDleaf Baxter, Tridgell & Weaver –Improved chess play from intermediate to master in 300 games Inventory Management Van Roy, Bertsekas, Lee & Tsitsiklis –10-15% improvement over industry standard methods Walking Robot Benbrahim & Franklin –Learned critical parameters for bipedal walking Real-world applications using on-line learning Back- prop

Backgammon SITUATIONS: configurations of the playing board (about ) ACTIONS: moves REWARDS: win: +1 lose: –1 else: 0 20 Pure delayed reward

TD-Gammon... Value TD Error V t  1  V t Action selection by 2-3 ply search Tesauro, Start with a random Network Play millions of games against itself Learn a value function from this simulated experience This produces arguably the best player in the world

Outline Computational Theory of Mind Reinforcement Learning –Some vivid examples Intuition of RL’s Computational Theory of Mind –Reward, policy, value (prediction of reward) Some of the Math –Policy iteration (values & policies build on each other) –Discounted reward –TD error More speculative extensions –Reason –Knowledge

The Reward Hypothesis The mind’s goal is to maximize the cumulative sum of a received scalar signal (reward)

The Reward Hypothesis The mind’s goal is to maximize the cumulative sum of a received scalar signal (reward) can be conceived of, understood as

The Reward Hypothesis The mind’s goal is to maximize the cumulative sum of a received scalar signal (reward) Must come from outside, not under the mind’s direct control

The Reward Hypothesis The mind’s goal is to maximize the cumulative sum of a received scalar signal (reward) a simple, single number (not a vector or symbol structure)

The Reward Hypothesis Obvious? Brilliant? Demeaning? Inevitable? Trivial? Simple, but not trivial May be adequate, may be completely satisfactory A good null hypothesis The mind’s goal is to maximize the cumulative sum of a received scalar signal (reward)

Policies A policy maps each state to an action to take –Like a stimulus–response rule We seek a policy that maximizes cumulative reward The policy is a subgoal to achieving reward Reward Policy

Value Functions Value functions = Predictions of expected reward following states: Value: States  Expected future reward Moment-by-moment estimates of how well its going All efficient methods for finding optimal policies first estimate value functions –RL methods, state-space planning methods, dynamic programming Recognizing and reacting to the ups and downs of life is an important part of intelligence Reward Value Function Policy

The Mountain Car Problem Minimum-Time-to-Goal Problem Moore, 1990 Goal Gravity wins SITUATIONS: car's position and velocity ACTIONS: three thrusts: forward, reverse, none REWARDS: always –1 until car reaches the goal No Discounting

Value Functions Learned while solving the Mountain Car problem Lower is better

Honeybee Brain & V UM Neuron Hammer, Menzel

Dopamine Neurons Signal “Error/Change” in Prediction of Reward

Outline Computational Theory of Mind Reinforcement Learning –Some vivid examples Intuition of RL’s Computational Theory of Mind –Reward, policy, value (prediction of reward) Some of the Math –Policy iteration (values & policies build on each other) –Discounted reward –TD error Speculative extensions –Reason –Knowledge

Notation Reward r : States  Pr(  ) e.g., Policies  : States  Pr(Actions)   is optimal Value Functions discount factor ≈1 but <1

Policy Iteration

Generalized Policy Iteration Policy Value Function V policy evaluation V greedification * 

Reward is a time series Reward Total future reward Discounted (imminent) reward

Discounting Example REWARD

What should the prediction error be? at time t?

Reward Unexpected Reward Value TD error Reward Expected Cue Value TD error Reward Absent Value TD error Computation Theoretical TD Errors

Honeybee Brain & V UM Neuron Hammer, Menzel

Dopamine Neurons Signal TD Error in Prediction of Reward

Representation & Algorithm Level: The many methods for approximating value Dynamic programming Temporal- difference learning Monte Carlo Exhaustive search bootstrapping, full backups sample backups shallow backups deep backups

Outline Computational Theory of Mind Reinforcement Learning –Some vivid examples Intuition of RL’s Computational Theory of Mind –Reward, policy, value (prediction of reward) Some of the Math –Policy iteration (values & policies build on each other) –Discounted reward –TD error Speculative extensions –Reason –Knowledge

Tolman & Honzik, 1930 “Reasoning in Rats” Food box Path 1 Path 3 Path 2 Block B Block A Start box

Reason as RL over Imagined Experience 1. Learn a model of the world’s transition dynamics transition probabilities, expected immediate rewards “1-step model” of the world 2. Use model to generate imaginary experiences internal thought trials, mental simulation (Craik, 1943) 3. Apply RL as if experience had really happened Reward Value Function 1-Step Model Policy

Mind is About Predictions Hypothesis: Knowledge is predictive About what-leads-to-what, under what ways of behaving What will I see if I go around the corner? Objects: What will I see if I turn this over? Active vision: What will I see if I look at my hand? Value functions: What is the most reward I know how to get? Such knowledge is learnable, chainable Hypothesis: Mental activity is working with predictions Learning them Combining them to produce new predictions (reasoning) Converting them to action (planning, reinforcement learning) Figuring out which are most useful

An old, simple, appealing idea Mind as prediction engine! Predictions are learnable, combinable They represent cause and effect, and can be pieced together to yield plans Perhaps this old idea is essentially correct. Just needs –Development, revitalization in modern forms –Greater precision, formalization, mathematics –The computational perspective to make it respectable –Imagination, determination, patience Not rushing to performance