Rutgers CS440, Fall 2003 Reinforcement Learning Reading: Ch. 21, AIMA 2 nd Ed.

Slides:



Advertisements
Similar presentations
Reinforcement Learning
Advertisements

Markov Decision Process
Genetic Algorithms (Evolutionary Computing) Genetic Algorithms are used to try to “evolve” the solution to a problem Generate prototype solutions called.
Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.
Eick: Reinforcement Learning. Topic 18: Reinforcement Learning 1. Introduction 2. Bellman Update 3. Temporal Difference Learning 4. Discussion of Project1.
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
Reinforcement Learning
Reinforcement learning (Chapter 21)
COSC 878 Seminar on Large Scale Statistical Machine Learning 1.
Markov Decision Processes
Planning under Uncertainty
Reinforcement learning
91.420/543: Artificial Intelligence UMass Lowell CS – Fall 2010
CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley.
Planning in MDPs S&B: Sec 3.6; Ch. 4. Administrivia Reminder: Final project proposal due this Friday If you haven’t talked to me yet, you still have the.
Reinforcement Learning Introduction Presented by Alp Sardağ.
CS 188: Artificial Intelligence Fall 2009 Lecture 10: MDPs 9/29/2009 Dan Klein – UC Berkeley Many slides over the course adapted from either Stuart Russell.
Reinforcement Learning Game playing: So far, we have told the agent the value of a given board position. How can agent learn which positions are important?
More RL. MDPs defined A Markov decision process (MDP), M, is a model of a stochastic, dynamic, controllable, rewarding process given by: M = 〈 S, A,T,R.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Making Decisions CSE 592 Winter 2003 Henry Kautz.
Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK
Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
Instructor: Vincent Conitzer
Reinforcement Learning
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Reinforcement Learning 主講人:虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.
Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Decision Making Under Uncertainty Lec #8: Reinforcement Learning UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Most slides by Jeremy.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
MDPs (cont) & Reinforcement Learning
Reinforcement learning (Chapter 21)
Reinforcement Learning
CPS 570: Artificial Intelligence Markov decision processes, POMDPs
Announcements  Upcoming due dates  Wednesday 11/4, 11:59pm Homework 8  Friday 10/30, 5pm Project 3  Watch out for Daylight Savings and UTC.
Markov Decision Process (MDP)
MDPs and Reinforcement Learning. Overview MDPs Reinforcement learning.
Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
1 Markov Decision Processes Finite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
Reinforcement learning (Chapter 21)
Reinforcement Learning (1)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Reinforcement learning (Chapter 21)
Markov Decision Processes
Reinforcement Learning
Reinforcement Learning
Planning to Maximize Reward: Markov Decision Processes
Instructors: Fei Fang (This Lecture) and Dave Touretzky
CS 188: Artificial Intelligence Fall 2007
CS 188: Artificial Intelligence Fall 2008
Instructor: Vincent Conitzer
Chapter 17 – Making Complex Decisions
CS 188: Artificial Intelligence Spring 2006
Reinforcement Learning (2)
Markov Decision Processes
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Markov Decision Processes
Reinforcement Learning
Reinforcement Learning (2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Presentation transcript:

Rutgers CS440, Fall 2003 Reinforcement Learning Reading: Ch. 21, AIMA 2 nd Ed

Rutgers CS440, Fall 2003 Outline What is RL? Methods for RL. Note: only brief overview, no in-depth coverage.

Rutgers CS440, Fall 2003 What is Reinforcement Learning (RL)? Learning so far: learning probabilistic models (BNs) or functions (NNs) Learning what/how to do from feedback (reward/reinforcement). –Chess playing – learn how to play from feedback won/lost game. –Learning to speak, crawl, … –Learning user preferences for web searching MDP – find optimal policy using known model –Optimal policy = maximizes expected total reward RL – learn optimal policy from rewards –Do not know environment model –Do not know reward function –Know how well something is done (e.g., won / lost)

Rutgers CS440, Fall 2003 Types of RL MDP: actions + states + rewards Passive learning: policy fixed, learn utility of states ( + rest of model ) Active learning: policy not fixed, learn utility as well as optimal policy S0S0 A0A0 R0R0 S1S1 A1A1 R1R1 S2S2 A2A2 R2R2 … P(S t | S t-1, A t-1 ) R(S t )

Rutgers CS440, Fall 2003 Passive RL Policy is known and fixed, need to learn how good it is + environment model Learn: U(s t ) [ but do not know P(s t | s t-1, a t-1 ) and R( s t ) ] from { a t } Method: conduct trials, receive sequence of actions, states and rewards { (a t,s t,R t ) }, compute model parameters and utility S0S0 A0A0 R0R0 S1S1 A1A1 R1R1 S2S2 A2A2 R2R2 … P(S t | S t-1, A t-1 ) R(S t ) atat NAAAA stst NL LL rtrt

Rutgers CS440, Fall 2003 Direct utility estimation Observe (a t,s t,R t ), estimate U(s t ) from “counts” – inductive learning atat NAAAA stst NL LL rtrt atat NAAA A stst NL LLL rtrt Sample #1Sample #2 … Example (  =1): Sample #1: U(NL) = = 20, U(NL) = = 40 Sample #2: U(NL) = = 25, U(NL) = = 45 On average, U(NL) = ( ) / 4 Drawback: does not use the fact that utilities of states are dependent (Bellman equations)!

Rutgers CS440, Fall 2003 Adaptive dynamic programming Take into account constraints described by Bellman equations Algorithm: For each sample, each time step 1.Estimate P(s t | s t-1, a t-1 ) E.g., P(L|NL,A) = #(L,NL,A) / #(NL,A) 2.Compute U(s t ) from R(s t ), P(s t | s t-1, a t-1 ), using Bellman equations or update Drawback: usually (too) many states

Rutgers CS440, Fall 2003 TD-Learning Only update U-values for observed transitions Algorithm: 1.Receive new sample pair, (s t,s t+1 ) 2.Assume only transition s t  s t+1 can occur 3.Compute update of U Does not need to compute model parameters! ( Yet converges to the “right” solution. ) Value computed from Bellman equation Old value New value