Rutgers CS440, Fall 2003 Reinforcement Learning Reading: Ch. 21, AIMA 2 nd Ed.

Slides:

Advertisements

Similar presentations

Reinforcement Learning

Advertisements

Markov Decision Process

Genetic Algorithms (Evolutionary Computing) Genetic Algorithms are used to try to “evolve” the solution to a problem Generate prototype solutions called.

Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.

Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.

Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.

Eick: Reinforcement Learning. Topic 18: Reinforcement Learning 1. Introduction 2. Bellman Update 3. Temporal Difference Learning 4. Discussion of Project1.

Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.

Reinforcement Learning

Reinforcement learning (Chapter 21)

COSC 878 Seminar on Large Scale Statistical Machine Learning 1.

Markov Decision Processes

Planning under Uncertainty

Reinforcement learning

91.420/543: Artificial Intelligence UMass Lowell CS – Fall 2010

CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley.

Planning in MDPs S&B: Sec 3.6; Ch. 4. Administrivia Reminder: Final project proposal due this Friday If you haven’t talked to me yet, you still have the.

Reinforcement Learning Introduction Presented by Alp Sardağ.

CS 188: Artificial Intelligence Fall 2009 Lecture 10: MDPs 9/29/2009 Dan Klein – UC Berkeley Many slides over the course adapted from either Stuart Russell.

Reinforcement Learning Game playing: So far, we have told the agent the value of a given board position. How can agent learn which positions are important?

More RL. MDPs defined A Markov decision process (MDP), M, is a model of a stochastic, dynamic, controllable, rewarding process given by: M = 〈 S, A,T,R.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

Making Decisions CSE 592 Winter 2003 Henry Kautz.

Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK

Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.

Instructor: Vincent Conitzer

Reinforcement Learning

Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.

Reinforcement Learning 主講人：虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.

Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.

© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.

Decision Making Under Uncertainty Lec #8: Reinforcement Learning UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Most slides by Jeremy.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

Reinforcement Learning 主講人：虞台文大同大學資工所智慧型多媒體研究室.

Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.

MDPs (cont) & Reinforcement Learning

Reinforcement learning (Chapter 21)

Reinforcement Learning

CPS 570: Artificial Intelligence Markov decision processes, POMDPs

Announcements  Upcoming due dates  Wednesday 11/4, 11:59pm Homework 8  Friday 10/30, 5pm Project 3  Watch out for Daylight Savings and UTC.

Markov Decision Process (MDP)

MDPs and Reinforcement Learning. Overview MDPs Reinforcement learning.

Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.

Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.

Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.

1 Markov Decision Processes Finite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.

1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.

Reinforcement learning (Chapter 21)

Reinforcement Learning (1)

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

Reinforcement learning (Chapter 21)

Markov Decision Processes

Reinforcement Learning

Reinforcement Learning

Planning to Maximize Reward: Markov Decision Processes

Instructors: Fei Fang (This Lecture) and Dave Touretzky

CS 188: Artificial Intelligence Fall 2007

CS 188: Artificial Intelligence Fall 2008

Instructor: Vincent Conitzer

Chapter 17 – Making Complex Decisions

CS 188: Artificial Intelligence Spring 2006

Reinforcement Learning (2)

Markov Decision Processes

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

Markov Decision Processes

Reinforcement Learning

Reinforcement Learning (2)

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

Presentation transcript:

Rutgers CS440, Fall 2003 Reinforcement Learning Reading: Ch. 21, AIMA 2 nd Ed

Rutgers CS440, Fall 2003 Outline What is RL? Methods for RL. Note: only brief overview, no in-depth coverage.

Rutgers CS440, Fall 2003 What is Reinforcement Learning (RL)? Learning so far: learning probabilistic models (BNs) or functions (NNs) Learning what/how to do from feedback (reward/reinforcement). –Chess playing – learn how to play from feedback won/lost game. –Learning to speak, crawl, … –Learning user preferences for web searching MDP – find optimal policy using known model –Optimal policy = maximizes expected total reward RL – learn optimal policy from rewards –Do not know environment model –Do not know reward function –Know how well something is done (e.g., won / lost)

Rutgers CS440, Fall 2003 Types of RL MDP: actions + states + rewards Passive learning: policy fixed, learn utility of states ( + rest of model ) Active learning: policy not fixed, learn utility as well as optimal policy S0S0 A0A0 R0R0 S1S1 A1A1 R1R1 S2S2 A2A2 R2R2 … P(S t | S t-1, A t-1 ) R(S t )

Rutgers CS440, Fall 2003 Passive RL Policy is known and fixed, need to learn how good it is + environment model Learn: U(s t ) [ but do not know P(s t | s t-1, a t-1 ) and R( s t ) ] from { a t } Method: conduct trials, receive sequence of actions, states and rewards { (a t,s t,R t ) }, compute model parameters and utility S0S0 A0A0 R0R0 S1S1 A1A1 R1R1 S2S2 A2A2 R2R2 … P(S t | S t-1, A t-1 ) R(S t ) atat NAAAA stst NL LL rtrt

Rutgers CS440, Fall 2003 Direct utility estimation Observe (a t,s t,R t ), estimate U(s t ) from “counts” – inductive learning atat NAAAA stst NL LL rtrt atat NAAA A stst NL LLL rtrt Sample #1Sample #2 … Example (  =1): Sample #1: U(NL) = = 20, U(NL) = = 40 Sample #2: U(NL) = = 25, U(NL) = = 45 On average, U(NL) = ( ) / 4 Drawback: does not use the fact that utilities of states are dependent (Bellman equations)!

Rutgers CS440, Fall 2003 Adaptive dynamic programming Take into account constraints described by Bellman equations Algorithm: For each sample, each time step 1.Estimate P(s t | s t-1, a t-1 ) E.g., P(L|NL,A) = #(L,NL,A) / #(NL,A) 2.Compute U(s t ) from R(s t ), P(s t | s t-1, a t-1 ), using Bellman equations or update Drawback: usually (too) many states

Rutgers CS440, Fall 2003 TD-Learning Only update U-values for observed transitions Algorithm: 1.Receive new sample pair, (s t,s t+1 ) 2.Assume only transition s t  s t+1 can occur 3.Compute update of U Does not need to compute model parameters! ( Yet converges to the “right” solution. ) Value computed from Bellman equation Old value New value