1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.

Slides:



Advertisements
Similar presentations
Reinforcement Learning
Advertisements

Reinforcement Learning I: The setting and classical stochastic dynamic programming algorithms Tuomas Sandholm Carnegie Mellon University Computer Science.
Markov Decision Process
Genetic Algorithms (Evolutionary Computing) Genetic Algorithms are used to try to “evolve” the solution to a problem Generate prototype solutions called.
Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
Decision Theoretic Planning
Reinforcement Learning
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
Markov Decision Processes
Planning under Uncertainty
Reinforcement learning
Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 6 - Reinforcement Learning Prof. Giancarlo.
KI Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17.
Rutgers CS440, Fall 2003 Reinforcement Learning Reading: Ch. 21, AIMA 2 nd Ed.
Reinforcement Learning Introduction Presented by Alp Sardağ.
CS 188: Artificial Intelligence Fall 2009 Lecture 10: MDPs 9/29/2009 Dan Klein – UC Berkeley Many slides over the course adapted from either Stuart Russell.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Reinforcement Learning Game playing: So far, we have told the agent the value of a given board position. How can agent learn which positions are important?
Reinforcement Learning (1)
Making Decisions CSE 592 Winter 2003 Henry Kautz.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
Reinforcement Learning
Reinforcement Learning
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
MDPs (cont) & Reinforcement Learning
Announcements  Upcoming due dates  Wednesday 11/4, 11:59pm Homework 8  Friday 10/30, 5pm Project 3  Watch out for Daylight Savings and UTC.
Markov Decision Process (MDP)
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
Markov Decision Process (MDP)
Making complex decisions
Reinforcement learning (Chapter 21)
Reinforcement Learning (1)
CMSC 471 – Spring 2014 Class #25 – Thursday, May 1
Reinforcement learning (Chapter 21)
Markov Decision Processes
Reinforcement Learning
Biomedical Data & Markov Decision Process
Reinforcement Learning
Markov Decision Processes
Planning to Maximize Reward: Markov Decision Processes
Markov Decision Processes
Announcements Homework 3 due today (grace period through Friday)
Reinforcement learning
CAP 5636 – Advanced Artificial Intelligence
Instructors: Fei Fang (This Lecture) and Dave Touretzky
CS 188: Artificial Intelligence Fall 2007
CS 188: Artificial Intelligence Fall 2008
Instructor: Vincent Conitzer
CS 188: Artificial Intelligence Fall 2007
Chapter 17 – Making Complex Decisions
CS 188: Artificial Intelligence Spring 2006
CS 188: Artificial Intelligence Spring 2006
CS 416 Artificial Intelligence
CS 416 Artificial Intelligence
Reinforcement Learning (2)
Markov Decision Processes
Markov Decision Processes
Reinforcement Learning
Reinforcement Learning (2)
Presentation transcript:

1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university

2 Passive Reinforcement Learning We will assume full observation Agent has a fix policy π Always executes π(s) Goal – to learn how good the policy is similar to policy evaluation But – doesn ’ t have all the knowledge Doesn ’ t know transition model T(s, a, s ’ ) Doesn ’ t know the reward function R(s)

3 example Our familiar 4x3 world Policy is known: Agent executes trails using the policy Trail start at (1,1) and experience sequence of states till reach terminal state +1 start

4 Example (cont.) Typical trails may be: (1,1) -.04  (1,2) -.04  (1,3) -.04  (1,2) -.04  (1,3) -.04  (2,3) -.04  (3,3) -.04  (4,3) +1 (1,1) -.04  (1,2) -.04  (1,3) -.04  (2,3) -.04  (3,3) -.04  (3,2) -.04  (3,3) -.04  (4,3) +1 (1,1) -.04  (2,1) -.04  (3,1) -.04  (3,2) -.04  (4,2) start

5 The goal Utility U π (s) : Expected sum of discounted rewards obtain for policy π May include learning model of environment

6 algorithms Direct utility estimation (DUE) Adaptive Dynamic Programming (ADP) Temporal Difference (DT)

7 Direct utility estimation Idea: Utility of state is the expected total reward from that state onward Each trail supply example/s of values for visited state Reward to go (of a state) the sum of the rewards from that state until a terminal state is reached

8 Example (1,1) -.04  (1,2) -.04  (1,3) -.04  (1,2) -.04  (1,3) -.04  (2,3) -.04  (3,3) -.04  (4,3) +1 U(1,1) = 0.72 U(1,2) = 0.76, 0.84 U(1,3) = 0.80, 0.88 U(2,3) = 0.92 U(3,3) = 0.96

9 algorithm Run over sequence of state (according to policy) Calculate observed “ reward to go ” for visited states Keeping average utility of each state in table

10 properties After infinity number of trails, the average will converge to true expectation Advantage Easy to compute No need of special actions disadvantage This is actually instance of supervised learning

11 disadvantage – expanding Similarity to supervised learning Each example has input (state) and output (observed reward to go) Reduce reinforcement learning to inductive learning lacking Missed dependency of neighbor states Utility of s = reward of s + expected utility of neighbors Doesn ’ t use the connection between states for learning Searches in hypothesis space larger than needs to Algorithm converge very slowly

12 example Second trail: (1,1) -.04  (1,2) -.04  (1,3) -.04  (2,3) -04  (3,3) -.04  (3,2) -.04  (3,3) -.04  (4,3) +1 (3,2) hasn ’ t been seen before (3,3) has been visited before and got high utility Learn about (3,2) only at the end of sequence Search in too much options …

13 Adaptive dynamic programming take advantage of connection between states Learn the transition model Solve markov decision process Running known policy Learns from observed sequences T(s,π(s),s’) Get R(s) from observed states Calculate utilities of states Use T(s,π(s),s’), R(s) in Bellman equation Solve the linear equations Instead might use simplified value iteration

14 Example In our three trails performs 3 times right in (1,3) 2 of these cases the result is (2,3) So T((1,3), right, (2,3)) estimates as 2/3

15 The algorithm Function PASSIVE_ADP_AGENT (percept) returns an action input: percept, a percept indicating the current state s’ and reward signal r’ static: π, a fixed policy mdp, an MDP with model T, rewards R, discount γ U, a table of utilities, initially empty N sa, a table of frequencies for state-action pairs, initially zero N sas’,, a table of frequencies for state-action pairs, initially zero a, s, the previous state and action, initially null if s’ is new then do U[s’]  r’; R[s’]  r’ if s is not null then do increment N sa [s,a] and N sas’ [s,a,s’] for each t such that N sas’ [s,a,t] is nonzero do T[s,a,t]  N sas’ [s,a,t]/N sa [s,a] U  VALUE_DETEMINATION(π,U,mdp) if TERMINAL?[s’] then s,a  null else s,a  s’,π[s’] return a

16 Properties Might seen like supervised learning input = state-action pair Output = resulting state Its Easy learning the model The environment is fully observation Algorithm does well as possible Provide standard for measuring reinforcement learning algorithms Good for large state spaces In backgammon solves equations with unknowns Disadvantage – a lot of work each time iteration

17 Performance in 4x3 world

18 Temporal difference learning Best of two world Allows approximate the constraint equations No need of solving equations for all possible states method Run according to policy π Use observed transitions to adjust utilities that they agree with the constraint equations.

19 Example As result of first trail U π (1,3) = 0.84 U π (2,3) = 0.92 We hope to see that: U(1,3) = U(2,3) = 0.88 So current estimate of 0.84 is a bit low and we must increase it.

20 In practice Watching transition occurs from s to s ’ update equation: U π (s)  U π (s) + α (R(s) + γ U π (s ’ ) − U π (s)) α is learning rate parameter. This is called temporal difference learning because update rule uses difference in utilities between successive states.

21 The algorithm Function PASSIVE_TD_AGENT (percept) returns an action input: percept, a percept indicating the current state s’ and reward signal r’ static: π, a fixed policy U, a table of utilities, initially empty N s, a table of frequencies for states, initially zero a, s, r, the previous state, action and reward initially null if s’ is new then do U[s’]  r’ if s is not null then do increment N s [s] U[s]  U[s] + α(N s [s])(r + γ U [s’] − U[s]) if TERMINAL?[s’] then s, a, r  null else s, a, r  s’,π[s’], r’ return a

22 Properties Update involves only observed successor s ’ Doesn ’ t take into account all possibilities efficient over large number of transitions Does not learn the model Environment supply the connection between neighboring states in form of observed transitions Average value of U π (s) will converge to correct value

23 quality Average value of U π (s) will converge to correct value if defined as a function that decreases as the number of times a state is visited increases, then U(s) will converge to correct value. We require: The function (n) = 1/n satisfies these conditions.

24 Performance in 4x3 world

25 TD vs. ADP TD: Doesn ’ t learn as fast as ADP Shows higher variability than ADP Simpler than ADP Much less computation per observation than ADP Does not need a model to perform updates Makes state updates to agree with observed successor (instead of all successors, like ADP) TD can be viewed as a crude, yet efficient, first approximation to ADP.

26 TD vs ADP Function PASSIVE_TD_AGENT (percept) returns an action if s’ is new then do U[s’]  r’ if s is not null then do increment N s [s] U[s]  U[s] + α(N s [s])(r + γ U [s’] − U[s]) if TERMINAL?[s’] then s, a, r  null else s, a, r  s’,π[s’], r’ return a PASSIVE_ADP_AGENT if s’ is new then do U[s’]  r’; R[s’]  r’ if s is not null then do increment N sa [s,a] and N sas’ [s,a,s’] for each t such that N sas’ [s,a,t] is nonzero do T[s,a,t]  N sas’ [s,a,t]/N sa [s,a] U  VALUE_DETEMINATION(π,U,mdp) if TERMINAL?[s’] then s,a  null else s,a  s’,π[s’] return a