CS 621 Reinforcement Learning Group 8 Neeraj Bisht Ranjeet Vimal Nishant Suren Naineet C. Patel Jimmie Tete.

Slides:

Advertisements

Similar presentations

Reinforcement Learning

Advertisements

Reinforcement Learning I: The setting and classical stochastic dynamic programming algorithms Tuomas Sandholm Carnegie Mellon University Computer Science.

Markov Decision Process

Genetic Algorithms (Evolutionary Computing) Genetic Algorithms are used to try to “evolve” the solution to a problem Generate prototype solutions called.

Reinforcement Learning

Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.

Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.

Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.

Class Project Due at end of finals week Essentially anything you want, so long as it’s AI related and I approve Any programming language you want In pairs.

Reinforcement Learning

Reinforcement learning (Chapter 21)

1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.

Markov Decision Processes

Planning under Uncertainty

Reinforcement learning

Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.

Reinforcement Learning Rafy Michaeli Assaf Naor Supervisor: Yaakov Engel Visit project’s home page at: FOR.

Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)

Incorporating Advice into Agents that Learn from Reinforcement Presented by Alp Sardağ.

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

Reinforcement Learning Introduction Presented by Alp Sardağ.

1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

Reinforcement Learning Game playing: So far, we have told the agent the value of a given board position. How can agent learn which positions are important?

Making Decisions CSE 592 Winter 2003 Henry Kautz.

Kunstmatige Intelligentie / RuG KI Reinforcement Learning Sander van Dijk.

Reinforcement Learning Russell and Norvig: Chapter 21 CMSC 421 – Fall 2006.

1 Reinforcement Learning: Learning algorithms Function Approximation Yishay Mansour Tel-Aviv University.

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

Reinforcement Learning

General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.

Reinforcement Learning

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.

Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.

CS344 : Introduction to Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 28- PAC and Reinforcement Learning.

Reinforcement Learning 主講人：虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.

© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.

Decision Making Under Uncertainty Lec #8: Reinforcement Learning UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Most slides by Jeremy.

Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 9 of 42 Wednesday, 14.

Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.

MDPs (cont) & Reinforcement Learning

Reinforcement learning (Chapter 21)

Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012.

Reinforcement Learning Based on slides by Avi Pfeffer and David Parkes.

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

R. Brafman and M. Tennenholtz Presented by Daniel Rasmussen.

REINFORCEMENT LEARNING Unsupervised learning 1. 2 So far ….  Supervised machine learning: given a set of annotated istances and a set of categories,

Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.

Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.

1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.

Reinforcement learning (Chapter 21)

Reinforcement learning (Chapter 21)

Markov Decision Processes

Reinforcement Learning

Reinforcement Learning

Announcements Homework 3 due today (grace period through Friday)

Instructors: Fei Fang (This Lecture) and Dave Touretzky

CS 188: Artificial Intelligence Spring 2006

Introduction to Reinforcement Learning and Q-Learning

CS 188: Artificial Intelligence Fall 2008

CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29

CS 188: Artificial Intelligence Spring 2006

CS 416 Artificial Intelligence

Reinforcement Learning (2)

Reinforcement Learning

Reinforcement Learning (2)

Presentation transcript:

CS 621 Reinforcement Learning Group 8 Neeraj Bisht Ranjeet Vimal Nishant Suren Naineet C. Patel Jimmie Tete

Outline  Introduction  Motivation  Passive Learning in an Known Environment  Passive Learning in an Unknown Environment  Active Learning in an Unknown Environment  Exploration  Learning an Action Value Function  Generalization in Reinforcement Learning  Conclusion  References

Introduction  Reinforcement Learning is a sub-area of machine learning concerned with how an agent ought to take actions in an environment so as to maximize some notion of long-term reward.  RL algorithms attempt to find a policy that maps states of the world to the actions the agent ought to take in those states.  In economics and game theory, RL is considered as a boundedly rational interpretation of how equilibrium may arise.

Motivation  Traditional Learning methods with teachers.  In RL no correct/incorrrect input/output are given.  Rather feedback after each stage is geiven.  Feedback for the learning process is called 'Reward' or 'Reinforcement'  In RL we examine how an agent can learn from success and failure, reward and punishment

The RL framework  Environment is depicted as a finite-state Markov Decision process.(MDP)‏  Utility of a state U[i] gives the usefulness of the state  The agent can begin with knowledge of the environment and the effects of its actions; or it will have to learn this model as well as utility information.

The RL problem  Rewards can be received either in intermediate or a terminal state.  Rewards can be a component of the actual utility(e.g. Pts in a TT match) or they can be hints to the actual utility (e.g. Verbal reinforcements)‏  The agent can be a passive or an active learner

Passive Learning in a Known Environment Passive Learner: A passive learner simply watches the world going by, and tries to learn the utility of being in various states. Another way to think of a passive learner is as an agent with a fixed policy trying to determine its benefits.

Passive Learning in a Known Environment In passive learning, the environment generates state transitions and the agent perceives them. Consider an agent trying to learn the utilities of the states shown below:

Passive Learning in a Known Environment  Agent can move {North, East, South, West}  Terminate on reading [4,2] or [4,3]

Passive Learning in a Known Environment Agent is provided: M i j = a model given the probability of reaching from state i to state j

Passive Learning in a Known Environment  The object is to use this information about rewards to learn the expected utility U(i) associated with each nonterminal state i  Utilities can be learned using 3 approaches 1) LMS (least mean squares) ‏ 2) ADP (adaptive dynamic programming) ‏ 3) TD (temporal difference learning) ‏

Passive Learning in a Known Environment LMS (Least Mean Square) ‏ Agent makes random runs (sequences of random moves) through environment [1,1]->[1,2]->[1,3]->[2,3]->[3,3]->[4,3] = +1 [1,1]->[2,1]->[3,1]->[3,2]->[4,2] = -1

Passive Learning in a Known Environment LMS  Collect statistics on final payoff for each state (eg. when on [2,3], how often reached +1 vs -1 ?) ‏  Learner computes average for each state Probably converges to true expected value (utilities) ‏

Passive Learning in a Known Environment LMS Main Drawback: - slow convergence - it takes the agent well over a 1000 training sequences to get close to the correct value

Passive Learning in a Known Environment ADP (Adaptive Dynamic Programming) ‏ Uses the value or policy iteration algorithm to calculate exact utilities of states given an estimated mode

Passive Learning in a Known Environment ADP In general: U n+1 (i) = U n (i)+ ∑ M ij. U n (j) -U n (i) is the utility of state i after nth iteration -Initially set to R(i) - R(i) is reward of being in state i (often non zero for only a few end states) ‏ - M ij is the probability of transition from state i to j

Passive Learning in a Known Environment Consider U(3,3)‏ U(3,3) = 0.33 x U(4,3) x U(2,3) x U(3,2)‏ = 0.33 x x x = ADP

Passive Learning in a Known Environment ADP  makes optimal use of the local constraints on utilities of states imposed by the neighborhood structure of the environment  somewhat intractable for large state spaces

Passive Learning in a Known Environment TD (Temporal Difference Learning)‏ The key is to use the observed transitions to adjust the values of the observed states so that they agree with the constraint equations

Passive Learning in a Known Environment TD Learning  Suppose we observe a transition from state i to state j U(i) = -0.5 and U(j) = +0.5  Suggests that we should increase U(i) to make it agree better with it successor  Can be achieved using the following updating rule U n+1 (i) = U n (i)+ a(R(i) + U n (j) –U n (i))

Passive Learning in a Known Environment TD Learning Performance:  Runs “noisier” than LMS but smaller error  Deal with observed states during sample runs (Not all instances, unlike ADP)‏

Passive Learning in an Unknown Environment LMS approach and TD approach operate unchanged in an initially unknown environment. ADP approach adds a step that updates an estimated model of the environment.

Passive Learning in an Unknown Environment ADP Approach  The environment model is learned by direct observation of transitions  The environment model M can be updated by keeping track of the percentage of times each state transitions to each of its neighbours

Passive Learning in an Unknown Environment ADP & TD Approaches  The ADP approach and the TD approach are closely related  Both try to make local adjustments to the utility estimates in order to make each state “agree” with its successors

Passive Learning in an Unknown Environment Minor differences :  TD adjusts a state to agree with its observed successor  ADP adjusts the state to agree with all of the successors Important differences :  TD makes a single adjustment per observed transition  ADP makes as many adjustments as it needs to restore consistency between the utility estimates U and the environment model M

Passive Learning in an Unknown Environment To make ADP more efficient :  directly approximate the algorithm for value iteration or policy iteration  prioritized-sweeping heuristic makes adjustments to states whose likely successors have just undergone a large adjustment in their own utility estimates Advantage of the approximate ADP :  efficient in terms of computation  eliminate long value iterations occur in early stage

Active Learning in an Unknown Environment An active agent must consider :  what actions to take  what their outcomes may be  how they will affect the rewards received

Active Learning in an Unknown Environment Minor changes to passive learning agent:  environment model now incorporates the probabilities of transitions to other states given a particular action  maximize its expected utility  agent needs a performance element to choose an action at each step

Active Learning in an Unknown Environment Active ADP Approach  need to learn the probability M a ij of a transition instead of M ij  the input to the function will include the action taken

Active Learning in an Unknown Environment Active TD Approach  the model acquisition problem for the TD agent is identical to that for the ADP agent  the update rule remains unchanged  the TD algorithm will converge to the same values as ADP as the number of training sequences tends to infinity

Exploration Learning also involves the exploration of unknown areas Its an attempted to learn from self-play

Exploration An agent can benefit from actions in 2 ways  immediate rewards  received percepts

Exploration Wacky Approach Vs. Greedy Approach The "wacky" approach acts randomly, in the hope that it will eventually explore the entire environment the "greedy" approach acts to maximize its utility using current estimates

Exploration The Exploration Function a simple example u= expected utility (greed) n= number of times actions have been tried(wacky) R+ = best reward possible

Learning An Action Value- Function Q-Values? An action-value function assigns an expected utility to taking a given action in a given state

Learning An Action Value- Function The Q-Values Formula U(i) = max Q(a, i) a

Learning An Action Value- Function The Q-Values Formula Application -just an adaptation of the active learning equation

Learning An Action Value- Function The TD Q-Learning Update Equation - requires no model - calculated after each transition from state.i to j Thus, they can be learned directly from reward feedback

Generalization In Reinforcement Learning Explicit Representation  we have assumed that all the functions learned by the agents(U,M,R,Q) are represented in tabular form  explicit representation involves one output value for each input tuple.

Generalization In Reinforcement Learning Explicit Representation  good for small state spaces, but the time to convergence and the time per iteration increase rapidly as the space gets larger  it may be possible to handle 10,000 states or more  this suffices for 2-dimensional, maze-like environments

Generalization In Reinforcement Learning Explicit Representation  Problem: more realistic worlds are out of question  eg. Chess & backgammon are tiny subsets of the real world, yet their state spaces contain on the order of to states. So it would be absurd to suppose that one must visit all these states in order to learn how to play the game.

Generalization In Reinforcement Learning Implicit Representation  Overcome the explicit problem  a form that allows one to calculate the output for any input, but that is much more compact than the tabular form.

Generalization In Reinforcement Learning Implicit Representation  For example, an estimated utility function for game playing can be represented as a weighted linear function of a set of board features f 1 ………f n : U(i) = w 1 f 1 (i)+w 2 f 2 (i)+….+w n f n (i)

Generalization In Reinforcement Learning Implicit Representation  enormous compression : achieved by an implicit representation allows the learning agents to generalize from states it has visited to states it has not visited  the most important aspect : it allows for inductive generalization over input states.  Therefore, such method are said to perform input generalization

Generalization In Reinforcement Learning Input Generalisation  The cart pole problem:  set up the problem of balancing a long pole upright on the top of a moving cart.

Generalization In Reinforcement Learning Input Generalisation  The cart can be jerked left or right by a controller that observes x, x’, , and  ’  the earliest work on learning for this problem was carried out by Michie and Chambers(1968)  their BOXES algorithm was able to balance the pole for over an hour after only about 30 trials.

Generalization In Reinforcement Learning Input Generalisation  The algorithm first discretized the 4- dimensional state into boxes, hence the name  it then ran trials until the pole fell over or the cart hit the end of the track.  Negative reinforcement was associated with the final action in the final box and then propagated back through the sequence

Generalization In Reinforcement Learning Input Generalisation  The discretization causes some problems when the apparatus was initialized in a different position  improvement : using the algorithm that adaptively partitions that state space according to the observed variation in the reward

Conclusion  Passive Learning in an Known Environment  Passive Learning in an Unknown Environment  Active Learning in an Unknown Environment  Exploration  Learning an Action Value Function  Generalization in Reinforcement Learning

References  olume4/kaelbling96a-html/rl-survey.html   Russel, S. and P. Norvig (1995). Artificial Intelligence - A Modern Approach. Upper Saddle River, NJ, Prentice Hall