Reinforcement Learning

Slides:



Advertisements
Similar presentations
Reinforcement Learning
Advertisements

Reinforcement Learning I: The setting and classical stochastic dynamic programming algorithms Tuomas Sandholm Carnegie Mellon University Computer Science.
Markov Decision Process
Genetic Algorithms (Evolutionary Computing) Genetic Algorithms are used to try to “evolve” the solution to a problem Generate prototype solutions called.
Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
Decision Theoretic Planning
Class Project Due at end of finals week Essentially anything you want, so long as it’s AI related and I approve Any programming language you want In pairs.
Reinforcement Learning
Reinforcement learning (Chapter 21)
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
Markov Decision Processes
Reinforcement learning
KI Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17.
Rutgers CS440, Fall 2003 Reinforcement Learning Reading: Ch. 21, AIMA 2 nd Ed.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Reinforcement Learning Introduction Presented by Alp Sardağ.
CS 188: Artificial Intelligence Fall 2009 Lecture 10: MDPs 9/29/2009 Dan Klein – UC Berkeley Many slides over the course adapted from either Stuart Russell.
Reinforcement Learning Game playing: So far, we have told the agent the value of a given board position. How can agent learn which positions are important?
Making Decisions CSE 592 Winter 2003 Henry Kautz.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
Reinforcement Learning
Reinforcement Learning Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
CS 621 Reinforcement Learning Group 8 Neeraj Bisht Ranjeet Vimal Nishant Suren Naineet C. Patel Jimmie Tete.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
MDPs (cont) & Reinforcement Learning
Reinforcement learning (Chapter 21)
Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.
CS 188: Artificial Intelligence Spring 2007 Lecture 21:Reinforcement Learning: II MDP 4/12/2007 Srini Narayanan – ICSI and UC Berkeley.
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO AUTOMATICO Lezione 12 - Reinforcement Learning Prof. Giancarlo Mauri.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
Artificial Intelligence Ch 21: Reinforcement Learning
Markov Decision Process (MDP)
Making complex decisions
Reinforcement Learning
Reinforcement learning (Chapter 21)
Reinforcement Learning (1)
Reinforcement learning (Chapter 21)
Reinforcement Learning
Markov Decision Processes
Reinforcement Learning
CS 188: Artificial Intelligence
Reinforcement Learning
Reinforcement learning
CAP 5636 – Advanced Artificial Intelligence
Instructors: Fei Fang (This Lecture) and Dave Touretzky
CS 188: Artificial Intelligence Fall 2007
CS 188: Artificial Intelligence Fall 2008
CS 188: Artificial Intelligence Fall 2008
یادگیری تقویتی Reinforcement Learning
13. Acting under Uncertainty Wolfram Burgard and Bernhard Nebel
CS 188: Artificial Intelligence Fall 2007
CS 188: Artificial Intelligence Spring 2006
CS 188: Artificial Intelligence Spring 2006
Announcements Homework 2 Project 2 Mini-contest 1 (optional)
Reinforcement Learning (2)
Markov Decision Processes
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Markov Decision Processes
Reinforcement Learning
Reinforcement Learning (2)
Presentation transcript:

Reinforcement Learning BY: SHIVIKA SODHI

INTRODUCTION Study - How agents can learn what to do in the absence of labeled examples. Ex. In chess, a supervised learning agent needs to know about every correct move, but that feedback is seldom available. Without a feedback of good/bad, the agent will have no ground for deciding what move to make. It needs to know If it (accidentally) checkmates: good or gets checkmated: bad. This kind of feedback is called reward/reinforcement. The agent must be hardwired to recognize reward vs normal input

INTRODUCTION Reinforcement learning uses observed rewards to learn an optimal policy. Eg. Playing a game whose rules you don’t know, and after a few moves, your opponent announces ‘you lose’. Feasible way to train a program to perform at high levels The program can be told when its won or lost, use this info to learn an evaluation function that gives accurate estimates Agent designs: Utility based: Uses a learnt utility function on state, to select actions that maximize the expected outcome utility Q-learning: learns action utility function, giving the expected utility Reflex agent: learns a policy that maps directly from states to actions

INTRODUCTION Utility-based agent must have a model of environment to make decisions, as it must know the states to which its actions will lead. Ex. To make use of a backgammon evaluation function, a backgammon program, must know what its legal moves are and how they affect the board position. (thus it can apply the utility function to the outcome states) Q- learning agent doesn’t need a model of environment as it can compare the expected utilities for its available choices without needing to know their outcomes. They can’t look ahead as they don’t know where their actions will lead

PASSIVE REINFORCEMENT LEARNING Agent’s policy is fixed. Goal is to learn how good the policy is [Upie(s)] This agent doesn’t know reward function or transition model. P(s’|s,a) Direct utility estimation: Utility of a state is the expected total reward from that state onward. Each trial provides a sample of this quantity for each state visited. At the end of each sequence, the algo calculates the observed reward- to-go as output.

Direct utility estimation Instance of supervised learning, input: state, output: reward-to-go Succeeds in reducing the reinforcement learning problem to an inductive learning problem, (which is known) Utility (each state) = own reward + utility (successor state) Utilities are not independent, this model ignores that. Hence this algorithm converges very slowly.

Adaptive Dynamic Programming An ADP agent takes advantage of the constraints among the utilities of states by learning the transition model that connects them and solving the corresponding Markov decision process using a dynamic programming method. Plugging the learned transition model and observed rewards into Bellman equation to calculate the utilities of the states. Equations are linear. Value iteration process can use previous utility estimates as initial values and should converge quickly.

Temporal Difference learning Solving the underlying MDP is not the only way to bring the Bellman equations to bear on the learning problem. Instead, use observed transitions to adjust utilities of observed states so that they agree with constraint equations. This rule is called temporal difference because it uses the differences in utilities between successive states TD doesn’t need a transition model to perform its updates.

ACTIVE REINFORCEMENT LEARNING An active agent must decide what to take. The agent will need to learn a complete model with outcome probabilities for all actions rather than just the model for fixed policy.

GENERALIZATION IN REINFORCEMENT LEARNING

POLICY SEARCH

APPLICATIONS OF REINFORCEMENT LEARNING