Artificial Intelligence Ch 21: Reinforcement Learning

Slides:



Advertisements
Similar presentations
Reinforcement Learning
Advertisements

Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.
Eick: Reinforcement Learning. Topic 18: Reinforcement Learning 1. Introduction 2. Bellman Update 3. Temporal Difference Learning 4. Discussion of Project1.
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
Class Project Due at end of finals week Essentially anything you want, so long as it’s AI related and I approve Any programming language you want In pairs.
Reinforcement Learning
Reinforcement learning (Chapter 21)
Reinforcement learning
Reinforcement Learning
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
Rutgers CS440, Fall 2003 Reinforcement Learning Reading: Ch. 21, AIMA 2 nd Ed.
Reinforcement Learning Introduction Presented by Alp Sardağ.
1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.
Reinforcement Learning Game playing: So far, we have told the agent the value of a given board position. How can agent learn which positions are important?
Reinforcement Learning Yishay Mansour Tel-Aviv University.
CPSC 7373: Artificial Intelligence Lecture 11: Reinforcement Learning Jiang Bian, Fall 2012 University of Arkansas at Little Rock.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Reinforcement Learning
Reinforcement Learning
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Reinforcement Learning
Reinforcement Learning 主講人:虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
CS 188: Artificial Intelligence Spring 2007 Lecture 23: Reinforcement Learning: III 4/17/2007 Srini Narayanan – ICSI and UC Berkeley.
Reinforcement learning (Chapter 21)
CS 484 – Artificial Intelligence1 Announcements Homework 5 due Tuesday, October 30 Book Review due Tuesday, October 30 Lab 3 due Thursday, November 1.
Sutton & Barto, Chapter 4 Dynamic Programming. Programming Assignments? Course Discussions?
Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.
CS 188: Artificial Intelligence Spring 2007 Lecture 21:Reinforcement Learning: II MDP 4/12/2007 Srini Narayanan – ICSI and UC Berkeley.
Reinforcement Learning. Overview Supervised Learning: Immediate feedback (labels provided for every input). Unsupervised Learning: No feedback (no labels.
REINFORCEMENT LEARNING Unsupervised learning 1. 2 So far ….  Supervised machine learning: given a set of annotated istances and a set of categories,
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.
CS 188: Artificial Intelligence Fall 2007 Lecture 12: Reinforcement Learning 10/4/2007 Dan Klein – UC Berkeley.
Reinforcement Learning for 3 vs. 2 Keepaway P. Stone, R. S. Sutton, and S. Singh Presented by Brian Light.
Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO AUTOMATICO Lezione 12 - Reinforcement Learning Prof. Giancarlo Mauri.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
Reinforcement Learning
Reinforcement Learning
Reinforcement learning
Reinforcement Learning
Reinforcement learning (Chapter 21)
CMSC 471 – Spring 2014 Class #25 – Thursday, May 1
Reinforcement learning (Chapter 21)
Teaching Style COSC 6368 Teaching Style COSC 6368
Markov Decision Processes
Reinforcement Learning
Reinforcement Learning
"Playing Atari with deep reinforcement learning."
CS 188: Artificial Intelligence
Reinforcement Learning
Chapter 3: The Reinforcement Learning Problem
CS 188: Artificial Intelligence Fall 2007
CS 188: Artificial Intelligence Fall 2008
CS 188: Artificial Intelligence Fall 2008
Chapter 3: The Reinforcement Learning Problem
CS 188: Artificial Intelligence Fall 2007
CS 188: Artificial Intelligence Spring 2006
CS 188: Artificial Intelligence Fall 2008
CS 188: Artificial Intelligence Spring 2006
Reinforcement Learning
Reinforcement Learning (2)
Markov Decision Processes
Markov Decision Processes
Reinforcement Learning
Reinforcement Learning (2)
CS 440/ECE448 Lecture 22: Reinforcement Learning
Presentation transcript:

Artificial Intelligence Ch 21: Reinforcement Learning Rodney Nielsen

Reinforcement Learning Introduction Passive Reinforcement Learning Active Reinforcement Learning Generalization in Reinforcement Learning Policy Search Applications of Reinforcement Learning

Introduction Supervised Learning requires labels for every example (percept) What if we only know whether we were successful after a series of (state, action, percept) triples? No a priori model of the environment or reward function For example, we receive feedback/a reward (positive: “you win”, or more likely negative: “you lose”) at the end of a new game we are learning. Or maybe a reward when a point is scored

Example: Chess Supervised Learning: Reinforcement Learning: Create labeled examples for each position numerous representative board states Labeled Ex.: feature vector representing state of board with label indicating what move to make Reinforcement Learning: Play a game, receive a “reward” at the end for winning or losing, and adjust all executed policy actions accordingly

Example: Robot Grasping Supervised Learning: Create labeled examples for numerous representative states State: location, orientation, temperature, ability, operating characteristics, etc. of body, arm, hand, legs, head, etc., and of object Reinforcement Learning: Try to grasp object, receive a positive (negative) reward at terminal state for success (failure) and adjust all executed policy actions accordingly Or partial rewards for getting closer https://www.youtube.com/watch?v=SbL7ICP-Fx0&index=20&list=PL5nBAYUyJTrM48dViibyi68urttMlUv7e

Example: Helicopter Maneuver Extremely difficult to program, but… Reinforcement Learning Feedback: Crashing (very negative) Shaking (moderate negative) Unstable (moderate negative) Inconsistent with goal (modest negative) https://www.youtube.com/watch?v=VCdxqn0fcnE

Example: Humanoid Robot Soccer Goal Kicking Two State Features: x-coordinate of the ball in camera Number mm foot is shifted out from hip Three Actions: Shift leg out Shift leg in Kick Rewards: -1 per shift action -2 for missing -20 for falling +20 for scoring https://www.youtube.com/watch?v=mRpX9DFCdwI&list=PL5nBAYUyJTrM48dViibyi68urttMlUv7e&index=12 https://www.youtube.com/watch?v=lwc-TYT0tbg https://www.youtube.com/watch?v=eHFg3RVHWjM https://www.youtube.com/watch?v=QdQL11uWWcI

Boston Dynamics https://www.youtube.com/watch?v=W1czBcnX1Ww https://www.youtube.com/watch?v=-h5qpXO3isM https://www.youtube.com/watch?v=mXI4WWhPn-U

Passive Reinforcement Learning π, the agent’s policy, does not change π(s) = constant action Does not know: Transition model P(s’|s,a) Reward function R(s) Percepts: Current state s Reward R(s) E.g., (1,1)-.04~(1,2)-.04~…~(4,3)+1

Passive Reinforcement Learning π(s) is static No P(s’|s,a) or R(s) Percepts: s, R(s) Goal: learn the expected utility Uπ(s) . Bellman equations for a fixed policy

Passive Reinforcement Learning Temporal-Difference Learning TD Equation: . α is the learning rate

Active Reinforcement Learning π, the agent’s policy, must be learned Must learn complete model: Passive-ADP-Agent Transition model P(s’|s,a) Learn optimal action a ?

Active Reinforcement Learning Learn optimal action a Exploration vs. exploitation Exploitation (greedy agent): Maximize reward under current policy Likely to stick roughly to the first actions that eventually led to success E.g., (1,1)-.04~(2,1)-.04~(3,1)-.04~ (3,2)-.04~(3,3)-.04~~(4,3)+1 Exploration: Test policies assumed to be suboptimal Stay in comfort zone vs. seek a better life ?

Active Reinforcement Learning Learn optimal action a f(u,n): the exploration function Greed f(u) traded off against curiosity f(n) R+: optimistic estimate of best possible reward Ne: constant parameter  Agent will tries each action–state pair at least Ne times ?

Active Reinforcement Learning Learning an action-utility function Q-Learning Q(s,a): value of action a in state s TD agents that learn a Q-function do not need a model of P(s’|s,a) either for learning or for action selection