Reinforcement Learning

Slides:

Advertisements

Similar presentations

Reinforcement Learning

Advertisements

Reinforcement Learning I: The setting and classical stochastic dynamic programming algorithms Tuomas Sandholm Carnegie Mellon University Computer Science.

Markov Decision Process

Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.

Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.

Eick: Reinforcement Learning. Topic 18: Reinforcement Learning 1. Introduction 2. Bellman Update 3. Temporal Difference Learning 4. Discussion of Project1.

Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.

INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.

Class Project Due at end of finals week Essentially anything you want, so long as it’s AI related and I approve Any programming language you want In pairs.

Reinforcement Learning

Reinforcement learning (Chapter 21)

Markov Decision Processes

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Reinforcement learning

Reinforcement Learning

Exploration and Exploitation Strategies for the K-armed Bandit Problem by Alexander L. Strehl.

1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.

Reinforcement Learning Game playing: So far, we have told the agent the value of a given board position. How can agent learn which positions are important?

Reinforcement Learning Yishay Mansour Tel-Aviv University.

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.

Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.

Reinforcement Learning

Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 1: Generalized policy iteration.

© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

INTRODUCTION TO Machine Learning

CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:

Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.

MDPs (cont) & Reinforcement Learning

Reinforcement Learning with Laser Cats! Marshall Wang Maria Jahja DTR Group Meeting October 5, 2015.

Reinforcement learning (Chapter 21)

Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.

Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.

CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.

1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.

Artificial Intelligence Ch 21: Reinforcement Learning

Figure 5: Change in Blackjack Posterior Distributions over Time.

Markov Decision Process (MDP)

Announcements Grader office hours posted on course website

Reinforcement learning

Reinforcement learning (Chapter 21)

Reinforcement Learning (1)

CMSC 471 – Spring 2014 Class #25 – Thursday, May 1

Reinforcement learning (Chapter 21)

Teaching Style COSC 6368 Teaching Style COSC 6368

An Overview of Reinforcement Learning

Markov Decision Processes

Reinforcement Learning

Reinforcement Learning

Reinforcement learning

CAP 5636 – Advanced Artificial Intelligence

Dr. Unnikrishnan P.C. Professor, EEE

Chapter 2: Evaluative Feedback

CS 188: Artificial Intelligence Fall 2008

Instructor: Vincent Conitzer

CS 188: Artificial Intelligence Fall 2007

CS 188: Artificial Intelligence Spring 2006

Introduction to Reinforcement Learning and Q-Learning

CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29

CS 188: Artificial Intelligence Spring 2006

Reinforcement Learning

Chapter 2: Evaluative Feedback

Markov Decision Processes

Markov Decision Processes

Reinforcement Learning

Reinforcement Learning (2)

CS 440/ECE448 Lecture 22: Reinforcement Learning

Presentation transcript:

Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary

Introduction Supervised Learning: Example Class Reinforcement Learning: … Situation Reward Situation Reward

Examples Playing chess: Reward comes at end of game Ping-pong: Reward on each point scored Animals: Hunger and pain - negative reward food intake – positive reward

Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary

Passive Learning We assume the policy Π is fixed. In state s we always execute action Π(s) Rewards are given.

Typical Trials (1,1) -0.04  (1,2) -0.04  (1,3) -0.04  (1,2) -0.04  (1,3) -0.04 …  (4,3) +1 Goal: Use rewards to learn the expected utility UΠ (s)

Expected Utility UΠ (s) = E [ Σt=0 γ R(st) | Π, S0 = s ] Expected sum of rewards when the policy is followed.

Example (1,1) -0.04  (1,2) -0.04  (1,3) -0.04  (2,3) -0.04  (3,3) -0.04  (4,3) +1 Total reward: (-0.04 x 5) + 1 = 0.80

Direct Utility Estimation Convert the problem to a supervised learning problem: (1,1)  U = 0.72 (2,1)  U = 0.68 … Learn to map states to utilities. But utilities are not independent of each other!

Bellman Equations Utility values obey the following equations: UΠ (s) = R(s) + γ Σs’ T(s,s’) UΠ (s’) Can be solved using dynamic programming. Assumes knowledge of model.

Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary

Temporal Difference Learning Use the following update rule: UΠ (s)  UΠ (s) + α [ R(s) + γ UΠ (s’) - UΠ (s) ] α is the learning rate Temporal difference equation. No model assumption.

Example U(1,3) = 0.84 U(2,3) = 0.92 We hope to see that: U(1,3) = 0.84 + -0.04 + [U(2,3) – U(1,3)] U(1,3) = 0.84 + -0.04 + (0.92 – 0.84) The value is 0.88. Current value is a bit low and we must increase it.

Considerations Update values toward the equilibrium equation. Update includes the successor only. Over many trials the updates converge toward optimal values.

Other heuristics Prioritized Sweeping: Make adjustments to states where the most probable successors have undergone a large adjustment in terms of utility estimates.

Richard Sutton Author of classic textbook: “Reinforcement Learning” by Sutton and Barto, MIT Press, 1998. Dept. of Computer Science University of Alberta

Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary

Active Reinforcement Learning Now we must decide what actions to take. Optimal policy: Choose action with highest utility value. Is that the right thing to do?

Active Reinforcement Learning No! Sometimes we may get stuck in suboptimal solutions. Exploration vs Exploitation Tradeoff Why is this important? The learned model is not the same as the true environment.

Explore vs Exploit Exploitation: Maximize its reward vs Exploration: Maximize long-term well being.

Bandit Problem An n-armed bandit has n levers. Which lever to play to maximize reward? In genetic algorithms the selection strategy is to allocate coins optimally given appropriate set of assumptions.

Solution U+ (s)  R(s) + γ maxa f(u,N(a,s)) U+ (s) : optimistic estimate of utility N(a,s): number of times action a has been tried. f(u,n): exploration function. Increasing in u (exploitation) Decreasing in n (exploration)

Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary

Applications Game Playing Checker playing program by Arthur Samuel (IBM) Update rules: change weights by difference between current states and backed-up value generating full look-ahead tree

Applications Robot Control Cart-pole balancing problem. Control the position of x so that the pole stays roughly upright.

Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary

Summary Goal is to learn utility values and an optimal mapping from states to actions. Direct Utility Estimation ignores dependencies among states. We must follow Bellman Equations. Temporal difference updates values to match those of successor states. Active reinforcement learning learns What is machine learning?

Video http://www.youtube.com/watch?v=YQIMGV5vtd4 What is machine learning?