Kunstmatige Intelligentie / RuG KI2 - 11 Reinforcement Learning Sander van Dijk.

Slides:



Advertisements
Similar presentations
Reinforcement Learning
Advertisements

Markov Decision Process
RL for Large State Spaces: Value Function Approximation
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
Reinforcement Learning
Reinforcement learning (Chapter 21)
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.
Reinforcement learning
Reinforcement Learning
CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley.
Reinforcement Learning Rafy Michaeli Assaf Naor Supervisor: Yaakov Engel Visit project’s home page at: FOR.
Reinforcement Learning
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
Distributed Reinforcement Learning for a Traffic Engineering Application Mark D. Pendrith DaimlerChrysler Research & Technology Center Presented by: Christina.
Incorporating Advice into Agents that Learn from Reinforcement Presented by Alp Sardağ.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Reinforcement Learning Introduction Presented by Alp Sardağ.
Machine Learning Lecture 11: Reinforcement Learning
1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.
Learning: Reinforcement Learning Russell and Norvig: ch 21 CMSC421 – Fall 2005.
Reinforcement Learning (1)
Reinforcement Learning Russell and Norvig: Chapter 21 CMSC 421 – Fall 2006.
CPSC 7373: Artificial Intelligence Lecture 11: Reinforcement Learning Jiang Bian, Fall 2012 University of Arkansas at Little Rock.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Reinforcement Learning
Temporal Difference Learning By John Lenz. Reinforcement Learning Agent interacting with environment Agent receives reward signal based on previous action.
REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.
OBJECT FOCUSED Q-LEARNING FOR AUTONOMOUS AGENTS M. ONUR CANCI.
Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction Ann Nowé By Sutton and.
Balancing Exploration and Exploitation Ratio in Reinforcement Learning Ozkan Ozcan (1stLT/ TuAF)
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Reinforcement Learning
Reinforcement Learning 主講人:虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.
Reinforcement Learning Ata Kaban School of Computer Science University of Birmingham.
Verve: A General Purpose Open Source Reinforcement Learning Toolkit Tyler Streeter, James Oliver, & Adrian Sannier ASME IDETC & CIE, September 13, 2006.
Curiosity-Driven Exploration with Planning Trajectories Tyler Streeter PhD Student, Human Computer Interaction Iowa State University
Advice Taking and Transfer Learning: Naturally-Inspired Extensions to Reinforcement Learning Lisa Torrey, Trevor Walker, Richard Maclin*, Jude Shavlik.
Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Thursday 29 October 2002 William.
Neural Networks Chapter 7
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Reinforcement learning (Chapter 21)
Learning and Acting with Bayes Nets Chapter 20.. Page 2 === A Network and a Training Data.
Reinforcement Learning AI – Week 22 Sub-symbolic AI Two: An Introduction to Reinforcement Learning Lee McCluskey, room 3/10
COMP 2208 Dr. Long Tran-Thanh University of Southampton Reinforcement Learning.
Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
Deep Learning and Deep Reinforcement Learning. Topics 1.Deep learning with convolutional neural networks 2.Learning to play Atari video games with Deep.
Reinforcement Learning. Overview Supervised Learning: Immediate feedback (labels provided for every input). Unsupervised Learning: No feedback (no labels.
REINFORCEMENT LEARNING Unsupervised learning 1. 2 So far ….  Supervised machine learning: given a set of annotated istances and a set of categories,
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.
Reinforcement Learning for 3 vs. 2 Keepaway P. Stone, R. S. Sutton, and S. Singh Presented by Brian Light.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
Reinforcement Learning
Reinforcement Learning
Reinforcement learning (Chapter 21)
Reinforcement learning (Chapter 21)
Reinforcement Learning
"Playing Atari with deep reinforcement learning."
Reinforcement Learning
Announcements Homework 3 due today (grace period through Friday)
Dr. Unnikrishnan P.C. Professor, EEE
Reinforcement Learning
Introduction to Reinforcement Learning and Q-Learning
Reinforcement Learning (2)
Reinforcement Learning (2)
Presentation transcript:

Kunstmatige Intelligentie / RuG KI Reinforcement Learning Sander van Dijk

What is Learning ? Percepts received by an agent should be used not only for acting, but also for improving the agent’s ability to behave optimally in the future to achieve its goal. Interaction between an agent and the world

Learning Types Supervised learning: Input, output) pairs of the function to be learned can be perceived or are given. Back-propagation Unsupervised Learning: No information at all about given output SOM Reinforcement learning: Agent receives no examples and starts with no model of the environment and no utility function. Agent gets feedback through rewards, or reinforcement.

Reinforcement Learning Task Learn how to behave successfully to achieve a goal while interacting with an external environment Learn through experience from trial and error Examples Game playing: The agent knows it has won or lost, but it doesn’t know the appropriate action in each state Control: a traffic system can measure the delay of cars, but not know how to decrease it.

Elements of RL Transition model, how action influence states Reward R, immediate value of state-action transition Policy , maps states to actions Agent Environment StateRewardAction Policy

Elements of RL r(state, action) immediate reward values G

Elements of RL Value function: maps states to state values Discount factor   [0, 1) (here 0.9) V * (state) values r(state, action) immediate reward values G G   2 11 π trγ t γrtrsV... G G

RL task (restated) Execute actions in environment, observe results. Learn action policy  : state  action that maximizes expected discounted reward E [r(t) +  r(t + 1) +  2 r(t + 2) + …] from any starting state in S

Reinforcement Learning Target function is  : state  action However… We have no training examples of form Training examples are of form, reward>

Utility-based agents Try to learn V  * (abbreviated V*) Perform look ahead search to choose best action from any state s Works well if agent knows  : state  action  state r : state  action  R When agent doesn’t know  and r, cannot choose actions this way

Q-values Define new function very similar to V* If agent learns Q, it can choose optimal action even without knowing  or R Using Q

Learning the Q-value Note: Q and V* closely related Allows us to write Q recursively as Temporal Difference learning

Learning the Q-value FOR each DO Initialize table entry: Observe current state s WHILE (true) DO Select action a and execute it Receive immediate reward r Observe new state s’ Update table entry for as follows Move: record transition from s to s’

r(state, action) immediate reward values Q(state, action) values V * (state) values G G G Q-learning Q-learning, learns the expected utility of taking a particular action a in a particular state s (Q-value of the pair (s,a))

Representation Explicit Implicit Weighted linear function/neural network Classical weight updating StateActionQ(s, a) 2MoveLeft81 2MoveRight100...

Exploration Agent follows policy deduced from learned Q-values Agent always performs same action in certain state, but perhaps there is an even better action? Exploration: Be safe learn more, greed curiosity. Extremely hard, if not impossible, to obtain optimal exploration policy. Randomly try actions that have not been tried often before but avoid actions that are believed to be of low utility

Q-learning estimates one time step difference Why not for n steps? Enhancement: Q( )

Q( ) formula Intuitive idea: use constant 0   1 to combine estimates from various look ahead distances (note normalization factor (1- )) Enhancement: Q( )

Enhancement: Eligibility Traces Look backward instead of forward. Weigh updates by eligibility trace e(s, a). On each step, decay all traces by  and increment the trace for the current state- action pair by 1. Update all state-action pairs in proportion to their eligibility.

Genetic algorithms Imagine the individuals as agent functions Fitness function as performance measure or reward function No attempt made to learn the relationship between the rewards and actions taken by an agent Simply searches directly in the individual space to find one that maximizes the fitness functions

Genetic algorithms Represent an individual as a binary string Selection works like this: if individual X scores twice as high as Y on the fitness function, then X is twice as likely to be selected for reproduction than Y. Reproduction is accomplished by cross-over and mutation

Cart – Pole balancing Demonstration

Summary RL addresses the problem of learning control strategies for autonomous agents TD-algorithms learn by iteratively reducing the differences between the estimates produced by the agent at different times In Q-learning an evaluation function over states and actions is learned In the genetic approach, the relation between rewards and actions is not learned. You simply search the fitness function space.