1 Reinforcement Learning: Learning algorithms Function Approximation Yishay Mansour Tel-Aviv University.

Slides:

Advertisements

Similar presentations

Lecture 18: Temporal-Difference Learning

Advertisements

Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.

RL for Large State Spaces: Value Function Approximation

Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.

Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.

Adversarial Search Chapter 5.

Reinforcement Learning

Reinforcement learning (Chapter 21)

Minimax and Alpha-Beta Reduction Borrows from Spring 2006 CS 440 Lecture Slides.

1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.

1 Temporal-Difference Learning Week #6. 2 Introduction Temporal-Difference (TD) Learning –a combination of DP and MC methods updates estimates based on.

Reinforcement Learning & Apprenticeship Learning Chenyi Chen.

Reinforcement learning

Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 6 - Reinforcement Learning Prof. Giancarlo.

Reinforcement Learning Tutorial

Reinforcement Learning

Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)

לביצוע מיידי ! להתחלק לקבוצות –2 או 3 בקבוצה להעביר את הקבוצות – היום בסוף השיעור ! ספר Reinforcement Learning – הספר קיים online ( גישה מהאתר של הסדנה.

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

Persistent Autonomous FlightNicholas Lawrance Reinforcement Learning for Soaring CDMRG – 24 May 2010 Nick Lawrance.

Chapter 6: Temporal Difference Learning

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

Reinforcement Learning

Temporal Difference Learning By John Lenz. Reinforcement Learning Agent interacting with environment Agent receives reward signal based on previous action.

Introduction Many decision making problems in real life

Reinforcement Learning

Well Posed Learning Problems Must identify the following 3 features –Learning Task: the thing you want to learn. –Performance measure: must know when you.

Reinforcement Learning Generalization and Function Approximation Subramanian Ramamoorthy School of Informatics 28 February, 2012.

CHAPTER 10 Reinforcement Learning Utility Theory.

Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.

© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.

Decision Making Under Uncertainty Lec #8: Reinforcement Learning UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Most slides by Jeremy.

Non-Bayes classifiers. Linear discriminants, neural networks.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 4 Ann Nowé By Sutton.

Schedule for presentations. 6.1: Chris? – The agent is driving home from work from a new work location, but enters the freeway from the same point. Thus,

Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.

Reinforcement learning (Chapter 21)

Retraction: I’m actually 35 years old. Q-Learning.

Reinforcement Learning Based on slides by Avi Pfeffer and David Parkes.

Pedagogical Possibilities for the 2048 Puzzle Game Todd W. Neller.

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

Well Posed Learning Problems Must identify the following 3 features –Learning Task: the thing you want to learn. –Performance measure: must know when you.

Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.

Deep Learning and Deep Reinforcement Learning. Topics 1.Deep learning with convolutional neural networks 2.Learning to play Atari video games with Deep.

1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 12: Generalization and Function Approximation Dr. Itamar Arel College of Engineering.

CS 541: Artificial Intelligence Lecture XI: Reinforcement Learning Slides Credit: Peter Norvig and Sebastian Thrun.

CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.

Stochastic tree search and stochastic games

Reinforcement Learning

Mastering the game of Go with deep neural network and tree search

Reinforcement learning (Chapter 21)

AlphaGo with Deep RL Alpha GO.

Reinforcement learning (Chapter 21)

Reinforcement Learning

Reinforcement Learning

"Playing Atari with deep reinforcement learning."

Announcements Homework 3 due today (grace period through Friday)

Reinforcement learning

Instructors: Fei Fang (This Lecture) and Dave Touretzky

RL for Large State Spaces: Value Function Approximation

Reinforcement Learning in MDPs by Lease-Square Policy Iteration

یادگیری تقویتی Reinforcement Learning

Reinforcement Learning (2)

Reinforcement Learning (2)

CS 440/ECE448 Lecture 22: Reinforcement Learning

Presentation transcript:

1 Reinforcement Learning: Learning algorithms Function Approximation Yishay Mansour Tel-Aviv University

2 Outline Week I: Basics –Mathematical Model (MDP) –Planning Value iteration Policy iteration Week II: Learning Algorithms –Model based –Model Free Week III: Large state space

3 Learning Algorithms Given access only to actions perform: 1. policy evaluation. 2. control - find optimal policy. Two approaches: 1. Model based (Dynamic Programming). 2. Model free (Q-Learning, SARSA).

4 Learning: Policy improvement Assume that we can compute: –Given a policy π, –The V and Q functions of π Can perform policy improvement: –Π= Greedy (Q) Process converges if estimations are accurate.

5 Learning - Model Free Optimal Control: off-policy Learn online the Q function. Q t+1 (s t,a t ) = Q t (s t,a t )+  A t OFF POLICY: Q-Learning Maximization Operator!!! A t = r t +  MAX a {Q t (s t+1,a )} - Q t (s t,a t )

6 Learning - Model Free Policy evaluation: TD(0) An online view: At state s t we performed action a t, received reward r t and moved to state s t+1. Our “estimation error” is A t =r t +  V t (s t+1 )-V t (s t ), The update: V t +1 (s t ) = V t (s t ) +  A t No maximization over actions!

7 Learning - Model Free Optimal Control: on-policy Learn online the optimal Q * function. Q t+1 (s t,a t ) = Q t (s t,a t )+  r t +  Q t (s t+1,a t+1 ) - Q t (s t,a t )] ON-Policy: SARSA a t+1 the  -greedy policy for Q t. The policy selects the action! Need to balance exploration and exploitation.

8 Modified Notation Rather than Q(s,a) have Q a (s) Greedy(Q) = MAX a Q a (s) Each action has a function Q a (s) Learn each Q a (s) independently!

9 Large state space Reduce number of states –Symmetries (x-o) –Cluster states Define attributes Limited number of attributes Some states will be identical

10 Example X-O For each action (square) –Consider row/diagonal/column through it –The state will encode the status of “rows”: Two X’s Two O’s Mixed (both X and O) One X One O empty –Only Three types of squares/actions

11 Clustering states Need to create attributes Attributes should be “game dependent” Different “real” states - same representation How do we run? –We estimate action value. –Consider only legal actions. –Play “best” action.

12 Function Approximation Use a limited model for Q a (s) Have an attribute vector: –Each state s has a vector vec(s)=x 1... x k –Normally k << |S| Examples: –Neural Network –Decision tree –Linear Function Weights  =  1...  k Value   i x i

13 Gradient Decent Minimize Squared Error –Square Error = ½  P(s) [V  (s) – V  (s)] 2 –P(s) is a weighting on the states Algorithm: –  (t+1) =  (t) +  [V  (s t ) – V  (t) (s t )]   (t) V  (t) (s t ) –   (t) = partial derivatives –Replace V  (s t ) by a sample Monte Carlo: use R t for V  (s t ) TD(0) use A t for [V  (s t ) – V  (t) (s t )]

14 Linear Functions Linear function:   i x i = Derivative   (t) V t (s t ) = vec(s t ) Update Rule: –  t+1 =  t +  [V  (s t ) – V t (s t )] vec(s t ) –MC:  t+1 =  t +  [ R t – ] vec(s t ) –TD:  t+1 =  t +  A t vec(s t )

15 Example: 4 in a row Select attributes for action (column): –3 in a row (type X or type O) –2 in a row (type X or O) and [blocked/ not] –Next location 3 in a row. Next move might lose –Other “features” RL will learn the weights. Look ahead significantly helps –use max-min tree

16 Bootstrapping Playing against a “good” player –Using.... Self play –Start with a random player –play against one self. Choose a starting point. –Max-Min tree with simple scoring function. Add some simple guidance –add “compulsory” moves.

17 Scoring Function Checkers: –Number of pieces –Number of Queens Chess –Weighted sum of pieces Othello/Reversi –Difference in number of pieces Can be used with Max-Min Tree –( ,  ) pruning

18 Example: Revesrsi (Othello) Use a simple score functions: –difference in pieces –edge pieces –corner pieces Use Max-Min Tree RL: optimize weights.

19 Advanced issues Time constraints –fast and slow modes Opening –can help End game –many cases: few pieces, –can be solved efficiently Train on a specific state –might be helpful/ not sure that its worth the effort.

20 What is Next? Create teams: –Choose a game! GUI for game –Deadline April 12, 2010 System specification –Project outline –High level components planning –May 10, 2010

21 Schedule (more) Build system Project completion –Aug. 30, 2010 All supporting documents in html! From next week: –Each groups works by itself. –Feel free to contact us.