Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Reinforcement Learning: Learning algorithms Function Approximation Yishay Mansour Tel-Aviv University.

Similar presentations


Presentation on theme: "1 Reinforcement Learning: Learning algorithms Function Approximation Yishay Mansour Tel-Aviv University."— Presentation transcript:

1 1 Reinforcement Learning: Learning algorithms Function Approximation Yishay Mansour Tel-Aviv University

2 2 Outline Week I: Basics –Mathematical Model (MDP) –Planning Value iteration Policy iteration Week II: Learning Algorithms –Model based –Model Free Week III: Large state space

3 3 Learning Algorithms Given access only to actions perform: 1. policy evaluation. 2. control - find optimal policy. Two approaches: 1. Model based (Dynamic Programming). 2. Model free (Q-Learning, SARSA).

4 4 Learning: Policy improvement Assume that we can compute: –Given a policy π, –The V and Q functions of π Can perform policy improvement: –Π= Greedy (Q) Process converges if estimations are accurate.

5 5 Learning - Model Free Optimal Control: off-policy Learn online the Q function. Q t+1 (s t,a t ) = Q t (s t,a t )+  A t OFF POLICY: Q-Learning Maximization Operator!!! A t = r t +  MAX a {Q t (s t+1,a )} - Q t (s t,a t )

6 6 Learning - Model Free Policy evaluation: TD(0) An online view: At state s t we performed action a t, received reward r t and moved to state s t+1. Our “estimation error” is A t =r t +  V t (s t+1 )-V t (s t ), The update: V t +1 (s t ) = V t (s t ) +  A t No maximization over actions!

7 7 Learning - Model Free Optimal Control: on-policy Learn online the optimal Q * function. Q t+1 (s t,a t ) = Q t (s t,a t )+  r t +  Q t (s t+1,a t+1 ) - Q t (s t,a t )] ON-Policy: SARSA a t+1 the  -greedy policy for Q t. The policy selects the action! Need to balance exploration and exploitation.

8 8 Modified Notation Rather than Q(s,a) have Q a (s) Greedy(Q) = MAX a Q a (s) Each action has a function Q a (s) Learn each Q a (s) independently!

9 9 Large state space Reduce number of states –Symmetries (x-o) –Cluster states Define attributes Limited number of attributes Some states will be identical

10 10 Example X-O For each action (square) –Consider row/diagonal/column through it –The state will encode the status of “rows”: Two X’s Two O’s Mixed (both X and O) One X One O empty –Only Three types of squares/actions

11 11 Clustering states Need to create attributes Attributes should be “game dependent” Different “real” states - same representation How do we run? –We estimate action value. –Consider only legal actions. –Play “best” action.

12 12 Function Approximation Use a limited model for Q a (s) Have an attribute vector: –Each state s has a vector vec(s)=x 1... x k –Normally k << |S| Examples: –Neural Network –Decision tree –Linear Function Weights  =  1...  k Value   i x i

13 13 Gradient Decent Minimize Squared Error –Square Error = ½  P(s) [V  (s) – V  (s)] 2 –P(s) is a weighting on the states Algorithm: –  (t+1) =  (t) +  [V  (s t ) – V  (t) (s t )]   (t) V  (t) (s t ) –   (t) = partial derivatives –Replace V  (s t ) by a sample Monte Carlo: use R t for V  (s t ) TD(0) use A t for [V  (s t ) – V  (t) (s t )]

14 14 Linear Functions Linear function:   i x i = Derivative   (t) V t (s t ) = vec(s t ) Update Rule: –  t+1 =  t +  [V  (s t ) – V t (s t )] vec(s t ) –MC:  t+1 =  t +  [ R t – ] vec(s t ) –TD:  t+1 =  t +  A t vec(s t )

15 15 Example: 4 in a row Select attributes for action (column): –3 in a row (type X or type O) –2 in a row (type X or O) and [blocked/ not] –Next location 3 in a row. Next move might lose –Other “features” RL will learn the weights. Look ahead significantly helps –use max-min tree

16 16 Bootstrapping Playing against a “good” player –Using.... Self play –Start with a random player –play against one self. Choose a starting point. –Max-Min tree with simple scoring function. Add some simple guidance –add “compulsory” moves.

17 17 Scoring Function Checkers: –Number of pieces –Number of Queens Chess –Weighted sum of pieces Othello/Reversi –Difference in number of pieces Can be used with Max-Min Tree –( ,  ) pruning

18 18 Example: Revesrsi (Othello) Use a simple score functions: –difference in pieces –edge pieces –corner pieces Use Max-Min Tree RL: optimize weights.

19 19 Advanced issues Time constraints –fast and slow modes Opening –can help End game –many cases: few pieces, –can be solved efficiently Train on a specific state –might be helpful/ not sure that its worth the effort.

20 20 What is Next? Create teams: –Choose a game! GUI for game –Deadline April 12, 2010 System specification –Project outline –High level components planning –May 10, 2010

21 21 Schedule (more) Build system Project completion –Aug. 30, 2010 All supporting documents in html! From next week: –Each groups works by itself. –Feel free to contact us.


Download ppt "1 Reinforcement Learning: Learning algorithms Function Approximation Yishay Mansour Tel-Aviv University."

Similar presentations


Ads by Google