Download presentation
Presentation is loading. Please wait.
1
לביצוע מיידי ! להתחלק לקבוצות –2 או 3 בקבוצה להעביר את הקבוצות – היום בסוף השיעור ! ספר Reinforcement Learning – הספר קיים online ( גישה מהאתר של הסדנה )
2
Reinforcement Learning: Learning algorithms Function Approximation Yishay Mansour Tel-Aviv University
3
Outline Week I: Basics –Mathematical Model (MDP) –Planning Value iteration Policy iteration Week II: Learning Algorithms –Model based –Model Free Week III: Large state space
4
Learning Algorithms Given access only to actions perform: 1. policy evaluation. 2. control - find optimal policy. Two approaches: 1. Model based (Dynamic Programming). 2. Model free (Q-Learning, SARSA).
5
Learning: Policy improvement Assume that we can compute: –Given a policy π, –The V and Q functions of π Can perform policy improvement: –Π= Greedy (Q) Process converges if estimations are accurate.
6
Learning - Model Free Optimal Control: off-policy Learn online the Q function. Q t+1 (s t,a t ) = Q t (s t,a t )+ A t OFF POLICY: Q-Learning Maximization Operator!!! A t = r t + MAX a {Q t (s t+1,a )} - Q t (s t,a t )
7
Learning - Model Free Policy evaluation: TD(0) An online view: At state s t we performed action a t, received reward r t and moved to state s t+1. Our “estimation error” is A t =r t + V t (s t+1 )-V t (s t ), The update: V t +1 (s t ) = V t (s t ) + A t No maximization over actions!
8
Learning - Model Free Optimal Control: on-policy Learn online the optimal Q * function. Q t+1 (s t,a t ) = Q t (s t,a t )+ r t + Q t (s t+1,a t+1 ) - Q t (s t,a t )] ON-Policy: SARSA a t+1 the -greedy policy for Q t. The policy selects the action! Need to balance exploration and exploitation.
9
Modified Notation Rather than Q(s,a) have Q a (s) Greedy(Q) = MAX a Q a (s) Each action has a function Q a (s) Learn each Q a (s) independently!
10
Large state space Reduce number of states –Symmetries (x-o) –Cluster states Define attributes Limited number of attributes Some states will be identical –Action view of a state
11
Example X-O For each action (square) –Consider row/diagonal/column through it –The state will encode the status of “rows”: Two X’s Two O’s Mixed (both X and O) One X One O empty –Only Three types of squares/actions
12
Clustering states Need to create attributes Attributes should be “game dependent” Different “real” states - same representation How do we differentiate states? –We estimate action value. –Consider only legal actions. –Play “best” action.
13
Function Approximation Use a limited model for Q a (s) Have an attribute vector: –Each state s has a vector vec(s)=x 1... x k –Normally k << |S| Examples: –Neural Network –Decision tree –Linear Function Weights = 1... k Value i x i
14
Gradient Decent Minimize Squared Error –Square Error = ½ P(s) [V (s) – V (s)] 2 –P(s) is sum weighting on the states Algorithm: – (t+1) = (t) + [V (s t ) – V (t) (s t )] (t) V (t) (s t ) – (t) = partial derivatives –Replace V (s t ) by a sample Monte Carlo: use R t for V (s t ) TD(0) use A t for [V (s t ) – V (t) (s t )]
15
Linear Functions Linear function: i x i = Derivative (t) V t (s t ) = vec(s t ) Update Rule: – t+1 = t + [V (s t ) – V t (s t )] vec(s t ) –MC: t+1 = t + [ R t – ] vec(s t ) –TD: t+1 = t + A t vec(s t )
16
Example: 4 in a row Select attributes for action (column): –3 in a row (type X or type O) –2 in a row (type X or O) and [blocked/ not] –Next location 3 in a row. Next move might lose –Other “features” RL will learn the weights. Look ahead significantly helps –use max-min tree
17
Bootstraping Playing against a “good” player –Using.... Self play –Start with a random player –play against one self. Choose a starting point. –Max-Min tree with simple scoring function. Add some simple guidance –add “compulsory” moves.
18
Scoring Function Checkers: –Number of pieces –Number of Queens Chess –Weighted sum of pieces Othello/Reversi –Difference in number of pieces Can be used with Max-Min Tree –( , ) pruning
19
Example: Revesrsi (Othello) Use a simple score functions: –difference in pieces –edge pieces –corner pieces Use Max-Min Tree RL: optimize weights.
20
Advanced issues Time constraints –fast and slow modes Opening –can help End game –many cases: few pieces, –can be solved efficiently Train on a specific state –might be helpful/ not sure that its worth the effort.
21
What is Next? Create teams: –at least 2 students at most 3 students Group size will influence our expectations! –Choose a game! –Give the names and game GUI for game –Deadline Dec. 17, 2006
22
Schedule (more) System specification –Project outline –High level components planning –Jan. 21, 2007 Build system Project completion –April 29, 2007 All supporting documents in html!
23
Next week GUI interface (using C++) Afterwards: –Each groups works by itself
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.