Download presentation
Presentation is loading. Please wait.
Published byMarcus Garrison Modified over 9 years ago
1
COSC 878 Seminar on Large Scale Statistical Machine Learning 1
2
Today’s Plan Course Website http://people.cs.georgetown.edu/~huiyang/cosc- 878/ Join the Google group: https://groups.google.com/forum/#!forum/cosc8 78 https://groups.google.com/forum/#!forum/cosc8 78 Students Introduction Team-up and Presentation Scheduling First Talk 2
3
Reinforcement Learning: A Survey Grace 1/13/15
4
What is Reinforcement Learning The problem faced by an agent that must learn behavior through trial-and-error interactions with a dynamic environment
5
Solve RL Problems – Two strategies Search in the behavior space – To find one behavior that perform well in the environment – Genetic algorithms, genetic programming Statistical methods and dynamic programming – Estimate the utility of taking actions in states of the world – We focus on this strategy
6
Standard RL model
7
What we learn in RL The agent’s job is to find a policy \pi that maximizes some long-run measure of reinforcement. – A policy \pi maps states to actions – Reinforcement = reward
8
Difference between RL and Supervised Learning In RL, no presentation of input/output pairs – No training data – We only know the immediate reward – Not know the best actions in long run In RL, need to evaluate the system online while learning – Online evaluation (know the online performance) is important
9
Difference between RL and AI/Planning AI algorithms are less general – AI algorithms require a predefined model of state transitions – And assume determinism RL assumes that the state space can be enumerated and stored in memory
10
Models The difficult part: – How to model future into the model Three models – Finite horizon – Infinite horizon – Average-reward
11
Finite Horizon At a given moment in time, the agent optimizes its expected reward for the next h steps Ignore what will happen after h steps
12
Infinite Horizon Maximize the long run reward Does not put limit on the number of future steps Future rewards are discounted geometrically Mathematically more tractable than finite horizon Discount factor (between 0 and 1)
13
Average-reward Maximize the long run average reward It is the limiting case of infinite horizon when \gamma approaches 1 Weakness: – Cannot know when get large rewards – When we prefer large initial reward, we have no way to know it in this model Cures: – Maximize both the long run average and the initial rewards – The Bias optimal model
14
Compare model optimality all unlabeled arrows produce a reward of 0 A single action
15
Compare model optimality Finite horizon h=4 Upper line: 0+0+2+2+2=6 Middle: 0+0+0+0+0=0 Lower: 0+0+0+0+0=0
16
Compare model optimality Infinite horizon \gamma=0.9 Upper line: 0*0.9^0 + 0*0.9^1+2*0.9^2+ 2*0.9^3+2*0.9^4… = 2*0.9^2*(1+0.9+0.9^ 2..)= 1.62*(1)/(1- 0.9)=16.2 Middle: … 10*0.9^5+…~=59 Lower: … + 11*0.9^6+… = 58.5
17
Compare model optimality Average reward Upper line: ~= 2 Middle: ~=10 Lower: ~= 11
18
Parameters Finite horizon and infinite horizon both have parameters – h – \gamma These parameters matter to the choice of optimality model – Choose them carefully in your application Average reward model’s advantage: not influenced by those parameters
19
MARKOV MODELS 19
20
Markov Process Markov Property 1 (the “ memoryless ” property) for a system, its next state depends on its current state. Pr(S i+1 |S i,…,S 0 )=Pr(S i+1 |S i ) Markov Process a stochastic process with Markov property. e.g. 20 1 A. A. Markov, ‘06 s0s0 s1s1 …… sisi s i+1
21
21 Markov Chain Hidden Markov Model Markov Decision Process Partially Observable Markov Decision Process Multi-armed Bandit Family of Markov Models
22
A Pagerank(A) Discrete-time Markov process Example: Google PageRank 1 Markov Chain B Pagerank(B) # of pages # of outlinks pages linked to S 22 D Pagerank(D) C Pagerank(C) E Pagerank(E) Random jump factor 1 L. Page et. al., ‘99 The stable state distribution of such an MC is PageRank State S – web page Transition probability M PageRank: how likely a random web surfer will land on a page (S, M)
23
Hidden Markov Model A Markov chain that states are hidden and observable symbols are emitted with some probability according to its states 1. 23 s0s0 s1s1 s2s2 …… o0o0 o1o1 o2o2 p0p0 p1p1 p2p2 S i – hidden state p i -- transition probability o i --observation e i --observation probability (emission probability) 1 Leonard E. Baum et. al., ‘66 (S, M, O, e)
24
MDP extends MC with actions and rewards 1 s i – state a i – action r i – reward p i – transition probability p0p0 p1p1 p2p2 Markov Decision Process 24 …… s0s0 s1s1 r0r0 a0a0 s2s2 r1r1 a1a1 s3s3 r2r2 a2a2 1 R. Bellman, ‘57 (S, M, A, R, γ)
25
Definition of MDP A tuple (S, M, A, R, γ) – S : state space – M: transition matrix M a (s, s') = P(s'|s, a) – A: action space – R: reward function R(s,a) = immediate reward taking action a at state s – γ: discount factor, 0< γ ≤1 policy π π(s) = the action taken at state s Goal is to find an optimal policy π * maximizing the expected total rewards. 25
26
Policy Policy: (s) = a According to which, select an action a at state s. (s 0 ) =move right and up s0s0 (s 1 ) =move right and up s1s1 (s 2 ) = move right s2s2 26 [Slide altered from Carlos Guestrin’s ML lecture]
27
Value of Policy Value: V (s) Expected long-term reward starting from s Start from s 0 s0s0 R(s0)R(s0) (s0)(s0) V (s 0 ) = E[R(s 0 ) + R(s 1 ) + 2 R(s 2 ) + 3 R(s 3 ) + 4 R(s 4 ) + ] Future rewards discounted by [0,1) 27 [Slide altered from Carlos Guestrin’s ML lecture]
28
Value of Policy Value: V (s) Expected long-term reward starting from s Start from s 0 s0s0 R(s0)R(s0) (s0)(s0) V (s 0 ) = E[R(s 0 ) + R(s 1 ) + 2 R(s 2 ) + 3 R(s 3 ) + 4 R(s 4 ) + ] Future rewards discounted by [0,1) s1s1 R(s1)R(s1) s 1 ’’ s1’ s1’ R(s1’)R(s1’) R(s 1 ’’) 28 [Slide altered from Carlos Guestrin’s ML lecture]
29
Value of Policy Value: V (s) Expected long-term reward starting from s Start from s 0 s0s0 R(s0)R(s0) (s0)(s0) V (s 0 ) = E[R(s 0 ) + R(s 1 ) + 2 R(s 2 ) + 3 R(s 3 ) + 4 R(s 4 ) + ] Future rewards discounted by [0,1) s1s1 R(s1)R(s1) s 1 ’’ s1’ s1’ R(s1’)R(s1’) R(s 1 ’’) (s1)(s1) R(s2)R(s2) s2s2 (s1’)(s1’) (s 1 ’’) s 2 ’’ s2’ s2’ R(s2’)R(s2’) R(s 2 ’’) 29 [Slide altered from Carlos Guestrin’s ML lecture]
30
Computing the value of a policy 30 Value function A possible next state The current state
31
Optimality — Bellman Equation The Bellman equation 1 to MDP is a recursive definition of the optimal value function V * (.) 31 Optimal Policy 1 R. Bellman, ‘57 state-value function
32
Optimality — Bellman Equation The Bellman equation can be rewritten as 32 Optimal Policy action-value function Relationship between V and Q
33
MDP algorithms 33 Value Iteration Policy Iteration Modified Policy Iteration Prioritized Sweeping Temporal Difference (TD) Learning Q-Learning Model free approaches Model-based approaches [Bellman, ’57, Howard, ‘60, Puterman and Shin, ‘78, Singh & Sutton, ‘96, Sutton & Barto, ‘98, Richard Sutton, ‘88, Watkins, ‘92] Solve Bellman equation Optimal value V * (s) Optimal policy *(s) [Slide altered from Carlos Guestrin’s ML lecture]
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.