Download presentation
Presentation is loading. Please wait.
1
Reinforcement Learning Yishay Mansour Tel-Aviv University
2
2 Outline Goal of Reinforcement Learning Mathematical Model (MDP) Planning
3
3 Goal of Reinforcement Learning Goal oriented learning through interaction Control of large scale stochastic environments with partial knowledge. Supervised / Unsupervised Learning Learn from labeled / unlabeled examples
4
4 Reinforcement Learning - origins Artificial Intelligence Control Theory Operation Research Cognitive Science & Psychology Solid foundations; well established research.
5
5 Typical Applications Robotics –Elevator control [CB]. –Robo-soccer [SV]. Board games –backgammon [T], –checkers [S]. –Chess [B] Scheduling –Dynamic channel allocation [SB]. –Inventory problems.
6
6 Contrast with Supervised Learning The system has a “state”. The algorithm influences the state distribution. Inherent Tradeoff: Exploration versus Exploitation.
7
7 Mathematical Model - Motivation Model of uncertainty: Environment, actions, our knowledge. Focus on decision making. Maximize long term reward. Markov Decision Process (MDP)
8
8 Mathematical Model - MDP Markov decision processes S- set of states A- set of actions - Transition probability R - Reward function Similar to DFA!
9
9 MDP model - states and actions Actions = transitions action a Environment = states 0.3 0.7
10
10 MDP model - rewards R(s,a) = reward at state s for doing action a (a random variable). Example: R(s,a) = -1 with probability 0.5 +10 with probability 0.35 +20 with probability 0.15
11
11 MDP model - trajectories trajectory: s0s0 a0a0 r0r0 s1s1 a1a1 r1r1 s2s2 a2a2 r2r2
12
12 MDP - Return function. Combining all the immediate rewards to a single value. Modeling Issues: Are early rewards more valuable than later rewards? Is the system “terminating” or continuous? Usually the return is linear in the immediate rewards.
13
13 MDP model - return functions Finite Horizon - parameter H Infinite Horizon discounted - parameter <1. undiscounted Terminating MDP
14
14 MDP model - action selection Policy - mapping from states to actions Fully Observable - can “see” the “entire” state. AIM: Maximize the expected return. Optimal policy: optimal from any start state. THEOREM: There exists a deterministic optimal policy
15
15 Contrast with Supervised Learning Supervised Learning: Fixed distribution on examples. Reinforcement Learning: The state distribution is policy dependent!!! A small local change in the policy can make a huge global change in the return.
16
16 MDP model - summary - set of states, |S|=n. - set of k actions, |A|=k. - transition function. - immediate reward function. - policy. - discounted cumulative return. R(s,a)
17
17 Simple example: N- armed bandit Single state. a1a1 a2a2 a3a3 s Goal: Maximize sum of immediate rewards. Difficulty: unknown model. Given the model: Greedy action.
18
18 N-Armed Bandit: Highlights Algorithms (near greedy): –Exponential weights G i sum of rewards of action a i w i = e G i –Follow the (perturbed) leader Results: –For any sequence of T rewards: –E[online] > max i {G i } - sqrt{T log N}
19
19 Planning - Basic Problems. Policy evaluation - Given a policy , estimate its return. Optimal control - Find an optimal policy * (maximizes the return from any start state). Given a complete MDP model.
20
20 Planning - Value Functions V (s) The expected return starting at state s following Q (s,a) The expected return starting at state s with action a and then following V (s) and Q (s,a) are define using an optimal policy . V (s) = max V (s)
21
21 Planning - Policy Evaluation Discounted infinite horizon (Bellman Eq.) Rewrite the expectation Linear system of equations. V (s) = E s’~ (s) [ R(s, (s)) + V (s’)]
22
22 Algorithms - Policy Evaluation Example A={+1,-1} = 1/2 (s i,a)= s i+a random s0s0 s1s1 s3s3 s2s2 01 23 V (s 0 ) = 0 + [ s 0,+1)V (s 1 ) + s 0,-1) V (s 3 ) ] a: R(s i,a) = i
23
23 Algorithms -Policy Evaluation Example A={+1,-1} = 1/2 (s i,a)= s i+a random s0s0 s1s1 s3s3 s2s2 a: R(s i,a) = i 01 23 V (s 0 ) = 0 + (V (s 1 ) + V (s 3 ) )/4 V (s 0 ) = 5/3 V (s 1 ) = 7/3 V (s 2 ) = 11/3 V (s 3 ) = 13/3
24
24 Algorithms - optimal control State-Action Value function: Note Q (s,a) E [ R(s,a)] + E s’~ (s,a) V (s’)] For a deterministic policy .
25
25 Algorithms -Optimal control Example A={+1,-1} = 1/2 (s i,a)= s i+a random s0s0 s1s1 s3s3 s2s2 R(s i,a) = i 01 23 Q (s 0,+1) = 0 + V (s 1 ) Q (s 0,+1) = 5/6 Q (s 0,-1) = 13/6
26
26 Algorithms - optimal control CLAIM: A policy is optimal if and only if at each state s: V (s) MAX a Q (s,a)} (Bellman Eq.) PROOF: Assume there is a state s and action a s.t., V (s) Q (s,a). Then the strategy of performing a at state s (the first time) is better than This is true each time we visit s, so the policy that performs action a at state s is better than
27
27 Algorithms -optimal control Example A={+1,-1} = 1/2 (s i,a)= s i+a random s0s0 s1s1 s3s3 s2s2 R(s i,a) = i 01 23 Changing the policy using the state-action value function.
28
28 Algorithms - optimal control The greedy policy with respect to Q (s,a) is (s) = argmax a {Q (s,a) } The -greedy policy with respect to Q (s,a) is (s) = argmax a {Q (s,a) } with probability 1- and (s) = random action with probability
29
29 MDP - computing optimal policy 1. Linear Programming 2. Value Iteration method. 3. Policy Iteration method.
30
30 Convergence Value Iteration –Drop in distance from optimal each iteration max s {V * (s) – V t (s)} Policy Iteration –Policy can only improve s V t+1 (s) V t (s) Less iterations then Value Iteration, but more expensive iterations.
31
31 Relations to Board Games state = current board action = what we can play. opponent action = part of the environment value function = probability of winning Q- function = modified policy. Hidden assumption: Game is Markovian
32
32 Planning versus Learning Tightly coupled in Reinforcement Learning Goal: maximize return while learning.
33
33 Example - Elevator Control Planning (alone) : Given arrival model build schedule Learning (alone): Model the arrival model well. Real objective: Construct a schedule while updating model
34
34 Partially Observable MDP Rather than observing the state we observe some function of the state. Ob - Observable function. a random variable for each states. Problem: different states may “look” similar. The optimal strategy is history dependent ! Example: (1) Ob(s) = s+noise. (2) Ob(s) = first bit of s.
35
35 POMDP - Belief State Algorithm Given a history of actions and observable value we compute a posterior distribution for the state we are in (belief state). The belief-state MDP: States: distribution over S (states of the POMDP). actions: as in the POMDP. Transition: the posterior distribution (given the observation) We can perform the planning and learning on the belief-state MDP.
36
36 POMDP - Hard computational problems. Computing an infinite (polynomial) horizon undiscounted optimal strategy for a deterministic POMDP is P-space- hard (NP-complete) [PT,L]. Computing an infinite (polynomial) horizon undiscounted optimal strategy for a stochastic POMDP is EXPTIME- hard (P-space-complete) [PT,L]. Computing an infinite (polynomial) horizon undiscounted optimal policy for an MDP is P-complete [PT].
37
37 Resources Reinforcement Learning (an introduction) [Sutton & Barto] Markov Decision Processes [Puterman] Dynamic Programming and Optimal Control [Bertsekas] Neuro-Dynamic Programming [Bertsekas & Tsitsiklis] Ph. D. thesis - Michael Littman
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.