Reinforcement Learning Yishay Mansour Tel-Aviv University.

Reinforcement Learning Yishay Mansour Tel-Aviv University

Outline Goal of Reinforcement Learning Mathematical Model (MDP) Planning

Goal of Reinforcement Learning Goal oriented learning through interaction Control of large scale stochastic environments with partial knowledge. Supervised / Unsupervised Learning Learn from labeled / unlabeled examples

Reinforcement Learning - origins Artificial Intelligence Control Theory Operation Research Cognitive Science & Psychology Solid foundations; well established research.

Typical Applications Robotics –Elevator control [CB]. –Robo-soccer [SV]. Board games –backgammon [T], –checkers [S]. –Chess [B] Scheduling –Dynamic channel allocation [SB]. –Inventory problems.

Contrast with Supervised Learning The system has a “state”. The algorithm influences the state distribution. Inherent Tradeoff: Exploration versus Exploitation.

Mathematical Model - Motivation Model of uncertainty: Environment, actions, our knowledge. Focus on decision making. Maximize long term reward. Markov Decision Process (MDP)

Mathematical Model - MDP Markov decision processes S- set of states A- set of actions  - Transition probability R - Reward function Similar to DFA!

MDP model - states and actions Actions = transitions action a Environment = states 0.3 0.7

MDP model - rewards R(s,a) = reward at state s for doing action a (a random variable). Example: R(s,a) = -1 with probability 0.5 +10 with probability 0.35 +20 with probability 0.15

MDP model - trajectories trajectory: s0s0 a0a0 r0r0 s1s1 a1a1 r1r1 s2s2 a2a2 r2r2

Simple example: N- armed bandit Single state. a1a1 a2a2 a3a3 s Goal: Maximize sum of immediate rewards. Difficulty: unknown model. Given the model: Greedy action.

MDP - Return function. Combining all the immediate rewards to a single value. Modeling Issues: Are early rewards more valuable than later rewards? Is the system “terminating” or continuous? Usually the return is linear in the immediate rewards.

MDP model - return functions Finite Horizon - parameter H Infinite Horizon discounted - parameter  <1. undiscounted Terminating MDP

MDP model - action selection Policy - mapping from states to actions Fully Observable - can “see” the “exact” state. AIM: Maximize the expected return. This talk: discounted return Optimal policy: optimal from any start state. THEOREM: There exists a deterministic optimal policy

MDP model - summary - set of states, |S|=n. - set of k actions, |A|=k. - transition function. - immediate reward function. - policy. - discounted cumulative return. R(s,a)

Relations to Board Games state = current board action = what we can play. opponent action = part of the environment Hidden assumption: Game is Markovian

Contrast with Supervised Learning Supervised Learning: Fixed distribution on examples. Reinforcement Learning: The state distribution is policy dependent!!! A small local change in the policy can make a huge global change in the return.

Planning - Basic Problems. Policy evaluation - Given a policy , estimate its return. Optimal control - Find an optimal policy  *  (maximizes the return from any start state). Given a complete MDP model.

Planning - Value Functions V  (s) The expected return starting at state s following  Q  (s,a) The expected return starting at state s with action a and then following  V  (s) and Q  (s,a) are define using an optimal policy  . V  (s) = max  V  (s)

Planning - Policy Evaluation Discounted infinite horizon (Bellman Eq.) Rewrite the expectation Linear system of equations. V  (s) = E s’~  (s) [ R(s,  (s)) +  V  (s’)]

Algorithms - Policy Evaluation Example A={+1,-1}  = 1/2  (s i,a)= s i+a  random s0s0 s1s1 s3s3 s2s2 01 23 V  (s 0 ) = 0 +  [  s 0,+1)V  (s 1 ) +  s 0,-1) V  (s 3 ) ]  a: R(s i,a) = i

Algorithms -Policy Evaluation Example A={+1,-1}  = 1/2  (s i,a)= s i+a  random s0s0 s1s1 s3s3 s2s2  a: R(s i,a) = i 01 23 V  (s 0 ) = 0 + (V  (s 1 ) + V  (s 3 ) )/4 V  (s 0 ) = 5/3 V  (s 1 ) = 7/3 V  (s 2 ) = 11/3 V  (s 3 ) = 13/3

Algorithms - optimal control State-Action Value function: Note Q  (s,a)  E [ R(s,a)] +  E s’~  (s)  V  (s’)] For a deterministic policy .

Algorithms -Optimal control Example A={+1,-1}  = 1/2  (s i,a)= s i+a  random s0s0 s1s1 s3s3 s2s2 R(s i,a) = i 01 23 Q  (s 0,+1) = 0 +  V  (s 1 ) Q  (s 0,+1) = 7/6 Q  (s 0,-1) = 13/6

Algorithms - optimal control CLAIM: A policy  is optimal if and only if at each state s: V  (s)  MAX a  Q  (s,a)} (Bellman Eq.) PROOF: Assume there is a state s and action a s.t., V  (s)  Q  (s,a). Then the strategy of performing a at state s (the first time) is better than  This is true each time we visit s, so the policy that performs action a at state s is better than  

Algorithms -optimal control Example A={+1,-1}  = 1/2  (s i,a)= s i+a  random s0s0 s1s1 s3s3 s2s2 R(s i,a) = i 01 23 Changing the policy using the state-action value function.

Algorithms - optimal control The greedy policy with respect to Q  (s,a) is  (s) = argmax a {Q  (s,a) } The  -greedy policy with respect to Q  (s,a) is  (s) = argmax a {Q  (s,a) } with probability 1-  and  (s) = random action with probability 

MDP - computing optimal policy 1. Linear Programming 2. Value Iteration method. 3. Policy Iteration method.

Convergence Value Iteration –Drop in distance from optimal By a factor of 1-γ Policy Iteration –Policy only improves

Relations to Board Games state = current board action = what we can play. opponent action = part of the environment value function = likelihood of winning Q- function = modified policy. Hidden assumption: Game is Markovian

Planning versus Learning Tightly coupled in Reinforcement Learning Goal: maximize return while learning.

Example - Elevator Control Planning (alone) : Given arrival model build schedule Learning (alone): Model the arrival model well. Real objective: Construct a schedule while updating model

Partially Observable MDP Rather than observing the state we observe some function of the state. Ob - Observable function. a random variable for each states. Problem: different states may “look” similar. The optimal strategy is history dependent ! Example: (1) Ob(s) = s+noise. (2) Ob(s) = first bit of s.

POMDP - Belief State Algorithm Given a history of actions and observable value we compute a posterior distribution for the state we are in (belief state). The belief-state MDP: States: distribution over S (states of the POMDP). actions: as in the POMDP. Transition: the posterior distribution (given the observation) We can perform the planning and learning on the belief-state MDP.

POMDP - Hard computational problems. Computing an infinite (polynomial) horizon undiscounted optimal strategy for a deterministic POMDP is P-space- hard (NP-complete) [PT,L]. Computing an infinite (polynomial) horizon undiscounted optimal strategy for a stochastic POMDP is EXPTIME- hard (P-space-complete) [PT,L]. Computing an infinite (polynomial) horizon undiscounted optimal policy for an MDP is P-complete [PT].

Resources Reinforcement Learning (an introduction) [Sutton & Barto] Markov Decision Processes [Puterman] Dynamic Programming and Optimal Control [Bertsekas] Neuro-Dynamic Programming [Bertsekas & Tsitsiklis] Ph. D. thesis - Michael Littman

Reinforcement Learning Yishay Mansour Tel-Aviv University.

Similar presentations

Presentation on theme: "Reinforcement Learning Yishay Mansour Tel-Aviv University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Reinforcement Learning Yishay Mansour Tel-Aviv University.

Similar presentations

Presentation on theme: "Reinforcement Learning Yishay Mansour Tel-Aviv University."— Presentation transcript:

Similar presentations

About project

Feedback