COMP 2208 Dr. Long Tran-Thanh University of Southampton Reinforcement Learning
Decision making Environment Perception Behaviour Categorize inputs Update belief model Update decision making policy Decision making Perception Behaviour
Decision making Environment Perception Behaviour Categorize inputs Update belief model Update decision making policy Decision making Perception Behaviour
Sequential decision making Environment Perception Behaviour/Action Decision making Repeatedly making decisions 1 decision per round Uncertainty: outcome is not known in advance and noisy
Example: getting out the maze At each time step: Choose a direction Make a step Check whether it’s the exit Goal: find the a way out from the maze It’s a standard search problem
Example: getting out the maze At each time step: Choose a direction Make a step Check whether it’s the exit New goal: find the shortest path from A (entrance) to B (exit) The robot can try several times Candidate paths
Example: getting out the maze At each time step: Choose a direction Make a step Check whether it’s the exit What we want to have is a policy (of behaviour) At each situation, it will tell us what to do System of shortest paths: from each point in the maze to the exit
Example: getting out the maze At each time step: Choose a direction Make a step Check whether it’s the exit Supervised vs. unsupervised learning? Offline vs. online learning? unsupervised online
Reinforcement learning A specific unsupervised online learning problem Some sees it different from unsupervised (e.g., Bishop, 2006) The setting: The agent repeatedly interact with the environment Gets some feedback (e.g., positive – negative) = reinforcement Learning problem: to get a good policy based on the feedback
Motivations
Difficulty of reinforcement learning
How do we know which actions lead us to victory? And what about those that made us lose the game? How can we measure which action is the best to take at each time step? We need to be able to evaluate the actions we take (and the states we are in)
States, actions, and rewards Think about the world as a set of (discrete) states With an action, we move from one state to another one Reward = feedback of the environment – measures the “goodness” of the action taken Good/bad states E.g., the exit of the maze = good, other locations = bad (or not so good) E.g., action = physically moving between locations If the new state is “good”, the reward is high, and vice versa Goal: maximise the sum of collected rewards over time
Simple example ABC DEF (+100) A simple maze: There are six states. C is a terminal state (game starts again at A). F gives a +100 reward (i.e., any actions that takes us to F receives +100) All other states have zero reward. Maximising rewards over time = finding shortest path
An intuition of learning good policies ABC DEF (+100) At the beginning we have no prior knowledge We start with a simple policy: just randomly move at each state But of course, we pay attention to the rewards we’ve collected so far At a certain point, we will eventually arrive to F, for which move we receive +100 Reasoning: what was the last state before I got to F? It must be a good state too (since it leads to F) We update the value of that state to be “good”
Temporal difference learning We maintain the value V of each state They represent how valuable the states are (in the long run) We update these values as we go along
Temporal difference learning How are we going to update our estimate of each state's value? Immediate reward is important. If moving into a certain state gives us a reward or punishment, that obviously needs recording. But future reward is important too. A state may give us nothing now, but we still like it if it is linked to future reinforcement.
Temporal difference learning Also important not to let any single learning experience change our opinion too much. We want to change our V estimates gradually to allow the long-run picture to emerge, so we need a learning rate. The formula for TD learning combines all of these factors. Current reward Future reward Progressive learning
Temporal difference learning Estimate of V_i at time step (t+1) Old estimate Learning rate Current reward Temporal difference: State value difference between old and new states
Temporal difference learning Suppose a = 0.1 (learning rate) We move from A to B We will receive 0 for a while …
Temporal difference learning Suppose we have a C-> F move at a certain point (F = +100) New value of V_c is 10 !! Since F is the exit, we restart the game in the next step (we don’t update V_F) But still keep the V_i values …
Temporal difference learning After a fair amount of time steps: We’re starting to get the sense of where the high-value states are But then, after 500K steps: 100 0
Temporal difference learning So what’s the reason here? Eventually from any state we can get to F sooner or later In the long term, all the states are equally valuable We need a way to distinguish paths that requires less moves States that lie on shorter paths have higher value
Temporal difference learning Solution: discounting future rewards Rewards in the far future are less valuable Current reward/rewards in the near future are more important Discount factor
Temporal difference learning Now, even after 500K steps: What is the best policy then? Always move to the neighbour with the highest value We need to know which actions are needed within this policy
Q-learning In some cases, we also need to learn the outcomes of the actions: We don’t know which actions will take us to which state Put differently, we want to learn the value of taking each action as well (Q value) The expected value of getting to state j is the maximum Q value we could get for any action x done at j. i i j j k x
Markov Decision Processes So far, our actions deterministically lead us from a state to another But in many real world situations, the state transitions are stochastic
Markov Decision Processes How to capture this uncertainty? Markov decision process: States, actions, state transition, rewards as usual State transition is stochastic i i J1 J2 Jm k Markov property: the probability of arriving to state j as the next state only depends on the current state and the current action The past does not have influence on the near future
Markov Decision Processes TD-learning: Q-learning: State transition probability
How to update the values in MDPs? Since the state transition is stochastic, we don’t know in advance what will the state if we take action k If the system is real, and we can only control the actions, then just take the action, and observe the next state However, in many cases, the system is a simulation, and thus, we need to control the state transition as well How can we simulate the state transition process?
Monte Carlo simulation Monte Carlo simulation (Fermi, von Neumann, Ulam, Metropolis) i i J3 J2 J1 P3 = 0.3 P2 = 0.2 P1 = 0.5 k Random generator: between 0 and 1 If choose J1 If between 0.5 and 0.7 -> choose J2 If between > 0.7 -> choose J3
Which actions/ state should we update? So far we only dealt with how to update the V and Q values Question: which one should we choose to update next? (i.e., which action we should choose next) If it’s all about learning the values as accurately as possible Uniformly randomly choosing the actions In some cases, the actual rewards do count as well (no training phase)
Exploration vs. exploitation Dilemma of exploration vs. exploitation Exploration: we want to learn as accurately as possible in order to make better decisions (the longer we learn, the better it is) Exploitation: we want to find the best actions as soon as possible (the less exploration the better) How to solve this paradox? Remember the bandit algorithms? We combine epsilon-greedy with TD-learning or Q-learning We choose the highest estimate with (1-epsilon) probability Uniformly and randomly choose another one with epsilon prob.
Extensions of reinforcement learning Partial observable MDPs (PoMDPs) We don’t fully observe the state we are in We maintain some belief about the possible states we can be in (Bayesian) Decentralised MDPs (DecMDPs) Multiple agents working together in a decentralised manner DecPoMDP – the combination of the above 2 Inverse RL: Classical RL: given the system, what is the optimal policy? Inverse RL: given the optimal policy, what is the underlying system?
Applications
Teaching helicopter to perform inverse hovering (Stanford) Smart home heating (Southampton)
Summary Reinforcement learning: an important learning problem Unsupervised, online, has feedback from the environment State value update: TD-learning Action value update: Q-learning Which action/state to choose next: epsilon-greedy MDP, PoMDP, DecMDP, etc…