Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision Process” –1–1 s0s0 Actions are stochastic: E.g., in book example, probability of taking “desired” action is 0.8, and probability of taking action at right angle to desired action is 0.2.
Utility of a state sequence = discounted sum of rewards s0s –1–1 s0s0
–1–1 s0s0 Policy: Function that maps states to actions : Optimal policy *: has highest expected utility
Utility of state s given policy : where s t is the state reached after starting in s and executing for t steps. s0s –1–1 s0s0
Define: s0s –1–1 s0s0
Suppose that the agent, in s at time t, following *, to s´ at time t+1. We can write U(s) in terms of U(s´): Bellman’s Equation
Suppose that the agent, in s at time t, following *, to s´ at time t+1. We can write U(s) in terms of U(s´): Bellman’s Equation Bellman’s equation yields a set of simultaneous equations that can be solved (given certain assumptions) to find utilities.
s0s –1– State utilities
How to learn an optimal policy? Value iteration: –Calculate utility of each state and then using the state utilities to select optimal action in each state.
How to learn an optimal policy? Value iteration: –Calculate utility of each state and then using the state utilities to select optimal action in each state. But: must know R(s) and T(s, a, s´). In most problems, the agent doesn’t have this knowledge.
Reinforcement Learning Agent has no teacher (contrast with NN), no prior knowledge of reward function or of state transition function. “Imagine playing a new game whose rules you don’t know; after a hundred or so moves, your opponent announces ‘you lose’. This is reinforcement learning in a nutshell.” (Textbook) Question: How best to explore “on-line” (while getting rewards and punishment?) Analogous to multi-armed bandit problem I mentioned earlier.
Q-learning Don’t learn utilities! Instead, learn a “value” function, Q: S A Q(s,a) = estimated value of U(s), for best action a from state s. “Model-free” method If we knew Q(s,a) for each state/action pair, we could simply choose the action which maximizes Q.
How to learn Q (simplified from Figure 21.8) Assume T: S A S is a deterministic state transition function. Assume = = 1.
How to learn Q (simplified from Figure 21.8) Q-Learn() { Q-matrix = 0; // All zeros t = 0; s = s 0 ; While s is not a terminal state { choose action a; // Many different // ways to do this. s’ = T(s,a); Q(s,a) = R(s) + max_a’ Q(s’,a’); s = s’; }
Pathfinder demo
–1–1 s0s0 How to do HW problem 4