Download presentation
Presentation is loading. Please wait.
Published byMagnus Jack Hodges Modified over 9 years ago
1
Reinforcement Learning Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell, Andrew Moore, Percy Liang, Luke Zettlemoyer
2
Announcements HW1 is graded –Grades are in your submission folders in a file called grade.txt (mean = 14.15, median = 18) –Thursday we’ll have demos of extra-credits Office hours canceled today
3
Reminder Mid-term exam next Tuesday, Feb 18 –Held during regular class time in SN014 and SN011 –Closed book –Short answer written questions –Shubham will hold a mid-term review + Q&A session, Feb 14 at 5pm in SN 014.
4
Exam topics 1) Intro to AI, agents and environments Turing test Rationality Expected utility maximization PEAS Environment characteristics: fully vs. partially observable, deterministic vs. stochastic, episodic vs. sequential, static vs. dynamic, discrete vs. continuous, single-agent vs. multi- agent, known vs. unknown 2) Search Search problem formulation: initial state, actions, transition model, goal state, path cost State space Search tree Frontier Evaluation of search strategies: completeness, optimality, time complexity, space complexity Uninformed search strategies: breadth-first search, uniform cost search, depth-first search, iterative deepening search Informed search strategies: greedy best-first, A*, weighted A* Heuristics: admissibility, dominance
5
Exam topics 3) Constraint satisfaction problems Backtracking search Heuristics: most constrained/most constraining variable, least constraining value Forward checking, constraint propagation, arc consistency Tree-structured CSPs Local search 4) Games Zero-sum games Game tree Minimax/Expectimax/Expectiminimax search Alpha-beta pruning Evaluation function Quiescence search Horizon effect Stochastic elements in games
6
Exam topics 5) Markov decision processes Markov assumption, transition model, policy Bellman equation Value iteration Policy iteration 6) Reinforcement learning Model-based vs. model-free approaches Passive vs Active Exploration vs. exploitation Direct Estimation TD Learning TD Q-learning
7
Reminder from last class
8
Stochastic, sequential environments Image credit: P. Abbeel and D. Klein Markov Decision Processes
9
Components: –States s, beginning with initial state s 0 –Actions a Each state s has actions A(s) available from it –Transition model P(s’ | s, a) Markov assumption: the probability of going to s’ from s depends only on s and a and not on any other past actions or states –Reward function R(s) Policy (s): the action that an agent takes in any given state –The “solution” to an MDP
10
Overview First, we will look at how to “solve” MDPs, or find the optimal policy when the transition model and the reward function are known Second, we will consider reinforcement learning, where we don’t know the rules of the environment or the consequences of our actions
11
Grid world R(s) = -0.04 for every non-terminal state Transition model: 0.80.1 Source: P. Abbeel and D. Klein
12
Goal: Policy Source: P. Abbeel and D. Klein
13
Grid world R(s) = -0.04 for every non-terminal state Transition model:
14
Grid world Optimal policy when R(s) = -0.04 for every non-terminal state
15
Grid world Optimal policies for other values of R(s):
16
Solving MDPs MDP components: –States s –Actions a –Transition model P(s’ | s, a) –Reward function R(s) The solution: –Policy (s): mapping from states to actions –How to find the optimal policy?
17
Finding the utilities of states U(s’) Max node Chance node P(s’ | s, a) What is the expected utility of taking action a in state s? How do we choose the optimal action? What is the recursive expression for U(s) in terms of the utilities of its successor states?
18
The Bellman equation Recursive relationship between the utilities of successive states: End up here with P(s’ | s, a) Get utility U(s’) (discounted by ) Receive reward R(s) Choose optimal action a
19
The Bellman equation Recursive relationship between the utilities of successive states: For N states, we get N equations in N unknowns –Solving them solves the MDP –We can solve them algebraically –Two methods: value iteration and policy iteration
20
Method 1: Value iteration Start out with every U(s) = 0 Iterate until convergence –During the i th iteration, update the utility of each state according to this rule: In the limit of infinitely many iterations, guaranteed to find the correct utility values –In practice, don’t need an infinite number of iterations…
21
Value iteration What effect does the update have? Value iteration demo
22
Values vs Policy Basic idea: approximations get refined towards optimal values Policy may converge long before values do
23
Method 2: Policy iteration Start with some initial policy 0 and alternate between the following steps: –Policy evaluation: calculate U i (s) for every state s –Policy improvement: calculate a new policy i+1 based on the updated utilities
24
Policy evaluation Given a fixed policy , calculate U (s) for every state s The Bellman equation for the optimal policy: –How does it need to change if our policy is fixed? –Can solve a linear system to get all the utilities! –Alternatively, can apply the following update:
27
Reinforcement learning (Chapter 21)
28
Short intro to learning … much more to come later
29
What is machine learning? Computer programs that can learn from data Two key components –Representation: how should we represent the data? –Generalization: the system should generalize from its past experience (observed data items) to perform well on unseen data items.
30
Types of ML algorithms Unsupervised –Algorithms operate on unlabeled examples Supervised –Algorithms operate on labeled examples Semi/Partially-supervised –Algorithms combine both labeled and unlabeled examples
32
Types of ML algorithms Unsupervised –Algorithms operate on unlabeled examples Supervised –Algorithms operate on labeled examples Semi/Partially-supervised –Algorithms combine both labeled and unlabeled examples
33
Slide from Dan Klein Slide 33 of 113
34
Example: Image classification apple pear tomato cow dog horse inputdesired output Slide credit: Svetlana Lazebnik Slide 34 of 113
35
Slide from Dan Klein http://yann.lecun.com/exdb/mnist/index.html Slide 35 of 113
36
Reinforcement learning for flight Stanford autonomous helicopter
37
Types of ML algorithms Unsupervised –Algorithms operate on unlabeled examples Supervised –Algorithms operate on labeled examples Semi/Partially-supervised –Algorithms combine both labeled and unlabeled examples
38
Supervised learning has many successes recognize speech, steer a car, classify documents classify proteins recognizing faces, objects in images... Slide Credit: Avrim Blum
39
However, for many problems, labeled data can be rare or expensive. Unlabeled data is much cheaper. Need to pay someone to do it, requires special testing,… Slide Credit: Avrim Blum
40
However, for many problems, labeled data can be rare or expensive. Unlabeled data is much cheaper. Speech Images Medical outcomes Customer modeling Protein sequences Web pages Need to pay someone to do it, requires special testing,… Slide Credit: Avrim Blum
41
However, for many problems, labeled data can be rare or expensive. Unlabeled data is much cheaper. [From Jerry Zhu] Need to pay someone to do it, requires special testing,… Slide Credit: Avrim Blum
42
Need to pay someone to do it, requires special testing,… However, for many problems, labeled data can be rare or expensive. Unlabeled data is much cheaper. Can we make use of cheap unlabeled data? Slide Credit: Avrim Blum
43
Semi-Supervised Learning Can we use unlabeled data to augment a small labeled sample to improve learning? But unlabeled data is missing the most important info!! But maybe still has useful regularities that we can use. But… Slide Credit: Avrim Blum
44
Reinforcement Learning
45
Components (same as MDP): –States s, beginning with initial state s 0 –Actions a Each state s has actions A(s) available from it –Transition model P(s’ | s, a) –Reward function R(s) Policy (s): the action that an agent takes in any given state –The “solution” New twist: don’t know Transition model or Reward function ahead of time! –Have to actually try actions and states out to learn
48
Reinforcement learning: Basic scheme In each time step: –Take some action –Observe the outcome of the action: successor state and reward –Update some internal representation of the environment and policy –If you reach a terminal state, just start over (each pass through the environment is called a trial) Why is this called reinforcement learning?
50
Passive Reinforcement learning strategies Model-based –Learn the model of the MDP (transition probabilities and rewards) and evaluate the state utilities under the given policy Model-free –Learn state utilities without explicitly modeling the transition probabilities P(s’ | s, a) –TD-learning: use the observed transitions and rewards to adjust the utilities of states so that they agree with the Bellman eqns
51
Model-based reinforcement learning Basic idea: try to learn the model of the MDP (transition probabilities and rewards) and evaluate the given policy –Keep track of how many times state s’ follows state s when you take action a and update the transition probability P(s’ | s, a) according to the relative frequencies –Keep track of the rewards R(s)
53
Model-free reinforcement learning Idea: Learn state utilities without explicitly modeling the transition probabilities P(s’ | s, a) Direct Utility Estimation –Utility of a state is expected total reward from that state onward. –Each trial provides a sample of this quantity for each state visited –Just keep a running average for each state in a table
55
Model-free reinforcement learning Idea: Learn state utilities without explicitly modeling the transition probabilities P(s’ | s, a) Temporal Difference learning idea: Update U(s) each time we experience (s,a,s’,r) –Policy still fixed! –Use observed transitions to adjust the utilities of states so that they agree with the Bellman eqns –Likely s’ will contribute updates more often When a transition occurs from s to s’ apply update: Learning rate Should start at 1 and decay as O(1/t)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.