Download presentation
Presentation is loading. Please wait.
1
Intro to Reinforcement Learning Learning how to get what you want...
2
Administrivia HW 2 due today You’ve all turned it in, right? Remember: Midterm Thu, Oct 21 Office hours cancelled Oct 20 Reading 2 due Oct 26 Class survey today Anonymous Help me make the class better!
3
Reminder: Bayesian parameter estimation Produces a distribution over params, not single value Use Bayes’ rule to get there: Where: Generative distribution: Parameter prior: Model-free data distribution (prior): Posterior parameter distribution:
4
Exercise Suppose you want to estimate the average air speed of an unladen (African) swallow Let’s say that airspeeds of individual swallows, x, are Gaussianly distributed with mean and variance 1: Let’s say, also, that we think the mean is “around” 50 kph, but we’re not sure exactly what it is. But our uncertainty (variance) is 10 kph. Derive the posterior estimate of the mean airspeed.
5
Pop Quiz What happens to the posterior estimate of μ as N → ∞ ? Just kidding! Not a real pop quiz for credit...... this time But a question like this could be on the exam Be sure you understand how ML and Bayesian parameter estimation work!
6
Posterior Gaussian Airspeed
8
Let’s look just at the numerator for a second...
9
Posterior Gaussian Airspeed After a little * rearrangement (completing the square)... *Ok, a lot... I.e., our posterior can be written as a Gaussian itself w/ a special mean and variance...
10
Posterior Gaussian Airspeed Average of data (data mean) Prior mean Linear combination (weighted avg.)
11
Intro to Reinforcement Learning
12
Meet Mack the Mouse* Mack lives a hard life as a psychology test subject Has to run around mazes all day, finding food and avoiding electric shocks Needs to know how to find cheese quickly, while getting shocked as little as possible Q: How can Mack learn to find his way around? * Mickey is still copyright ?
13
Start with an easy case V. simple maze: Whenever Mack goes left, he gets cheese Whenever he goes right, he gets shocked After reward/punishment, he’s reset back to start of maze Q: how can Mack learn to act well in this world?
14
Learning in the easy case Say there are two labels: “cheese” and “shock” Mack tries a bunch of trials in the world -- that generates a bunch of experiences: Now what?
15
But what to do? So we know that Mack can learn a mapping from actions to outcomes But what should Mack do in any given situation? What action should he take at any given time? Suppose Mack is the subject of a psychotropic drug study and has actually come to like shocks and hate cheese -- how does he act now?
16
Reward functions In general, we think of a reward function: R() tells us whether Mack thinks a particular outcome is good or bad Mack before drugs: R( cheese )=+1 R( shock )=-1 Mack after drugs: R( cheese )=-1 R( shock )=+1 Behavior always depends on rewards (utilities)
17
Maximizing reward So Mack wants to get the maximum possible reward (Whatever that means to him) For the one-shot case like this, this is fairly easy Now what about a harder case?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.