E 3 Finish-up; Intro to Clustering & Unsup. Kearns & Singh, “Near-Optimal Reinforcement Learning in Polynomial Time.”Machine Learning 49, 2002. Class Text:

Slides:

Advertisements

Similar presentations

Dougal Sutherland, 9/25/13.

Advertisements

Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014.

Reinforcement Learning

Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.

Exponential Distribution

Sampling Distributions Welcome to inference!!!! Chapter 9.

Statistics review of basic probability and statistics.

Bayesian Wrap-Up (probably). 5 minutes of math... Marginal probabilities If you have a joint PDF:... and want to know about the probability of just one.

Chapter 18 Sampling Distribution Models

1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.

Markov Decision Processes

Planning under Uncertainty

Unsupervised Learning: Clustering & Model Fitting.

RL at Last! Q- learning and buddies. Administrivia R3 due today Class discussion Project proposals back (mostly) Only if you gave me paper; e-copies yet.

RL Cont’d. Policies Total accumulated reward (value, V ) depends on Where agent starts What agent does at each step (duh) Plan of action is called a policy,

Model-Free vs. Model- Based RL: Q, SARSA, & E 3. Administrivia Reminder: Office hours tomorrow truncated 9:00-10:15 AM Can schedule other times if necessary.

Descriptive statistics Experiment  Data  Sample Statistics Sample mean Sample variance Normalize sample variance by N-1 Standard deviation goes as square-root.

91.420/543: Artificial Intelligence UMass Lowell CS – Fall 2010

Policy Evaluation & Policy Iteration S&B: Sec 4.1, 4.3; 6.5.

To Model or not To Model; that is the question.. Administriva ICES surveys today Reminder: ML dissertation defense (ML for fMRI) Tomorrow, 1:00 PM, FEC141.

Q. The policy iteration alg. Function: policy_iteration Input: MDP M = 〈 S, A,T,R 〉  discount  Output: optimal policy π* ; opt. value func. V* Initialization:

Descriptive statistics Experiment  Data  Sample Statistics Experiment  Data  Sample Statistics Sample mean Sample mean Sample variance Sample variance.

Planning in MDPs S&B: Sec 3.6; Ch. 4. Administrivia Reminder: Final project proposal due this Friday If you haven’t talked to me yet, you still have the.

Policies and exploration and eligibility, oh my!.

Reinforcement Learning Reinforced. Administrivia I’m out of town next Tues and Wed Class cancelled Apr 4 -- work on projects! No office hrs Apr 4 or 5.

1 Confidence Interval for the Population Mean. 2 What a way to start a section of notes – but anyway. Imagine you are at the ground level in front of.

Exploration and Apprenticeship Learning in Reinforcement Learning Pieter Abbeel and Andrew Y. Ng Stanford University.

Reinforcement Learning: Learning to get what you want... Sutton & Barto, Reinforcement Learning: An Introduction, MIT Press 1998.

Expectation- Maximization. News o’ the day First “3-d” picture of sun Anybody got red/green sunglasses?

To Model or not To Model; that is the question.. Administriva Presentations starting Thurs Ritthaler Scully Gupta Wildani Ammons ICES surveys today.

Unsupervised, Cont’d Expectation Maximization. Presentation tips Practice! Work on knowing what you’re going to say at each point. Know your own presentation.

The Value of Plans. Now and Then Last time Value in stochastic worlds Maximum expected utility Value function calculation Today Example: gridworld navigation.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

The People Have Spoken.... Administrivia Final Project proposal due today Undergrad credit: please see me in office hours Dissertation defense announcements.

Reinforcement Learning (1)

Q. Administrivia Final project proposals back today (w/ comments) Evaluated on 4 axes: W&C == Writing & Clarity M&P == Motivation & Problem statement.

Policies and exploration and eligibility, oh my!.

Learning and Planning for POMDPs Eyal Even-Dar, Tel-Aviv University Sham Kakade, University of Pennsylvania Yishay Mansour, Tel-Aviv University.

Preparing a Presentation

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

Introduction to Data Analysis Probability Distributions.

Reinforcement Learning

The Binomial Distribution © Christine Crisp “Teach A Level Maths” Statistics 1.

Section 8.1 Estimating  When  is Known In this section, we develop techniques for estimating the population mean μ using sample data. We assume that.

Vegas Baby A trip to Vegas is just a sample of a random variable (i.e. 100 card games, 100 slot plays or 100 video poker games) Which is more likely? Win.

IID Samples In supervised learning, we usually assume that data points are sampled independently and from the same distribution IID assumption: data are.

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.

Reinforcement Learning Ata Kaban School of Computer Science University of Birmingham.

© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.

Q-learning, SARSA, and Radioactive Breadcrumbs S&B: Ch.6 and 7.

LECTURE 25 THURSDAY, 19 NOVEMBER STA291 Fall

Reinforcement Learning Yishay Mansour Tel-Aviv University.

Inference: Probabilities and Distributions Feb , 2012.

Workable Presentations 20 Tips (more or less) to a Successful Presentation Created by S. L. Shea Dept of Family & Comm. Medicine Southern Illinois University.

Sampling and estimation Petter Mostad

Reinforcement Learning

R. Brafman and M. Tennenholtz Presented by Daniel Rasmussen.

REINFORCEMENT LEARNING Unsupervised learning 1. 2 So far ….  Supervised machine learning: given a set of annotated istances and a set of categories,

1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.

Reinforcement Learning (1)

CMSC 471 Fall 2009 RL using Dynamic Programming

Chapter 4: Dynamic Programming

Chapter 4: Dynamic Programming

CS 188: Artificial Intelligence Fall 2007

CS 188: Artificial Intelligence Fall 2008

Chapter 4: Dynamic Programming

Software Development Techniques

Reinforcement Learning (2)

Reinforcement Learning (2)

Presentation transcript:

E 3 Finish-up; Intro to Clustering & Unsup. Kearns & Singh, “Near-Optimal Reinforcement Learning in Polynomial Time.”Machine Learning 49, Class Text: Sec 3.9; 5.9; 5.10; 6.6

Administrivia Office hours next week (Nov 24) Truncated on early end From “whenever I get in” ‘til noon Or by appt.

Oral presentations: Tips #1 Not too much text on any slide No paragraphs!!! Not even full sentences (usually) Be sure text is readable Fonts big enough Beware of Serifed Fonts

Oral presentations: Tips #1 This is a deliberately bad example of presentation style. Note that the text is very dense, there’s a lot of it, the font is way too small, and the font is somewhat difficult to read (the serifs are very narrow and the kerning is too tight, so the letters tend to smear together when viewed from a distance). It’s essentially impossible for your audience to follow this text while you’re speaking. (Except for a few speedreaders who happen to be sitting close to the screen.) In general, don’t expect your audience to read the text on your presentation -- it’s mostly there as a reminder to keep them on track while you’re talking and remind them what you’re talking about when they fall asleep. Note that these rules of thumb also apply well to posters. Unless you want your poster to completely standalone (no human there to describe it), it’s best to avoid large blocks of dense text.

Oral presentations: Tips #1 Also, don’t switch slides too quickly...

Exercise Given: MDP M = 〈 S, A,T,R 〉 ; discount factor γ, max absolute rwd R max =max S {|R(s)|} Find: A planning horizon H γ max such that if the agent plans only about events that take place within H γ max steps, then the agent is gauranteed to miss no more than ε I.e., For any trajectory of length H γ max, h γ H, the value difference between h γ H and h γ ∞ is less than ε:

E3E3 Efficient Explore & Exploit algorithm Kearns & Singh, Machine Learning 49, 2002 Explicitly keeps a T matrix and a R table Plan (policy iter) w/ curr. T & R -> curr. π Every state/action entry in T and R : Can be marked known or unknown Has a #visits counter, nv(s,a) After every 〈 s,a,r,s’ 〉 tuple, update T & R (running average) When nv(s,a)>NVthresh, mark cell as known & re-plan When all states known, done learning & have π*

The E 3 algorithm Algorithm: E3_learn_sketch // only an overview Inputs: S, A, γ (0<=γ<1), NVthresh, R max, Var max Outputs: T, R, π* Initialization: R ( s )= R max // for all s T ( s, a, s’ )=1/| S | // for all s, a, s’ known( s, a )=0; nv ( s, a )=0; // for all s, a π =policy_iter( S, A, T, R )

The E 3 algorithm Algorithm: E3_learn_sketch // con’t Repeat { s =get_current_world_state() a = π ( s ) ( r, s’ )=act_in_world( a ) T ( s, a, s’ )=(1+ T ( s, a, s’ )* nv ( s, a ))/( nv ( s, a )+1) nv ( s, a )++; if ( nv ( s, a )> NVthresh ) { known(s,a)=1; π =policy_iter( S, A, T, R ) } } Until (all ( s, a ) known)

Choosing NVthresh Critical parameter in E3: NVthresh Affects how much experience agent needs to be confident in saying a T(s,a,s’) value is known How to pick this param? Want to ensure that curr estimate,, is close to true T(s,a,s’) with high prob: How to do that?

5 minutes of math... General problem: Given a binomially distributed random variable, X, what is the probability that it deviates very far from its true mean? R.v. could be: Sum of many coin flips: Average of many samples from a transition function:

5 minutes of math... Theorem (Chernoff bound): Given a binomially distributed random variable, X, generated from a sequence of n events, the probability that X is very far from its true mean,, is given by:

5 minutes of math... Consequence of the Chernoff bound (informal): With a bit of fiddling, you can show that: The probability that the estimated mean for a binomially distributed random variable falls very far from the true mean falls off exponentially quickly with the size of the sample set

Chernoff bound & NVthresh Using Chernoff bound, can show that a transition can be considered “known” when: Where: N =number of states in M, | S | δ=amount you’re willing to be wrong by ε=prob that you got it wrong by more than δ

Poly time RL A further consequence (once you layer on a bunch of math & assumptions): Can learn complete model in at most steps Notes: Polynomial in N, 1/ ε, and 1/ δ BIG polynomial, nasty constants

Take-home messages Model based RL is a different way to think of the goals of RL Get better understanding of world (Sometimes) provides stronger theoretical leverage There exists a provably poly time alg. for RL Nasty polynomial, tho. Doesn’t work well in practice Still, nice explanation of why some forms of RL work

Unsupervised Learning: Clustering & Model Fitting

The unsupervised problem Given: Set of data points Find: Good description of the data

Typical tasks Given: many measurements of flowers What different breeds are there? Given: many microarray measurements, What genes act the same? Given: bunch of documents What topics are there? How are they related? Which are “good” essays and which are “bad”? Given: Long sequences of GUI events What tasks was user working on? Are they “flat” or hierarchical?