E 3 Finish-up; Intro to Clustering & Unsup. Kearns & Singh, “Near-Optimal Reinforcement Learning in Polynomial Time.”Machine Learning 49, Class Text: Sec 3.9; 5.9; 5.10; 6.6
Administrivia Office hours next week (Nov 24) Truncated on early end From “whenever I get in” ‘til noon Or by appt.
Oral presentations: Tips #1 Not too much text on any slide No paragraphs!!! Not even full sentences (usually) Be sure text is readable Fonts big enough Beware of Serifed Fonts
Oral presentations: Tips #1 This is a deliberately bad example of presentation style. Note that the text is very dense, there’s a lot of it, the font is way too small, and the font is somewhat difficult to read (the serifs are very narrow and the kerning is too tight, so the letters tend to smear together when viewed from a distance). It’s essentially impossible for your audience to follow this text while you’re speaking. (Except for a few speedreaders who happen to be sitting close to the screen.) In general, don’t expect your audience to read the text on your presentation -- it’s mostly there as a reminder to keep them on track while you’re talking and remind them what you’re talking about when they fall asleep. Note that these rules of thumb also apply well to posters. Unless you want your poster to completely standalone (no human there to describe it), it’s best to avoid large blocks of dense text.
Oral presentations: Tips #1 Also, don’t switch slides too quickly...
Exercise Given: MDP M = 〈 S, A,T,R 〉 ; discount factor γ, max absolute rwd R max =max S {|R(s)|} Find: A planning horizon H γ max such that if the agent plans only about events that take place within H γ max steps, then the agent is gauranteed to miss no more than ε I.e., For any trajectory of length H γ max, h γ H, the value difference between h γ H and h γ ∞ is less than ε:
E3E3 Efficient Explore & Exploit algorithm Kearns & Singh, Machine Learning 49, 2002 Explicitly keeps a T matrix and a R table Plan (policy iter) w/ curr. T & R -> curr. π Every state/action entry in T and R : Can be marked known or unknown Has a #visits counter, nv(s,a) After every 〈 s,a,r,s’ 〉 tuple, update T & R (running average) When nv(s,a)>NVthresh, mark cell as known & re-plan When all states known, done learning & have π*
The E 3 algorithm Algorithm: E3_learn_sketch // only an overview Inputs: S, A, γ (0<=γ<1), NVthresh, R max, Var max Outputs: T, R, π* Initialization: R ( s )= R max // for all s T ( s, a, s’ )=1/| S | // for all s, a, s’ known( s, a )=0; nv ( s, a )=0; // for all s, a π =policy_iter( S, A, T, R )
The E 3 algorithm Algorithm: E3_learn_sketch // con’t Repeat { s =get_current_world_state() a = π ( s ) ( r, s’ )=act_in_world( a ) T ( s, a, s’ )=(1+ T ( s, a, s’ )* nv ( s, a ))/( nv ( s, a )+1) nv ( s, a )++; if ( nv ( s, a )> NVthresh ) { known(s,a)=1; π =policy_iter( S, A, T, R ) } } Until (all ( s, a ) known)
Choosing NVthresh Critical parameter in E3: NVthresh Affects how much experience agent needs to be confident in saying a T(s,a,s’) value is known How to pick this param? Want to ensure that curr estimate,, is close to true T(s,a,s’) with high prob: How to do that?
5 minutes of math... General problem: Given a binomially distributed random variable, X, what is the probability that it deviates very far from its true mean? R.v. could be: Sum of many coin flips: Average of many samples from a transition function:
5 minutes of math... Theorem (Chernoff bound): Given a binomially distributed random variable, X, generated from a sequence of n events, the probability that X is very far from its true mean,, is given by:
5 minutes of math... Consequence of the Chernoff bound (informal): With a bit of fiddling, you can show that: The probability that the estimated mean for a binomially distributed random variable falls very far from the true mean falls off exponentially quickly with the size of the sample set
Chernoff bound & NVthresh Using Chernoff bound, can show that a transition can be considered “known” when: Where: N =number of states in M, | S | δ=amount you’re willing to be wrong by ε=prob that you got it wrong by more than δ
Poly time RL A further consequence (once you layer on a bunch of math & assumptions): Can learn complete model in at most steps Notes: Polynomial in N, 1/ ε, and 1/ δ BIG polynomial, nasty constants
Take-home messages Model based RL is a different way to think of the goals of RL Get better understanding of world (Sometimes) provides stronger theoretical leverage There exists a provably poly time alg. for RL Nasty polynomial, tho. Doesn’t work well in practice Still, nice explanation of why some forms of RL work
Unsupervised Learning: Clustering & Model Fitting
The unsupervised problem Given: Set of data points Find: Good description of the data
Typical tasks Given: many measurements of flowers What different breeds are there? Given: many microarray measurements, What genes act the same? Given: bunch of documents What topics are there? How are they related? Which are “good” essays and which are “bad”? Given: Long sequences of GUI events What tasks was user working on? Are they “flat” or hierarchical?