Accumulation vs. replacement; model- free vs. model-based RL.

Slides:



Advertisements
Similar presentations
Welcome Back to School!!! Mr. Sortina.
Advertisements

The people Look for some people. Write it down. By the water
A.
Review of Exercises from Chapter 17 Statistics, Spring
The cuckou song Sumer is ycomen in, Loude sing cuckou! Groweth seed and bloweth meed, And springth the wode now. Sing cuckou! Ewe bleteth after lamb,
This title is for centering. This title is for centering. This title is for centering. This title is for centering.
Ethiconnect for Kids! Ethiconnect: A Gift- Of kids, from kids,…
I’m made of wax. Larry, what are you made of? By Chris Mouse.
Department of Intelligent Computer Systems University of Malta Finding Literature, Taking Notes, and Giving a Presentation Dr. Chris Staff
In Sync (No, not the lame boy band.). Today in history... Last time: Thinking about design: interface design Survey of code complexity: P1 vs. P3 Intro.
RL at Last! Q- learning and buddies. Administrivia R3 due today Class discussion Project proposals back (mostly) Only if you gave me paper; e-copies yet.
Model-Free vs. Model- Based RL: Q, SARSA, & E 3. Administrivia Reminder: Office hours tomorrow truncated 9:00-10:15 AM Can schedule other times if necessary.
Q. The policy iteration alg. Function: policy_iteration Input: MDP M = 〈 S, A,T,R 〉  discount  Output: optimal policy π* ; opt. value func. V* Initialization:
My Policies and Some Advice for Doing Well in this Course.
Policies and exploration and eligibility, oh my!.
My Policies and Some Advice for Doing Well in this Course.
Accumulation vs. replacement; model- free vs. model-based RL.
Second Grade English High Frequency Words
Q. Administrivia Final project proposals back today (w/ comments) Evaluated on 4 axes: W&C == Writing & Clarity M&P == Motivation & Problem statement.
Eligibility traces: The “atomic breadcrumbs” approach to RL.
Policies and exploration and eligibility, oh my!.
Today, in our R.E lesson, we are going to...
Spelling Lists.
Spelling Lists. Unit 1 Spelling List write family there yet would draw become grow try really ago almost always course less than words study then learned.
Models and Designs Investigation 1.  Label your new section Models and Designs  Draw pictures of a “model” and “design”
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
The hills across the valley of the Ebro were long and white. On this side there was no shade and no trees and the station was between two lines of rails.
Hippocrates vs Jerry Springer GCSE History Medicine Through Time: Greek Medicine.
What a Student with Learning Difficulties Might Say Things to look for in your conversations with students. You can learn a lot about a student from the.
Copyright ©: SAMSUNG & Samsung Hope for Youth. All rights reserved Tutorials Screens: Presentation skills Suitable for: Improver Advanced.
The Archetypes of the Fisher King
Agent Sales-Track Training Setting the Appointment in Stone.
I am ready to test!________ I am ready to test!________
Sight Words.
KAREN PHELPS Spontaneous Sponsoring. Your Home Presentations “A Valuable Source for Recruits”
Little man should have been more appreciative for the books because they don’t that much stuff, and little man should been happy for what he got cause.
By: Jae Lim.  Cold play-Viva La Vida(Life)  I used to rule the world Seas would rise when I gave the word Now in the morning I sleep alone.
Sight words.
Ch. 21: Between the World Wars. Terms Modernism (in literature) Kafkaesque Cubism Surrealism Harlem Renaissance Blue note jazz Featured Works T.S. Eliot’s.
Powerpoint Presentations Problems. Font issues #1 Some students make the font so tiny that it cannot be read.
Listen and Decode Listen and Respond Listen and Read Listen and Match Listen and Conclude Listen and Complete Listen and Judge Being All Ears.
Q-learning, SARSA, and Radioactive Breadcrumbs S&B: Ch.6 and 7.
E 3 Finish-up; Intro to Clustering & Unsup. Kearns & Singh, “Near-Optimal Reinforcement Learning in Polynomial Time.”Machine Learning 49, Class Text:
Sight Word List.
STUDENT LED CONFERENCES SPRING 2015 BEN IVERSON. Cover Letter Science this year has covered a wide range of topics. We started the year with learning.
High Frequency Words August 31 - September 4 around be five help next
Thanks for continuing to work at becoming a better reader. As soon as you can quickly read these phrases, please go onto the next 100 phrases. Your extra.
Submitting Others Work Copy-Paste Purchased Material Paraphrasing Modification of Text Plagiarism.
Sight Words.
Independent Reading Day #1 (Sad, but true.). Let’s do an experiment: Figure out a starting page number for today’s reading – It might not be page one.
This title is for centering.. Sponsor The official sponsor of [your opera]
High Frequency Words.
FINISHING THE SEMESTER STRONG Advisory December 2, 2015.
Alex and Function. Once upon a time, much like today, Alex was sitting in math class. The weather in Chicago was frightful. Today there had already been.
Diction, and Syntax. Classroom Experiment Imagine you are describing the same event to the following people: Your child Your boss The police Your friend.
Diction and Syntax. Classroom Experiment Imagine you are describing the same event to the following people: Your child Your boss The police Your friend.
Grade 4 Short-response (2-point) Sample Guide Set.
First Grade Rainbow Words By Mrs. Saucedo , Maxwell School
Don’t Worry, Be Happy By: Kendra Nuttall, Grecia Corona, and Avenly Millar.
Classic Connections: Innovative Methods for Making Education Work.
Mixed-Up Dialogues at home (slides 2-4) on transport (5-6) at school (7-11) on the weekend (12-15) being friendly (16 -25)
关于 ” 爱 ” 的理解 If it is not love. A girl and a boy were on a motorcycle, speeding through the night. They loved each other a lot.. Girl: Slow down a little.
High Frequency words Kindergarten review. red yellow.
Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 5 Ann Nowé By Sutton.
Created By Sherri Desseau Click to begin TACOMA SCREENING INSTRUMENT FIRST GRADE.
课标人教实验版 高二 Module 6 Unit 3. Listening on workbook.
Fry Word Test First 300 words in 25 word groups
SUPER SUCCESS SERIES TIME MANAGEMENT VOL. 1
The of and to in is you that it he for was.
Presentation transcript:

Accumulation vs. replacement; model- free vs. model-based RL

Administrivia Pseudo-HW3 today Not graded Worth doing anyway Good for your soul better for your final exam. We can discuss in class next Tues

Today in history Last time: Action selection Use of experience Eligibility traces SARSA( λ ) Today Replacing vs accumulating traces Thinking about eligibility Model-free vs. model-based learning (?) R3 discussion

Presentation hints Formal presentation to an audience Trying to convince audience of something E.g., you have invented a great idea and proven that it works Subtext: you’re smart and they should invest in you Think of it as a sales pitch (sort-of) Get the core idea across Don’t dwell on tedious detail Don’t be fluffy

Presentation hints Practice! Time will be tight -- time yourself Get friends/colleagues to help you practice Practice! Think about order of material presentation Practice!

Presentation hints Avoid using every clever powerpoint trick And be careful with cute, but pointless images

Presentation hints Oh, and avoid using bizarre fonts and really tiny font sizes just so that you can cram as much junk on the screen as possible. Remember: it’s more important that the audience actually understand your material than that you convey more ‘volume’ of material in the same time. It’s essentially pointless to ream through bunches of text or incredible amounts of math if nobody in the audience gets it. At best, they will be bored and zone out for most of your talk. At worst, they will be actively put off or annoyed by your presentation. And, presumably, you want them all to like you and be impressed with your material and ideas, so it’s counterproductive to antagonize your audience. Remember: at some point, your project, future funding, and/or job may depend on a presentation like this, so it behooves you to keep your audience happy. I have actually seen people give abysmally bad presentations and be completely rejected from the job opening because of their poor presentations. Now that that has been said, I still need to fill out this page with a large blob of text so that it’s as intimidating as possible. Honestly, I don’t expect anybody to actually read this far even in the online copy, let alone in class. If you do actually get this far while I’’m flashing this page up in class, do please shout out. I’ll be most impressed and you’ll get brownie points for speed reading. Even if you happen to read this far in the online copy, please send me a note, just to satisfy my curiosity about who’s determined enough to get that far. Hm. Still half a page to fill. This is a pretty drastically condensed slide. Let’s see. Need more text. Maybe a little web mining... Ok, here we go: APRIL is the cruellest month, breeding / Lilacs out of the dead land, mixing / Memory and desire, stirring / Dull roots with spring rain. / Winter kept us warm, covering / Earth in forgetful snow, feeding / A little life with dried tubers. / Summer surprised us, coming over the Starnbergersee / With a shower of rain; we stopped in the colonnade, / And went on in sunlight, into the Hofgarten, / And drank coffee, and talked for an hour. / Bin gar keine Russin, stamm' aus Litauen, echt deutsch. / And when we were children, staying at the archduke's, / My cousin's, he took me out on a sled, / And I was frightened. He said, Marie, / Marie, hold on tight. And down we went. / In the mountains, there you feel free. / I read, much of the night, and go south in the winter. / / What are the roots that clutch, what branches grow / Out of this stony rubbish? Son of man, / You cannot say, or guess, for you know only / A heap of broken images, where the sun beats, / And the dead tree gives no shelter, the cricket no relief, / And the dry stone no sound of water. Only / There is shadow under this red rock, / (Come in under the shadow of this red rock), / And I will show you something different from either / Your shadow at morning striding behind you / Or your shadow at evening rising to meet you; / I will show you fear in a handful of dust. / Frisch weht der Wind / Der Heimat zu. / Mein Irisch Kind, / Wo weilest du? / 'You gave me hyacinths first a year ago; / 'They called me the hyacinth girl.' / —Yet when we came back, late, from the Hyacinth garden, / Your arms full, and your hair wet, I could not / Speak, and my eyes failed, I was neither / Living nor dead, and I knew nothing, / Looking into the heart of light, the silence. / Od' und leer das Meer.

Presentation hints Oh yeah. Don’t switch slides too quickly.

Presentation hints Be sure to look at audience Don’t just read from your slides Don’t stare at screen whole time Be careful w/ laser pointers Practice!

The Q -learning algorithm Algorithm: Q_learn Inputs: State space S ; Act. space A Discount  (0<=  <1); Learning rate  (0<=  <1) Outputs: Q Repeat { s =get_current_world_state() a =pick_next_action( Q, s ) ( r, s’ )=act_in_world( a ) Q ( s, a )= Q ( s, a )+  *( r +  *max_ a’ ( Q ( s’, a’ ))- Q ( s, a )) } Until (bored)

SARSA-learning algorithm Algorithm: SARSA_learn Inputs: State space S ; Act. space A Discount  (0<=  <1); Learning rate  (0<=  <1) Outputs: Q s =get_current_world_state() a =pick_next_action( Q, s ) Repeat { ( r, s’ )=act_in_world( a ) a’ =pick_next_action( Q, s’ ) Q ( s, a )= Q ( s, a )+  *( r +  * Q ( s’, a’ )- Q ( s, a )) a = a’ ; s = s’ ; } Until (bored)

SARSA vs. Q SARSA and Q -learning very similar SARSA updates Q(s,a) for the policy it’s actually executing Lets the pick_next_action() function pick action to update Q updates Q(s,a) for greedy policy w.r.t. current Q Uses max_ a to pick action to update might be diff than the action it executes at s’ In practice: Q will learn the “true” π*, but SARSA will learn about what it’s actually doing Exploration can get Q -learning in trouble...

Radioactive breadcrumbs Can now define eligibility traces for SARSA In addition to Q(s,a) table, keep an e(s,a) table Records “eligibility” (real number) for each state/action pair At every step ( (s,a,r,s’,a’) tuple): Increment e(s,a) for current (s,a) pair by 1 Update all Q(s’’,a’’) vals in proportion to their e(s’’,a’’) Decay all e(s’’,a’’) by factor of  Leslie Kaelbling calls this the “radioactive breadcrumbs” form of RL

SARSA(  )-learning alg. Algorithm: SARSA(  )_learn Inputs: S, A,  (0<=  <1),  (0<=  <1),  (0<=  <1) Outputs: Q e ( s, a )=0 // for all s, a s =get_curr_world_st(); a =pick_nxt_act( Q, s ); Repeat { ( r, s’ )=act_in_world( a ) a’ =pick_next_action( Q, s’ )  = r +  * Q ( s’, a’ )- Q ( s, a ) e ( s, a )+=1 foreach ( s’’, a’’ ) pair in ( S X A ) { Q ( s’’, a’’ )= Q ( s’’, a’’ )+  * e ( s’’, a’’ )*  e ( s’’, a’’ )*=  } a = a’ ; s = s’ ; } Until (bored)

The trail of crumbs Sutton & Barto, Sec 7.5

The trail of crumbs Sutton & Barto, Sec 7.5 λ=0

The trail of crumbs Sutton & Barto, Sec 7.5

Eligibility for a single state e(s i,a j ) 1st visit 2nd visit... Sutton & Barto, Sec 7.5

Eligibility trace followup Eligibility trace allows: Tracking where the agent has been Backup of rewards over longer periods Credit assignment: state/action pairs rewarded for having contributed to getting to the reward Why does it work?

The “forward view” of elig. Original SARSA did “one step” backup: Q(s,a) rtrt Q(s t+1,a t+1 ) Rest of trajectory Info backup

The “forward view” of elig. Original SARSA did “one step” backup: Could also do a “two step backup”: Q(s,a) rtrt Q(s t+2,a t+2 ) Rest of trajectory r t+1 Info backup

The “forward view” of elig. Original SARSA did “one step” backup: Could also do a “two step backup”: Or even an “ n step backup”:

The “forward view” of elig. Small-step backups ( n =1, n =2, etc.) are slow and nearsighted Large-step backups ( n =100, n =1000, n = ∞ ) are expensive and may miss near-term effects Want a way to combine them Can take a weighted average of different backups E.g.:

The “forward view” of elig. 1/31/3 2/32/3

How do you know which number of steps to avg over? And what the weights should be? Accumulating eligibility traces are just a clever way to easily avg. over all n :

The “forward view” of elig. 00 11 22  n-1

Replacing traces Kind just described are accumulating e-traces Every time you go back to state, add extra e. There are also replacing eligibility traces Every time you go back to a state/action, reset e(s,a) to 1 Works better sometimes Sutton & Barto, Sec 7.8