The People Have Spoken.... Administrivia Final Project proposal due today Undergrad credit: please see me in office hours Dissertation defense announcements.

Slides:

Advertisements

Similar presentations

Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.

Advertisements

Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.

Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.

1 Reinforcement Learning Problem Week #3. Figure reproduced from the figure on page 52 in reference [1] 2 Reinforcement Learning Loop state Agent Environment.

INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.

Partially Observable Markov Decision Process By Nezih Ergin Özkucur.

COSC 878 Seminar on Large Scale Statistical Machine Learning 1.

Reinforcement Learning & Apprenticeship Learning Chenyi Chen.

Planning under Uncertainty

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

RL Cont’d. Policies Total accumulated reward (value, V ) depends on Where agent starts What agent does at each step (duh) Plan of action is called a policy,

Model-Free vs. Model- Based RL: Q, SARSA, & E 3. Administrivia Reminder: Office hours tomorrow truncated 9:00-10:15 AM Can schedule other times if necessary.

Reinforcement Learning

Intro to Reinforcement Learning Learning how to get what you want...

Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.

Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)

Reinforcement Learning (for real this time). Administrivia No noose is good noose (Crazy weather?)

Reinforcement Learning, Cont’d Useful refs: Sutton & Barto, Reinforcement Learning: An Introduction, MIT Press 1998.

Planning in MDPs S&B: Sec 3.6; Ch. 4. Administrivia Reminder: Final project proposal due this Friday If you haven’t talked to me yet, you still have the.

Reinforcement Learning Reinforced. Administrivia I’m out of town next Tues and Wed Class cancelled Apr 4 -- work on projects! No office hrs Apr 4 or 5.

Reinforcement Learning: Learning to get what you want... Sutton & Barto, Reinforcement Learning: An Introduction, MIT Press 1998.

Planning to learn. Progress report Last time: Transition functions & stochastic outcomes Markov chains MDPs defined Today: Exercise completed Value functions.

RL 2 It’s 2:00 AM. Do you know where your mouse is?

The Value of Plans. Now and Then Last time Value in stochastic worlds Maximum expected utility Value function calculation Today Example: gridworld navigation.

More RL. MDPs defined A Markov decision process (MDP), M, is a model of a stochastic, dynamic, controllable, rewarding process given by: M = 〈 S, A,T,R.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

Learning: Reinforcement Learning Russell and Norvig: ch 21 CMSC421 – Fall 2005.

Learning and Planning for POMDPs Eyal Even-Dar, Tel-Aviv University Sham Kakade, University of Pennsylvania Yishay Mansour, Tel-Aviv University.

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Reinforcement Learning Russell and Norvig: Chapter 21 CMSC 421 – Fall 2006.

Reinforcement Learning in the Presence of Hidden States Andrew Howard Andrew Arnold {ah679

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

Instructor: Vincent Conitzer

Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.

Introduction to Reinforcement Learning Dr Kathryn Merrick 2008 Spring School on Optimisation, Learning and Complexity Friday 7 th.

Reinforcement Learning 主講人：虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.

Privacy-Preserving Bayes-Adaptive MDPs CS548 Term Project Kanghoon Lee, AIPR Lab., KAIST CS548 Advanced Information Security Spring 2010.

© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.

Decision Making Under Uncertainty Lec #8: Reinforcement Learning UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Most slides by Jeremy.

Q-learning, SARSA, and Radioactive Breadcrumbs S&B: Ch.6 and 7.

Reinforcement Learning

Game Theory, Social Interactions and Artificial Intelligence Supervisor: Philip Sterne Supervisee: John Richter.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

INTRODUCTION TO Machine Learning

CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:

1 Introduction to Reinforcement Learning Freek Stulp.

Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.

Decision Theoretic Planning. Decisions Under Uncertainty  Some areas of AI (e.g., planning) focus on decision making in domains where the environment.

Reinforcement Learning with Laser Cats! Marshall Wang Maria Jahja DTR Group Meeting October 5, 2015.

Course Overview  What is AI?  What are the Major Challenges?  What are the Main Techniques?  Where are we failing, and why?  Step back and look at.

Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.

CPS 570: Artificial Intelligence Markov decision processes, POMDPs

Reinforcement Learning AI – Week 22 Sub-symbolic AI Two: An Introduction to Reinforcement Learning Lee McCluskey, room 3/10

Markov Decision Process (MDP)

MDPs and Reinforcement Learning. Overview MDPs Reinforcement learning.

COMP 2208 Dr. Long Tran-Thanh University of Southampton Reinforcement Learning.

Markov Decision Processes AIMA: 17.1, 17.2 (excluding ), 17.3.

Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.

CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.

Done Done Course Overview What is AI? What are the Major Challenges?

Reinforcement Learning (1)

Reinforcement Learning in POMDPs Without Resets

Markov Decision Processes

Markov Decision Processes

Announcements Homework 3 due today (grace period through Friday)

Dr. Unnikrishnan P.C. Professor, EEE

CS 188: Artificial Intelligence Fall 2007

Presentation transcript:

The People Have Spoken...

Administrivia Final Project proposal due today Undergrad credit: please see me in office hours Dissertation defense announcements R2 assigned today Midterm back Results of the survey

ML (ish) thesis defenses April 5, 8:30 AM Rob Abbot (advisor: Forrest) Learning to play robo-soccer from human observation + genetic adaptation April 20, 1:00 PM John Burge (advisor: Lane) Learning network models of fMRI brain activity data

Reading 2 Due: Thurs, Apr 5 Knill, D., and Pouget, A. “The Bayesian brain: the role of uncertainty in neural coding and computation”. Trends in Neuroscience. 27(12):

Midterm Not too shabby over all Some weak spots, though Summary: μ=71.3 σ=19.1

Surveeeeeeeey says!

Vote tally: Unsupervised learning: 2 Reinforcement learning: 5

Surveeeeeeeey says! Vote tally: Unsupervised learning: 2 Reinforcement learning: 5 MLE: θ=Pr[RL]=0.71

Surveeeeeeeey says! Vote tally: Unsupervised learning: 2 Reinforcement learning: 5 MLE: θ=Pr[RL]=0.71 Bayesian posterior:

Reinforcement Learning: Learning to get what you want... Sutton & Barto, Reinforcement Learning: An Introduction, MIT Press Kaelbling, Littman, & Moore, ``Reinforcement Learning: A Survey,'' Journal of Artificial Intelligence Research, Volume 4,

Meet Mack the Mouse* Mack lives a hard life as a psychology test subject Has to run around mazes all day, finding food and avoiding electric shocks Needs to know how to find cheese quickly, while getting shocked as little as possible Q: How can Mack learn to find his way around? * Mickey is still copyright ?

Start with an easy case V. simple maze: Whenever Mack goes left, he gets cheese Whenever he goes right, he gets shocked After reward/punishment, he’s reset back to start of maze Q: how can Mack learn to act well in this world?

Learning in the easy case Say there are two labels: “cheese” and “shock” Mack tries a bunch of trials in the world -- that generates a bunch of experiences: Now what?

But what to do? So we know that Mack can learn a mapping from actions to outcomes But what should Mack do in any given situation? What action should he take at any given time? Suppose Mack is the subject of a psychotropic drug study and has actually come to like shocks and hate cheese -- how does he act now?

Reward functions In general, we think of a reward function: R() tells us whether Mack thinks a particular outcome is good or bad Mack before drugs: R( cheese )=+1 R( shock )=-1 Mack after drugs: R( cheese )=-1 R( shock )=+1 Behavior always depends on rewards (utilities)

Maximizing reward So Mack wants to get the maximum possible reward (Whatever that means to him) For the one-shot case like this, this is fairly easy Now what about a harder case?

Reward over time In general: agent can be in a state s i at any time t Can choose an action a j to take in that state Rwd associated with a state: R(s i ) Or with a state/action transition: R(s i,a j ) Series of actions leads to series of rewards (s 1,a 1 ) → s 3 : R(s 3 ); (s 3,a 7 ) → s 14 : R(s 14 );...

Reward over time s1s1 s2s2 s3s3 s4s4 s5s5 s6s6 s4s4 s2s2 s7s7 s 11 s8s8 s9s9 s 10

Reward over time s1s1 s2s2 s3s3 s4s4 s5s5 s6s6 s4s4 s2s2 s7s7 s 11 s8s8 s9s9 s 10 V(s 1 )=R(s 1 )+R(s 4 )+R(s 11 )+R(s 10 )+...

Reward over time s1s1 s2s2 s3s3 s4s4 s5s5 s6s6 s4s4 s2s2 s7s7 s 11 s8s8 s9s9 s 10 V(s 1 )=R(s 1 )+R(s 2 )+R(s 6 )+...

Where can you go? Definition: Complete set of all states agent could be in is called the state space: S Could be discrete or continuous We’ll usually work with discrete Size of state space: | S | S ={s 1,s 2,...,s | S | }

What can you do? Definition: Complete set of actions an agent could take is called the action space: A Again, discrete or cont. Again, we work w/ discrete Again, size: | A | A ={a 1,...,a | A | }

Experience & histories In supervised learning, “fundamental unit of experience”: feature vector+label Fundamental unit of experience in RL: At time t in some state s i, take action a j, get reward r t, end up in state s k Called an experience tuple or SARSA tuple

The value of history... Set of all experience during a single episode up to time t is a history: A.k.a., trace, trajectory

Policies Total accumulated reward (value, V ) depends on Where agent starts, initial s What agent does at each step (duh), a Plan of action is called a policy, π Policy defines what action to take in every state of the system: A.k.a. controller, control law, decision rule, etc.

Policies Value is a function of start state and policy: Useful to think about finite horizon and infinite horizon values:

Finite horizon reward Assuming that an episode is finite: Agent acts in the world for a finite number of time steps, T, experiences history h T What should total aggregate value be?

Finite horizon reward Assuming that an episode is finite: Agent acts in the world for a finite number of time steps, T, experiences history h T What should total aggregate value be? Total accumulated reward: Occasionally useful to use average reward:

Gonna live forever... Often, we want to model a process that is indefinite Infinitely long Of unknown length (don’t know in advance when it will end) Runs ‘til it’s stopped (randomly) Have to consider infinitely long histories Q: what does value mean over an infinite history?

Reaaally long-term reward Let be an infinite history We define the infinite-horizon discounted value to be: where is the discount factor Q1: Why does this work? Q2: if R max is the max possible reward attainable in the environment, what is V max ?