2/theres-an-algorithm-for-that-algorithmia- helps-you-find-it/ 2/theres-an-algorithm-for-that-algorithmia-

Slides:



Advertisements
Similar presentations
Reinforcement Learning
Advertisements

Reinforcement Learning I: The setting and classical stochastic dynamic programming algorithms Tuomas Sandholm Carnegie Mellon University Computer Science.
Reinforcement Learning
Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.
SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
Monte-Carlo Methods Learning methods averaging complete episodic returns Slides based on [Sutton & Barto: Reinforcement Learning: An Introduction, 1998]
11 Planning and Learning Week #9. 22 Introduction... 1 Two types of methods in RL ◦Planning methods: Those that require an environment model  Dynamic.
Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 3 Ann Nowé By Sutton.
1. Algorithms for Inverse Reinforcement Learning 2
Reinforcement learning (Chapter 21)
1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
Reinforcement Learning & Apprenticeship Learning Chenyi Chen.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 4: Dynamic Programming pOverview of a collection of classical solution.
Reinforcement learning
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 9: Planning and Learning pUse of environment models pIntegration of planning.
Models of Planning ClassicalContingent (FO)MDP ???Contingent POMDP ???Conformant (NO)MDP Complete Observation Partial None Uncertainty Deterministic Disjunctive.
Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.
Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman March 6 th, 2006 CS286r Presented by Ilan Lobel.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
7. Experiments 6. Theoretical Guarantees Let the local policy improvement algorithm be policy gradient. Notes: These assumptions are insufficient to give.
CS 188: Artificial Intelligence Fall 2009 Lecture 10: MDPs 9/29/2009 Dan Klein – UC Berkeley Many slides over the course adapted from either Stuart Russell.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 4: Dynamic Programming pOverview of a collection of classical solution.
9/23. Announcements Homework 1 returned today (Avg 27.8; highest 37) –Homework 2 due Thursday Homework 3 socket to open today Project 1 due Tuesday –A.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 9: Planning and Learning pUse of environment models pIntegration of planning.
Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK
1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012.
Qjk Qjk.
Search and Planning for Inference and Learning in Computer Vision
1 Endgame Logistics  Final Project Presentations  Tuesday, March 19, 3-5, KEC2057  Powerpoint suggested ( to me before class)  Can use your own.
General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.
Reinforcement Learning
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Introduction to Reinforcement Learning
CHAPTER 10 Reinforcement Learning Utility Theory.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Decision Making Under Uncertainty Lec #8: Reinforcement Learning UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Most slides by Jeremy.
Off-Policy Temporal-Difference Learning with Function Approximation Doina Precup McGill University Rich Sutton Sanjoy Dasgupta AT&T Labs.
MDPs (cont) & Reinforcement Learning
1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 14: Planning and Learning Dr. Itamar Arel College of Engineering Department of Electrical.
Reinforcement learning (Chapter 21)
CSE 473Markov Decision Processes Dan Weld Many slides from Chris Bishop, Mausam, Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer.
Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012.
1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 21: Dynamic Multi-Criteria RL problems Dr. Itamar Arel College of Engineering Department.
Transfer Learning in Sequential Decision Problems: A Hierarchical Bayesian Approach Aaron Wilson, Alan Fern, Prasad Tadepalli School of EECS Oregon State.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Using MDP Characteristics to Guide Exploration in Reinforcement Learning Paper: Bohdana Ratich & Doina Precucp Presenter: Michael Simon Some pictures/formulas.
R. Brafman and M. Tennenholtz Presented by Daniel Rasmussen.
Best-first search is a search algorithm which explores a graph by expanding the most promising node chosen according to a specified rule.
CS 188: Artificial Intelligence Spring 2007 Lecture 21:Reinforcement Learning: II MDP 4/12/2007 Srini Narayanan – ICSI and UC Berkeley.
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.
Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 5 Ann Nowé By Sutton.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
Reinforcement Learning
Reinforcement learning (Chapter 21)
AlphaGo with Deep RL Alpha GO.
Reinforcement learning (Chapter 21)
CMSC 471 Fall 2009 RL using Dynamic Programming
Chapter 4: Dynamic Programming
Chapter 4: Dynamic Programming
Reinforcement Learning
Chapter 10: Dimensions of Reinforcement Learning
Chapter 9: Planning and Learning
CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29
Chapter 4: Dynamic Programming
Reinforcement Learning
October 20, 2010 Dr. Itamar Arel College of Engineering
Markov Decision Processes
Markov Decision Processes
Presentation transcript:

2/theres-an-algorithm-for-that-algorithmia- helps-you-find-it/ 2/theres-an-algorithm-for-that-algorithmia- helps-you-find-it/

Autonomous Learning of Stable Quadruped Locomotion, 2007 Policy gradient algorithm

PILCO: Deisenroth & Rasmussen, 2011 Model-based policy search method

Learning: Uses experience Planning: Use simulated experience

Models How would we learn to estimate them? Discrete state space Continuous state space

Planning State space planning: search of state space Plan-space planning: search of plans – Genetic Algorithms – Partial-order planning Dynamic programming: have a model Model-based RL (aka Model-learning RL) : learn a model

Random-sample on-step tabular Q-planning

Dyna Search Control – Direct simulated experience Planning, acting, model learning, direct RL in parallel!

Dyna-Q

Collect Data Generate Model Plan over model … Profit!

Without planning: each episode adds only one additional step to the policy, and only one step (the last) has been learned so far With planning: only one step is learned during the first episode, but here during the second episode an extensive policy has been developed that by the episode's end will reach almost back to the start state.

Dyna-Q+ is Dyna-Q with exploration bonus (shaping reward) : why the square root of time steps was chosen, rather than, say, changing linearly.

Dyna-Q+ is Dyna-Q with exploration bonus Why is Dyna-Q+ much better on left, but only slightly better on right?

Dyna-Q+ is Dyna-Q with exploration bonus Why is Dyna-Q+ much better on left, but only slightly better on right? Model is “optimistic” – Think pessimistic vs. optimistic initialization

Prioritized Sweeping Priority Queue over s,a pairs whose estimated value would change non-trivially When processed, effect on predecessor pairs is computed and added Other ideas: – Change in policy? – Continuous state space: pdf pdf “All animals are equal, but some are more equal than others.” -George Orwell, Animal Farm

Prioritized Sweeping vs. Dyna-Q

Trajectory sampling: perform backups along simulated trajectories Samples from on-policy distribution May be able to (usefully) ignore large parts of state space

How to backup Full vs. Sample Deep vs. Shallow Heuristic Search

R-Max

Stochastic games – Matrix: strategic form – Sequence of games from some set of games – More general than MDPs (how map to MDP?) Implicit Exploration vs. Explicit Exploration? – Epsilon greedy, UCB1, optimistic value function initialization – What will a pessimistic initialization do? Assumptions: recognize the state it’s in, and knows the actions/payoffs received Maximin R-Max (1/2)

Either optimal, or efficient learning Mixing time – Smallest value of T after which π guarantees expected payoff of U(π) – ε Approach optimal policy, in time polynomial in T, in 1/ε, and ln(1/δ) – PAC R-Max (2/2)

Efficient Reinforcement Learning with Relocatable Action Models RAM Grid-World Experiment Algorithm