Using MDP Characteristics to Guide Exploration in Reinforcement Learning Paper: Bohdana Ratich & Doina Precucp Presenter: Michael Simon Some pictures/formulas.

Slides:

Advertisements

Similar presentations

Reinforcement Learning

Advertisements

Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.

Markov Decision Process

Reinforcement Learning (II.) Exercise Solutions Ata Kaban School of Computer Science University of Birmingham 2003.

1 University of Southern California Keep the Adversary Guessing: Agent Security by Policy Randomization Praveen Paruchuri University of Southern California.

Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.

SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.

Monte-Carlo Methods Learning methods averaging complete episodic returns Slides based on [Sutton & Barto: Reinforcement Learning: An Introduction, 1998]

Tuning bandit algorithms in stochastic environments The 18th International Conference on Algorithmic Learning Theory October 3, 2007, Sendai International.

Questions?. Setting a reward function, with and without subgoals Difference between agent and environment AI for games, Roomba Markov Property – Broken.

1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.

1 Temporal-Difference Learning Week #6. 2 Introduction Temporal-Difference (TD) Learning –a combination of DP and MC methods updates estimates based on.

Markov Decision Processes

SA-1 1 Probabilistic Robotics Planning and Control: Markov Decision Processes.

Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.

Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)

Descriptive statistics Experiment  Data  Sample Statistics Experiment  Data  Sample Statistics Sample mean Sample mean Sample variance Sample variance.

Exploration and Exploitation Strategies for the K-armed Bandit Problem by Alexander L. Strehl.

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

Statistics and Probability Theory Prof. Dr. Michael Havbro Faber

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

1 Quality of Experience Control Strategies for Scalable Video Processing Wim Verhaegh, Clemens Wüst, Reinder J. Bril, Christian Hentschel, Liesbeth Steffens.

Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK

Learning and Planning for POMDPs Eyal Even-Dar, Tel-Aviv University Sham Kakade, University of Pennsylvania Yishay Mansour, Tel-Aviv University.

CSE808 F'99Xiangping Chen1 Simulation of Rare Events in Communications Networks J. Keith Townsend Zsolt Haraszti James A. Freebersyser Michael Devetsikiotis.

Search and Planning for Inference and Learning in Computer Vision

Reinforcement Learning Evaluative Feedback and Bandit Problems Subramanian Ramamoorthy School of Informatics 20 January 2012.

Reinforcement Learning

General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.

Random Sampling, Point Estimation and Maximum Likelihood.

Number of Blocks per Pole Diego Arbelaez. Option – Number of Blocks per Pole Required magnetic field tolerance of ~10 -4 For a single gap this can be.

Reinforcement Learning (II.) Exercise Solutions Ata Kaban School of Computer Science University of Birmingham.

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.

CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.

Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 1: Generalized policy iteration.

Reinforcement Learning 主講人：虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.

Basic Numerical Procedures Chapter 19 1 Options, Futures, and Other Derivatives, 7th Edition, Copyright © John C. Hull 2008.

CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty Jiang Bian, Fall 2012 University of Arkansas at Little Rock.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

1 Standard error Estimated standard error,s,. 2 Example 1 While measuring the thermal conductivity of Armco iron, using a temperature of 100F and a power.

MDPs (cont) & Reinforcement Learning

Distributed Q Learning Lars Blackmore and Steve Block.

Point Estimation of Parameters and Sampling Distributions Outlines:  Sampling Distributions and the central limit theorem  Point estimation  Methods.

Reinforcement Learning Based on slides by Avi Pfeffer and David Parkes.

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

Chapter 8 Estimation ©. Estimator and Estimate estimator estimate An estimator of a population parameter is a random variable that depends on the sample.

R. Brafman and M. Tennenholtz Presented by Daniel Rasmussen.

1 Link Privacy in Social Networks Aleksandra Korolova, Rajeev Motwani, Shubha U. Nabar CIKM’08 Advisor: Dr. Koh, JiaLing Speaker: Li, HueiJyun Date: 2009/3/30.

Keep the Adversary Guessing: Agent Security by Policy Randomization

Statistical Modelling

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 10

Reinforcement Learning (1)

Reinforcement Learning in POMDPs Without Resets

"Playing Atari with deep reinforcement learning."

Markov Decision Processes

Markov Decision Processes

 Real-Time Scheduling via Reinforcement Learning

Instructors: Fei Fang (This Lecture) and Dave Touretzky

Chapter 2: Evaluative Feedback

Reinforcement Learning

Reinforcement Learning

 Real-Time Scheduling via Reinforcement Learning

CS 188: Artificial Intelligence Fall 2008

Distributed Algorithms for DCOP: A Graphical-Game-Based Approach

Chapter 2: Evaluative Feedback

Reinforcement Learning (2)

Reinforcement Learning (2)

Presentation transcript:

Using MDP Characteristics to Guide Exploration in Reinforcement Learning Paper: Bohdana Ratich & Doina Precucp Presenter: Michael Simon Some pictures/formulas gratefully borrowed from slides by Ratich

MDP Terminology Transition Probabilities - P a s,s’ Expected reward - R a s,s’ Return

Reinforcement Learning Learning only on environmental rewards –Achieve the best payoff possible Must balance exploitation with exploration –exploration can take large amounts of time The structure of the problem/model can assist in the exploration, in theory –But with what in our MDP case?

Goals/Approach Find MDP Characteristics... –... that affect performance... –... and test on them. Use MDP Characteristics... –... to tune parameters. –... to select algorithms. –... to create strategy.

Back to RL Undirected –Sufficient Exploration –Simple, but can be exponential Directed –Extra Computation/Storage, but possibly polynomial –Often uses aspects of the model to its advantage

RL Methods - Undirected  -greedy exploration –Probability 1-  of exploiting based on your best greedy guess at the moment –Explore with probability , select action (uniform) randomly Boltzman Distribution

RL Methods - Directed Maximize w/Exploration Bonuses –Different options for  Counter-based (least frequently) Recency-based (most frequently) Error-based (most variable in estimation value) Interval Estimation (highest variance in samples)

Properties of MDPs State Transition Entropy Controllability Variance of Immediate Rewards Risk Factor Transition Distance Transition Variability

State Transition Entropy Stochasticity of State Transitions –High STE = good exploration Potential variance of samples needed –High STE = more samples needed

Controllability - Calculation How much the environment’s response differs for an action –Can also be thought of as normalized information gain

Controllability - Usage High controllability Control over actions Different actions lead to different parts of the space More variance = more sampling needed Take actions leading to controllable states Actions with Forward Controllability (FC)

Proposed Method Undirected –Explore w/ probability  –For experiments K 1, K 2 = {0,1}  = 1,  = {0.1, 0.4, 0.9}

Proposed Method Directed –Pick action maximizing –For Experments K 0 = {1, 10, 50}, K 1, K 2 = {0,1}, K 3 = 1  is recency based

Experiments Random MDPs –225 states 3 actions 1-20 branching factor transition probs/rewards uniform [0,1] 0.01 chance of termination –Divided into 4 groups Low STE, High STE High variation (test) vs. low variation (control)

Experiments Continued Performance Measures –Return Estimates Run greedy policy from 50 different states, 30 trials per state, average returns, normalize –Penalty Measure R max = upper limit of return of optimal R t is normalized greedy return after trial t T = # of trials

Graphs, Glorious Graphs

More Graphs, Glorious Graphs

Discussion Significant results obtained when using STE and FC –Results correspond with presence of STC Values can be calculated prior to learning –Requires model knowledge Rug Sweeping and more judgements –SARSA

It’s over!