Transfer in Variable - Reward Hierarchical Reinforcement Learning Hui Li March 31, 2006.

Slides:



Advertisements
Similar presentations
A Decision-Theoretic Model of Assistance - Evaluation, Extension and Open Problems Sriraam Natarajan, Kshitij Judah, Prasad Tadepalli and Alan Fern School.
Advertisements

Reinforcement Learning
Hierarchical Reinforcement Learning Amir massoud Farahmand
Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.
Value and Planning in MDPs. Administrivia Reading 3 assigned today Mahdevan, S., “Representation Policy Iteration”. In Proc. of 21st Conference on Uncertainty.
Adopt Algorithm for Distributed Constraint Optimization
Markov Decision Process
Modified MDPs for Concurrent Execution AnYuan Guo Victor Lesser University of Massachusetts.
RL for Large State Spaces: Value Function Approximation
CSE-573 Artificial Intelligence Partially-Observable MDPS (POMDPs)
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.
Introduction to Hierarchical Reinforcement Learning Jervis Pinto Slides adapted from Ron Parr (From ICML 2005 Rich Representations for Reinforcement Learning.
Automatic Induction of MAXQ Hierarchies Neville Mehta Michael Wynkoop Soumya Ray Prasad Tadepalli Tom Dietterich School of EECS Oregon State University.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
An Introduction to Markov Decision Processes Sarah Hickmott
Planning under Uncertainty
Multi-Agent Shared Hierarchy Reinforcement Learning Neville Mehta Prasad Tadepalli School of Electrical Engineering and Computer Science Oregon State University.
Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.
An Introduction to PO-MDP Presented by Alp Sardağ.
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
Planning in MDPs S&B: Sec 3.6; Ch. 4. Administrivia Reminder: Final project proposal due this Friday If you haven’t talked to me yet, you still have the.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Hierarchical Reinforcement Learning Ersin Basaran 19/03/2005.
An Overview of MAXQ Hierarchical Reinforcement Learning Thomas G. Dietterich from Oregon State Univ. Presenter: ZhiWei.
Algorithms For Inverse Reinforcement Learning Presented by Alp Sardağ.
Markov Decision Processes Value Iteration Pieter Abbeel UC Berkeley EECS TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:
The Value of Plans. Now and Then Last time Value in stochastic worlds Maximum expected utility Value function calculation Today Example: gridworld navigation.
More RL. MDPs defined A Markov decision process (MDP), M, is a model of a stochastic, dynamic, controllable, rewarding process given by: M = 〈 S, A,T,R.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
MAKING COMPLEX DEClSlONS
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Model-based Bayesian Reinforcement Learning in Partially Observable Domains by Pascal Poupart and Nikos Vlassis (2008 International Symposium on Artificial.
Introduction to Reinforcement Learning Dr Kathryn Merrick 2008 Spring School on Optimisation, Learning and Complexity Friday 7 th.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Hierarchical Reinforcement Learning Using Graphical Models Victoria Manfredi and.
Reinforcement Learning 主講人:虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.
Privacy-Preserving Bayes-Adaptive MDPs CS548 Term Project Kanghoon Lee, AIPR Lab., KAIST CS548 Advanced Information Security Spring 2010.
Decision Making Under Uncertainty Lec #8: Reinforcement Learning UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Most slides by Jeremy.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
INTRODUCTION TO Machine Learning
1 Introduction to Reinforcement Learning Freek Stulp.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
MDPs (cont) & Reinforcement Learning
Artificial Intelligence Chapter 10 Planning, Acting, and Learning Biointelligence Lab School of Computer Sci. & Eng. Seoul National University.
Model Minimization in Hierarchical Reinforcement Learning Balaraman Ravindran Andrew G. Barto Autonomous Learning Laboratory.
Announcements  Upcoming due dates  Wednesday 11/4, 11:59pm Homework 8  Friday 10/30, 5pm Project 3  Watch out for Daylight Savings and UTC.
1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 21: Dynamic Multi-Criteria RL problems Dr. Itamar Arel College of Engineering Department.
Transfer Learning in Sequential Decision Problems: A Hierarchical Bayesian Approach Aaron Wilson, Alan Fern, Prasad Tadepalli School of EECS Oregon State.
Markov Decision Process (MDP)
Some Final Thoughts Abhijit Gosavi. From MDPs to SMDPs The Semi-MDP is a more general model in which the time for transition is also a random variable.
Department of Computer Science Undergraduate Events More
Reinforcement Learning for Mapping Instructions to Actions S.R.K. Branavan, Harr Chen, Luke S. Zettlemoyer, Regina Barzilay Computer Science and Artificial.
Kernelized Value Function Approximation for Reinforcement Learning Gavin Taylor and Ronald Parr Duke University.
Partial Observability “Planning and acting in partially observable stochastic domains” Leslie Pack Kaelbling, Michael L. Littman, Anthony R. Cassandra;
Def gradientDescent(x, y, theta, alpha, m, numIterations): xTrans = x.transpose() replaceMe =.0001 for i in range(0, numIterations): hypothesis = np.dot(x,
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
Propagating Uncertainty In POMDP Value Iteration with Gaussian Process
UAV Route Planning in Delay Tolerant Networks
Hierarchical POMDP Solutions
SNU BioIntelligence Lab.
Artificial Intelligence Chapter 10 Planning, Acting, and Learning
Artificial Intelligence Chapter 10 Planning, Acting, and Learning
CS 416 Artificial Intelligence
Reinforcement Learning (2)
Reinforcement Learning
Reinforcement Learning (2)
A Deep Reinforcement Learning Approach to Traffic Management
Presentation transcript:

Transfer in Variable - Reward Hierarchical Reinforcement Learning Hui Li March 31, 2006

Overview Multi-criteria reinforcement learning Transfer in variable-reward hierarchical reinforcement learning Results Conclusions

Multi-criteria reinforcement learning Definition Reinforcement learning is the process by which the agent learns an approximately optimal policy through trial and error interactions with the environment. Agent Policy Environment action a t reward r t state s t r t+1 s t+1 Reinforcement Learning s0s0 a 0 : r 0 P ss’ (a) s1s1 a 1 : r 1 P ss’ (a) s2s2 a 2 : r 2 P ss’ (a)...

Goal The agent’s goal is to maximize the cumulative amount of rewards he receives over the long run. A new value function -- average adjusted sum of rewards (bias) Average reward (gain) per time step at given policy  Bellman equation

H-learning: model-based version of average reward reinforcement learning  old  new New observation r s (a) learning rate, 0<  <1 R-learning: model-free version of average reward reinforcement learning

Multi-criteria reinforcement learning In many situations, it is nature to express the objective as making some appropriate tradeoffs between different kinds of rewards. Buridan’s donkey problem Goals: Eating food Guarding food Minimize the number of steps it walks

Weighted optimization criterion: weight, which represents the importance of each reward If the weight vector w is static, never changes over time, then the problem reduces to the reinforcement learning with a scalar value of reward. If the weight vector varies from time to time, learning policy for each weight vector from scratch is very inefficient.

Since the MDP model is a liner transformation, the average reward   and the average adjusted reward h  (s) are linear in the reward weights for a given policy .

11 22 33 44 55 66 77 Each line represents the weighted average reward given a policy  k, Solid lines represent those active weighted average rewards Dot lines represent those inactive weighted average rewards Dark line segments represent the best average rewards for any weight vectors

The key idea:  : the set of all stored policies. Only those policies which have active average rewards are stored. Update equations:

Variable-reward hierarchical reinforcement learning The original MDP M is split into sub-SMDP {M 0,… M n }, each sub-SMDP representing a sub-task Solving the root task M 0 solves the entire MDP M The task hierarchy is represented as a directed acyclic graph known as the task graph A local policy  i for the subtask M i is a mapping from the states to the child tasks of M i A hierarchical policy  for the whole task is an assignment of a local policy  i to each subtask M i The objective is to learn an optimal policy that optimizes the policy for each subtask assuming that its children’s polices are optimized

Goldmine Peasants Forest Enemy base Home base

Two kinds of subtask: Composite subtask : Root: the whole task Harvest: the goal is to harvest wood or gold Deposit: the goal is to deposit a resource into home base Attack: the goal is to attack the enemy base Primitive subtask: primitive actions north, south, east, west, pick a resource, put a resource attack the enemy base idle

SMDP – semi-MDP A SMDP is a tuple S, A, P, r are defined the same as in MDP; t(s, a) is the execution tine for taking action a in state s Bellman equation of SMDP for average reward learning A sub-task is a tuple B i : state abstraction function which maps state s in the original MDP into an abstract state in M i A i : The set of subtasks that can be called by M i G i : Termination predicate

The value function decomposition satisfied the following set of Bellman equations: where At root, we only store the average adjusted reward

Results Learning curves for a test reward weight after having seen 0, 1, 2, …, 10 previous training weight vectors Negative transfer: learning based on one previous weight is worse than learning from scratch.

Transfer ratio:F Y /F Y/X F Y is the area between the learning curve and its optimal value for problem with no prior learning experience on X. F Y/X is the area between the learning curve and its optimal value for problem given prior training on X.

Conclusions This paper showed that hierarchical task structure can accelerate transfer across variable-reward MDPs more than in the flat MDP This hierarchical task structure facilitates multi-agent learning

References [1] T. Dietterich, Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition. Journal of Artificial Intelligence Research, 9:227–303, [2] N. Mehta and P. Tadepalli, Multi-Agent Shared Hierarchy Reinforcement Learning. ICML Workshop on Richer Representations in Reinforcement Learning, [3] S. Natarajan and P. Tadepalli, Dynamic Preferences in Multi-Criteria Reinforcement Learning, in Proceedings of ICML-05, [4] N. Mehta, S. Natarajan, P. Tadepalli and A. Fern, Transfer in Variable-Reward Hierarchical Reinforcement Learning, in NIPS Workshop on transfer learning, [5] Barto, A., & Mahadevan, S. (2003). Recent Advances in Hierarchical Reinforcement Learning, Discrete Event Systems. [6] S. Mahadevan, Average Reward Reinforcement Learning: Foundations, Algorithms, and Empirical Results, Machine Learning, 22, (1996) [7] P. Tadepalli and D. OK, Model-based Average Reward Reinforcement Learning, Artificial intelligence 1998