1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 21: Dynamic Multi-Criteria RL problems Dr. Itamar Arel College of Engineering Department.

Slides:



Advertisements
Similar presentations
Reinforcement Learning
Advertisements

Markov Decision Process
SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
CSE-573 Artificial Intelligence Partially-Observable MDPS (POMDPs)
11 Planning and Learning Week #9. 22 Introduction... 1 Two types of methods in RL ◦Planning methods: Those that require an environment model  Dynamic.
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 3 Ann Nowé By Sutton.
Decision Theoretic Planning
1 Reinforcement Learning Problem Week #3. Figure reproduced from the figure on page 52 in reference [1] 2 Reinforcement Learning Loop state Agent Environment.
Reinforcement learning (Chapter 21)
Basic Feasible Solutions: Recap MS&E 211. WILL FOLLOW A CELEBRATED INTELLECTUAL TEACHING TRADITION.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
Markov Decision Processes
Planning under Uncertainty
POMDPs: Partially Observable Markov Decision Processes Advanced AI
Reinforcement learning
Reinforcement Learning
CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley.
Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Reinforcement Learning (2) Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)
CS 188: Artificial Intelligence Fall 2009 Lecture 10: MDPs 9/29/2009 Dan Klein – UC Berkeley Many slides over the course adapted from either Stuart Russell.
More RL. MDPs defined A Markov decision process (MDP), M, is a model of a stochastic, dynamic, controllable, rewarding process given by: M = 〈 S, A,T,R.
Reinforcement Learning (1)
9/23. Announcements Homework 1 returned today (Avg 27.8; highest 37) –Homework 2 due Thursday Homework 3 socket to open today Project 1 due Tuesday –A.
Making Decisions CSE 592 Winter 2003 Henry Kautz.
A Reinforcement Learning Approach for Product Delivery by Multiple Vehicles Scott Proper Oregon State University Prasad Tadepalli Hong TangRasaratnam Logendran.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
MAKING COMPLEX DEClSlONS
1 Dr. Itamar Arel College of Engineering Electrical Engineering & Computer Science Department The University of Tennessee Fall 2009 August 24, 2009 ECE-517:
CSE-473 Artificial Intelligence Partially-Observable MDPS (POMDPs)
1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 6: Optimality Criterion in MDPs Dr. Itamar Arel College of Engineering Department.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 11: Temporal Difference Learning (cont.), Eligibility Traces Dr. Itamar Arel College.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 17: TRTRL, Implementation Considerations, Apprenticeship Learning Dr. Itamar Arel.
Reinforcement Learning
Introduction to Reinforcement Learning Dr Kathryn Merrick 2008 Spring School on Optimisation, Learning and Complexity Friday 7 th.
Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 1: Generalized policy iteration.
CHAPTER 10 Reinforcement Learning Utility Theory.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
1 S ystems Analysis Laboratory Helsinki University of Technology Flight Time Allocation Using Reinforcement Learning Ville Mattila and Kai Virtanen Systems.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Transfer in Variable - Reward Hierarchical Reinforcement Learning Hui Li March 31, 2006.
1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 8: Dynamic Programming – Value Iteration Dr. Itamar Arel College of Engineering Department.
MDPs (cont) & Reinforcement Learning
Reinforcement Learning with Laser Cats! Marshall Wang Maria Jahja DTR Group Meeting October 5, 2015.
Course Overview  What is AI?  What are the Major Challenges?  What are the Main Techniques?  Where are we failing, and why?  Step back and look at.
Reinforcement learning (Chapter 21)
Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012.
Reinforcement Learning AI – Week 22 Sub-symbolic AI Two: An Introduction to Reinforcement Learning Lee McCluskey, room 3/10
Comparison Value vs Policy iteration
QUIZ!!  T/F: Optimal policies can be defined from an optimal Value function. TRUE  T/F: “Pick the MEU action first, then follow optimal policy” is optimal.
Searching a Linear Subspace Lecture VI. Deriving Subspaces There are several ways to derive the nullspace matrix (or kernel matrix). ◦ The methodology.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
Reinforcement learning (Chapter 21)
Announcements Homework 3 due today (grace period through Friday)
Instructors: Fei Fang (This Lecture) and Dave Touretzky
CS 188: Artificial Intelligence Fall 2007
October 6, 2011 Dr. Itamar Arel College of Engineering
CS 188: Artificial Intelligence Fall 2008
CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29
Hidden Markov Models (cont.) Markov Decision Processes
Reinforcement Learning Dealing with Partial Observability
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 17: TRTRL, Implementation Considerations, Apprenticeship Learning November 3, 2010.
Presentation transcript:

1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 21: Dynamic Multi-Criteria RL problems Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science The University of Tennessee Fall 2009 November 16, 2009

ECE Reinforcement Learning in AI Outline Alternative return (value function) formulation Alternative return (value function) formulation Multi-criteria RL Multi-criteria RL 2

ECE Reinforcement Learning in AI 3 Average Reward Reinforcement Learning Consider a (model-based) MDP as discussed before Optimal policy here will maximize the expected long-term average reward per time step from every state The Bellman equation for the average return RL is where  is the average reward per time step of the policy under  Under reasonable conditions,  is constant over the entire state-space The difference between the reward and  is called the average-adjusted reward of action a in state s

ECE Reinforcement Learning in AI 4 Average Reward Reinforcement Learning (cont.) H-Learning, which is a learning method for average reward RL, updates the current value function as follows: The average reward is updated using the following equation over the greedy steps The state-transition model and immediate reward are learned by updating their running averages R-Learning is a model-free version of H-Learning that uses the action-value representation (analogous to Q-learning)

ECE Reinforcement Learning in AI 5 R-Learning The update equation for R-Learning is Although there are no formal proofs of convergences for R-Learning and H-Learning, they are very effective Converge robustly to optimal policies

ECE Reinforcement Learning in AI 6 Dynamic Preferences in Multi-Criteria RL RL traditionally considers a scalar reward value In many real-world problem domains, that’s not sufficient e.g. network routing can have multiple goals, for example: end-to-end delay and power conservation e.g. network routing can have multiple goals, for example: end-to-end delay and power conservation Manufacturing  increase production vs. improve quality Manufacturing  increase production vs. improve quality Illustration by Buridan’s Donkey Problem There’s a donkey at equal distance from two piles of food There’s a donkey at equal distance from two piles of food If it moves towards one pile, the other pile might be stolen If it moves towards one pile, the other pile might be stolen If it stays on guard at the center, it might starve If it stays on guard at the center, it might starve General approach towards solving Multi-Criteria RL: Fixed prioritization (ordering) between goals Fixed prioritization (ordering) between goals “Weighted criterion” schemes “Weighted criterion” schemes

ECE Reinforcement Learning in AI 7 Dynamic Preferences in Multi-Criteria RL (cont.) What if the preferences are dynamic (i.e. change with time)? Government (administration) changes in a country Government (administration) changes in a country Network conditions change Network conditions change We’ll see that a weighted-average reward framework can work well When priorities change, the agent will start from the current “best” policy Does not start from scratch Does not start from scratch Moreover, by learning a set of policies the agent can effectively deal with wide range of preferences The learned set will cover a large part of the relevant weight space The learned set will cover a large part of the relevant weight space

ECE Reinforcement Learning in AI 8 Dynamic Multi-Criteria Average Reward RL We would like to express the objective as making appropriate tradeoffs between different kinds of rewards Let’s consider a simplified Buridan’s Donkey problem We add another reward criteria: minimization of the number of steps taken

ECE Reinforcement Learning in AI 9 Dynamic Multi-Criteria Average Reward RL (cont.) We approach the solution by defining a weight vector Each reward type is associate with a weight The weighted gain (at a given state) is defined as We are looking for a policy that maximizes g If the weights don’t change, then the problem reduces to a regular MDP (why?) We’re interested in the dynamic case – where the relevance of each reward component changes in time

ECE Reinforcement Learning in AI 10 Dynamic Multi-Criteria Average Reward RL (cont.) Let’s consider the alternative solutions to this problem: Learning an optimal policy for each weight vector is possible, but too inefficient Learning an optimal policy for each weight vector is possible, but too inefficient Alternatively, we could store all policies learned thus far, with their weighted gains, and initialize the policy search with the policy that has the highest gain Alternatively, we could store all policies learned thus far, with their weighted gains, and initialize the policy search with the policy that has the highest gain However, the current weight vector can be very different than that of the highest-gain policy  won’t work efficiently However, the current weight vector can be very different than that of the highest-gain policy  won’t work efficiently Studying the problem at hand we note that With a fixed policy, an MDP gives rise to a fixed distribution of sequences of rewards for each reward type With a fixed policy, an MDP gives rise to a fixed distribution of sequences of rewards for each reward type So, the average reward of the weighted MDP = the weighted sum of the average rewards of the component MDPs, each associated with a single reward type So, the average reward of the weighted MDP = the weighted sum of the average rewards of the component MDPs, each associated with a single reward type

ECE Reinforcement Learning in AI 11 Dynamic Multi-Criteria Average Reward RL (cont.) If we represent the average rewards of the component MDPs as an average reward vector,, the weighted gain of a fixed policy becomes Similarly, the value function can be expressed as

ECE Reinforcement Learning in AI 12 Example of a two-component weight vector Consider a weight vector with two components, such that their sum is 1 Weighted gain for each policy varies linearly with either weight For each weight vector, the optimal policy is the one that maximizes the weighted gain Global optimal weighted gain (for any weight) is shown in dark A policy that is optimal for some weight range is called an “un- dominated” policy; the others are dominated The un-dominated policies trace a weighted gain function that is convex and piecewise linear. Convex piecewise hyper-plane in multiple dimensions

ECE Reinforcement Learning in AI 13 Dynamic Multi-Criteria RL A key idea behind the approach is that only un-dominated policies are learned and stored A policy is indirectly represented as a value function vector and an average reward vector Both have dimension k – the number of reward types Both have dimension k – the number of reward types The value function vector is a function of states,, in the case of model-based learning and a function of the state-action pairs for model-free learning However, is a constant For each policy learned, we store the policy as well as the associated weight vector Let  denote the set of all stored policies

ECE Reinforcement Learning in AI 14 Dynamic Multi-Criteria RL (cont.) The questions is: what is the most appropriate policy to choose for a new weight vector ? We exploit the fact that each policy also stores its average reward vector The inner product of these two vectors gives the weighted gain of  under the new weight vector To maximize the weighted gain, we need to pick the policy that maximizes the inner product The value function and average reward vectors are initialized to those of  min and are further improved with respect to the new weights

ECE Reinforcement Learning in AI 15 Dynamic Multi-Criteria RL (cont.) For storage efficiency reasons, new policies are only stored if they yield an improvement greater than a given threshold Although there are Infinite number of weight vectors Infinite number of weight vectors An exponential number of policies An exponential number of policies … the number of stored policies may remain small Learning at each phase is carried out by following any vector-based average reward RL algorithm For R-Learning, the update equation is where

ECE Reinforcement Learning in AI 16 Dynamic Multi-Criteria RL Algorithm

ECE Reinforcement Learning in AI 17 Simulation results for Donkey Problem Hunger | stolen | walking