1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 21: Dynamic Multi-Criteria RL problems Dr. Itamar Arel College of Engineering Department.

1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 21: Dynamic Multi-Criteria RL problems Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science The University of Tennessee Fall 2009 November 16, 2009

ECE 517 - Reinforcement Learning in AI Outline Alternative return (value function) formulation Alternative return (value function) formulation Multi-criteria RL Multi-criteria RL 2

ECE 517 - Reinforcement Learning in AI 3 Average Reward Reinforcement Learning Consider a (model-based) MDP as discussed before Optimal policy here will maximize the expected long-term average reward per time step from every state The Bellman equation for the average return RL is where  is the average reward per time step of the policy under  Under reasonable conditions,  is constant over the entire state-space The difference between the reward and  is called the average-adjusted reward of action a in state s

ECE 517 - Reinforcement Learning in AI 4 Average Reward Reinforcement Learning (cont.) H-Learning, which is a learning method for average reward RL, updates the current value function as follows: The average reward is updated using the following equation over the greedy steps The state-transition model and immediate reward are learned by updating their running averages R-Learning is a model-free version of H-Learning that uses the action-value representation (analogous to Q-learning)

ECE 517 - Reinforcement Learning in AI 5 R-Learning The update equation for R-Learning is Although there are no formal proofs of convergences for R-Learning and H-Learning, they are very effective Converge robustly to optimal policies

ECE 517 - Reinforcement Learning in AI 6 Dynamic Preferences in Multi-Criteria RL RL traditionally considers a scalar reward value In many real-world problem domains, that’s not sufficient e.g. network routing can have multiple goals, for example: end-to-end delay and power conservation e.g. network routing can have multiple goals, for example: end-to-end delay and power conservation Manufacturing  increase production vs. improve quality Manufacturing  increase production vs. improve quality Illustration by Buridan’s Donkey Problem There’s a donkey at equal distance from two piles of food There’s a donkey at equal distance from two piles of food If it moves towards one pile, the other pile might be stolen If it moves towards one pile, the other pile might be stolen If it stays on guard at the center, it might starve If it stays on guard at the center, it might starve General approach towards solving Multi-Criteria RL: Fixed prioritization (ordering) between goals Fixed prioritization (ordering) between goals “Weighted criterion” schemes “Weighted criterion” schemes

ECE 517 - Reinforcement Learning in AI 7 Dynamic Preferences in Multi-Criteria RL (cont.) What if the preferences are dynamic (i.e. change with time)? Government (administration) changes in a country Government (administration) changes in a country Network conditions change Network conditions change We’ll see that a weighted-average reward framework can work well When priorities change, the agent will start from the current “best” policy Does not start from scratch Does not start from scratch Moreover, by learning a set of policies the agent can effectively deal with wide range of preferences The learned set will cover a large part of the relevant weight space The learned set will cover a large part of the relevant weight space

ECE 517 - Reinforcement Learning in AI 8 Dynamic Multi-Criteria Average Reward RL We would like to express the objective as making appropriate tradeoffs between different kinds of rewards Let’s consider a simplified Buridan’s Donkey problem We add another reward criteria: minimization of the number of steps taken

ECE 517 - Reinforcement Learning in AI 9 Dynamic Multi-Criteria Average Reward RL (cont.) We approach the solution by defining a weight vector Each reward type is associate with a weight The weighted gain (at a given state) is defined as We are looking for a policy that maximizes g If the weights don’t change, then the problem reduces to a regular MDP (why?) We’re interested in the dynamic case – where the relevance of each reward component changes in time

ECE 517 - Reinforcement Learning in AI 10 Dynamic Multi-Criteria Average Reward RL (cont.) Let’s consider the alternative solutions to this problem: Learning an optimal policy for each weight vector is possible, but too inefficient Learning an optimal policy for each weight vector is possible, but too inefficient Alternatively, we could store all policies learned thus far, with their weighted gains, and initialize the policy search with the policy that has the highest gain Alternatively, we could store all policies learned thus far, with their weighted gains, and initialize the policy search with the policy that has the highest gain However, the current weight vector can be very different than that of the highest-gain policy  won’t work efficiently However, the current weight vector can be very different than that of the highest-gain policy  won’t work efficiently Studying the problem at hand we note that With a fixed policy, an MDP gives rise to a fixed distribution of sequences of rewards for each reward type With a fixed policy, an MDP gives rise to a fixed distribution of sequences of rewards for each reward type So, the average reward of the weighted MDP = the weighted sum of the average rewards of the component MDPs, each associated with a single reward type So, the average reward of the weighted MDP = the weighted sum of the average rewards of the component MDPs, each associated with a single reward type

ECE 517 - Reinforcement Learning in AI 11 Dynamic Multi-Criteria Average Reward RL (cont.) If we represent the average rewards of the component MDPs as an average reward vector,, the weighted gain of a fixed policy becomes Similarly, the value function can be expressed as

ECE 517 - Reinforcement Learning in AI 12 Example of a two-component weight vector Consider a weight vector with two components, such that their sum is 1 Weighted gain for each policy varies linearly with either weight For each weight vector, the optimal policy is the one that maximizes the weighted gain Global optimal weighted gain (for any weight) is shown in dark A policy that is optimal for some weight range is called an “un- dominated” policy; the others are dominated The un-dominated policies trace a weighted gain function that is convex and piecewise linear. Convex piecewise hyper-plane in multiple dimensions

ECE 517 - Reinforcement Learning in AI 13 Dynamic Multi-Criteria RL A key idea behind the approach is that only un-dominated policies are learned and stored A policy is indirectly represented as a value function vector and an average reward vector Both have dimension k – the number of reward types Both have dimension k – the number of reward types The value function vector is a function of states,, in the case of model-based learning and a function of the state-action pairs for model-free learning However, is a constant For each policy learned, we store the policy as well as the associated weight vector Let  denote the set of all stored policies

ECE 517 - Reinforcement Learning in AI 14 Dynamic Multi-Criteria RL (cont.) The questions is: what is the most appropriate policy to choose for a new weight vector ? We exploit the fact that each policy also stores its average reward vector The inner product of these two vectors gives the weighted gain of  under the new weight vector To maximize the weighted gain, we need to pick the policy that maximizes the inner product The value function and average reward vectors are initialized to those of  min and are further improved with respect to the new weights

ECE 517 - Reinforcement Learning in AI 15 Dynamic Multi-Criteria RL (cont.) For storage efficiency reasons, new policies are only stored if they yield an improvement greater than a given threshold Although there are Infinite number of weight vectors Infinite number of weight vectors An exponential number of policies An exponential number of policies … the number of stored policies may remain small Learning at each phase is carried out by following any vector-based average reward RL algorithm For R-Learning, the update equation is where

ECE 517 - Reinforcement Learning in AI 16 Dynamic Multi-Criteria RL Algorithm

ECE 517 - Reinforcement Learning in AI 17 Simulation results for Donkey Problem Hunger | stolen | walking

1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 21: Dynamic Multi-Criteria RL problems Dr. Itamar Arel College of Engineering Department.

Similar presentations

Presentation on theme: "1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 21: Dynamic Multi-Criteria RL problems Dr. Itamar Arel College of Engineering Department."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 21: Dynamic Multi-Criteria RL problems Dr. Itamar Arel College of Engineering Department.

Similar presentations

Presentation on theme: "1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 21: Dynamic Multi-Criteria RL problems Dr. Itamar Arel College of Engineering Department."— Presentation transcript:

Similar presentations

About project

Feedback