Apprenticeship Learning Using Linear Programming

Slides:

Advertisements

Similar presentations

Value and Planning in MDPs. Administrivia Reading 3 assigned today Mahdevan, S., “Representation Policy Iteration”. In Proc. of 21st Conference on Uncertainty.

Advertisements

Markov Decision Process

On-line learning and Boosting

Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 3 Ann Nowé By Sutton.

Decision Theoretic Planning

Optimal Policies for POMDP Presented by Alp Sardağ.

1. Algorithms for Inverse Reinforcement Learning 2

Markov Decision Processes

Infinite Horizon Problems

Planning under Uncertainty

Apprenticeship learning for robotic control Pieter Abbeel Stanford University Joint work with Andrew Y. Ng, Adam Coates, Morgan Quigley.

Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel Stanford University [Joint work with Andrew Ng.]

Apprenticeship Learning by Inverse Reinforcement Learning Pieter Abbeel Andrew Y. Ng Stanford University.

Markov Decision Processes

Apprenticeship Learning by Inverse Reinforcement Learning Pieter Abbeel Andrew Y. Ng Stanford University.

Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel Stanford University [Joint work with Andrew Ng.]

Planning in MDPs S&B: Sec 3.6; Ch. 4. Administrivia Reminder: Final project proposal due this Friday If you haven’t talked to me yet, you still have the.

Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.

Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel and Andrew Y. Ng Stanford University.

The Value of Plans. Now and Then Last time Value in stochastic worlds Maximum expected utility Value function calculation Today Example: gridworld navigation.

Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.

More RL. MDPs defined A Markov decision process (MDP), M, is a model of a stochastic, dynamic, controllable, rewarding process given by: M = 〈 S, A,T,R.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

Radial Basis Function Networks

An Introduction to Support Vector Machines Martin Law.

Support Vector Machines

Instructor: Vincent Conitzer

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.

An Introduction to Support Vector Machines (M. Law)

Utilities and MDP: A Lesson in Multiagent System Based on Jose Vidal’s book Fundamentals of Multiagent Systems Henry Hexmoor SIUC.

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter and Andrew Y. Ng Computer Science Department Stanford.

CPS 570: Artificial Intelligence Markov decision processes, POMDPs

Presented by- Nikhil Kejriwal advised by- Theo Damoulas (ICS) Carla Gomes (ICS) in collaboration with- Bistra Dilkina (ICS) Rusell Toth (Dept. of Applied.

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Announcements  Homework 3: Games  Due tonight at 11:59pm.  Project 2: Multi-Agent Pacman  Has been released, due Friday 2/19 at 5:00pm.  Optional.

1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.

Generative Adversarial Imitation Learning

(5) Notes on the Least Squares Estimate

Making complex decisions

POMDPs Logistics Outline No class Wed

Reinforcement Learning (1)

Reinforcement learning (Chapter 21)

Markov Decision Processes

CPS 570: Artificial Intelligence Markov decision processes, POMDPs

CS 188: Artificial Intelligence

Chapter 6. Large Scale Optimization

Apprenticeship Learning via Inverse Reinforcement Learning

CAP 5636 – Advanced Artificial Intelligence

Instructors: Fei Fang (This Lecture) and Dave Touretzky

CS 188: Artificial Intelligence Fall 2007

Reinforcement Learning in MDPs by Lease-Square Policy Iteration

Instructor: Vincent Conitzer

Markov Decision Problems

CS 188: Artificial Intelligence Spring 2006

CS 188: Artificial Intelligence Fall 2008

CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29

CS 188: Artificial Intelligence Spring 2006

Announcements Homework 2 Project 2 Mini-contest 1 (optional)

Reinforcement Learning (2)

Markov Decision Processes

Chapter 6. Large Scale Optimization

Chapter 2. Simplex method

Markov Decision Processes

Reinforcement Learning (2)

Introduction to Machine Learning

Presentation transcript:

Apprenticeship Learning Using Linear Programming Umar Syed Princeton University Michael Bowling University of Alberta Robert E. Schapire ICML 2008 1

Input: Demonstrations by expert policy. Apprenticeship Learning: An apprentice learns to behave by observing an expert. Learning algorithm Input: Demonstrations by expert policy. Output: Apprentice policy that is at least as good as expert policy (and possibly better). 2

Main Contribution A new apprenticeship learning algorithm that: Produces simpler apprentice policies, and Is empirically faster than previous algorithms. 3

Outline Introduction Apprenticeship Learning Prior Work Summary of Advantages Over Prior Work Background: Occupancy Measure Linear Program for Apprenticeship Learning (LPAL) Experiments and Demos Other Topics 4

Apprenticeship Learning Given: Same as Markov Decision Process, except no reward function R. Also given: Basis reward functions R1, …, Rk. Demonstrations by an expert policy ¼E. Assume: True reward function R is a weighted combination of the k basis reward functions: R(s, a) = i wi* ¢ Ri(s, a) where weight vector w* is unknown. Goal: Find apprentice policy ¼A such that V(¼A) ¸ V(¼E) where value V(¼) of policy ¼ is with respect to unknown reward function. 5

Outline Introduction Apprenticeship Learning Prior Work Summary of Advantages Over Prior Work Background: Occupancy Measure Linear Program for Apprenticeship Learning (LPAL) Experiments and Demos Other Topics 6

Prior Work – Key Idea Define Vi(¼) = ith “basis value” of ¼ = Value of ¼ with respect to ith basis reward function. Then true value of a policy is a weighted combination of its basis values, i.e. V(¼) = i wi* ¢ Vi(¼) Proof: Linearity of expectation. 7

Prior Work – Abbeel & Ng (2004) Introduced the apprenticeship learning framework. Algorithm Idea: Estimate Vi(¼E) for all i from expert’s demonstrations. Find ¼A such that Vi(¼A) = Vi(¼E) for all i. Theorem: V(¼A) = V(¼E) Algorithm “type”: Geometric 8

Prior Work – Syed & Schapire (2007) Assumed that all wi* are non-negative and sum to 1. Algorithm Idea: Estimate Vi(¼E) for all i from expert’s demonstrations. Find ¼A and M such that: Vi(¼A) ¸ Vi(¼E) + M for all i, and M is as large as possible. Theorem: V(¼A) ¸ V(¼E), and possibly V(¼A) À V(¼E). Algorithm “type”: Boosting 9

Outline Introduction Apprenticeship Learning Prior Work Summary of Our Approach Background: Occupancy Measure Linear Program for Apprenticeship Learning (LPAL) Experiments and Demos Other Topics 10

Summary of Our Approach Same algorithm idea as Syed & Schapire (2007), but formulated as a single linear program, which we give to an off-the-shelf solver. 11

V(¼A) ¸ V(¼E), and possibly V(¼A) À V(¼E). Advantages of Our Approach Previous algorithms: Didn’t actually output a single stationary policy ¼A, but instead output a distribution D over a set of stationary policies, such that E¼ » D[V(¼A)] ¸ V(¼E) Our algorithm: Outputs a single stationary policy ¼A such that V(¼A) ¸ V(¼E), and possibly V(¼A) À V(¼E). Advantage: Apprentice policy is simpler and more intuitive. A 12

Advantages of Our Approach Previous algorithms: Ran for several rounds, and each round required solving a standard MDP (expensive). Our algorithm: A single linear program. Advantage: Empirically faster than previous algorithms. We informally conjecture that this is because it solves the problem “all at once”. 13

Outline Introduction Apprenticeship Learning Prior Work Summary of Our Approach Background: Occupancy Measure Linear Program for Apprenticeship Learning (LPAL) Experiments and Demos Other Topics 14

Occupancy Measure The occupancy measure x¼ 2 R|S||A| of policy ¼ is an alternate way of describing how ¼ moves through the state-action space. x¼sa = Expected (discounted) number of visits by policy ¼ to state-action pair (s, a). Example: Suppose policy ¼ visits state-action pair (s, a): With probability 1 at time 1. With probability 1/2 at time 2. With probability 1/3 at time 3. With probability 0 a time ¸ 4. Then x¼sa = 1 + °(1/2) + °2(1/3) 15

Occupancy Measure – Equivalent Representation The relationship between a stationary policy ¼ and its occupancy measure x¼ is given by: Proof: Left-hand side = Probability of taking action a in state s. Right-hand side = No. of visits to state-action (s, a) No. of visits to state s. Significance: It is easy to recover a stationary policy from its occupancy measure. 16

Occupancy Measure – Calculating Value Define º(x) , s,a R(s, a) ¢ xsa Then V(¼) = º(x¼). In other words, º(x) is the value of a policy whose occupancy measure is x. Proof: A policy earns reward R(s, a) (suitably discounted) every time it visits state-action pair (s, a). Significance: The value of a policy is a linear function of its occupancy measure. 17

Occupancy Measure – Bellman Flow Constraints The Bellman flow constraints are a set of constraints that any vector x 2 R|S||A| must satisfy to be a valid occupancy measure. The Bellman flow constraints say: “Under any policy, the number of visits into a state s must equal the number of visits leaving state s.” s Must be equal 18

Occupancy Measure – Bellman Flow Constraints In fact, the Bellman flow constraints completely characterize the set of occupancy measures. x satisfies the Bellman flow constraints m x is the occupancy measure of some policy ¼ Significance: The Bellman flow constraints are linear in x. All policies All occupancy measures ¢ ¼ Bellman flow constraints x ¢ 19

Outline Introduction Apprenticeship Learning Prior Work Summary of Our Approach Background: Occupancy Measure Linear Program for Apprenticeship Learning (LPAL) Experiments and Demos Other Topics 20

Derivation of LPAL Algorithm 1. Estimate Vi(¼E) for all i from expert’s demonstrations. 2. Find ¼A and M that solve: max M subject to: Vi(¼A) - Vi(¼E) ¸ M for all i. ¼A is a policy. Start with algorithm idea from Syed & Schapire (2007) 21

Derivation of LPAL Algorithm 1. Estimate Vi(¼E) for all i from expert’s demonstrations. 2. Find ¼A and M that solve: max M subject to: Vi(¼A) - Vi(¼E) ¸ M for all i. ¼A is a policy. xA satisfies the Bellman flow constraints. xA 22

Derivation of LPAL Algorithm 1. Estimate Vi(¼E) for all i from expert’s demonstrations. 2. Find ¼A and M that solve: max M subject to: Vi(¼A) - Vi(¼E) ¸ M for all i. ¼A is a policy. xA satisfies the Bellman flow constraints. xA ºi(xA) 23

Derivation of LPAL Algorithm 1. Estimate Vi(¼E) for all i from expert’s demonstrations. 2. Find ¼A and M that solve: max M subject to: Vi(¼A) - Vi(¼E) ¸ M for all i. ¼A is a policy xA satisfies the Bellman flow constraints. xA ¼A ºi(xA) Vi(¼A) ¼A is a policy. 24

Derivation of LPAL Algorithm 1. Estimate Vi(¼E) for all i from expert’s demonstrations. 2. Find xA and M that solve: max M subject to: ºi(xA) - Vi(¼E) ¸ M for all i. xA satisfies the Bellman flow constraints. 25

Derivation of LPAL Algorithm 1. Estimate Vi(¼E) for all i from expert’s demonstrations. 2. Find xA and M that solve: max M subject to: ºi(xA) - Vi(¼E) ¸ M for all i. xA satisfies the Bellman flow constraints. This is a linear program! 26

Derivation of LPAL Algorithm 1. Estimate Vi(¼E) for all i from expert’s demonstrations. 2. Find xA and M that solve: max M subject to: ºi(xA) - Vi(¼E) ¸ M for all i. xA satisfies the Bellman flow constraints. “Of all occupancy measures … 27

Derivation of LPAL Algorithm 1. Estimate Vi(¼E) for all i from expert’s demonstrations. 2. Find xA and M that solve: max M subject to: ºi(xA) - Vi(¼E) ¸ M for all i. xA satisfies the Bellman flow constraints. “Of all occupancy measures … find one corresponding to a policy that is better than the expert’s policy … 28

Derivation of LPAL Algorithm 1. Estimate Vi(¼E) for all i from expert’s demonstrations. 2. Find xA and M that solve: max M subject to: ºi(xA) - Vi(¼E) ¸ M for all i. xA satisfies the Bellman flow constraints. “Of all occupancy measures … find one corresponding to a policy that is better than the expert’s policy … by as much as possible.” 29

Derivation of LPAL Algorithm 1. Estimate Vi(¼E) for all i from expert’s demonstrations. 2. Find xA and M that solve: max M subject to: ºi(xA) - Vi(¼E) ¸ M for all i xA satisfies the Bellman flow constraints. “Of all occupancy measures … find one corresponding to a policy that is better than the expert’s policy … by as much as possible.” 3. Convert occupancy measure to a stationary policy: 30

LPAL Algorithm Theorem: V(¼A) ¸ V(¼E), and possibly V(¼A) À V(¼E). (same as Syed & Schapire (2007)) Proof: Almost immediate. Remark: We could have applied the same occupancy measure “trick” to the algorithm idea from Abbeel & Ng (2004), and likewise derived a linear program. 31

Outline Introduction Apprenticeship Learning Prior Work Summary of Our Approach Background: Occupancy Measure Linear Program for Apprenticeship Learning (LPAL) Experiments and Demos Other Topics 32

“Gridworld” environment, divided into regions Experiment – Setup Actions & transitions: North, South, East and West, with 30% chance of moving to a random state. Basis rewards: One indicator basis reward per region. Expert: Optimal policy for randomly chosen weight vector w*. “Gridworld” environment, divided into regions Region State 33

Experiment – Setup Compare: Projection algorithm (Abbeel & Ng 2004) MWAL algorithm (Syed & Schapire 2007) LPAL algorithm (this work) Evaluation Metric: Time required to learn apprentice policy whose value is 95% of the optimal value. 34

Experiment – Results Note: Y-axis is log scale. 35

Experiment – Results Note: Y-axis is log scale. 36

Output of LPAL algorithm Demo – Mimicking the Expert Expert Output of LPAL algorithm 37

Output of LPAL algorithm Demo – Improving Upon the Expert Expert Output of LPAL algorithm 38

Outline Introduction Apprenticeship Learning Prior Work Summary of Our Approach Background: Occupancy Measure Linear Program for Apprenticeship Learning (LPAL) Experiments and Demos Other Topics 39

Also Discussed in the Paper… We observe that the MWAL algorithm often performs better than its theory predicts it should. We have new results explaining this behavior (in preparation). 40

Related to Our Approach Constrained MDPs and RL with multiple rewards: Feinberg and Schwartz (1996) Gabor, Kalmar and Szepesvari (1998) Altman (1999) Shelton (2000) Dolgov and Durfree (2005) … Max margin planning: Ratliff, Bagnell and Zinkevich (2006) 41

Recap A new apprenticeship learning algorithm that: Produces simpler apprentice policies, and Is empirically faster than previous algorithms. Thanks! Questions? 42

Prior Work – Details Algorithms for finding ¼A: Max-Margin: Based on quadratic programming. Projection: Based on a geometric approach. MWAL: Based on a multiplicative weights approach, similar to boosting. Actually, existing algorithms don’t find a single stationary ¼A, but instead find a distribution D over a set of stationary policies, such that E¼ » D[V(¼A)] ¸ V(¼E). Drawbacks: Apprentice policy is nonintuitive and complicated to describe. Algorithms have slow empirical running time. 43

This Work Same algorithm idea: Find ¼A such that Vi(¼A) ¸ Vi(¼E) for all i. Algorithm to find ¼A based on linear programming. Allows us to leverage the efficiency of modern LP solvers. Outputs a single stationary policy ¼A such that V(¼A) ¸ V(¼E). Benefits: Apprentice policy is simpler. Algorithm is empirically much faster. 44

This Work Same algorithm idea: Find ¼A such that Vi(¼A) ¸ Vi(¼E) for all i. Algorithm to find ¼A based on linear programming. Allows us to leverage the efficiency of modern LP solvers. Outputs a single stationary policy ¼A such that V(¼A) ¸ V(¼E). Benefits: Apprentice policy is simpler. Algorithm is empirically much faster. 45

Occupancy Measure – Equivalent Representation The occupancy measure x¼ 2 R|S||A| of policy ¼ is an alternate way of describing how ¼ moves through the state-action space. x¼sa = Expected (discounted) number of visits by policy ¼ to state-action pair (s, a). 46

Occupancy Measure – Calculating Value Given the occupancy measure x¼ of a policy ¼, computing the basis values of ¼ is easy. Define ºi(x) , s,a Ri(s, a) ¢ xsa Then V_i(\pi) = \nu_i(x^\pi). In other words, \nu_i(x) is the ith basis value of a policy whose occupancy measure is x. Proof: Policy ¼ earns (suitably discounted) reward Ri(s, a) for each visit to state-action pair (s, a). And x¼sa is the expected (and suitably discounted) number of visits by policy ¼ to state-action pair (s, a). 47

Occupancy Measure – Calculating Value Given the occupancy measure x¼ of a policy ¼, computing the basis values of ¼ is easy: Vi(¼) = s,a Ri(s, a) ¢ x¼(s, a) Proof: Linearity of expectation. For convenience, define: ºi(x) , s,a Ri(s, a) ¢ x(s, a) 48

Occupancy Measure – Calculating Value Define º(x) , s,a R(s, a) ¢ xsa Then V(¼) = º(x¼). In other words, º(x) is the value of a policy whose occupancy measure is x. Proof: Policy ¼ earns (discounted) reward R(s, a) for each visit to state-action pair (s, a). And x¼sa is the expected (discounted) number of visits by policy ¼ to state-action pair (s, a). So V(\pi) is 49

Occupancy Measure – Bellman Flow Constraints The Bellman flow constraints are a set of linear constraints that define the set of all occupancy measures. x satisfies the Bellman flow constraints m x is the occupancy measure of some policy ¼ All policies All occupancy measures ¢ ¼ Bellman flow constraints x ¢ 50

Algorithm 1. Estimate Vi(¼E) for all i from expert’s demonstrations. 2. Find ¼A that solves: Vi(¼A) ¸ Vi(¼E) for all i. subject to: ¼A is a policy 51

Related Work Quite a lot of work on constrained MDPs and MDPs with multiple rewards: Feinberg and Schwartz (1996) Gabor, Kalmar and Szepesvari (1998) Altman (1999) Shelton (2000) Dolgov and Durfree (2005) … Closely related to Apprenticeship Learning. May be a source of ideas… 52

Expert demonstrating policy ¼E. Apprenticeship Learning – Example States: Speeds and positions of all cars Actions: Left, right, speed up, and slow down Basis reward functions: Speed Collision indicator Off-road indicator True reward function: Unknown weighted combination of basis rewards. Expert demonstrating policy ¼E. 53

Also Discussed in the Paper… LPAL can’t be used if the environment dynamics are a “black-box” In this case, the occupancy measure “trick” can still be used to modify existing algorithms — e.g. Projection and MWAL — so that: They output a single stationary policy, and Their theoretical guarantees are preserved. Note: Resulting algorithms are not any faster. 54