Apprenticeship Learning Using Linear Programming

Apprenticeship Learning Using Linear Programming
Umar Syed Princeton University Michael Bowling University of Alberta Robert E. Schapire ICML 2008 1

Input: Demonstrations by expert policy.
Apprenticeship Learning: An apprentice learns to behave by observing an expert. Learning algorithm Input: Demonstrations by expert policy. Output: Apprentice policy that is at least as good as expert policy (and possibly better). 2

Main Contribution A new apprenticeship learning algorithm that:
Produces simpler apprentice policies, and Is empirically faster than previous algorithms. 3

Outline Introduction Apprenticeship Learning Prior Work
Summary of Advantages Over Prior Work Background: Occupancy Measure Linear Program for Apprenticeship Learning (LPAL) Experiments and Demos Other Topics 4

Apprenticeship Learning
Given: Same as Markov Decision Process, except no reward function R. Also given: Basis reward functions R1, …, Rk. Demonstrations by an expert policy ¼E. Assume: True reward function R is a weighted combination of the k basis reward functions: R(s, a) = i wi* ¢ Ri(s, a) where weight vector w* is unknown. Goal: Find apprentice policy ¼A such that V(¼A) ¸ V(¼E) where value V(¼) of policy ¼ is with respect to unknown reward function. 5

Summary of Advantages Over Prior Work Background: Occupancy Measure Linear Program for Apprenticeship Learning (LPAL) Experiments and Demos Other Topics 6

Prior Work – Key Idea Define Vi(¼) = ith “basis value” of ¼
= Value of ¼ with respect to ith basis reward function. Then true value of a policy is a weighted combination of its basis values, i.e. V(¼) = i wi* ¢ Vi(¼) Proof: Linearity of expectation. 7

Prior Work – Abbeel & Ng (2004)
Introduced the apprenticeship learning framework. Algorithm Idea: Estimate Vi(¼E) for all i from expert’s demonstrations. Find ¼A such that Vi(¼A) = Vi(¼E) for all i. Theorem: V(¼A) = V(¼E) Algorithm “type”: Geometric 8

Prior Work – Syed & Schapire (2007)
Assumed that all wi* are non-negative and sum to 1. Algorithm Idea: Estimate Vi(¼E) for all i from expert’s demonstrations. Find ¼A and M such that: Vi(¼A) ¸ Vi(¼E) + M for all i, and M is as large as possible. Theorem: V(¼A) ¸ V(¼E), and possibly V(¼A) À V(¼E). Algorithm “type”: Boosting 9

Summary of Our Approach Background: Occupancy Measure Linear Program for Apprenticeship Learning (LPAL) Experiments and Demos Other Topics 10

Summary of Our Approach
Same algorithm idea as Syed & Schapire (2007), but formulated as a single linear program, which we give to an off-the-shelf solver. 11

V(¼A) ¸ V(¼E), and possibly V(¼A) À V(¼E).
Advantages of Our Approach Previous algorithms: Didn’t actually output a single stationary policy ¼A, but instead output a distribution D over a set of stationary policies, such that E¼ » D[V(¼A)] ¸ V(¼E) Our algorithm: Outputs a single stationary policy ¼A such that V(¼A) ¸ V(¼E), and possibly V(¼A) À V(¼E). Advantage: Apprentice policy is simpler and more intuitive. A 12

Advantages of Our Approach
Previous algorithms: Ran for several rounds, and each round required solving a standard MDP (expensive). Our algorithm: A single linear program. Advantage: Empirically faster than previous algorithms. We informally conjecture that this is because it solves the problem “all at once”. 13

Occupancy Measure The occupancy measure x¼ 2 R|S||A| of policy ¼ is an alternate way of describing how ¼ moves through the state-action space. x¼sa = Expected (discounted) number of visits by policy ¼ to state-action pair (s, a). Example: Suppose policy ¼ visits state-action pair (s, a): With probability 1 at time 1. With probability 1/2 at time 2. With probability 1/3 at time 3. With probability 0 a time ¸ 4. Then x¼sa = 1 + °(1/2) + °2(1/3) 15

Occupancy Measure – Equivalent Representation
The relationship between a stationary policy ¼ and its occupancy measure x¼ is given by: Proof: Left-hand side = Probability of taking action a in state s. Right-hand side = No. of visits to state-action (s, a) No. of visits to state s. Significance: It is easy to recover a stationary policy from its occupancy measure. 16

Occupancy Measure – Calculating Value
Define º(x) , s,a R(s, a) ¢ xsa Then V(¼) = º(x¼). In other words, º(x) is the value of a policy whose occupancy measure is x. Proof: A policy earns reward R(s, a) (suitably discounted) every time it visits state-action pair (s, a). Significance: The value of a policy is a linear function of its occupancy measure. 17

Occupancy Measure – Bellman Flow Constraints
The Bellman flow constraints are a set of constraints that any vector x 2 R|S||A| must satisfy to be a valid occupancy measure. The Bellman flow constraints say: “Under any policy, the number of visits into a state s must equal the number of visits leaving state s.” s Must be equal 18

In fact, the Bellman flow constraints completely characterize the set of occupancy measures. x satisfies the Bellman flow constraints m x is the occupancy measure of some policy ¼ Significance: The Bellman flow constraints are linear in x. All policies All occupancy measures ¢ ¼ Bellman flow constraints x ¢ 19

Derivation of LPAL Algorithm
1. Estimate Vi(¼E) for all i from expert’s demonstrations. 2. Find ¼A and M that solve: max M subject to: Vi(¼A) - Vi(¼E) ¸ M for all i. ¼A is a policy. Start with algorithm idea from Syed & Schapire (2007) 21

1. Estimate Vi(¼E) for all i from expert’s demonstrations. 2. Find ¼A and M that solve: max M subject to: Vi(¼A) - Vi(¼E) ¸ M for all i. ¼A is a policy. xA satisfies the Bellman flow constraints. xA 22

1. Estimate Vi(¼E) for all i from expert’s demonstrations. 2. Find ¼A and M that solve: max M subject to: Vi(¼A) - Vi(¼E) ¸ M for all i. ¼A is a policy. xA satisfies the Bellman flow constraints. xA ºi(xA) 23

1. Estimate Vi(¼E) for all i from expert’s demonstrations. 2. Find ¼A and M that solve: max M subject to: Vi(¼A) - Vi(¼E) ¸ M for all i. ¼A is a policy xA satisfies the Bellman flow constraints. xA ¼A ºi(xA) Vi(¼A) ¼A is a policy. 24

1. Estimate Vi(¼E) for all i from expert’s demonstrations. 2. Find xA and M that solve: max M subject to: ºi(xA) - Vi(¼E) ¸ M for all i. xA satisfies the Bellman flow constraints. 25

1. Estimate Vi(¼E) for all i from expert’s demonstrations. 2. Find xA and M that solve: max M subject to: ºi(xA) - Vi(¼E) ¸ M for all i. xA satisfies the Bellman flow constraints. This is a linear program! 26

1. Estimate Vi(¼E) for all i from expert’s demonstrations. 2. Find xA and M that solve: max M subject to: ºi(xA) - Vi(¼E) ¸ M for all i. xA satisfies the Bellman flow constraints. “Of all occupancy measures … 27

1. Estimate Vi(¼E) for all i from expert’s demonstrations. 2. Find xA and M that solve: max M subject to: ºi(xA) - Vi(¼E) ¸ M for all i. xA satisfies the Bellman flow constraints. “Of all occupancy measures … find one corresponding to a policy that is better than the expert’s policy … 28

1. Estimate Vi(¼E) for all i from expert’s demonstrations. 2. Find xA and M that solve: max M subject to: ºi(xA) - Vi(¼E) ¸ M for all i. xA satisfies the Bellman flow constraints. “Of all occupancy measures … find one corresponding to a policy that is better than the expert’s policy … by as much as possible.” 29

1. Estimate Vi(¼E) for all i from expert’s demonstrations. 2. Find xA and M that solve: max M subject to: ºi(xA) - Vi(¼E) ¸ M for all i xA satisfies the Bellman flow constraints. “Of all occupancy measures … find one corresponding to a policy that is better than the expert’s policy … by as much as possible.” 3. Convert occupancy measure to a stationary policy: 30

LPAL Algorithm Theorem: V(¼A) ¸ V(¼E), and possibly V(¼A) À V(¼E).
(same as Syed & Schapire (2007)) Proof: Almost immediate. Remark: We could have applied the same occupancy measure “trick” to the algorithm idea from Abbeel & Ng (2004), and likewise derived a linear program. 31

“Gridworld” environment, divided into regions
Experiment – Setup Actions & transitions: North, South, East and West, with 30% chance of moving to a random state. Basis rewards: One indicator basis reward per region. Expert: Optimal policy for randomly chosen weight vector w*. “Gridworld” environment, divided into regions Region State 33

Experiment – Setup Compare: Projection algorithm (Abbeel & Ng 2004)
MWAL algorithm (Syed & Schapire 2007) LPAL algorithm (this work) Evaluation Metric: Time required to learn apprentice policy whose value is 95% of the optimal value. 34

Experiment – Results Note: Y-axis is log scale. 35

Experiment – Results Note: Y-axis is log scale. 36

Output of LPAL algorithm
Demo – Mimicking the Expert Expert Output of LPAL algorithm 37

Output of LPAL algorithm
Demo – Improving Upon the Expert Expert Output of LPAL algorithm 38

Also Discussed in the Paper…
We observe that the MWAL algorithm often performs better than its theory predicts it should. We have new results explaining this behavior (in preparation). 40

Related to Our Approach
Constrained MDPs and RL with multiple rewards: Feinberg and Schwartz (1996) Gabor, Kalmar and Szepesvari (1998) Altman (1999) Shelton (2000) Dolgov and Durfree (2005) … Max margin planning: Ratliff, Bagnell and Zinkevich (2006) 41

Recap A new apprenticeship learning algorithm that:
Produces simpler apprentice policies, and Is empirically faster than previous algorithms. Thanks! Questions? 42

Prior Work – Details Algorithms for finding ¼A:
Max-Margin: Based on quadratic programming. Projection: Based on a geometric approach. MWAL: Based on a multiplicative weights approach, similar to boosting. Actually, existing algorithms don’t find a single stationary ¼A, but instead find a distribution D over a set of stationary policies, such that E¼ » D[V(¼A)] ¸ V(¼E). Drawbacks: Apprentice policy is nonintuitive and complicated to describe. Algorithms have slow empirical running time. 43

This Work Same algorithm idea: Find ¼A such that Vi(¼A) ¸ Vi(¼E) for all i. Algorithm to find ¼A based on linear programming. Allows us to leverage the efficiency of modern LP solvers. Outputs a single stationary policy ¼A such that V(¼A) ¸ V(¼E). Benefits: Apprentice policy is simpler. Algorithm is empirically much faster. 44

This Work Same algorithm idea: Find ¼A such that Vi(¼A) ¸ Vi(¼E) for all i. Algorithm to find ¼A based on linear programming. Allows us to leverage the efficiency of modern LP solvers. Outputs a single stationary policy ¼A such that V(¼A) ¸ V(¼E). Benefits: Apprentice policy is simpler. Algorithm is empirically much faster. 45

Occupancy Measure – Equivalent Representation
The occupancy measure x¼ 2 R|S||A| of policy ¼ is an alternate way of describing how ¼ moves through the state-action space. x¼sa = Expected (discounted) number of visits by policy ¼ to state-action pair (s, a). 46

Given the occupancy measure x¼ of a policy ¼, computing the basis values of ¼ is easy. Define ºi(x) , s,a Ri(s, a) ¢ xsa Then V_i(\pi) = \nu_i(x^\pi). In other words, \nu_i(x) is the ith basis value of a policy whose occupancy measure is x. Proof: Policy ¼ earns (suitably discounted) reward Ri(s, a) for each visit to state-action pair (s, a). And x¼sa is the expected (and suitably discounted) number of visits by policy ¼ to state-action pair (s, a). 47

Given the occupancy measure x¼ of a policy ¼, computing the basis values of ¼ is easy: Vi(¼) = s,a Ri(s, a) ¢ x¼(s, a) Proof: Linearity of expectation. For convenience, define: ºi(x) , s,a Ri(s, a) ¢ x(s, a) 48

Define º(x) , s,a R(s, a) ¢ xsa Then V(¼) = º(x¼). In other words, º(x) is the value of a policy whose occupancy measure is x. Proof: Policy ¼ earns (discounted) reward R(s, a) for each visit to state-action pair (s, a). And x¼sa is the expected (discounted) number of visits by policy ¼ to state-action pair (s, a). So V(\pi) is 49

The Bellman flow constraints are a set of linear constraints that define the set of all occupancy measures. x satisfies the Bellman flow constraints m x is the occupancy measure of some policy ¼ All policies All occupancy measures ¢ ¼ Bellman flow constraints x ¢ 50

Algorithm 1. Estimate Vi(¼E) for all i from expert’s demonstrations.
2. Find ¼A that solves: Vi(¼A) ¸ Vi(¼E) for all i. subject to: ¼A is a policy 51

Related Work Quite a lot of work on constrained MDPs and MDPs with multiple rewards: Feinberg and Schwartz (1996) Gabor, Kalmar and Szepesvari (1998) Altman (1999) Shelton (2000) Dolgov and Durfree (2005) … Closely related to Apprenticeship Learning. May be a source of ideas… 52

Expert demonstrating policy ¼E.
Apprenticeship Learning – Example States: Speeds and positions of all cars Actions: Left, right, speed up, and slow down Basis reward functions: Speed Collision indicator Off-road indicator True reward function: Unknown weighted combination of basis rewards. Expert demonstrating policy ¼E. 53

Also Discussed in the Paper…
LPAL can’t be used if the environment dynamics are a “black-box” In this case, the occupancy measure “trick” can still be used to modify existing algorithms — e.g. Projection and MWAL — so that: They output a single stationary policy, and Their theoretical guarantees are preserved. Note: Resulting algorithms are not any faster. 54

Apprenticeship Learning Using Linear Programming

Similar presentations

Presentation on theme: "Apprenticeship Learning Using Linear Programming"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Apprenticeship Learning Using Linear Programming

Similar presentations

Presentation on theme: "Apprenticeship Learning Using Linear Programming"— Presentation transcript:

Similar presentations

About project

Feedback