Learning to Improve the Quality of Plans Produced by Partial-order Planners M. Afzal Upal Intelligent Agents & Multiagent Systems Lab.

Learning to Improve the Quality of Plans Produced by Partial-order Planners M. Afzal Upal Intelligent Agents & Multiagent Systems Lab

Outline Artificial Intelligence Planning: Problems and Solutions Why Learn to Improve Plan Quality? The Performance Improving Partial-order planner (PIP) Intra-solution Learning (ISL) algorithm Search-control vs Rewrite rules Empirical Evaluation Conclusion

The Performance Task: The Classical AI Planning Given: Initial State Goals Find: Actions :{ up, down, left, right} A sequence of actions that achieves the goals when executed in the initial state e.g., down(4), right(3), up(2) 1 2 3 8 4 765 123 84 765

Automated Planning Systems Domain Independent Planning Systems Modular, Sound, and Complete Domain-dependent Planning Systems Practical, Efficient, Produce high quality plans

Domain Independent Systems State-space Search (each search node is a valid world state) e.g., PRODIGY, FF Partial-order Plan Space Search (each search node is a partially-ordered plan) Partial-order planners e.g., SNLP, UCPOP Graphplan-based Search (a search node is a union of world states) e.g., STAN Compilation to General Search satisfiability engines e.g., SATPLAN constraint satisfaction engines e.g., CPLAN

State-space vs Plan-space Planning 123 84 765 1 2 3 84 765 123 84 765 123 84 765 right(8) down(2) left(4) right(8) l(4) d(2) up(6) END 123 4 765 8

Partial-order Plan-space Planning Partial-order planning is the process of removing flaws (unresolved goals and unordered actions that cannot take place at the same time)

Partial-order Plan-space Planning Decouple the order in which actions are added during planning from the order in which they appear in the final plan 4 1 2 3

Learning to Improve Plan Quality for Partial-order Planners How to represent plan quality information? Extended STRIPS operators + value function How to identify learning opportunities? (there are no planning failures or successes to learn from) Assume a better quality model plan for a given problem is available (from a domain expert or a through a more extensive automated search of the problems search space) What search features to base the quality improving search control knowledge on?

The Logistics Transportation Domain Initial State: Goals: at-object(parcel, postoffice) at-truck(truck1, postoffice) at-plane(plane1, airport) at-object(parcel, airport)

STRIPS encoding of the Logistics Transportation Domain Preconditions: {at-object(Object,Location), at-truck(Truck,Location)} LOAD-TRUCK(Object, Truck, Location) Effects: {in(Object,Truck), not(at-object(Object,Location))} Preconditions: {at-truck(Truck,From)} DRIVE-TRUCK(Truck, From, To) Effects: {at-truck(Truck,To), not(at-truck(Truck,From), same- city(From, To)} UNLOAD-TRUCK(Object, Truck, Location) Preconditions: {in(Object,Truck), at-truck(Truck,Location)} Effects: {at-object(Object,Location), not(in(Object,Truck))}

PR-STRIPS (similar to PDDL 2.1 level 2) A state is described using propositional as well as metric attributes (that specify the levels of the resources in that state). An action can have propositional as well as metric effects (functions which specify the amount of resources the action consumes). A value function that specifies the relative importance of the amount of each resource consumed and defines plan quality as a function of the amount of resources consumed by all actions in the plan.

PR-STRIPS encoding of the Logistics Transportation Domain Preconditions: {at-object(Object,Location), at-truck(Truck,Location)} LOAD-TRUCK(Object, Truck, Location) Effects: {in(Object,Truck), not(at-object(Object,Location)), time(-0.5), money(-5)} Preconditions: {at-truck(Truck,From)} DRIVE-TRUCK(Truck, From, To) Effects: {at-truck(Truck,To), not(at-truck(Truck,From), time(-.02*distance(From, To)), money(-distance(From, To))} UNLOAD-TRUCK(Object, Truck, Location) Preconditions: {in(Object,Truck), at-truck(Truck,Location)} Effects: {at-object(Object,Location), not(in(Object,Truck)), time(-0.5), money(-5) }

PR-STRIPS encoding of the Logistics Transportation Domain Preconditions: {at-object(Object, Location), at-plane(Plane, Location)} LOAD-PLANE(Object, Plane, Location) Effects: {in(Object, Plane), not(at-object(Object, Location)), time(-0.5), money(-5)} Preconditions: {at-plane(Plane, From), airport(To)} FLY-PLANE(Plane, From, To) Effects: {at-plane(Plane,To), not(at-plane(Plane, From), time(-.02*distance(From, To)), money(-distance(From, To))} UNLOAD-PLANE(Object, Plane, Location) Preconditions: {in(Object, Plane), at-plane(Plane, Location)} Effects: {at-object(Object, Location), not(in(Object, Plane)), time(-0.5), money(-5) }

PR-STRIPS encoding of the Logistics Transportation Domain Quality(Plan) = 1/ (2*time-used(Plan) + 5*money-used(Plan))

The Learning Problem Given A planning problem (goals, initial state, and initial resource level) Domain knowledge (actions, plan quality knowledge) A partial-order planner A model plan for the given problem Find Domain specific rules that can be used by the given planner to produce better quality plans (than the plans it wouldve produced had it not learned those rules).

Solution: The Intra-solution Learning Algorithm 1. Find a learning opportunity 2. Choose the relevant information and ignore the rest 3. Generalize the relevant information using a generalization theory

Phase 1: Find a Learning Opportunity 1. Generate a systems default plan and a default planning trace using the given partial-order planner for the given problem 2. Compare the default plan with the model plan. If the model plan is not of higher quality then goto Step 1 3. Infer the planning decisions that produced the model plan 4. Compare the inferred model planning trace with the default planning trace to identify the decision points where the two traces differ. These are the conflicting choice points

Model Trace Systems Planning Trace Common Nodes

Phase 2: Choose the relevant Information 1. Examine the downstream planning traces identifying relevant planning decisions using the heuristics 1. A planning decision to add an action Q is relevant if Q supplies a relevant condition to a relevant action 2. A planning decision to establish an open condition is relevant if it binds an uninstantiated variable of a relevant open condition 3. A planning decision to resolve a threat is relevant if all three actions involved are relevant

Phase 3: Generalize the Relevant Information 1. Generalize the relevant information using a generalization theory 1. Replace all constants with variables

An Example Logistics Problem Initial-state: {at-object(o1, lax), at-object(o2, lax), at-truck(tr1, lax), at-plane(p1, lax), airport(sjc), distance(lax, sjc)=250, time=0, money=500} Goals:{at-object(o1, sjc), at-object(o2, sjc)}

Generate Systems Default Plan and Default Planning Trace Use the given planner to generate systems default planning trace (an ordered constraint set) Each add-step/establishment decision adds a causal-link and an ordering constraint Each threat-resolution decision adds an ordering constraint 1- START END, 2- unload-truck() END, unload-truck(o1,Tr,sjc) at-object(o1,sjc) END 3- load-truck() unload-truck(),load-truck(o1,Tr, sjc) in-truck(o1,Tr) unload-truck(o1,Tr, sjc) 4- drive-truck() unload-truck(),drive-truck(Tr, X, sjc) at-truck(Tr, sjc) unload-truck(o1,Tr, sjc) 5- …

Compare Systems Default Plan with the Model Plan load-truck(o1, tr1, lax), load-truck(o2, tr1,lax), drive-truck(tr1, lax, sjc), unload-truck(o1, tr1, sjc), unload-truck(o2, tr1, sjc) load-plane(o1, p1, lax), load-plane(o2, p1, lax), fly-plane(p1, lax, sjc), unload-plane(o1, p1, sjc), unload-plane(o2, p1, sjc) Systems Default PlanModel Plan

Infer the Unordered Model Constraint Set unload-plane(ol,p1,sjc) at-object(o1,sjc) END load-plane(ol,p1,lax) at-object(o1,sjc) unload-plane(o1,p1,sjc) fly-plane(p1,sjc,lax) at-plane(p1,sjc) unload-plane(o1,p1,sjc) START at-plane(p1,lax) load-plane(ol,p1,lax) START at-plane(p1,lax) fly-plane(ol,p1,lax) START at-object(o1,lax) load-plane(ol,p1,lax) unload-plane(o2,p1,sjc) at-object(o2,sjc) END load-plane(o2,p1,lax) at-object(o2,sjc) unload-plane(o2,p1,sjc) fly-plane(p1,sjc,lax) at-plane(p1,sjc) unload-plane(o2,p1,sjc) START at-plane(p1,lax) load-plane(o2,p1,lax) START at-plane(p1,lax) fly-plane(o2,p1,lax) START at-object(o2,lax) load-plane(o2,p1,lax)

Compare the two Planning Traces to Identify Learning Opportunities START END at-object(o1,sjc) START END, unload-truck(o1,tr1,sjc) END unload-truck(o1,t1,sjc) at-object(o1,tr1,sjc) END START END, unload-plane(o1,p1,ap) END unload-plane(o1,p1,sjc) at-object(o1,p1,sjc) END A learning opportunity

Choose the Relevant Planning Decisions add-actions:START-END add-action:unload-plane(o1)add-actions:unload-truck(o1) add-action:fly-plane() add-action:load-plane(o1) add-action:unload-plane(o2) add-action:load-plane(o2) add-action:drive-truck() add-actions:load-truck(o1) add-action:drive-truck() add-actions:load-truck(o2) learning opportunity relevant decisions irrelevant decisions

Generalize the relevant planning decisions chains add-actions:START-END add-action:unload-plane(O, T)add-actions:unload-truck(O, P) add-action:fly-plane(T,X,Y) add-action:load-plane(O, T) add-action:drive-truck(P,X,Y) add-actions:load-truck(O, P)

In What Form Should the Learned Knowledge be Stored? Rewrite Rule To-be-replaced actions {load-truck(O,T,X), drive-truck(T,X,Y), unload(O,T, Y)} Replacing actions {load-plane(O,P,X), fly-plane(P,X,Y), unload-plane(O,P,Y))} Search-Control Rule Given the goals {at-object(O,Y)} to resolve and effects {at-truck(T,X), at- plane(P, X), airport(Y)}, and distance(X, Y) > 100 prefer the planning decisions {add-step(unload-plane(O,P,Y)), add- step(load-plane(O,P,X)), add- step(fly-plane(P,X,Y))} over the planning decisions {add-step(unload-truck(O,T,Y)), add-step(load-truck(O,T,X)), add- step(drive-truck(T,X,Y))}

Search Control Knowledge A heuristic function that provides an estimate of the quality of the plan a node is expected to lead to root n quality=8 quality=4 quality=2

Rewrite Rules A Rewrite rule is a 2-tuple to-be-replaced-subplan, replacing-subplan Used after search has produced a complete plan to rewrite it into a higher quality plan. Only useful in those domains where it is possible to efficiently produce a low quality plan but hard to produce a higher quality plan E.g., To-be-replaced-subplan: A4, A5 Replacing subplan: B1

Planning by Rewriting A1 A2 A3 A4 A5 A6 B1

Empirical Evaluation I: What Form Should the Learned Knowledge be Stored in? Perform empirical experiments to compare the performance of a version of PIP that learns search-control rules (Sys-search- control) with a version that learns rewrite rules (Sys-rewrite). Both Sys-rewrite-first and Sys-rewrite-best perform up to two rewritings. At each rewriting Sys-rewrite-first randomly chooses one of the applicable rewrite rules Sys-rewrite-best applies all applicable rewrite rules to try all ways of rewriting a plan.

Experimental Set-up Three benchmark planning domains logistics, softbot, and process planning Randomly generate 120 unique problem instances Train Sys-search-control and Sys-rewrite on optimal quality solutions for 20, 30, 40, and 60 examples and test them on the remaining examples (cross-validation) Plan quality is one minus the average distance of the plans generated by a system from the optimal quality plans Planning efficiency is measured by counting the average number of new nodes generated by each system

Results SoftbotLogisticsProcess Planning

Conclusion I Both search control and rewrite rules lead to improvements in plan quality. Rewrite-rules have a larger cost in terms of the loss of planning efficiency than search control rules Need a mechanism to distinguish good rules from bad rules and to forget the bad rules Comparing planning traces seems to be a better technique for learning search control rules than rewrite rules Need to explore alternate strategies for learning rewrite rules By comparing two completed plans of different quality Through static domain analysis

Empirical Evaluation II: A Study of the Factors Affecting PIPs Learning Performance Generated 25 abstract domains varying along a number of seemingly relevant dimensions Instance Similarity Quality Branching Factor (average number of multiple quality solutions per problem) Association between the default planning bias and the quality bias Are there any statistically significant differences in PIPs performance as each factor is varied (student t-test)?

Results PIPs learning leads to greater improvements in domains where Quality branching factor is large Planners default biases are negatively correlated with the quality improving heuristic function There is no simple relationship between instance similarity and PIPs learning performance

Conclusion II Need to address scale up issues Need to keep up with advances in AI planning technologies It is arguably more difficult to accelerate a new generation planner by outfiting it with learning as the overhead cost by the learning system can overwhelm the gains in search efficiency (Kambhampati 2001) Problem is not the lack of a well defined task! Organize a symposium/special issue on issues of how to efficiently organize, retrieve, and forget learned knowledge An open source planning and learning software?

Learning to Improve the Quality of Plans Produced by Partial-order Planners M. Afzal Upal Intelligent Agents & Multiagent Systems Lab.

Similar presentations

Presentation on theme: "Learning to Improve the Quality of Plans Produced by Partial-order Planners M. Afzal Upal Intelligent Agents & Multiagent Systems Lab."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Learning to Improve the Quality of Plans Produced by Partial-order Planners M. Afzal Upal Intelligent Agents & Multiagent Systems Lab.

Similar presentations

Presentation on theme: "Learning to Improve the Quality of Plans Produced by Partial-order Planners M. Afzal Upal Intelligent Agents & Multiagent Systems Lab."— Presentation transcript:

Similar presentations

About project

Feedback