Automatic Induction of MAXQ Hierarchies Neville Mehta Michael Wynkoop Soumya Ray Prasad Tadepalli Tom Dietterich School of EECS Oregon State University.

Automatic Induction of MAXQ Hierarchies Neville Mehta Michael Wynkoop Soumya Ray Prasad Tadepalli Tom Dietterich School of EECS Oregon State University Funded by DARPA Transfer Learning Program

Hierarchical Reinforcement Learning Exploits domain structure to facilitate learning Exploits domain structure to facilitate learning Policy constraints Policy constraints State abstraction State abstraction Paradigms: Options, HAMs, MaxQ Paradigms: Options, HAMs, MaxQ MaxQ task hierarchy MaxQ task hierarchy Directed acyclic graph of subtasks Directed acyclic graph of subtasks Leaves are the primitive MDP actions Leaves are the primitive MDP actions Traditionally, task structure is provided as prior knowledge to the learning agent Traditionally, task structure is provided as prior knowledge to the learning agent

Model Representation Dynamic Bayesian Networks for the transition and reward models Dynamic Bayesian Networks for the transition and reward models Symbolic representation of the conditional probabilities/reward values as decision trees Symbolic representation of the conditional probabilities/reward values as decision trees

Goal: Learn Task Hierarchies Avoid the significant manual engineering of task decomposition Avoid the significant manual engineering of task decomposition Requiring deep understanding of the purpose and function of subroutines in computer science Requiring deep understanding of the purpose and function of subroutines in computer science Frameworks for learning exit-option hierarchies: Frameworks for learning exit-option hierarchies: HexQ: Determine exit states through random exploration HexQ: Determine exit states through random exploration VISA: Determine exit states by analyzing DBN action models VISA: Determine exit states by analyzing DBN action models

Focused Creation of Subtasks HEXQ & VISA: Create a separate subtask for each possible exit state. HEXQ & VISA: Create a separate subtask for each possible exit state. This can generate a large number of subtasks This can generate a large number of subtasks Claim: Defining good subtasks requires maximizing state abstraction while identifying “useful” subgoals. Claim: Defining good subtasks requires maximizing state abstraction while identifying “useful” subgoals. Our approach: selectively define subtasks with single abstract exit states Our approach: selectively define subtasks with single abstract exit states

Transfer Learning Scenario Working hypothesis: Working hypothesis: MaxQ value-function learning is much quicker than non- hierarchical (flat) Q-learning MaxQ value-function learning is much quicker than non- hierarchical (flat) Q-learning Hierarchical structure is more amenable to transfer from source tasks to the target than value functions Hierarchical structure is more amenable to transfer from source tasks to the target than value functions Transfer scenario: Transfer scenario: Solve a “source problem” (no CPU time limit) Solve a “source problem” (no CPU time limit) Learn DBN models Learn DBN models Learn MAXQ hierarchy Learn MAXQ hierarchy Solve a “target problem” under the assumption that the same hierarchical structure applies Solve a “target problem” under the assumption that the same hierarchical structure applies Will relax this constraint in future work Will relax this constraint in future work

MaxNode State Abstraction Y is irrelevant within this action Y is irrelevant within this action It affects the dynamics but not the reward function It affects the dynamics but not the reward function In HEXQ, VISA, and our work, we assume there is only one terminal abstract state, hence no pseudo- reward is needed In HEXQ, VISA, and our work, we assume there is only one terminal abstract state, hence no pseudo- reward is needed As a side-effect, this enables “funnel” abstractions in parent tasks As a side-effect, this enables “funnel” abstractions in parent tasks R t+1 XtXt YtYt AtAt X t+1 Y t+1

Our Approach: AI-MAXQ Learn DBN action models via random exploration (Other work) Apply Q learning to solve the source problem Generate a good trajectory from the learned Q function Analyze trajectory to produce CAT Analyze CAT to define MAXQ Hierarchy (This Talk)

Wargus Resource-Gathering Domain

reg.* a.l Causally Annotated Trajectory (CAT) EndStartGotoMGGotoDepGotoCWGotoDep a.r req.gold req.wood req.gold a.* reg.* req.wood A variable v is relevant to an action if the DBN for that action tests or changes that variable (this includes both the variable nodes and the reward nodes) Create an arc from A to B labeled with variable v iff v is relevant to A and B but not to any intermediate actions.

CAT Scan EndStartGotoMGGotoDepGotoCWGotoDep An action is absorbed regressively as long as An action is absorbed regressively as long as It does not have an effect beyond the trajectory segment, preventing exogenous effects It does not have an effect beyond the trajectory segment, preventing exogenous effects It does not increase the state abstraction It does not increase the state abstraction

CAT Scan EndStartGotoMGGotoDepGotoCWGotoDep

CAT Scan EndStartGotoMGGotoDepGotoCWGotoDep Root

CAT Scan EndStartGotoMGGotoDepGotoCWGotoDep Root Harvest WoodHarvest Gold

Induced Wargus Hierarchy Root Harvest WoodHarvest Gold Get GoldGet Wood Goto(loc) Mine GoldChop WoodGDeposit Put Gold Put Wood WGoto(townhall)GGoto(goldmine)WGoto(forest)GGoto(townhall) WDeposit

Induced Abstraction & Termination Task Name State Abstraction Termination Condition Root req.gold, req.wood req.gold = 1 && req.wood = 1 Harvest Gold req.gold, agent.resource, region.townhall req.gold = 1 Get Gold agent.resource, region.goldmine agent.resource = gold Put Gold req.gold, agent.resource, region.townhall agent.resource = 0 GGoto(goldmine) agent.x, agent.y agent.resource = 0 && region.goldmine = 1 GGoto(townhall) agent.x, agent.y req.gold = 0 && agent.resource = gold && region.townhall = 1 Harvest Wood req.wood, agent.resource, region.townhall req.wood = 1 Get Wood agent.resource, region.forest agent.resource = wood Put Wood req.wood, agent.resource, region.townhall agent.resource = 0 WGoto(forest) agent.x, agent.y agent.resource = 0 && region.forest = 1 WGoto(townhall) agent.x, agent.y req.wood = 0 && agent.resource = wood && region.townhall = 1 Mine Gold agent.resource, region.goldmine NA Chop Wood agent.resource, region.forest NA GDeposit req.gold, agent.resource, region.townhall NA WDeposit req.wood, agent.resource, region.townhall NA Goto(loc) agent.x, agent.y NA Note that because each subtask has a unique terminal state, Result Distribution Irrelevance applies

Claims The resulting hierarchy is unique The resulting hierarchy is unique Does not depend on the order in which goals and trajectory sequences are analyzed Does not depend on the order in which goals and trajectory sequences are analyzed All state abstractions are safe All state abstractions are safe There exists a hierarchical policy within the induced hierarchy that will reproduce the observed trajectory There exists a hierarchical policy within the induced hierarchy that will reproduce the observed trajectory Extend MaxQ Node Irrelevance to the induced structure Extend MaxQ Node Irrelevance to the induced structure Learned hierarchical structure is “locally optimal” Learned hierarchical structure is “locally optimal” No local change in the trajectory segmentation can improve the state abstractions (very weak) No local change in the trajectory segmentation can improve the state abstractions (very weak)

Experimental Setup Randomly generate pairs of source - target resource- gathering maps in Wargus Randomly generate pairs of source - target resource- gathering maps in Wargus Learn the optimal policy in source Learn the optimal policy in source Induce task hierarchy from a single (near) optimal trajectory Induce task hierarchy from a single (near) optimal trajectory Transfer this hierarchical structure to the MaxQ value- function learner for target Transfer this hierarchical structure to the MaxQ value- function learner for target Compare to direct Q learning, and MaxQ learning on a manually engineered hierarchy within target Compare to direct Q learning, and MaxQ learning on a manually engineered hierarchy within target

Hand-Built Wargus Hierarchy Root Get GoldGet Wood Goto(loc) Mine GoldChop WoodDeposit GWDeposit

Hand-Built Abstractions & Terminations Task Name State Abstraction Termination Condition Root req.gold, req.wood, agent.resource req.gold = 1 && req.wood = 1 Harvest Gold agent.resource, region.goldmine agent.resource ≠ 0 Harvest Wood agent.resource, region.forest agent.resource ≠ 0 GWDeposit req.gold, req.wood, agent.resource, region.townhall agent.resource = 0 Mine Gold region.goldmineNA Chop Wood region.forestNA Deposit req.gold, req.wood, agent.resource, region.townhall NA Goto(loc) agent.x, agent.y NA

Results: Wargus

Need For Demonstrations VISA only uses DBNs for causal information VISA only uses DBNs for causal information Globally applicable across state space without focusing on the pertinent subspace Globally applicable across state space without focusing on the pertinent subspace Problems Problems Global variable coupling might prevent concise abstraction Global variable coupling might prevent concise abstraction Exit states can grow exponentially: one for each path in the decision tree encoding Exit states can grow exponentially: one for each path in the decision tree encoding Modified bitflip domain exposes these shortcomings Modified bitflip domain exposes these shortcomings

Modified Bitflip Domain State space: b 0,…,b n-1 State space: b 0,…,b n-1 Action space: Action space: Flip(i), 0 < i < n-1 Flip(i), 0 < i < n-1 If b 0  …  b i-1 = 1 then b i ← ~b i If b 0  …  b i-1 = 1 then b i ← ~b i Else b 0 ← 0, …, b i ← 0 Else b 0 ← 0, …, b i ← 0 Flip(n-1) Flip(n-1) If parity(b 0, …,b n-2 )  b n-2 = 1, b n-1 ← ~b n-1 If parity(b 0, …,b n-2 )  b n-2 = 1, b n-1 ← ~b n-1 Else b 0 ← 0, …, b n-1 ← 0 Else b 0 ← 0, …, b n-1 ← 0 parity(…) = even if n-1 is even, odd otherwise parity(…) = even if n-1 is even, odd otherwise Reward: -1 for all actions Reward: -1 for all actions Terminal/goal state: b 0  …  b n-1 = 1 Terminal/goal state: b 0  …  b n-1 = 1

Modified Bitflip Domain 1 1 1 0 0 0 0 1 1 1 1 0 0 0 Flip(3) 1 0 1 1 0 0 0 Flip(1) 0 0 0 0 0 0 0 Flip(4)

VISA’s Causal Graph Variables grouped into two strongly connected components (dashed ellipses) Variables grouped into two strongly connected components (dashed ellipses) Both components affect the reward node Both components affect the reward node b0b0 b1b1 Flip(1) Flip(2) b n-2 b n-1 Flip(n-1) b2b2 Flip(2) R Flip(n-1) Flip(3) Flip(n-1) Flip(3) Flip(2) Flip(n-2)

VISA task hierarchy Root Flip(1)Flip(0) Flip(n-1) Parity(b 0,…,b n-2 )  b n-2 = 1 2 n-3 exit options

Bitflip CAT Flip(n-1)EndStartFlip(0)Flip(n-2) b n-1 b0b0 Flip(1) b0b0 b n-2 b 0,…,b n-2 b 0,…,b n-1 b1b1

Induced MAXQ task hierarchy Root Flip(1)Flip(0) Flip(n-1) b 0  …  b n-2 = 1 b 0  …  b n-3 = 1 Flip(n-2) b 0  b 1 = 1 Flip(n-3)

Results: Bitflip

Conclusion Causality analysis is the key to our approach Causality analysis is the key to our approach Enables us to find concise subtask definitions from a demonstration Enables us to find concise subtask definitions from a demonstration CAT scan is easy to perform CAT scan is easy to perform Need to extend to learn from multiple demonstrations Need to extend to learn from multiple demonstrations Disjunctive goals Disjunctive goals

Automatic Induction of MAXQ Hierarchies Neville Mehta Michael Wynkoop Soumya Ray Prasad Tadepalli Tom Dietterich School of EECS Oregon State University.

Similar presentations

Presentation on theme: "Automatic Induction of MAXQ Hierarchies Neville Mehta Michael Wynkoop Soumya Ray Prasad Tadepalli Tom Dietterich School of EECS Oregon State University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Automatic Induction of MAXQ Hierarchies Neville Mehta Michael Wynkoop Soumya Ray Prasad Tadepalli Tom Dietterich School of EECS Oregon State University.

Similar presentations

Presentation on theme: "Automatic Induction of MAXQ Hierarchies Neville Mehta Michael Wynkoop Soumya Ray Prasad Tadepalli Tom Dietterich School of EECS Oregon State University."— Presentation transcript:

Similar presentations

About project

Feedback