Knowledge Representation Meets Stochastic Planning Bob Givan Joint work w/ Alan Fern and SungWook Yoon Electrical and Computer Engineering Purdue University.

Slides:

Advertisements

Similar presentations

Introduction to Transportation Systems. PART II: FREIGHT TRANSPORTATION.

Advertisements

Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.

Sungwook Yoon – Probabilistic Planning via Determinization Probabilistic Planning via Determinization in Hindsight FF-Hindsight Sungwook Yoon Joint work.

Kshitij Judah, Alan Fern, Tom Dietterich TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: School of EECS, Oregon State.

SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.

Decision Theoretic Planning

Optimal Policies for POMDP Presented by Alp Sardağ.

MDP Presentation CS594 Automated Optimal Decision Making Sohail M Yousof Advanced Artificial Intelligence.

COSC 878 Seminar on Large Scale Statistical Machine Learning 1.

Reinforcement Learning & Apprenticeship Learning Chenyi Chen.

Planning under Uncertainty

1 Policies for POMDPs Minqing Hu. 2 Background on Solving POMDPs MDPs policy: to find a mapping from states to actions POMDPs policy: to find a mapping.

Concurrent Markov Decision Processes Mausam, Daniel S. Weld University of Washington Seattle.

INSTITUTO DE SISTEMAS E ROBÓTICA Minimax Value Iteration Applied to Robotic Soccer Gonçalo Neto Institute for Systems and Robotics Instituto Superior Técnico.

Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.

Ryan Kinworthy 2/26/20031 Chapter 7- Local Search part 1 Ryan Kinworthy CSCE Advanced Constraint Processing.

Nov 14 th  Homework 4 due  Project 4 due 11/26.

Learning From Data Chichang Jou Tamkang University.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.

Multiagent Planning with Factored MDPs Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University.

Markov Decision Processes Value Iteration Pieter Abbeel UC Berkeley EECS TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:

Dynamic Bayesian Networks CSE 473. © Daniel S. Weld Topics Agency Problem Spaces Search Knowledge Representation Reinforcement Learning InferencePlanningLearning.

Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

MDP Reinforcement Learning. Markov Decision Process “Should you give money to charity?” “Would you contribute?” “Should you give money to charity?” $

1 Monte-Carlo Planning: Introduction and Bandit Basics Alan Fern.

Computational Stochastic Optimization: Bridging communities October 25, 2012 Warren Powell CASTLE Laboratory Princeton University

Reinforcement Learning on Markov Games Nilanjan Dasgupta Department of Electrical and Computer Engineering Duke University Durham, NC Machine Learning.

Planning and Verification for Stochastic Processes with Asynchronous Events Håkan L. S. Younes Carnegie Mellon University.

Reinforcement Learning

REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.

Bug Localization with Machine Learning Techniques Wujie Zheng

Decision Making in Robots and Autonomous Agents Decision Making in Robots and Autonomous Agents The Markov Decision Process (MDP) model Subramanian Ramamoorthy.

Constraint Satisfaction Problems (CSPs) CPSC 322 – CSP 1 Poole & Mackworth textbook: Sections § Lecturer: Alan Mackworth September 28, 2012.

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.

CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.

Dynamic Programming for Partially Observable Stochastic Games Daniel S. Bernstein University of Massachusetts Amherst in collaboration with Christopher.

Solving Large Markov Decision Processes Yilan Gu Dept. of Computer Science University of Toronto April 12, 2004.

Efficient Solution Algorithms for Factored MDPs by Carlos Guestrin, Daphne Koller, Ronald Parr, Shobha Venkataraman Presented by Arkady Epshteyn.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

Reinforcement Learning 主講人：虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.

© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

A Tutorial on the Partially Observable Markov Decision Process and Its Applications Lawrence Carin June 7,2006.

Conformant Probabilistic Planning via CSPs ICAPS-2003 Nathanael Hyafil & Fahiem Bacchus University of Toronto.

Dana Nau: Lecture slides for Automated Planning Licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License:

1 Monte-Carlo Planning: Policy Improvement Alan Fern.

Dana Nau: Lecture slides for Automated Planning Licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License:

Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.

Data Mining and Decision Support

Automated Planning and Decision Making Prof. Ronen Brafman Automated Planning and Decision Making Fully Observable MDP.

1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 21: Dynamic Multi-Criteria RL problems Dr. Itamar Arel College of Engineering Department.

Transfer Learning in Sequential Decision Problems: A Hierarchical Bayesian Approach Aaron Wilson, Alan Fern, Prasad Tadepalli School of EECS Oregon State.

1 Chapter 17 2 nd Part Making Complex Decisions --- Decision-theoretic Agent Design Xin Lu 11/04/2002.

Generalized Point Based Value Iteration for Interactive POMDPs Prashant Doshi Dept. of Computer Science and AI Institute University of Georgia

Reinforcement Learning for Mapping Instructions to Actions S.R.K. Branavan, Harr Chen, Luke S. Zettlemoyer, Regina Barzilay Computer Science and Artificial.

Search Control.. Planning is really really hard –Theoretically, practically But people seem ok at it What to do…. –Abstraction –Find “easy” classes of.

Learning Declarative Control Rules for Constraint-Based Planning Yi-Cheng Huang Bart Selman Cornell University Henry Kautz University of Washington.

Partial Observability “Planning and acting in partially observable stochastic domains” Leslie Pack Kaelbling, Michael L. Littman, Anthony R. Cassandra;

1 Markov Decision Processes Finite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

Reinforcement Learning (1)

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

Announcements Homework 3 due today (grace period through Friday)

Metaheuristic methods and their applications. Optimization Problems Strategies for Solving NP-hard Optimization Problems What is a Metaheuristic Method?

Classifier-Based Approximate Policy Iteration

Reinforcement Learning (2)

Markov Decision Processes

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

Markov Decision Processes

Reinforcement Learning (2)

Presentation transcript:

Knowledge Representation Meets Stochastic Planning Bob Givan Joint work w/ Alan Fern and SungWook Yoon Electrical and Computer Engineering Purdue University

Bob Givan Electrical and Computer Engineering Purdue University 2 Dagstuhl, May 12-16, 2003 Overview  We present a form of approximate policy iteration specifically designed for large relational MDPs.  We describe a novel application viewing entire planning domains as MDPs  we automatically induce domain-specific planners  Induced planners are state-of-the-art on:  deterministic planning benchmarks  stochastic variants of planning benchmarks

Bob Givan Electrical and Computer Engineering Purdue University 3 Dagstuhl, May 12-16, 2003 Decision-theoretic Planning Traditional Planning Ideas from Two Communities Induction of Control Knowledge Planning Heuristics Policy Rollout Approximate Policy Iteration (API) Two views of the new technique Iterative improvement of control knowledge API with a policy space bias

Bob Givan Electrical and Computer Engineering Purdue University 4 Dagstuhl, May 12-16, 2003 Planning Problems ? Current StateGoal State/Region States: First-order Interpretations of a particular language A planning problem gives:  a current state  a goal state  a list of actions and their semantics (may be stochastic) Available actions: Pickup(x) PutDown(y)

Bob Givan Electrical and Computer Engineering Purdue University 5 Dagstuhl, May 12-16, 2003  Distributions over problems sharing one set of actions (but with different domains and sizes) Planning Domains Blocks World Domain ? ? ? ? Available actions: Pickup(x) PutDown(y)

Bob Givan Electrical and Computer Engineering Purdue University 6 Dagstuhl, May 12-16, 2003  Traditional planners solve problems, not domains.  little or no generalization between problems in a domain  Planning domains “solved” by control knowledge  pruning some actions, typically eliminating search Control Knowledge ? ? ? X e.g. “don’t pick up a solved block” X X

Bob Givan Electrical and Computer Engineering Purdue University 7 Dagstuhl, May 12-16, 2003 Recent Control Knowledge Research  Human-written c. k. often eliminates search [Bacchus & Kabanza, 1996] TL-Plan  Helpful c. k. can be learned from “small problems” [Khardon, 1996 & 1999] Learning Horn clause action strategies [Huang, Selman & Kautz, 2000] Learning action selection & action rejection rules [Martin & Geffner, 2000] Learning generalized policies in concept languages [Yoon, Fern & Givan, 2002] Inductive policy selection for stochastic planning domains

Bob Givan Electrical and Computer Engineering Purdue University 8 Dagstuhl, May 12-16, 2003 Unsolved Problems  Finding control knowledge without immediate access to small problems  Can we learn directly in a large domain?  Improving buggy control knowledge  All previous techniques produce unreliable control knowledge…with occasional fatal flaws.  Our approach: view control knowledge as an MDP policy and apply policy improvement A policy is a choice of action for each MDP state

Bob Givan Electrical and Computer Engineering Purdue University 9 Dagstuhl, May 12-16, 2003 View domain as one big statespace, each state a planning problem This view facilitates generalization between problems. Planning Domains as MDPs Blocks World Domain ? ? ? ? Available actions: Pickup(x) PutDown(y) Pickup(Purple)

Bob Givan Electrical and Computer Engineering Purdue University 10 Dagstuhl, May 12-16, 2003 Decision-theoretic Planning Traditional Planning Ideas from Two Communities Induction of Control Knowledge Planning Heuristics Policy Rollout Approximate Policy Iteration (API) Two views of the new technique Iterative improvement of control knowledge API with a policy space bias

Bob Givan Electrical and Computer Engineering Purdue University 11 Dagstuhl, May 12-16, 2003  Given a policy  and a state s, can we improve  (s)?  If V  (s) < Q  (s,b), then  (s) can be improved to blue.  Can make such improvements at all states at once: Policy Iteration s RoRo RbRb … tntn s1s1 sksk t1t1 … V  (s) = Q  (s,o) = R o +  E s’  {s1…sk} V  (s’) Q  (s,b) = R b +  E s’  {t1…tn} V  (s’)  (s) Policy Improvement base policy improved policy

Bob Givan Electrical and Computer Engineering Purdue University 12 Dagstuhl, May 12-16, 2003 Flowchart View of Policy Iteration Current Policy Choose best action at each state Compute Q  for each action at all states Compute V  at all states Improved Policy  ’  VV QQ Problem: too many states

Bob Givan Electrical and Computer Engineering Purdue University 13 Dagstuhl, May 12-16, 2003 at all states Compute V  at all states Flowchart View of Policy Rollout Improved Policy  VV QQ Choose best action at each state Compute Q  for each action at all states Current Policy  s”  (s”) s’ …  (s’) … … … … Trajectories under  s’ V  (s’) at s’ s RaRa … s1s1 sksk a Sample s’ from s 1 …s k s Q  (s,) at s s  ’(s) at s

Bob Givan Electrical and Computer Engineering Purdue University 14 Dagstuhl, May 12-16, 2003 Approximate Policy Iteration Compute Q  for each action at state s s Q  (s,) at state s’ Compute V  at state s’ Choose best action at state s Current Policy s”  (s”) s’ V  (s’) s  ’(s)  draw a training set of pairs (s,  ’(s))  learn a policy  repeat Idea: use machine learning to control the number of samples needed Refinement: use pairs (s,Q  (s,)) to define mis- classification costs

Bob Givan Electrical and Computer Engineering Purdue University 15 Dagstuhl, May 12-16, 2003 Challenge Problem Consider the following stochastic blocks world problem: Goal: Clear(A) Assume: Block color affects pickup() success Optimal policy is compact, but value function is not – state value depends on set of colors above A AA ?

Bob Givan Electrical and Computer Engineering Purdue University 16 Dagstuhl, May 12-16, 2003 Policy for Example Problem A compact policy for this problem: 1. If holding a block, put it down on the table, else… 2. Pick up a clear block above A. How can we formalize this policy? AA ? 1. A ? A 2.

Bob Givan Electrical and Computer Engineering Purdue University 17 Dagstuhl, May 12-16, 2003 Action Selection Rules [Martin&Geffner, KR2000] Pickup a clear block above block A… Action selection rules based on classes of objects  Apply action a to an object in class C (if possible).  abbreviated C : a How can we describe the object classes? AA ? A ? A

Bob Givan Electrical and Computer Engineering Purdue University 18 Dagstuhl, May 12-16, 2003 A ? A Formal Policy for Example Problem English Decision List Taxonomic Syntax 1. “blocks being held” : putdown 2. “clear blocks above block A” : pickup 1. holding : putdown 2. clear  (on* A) : pickup AA ? 1.2. We find this policy with a heuristic search guided by the training data

Bob Givan Electrical and Computer Engineering Purdue University 19 Dagstuhl, May 12-16, 2003 Decision-theoretic Planning Traditional Planning Ideas from Two Communities Induction of Control Knowledge Planning Heuristics Policy Rollout Approximate Policy Iteration (API) Two views of the new technique Iterative improvement of control knowledge API with a policy space bias

Bob Givan Electrical and Computer Engineering Purdue University 20 Dagstuhl, May 12-16, 2003 API with a Policy Language Bias Compute Q  for each action at state s s Q  (s,) at state s’ Compute V  at state s’ Choose best action at state s Current Policy s”  (s”) s’ V  (s’) s  ’(s) Train a new policy  ’ ’’

Bob Givan Electrical and Computer Engineering Purdue University 21 Dagstuhl, May 12-16, 2003 Incorporating Value Estimates  What happens if the policy can’t find reward?  For learning control knowledge, we use the FF-plan plangraph heuristic s’ …  (s’) … … … … Trajectories under  Use a value estimate at these states

Bob Givan Electrical and Computer Engineering Purdue University 22 Dagstuhl, May 12-16, 2003 Initial Policy Choice  Policy iteration requires an initial base policy  Options include:  random policy  greedy policy with respect to a planning heuristic  policy learned from small problems

Bob Givan Electrical and Computer Engineering Purdue University 23 Dagstuhl, May 12-16, 2003 Experimental Domains (Stochastic) Blocks World (Stochastic) Painted Blocks World (Stochastic) Logistics World SBW(n)SPW(n)SLW(t,p,c)

Bob Givan Electrical and Computer Engineering Purdue University 24 Dagstuhl, May 12-16, 2003 API Results Starting with flawed policies learned from small problems Success Rate

Bob Givan Electrical and Computer Engineering Purdue University 25 Dagstuhl, May 12-16, 2003 API Results We used the heuristic of FF-plan (Hoffman and Nebel ’02 JAIR) Starting with a policy greedy with respect to a domain independent heuristic

Bob Givan Electrical and Computer Engineering Purdue University 26 Dagstuhl, May 12-16, 2003 How Good is the Induced Planner? Success Rate Average Plan Length Running Time(s) FFAPIFFAPIFFAPI BW(10) BW(15) BW(20) BW(30) LW(4,6,4) LW(5,14,20)

Bob Givan Electrical and Computer Engineering Purdue University 27 Dagstuhl, May 12-16, 2003 Conclusions  Using a policy space bias, we can learn good policies for extremely large structured MDPs.  We can automatically learn domain-specific planners that compete favorably with the state-of-the-art domain-independent planners.

Bob Givan Electrical and Computer Engineering Purdue University 28 Dagstuhl, May 12-16, 2003 Approximate Policy Iteration Sample states s, and compute Q values at each: Form a training set of tuples (s,b,Q ,b (s)). Learn a new policy from this training set. s RoRo RbRb … tntn s1s1 sksk t1t1 … Estimate R b +  E s’  {t1…tn} V  (s’) by  Sampling states t i from t 1 …t n  Drawing trajectories under  from t i to estimate V  Computing Q ,b (s):

Bob Givan Electrical and Computer Engineering Purdue University 29 Dagstuhl, May 12-16, 2003 Markov Decision Process (MDP)  Ingredients:  System state x in state space X  Control action a in A(x)  Reward R(x,a)  State-transition probability P(x,y,a)  Find control policy to maximize objective fun

Bob Givan Electrical and Computer Engineering Purdue University 30 Dagstuhl, May 12-16, 2003 Control Knowledge vs. Policy  Perhaps the biggest difference in communities:  deterministic planning works with action sequences  decision-theoretic planning works with policies  Policies are needed because uncertainty may carry you to any state.  compare: control knowledge also handles every state  Good c.k. eliminates search  defines a policy over the possible state/goal pairs