A Decision-Theoretic Model of Assistance - Evaluation, Extension and Open Problems Sriraam Natarajan, Kshitij Judah, Prasad Tadepalli and Alan Fern School.

Slides:



Advertisements
Similar presentations
Dialogue Policy Optimisation
Advertisements

On-line dialogue policy optimisation Milica Gašić Dialogue Systems Group.
Markov Decision Process
1 University of Southern California Keep the Adversary Guessing: Agent Security by Policy Randomization Praveen Paruchuri University of Southern California.
Kshitij Judah, Alan Fern, Tom Dietterich TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: School of EECS, Oregon State.
SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
CSE-573 Artificial Intelligence Partially-Observable MDPS (POMDPs)
User-Initiated Learning for Assistive Interfaces USER-INITIATED LEARNING  Motivation  All learning tasks are pre-defined before deployment  The learning.
Pradeep Varakantham Singapore Management University Joint work with J.Y.Kwak, M.Taylor, J. Marecki, P. Scerri, M.Tambe.
A Hybridized Planner for Stochastic Domains Mausam and Daniel S. Weld University of Washington, Seattle Piergiorgio Bertoli ITC-IRST, Trento.
What Are Partially Observable Markov Decision Processes and Why Might You Care? Bob Wall CS 536.
Planning under Uncertainty
POMDPs: Partially Observable Markov Decision Processes Advanced AI
Machine LearningRL1 Reinforcement Learning in Partially Observable Environments Michael L. Littman.
A Principled Information Valuation for Communications During Multi-Agent Coordination Simon A. Williamson, Enrico H. Gerding, Nicholas R. Jennings School.
Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.
CMPUT 551 Analyzing abstraction and approximation within MDP/POMDP environment Magdalena Jankowska (M.Sc. - Algorithms) Ilya Levner (M.Sc - AI/ML)
Department of Computer Science Undergraduate Events More
Reinforcement Learning Yishay Mansour Tel-Aviv University.
1 Quality of Experience Control Strategies for Scalable Video Processing Wim Verhaegh, Clemens Wüst, Reinder J. Bril, Christian Hentschel, Liesbeth Steffens.
Hardness-Aware Restart Policies Yongshao Ruan, Eric Horvitz, & Henry Kautz IJCAI 2003 Workshop on Stochastic Search.
MDP Reinforcement Learning. Markov Decision Process “Should you give money to charity?” “Would you contribute?” “Should you give money to charity?” $
Function Approximation for Imitation Learning in Humanoid Robots Rajesh P. N. Rao Dept of Computer Science and Engineering University of Washington,
Search and Planning for Inference and Learning in Computer Vision
REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.
Overview  Decision processes and Markov Decision Processes (MDP)  Rewards and Optimal Policies  Defining features of Markov Decision Process  Solving.
Using Fast Weights to Improve Persistent Contrastive Divergence Tijmen Tieleman Geoffrey Hinton Department of Computer Science, University of Toronto ICML.
Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up Ekhlas Sonu, Prashant Doshi Dept. of Computer Science University.
1 Prasad Tadepalli Intelligent assistive systems Infer the goals of the human users and offer timely help; applications to assistance, tutoring; Learning.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
INTERACTIVE ANALYSIS OF COMPUTER CRIMES PRESENTED FOR CS-689 ON 10/12/2000 BY NAGAKALYANA ESKALA.
Dynamic Programming for Partially Observable Stochastic Games Daniel S. Bernstein University of Massachusetts Amherst in collaboration with Christopher.
Reinforcement Learning for Spoken Dialogue Systems: Comparing Strengths & Weaknesses for Practical Deployment Tim Paek Microsoft Research Dialogue on Dialogues.
Model-based Bayesian Reinforcement Learning in Partially Observable Domains by Pascal Poupart and Nikos Vlassis (2008 International Symposium on Artificial.
Reinforcement Learning 主講人:虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.
Haley: A Hierarchical Framework for Logical Composition of Web Services Haibo Zhao, Prashant Doshi LSDIS Lab, Dept. of Computer Science, University of.
Chapter 8 Object Design Reuse and Patterns. Object Design Object design is the process of adding details to the requirements analysis and making implementation.
Solving POMDPs through Macro Decomposition
Reinforcement Learning Yishay Mansour Tel-Aviv University.
A Tutorial on the Partially Observable Markov Decision Process and Its Applications Lawrence Carin June 7,2006.
Transfer in Variable - Reward Hierarchical Reinforcement Learning Hui Li March 31, 2006.
Learning to Navigate Through Crowded Environments Peter Henry 1, Christian Vollmer 2, Brian Ferris 1, Dieter Fox 1 Tuesday, May 4, University of.
CUHK Learning-Based Power Management for Multi-Core Processors YE Rong Nov 15, 2011.
Twenty Second Conference on Artificial Intelligence AAAI 2007 Improved State Estimation in Multiagent Settings with Continuous or Large Discrete State.
© 2009 Ilya O. Ryzhov 1 © 2008 Warren B. Powell 1. Optimal Learning On A Graph INFORMS Annual Meeting October 11, 2009 Ilya O. Ryzhov Warren Powell Princeton.
Model Minimization in Hierarchical Reinforcement Learning Balaraman Ravindran Andrew G. Barto Autonomous Learning Laboratory.
Transfer Learning in Sequential Decision Problems: A Hierarchical Bayesian Approach Aaron Wilson, Alan Fern, Prasad Tadepalli School of EECS Oregon State.
Department of Computer Science Undergraduate Events More
1 Chapter 17 2 nd Part Making Complex Decisions --- Decision-theoretic Agent Design Xin Lu 11/04/2002.
Learning Team Behavior Using Individual Decision Making in Multiagent Settings Using Interactive DIDs Muthukumaran Chandrasekaran THINC Lab, CS Department.
Generalized Point Based Value Iteration for Interactive POMDPs Prashant Doshi Dept. of Computer Science and AI Institute University of Georgia
Reinforcement Learning for Mapping Instructions to Actions S.R.K. Branavan, Harr Chen, Luke S. Zettlemoyer, Regina Barzilay Computer Science and Artificial.
On the Difficulty of Achieving Equilibrium in Interactive POMDPs Prashant Doshi Dept. of Computer Science University of Georgia Athens, GA Twenty.
Partial Observability “Planning and acting in partially observable stochastic domains” Leslie Pack Kaelbling, Michael L. Littman, Anthony R. Cassandra;
Probabilistic Robotics Probability Theory Basics Error Propagation Slides from Autonomous Robots (Siegwart and Nourbaksh), Chapter 5 Probabilistic Robotics.
Keep the Adversary Guessing: Agent Security by Policy Randomization
Partially Observable Markov Decision Process and RL
Budgeted Optimization with Concurrent Stochastic-Duration Experiments
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Q-Learning for Policy Improvement in Network Routing
CS b659: Intelligent Robotics
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
System Control based Renewable Energy Resources in Smart Grid Consumer
Markov Decision Processes
Markov Decision Processes
Hierarchical POMDP Solutions
Chapter 10: Dimensions of Reinforcement Learning
Chapter 17 – Making Complex Decisions
CS 416 Artificial Intelligence
Reinforcement Learning Dealing with Partial Observability
Presentation transcript:

A Decision-Theoretic Model of Assistance - Evaluation, Extension and Open Problems Sriraam Natarajan, Kshitij Judah, Prasad Tadepalli and Alan Fern School of EECS, Oregon State University

Outline Introduction Introduction Decision-Theoretic Model Decision-Theoretic Model Experiment with folder predictor Experiment with folder predictor Incorporating Relational Hierarchies Incorporating Relational Hierarchies Open Problems Open Problems Conclusion Conclusion

Motivation Motivation Several assistant systems proposed to Several assistant systems proposed to Assist users in daily tasks Assist users in daily tasks Reduce their cognitive load Reduce their cognitive load Examples: CALO (CALO 2003), COACH (Boger et al. 2005) etc Examples: CALO (CALO 2003), COACH (Boger et al. 2005) etc Problems with previous work Problems with previous work Fine-tuned to particular application domains Fine-tuned to particular application domains Utilize specialized technologies Utilize specialized technologies Lack an overarching framework Lack an overarching framework

Interaction Model User Assistant Action set U Action set A Goal W2W2 User Action W1W1 Initial State

Interaction Model Assistant W2W2 User Action W4W4 W5W5 W3W3 Assistant Actions W1W1 Initial State User Assistant Goal : Minimize users actions

Interaction Model User Assistant Goal W6W6 W2W2 User Action W4W4 W5W5 W3W3 Assistant Actions W1W1 Initial State

Interaction Model User Assistant Action set A W6W6 W7W7 W8W8 W2W2 User Action W4W4 W5W5 W3W3 Assistant Actions W1W1 Initial State Goal : Minimize users actions

Interaction Model User Assistant Thank you W6W6 W7W7 W8W8 W9W9 Goal Achieved W2W2 User Action W4W4 W5W5 W3W3 Assistant Actions W1W1 Initial State

Introduction Introduction Decision-Theoretic Model Decision-Theoretic Model Experiment with folder predictor Experiment with folder predictor Incorporating Relational Hierarchies Incorporating Relational Hierarchies Open Problems Open Problems Conclusion Conclusion

Markov Decision Process MDP – (S,A,T,R,I) MDP – (S,A,T,R,I) Policy ( ) – Mapping from S to A Policy ( ) – Mapping from S to A V( ) = E(Σ T t=1 r t ), T = length of episode V( ) = E(Σ T t=1 r t ), T = length of episode Optimal Policy ( ) = argmax (V( )) Optimal Policy ( ) = argmax (V( )) A Partially Observable Markov Decision Process (POMDP): A Partially Observable Markov Decision Process (POMDP): O is the set of observations O is the set of observations µ(o|s) is a distribution over observations o є O given current state s µ(o|s) is a distribution over observations o є O given current state s

Decision-Theoretic Model (Fern et al. 07) Assistant: History-dependent stochastic policy (a|w, O) Assistant: History-dependent stochastic policy (a|w, O) Observables: World states, Agents actions Observables: World states, Agents actions Hidden: Agents goals Hidden: Agents goals Episode begins at state w with goal g Episode begins at state w with goal g C(w, g,, ): Cost of episode C(w, g,, ): Cost of episode Objective: compute that minimizes E[C(I, G 0,, )] Objective: compute that minimizes E[C(I, G 0,, )]

Assistant POMDP Given MDP, G 0 and, the assistant POMDP is defined as: Given MDP, G 0 and, the assistant POMDP is defined as: State space is W x G State space is W x G Action set is A Action set is A Transition function T is Transition function T is T((w,g),a,(w,g)) = 0 if g != g T((w,g),a,(w,g)) = 0 if g != g = T(w,a,w) if a != noop = T(w,a,w) if a != noop = P(T(w, (w,g)) = w) = P(T(w, (w,g)) = w) if a == noop if a == noop Cost model C is Cost model C is C((w, g), a) = C(w, a) if a != noop C((w, g), a) = C(w, a) if a != noop = E[C(w, a)] where a is distributed according to = E[C(w, a)] where a is distributed according to

Assistant POMDP AtAtAtAt WtWtWtWt G StStStSt W t+1 AtAtAtAt A t+1 S t+1 A t+1

Approximate Solution Approach Goal RecognizerAction Selection Environment User UtUt AtAt OtOt P(G) Assistant WtWt Online actions selection cycle Online actions selection cycle 1) Estimate posterior goal distribution given observation 2) Action selection via myopic heuristics

Goal Estimation WtWt Current State P(G | O t ) Goal posterior given observations up to time t W t+1 UtUt P(G | O t+1 ) Updated goal posterior new observation Given Given P(G | O t ) : Goal posterior at time t P(G | O t ) : Goal posterior at time t P(U t | G, W t ) : User policy P(U t | G, W t ) : User policy O t+1 : New observation of user action and world state O t+1 : New observation of user action and world state must learn user policy

Action Selection: Assistant POMDP A t WtWt W t+1 W t+2 U G A t WtWt W t+2 Assistant MDP Assume we know the user goal G and policy Assume we know the user goal G and policy Can create a corresponding assistant MDP over assistant actions Can create a corresponding assistant MDP over assistant actions Can compute Q(A, W, G) giving value of taking assistive action A when users goal is G Can compute Q(A, W, G) giving value of taking assistive action A when users goal is G Select action that maximizes expected (myopic) value: Select action that maximizes expected (myopic) value: Q ( A, W ) = P G P ( G j O t ) Q ( A ; W ; G )

Introduction Introduction Decision-Theoretic Model Decision-Theoretic Model Experiment with folder predictor Experiment with folder predictor Incorporating Relational Hierarchies Incorporating Relational Hierarchies Open Problems Open Problems Conclusion Conclusion

Folder Predictor Previous work (Bao et al. 2006): Previous work (Bao et al. 2006): No repredictions No repredictions Does not consider new folders Does not consider new folders Decision-Theoretic Model Decision-Theoretic Model Naturally handles repredictions Naturally handles repredictions Considers mixture density to obtain the distribution Considers mixture density to obtain the distribution Data set – set of requests of Open and saveAs Data set – set of requests of Open and saveAs Folder hierarchy – 226 folders Folder hierarchy – 226 folders Prior distribution initialized according to the model of Bao et al. Prior distribution initialized according to the model of Bao et al. P ( f ) = ¹ 0 P 0 ( f ) + ( 1 ¡ ¹ 0 ) P l ( f )

restricted folder set all folders considered No Reprediction With Repredictions Avg. no. of clicks per open/saveAs Current Tasktracer Full Assistant Framework

Introduction Introduction Decision-Theoretic Model Decision-Theoretic Model Experiment with folder predictor Experiment with folder predictor Incorporating Relational Hierarchies Incorporating Relational Hierarchies Open Problems Open Problems Conclusion Conclusion

Incorporating Relational Hierarchies Tasks are hierarchical Tasks are hierarchical Writing a paper Writing a paper Tasks have a natural class – subclass hierarchy Tasks have a natural class – subclass hierarchy Papers to ICML or IJCAI involve similar subtasks Papers to ICML or IJCAI involve similar subtasks Tasks are chosen based on some attribute of the world Tasks are chosen based on some attribute of the world Grad students work on a paper closer to the deadline Grad students work on a paper closer to the deadline Goal: Combine these ideas to Goal: Combine these ideas to Specify prior knowledge easily Specify prior knowledge easily Accelerate learning of the parameters Accelerate learning of the parameters

Doorman Domain

L = R.Loc Gather(R)Attack(E) Collect(R)Deposit(R,S)DestroyCamp(E)KillDragon(D) Goto(L) Pickup(R) Move(X) Open(D) DropOff(R,S) R.Type = S.Type L = S.Loc L = D.Loc Kill(D) Destroy(E) L = E.Loc E.Type = D.Type

Performance of different models

Introduction Introduction Decision-Theoretic Model Decision-Theoretic Model Experiment with folder predictor Experiment with folder predictor Incorporating Relational Hierarchies Incorporating Relational Hierarchies Open Problems Open Problems Conclusion Conclusion

Open Problems Partial Observability of the user Partial Observability of the user Currently user completely observes the environment Currently user completely observes the environment Not the case in real-world – User need not know what is in the refrigerator Not the case in real-world – User need not know what is in the refrigerator Assistant can completely observe the world Assistant can completely observe the world Current system does not consider users exploratory actions Current system does not consider users exploratory actions Setting is similar to interactive POMDPs (Doshi et al.) Setting is similar to interactive POMDPs (Doshi et al.) Environment – POMDP Environment – POMDP Belief states of the POMDP are belief states of the user Belief states of the POMDP are belief states of the user State space needs to be extended to capture users beliefs State space needs to be extended to capture users beliefs

Open Problems Large State space Large State space Solving POMDP is impractical Solving POMDP is impractical Kitchen Domain (Fern et al.) – states Kitchen Domain (Fern et al.) – states Prune certain regions of the search space (Electric Elves) Prune certain regions of the search space (Electric Elves) Can use user trajectories as training examples Can use user trajectories as training examples Parallel subgoals/actions Parallel subgoals/actions Assistant and user execute actions in parallel Assistant and user execute actions in parallel Useful to execute parallel subgoals - User writes paper, assistant runs experiments Useful to execute parallel subgoals - User writes paper, assistant runs experiments Identification of the possible parallel actions Identification of the possible parallel actions The assistant can change the goal stack of the user The assistant can change the goal stack of the user Goal estimation has to include the users response Goal estimation has to include the users response

Open Problems Changing goals Changing goals User can change goal midway - Work on a different project User can change goal midway - Work on a different project Currently, the system would converge to the goal slowly Currently, the system would converge to the goal slowly Explicitly model this possibility Explicitly model this possibility Borrow ideas from user modeling to predict changing goals Borrow ideas from user modeling to predict changing goals Expanding set of goals Expanding set of goals A large number of dishes can be cooked A large number of dishes can be cooked Forgetting subgoals Forgetting subgoals Forgetting to attach a document to the Forgetting to attach a document to the Explicitly model this possibility – borrow ideas from cognitive science literature Explicitly model this possibility – borrow ideas from cognitive science literature

Introduction Introduction Decision-Theoretic Model Decision-Theoretic Model Experiment with folder predictor Experiment with folder predictor Incorporating Relational Hierarchies Incorporating Relational Hierarchies Open Problems Open Problems Conclusion Conclusion

Conclusion Propose a general framework based on decision-theory Propose a general framework based on decision-theory Experiments in a real-world domain Experiments in a real-world domain Repredictions are useful Repredictions are useful Currently working on a relational hierarchical model Currently working on a relational hierarchical model Outlined several open problems Outlined several open problems Motivated the necessity of using sophisticated user models Motivated the necessity of using sophisticated user models

Thank you!!!