U NCERTAINTY IN S ENSING ( AND ACTION ). P LANNING W ITH P ROBABILISTIC U NCERTAINTY IN S ENSING No motion Perpendicular motion.

Slides:



Advertisements
Similar presentations
Markov Decision Process
Advertisements

Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.
Partially Observable Markov Decision Process (POMDP)
SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
CSE-573 Artificial Intelligence Partially-Observable MDPS (POMDPs)
Decision Theoretic Planning
Optimal Policies for POMDP Presented by Alp Sardağ.
5/11/2015 Mahdi Naser-Moghadasi Texas Tech University.
Meeting 3 POMDP (Partial Observability MDP) 資工四 阮鶴鳴 李運寰 Advisor: 李琳山教授.
CS594 Automated decision making University of Illinois, Chicago
MDP Presentation CS594 Automated Optimal Decision Making Sohail M Yousof Advanced Artificial Intelligence.
What Are Partially Observable Markov Decision Processes and Why Might You Care? Bob Wall CS 536.
主講人:虞台文 大同大學資工所 智慧型多媒體研究室
Markov Decision Processes
Planning under Uncertainty
1 Policies for POMDPs Minqing Hu. 2 Background on Solving POMDPs MDPs policy: to find a mapping from states to actions POMDPs policy: to find a mapping.
Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 6 - Reinforcement Learning Prof. Giancarlo.
KI Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17.
An Introduction to PO-MDP Presented by Alp Sardağ.
Incremental Pruning CSE 574 May 9, 2003 Stanley Kok.
4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.
Markov Decision Processes
CS B 659: I NTELLIGENT R OBOTICS Planning Under Uncertainty.
Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
Instructor: Vincent Conitzer
MAKING COMPLEX DEClSlONS
Conference Paper by: Bikramjit Banerjee University of Southern Mississippi From the Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence.
REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.
Overview  Decision processes and Markov Decision Processes (MDP)  Rewards and Optimal Policies  Defining features of Markov Decision Process  Solving.
CSE-473 Artificial Intelligence Partially-Observable MDPS (POMDPs)
Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up Ekhlas Sonu, Prashant Doshi Dept. of Computer Science University.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
TKK | Automation Technology Laboratory Partially Observable Markov Decision Process (Chapter 15 & 16) José Luis Peralta.
Computer Science CPSC 502 Lecture 14 Markov Decision Processes (Ch. 9, up to 9.5.3)
Utilities and MDP: A Lesson in Multiagent System Based on Jose Vidal’s book Fundamentals of Multiagent Systems Henry Hexmoor SIUC.
Solving POMDPs through Macro Decomposition
Reinforcement Learning Yishay Mansour Tel-Aviv University.
A Tutorial on the Partially Observable Markov Decision Process and Its Applications Lawrence Carin June 7,2006.
U NCERTAINTY IN S ENSING ( AND ACTION ). A GENDA Planning with belief states Nondeterministic sensing uncertainty Probabilistic sensing uncertainty.
U NCERTAINTY IN S ENSING ( AND ACTION ). A GENDA Planning with belief states Nondeterministic sensing uncertainty Probabilistic sensing uncertainty.
Decision Theoretic Planning. Decisions Under Uncertainty  Some areas of AI (e.g., planning) focus on decision making in domains where the environment.
Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.
CPS 570: Artificial Intelligence Markov decision processes, POMDPs
1 Chapter 17 2 nd Part Making Complex Decisions --- Decision-theoretic Agent Design Xin Lu 11/04/2002.
Planning Under Uncertainty. Sensing error Partial observability Unpredictable dynamics Other agents.
Markov Decision Processes AIMA: 17.1, 17.2 (excluding ), 17.3.
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
Web-Mining Agents Agents and Rational Behavior Decision-Making under Uncertainty Complex Decisions Ralf Möller Universität zu Lübeck Institut für Informationssysteme.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
CS b659: Intelligent Robotics
Making complex decisions
POMDPs Logistics Outline No class Wed
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Markov Decision Processes
Markov Decision Processes
Robust Belief-based Execution of Manipulation Programs
Markov Decision Processes
Reinforcement learning
CS 188: Artificial Intelligence Fall 2007
ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 15: Partially Observable Markov Decision Processes (POMDPs) November 5, 2015 Dr.
Chapter 17 – Making Complex Decisions
Heuristic Search Value Iteration
CS 416 Artificial Intelligence
Reinforcement Learning Dealing with Partial Observability
Markov Decision Processes
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Markov Decision Processes
ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 15: Partially Observable Markov Decision Processes (POMDPs) November 5, 2015 Dr.
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Presentation transcript:

U NCERTAINTY IN S ENSING ( AND ACTION )

P LANNING W ITH P ROBABILISTIC U NCERTAINTY IN S ENSING No motion Perpendicular motion

T HE “T IGER ” E XAMPLE Two states: s 0 (tiger-left) and s 1 (tiger right) Observations: GL (growl-left) and GR (growl-right) received only if listen action is chosen P(GL|s 0 )=0.85, P(GR|s 0 )=0.15 P(GL|s 1 )=0.15, P(GL|s 1 )=0.85 Rewards: -100 if wrong door opened, +10 if correct door opened, -1 for listening

B ELIEF STATE Probability of s 0 vs s 1 being true underlying state Initial belief state: P(s 0 )=P(s 1 )=0.5 Upon listening, the belief state should change according to the Bayesian update (filtering) But how confident should you be on the tiger’s position before choosing a door?

P ARTIALLY O BSERVABLE MDP S Consider the MDP model with states s  S, actions a  A Reward R(s) Transition model P(s’|s,a) Discount factor  With sensing uncertainty, initial belief state is a probability distributions over state: b(s) b(s i )  0 for all s i  S,  i b(s i ) = 1 Observations are generated according to a sensor model Observation space o  O Sensor model P(o|s) Resulting problem is a Partially Observable Markov Decision Process (POMDP)

B ELIEF S PACE Belief can be defined by a single number p t = P(s 1 |O 1,…,O t ) Optimal action does not depend on time step, just the value of p t So a policy  (p) is a map from [0,1]  {0,1,2} listen open-left open-right 10 p

U TILITIES FOR NON - TERMINAL ACTIONS Now consider  (p)  listen for p  [a,b] Reward of -1 If GR is observed at time t, p becomes P(GR t |s 1 ) P(s 1 | p) / P(GR t | p) 0.85 p / (0.85 p (1-p)) = 0.85p / ( p) Otherwise, p becomes P(GL t |s 1 ) P(s 1 | p) / P(GL t | p) 0.15 p / (0.15 p (1-p)) = 0.15p / ( p) So, the utility at p is U  (p) = -1 + P(GR|p) U   0.85p / ( p)) + P(GL|p) U   0.15p / ( p))

POMDP U TILITY F UNCTION A policy  (b)  is defined as a map from belief states to actions Expected discounted reward with policy  : U  (b) = E[  t  t R(S t )] where S t is the random variable indicating the state at time t P(S 0 =s) = b 0 (s) P(S 1 =s) = ?

POMDP U TILITY F UNCTION A policy  (b)  is defined as a map from belief states to actions Expected discounted reward with policy  : U  (b) = E[  t  t R(S t )] where S t is the random variable indicating the state at time t P(S 0 =s) = b 0 (s) P(S 1 =s) = P(s|  (b  ),b 0 ) =  s’ P(s|s’,  (b 0 )) P(S 0 =s’) =  s’ P(s|s’,  (b 0 )) b 0 (s’)

POMDP U TILITY F UNCTION A policy  (b)  is defined as a map from belief states to actions Expected discounted reward with policy  : U  (b) = E[  t  t R(S t )] where S t is the random variable indicating the state at time t P(S 0 =s) = b 0 (s) P(S 1 =s) =  s’ P(s|s’,  (b)) b 0 (s’) P(S 2 =s) = ?

POMDP U TILITY F UNCTION A policy  (b)  is defined as a map from belief states to actions Expected discounted reward with policy  : U  (b) = E[  t  t R(S t )] where S t is the random variable indicating the state at time t P(S 0 =s) = b 0 (s) P(S 1 =s) =  s’ P(s|s’,  (b)) b 0 (s’) What belief states could the robot take on after 1 step?

b0b0 Predict b 1 (s)=  s’ P(s|s’,  (b 0 )) b 0 (s’) Choose action  (b 0 ) b1b1

b0b0 oAoA oBoB oCoC oDoD Predict b 1 (s)=  s’ P(s|s’,  (b 0 )) b 0 (s’) Choose action  (b 0 ) b1b1 Receive observation

b0b0 P(o A |b 1 ) Predict b 1 (s)=  s’ P(s|s’,  (b 0 )) b 0 (s’) Choose action  (b 0 ) b1b1 Receive observation b 1,A P(o B |b 1 )P(o C |b 1 )P(o D |b 1 ) b 1,B b 1,C b 1,D

b0b0 Predict b 1 (s)=  s’ P(s|s’,  (b 0 )) b 0 (s’) Choose action  (b 0 ) b1b1 Update belief b 1,A (s) = P(s|b 1,o A ) P(o A |b 1 )P(o B |b 1 )P(o C |b 1 )P(o D |b 1 ) Receive observation b 1,A b 1,B b 1,C b 1,D b 1,B (s) = P(s|b 1,o B ) b 1,C (s) = P(s|b 1,o C ) b 1,D (s) = P(s|b 1,o D )

b0b0 Predict b 1 (s)=  s’ P(s|s’,  (b 0 )) b 0 (s’) Choose action  (b 0 ) b1b1 Update belief P(o A |b 1 )P(o B |b 1 )P(o C |b 1 )P(o D |b 1 ) Receive observation P(o|b) =  s P(o|s)b(s) P(s|b,o) = P(o|s)P(s|b)/P(o|b) = 1/Z P(o|s) b(s) b 1,A (s) = P(s|b 1,o A ) b 1,B (s) = P(s|b 1,o B ) b 1,C (s) = P(s|b 1,o C ) b 1,D (s) = P(s|b 1,o D ) b 1,A b 1,B b 1,C b 1,D

B ELIEF - SPACE SEARCH TREE Each belief node has |A| action node successors Each action node has |O| belief successors Each (action,observation) pair (a,o) requires predict/update step similar to HMMs Matrix/vector formulation: b(s): a vector b of length |S| P(s’|s,a): a set of |S|x|S| matrices T a P(o k |s): a vector o k of length |S| b a = T a b (predict) P(o k |b a ) = o k T b a (probability of observation) b a,k = diag( o k ) b a / ( o k T b a ) (update) Denote this operation as b a,o

R ECEDING HORIZON SEARCH Expand belief-space search tree to some depth h Use an evaluation function on leaf beliefs to estimate utilities For internal nodes, back up estimated utilities: U(b) = E[R(s)|b] +  max a  A  o  O P(o|b a )U(b a,o )

QMDP E VALUATION F UNCTION One possible evaluation function is to compute the expectation of the underlying MDP value function over the leaf belief states f(b) =  s U MDP (s) b(s) “Averaging over clairvoyance” Assumes the problem becomes instantly fully observable after 1 action Is optimistic: U(b)  f(b) Approaches POMDP value function as state and sensing uncertainty decreases In extreme h=1 case, this is called the QMDP policy

QMDP P OLICY (L ITTMAN, C ASSANDRA, K AELBLING 1995 )

U TILITIES FOR TERMINAL ACTIONS Consider a belief-space interval mapped to a terminating action  (p)  open-right for p  [a,b] If true state is s 1, reward is +10, otherwise -100 P(s 1 )=p, so U  (p) = p*10 - (1-p)*100 open-right 10 p UU

U TILITIES FOR TERMINAL ACTIONS Now consider  (p)  open-right for p  [a,b] If true state is s 1, reward is -100, otherwise +10 P(s 1 )=p, so U  (p) = -p*100 + (1-p)*10 open-right 10 p UU open-left

P IECEWISE L INEAR V ALUE F UNCTION U  (p) = -1 + P(GR|p) U   0.85p / P(GR | p)) + P(GL|p) U   0.15p / P(GL | p)) If we assume U  at 0.85p / P(GR | p) and 0.15p / P(GL | p) are linear functions U  (x) = m 1 x+b 1 and U  (x) = m 2 x+b 2, then U  (p) = -1 + P(GR|p) (m p / P(GR | p) + b 1 ) + P(GL|p) (m p / P(GL | p) + b 2 ) = -1 + m p + b 1 P(GR|p) + m p + b 2 P(GL|p) = b b 2 + (m m b b 2 ) p Linear!

V ALUE I TERATION FOR POMDP S Start with optimal zero-step rewards Compute optimal one-step rewards given piecewise linear U open-right 10 p UU open-leftlisten

V ALUE I TERATION FOR POMDP S Start with optimal zero-step rewards Compute optimal one-step rewards given piecewise linear U open-right 10 p UU open-leftlisten

V ALUE I TERATION FOR POMDP S Start with optimal zero-step rewards Compute optimal one-step rewards given piecewise linear U Repeat… open-right 10 p UU open-leftlisten

W ORST - CASE C OMPLEXITY Infinite-horizon undiscounted POMDPs are undecideable (reduction to halting problem) Exact solution to infinite-horizon discounted POMDPs are intractable even for low |S| Finite horizon: O(|S| 2 |A| h |O| h ) Receding horizon approximation: one-step regret is O(  h ) Approximate solution: becoming tractable for |S| in millions  -vector point-based techniques Monte Carlo tree search …Beyond scope of course…

(S OMETIMES ) E FFECTIVE H EURISTICS Assume most likely state Works well if uncertainty is low, sensing is passive, and there are no “cliffs” QMDP – average utilities of actions over current belief state Works well if the agent doesn’t need to “go out of the way” to perform sensing actions Most-likely-observation assumption Information-gathering rewards / uncertainty penalties Map building

S CHEDULE 11/27: Robotics 11/29 Guest lecture: David Crandall, computer vision 12/4: Review 12/6: Final project presentations, review

F INAL D ISCUSSION