Presentation is loading. Please wait.

Presentation is loading. Please wait.

U NCERTAINTY IN S ENSING ( AND ACTION ). P LANNING W ITH P ROBABILISTIC U NCERTAINTY IN S ENSING No motion Perpendicular motion.

Similar presentations


Presentation on theme: "U NCERTAINTY IN S ENSING ( AND ACTION ). P LANNING W ITH P ROBABILISTIC U NCERTAINTY IN S ENSING No motion Perpendicular motion."— Presentation transcript:

1 U NCERTAINTY IN S ENSING ( AND ACTION )

2 P LANNING W ITH P ROBABILISTIC U NCERTAINTY IN S ENSING No motion Perpendicular motion

3 T HE “T IGER ” E XAMPLE Two states: s 0 (tiger-left) and s 1 (tiger right) Observations: GL (growl-left) and GR (growl-right) received only if listen action is chosen P(GL|s 0 )=0.85, P(GR|s 0 )=0.15 P(GL|s 1 )=0.15, P(GL|s 1 )=0.85 Rewards: -100 if wrong door opened, +10 if correct door opened, -1 for listening

4 B ELIEF STATE Probability of s 0 vs s 1 being true underlying state Initial belief state: P(s 0 )=P(s 1 )=0.5 Upon listening, the belief state should change according to the Bayesian update (filtering) But how confident should you be on the tiger’s position before choosing a door?

5 P ARTIALLY O BSERVABLE MDP S Consider the MDP model with states s  S, actions a  A Reward R(s) Transition model P(s’|s,a) Discount factor  With sensing uncertainty, initial belief state is a probability distributions over state: b(s) b(s i )  0 for all s i  S,  i b(s i ) = 1 Observations are generated according to a sensor model Observation space o  O Sensor model P(o|s) Resulting problem is a Partially Observable Markov Decision Process (POMDP)

6 B ELIEF S PACE Belief can be defined by a single number p t = P(s 1 |O 1,…,O t ) Optimal action does not depend on time step, just the value of p t So a policy  (p) is a map from [0,1]  {0,1,2} listen open-left open-right 10 p

7 U TILITIES FOR NON - TERMINAL ACTIONS Now consider  (p)  listen for p  [a,b] Reward of -1 If GR is observed at time t, p becomes P(GR t |s 1 ) P(s 1 | p) / P(GR t | p) 0.85 p / (0.85 p + 0.15 (1-p)) = 0.85p / (0.15 + 0.7 p) Otherwise, p becomes P(GL t |s 1 ) P(s 1 | p) / P(GL t | p) 0.15 p / (0.15 p + 0.85 (1-p)) = 0.15p / (0.85 - 0.7 p) So, the utility at p is U  (p) = -1 + P(GR|p) U   0.85p / (0.15 + 0.7 p)) + P(GL|p) U   0.15p / (0.85 - 0.7 p))

8 POMDP U TILITY F UNCTION A policy  (b)  is defined as a map from belief states to actions Expected discounted reward with policy  : U  (b) = E[  t  t R(S t )] where S t is the random variable indicating the state at time t P(S 0 =s) = b 0 (s) P(S 1 =s) = ?

9 POMDP U TILITY F UNCTION A policy  (b)  is defined as a map from belief states to actions Expected discounted reward with policy  : U  (b) = E[  t  t R(S t )] where S t is the random variable indicating the state at time t P(S 0 =s) = b 0 (s) P(S 1 =s) = P(s|  (b  ),b 0 ) =  s’ P(s|s’,  (b 0 )) P(S 0 =s’) =  s’ P(s|s’,  (b 0 )) b 0 (s’)

10 POMDP U TILITY F UNCTION A policy  (b)  is defined as a map from belief states to actions Expected discounted reward with policy  : U  (b) = E[  t  t R(S t )] where S t is the random variable indicating the state at time t P(S 0 =s) = b 0 (s) P(S 1 =s) =  s’ P(s|s’,  (b)) b 0 (s’) P(S 2 =s) = ?

11 POMDP U TILITY F UNCTION A policy  (b)  is defined as a map from belief states to actions Expected discounted reward with policy  : U  (b) = E[  t  t R(S t )] where S t is the random variable indicating the state at time t P(S 0 =s) = b 0 (s) P(S 1 =s) =  s’ P(s|s’,  (b)) b 0 (s’) What belief states could the robot take on after 1 step?

12 b0b0 Predict b 1 (s)=  s’ P(s|s’,  (b 0 )) b 0 (s’) Choose action  (b 0 ) b1b1

13 b0b0 oAoA oBoB oCoC oDoD Predict b 1 (s)=  s’ P(s|s’,  (b 0 )) b 0 (s’) Choose action  (b 0 ) b1b1 Receive observation

14 b0b0 P(o A |b 1 ) Predict b 1 (s)=  s’ P(s|s’,  (b 0 )) b 0 (s’) Choose action  (b 0 ) b1b1 Receive observation b 1,A P(o B |b 1 )P(o C |b 1 )P(o D |b 1 ) b 1,B b 1,C b 1,D

15 b0b0 Predict b 1 (s)=  s’ P(s|s’,  (b 0 )) b 0 (s’) Choose action  (b 0 ) b1b1 Update belief b 1,A (s) = P(s|b 1,o A ) P(o A |b 1 )P(o B |b 1 )P(o C |b 1 )P(o D |b 1 ) Receive observation b 1,A b 1,B b 1,C b 1,D b 1,B (s) = P(s|b 1,o B ) b 1,C (s) = P(s|b 1,o C ) b 1,D (s) = P(s|b 1,o D )

16 b0b0 Predict b 1 (s)=  s’ P(s|s’,  (b 0 )) b 0 (s’) Choose action  (b 0 ) b1b1 Update belief P(o A |b 1 )P(o B |b 1 )P(o C |b 1 )P(o D |b 1 ) Receive observation P(o|b) =  s P(o|s)b(s) P(s|b,o) = P(o|s)P(s|b)/P(o|b) = 1/Z P(o|s) b(s) b 1,A (s) = P(s|b 1,o A ) b 1,B (s) = P(s|b 1,o B ) b 1,C (s) = P(s|b 1,o C ) b 1,D (s) = P(s|b 1,o D ) b 1,A b 1,B b 1,C b 1,D

17 B ELIEF - SPACE SEARCH TREE Each belief node has |A| action node successors Each action node has |O| belief successors Each (action,observation) pair (a,o) requires predict/update step similar to HMMs Matrix/vector formulation: b(s): a vector b of length |S| P(s’|s,a): a set of |S|x|S| matrices T a P(o k |s): a vector o k of length |S| b a = T a b (predict) P(o k |b a ) = o k T b a (probability of observation) b a,k = diag( o k ) b a / ( o k T b a ) (update) Denote this operation as b a,o

18 R ECEDING HORIZON SEARCH Expand belief-space search tree to some depth h Use an evaluation function on leaf beliefs to estimate utilities For internal nodes, back up estimated utilities: U(b) = E[R(s)|b] +  max a  A  o  O P(o|b a )U(b a,o )

19 QMDP E VALUATION F UNCTION One possible evaluation function is to compute the expectation of the underlying MDP value function over the leaf belief states f(b) =  s U MDP (s) b(s) “Averaging over clairvoyance” Assumes the problem becomes instantly fully observable after 1 action Is optimistic: U(b)  f(b) Approaches POMDP value function as state and sensing uncertainty decreases In extreme h=1 case, this is called the QMDP policy

20 QMDP P OLICY (L ITTMAN, C ASSANDRA, K AELBLING 1995 )

21 U TILITIES FOR TERMINAL ACTIONS Consider a belief-space interval mapped to a terminating action  (p)  open-right for p  [a,b] If true state is s 1, reward is +10, otherwise -100 P(s 1 )=p, so U  (p) = p*10 - (1-p)*100 open-right 10 p UU

22 U TILITIES FOR TERMINAL ACTIONS Now consider  (p)  open-right for p  [a,b] If true state is s 1, reward is -100, otherwise +10 P(s 1 )=p, so U  (p) = -p*100 + (1-p)*10 open-right 10 p UU open-left

23 P IECEWISE L INEAR V ALUE F UNCTION U  (p) = -1 + P(GR|p) U   0.85p / P(GR | p)) + P(GL|p) U   0.15p / P(GL | p)) If we assume U  at 0.85p / P(GR | p) and 0.15p / P(GL | p) are linear functions U  (x) = m 1 x+b 1 and U  (x) = m 2 x+b 2, then U  (p) = -1 + P(GR|p) (m 1 0.85p / P(GR | p) + b 1 ) + P(GL|p) (m 2 0.15p / P(GL | p) + b 2 ) = -1 + m 1 0.85p + b 1 P(GR|p) + m 2 0.15p + b 2 P(GL|p) = -1 + 0.15b 1 +0.85b 2 + (m 1 0.85 + m 2 0.15 + 0.7 b 1 - 0.7 b 2 ) p Linear!

24 V ALUE I TERATION FOR POMDP S Start with optimal zero-step rewards Compute optimal one-step rewards given piecewise linear U open-right 10 p UU open-leftlisten

25 V ALUE I TERATION FOR POMDP S Start with optimal zero-step rewards Compute optimal one-step rewards given piecewise linear U open-right 10 p UU open-leftlisten

26 V ALUE I TERATION FOR POMDP S Start with optimal zero-step rewards Compute optimal one-step rewards given piecewise linear U Repeat… open-right 10 p UU open-leftlisten

27 W ORST - CASE C OMPLEXITY Infinite-horizon undiscounted POMDPs are undecideable (reduction to halting problem) Exact solution to infinite-horizon discounted POMDPs are intractable even for low |S| Finite horizon: O(|S| 2 |A| h |O| h ) Receding horizon approximation: one-step regret is O(  h ) Approximate solution: becoming tractable for |S| in millions  -vector point-based techniques Monte Carlo tree search …Beyond scope of course…

28 (S OMETIMES ) E FFECTIVE H EURISTICS Assume most likely state Works well if uncertainty is low, sensing is passive, and there are no “cliffs” QMDP – average utilities of actions over current belief state Works well if the agent doesn’t need to “go out of the way” to perform sensing actions Most-likely-observation assumption Information-gathering rewards / uncertainty penalties Map building

29 S CHEDULE 11/27: Robotics 11/29 Guest lecture: David Crandall, computer vision 12/4: Review 12/6: Final project presentations, review

30 F INAL D ISCUSSION


Download ppt "U NCERTAINTY IN S ENSING ( AND ACTION ). P LANNING W ITH P ROBABILISTIC U NCERTAINTY IN S ENSING No motion Perpendicular motion."

Similar presentations


Ads by Google