ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 15: Partially Observable Markov Decision Processes (POMDPs) November 5, 2015 Dr.

Slides:



Advertisements
Similar presentations
Partially Observable Markov Decision Process (POMDP)
Advertisements

SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
CSE-573 Artificial Intelligence Partially-Observable MDPS (POMDPs)
Optimal Policies for POMDP Presented by Alp Sardağ.
Meeting 3 POMDP (Partial Observability MDP) 資工四 阮鶴鳴 李運寰 Advisor: 李琳山教授.
CS594 Automated decision making University of Illinois, Chicago
What Are Partially Observable Markov Decision Processes and Why Might You Care? Bob Wall CS 536.
Partially Observable Markov Decision Process By Nezih Ergin Özkucur.
主講人:虞台文 大同大學資工所 智慧型多媒體研究室
Planning under Uncertainty
1 Policies for POMDPs Minqing Hu. 2 Background on Solving POMDPs MDPs policy: to find a mapping from states to actions POMDPs policy: to find a mapping.
POMDPs: Partially Observable Markov Decision Processes Advanced AI
An Introduction to PO-MDP Presented by Alp Sardağ.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Markov Decision Processes
Department of Computer Science Undergraduate Events More
Department of Computer Science Undergraduate Events More
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
MAKING COMPLEX DEClSlONS
1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 15: Partially Observable Markov Decision Processes (POMDPs) Dr. Itamar Arel College.
Reinforcement Learning
CSE-473 Artificial Intelligence Partially-Observable MDPS (POMDPs)
1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 6: Optimality Criterion in MDPs Dr. Itamar Arel College of Engineering Department.
1 Robot Environment Interaction Environment perception provides information about the environment’s state, and it tends to increase the robot’s knowledge.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
TKK | Automation Technology Laboratory Partially Observable Markov Decision Process (Chapter 15 & 16) José Luis Peralta.
Computer Science CPSC 502 Lecture 14 Markov Decision Processes (Ch. 9, up to 9.5.3)
Department of Computer Science Undergraduate Events More
Reinforcement Learning Yishay Mansour Tel-Aviv University.
A Tutorial on the Partially Observable Markov Decision Process and Its Applications Lawrence Carin June 7,2006.
1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 8: Dynamic Programming – Value Iteration Dr. Itamar Arel College of Engineering Department.
Decision Theoretic Planning. Decisions Under Uncertainty  Some areas of AI (e.g., planning) focus on decision making in domains where the environment.
1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 21: Dynamic Multi-Criteria RL problems Dr. Itamar Arel College of Engineering Department.
Department of Computer Science Undergraduate Events More
1 Chapter 17 2 nd Part Making Complex Decisions --- Decision-theoretic Agent Design Xin Lu 11/04/2002.
1 (Chapter 3 of) Planning and Control in Stochastic Domains with Imperfect Information by Milos Hauskrecht CS594 Automated Decision Making Course Presentation.
Markov Decision Processes AIMA: 17.1, 17.2 (excluding ), 17.3.
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
CS b659: Intelligent Robotics
POMDPs Logistics Outline No class Wed
A Crash Course in Reinforcement Learning
Chapter 6: Temporal Difference Learning
Reinforcement Learning in POMDPs Without Resets
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
CPS 570: Artificial Intelligence Markov decision processes, POMDPs
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 14
Markov ó Kalman Filter Localization
Hidden Markov Models Part 2: Algorithms
Dr. Unnikrishnan P.C. Professor, EEE
CS 188: Artificial Intelligence Fall 2007
ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 15: Partially Observable Markov Decision Processes (POMDPs) November 5, 2015 Dr.
Approximate POMDP planning: Overcoming the curse of history!
Instructor: Vincent Conitzer
September 22, 2011 Dr. Itamar Arel College of Engineering
Artificial Intelligence Chapter 10 Planning, Acting, and Learning
October 6, 2011 Dr. Itamar Arel College of Engineering
Chapter 6: Temporal Difference Learning
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 14
CS 188: Artificial Intelligence Fall 2008
CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29
Artificial Intelligence Chapter 10 Planning, Acting, and Learning
CS 416 Artificial Intelligence
Reinforcement Learning Dealing with Partial Observability
CS 416 Artificial Intelligence
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Reinforcement Learning (2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 14
Presentation transcript:

ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 15: Partially Observable Markov Decision Processes (POMDPs) November 5, 2015 Dr. Itamar Arel College of Engineering Electrical Engineering and Computer Science Department The University of Tennessee Fall 2015

Outline Why use POMDPs? Formal definition Belief state Value function

Partially Observable Markov Decision Problems (POMDPs) To introduce POMDPs let us consider an example where an agent learns to drive a car in New York city The agent can look forward, backward, left or right It cannot change speed but it can steer into the lane it is looking at The different types of observations are the direction in which the agent's gaze is directed the closest object in the agent's gaze whether the object is looming or receding the color of the object whether a horn is sounding To drive safely the agent must steer out of its lane to avoid slow cars ahead and fast cars behind

POMDP Example The agent is in control of the middle car The car behind is fast and will not slow down The car ahead is slower To avoid a crash, the agent must steer right However, when the agent is gazing to the right, there is no immediate observation that tells it about the impending crash The agent basically needs to learn how the observations might aid its performance

POMDP Example (cont.) This is not easy when the agent has no explicit goals beyond “performing well" There are no explicit training patterns such as “if there is a car ahead and left, steer right." However, a scalar reward is provided to the agent as a performance indicator (just like MDPs) The agent is penalized for colliding with other cars or the road shoulder The only goal hard-wired into the agent is that it must maximize a long-term measure of the reward

POMDP Example (cont.) Two significant problems make it difficult to learn under these conditions Temporal credit assignment – If our agent hits another car and is consequently penalized, how does the agent reason about which sequence of actions should not be repeated, and in what circumstances? Generally same as in MDPs Partial Observability - If the agent is about to hit the car ahead of it, and there is a car to the left, then circumstances dictate that the agent should steer right However, when it looks to the right it has no sensory information regarding what goes on elsewhere To solve the latter, the agent needs memory – creates knowledge of the state of the world around it

Forms of Partial Observability Partial Observability coarsely pertains to either Lack of important state information in observations – must be compensated using memory Extraneous information in observations – needs to learn to avoid In our example: Color of the car in its gaze is extraneous (unless red cars really drive faster) It needs to build a memory-based model of the world in order to accurately predict what will happen Creates “belief state” information (we’ll see later) If the agent has access to the complete state, such as a chess playing machine that can view the entire board: It can choose optimal actions without memory Markov property holds – i.e. future state of the world is simply a function of the current state and action

Modeling the world as a POMDP Our setting is that of an agent taking actions in a world according to its policy The agent still receives feedback about its performance through a scalar reward received at each time step Formally stated, POMDPs consists of … |S| states S = {1,2,…,|S|} of the world |U| actions (or controls) U = {1,2,…, |U|} available to the policy |Y| observations Y = {1,2,…,|Y|} a (possibly stochastic) reward r(i) for each state i in S

Modeling the world as a POMDP (cont.)

MDPs vs. POMDPs In MDP: one observation for each state Concept of observation and state being interchangeable Memoryless policy that does not make use of internal state In POMDPs different states may have similar probability distributions over observations Different states may look the same to the agent For this reason, POMDPs are said to have hidden state Two hallways may look the same for a robot’s sensors Optimal action for the first  take left Optimal action for the first  take right A memoryless policy can not distinguish between the two

MDPs vs. POMDPs (cont.) Noise can create ambiguity in state inference Agent’s sensors are always limited in the amount of information they can pick up One way of overcoming this is to add sensors Specific sensors that help it to “disambiguate” hallways Only when possible, affordable or desirable In general, we’re now considering agents that need to be proactive (also called “anticipatory”) Not only react to environmental stimuli Self-create context using memory POMDP problems are harder to solve, but represent realistic scenarios

POMDP solution techniques – model based methods If an exact model of the environment is available, POMDPs can (in theory) be solved i.e. an optimal policy can be found Like model-based MDPs, it’s not so much a learning problem No real “learning”, or trial and error taking place No exploration/exploitation dilemma Rather a probabilistic planning problem  find the optimal policy In POMDPs the above is broken into two elements Belief state computation, and Value function computation based on belief states

The belief state is a probability distribution over the states Instead of maintaining the complete action/observation history, we maintain a belief state b. The belief state is a probability distribution over the states Given an observation Dim(b) = |S|-1 The belief space is the entire probability space We’ll use a two-state POMDP as a running example Probability of being in state one = p  probability of being in state two = 1-p Therefore, the entire space of belief states can be represented as a line segment

The belief space Here is a representation of the belief space when we have two states (s0,s1)

The belief space (cont.) The belief space is continuous, but we only visit a countable number of belief points Assumption: Finite action set Finite observation set Next belief state b’ = f (b,a,o) where: b: current belief state, a:action, o:observation

The Tiger Problem Standing in front of two closed doors World is in one of two states: tiger is behind left door or right door Three actions: Open left door, open right door, listen Listening is not free, and not accurate (may get wrong info) Reward: Open the wrong door and get eaten by the tiger (large –r) Open the right door and get a prize (small +r)

Tiger Problem: POMDP Formulation Two states: SL and SR (tiger is really behind left or right door) Three actions: LEFT, RIGHT, LISTEN Transition probabilities: Listening does not change the tiger’s position Each episode is a “Reset” Current state Left SL SR 0.5 Listen SL SR 1.0 0.0 Next state Right SL SR 0.5

Tiger Problem: POMDP Formulation (cont.) Observations: TL (tiger left) or TR (tiger right) Observation probabilities: Left TL TR SL 0.5 SR Current state Listen TL TR SL 0.85 0.15 SR Next state Right TL TR SL 0.5 SR Rewards: R(SL, Listen) = R(SR, Listen) = -1 R(SL, Left) = R(SR, Right) = -100 R(SL, Right) = R(SR, Left) = +10

POMDP Policy Tree (Fake Policy) Starting belief state (tiger left probability: 0.3) Listen Tiger roar left Tiger roar right New belief state Listen Tiger roar right Open Left door New belief state Tiger roar left Open Left door Listen … New belief state … Listen

POMDP Policy Tree (cont’) A1 o1 o3 o2 A2 o3 A4 A3 o4 o5 A7 A5 A6 … … A8

How many POMDP policies possible 1 o1 o3 o2 A2 o6 A3 A4 |O| o4 o5 A7 A5 A6 |O|^2 … … A8 … How many policy trees, if |A| actions, |O| observations, T horizon: How many nodes in a tree: N =  |O|i = (|O|T- 1) / (|O| - 1) How many trees: T-1 |A|N i=0

Computing Belief States b’(s’) = Pr (s’ | o, a, b) = Pr (s’  o  a  b) / Pr(o  a  b) = Pr(o |s’, a, b) Pr(s’| a, b) * Pr (a  b) Pr(o | a, b) * Pr (a  b) = Pr(o | s’, a) Pr (s’ | a, b) Pr(o | a, b) Will not repeat Pr(o | a, b) in the next slide, but assume it is there! Treated as a normalizing factor, so that b’ sums to 1

Computing Belief States: Numerator Pr(o | s’ a) Pr (s’ | a, b) = O(s’, a, o) Pr (s’ | a, b) = O(s’, a, o)  Pr (s’ | a, b, s) Pr (s | a, b) = O(s’, a, o)  Pr (s’ | a, b, s) b(s) ; Pr (s | a, b) = Pr (s | b) = b(s) = O(s’, a, o)  T(s, a, s’) b(s) (Please work out some of the details at home!)

The belief state is updated proportionally to: Overall formula: The belief state is updated proportionally to: The prob. of seeing the current observation given state s’, and to the prob. of arriving at state s’ given the action and our previous belief state (b) The above are all given by the model

Let’s look at an example: Belief State (cont.) Let’s look at an example: Consider a robot that is initially completely uncertain about its location Seeing a door may, as specified by the model’s occur in three different locations Suppose that the robot takes an action and observes a T-junction It may be that given the action only one of the three states could have lead to an observation of a T-junction The agent now knows with certainty which state it is in Not in all cases the uncertainty disappears like that

Finding an optimal policy The policy component of a POMDP agent must map the current belief state into action It turns out that the process of maintaining belief states is a sufficient statistic (i.e. Markovian) We can not do better even if we remembered the entire history of observations and actions We have now transformed the POMDP into a MDP Good news: we have ways of solving those (GPI algorithms) Bad news: the belief state space is continuous !!

The belief state is a point in a continuous space of N-1 dimensions! Value function The belief state is the input to the second component of the method: the value function computation The belief state is a point in a continuous space of N-1 dimensions! The value function must be defined over this infinite space Application of dynamic programming techniques  infeasible

Value function (cont.) V(b) S1 [1, 0] S2 [0, 1] [0.5, 0.5] Let’s assume only two states: S1 and S2 Belief state [0.25 0.75] indicates b(s1) = 0.25, b(s2) = 0.75 With two states, b(s1) is sufficient to indicate belief state: b(s2) = 1 – b(s1) V(b) S1 [1, 0] S2 [0, 1] [0.5, 0.5] b: belief state

Piecewise linear and Convex (PWLC) Turns out that the value function is, or can be accurately approximated, by a piecewise linear and convex function Intuition on convexity: being certain of a state yields high value, where as uncertainty lowers the value V(b) b: belief state S1 [1, 0] S2 [0, 1] [0.5, 0.5]

Why does PWLC helps? Vp1 Vp3 V(b) Vp2 region3 region1 region2 S1 [1, 0] S2 [0, 1] [0.5, 0.5] b: belief state We can directly work with regions (intervals) of belief space! The vectors are policies, and indicate the right action to take in each region of the space

POMDPs  better modeling of realistic scenarios Summary POMDPs  better modeling of realistic scenarios Rely on belief states that are derived from observations and actions Can be transformed into an MDP with PWLC for value function approximation