1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 15: Partially Observable Markov Decision Processes (POMDPs) Dr. Itamar Arel College.

Slides:



Advertisements
Similar presentations
Markov Decision Process
Advertisements

Partially Observable Markov Decision Process (POMDP)
SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
CSE-573 Artificial Intelligence Partially-Observable MDPS (POMDPs)
Decision Theoretic Planning
Optimal Policies for POMDP Presented by Alp Sardağ.
Meeting 3 POMDP (Partial Observability MDP) 資工四 阮鶴鳴 李運寰 Advisor: 李琳山教授.
MDP Presentation CS594 Automated Optimal Decision Making Sohail M Yousof Advanced Artificial Intelligence.
What Are Partially Observable Markov Decision Processes and Why Might You Care? Bob Wall CS 536.
Partially Observable Markov Decision Process By Nezih Ergin Özkucur.
主講人:虞台文 大同大學資工所 智慧型多媒體研究室
Reinforcement Learning & Apprenticeship Learning Chenyi Chen.
Planning under Uncertainty
1 Policies for POMDPs Minqing Hu. 2 Background on Solving POMDPs MDPs policy: to find a mapping from states to actions POMDPs policy: to find a mapping.
POMDPs: Partially Observable Markov Decision Processes Advanced AI
Reinforcement Learning
Machine LearningRL1 Reinforcement Learning in Partially Observable Environments Michael L. Littman.
An Introduction to PO-MDP Presented by Alp Sardağ.
Nov 14 th  Homework 4 due  Project 4 due 11/26.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Department of Computer Science Undergraduate Events More
Department of Computer Science Undergraduate Events More
Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Instructor: Vincent Conitzer
MAKING COMPLEX DEClSlONS
Reinforcement Learning
1 Dr. Itamar Arel College of Engineering Electrical Engineering & Computer Science Department The University of Tennessee Fall 2009 August 24, 2009 ECE-517:
CSE-473 Artificial Intelligence Partially-Observable MDPS (POMDPs)
1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 6: Optimality Criterion in MDPs Dr. Itamar Arel College of Engineering Department.
1 Robot Environment Interaction Environment perception provides information about the environment’s state, and it tends to increase the robot’s knowledge.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
TKK | Automation Technology Laboratory Partially Observable Markov Decision Process (Chapter 15 & 16) José Luis Peralta.
Computer Science CPSC 502 Lecture 14 Markov Decision Processes (Ch. 9, up to 9.5.3)
1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 17: TRTRL, Implementation Considerations, Apprenticeship Learning Dr. Itamar Arel.
Department of Computer Science Undergraduate Events More
Solving POMDPs through Macro Decomposition
A Tutorial on the Partially Observable Markov Decision Process and Its Applications Lawrence Carin June 7,2006.
1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 8: Dynamic Programming – Value Iteration Dr. Itamar Arel College of Engineering Department.
Decision Theoretic Planning. Decisions Under Uncertainty  Some areas of AI (e.g., planning) focus on decision making in domains where the environment.
Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.
1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 12: Generalization and Function Approximation Dr. Itamar Arel College of Engineering.
CPS 570: Artificial Intelligence Markov decision processes, POMDPs
1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 21: Dynamic Multi-Criteria RL problems Dr. Itamar Arel College of Engineering Department.
Department of Computer Science Undergraduate Events More
1 Chapter 17 2 nd Part Making Complex Decisions --- Decision-theoretic Agent Design Xin Lu 11/04/2002.
COMP 2208 Dr. Long Tran-Thanh University of Southampton Reinforcement Learning.
1 (Chapter 3 of) Planning and Control in Stochastic Domains with Imperfect Information by Milos Hauskrecht CS594 Automated Decision Making Course Presentation.
On the Difficulty of Achieving Equilibrium in Interactive POMDPs Prashant Doshi Dept. of Computer Science University of Georgia Athens, GA Twenty.
1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 12: Generalization and Function Approximation Dr. Itamar Arel College of Engineering.
Network Management Lecture 13. MACHINE LEARNING TECHNIQUES 2 Dr. Atiq Ahmed Université de Balouchistan.
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 14
Markov Decision Processes
Markov Decision Processes
Hidden Markov Models Part 2: Algorithms
Dr. Unnikrishnan P.C. Professor, EEE
ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 15: Partially Observable Markov Decision Processes (POMDPs) November 5, 2015 Dr.
October 6, 2011 Dr. Itamar Arel College of Engineering
CS 188: Artificial Intelligence Fall 2008
CS 416 Artificial Intelligence
Reinforcement Learning Dealing with Partial Observability
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 15: Partially Observable Markov Decision Processes (POMDPs) November 5, 2015 Dr.
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Presentation transcript:

1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 15: Partially Observable Markov Decision Processes (POMDPs) Dr. Itamar Arel College of Engineering Electrical Engineering and Computer Science Department The University of Tennessee Fall 2011 October 27, 2011

ECE 517 – Reinforcement Learning in AI 2 Outline Why use POMDPs? Formal definition Belief state Value function

ECE 517 – Reinforcement Learning in AI 3 Partially Observable Markov Decision Problems (POMDPs) To introduce POMDPs let us consider an example where an agent learns to drive a car in New York city The agent can look forward, backward, left or right It can’t change speed but it can steer into the lane it is looking at The different types of observations are the direction in which the agent's gaze is directed the direction in which the agent's gaze is directed the closest object in the agent's gaze the closest object in the agent's gaze whether the object is looming or receding whether the object is looming or receding the color of the object the color of the object whether a horn is sounding whether a horn is sounding To drive safely the agent must steer out of its lane to avoid slow cars ahead and fast cars behind

ECE 517 – Reinforcement Learning in AI 4 POMDP Example The agent is in control of the middle car The car behind is fast and will not slow down The car behind is fast and will not slow down The car ahead is slower The car ahead is slower To avoid a crash, the agent must steer right However, when the agent is gazing to the right, there is no immediate observation that tells it about the impending crash

ECE 517 – Reinforcement Learning in AI 5 POMDP Example (cont.) This is not easy when the agent has no explicit goals beyond “performing well" There are no explicit training patterns such as “if there is a car ahead and left, steer right." However, a scalar reward is provided to the agent as a performance indicator (just like MDPs) The agent is penalized for colliding with other cars or the road shoulder The agent is penalized for colliding with other cars or the road shoulder The only goal hard-wired into the agent is that it must maximize a long-term measure of the reward The only goal hard-wired into the agent is that it must maximize a long-term measure of the reward

ECE 517 – Reinforcement Learning in AI 6 POMDP Example (cont.) Two significant problems make it difficult to learn under these conditions Temporal credit assignment – Temporal credit assignment – If our agent hits another car and is consequently penalized, how does the agent reason about which sequence of actions should not be repeated, and in what circumstances? Generally same as in MDPs Partial Observability - Partial Observability - If the agent is about to hit the car ahead of it, and there is a car to the left, then circumstances dictate that the agent should steer right However, when it looks to the right it has no sensory information regarding what goes on elsewhere To solve the latter, the agent needs memory – creates knowledge of the state of the world around it

ECE 517 – Reinforcement Learning in AI 7 Forms of Partial Observability Partial Observability coarsely pertains to either Lack of important state information in observations – must be compensated using memory Lack of important state information in observations – must be compensated using memory Extraneous information in observations – needs to learn to avoid Extraneous information in observations – needs to learn to avoid In our example: Color of the car in its gaze is extraneous (unless red cars really drive faster) Color of the car in its gaze is extraneous (unless red cars really drive faster) It needs to build a memory-based model of the world in order to accurately predict what will happen It needs to build a memory-based model of the world in order to accurately predict what will happen Creates “belief state” information (we’ll see later) Creates “belief state” information (we’ll see later) If the agent has access to the complete state, such as a chess playing machine that can view the entire board: It can choose optimal actions without memory It can choose optimal actions without memory Markov property holds – i.e. future state of the world is simply a function of the current state and action Markov property holds – i.e. future state of the world is simply a function of the current state and action

ECE 517 – Reinforcement Learning in AI 8 Modeling the world as a POMDP Our setting is that of an agent taking actions in a world according to its policy The agent still receives feedback about its performance through a scalar reward received at each time step Formally stated, POMDPs consists of … |S| states S = {1,2,…,|S|} of the world |S| states S = {1,2,…,|S|} of the world |U| actions (or controls) U = {1,2,…, |U|} available to the policy |U| actions (or controls) U = {1,2,…, |U|} available to the policy |Y| observations Y = {1,2,…,|Y|} |Y| observations Y = {1,2,…,|Y|} a (possibly stochastic) reward r(i) for each state i in S a (possibly stochastic) reward r(i) for each state i in S

ECE 517 – Reinforcement Learning in AI 9 Modeling the world as a POMDP (cont.)

ECE 517 – Reinforcement Learning in AI 10 MDPs vs. POMDPs In MDP: one observation for each state Concept of observation and state being interchangeable Concept of observation and state being interchangeable Memoryless policy that does not make use of internal state Memoryless policy that does not make use of internal state In POMDPs different states may have similar probability distributions over observations Different states may look the same to the agent Different states may look the same to the agent For this reason, POMDPs are said to have hidden state For this reason, POMDPs are said to have hidden state Two hallways may look the same for a robot’s sensors Optimal action for the first  take left Optimal action for the first  take left Optimal action for the second  take right Optimal action for the second  take right A memoryless policy can’t distinguish between the two A memoryless policy can’t distinguish between the two

ECE 517 – Reinforcement Learning in AI 11 MDPs vs. POMDPs (cont.) Noise can create ambiguity in state inference Agent’s sensors are always limited in the amount of information they can pick up Agent’s sensors are always limited in the amount of information they can pick up One way of overcoming this is to add sensors Specific sensors that help it to “disambiguate” hallways Specific sensors that help it to “disambiguate” hallways Only when possible, affordable or desirable Only when possible, affordable or desirable In general, we’re now considering agents that need to be proactive (also called “anticipatory”) Not only react to environmental stimuli Not only react to environmental stimuli Self-create context using memory Self-create context using memory POMDP problems are harder to solve, but represent realistic scenarios

ECE 517 – Reinforcement Learning in AI 12 POMDP solution techniques – model based methods If an exact model of the environment is available, POMDPs can (in theory) be solved i.e. an optimal policy can be found i.e. an optimal policy can be found Like model-based MDPs, it’s not so much a learning problem No real “learning”, or trial and error taking place No real “learning”, or trial and error taking place No exploration/exploitation dilemma No exploration/exploitation dilemma Rather a probabilistic planning problem  find the optimal policy Rather a probabilistic planning problem  find the optimal policy In POMDPs the above is broken into two elements Belief state computation, and Belief state computation, and Value function computation based on belief states Value function computation based on belief states

ECE 517 – Reinforcement Learning in AI 13 The belief state Instead of maintaining the complete action/observation history, we maintain a belief state b. The belief state is a probability distribution over the states Given an observation Given an observation Dim(b) = |S|-1 Dim(b) = |S|-1 The belief space is the entire probability space We’ll use a two-state POMDP as a running example Probability of being in state one = p  probability of being in state two = 1-p Probability of being in state one = p  probability of being in state two = 1-p Therefore, the entire space of belief states can be represented as a line segment Therefore, the entire space of belief states can be represented as a line segment

ECE 517 – Reinforcement Learning in AI 14 The belief space Here is a representation of the belief space when we have two states (s 0,s 1 )

ECE 517 – Reinforcement Learning in AI 15 The belief space (cont.) The belief space is continuous, but we only visit a countable number of belief points Assumption: Finite action set Finite action set Finite observation set Finite observation set Next belief state b’ = f (b,a,o) where: Next belief state b’ = f (b,a,o) where: b: current belief state, a:action, o:observation

ECE 517 – Reinforcement Learning in AI 16 The Tiger Problem Standing in front of two closed doorsStanding in front of two closed doors World is in one of two states: tiger is behind left door or right doorWorld is in one of two states: tiger is behind left door or right door Three actions: Open left door, open right door, listenThree actions: Open left door, open right door, listen Listening is not free, and not accurate (may get wrong info)Listening is not free, and not accurate (may get wrong info) Reward: Open the wrong door and get eaten by the tiger (large –r) Reward: Open the wrong door and get eaten by the tiger (large –r) Open the right door and get a prize (small +r) Open the right door and get a prize (small +r)

ECE 517 – Reinforcement Learning in AI 17 Tiger Problem: POMDP Formulation Two states: SL and SR (tiger is really behind left or right door) Three actions: LEFT, RIGHT, LISTEN Transition probabilities: Listening does not change the tiger’s position Each episode is a “Reset” ListenSLSR SL SR LeftSLSRSL SR RightSLSRSL SR Current state Next state

ECE 517 – Reinforcement Learning in AI 18 Tiger Problem: POMDP Formulation (cont.) Observations: TL (tiger left) or TR (tiger right) Observation probabilities: Rewards: – R(SL, Listen) = R(SR, Listen) = -1 – R(SL, Left) = R(SR, Right) = -100 – R(SL, Right) = R(SR, Left) = +10 ListenTLTR SL SR LeftTLTRSL SR RightTLTRSL SR Current state Next state

ECE 517 – Reinforcement Learning in AI 19 POMDP Policy Tree (Fake Policy) Listen Open Left door Listen Open Left door Listen Tiger roar left Tiger roar right Tiger roar left Tiger roar right … … Starting belief state (tiger left probability: 0.3) New belief state (0.6) New belief State (0.15) New belief State (0.9)

ECE 517 – Reinforcement Learning in AI 20 POMDP Policy Tree (cont’) A1 A2 A3 A4 A5 A6 A7 A8 o1 o2 o3 o4 o5 o6 … …

ECE 517 – Reinforcement Learning in AI 21 How many POMDP policies possible A1 A2 A3 A4 A5 A6 A7 A8 o1 o2 o3 o4 o5 o6 … … How many policy trees, if |A| actions, |O| observations, T horizon: How many nodes in a tree: How many nodes in a tree: N =  |O| i = (|O| T - 1) / (|O| - 1) i=0 T-1 How many trees: |A| N 1 |O| |O|^2 …

ECE 517 – Reinforcement Learning in AI 22 Belief State Overall formula: The belief state is updated proportionally to: The prob. of seeing the current observation given state s’, The prob. of seeing the current observation given state s’, and to the prob. of arriving at state s’ given the action and our previous belief state (b) and to the prob. of arriving at state s’ given the action and our previous belief state (b) The above are all given by the model The above are all given by the model

ECE 517 – Reinforcement Learning in AI 23 Belief State (cont.) Let’s look at an example: Consider a robot that is initially completely uncertain about its location Consider a robot that is initially completely uncertain about its location Seeing a door may, as specified by the model’s occur in three different locations Seeing a door may, as specified by the model’s occur in three different locations Suppose that the robot takes an action and observes a T-junction Suppose that the robot takes an action and observes a T-junction It may be that given the action only one of the three states could have lead to an observation of a T-junction It may be that given the action only one of the three states could have lead to an observation of a T-junction The agent now knows with certainty which state it is in Not in all cases the uncertainty disappears like that

ECE 517 – Reinforcement Learning in AI 24 Finding an optimal policy The policy component of a POMDP agent must map the current belief state into action It turns out that the process of maintaining belief states is a sufficient statistic (i.e. Markovian) We can’t do better even if we remembered the entire history of observations and actions We can’t do better even if we remembered the entire history of observations and actions We have now transformed the POMDP into a MDP Good news: we have ways of solving those (GPI algorithms) Good news: we have ways of solving those (GPI algorithms) Bad news: the belief state space is continuous !! Bad news: the belief state space is continuous !!

ECE 517 – Reinforcement Learning in AI 25 Value function The belief state is the input to the second component of the method: the value function computation The belief state is a point in a continuous space of N-1 dimensions! The value function must be defined over this infinite space Application of dynamic programming techniques  infeasible

ECE 517 – Reinforcement Learning in AI 26 Value function (cont.) Let’s assume only two states: S 1 and S 2Let’s assume only two states: S 1 and S 2 Belief state [ ] indicates b(s 1 ) = 0.25, b(s 2 ) = 0.75Belief state [ ] indicates b(s 1 ) = 0.25, b(s 2 ) = 0.75 With two states, b(s 1 ) is sufficient to indicate belief state: b(s 2 ) = 1 – b(s 1 )With two states, b(s 1 ) is sufficient to indicate belief state: b(s 2 ) = 1 – b(s 1 ) S 1 [1, 0] S 2 [0, 1] [0.5, 0.5] V(b) b: belief state

ECE 517 – Reinforcement Learning in AI 27 Piecewise linear and Convex (PWLC) Turns out that the value function is, or can be accurately approximated, by a piecewise linear and convex function Intuition on convexity: being certain of a state yields high value, where as uncertainty lowers the value S 1 [1, 0] S 2 [0, 1] [0.5, 0.5] V(b) b: belief state

ECE 517 – Reinforcement Learning in AI 28 Why does PWLC helps? We can directly work with regions (intervals) of belief space! The vectors are policies, and indicate the right action to take in each region of the space S 1 [1, 0] S 2 [0, 1] [0.5, 0.5] V(b) b: belief state V p1 V p2 V p3 region1region2 region3

ECE 517 – Reinforcement Learning in AI 29 Summary POMDPs  model realistic scenarios more accurately Rely on belief states that are derived from observations and actions Can be transformed into an MDP with PWLC for value function approximation What if we don’t have a model??? Next class: (recurrent) neural networks come to the rescue … Next class: (recurrent) neural networks come to the rescue …