Instructor: Vincent Conitzer

Slides:



Advertisements
Similar presentations
Markov Decision Processes (MDPs) read Ch utility-based agents –goals encoded in utility function U(s), or U:S  effects of actions encoded in.
Advertisements

Markov Decision Process
Partially Observable Markov Decision Process (POMDP)
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.
SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
CSE-573 Artificial Intelligence Partially-Observable MDPS (POMDPs)
Decision Theoretic Planning
MDP Presentation CS594 Automated Optimal Decision Making Sohail M Yousof Advanced Artificial Intelligence.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
An Introduction to Markov Decision Processes Sarah Hickmott
COSC 878 Seminar on Large Scale Statistical Machine Learning 1.
Markov Decision Processes
Infinite Horizon Problems
Planning under Uncertainty
SA-1 1 Probabilistic Robotics Planning and Control: Markov Decision Processes.
91.420/543: Artificial Intelligence UMass Lowell CS – Fall 2010
KI Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.
Department of Computer Science Undergraduate Events More
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
MAKING COMPLEX DEClSlONS
REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.
Overview  Decision processes and Markov Decision Processes (MDP)  Rewards and Optimal Policies  Defining features of Markov Decision Process  Solving.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
Computer Science CPSC 502 Lecture 14 Markov Decision Processes (Ch. 9, up to 9.5.3)
Utilities and MDP: A Lesson in Multiagent System Based on Jose Vidal’s book Fundamentals of Multiagent Systems Henry Hexmoor SIUC.
CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty Jiang Bian, Fall 2012 University of Arkansas at Little Rock.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Department of Computer Science Undergraduate Events More
Reinforcement Learning Yishay Mansour Tel-Aviv University.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.
CPS 570: Artificial Intelligence Markov decision processes, POMDPs
Announcements  Upcoming due dates  Wednesday 11/4, 11:59pm Homework 8  Friday 10/30, 5pm Project 3  Watch out for Daylight Savings and UTC.
CPS 170: Artificial Intelligence Markov processes and Hidden Markov Models (HMMs) Instructor: Vincent Conitzer.
Department of Computer Science Undergraduate Events More
Department of Computer Science Undergraduate Events More
Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Making complex decisions
POMDPs Logistics Outline No class Wed
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Markov Decision Processes
CPS 570: Artificial Intelligence Markov decision processes, POMDPs
Vincent Conitzer CPS Repeated games Vincent Conitzer
Biomedical Data & Markov Decision Process
Markov Decision Processes
Markov Decision Processes
CAP 5636 – Advanced Artificial Intelligence
Instructors: Fei Fang (This Lecture) and Dave Touretzky
CS 188: Artificial Intelligence Fall 2007
Instructor: Vincent Conitzer
Vincent Conitzer Repeated games Vincent Conitzer
Chapter 17 – Making Complex Decisions
CS 188: Artificial Intelligence Spring 2006
Hidden Markov Models (cont.) Markov Decision Processes
CS 416 Artificial Intelligence
Reinforcement Learning Dealing with Partial Observability
Reinforcement Nisheeth 18th January 2019.
Reinforcement Learning (2)
Markov Decision Processes
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Markov Decision Processes
Vincent Conitzer CPS Repeated games Vincent Conitzer
Reinforcement Learning (2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Presentation transcript:

Instructor: Vincent Conitzer CPS 270: Artificial Intelligence http://www.cs.duke.edu/courses/fall08/cps270/ Markov decision processes, POMDPs Instructor: Vincent Conitzer

Warmup: a Markov process with rewards We derive some reward from the weather each day, but cannot influence it .6 s 10 .1 .3 .4 .3 8 .2 1 c r .3 .3 .5 How much utility can we expect in the long run? Depends on discount factor δ Depends on initial state

Figuring out long-term rewards Let v(s) be the (long-term) expected utility from being in state s now Let P(s, s’) be the transition probability from s to s’ We must have: for all s, v(s) = R(s) + δΣs’ P(s, s’) v(s’) .6 s 10 .1 .3 .4 .3 8 .2 1 c r .3 .3 .5 E.g., v(c) = 8 + δ(.4v(s) + .3v(c) + .3v(r)) Solve system of linear equations to obtain values for all states

Iteratively updating values If we do not want to solve system of equations… E.g., too many states … can iteratively update values until convergence vi(s) is value estimate after i iterations vi(s) = R(s) + δΣs’ P(s, s’) vi-1(s’) Will converge to right values If we initialize v0=0 everywhere, then vi(s) is expected utility with only i steps left (finite horizon) Dynamic program from the future to the present Shows why we get convergence: due to discounting far future does not contribute much

Markov decision process (MDP) Like a Markov process, except every round we make a decision Transition probabilities depend on actions taken P(St+1 = s’ | St = s, At = a) = P(s, a, s’) Rewards for every state, action pair R(St = s, At = a) = R(s, a) Sometimes people just use R(s); R(s, a) little more convenient sometimes Discount factor δ

Example MDP Machine can be in one of three states: good, deteriorating, broken Can take two actions: maintain, ignore

Policies No time period is different from the others Optimal thing to do in state s should not depend on time period … because of infinite horizon With finite horizon, don’t want to maintain machine in last period A policy is a function π from states to actions Example policy: π(good shape) = ignore, π(deteriorating) = ignore, π(broken) = maintain

Evaluating a policy Key observation: MDP + policy = Markov process with rewards Already know how to evaluate Markov process with rewards: system of linear equations Gives algorithm for finding optimal policy: try every possible policy, evaluate Terribly inefficient

Bellman equation Suppose you are in state s, and you play optimally from there on This leads to expected value v*(s) Bellman equation: v*(s) = maxa R(s, a) + δΣs’ P(s, a, s’) v*(s’) Given v*, finding optimal policy is easy

Value iteration algorithm for finding optimal policy Iteratively update values for states using Bellman equation vi(s) is our estimate of value of state s after i updates vi+1(s) = maxa R(s, a) + δΣs’ P(s, a, s’) vi(s’) Will converge If we initialize v0=0 everywhere, then vi(s) is optimal expected utility with only i steps left (finite horizon) Again, dynamic program from the future to the present

Policy iteration algorithm for finding optimal policy Easy to compute values given a policy No max operator Alternate between evaluating policy and updating policy: Solve for function vi based onπi πi+1(s) = arg maxa R(s, a) +δΣs’ P(s, a, s’) vi(s’) Will converge

Mixing things up Do not need to update every state every time Makes sense to focus on states where we will spend most of our time In policy iteration, may not make sense to compute state values exactly Will soon change policy anyway Just use some value iteration updates (with fixed policy, as we did earlier) Being flexible leads to faster solutions

Linear programming approach If only v*(s) = maxa R(s, a) + δΣs’ P(s, s’, a) v*(s’) were linear in the v*(s)… But we can do it as follows: Minimize Σs v(s) Subject to, for all s and a, v(s) ≥ R(s, a) + δΣs’ P(s, s’, a) v(s’) Solver will try to push down the v(s) as far as possible, so that constraints are tight for optimal actions

Partially observable Markov decision processes (POMDPs) Markov process + partial observability = HMM Markov process + actions = MDP Markov process + partial observability + actions = HMM + actions = MDP + partial observability = POMDP full observability partial observability Markov process HMM MDP POMDP no actions actions

Example POMDP Need to specify observations E.g., does machine fail on a single job? P(fail | good shape) = .1, P(fail | deteriorating) = .2, P(fail | broken) = .9 Can also let probabilities depend on action taken

Optimal policies in POMDPs Cannot simply useπ(s) because we do not know s We can maintain a probability distribution over s using filtering: P(St | A1= a1, O1= o1, …, At-1= at-1, Ot-1= ot-1) This gives a belief state b where b(s) is our current probability for s Key observation: policy only needs to depend on b, π(b)

Solving a POMDP as an MDP on belief states If we think of the belief state as the state, then the state is observable and we have an MDP (.3, .4, .3) observe failure disclaimer: did not actually calculate these numbers… maintain observe success (.5, .3, .2) (.6, .3, .1) (.2, .2, .6) ignore Reward for an action from a state = expected reward given belief state observe failure observe success (.4, .2, .2) Now have a large, continuous belief state… Much more difficult