Department of Computer Science Undergraduate Events More https://my.cs.ubc.ca/students/development/eventshttps://my.cs.ubc.ca/students/development/events.

Slides:



Advertisements
Similar presentations
Markov Decision Processes (MDPs) read Ch utility-based agents –goals encoded in utility function U(s), or U:S  effects of actions encoded in.
Advertisements

Markov Decision Process
Department of Computer Science Undergraduate Events More
Announcements  Homework 3: Games  Due tonight at 11:59pm.  Project 2: Multi-Agent Pacman  Has been released, due Friday 2/21 at 5:00pm.  Optional.
Decision Theoretic Planning
An Introduction to Markov Decision Processes Sarah Hickmott
Markov Decision Processes
Infinite Horizon Problems
Planning under Uncertainty
CPSC 322, Lecture 35Slide 1 Finish VE for Sequential Decisions & Value of Information and Control Computer Science cpsc322, Lecture 35 (Textbook Chpt 9.4)
CPSC 322, Lecture 37Slide 1 Finish Markov Decision Processes Last Class Computer Science cpsc322, Lecture 37 (Textbook Chpt 9.5) April, 8, 2009.
Announcements Homework 3: Games Project 2: Multi-Agent Pacman
SA-1 1 Probabilistic Robotics Planning and Control: Markov Decision Processes.
91.420/543: Artificial Intelligence UMass Lowell CS – Fall 2010
KI Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17.
4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)
Department of Computer Science Undergraduate Events More
Department of Computer Science Undergraduate Events More
Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
Instructor: Vincent Conitzer
MAKING COMPLEX DEClSlONS
CPSC 502, Lecture 13Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 13 Oct, 25, 2011 Slide credit POMDP: C. Conati.
Overview  Decision processes and Markov Decision Processes (MDP)  Rewards and Optimal Policies  Defining features of Markov Decision Process  Solving.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Computer Science CPSC 502 Lecture 14 Markov Decision Processes (Ch. 9, up to 9.5.3)
Utilities and MDP: A Lesson in Multiagent System Based on Jose Vidal’s book Fundamentals of Multiagent Systems Henry Hexmoor SIUC.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Department of Computer Science Undergraduate Events More
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Quiz 6: Utility Theory  Simulated Annealing only applies to continuous f(). False  Simulated Annealing only applies to differentiable f(). False  The.
MDPs (cont) & Reinforcement Learning
Decision Theoretic Planning. Decisions Under Uncertainty  Some areas of AI (e.g., planning) focus on decision making in domains where the environment.
Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.
UBC Department of Computer Science Undergraduate Events More
CPS 570: Artificial Intelligence Markov decision processes, POMDPs
Announcements  Upcoming due dates  Wednesday 11/4, 11:59pm Homework 8  Friday 10/30, 5pm Project 3  Watch out for Daylight Savings and UTC.
CSE 473Markov Decision Processes Dan Weld Many slides from Chris Bishop, Mausam, Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer.
Department of Computer Science Undergraduate Events More
MDPs and Reinforcement Learning. Overview MDPs Reinforcement learning.
Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.
Announcements  Homework 3: Games  Due tonight at 11:59pm.  Project 2: Multi-Agent Pacman  Has been released, due Friday 2/19 at 5:00pm.  Optional.
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 10
Making complex decisions
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 4
Markov Decision Processes
CPS 570: Artificial Intelligence Markov decision processes, POMDPs
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 4
Markov Decision Processes
Markov Decision Processes
UBC Department of Computer Science Undergraduate Events More
CAP 5636 – Advanced Artificial Intelligence
Instructors: Fei Fang (This Lecture) and Dave Touretzky
Quiz 6: Utility Theory Simulated Annealing only applies to continuous f(). False Simulated Annealing only applies to differentiable f(). False The 4 other.
CS 188: Artificial Intelligence Fall 2007
CS 188: Artificial Intelligence Fall 2008
13. Acting under Uncertainty Wolfram Burgard and Bernhard Nebel
Instructor: Vincent Conitzer
Chapter 17 – Making Complex Decisions
Hidden Markov Models (cont.) Markov Decision Processes
CS 416 Artificial Intelligence
CS 416 Artificial Intelligence
Markov Decision Processes
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Markov Decision Processes
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Presentation transcript:

Department of Computer Science Undergraduate Events More Co-op Info Session Thurs., Sept 17 12:30 – 1:30 pm MCLD 202 Simba Technologies Tech Talk/Info Session Mon., Sept 21 6 – 7 pm DMP 310 EA Info Session Tues., Sept 22 6 – 7 pm DMP 310

CPSC 422, Lecture 4Slide 2 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 4 Sep, 16, 2015

Announcements What to do with readings? In a few lectures we will discuss the first research paper. Instructions on what to do are available on the course webpage. CPSC 422, Lecture 4Slide 3 Assignment0 / Survey results -Discussion on Piazza (sign up piazza.com/ubc.ca/winterterm12015/cpsc422) - More than 50% took 322 more than a year ago… so make sure you revise 322 material!

CPSC 422, Lecture 44 Lecture Overview Markov Decision Processes Some ideas and notation Finding the Optimal Policy Value Iteration From Values to the Policy Rewards and Optimal Policy

Slide 5 Sketch of ideas to find the optimal policy for a MDP (Value Iteration) We first need a couple of definitions V п (s): the expected value of following policy π in state s Q п (s, a), where a is an action: expected value of performing a in s, and then following policy π. Can we express Q п (s, a) in terms of V п (s) ? X: set of states reachable from s by doing a Q п (s, a)= B. A. C. D. None of the above CPSC 422, Lecture 3

Discounted Reward Function  Suppose the agent goes through states s 1, s 2,...,s k and receives rewards r 1, r 2,...,r k  We will look at discounted reward to define the reward for this sequence, i.e. its utility for the agent CPSC 422, Lecture 3Slide 6

CPSC 422, Lecture 3Slide 7 Sketch of ideas to find the optimal policy for a MDP (Value Iteration) We first need a couple of definitions V п (s): the expected value of following policy π in state s Q п (s, a), where a is an action: expected value of performing a in s, and then following policy π. We have, by definition states reachable from s by doing a reward obtained in s expected value of following policy π in s’ Probability of getting to s’ from s via a Q п (s, a)= Discount factor

CPSC 422, Lecture 3Slide 8 Value of a policy and Optimal policy We can also compute V п (s) in terms of Q п (s, a) Expected value of performing the action indicated by π in s and following π after that Expected value of following π in s action indicated by π in s For the optimal policy π * we also have

CPSC 422, Lecture 3 Slide 9 Value of Optimal policy But the Optimal policy π * is the one that gives the action that maximizes the future reward for each state Remember for any policy π So…

CPSC 422, Lecture 4 Slide 10 Value of Optimal policy But the Optimal policy π * is the one that gives the action that maximizes the future reward for each state Remember for any policy π So…

Value Iteration Rationale  Given N states, we can write an equation like the one below for each of them  Each equation contains N unknowns – the V values for the N states  N equations in N variables (Bellman equations): It can be shown that they have a unique solution: the values for the optimal policy  Unfortunately the N equations are non-linear, because of the max operator: Cannot be easily solved by using techniques from linear algebra  Value Iteration Algorithm: Iterative approach to find the V values and the corresponding  optimal policy CPSC 422, Lecture 4Slide 11

Value Iteration in Practice  Let V (i) (s) be the utility of state s at the i th iteration of the algorithm  Start with arbitrary utilities on each state s: V (0) (s)  Repeat simultaneously for every s until there is “no change”  True “no change” in the values of V(s) from one iteration to the next are guaranteed only if run for infinitely long. In the limit, this process converges to a unique set of solutions for the Bellman equations They are the total expected rewards (utilities) for the optimal policy CPSC 422, Lecture 4Slide 12

VI algorithm CPSC 422, Lecture 4Slide 13

Example Suppose, for instance, that we start with values V (0) (s) that are all Iteration 0 Iteration 1 CPSC 422, Lecture 4 Slide 14

Example (cont’d)  Let’s compute V (1) (3,3) Iteration 0 Iteration CPSC 422, Lecture 4 Slide 15

Example (cont’d)  Let’s compute V (1) (4,1 ) Iteration 0 Iteration CPSC 422, Lecture 4 Slide 16

After a Full Iteration Iteration 1  Only the state one step away from a positive reward (3,3) has gained value, all the others are losing value CPSC 422, Lecture 4Slide 17

Some steps in the second iteration Iteration Iteration CPSC 422, Lecture 4 Slide 18

Example (cont’d)  Let’s compute V (1) (2,3)  Steps two moves away from positive rewards start increasing their value Iteration Iteration CPSC 422, Lecture 4Slide 19

State Utilities as Function of Iteration #  Note that values of states at different distances from (4,3) accumulate negative rewards until a path to (4,3) is found CPSC 422, Lecture 4Slide 20 (3,3)(4,3) (4,2) (1,1)(3,1)(4,1)

Value Iteration: Computational Complexity Value iteration works by producing successive approximations of the optimal value function. What is the complexity of each iteration? …or faster if there is sparsity in the transition function. CPSC 422, Lecture 4Slide 21 B. O(|A||S| 2 ) A. O(|A| 2 |S|) C. O(|A| 2 |S| 2 )

U(s) and R(s)  Note that U(s) and R(s) are quite different quantities R(s) is the short term reward related to s U(s) is the expected long term total reward from s onward if following л*  Example: U(s) for the optimal policy we found for the grid problem with R(s non-terminal) = and γ =1 CPSC 422, Lecture 4Slide 22

Relevance to state of the art MDPs “ Value Iteration (VI) forms the basis of most of the advanced MDP algorithms that we discuss in the rest of the book. ……..” CPSC 422, Lecture 4Slide 23 FROM : Planning with Markov Decision Processes: An AI Perspective Mausam (UW), Andrey Kolobov (MSResearch) Synthesis Lectures on Artificial Intelligence and Machine Learning Jun 2012MausamAndrey Kolobov Synthesis Lectures on Artificial Intelligence and Machine Learning Free online through UBC

CPSC 422, Lecture 424 Lecture Overview Markov Decision Processes …… Finding the Optimal Policy Value Iteration From Values to the Policy Rewards and Optimal Policy

Value Iteration: from state values V to л*  Now the agent can chose the action that implements the MEU principle: maximize the expected utility of the subsequent state CPSC 422, Lecture 4Slide 25

Value Iteration: from state values V to л*  Now the agent can chose the action that implements the MEU principle: maximize the expected utility of the subsequent state states reachable from s by doing a expected value of following policy л* in s’ Probability of getting to s’ from s via a CPSC 422, Lecture 4Slide 26

Example: from state values V to л*  To find the best action in (1,1) CPSC 422, Lecture 4Slide 27

Optimal policy CPSC 422, Lecture 4Slide 28  This is the policy that we obtain….

CPSC 322, Lecture 36Slide 29 Learning Goals for today’s class You can: Define/read/write/trace/debug the Value Iteration (VI) algorithm. Compute its complexity. Compute the Optimal Policy given the output of VI Explain influence of rewards on optimal policy

CPSC 422, Lecture 4Slide 30 TODO for Fri Read Textbook Partially Observable MDPs Also Do Practice Ex. 9.C