ECE 517: Reinforcement Learning in Artificial Intelligence

Slides:

Advertisements

Similar presentations

Markov Decision Process

Advertisements

1 Reinforcement Learning Problem Week #3. Figure reproduced from the figure on page 52 in reference [1] 2 Reinforcement Learning Loop state Agent Environment.

An Introduction to Markov Decision Processes Sarah Hickmott

Markov Decision Processes

Planning under Uncertainty

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

Department of Computer Science Undergraduate Events More

Reinforcement Learning

Reinforcement Learning (1)

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

MAKING COMPLEX DEClSlONS

REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.

1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 6: Optimality Criterion in MDPs Dr. Itamar Arel College of Engineering Department.

Markov Decision Processes1 Definitions; Stationary policies; Value improvement algorithm, Policy improvement algorithm, and linear programming for discounted.

Decision Making in Robots and Autonomous Agents Decision Making in Robots and Autonomous Agents The Markov Decision Process (MDP) model Subramanian Ramamoorthy.

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.

Reinforcement Learning 主講人：虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.

1 S ystems Analysis Laboratory Helsinki University of Technology Flight Time Allocation Using Reinforcement Learning Ville Mattila and Kai Virtanen Systems.

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

Attributions These slides were originally developed by R.S. Sutton and A.G. Barto, Reinforcement Learning: An Introduction. (They have been reformatted.

Reinforcement Learning 主講人：虞台文大同大學資工所智慧型多媒體研究室.

1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 8: Dynamic Programming – Value Iteration Dr. Itamar Arel College of Engineering Department.

MDPs (cont) & Reinforcement Learning

CMSC 471 Fall 2009 MDPs and the RL Problem Prof. Marie desJardins Class #23 – Tuesday, 11/17 Thanks to Rich Sutton and Andy Barto for the use of their.

Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012.

1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 21: Dynamic Multi-Criteria RL problems Dr. Itamar Arel College of Engineering Department.

Department of Computer Science Undergraduate Events More

Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 3: The Reinforcement Learning Problem pdescribe the RL problem we will.

Announcements Grader office hours posted on course website

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3

Prof. Dr. Holger Schlingloff 1,2 Dr. Esteban Pavese 1

Making complex decisions

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3

Markov Decision Processes

Timothy Boger and Mike Korostelev

Biomedical Data & Markov Decision Process

"Playing Atari with deep reinforcement learning."

Markov Decision Processes

Planning to Maximize Reward: Markov Decision Processes

Markov Decision Processes

CS 188: Artificial Intelligence

CMSC 671 – Fall 2010 Class #22 – Wednesday 11/17

Hidden Markov Models Part 2: Algorithms

Chapter 3: The Reinforcement Learning Problem

CAP 5636 – Advanced Artificial Intelligence

Instructors: Fei Fang (This Lecture) and Dave Touretzky

CS 188: Artificial Intelligence Fall 2007

13. Acting under Uncertainty Wolfram Burgard and Bernhard Nebel

Chapter 3: The Reinforcement Learning Problem

Instructor: Vincent Conitzer

Chapter 3: The Reinforcement Learning Problem

September 22, 2011 Dr. Itamar Arel College of Engineering

October 6, 2011 Dr. Itamar Arel College of Engineering

Markov Decision Problems

CS 188: Artificial Intelligence Spring 2006

CS 188: Artificial Intelligence Fall 2008

CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29

Hidden Markov Models (cont.) Markov Decision Processes

CS 416 Artificial Intelligence

Reinforcement Learning Dealing with Partial Observability

CS 416 Artificial Intelligence

October 20, 2010 Dr. Itamar Arel College of Engineering

Markov Decision Processes

Markov Decision Processes

ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 17: TRTRL, Implementation Considerations, Apprenticeship Learning November 3, 2010.

Reinforcement Learning

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3

Presentation transcript:

ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 5: The Reinforcement Learning Problem, Markov Decision Processes (MDPs) ECE 517: Reinforcement Learning in Artificial Intelligence September 1, 2011 Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science The University of Tennessee Fall 2011

Outline The Reinforcement learning problem Sequential Decision Making Processes Markov Decision Processes (MDPs) Value Functions and the Bellman Equation

Agent-Environment Interface We will consider an agent interacting with an environment Environment will be defined as anything the agent receives sensory information about Agent selects actions  environment responds

Agent-Environment Interface (cont.) We will assume that at any time t … Agent receives a representation of the environment's state, st S Selects an action at A(st), where A(st) denotes the set of possible actions available at state st One time step later, as a consequence of the action taken, a reward, rt+1, is received Policy, pt, can be interpreted as mapping from states to probabilities of selecting possible actions The RL agent continuously adjusts its policy so as to maximize the total amount of reward received over the long run

Agent-Environment Interface (cont.) This formalism is abstract, generic and flexible Time does not have to be in fixed intervals Actions can vary as well High-level (e.g. go to lunch) or low-level (move motor) States can come in wide variety of forms Signal readings to abstract state of affairs (e.g. symbolic descriptions) Can be completely intrinsic (“what do I think of next?”) However, is the probabilistic approach the right one? Can a different mechanism be employed more effectively?

Boundaries between agent and environment Not always the same as the physical boundary of a robot’s body Boundaries are usually closer to the agent than that Robot’s motor can be assumed part of the environment Sensory subsystems as well In an analogy to humans: muscles, bones, etc. The general rule is that anything that cannot be arbitrarily changed by the agent  environment Reward is by definition part of the environment Knowledge of the environment does not mean easily solvable problem space  e.g. Rubik’s cube

Representation of metrics The manner by which states, actions and rewards are represented is key to the effectiveness of the application Currently more art than science Example: Pick-and-Place Robot Consider RL agent learning to control the motion of a robot arm Goal: generate fast and smooth motion Low latency readings of position and velocities are required to be directly presented Actions may be, for example, voltages applied to each joint States might be the latest readings of joint angles and velocities Reward may be +1 for every successful placement (0 elsewhere) To improve smoothness of motion, small negative rewards may be issued to “jerky” moves Q: How can we speedup the task?

Goals and Rewards Agent strives to maximize long run cumulative reward Robot learning to walk: agent is provided a reward proportional to the forward motion In Maze-type problems Option #1: Reward of 0 until agent reaches goal (then +1) Option #2: Reward of –1 until agent reaches goal (accelerates learning) Other ideas? When designing the setup: make sure that rewards are issued in such a way that maximizing them over time achieves our goal

Returns In the simplest form, return is the sum of all rewards The above holds for finite-duration (episodic/terminal) problems In non-episodic cases, however, the reward can be infinite We, therefore, exploit the concept of discounting Accordingly, we maximize the sum of discounted rewards where 0 < a < 1 is the discount rate. Given that rewards are finite, the return is guaranteed to be finite as well (why?) Q: Is the discounted model perfect? How can it be improved?

The Pole-Balancing Problem Pseudo-standard benchmark problem from the field of control theory Served as an early illustration of RL The objective is to determine the force applied to a cart so as to keep the pole from falling past a given angle Failure: angle is exceeded or cart reaches end of track Reward can be +1 for each time step that the pole is still in the air, and –1 for failure Finite (episodic): return  number of steps before failure occurs Infinite: return  sum of discounted rewards

Sequential Decision Making Each day we make decisions that have both immediate and long-term consequences For example, in a long race, deciding to sprint at the beginning may deplete energy reserves quickly  may result in poor finish Markov Decision Processes (MDPs) provide us with a formalism for sequential decision making under uncertainty Recall that … Agent receives a representation of the environment's state, st S Selects an action at A(st), where A(st) denotes the set of possible actions available at state st One time step later, as a consequence of the action taken, a reward, rt+1, is received The agent aims to find a good policy (i.e. mapping states  actions)

Motivations for using the Markov property The state information ideally provides the agent with context e.g.: if the answer to a question is “yes”, the state will reflect on what was the question that came before the answer All “relevant” information the agent may have about its environment (primarily sensory-driven) In practical scenarios, the state often provides partial information e.g. agent playing card games, vision-based maze maneuvering As a motivational statement: We would ideally like a state signal that compactly summarizes past sensations in a way the all relevant information is retained. A state signal that achieves the above is said to be Markovian, or to have the Markov property.

Markov Property and Model Formulation We will assume, for simplicity, a finite number of states and reward values Consider the reaction of the environment at time t+1 to actions taken at time t However, if the state signal has the Markov property This one-step dynamics property allows us to efficiently predict next state/s and expected reward/s A model is called stationary if rewards and transition probabilities are independent of t

The Pole-Balancing Problem revisited In the pole-balancing task introduced before, state signal would be Markovian if it specified exactly: Position and velocity of the cart Angle between the cart and the pole Angular velocity In an idealized system, this would be enough to fully control the system In reality “noise” is inserted to sensors etc. Can cause violation of the Markov property Sometimes adding partial information helps e.g. cart position into 3 segments

st,rt st+1,rt+1 at at+1 System evolution t t +1 Decision Epoch

An alternative perspective on system evolution

Markov Decision Processes A RL task that satisfies the Markov property is called a Markov decision process, or MDP A particular finite MDP is defined by the transition probabilities, Similarly, the expected value of the next reward is The above information tells us most of what we need to know to solve any finite MDP Notice that MDPs are MC whereby each action yields a different transition probability matrix

Example: The Recycling Robot MDP A robot makes decisions at times determined by external events At each such time the robot decides whether it should Actively search for a can Remain stationary and wait for someone to bring it a can, or Go back to home base to recharge its battery The best way to find cans is to actively search for them, but this runs down the robot's battery, whereas waiting does not Whenever the robot is searching, the possibility exists that its battery will become depleted In this case the robot must shut down and wait to be rescued (resulting in negative reward)

The Recycling Robot MDP (cont.) More assumptions … The agent makes its decisions solely as a function of the energy level of the battery It can distinguish two levels, High and Low, such that the state set is S = {High, Low} The action set, A = {wait, search, recharge} Accordingly, the state-dependent action sets are: A period of searching that begins with a high energy level leaves the energy level high with probability a and reduces it to low with probability 1-a

The Recycling Robot MDP (cont.) On the other hand, a period of searching undertaken when the energy level is low leaves it low with probability b and depletes the battery with probability 1-b In the latter case, the robot must be rescued, and the battery is then recharged back to high Each can collected by the robot counts as a unit reward, whereas a reward of –3 results whenever the robot has to be rescued Moreover, let … Rsearch  expected number of cans the robot will collect when in search mode Rwait  expected number of cans the robot will collect when in wait mode

Transition Probabilities and Expected Rewards high search low wait recharge

MDP Transition Graph A transition graph is a useful way to summarize the dynamics of a finite MDP State node for each possible state Action node for each possible state-action pair

Value Functions Almost all reinforcement learning algorithms are based on estimating value functions Option #1: How good is it to be in a given state Option #2: How good is it to take a specific action at a given state The notion of “how good”  expected return (R) Recall that a policy, p, is a mapping from each state, s, and action, a, to the probability function p(s,a) Informally, the value of a state is the expected return when starting in that state and following p thereafter We define the state-value function for policy p as

Value Functions (cont.) Similarly, we define the action-value function for policy p as the value of taking action a in state s under policy p, denoted by Q p(s,a), such that The value functions V p(s), and Q p(s,a), can be estimated from experience V p(s) - An agent following policy p from a given state, which maintains an average of actual returns  this average will converge to the state's value Q p(s,a) - If separate averages are kept for each action taken in a state, then these averages will similarly converge to the state-action values

Fundamental recursive relationship (Bellman Equation) For any policy p and any state s, the following consistency condition holds between the value of s and the value of its possible successor states Bellman Equation for V p

Illustration of the Bellman Equation Think of looking ahead from one state to its possible successor states The Bellman equation averages over all the possibilities, weighting each by its probability of occurring It states that the value of the start state must equal the (discounted) value of the expected next state, plus the reward expected along the way

The Backup Diagram The value function V p(s) is the unique solution to its Bellman equation We’ll later see how the Bellman equation forms the basis for methods to compute, approximate and learn V p(s) Diagrams like those shown in previous slide are called backup diagrams These diagrams form the basis of the update or backup operations are at the heart of RL These operations transfer value information back to a state (or a state-action pair) from its successor states (or state-action pairs). Explicit arrowheads are usually omitted since time always flows downward in a backup diagram Read and make sure you understand examples 3.8 and 3.9 in the SB book