ECE 517: Reinforcement Learning in Artificial Intelligence

ECE 517: Reinforcement Learning in Artificial Intelligence
Lecture 5: The Reinforcement Learning Problem, Markov Decision Processes (MDPs) ECE 517: Reinforcement Learning in Artificial Intelligence September 1, 2011 Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science The University of Tennessee Fall 2011

Outline The Reinforcement learning problem
Sequential Decision Making Processes Markov Decision Processes (MDPs) Value Functions and the Bellman Equation

Agent-Environment Interface
We will consider an agent interacting with an environment Environment will be defined as anything the agent receives sensory information about Agent selects actions  environment responds

Agent-Environment Interface (cont.)
We will assume that at any time t … Agent receives a representation of the environment's state, st S Selects an action at A(st), where A(st) denotes the set of possible actions available at state st One time step later, as a consequence of the action taken, a reward, rt+1, is received Policy, pt, can be interpreted as mapping from states to probabilities of selecting possible actions The RL agent continuously adjusts its policy so as to maximize the total amount of reward received over the long run

Agent-Environment Interface (cont.)
This formalism is abstract, generic and flexible Time does not have to be in fixed intervals Actions can vary as well High-level (e.g. go to lunch) or low-level (move motor) States can come in wide variety of forms Signal readings to abstract state of affairs (e.g. symbolic descriptions) Can be completely intrinsic (“what do I think of next?”) However, is the probabilistic approach the right one? Can a different mechanism be employed more effectively?

Boundaries between agent and environment
Not always the same as the physical boundary of a robot’s body Boundaries are usually closer to the agent than that Robot’s motor can be assumed part of the environment Sensory subsystems as well In an analogy to humans: muscles, bones, etc. The general rule is that anything that cannot be arbitrarily changed by the agent  environment Reward is by definition part of the environment Knowledge of the environment does not mean easily solvable problem space  e.g. Rubik’s cube

Representation of metrics
The manner by which states, actions and rewards are represented is key to the effectiveness of the application Currently more art than science Example: Pick-and-Place Robot Consider RL agent learning to control the motion of a robot arm Goal: generate fast and smooth motion Low latency readings of position and velocities are required to be directly presented Actions may be, for example, voltages applied to each joint States might be the latest readings of joint angles and velocities Reward may be +1 for every successful placement (0 elsewhere) To improve smoothness of motion, small negative rewards may be issued to “jerky” moves Q: How can we speedup the task?

Goals and Rewards Agent strives to maximize long run cumulative reward
Robot learning to walk: agent is provided a reward proportional to the forward motion In Maze-type problems Option #1: Reward of 0 until agent reaches goal (then +1) Option #2: Reward of –1 until agent reaches goal (accelerates learning) Other ideas? When designing the setup: make sure that rewards are issued in such a way that maximizing them over time achieves our goal

Returns In the simplest form, return is the sum of all rewards
The above holds for finite-duration (episodic/terminal) problems In non-episodic cases, however, the reward can be infinite We, therefore, exploit the concept of discounting Accordingly, we maximize the sum of discounted rewards where 0 < a < 1 is the discount rate. Given that rewards are finite, the return is guaranteed to be finite as well (why?) Q: Is the discounted model perfect? How can it be improved?

The Pole-Balancing Problem
Pseudo-standard benchmark problem from the field of control theory Served as an early illustration of RL The objective is to determine the force applied to a cart so as to keep the pole from falling past a given angle Failure: angle is exceeded or cart reaches end of track Reward can be +1 for each time step that the pole is still in the air, and –1 for failure Finite (episodic): return  number of steps before failure occurs Infinite: return  sum of discounted rewards

Sequential Decision Making
Each day we make decisions that have both immediate and long-term consequences For example, in a long race, deciding to sprint at the beginning may deplete energy reserves quickly  may result in poor finish Markov Decision Processes (MDPs) provide us with a formalism for sequential decision making under uncertainty Recall that … Agent receives a representation of the environment's state, st S Selects an action at A(st), where A(st) denotes the set of possible actions available at state st One time step later, as a consequence of the action taken, a reward, rt+1, is received The agent aims to find a good policy (i.e. mapping states  actions)

Motivations for using the Markov property
The state information ideally provides the agent with context e.g.: if the answer to a question is “yes”, the state will reflect on what was the question that came before the answer All “relevant” information the agent may have about its environment (primarily sensory-driven) In practical scenarios, the state often provides partial information e.g. agent playing card games, vision-based maze maneuvering As a motivational statement: We would ideally like a state signal that compactly summarizes past sensations in a way the all relevant information is retained. A state signal that achieves the above is said to be Markovian, or to have the Markov property.

Markov Property and Model Formulation
We will assume, for simplicity, a finite number of states and reward values Consider the reaction of the environment at time t+1 to actions taken at time t However, if the state signal has the Markov property This one-step dynamics property allows us to efficiently predict next state/s and expected reward/s A model is called stationary if rewards and transition probabilities are independent of t

The Pole-Balancing Problem revisited
In the pole-balancing task introduced before, state signal would be Markovian if it specified exactly: Position and velocity of the cart Angle between the cart and the pole Angular velocity In an idealized system, this would be enough to fully control the system In reality “noise” is inserted to sensors etc. Can cause violation of the Markov property Sometimes adding partial information helps e.g. cart position into 3 segments

st,rt st+1,rt+1 at at+1 System evolution t t +1 Decision Epoch

An alternative perspective on system evolution

Markov Decision Processes
A RL task that satisfies the Markov property is called a Markov decision process, or MDP A particular finite MDP is defined by the transition probabilities, Similarly, the expected value of the next reward is The above information tells us most of what we need to know to solve any finite MDP Notice that MDPs are MC whereby each action yields a different transition probability matrix

Example: The Recycling Robot MDP
A robot makes decisions at times determined by external events At each such time the robot decides whether it should Actively search for a can Remain stationary and wait for someone to bring it a can, or Go back to home base to recharge its battery The best way to find cans is to actively search for them, but this runs down the robot's battery, whereas waiting does not Whenever the robot is searching, the possibility exists that its battery will become depleted In this case the robot must shut down and wait to be rescued (resulting in negative reward)

The Recycling Robot MDP (cont.)
More assumptions … The agent makes its decisions solely as a function of the energy level of the battery It can distinguish two levels, High and Low, such that the state set is S = {High, Low} The action set, A = {wait, search, recharge} Accordingly, the state-dependent action sets are: A period of searching that begins with a high energy level leaves the energy level high with probability a and reduces it to low with probability 1-a

The Recycling Robot MDP (cont.)
On the other hand, a period of searching undertaken when the energy level is low leaves it low with probability b and depletes the battery with probability 1-b In the latter case, the robot must be rescued, and the battery is then recharged back to high Each can collected by the robot counts as a unit reward, whereas a reward of –3 results whenever the robot has to be rescued Moreover, let … Rsearch  expected number of cans the robot will collect when in search mode Rwait  expected number of cans the robot will collect when in wait mode

Transition Probabilities and Expected Rewards
high search low wait recharge

MDP Transition Graph A transition graph is a useful way to summarize the dynamics of a finite MDP State node for each possible state Action node for each possible state-action pair

Value Functions Almost all reinforcement learning algorithms are based on estimating value functions Option #1: How good is it to be in a given state Option #2: How good is it to take a specific action at a given state The notion of “how good”  expected return (R) Recall that a policy, p, is a mapping from each state, s, and action, a, to the probability function p(s,a) Informally, the value of a state is the expected return when starting in that state and following p thereafter We define the state-value function for policy p as

Value Functions (cont.)
Similarly, we define the action-value function for policy p as the value of taking action a in state s under policy p, denoted by Q p(s,a), such that The value functions V p(s), and Q p(s,a), can be estimated from experience V p(s) - An agent following policy p from a given state, which maintains an average of actual returns  this average will converge to the state's value Q p(s,a) - If separate averages are kept for each action taken in a state, then these averages will similarly converge to the state-action values

Fundamental recursive relationship (Bellman Equation)
For any policy p and any state s, the following consistency condition holds between the value of s and the value of its possible successor states Bellman Equation for V p

Illustration of the Bellman Equation
Think of looking ahead from one state to its possible successor states The Bellman equation averages over all the possibilities, weighting each by its probability of occurring It states that the value of the start state must equal the (discounted) value of the expected next state, plus the reward expected along the way

The Backup Diagram The value function V p(s) is the unique solution to its Bellman equation We’ll later see how the Bellman equation forms the basis for methods to compute, approximate and learn V p(s) Diagrams like those shown in previous slide are called backup diagrams These diagrams form the basis of the update or backup operations are at the heart of RL These operations transfer value information back to a state (or a state-action pair) from its successor states (or state-action pairs). Explicit arrowheads are usually omitted since time always flows downward in a backup diagram Read and make sure you understand examples 3.8 and 3.9 in the SB book

ECE 517: Reinforcement Learning in Artificial Intelligence

Similar presentations

Presentation on theme: "ECE 517: Reinforcement Learning in Artificial Intelligence"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ECE 517: Reinforcement Learning in Artificial Intelligence

Similar presentations

Presentation on theme: "ECE 517: Reinforcement Learning in Artificial Intelligence"— Presentation transcript:

Similar presentations

About project

Feedback