From Markov Decision Processes to Artificial Intelligence

From Markov Decision Processes to Artificial Intelligence
Rich Sutton with thanks to: Andy Barto Satinder Singh Doina Precup

The steady march of computing science is changing artificial intelligence
More computation-based approximate methods Machine learning, neural networks, genetic algorithms Machines are taking on more of the work More data, more computation Less handcrafted solutions, human understandability More search Exponential methods are still exponential… but compute-intensive methods increasingly winning More general problems stochastic, non-linear, optimal real-time, large

state The problem is to predict and control a doubly branching interaction unfolding over time, with a long-term goal action state action state Agent World action state

Sequential, state-action-reward problems are ubiquitous
Walking Flying a helicopter Playing tennis Logistics Inventory control Intruder detection Repair or replace? Visual search for objects Playing chess, Go, Poker Medical tests, treatment Conversation User interfaces Marketing Queue/server control Portfolio management Industrial process control Pipeline failure prediction Real-time load balancing

Markov Decision Processes (MDPs)
state Discrete time States Actions Policy Transition probabilities Rewards action state action state action state

MDPs Part II: The Objective
“Maximize cumulative reward” Define the value of being in a state under a policy as where delayed rewards are discounted by g  [0,1] Defines a partial ordering over policies, with at least one optimal policy: There are other possibilities... Needs proving

Markov Decision Processes
Extensively studied since 1950s In Optimal Control Specializes to Ricatti equations for linear systems And to HJB equations for continuous time systems Only general, nonlinear, optimal-control framework In Operations Research Planning, scheduling, logistics Sequential design of experiments Finance, marketing, inventory control, queuing, telecomm In Artificial Intelligence (last 15 years) Reinforcement learning, probabilistic planning Dynamic Programming is the dominant solution method

Outline Markov decision processes (MDPs) Dynamic Programming (DP)
The curse of dimensionality Reinforcement Learning (RL) TD(l) algorithm TD-Gammon example Acrobot example RL significantly extends DP methods for solving MDPs RoboCup example Conclusion, from the AI point of view Spy plane example

The Principle of Optimality
Dynamic Programming (DP) requires a decomposition into subproblems In MDPs this comes from the Independence of Path assumption Values can be written in terms of successor values, e.g., “Bellman Equations”

For example, Value Iteration:
Dynamic Programming: Sweeping through the states, updating an approximation to the optimal value function For example, Value Iteration: Initialize: Do forever: Pick any of the maximizing actions to get p*

DP is repeated backups, shallow lookahead searches
V V V(s’) V(s’’)

Dynamic Programming is the dominant solution method for MDPs
Routinely applied to problems with millions of states Worst case scales polynomially in |S| and |A| Linear Programming has better worst-case bounds but in practice scales 100s of times worse On large stochastic problems, only DP is feasible

Perennial Difficulties for DP
1. Large state spaces “The curse of dimensionality” 2. Difficulty calculating the dynamics, e.g., from a simulation 3. Unknown dynamics

The Curse of Dimensionality
Bellman, 1961 The Curse of Dimensionality The number of states grows exponentially with dimensionality -- the number of state variables Thus, on large problems, Can’t complete even one sweep of DP Can’t enumerate states, need sampling! Can’t store separate values for each state Can’t store values in tables, need function approximation!

Reinforcement Learning: Using experience in place of dynamics
Let be an observed sequence with actions selected by p For every time step, t, “Bellman Equation” which suggests the DP-like update: We don’t know this expected value, but we know the actual , an unbiased sample of it. In RL, we take a step toward this sample, e.g., half way “Tabular TD(0)”

Temporal-Difference Learning (Sutton, 1988)
Updating a prediction based on its change (temporal difference) from one moment to the next. Tabular TD(0): Or, V is, e.g., a neural network with parameter q Then use gradient-descent TD(0): TD(l), l>0, uses differences from later predictions as well first prediction better, later prediction Temporal difference

TD-Gammon T e s a u r o , 1 9 9 2 - 1 9 9 5 T D E r r o r A c t i o n
. . . ≈ probability of winning . . . . . . . . . 162 T D E r r o r A c t i o n s e l e c t i o n b y 2 - 3 p l y s e a r c h S t a r t w i t h a r a n d o m N e t w o r k P l a y m i l l i o n s o f g a m e s a g a i n s t i t s e l f L e a r n a v a l u e f u n c t i o n f r o m t h i s s i m u l a t e d e x p e r i e n c e T h i s p r o d u c e s a r g u a b l y t h e b e s t p l a y e r i n t h e w o r l d

TD-Gammon vs an Expert-Trained Net
(Tesauro, 1992) TD-Gammon vs an Expert-Trained Net TD-Gammon +features .70 TD-Gammon * .65 fraction of games won against Gammontool EP+features "Neurogammon" .60 .55 EP (BP net trained from expert moves) .50 .45 10 20 40 80 number of hidden units

Examples of Reinforcement Learning
Elevator Control Crites & Barto (Probably) world's best down-peak elevator controller Walking Robot Benbrahim & Franklin Learned critical parameters for bipedal walking Robocup Soccer Teams e.g., Stone & Veloso, Reidmiller et al. RL is used in many of the top teams Inventory Management Van Roy, Bertsekas, Lee & Tsitsiklis 10-15% improvement over industry standard methods Dynamic Channel Assignment Singh & Bertsekas, Nie & Haykin World's best assigner of radio channels to mobile telephone calls KnightCap and TDleaf Baxter, Tridgell & Weaver Improved chess play from intermediate to master in 300 games

Does function approximation beat the curse of dimensionality?
Yes… probably FA makes dimensionality per se largely irrelevant With FA, computation seems to scale with the complexity of the solution (crinkliness of the value function) and how hard it is to find it If you can get FA to work!

FA in DP and RL (1st bit) Conventional DP works poorly with FA
Empirically [Boyan and Moore, 1995] Diverges with linear FA [Baird, 1995] Even for prediction (evaluating a fixed policy) [Baird, 1995] RL works much better Empirically [many applications and Sutton, 1996] TD(l) prediction converges with linear FA [Tsitsiklis & Van Roy, 1997] TD(l) control converges with linear FA [Perkins & Precup, 2002] Why? Following actual trajectories in RL ensures that every state is updated at least as often as it is the basis for updating

DP+FA fails RL+FA works
Real trajectories always leave a state after entering it More transitions can go in to a state than go out

The Mountain Car Problem
Moore, 1990 The Mountain Car Problem Goal SITUATIONS: car's position and velocity ACTIONS: three thrusts: forward, reverse, none REWARDS: always –1 until car reaches the goal No Discounting Gravity wins Minimum-Time-to-Goal Problem

Value Functions Learned while solving the Mountain Car problem
Sutton, 1996 Value Functions Learned while solving the Mountain Car problem Goal region Minimize Time-to-Goal Value = estimated time to goal Lower is better

Sparse, Coarse, Tile-Coding (CMAC)
Car velocity Car position Albus, 1980

Tile Coding (CMAC) Albus, 1980 Example of Sparse Coarse-Coded Networks
. . . . . Linear last layer fixed expansive Re-representation . . . . . . features Coarse: Large receptive fields Sparse: Few features present at one time

The Acrobot Problem Reward = -1 per time step fixed base q tip q
Sutton, 1996 The Acrobot Problem Goal: Raise tip above line e.g., Dejong & Spong, 1994 fixed base Torque applied here Minimum–Time–to–Goal: 4 state variables: 2 joint angles 2 angular velocities Tile coding with 48 tilings q 1 tip q 2 Reward = -1 per time step

The RoboCup Soccer Competition

13 Continuous State Variables (for 3 vs 2)
11 distances among the players, ball, and the center of the field 2 angles to takers along passing lanes Stone & Sutton, 2001

RoboCup Feature Vectors
. . . Full soccer state . action values . Sparse, coarse, tile coding Linear map q . . . . 13 continuous state variables . . r Huge binary feature vector (about 400 1’s and 40,000 0’s) f s Stone & Sutton, 2001

Learning Keepaway Results 3v2 handcrafted takers
Stone & Sutton, 2001 Learning Keepaway Results 3v2 handcrafted takers Multiple, independent runs of TD(l)

Hajime Kimura’s RL Robots (dynamics knowledge)
Before After Backward New Robot, Same algorithm

Assessment re: DP RL has added some new capabilities to DP methods
Much larger MDPs can be addressed (approximately) Simulations can be used without explicit probabilities Dynamics need not be known or modeled Many new applications are now possible Process control, logistics, manufacturing, telecomm, finance, scheduling, medicine, marketing… Theoretical and practical questions remain open

A lesson for AI: The Power of a “Visible” Goal
In MDPs, the goal (reward) is part of the data, part of the agent’s normal operation The agent can tell for itself how well it is doing This is very powerful… we should do more of it in AI Can we give AI tasks visible goals? Visual object recognition? Better would be active vision Story understanding? Better would be dialog, eg call routing User interfaces, personal assistants Robotics… say mapping and navigation, or search The usual trick is to make them into long-term prediction problems Must be a way. If you can’t feel it, why care about it?

Assessment re: AI DP and RL are potentially powerful probabilistic planning methods But typically don’t use logic or structured representations How is they as an overall model of thought? Good mix of deliberation and immediate judgments (values) Good for causality, prediction, not for logic, language The link to data is appealing…but incomplete MDP-style knowledge may be learnable, tuneable, verifiable But only if the “level” of the data is right Sometimes seems too low-level, too flat

Ongoing and Future Directions
Temporal abstraction [Sutton, Precup, Singh, Parr, others] Generalize transitions to include macros, “options” Multiple overlying MDP-like models at different levels States representation [Littman, Sutton, Singh, Jaeger...] Eliminate the nasty assumption of observable state Get really real with data Work up to higher-level, yet grounded, representations Neuroscience of reward systems [Dayan, Schultz, Doya] Dopamine reward system behaves remarkably like TD Theory and practice of value function approximation [everybody]

Spy Plane Example (Reconnaissance Mission Planning)
Sutton & Ravindran, 2001 Spy Plane Example (Reconnaissance Mission Planning) Mission: Fly over (observe) most valuable sites and return to base Stochastic weather affects observability (cloudy or clear) of sites Limited fuel Intractable with classical optimal control methods Temporal scales: Actions: which direction to fly now Options: which site to head for Options compress space and time Reduce steps from ~600 to ~6 Reduce states from ~1011 to ~106 any state (106) sites only (6)

SMDP planner with re-evaluation of options on each step
Sutton & Ravindran, 2001 Spy Plane Results Expected Reward/Mission SMDP planner: Assumes options followed to completion Plans optimal SMDP solution SMDP planner with re-evaluation Plans as if options must be followed to completion But actually takes them for only one step Re-picks a new option on every step Static planner: Assumes weather will not change Plans optimal tour among clear sites Re-plans whenever weather changes Please note that this example is meant just as an illustration of our formal results and to provide intuition for possible applications of relevance to the Air Force. The need for dynamic re-planning in response to changing conditions comes up in many applications, but we do not claim that the specific problem described accurately represents UAV mission planning. Figure in upper left and first three bullets: Each star represents a site that could be visited by the UAV. The green numbers next to each site represent the reward obtained for observing that site. The clouds veiling two of the sites represent stochastically changeable weather conditions making them currently unobservable. The task is for the UAV to direct itself on a tour among the sites, returning to base before running out of fuel. It tries to maximize the sum of rewards during the tour, which means it tries to observe as many of the sites as it can, preferring the higher-payoff sites. The UAV moves at a constant speed and consumes fuel at a constant rate. The system is simulated in continuous space and time, but decisions are made at short regular intervals. A shortest tour that visits all sites (ignoring weather) is about 300 decision stages long, compared to a fuel supply lasting 500 stages (low fuel condition) or 1000 stages (high fuel condition). The weather at the sites is simulated as a collection of independent Poisson processes with turn-on and turn-off probabilities. We assume that changes in weather conditions at each site are observed globally (e.g., by satellite) and immediately transmitted to the UAV. Even for 5 sites, this is a stochastic optimal control problem that requires some simplifications to solve exactly using classical methods. Fourth bullet: In the low-level view of this problem, the controls select angular directions in which to fly. The problem can be simplified into tractable form by defining higher-level controls, which we call strategies, or options. The example we consider is to define the options of flying to one of the 5 sites, or to the base. We assume the low-level details about how to execute these options are easy to determine (as they are here) and are supplied a priori. At this higher level of options, the planning process computes the optimal policy given the constraint that if an option is picked, then it must be followed until the target site of that option is reached. Planning at this level is a semi-Markov decision problem (SMDP) which can be solved in tens of minutes on a workstation using standard dynamic programming or RL methods. This would be done before starting the mission, and the UAV would use the resulting policy during flight. Results for this method are shown in the middle group of the results graph (RL planning w/strategies). Fifth bullet: This describes some details of the reduction to an SMDP described above. Assuming a fairly coarse level of quantization for the continuous state variables (100 intervals for distance in each direction, and 360 discrete heading angles), we obtain the reductions shown. These occur because with options we always travel all the way to a site (or to the base), thus we can assume we are (almost) always at one of only 6 locations. Similarly, we only consider the 6 options instead of 360 headings. Sixth bullet. Although we have not performed a detailed analysis, it is likely that the low-level version of this problem is intractable because of the length of the solutions, the size of the state space, and the stochasticity of the weather. The methods we are proposing do not rely on any of the fine details of the problem. Final bullet and graph: We compared the performance of three controllers: 1) Static Replanner: a receding-horizon controller which at each decision stage plans the optimal mission from the current state and takes the first action of that plan. This method plans as if the weather conditions will not change at any site, but then re-plans immediately when they do. Also like any receding-horizon controller, it does not take into account the fact that it will be re-planning in the future; 2) RL planning w/strategies: this is the SMDP using options described above. Options must be executed to completion; 3) RL planning w/strategies and real-time control: this method uses options but can change options at every low-level decision stage to the one whose long-term performance appears best given the current state, thereby responding immediately to changes in the observability conditions of the sites. These decisions are made by the value function computed in solving the SMDP, which is a feasible computation. We have proved a theorem which says that this method will always improve on method (2) above, as indeed these results show in fairly dramatic form, especially in the low-fuel scenario. These results were obtained by running the three controllers on a large number of simulated missions and averaging the results. The height of each bar is the average total reward obtained by each controller. For example, if all 5 sites are observed it is possible to obtain as much as 63 units of reward, but no method can do this on every mission. High Fuel Low Fuel SMDP planner with re-evaluation of options on each step SMDP Planner Static Re-planner Temporal abstraction finds better approximation than static planner, with little more computation than SMDP planner

Didn’t have time for Action Selection
Exploration/Exploitation Action values vs. search How learning values leads to policy improvements Different returns, e.g., the undiscounted case Exactly how FA works, backprop Exactly how options work How planning at a high level can affect primitive actions How states can be abstracted to affordances And how this directly builds on the option work

From Markov Decision Processes to Artificial Intelligence

Similar presentations

Presentation on theme: "From Markov Decision Processes to Artificial Intelligence"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

From Markov Decision Processes to Artificial Intelligence

Similar presentations

Presentation on theme: "From Markov Decision Processes to Artificial Intelligence"— Presentation transcript:

Similar presentations

About project

Feedback