Overcoming the Curse of Dimensionality with Reinforcement Learning Rich Sutton AT&T Labs with thanks to Doina Precup, Peter Stone, Satinder Singh, David McAllester, Sanjoy Dasgupta
Computers have gotten faster and bigger Analytic solutions are less important Computer-based approximate solutions –Neural networks –Genetic algorithms Machines take on more of the work More general solutions to more general problems –Non-linear systems –Stochastic systems –Larger systems Exponential methods are still exponential… but compute-intensive methods increasingly winning
New Computers have led to a New Artificial Intelligence More general problems and algorithms, automation - Data intensive methods - learning methods Less handcrafted solutions, expert systems More probability, numbers Less logic, symbols, human understandability More real-time decision-making States, Actions, Goals, Probability => Markov Decision Processes
Markov Decision Processes State Space S (finite) Action Space A (finite) Discrete time t = 0,1,2,… Episode Transition Probabilities Expected Rewards Policy Return Value Optimal policy (discount rate) PREDICTION Problem CONTROL Problem
Key Distinctions Control vs Prediction Bootstrapping/Truncation vs Full Returns Sampling vs Enumeration Function approximation vs Table lookup Off-policy vs On-policy Harder, more challenging and interesting Easier, conceptually simpler
Full Depth Search s a r Full Returns r r 2 r s’ a’ r’ r” Computing V (s) ˆ is of exponential complexity BDBD branching factor depth
Truncated Search s Truncated Returns r ˆ V(s) Computing V (s) a r s’ ˆ V(s) Search truncated after one ply Approximate values used at stubs Values computed from their own estimates! -- “Bootstrapping”
Dynamic Programming is Bootstrapping s Truncated Returns ˆ V ˆ V ˆ V E.g., DP Policy Evaluation Er ˆ V(s)s a r s’ ˆ V
Boostrapping seems to Speed Learning
Bootstrapping/Truncation Replacing possible futures with estimates of value Can reduce computation and variance A powerful idea, but Requires stored estimates of value for each state
The Curse of Dimensionality The number of states grows exponentially with dimensionality -- the number of state variables Thus, on large problems, –Can’t complete even one sweep of DP Can’t enumerate states, need sampling! –Can’t store separate values for each state Can’t store values in tables, need function approximation! DP Policy Evaluation Bellman, 1961
ˆ V k 1 (s) (s,a) a p ss a r ss a ˆ V k (s) s s S ˆ V k 1 (s) d(s) (s,a) a p ss a r ss a ˆ V k (s) s DP Policy Evaluation TD( ) samples the possibilities rather than enumerating and explicitly considering all of them Some distribution over states, possibly uniform s S
These terms can be replaced by sampling ˆ V k 1 (s) (s,a) a p ss a r ss a ˆ V k (s) s s S ˆ V k 1 (s) d(s) (s,a) a p ss a r ss a ˆ V k (s) s DP Policy Evaluation s S
For each sample transition, s,a s ’,r : Sutton, 1988; Witten, 1974 Tabular TD( ) Sampling vs Enumeration ˆ V k 1 (s) (s,a) a p ss a r ss a ˆ V k (s) s s S ˆ V k 1 (s) d(s) (s,a) a p ss a r ss a ˆ V k (s) s DP Policy Evaluation s S
Sample Returns can also be either r r 2 r r r r r ˆ V(s) As in the general TD( ) algorithm FullorTruncated
Function Approximation Store values as a parameterized form Update , e.g., by gradient descent: cf. DP Policy Evaluation (rewritten to include a step-size ):
Linear Function Approximation Each state s represented by a feature vector Or respresent a state-action pair with and approximate action values:
Linear TD( ) After each episode: where “n-step return” r t 1 T s t 1 a t 1 “ -return” e.g., Sutton, 1988
RoboCup Use soccer as a rich and realistic testbed Robotic and simulation leagues –Open source simulator (Noda) An international AI and Robotics research initiative Research Challenges Multiple teammates with a common goal Multiple adversaries – not known in advance Real-time decision making necessary Noisy sensors and actuators Enormous state space, > states 9
RoboCup Feature Vectors.. Sparse, coarse, tile coding Linear map Full soccer state action values Huge binary feature vector (about 400 1’s and 40,000 0’s) 13 continuous state variables s
13 Continuous State Variables (for 3 vs 2) 11 distances among the players, ball, and the center of the field 2 angles to takers along passing lanes
Sparse, Coarse, Tile-Coding (CMACs) 32 tilings per group of state variables
Learning Keepaway Results 3v2 handcrafted takers Multiple, independent runs of TD( ) Stone & Sutton, 2001
Key Distinctions Control vs Prediction Bootstrapping/Truncation vs Full Returns Function approximation vs Table lookup Sampling vs Enumeration Off-policy vs On-policy –The distribution d(s)
Off-Policy Instability Examples of diverging k are known for –Linear FA –Bootstrapping Even for –Prediction –Enumeration –Uniform d(s) In particular, linear Q-learning can diverge Baird, 1995 Gordon, 1995 Bertsekas & Tsitsiklis, 1996
Baird’s Counterexample Markov chain (no actions) All states updated equally often, synchronously Exact solution exists: = 0 Initial 0 = (1,1,1,1,1,10,1) T 100% ±1)
On-Policy Stability If d(s) is the stationary distribution of the MDP under policy (the on-policy distribution) Then convergence is guaranteed for –Linear FA –Bootstrapping –Sampling –Prediction Furthermore, asymptotic mean square error is a bounded expansion of the minimal MSE: Tsitsiklis & Van Roy, 1997 Tadic, 2000
— Value Function Space — inadmissable value functions value functions consistent with parameterization True V* Region of * best admissable policy Original naïve hope guaranteed convergence to good policy Res gradient et al. guaranteed convergence to less desirable policy Sarsa, TD( ) & other on-policy methods chattering without divergence or guaranteed convergence Q-learning, DP & other off-policy methods divergence possible V* best admissable value fn.
There are Two Different Problems: Chattering Is due to Control + FA Bootstrapping not involved Not necessarily a problem Being addressed with policy-based methods Argmax-ing is to blame Instability Is due to Bootstrapping + FA + Off-Policy Control not involved Off-policy is to blame
Yet we need Off-policy Learning Off-policy learning is needed in all the frameworks that have been proposed to raise reinforcement learning to a higher level –Macro-actions, options, HAMs, MAXQ –Temporal abstraction, hierarchy, modularity –Subgoals, goal-and-action-oriented perception The key idea is: We can only follow one policy, but we would like to learn about many policies, in parallel –To do this requires off-policy learning
On-Policy Policy Evaluation Problem Use data (episodes) generated by to learn Off-Policy Policy Evaluation Problem Use data (episodes) generated by ’ to learn behavior policy Target policy
Naïve Importance-Sampled TD( ) 1 2 3 T-1 importance sampling correction ratio for t Relative prob. of episode under and ’ We expect this to have relatively high variance
Per-Decision Importance-Sampled TD( ) 1 2 3 t is like, except in terms of
Per-Decision Theorem Precup, Sutton & Singh (2000) New Result for Linear PD Algorithm Precup, Sutton & Dasgupta (2001) Total change over episode for new algorithm Total change for conventional TD( )
Convergence Theorem Under natural assumptions – S and A are finite – All s,a are visited under ’ – and ’ are proper (terminate w.p.1) – bounded rewards – usual stochastic approximation conditions on the step size k And one annoying assumption Then the off-policy linear PD algorithm converges to the same as on-policy TD( ) e.g., bounded episode length
The variance assumption is restrictive Consider a modified MDP with bounded episode length –We have data for this MDP –Our result assures good convergence for this –This solution can be made close to the sol’n to original problem –By choosing episode bound long relative to or the mixing time Consider application to macro-actions –Here it is the macro-action that terminates –Termination is artificial, real process is unaffected –Yet all results directly apply to learning about macro-actions –We can choose macro-action termination to satisfy the variance condition But can often be satisfied with “artificial” terminations
Empirical Illustration Agent always starts at S Terminal states marked G Deterministic actions Behavior policy chooses up-down with prob. Target policy chooses up-down with If the algorithm is successful, it should give positive weight to rightmost feature, negative to the leftmost one
Trajectories of Two Components of = 0.9 decreased appears to converge as advertised Episodes x 100,000 µ leftmost,down µ leftmost,down µ rightmost,down * µ rightmost,down *
Comparison of Naïve and PD IS Algs Root Mean Squared Error Naive IS Per-Decision IS Log 2 = 0.9 constant (after 100,000 episodes, averaged over 50 runs) Precup, Sutton & Dasgupta, 2001
Can Weighted IS help the variance? Return to the tabular case, consider two estimators: i th return following s,a IS correction product t 1 t 2 t 3 T 1 ( s,a occurs at t ) converges with finite variance iff the w i have finite variance converges with finite variance even if the w i have infinite variance Can this be extended to the FA case?
Restarting within an Episode We can consider episodes to start at any time This alters the weighting of states, –But we still converge, –And to near the best answer (for the new weighting)
Incremental Implementation At the start of each episode: On each step:
Key Distinctions Control vs Prediction Bootstrapping/Truncation vs Full Returns Sampling vs Enumeration Function approximation vs Table lookup Off-policy vs On-policy Harder, more challenging and interesting Easier, conceptually simpler
Conclusions RL is beating the Curse of Dimensionality –FA and Sampling There is a broad frontier, with many open questions MDPs: States, Decisions, Goals, and Probability is a rich area for mathematics and experimentation