Reinforcement Learning I: The setting and classical stochastic dynamic programming algorithms Tuomas Sandholm Carnegie Mellon University Computer Science.

Slides:



Advertisements
Similar presentations
Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014.
Advertisements

Reinforcement Learning
Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.
Markov Decision Process
Reinforcement Learning
Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.
Partially Observable Markov Decision Process (POMDP)
1 Dynamic Programming Week #4. 2 Introduction Dynamic Programming (DP) –refers to a collection of algorithms –has a high computational complexity –assumes.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
Decision Theoretic Planning
Reinforcement Learning
MDP Presentation CS594 Automated Optimal Decision Making Sohail M Yousof Advanced Artificial Intelligence.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
Markov Decision Processes
Infinite Horizon Problems
Planning under Uncertainty
KI Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17.
Markov Decision Processes
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)
Reinforcement Learning Introduction Presented by Alp Sardağ.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Reinforcement Learning Game playing: So far, we have told the agent the value of a given board position. How can agent learn which positions are important?
Department of Computer Science Undergraduate Events More
9/23. Announcements Homework 1 returned today (Avg 27.8; highest 37) –Homework 2 due Thursday Homework 3 socket to open today Project 1 due Tuesday –A.
Making Decisions CSE 592 Winter 2003 Henry Kautz.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
MAKING COMPLEX DEClSlONS
Reinforcement Learning
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Utilities and MDP: A Lesson in Multiagent System Based on Jose Vidal’s book Fundamentals of Multiagent Systems Henry Hexmoor SIUC.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
CS 621 Reinforcement Learning Group 8 Neeraj Bisht Ranjeet Vimal Nishant Suren Naineet C. Patel Jimmie Tete.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
MDPs (cont) & Reinforcement Learning
Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.
Department of Computer Science Undergraduate Events More
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
Markov Decision Process (MDP)
Making complex decisions
Reinforcement Learning (1)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Reinforcement learning (Chapter 21)
Reinforcement Learning
Reinforcement Learning
Markov Decision Processes
Markov Decision Processes
CMSC 671 – Fall 2010 Class #22 – Wednesday 11/17
Reinforcement Learning
Reinforcement learning
Instructors: Fei Fang (This Lecture) and Dave Touretzky
CS 188: Artificial Intelligence Fall 2007
13. Acting under Uncertainty Wolfram Burgard and Bernhard Nebel
CS 188: Artificial Intelligence Fall 2007
Chapter 17 – Making Complex Decisions
CS 188: Artificial Intelligence Spring 2006
CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Reinforcement Learning
Presentation transcript:

Reinforcement Learning I: The setting and classical stochastic dynamic programming algorithms Tuomas Sandholm Carnegie Mellon University Computer Science Department

Reinforcement Learning (Ch , Ch. 20) Learner passive active Sequential decision problems Approaches: 1.Learn values of states (or state histories) & try to maximize utility of their outcomes. Need a model of the environment: what ops & what states they lead to 2.Learn values of state-action pairs Does not require a model of the environment (except legal moves) Cannot look ahead

Reinforcement Learning … Deterministic transitions Stochastic transitions is the probability to reaching state j when taking action a in state i start A simple environment that presents the agent with a sequential decision problem: Move cost = 0.04 (Temporal) credit assignment problem sparse reinforcement problem Offline alg: action sequences determined ex ante Online alg: action sequences is conditional on observations along the way; Important in stochastic environment (e.g. jet flying)

Reinforcement Learning … M = 0.8 in direction you want to go 0.2 in perpendicular 0.1 left 0.1 right Policy: mapping from states to actions An optimal policy for the stochastic environment: utilities of states: Environment Observable (accessible): percept identifies the state Partially observable Markov property: Transition probabilities depend on state only, not on the path to the state. Markov decision problem (MDP). Partially observable MDP (POMDP): percepts does not have enough info to identify transition probabilities.

Partial observability in previous figure (2,1) vs. (2,3) U(A)  0.8*U(A) in (2,1) + 0.2*U(A) in (2,3) Have to factor in the value of new info obtained by moving in the world

Observable MDPs Assume additivity (almost always true in practice): U h ([S 0,S 1 …S n ]) = R 0 + U h ([S 1,…S n ]) Utility function on histories Policy*(i) =U(i) = R(i) +

Classic Dynamic Programming (DP) Start from last step & move backward Complexity of Naïve search O(|A| n ) DP O(n|A||S|) Actions per step # possible states Problem: n=  if loops or otherwise infinite horizon

Does not require there to exist a “last step” unlike dynamic programming

The utility values for selected states at each iteration step in the application of VALUE-ITERATION to the 4x3 world in our example Thrm: As t  , value iteration converges to exact U even if updates are done asynchronously & i is picked randomly at every step.

When to stop value iteration?

Idea: Value determination (given a policy) is simpler than value iteration

Value Determination Algorithm The VALUE-DETERMINATION algorithm can be implemented in one of two ways. The first is a simplification of the VALUE-ITERATION algorithm, replacing the equation (17.4) with and using the current utility estimates from policy iteration as the initial values. (Here Policy(i) is the action suggested by the policy in state i) While this can work well in some environments, it will often take a very long time to converge in the early stages of policy iteration. This is because the policy will be more or less random, so that many steps can be required to reach terminal states

Value Determination Algorithm The second approach is to solve for the utilities directly. Given a fixed policy P, the utilities of states obey a set of equations of the form: For example, suppose P is the policy shown in Figure 17.2(a). Then using the transition model M, we can construct the following set of equations: U(1,1) = 0.8u(1,2) + 0.1u(1,1) + 0.1u(2,1) U(1,2) = 0.8u(1,3) + 0.2u(1,2) and so on. This gives a set of 11 linear equations in 11 unknowns, which can be solved by linear algebra methods such as Gaussian elimination. For small state spaces, value determination using exact solution methods is often the most efficient approach. Policy iteration converges to optimal policy, and policy improves monotonically for all states. Asynchronous version converges to optimal policy if all states are visited infinitely often.

Discounting Infinite horizon  Infinite U  Policy & value iteration fail to converge. Also, what is rational:  vs.  Solution: discounting Finite if

Reinforcement Learning II: Reinforcement learning (RL) algorithms (we will focus solely on observable environments in this lecture) Tuomas Sandholm Carnegie Mellon University Computer Science Department

Passive learning (1,1)  (1,2)  (1,3)  (1,2)  (1,3)  (1,2)  (1,1)  (1,2)  (2,2)  (3,2) –1 (1,1)  (1,2)  (1,3)  (2,3)  (2,2)  (2,3)  (3,3) +1 (1,1)  (1,2)  (1,1)  (1,2)  (1,1)  (2,1)  (2,2)  (2,3)  (3,3) +1 (1,1)  (1,2)  (2,2)  (1,2)  (1,3)  (2,3)  (1,3)  (2,3)  (3,3) +1 (1,1)  (2,1)  (2,2)  (2,1)  (1,1)  (1,2)  (1,3)  (2,3)  (2,2)  (3,2) -1 (1,1)  (2,1)  (1,1)  (1,2)  (2,2)  (3,2) -1 Epochs = training sequences:

Passive learning … start (a) (b) (c) (a) A simple stochastic environment. (b) Each state transitions to a neighboring state with equal probability among all neighboring states. State (4,2) is terminal with reward –1, and state (4,3) is terminal with reward +1. (c) The exact utility values.

LMS – updating [Widrow & Hoff 1960] function LMS-UPDATE(U,e,percepts,M,N) returns an update U if TERMINAL?[e] then reward-to-go  0 for each e i in percepts (starting at end) do reward-to-go  reward-to-go + REWARD[e i ] U[STATE[e i ]]  RUNNING-AVERAGE (U[STATE[e i ]], reward-to-go, N[STATE[e i ]]) end Average reward-to-go that state has gotten simple average batch mode

Converges slowly to LMS estimate or training set

But utilities of states are not independent! NEW U = ? OLD U = -0.8 P=0.9 P= An example where LMS does poorly. A new state is reached for the first time, and then follows the path marked by the dashed lines, reaching a terminal state with reward +1.

Adaptive DP (ADP) Idea: use the constraints (state transition probabilities) between states to speed learning. Solve = value determination. No maximization over actions because agent is passive unlike in value iteration. using DP  Large state space e.g. Backgammon: equations in variables

Temporal Difference (TD) Learning Idea: Do ADP backups on a per move basis, not for the whole state space. Thrm: Average value of U(i) converges to the correct value. Thrm: If  is appropriately decreased as a function of times a state is visited (  =  [N[i]]), then U(i) itself converges to the correct value

Algorithm TD( ) (not in Russell & Norvig book) Idea: update from the whole epoch, not just on state transition. Special cases: =1: LMS =0: TD Intermediate choice of (between 0 and 1) is best. Interplay with  …

Convergence of TD( ) Thrm: Converges w.p. 1 under certain boundaries conditions. Decrease  i (t) s.t. In practice, often a fixed  is used for all i and t.

Passive learning in an unknown environment unknown ADP does not work directly LMS & TD( ) will operate unchanged … Changes to ADP Construct an environment model (of ) based on observations (state transitions) & run DP Quick in # epochs, slow update per example As the environment model approaches the correct model, the utility estimates will converge to the correct utilities.

Passive learning in an unknown environment ADP: full backup TD: one experience back up As TD makes a single adjustment (to U) per observed transitions, ADP makes as many (to U) as it needs to restore consistency between U and M. Change to M is local, but effects may need to be propagated throughout U.

Passive learning in an unknown environment TD can be viewed as a crude approximation of ADP Adjustments in ADP can be viewed as pseudo experience in TD A model for generating pseudo-experience can be used in TD directly: DYNA [Sutton] Cost of thinking vs. cost of acting Approximating iterations directly by restricting the backup after each observed transition. Prioritized sweeping heuristic prefers to make adjustments to states whose likely successors have just undergone large adjustments in U(j) - Learns roughly as fast as full ADP (#epochs) - Several orders of magnitude less computation  allows doing problems that are not solvable via ADP - M is incorrect early on  minimum decreasing adjustment size before recompute U(i)

Active learning in an unknown environment Agent considers what actions to take. Algorithms for learning in the setting (action choice discussed later) ADP: TD( ): Unchanged! Learn instead of as before Model-based (learn M) Model-free (e.g. Q-learning) Which is better? open Tradeoff

Q-learning Q (a,i) Direct approach (ADP) would require learning a model. Q-learning does not: Do this update after each state transition:

Exploration Tradeoff between exploitation (control) and exploration (identification) Extremes: greedy vs. random acting (n-armed bandit models) Q-learning converges to optimal Q-values if * Every state is visited infinitely often (due to exploration), * The action selection becomes greedy as time approaches infinity, and * The learning rate  is decreased fast enough but not too fast (as we discussed in TD learning)

Common exploration methods 1.E.g. in value iteration in an ADP agent: Optimistic estimate of utility U + (i) 2.E.g. in TD( ) or Q-learning: Choose best action w.p. p and a random action otherwise. 3.E.g. in TD( ) or Q-learning: Boltzmann exploration Exploration fn. e.g. R + if n<N u o.w.

Reinforcement Learning III: Advanced topics Tuomas Sandholm Carnegie Mellon University Computer Science Department

Generalization With table lookup representation (of U,M,R,Q) up to 10,000 states or more Chess ~ Backgammon ~ 1050 Industrial problems Hard to represent & visit all states! Implicit representation, e.g. U(i) = w 1 f 1 (i) + w 2 f 2 (i) + …+ w n f n (i) Chess states  n weights This compression does generalization E.g. Backgammon: Observe 1/10 44 state space and beat any human.

Generalization … Could use any supervised learning algorithm for the generalization part: input sensation generalization estimate (Q or U…) update from RL Convergence results do not apply with generalization. Pseudo-experiments require predicting many steps ahead (not supported by standard generalization methods)

Convergence results of Q-learning tabularfunction approximation state aggregation converges to Q* general diverges averagers converges to Q* linear on-policy off-policy converges to Q  predictioncontrol chatters, bound unknown diverges

Applications of RL Checker’s [Samuel 59] TD-Gammon [Tesauro 92] World’s best downpeak elevator dispatcher [Crites at al ~95] Inventory management [Bertsekas et al ~95] –10-15% better than industry standard Dynamic channel assignment [Singh & Bertsekas, Nie&Haykin ~95] –Outperforms best heuristics in the literature Cart-pole [Michie&Chambers 68-] with bang-bang control Robotic manipulation [Grupen et al. 93-] Path planning Robot docking [Lin 93] Parking Football Tetris Multiagent RL [Tan 93, Sandholm&Crites 95, Sen 94-, Carmel&Markovitch 95-, lots of work since] Combinatorial optimization: maintenance & repair –Control of reasoning [Zhang & Dietterich IJCAI-95]

TD-Gammon Q-learning & back propagation neural net Start with random net Learned by 1.5 million games against itself As good as best human in the world Expert labeled examples are scarce, expensive & possibly wrong Self-play is cheap & teaches the real solution Hand-crafted features help Performance against Gammontool # hidden units TD-Gammon (self-play) Neurogammon (15,000 supervised learning examples)

Multiagent RL Each agent as a Q-table entry e.g. in a communication network Each agent as an intentional entity –Opponent’s behavior varies for a given sensation of the agent Opponent uses different sensation than agent, e.g. longer window or different features (Stochasticity in steady state) Opponent learned: sensation  Q-values (Nonstationarity) Opponent’s exploration policy (Q-values  action probabilities) changed. Opponent’s action selector chose different action. (Stochasticity) Q-storage Explorer Random Process Q coop Q def deterministic p(coop) p(def) anan Sensation at step n:, reward from step n-1

Future research in RL Function approximation (& convergence results) On-line experience vs. simulated experience Amount of search in action selection Exploration method (safe?) Kind of backups –Full (DP) vs. sample backups (TD) –Shallow (Monte Carlo) vs. deep (exhaustive) controls this in TD( ) Macros –Advantages Reduce complexity of learning by learning subgoals (macros) first Can be learned by TD( ) –Problems Selection of macro action Learn models of macro actions (predict their outcome) How do you come up with subgoals