Reinforcement Learning: Learning from Interaction

Reinforcement Learning: Learning from Interaction
Winter School on Machine Learning and Vision, 2010 B. Ravindran Many slides adapted from Sutton and Barto

Learning to Control So far looked at two models of learning
Supervised: Classification, Regression, etc. Unsupervised: Clustering, etc. How did you learn to cycle? Neither of the above Trial and error! Falling down hurts! Intro to RL

Can You hear me now? Can You hear me now? Can You hear me now?

Reinforcement Learning
A trial-and-error learning paradigm Rewards and Punishments Not just an algorithm but a new paradigm in itself Learn about a system – behaviour control from minimal feed back Inspired by behavioural psychology Intro to RL

RL Framework Learn from close interaction Stochastic environment
State Action evaluation Agent Benefits of RL.Applications. Future rewards. Situate in literature.. AI. Stochastic systems. Connections to OR. Learn from close interaction Stochastic environment Noisy delayed scalar evaluation Maximize a measure of long term performance

Not Supervised Learning!
Input Output Agent Target Error Very sparse “supervision” No target output provided No error gradient information available Action chooses next state Explore to estimate gradient – Trail and error learning Intro to RL

Not Unsupervised Learning
Input Activation Agent Evaluation Sparse “supervision” available Pattern detection not primary goal Intro to RL

TD Gammon Tesauro 1992, 1994, 1995, ... White has just rolled a 5 and a 2 so can move one of his pieces 5 and one (possibly the same) 2 steps Objective is to advance all pieces to points 19-24 Hitting 30 pieces, 24 locations implies enormous number of configurations Effective branching factor of 400 Intro to RL

The Agent-Environment Interface
. . . s a r t +1 t +2 t +3

The Agent Learns a Policy
Reinforcement learning methods specify how the agent changes its policy as a result of experience. Roughly, the agent’s goal is to get as much reward as it can over the long run. Intro to RL

Goals and Rewards Is a scalar reward signal an adequate notion of a goal?—maybe not, but it is surprisingly flexible. A goal should specify what we want to achieve, not how we want to achieve it. A goal must be outside the agent’s direct control—thus outside the agent. The agent must be able to measure success: explicitly; frequently during its lifespan. Intro to RL

Returns Episodic tasks: interaction breaks naturally into episodes, e.g., plays of a game, trips through a maze. where T is a final time step at which a terminal state is reached, ending an episode. Intro to RL

The Markov Property “the state” at step t, means whatever information is available to the agent at step t about its environment. The state can include immediate “sensations”, highly processed sensations, and structures built up over time from sequences of sensations. Ideally, a state should summarize past sensations so as to retain all “essential” information, i.e., it should have the Markov Property: Intro to RL

Markov Decision Processes
If a reinforcement learning task has the Markov Property, it is basically a Markov Decision Process (MDP). If state and action sets are finite, it is a finite MDP. To define a finite MDP, you need to give: state and action sets one-step “dynamics” defined by transition probabilities: reward expectations: Intro to RL

An Example Finite MDP Recycling Robot
At each step, robot has to decide whether it should (1) actively search for a can, (2) wait for someone to bring it a can, or (3) go to home base and recharge. Searching is better but runs down the battery; if runs out of power while searching, has to be rescued (which is bad). Decisions made on basis of current energy level: high, low. Reward = number of cans collected Intro to RL

Recycling Robot MDP Intro to RL

Value Functions The value of a state is the expected return starting from that state; depends on the agent’s policy: The value of a state-action pair is the expected return starting from that state, taking that action, and thereafter following p : Intro to RL

Bellman Equation for a Policy p
The basic idea: So: Or, without the expectation operator: Intro to RL

Optimal Value Functions
For finite MDPs, policies can be partially ordered: There is always at least one (and possibly many) policies that is better than or equal to all the others. This is an optimal policy. We denote them all p *. Optimal policies share the same optimal state-value function: Intro to RL

Bellman Optimality Equation
The value of a state under an optimal policy must equal the expected return for the best action from that state: is the unique solution of this system of nonlinear equations. Intro to RL

Bellman Optimality Equation
Similarly, the optimal value of a state-action pair is the expected return for taking that action and thereafter following the optimal policy is the unique solution of this system of nonlinear equations. Intro to RL

Dynamic Programming DP is the solution method of choice for MDPs
Require complete knowledge of system dynamics (P and R) Expensive and often not practical Curse of dimensionality Guaranteed to converge! RL methods: online approximate dynamic programming No knowledge of P and R Sample trajectories through state space Some theoretical convergence analysis available Intro to RL

Policy Evaluation Policy Evaluation: for a given policy p, compute the
state-value function Recall: Intro to RL

Policy Improvement Suppose we have computed for a deterministic policy p. For a given state s, would it be better to do an action ? Intro to RL

Policy Improvement Cont.
Intro to RL

Policy Iteration policy evaluation policy improvement “greedification”
Intro to RL

Value Iteration Recall the Bellman optimality equation:
We can convert it to an full value iteration backup: Iterate until “convergence” Intro to RL

Generalized Policy Iteration
Generalized Policy Iteration (GPI): any interaction of policy evaluation and policy improvement, independent of their granularity. A geometric metaphor for convergence of GPI: Intro to RL

Dynamic Programming T T T T T T T T T T T T T Intro to RL

Simplest TD Method T T T T T T T T T T T Intro to RL

RL Algorithms – Prediction
Policy Evaluation (the prediction problem): for a given policy, compute the state-value function. No knowledge of P and R, but access to the real system, or a “sample” model assumed. Uses “bootstrapping” and sampling Intro to RL

Advantages of TD TD methods do not require a model of the environment, only experience TD methods can be fully incremental You can learn before knowing the final outcome Less memory Less peak computation You can learn without the final outcome From incomplete sequences Intro to RL

RL Algorithms – Control
SARSA Q-learning Intro to RL

Cliffwalking e-greedy, e = 0.1 Intro to RL

Actor-Critic Methods Explicit representation of policy as well as value function Minimal computation to select actions Can learn an explicit stochastic policy Can put constraints on policies Appealing as psychological and neural models Intro to RL

Actor-Critic Details Intro to RL

Applications of RL Robot navigation Adaptive control
Helicopter pilot! Combinatorial optimization VLSI placement and routing , elevator dispatching Game playing Backgammon – world’s best player! Computational Neuroscience Modeling of reward processes Intro to RL

Other Topics Different measures of return Policy gradient approaches
Average rewards, Discounted Returns Policy gradient approaches Directly perturb policies Generalization [Saturday] Function approximation Temporal abstraction Least square methods Better use of data More suited for “off-line” RL Intro to RL

References Mitchell, T. Machine Learning. McGraw Hill. 1992
Russell, S. J. and Norvig, P. Artificial Intelligence – A modern approach. Pearson Educational Sutton, R.S. and Barto, A. G. Reinforcement Learning: An Introduction. MIT Press Dayan, P. and Abbott, L. F. Theoretical Neuroscience: Computational and Mathematical Modeling of Neural Systems. MIT Press Bertsikas, D. P. and Tsitsiklis, J. N. Neuro-Dynamic Programming. Athena Scientific Intro to RL

Reinforcement Learning: Learning from Interaction

Similar presentations

Presentation on theme: "Reinforcement Learning: Learning from Interaction"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Reinforcement Learning: Learning from Interaction

Similar presentations

Presentation on theme: "Reinforcement Learning: Learning from Interaction"— Presentation transcript:

Similar presentations

About project

Feedback