COSC 878 Seminar on Large Scale Statistical Machine Learning 1.

Slides:



Advertisements
Similar presentations
Markov Decision Process
Advertisements

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.
Monte-Carlo Methods Learning methods averaging complete episodic returns Slides based on [Sutton & Barto: Reinforcement Learning: An Introduction, 1998]
1 Reinforcement Learning Problem Week #3. Figure reproduced from the figure on page 52 in reference [1] 2 Reinforcement Learning Loop state Agent Environment.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
MDP Presentation CS594 Automated Optimal Decision Making Sohail M Yousof Advanced Artificial Intelligence.
1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
Markov Decision Processes & Reinforcement Learning Megan Smith Lehigh University, Fall 2006.
An Introduction to Markov Decision Processes Sarah Hickmott
Planning under Uncertainty
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 6 - Reinforcement Learning Prof. Giancarlo.
RL Cont’d. Policies Total accumulated reward (value, V ) depends on Where agent starts What agent does at each step (duh) Plan of action is called a policy,
Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
Planning in MDPs S&B: Sec 3.6; Ch. 4. Administrivia Reminder: Final project proposal due this Friday If you haven’t talked to me yet, you still have the.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.
Planning to learn. Progress report Last time: Transition functions & stochastic outcomes Markov chains MDPs defined Today: Exercise completed Value functions.
Multiagent Planning with Factored MDPs Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
More RL. MDPs defined A Markov decision process (MDP), M, is a model of a stochastic, dynamic, controllable, rewarding process given by: M = 〈 S, A,T,R.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Instructor: Vincent Conitzer
Search and Planning for Inference and Learning in Computer Vision
General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.
REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Introduction to Reinforcement Learning Dr Kathryn Merrick 2008 Spring School on Optimisation, Learning and Complexity Friday 7 th.
Reinforcement Learning 主講人:虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.
Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
A Tutorial on the Partially Observable Markov Decision Process and Its Applications Lawrence Carin June 7,2006.
Attributions These slides were originally developed by R.S. Sutton and A.G. Barto, Reinforcement Learning: An Introduction. (They have been reformatted.
INTRODUCTION TO Machine Learning
Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.
CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:
MDPs (cont) & Reinforcement Learning
Reinforcement Learning with Laser Cats! Marshall Wang Maria Jahja DTR Group Meeting October 5, 2015.
Model Minimization in Hierarchical Reinforcement Learning Balaraman Ravindran Andrew G. Barto Autonomous Learning Laboratory.
Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.
Reinforcement Learning
CPS 570: Artificial Intelligence Markov decision processes, POMDPs
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.
Abstract LSPI (Least-Squares Policy Iteration) works well in value function approximation Gaussian kernel is a popular choice as a basis function but can.
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO AUTOMATICO Lezione 12 - Reinforcement Learning Prof. Giancarlo Mauri.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
Reinforcement learning
Reinforcement Learning
Biomedical Data & Markov Decision Process
Planning to Maximize Reward: Markov Decision Processes
Reinforcement learning
Instructors: Fei Fang (This Lecture) and Dave Touretzky
CS 188: Artificial Intelligence Fall 2007
Instructor: Vincent Conitzer
Chapter 17 – Making Complex Decisions
Hidden Markov Models (cont.) Markov Decision Processes
Reinforcement Nisheeth 18th January 2019.
Markov Decision Processes
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Markov Decision Processes
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Presentation transcript:

COSC 878 Seminar on Large Scale Statistical Machine Learning 1

Today’s Plan Course Website 878/ Join the Google group: Students Introduction Team-up and Presentation Scheduling First Talk 2

Reinforcement Learning: A Survey Grace 1/13/15

What is Reinforcement Learning The problem faced by an agent that must learn behavior through trial-and-error interactions with a dynamic environment

Solve RL Problems – Two strategies Search in the behavior space – To find one behavior that perform well in the environment – Genetic algorithms, genetic programming Statistical methods and dynamic programming – Estimate the utility of taking actions in states of the world – We focus on this strategy

Standard RL model

What we learn in RL The agent’s job is to find a policy \pi that maximizes some long-run measure of reinforcement. – A policy \pi maps states to actions – Reinforcement = reward

Difference between RL and Supervised Learning In RL, no presentation of input/output pairs – No training data – We only know the immediate reward – Not know the best actions in long run In RL, need to evaluate the system online while learning – Online evaluation (know the online performance) is important

Difference between RL and AI/Planning AI algorithms are less general – AI algorithms require a predefined model of state transitions – And assume determinism RL assumes that the state space can be enumerated and stored in memory

Models The difficult part: – How to model future into the model Three models – Finite horizon – Infinite horizon – Average-reward

Finite Horizon At a given moment in time, the agent optimizes its expected reward for the next h steps Ignore what will happen after h steps

Infinite Horizon Maximize the long run reward Does not put limit on the number of future steps Future rewards are discounted geometrically Mathematically more tractable than finite horizon Discount factor (between 0 and 1)

Average-reward Maximize the long run average reward It is the limiting case of infinite horizon when \gamma approaches 1 Weakness: – Cannot know when get large rewards – When we prefer large initial reward, we have no way to know it in this model Cures: – Maximize both the long run average and the initial rewards – The Bias optimal model

Compare model optimality all unlabeled arrows produce a reward of 0 A single action

Compare model optimality Finite horizon h=4 Upper line: =6 Middle: =0 Lower: =0

Compare model optimality Infinite horizon \gamma=0.9 Upper line: 0*0.9^0 + 0*0.9^1+2*0.9^2+ 2*0.9^3+2*0.9^4… = 2*0.9^2*( ^ 2..)= 1.62*(1)/(1- 0.9)=16.2 Middle: … 10*0.9^5+…~=59 Lower: … + 11*0.9^6+… = 58.5

Compare model optimality Average reward Upper line: ~= 2 Middle: ~=10 Lower: ~= 11

Parameters Finite horizon and infinite horizon both have parameters – h – \gamma These parameters matter to the choice of optimality model – Choose them carefully in your application Average reward model’s advantage: not influenced by those parameters

MARKOV MODELS 19

Markov Process Markov Property 1 (the “ memoryless ” property) for a system, its next state depends on its current state. Pr(S i+1 |S i,…,S 0 )=Pr(S i+1 |S i ) Markov Process a stochastic process with Markov property. e.g A. A. Markov, ‘06 s0s0 s1s1 …… sisi s i+1

21 Markov Chain Hidden Markov Model Markov Decision Process Partially Observable Markov Decision Process Multi-armed Bandit Family of Markov Models

A Pagerank(A) Discrete-time Markov process Example: Google PageRank 1 Markov Chain B Pagerank(B) # of pages # of outlinks pages linked to S 22 D Pagerank(D) C Pagerank(C) E Pagerank(E) Random jump factor 1 L. Page et. al., ‘99 The stable state distribution of such an MC is PageRank State S – web page Transition probability M PageRank: how likely a random web surfer will land on a page (S, M)

Hidden Markov Model A Markov chain that states are hidden and observable symbols are emitted with some probability according to its states s0s0 s1s1 s2s2 …… o0o0 o1o1 o2o2 p0p0 p1p1 p2p2 S i – hidden state p i -- transition probability o i --observation e i --observation probability (emission probability) 1 Leonard E. Baum et. al., ‘66 (S, M, O, e)

MDP extends MC with actions and rewards 1 s i – state a i – action r i – reward p i – transition probability p0p0 p1p1 p2p2 Markov Decision Process 24 …… s0s0 s1s1 r0r0 a0a0 s2s2 r1r1 a1a1 s3s3 r2r2 a2a2 1 R. Bellman, ‘57 (S, M, A, R, γ)

Definition of MDP A tuple (S, M, A, R, γ) – S : state space – M: transition matrix M a (s, s') = P(s'|s, a) – A: action space – R: reward function R(s,a) = immediate reward taking action a at state s – γ: discount factor, 0< γ ≤1 policy π π(s) = the action taken at state s Goal is to find an optimal policy π * maximizing the expected total rewards. 25

Policy Policy:  (s) = a According to which, select an action a at state s.  (s 0 ) =move right and up s0s0  (s 1 ) =move right and up s1s1  (s 2 ) = move right s2s2 26 [Slide altered from Carlos Guestrin’s ML lecture]

Value of Policy Value: V  (s) Expected long-term reward starting from s Start from s 0 s0s0 R(s0)R(s0) (s0)(s0) V  (s 0 ) = E[R(s 0 ) +  R(s 1 ) +  2 R(s 2 ) +  3 R(s 3 ) +  4 R(s 4 ) +  ] Future rewards discounted by   [0,1) 27 [Slide altered from Carlos Guestrin’s ML lecture]

Value of Policy Value: V  (s) Expected long-term reward starting from s Start from s 0 s0s0 R(s0)R(s0) (s0)(s0) V  (s 0 ) = E[R(s 0 ) +  R(s 1 ) +  2 R(s 2 ) +  3 R(s 3 ) +  4 R(s 4 ) +  ] Future rewards discounted by   [0,1) s1s1 R(s1)R(s1) s 1 ’’ s1’ s1’ R(s1’)R(s1’) R(s 1 ’’) 28 [Slide altered from Carlos Guestrin’s ML lecture]

Value of Policy Value: V  (s) Expected long-term reward starting from s Start from s 0 s0s0 R(s0)R(s0) (s0)(s0) V  (s 0 ) = E[R(s 0 ) +  R(s 1 ) +  2 R(s 2 ) +  3 R(s 3 ) +  4 R(s 4 ) +  ] Future rewards discounted by   [0,1) s1s1 R(s1)R(s1) s 1 ’’ s1’ s1’ R(s1’)R(s1’) R(s 1 ’’) (s1)(s1) R(s2)R(s2) s2s2 (s1’)(s1’)  (s 1 ’’) s 2 ’’ s2’ s2’ R(s2’)R(s2’) R(s 2 ’’) 29 [Slide altered from Carlos Guestrin’s ML lecture]

Computing the value of a policy 30 Value function A possible next state The current state

Optimality — Bellman Equation The Bellman equation 1 to MDP is a recursive definition of the optimal value function V * (.) 31 Optimal Policy 1 R. Bellman, ‘57 state-value function

Optimality — Bellman Equation The Bellman equation can be rewritten as 32 Optimal Policy action-value function Relationship between V and Q

MDP algorithms 33 Value Iteration Policy Iteration Modified Policy Iteration Prioritized Sweeping Temporal Difference (TD) Learning Q-Learning Model free approaches Model-based approaches [Bellman, ’57, Howard, ‘60, Puterman and Shin, ‘78, Singh & Sutton, ‘96, Sutton & Barto, ‘98, Richard Sutton, ‘88, Watkins, ‘92] Solve Bellman equation Optimal value V * (s) Optimal policy  *(s) [Slide altered from Carlos Guestrin’s ML lecture]