Reinforcement Learning

Slides:



Advertisements
Similar presentations
Reinforcement Learning Peter Bodík. Previous Lectures Supervised learning –classification, regression Unsupervised learning –clustering, dimensionality.
Advertisements

Reinforcement learning
Lecture 18: Temporal-Difference Learning
Markov Decision Process
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
Reinforcement Learning
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
COSC 878 Seminar on Large Scale Statistical Machine Learning 1.
Reinforcement Learning Tutorial
Reinforcement Learning
CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley.
Application of Reinforcement Learning in Network Routing By Chaopin Zhu Chaopin Zhu.
Reinforcement Learning
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Chapter 6: Temporal Difference Learning
Chapter 6: Temporal Difference Learning
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Reinforcement Learning
Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK
Reinforcement Learning
REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.
Neural Networks Chapter 7
Design and Implementation of General Purpose Reinforcement Learning Agents Tyler Streeter November 17, 2005.
Reinforcement Learning with Laser Cats! Marshall Wang Maria Jahja DTR Group Meeting October 5, 2015.
CS 188: Artificial Intelligence Spring 2007 Lecture 23: Reinforcement Learning: III 4/17/2007 Srini Narayanan – ICSI and UC Berkeley.
COMP 2208 Dr. Long Tran-Thanh University of Southampton Reinforcement Learning.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
Reinforcement Learning
Reinforcement Learning
Introduction of Reinforcement Learning
A Crash Course in Reinforcement Learning
Chapter 6: Temporal Difference Learning
Reinforcement learning (Chapter 21)
Reinforcement Learning (1)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Reinforcement learning (Chapter 21)
An Overview of Reinforcement Learning
Deep reinforcement learning
Biomedical Data & Markov Decision Process
Reinforcement Learning
"Playing Atari with deep reinforcement learning."
UAV Route Planning in Delay Tolerant Networks
CMSC 671 – Fall 2010 Class #22 – Wednesday 11/17
Course Logistics CS533: Intelligent Agents and Decision Making
Reinforcement learning
CMSC 471 Fall 2009 RL using Dynamic Programming
Instructors: Fei Fang (This Lecture) and Dave Touretzky
Dr. Unnikrishnan P.C. Professor, EEE
یادگیری تقویتی Reinforcement Learning
Reinforcement Learning
Chapter 3: The Reinforcement Learning Problem
CS 188: Artificial Intelligence Fall 2007
October 6, 2011 Dr. Itamar Arel College of Engineering
Chapter 6: Temporal Difference Learning
CS 188: Artificial Intelligence Spring 2006
CS 188: Artificial Intelligence Fall 2008
CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29
Chapter 7: Eligibility Traces
CS 188: Artificial Intelligence Spring 2006
Reinforcement Learning (2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Reinforcement Learning (2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Presentation transcript:

Reinforcement Learning CSLT ML Summer Seminar (12) Reinforcement Learning Dong Wang Most slides from David Silver’s UCL Course on RL Some slides from John Schulman’s ‘Deep Reinforcement Learning’, Lecture 1

Content What is reinforcement learning? Shallow discrete learning Deep Q learning

What is reinforcement learning? Reinforcement learning is the problem faced by an agent that learns behavior through tiral-and- error interactions with a dynamic environment. Given a state by the environment, the agent learns how to take an action, which will be given back some (random) rewards and the system moves to another (random) state. The probabilities of the reward and the next state are stationray.

An example

Main component of RL

It is different from other tasks Unlike supervised learning, it has no ‘label’, e.g., which action should take. Feedback is often delayed, e.g., in game playing Time is important. It is sequential decision process. Decision impacts the environment It has some ‘supervision’ (the reward) when compared to unsupervised learning.

Applications Fly stunt manoeuvres in a helicopter Defeat the world champion at Backgammon Manage an investment portfolio Control a power station Make a humanoid robot walk Play many dierent Atari games better than humans

Robot

Robot

Business

Finance

Media

Medicine

Sequence prediction

Game playing http://www.nature.com/nature/journal/v518/n7540/fig_tab/nature14236_SV1.html http://www.nature.com/nature/journal/v518/n7540/fig_tab/nature14236_SV2.html

Some important things to mention If the environment is known (e.g., the transition and reward probability) If we can observe the state (hidden or explicit) Do we need to model the environment If we learn from episode or online If we want to use approximation or explicit table

Content What is reinforcement learning? Shallow discrete learning Deep Q learning

Markov decision process Markov decision processes formally describe an environment for reinforcement learning Where the environment is fully observable, i.e. the current state completely characterises the process Almost all RL problems can be formalised as MDPs, e.g. Optimal control primarily deals with continuous MDPs Partially observable problems can be converted into MDPs Bandits are MDPs with one state

Markov process

Markov reward process

An example of Markov reward process

Return in Markov reward process

Value function

Bellman Equation

Markov decision process (MDP)

Policy

Value function in MDP

Bellman Expectation Equation in MDP

Relation between two valuation functions

Relation between two valuation functions

Optimal value function

Optimal policy

Find optimal policy

Bellman optimization They are non-linear (because of the max()), and no closed form solution Iterative procedures can do that

Policy evaluation Given a policy, look at the valuation function at each state.

Improve policy

General policy process

Value iteration Not involve policy update, however the optimal policy has been learned by the ‘max’ operation.

Can be performed asynchronously Three simple ideas for asynchronous dynamic programming: In-place dynamic programming Prioritised sweeping Real-time dynamic programming

Full-width approach and sampling approach The above approach uses all possible offsprings in the update, but it can also be using the ‘exact experience’. We have to know the dynamic properties of the system It can be also design a model-free approach based on sampling. It interacts with the environment and learn from the expeirence. Sampling approach is easier to implement and more efficient

Monte-Carlo Reinforcement Learning MC methods learn directly from episodes of experience MC is model-free: no knowledge of MDP transitions / rewards MC learns from complete episodes: no bootstrapping MC uses the simplest possible idea: value = mean return Caveat: can only apply MC to episodic MDPs All episodes must terminate

MC evaluation Do not touch the policy, just the value function

First-visit MC

Every-visit MC

Incremental MC

Temporal-Dierence Learning TD methods learn directly from episodes of experience TD is model-free: no knowledge of MDP transitions / rewards TD learns from incomplete episodes, by bootstrapping TD updates a guess towards a guess Can work without knowing the output. It is good for online!

Some comparison

Now we update the policy On-policy learning Learn on the job Learn about policy from experience sampled from O-policy learning Look over someone's shoulder

MC policy learning

Sarsa (TD) policy learning

Off-policy learning Update the policy and Q valuate together

But all the above mostly useless How do we know the status? How do we keep the value function if the status is large? How if the status is continuous? How about if we meet some status that similar but different from those in the training data? How if we just observe a small number of training examples? Use parametric function to approximate it!

Value function approximation

We consider dierentiable function approximators, e.g. Linear combinations of features Neural network Decision tree Nearest neighbour Fourier / wavelet bases ... Furthermore, we require a training method that is suitable for non- stationary, non-iid data

Content What is reinforcement learning? Shallow discrete learning Deep Q learning

Deep Q network Using DNN to approximate the value function Using MC or TD to generate samples, using the error signals from the training samples to train the DNN

Incremental update for Q function

DQN for game learning Human-level control through deep reinforcement

Two mechanisms

DQN for AlphaGo Mastering the game of Go with deep neural networks and tree search

Wrap up Reinforcement learning learns policy. It is basically formulated as a Markov decision process learning. It can be learned in a ‘batch way’ or sample way, and can be in an episode or incremental fashion. Learning value function is highly important. Deep learning provides a brilliant solution. It opens the door for fantastic machine intelligence.