Abstract LSPI (Least-Squares Policy Iteration) works well in value function approximation Gaussian kernel is a popular choice as a basis function but can.

Slides:

Advertisements

Similar presentations

Dialogue Policy Optimisation

Advertisements

Markov Decision Process

SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.

1 A Semiparametric Statistics Approach to Model-Free Policy Evaluation Tsuyoshi UENO (1), Motoaki KAWANABE (2), Takeshi MORI (1), Shin-ich MAEDA (1), Shin.

Decision Theoretic Planning

COSC 878 Seminar on Large Scale Statistical Machine Learning 1.

Markov Decision Processes

Planning under Uncertainty

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.

RL at Last! Q- learning and buddies. Administrivia R3 due today Class discussion Project proposals back (mostly) Only if you gave me paper; e-copies yet.

Reinforcement Learning

Apprenticeship Learning by Inverse Reinforcement Learning Pieter Abbeel Andrew Y. Ng Stanford University.

Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.

Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)

Planning in MDPs S&B: Sec 3.6; Ch. 4. Administrivia Reminder: Final project proposal due this Friday If you haven’t talked to me yet, you still have the.

Distributed Q Learning Lars Blackmore and Steve Block.

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

Octopus Arm Mid-Term Presentation Dmitry Volkinshtein & Peter Szabo Supervised by: Yaki Engel.

Planning to learn. Progress report Last time: Transition functions & stochastic outcomes Markov chains MDPs defined Today: Exercise completed Value functions.

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

The Value of Plans. Now and Then Last time Value in stochastic worlds Maximum expected utility Value function calculation Today Example: gridworld navigation.

More RL. MDPs defined A Markov decision process (MDP), M, is a model of a stochastic, dynamic, controllable, rewarding process given by: M = 〈 S, A,T,R.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

Reinforcement Learning

REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.

Decision Making in Robots and Autonomous Agents Decision Making in Robots and Autonomous Agents The Markov Decision Process (MDP) model Subramanian Ramamoorthy.

CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.

Reinforcement Learning

Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 1: Generalized policy iteration.

Reinforcement Learning 主講人：虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.

Value Function Approximation on Non-linear Manifolds for Robot Motor Control Masashi Sugiyama1)2) Hirotaka Hachiya1)2) Christopher Towell2) Sethu.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

Attributions These slides were originally developed by R.S. Sutton and A.G. Barto, Reinforcement Learning: An Introduction. (They have been reformatted.

Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter and Andrew Y. Ng Computer Science Department Stanford.

Reinforcement Learning 主講人：虞台文大同大學資工所智慧型多媒體研究室.

CS 484 – Artificial Intelligence1 Announcements Homework 5 due Tuesday, October 30 Book Review due Tuesday, October 30 Lab 3 due Thursday, November 1.

Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012.

Markov Decision Process (MDP)

Value Function Approximation with Diffusion Wavelets and Laplacian Eigenfunctions by S. Mahadevan & M. Maggioni Discussion led by Qi An ECE, Duke University.

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.

Kernelized Value Function Approximation for Reinforcement Learning Gavin Taylor and Ronald Parr Duke University.

Reinforcement Learning. Overview Supervised Learning: Immediate feedback (labels provided for every input). Unsupervised Learning: No feedback (no labels.

REINFORCEMENT LEARNING Unsupervised learning 1. 2 So far ….  Supervised machine learning: given a set of annotated istances and a set of categories,

CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.

Reinforcement learning

Making complex decisions

Reinforcement Learning (1)

Markov Decision Processes

Timothy Boger and Mike Korostelev

"Playing Atari with deep reinforcement learning."

Markov Decision Processes

Planning to Maximize Reward: Markov Decision Processes

Markov Decision Processes

Announcements Homework 3 due today (grace period through Friday)

Chapter 3: The Reinforcement Learning Problem

CS 188: Artificial Intelligence Fall 2007

Reinforcement Learning in MDPs by Lease-Square Policy Iteration

Chapter 3: The Reinforcement Learning Problem

Reinforcement Learning

Chapter 3: The Reinforcement Learning Problem

CS 188: Artificial Intelligence Spring 2006

CS 416 Artificial Intelligence

CS 416 Artificial Intelligence

Reinforcement Learning (2)

Markov Decision Processes

Markov Decision Processes

Reinforcement Learning (2)

Presentation transcript:

Abstract LSPI (Least-Squares Policy Iteration) works well in value function approximation Gaussian kernel is a popular choice as a basis function but can not treat discontinuities well Propose new type of basis function, Geodesic Gaussian Kernels Apply our method to robot control tasks

Maze Problem Task: guide a robot to the goal from any place Up Reward: +1 (reach the goal) 0 (otherwise) Left Right Good Job!! Position Down Task: guide a robot to the goal from any place Condition: no supervision but reward at goal Goal: select optimal action in each position

Markov Decision Process (MDP) A model consisting of {S, A, T, R} S: finite set of states, e.g. si = (xi, yi), i=1,2,3…. A: finite set of actions, e.g. up, down, left, right T: transition function, specifies next state s’ R: immediate reward function Assume MDP is given or can be estimated from data Policy function: specifies action to take in each state Goal: learn good policy function from MDP

Reinforcement Learning (RL) Iterate 1 and 2 (Policy iteration) gives optimal π 1. Evaluate action-value function Q(s,a): discounted sum of future rewards when taking a in s and following π thereafter (Sutton 1998) 2. Update policy Problem: Q(s,a) can not be evaluated directly r(si,ai): immediate reward when taking action a in state s, γ: discount factor (0<γ<1)

Bellman Equation Q(s,a) can be evaluated through recursive form Problem: number of parameters becomes very large in large state and action spaces Slow learning Overfitting

Least-Square Policy Iteration Lagoudakis and Parr, 2005 Linear Architecture φ(s,a): fixed basis function, w: learned weight, K: # of basis functions Learn so as to optimally approximate Bellman equation in the Least-square sense # of learning parameters can be reduced dramatically Problem: How do we choose φi(s,a)?

Popular Choice: Gaussian Kernel ED: Euclid Distance Sc: centre state Bell-shape centered on Sc Smooth surface Gaussian tail goes over the partition Sc One kernel is placed near the partition Sc

Value Function with Discontinuities Approximated value function by 20-randomly located Gaussian kernels Less accurate around the partition Undesired policies are obtained around the partition Obtained Policy Log scale Optimal value function Log scale

Aim of This Research Gaussian kernels are not suited for approximating discontinuous value function Value function is smooth along the maze but discontinuous across the partition Goal: We propose new kernel, Geodesic Gaussian Kernels, based on the state space structure

Gaussian Kernels on Graph Ordinary Gaussian Geodesic Gaussian S Sc S Sc Shortest Path (Dijkstra Algorithm) Euclidean Distance

Example of Kernels Ordinary Gaussian Geodesic Gaussian Sc Sc Sc Tail does not go across the partition

Value Function by Geodesic Gaussian Approximated value function by 20-randomly located Geodesic Gaussian kernels Accurate around the partition Desired policies are obtained around the walls Obtained Policies Log scale Optimal value function Log scale

Experimental Results Fraction of optimal states Average over 100 runs Sutton’s maze Three-room maze Fraction of optimal states Average over 100 runs

Discussions Ordinary Gaussian: Large width suffers from the tail problem Small width does not have the tail problem, but is less smooth along the state space. Geodesic Gaussian (with rather large width): Smooth along the state space, while discontinuity across the partitions preserved.

Arm Robot Control Reward: 2-DOF robot arm Lead the hand to the apple. +1 (reach the apple) 0 (otherwise) 2-DOF robot arm Lead the hand to the apple. State space

Learned Value Functions by Ordinary Gaussian smooth over the obstacle

Learned Value Functions by Geodesic Gaussian smooth along the state space

Summary of Results Average over 30 runs

Khepera Robot Navigation Khepera robot has 8 IR sensors measuring the distance to obstacles (0-1030) Task: explore unknown maze without collision 6 actions Reward: +10 (a1) +5 (a2/a3) 0 (a4/a5) -4 (a6) -20 (collision)

Difficulty of The Task State space is high dimensional (8-D) and large (1030^8) Entire state space can not be explored, thus need inter/extrapolation State transition is highly stochastic

State Space and Graph 8D graph by Self-Organized Map is projected onto the 2D subspace for visualization Partitions

Learned Value Functions by Ordinary Gaussian Local maximum When Khepera faces an obstacle, it goes backward (and go forward).

Learned Value Functions by Geodesic Gaussian When Khepera faces an obstacle, it makes a turn (and go forward).

Summary of Results Average over 30 runs

Conclusion Value function approximation: good basis function needed Ordinary Gaussian kernel: smooth over discontinuities Geodesic Gaussian kernel: smooth along the state space Graph Gaussian is promising in high-dimensional continuous problems!