Value Function Approximation on Non-linear Manifolds for Robot Motor Control Masashi Sugiyama1)2) Hirotaka Hachiya1)2) Christopher Towell2) Sethu.

Slides:



Advertisements
Similar presentations
Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014.
Advertisements

Hierarchical Reinforcement Learning Amir massoud Farahmand
Dialogue Policy Optimisation
Markov Decision Processes (MDPs) read Ch utility-based agents –goals encoded in utility function U(s), or U:S  effects of actions encoded in.
Markov Decision Process
Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.
1 A Semiparametric Statistics Approach to Model-Free Policy Evaluation Tsuyoshi UENO (1), Motoaki KAWANABE (2), Takeshi MORI (1), Shin-ich MAEDA (1), Shin.
Effective Reinforcement Learning for Mobile Robots Smart, D.L and Kaelbing, L.P.
Markov Decision Processes
Planning under Uncertainty
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.
SA-1 1 Probabilistic Robotics Planning and Control: Markov Decision Processes.
Policy Evaluation & Policy Iteration S&B: Sec 4.1, 4.3; 6.5.
Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.
Integrating POMDP and RL for a Two Layer Simulated Robot Architecture Presented by Alp Sardağ.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.
Chapter 5: Path Planning Hadi Moradi. Motivation Need to choose a path for the end effector that avoids collisions and singularities Collisions are easy.
CS121 Heuristic Search Planning CSPs Adversarial Search Probabilistic Reasoning Probabilistic Belief Learning.
Markov Decision Processes Value Iteration Pieter Abbeel UC Berkeley EECS TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:
Department of Computer Science Undergraduate Events More
More RL. MDPs defined A Markov decision process (MDP), M, is a model of a stochastic, dynamic, controllable, rewarding process given by: M = 〈 S, A,T,R.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Reinforcement Learning
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.
Encoding Robotic Sensor States for Q-Learning using the Self-Organizing Map Gabriel J. Ferrer Department of Computer Science Hendrix College.
Generative Topographic Mapping by Deterministic Annealing Jong Youl Choi, Judy Qiu, Marlon Pierce, and Geoffrey Fox School of Informatics and Computing.
Markov Decision Processes1 Definitions; Stationary policies; Value improvement algorithm, Policy improvement algorithm, and linear programming for discounted.
Decision Making in Robots and Autonomous Agents Decision Making in Robots and Autonomous Agents The Markov Decision Process (MDP) model Subramanian Ramamoorthy.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
Reinforcement Learning
Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 1: Generalized policy iteration.
Reinforcement Learning 主講人:虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Attributions These slides were originally developed by R.S. Sutton and A.G. Barto, Reinforcement Learning: An Introduction. (They have been reformatted.
Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter and Andrew Y. Ng Computer Science Department Stanford.
Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.
Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering.
Randomized Kinodynamics Planning Steven M. LaVelle and James J
Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012.
Department of Computer Science Undergraduate Events More
Value Function Approximation with Diffusion Wavelets and Laplacian Eigenfunctions by S. Mahadevan & M. Maggioni Discussion led by Qi An ECE, Duke University.
Kernelized Value Function Approximation for Reinforcement Learning Gavin Taylor and Ronald Parr Duke University.
Abstract LSPI (Least-Squares Policy Iteration) works well in value function approximation Gaussian kernel is a popular choice as a basis function but can.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
Making complex decisions
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Markov Decision Processes
Timothy Boger and Mike Korostelev
"Playing Atari with deep reinforcement learning."
Markov Decision Processes
Markov Decision Processes
Reinforcement learning
Chapter 3: The Reinforcement Learning Problem
CS 188: Artificial Intelligence Fall 2007
Reinforcement Learning in MDPs by Lease-Square Policy Iteration
Chapter 3: The Reinforcement Learning Problem
Reinforcement Learning
Chapter 3: The Reinforcement Learning Problem
CS 188: Artificial Intelligence Spring 2006
CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29
CS 416 Artificial Intelligence
CS 416 Artificial Intelligence
Reinforcement Learning (2)
Markov Decision Processes
Markov Decision Processes
Reinforcement Learning (2)
Presentation transcript:

Value Function Approximation on Non-linear Manifolds for Robot Motor Control Masashi Sugiyama1)2) Hirotaka Hachiya1)2) Christopher Towell2) Sethu Vijayakumar2) 1) Computer Science, Tokyo Institute of Technology 2) School of Informatics, University of Edinburgh

Maze Problem: Guide Robot to Goal Possible actions Up Left Right Position (x,y) Down reward In this talk, we would like to discuss maze problem like this. In the maze, we want to guide a robot to the goal. The robot knows its position but doesn’t know which direction he should go. We don’t teach the robot the best action to take at each position but give a reward at the goal. The task of this problem is to make the robot select optimal action. Goal Robot knows its position but doesn’t know which direction to go. We don’t teach the best action to take at each position but give a reward at the goal. Task: make the robot select the optimal action.

Markov Decision Process (MDP) An MDP consists of : set of states, : set of actions, : transition probability, : reward, An action the robot takes at state is specified by policy . Goal: make the robot learn optimal policy The maze problem can be formulated as Markov Decision Process (MDP). An MDP consists of S, A, P, R. S is a set of states, A is a set of actions, P is transition probability, R is reward An action a the robot takes at state s is specified by policy, pi like this. So, the goal of the maze problem is to make the robot learn good policy

Definition of Optimal Policy Action-value function: discounted sum of future rewards when taking in and following thereafter Optimal value: Optimal policy: is computed if is given. Question: How to compute ? Here we define optimal policy more formally To this purpose, we first define action value function Q as this. This means a discounted sum of future rewards when taking action a in state s and following pi thereafter. Using this value function Q, we define optimal value function Q star like this. Using Q star, we can define optimal policy like this This means we can compute optimal policy pi if we Q star is given. So the question is how to compute optimal value function Q star?

Policy Iteration (Sutton & Barto, 1998) Starting from some initial policy iterate Steps 1 and 2 until convergence. Step 1. Compute for current Step 2. Update by Policy iteration always converges to if       in step 1 can be computed. Question: How to compute ? Q star can be computed by Policy Iteration. This consists of two steps. Starting from some initial policy pi, At step 1, compute Q for current pi At step 2, update pi by this. Policy iteration always converges to optimal policy pi if we can compute Qpi in step 1. Because Qp is expressed by infinite sum, so the question is how to compute Q pi in step 1?

Bellman Equation can be recursively expressed by can be computed by solving Bellman equation Drawback: dimensionality of Bellman equation becomes huge in large state and action spaces Q pi can be recursively expressed by this. This is called Bellman equation. Q pi can be computed by solving Bellman equation But the drawback of this approach is dimensionality of Bellman equation becomes huge in large state and action spaces. So computation cost is high high computational cost

Least-Squares Policy Iteration (Lagoudakis and Parr, 2003) Linear architecture: is learned so as to optimally approximate Bellman equation in the least-squares sense # of parameters is only : LSPI works well if we choose appropriate Question: How to choose ? : fixed basis functions : parameters : # of basis functions To overcome this drawback, LSPI has been proposed. In LSPI, Q is approximated by the linear combination of some fixed basis functions. Here, phi is fixed basis function, w is parameter, K is the number of basis functions. w is learned so as to optimally approximate Bellman equation in the least-squares sense In this LSPI, # of parameters is only K which is much smaller than S times A. LSPI works well if we choose appropriate basis function. The question is how to choose basis function?

Popular Choice: Gaussian Kernel (GK) Partitions : Euclidean distance : Centre state A popular choice is Gaussian Kernel. Here ED is Euclid Distance, Sc is centre state. The advantage of Gaussian Kernel is smooth but Gaussian tail goes over the partition. Actually this would cause big problem. Smooth Gaussian tail goes over partitions

Approximated Value Function by GK Optimal value function Approximated by GK Log scale When we approximate this value function by Gaussian Kernels. The approximated value function is like this. Clearly values around the partition are not good. 20 randomly located Gaussians Values around the partitions are not approximated well.

Policy Obtained by GK Optimal policy GK-based policy Policy obtained by Gaussian Kernel is like this. Compared with optimal policy, Gaussian Kernels provides undesired policies around the partition. GK provides an undesired policy around the partition.

We propose new Gaussian kernel to overcome this problem. Aim of This Research Gaussian tails go over the partition. Not suited for approximating discontinuous value functions. As shown in the example, Gaussian tails go over the partition. So, Gaussians are not suited for approximating discontinuous value functions In this research, we propose new Gaussian Kernel to solve this problem We propose new Gaussian kernel to overcome this problem.

State Space as a Graph Ordinary Gaussian uses Euclidean distance. Euclidean distance does not incorporate state space structure, so tail problems occur. We represent state space structure by a graph, and use it for defining Gaussian kernels. Gaussian tail problems occur since the state space structure is ignored. Here, we represent state space by a graph and use it for defining Gaussian Kernels. (Mahadevan, ICML 2005)

Geodesic Gaussian Kernels Natural distance on graph is shortest path. We use shortest path in Gaussian function. We call this kernel geodesic Gaussian. SP can be efficiently computed by Dijkstra. Shortest path Gaussian Kernels depend on the distance. A natural distance on the graph would be shortest path. We define Gaussian kernels on the graph as this. The Euclidean distance between these two points is blue line. On the other hand, these two points are not directly connected on the graph so shortest path is red line. We call this geodesic Gaussian We note that Shortest Path is efficiently computable by Dijkstra algorithm. Euclidean distance

Example of Kernels Ordinary Gaussian Geodesic Gaussian For example, when we put the center here, Geodesic Gaussian Kernel looks like this. This shows that The tail does not go across the partition And Values smoothly decrease along the maze. Tails do not go across the partition. Values smoothly decrease along the maze.

Comparison of Value Functions Optimal Ordinary Gaussian Geodesic Gaussian When we approximate this value function by Geodesic Gaussian Kernels. The approximated value function is like this. Unlike Ordinary Gaussian, values near the partition are well approximated and discontinuity across the partition preserved Values near the partition are well approximated. Discontinuity across the partition is preserved.

Comparison of Policies Ordinary Gaussian Geodesic Gaussian Therefore, Geodesic Gaussian Kernels can provide desired policies around the partition GGKs provide good policies near the partition.

Fraction of optimal states Experimental Result Average over 100 runs Geodesic Fraction of optimal states We ran the same experiments 100 times and plot average fraction of optimal states. Small width does not have the tail problem but is less smooth Large width suffers from the tail problem In Geodesic Gaussian, Small width does not have tail problem but less smooth Large width has no tail problem and smooth Ordinary Number of kernels Ordinary Gaussian: tail problem Geodesic Gaussian: no tail problem

Robot Arm Reaching Task: move the end effector to reach the object 2-DOF robot arm State space 180 Object End effector Obstacle Next, we applied our kernel to a robot arm reaching task. This robot consists of two joints and the task is to lead the end effector to a target object while avoiding obstacle. The reward is 1 when the end effecter touches the object, otherwise 0. The state space is like this. This pose corresponds to this blue point in the state space and as the robot moves joints, the blue point moves through white areas. When the arm hits the obstacles, the blue point hits the black areas. When the end effector reaches the object, the blue point gets into the red area. Joint 2 (degree) Joint 2 Joint 1 Reward: +1 reach the object 0 otherwise -180 -100 100 Joint 1 (degree)

Robot Arm Reaching Ordinary Gaussian Geodesic Gaussian Next we would like to show applications on robot controls. We have to omit the details due to the time limitation. Here is Robot Arm Reaching task. Policy Obtained by Ordinary Gaussian is like this. The arm hits the obstacle to reach the end-effector to the object. On the other hand, policy obtained by Geodesic Gaussian is like this. This makes the arm avoid the obstacle. Moves directly towards the object without avoiding the obstacle. Successfully avoids the obstacle and can reach the object.

Khepera Robot Navigation Khepera has 8 IR sensors measuring the distance to obstacles. Task: explore unknown maze without collision Next, we apply our method to a Khpera Robot Navigation. Khepera has 8 IR sensors measuring the distance to obstacles. The task is to explore unknown maze without collision to various obstacles. Khepera takes four actions: moving forward, rotate left, right and backward. To make the Khepera explore the world, rewards are set to 1 for moving forward, -2 for collision, 0 for others. Reward: +1 (forward) -2 (collision) 0 (others) Sensor value: 0 - 1030

State Space and Graph Discretize 8D state space by self-organizing map. 2D visualization We discretize 8-dimensional state space by self-organizing map. The state space is like this. For visualization the state space is projected on 2-D subspace. We can see that there are also partitions in this state space. Partitions 21

Khepera Robot Navigation Ordinary Gaussian Geodesic Gaussian Next is about the Khepera robot navigation task. Policy obtained by Ordinary Gaussian is like this. When the robot faces an obstacle, it goes backward and it forwards again and again. On the other hand, policy obtained by Geodesic Gaussian is like this. When the robot faces an obstacle, it makes a turn. When facing obstacle, goes backward (and goes forward again). When facing obstacle, makes a turn (and go forward). 22

Experimental Results Geodesic outperforms ordinary Gaussian. Average over 30 runs Geodesic Geodesic Gaussian outperforms ordinary Gaussian. Ordinary Geodesic outperforms ordinary Gaussian.

Conclusion Value function approximation: good basis function needed Ordinary Gaussian kernel: tail goes over discontinuities Geodesic Gaussian kernel: smooth along the state space Through the experiments, we showed geodesic Gaussian is promising in high-dimensional continuous problems! As a conclusion, Value function approximation needs good basis function In ordinary Gaussian kernel, tail goes over discontinuities. Geodesic Gaussian kernel is smooth along the state space. Geodesic Gaussian is promising in high-dimensional continuous problems. Thank you for listening my presentation today. 24