Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science.

Slides:



Advertisements
Similar presentations
Discussion for SAMSI Tracking session, 8 th September 2008 Simon Godsill Signal Processing and Communications Lab. University of Cambridge www-sigproc.eng.cam.ac.uk/~sjg.
Advertisements

Lecture 18: Temporal-Difference Learning
Kshitij Judah, Alan Fern, Tom Dietterich TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: School of EECS, Oregon State.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.
1 Kshitij Judah, Alan Fern, Tom Dietterich TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: UAI-2012 Catalina Island,
Dynamic Bayesian Networks (DBNs)
1. Algorithms for Inverse Reinforcement Learning 2
Adam Coates, Pieter Abbeel, and Andrew Y. Ng Stanford University ICML 2008 Learning for Control from Multiple Demonstrations TexPoint fonts used in EMF.
Beam Sensor Models Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics TexPoint fonts used in EMF. Read.
Nonlinear Optimization for Optimal Control
Optimal Design Laboratory | University of Michigan, Ann Arbor 2011 Design Preference Elicitation Using Efficient Global Optimization Yi Ren Panos Y. Papalambros.
Kinodynamic Path Planning Aisha Walcott, Nathan Ickes, Stanislav Funiak October 31, 2001.
1 Reasoning Under Uncertainty Over Time CS 486/686: Introduction to Artificial Intelligence Fall 2013.
Planning under Uncertainty
Apprenticeship learning for robotic control Pieter Abbeel Stanford University Joint work with Andrew Y. Ng, Adam Coates, Morgan Quigley.
STANFORD Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion J. Zico Kolter, Pieter Abbeel, Andrew Y. Ng Goal Initial Position.
CPSC 322, Lecture 15Slide 1 Stochastic Local Search Computer Science cpsc322, Lecture 15 (Textbook Chpt 4.8) February, 6, 2009.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 9: Planning and Learning pUse of environment models pIntegration of planning.
Using Inaccurate Models in Reinforcement Learning Pieter Abbeel, Morgan Quigley and Andrew Y. Ng Stanford University.
Reinforcement Learning Rafy Michaeli Assaf Naor Supervisor: Yaakov Engel Visit project’s home page at: FOR.
An Application of Reinforcement Learning to Autonomous Helicopter Flight Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng Stanford University.
Apprenticeship Learning for the Dynamics Model Overview  Challenges in reinforcement learning for complex physical systems such as helicopters:  Data.
Arbitrators in Overlapping Coalition Formation Games
7. Experiments 6. Theoretical Guarantees Let the local policy improvement algorithm be policy gradient. Notes: These assumptions are insufficient to give.
Nonmyopic Active Learning of Gaussian Processes An Exploration – Exploitation Approach Andreas Krause, Carlos Guestrin Carnegie Mellon University TexPoint.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.
Planning operation start times for the manufacture of capital products with uncertain processing times and resource constraints D.P. Song, Dr. C.Hicks.
Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.
A Multi-Agent Learning Approach to Online Distributed Resource Allocation Chongjie Zhang Victor Lesser Prashant Shenoy Computer Science Department University.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 9: Planning and Learning pUse of environment models pIntegration of planning.
Hardness-Aware Restart Policies Yongshao Ruan, Eric Horvitz, & Henry Kautz IJCAI 2003 Workshop on Stochastic Search.
8/9/20151 DARPA-MARS Kickoff Adaptive Intelligent Mobile Robots Leslie Pack Kaelbling Artificial Intelligence Laboratory MIT.
RL for Large State Spaces: Policy Gradient
Computational Stochastic Optimization: Bridging communities October 25, 2012 Warren Powell CASTLE Laboratory Princeton University
ME451 Kinematics and Dynamics of Machine Systems
Apprenticeship Learning for Robotics, with Application to Autonomous Helicopter Flight Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng,
Feedback controller Motion planner Parameterized control policy (PID, LQR, …) One (of many) Integrated Approaches: Gain-Scheduled RRT Core Idea: Basic.
Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley.
Multifactor GPs Suppose now we wish to model different mappings for different styles. We will add a latent style vector s along with x, and define the.
December 9, 2014Computer Vision Lecture 23: Motion Analysis 1 Now we will talk about… Motion Analysis.
Constrained adaptive sensing Mark A. Davenport Georgia Institute of Technology School of Electrical and Computer Engineering TexPoint fonts used in EMF.
1 Distributed and Optimal Motion Planning for Multiple Mobile Robots Yi Guo and Lynne Parker Center for Engineering Science Advanced Research Computer.
Curiosity-Driven Exploration with Planning Trajectories Tyler Streeter PhD Student, Human Computer Interaction Iowa State University
Confidence Based Autonomy: Policy Learning by Demonstration Manuela M. Veloso Thanks to Sonia Chernova Computer Science Department Carnegie Mellon University.
Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter and Andrew Y. Ng Computer Science Department Stanford.
Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.
RADHA-KRISHNA BALLA 19 FEBRUARY, 2009 UCT for Tactical Assault Battles in Real-Time Strategy Games.
Local Search and Optimization Presented by Collin Kanaley.
Mixture Kalman Filters by Rong Chen & Jun Liu Presented by Yusong Miao Dec. 10, 2003.
1 Monte-Carlo Planning: Policy Improvement Alan Fern.
Machine Learning and Data Mining Clustering (adapted from) Prof. Alexander Ihler TexPoint fonts used in EMF. Read the TexPoint manual before you delete.
1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 21: Dynamic Multi-Criteria RL problems Dr. Itamar Arel College of Engineering Department.
Generalized Point Based Value Iteration for Interactive POMDPs Prashant Doshi Dept. of Computer Science and AI Institute University of Georgia
Abstract LSPI (Least-Squares Policy Iteration) works well in value function approximation Gaussian kernel is a popular choice as a basis function but can.
FAST-PPR: Personalized PageRank Estimation for Large Graphs Peter Lofgren (Stanford) Joint work with Siddhartha Banerjee (Stanford), Ashish Goel (Stanford),
Multiple-goal Search Algorithms and their Application to Web Crawling Dmitry Davidov and Shaul Markovitch Computer Science Department Technion, Haifa 32000,
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Achieving Goals in Decentralized POMDPs Christopher Amato Shlomo Zilberstein UMass.
Simulations of Curve Tracking using Curvature Based Control Laws Graduate Student: Robert Sizemore Undergraduate Students: Marquet Barnes,Trevor Gilmore,
Optimal Acceleration and Braking Sequences for Vehicles in the Presence of Moving Obstacles Jeff Johnson, Kris Hauser School of Informatics and Computing.
Analytics and OR DP- summary.
Apprenticeship Learning Using Linear Programming
Robust Belief-based Execution of Manipulation Programs
Navigation In Dynamic Environment
Approximate POMDP planning: Overcoming the curse of history!
Chapter 9: Planning and Learning
Resource Allocation for Distributed Streaming Applications
Standing Balance Control Using a Trajectory Library
Reinforcement Learning in a Multi-Robot Domain
October 20, 2010 Dr. Itamar Arel College of Engineering
Machine Learning and Data Mining Clustering
Presentation transcript:

Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science Department Stanford University July 2008, ICML TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A AA A AA

Outline Reinforcement Learning and Following Trajectories Space-indexed Dynamical Systems and Space-indexed Dynamic Programming Experimental Results

Reinforcement Learning and Following Trajectories

Trajectory Following Consider task of following trajectory in a vehicle such as a car or helicopter State space too large to discretize, can’t apply tabular RL/dynamic programming

Trajectory Following Dynamic programming algorithms w/ non- stationary policies seem well-suited to task –Policy Search by Dynamic Programming (Bagnell, et. al), Differential Dynamic Programming (Jacobson and Mayne)

Dynamic Programming t=1 Divide control task into discrete time steps

Dynamic Programming t=1 Divide control task into discrete time steps t=2

Dynamic Programming t=1 Divide control task into discrete time steps t=2 t=3 t=4 t=5

Dynamic Programming t=1 t=2 t=3 t=4 t=5 Proceeding backwards in time, learn policies for t = T, T-1, …, 2, 1

Dynamic Programming t=1 t=2 t=3 t=4 t=5 Proceeding backwards in time, learn policies for t = T, T-1, …, 2, 1

Dynamic Programming t=1 t=2 t=3 t=4 t=5 Proceeding backwards in time, learn policies for t = T, T-1, …, 2, 1

Dynamic Programming t=1 t=2 t=3 t=4 t=5 Proceeding backwards in time, learn policies for t = T, T-1, …, 2, 1

Dynamic Programming t=1 t=2 t=3 t=4 t=5 Key Advantage: Policies are local (only need to perform well over small portion of state space)

Problems with Dynamic Programming Problem #1: Policies from traditional dynamic programming algorithms are time-indexed

Problems with Dynamic Programming Supposed we learned policy assuming this distribution over states

Problems with Dynamic Programming But, due to natural stochasticity of environment, car is actually here at t = 5

Problems with Dynamic Programming Resulting policy will perform very poorly

Problems with Dynamic Programming Partial Solution: Re-indexing Execute policy closest to current location, regardless of time

Problems with Dynamic Programming Problem #2: Uncertainty over future states makes it hard to learn any good policy

Problems with Dynamic Programming Due to stochasticity, large uncertainty over states in distant future Dist. over states at time t = 5

Problems with Dynamic Programming DP algorithms require learning policy that performs well over entire distribution Dist. over states at time t = 5

Space-Indexed Dynamic Programming Basic idea of Space-Indexed Dynamic Programming (SIDP): Perform DP with respect to space indices (planes tangent to trajectory)

Space-Indexed Dynamical Systems and Dynamic Programming

Difficulty with SIDP No guarantee that taking single action will move to next plane along trajectory Introduce notion of space-indexed dynamical system

Time-Indexed Dynamical System Creating time-indexed dynamical systems:

Time-Indexed Dynamical System Creating time-indexed dynamical systems: current state

Time-Indexed Dynamical System Creating time-indexed dynamical systems: control action current state

Time-Indexed Dynamical System Creating time-indexed dynamical systems: control action current state time derivative of state

Time-Indexed Dynamical System Creating time-indexed dynamical systems: Euler integration

Space-Indexed Dynamical Systems Creating space-indexed dynamical systems: Simulate forward until whenever vehicle hits next tangent plane space index d space index d+1

Space-Indexed Dynamical Systems Creating space-indexed dynamical systems: space index d space index d+1

Space-Indexed Dynamical Systems Creating space-indexed dynamical systems: space index d space index d+1 (Positive solution exists as long as controller makes some forward progress)

Space-Indexed Dynamical Systems Result is a dynamical system indexed by spatial-index variable d rather than time Space-indexed dynamic programming runs DP directly on this system

Space-Indexed Dynamic Programming Divide trajectory into discrete space planes d=1

Space-Indexed Dynamic Programming Divide trajectory into discrete space planes d=1 d=2

Space-Indexed Dynamic Programming Divide trajectory into discrete space planes d=1 d=2 d=3 d=4 d=5

Space-Indexed Dynamic Programming d=1 d=2 d=3 d=4 d=5 Proceeding backwards, learn policies for d = D, D-1, …, 2, 1

Space-Indexed Dynamic Programming d=1 d=2 d=3 d=4 d=5 Proceeding backwards, learn policies for d = D, D-1, …, 2, 1

Space-Indexed Dynamic Programming d=1 d=2 d=3 d=4 d=5 Proceeding backwards, learn policies for d = D, D-1, …, 2, 1

Space-Indexed Dynamic Programming d=1 d=2 d=3 d=4 d=5 Proceeding backwards, learn policies for d = D, D-1, …, 2, 1

Problems with Dynamic Programming Problem #1: Policies from traditional dynamic programming algorithms are time-indexed

Space-Indexed Dynamic Programming Time indexed DP: can execute policy learned for different location Space indexed DP: always executes policy based on current spatial index

Problems with Dynamic Programming Problem #2: Uncertainty over future states makes it hard to learn any good policy

Space-Indexed Dynamic Programming Time indexed DP: wide distribution over future states Space indexed DP: much tighter distribution over future states Dist. over states at time t = 5Dist. over states at index d = 5

Space-Indexed Dynamic Programming Time indexed DP: wide distribution over future states Space indexed DP: much tighter distribution over future states Dist. over states at time t = 5Dist. over states at index d = 5 t(5):

Experiments

Experimental Domain Task: following race track trajectory in RC car with randomly placed obstacles

Experimental Setup Implemented space-indexed version of PSDP algorithm –Policy chooses steering angle using SVM classifier (constant velocity) –Used simple textbook model simulator of car dynamics to learn policy Evaluated PSDP time-indexed, time- indexed with re-indexing and space-indexed

Time-Indexed PSDP

Time-Indexed PSDP w/ Re-indexing

Space-Indexed PSDP

Empirical Evaluation Time-indexed PSDPTime-indexed PSDP with Re-indexing Space-indexed PSDP Cost: Cost: Infinite (no trajectory succeeds) Cost: 59.74

Additional Experiments In the paper: additional experiments on the Stanford Grand Challenge Car using space-indexed DDP, and on a simulated helicopter domain using space-indexed PSDP

Related Work Reinforcement learning / dynamic programming: Bagnell et al., 2004; Jacobson and Mayne, 1970; Lagoudakis and Parr, 2003; Langford and Zadrozny, 2005 Differential Dynamic Programming: Atkeson, 1994; Tassa et al., 2008 Gain Scheduling, Model Predictive Control: Leith and Leithead, 2000; Garica et al., 1989

Summary Trajectory following uses non-stationary policies, but traditional DP / RL algorithms suffer because they are time-indexed In this paper, we introduce the notions of a space-indexed dynamical system, and space-indexed dynamic programming Demonstrated usefulness of these methods on real-world control tasks.

Thank you! Videos available online at