Learning Preferences on Trajectories via Iterative Improvement

Slides:

Advertisements

Similar presentations

A Support Vector Method for Optimizing Average Precision

Advertisements

Beyond Geometric Path Planning: When Context Matters Ashesh Jain, Shikhar Sharma Thorsten Joachims and Ashutosh Saxena.

Diversified Retrieval as Structured Prediction Redundancy, Diversity, and Interdependent Document Relevance (IDR ’09) SIGIR 2009 Workshop Yisong Yue Cornell.

Modelling Relevance and User Behaviour in Sponsored Search using Click-Data Adarsh Prasad, IIT Delhi Advisors: Dinesh Govindaraj SVN Vishwanathan* Group:

Kinematic Synthesis of Robotic Manipulators from Task Descriptions June 2003 By: Tarek Sobh, Daniel Toundykov.

Beat the Mean Bandit Yisong Yue (CMU) & Thorsten Joachims (Cornell)

L EARNING TO D IVERSIFY USING IMPLICIT FEEDBACK Karthik Raman, Pannaga Shivaswamy & Thorsten Joachims Cornell University 1.

Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.

1 Kshitij Judah, Alan Fern, Tom Dietterich TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: UAI-2012 Catalina Island,

4/15/2017 Using Gaussian Process Regression for Efficient Motion Planning in Environments with Deformable Objects Barbara Frank, Cyrill Stachniss, Nichola.

Optimal Design Laboratory | University of Michigan, Ann Arbor 2011 Design Preference Elicitation Using Efficient Global Optimization Yi Ren Panos Y. Papalambros.

Iterative Relaxation of Constraints (IRC) Can’t solve originalCan solve relaxed PRMs sample randomly but… start goal C-obst difficult to sample points.

Apprenticeship learning for robotic control Pieter Abbeel Stanford University Joint work with Andrew Y. Ng, Adam Coates, Morgan Quigley.

STANFORD Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion J. Zico Kolter, Pieter Abbeel, Andrew Y. Ng Goal Initial Position.

1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.

Using Inaccurate Models in Reinforcement Learning Pieter Abbeel, Morgan Quigley and Andrew Y. Ng Stanford University.

Trajectory Week 8. Learning Outcomes By the end of week 8 session, students will trajectory of industrial robots.

Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel Stanford University [Joint work with Andrew Ng.]

Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel Stanford University [Joint work with Andrew Ng.]

An Application of Reinforcement Learning to Autonomous Helicopter Flight Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng Stanford University.

A machine learning approach to improve precision for navigational queries in a Web information retrieval system Reiner Kraft

7. Experiments 6. Theoretical Guarantees Let the local policy improvement algorithm be policy gradient. Notes: These assumptions are insufficient to give.

Exploration and Apprenticeship Learning in Reinforcement Learning Pieter Abbeel and Andrew Y. Ng Stanford University.

Chapter 5: Path Planning Hadi Moradi. Motivation Need to choose a path for the end effector that avoids collisions and singularities Collisions are easy.

Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel and Andrew Y. Ng Stanford University.

Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.

Introduction to Monte Carlo Methods D.J.C. Mackay.

Function Approximation for Imitation Learning in Humanoid Robots Rajesh P. N. Rao Dept of Computer Science and Engineering University of Washington,

Definition of an Industrial Robot

Karthik Raman, Pannaga Shivaswamy & Thorsten Joachims Cornell University 1.

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.

Bringing Clothing into Desired Configurations with Limited Perception Joseph Xu.

Chapter 7: Trajectory Generation Faculty of Engineering - Mechanical Engineering Department ROBOTICS Outline: 1.

Online Learning Rong Jin. Batch Learning Given a collection of training examples D Learning a classification model from D What if training examples are.

1 1 Spatialized Haptic Rendering: Providing Impact Position Information in 6DOF Haptic Simulations Using Vibrations 9/12/2008 Jean Sreng, Anatole Lécuyer,

1 Monte-Carlo Planning: Policy Improvement Alan Fern.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Rick Parent - CIS681 Reaching and Grasping Reaching control synthetic human arm to reach for object or position in space while possibly avoiding obstacles.

How to Ethically Generate Abundant Referrals to Your Practice: Turning Google’s Updates into Your Best Referral Source Turning Google’s Updates into Your.

1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.

Template for the Storyboard stage 1. Mention what will be your animation medium: 2D or 3D Mention the software to be used for animation development: JAVA,

Designing a framework For Recommender system Based on Interactive Evolutionary Computation Date : Mar 20 Sat, 2011 Project Number :

Human Computer Interaction Lecture 21 User Support

Optimization Of Robot Motion Planning Using Genetic Algorithm

CSCE 441: Computer Graphics Forward/Inverse kinematics

Project Management: Messages

Figure 5: Change in Blackjack Posterior Distributions over Time.

Informatics 223 Applied Software Design Techniques

Lecture 12: Relevance Feedback & Query Expansion - II

System Design Ashima Wadhwa.

Computer Output Device: Arm Robot

Chapter 5. The Bootstrapping Approach to Developing Reinforcement Learning-based Strategies in Reinforcement Learning for Adaptive Dialogue Systems, V.

By: Zeeshan Ansari, BEng (Hons) Electronic Engineering

Tingdan Luo 05/02/2016 Interactively Optimizing Information Retrieval Systems as a Dueling Bandits Problem Tingdan Luo

Irving Vasquez, L. Enrique Sucar, Rafael Murrieta-Cid

Electrical & Electronics Engineering Department

Bowei Tang, Tianyu Chen, and Christopher Atkeson

Bringing Clothing into Desired Configurations with Limited Perception

Privacy Protection for Social Network Services

A Dynamic System Analysis of Simultaneous Recurrent Neural Network

Outline: Introduction Solvability Manipulator subspace when n<6

Testing & modeling users

Jonathan Elsas LTI Student Research Symposium Sept. 14, 2007

Learning Incoherent Sparse and Low-Rank Patterns from Multiple Tasks

Chapter 4 . Trajectory planning and Inverse kinematics

Classifier-Based Approximate Policy Iteration

Rohan Yadav and Charles Yuan (rohany) (chenhuiy)

ACHIEVEMENT DESCRIPTION

GhostLink: Latent Network Inference for Influence-aware Recommendation

Learn the ways for Garmin GPS update on your device

Presentation transcript:

Learning Preferences on Trajectories via Iterative Improvement Ashesh Jain, Thorsten Joachims, Ashutosh Saxena Cornell University I’ll be talking about Learning Preferences on Trajectories via Iterative Improvement. Hello, I am Ashesh Jain and this is joint work with Thorsten Joachims and Ashutosh Saxena.

Motivation High DoF robots Generate multiple trajectories for a task Only a few are desirable by user We learn preferences over trajectories 2 3 1. We have many High DoF houeshold or assembly robots such as PR2 and Baxter. 2. As shown in the figure, these robots can generate multiple different trajectories for a task. 3. However, depending on the task and environment, only few of the generated trajectories are desirable by the end user i.e. the user have preferences over trajectories. In this work, we closely study examples of such preferences and propose an algorithm to learn them for high DoF robots. 1 11/16/2018 Jain, Joachims, Saxena

Example of Preferences Glass of water upright Now, lets look at some examples of such preferences for a household robot. If a robot is moving a glass of water, then the glass should not only be upright. 11/16/2018 Jain, Joachims, Saxena

Example of Preferences Environment dependent preferences: Don’t move water over laptop But the robot should also be aware of its surrounding and keep water away from electronic devices such as laptop or mobile. 11/16/2018 Jain, Joachims, Saxena

Example of Preferences Glass of water upright. Environment dependent preferences: Don’t move water over laptop. User likes to predict robot’s move Contorted arm configurations are not preferred Furthermore, the robot should produce motion which is predictable by user and therefore should avoid contorted arm configurations as shown in the figure. This makes the problem challenging because the user preferences are governed by complete arm configuration is not just the end effector location. 11/16/2018 Jain, Joachims, Saxena

Example of Preferences Glass of water upright Environment dependent preferences: Don’t move water over laptop User likes to predict robot’s move Contorted arm configurations are not preferred Preferences varies with users, tasks, and environment As we can see from this example, this problem is challenging because space of preferences can potentially be huge and preferences can further vary with user, tasks and environment. Therefore manually coding them into the planners is not a viable option and therefore we rely on machine learning algorithms. 11/16/2018 Jain, Joachims, Saxena

Learning preferences Method 1 Learning from expert’s Demonstration (Kobel and Peters 2011 , Abbeel et. al. 2010 , Ziebrat et. al. 2008 , Ratliff et. al. 2006) One method to learn user preferences is from expert’s demonstration and it has been shown to be useful in many applications. However, for the problem we consider, this approach faces two challenges. 11/16/2018 Jain, Joachims, Saxena

Learning preferences Method 1 Learning from expert’s Demonstration (Kobel and Peters 2011 , Abbeel et. al. 2010 , Ziebrat et. al. 2008 , Ratliff et. al. 2006) Do not scale well to high DoF manipulators On the algorithmic side, these methods do not scale well to the high DoF robots specially when the user is concerned with robot’s complete arm configuration. 11/16/2018 Jain, Joachims, Saxena

Learning preferences Method 1 Learning from expert’s Demonstration (Kobel and Peters 2011 , Abbeel et. al. 2010 , Ziebrat et. al. 2008 , Ratliff et. al. 2006) Do not scale well to high DoF manipulators Near optimal demonstrations Providing kinesthetic demonstrations on high DoF robots is difficult On the practical side, the assumption that the user knows the optimal trajectory may no longer be true. And even if user knows the optimal demonstration, providing such kinesthetic demonstrations accurately is hard. To overcome this, we propose a new method for human robot interaction. 11/16/2018 Jain, Joachims, Saxena

Learning preferences Method 1 Our Method Learning from expert’s Demonstration Our Method Learns from suboptimal online user feedback (Kobel and Peters 2011 , Abbeel et. al. 2010 , Ziebrat et. al. 2008 , Ratliff et. al. 2006) Do not scale to high DoF manipulators Near optimal demonstrations Providing kinesthetic demonstrations on high DoF robots is difficult Where the robot learns from suboptimal user feedbacks i.e. expert demonstrations are not required. 11/16/2018 Jain, Joachims, Saxena

Learning preferences Method 1 Our Method Learning from expert’s Demonstration Our Method Learns from suboptimal online user feedback User iteratively improves the trajectory proposed by robot (Kobel and Peters 2011 , Abbeel et. al. 2010 , Ziebrat et. al. 2008 , Ratliff et. al. 2006) Do not scale to high DoF manipulators Near optimal demonstrations Providing kinesthetic demonstrations on high DoF robots is difficult. As a feedback user iteratively improves the trajectory proposed by the robot. 11/16/2018 Jain, Joachims, Saxena

Learning preferences Method 1 Our Method Learning from expert’s Demonstration Our Method Learns from suboptimal online user feedback User iteratively improves the trajectory proposed by robot Theoretical guarantee on regret bounds (Kobel and Peters 2011 , Abbeel et. al. 2010 , Ziebrat et. al. 2008 , Ratliff et. al. 2006) Do not scale to high DoF manipulators Near optimal demonstrations Providing kinesthetic demonstrations on high DoF robots is difficult We further show even with the suboptimal feedback our method converges at the same rate as if optimal demonstrations were available. 11/16/2018 Jain, Joachims, Saxena

Learning preferences Method 1 Our Method Learning from expert’s Demonstration Our Method Learns from suboptimal online user feedback User iteratively improves the trajectory proposed by robot Theoretical guarantee on regret bounds User NEVER disclose the optimal trajectory (Kobel and Peters 2011 , Abbeel et. al. 2010 , Ziebrat et. al. 2008 , Ratliff et. al. 2006) Do not scale to high DoF manipulators Near optimal demonstrations Providing kinesthetic demonstrations on high DoF robots is difficult Finally, in this entire procedure the user never disclose the optimal trajectory to the robot. 11/16/2018 Jain, Joachims, Saxena

Iterative Improvement Co-active Learning setting: Interactively improves a trajectory Clicks on one trajectory from a ranked list of trajectories Similar to search engine 2 3 1 2 One way to improve trajectory is by correcting one of its waypoint, as shown in the figure. We implement this using ROS interactive manipulation tool. 11/16/2018 Jain, Joachims, Saxena

Result Tested on 35 different tasks nDCG@3 # Feedbacks We evaluated our method on a significant number of house hold tasks and show we outperform fully supervised batch algorithms in less than 5 feedback on average. We also produce better trajectories than sampling based planners with manually coded constraints. Tested on 35 different tasks In less than 5 feedback our method outperforms fully supervised batch methods 11/16/2018 Jain, Joachims, Saxena

Visit our poster for more details. Thank You Visit our poster for more details.