DARPA ITO/MARS Program

Slides:

Advertisements

Similar presentations

Bayesian Belief Propagation

Advertisements

Motion Planning for Point Robots CS 659 Kris Hauser.

Timed Automata.

CIS 540 Principles of Embedded Computation Spring Instructor: Rajeev Alur

DARPA Mobile Autonomous Robot SoftwareMay Adaptive Intelligent Mobile Robotics William D. Smart, Presenter Leslie Pack Kaelbling, PI Artificial.

Modular Specification of Hybrid Systems in CHARON R. Alur, R. Grosu, Y. Hur, V. Kumar, I. Lee University of Pennsylvania SDRL and GRASP.

DESIGN OF A GENERIC PATH PATH PLANNING SYSTEM AILAB Path Planning Workgroup.

Modeling and Verification of Embedded Software Rajeev Alur POPL Mentoring Workshop, Jan 2012 University of Pennsylvania.

Planning under Uncertainty

Brent Dingle Marco A. Morales Texas A&M University, Spring 2002

1 University of Pennsylvania Demonstrations Alur, Kumar, Lee, Pappas Rafael Fierro Yerang Hur Franjo Ivancic PK Mishra.

SDRL and GRASP University of Pennsylvania 6/27/00 MoBIES 1 Design, Implementation, and Validation of Embedded Software (DIVES) Contract No. F C-1707.

University of Pennsylvania 1 SDRL CHARON SDRL and GRASP University of Pennsylvania Funded by DARPA ITO.

Integrating POMDP and RL for a Two Layer Simulated Robot Architecture Presented by Alp Sardağ.

Chapter 5: Path Planning Hadi Moradi. Motivation Need to choose a path for the end effector that avoids collisions and singularities Collisions are easy.

DIVES Alur, Lee, Kumar, Pappas: University of Pennsylvania  Charon: high-level modeling language and a design environment reflecting the current state.

Code Generation from CHARON Rajeev Alur, Yerang Hur, Franjo Ivancic, Jesung Kim, Insup Lee, and Oleg Sokolsky University of Pennsylvania.

1 Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning Greg Grudic University of Colorado.

Overview and Mathematics Bjoern Griesbach

Radial Basis Function Networks

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

Artificial Neural Networks

Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.

DARPA Mobile Autonomous Robot SoftwareLeslie Pack Kaelbling; March Adaptive Intelligent Mobile Robotics Leslie Pack Kaelbling Artificial Intelligence.

© Manfred Huber Autonomous Robots Robot Path Planning.

Study on Genetic Network Programming (GNP) with Learning and Evolution Hirasawa laboratory, Artificial Intelligence section Information architecture field.

Probabilistic Robotics Bayes Filter Implementations Gaussian filters.

1 Robot Environment Interaction Environment perception provides information about the environment’s state, and it tends to increase the robot’s knowledge.

EDGE DETECTION IN COMPUTER VISION SYSTEMS PRESENTATION BY : ATUL CHOPRA JUNE EE-6358 COMPUTER VISION UNIVERSITY OF TEXAS AT ARLINGTON.

Reinforcement Learning 主講人：虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.

System Design Research Lab University of Pennylvania 1/29/2002 CHARON modeling language.

1 Distributed and Optimal Motion Planning for Multiple Mobile Robots Yi Guo and Lynne Parker Center for Engineering Science Advanced Research Computer.

University of Windsor School of Computer Science Topics in Artificial Intelligence Fall 2008 Sept 11, 2008.

Learning to Navigate Through Crowded Environments Peter Henry 1, Christian Vollmer 2, Brian Ferris 1, Dieter Fox 1 Tuesday, May 4, University of.

DISTIN: Distributed Inference and Optimization in WSNs A Message-Passing Perspective SCOM Team

Distributed cooperation and coordination using the Max-Sum algorithm

Abstract LSPI (Least-Squares Policy Iteration) works well in value function approximation Gaussian kernel is a popular choice as a basis function but can.

University of Pennsylvania 1 GRASP Control of Multiple Autonomous Robot Systems Vijay Kumar Camillo Taylor Aveek Das Guilherme Pereira John Spletzer GRASP.

Advanced Algorithms Analysis and Design

Optimal Acceleration and Braking Sequences for Vehicles in the Presence of Moving Obstacles Jeff Johnson, Kris Hauser School of Informatics and Computing.

Advanced Computer Systems

OPERATING SYSTEMS CS 3502 Fall 2017

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Exploratory Decomposition Dr. Xiao Qin Auburn.

SOFTWARE DESIGN AND ARCHITECTURE

CS b659: Intelligent Robotics

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

On Multi-Arm Manipulation Planning

Dynamical Statistical Shape Priors for Level Set Based Tracking

CIS 488/588 Bruce R. Maxim UM-Dearborn

Navigation In Dynamic Environment

Announcements Homework 3 due today (grace period through Friday)

Algorithm An algorithm is a finite set of steps required to solve a problem. An algorithm must have following properties: Input: An algorithm must have.

Objective of This Course

Probabilistic Robotics

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Instructors: Fei Fang (This Lecture) and Dave Touretzky

CIS 488/588 Bruce R. Maxim UM-Dearborn

MURI Kickoff Meeting Randolph L. Moses November, 2008

Real-Time Motion Planning

Boltzmann Machine (BM) (§6.4)

Compositional Refinement for Hierarchical Hybrid Systems

Presented By: Darlene Banta

Emir Zeylan Stylianos Filippou

Reinforcement Learning Dealing with Partial Observability

Unsupervised Perceptual Rewards For Imitation Learning

Chapter 4 . Trajectory planning and Inverse kinematics

Markov Decision Processes

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

Markov Decision Processes

Presentation transcript:

DARPA ITO/MARS Program Control and Coordination of Multiple Autonomous Robots Vijay Kumar GRASP Laboratory University of Pennsylvania http://www.cis.upenn.edu/mars

Motivation We are interested in coordinated control of robots manipulation vision-based control Large number of modes Scalability Individual modes (behaviors) are well understood, but the interaction between them is not. Software design: modes - bottom up, protocols - top down

MARS Analysis Learning Algorithms CHARON to Java Translator CHARON Code (High level language) Analysis Learning Algorithms CHARON to Java Translator Java Libraries Drivers Java Code Control Code Generator Simulator Code Generator Human Interface

Outline of the Talk Language and software architecture CHARON agents and modes examples Reactive control algorithms mode switching hierarchical composition of reactive control algorithms results From reactive to deliberative schemes Simulation Reinforcement learning learn mode switching and composition rules Future work

Participants Rajeev Alur Aveek Das Joel Esposito Rafael Fierro Radu Grosu Greg Grudic Yerang Hur Vijay Kumar Insup Lee Ben Southall John Spletzer Camillo Taylor Lyle Ungar

Architectural Hierarchy in CHARON Agent Agent1 Agent2 sensor sensor processor processor actuator actuator Input Port Output Port Each agent can be represented as a parallel composition of sub-agents

Behavioral Hierarchy in CHARON Modes main awayTarget atTarget sensing control Entry Port Exit Port Each agent consists of modes or behaviors Modes can in turn consist of sub modes

CHARON Individual components described as agents Composition, instantiation, and hiding Individual behaviors described as modes Encapsulation, instantiation, and Scoping Support for concurrency Shared variables as well as message passing Support for discrete and continuous behavior Well-defined formal semantics Composition of submodes is called encapsulation. Sequential composition. Follow the wall and obstacle avoidance can be sequentially implemented

Reactive Behaviors based on Vision Avoid Obstacle Collision Recovery Pursue Motion Controller Range Mapper Robot Position Estimator Target Detector Edge Detector Collision Color Blob Finder Frame Grabber Actuators

Robot Agent robotController = rmapper || cdetector || explore rmapper = rangeMapper() [rangeMap = rM]; cdetector = collisionDetector() [collisionDetected = cD]; explore = obstacleAvoider() [collisionDetected, rangeMap = cD, rM]; agent explore(){ mode top = exploreTopMode() } agent rangeMapper(){ mode top = rangeMapperTopMode() agent collisionDetector(){ mode top = collisionDetectorTopMode()

Collision Recovery mode collisionRecoveryMode(real recoveryDuration, real c) { entry enPt; exit exPt; readWrite diff analog real x; readWrite diff analog real phi; readWrite diff analog real recoveryTimer; diffEqn dRecovery { d(x) = -c; d(phi) = 0; d(recoveryTimer) = 1.0 } inv invRecovery { 0.0 <= recoveryTimer && recoveryTimer <= recoveryDuration } } // end of mode collisionRecoveryMode

Obstacle Avoidance mode obAvoidanceMode(){ entry enPt; exit exPt; read discrete bool collisionDetected; read RangeMap rangeMap; readWrite diff analog real x; readWrite diff analog real phi; diffEqn dObAvoidance {d(x) = computeSpeed(rangeMap); d(phi) = computeAngle(rangeMap)} inv invObAvoidance {collisionDetected = false} initTrans from obAvoidanceMode to obAvoidanceMode when true do{x = 0.0; phi = 0.0} }

Explore mode exploreTopMode() { entry enPt; read discrete bool collisionDetected; read RangeMap rangeMap; private diff analog real recoveryTimer; mode obAvoidance = obAvoidanceMode() mode collisionRecovery = collisionRecoveryMode(recoveryDuration, c) initTrans from obstacleAvoiderTopMod to obAvoidance when true do {recoveryDuration = 10.0; c = 1.0} // initialization trans OaToCr from obAvoidance.exPt to collisionRecovery.enPt when (collisionDetected == true) do {} trans CrToOa from collisionRecovery.exPt to obAvoidance.enPt when (recoveryTimer == recoveryDuration) do {recoveryTimer = 0.0} // reset the timer }

. . Explore Explore collisionRecovery obAvoidance rangeMap collisionDetected . phi = -k2 x=k1r . recoveryTimer = 1 phi = 0 x=-c collisionRecovery dRecovery obAvoidance dObAvoidance collision timeOut

Vision Based Control with Mobile Robots Parallel composition of software agents obstacle avoidance wall following Mode switching Multiple levels of abstraction of data from sensor

Explore: Wall Following with Obstacles

Explore, Search, and Pursue

local diff analog timer Multiagent Control pos = target local diff analog timer awTarget dPlan iAway atTarget dStop iAt arrive pos r2Est1 r2Est2 r1Est1 r1Est2 Robot1 dTimer pos.x = v * cos(phi) pos.y = v * sin(phi) . timer/updateFreq = 0 timer = 1 . moving dSteer aOmega iFreq sense sensing dStop iConst arrive move omega = k * (theta – phi)

Multiagent Control

Modular Simulation Goal Modes are simulated using local information Simulation is efficient and accurate Integration of modes at different time scales Integration of agents at different time scales Modes are simulated using local information Submodes are regarded as black-boxes Submodes are simulated independently of other ones Agents are simulated using local information Agents are regarded as black-boxes Agents are simulated independently of other ones

Time Round of a Mode (Agent) 1. Get integration time d and invariants from the supermode (or the scheduler). x . d, xInv y . 2. While (time t = 0; t <= d) do: dt, yInv - Simplify all invariants. z . - Predict integration step dt based on d and the invariants. - Execute time round of the active submode and get state s and time elapsed e. e, sz - Integrate for time e and get new state s. t+e, sy - Return s and t+e if invariants were violated. - Increment t = t+e. 3. Return s and d

Modular Simulation - Global execution t+dt time Agents A1 A2 A3 t e d 1. Pick up the agents with minimum and second minimum reached time. 2. Compute the time round interval d for the minimum agent, i.e. A2, such that its absolute time may exceed with at most dt the time reached by the second minimum agent, i.e. A1. 3. The time round may end before the time interval d was consumed if the invariants of A2 were violated. Then, an actual time increment would be e. 4. The agent A2 executes an update round to synchronize the discrete variables with the analog ones. 5. The state of A2 get visible to other agents.

Modular Multi-rate Simulation x1 x2 x3 time Use a different time step for each component to exploit multiple time scales, to increasing efficiency. ratio of largest to smallest step size coupling step size constant “Slowest-first” order of integration Coupling is accommodated by using interpolants for slow variables Tight error bound: O( hm+1 )

Simulation and Analysis of Hierarchical Systems with Mode Switching Modular simulation Automatic detection of events mode switching transitions, guards Synthesis of controllers include models of uncertainty Sensor fusion include models of noise

Traditional Model of Hierarchy NASREM Architecture [Albus, 80] Implementations Demo III NASA robotic systems

Event detection Given: dynamics output Given: g(x) x(t) We re-parameterize time by controlling the integration step size: Event ! output dynamics input Using feedback linearization we select our “speed” (step-size) along the integral curves to converge to the event surface

. . Hysteresis -a a+2 1 -1 -(a+2) a strMinus Env inc Hyst up dec y = 2u x1 < a x2 = -1 . strMinus dY iStrM aStrM . Hyst Env u x1 = u inc dX1 s2u up dY iUp aUp dec inc -a a+2 u2p 1 dec dX1 strPlus dY iStrP aStrP -1 -(a+2) a

Global versus Modular Simulation -1 1 Hysteresis example 2 levels of hierarchy global state is two dimensional Significant potential for more complex systems

Modular Simulation Error

Current Implementation Status CHARON Specification Work to date CHARON semantics Parser for CHARON Internal representation Type checker Current work Modular simulation scheme Internal representation generator CHARON Parser Type Checker Syntax Tree Internal Representation Generator Internal Representation Control Code Generator Simulator Generator Model Checker

Reactive to Deliberative Schemes Reactive scheme is a composition of go to target collision avoidance Deliberative scheme preplanned path around the nominal model Reactive schemes robust easy to implement may be limited in terms of being able to accomplish complex tasks may not compare favorably to recursive implementations of deliberative controllers Nominal Model Obstacle

Toward a composition of reactive and deliberative decision making u1 - vector field specified by a reactive planner u2 - vector field specified by a deliberative planner If u1 Î U, u2 Î U, then au1 + (1- a) u2 Î U

Composition of reactive and deliberative planners Framework for decision making U is the set of available control policies Y is the uncertainty set uncertainty in the environment model uncertainty in dynamics uncertainty in localization Best decision under the worst uncertainty

Results Worst Case Outcome Better than Worst-Case Outcomes Minimization weighting prior information and current information resolving the discrepancy between prior and current plans Closure property of “basis behaviors” ensures robustness Requires a priori calculation of roadmap

Detailed Analysis Cost Function  cross-section Min-Max Max-Min x cross-section Cost Function Global saddle point does not exist due to non-smooth solution

Best under Worst Case Uncertainty More Results Open Loop Recursive Best under Worst Case Uncertainty Open Loop

Deliberative and Reactive Behaviors in a Dynamic Setting Obstacle dynamics are known, exact inputs are not obstacle robot target

Paradigm for Learning Sensory information Information state Situation partitions Modes Action space Hierarchical structure allows learning at several levels Lowest level parameter estimation within each mode algorithms for extracting the information state (features, position and velocity, high level descriptors) Intermediate level select best mode for a situation determine the best partitioning of states for a given information state Advanced level transfer knowledge (programs, behaviors) between robots and human Learning at any level forces changes at others

Reinforcement Learning and Robotics Successful RL (Kaelbling, et al 96) Low dimensional discrete state space 100,000’s training runs necessary Stochastic search required Robotics Systems Large, continuous state space A large number of training runs (e.g, 100, 000) may not be practical Stochastic search not desirable The fundamental question we must ask is can RL learning be applied to robotics give that previous successful implementations of RL seem directly incompatible with robotics.

Boundary Localized Reinforcement Learning Our approach to robot control Noisy state space Deterministic modes Our approach to Reinforcement Learning Search only mode boundaries Ignore most of the state space Minimize stochastic search RL using no stochastic search during learning Our goal is to apply reinforcement learning to high dimensional robotics problems which have deterministic mode switching controllers (i.e. the estimate of the current state deterministically defines the mode/action taken). In order to apply RL to such deterministic and high dimensional problems (such problems have NOT previously been successfully addressed in the RL framework), we will reformulate the RL into a mode boundary localized search with minimal stochastic search and no stochastic search at all. This greatly reduces the solution search space, thus making high dimensional reinforcement learning feasible.

Mode Switching Controllers Mode of operation (action ai executed) Parameterization of boundary Mode Boundaries Our definition of a mode switching control system. Modes include: Move away from obstacle. Follow wall. Follow leader. Follow hallway. Go towards goal. Note that theta defines the parameterization of the boundaries between modes in state (information) space. State Space

Reinforcement Learning for Mode Switching Controllers Initial Guess (prior knowledge) R Reinforcement Feedback Our framework for applying RL to mode switching controllers: 1. Use prior knowledge to define a parameterization of mode boundaries. 2. Use reinforcement feedback from the robots environment to to modify these boundaries to obtain locally optimal performance. “Optimal” parameterization

Reinforcement Learning Markov Decision Process Policy Reinforcement Feedback (environment): rt Goal: modify policy to maximize performance Policy Gradient Formulation Summary of our Reinforcement Learning formulation as a Markov Decision Process (i.e action taken is only a function of the current state). Note: Discounted reward performance function shown here. 1. A policy is a probabilistic distribution of actions taken in the current state (given the parameterization theta). 2. The environment supplies a reinforcement feedback. 3. The robot’s goal is the maximize reward (i.e. maximize positive reinforcement feedback) under some reward specification (one example is discounted reward but can also have others - I.e. average reward, etc.). 5. We use the policy gradient formulation of RL where we update the theta parameters using by differentiating the reward function and performing gradient ascent.

Why Policy Gradient RL? Computation linear in the number of parameters q avoids blow-up from discretization as with other RL methods Generalization in state space is implicitly defined by the parametric representation generalization is important for high dimensional problems We use the policy gradient formulation because its computational cost is linear with the number of parameters used to specify the policy. Other RL methods grow exponentially with the number of dimensions in the state/information space.

Key Result #1 Any q parameterized probabilistic policy can be transformed into a approximately deterministic policy parameterized by q Deterministic everywhere except near mode boundaries. There are 3 theoretical results. The first one states that we can transform any probabilistic policy into an approximately deterministic policy (I.e. deterministic everywhere except arbitrarily near mode boundaries).

Key Result #2 Convergence to a locally optimal mode switching policies is obtained by searching near mode boundaries All other regions of the state space can be ignored This significantly reduces the search space The second theoretical results says that locally optimal mode switching policies are obtained by using stochastic search near mode boundaries. This means that we don’t need to do stochastic search in other regions of the state/information space . The result is that our search space is significantly smaller (as compared with typical probabilistic RL).

Stochastic Search Localized to Mode Boundaries Regions A diagrammatic representation of the second theoretical result. State Space

Key Result #3 Reinforcement learning can be applied to locally optimizing deterministic mode switching policies without using stochastic search if robot takes small steps value of executing actions (Q) is smooth w.r.t. state These conditions are met almost everywhere in typical robot applications The final theoretical result is that stochastic search is NOT necessary at all if the reward function is smooth in theta! This result is significant because the basic premise of all RL (up to now) it cannot be done without stochastic search. This result eliminates the need for any stochastic search - which is good for robotics applications where randomly chosen action can result in unsafe or expensive outcomes.

Deterministic Search at Mode Boundaries Regions A diagrammatic representation of the second theoretical result. State Space

Simulation State Space: Robot Position Boundary Definitions: Gaussian centers and widths 2 parameters per Gaussian for each dimension. 20 parameters. 2 types of modes: Toward a goal, away from an obstacle. Reward: +1 for reaching a goal, -1 for hitting an obstacle.

Convergence Results Learning Algorithm used: Action Transition Policy Gradient Policy gradient updates occur only at action transitions Orders of magnitude faster convergence than other PG algorithms algorithm can also be used on entirely stochastic policies convergence guaranteed under appropriate smoothness conditions A comparison of number of passes through the environment (or iterations) required for the 3 algorithms implemented: 1) Traditional Probabilistic RL: stochastic search in the entire state space. 2) Stochastic Boundary Localized RL (BLRL): stochastic search limited to areas near mode boundaries. 3) Deterministic BLRL: no stochastic search. Note: A) The traditional probabilistic RL is the slowest to converge. B) Stochastic BLRL is one order magnitude faster. C) Deterministic BLRL is the fasted to converge. D) All 3 algorithms converge to similar solutions - therefore there is no performance loss entailed in using Deterministic BLRL (as predicted by theory).

Convergence in Higher Dimensions Convergence Results 2D Projection 4500 Stochastic Policy 4000 3500 Initial 3000 Number of trials 2500 Boundary Localized 2000 1500 The robot's learn not to run into walls or obstacles while minimizing distance traveled to goal (it takes about 5 to 10 runs to learn this). They start with an initial guess of how close they can approach an obstacle without hitting it, and this distance is incrementally modified to trade off between path length and collisions - the learning stops when the robot never collides with obstacles while taking the shortest path. The current demo is only learning one parameter, but now it is a simple matter to add more parameters (starting with the entire range map) and modes (i.e. the sensor and actuator models are now working properly). Learned Deterministic 1000 500 20 40 60 80 100 120 140 Dimension

Current work: More Sophisticated Simulation Realistic models of sensors and actuators models of noise Control code used for actual robots also used in simulator allows modeling of realistic mode switching modes tested in simulation are directly transferred to real robots learning can be done in simulation transferred to real robots for further refinement

Status Framework for learning parameter adaptation using reinforcement reward - low level learning rules for selecting and switching between behaviors - intermediate level Preliminary numerical experiments are reassuring supported by theory Current work more realistic simulation Future work combine experiments with simulation

Summary of Progress Software tools Implementation on robot hardware On schedule Implementation on robot hardware Mixed Simulation New results on modular simulation Control algorithms Reactive to deliberative algorithms New Learning algorithms Ahead of schedule