DARPA ITO/MARS Program

DARPA ITO/MARS Program
Control and Coordination of Multiple Autonomous Robots Vijay Kumar GRASP Laboratory University of Pennsylvania

Motivation We are interested in coordinated control of robots
manipulation vision-based control Large number of modes Scalability Individual modes (behaviors) are well understood, but the interaction between them is not. Software design: modes - bottom up, protocols - top down

MARS Analysis Learning Algorithms CHARON to Java Translator
CHARON Code (High level language) Analysis Learning Algorithms CHARON to Java Translator Java Libraries Drivers Java Code Control Code Generator Simulator Code Generator Human Interface

Outline of the Talk Language and software architecture
CHARON agents and modes examples Reactive control algorithms mode switching hierarchical composition of reactive control algorithms results From reactive to deliberative schemes Simulation Reinforcement learning learn mode switching and composition rules Future work

Participants Rajeev Alur Aveek Das Joel Esposito Rafael Fierro
Radu Grosu Greg Grudic Yerang Hur Vijay Kumar Insup Lee Ben Southall John Spletzer Camillo Taylor Lyle Ungar

Architectural Hierarchy in CHARON
Agent Agent1 Agent2 sensor sensor processor processor actuator actuator Input Port Output Port Each agent can be represented as a parallel composition of sub-agents

Behavioral Hierarchy in CHARON
Modes main awayTarget atTarget sensing control Entry Port Exit Port Each agent consists of modes or behaviors Modes can in turn consist of sub modes

CHARON Individual components described as agents
Composition, instantiation, and hiding Individual behaviors described as modes Encapsulation, instantiation, and Scoping Support for concurrency Shared variables as well as message passing Support for discrete and continuous behavior Well-defined formal semantics Composition of submodes is called encapsulation. Sequential composition. Follow the wall and obstacle avoidance can be sequentially implemented

Reactive Behaviors based on Vision
Avoid Obstacle Collision Recovery Pursue Motion Controller Range Mapper Robot Position Estimator Target Detector Edge Detector Collision Color Blob Finder Frame Grabber Actuators

Robot Agent robotController = rmapper || cdetector || explore
rmapper = rangeMapper() [rangeMap = rM]; cdetector = collisionDetector() [collisionDetected = cD]; explore = obstacleAvoider() [collisionDetected, rangeMap = cD, rM]; agent explore(){ mode top = exploreTopMode() } agent rangeMapper(){ mode top = rangeMapperTopMode() agent collisionDetector(){ mode top = collisionDetectorTopMode()

Collision Recovery mode collisionRecoveryMode(real recoveryDuration,
real c) { entry enPt; exit exPt; readWrite diff analog real x; readWrite diff analog real phi; readWrite diff analog real recoveryTimer; diffEqn dRecovery { d(x) = -c; d(phi) = 0; d(recoveryTimer) = 1.0 } inv invRecovery { 0.0 <= recoveryTimer && recoveryTimer <= recoveryDuration } } // end of mode collisionRecoveryMode

Obstacle Avoidance mode obAvoidanceMode(){ entry enPt; exit exPt;
read discrete bool collisionDetected; read RangeMap rangeMap; readWrite diff analog real x; readWrite diff analog real phi; diffEqn dObAvoidance {d(x) = computeSpeed(rangeMap); d(phi) = computeAngle(rangeMap)} inv invObAvoidance {collisionDetected = false} initTrans from obAvoidanceMode to obAvoidanceMode when true do{x = 0.0; phi = 0.0} }

Explore mode exploreTopMode() { entry enPt;
read discrete bool collisionDetected; read RangeMap rangeMap; private diff analog real recoveryTimer; mode obAvoidance = obAvoidanceMode() mode collisionRecovery = collisionRecoveryMode(recoveryDuration, c) initTrans from obstacleAvoiderTopMod to obAvoidance when true do {recoveryDuration = 10.0; c = 1.0} // initialization trans OaToCr from obAvoidance.exPt to collisionRecovery.enPt when (collisionDetected == true) do {} trans CrToOa from collisionRecovery.exPt to obAvoidance.enPt when (recoveryTimer == recoveryDuration) do {recoveryTimer = 0.0} // reset the timer }

. . Explore Explore collisionRecovery obAvoidance rangeMap
collisionDetected . phi = -k2 x=k1r . recoveryTimer = 1 phi = 0 x=-c collisionRecovery dRecovery obAvoidance dObAvoidance collision timeOut

Vision Based Control with Mobile Robots
Parallel composition of software agents obstacle avoidance wall following Mode switching Multiple levels of abstraction of data from sensor

Explore: Wall Following with Obstacles

Explore, Search, and Pursue

local diff analog timer
Multiagent Control pos = target local diff analog timer awTarget dPlan iAway atTarget dStop iAt arrive pos r2Est1 r2Est2 r1Est1 r1Est2 Robot1 dTimer pos.x = v * cos(phi) pos.y = v * sin(phi) . timer/updateFreq = 0 timer = 1 . moving dSteer aOmega iFreq sense sensing dStop iConst arrive move omega = k * (theta – phi)

Multiagent Control

Modular Simulation Goal Modes are simulated using local information
Simulation is efficient and accurate Integration of modes at different time scales Integration of agents at different time scales Modes are simulated using local information Submodes are regarded as black-boxes Submodes are simulated independently of other ones Agents are simulated using local information Agents are regarded as black-boxes Agents are simulated independently of other ones

Time Round of a Mode (Agent)
1. Get integration time d and invariants from the supermode (or the scheduler). x . d, xInv y . 2. While (time t = 0; t <= d) do: dt, yInv - Simplify all invariants. z . - Predict integration step dt based on d and the invariants. - Execute time round of the active submode and get state s and time elapsed e. e, sz - Integrate for time e and get new state s. t+e, sy - Return s and t+e if invariants were violated. - Increment t = t+e. 3. Return s and d

Modular Simulation - Global execution
t+dt time Agents A1 A2 A3 t e d 1. Pick up the agents with minimum and second minimum reached time. 2. Compute the time round interval d for the minimum agent, i.e. A2, such that its absolute time may exceed with at most dt the time reached by the second minimum agent, i.e. A1. 3. The time round may end before the time interval d was consumed if the invariants of A2 were violated. Then, an actual time increment would be e. 4. The agent A2 executes an update round to synchronize the discrete variables with the analog ones. 5. The state of A2 get visible to other agents.

Modular Multi-rate Simulation
x1 x2 x3 time Use a different time step for each component to exploit multiple time scales, to increasing efficiency. ratio of largest to smallest step size coupling step size constant “Slowest-first” order of integration Coupling is accommodated by using interpolants for slow variables Tight error bound: O( hm+1 )

Simulation and Analysis of Hierarchical Systems with Mode Switching
Modular simulation Automatic detection of events mode switching transitions, guards Synthesis of controllers include models of uncertainty Sensor fusion include models of noise

Traditional Model of Hierarchy
NASREM Architecture [Albus, 80] Implementations Demo III NASA robotic systems

Event detection Given:
dynamics output Given: g(x) x(t) We re-parameterize time by controlling the integration step size: Event ! output dynamics input Using feedback linearization we select our “speed” (step-size) along the integral curves to converge to the event surface

. . Hysteresis -a a+2 1 -1 -(a+2) a strMinus Env inc Hyst up dec
y = 2u x1 < a x2 = -1 . strMinus dY iStrM aStrM . Hyst Env u x1 = u inc dX1 s2u up dY iUp aUp dec inc -a a+2 u2p 1 dec dX1 strPlus dY iStrP aStrP -1 -(a+2) a

Global versus Modular Simulation
-1 1 Hysteresis example 2 levels of hierarchy global state is two dimensional Significant potential for more complex systems

Modular Simulation Error

Current Implementation Status
CHARON Specification Work to date CHARON semantics Parser for CHARON Internal representation Type checker Current work Modular simulation scheme Internal representation generator CHARON Parser Type Checker Syntax Tree Internal Representation Generator Internal Representation Control Code Generator Simulator Generator Model Checker

Reactive to Deliberative Schemes
Reactive scheme is a composition of go to target collision avoidance Deliberative scheme preplanned path around the nominal model Reactive schemes robust easy to implement may be limited in terms of being able to accomplish complex tasks may not compare favorably to recursive implementations of deliberative controllers Nominal Model Obstacle

Toward a composition of reactive and deliberative decision making
u1 - vector field specified by a reactive planner u2 - vector field specified by a deliberative planner If u1 Î U, u2 Î U, then au1 + (1- a) u2 Î U

Composition of reactive and deliberative planners
Framework for decision making U is the set of available control policies Y is the uncertainty set uncertainty in the environment model uncertainty in dynamics uncertainty in localization Best decision under the worst uncertainty

Results Worst Case Outcome Better than Worst-Case Outcomes Minimization weighting prior information and current information resolving the discrepancy between prior and current plans Closure property of “basis behaviors” ensures robustness Requires a priori calculation of roadmap

Detailed Analysis Cost Function  cross-section Min-Max Max-Min
x cross-section Cost Function Global saddle point does not exist due to non-smooth solution

Best under Worst Case Uncertainty
More Results Open Loop Recursive Best under Worst Case Uncertainty Open Loop

Deliberative and Reactive Behaviors in a Dynamic Setting
Obstacle dynamics are known, exact inputs are not obstacle robot target

Paradigm for Learning Sensory information Information state Situation partitions Modes Action space Hierarchical structure allows learning at several levels Lowest level parameter estimation within each mode algorithms for extracting the information state (features, position and velocity, high level descriptors) Intermediate level select best mode for a situation determine the best partitioning of states for a given information state Advanced level transfer knowledge (programs, behaviors) between robots and human Learning at any level forces changes at others

Reinforcement Learning and Robotics
Successful RL (Kaelbling, et al 96) Low dimensional discrete state space 100,000’s training runs necessary Stochastic search required Robotics Systems Large, continuous state space A large number of training runs (e.g, 100, 000) may not be practical Stochastic search not desirable The fundamental question we must ask is can RL learning be applied to robotics give that previous successful implementations of RL seem directly incompatible with robotics.

Boundary Localized Reinforcement Learning
Our approach to robot control Noisy state space Deterministic modes Our approach to Reinforcement Learning Search only mode boundaries Ignore most of the state space Minimize stochastic search RL using no stochastic search during learning Our goal is to apply reinforcement learning to high dimensional robotics problems which have deterministic mode switching controllers (i.e. the estimate of the current state deterministically defines the mode/action taken). In order to apply RL to such deterministic and high dimensional problems (such problems have NOT previously been successfully addressed in the RL framework), we will reformulate the RL into a mode boundary localized search with minimal stochastic search and no stochastic search at all. This greatly reduces the solution search space, thus making high dimensional reinforcement learning feasible.

Mode Switching Controllers
Mode of operation (action ai executed) Parameterization of boundary Mode Boundaries Our definition of a mode switching control system. Modes include: Move away from obstacle. Follow wall. Follow leader. Follow hallway. Go towards goal. Note that theta defines the parameterization of the boundaries between modes in state (information) space. State Space

Reinforcement Learning for Mode Switching Controllers
Initial Guess (prior knowledge) R Reinforcement Feedback Our framework for applying RL to mode switching controllers: 1. Use prior knowledge to define a parameterization of mode boundaries. 2. Use reinforcement feedback from the robots environment to to modify these boundaries to obtain locally optimal performance. “Optimal” parameterization

Reinforcement Learning
Markov Decision Process Policy Reinforcement Feedback (environment): rt Goal: modify policy to maximize performance Policy Gradient Formulation Summary of our Reinforcement Learning formulation as a Markov Decision Process (i.e action taken is only a function of the current state). Note: Discounted reward performance function shown here. 1. A policy is a probabilistic distribution of actions taken in the current state (given the parameterization theta). 2. The environment supplies a reinforcement feedback. 3. The robot’s goal is the maximize reward (i.e. maximize positive reinforcement feedback) under some reward specification (one example is discounted reward but can also have others - I.e. average reward, etc.). 5. We use the policy gradient formulation of RL where we update the theta parameters using by differentiating the reward function and performing gradient ascent.

Why Policy Gradient RL? Computation linear in the number of parameters q avoids blow-up from discretization as with other RL methods Generalization in state space is implicitly defined by the parametric representation generalization is important for high dimensional problems We use the policy gradient formulation because its computational cost is linear with the number of parameters used to specify the policy. Other RL methods grow exponentially with the number of dimensions in the state/information space.

Key Result #1 Any q parameterized probabilistic policy can be transformed into a approximately deterministic policy parameterized by q Deterministic everywhere except near mode boundaries. There are 3 theoretical results. The first one states that we can transform any probabilistic policy into an approximately deterministic policy (I.e. deterministic everywhere except arbitrarily near mode boundaries).

Key Result #2 Convergence to a locally optimal mode switching policies is obtained by searching near mode boundaries All other regions of the state space can be ignored This significantly reduces the search space The second theoretical results says that locally optimal mode switching policies are obtained by using stochastic search near mode boundaries. This means that we don’t need to do stochastic search in other regions of the state/information space . The result is that our search space is significantly smaller (as compared with typical probabilistic RL).

Stochastic Search Localized to Mode Boundaries
Regions A diagrammatic representation of the second theoretical result. State Space

Key Result #3 Reinforcement learning can be applied to locally optimizing deterministic mode switching policies without using stochastic search if robot takes small steps value of executing actions (Q) is smooth w.r.t. state These conditions are met almost everywhere in typical robot applications The final theoretical result is that stochastic search is NOT necessary at all if the reward function is smooth in theta! This result is significant because the basic premise of all RL (up to now) it cannot be done without stochastic search. This result eliminates the need for any stochastic search - which is good for robotics applications where randomly chosen action can result in unsafe or expensive outcomes.

Deterministic Search at Mode Boundaries
Regions A diagrammatic representation of the second theoretical result. State Space

Simulation State Space: Robot Position
Boundary Definitions: Gaussian centers and widths 2 parameters per Gaussian for each dimension. 20 parameters. 2 types of modes: Toward a goal, away from an obstacle. Reward: +1 for reaching a goal, -1 for hitting an obstacle.

Convergence Results Learning Algorithm used: Action Transition Policy Gradient Policy gradient updates occur only at action transitions Orders of magnitude faster convergence than other PG algorithms algorithm can also be used on entirely stochastic policies convergence guaranteed under appropriate smoothness conditions A comparison of number of passes through the environment (or iterations) required for the 3 algorithms implemented: 1) Traditional Probabilistic RL: stochastic search in the entire state space. 2) Stochastic Boundary Localized RL (BLRL): stochastic search limited to areas near mode boundaries. 3) Deterministic BLRL: no stochastic search. Note: A) The traditional probabilistic RL is the slowest to converge. B) Stochastic BLRL is one order magnitude faster. C) Deterministic BLRL is the fasted to converge. D) All 3 algorithms converge to similar solutions - therefore there is no performance loss entailed in using Deterministic BLRL (as predicted by theory).

Convergence in Higher Dimensions
Convergence Results 2D Projection 4500 Stochastic Policy 4000 3500 Initial 3000 Number of trials 2500 Boundary Localized 2000 1500 The robot's learn not to run into walls or obstacles while minimizing distance traveled to goal (it takes about 5 to 10 runs to learn this). They start with an initial guess of how close they can approach an obstacle without hitting it, and this distance is incrementally modified to trade off between path length and collisions - the learning stops when the robot never collides with obstacles while taking the shortest path. The current demo is only learning one parameter, but now it is a simple matter to add more parameters (starting with the entire range map) and modes (i.e. the sensor and actuator models are now working properly). Learned Deterministic 1000 500 20 40 60 80 100 120 140 Dimension

Current work: More Sophisticated Simulation
Realistic models of sensors and actuators models of noise Control code used for actual robots also used in simulator allows modeling of realistic mode switching modes tested in simulation are directly transferred to real robots learning can be done in simulation transferred to real robots for further refinement

Status Framework for learning
parameter adaptation using reinforcement reward - low level learning rules for selecting and switching between behaviors - intermediate level Preliminary numerical experiments are reassuring supported by theory Current work more realistic simulation Future work combine experiments with simulation

Summary of Progress Software tools Implementation on robot hardware
On schedule Implementation on robot hardware Mixed Simulation New results on modular simulation Control algorithms Reactive to deliberative algorithms New Learning algorithms Ahead of schedule

DARPA ITO/MARS Program

Similar presentations

Presentation on theme: "DARPA ITO/MARS Program"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

DARPA ITO/MARS Program

Similar presentations

Presentation on theme: "DARPA ITO/MARS Program"— Presentation transcript:

Similar presentations

About project

Feedback