Using Run-Time Checking to Provide Safety and Progress for Distributed Cyber-Physical Systems Stanley Bak, Fardin Abdi Taghi Abad, Zhenqi Huang, Marco Caccamo Presentor: Renato Mancuso
Distributed Coordination 1 Interconnected systems that physically affect each other State of each node is a function of control inputs of other nodes based on system connection graph 2 Water Distribution system Electrical Grids Traffic Control system
Communication; An Essential Component Distributed systems rely on communication for: 1. Reaching the desired state Functionality 2. Maintain invariants: stability Stability/Safety …so what happens when communication is unreliable? Communication Faults Violation of Safety
Limits of Distributed Coordination Approach 1: Massive distributed sensor-fusion and unsafe state avoidance Not exhaustive Approach 2: Use middleware that provides guarantees of communication and latency Scalability Middleware Water Distribution system Electrical Grids Traffic Control system Image: “A Swarm of Nano Quadrotors”, UPENN, http://www.youtube.com/watch?v=YQIMGV5vtd4
Paper Goals Examine fundamental requirements for safety in distributed systems with unreliable communication Safety: global invariant (for example, collisions are avoided) Goal 1 Provide a mechanism for safe progress, if the communication works adequately well Progress: all distributed agents follow the same goal Water Distribution system Electrical Grids Traffic Control system Goal 2 Image: “A Swarm of Nano Quadrotors”, UPENN, http://www.youtube.com/watch?v=YQIMGV5vtd4
Safety Theorem Intuition: Goal 1: Safety Safety Theorem A coordinating distributed system is safe under unreliable communication if and only if both: Condition 1: The system is safe if no communication takes place Condition 2: For each message m that is received by any node, the system remains safe if no other messages are ever received after m Water Distribution system Electrical Grids Traffic Control system Intuition:
Runtime Checking …but progress? Goal 1: Safety Runtime Checking Note that: Condition 2 is difficult to check ahead of time, since it’s quantified for every message Proposed Solution To build a usable system with this result, we check this condition at runtime, and drop messages which violate it Water Distribution system Electrical Grids Traffic Control system …but progress?
Proposed Architecture Goal 1: Safety Proposed Architecture Perform a safety test on each command (check condition 2) Safe commands pass Command Filter Safeguard Unsafe commands are filtered
Safe Progress Compatible actions: Goal 2: Progress Safe Progress Compatible actions: actions which all agents can take that are globally safe. So: Build a chain of compatible actions for global progress The rate of progress depends on the quality of the comm. channel. Set-point 𝑖−1 Water Distribution system Electrical Grids Traffic Control system Set-point 𝑖 Set-point 𝑖+1 -ball Trajectory
Example System A flock of vehicles moves along a path The user can input “detour points”, to redirect the flock Collisions must be avoided Detour points should be reached, communication permitting
Non-Compatible Actions A new waypoint for the flock is entered Collision may occur due to a communication fault
Compatible Actions Iteratively Approach Goal
Compatible Actions Iteratively Approach Goal
Compatible Actions Iteratively Approach Goal
Compatible Actions Iteratively Approach Goal
Compatible Actions Iteratively Approach Goal
Compatible Actions Iteratively Approach Goal
Compatible Actions Iteratively Approach Goal
Compatible Actions Robustness to Communication Failures Tractor 1 did not receive the new path but safety is maintained. Paths sent to followers! Tractor 1 did not receive the path Desired final path for the flock Paths generated for all the followers New detour point entered by operator 1 2 3
Vehicle Flocking Application Flocking system with StarL StarL code can be run on a Roomba flock, or in a built-in simulator Communication effects can be simulated and evaluated Video: https://youtu.be/dIGU8OTfCh8
Evaluation We measured the effect of packet-loss and vehicle count on convergence time and number of messages sent With increasing loss ratio: convergence time grows quadratically With increasing vehicles: convergence time is constant bandwidth grows linearly
Future Extensions Replace runtime reachability checks with ahead-of-time computation Progress framework where commands do not originate from a centralized coordinator Implementation on a large swarm of robots
Thanks.