Carnegie Mellon R-BATCH: Task Partitioning for Fault-tolerant Multiprocessor Real-Time Systems Junsung Kim, Karthik Lakshmanan and Raj Rajkumar Electrical and Computer Engineering Carnegie Mellon University
Carnegie Mellon Outline Motivation Goals and Systems Models R-BATCH: Task Allocations with Replicas Performance Evaluation Conclusion
Carnegie Mellon Motivation → Goals → R-BATCH → Evaluation → Conclusion3 Autonomous Vehicles: Background GM Chevy Tahoe named “Boss” Won 2007 DARPA urban challenge Motivation → Goals → R-BATCH → Evaluation → Conclusion3
Carnegie Mellon Motivation → Goals → R-BATCH → Evaluation → Conclusion4 Autonomous Vehicles: Background Boss Senses environment Fuses sensor data to form a model of the real world Plans navigation paths Actuates steering wheel, brake, and accelerator Boss requires Safety-critical operations Timing guarantees Robustness to harsh environments Motivation → Goals → R-BATCH → Evaluation → Conclusion4
Carnegie Mellon Motivation → Goals → R-BATCH → Evaluation → Conclusion5 Autonomous Vehicles: Architecture 0.5 million lines of code for autonomous driving support 10 dual-core processors + 50 embedded processors Motivation → Goals → R-BATCH → Evaluation → Conclusion5
Carnegie Mellon Motivation → Goals → R-BATCH → Evaluation → Conclusion6 Autonomous Vehicles: Capabilities 0.5 million lines of code for autonomous driving support 10 dual-core processors + 50 embedded processors Requires high computational capabilities with timeliness guarantees Adding more processors Using high-performance processors Motivation → Goals → R-BATCH → Evaluation → Conclusion6
Carnegie Mellon , 32nm 2000, 130nm 1989, 800nm 100 Log time (years in service) Infant mortality (random, extrinsic) Failure Rate Wear-out (intrinsic) Processor Reliability Trend Motivation → Goals → R-BATCH → Evaluation → Conclusion7
Carnegie Mellon Outline Motivation Goals and Systems Models R-BATCH: Task Allocations with Replicas Performance Evaluation Conclusion
Carnegie Mellon Motivation → Goals → R-BATCH → Evaluation → Conclusion9 Goals for Fault-Tolerance Handle permanent processor failures Tolerate a given number of processor failures Avoid losing functionality by adding more resources in an affordable way Hardware replication Software replication Re-execution of failed jobs Lower quality of service of tasks Deal with unpredictable nature of failures Consider all possible scenarios? Motivation → Goals → R-BATCH → Evaluation → Conclusion9
Carnegie Mellon Motivation → Goals → R-BATCH → Evaluation → Conclusion10 System Model (1 of 2) Motivation → Goals → R-BATCH → Evaluation → Conclusion10
Carnegie Mellon Motivation → Goals → R-BATCH → Evaluation → Conclusion11 System Model (2 of 2) Task classifications Hard recovery task cannot miss the deadline even if a failure occurs e.g., automotive engine control Soft recovery task can be recovered in the next period e.g., navigation, chassis unit control Best-effort recovery task can be recovered if there is an enough room after failure Motivation → Goals → R-BATCH → Evaluation → Conclusion11
Carnegie Mellon Motivation → Goals → R-BATCH → Evaluation → Conclusion12 Hard Recovery Task 0 Failure occurred Task recovered Motivation → Goals → R-BATCH → Evaluation → Conclusion12 Processor 1 Processor 2
Carnegie Mellon Motivation → Goals → R-BATCH → Evaluation → Conclusion13 Soft Recovery Task Task recovered Failure occurred 0 Motivation → Goals → R-BATCH → Evaluation → Conclusion13 Processor 1 Processor 2
Carnegie Mellon Motivation → Goals → R-BATCH → Evaluation → Conclusion14 Task Replication Hot StandbyCold Standby Can recover Hard Recovery taskCan recover Soft Recovery task Running multiple copies of a featureDormant until activated No timing penaltyDelayed recovery time Utilization lossNo utilization loss without failures Observations Hot Standby The primary and the backups running at the same time Cold Standby One Cold Standby can recover several tasks on different processors Shared system state is available in all processors By using network bus architecture Motivation → Goals → R-BATCH → Evaluation → Conclusion14
Carnegie Mellon Motivation → Goals → R-BATCH → Evaluation → Conclusion15 Hard Recovery Task with Hot Standby 0 Failure occurred Task recovered via Hot Standby Motivation → Goals → R-BATCH → Evaluation → Conclusion15 Processor 1 Processor 2
Carnegie Mellon Motivation → Goals → R-BATCH → Evaluation → Conclusion16 Soft Recovery Task with Cold Standby Processor 1 Processor 2 Task recovered via Cold Standby Failure occurred 0 Motivation → Goals → R-BATCH → Evaluation → Conclusion16
Carnegie Mellon Motivation → Goals → R-BATCH → Evaluation → Conclusion17 2H 3H P1P1 P2P2 P3P3 P4P4 1P 2P 3P 1C 4P 5H 4C 5P 3C 5C 2P 3H P1P1 P2P2 P3P3 P4P4 1P 2P 3P 1C 4P 5H 4P 5P 3C 5H 2H 3H P1P1 P2P2 P3P3 P4P4 1P 2P 3P 1P 4P 5H 4C 5P 3H 5C Example Scenarios P 3 failed P 1 failed With 5 tasks and 4 processors n P : Primary of task n n H : Hot Standbys of task n n C : Cold Standbys of task n Motivation → Goals → R-BATCH → Evaluation → Conclusion17
Carnegie Mellon Outline Motivation Goals and Systems Models R-BATCH: Task Allocations with Replicas Performance Evaluation Conclusion
Carnegie Mellon Motivation → Goals → R-BATCH → Evaluation → Conclusion19 R-BATCH Reliable Bin-packing Algorithm for Tasks with Cold standby and Hot standby Reliable task allocation Allocates Hot Standbys Allocates Cold Standbys Motivation → Goals → R-BATCH → Evaluation → Conclusion19
Carnegie Mellon Motivation → Goals → R-BATCH → Evaluation → Conclusion20 Uniprocessor Schedulability * More complex; misbehaves at higher U Lower utilization Practical Motivation → Goals → R-BATCH → Evaluation → Conclusion20
Carnegie Mellon Motivation → Goals → R-BATCH → Evaluation → Conclusion21 Bin-packing Problem Definition: The problem of packing a set of items into the fewest number of bins such that the total size does not exceed the bin capacity * Items: Utilizations of each task Bins: Processors Then, given a set of tasks, how many bins (processors) do we need? † TkTk TkTk TjTj TjTj TiTi TiTi TmTm TmTm Processor P Motivation → Goals → R-BATCH → Evaluation → Conclusion21
Carnegie Mellon Motivation → Goals → R-BATCH → Evaluation → Conclusion22 The Classical Approach: Bin-packing Bin packing is used to allocate tasks to multiprocessor platforms Best-fit Decreasing (BFD) algorithm Step 1: Sort the objects in descending order of size Step 2: Sort the bins in descending order of consumed space Step 3: Fit next object into the first sorted bin that fits If no bin fits, add a new bin to fit into Step 4: If objects remain, go to Step 2. Step 5: Done. P1P1 P2P2 P3P3 P4P4 1, 0.6 2, 0.3 3, 0.2 1, 0.6 Given a set of tasks: {0.6, 0.3, 0.2} 2, 0.3 3, 0.2 Motivation → Goals → R-BATCH → Evaluation → Conclusion22
Carnegie Mellon Motivation → Goals → R-BATCH → Evaluation → Conclusion23 BFD with Placement Constraints We also have to deal with replicated tasks Under the placement constraint (BFD-P * ) No two replicas can be on the same processor Otherwise, processor failure will take down both replicas 1P, 0.6 2P, 0.3 3P, 0.2 1H, 0.6 2H, 0.3 3H, 0.2 P1P1 P2P2 P3P3 P4P4 1P, 0.6 2P, 0.3 3H, 0.2 1H, 0.6 2H, 0.3 3P, 0.2 Motivation → Goals → R-BATCH → Evaluation → Conclusion23
Carnegie Mellon Motivation → Goals → R-BATCH → Evaluation → Conclusion24 Given a set of tasks: {0.6, 0.3, 0.2} with 2 replicas each By using BFD with placement constraint We can however reduce the number of bins as follows: 1H, 0.6 1P, 0.6 2P, 0.3 3P, 0.2 1H, 0.6 2H, 0.3 3H, 0.2 1P, 0.6 2P, 0.3 3P, 0.2 2H, 0.3 3H, 0.3 Can BFD-P Be Improved? P1P1 P2P2 P3P3 P4P4 P1P1 P2P2 P3P3 P4P4 1P, 0.6 2P, 0.3 3H, 0.2 1H, 0.6 2H, 0.3 3P, 0.2 1P, 0.6 2P, 0.3 3H, 0.2 1H, 0.6 2H, 0.3 3P, 0.2 Motivation → Goals → R-BATCH → Evaluation → Conclusion24
Carnegie Mellon Motivation → Goals → R-BATCH → Evaluation → Conclusion25 Reliable BFD (RBFD) RBFD Algorithm Step 1: Sort tasks in decreasing order according to the utilization of each task Step 2: Allocate each primary task in the bin which will have the smallest remaining space Step 3: Set i = 1 Step 4: Allocate i th replica of each task in the bin which will have the smallest remaining space. Step 5: Increment i and repeat Step 4 until all replicas are allocated. 1H, 0.6 1P, 0.6 2P, 0.3 3P, 0.2 2H, 0.3 3H, 0.3 P1P1 P2P2 P3P3 P4P4 1P, 0.6 2P, 0.3 3H, 0.2 1H, 0.6 2H, 0.3 3P, 0.2 Motivation → Goals → R-BATCH → Evaluation → Conclusion25
Carnegie Mellon Motivation → Goals → R-BATCH → Evaluation → Conclusion26 Given a set of tasks: {0.6, 0.3, 0.2} with 3 replicas each to tolerate 2 processor failures Instead of using two more processors, add an “empty” processor to hold a “virtual task” Save More Processors with Cold Standby 1H, 0.6 1P, 0.6 2P, 0.3 3P, 0.2 2H, 0.3 3H, 0.2 P1P1 P2P2 P3P3 P4P4 P5P5 1H, 0.6 2H, 0.3 3H, 0.2 1H, 0.6 1P, 0.6 2P, 0.3 3P, 0.2 2H, 0.3 3H, 0.2 P1P1 P2P2 P3P3 P4P4 P5P5 1C, 0.6 2C, 0.3 3C, 0.2 Motivation → Goals → R-BATCH → Evaluation → Conclusion26
Carnegie Mellon Motivation → Goals → R-BATCH → Evaluation → Conclusion27 Cold Standby with Virtual Task Virtual task A guaranteed utilization reserving slack for recovering failures via Cold Standby Generate Virtual Tasks Step 1: Create a new virtual task by selecting the task with the highest utilization across all processors, which is not allocated to virtual tasks Step 2: Compare the size of virtual task with tasks on different processors, and check if those tasks can be recovered by using the virtual task Step 3: Go to Step 1 if there are remaining tasks 1H, 0.6 1P, 0.6 1P, 0.6 2P, 0.3 3P, 0.2 2H, 0.3 2H, 0.3 3H, 0.2 3H, 0.2 P1P1 P2P2 P3P3 P4P4 1C, 0.6 Generated Virtual Task 1C covers task 1, 2, and 3 2C, 0.3 3C, 0.2 1C, 0.6 2C, 0.3 3C, 0.2 1C, 0.6 Motivation → Goals → R-BATCH → Evaluation → Conclusion27
Carnegie Mellon Motivation → Goals → R-BATCH → Evaluation → Conclusion28 R-BATCH Reliable Bin-packing Algorithm for Tasks with Cold Standby and Hot Standby Step 1: Perform R-BFD with the primary and Hot Standbys Step 2: Generate virtual tasks Step 3: Perform R-BFD with virtual tasks 1H, 0.6 1P, 0.6 1P, 0.6 2P, 0.3 3P, 0.2 2H, 0.3 2H, 0.3 3H, 0.2 3H, 0.2 P1P1 P2P2 P3P3 P4P4 1P, 0.6 2P, 0.3 3H, 0.2 1H, 0.6 2H, 0.3 3P, 0.2 1C, 0.6 2C, 0.3 3C, 0.2 1C, 0.6 Motivation → Goals → R-BATCH → Evaluation → Conclusion28
Carnegie Mellon Outline Motivation Goals and Systems Models R-BATCH: Task Allocations with Replicas Performance Evaluation Conclusion
Carnegie Mellon Motivation → Goals → R-BATCH → Evaluation → Conclusion30 Evaluation Environment Motivation → Goals → R-BATCH → Evaluation → Conclusion30
Carnegie Mellon Motivation → Goals → R-BATCH → Evaluation → Conclusion31 Performance Evaluation (R-BFD) Motivation → Goals → R-BATCH → Evaluation → Conclusion31 Ratios of Saved Processors (Normalized to BFD-P) Number of Tasks 18%
Carnegie Mellon Motivation → Goals → R-BATCH → Evaluation → Conclusion32 Performance Evaluation (R-BATCH) Motivation → Goals → R-BATCH → Evaluation → Conclusion32 Ratios of Saved Processors (Normalized to BFD-P) Number of Tasks 49%
Carnegie Mellon Motivation → Goals → R-BATCH → Evaluation → Conclusion33 Performance Evaluation Motivation → Goals → R-BATCH → Evaluation → Conclusion33 Ratios of Saved Processors (Normalized to BFD-P) Ratios of Saved Processors (Normalized to BFD-P) For smaller task set sizes, R-BFD is more beneficial For larger task set sizes, R-BATCH is more beneficial
Carnegie Mellon Motivation → Goals → R-BATCH → Evaluation → Conclusion34 Back to Boss 20 periodic tasks for autonomous driving support By using R-BATCH Can tolerate 5 failures with 10 dual-core processors 35% saving compared to BFD-P With the primary With 1 Hot Standby per task With 4 Cold Standby per task Motivation → Goals → R-BATCH → Evaluation → Conclusion34
Carnegie Mellon Motivation → Goals → R-BATCH → Evaluation → Conclusion35 Conclusion Many safety-critical real-time systems must also support redundancy for tolerating faults We defined recovery task models Hard Recovery Task Soft Recovery Task Best-effort Recovery Task We used two types of recovery schemes Hot Standby (for Hard Recovery Task) Cold Standby (for Soft Recovery Task) We can tolerate a fixed number of (fail-stop) failures R-BFD 18% fewer processors with Hot Standby R-BATCH 49% fewer processors with Hot Standby and Cold Standby Utilizes slack for additional tasks Motivation → Goals → R-BATCH → Evaluation → Conclusion35