Fault-Tolerant Programming Models and Computing Frameworks

Fault-Tolerant Programming Models and Computing Frameworks
PhD Final Examination 07/15/2015 Mehmet Can Kurt Department of Computer Science & Engineering Advisor: Gagan Agrawal

Motivation Significant transformation in hardware (multi-cores, GPUs, many-cores) Decreasing MTBF  a failure every 3-26 minutes Resilience important now more than ever!

Type of Failures We Consider
Fail-Stop failures checkpoint/restart size of checkpoints matters ex: core job, checkpoint+restart+recomp. = 65% of exec. Soft errors ECC (can’t detect double bit flips) replication (low resource utilization) Errors induced by programmers performance tuning efforts (Intel CnC) programmer specifications leading to errors

Goals Address these failures in several context
big data processing platforms data availability and ensuring load balance popular programming paradigms (SPMD) programming abstractions expose core execution state and critical computations alternative programming paradigms task graph execution

Contributions Big data processing platforms
A Fault-Tolerant Environment for Large-Scale Query Processing [HiPC12] Parallel programming models DISC: A Domain-Interaction based Programming Model with Support for Heterogeneous Execution [SC14] Low-overhead fault-tolerance support using DISC programming model [submitted to LCPC15] Fault-Tolerant Dynamic Task Graph Scheduling [SC14] Runtime support for programmer-based performance tuning Memory-efficient Scheduling of Dynamic Task Graphs [to be submitted to PPoPP16] Final Examination

Outline Before candidacy After candidacy
DISC: a domain-interaction based programming model After candidacy Low-overhead fault-tolerance support using DISC Fault-tolerant dynamic task graph scheduling Memory-efficient scheduling of dynamic task graphs Final Examination

Application Development for HPC
Existing programming models designed for homogeneous settings (MPI, PGAS) explicit partitioning and communication no support for resilience Similarities across applications iterative structure cells/particles interactions among cells/particles Final Examination

Our Work DISC: a high-level programming model and runtime
notion of domain and interactions between domain elements suitable for most classes of popular scientific applications Abstractions to hide data distribution and communication captured through a domain-interaction API Key Features: automatic partitioning and communication heterogeneous execution support with work redistribution automated resilient execution

DISC Basics Domain and subdomain Interaction among domain elements
user provides information about domain: type of domain number of dimensions and boundaries type of interaction grid based radius based explicit list

compute-function and computation-space
Molecular Dynamics single iteration x, y, z, vx, vy, vz x’, y’, z’, vx’, vy’, vz’ update coordinates and velocities compute-function calculate new values for domain elements invoked by runtime at each iteration computation-space maintains updates on domain elements leverages automatic repartitioning

Work Redistribution for Heterogeneity
main idea: shrinking/expanding a subdomain changes processors’ workload How can we utilize unit-processing time for resizing? The answer is simple for 1D partitioning. Example: P1 is faster than the remaining processors. We want to expand its subdomain. However, the actions that we take will affect other subdomains. (increasing xr1, expands P4’s subdomain) To find an optimal answer: express problem as a non-linear optimization problem. (AMPL to model, MINOS to solve) 1D Case inversely proportional to current processing power 2D/3D Case express as a non-linear optimization problem

Implementation: Putting it Together
@application @runtime

Fault-Tolerance Support
Molecular Dynamics single iteration Observations computation-space captures the progress made on each domain element  core execution state soft errors in compute-function eventually corrupt computation-space most critical computations x, y, z, vx, vy, vz x’, y’, z’, vx’, vy’, vz’ update coordinates and velocities Our fault-tolerance approach is based on checkpointing. As in any checkpointing solutions, there are two important questions. When do we initiate a checkpoint? (avoid synchronization and message logging overheads as in coordinated and uncoordinated check. schemes) Which data-structures should be checkpointed? (to reduce checkpoint costs, we want checkpointed state to be as small as possible) Additional meta-data iteration_no: upon restart, tells the runtime where the computation should continue subdomain boundaries: to facilitate a different partitioning upon restart (fewer number of nodes, more number of nodes)

Automated Application-Level Checkpointing
what? computation-space objects when? end of an iteration recovery? load the checkpoint files and restart the execution ability to restart with fewer number of nodes Our fault-tolerance approach is based on checkpointing. As in any checkpointing solutions, there are two important questions. When do we initiate a checkpoint? (avoid synchronization and message logging overheads as in coordinated and uncoordinated check. schemes) Which data-structures should be checkpointed? (to reduce checkpoint costs, we want checkpointed state to be as small as possible) Additional meta-data iteration_no: upon restart, tells the runtime where the computation should continue subdomain boundaries: to facilitate a different partitioning upon restart (fewer number of nodes, more number of nodes) sample checkpoint file for a molecular dynamics application

Replication of compute-functions
Limitations: side-effect free compute-functions soft errors in combinatorial logic (register values, ALUs, pipeline latches, …) control variables protected by other means Assumptions??? Final Examination

Replication of compute-functions: Cache Utilization Improvements
for(int i from 1 to NATOMS) { double value = … computation_space[i] = value } @compute_checksum for(int i from 1 to NATOMS) update_chksum(chksum,computation_space[i]) On the fly checksum calculation no explicit checksum calculation step large number of cache misses!!! @compute_function for(int i from 1 to NATOMS) { double value = … computation_space[i] = value } update_chksum(chksum,computation_space[i]) Final Examination

Replication of compute-functions: Cache Utilization Improvements
No computation-space for replica thread @replica thread compute_function for(int i from 1 to NATOMS) { double value = … computation_space[i] = value } update_chksum(chksum, computation_space[i]) @replica thread compute_function for(int i from 1 to NATOMS) { double value = … update_chksum(chksum, value) } Final Examination

Experiments - Checkpointing
Platform two-quad core 2.53 GHz Intel(R) Xeon(R) processor with 12GB RAM Comparison with MPI (BLCR as system-level checkpointing library) Applications Stencil (Jacobi, Sobel) Unstructured Grid (Euler) Molecular Dynamics (MiniMD) 4 applications corresponding to three different computation patterns

Experiments - Checkpointing
Jacobi MiniMD x2.3 x12.7 Compare the performance of DISC model with the pure MPI implementations under checkpointing BLCR system-level checkpointing library White portions: normal-execution time Gray area above each bar: time spent on checkpointing Jacobi: 400 million elements for 1000 it. MiniMD: 4 million atoms for 1000 it Notes: Jacobi: 25% overhead during normal execution (MPI imp. two matrices which are swapped at each iteration, but in DISC programmer needs to copy the content of local computation space to a runtime system managed buffer) MiniMD: Checkpoint files are smaller since memory footprint is small Checkpoint freq: 100 iterations Checkpoint size: 2GB vs 192 MB Checkpoint freq: 250 iterations Checkpoint size: 6 GB vs 3 GB

Experiments - Replication
Platform Intel Xeon Phi 7110P many-core processor 61 cores running 244 hardware threads Replication compute-function replication through OpenMP original and replica threads pinned to the same core core0 left for OS 4 applications corresponding to three different computation patterns

Experiments - Replication
no rep: no replication rep: plain replication rep+ofc: replication + on the fly checksum rep+ofc+ncs: replication + on the fly checksum + no replica compute-space Jacobi MiniMD 118%, 44%, 33% 13%, 15%, 9% Compare the performance of DISC model with the pure MPI implementations under checkpointing BLCR system-level checkpointing library White portions: normal-execution time Gray area above each bar: time spent on checkpointing Jacobi: 400 million elements for 1000 it. MiniMD: 4 million atoms for 1000 it Notes: Jacobi: 25% overhead during normal execution (MPI imp. two matrices which are swapped at each iteration, but in DISC programmer needs to copy the content of local computation space to a runtime system managed buffer) MiniMD: Checkpoint files are smaller since memory footprint is small

Background: Task Graph Execution
Representation as a DAG vertices (tasks), edges (dependences) Main scheduling rule Improved scalability asynchronous execution load balance via work stealing C A D E B

Failure Model Task graph scheduling in presence of detectable soft errors Recover corrupted data blocks and task descriptors Assumptions: existence of an error detector ECC, symptom-based detectors, application-level assertions recovery upon observation logic for task graph creation is resilient through user provided functions

Recovery Challenges D fails right after its computation
Waiting Completed Executing Failed D fails right after its computation re-compute D (only once), restart B and C Further complications if data blocks are reused C overwrites E re-compute E (only once) Minimum effect on normal scheduling A D B

Fault-Tolerant Scheduling
Developed on NABBIT* a task graph scheduler using work stealing augmented with additional routines optimality properties maintained Recovery from arbitrary number of soft failures no redundant execution or checkpoint/restart selective task re-execution negligible overheads for a small constant number of faults * IPDPS’10

Scheduling Without Failures
B C’s Task Descriptor join: notifyArray: status: db: Traverse predecessors A.status is “Computed”, decrement C.join B.status is “Visited”, enqueue C to B.notifyArray Successors enqueued in notifyArray Compute task when join is 0 Notify successors in notifyArray 1 2 number of outstanding predecessors successors to notify {D} { } C D execution status at the moment Computed Visited pointer to output null data A

Fault-Tolerant Scheduling: Properties
Non-collective and selective recovery without interfering with other threads re-execute impacted portion of the task graph thread1 C thread2 A E D B thread3

Fault-Tolerant Scheduling: Recovery
Failures can be handled at any stage of execution Enclosure with try-catch blocks No recovery for non observed failures Predecessor Failure Self Failure Successor Failure B B B C C C C E A A A during traversal during computation during computation during notification recover B recover B recover C recover E B A C

Fault-Tolerant Scheduling: Recovery
Meta-data of a failed task is correctly recovered. Treat the failed task as new (no backup & restore) Replace failed task descriptor Recovering task traverses its predecessors, computes and notifies C B’s Task Descriptor join: 1 notifyArray: {} status: Visited db: null B’s Task Descriptor join: 0 notifyArray: {C,D} status: Visited db: null A D E B B’

Fault-Tolerant Scheduling: Key Guarantees
Guarantee 1: join of a task descriptor is decremented exactly once per predecessor. B recovers and notifies D again D executes prematurely Keep track of notifications C D join: 1 (notified by B, waiting for C) A D E B

Guarantee 2: Every task waiting on a predecessor is notified. Hung execution state if tasks enqueued are not notified! Re-construct notifyArray C A B’ notifyArr:{C,D} B notifyArr:{C,D} B’ notifyArr:{} C join: 1 D join: 2 D E B B’

Guarantee 3: Each failure is recovered at most once. Both C and D observes failure A separate recovery by each observer Keep track of initiated recoveries C A D E B

Guarantee 4: Overwritten data blocks are distinguished and handled correctly. Did D start overwriting C’s data block? if no, only re-compute D. otherwise, re-compute C, B and A as well. Treat overwritten data blocks as failed A B C D v=0 v=1 v=2

Experiments Platform Applications
four 12-core AMD Opteron 2.3 GHz processors with 256 GB memory only 44 cores out of 48 arithmetic mean (with standard deviation) of 10 runs Applications LCS, Smith-Waterman, Floyd-Warshall, LU and Cholesky

Overheads without Failures
Results for LCS and SW 40 39 37 36

Overheads without Failures
Results for FW (10-15% overhead at 44 cores) 42 36

Overheads with Failures
Amount of Work Lost: loss is a constant amount of tasks (512), or a percentage of total work (2%, 5%) Failure Time: before compute or after compute Task Type: tasks which produce a data block’s 0th (v=0), last (v=last), or a random version (v=rand).

Overheads with Failures (512 re-executions)
Negligible/small in “before compute”/”after compute” scenarios no overhead with 1, 8 and 64 task re-executions

Overheads with Failures (2% and 5%)
Overheads proportional to the amount of work lost. 8.2% 3.6%

Scalability Analysis (5% task re- executions, varying number of cores)
Re-execution chains leading to lack of concurrency. Overheads not exceeding 6.5% in most cases. 6.5%

Motivation Efficient implementations via performance tuning
ex: Intel CNC tuning API (task prioritization, affinity control, …) Memory management single-assignment: runs out of memory use-count based garbage-collection: not always available Our approach: Recycling recycle memory assigned to data blocks across tasks dictated by user-provided store recycling functions ex: Recycle(B) = A

Recycling Challenges Invalid recycling specifications from programmers
Determination of the most efficient recycling functions Efficient representation Correctness for every schedule and problem instance Recycle(B) = A Recycle(C) = A Recycle(B)=D too early recycling concurrent recycling Recycle(B)=A What if there is a new task G depending on A?

Overview of Our Approach
on a representative smaller problem instance on actual problem instance

Verification – Recycling Constraints
Task A can recycle task B, if both B and all of B’s uses causally precede A  ensures no premature recycling Two tasks A and B can recycle the same task C, only if A can recycle B or B can recycle A  ensures no concurrent recycling Track causality relations between tasks via vector clocks A correct recycling function provides a valid recycling for each task

Verification – Auto Exploration
Recycling candidates: immediate and transitive predecessors User provides a dependence structure for each task class number of predecessors a function to reach each predecessor

Production no vector clocks, no verification  minimum overhead
Guarantees: no concurrent recycling a data block recycled too early recomputed through re-execution

Experiments Platform Applications
Intel Xeon Phi 7110P many-core processor 61 cores running 244 hardware threads 8GB total memory Applications LU, Cholesky, Floyd-Warshall (FW), Smith-Waterman (SW), Rician Denoising, Heat2D

(moderate problem instance) (large problem instance)
Comparison between Single-Assignment, Recycling and Use-Count Garbage-Collection single-assignment degrades with larger threads no significant overheads for auto-recycle (<1%) single-assignment runs out of memory (large instance) Execution Time (sec) Cholesky (moderate problem instance) Cholesky (large problem instance)

Recycling Function Verification Cost
LU, FW, Rician 3-hops: more than functions

Re-execution Overheads
<1% for most benchmarks up to 61 threads

Re-execution Overheads
Number of Evaluated Incorrect Functions LCS: 4 LU: 1424 Cholesky: 228 FW: 1424 SW: 30 Heat2D: 25 Rician: 1500 100% 84% 100% 100% 73%

Future Work Extending DISC programming model
asynchronous execution through data flow features more scientific communication patterns Fault-tolerant dynamic task graph scheduling targeting distributed memory architectures soft error detection for a more holistic approach interplay between task re-execution and checkpointing Final Examination

EXTRA SLIDES Final Examination

Fault-Tolerant Programming Models and Computing Frameworks

Similar presentations

Presentation on theme: "Fault-Tolerant Programming Models and Computing Frameworks"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Fault-Tolerant Programming Models and Computing Frameworks

Similar presentations

Presentation on theme: "Fault-Tolerant Programming Models and Computing Frameworks"— Presentation transcript:

Similar presentations

About project

Feedback