MPICH-V: Fault Tolerant MPI Rachit Chawla. Outline  Introduction  Objectives  Architecture  Performance  Conclusion.

MPICH-V: Fault Tolerant MPI Rachit Chawla

Outline  Introduction  Objectives  Architecture  Performance  Conclusion

Fault Tolerant Techniques  Transparency User Level  Re-launch application from previous coherent snapshot API Level  Error codes returned to be handled by programmer Communication Library Level  Transparent Fault Tolerant communication layer  Checkpoint Co-ordination Coordinated Uncoordinated  Message Logging Optimistic - All events are logged in the volatile memory Pessimistic - All events are logged on a stable storage

Checkpointing  Coordinated Coordinator initiates a checkpoint No Domino Effect Simplified Rollback Recovery  Uncoordinated Independent checkpoints Possibility of Domino Effect Rollback Recovery Complex

Logging  Piece-Wise Deterministic (PWD) For all Non-Deterministic events, store information in determinant – replay Non-Deterministic Events  Send/Receive message  Software Interrupt  System Calls  Replay Execution Events after last checkpoint

Objectives  Automatic Fault Tolerance  Transparency (for programmer/user)  Tolerate n faults (n, # of MPI Processes)  Scalable Infrastructure/Protocol  No Global Synchronization

MPICH-V  Extension of MPICH – Comm Lib level Implements all comm subroutines in MPICH  Tolerant to volatility of nodes Node Failure Network Failure  Uncoordinated Checkpointing Checkpointing Servers  Distributed Pessimistic Message Logging Channel Memories

Architecture  Communication Library Relink the application with “libmpichv”  Run-Time Environment Dispatcher Channel Memories - CM CheckPointing Servers - CS Computing/Communicating Nodes

Node Network Node Dispatcher Node Checkpoint server Firewall 1 2 3 4 Big Picture Channel Memory

Overview  Channel Memory Dedicated Nodes Message tunneling Message Repository  Node  Home CM  Send a message – send to receiver’s home CM  Distributed Checkpointing/Logging  Execution Context - CS  Communication context - CM

Dispatcher (Stable)  Initializes the execution A Stable Service Registry (centralized) started Providing services - CM, CS to nodes CM, CS assigned in a round-robin fashion  Launches the instances of MPI processes on Nodes  Monitors the Node state alive signal, or time-out  Reschedules tasks on available nodes for dead MPI process instances

Steps  When a node executes Contacts Stable Service Registry Gets assigned CM, CS based on rank Sends “alive” signal periodically to dispatcher – contains rank  On a failure Restart its execution Other processes unaware about failure CM – allows single connection per rank If faulty process reconnects, error code returned, it exits

Channel Memory (Stable)  Logs every message  Send/Receive Messages – GET & PUT  GET & PUT are transactions  FIFO order maintained – each receiver  On a restart, replays communications using CMs node Get Network Put Get node Channel Memory node

Checkpoint Server (Stable)  Checkpoint stored on stable storage  Execution – node performs a checkpoint, send image to CS  On a Restart Dispatcher  informs about task  CS to contact to get last task chkpt Node  Contacts CS with its rank  Gets last chpkt image back from CS

Putting it all together 0 1 2 2 1 CM CS 1 2 2 Crash Rollback to latest process checkpoint Worst condition: in-transit message + checkpoint Pseudo time scale Processes Ckpt image Ckpt images

Performance Evaluation  xTremeWeb – P2P Platform Dispatcher Client – excute parrallel application Workers – MPICH-V nodes, CMs & CSs  216 PIII 733 Pcs  Connected by Ethernet  Simulate Node volatility – enforce process crashes  NAS BT benchmark – simulated Computational Fluid Dynamics Application  Parallel Benchmark  Significant Communication + Computation

Effects of CM on RTT Time, sec Mean over 100 measurements 0 0.05 0.1 0.15 0.2 P4 ch_cm 1 CM out-of-core ch_cm 1 CM in-core ch_cm 1 CM out-of-core best X ~2 10.5 MB/s 5.6 MB/s Message size 0 64kB128kB192kB256kB320kB384kB

Impact of Remote Checkpointing +25% +2% +14% +28% 50 44 78 62 214 208 1.8 1.4 0 50 100 150 200 250 bt.W.4 (2MB) bt.A.4 (43MB)bt.B.4 (21MB) Dist. Ethernet 100BaseT Local (disc)  Cost of remote checkpoint is close to the one of local checkpoint (can be as low as 2%)… …because compression and transfer are overlapped Time between reception of a checkpoint signal and actual restart: fork, ckpt, compress, transfer to CS, way back, decompress, restart RTT Time, sec bt.A.1 (201MB)

Performance of Re-Execution  The system can survive the crash of all MPI Processes

Execution Time Vs Faults Number of faults Base exec. without ckpt. and fault 0 12345678910 610 650 700 750 800 850 900 950 1000 1050 1100 Total execution time (sec.)  Overhead of chkpt is about 23%  For 10 faults performance is 68% of the one without fault ~1 fault/110 sec.

Conclusion  MPICH-V full fledge fault tolerant MPI environment (lib + runtime). uncoordinated checkpointing + distributed pessimistic message logging. Channel Memories, Checkpoint Servers, Dispatcher and nodes.

MPICH-V: Fault Tolerant MPI Rachit Chawla. Outline  Introduction  Objectives  Architecture  Performance  Conclusion.

Similar presentations

Presentation on theme: "MPICH-V: Fault Tolerant MPI Rachit Chawla. Outline  Introduction  Objectives  Architecture  Performance  Conclusion."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

MPICH-V: Fault Tolerant MPI Rachit Chawla. Outline  Introduction  Objectives  Architecture  Performance  Conclusion.

Similar presentations

Presentation on theme: "MPICH-V: Fault Tolerant MPI Rachit Chawla. Outline  Introduction  Objectives  Architecture  Performance  Conclusion."— Presentation transcript:

Similar presentations

About project

Feedback