Computer and Computational Sciences Division Los Alamos National Laboratory On the Feasibility of Incremental Checkpointing for Scientific Computing Jose Carlos Sancho with Fabrizio Petrini, Greg Johnson, Juan Fernandez and Eitan Frachtenberg Performance and Architectures Lab (PAL)
Jose C. Sancho Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Talk Overview n Goal n Fault-tolerance for Scientific Computing n Methodology n Characterization of Scientific Applications n Performance Evaluation of Incremental Checkpointing n Concluding Remarks
Jose C. Sancho Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Goal Prove the Feasibility of Incremental Checkpointing Frequent Automatic User-transparent No changes to application No special hardware support
Jose C. Sancho Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Large Scale Computers n Large component count n Strongly coupled hardware 133,120 processors 608,256 DRAM Failure Rate
Jose C. Sancho Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Failures expected during the application’s execution n Running for months n Demands high capability Scientific Computing
Jose C. Sancho Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Providing Fault-tolerance n Hardware replication + n Checkpointing and rollback recovery High cost solution! Spare node Checkpointing Recovery
Jose C. Sancho Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Checkpointing and Recovery n Simplicity F Easy implementation n Cost-effective F No additional hardware support Critical aspect: Bandwidth requirements Saving process state
Jose C. Sancho Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Reducing Bandwidth n Incremental checkpointing F Only the memory modified from the previous checkpoint is saved to stable storage Full Process state Incremental
Jose C. Sancho Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM New Challenges n Frequent checkpoints: F Minimizing rollback interval to increase system availability n Automatic and user-transparent F Autonomic computing F New vision of to manage the high complexity of large systems F Self-healing and self-repairing More bandwidth pressure
Jose C. Sancho Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Survey of Implementation Levels CLIP, Dome, CCITF Ickp, CoCheck, Diskless Revive, Safetynet Just a few !! Hardware Operating system Run-time library Application
Jose C. Sancho Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Enabling Automatic Checkpointing Low User intervention Checkpoint data Low Hardware Operating system Run-time library Application High automatic
Jose C. Sancho Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM The Bandwidth Challenge Does the current technology provide enough bandwidth? Frequent Automatic
Jose C. Sancho Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Methodology n Analyzing the Memory Footprint of Scientific Codes F Run-time library stack heap static data text mmap mprotec() Application’s Memory Footprint
Jose C. Sancho Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Methodology n Quantifying the Bandwidth Requirements F Checkpoint intervals: 1s to 20s F Comparing with the current bandwidth available 900 MB/s 75 MB/s Sustained network bandwidth Quadrics QsNet II Single sustained disk bandwidth Ultra SCSI controller
Jose C. Sancho Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Experimental Environment n 32-node Linux Cluster u 64 Itanium II processors u PCI-X I/O bus u Quadrics QsNet interconnection network n Parallel Scientific Codes u Sage u Sweep3D u NAS parallel benchmarks: SP, LU, BT and FT Representative of the ASCI production codes at LANL
Jose C. Sancho Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Memory Footprint Sage-1000MB954.6MB Sage-500MB497.3MB Sage-100MB103.7MB Sage-50MB55MB Sweep3D105.5MB SP Class C40.1MB LU Class C16.6MB BT Class C76.5MB FT Class C118MB Increasing memory footprint
Jose C. Sancho Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Talk overview n Goal n Fault-tolerance for scientific computing n Methodology n Characterization of scientific applications n Performance evaluation of Incremental Checkpointing u Bandwidth u Scalability
Jose C. Sancho Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Characterization Data initialization Regular processing bursts Sage-1000MB
Jose C. Sancho Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Communication Interleaved Sage-1000MB Regular communication bursts
Jose C. Sancho Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Fraction of the Memory Footprint Overwritten during the Main Iteration Full memory footprint Below the full memory footprint
Jose C. Sancho Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Bandwidth Requirements Bandwidth (MB/s) Timeslices (s) 78.8MB/ s 12.1MB/ s Decreases with the timeslices Sage-1000MB
Jose C. Sancho Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Bandwidth Requirements for 1 second Increases with memory footprint Single SCSI disk performance Most demanding
Jose C. Sancho Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Increasing Memory Footprint Size Average Bandwidth (MB/s) Timeslices (s) Increases sublinearly
Jose C. Sancho Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Increasing Processor Count Average Bandwidth (MB/s) Timeslices (s) Decreases slightly with processor count Weak-scaling
Jose C. Sancho Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Technological Trends Performance of applications bounded by memory improvements Increases at a faster pace Performance Improvement per year
Jose C. Sancho Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Conclusions n No technological limitations of commodity components for clusters to implement automatic, frequent, and user-transparent incremental checkpointing n Current hardware technology can sustain the bandwidth requirements n These results can be generalized to future large scale computers
Jose C. Sancho Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Conclusions n The process bandwidth decreases slightly with processor count n Increases sublinearly with the memory footprint size n Improvements in networking and storage will make incremental checkpointing even more effective in the future