Computer and Computational Sciences Division Los Alamos National Laboratory On the Feasibility of Incremental Checkpointing for Scientific Computing Jose.

Slides:



Advertisements
Similar presentations
Technology Drivers Traditional HPC application drivers – OS noise, resource monitoring and management, memory footprint – Complexity of resources to be.
Advertisements

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
High Performing Cache Hierarchies for Server Workloads
Matching Memory Access Patterns and Data Placement for NUMA Systems Zoltán Majó Thomas R. Gross Computer Science Department ETH Zurich, Switzerland.
Priority Research Direction (I/O Models, Abstractions and Software) Key challenges What will you do to address the challenges? – Develop newer I/O models.
1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.
2. Computer Clusters for Scalable Parallel Computing
The TickerTAIP Parallel RAID Architecture P. Cao, S. B. Lim S. Venkatraman, J. Wilkes HP Labs.
Availability in Globally Distributed Storage Systems
Experimental Evaluation of a SIFT Environment for Parallel Spaceborne Applications K. Whisnant, Z. Kalbarczyk, R.K. Iyer, P. Jones Center for Reliable.
1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability.
Distributed Processing, Client/Server, and Clusters
IBM RS6000/SP Overview Advanced IBM Unix computers series Multiple different configurations Available from entry level to high-end machines. POWER (1,2,3,4)
MPI in uClinux on Microblaze Neelima Balakrishnan Khang Tran 05/01/2006.
MPICH-V: Fault Tolerant MPI Rachit Chawla. Outline  Introduction  Objectives  Architecture  Performance  Conclusion.
A Comparative Study of Network Protocols & Interconnect for Cluster Computing Performance Evaluation of Fast Ethernet, Gigabit Ethernet and Myrinet.
Computer and Computational Sciences Division Los Alamos National Laboratory Ideas that change the world Achieving Usability and Efficiency in Large-Scale.
Jharrod LaFon (HPC-3) Jim Williams (HPC-3) 2011 Computer System, Cluster, and Networking Summer Institute Russell Husted (MTU) Derek Walker (NCA&TSU) Povi.
Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 1 CCS-3 P AL A NEW APPROACH.
Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 1 CCS-3 P AL A CASE STUDY.
Checkpoint Based Recovery from Power Failures Christopher Sutardja Emil Stefanov.
1 of 14 1 / 18 An Approach to Incremental Design of Distributed Embedded Systems Paul Pop, Petru Eles, Traian Pop, Zebo Peng Department of Computer and.
NPACI: National Partnership for Advanced Computational Infrastructure August 17-21, 1998 NPACI Parallel Computing Institute 1 Cluster Archtectures and.
Multi-level Selective Deduplication for VM Snapshots in Cloud Storage Wei Zhang*, Hong Tang †, Hao Jiang †, Tao Yang*, Xiaogang Li †, Yue Zeng † * University.
RAID-x: A New Distributed Disk Array for I/O-Centric Cluster Computing Kai Hwang, Hai Jin, and Roy Ho.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
PMIT-6102 Advanced Database Systems
Cloud MapReduce : a MapReduce Implementation on top of a Cloud Operating System Speaker : 童耀民 MA1G Authors: Huan Liu, Dan Orban Accenture.
DISTRIBUTED ALGORITHMS Luc Onana Seif Haridi. DISTRIBUTED SYSTEMS Collection of autonomous computers, processes, or processors (nodes) interconnected.
A brief overview about Distributed Systems Group A4 Chris Sun Bryan Maden Min Fang.
So, Jung-ki Distributed Computing System LAB School of Computer Science and Engineering Seoul National University Implementation of Package Management.
1 Configurable Security for Scavenged Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany, Matei Ripeanu.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Introduction to Hadoop and HDFS
The Red Storm High Performance Computer March 19, 2008 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
Rio de Janeiro, October, 2005 SBAC Portable Checkpointing for BSP Applications on Grid Environments Raphael Y. de Camargo Fabio Kon Alfredo Goldman.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.
SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,
A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.
Parallelization of Classification Algorithms For Medical Imaging on a Cluster Computing System 指導教授 : 梁廷宇 老師 系所 : 碩光通一甲 姓名 : 吳秉謙 學號 :
Large Scale Parallel File System and Cluster Management ICT, CAS.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Server Performance, Scaling, Reliability and Configuration Norman White.
Fault Tolerant Extensions to Charm++ and AMPI presented by Sayantan Chakravorty Chao Huang, Celso Mendes, Gengbin Zheng, Lixia Shi.
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Designing Parallel Operating Systems using.
Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR
Diskless Checkpointing on Super-scale Architectures Applied to the Fast Fourier Transform Christian Engelmann, Al Geist Oak Ridge National Laboratory Februrary,
A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance Chao Wang, Frank Mueller North Carolina State University Christian Engelmann, Stephen.
GACOP JACCA Meeting - February 27, 2004 P AL A New Approach in the System Software Design for Large-Scale Parallel Computers Juan Fernández 1,2, Eitan.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
Problem-solving on large-scale clusters: theory and applications Lecture 4: GFS & Course Wrap-up.
Efficient Live Checkpointing Mechanisms for computation and memory-intensive VMs in a data center Kasidit Chanchio Vasabilab Dept of Computer Science,
Outline Why this subject? What is High Performance Computing?
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
A Simulation Framework to Automatically Analyze the Communication-Computation Overlap in Scientific Applications Vladimir Subotic, Jose Carlos Sancho,
Fault Tolerance in Charm++ Gengbin Zheng 10/11/2005 Parallel Programming Lab University of Illinois at Urbana- Champaign.
Lawrence Livermore National Laboratory 1 Science & Technology Principal Directorate - Computation Directorate Scalable Fault Tolerance for Petascale Systems.
Tackling I/O Issues 1 David Race 16 March 2010.
FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.
Presented by Fault Tolerance Challenges and Solutions Al Geist Network and Cluster Computing Computational Sciences and Mathematics Division Research supported.
KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association SYSTEM ARCHITECTURE GROUP DEPARTMENT OF COMPUTER.
Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.
Mark Mathis Texas A&M University Darren Kerbyson and Adolfy Hoisie
Supporting Fault-Tolerance in Streaming Grid Applications
Fault Tolerance Distributed Web-based Systems
Hadoop Technopoints.
Department of Computer Science University of California, Santa Barbara
Presentation transcript:

Computer and Computational Sciences Division Los Alamos National Laboratory On the Feasibility of Incremental Checkpointing for Scientific Computing Jose Carlos Sancho with Fabrizio Petrini, Greg Johnson, Juan Fernandez and Eitan Frachtenberg Performance and Architectures Lab (PAL)

Jose C. Sancho Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Talk Overview n Goal n Fault-tolerance for Scientific Computing n Methodology n Characterization of Scientific Applications n Performance Evaluation of Incremental Checkpointing n Concluding Remarks

Jose C. Sancho Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Goal Prove the Feasibility of Incremental Checkpointing  Frequent  Automatic  User-transparent  No changes to application  No special hardware support

Jose C. Sancho Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Large Scale Computers n Large component count n Strongly coupled hardware 133,120 processors 608,256 DRAM Failure Rate

Jose C. Sancho Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Failures expected during the application’s execution n Running for months n Demands high capability Scientific Computing

Jose C. Sancho Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Providing Fault-tolerance n Hardware replication + n Checkpointing and rollback recovery High cost solution! Spare node Checkpointing Recovery

Jose C. Sancho Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Checkpointing and Recovery n Simplicity F Easy implementation n Cost-effective F No additional hardware support Critical aspect: Bandwidth requirements Saving process state

Jose C. Sancho Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Reducing Bandwidth n Incremental checkpointing F Only the memory modified from the previous checkpoint is saved to stable storage Full Process state Incremental

Jose C. Sancho Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM New Challenges n Frequent checkpoints: F Minimizing rollback interval to increase system availability n Automatic and user-transparent F Autonomic computing F New vision of to manage the high complexity of large systems F Self-healing and self-repairing More bandwidth pressure

Jose C. Sancho Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Survey of Implementation Levels CLIP, Dome, CCITF Ickp, CoCheck, Diskless Revive, Safetynet Just a few !! Hardware Operating system Run-time library Application

Jose C. Sancho Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Enabling Automatic Checkpointing Low User intervention Checkpoint data Low Hardware Operating system Run-time library Application High automatic

Jose C. Sancho Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM The Bandwidth Challenge Does the current technology provide enough bandwidth? Frequent Automatic

Jose C. Sancho Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Methodology n Analyzing the Memory Footprint of Scientific Codes F Run-time library stack heap static data text mmap mprotec() Application’s Memory Footprint

Jose C. Sancho Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Methodology n Quantifying the Bandwidth Requirements F Checkpoint intervals: 1s to 20s F Comparing with the current bandwidth available 900 MB/s 75 MB/s Sustained network bandwidth Quadrics QsNet II Single sustained disk bandwidth Ultra SCSI controller

Jose C. Sancho Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Experimental Environment n 32-node Linux Cluster u 64 Itanium II processors u PCI-X I/O bus u Quadrics QsNet interconnection network n Parallel Scientific Codes u Sage u Sweep3D u NAS parallel benchmarks: SP, LU, BT and FT Representative of the ASCI production codes at LANL

Jose C. Sancho Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Memory Footprint Sage-1000MB954.6MB Sage-500MB497.3MB Sage-100MB103.7MB Sage-50MB55MB Sweep3D105.5MB SP Class C40.1MB LU Class C16.6MB BT Class C76.5MB FT Class C118MB Increasing memory footprint

Jose C. Sancho Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Talk overview n Goal n Fault-tolerance for scientific computing n Methodology n Characterization of scientific applications n Performance evaluation of Incremental Checkpointing u Bandwidth u Scalability

Jose C. Sancho Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Characterization Data initialization Regular processing bursts Sage-1000MB

Jose C. Sancho Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Communication Interleaved Sage-1000MB Regular communication bursts

Jose C. Sancho Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Fraction of the Memory Footprint Overwritten during the Main Iteration Full memory footprint Below the full memory footprint

Jose C. Sancho Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Bandwidth Requirements Bandwidth (MB/s) Timeslices (s) 78.8MB/ s 12.1MB/ s Decreases with the timeslices Sage-1000MB

Jose C. Sancho Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Bandwidth Requirements for 1 second Increases with memory footprint Single SCSI disk performance Most demanding

Jose C. Sancho Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Increasing Memory Footprint Size Average Bandwidth (MB/s) Timeslices (s) Increases sublinearly

Jose C. Sancho Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Increasing Processor Count Average Bandwidth (MB/s) Timeslices (s) Decreases slightly with processor count Weak-scaling

Jose C. Sancho Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Technological Trends Performance of applications bounded by memory improvements Increases at a faster pace Performance Improvement per year

Jose C. Sancho Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Conclusions n No technological limitations of commodity components for clusters to implement automatic, frequent, and user-transparent incremental checkpointing n Current hardware technology can sustain the bandwidth requirements n These results can be generalized to future large scale computers

Jose C. Sancho Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Conclusions n The process bandwidth decreases slightly with processor count n Increases sublinearly with the memory footprint size n Improvements in networking and storage will make incremental checkpointing even more effective in the future