1/20 Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications Sheng Di, Mohamed Slim Bouguerra, Leonardo Bautista-gomez, Franck Cappello.

Slides:

Advertisements

Similar presentations

Performance in Decentralized Filesharing Networks Theodore Hong Freenet Project.

Advertisements

School of Computing University of Leeds Computational PDEs Unit A Grid-based approach to the validation and testing of lubrication models Christopher Goodyer.

Energy Efficiency through Burstiness Athanasios E. Papathanasiou and Michael L. Scott University of Rochester, Computer Science Department Rochester, NY.

Master/Slave Architecture Pattern Source: Pattern-Oriented Software Architecture, Vol. 1, Buschmann, et al.

Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.

1 Magnetic Disks 1956: IBM (RAMAC) first disk drive 5 Mb – Mb/in $/year 9 Kb/sec 1980: SEAGATE first 5.25’’ disk drive 5 Mb – 1.96 Mb/in2 625.

Modeling and Analysis of Random Walk Search Algorithms in P2P Networks Nabhendra Bisnik, Alhussein Abouzeid ECSE, Rensselaer Polytechnic Institute.

Spark: Cluster Computing with Working Sets

Software Architecture of High Efficiency Video Coding for Many-Core Systems with Power- Efficient Workload Balancing Muhammad Usman Karim Khan, Muhammad.

1 Data Persistence in Large-scale Sensor Networks with Decentralized Fountain Codes Yunfeng Lin, Ben Liang, Baochun Li INFOCOM 2007.

MPICH-V: Fault Tolerant MPI Rachit Chawla. Outline  Introduction  Objectives  Architecture  Performance  Conclusion.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming with MPI and OpenMP Michael J. Quinn.

Security-Driven Heuristics and A Fast Genetic Algorithm for Trusted Grid Job Scheduling Shanshan Song, Ricky Kwok, and Kai Hwang University of Southern.

1 PERFORMANCE EVALUATION H Often in Computer Science you need to: – demonstrate that a new concept, technique, or algorithm is feasible –demonstrate that.

An Algebraic Multigrid Solver for Analytical Placement With Layout Based Clustering Hongyu Chen, Chung-Kuan Cheng, Andrew B. Kahng, Bo Yao, Zhengyong Zhu.

Support Vector Regression David R. Musicant and O.L. Mangasarian International Symposium on Mathematical Programming Thursday, August 10, 2000

OS Fall ’ 02 Performance Evaluation Operating Systems Fall 2002.

1 Resolution of large symmetric eigenproblems on a world-wide grid Laurent Choy, Serge Petiton, Mitsuhisa Sato CNRS/LIFL HPCS Lab. University of Tsukuba.

U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.

A Workflow-Aware Storage System Emalayan Vairavanathan 1 Samer Al-Kiswany, Lauro Beltrão Costa, Zhao Zhang, Daniel S. Katz, Michael Wilde, Matei Ripeanu.

PicsouGrid Viet-Dung DOAN. Agenda Motivation PicsouGrid’s architecture –Pricing scenarios PicsouGrid’s properties –Load balancing –Fault tolerance Perspectives.

Tanzima Z. Islam, Saurabh Bagchi, Rudolf Eigenmann – Purdue University Kathryn Mohror, Adam Moody, Bronis R. de Supinski – Lawrence Livermore National.

1 Hybrid methods for solving large-scale parameter estimation problems Carlos A. Quintero 1 Miguel Argáez 1 Hector Klie 2 Leticia Velázquez 1 Mary Wheeler.

Emalayan Vairavanathan

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.

Performance Model & Tools Summary Hung-Hsun Su UPC Group, HCS lab 2/5/2004.

Performance Measurement n Assignment? n Timing #include double When() { struct timeval tp; gettimeofday(&tp, NULL); return((double)tp.tv_sec + (double)tp.tv_usec.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.

Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:

Network Survivability Against Region Failure Signal Processing, Communications and Computing (ICSPCC), 2011 IEEE International Conference on Ran Li, Xiaoliang.

Distributed Genetic Algorithms with a New Sharing Approach in Multiobjective Optimization Problems Tomoyuki HIROYASU Mitsunori MIKI Sinya WATANABE Doshisha.

Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.

CISC Machine Learning for Solving Systems Problems Presented by: Alparslan SARI Dept of Computer & Information Sciences University of Delaware

A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.

MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.

Simulation is the process of studying the behavior of a real system by using a model that replicates the behavior of the system under different scenarios.

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

On the Use of Sparse Direct Solver in a Projection Method for Generalized Eigenvalue Problems Using Numerical Integration Takamitsu Watanabe and Yusaku.

Fault Tolerant Extensions to Charm++ and AMPI presented by Sayantan Chakravorty Chao Huang, Celso Mendes, Gengbin Zheng, Lixia Shi.

April 28, 2003 Early Fault Detection and Failure Prediction in Large Software Systems Felix Salfner and Miroslaw Malek Department of Computer Science Humboldt.

Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.

Automatic Statistical Evaluation of Resources for Condor Daniel Nurmi, John Brevik, Rich Wolski University of California, Santa Barbara.

ERLANGEN REGIONAL COMPUTING CENTER st International Workshop on Fault Tolerant Systems, IEEE Cluster `15 Building a fault tolerant application.

Efficient Live Checkpointing Mechanisms for computation and memory-intensive VMs in a data center Kasidit Chanchio Vasabilab Dept of Computer Science,

Presented by Reliability, Availability, and Serviceability (RAS) for High-Performance Computing Stephen L. Scott Christian Engelmann Computer Science Research.

1/22 Optimization of Google Cloud Task Processing with Checkpoint-Restart Mechanism Speaker: Sheng Di Coauthors: Yves Robert, Frédéric Vivien, Derrick.

OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.

HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.

MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.

Fault Tolerance and Checkpointing - Sathish Vadhiyar.

Generalized Point Based Value Iteration for Interactive POMDPs Prashant Doshi Dept. of Computer Science and AI Institute University of Georgia

Validated Computing 2002 by Eustaquio A. Martínez 1, Tiaraju Asmuz Diverio 2 & Benjamín Barán 3 1 2

Parallel Computing Presented by Justin Reschke

University of Texas at Arlington Scheduling and Load Balancing on the NASA Information Power Grid Sajal K. Das, Shailendra Kumar, Manish Arora Department.

Load Rebalancing for Distributed File Systems in Clouds.

FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.

Presented by Fault Tolerance Challenges and Solutions Al Geist Network and Cluster Computing Computational Sciences and Mathematics Division Research supported.

The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)

BIG DATA/ Hadoop Interview Questions.

Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.

Pouya Ostovari and Jie Wu Computer & Information Sciences

VGrADS and GridSolve Asim YarKhan Jack Dongarra, Zhiao Shi, Fengguang Song Innovative Computing Laboratory University of Tennessee VGrADS Workshop – September.

Dynamic Resource Allocation for Shared Data Centers Using Online Measurements By- Abhishek Chandra, Weibo Gong and Prashant Shenoy.

How Much SSD Is Useful For Resilience In Supercomputers

Glenn Fung, Murat Dundar, Bharat Rao and Jinbo Bi

Implementation of Efficient Check-pointing and Restart on CPU - GPU

Hadoop Technopoints.

Computer Systems Performance Evaluation

Convergence of Big Data and Extreme Computing

Presentation transcript:

1/20 Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications Sheng Di, Mohamed Slim Bouguerra, Leonardo Bautista-gomez, Franck Cappello INRIA and ANL 2013

2/20 Outline Background of Multi-level Checkpoint Model Problem Formulation Optimization of Multi-level Checkpoint Model Optimizing Checkpoint Intervals for each level Optimizing the Selection of Levels Performance Evaluation Conclusion and Future Work

3/20 Background of Multi-level Ckpt Model Traditional Ckpt/Restart model always stores checkpoint files onto Parallel File System (PFS) PFS is of central-controlled mode, which suffers bottle-neck issue for large-scale app. For example, our experiments shows that the checkpoint overhead on PFS increases quickly with problem size and execution scale: # cores Ckpt cost7.4 sec10.8 sec16.8 sec43.1 sec

4/20 Background of Multi-level Ckpt Model Existing Multi-level checkpoint toolkits Scalable Checkpoint/Restart Library (SCR) – SC’10 RAM disk / local disk Partner-copy / XOR encoding Parallel File System (PFS), e.g., NFS Fault Tolerance Interface (FTI) - SC’11 Local disk: storing ckpt files into local disk Partner-copy: storing ckpt files in local disk & partner disk Reed-Solomon encoding (RS-encoding) Parallel File System (PFS): such as NFS

5/20 Problem Formulation Different Types of Failures CPL1: There are no hardware failures but software errors. CPL2: There are non-adjacent hardware failures CPL3: There are a few adjacent hardware failures CPL4: There are a lot of hardware failures

6/20 Problem Formulation The process of running an HPC application with failures over multi-level checkpoint model

7/20 Problem Formulation Our Objective - Minimize the expected wall- clock length for each HPC application with: optimized selection of levels optimized checkpoint intervals on each level Mathematical Expectation of Wall-clock Length: Productive time # of levels# of ckpt intervals at level i Ckpt overhead Rollback lossRestart cost # of failures at level i probability

8/20 Optimization of Multi-level Checkpoint Model E(T w ) is convex, because x i is referred to as the # of ckpt intervals at level i We get optimal solution as long as we solve the simultaneous equations, optimal x i * : where i = 1, 2, 3, …., L

9/20 Optimization of Multi-level Checkpoint Model Optimizing Checkpoint Intervals Simplified equations: We use an iterative algorithm to solve it: k=0: err=0.2 k=1: err=0.08 k=2: err=0.005 K=3: err= …… We use Young’s formula to initialize x i (0) k+1 k k

10/20 Optimization of Multi-level Checkpoint Model Optimizing Checkpoint Intervals How fast is our iterative optimal algorithm? If we set the error threshold to 10 -6, the algorithm will converge with only about iterations !! What is the performance gain under our method, compared to the traditional Young’s formula? Suppose there are 8 levels and application execution length is 1000 ~ 9000 seconds The checkpoint overheads on the 8 levels are 10, 30, 45, 50, 55, 60, 65, 240 seconds per checkpoint. Numerical simulation shows that our method is better than Young’s formula by 4.2% %.

11/20 Optimization of Multi-level Checkpoint Model Optimizing Selection of Checkpoint Levels For a particular combination of levels, the computation complexity is only about 30 iterations. It is feasible to traverse all of combinations of levels to find the optimal selection of levels. Suppose there are 8 levels, so there are =255 different combinations of levels, and the total computation complexity is 255*30=7650, which is very small!

12/20 Optimization of Multi-level Checkpoint Model Analysis of A Practical Case – FTI There are 4 levels: local disk, partner-copy, RS- encoding, and PFS Use C lf, C pc, C rs, C pf to denote ckpt overheads Use R lf, R pc, R rs, R pf to denote restart overheads

13/20 Optimization of Multi-level Checkpoint Model Analysis of A Practical Case – FTI The target simultaneous equations derived from convex optimization (first-order derivatives) is: The solution to the above equations must be optimal We can use iterative method to get it very quickly.

14/20 Performance Evaluation Experimental Setting Evaluation Type A: Numerical Simulation To evaluate a large number of various cases with different parameters, including different ckpt overheads, restart cost, application length, etc. Evaluation Type B: Real Experiment To validate the feasibility of using our optimal checkpoint model in a real use case – FTI scenario. MPI program used in our experiment: Head distribution

15/20 Performance Evaluation Checkpoint Overhead of FTI on FUSION cluster Key Indicator: Workload Processing Ratio (WPR) = productive time / wall-clock length 26MB per proc 57MB per proc

16/20 Performance Evaluation Different Selections of Checkpoint Levels Simulation Settings

17/20 Performance Evaluation Different Selections of Checkpoint Levels Simulation Results Improvement:10-20%

18/20 Performance Evaluation Experimental Results on FUSION cluster

19/20 Conclusion Optimal Multi-level Checkpoint/Restart Model Key Theoretical Conclusions: Ckpt intervals on each level can be optimized by fast iterative methods (converged within only 30 iterations) The ckpt intervals are optimal based on convex- optimization theory Key Simulation/Experimental Results: For FTI, Iterative Optimal method with best selection of levels is better than other solutions by up to 20%. For other cases like 8 levels, Optimized selection of levels can improve performance by 50% in some cases.

20/20 Future Work In the future, we plan to: evaluate our optimal ckpt/restart model using more complex MPI program on real clusters with larger scales, such as CESM. optimize the robustness and stability by taking into account the possible prediction errors on checkpoint overheads and execution length. optimize the execution scale (# of processes) based on checkpoint overheads for some application with specific productive time.

21/20 Thanks!! Contact me at: