Prof. Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University FT-MPICH : Providing fault tolerance for MPI parallel applications.

Slides:

Advertisements

Similar presentations

CSF4 Meta-Scheduler Tutorial 1st PRAGMA Institute Zhaohui Ding or

Advertisements

NGS computation services: API's,

Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH) Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University.

Jaringan Informasi Pengantar Sistem Terdistribusi oleh Ir. Risanuri Hidayat, M.Sc.

Greg Thain Computer Sciences Department University of Wisconsin-Madison Condor Parallel Universe.

Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.

Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.

GRID workload management system and CMS fall production Massimo Sgaravatto INFN Padova.

Rheeve: A Plug-n-Play Peer- to-Peer Computing Platform Wang-kee Poon and Jiannong Cao Department of Computing, The Hong Kong Polytechnic University ICDCSW.

Workload Management Workpackage Massimo Sgaravatto INFN Padova.

MPICH-V: Fault Tolerant MPI Rachit Chawla. Outline  Introduction  Objectives  Architecture  Performance  Conclusion.

1 Parallel Computing—Introduction to Message Passing Interface (MPI)

Workload Management Massimo Sgaravatto INFN Padova.

High Performance Communication using MPJ Express 1 Presented by Jawad Manzoor National University of Sciences and Technology, Pakistan 29 June 2015.

First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova

Distributed Computing Overviews. Agenda What is distributed computing Why distributed computing Common Architecture Best Practice Case study –Condor –Hadoop.

Jaeyoung Yoon Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.

Parallel Computing The Bad News –Hardware is not getting faster fast enough –Too many architectures –Existing architectures are too specific –Programs.

Yavor Todorov. Introduction How it works OS level checkpointing Application level checkpointing CPR for parallel programing CPR functionality References.

Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.

Research Achievements Kenji Kaneda. Agenda Research background and goal Research background and goal Overview of my research achievements Overview of.

Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.

A Distributed Computing System Based on BOINC September - CHEP 2004 Pedro Andrade António Amorim Jaime Villate.

Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.

So, Jung-ki Distributed Computing System LAB School of Computer Science and Engineering Seoul National University Implementation of Package Management.

Job Submission Condor, Globus, Java CoG Kit Young Suk Moon.

Grid Computing I CONDOR.

Greg Thain Computer Sciences Department University of Wisconsin-Madison cs.wisc.edu Interactive MPI on Demand.

G-JavaMPI: A Grid Middleware for Distributed Java Computing with MPI Binding and Process Migration Supports Lin Chen, Cho-Li Wang, Francis C. M. Lau and.

Intermediate Condor Rob Quick Open Science Grid HTC - Indiana University.

Rochester Institute of Technology Job Submission Andrew Pangborn & Myles Maxfield 10/19/2015Service Oriented Cyberinfrastructure Lab,

Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 3: Operating-System Structures System Components Operating System Services.

Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 6 System Calls OS System.

CSF4 Meta-Scheduler Name: Zhaohui Ding, Xiaohui Wei

An Overview of Berkeley Lab’s Linux Checkpoint/Restart (BLCR) Paul Hargrove with Jason Duell and Eric.

Lecture 3 Process Concepts. What is a Process? A process is the dynamic execution context of an executing program. Several processes may run concurrently,

HYDRA: Using Windows Desktop Systems in Distributed Parallel Computing Arvind Gopu, Douglas Grover, David Hart, Richard Repasky, Joseph Rinkovsky, Steve.

Advanced Computer Networks Topic 2: Characterization of Distributed Systems.

Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison

Application Block Diagram III. SOFTWARE PLATFORM Figure above shows a network protocol stack for a computer that connects to an Ethernet network and.

Report from USA Massimo Sgaravatto INFN Padova. Introduction Workload management system for productions Monte Carlo productions, data reconstructions.

The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison

 Apache Airavata Architecture Overview Shameera Rathnayaka Graduate Assistant Science Gateways Group Indiana University 07/27/2015.

Derek Wright Computer Sciences Department University of Wisconsin-Madison MPI Scheduling in Condor: An.

July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.

What is SAM-Grid? Job Handling Data Handling Monitoring and Information.

Review of Condor,SGE,LSF,PBS

CS533 - Concepts of Operating Systems 1 The Mach System Presented by Catherine Vilhauer.

8/25/2005IEEE PacRim The Design Concept and Initial Implementation of AgentTeamwork Grid Computing Middleware Munehiro Fukuda Computing & Software.

Condor Week 2004 The use of Condor at the CDF Analysis Farm Presented by Sfiligoi Igor on behalf of the CAF group.

Automatic Statistical Evaluation of Resources for Condor Daniel Nurmi, John Brevik, Rich Wolski University of California, Santa Barbara.

Introduction to Grids By: Fetahi Z. Wuhib [CSD2004-Team19]

Introduction to Operating Systems Prepared by: Dhason Operating Systems.

Fault Tolerance in Charm++ Gengbin Zheng 10/11/2005 Parallel Programming Lab University of Illinois at Urbana- Champaign.

Middleware for Fault Tolerant Applications Lihua Xu and Sheng Liu Jun, 05, 2003.

Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.

Process Manager Specification Rusty Lusk 1/15/04.

JSS Job Submission Service Massimo Sgaravatto INFN Padova.

FTOP: A library for fault tolerance in a cluster R. Badrinath Rakesh Gupta Nisheeth Shrivastava.

OPERATING SYSTEM BASICS. What is an operating system and what does it do? The operating system has two basic functions: –communicates with the PC.

OpenSAF Technical Overview Mario Angelic Technical Co-Chair OpenSAF Project June 4 th, 2009.

Workload Management Workpackage

Introduction to Distributed Platforms

Duncan MacMichael & Galen Deal CSS 534 – Autumn 2016

DUCKS – Distributed User-mode Chirp-Knowledgeable Server

Basic Grid Projects – Condor (Part I)

Introduction to locality sensitive approach to distributed systems

Middleware for Fault Tolerant Applications

Initial job submission and monitoring efforts with JClarens

Time Gathering Systems Secure Data Collection for IBM System i Server

Presentation transcript:

Prof. Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University FT-MPICH : Providing fault tolerance for MPI parallel applications

Heon Y. Yeom, Seoul National University Condor Week 2006 Motivation  Condor supports Checkpoint/Restart(C/R) mechanism only in Standard Universe. Single process jobs  C/R for parallel jobs is not provided in any of current Condor universes.  We would like to make C/R available for MPI programs.

Heon Y. Yeom, Seoul National University Condor Week 2006 Introduction  Why Message Passing Interface (MPI)? Designing a generic FT framework is extremely hard due to the diversity of hardware and software systems. We have chosen MPICH series.... MPI is the most popular programming model in cluster computing. Providing fault-tolerance to MPI is more cost-effective than providing it to the OS or hardware …

Heon Y. Yeom, Seoul National University Condor Week 2006 Architecture -Concept- Monitoring Failure Detection C/R Protocol FT-MPICH

Heon Y. Yeom, Seoul National University Condor Week 2006 Architecture -Overall System- Ethernet IPC Management System Communication MPI Process Communication IPC Ethernet MPI Process Communication IPC Ethernet MPI Process Communication IPC Ethernet Message Queue

Heon Y. Yeom, Seoul National University Condor Week 2006 Management System Management System Makes MPI more reliable Failure Detection Checkpoint Coordination Recovery Initialization Coordination Output Management Checkpoint Transfer

Heon Y. Yeom, Seoul National University Condor Week 2006 Manager System MPI process Local Manager MPI process Local Manager MPI process Local Manager Stable Storage Leader Manager Initialization, CKPT cmd, CKPT transfer, Failure notification & recovery Communication between MPI process to exchange data

Heon Y. Yeom, Seoul National University Condor Week 2006 Fault-tolerant MPICH_P4 FT Module Recovery Module Connection Re-establishment Checkpoint Toolkit Atomic Message Transfer ADI(Abstract Device Interface) Ch_p4 (Ethernet) FT-MPICH Ethernet Collective Operations P2P Operations

Heon Y. Yeom, Seoul National University Condor Week 2006 Startup in Condor  Precondition Leader Manager already knows the machines where MPI process is executed and the number of MPI process by user input Binary of Local Manager and MPI process is located at the same location of each machine

Heon Y. Yeom, Seoul National University Condor Week 2006 Startup in Condor  Job submission description file Vanilla Universe Shell script file is used in submission description file executable points a shell script The shell file only executes Leader Manager Ex) Example.cmd #!/bin/sh Leader_manager … exe.sh(shell script) universe = Vanilla executable = exe.sh output = exe.out error = exe.err log = exe.log queue Example.cmd

Heon Y. Yeom, Seoul National University Condor Week 2006 Startup in Condor  User submits a job using condor_submit  Normal Job Startup Condor Pool Central Manager Submit Machine SubmitShadow Schedd NegotiatorCollector Execute Machine Job (Leader Manager) Starter Startd

Heon Y. Yeom, Seoul National University Condor Week 2006 Startup in Condor  Leader Manager executes Local Manager  Local Manager executes MPI process Condor Pool Central Manager Submit Machine Execute Machine Job (Leader Manager) Execute Machine 1Execute Machine 2Execute Machine 3 Local Manager MPI Process Fork() & Exec()

Heon Y. Yeom, Seoul National University Condor Week 2006 Startup in Condor  MPI processes send Communication Info and Leader Manager aggregates this info  Leader Manager broadcasts aggregated info Condor Pool Central Manager Submit Machine Execute Machine Job (Leader Manager) Execute Machine 1Execute Machine 2Execute Machine 3 Local Manager MPI Process

Heon Y. Yeom, Seoul National University Condor Week 2006 Fault Tolerant MPI  To provide MPI fault-tolerance, we have adopted Coordinated checkpointing scheme (vs. Independent scheme) The Leader Manager is the Coordinator!! Application-level checkpointing (vs. kernel-level CKPT.) This method does not require any efforts on the part of cluster administrators User-transparent checkpointing scheme (vs. User-aware) This method requires no modification of MPI source codes

Heon Y. Yeom, Seoul National University Condor Week 2006 Atomic Message Passing  Coordination between MPI process Assumption Communication Channel is FIFO Lock(), Unlock() To create atomic operation Proc 1 Lock() Unlock() Atomic Region CKPT SIG Checkpoint is performed!! Checkpoint is delayed!! Proc 0

Heon Y. Yeom, Seoul National University Condor Week 2006 Atomic Message Passing (Case 1)  When MPI process receive CKPT SIG, MPI process send & receive barrier message Proc 1 Lock() Unlock() Atomic Region Proc 0 CKPT SIG Barrier Data CKPT CKPT SIG CKPT

Heon Y. Yeom, Seoul National University Condor Week 2006 Atomic Message Passing (Case 2)  Through sending and receiving barrier message, In-transit message is pushed to the destination Proc 1 Lock() Unlock() Atomic Region Proc 0 CKPT SIG Barrier Data Delayed CKPT

Heon Y. Yeom, Seoul National University Condor Week 2006 Atomic Message Passing (Case 3)  The communication channel between MPI process is flushed Dependency between MPI process is removed Proc 1 Lock() Unlock() Atomic Region Proc 0 CKPT SIG Barrier Data Delayed CKPT

Heon Y. Yeom, Seoul National University Condor Week 2006 Checkpointing  Coordinated Checkpointing ver 2 ver 1 Leader Manager checkpoint command rank0 rank1rank2rank3 storage Stack Data Text Heap

Heon Y. Yeom, Seoul National University Condor Week 2006 Failure Recovery  MPI process recovery Stack Data Text Heap Stack Data Text Heap CKPT ImageNew processRestarted Process

Heon Y. Yeom, Seoul National University Condor Week 2006 Failure Recovery  Connection Re-establishment Each MPI process re-opens socket and sends IP, Port info to Local Manager  This is the same with the one we did before at the initialization time.

Heon Y. Yeom, Seoul National University Condor Week 2006 Fault Tolerant MPI  Recovery from failure failure detection ver 1 Leader Manager checkpoint command rank0 rank1rank2rank3 storage

Heon Y. Yeom, Seoul National University Condor Week 2006 Fault Tolerant MPI in Condor  Leader Manager controls MPI processes by issuing checkpoint command, monitoring Condor Pool Central Manager Submit Machine Execute Machine Job (Leader Manager) Execute Machine 1Execute Machine 2Execute Machine 3 Local Manager MPI Process Condor is not aware of the failure incident

Heon Y. Yeom, Seoul National University Condor Week 2006 Fault-tolerant MPICH-variants (Seoul National University) FT Module Recovery Module Connection Re-establishment Ethernet Checkpoint Toolkit Atomic Message Transfer ADI(Abstract Device Interface) Globus2 (Ethernet) GM (Myrinet)MVAPICH (InfiniBand) Collective Operations MPICH-GF P2P Operations M3M3 SHIELD MyrinetInfiniBand

Heon Y. Yeom, Seoul National University Condor Week 2006 Summary  We can provide fault-tolerance for parallel applications using MPICH on Ethernet, Myrinet, and Infiniband.  Currently, only the P4(ethernet) version works with Condor.  We look forward to working with Condor team.