ERLANGEN REGIONAL COMPUTING CENTER 08.09.2015 1st International Workshop on Fault Tolerant Systems, IEEE Cluster `15 Building a fault tolerant application.

Slides:

Advertisements

Similar presentations

Remus: High Availability via Asynchronous Virtual Machine Replication

Advertisements

Use Cases for Fault Tolerance Support in MPI Rich Graham Oak Ridge National Laboratory.

User Level Failure Mitigation Fault Tolerance Working Group September 2013, MPI Forum Meeting Madrid, Spain.

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

Christian Delbe1 Christian Delbé OASIS Team INRIA -- CNRS - I3S -- Univ. of Nice Sophia-Antipolis November Automatic Fault Tolerance in ProActive.

Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.

© 2005 Dorian C. Arnold Reliability in Tree-based Overlay Networks Dorian C. Arnold University of Wisconsin Paradyn/Condor Week March 14-18, 2005 Madison,

Cilk NOW Based on a paper by Robert D. Blumofe & Philip A. Lisiecki.

Bridging. Bridge Functions To extend size of LANs either geographically or in terms number of users. − Protocols that include collisions can be performed.

Spark: Cluster Computing with Working Sets

Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.

Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,

MPICH-V: Fault Tolerant MPI Rachit Chawla. Outline  Introduction  Objectives  Architecture  Performance  Conclusion.

Sinfonia: A New Paradigm for Building Scalable Distributed Systems Marcos K. Aguilera, Arif Merchant, Mehul Shah, Alistair Veitch, Christonos Karamanolis.

Lessons Learned Implementing User-Level Failure Mitigation in MPICH Wesley Bland, Huiwei Lu, Sangmin Seo, Pavan Balaji Argonne National Laboratory User-level.

Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.

07/14/08. 2 Points Introduction. Cluster and Supercomputers. Cluster Types and Advantages. Our Cluster. Cluster Performance. Cluster Computer for Basic.

Simplifying the Recovery Model of User- Level Failure Mitigation Wesley Bland ExaMPI ‘14 New Orleans, LA, USA November 17, 2014.

Hardware implementation and Demonstration. Synapse RF26X We started off with Synapse RF26X 10-bit ADC Up to 2 Mbps Data Rate 4K internal EEPROM 128k flash.

Google Distributed System and Hadoop Lakshmi Thyagarajan.

Pregel: A System for Large-Scale Graph Processing

1 The Google File System Reporter: You-Wei Zhang.

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

Selective Recovery From Failures In A Task Parallel Programming Model James Dinan*, Sriram Krishnamoorthy #, Arjun Singri*, P. Sadayappan* *The Ohio State.

Presented By HaeJoon Lee Yanyan Shen, Beng Chin Ooi, Bogdan Marius Tudor National University of Singapore Wei Lu Renmin University Cang Chen Zhejiang University.

W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.

C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.

A Proposal of Application Failure Detection and Recovery in the Grid Marian Bubak 1,2, Tomasz Szepieniec 2, Marcin Radecki 2 1 Institute of Computer Science,

MapReduce How to painlessly process terabytes of data.

Checksum Strategies for Data in Volatile Memory Authors: Humayun Arafat(Ohio State) Sriram Krishnamoorthy(PNNL) P. Sadayappan(Ohio State) 1.

1/20 Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications Sheng Di, Mohamed Slim Bouguerra, Leonardo Bautista-gomez, Franck Cappello.

Advanced Computer Networks Topic 2: Characterization of Distributed Systems.

A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.

N. GSU Slide 1 Chapter 05 Clustered Systems for Massive Parallelism N. Xiong Georgia State University.

Copyright: Abhinav Vishnu Fault-Tolerant Communication Runtime Support for Data-Centric Programming Models Abhinav Vishnu 1, Huub Van Dam 1, Bert De Jong.

Headline in Arial Bold 30pt HPC User Forum, April 2008 John Hesterberg HPC OS Directions and Requirements.

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.

Fault Tolerant Extensions to Charm++ and AMPI presented by Sayantan Chakravorty Chao Huang, Celso Mendes, Gengbin Zheng, Lixia Shi.

Mehmet Can Kurt, The Ohio State University Sriram Krishnamoorthy, Pacific Northwest National Laboratory Kunal Agrawal, Washington University in St. Louis.

Wesley Bland, Huiwei Lu, Sangmin Seo, Pavan Balaji Argonne National Laboratory {wbland, huiweilu, sseo, May 5, 2015 Lessons Learned Implementing.

Parallel Computing With High Performance Computing Clusters (HPCs) By Jeremy Cathey.

1 ©2004 Board of Trustees of the University of Illinois Computer Science Overview Laxmikant (Sanjay) Kale ©

HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.

Distributed simulation with MPI in ns-3 Joshua Pelkey and Dr. George Riley Wns3 March 25, 2011.

This project has received funding from the European Union's Seventh Framework Programme for research, technological development.

Distributed Logging Facility Castor External Operation Workshop, CERN, November 14th 2006 Dennis Waldron CERN / IT.

Fault Tolerance in Charm++ Gengbin Zheng 10/11/2005 Parallel Programming Lab University of Illinois at Urbana- Champaign.

Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab University of Illinois at Urbana-Champaign.

Adding Algorithm Based Fault-Tolerance to BLIS Tyler Smith, Robert van de Geijn, Mikhail Smelyanskiy, Enrique Quintana-Ortí 1.

GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.

Fault Tolerant MapReduce-MPI for HPC Clusters Yanfei Guo *, Wesley Bland +, Pavan Balaji +, Xiaobo Zhou * * Dept. of Computer Science, University of Colorado,

Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.

FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.

VOCL-FT: Introducing Techniques for Efficient Soft Error Coprocessor Recovery Antonio J. Peña, Wesley Bland, Pavan Balaji.

An Introduction to GPFS

Gorilla: A Fast, Scalable, In-Memory Time Series Database

EpiC: an Extensible and Scalable System for Processing Big Data Dawei Jiang, Gang Chen, Beng Chin Ooi, Kian Lee Tan, Sai Wu School of Computing, National.

Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.

Jack Dongarra University of Tennessee

PREGEL Data Management in the Cloud

Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab

Replication Middleware for Cloud Based Storage Service

MapReduce Simplied Data Processing on Large Clusters

RAID RAID Mukesh N Tekwani

Scalable Parallel Interoperable Data Analytics Library

Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights Feng Zhang †⋄, Jidong Zhai ⋄, Xipeng Shen #, Onur Mutlu ⋆, Wenguang.

RAID RAID Mukesh N Tekwani April 23, 2019

Phoenix: A Substrate for Resilient Distributed Graph Analytics

MapReduce: Simplified Data Processing on Large Clusters

Presentation transcript:

ERLANGEN REGIONAL COMPUTING CENTER st International Workshop on Fault Tolerant Systems, IEEE Cluster `15 Building a fault tolerant application using the GASPI communication layer

2 Motivation  Nowadays, the increasing computational capacity is mainly due to extreme level of hardware parallelism.  With future machines, the Mean time to failure is expected to be in minutes and hours.  Absence of fault tolerant environment will put precious data at risk.  The lack of well-defined fault tolerant environment is the first big challenge in the development of fault tolerant application. Building a fault tolerant application using the GASPI communication layer

3 Automatic Fault Tolerance Application: Approaching the problem 1.Failure detection: i.Who detects the failure? ii.Failure information propagation iii.Consensus about failed processes 2.Processes and communicator recovery i.Shrink ii.Spawn iii.Spare 3.Lost data recovery … Building a fault tolerant application using the GASPI communication layer

4 Failure detection approaches: 1.Ping based all-to-all Within each iteration, the health of all procecsses is check. 2.Ping based neighbor-level: After neighbor failure detection -> check all-to-all health 3.Unsuccessful communication After failure detection -> check all-to-all health 4.Dedicated failure detection process(es) Pings all other processes Global view of processes healths Propagates the failure info to remaining processes Building a fault tolerant application using the GASPI communication layer

5 Automatic Fault Tolerance Application: Approaching the problem 1.Failure detection:  Fault-detector process 2.Processes and communicator recovery  Spare nodes 3.Lost data recovery  „Neighbor“ node level Checkpoint/Restart Building a fault tolerant application using the GASPI communication layer Worker communicator Spare nodes 0 Fault -detector process

6 Fault tolerance in GASPI: Introduction (I)  GASPI – Developed by Fraunhofer IWTM, Kaiserslautern, Germany  Based on PGAS programming model  Two memory parts Local: only local to the GASPI process (and its threads) Global: Available to other processes for reading and writing.  Enables fault tolerance: In case of single node failure, rest of the nodes stay up and running Provides TIMEOUT for every communication call.  Return values: GASPI_SUCCESS, GASPI_TIMEOUT, GASPI_ERROR Building a fault tolerant application using the GASPI communication layer

7 Fault tolerance in GASPI: Introduction (II)  What GASPI provides: Gaspi_proc_ping(): A process can check the state of a process by pinging any specific process. The return value of ping can either be 0 or 1 (Healthy or dead).  User side: Deletion of old comm., creation of new comm., new communication structure, (checkpoint/restart) -> user‘s responsibility Building a fault tolerant application using the GASPI communication layer

8 Failure detector (I): Worker communicator Idle processes gaspi_proc_ping() return_val = gaspi_wait() return_val: 1) GASPI_SUCCESS 2) GASPI_TIMEOUT 3) GASPI_ERROR gaspi_proc_ping() return_val = gaspi_wait() Fault detector process Building a fault tolerant application using the GASPI communication layer

9 Failure detector (II): Worker communicator Idle processes gaspi_proc_ping() return_val = gaspi_wait() GASPI_ERROR Failed Proc(s) IDs Rescue Proc(s) IDs 6, 71, 2 Failure detector process  Detector processes informs every process about failure details via gaspi_write(). 1 2 return_val: 1) GASPI_SUCCESS 2) GASPI_TIMEOUT 3) GASPI_ERROR Building a fault tolerant application using the GASPI communication layer

10 Automatic Fault Tolerance Application  Program flow: Building a fault tolerant application using the GASPI communication layer

11 Asynchronous in-memory checkpointing Building a fault tolerant application using the GASPI communication layer

12 Benchmarks (I): Test bed  Lanczos algorithm:  Checkpoint data structure:  After startup: Every process once stores matrix communication data structure.  Two recent Lanczos vectors are stored at each checkpoint iteration.  Recently calculated eigenvalues.  Test cluster:  LiMa – RRZE, Erlangen: 500 nodes, Xeon 5650 "Westmere" chips (12 cores + SMT), 2.66 GHz, 24 GB RAM, QDR Infiniband Building a fault tolerant application using the GASPI communication layer Checkpoint data:  v j, v j+1  metadata

13 Benchmark (II):  Average ping time per process ~ 5-6 µs Failure-Detector Process: Weak scaling of ping scan, failure detection and ack. time. Building a fault tolerant application using the GASPI communication layer

14 Benchmarks (III): 64s Failure detection + re-init + redo-work Computation Num. of nodes = 256, threads-per-process = 12 Failure detection + acknowledgement + Re-init = 11 sec. Building a fault tolerant application using the GASPI communication layer # iters. = 3500 Chpt. freq = 500

15 Remarks:  Worker processes remain undisturbed in failure-free application run.  Overhead only in case of worker failure(s).  Redo-Work after failure recovery  Checkpoint Frequency. Building a fault tolerant application using the GASPI communication layer

16 Outlook:  Related work:  FT communication: › MPICH-V › User-level Failure Mitigation - MPI (ULFM) › Fault tolerance Messaging Interface FMI  Node-level checkpoint/restart: › Fault Tolerance Interface (FTI) › Scalable Checkpoint/Restart (SCR)  Future work:  Having multiple failure detector processes.  Adding Redundancy for failure detector processes  Compartive study: ULFM, SCR  Building a fault tolerant application using the GASPI communication layer

17 Thank you! Questions? Building a fault tolerant application using the GASPI communication layer