Copyright: Abhinav Vishnu Fault-Tolerant Communication Runtime Support for Data-Centric Programming Models Abhinav Vishnu 1, Huub Van Dam 1, Bert De Jong.

Slides:



Advertisements
Similar presentations
Remus: High Availability via Asynchronous Virtual Machine Replication
Advertisements

Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH) Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University.
Distributed Systems Major Design Issues Presented by: Christopher Hector CS8320 – Advanced Operating Systems Spring 2007 – Section 2.6 Presentation Dr.
C. Bell, D. Bonachea, R. Nishtala, and K. Yelick, 1Berkeley UPC: Optimizing Bandwidth Limited Problems Using One-Sided Communication.
CoMPI: Enhancing MPI based applications performance and scalability using run-time compression. Rosa Filgueira, David E.Singh, Alejandro Calderón and Jesús.
Reliable Communication in the Presence of Failures Kenneth Birman, Thomas Joseph Cornell University, 1987 Julia Campbell 19 November 2003.
Christian Delbe1 Christian Delbé OASIS Team INRIA -- CNRS - I3S -- Univ. of Nice Sophia-Antipolis November Automatic Fault Tolerance in ProActive.
© 2005 Dorian C. Arnold Reliability in Tree-based Overlay Networks Dorian C. Arnold University of Wisconsin Paradyn/Condor Week March 14-18, 2005 Madison,
Department of Computer Science and Engineering University of Washington Brian N. Bershad, Stefan Savage, Przemyslaw Pardyak, Emin Gun Sirer, Marc E. Fiuczynski,
Improving TCP Performance over Mobile Ad Hoc Networks by Exploiting Cross- Layer Information Awareness Xin Yu Department Of Computer Science New York University,
Spark: Cluster Computing with Working Sets
Experimental Evaluation of a SIFT Environment for Parallel Spaceborne Applications K. Whisnant, Z. Kalbarczyk, R.K. Iyer, P. Jones Center for Reliable.
Parallel Programming Laboratory1 Fault Tolerance in Charm++ Sayantan Chakravorty.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Distributed components
Distributed Systems Fall 2010 Replication Fall 20105DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Network Operating Systems Users are aware of multiplicity of machines. Access to resources of various machines is done explicitly by: –Logging into the.
Sinfonia: A New Paradigm for Building Scalable Distributed Systems Marcos K. Aguilera, Arif Merchant, Mehul Shah, Alistair Veitch, Christonos Karamanolis.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Lessons Learned Implementing User-Level Failure Mitigation in MPICH Wesley Bland, Huiwei Lu, Sangmin Seo, Pavan Balaji Argonne National Laboratory User-level.
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
Page 1 Copyright © Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.
Non-Collective Communicator Creation in MPI James Dinan 1, Sriram Krishnamoorthy 2, Pavan Balaji 1, Jeff Hammond 1, Manojkumar Krishnan 2, Vinod Tipparaju.
Chapter 3: Operating-System Structures System Components Operating System Services System Calls System Programs System Structure Virtual Machines System.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
1 Scaling Collective Multicast Fat-tree Networks Sameer Kumar Parallel Programming Laboratory University Of Illinois at Urbana Champaign ICPADS ’ 04.
Selective Recovery From Failures In A Task Parallel Programming Model James Dinan*, Sriram Krishnamoorthy #, Arjun Singri*, P. Sadayappan* *The Ohio State.
Dynamic Time Variant Connection Management for PGAS Models on InfiniBand Abhinav Vishnu 1, Manoj Krishnan 1 and Pavan Balaji 2 1 Pacific Northwest National.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Chapter 19 Recovery and Fault Tolerance Copyright © 2008.
Presented by Reliability, Availability, and Serviceability (RAS) for High-Performance Computing Stephen L. Scott and Christian Engelmann Computer Science.
An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech.
Checksum Strategies for Data in Volatile Memory Authors: Humayun Arafat(Ohio State) Sriram Krishnamoorthy(PNNL) P. Sadayappan(Ohio State) 1.
Advanced Computer Networks Topic 2: Characterization of Distributed Systems.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
1 ACTIVE FAULT TOLERANT SYSTEM for OPEN DISTRIBUTED COMPUTING (Autonomic and Trusted Computing 2006) Giray Kömürcü.
High-Level, One-Sided Models on MPI: A Case Study with Global Arrays and NWChem James Dinan, Pavan Balaji, Jeff R. Hammond (ANL); Sriram Krishnamoorthy.
Non-Data-Communication Overheads in MPI: Analysis on Blue Gene/P P. Balaji, A. Chan, W. Gropp, R. Thakur, E. Lusk Argonne National Laboratory University.
S-Paxos: Eliminating the Leader Bottleneck
Fault Tolerance in CORBA and Wireless CORBA Chen Xinyu 18/9/2002.
Scaling NWChem with Efficient and Portable Asynchronous Communication in MPI RMA Min Si [1][2], Antonio J. Peña [1], Jeff Hammond [3], Pavan Balaji [1],
Mehmet Can Kurt, The Ohio State University Sriram Krishnamoorthy, Pacific Northwest National Laboratory Kunal Agrawal, Washington University in St. Louis.
Diskless Checkpointing on Super-scale Architectures Applied to the Fast Fourier Transform Christian Engelmann, Al Geist Oak Ridge National Laboratory Februrary,
11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.
ERLANGEN REGIONAL COMPUTING CENTER st International Workshop on Fault Tolerant Systems, IEEE Cluster `15 Building a fault tolerant application.
Chap 7: Consistency and Replication
Finding concurrency Jakub Yaghob. Finding concurrency design space Starting point for design of a parallel solution Analysis The patterns will help identify.
Implementing Babel RMI with ARMCI Jian Yin Khushbu Agarwal Daniel Chavarría Manoj Krishnan Ian Gorton Vidhya Gurumoorthi Patrick Nichols.
HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.
Sockets Direct Protocol for Hybrid Network Stacks: A Case Study with iWARP over 10G Ethernet P. Balaji, S. Bhagvat, R. Thakur and D. K. Panda, Mathematics.
Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab University of Illinois at Urbana-Champaign.
Antidio Viguria Ann Krueger A Nonblocking Quorum Consensus Protocol for Replicated Data Divyakant Agrawal and Arthur J. Bernstein Paper Presentation: Dependable.
Parallel Computing Presented by Justin Reschke
OpenMP Runtime Extensions Many core Massively parallel environment Intel® Xeon Phi co-processor Blue Gene/Q MPI Internal Parallelism Optimizing MPI Implementation.
FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.
Parallel Performance Wizard: A Generalized Performance Analysis Tool Hung-Hsun Su, Max Billingsley III, Seth Koehler, John Curreri, Alan D. George PPW.
Operating Systems Distributed-System Structures. Topics –Network-Operating Systems –Distributed-Operating Systems –Remote Services –Robustness –Design.
Fundamentals of Fault-Tolerant Distributed Computing In Asynchronous Environments Paper by Felix C. Gartner Graeme Coakley COEN 317 November 23, 2003.
EEC 688/788 Secure and Dependable Computing Lecture 10 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
A Parallel Communication Infrastructure for STAPL
rain technology (redundant array of independent nodes)
Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab
Outline Announcements Fault Tolerance.
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Support for Adaptivity in ARMCI Using Migratable Objects
Phoenix: A Substrate for Resilient Distributed Graph Analytics
Distributed Systems and Concurrency: Distributed Systems
Presentation transcript:

Copyright: Abhinav Vishnu Fault-Tolerant Communication Runtime Support for Data-Centric Programming Models Abhinav Vishnu 1, Huub Van Dam 1, Bert De Jong 1, Pavan Balaji 2, Shuaiwen Song 3 1 Pacific Northwest National Laboratory 2 Argonne National Laboratory 3 Virginia Tech

Copyright: Abhinav Vishnu Faults Are Becoming Increasingly Prevalent Many scientific domains have insatiable computational requirements Chemistry, Astrophysics etc. Many hundreds of thousands of processing elements are being combined Hardware faults are becoming increasingly commonplace Designing fault resilient systems and system stack is imperative Large body of work with Checkpoint/Restart Application driven and transparent (BLCR) Message Logging Matches very well with MPI semantics and fixed process model System#CoresMTBF ASCI/Q hrs ASCI White hrs Jaguar 4 Core 150K+37.5 hrs Source: Mueller et al., ICPADS’ 2010

Copyright: Abhinav Vishnu Quick Overview of Data Centric / Partitioned Global Address Space (PGAS) Models Abstractions for arbitrary distributed data structures Arrays, Trees etc Fit for irregular global address space accesses Natural mechanism for load balancing Decouple of data and computation We focus on Global Arrays Fundamental building block for other structures Uses one-sided communication runtime system for data transfer (ARMCI) What are the requirements for fault tolerance from the runtime? Global Address Space view Physically distributed data Global Arrays Layer Global Arrays Layer Applications Fault Tolerant Runtime Fault Tolerant Runtime

Copyright: Abhinav Vishnu Task-Based Execution using PGAS: Characteristic of Various Applications Each task executed by any process Reads/updates arbitrary global address space Requirements for fault tolerance Continue with available nodes Recover data on fault Recovery proportional to degree of failure Global Address Space (RO) Global Address Space (RW) Task Collection P0P0 P N-1 Compute Get Put/Accumulate

Copyright: Abhinav Vishnu Other Questions for Fault Tolerance Interconnection Network Power supply Is Node 2 Dead? What should the process manager do? What about the collective operations and their semantics? What about the one-sided operations to the failed node? What about the lost data and computation? Node 1Node 2Node 3Node 4 We answer these questions in this work

Copyright: Abhinav Vishnu Provided Solutions 1.Is Node 2 Dead?. 2.What should the process manager do? 3.What about the lost data and computation? 4.One-sided operations to the failed node? 5.Collective communication operations and their semantics? 1.Design a Fault Tolerance Management Infrastructure 2.Continue execution with lost node 3.Application based redundancy and recovery 4.Fault resilient Global Arrays and ARMCI 5.Design a fault resilient barrier ProblemSolution

Copyright: Abhinav Vishnu Design of Fault Tolerant Communication Runtime System Application Data Redundancy/Fault Recovery Layer Global Arrays Layer Fault Resilient ARMCI Fault Resilient Process Manager Fault Tolerance Management Infrastructure Fault Tolerant Barrier Handling Data Redundancy Handling Data Redundancy Network

Copyright: Abhinav Vishnu ARMCI: underlying communication runtime for Global Arrays Aggregate Remote Memory Copy Interface (ARMCI) Provides one-sided communication primitives Put, Get, Accumulate, Atomic Memory Operations Abstract network interfaces Established & available on all leading platforms Cray XTs, XE IBM Blue Gene L | P Commodity interconnects (InfiniBand, Ethernet etc) Further upcoming platforms IBM Blue Waters, Blue GeneQ Cray Cascade

Copyright: Abhinav Vishnu FTMI Protocols (Example with InfiniBand) No Response Node 2 is dead Network Adapter Node Requires response from remote process, Less Reliable Ping Message RDMA Read Reliable Notification, Most Reliable

Copyright: Abhinav Vishnu Fault Resilient Process Manager Adaptation from OSU-MVAICH Process Manager Provides MPI style (not fault tolerant) collectives Based on TCP/IP for bootstrapping Generic enough for any machine which has at least Ethernet control network Ignores any TCP/IP errors Layers rely on FTMI for higher accuracy fault information Interconnection Network

Copyright: Abhinav Vishnu Expected Data Redundancy Model: Impact on ARMCI Expected Data Redundancy Model Staggered data model Simultaneous updates may result in both copies in an inconsistent state Each copy should be updated one by one Every Write based Primitive (Put/Acc) should be Fenced WaitProc – Wait for all non- blocking operations to complete Fence – Ensure all writes to a process have finished N1 N2 N3 N4 Primary Copy Shadow Copy N4 N1 N2 N3 Data

Copyright: Abhinav Vishnu Fault Resilient ARMCI – Communication Protocols Multiple phases of communication in ARMCI Put/Get/Acc are implemented as a combination of these phases On Failures Either process/thread may be waiting for data, while other process is dead Use Timeout based FTMI to detect failures If FTMI detects failure, return error, if necessary Asynchronous Agent Process

Copyright: Abhinav Vishnu Fault Resilient Collective Communication Primitives Barrier is fundamental non-data moving collective communication operation Ga_sync = Fence to all processes + Barrier Used at various execution points in ARMCI We have implemented multiple fault tolerant barrier algorithms Fault tolerant version of based on high concurrency all-to-all personalized exchange Fault tolerant version of hypercube based implementation

Copyright: Abhinav Vishnu Usage of FTMI General functionality for fault detection Potential use as a component for fault tolerance backplane ARMCI Layer Different phases of one-sided communication protocols Put, Get, Accumulate Fault Resilient Barrier Different steps of the fault tolerant algorithm Potential use at application layer for designing recovery algorithm

Copyright: Abhinav Vishnu Performance Evaluation Methodology Comparisons of following approaches: FT-ARMCI, No Fault FT-ARMCI, One Node Fault (Actual) Original Testbed: InfiniBand DDR with AMD Barcelona CPUs Used 128 Nodes for performance evaluation

Copyright: Abhinav Vishnu Overhead of FT-ARMCI (No Faults) Latency comparisons for Original and FT-ARMCI implementation Objective is to understand the overhead of pure communication benchmarks Put and Accumulate primitives Overhead is observed due to synchronous writes Possible performance optimizations for future Piggyback acknowledgement with data

Copyright: Abhinav Vishnu Performance Evaluation with Faults Pure communication Benchmark Real scientific application would have a combination of computation and communication Objective is to understand the overhead when a fault occurs The primary overhead is due to timeout by the hardware for fault detection Test continues to execute, when actual faults occur

Copyright: Abhinav Vishnu Conclusions We presented the execution models of data centric programming models for fault tolerance We presented the design for fault tolerant communication runtime system Fault tolerance management infrastructure Process manager and barrier collective operation Fault tolerant communication protocols Our evaluation shows We can perform continued execution in presence of node faults Improvable overhead is observed in absence of faults Acceptable degradation in presence of node faults

Copyright: Abhinav Vishnu Future Work This is a part of active R&D: Leverage infrastructure for Global Arrays Significant application impact Handling simultaneous and consecutive failures Actively pursuing designs for upcoming high end systems Blue Waters, Gemini etc Our thanks to: eXtreme Scale Computing Initiative PNNL US Department of Energy Molecular Science Computing PNNL