Managed by UT-Battelle for the Department of Energy MPI Must Evolve or Die! Research sponsored by ASCR Al Geist Oak Ridge National Laboratory September.

Slides:

Advertisements

Similar presentations

1 Academic/Industrial Roadmap for Smart Process Manufacturing (SPM) Jim Davis (UCLA) Tom Edgar (UT-Austin)

Advertisements

Analysis of Computer Algorithms

Chapter 27 Software Change.

Ada, Model Railroading, and Software Engineering Education John W. McCormick University of Northern Iowa.

The MPI Forum: Getting Started Rich Graham Oak Ridge National Laboratory.

Designing Services for Grid-based Knowledge Discovery A. Congiusta, A. Pugliese, Domenico Talia, P. Trunfio DEIS University of Calabria ITALY

Making the System Operational

University of St Andrews School of Computer Science Experiences with a Private Cloud St Andrews Cloud Computing co-laboratory James W. Smith Ali Khajeh-Hosseini.

Research Councils ICT Conference Welcome Malcolm Atkinson Director 17 th May 2004.

Chapter 1 Introduction Copyright © Operating Systems, by Dhananjay Dhamdhere Copyright © Introduction Abstract Views of an Operating System.

©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 14 Slide 1 Object-oriented Design 1.

Configuration management

Issues of HPC software From the experience of TH-1A Lu Yutong NUDT.

Presented by Fault Tolerance Challenges and Solutions Al Geist Computer Science Research Group Computer Science and Mathematics Division Research supported.

Efficient Simulation of Agent-based Models on Multi-GPU & Multi-Core Clusters Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory.

Testing Workflow Purpose

INTRODUCTION TO SIMULATION WITH OMNET++ José Daniel García Sánchez ARCOS Group – University Carlos III of Madrid.

Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.

Leadership ®. T EAM STEPPS 05.2 Mod Page 2 Leadership ® 2 Objectives Describe different types of team leaders Describe roles and responsibilities.

Database System Concepts and Architecture

©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 4 Slide 1 Software processes 2.

Chapter 11 Software Evolution

Global Analysis and Distributed Systems Software Architecture Lecture # 5-6.

Μπ A Scalable & Transparent System for Simulating MPI Programs Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory Adjunct Professor.

KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems Jaeung Han¹, Jeongseob Ahn¹, Changdae Kim¹, Youngjin.

Technology Drivers Traditional HPC application drivers – OS noise, resource monitoring and management, memory footprint – Complexity of resources to be.

University of Minnesota Optimizing MapReduce Provisioning in the Cloud Michael Cardosa, Aameek Singh†, Himabindu Pucha†, Abhishek Chandra

Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 14: Protection.

Presented by Dealing with the Scale Problem Innovative Computing Laboratory MPI Team.

This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

Last Lecture The Future of Parallel Programming and Getting to Exascale 1.

Parallel Research at Illinois Parallel Everywhere

Presented by Scalable Systems Software Project Al Geist Computer Science Research Group Computer Science and Mathematics Division Research supported by.

A 100,000 Ways to Fa Al Geist Computer Science and Mathematics Division Oak Ridge National Laboratory July 9, 2002 Fast-OS Workshop Advanced Scientific.

Building Fault Survivable MPI Programs with FT-MPI Using Diskless Checkpointing Zizhong Chen, Graham E. Fagg, Edgar Gabriel, Julien Langou, Thara Angskun,

Performance Engineering and Debugging HPC Applications David Skinner

07/14/08. 2 Points Introduction. Cluster and Supercomputers. Cluster Types and Advantages. Our Cluster. Cluster Performance. Cluster Computer for Basic.

4.x Performance Technology drivers – Exascale systems will consist of complex configurations with a huge number of potentially heterogeneous components.

1 Developing Native Device for MPJ Express Advisor: Dr. Aamir Shafi Co-advisor: Ms Samin Khaliq.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Scaling to New Heights Retrospective IEEE/ACM SC2002 Conference Baltimore, MD.

Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories Abstract: Sandia National.

OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY Parallel Solution of 2-D Heat Equation Using Laplace Finite Difference Presented by Valerie Spencer.

Extreme scale parallel and distributed systems – High performance computing systems Current No. 1 supercomputer Tianhe-2 at petaflops Pushing toward.

The Red Storm High Performance Computer March 19, 2008 Sue Kelly Sandia National Laboratories Abstract: Sandia National.

Loosely Coupled Parallelism: Clusters. Context We have studied older archictures for loosely coupled parallelism, such as mesh’s, hypercubes etc, which.

4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.

Directed Reading 2 Key issues for the future of Software and Hardware for large scale Parallel Computing and the approaches to address these. Submitted.

Crystal Ball Panel ORNL Heterogeneous Distributed Computing Research Al Geist ORNL March 6, 2003 SOS 7.

Presented by Leadership Computing Facility (LCF) Roadmap Buddy Bland Center for Computational Sciences Leadership Computing Facility Project.

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.

Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.

Lawrence Livermore National Laboratory BRdeS-1 Science & Technology Principal Directorate - Computation Directorate How to Stop Worrying and Learn to Love.

Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 1.

Background Computer System Architectures Computer System Software.

Is MPI still part of the solution ? George Bosilca Innovative Computing Laboratory Electrical Engineering and Computer Science Department University of.

Presented by Fault Tolerance Challenges and Solutions Al Geist Network and Cluster Computing Computational Sciences and Mathematics Division Research supported.

PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.

ORNL is managed by UT-Battelle for the US Department of Energy Musings about SOS Buddy Bland Presented to: SOS20 Conference March 25, 2016 Asheville, NC.

Basic Concepts of Software Architecture. What is Software Architecture? Definition: – A software system’s architecture is the set of principal design.

Building PetaScale Applications and Tools on the TeraGrid Workshop December 11-12, 2007 Scott Lathrop and Sergiu Sanielevici.

Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Jack Dongarra University of Tennessee

University of Technology

Hybrid Programming with OpenMP and MPI

Fault Tolerance with FT-MPI for Linear Algebra Algorithms

Presentation transcript:

Managed by UT-Battelle for the Department of Energy MPI Must Evolve or Die! Research sponsored by ASCR Al Geist Oak Ridge National Laboratory September 9, 2008 EuroPVM-MPI Conference Dublin Ireland Heterogeneous multi-core

2Managed by UT-Battelle for the Department of Energy Exascale software solution cant rely on Then a miracle occurs Hardware developer describing exascale design says Then a miracle occurs System software engineer replies I think you should be more explicit

3Managed by UT-Battelle for the Department of Energy Acknowledgements Harness Research Project (Geist, Dongarra, Sundaram) The same team that created PVM has continued the exploration of heterogeneous and adaptive computing. Acknowledge the team members whose ideas and research on the Harness project are being presented in this talk. Bob Manchek Graham Fagg June Denoto Jelena Pješivac-Grbović George Bosilca Thara Angskun Magdalena Slawinska Jaroslaw Slawinski Edgar Gabriel Research sponsored by ASCR Interesting observation PVM use is starting to grow again The support questions have doubled in past year Even getting queries from HPC users who are desperate for fault tolerance. Apologies to anyone I missed

4Managed by UT-Battelle for the Department of Energy Example of a Petaflops System - ORNL (late 2008) Multi-core, homogeneous, multiple programming models DOE Cray Baker 1 Petaflops system 13,944 dual-socket, 8-core SMP nodes with 16 GB 27,888 quad-core processors Barcelona 2.3 GHz (37 Gflops) 223 TB memory (2GB/core) 200+ GB/s disk bandwidth 10 PB storage 6.5 MW system power 150 cabinets, 3,400 ft 2 Liquid cooled cabinets Compute Node Linux operating system

5Managed by UT-Battelle for the Department of Energy MPI Dominates Petascale Communication Survey top HPC open science applications Must have Can use

6Managed by UT-Battelle for the Department of Energy The answer is MPI. What is the question? While applications may continue to use MPI due to: Inertia – these codes take decades to create and validate Nothing better – developers need a BIG incentive to rewrite (not 50%) Communication libraries are being changed to exploit new petascale systems, giving applications more life. Hardware support for MPI is pushing this out even further Business as usual has been to improve latency and/or bandwidth. But large-scale, many-core, heterogeneous architectures require us to think further outside the box It is not business as usual inside petascale communication libraries Hierarchical algorithms Hybrid algorithms Dynamic algorithm selection Fault tolerance

7Managed by UT-Battelle for the Department of Energy Hierarchical Algorithms Hierarchical algorithm designs seek to consolidate information at different levels of the architecture to reduce the number of messages and contention on the interconnect. Architecture Levels: Socket Node Board Cabinet Switch System PVM Project studied hierarchical collective algorithms using clusters of clusters (simple 2-level model) Communication within cluster was 10X faster than between clusters Found improvements in the range of 2X-5X but not pursued because HPC machines at time had only one level. Needs rethinking for petascale systems

8Managed by UT-Battelle for the Department of Energy Hybrid Algorithms Hybrid algorithm designs use different algorithms at different levels of the architecture, for example, using a shared memory algorithm within a node, or an accelerator board, such as Cell, and a message passing algorithm between nodes. PVM Project studied hybrid msg passing algorithms using heterogeneous parallel virtual machines Communication optimized to the custom HW within each computer Today all MPI implementations do this to some extent. But there is more to be done for new heterogeneous systems Roadrunner

9Managed by UT-Battelle for the Department of Energy Adaptive Communication Libraries Algorithm is dynamically selected from a set of collective communication algorithms based on multiple metrics such as: Number of tasks being sent to Where they are located in the system The size of the message being sent The physical topology and particular quirks of the system At run time, decision function is invoked to select the best algorithm for particular collective call Steps in optimization process: 1.Implementation of different MPI algorithms 2.MPI collective algorithm performance information Optimal MPI collective operation implementation 3.Decision / Algorithm selection process 4.Decision function - Automatically generate code based on step 3 Harness Project explored having adaptive MPI collectives

10Managed by UT-Battelle for the Department of Energy Harness Adaptive Collective Communication MPI collective algorithm implementations Exhaustive Testing Optimal MPI collective implementation Decision Process Decision Function Performance modeling Performed just once on a given machine

11Managed by UT-Battelle for the Department of Energy Decision/Algorithm Selection Process Three Different Approaches Explored Parametric data modeling: Use algorithm performance models to select algorithm with shortest completion time (Hockney, LogGP, PLogP, …) Image encoding techniques: Use graphics encoding algorithms to capture information algorithm switching points Statistical learning methods: Use statistical learning methods to find patterns in algorithm performance data and to construct decision systems

12Managed by UT-Battelle for the Department of Energy Fault Tolerant Communication Harness Project was where FT-MPI was created to explore ways that MPI could be modified to allow applications to run through faults. Accomplishments of FT-MPI Research Define the behavior of MPI in case an error occurs Give the application the possibility to recover from a node-failure A regular, non fault-tolerant MPI program will run using FT-MPI Stick to the MPI-1 and MPI-2 specification as closely as possible (e.g. no additional function calls) Provide the notification to the application Provide recovery options for the application to exploit if desired What FT-MPI does not do: Recover user data (e.g. automatic check-pointing) Provide transparent fault-tolerance

13Managed by UT-Battelle for the Department of Energy FT-MPI recovery options ABORT: just do as other implementations BLANK: leave hole SHRINK: re-order processes to make a contiguous communicator Some ranks change REBUILD: re-spawn lost processes and add them to MPI_COMM_WORLD Key to allowing MPI applications to run through faults. Developing COMM_CREATE that can build a new MPI_COMM_WORLD Four options explored (abort, blank, shrink, rebuild) As a convenience a fifth option to shrink or rebuild ALL communicators inside an application at once was also investigated.

14Managed by UT-Battelle for the Department of Energy Future of Fault Tolerant Communication The fault tolerant capabilities and datatypes in FT-MPI are now becoming a part of the OpenMPI effort. Fault Tolerance is under consideration by the MPI forum as a possible addition to the MPI-3 standard MPI-3

15Managed by UT-Battelle for the Department of Energy Getting Applications to Use this Stuff All these new multi-core, heterogeneous algorithms and techniques are for naught if we dont get the science teams to use them. ORNLs Leadership Computing Facility uses a couple key methods to get the latest algorithms and system specific features used by the science teams. Science Liaisons and Centers of Excellence. Science Liaisons from the Scientific Computing Group are assigned to every science team on the leadership system. Their duties include: Scaling algorithms to the required size Application and library code optimization and scaling Exploiting parallel I/O & other technologies in apps More…

16Managed by UT-Battelle for the Department of Energy Centers of Excellence ORNL has a Cray Center of Excellence and a Lustre Center of Excellence one of their missions is to have vendor engineers engage directly with users to help them with the latest techniques to get scalable performance Cray Center of Excellence Lustre Center of Excellence But, having science liaisons and help from vendor engineers is not a scalable solution for the larger community of users so we are creating…

17Managed by UT-Battelle for the Department of Energy Harness Workbench for Science Teams Eclipse (Parallel Tools Platform) Help the user by building a tool that can apply basic knowledge of developer, admin, and vendor Integrated with runtime Available to LCF science liaisons this summer

18Managed by UT-Battelle for the Department of Energy Next Generation Runtime Scalable Tool Communication Infrastructure (STCI) Harness runtime environment (underlying Harness workbench, adaptive comm, fault recovery) Open runtime environment - OpenRTE (underlying OpenMPI) Scalable Tool Communication Infrastructure Which was generalized Adopted emerging RTE Execution context Sessions Communications Persistence Security High-performance, scalable, resilient, and portable communications and process control services for user and system tools: parallel run-time environment (MPI), application correctness tools, performance analysis tools system monitoring and management

19Managed by UT-Battelle for the Department of Energy Petascale to Exascale requires new approach: Try to break the cycle of HW vendors throwing the latest giant system over the fence and leave it to the system software guys and applications to figure out how to use the latest HW (billion-way parallelism at exascale) Try to get applications to rethink their algorithms and even their physics in order to better match what the HW can give them (memory wall isnt going away) Meet in the middle – change what balanced system means Synergistically Developing Architecture and Algorithms Together Creating a Revolution in Evolution Institute for Advances Architectures and Algorithms has been established in a Sandia/ORNL joint effort to facilitate the co-design of architectures and algorithms in order to create synergy in their respective evolutions.

20Managed by UT-Battelle for the Department of Energy Summary It is not business as usual for petascale communication No longer just about improved latency and bandwidth But MPI is not going away Communication libraries are adapting Hierarchical algorithms Hybrid algorithms Dynamic selected algorithms Allow run through fault tolerance But we have to get applications to use these new ideas Going to Exascale communication needs a fundamental shift Break the deadly cycle of hardware being thrown over fence for the software developers to figure out how to use. is this crazy talk? Evolve or Die

21Managed by UT-Battelle for the Department of Energy Questions? 21Managed by UT-Battelle for the Department of Energy