A U.S. Department of Energy Office of Science Laboratory Operated by The University of Chicago Argonne National Laboratory Office of Science U.S. Department.

Slides:

Advertisements

Similar presentations

D u k e S y s t e m s Time, clocks, and consistency and the JMM Jeff Chase Duke University.

Advertisements

COMPSCI 105 S Principles of Computer Science 12 Abstract Data Type.

Prof. Srinidhi Varadarajan Director Center for High-End Computing Systems.

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

The Triumph of Hope over Experience * ? Bill Gropp *Samuel Johnson.

Assurance through Enhanced Design Methodology Orlando, FL 5 December 2012 Nirav Davé SRI International This effort is sponsored by the Defense Advanced.

An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

Reference: Message Passing Fundamentals.

Graph Analysis with High Performance Computing by Bruce Hendrickson and Jonathan W. Berry Sandria National Laboratories Published in the March/April 2008.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

Portability Issues. The MPI standard was defined in May of This standardization effort was a response to the many incompatible versions of parallel.

Reasons to study concepts of PL

Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.

University of Kansas Construction & Integration of Distributed Systems Jerry James Oct. 30, 2000.

1 New Architectures Need New Languages A triumph of optimism over experience! Ian Watson 3 rd July 2009.

A. Frank - P. Weisberg Operating Systems Introduction to Tasks/Threads.

A Brief Look At MPI’s Point To Point Communication Brian T. Smith Professor, Department of Computer Science Director, Albuquerque High Performance Computing.

1 MPI-2 and Threads. 2 What are Threads? l Executing program (process) is defined by »Address space »Program Counter l Threads are multiple program counters.

Programming for High Performance Computers John M. Levesque Director Cray’s Supercomputing Center Of Excellence.

SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.

PROGRAMMING LANGUAGES The Study of Programming Languages.

Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.

Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.

1 Developing Native Device for MPJ Express Advisor: Dr. Aamir Shafi Co-advisor: Ms Samin Khaliq.

The Future of MPI William Gropp Argonne National Laboratory

Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?

Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Scaling to New Heights Retrospective IEEE/ACM SC2002 Conference Baltimore, MD.

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

Software for Exaflops Computing William Gropp Mathematics and Computer Science Division

Center for Programming Models for Scalable Parallel Computing: Project Meeting Report Libraries, Languages, and Execution Models for Terascale Applications.

CS 390- Unix Programming Environment CS 390 Unix Programming Environment Topics to be covered: Distributed Computing Fundamentals.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

Programming Models & Runtime Systems Breakout Report MICS PI Meeting, June 27, 2002.

GPU in HPC Scott A. Friedman ATS Research Computing Technologies.

4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.

Advanced Computer Networks Topic 2: Characterization of Distributed Systems.

Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.

HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.

CORBA1 Distributed Software Systems Any software system can be physically distributed By distributed coupling we get the following:  Improved performance.

Experts in numerical algorithms and HPC services Compiler Requirements and Directions Rob Meyer September 10, 2009.

Ted Pedersen – CS 3011 – Chapter 10 1 A brief history of computer architectures CISC – complex instruction set computing –Intel x86, VAX –Evolved from.

1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.

1 "Workshop 31: Developing a Hands-on Undergraduate Parallel Programming Course with Pattern Programming SIGCSE The 44 th ACM Technical Symposium.

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

Programmability Hiroshi Nakashima Thomas Sterling.

(Superficial!) Review of Uniprocessor Architecture Parallel Architectures and Related concepts CS 433 Laxmikant Kale University of Illinois at Urbana-Champaign.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

A Pattern Language for Parallel Programming Beverly Sanders University of Florida.

Parallel Computing Presented by Justin Reschke

1 Advanced MPI William D. Gropp Rusty Lusk and Rajeev Thakur Mathematics and Computer Science Division Argonne National Laboratory.

Background Computer System Architectures Computer System Software.

Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.

Application of Design Patterns to Geometric Decompositions V. Balaji, Thomas L. Clune, Robert W. Numrich and Brice T. Womack.

Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.

These slides are based on the book:

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Parallel Programming By J. H. Wang May 2, 2017.

Pattern Parallel Programming

Cache Memory Presentation I

Is System X for Me? Cal Ribbens Computer Science Department

EE 193: Parallel Computing

The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.

Presentation transcript:

A U.S. Department of Energy Office of Science Laboratory Operated by The University of Chicago Argonne National Laboratory Office of Science U.S. Department of Energy How to Replace MPI as the Programming Model of the Future William Gropp

Pioneering Science and Technology Office of Science U.S. Department of Energy Quotes from “System Software and Tools for High Performance Computing Environments” (1993) “The strongest desire expressed by these users was simply to satisfy the urgent need to get applications codes running on parallel machines as quickly as possible” In a list of enabling technologies for mathematical software, “Parallel prefix for arbitrary user-defined associative operations should be supported. Conflicts between system and library (e.g., in message types) should be automatically avoided.” -Note that MPI-1 provided both Immediate Goals for Computing Environments: -Parallel computer support environment -Standards for same -Standard for parallel I/O -Standard for message passing on distributed memory machines “The single greatest hindrance to significant penetration of MPP technology in scientific computing is the absence of common programming interfaces across various parallel computing systems”

Pioneering Science and Technology Office of Science U.S. Department of Energy Quotes from “Enabling Technologies for Petaflops Computing” (1995): “The software for the current generation of 100 GF machines is not adequate to be scaled to a TF…” “The Petaflops computer is achievable at reasonable cost with technology available in about 20 years [2014].” -(estimated clock speed in 2004 — 700MHz) “Software technology for MPP’s must evolve new ways to design software that is portable across a wide variety of computer architectures. Only then can the small but important MPP sector of the computer hardware market leverage the massive investment that is being applied to commercial software for the business and commodity computer market.” “To address the inadequate state of software productivity, there is a need to develop language systems able to integrate software components that use different paradigms and language dialects.” (9 overlapping programming models, including shared memory, message passing, data parallel, distributed shared memory, functional programming, O-O programming, and evolution of existing languages)

Pioneering Science and Technology Office of Science U.S. Department of Energy Some History The Cray 1 was a very successful machine. Why? -Not the first or fastest vector machine (CDC Star 100 and later Cyber 203/205 had much higher peaks) -Early users were thrilled that Cray 1 was 1.76 x faster than CDC = 22/12.5 = ratio of (scalar) clock speed - Applications were not vectorizing! - These were legacy applications -Balanced (by comparison with Star100 etc.) performance -Hardware supported programming model; predictable performance -Software (compiler) trained users - Worked with the user to performance-tune code -Of course, this was before the memory wall

Pioneering Science and Technology Office of Science U.S. Department of Energy The Curse of Latency Current vector systems (often) do not perform well on sparse matrix-vector code -Despite the fact that an excellent predictor of performance for these algorithms is STREAM bandwidth on cache-based systems -And that there is a great deal of concurrency Lesson: Balance and Amdahl's law -Many thought Amdahls's law would limit the success of massively parallel systems -Reality — It is often easier to find concurrent scalar threads in modern algorithms than scalar-free vector code -Don’t confuse necessary with sufficient

Pioneering Science and Technology Office of Science U.S. Department of Energy How Expensive Is MPI? Measured numbers from still untuned MPICH2 shared memory implementation: MPI_Recv : Number of Instructions 323 MPI_Recv : Number of Cycles MPI_Recv : Number of L1 Cache Data misses 029 MPI_Recv : Number of L1 Cache Inst misses 002 MPI_Send : Number of Instructions 278 MPI_Send : Number of Cycles 995 -MPI_Send : Number of L1 Cache Data misses 007 -MPI_Send : Number of L1 Inst misses 003 This may seem like a lot, but much of the cycle cost is latency to remote memory, not overhead. On “conventional” systems, eliminating MPI overhead gives less than a factor of 2 improvement -Need latency hiding, adequate concurrent, independent threads to make effective cost of an operation the overhead cost instead of the latency cost

Pioneering Science and Technology Office of Science U.S. Department of Energy What Performance Can Compilers Deliver? Here are a few questions for a computer vendor 1.Do you have a optimized DGEMM? - Did you do it by hand? - Did you use ATLAS? - Should users choose it over the reference implementation from netlib? 2.Do you have an optimizing Fortran compiler - Is it effective? Aren’t the answers to 1 and 2 incompatible? Conclusion: Writing optimizing compilers, even for the easiest, most studied case, is hard

Pioneering Science and Technology Office of Science U.S. Department of Energy From AtlasAtlas Compiler Hand-tuned Parallelizing Compilers Are Not the Answer Enormous effort required to get good performance Large gap between natural code and specialized code

Pioneering Science and Technology Office of Science U.S. Department of Energy Manual Decomposition of Data Structures Trick! -This is from a paper on dense matrix tiling for uniprocessors! This suggests that managing data decompositions is a common problem for real machines, whether they are parallel or not -Not just an artifact of MPI-style programming -Aiding programmers in data structure decomposition is an important part of solving the productivity puzzle

Pioneering Science and Technology Office of Science U.S. Department of Energy Why Was MPI Successful? It address all of the following issues: -Portability -Performance -Simplicity and Symmetry -Modularity -Composability -Completeness For a more complete discussion, see “Learning from the Success of MPI”, /mpi-lessons.pdf

Pioneering Science and Technology Office of Science U.S. Department of Energy Portability and Performance Portability does not require a “lowest common denominator” approach -Good design allows the use of special, performance enhancing features without requiring hardware support -For example, MPI’s nonblocking message-passing semantics allows but does not require “zero-copy” data transfers MPI is really a “Greatest Common Denominator” approach -It is a “common denominator” approach; this is portability - To fix this, you need to change the hardware (change “common”) -It is a (nearly) greatest approach in that, within the design space (which includes a library-based approach), changes don’t improve the approach - Least suggests that it will be easy to improve; by definition, any change would improve it. - Have a suggestion that meets the requirements? Lets talk!

Pioneering Science and Technology Office of Science U.S. Department of Energy Simplicity and Symmetry MPI is organized around a small number of concepts -The number of routines is not a good measure of complexity -E.g., Fortran - Large number of intrinsic functions -C and Java runtimes are large -Development Frameworks - Hundreds to thousands of methods -This doesn’t bother millions of programmers

Pioneering Science and Technology Office of Science U.S. Department of Energy Symmetry Exceptions are hard on users -But easy on implementers — less to implement and test Example: MPI_Issend -MPI provides several send modes: - Regular - Synchronous - Receiver Ready - Buffered -Each send can be blocking or non-blocking -MPI provides all combinations (symmetry), including the “Nonblocking Synchronous Send” - Removing this would slightly simplify implementations - Now users need to remember which routines are provided, rather than only the concepts -It turns out he MPI_Issend is useful in building performance and correctness debugging tools for MPI programs

Pioneering Science and Technology Office of Science U.S. Department of Energy Modularity Modern algorithms are hierarchical -Do not assume that all operations involve all or only one process -Provide tools that don’t limit the user Modern software is built from components -MPI designed to support libraries -Example: communication contexts

Pioneering Science and Technology Office of Science U.S. Department of Energy Composability Environments are built from components -Compilers, libraries, runtime systems -MPI designed to “play well with others” MPI exploits newest advancements in compilers -… without ever talking to compiler writers -OpenMP is an example - MPI (the standard) required no changes to work with OpenMP

Pioneering Science and Technology Office of Science U.S. Department of Energy Completeness MPI provides a complete parallel programming model and avoids simplifications that limit the model -Contrast: Models that require that synchronization only occurs collectively for all processes or tasks Make sure that the functionality is there when the user needs it -Don’t force the user to start over with a new programming model when a new feature is needed

Pioneering Science and Technology Office of Science U.S. Department of Energy Conclusions: Lessons From MPI A successful parallel programming model must enable more than the simple problems -It is nice that those are easy, but those weren’t that hard to begin with Scalability is essential -Why bother with limited parallelism? -Just wait a few months for the next generation of hardware Performance is equally important -But not at the cost of the other items

Pioneering Science and Technology Office of Science U.S. Department of Energy More Lessons A general programming model for high-performance technical computing must address many issues to succeed, including: Completeness -Support the evolution of applications Simplicity -Focus on users not implementors -Symmetry reduces the burden on users Portability rides the hardware wave -Sacrifice only if the advantage is huge and persistent -Competitive performance and elegant design is not enough Composability rides the software wave -Leverage improvements in compilers, runtimes, algorithms -Matches hierarchical nature of systems

Pioneering Science and Technology Office of Science U.S. Department of Energy Improving Parallel Programming How can we make the programming of real applications easier? Problems with the Message-Passing Model -User’s responsibility for data decomposition -“Action at a distance” - Matching sends and receives - Remote memory access -Performance costs of a library (no compile-time optimizations) -Need to choose a particular set of calls to match the hardware In summary, the lack of abstractions that match the applications

Pioneering Science and Technology Office of Science U.S. Department of Energy Challenges Must avoid the traps: -The challenge is not to make easy programs easier. The challenge is to make hard programs possible. -We need a “well-posedness” concept for programming tasks - Small changes in the requirements should only require small changes in the code - Rarely a property of “high productivity” languages - Abstractions that make easy programs easier don’t solve the problem -Latency hiding is not the same as low latency - Need “Support for aggregate operations on large collections” An even harder challenge: make it hard to write incorrect programs. -OpenMP is not a step in the (entirely) right direction -In general, current shared memory programming models are very dangerous. - They also perform action at a distance - They require a kind of user-managed data decomposition to preserve performance without the cost of locks/memory atomic operations -Deterministic algorithms should have provably deterministic implementations

Pioneering Science and Technology Office of Science U.S. Department of Energy What is Needed To Achieve Real High Productivity Programming Simplify the construction of correct, high-performance applications Managing Data Decompositions -Necessary for both parallel and uniprocessor applications -Many levels must be managed -Strong dependence on problem domain (e.g., halos, load-balanced decompositions, dynamic vs. static) Possible approaches -Language-based - Limited by predefined decompositions - Some are more powerful than others; Divacon provided a built-in divided and conquer -Library-based - Overhead of library (incl. lack of compile-time optimizations), tradeoffs between number of routines, performance, and generality -Domain-specific languages …

Pioneering Science and Technology Office of Science U.S. Department of Energy Domain-specific languages A possible solution, particularly when mixed with adaptable runtimes Exploit composition of software (e.g., work with existing compilers, don’t try to duplicate/replace them) Example: mesh handling -Standard rules can define mesh - Including “new” meshes, such as C-grids -Alternate mappings easily applied (e.g., Morton orderings) -Careful source-to-source methods can preserve human-readable code -In the longer term, debuggers could learn to handle programs built with language composition (they already handle 2 languages – assembly and C/Fortran/…) Provides a single “user abstraction” whose implementation may use the composition of hierarchical models -Also provides a good way to integrate performance engineering into the application

Pioneering Science and Technology Office of Science U.S. Department of Energy PGAS Languages A good first step Do not address many important issues -Still based on action-at-a-distance -No high-level support for data decomposition -No middle ground between local and global -Lower overhead access to remote data but little direct support for latency hiding -I/O not well integrated (e.g., the easiest way to write a CoArray in canonical form (as part of a distributed array) is to use MPI) -Not performance transparent; no direct support for performance debugging

Pioneering Science and Technology Office of Science U.S. Department of Energy How to Replace MPI Retain MPI’s strengths -Performance from matching programming model to the realities of underlying hardware -Ability to compose with other software (libraries, compilers, debuggers) -Determinism (without MPI_ANY_{TAG,SOURCE}) -Run-everywhere portability Add to what MPI is missing, such as -Distributed data structures (not just a few popular ones) -Low overhead remote operations; better latency hiding/management; overlap with computation (not just latency; MPI can be implemented in a few hundred instructions, so overhead is roughly the same as remote memory reference (memory wall)) -Dynamic load balancing for dynamic, distributed data structures -Unified method for treating multicores, remote processors, other resources Enable the transition from MPI programs -Build component-friendly solutions - There is no one, true language

Pioneering Science and Technology Office of Science U.S. Department of Energy Further Reading For a historical perspective (and a reality check), -“Enabling Technologies for Petaflops Computing”, Thomas Sterling, Paul Messina, and Paul H. Smith, MIT Press, “System Software and Tools for High Performance Computing Environments”, edited by Paul Messina and Thomas Sterling, SIAM, 1993 For current thinking on possible directions, -“Report of the Workshop on High-Productivity Programming Languages and Models”, edited by Hans Zima, May 2004.