Managed by UT-Battelle for the Department of Energy MPI for MultiCore and ManyCore Galen Shipman Oak Ridge National Laboratory June 4, 2008.

Slides:



Advertisements
Similar presentations
MPI Message Passing Interface
Advertisements

Cross-site running on TeraGrid using MPICH-G2 Presented by Krishna Muriki (SDSC) on behalf of Dr. Nick Karonis (NIU)
MPI Basics Introduction to Parallel Programming and Cluster Computing University of Washington/Idaho State University MPI Basics Charlie Peck Earlham College.
Priority Research Direction (I/O Models, Abstractions and Software) Key challenges What will you do to address the challenges? – Develop newer I/O models.
Parallel Programming in C with MPI and OpenMP
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Lecture 6: Threads Chapter 4.
Reference: / MPI Program Structure.
MPI Program Structure Self Test with solution. Self Test 1.How would you modify "Hello World" so that only even-numbered processors print the greeting.
CS 240A: Models of parallel programming: Distributed memory and MPI.
Message-Passing Programming and MPI CS 524 – High-Performance Computing.
Parallel Programming in C with MPI and OpenMP
Comp 422: Parallel Programming Lecture 8: Message Passing (MPI)
1 Tuesday, October 10, 2006 To err is human, and to blame it on a computer is even more so. -Robert Orben.
Bulk Synchronous Parallel (BSP) Model Illustration of a BSP superstep.
Basics of Message-passing Mechanics of message-passing –A means of creating separate processes on different computers –A way to send and receive messages.
The hybird approach to programming clusters of multi-core architetures.
Optimizing Threaded MPI Execution on SMP Clusters Hong Tang and Tao Yang Department of Computer Science University of California, Santa Barbara.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
Parallel Processing LAB NO 1.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
MPI3 Hybrid Proposal Description
1 Developing Native Device for MPJ Express Advisor: Dr. Aamir Shafi Co-advisor: Ms Samin Khaliq.
ORNL is managed by UT-Battelle for the US Department of Energy Crash Course In Message Passing Interface Adam Simpson NCCS User Assistance.
Chapter 4: Threads. 4.2 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts Threads A thread (or lightweight process) is a basic unit of CPU.
Director of Contra Costa College High Performance Computing Center
RFC: Breaking Free from the Collective Requirement for HDF5 Metadata Operations.
1 MPI: Message-Passing Interface Chapter 2. 2 MPI - (Message Passing Interface) Message passing library standard (MPI) is developed by group of academics.
Part I MPI from scratch. Part I By: Camilo A. SilvaBIOinformatics Summer 2008 PIRE :: REU :: Cyberbridges.
Parallel Computing A task is broken down into tasks, performed by separate workers or processes Processes interact by exchanging information What do we.
Parallel Programming with MPI Prof. Sivarama Dandamudi School of Computer Science Carleton University.
Hybrid MPI and OpenMP Parallel Programming
CS 838: Pervasive Parallelism Introduction to MPI Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from an online tutorial.
Message Passing Programming Model AMANO, Hideharu Textbook pp. 140-147.
Summary of MPI commands Luis Basurto. Large scale systems Shared Memory systems – Memory is shared among processors Distributed memory systems – Each.
Share Memory Program Example int array_size=1000 int global_array[array_size] main(argc, argv) { int nprocs=4; m_set_procs(nprocs); /* prepare to launch.
High-Level, One-Sided Models on MPI: A Case Study with Global Arrays and NWChem James Dinan, Pavan Balaji, Jeff R. Hammond (ANL); Sriram Krishnamoorthy.
Distributed-Memory (Message-Passing) Paradigm FDI 2004 Track M Day 2 – Morning Session #1 C. J. Ribbens.
Non-Data-Communication Overheads in MPI: Analysis on Blue Gene/P P. Balaji, A. Chan, W. Gropp, R. Thakur, E. Lusk Argonne National Laboratory University.
CS 591 x I/O in MPI. MPI exists as many different implementations MPI implementations are based on MPI standards MPI standards are developed and maintained.
Parallel Programming with MPI By, Santosh K Jena..
AEROSPACE DIVISION Optimised MPI for HPEC applications Benoit Guillon: Thales Research&Technology Gerard Cristau: Thales Computers Vincent Chuffart: Thales.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, ©
CSCI-455/522 Introduction to High Performance Computing Lecture 4.
1 Message Passing Models CEG 4131 Computer Architecture III Miodrag Bolic.
Message Passing and MPI Laxmikant Kale CS Message Passing Program consists of independent processes, –Each running in its own address space –Processors.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
A new thread support level for hybrid programming with MPI endpoints EASC 2015 Dan Holmes, Mark Bull, Jim Dinan
Introduction to MPI CDP 1. Shared Memory vs. Message Passing Shared Memory Implicit communication via memory operations (load/store/lock) Global address.
12.1 Parallel Programming Types of Parallel Computers Two principal types: 1.Single computer containing multiple processors - main memory is shared,
2.1 Collective Communication Involves set of processes, defined by an intra-communicator. Message tags not present. Principal collective operations: MPI_BCAST()
Chapter 5. Nonblocking Communication MPI_Send, MPI_Recv are blocking operations Will not return until the arguments to the functions can be safely modified.
3/12/2013Computer Engg, IIT(BHU)1 MPI-1. MESSAGE PASSING INTERFACE A message passing library specification Extended message-passing model Not a language.
CS 591x Overview of MPI-2. Major Features of MPI-2 Superset of MPI-1 Parallel IO (previously discussed) Standard Process Startup Dynamic Process Management.
1 Parallel and Distributed Processing Lecture 5: Message-Passing Computing Chapter 2, Wilkinson & Allen, “Parallel Programming”, 2 nd Ed.
MPI Groups, Communicators and Topologies. Groups and communicators In our case studies, we saw examples where collective communication needed to be performed.
Message Passing Interface Using resources from
Lecture 3: Today’s topics MPI Broadcast (Quinn Chapter 5) –Sieve of Eratosthenes MPI Send and Receive calls (Quinn Chapter 6) –Floyd’s algorithm Other.
OpenMP Runtime Extensions Many core Massively parallel environment Intel® Xeon Phi co-processor Blue Gene/Q MPI Internal Parallelism Optimizing MPI Implementation.
Is MPI still part of the solution ? George Bosilca Innovative Computing Laboratory Electrical Engineering and Computer Science Department University of.
Introduction to parallel computing concepts and technics
Introduction to MPI.
MPI Message Passing Interface
CS 584.
Parallel Processing - MPI
Message Passing Models
MPI: Message Passing Interface
Programming with Shared Memory
Hybrid MPI and OpenMP Parallel Programming
MPI Message Passing Interface
CS 584 Lecture 8 Assignment?.
Presentation transcript:

Managed by UT-Battelle for the Department of Energy MPI for MultiCore and ManyCore Galen Shipman Oak Ridge National Laboratory June 4, 2008

2Managed by UT-Battelle for the Department of Energy Ripe Areas of Improvement for MultiCore  MPI Implementations  The MPI Standard  Resource Managers  Improving MPI as a Low Level Substrate

3Managed by UT-Battelle for the Department of Energy Open MPI Component Architecture Intra-node Optimizations are primarily isolated to the collectives and point-to-point interfaces within Open MPI

4Managed by UT-Battelle for the Department of Energy MPI Implementation Improvements  Extending Intra-node optimizations beyond “shared memory as a transport mechanism” –Process synchronization primitives –Hierarchical collectives  Reduces network contention  Exploit on-node memory hierarchies if they exist –“Offload” some MPI library tasks to dedicated cores  At extremely large scale the additional overhead of this offload may be insignificant in contrast to the ability to schedule operations effectively  Requires applications to be optimized for overlap

5Managed by UT-Battelle for the Department of Energy Shared Memory Optimizations MPI_Bcast On 16 cores (quad socket dual core Opteron) “MPI Support for Multi-Core Architectures: Optimized Shared Memory Collectives”, R. L. Graham and G. M. Shipman, To appear in the proceedings of EuroPVM/MPI 2008

6Managed by UT-Battelle for the Department of Energy Shared Memory Optimizations MPI_Reduce On 16 cores (quad socket dual core Opteron) “MPI Support for Multi-Core Architectures: Optimized Shared Memory Collectives”, R. L. Graham and G. M. Shipman, To appear in the proceedings of EuroPVM/MPI 2008

7Managed by UT-Battelle for the Department of Energy Shared Memory Optimizations

8Managed by UT-Battelle for the Department of Energy Shared Memory Optimizations

9Managed by UT-Battelle for the Department of Energy MPI Implementation Improvements  Reduced Memory Footprint –MPI has gotten used to 1 GB/core (no we don’t use all of it) –Careful memory usage needed even at “modest scale”  ~100K cores on Baker may require data structure changes  What happens at 1M cores when I only have 256 MB/Core?  Can we improve performance through reduced generality of MPI? –What if I don’t want datatypes (other than MPI_BYTE)? –What if I don’t use MPI_ANYSOURCE? –Can relaxed ordering semantics help for some use cases? –Additional crazy ideas  And don’t forget about I/O.. –Hierarchies in I/O infrastructure may need to be explicitly managed to achieve reasonable performance –Applications will have to change how they do I/O

10Managed by UT-Battelle for the Department of Energy MPI Standards Improvements  MPI RMA (Remote Memory Access) –Not MPI-2 One Sided –Need to decouple  RMA Initialization  RMA Ordering  Remote Completion  Process Synchronization –Intertwining these semantics reduces performance  (see MPI_WIN_FENCE) –Need RMW (read modify write) operations  Not MPI_ACCUMULATE –Relax Window access restrictions  Explicit support for process hierarchies –Are all processes created equal? –Should some process groups have “Divine Rights”?

11Managed by UT-Battelle for the Department of Energy MPI Standards Improvements  Can threads be first (or even second class) citizens in an MPI world?  Work has been done in this area long ago –(see TMPI and TOMPI)

12Managed by UT-Battelle for the Department of Energy RMA Example - SHMEM on Cray X1 #include #define SIZE 16 main(int argc, char* argv[]) { int buffer[SIZE]; int i = 0; int num_pe = atoi(argv[1]); start_pes(num_pe); buffer[SIZE-1] = 0; if (_my_pe() == 0) { buffer[SIZE-1] = 1; shmem_int_put(buffer, buffer, SIZE, 1); shmem_int_wait(&buffer[SIZE-1], 1); } else if(_my_pe() == 1) { shmem_int_wait(&buffer[SIZE-1], 0); buffer[SIZE-1] = 0; shmem_int_put(buffer, buffer, SIZE, 0); } shmem_barrier_all(); /* sync before exiting */ }

13Managed by UT-Battelle for the Department of Energy RMA Example - MPI on My Laptop #include #define SIZE 16 main(int argc, char* argv[]) { int buffer[SIZE]; int proc, nproc; MPI_Win win; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &proc); MPI_Comm_size(MPI_COMM_WORLD, &nproc); MPI_Win_create(buffer, SIZE*sizeof(int), sizeof(int), MPI_INFO_NULL, MPI_COMM_WORLD, &win); if(proc == 0) { MPI_Win_fence(0, win); /* exposure epoch */ MPI_Put(buffer, SIZE, MPI_INT, 1, 0, SIZE, MPI_INT, win); MPI_Win_fence(0, win); /* exposure epoch */ MPI_Win_fence(0, win); /* data has landed */ } else if(proc == 1) { MPI_Win_fence(0, win); /* exposure epoch */ MPI_Win_fence(0, win); /* data has landed */ MPI_Win_fence(0, win); MPI_Put(buffer, SIZE, MPI_INT, 0, 0, SIZE, MPI_INT, win); MPI_Win_fence(0, win); } MPI_Win_free(&win); MPI_Finalize(); }

14Managed by UT-Battelle for the Department of Energy Improved Resource Managers  Express resource requirements for ranks and groups (stencils) of ranks –Network and Memory bandwidth requirements –Memory per process –Latency requirements –I/O requirements  We can do this in MPI but it doesn’t belong there –An application may not even want to be scheduled if certain resource requirements cannot be met –We will need improved integration between MPI and resource managers –MPI can use information provided by the resource manager to allocate internal resources depending on the resource requirements specified for the given communicating peers

15Managed by UT-Battelle for the Department of Energy MPI as a Low Level Parallel Computing Substrate  MPI is not enough –Nor should it be, we need domain specific packages to layer on top of MPI –As such MPI had better provide the low level communication primitives that these packages need (reasonably)  MPI should be improved –MPI RMA can allow PGAS, SHMEM, Global Arrays and others to effectively use MPI as a low-level communication substrate –Composing MPI features for a particular MPI consumer and operating environment (Opteron or Cell SPE) can remove the barriers to MPI adoption on many MultiCore and hybrid environments  CML (Cell messaging layer) by Scott Pakin is a good example of a special purpose “lightweight” MPI

16Managed by UT-Battelle for the Department of Energy Questions?