1 Tuning for MPI Protocols l Aggressive Eager l Rendezvous with sender push l Rendezvous with receiver pull l Rendezvous blocking (push or pull)

Slides:



Advertisements
Similar presentations
You have been given a mission and a code. Use the code to complete the mission and you will save the world from obliteration…
Advertisements

1. XP 2 * The Web is a collection of files that reside on computers, called Web servers. * Web servers are connected to each other through the Internet.
Finding The Unknown Number In A Number Sentence! NCSCOS 3 rd grade 5.04 By: Stephanie Irizarry Click arrow to go to next question.
Using Matrices in Real Life
1 Concurrency: Deadlock and Starvation Chapter 6.
Advanced Piloting Cruise Plot.
Our library has two forms of encyclopedias: Hard copy and electronic versions. The first is simply the old-fashioned "book on the shelf" type of encyclopedia.
Chapter 1 The Study of Body Function Image PowerPoint
UNITED NATIONS Shipment Details Report – January 2006.
Document #07-12G 1 RXQ Customer Enrollment Using a Registration Agent Process Flow Diagram (Switch) Customer Supplier Customer authorizes Enrollment.
Document #07-12G 1 RXQ Customer Enrollment Using a Registration Agent Process Flow Diagram (Switch) Customer Supplier Customer authorizes Enrollment.
ECMWF Slide 1Porting MPI Programs to the IBM Cluster 1600 Peter Towers March 2004.
Human Performance Improvement Process
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Title Subtitle.
Exit a Customer Chapter 8. Exit a Customer 8-2 Objectives Perform exit summary process consisting of the following steps: Review service records Close.
Child Health Reporting System (CHRS) How to Submit VHSS Data
My Alphabet Book abcdefghijklm nopqrstuvwxyz.
Multiplying binomials You will have 20 seconds to answer each of the following multiplication problems. If you get hung up, go to the next problem when.
DIVIDING INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
Addition Facts
NGS computation services: API's,
Integrify 5.0 Tutorial : Creating a New Process
Course Objectives After completing this course, you should be able to:
Demo of ISP Eclipse GUI Command-line Options Set-up Audience with LiveDVD About 30 minutes – by Ganesh 1.
1 How to Enter Time. 2 Select: Log In Once logged in, Select: Employees.
Operating Systems Disk Scheduling A. Frank - P. Weisberg.
ABC Technology Project
VOORBLAD.
Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.
1 Breadth First Search s s Undiscovered Discovered Finished Queue: s Top of queue 2 1 Shortest path from s.
1 What is message passing? l Data transfer plus synchronization l Requires cooperation of sender and receiver l Cooperation not always apparent in code.
Factor P 16 8(8-5ab) 4(d² + 4) 3rs(2r – s) 15cd(1 + 2cd) 8(4a² + 3b²)
4 Oracle Data Integrator First Project – Simple Transformations: One source, one target 3-1.
© 2012 National Heart Foundation of Australia. Slide 2.
Lets play bingo!!. Calculate: MEAN Calculate: MEDIAN
Sliding window protocols:  Window: subset of consecutive frames  only frames in window can be sent.
Understanding Generalist Practice, 5e, Kirst-Ashman/Hull
Executional Architecture
Chapter 5 Test Review Sections 5-1 through 5-4.
GG Consulting, LLC I-SUITE. Source: TEA SHARS Frequently asked questions 2.
Μπ A Scalable & Transparent System for Simulating MPI Programs Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory Adjunct Professor.
Addition 1’s to 20.
25 seconds left…...
XP New Perspectives on Browser and Basics Tutorial 1 1 Browser and Basics Tutorial 1.
Håkan Sundell, Chalmers University of Technology 1 Evaluating the performance of wait-free snapshots in real-time systems Björn Allvin.
Januar MDMDFSSMDMDFSSS
Week 1.
We will resume in: 25 Minutes.
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
Network Operations & administration CS 4592 Lecture 15 Instructor: Ibrahim Tariq.
PSSA Preparation.
VPN AND REMOTE ACCESS Mohammad S. Hasan 1 VPN and Remote Access.
11-1 FRAMING The data link layer needs to pack bits into frames, so that each frame is distinguishable from another. Our postal system practices a type.
1 PART 1 ILLUSTRATION OF DOCUMENTS  Brief introduction to the documents contained in the envelope  Detailed clarification of the documents content.
1 Non-Blocking Communications. 2 #include int main(int argc, char **argv) { int my_rank, ncpus; int left_neighbor, right_neighbor; int data_received=-1;
Parallel Processing1 Parallel Processing (CS 667) Lecture 9: Advanced Point to Point Communication Jeremy R. Johnson *Parts of this lecture was derived.
1 Buffers l When you send data, where does it go? One possibility is: Process 0Process 1 User data Local buffer the network User data Local buffer.
1 Parallel Computing—Introduction to Message Passing Interface (MPI)
1 Message protocols l Message consists of “envelope” and data »Envelope contains tag, communicator, length, source information, plus impl. private data.
A Brief Look At MPI’s Point To Point Communication Brian T. Smith Professor, Department of Computer Science Director, Albuquerque High Performance Computing.
1 What is message passing? l Data transfer plus synchronization l Requires cooperation of sender and receiver l Cooperation not always apparent in code.
1 Choosing MPI Alternatives l MPI offers may ways to accomplish the same task l Which is best? »Just like everything else, it depends on the vendor, system.
Parallel Computing A task is broken down into tasks, performed by separate workers or processes Processes interact by exchanging information What do we.
1 Lecture 4: Part 2: MPI Point-to-Point Communication.
Presentation transcript:

1 Tuning for MPI Protocols l Aggressive Eager l Rendezvous with sender push l Rendezvous with receiver pull l Rendezvous blocking (push or pull)

2 Aggressive Eager l Performance problem: extra copies l Possible deadlock for inadequate eager buffering

3 Tuning for Aggressive Eager l Ensure that receives are posted before sends l MPI_Issend can be used to express wait until receive is posted

4 Rendezvous with Sender Push l Extra latency l Possible delays while waiting for sender to begin

5 Rendezvous Blocking l What happens once sender and receiver rendezvous? »Sender (push) or receiver (pull) may complete operation »May block other operations while completing l Performance tradeoff »If operation does not block (by checking for other requests), it adds latency or reduces bandwidth. l Can reduce performance if a receiver, having acknowledged a send, must wait for the sender to complete a separate operation that it has started.

6 Tuning for Rendezvous with Sender Push l Ensure receives posted before sends »better, ensure receives match sends before computation starts; may be better to do sends before receives l Ensure that sends have time to start transfers l Can use short control messages l Beware of the cost of extra messages

7 Rendezvous with Receiver Pull l Possible delays while waiting for receiver to begin

8 Tuning for Rendezvous with Receiver Pull l Place MPI_Isends before receives l Use short control messages to ensure matches l Beware of the cost of extra messages

9 Experiments with MPI Implementations l Multiparty data exchange l Jacobi iteration in 2 dimensions »Model for PDEs, Matrix-vector products »Algorithms with surface/volume behavior »Issues similar to unstructured grid problems (but harder to illustrate) l Others at

10 Multiparty Data Exchange l Real programs have many processes exchanging data, often nearly at the same time l Pingpong tests do not measure this communication pattern l Simultaneous pingpong between processes i and i+p/2 on IBM SP2:

11 Scheduling for Contention l Many programs alternate between communication and computation phases l Contention can reduce effective bandwidth l Consider restructuring program so that some nodes communicate while others compute: Comm Compute

12 Jacobi Iteration l Simple parallel data structure l Processes exchange rows with neighbors

13 Background to Tests l Goals »Identify better performing idioms for the same communication operation »Understand these by understanding the underlying MPI process »Provide a starting point for evaluating additional options (there are many ways to write even simple codes)

14 Different Send/Receive Modes l MPI provides many different ways to perform a send/recv l Choose different ways to manage buffering (avoid copying) and synchronization l Interaction with polling and interrupt modes

15 Some Send/Receive Approaches l Based on operation hypothesis. Most of these are for polling mode. Each of the following is a hypothesis that the experiments test »Better to start receives first »Ensure recvs posted before sends »Ordered (no overlap) »Nonblocking operations, overlap effective »Use of Ssend, Rsend versions (EPCC/T3D can prefer Ssend over Send; uses Send for buffered send) »Manually advance automaton l Persistent operations

16 Scheduling Communications l Is it better to use MPI_Waitall or to schedule/order the requests? »Does the implementation complete a Waitall in any order or does it prefer requests as ordered in the array of requests? l In principle, it should always be best to let MPI schedule the operations. In practice, it may be better to order either the short or long messages first, depending on how data is transferred.

17 Some Example Results l Summarize some different approaches l More details at mpiexmpl/src3/runs.html mpiexmpl/src3/runs.html

18 Send and Recv l Simplest use of send and recv l Very poor performance on SP2 »Rendezvous sequentializes sends/receives l OK performance on T3D (implementation tends to buffer operations)

19 Better to start receives first l Irecv, Isend, Waitall l Ok performance

20 Ensure recvs posted before sends l Irecv, Sendrecv/Barrier, Rsend, Waitall

21 Receives posted before sends l Best performer on SP2 l Fails to run on SGI (needs cancel) and T3D (core dumps)

22 Ordered (no overlap) l Send, Recv or Recv, Send l MPI_Sendrecv (shift) l MPI_Sendrecv (exchange)

23 Shift with MPI_Sendrecv l Performs reasonably well; simpler than many other approaches l T3D performance is ok but other approaches are better

24 Use of Ssend versions l Ssend allows send to wait until receive ready »At least one implementation (T3D) gives better performance for Ssend than for Send

25 Nonblocking Operations, Overlap Effective l Isend, Irecv, Waitall l A variant uses Waitsome with computation

26 Persistent Operations l Potential saving »Allocation of MPI_Request »Validating and storing arguments l Variations of example »sendinit, recvinit, startall, waitall »startall(recvs), sendrecv/barrier, startall(rsends), waitall l Some vendor implementations are buggy l Persistent operations may be slightly slower

27 Manually Advance Automaton l irecv, isend, iprobe in computation, waitall To test for messages: MPI_Iprobe( MPI_ANY_SOURCE, 0, MPI_COMM_WORLD, &flag, &status )

28 Summary of Results l Better to start sends before receives »Most implementations use rendezvous protocols for long messages (Cray, IBM, SGI) l Synchronous sends better on T3D »otherwise system buffers l MPI_Rsend can offer some performance gain on SP2 »as long as receives can be guaranteed without extra messages