1 Timing MPI Programs The elapsed (wall-clock) time between two points in an MPI program can be computed using MPI_Wtime : double t1, t2; t1 = MPI_Wtime();...

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

MPI Message Passing Interface Portable Parallel Programs.
MPI Message Passing Interface
EMS1EP Lecture 4 Intro to Programming Dr. Robert Ross.
System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
Pipeline Computations Increasing Throughput By Using More Processes.
CS1104: Computer Organisation School of Computing National University of Singapore.
Exceptional Control Flow Processes Today. Control Flow Processors do only one thing: From startup to shutdown, a CPU simply reads and executes (interprets)
1 Buffers l When you send data, where does it go? One possibility is: Process 0Process 1 User data Local buffer the network User data Local buffer.
TU/e Processor Design 5Z032 1 Processor Design 5Z032 The role of Performance Henk Corporaal Eindhoven University of Technology 2009.
MPI and C-Language Seminars Seminar Plan  Week 1 – Introduction, Data Types, Control Flow, Pointers  Week 2 – Arrays, Structures, Enums, I/O,
1 Performance Modeling l Basic Model »Needed to evaluate approaches »Must be simple l Synchronization delays l Main components »Latency and Bandwidth »Load.
1 Soft Timers: Efficient Microsecond Software Timer Support For Network Processing Mohit Aron and Peter Druschel Rice University Presented By Jonathan.
1 Parallel Computing—Introduction to Message Passing Interface (MPI)
Soft Timers: Efficient Microsecond Software Timer Support For Network Processing Mohit Aron and Peter Druschel Rice University Presented by Reinette Grobler.
Chapter 4 Assessing and Understanding Performance
Protocol Implementation An Engineering Approach to Computer Networking.
MPI Program Performance. Introduction Defining the performance of a parallel program is more complex than simply optimizing its execution time. This is.
Message Passing Interface In Java for AgentTeamwork (MPJ) By Zhiji Huang Advisor: Professor Munehiro Fukuda 2005.
COMP 14: Intro. to Intro. to Programming May 23, 2000 Nick Vallidis.
1 Chapter 4. 2 Measure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding underlying organizational motivation.
1 What is message passing? l Data transfer plus synchronization l Requires cooperation of sender and receiver l Cooperation not always apparent in code.
Chapter 1: Introduction To Computer | SCP1103 Programming Technique C | Jumail, FSKSM, UTM, 2005 | Last Updated: July 2005 Slide 1 Introduction To Computers.
MULTICOMPUTER 1. MULTICOMPUTER, YANG DIPELAJARI Multiprocessors vs multicomputers Interconnection topologies Switching schemes Communication with messages.
An Introduction Chapter Chapter 1 Introduction2 Computer Systems  Programmable machines  Hardware + Software (program) HardwareProgram.
Evaluating the Running Time of a Communication Round over the Internet Omar Bakr Idit Keidar MIT MIT/Technion PODC 2002.
WHAT IS THIS? OBJECTIVE AND OUTCOMES Candidates should be able to: Describe and explain the CPU as fetching, decoding and executing of instructions and.
Intro to Architecture – Page 1 of 22CSCI 4717 – Computer Architecture CSCI 4717/5717 Computer Architecture Topic: Introduction Reading: Chapter 1.
Computer Architecture. “The design of a computer system. It sets the standard for all devices that connect to it and all the software that runs on it.
Implementing Processes and Process Management Brian Bershad.
Chapter 4 - Implementing Standard Program Structures in 8086 Assembly Language from Microprocessors and Interfacing by Douglas Hall.
Parallel Computing A task is broken down into tasks, performed by separate workers or processes Processes interact by exchanging information What do we.
TIME TAGGING. HYPACK ® Timing Basics  The internal clock on your computer is a lousy clock.  HYPACK SURVEY and DREDGEPACK ® use a proprietary clock.
1 CS/EE 362 Hardware Fundamentals Lecture 9 (Chapter 2: Hennessy and Patterson) Winter Quarter 1998 Chris Myers.
Performance.
CS 838: Pervasive Parallelism Introduction to MPI Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from an online tutorial.
Performance Oriented MPI Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster computers –shared memory model ( access nsec) –message passing multiprocessor.
Agenda Fail Stop Processors –Problem Definition –Implementation with reliable stable storage –Implementation without reliable stable storage Failure Detection.
22-Sept-2005 Google Summer of Code Projects: Lightweight Precision Timestamps Jeff Boote.
Presented by Open MPI on the Cray XT Richard L. Graham Tech Integration National Center for Computational Sciences.
Optimizing Charm++ Messaging for the Grid Gregory A. Koenig Parallel Programming Laboratory Department of Computer.
Concurrency, Processes, and System calls Benefits and issues of concurrency The basic concept of process System calls.
1 Lecture 4: Part 2: MPI Point-to-Point Communication.
Lecture 2a: Performance Measurement. Goals of Performance Analysis The goal of performance analysis is to provide quantitative information about the performance.
1  1998 Morgan Kaufmann Publishers How to measure, report, and summarize performance (suorituskyky, tehokkuus)? What factors determine the performance.
Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.
1  1998 Morgan Kaufmann Publishers Where we are headed Performance issues (Chapter 2) vocabulary and motivation A specific instruction set architecture.
1 Using PMPI routines l PMPI allows selective replacement of MPI routines at link time (no need to recompile) l Some libraries already make use of PMPI.
September 10 Performance Read 3.1 through 3.4 for Wednesday Only 3 classes before 1 st Exam!
Computer Organization Instruction Set Architecture (ISA) Instruction Set Architecture (ISA), or simply Architecture, of a computer is the.
Chapter 4. Measure, Report, and Summarize Make intelligent choices See through the marketing hype Understanding underlying organizational aspects Why.
3/12/2013Computer Engg, IIT(BHU)1 MPI-1. MESSAGE PASSING INTERFACE A message passing library specification Extended message-passing model Not a language.
Acceleration Sensing Dec 10, 2004 Zhong-Yi Jin William Chang.
Performance Computer Organization II 1 Computer Science Dept Va Tech January 2009 © McQuain & Ribbens Defining Performance Which airplane has.
COMPUTER ARCHITECTURE & OPERATIONS I Instructor: Yaohang Li.
1 Advanced MPI William D. Gropp Rusty Lusk and Rajeev Thakur Mathematics and Computer Science Division Argonne National Laboratory.
Intro to Distributed Systems Hank Levy. 23/20/2016 Distributed Systems Nearly all systems today are distributed in some way, e.g.: –they use –they.
MPI-Message Passing Interface. What is MPI?  MPI is a specification for the developers and users of message passing libraries. By itself, it is NOT a.
CSE331: Introduction to Networks and Security Lecture 2 Fall 2002.
10/25/2005Comp 120 Fall October 25 Review for 2 nd Exam on Tuesday 27 October MUL not MULI Ask Questions!
Performance Analysis Tools
Distributed Shared Memory
Defining Performance Which airplane has the best performance?
MPI Message Passing Interface
CS 258 Reading Assignment 4 Discussion Exploiting Two-Case Delivery for Fast Protected Messages Bill Kramer February 13, 2002 #
DATA TYPES There are four basic data types associated with variables:
Performance.
Computer Organization and Design Chapter 4
MPI Message Passing Interface
Presentation transcript:

1 Timing MPI Programs The elapsed (wall-clock) time between two points in an MPI program can be computed using MPI_Wtime : double t1, t2; t1 = MPI_Wtime();... t2 = MPI_Wtime(); printf( “time is %d\n”, t2 - t1 ); The value returned by a single call to MPI_Wtime has little value. Times in general are local, but an implementation might offer synchronized times. See attribute MPI_WTIME_IS_GLOBAL.

2 Measuring Performance l Using MPI_Wtime »timers are not continuous — MPI_Wtick l MPI_Wtime is local unless MPI_WTIME_IS_GLOBAL attribute is true l MPI Profiling interface l Performance measurement tools

3 Sample Timing Harness Average times, make several trials for (k } time = MPI_Wtime() - t1; if (time < tfinal) tfinal = time; } l Use MPI_Wtick to discover clock resolution l Use getrusage to get other effects (e.g., context switches, paging)

4 Pitfalls in timing Time too short: t = MPI_Wtime(); MPI_Send(…); time = MPI_Wtime() - t; l Underestimates by MPI_Wtick, over by cost of calling MPI_Wtime l “Correcting” MPI_Wtime by subtracting average of MPI_Wtime calls overestimates MPI_Wtime l Code not paged in (always run at least twice) l Minimums not what users see l Tests with 2 processors may not be representative »T3D has processors in pairs, pingpong give 130 MB/sec for 2 but 75 MB/sec for 4 (for MPI_Ssend)

5 Example of Paging Problem l Black area is identical setup computation

6 Latency and Bandwidth l Simplest model s + r n l s includes both hardware (gate delays) and software (context switch, setup) l r includes both hardware (raw bandwidth of interconnection and memory system) and software (packetization, copies between user and system) l Head-to-head and pingpong values may differ

7 l Bandwidth is the inverse of the slope of the line time = latency + (1/rate) size_of_message l For performance estimation purposes, latency is the limit(n  0) of the time to send n bytes l Latency is sometimes described as “time to send a message of zero bytes”. This is true only for the simple model. The number quoted is sometimes misleading. Interpreting Latency and Bandwidth Latency 1/slope=Bandwidth Message Size Time to Send Message Not latency

8 Exercise: Timing MPI Operations l Estimate the latency and bandwidth for some MPI operation (e.g., Send/Recv, Bcast, Ssend/Irecv-Wait) »Make sure all processes are ready before starting the test »How repeatable are your measurements? »How does the performance compare to the performance of other operations (e.g., memcpy, floating multiply)?