High Performance Computing: Concepts, Methods & Means Performance 3 : Measurement Prof. Thomas Sterling Department of Computer Science Louisiana State.

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

MPI Message Passing Interface
CS 140: Models of parallel programming: Distributed memory and MPI.
Multiple Processor Systems
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Multiple Processor Systems Chapter Multiprocessors 8.2 Multicomputers 8.3 Distributed systems.
A Message Passing Standard for MPP and Workstations Communications of the ACM, July 1996 J.J. Dongarra, S.W. Otto, M. Snir, and D.W. Walker.
CS 240A: Models of parallel programming: Distributed memory and MPI.
Distributed Memory Programming with MPI. What is MPI? Message Passing Interface (MPI) is an industry standard message passing system designed to be both.
1 Lecture 1: Parallel Architecture Intro Course organization:  ~5 lectures based on Culler-Singh textbook  ~5 lectures based on Larus-Rajwar textbook.
1 Parallel Computing—Introduction to Message Passing Interface (MPI)
1 Tuesday, October 10, 2006 To err is human, and to blame it on a computer is even more so. -Robert Orben.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Basics of Message-passing Mechanics of message-passing –A means of creating separate processes on different computers –A way to send and receive messages.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
CS 179: GPU Programming Lecture 20: Cross-system communication.
Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
1 MPI: Message-Passing Interface Chapter 2. 2 MPI - (Message Passing Interface) Message passing library standard (MPI) is developed by group of academics.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
An Introduction to Parallel Programming and MPICH Nikolaos Hatzopoulos.
Part I MPI from scratch. Part I By: Camilo A. SilvaBIOinformatics Summer 2008 PIRE :: REU :: Cyberbridges.
CS 240A Models of parallel programming: Distributed memory and MPI.
Parallel Programming with MPI Prof. Sivarama Dandamudi School of Computer Science Carleton University.
Master Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Systems and Enabling Platforms Marco Vanneschi 1. Prerequisites.
Message Passing Programming with MPI Introduction to MPI Basic MPI functions Most of the MPI materials are obtained from William Gropp and Rusty Lusk’s.
Hybrid MPI and OpenMP Parallel Programming
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
CS 838: Pervasive Parallelism Introduction to MPI Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from an online tutorial.
Message Passing Programming Model AMANO, Hideharu Textbook pp. 140-147.
Summary of MPI commands Luis Basurto. Large scale systems Shared Memory systems – Memory is shared among processors Distributed memory systems – Each.
MPI Introduction to MPI Commands. Basics – Send and Receive MPI is a message passing environment. The processors’ method of sharing information is NOT.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster computers –shared memory model ( access nsec) –message passing multiprocessor.
CSC 7600 Lecture 28 : Final Exam Review Spring 2010 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS FINAL EXAM REVIEW Daniel Kogler, Chirag Dekate.
Distributed-Memory (Message-Passing) Paradigm FDI 2004 Track M Day 2 – Morning Session #1 C. J. Ribbens.
Parallel Programming with MPI By, Santosh K Jena..
MA471Fall 2002 Lecture5. More Point To Point Communications in MPI Note: so far we have covered –MPI_Init, MPI_Finalize –MPI_Comm_size, MPI_Comm_rank.
HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS Prof. Thomas Sterling Department of Computer Science Louisiana.
1 Message Passing Models CEG 4131 Computer Architecture III Miodrag Bolic.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
Chapter 13 – I/O Systems (Pgs ). Devices  Two conflicting properties A. Growing uniformity in interfaces (both h/w and s/w): e.g., USB, TWAIN.
Introduction to MPI CDP 1. Shared Memory vs. Message Passing Shared Memory Implicit communication via memory operations (load/store/lock) Global address.
An Introduction to MPI (message passing interface)
1 Lecture 1: Parallel Architecture Intro Course organization:  ~18 parallel architecture lectures (based on text)  ~10 (recent) paper presentations 
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
3/12/2013Computer Engg, IIT(BHU)1 MPI-1. MESSAGE PASSING INTERFACE A message passing library specification Extended message-passing model Not a language.
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
Message Passing Programming Based on MPI Collective Communication I Bora AKAYDIN
Message Passing Interface Using resources from
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
1 Advanced MPI William D. Gropp Rusty Lusk and Rajeev Thakur Mathematics and Computer Science Division Argonne National Laboratory.
Background Computer System Architectures Computer System Software.
MPI-Message Passing Interface. What is MPI?  MPI is a specification for the developers and users of message passing libraries. By itself, it is NOT a.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
PVM and MPI.
These slides are based on the book:
Introduction to parallel computing concepts and technics
Introduction to MPI.
MPI Message Passing Interface
Is System X for Me? Cal Ribbens Computer Science Department
CMSC 611: Advanced Computer Architecture
Lecture 1: Parallel Architecture Intro
Lecture 14: Inter-process Communication
Multiple Processor Systems
Chirag Dekate Department of Computer Science
Introduction to parallelism and the Message Passing Interface
Hardware Environment VIA cluster - 8 nodes Blade Server – 5 nodes
Hello, world in MPI #include <stdio.h> #include "mpi.h"
Hello, world in MPI #include <stdio.h> #include "mpi.h"
Presentation transcript:

High Performance Computing: Concepts, Methods & Means Performance 3 : Measurement Prof. Thomas Sterling Department of Computer Science Louisiana State University February 27 th, 2007

Term Projects Graduate students only Due date: April 19 th, 2007 Total time: approx. 40 hours 20% of final grade 4 categories of projects: –A) technology evolution report (20 page report) –B) Fixed application code execution scaling (7 page report) –C) Synthetic code parametric studies (7 page report) –D) Parallel application development (7 page report) 1 paragraph abstract due March 9 th – to Chirag by COB Friday

Term Projects – Technology Evolution In depth survey of an enabling technology Report on capability with respect to time and factors Two general classes of technology: –Device technology Main memory Secondary storage System network Logic –Architecture SIMD Vector Systolic Dataflow

Term Project – Fixed Application Scaling Select an application code –Need not be one of those in class –Must be a parallel code –You need not write this yourself Select two or more system parameters to scale with –# processors –# nodes –Network bandwidth and/or latency –Data block partition size Use performance measurement and profiling tools –Describe measured trends –Diagnose reasons for observed results

Term Project – Synthetic Parametric Study Write a code expressly to exercise one or more system functions –Parallelism –Network bandwidth –Memory bandwidth Allow at least one dimension to be independent and adjust –Message insert rate –Message packet size –Overhead time Show system operation with respect to parameter

Term Project – Roll your own Write a small parallel application program –Preferably not one we’ve done in class –Can be something you’ve done in another class or research project modified for MPI or OpenMP –Please! Do this yourself!! –Libraries permitted Use profiling tools to determine where most of the work is being done Demonstrate scaling wrt # processors

7 Topics Introduction Performance Characteristics & Models Performance Models : LogP Performance Models : LogGP Benchmarks : b_eff MPI Tracing : PMPI TAU & MPI Summary – Materials for Test

8 Topics Introduction Performance Characteristics & Models Performance Models : LogP Performance Models : LogGP Benchmarks : b_eff MPI Tracing : PMPI TAU & MPI Summary – Materials for Test

9 Please understand when to use the following and what they mean : API Elements : –MPI_Init(), MPI_Finalize() –MPI_Comm_size(), MPI_Comm_rank() –MPI_COMM_WORLD –Error checking using MPI_SUCCESS –MPI basic data types (slide 27) –Blocking : MPI_Send(), MPI_Recv() –Non-Blocking : MPI_Isend(), MPI_Irecv(), MPI_Wait() –Collective Calls : MPI_Barrier(), MPI_Bcast(), MPI_Gather(), MPI_Scatter(), MPI_Reduce() Commands : –Running MPI Programs : mpirun –Compile : mpicc –Compile : mpif77

10 Where are we? Three classes of parallel computing –Capacity –Cooperative –Capability Three execution models –Throughput –Shared memory multithreaded –Communicating sequential processes (message passing) Three programming formalisms –Condor –OpenMP –MPI More performance modeling and measurement –For cooperative/message passing/MPI

11 Topics Introduction Performance Characteristics & Models Performance Models : LogP Performance Models : LogGP Benchmarks : b_eff MPI Tracing : PMPI TAU & MPI Summary – Materials for Test

12 What has changed? SMP to MPP SMP – symmetric multiprocessor –Shared memory UMA – uniform memory access with cache coherence –Multithreaded parallelism –Communication through main memory –Not scalable –Programming in OpenMP –DSM and PGAS provide alternative shared memory structures DSM – distributed shared memory (with cache coherence) PGAS – Partitioned global address space (without cache coherence) Both are NUMA MPP – massively parallel processor –Distributed memory NUMA – non-uniform memory access –Concurrent sequential processes parallelism –Communication through messages between nodes –Scalable –Programming in MPI –Same for commodity clusters but usually with weaker networks

13 MPI Performance Characteristics Latency –Time to send first bits of data across link to remote node –Does not include overhead Bandwidth –Rate of data transfer across link to remote node Buffers –System or user buffers take up time to manage capacity etc. Blocking versus Asynchronous –Forced ordering of computation and communication

14 Granularities of Time Measurements TimerUsageWallclock / CPU Time ResolutionLanguagesPortable? timeshell / scriptboth1/100th secondanyyes timexshell / scriptboth1/100th secondanyyes gettimeofdaysubroutinewallclockmicrosecondC/C++yes read_real_timesubroutinewallclocknanosecondC/C++no rtcsubroutinewallclockmicrosecondFortranno irtcsubroutinewallclocknanosecondFortranno dtime_subroutineCPU1/100th secondFortranno etime_subroutineCPU1/100th secondFortranno mclocksubroutineCPU1/100th secondFortranno timefsubroutinewallclockmillisecondFortranno MPI_WtimesubroutinewallclockmicrosecondC/C++, Fortranyes AIX Trace Facilityshell / script / subroutine wallclockmicrosecondanyno time

15 Performance Factors Platform / Architecture Related: –cpu - clock speed, number of cpus –Memory subsystem - memory and cache configuration, memory-cache-cpu bandwidth, memory copy bandwidth –Network adapters - type, latency and bandwidth characteristics –Operating system characteristics - many Network Related: –Hardware - ethernet, FDDI, switch, intermediate hardware (routers) –Protocols - TCP/IP, UDP/IP, other –Configuration, routing, etc –Network tuning options ("no" command) –Network contention / saturation source :

16 Performance Factors (2) Application Related: –Algorithm efficiency and scalability –Communication to computation ratios –Load balance –Memory usage patterns –I/O –Message size used –Types of MPI routines used - blocking, non-blocking, point-to- point, collective communications MPI Implementation Related: –Message buffering –Message passing protocols - eager, rendezvous, other –Sender-Receiver synchronization - polling, interrupt –Routine internals - efficiency of algorithm used to implement a given routine source :

17 Performance Impact of Message Sizes Message size can be a very significant contributor to MPI application performance. In most cases, increasing the message size will yield better performance. For communication intensive applications, algorithm modifications that take advantage of message size "economies of scale" may be worth the effort. Performance can often improve significantly within a relatively small range of message sizes. The following three graphs demonstrate how increasing message size can improve bandwidth for different message size ranges

18 MPI Performance Models Hockney: Point to Point –Time to send: t=t 0 +m/r inf –t 0 : fixed cost per message, startup cost –m: message length –r inf : bandwidth for very large messages Xu/Hwang: Collective –Time to send: t=t 0 (n)+m/r inf (n) –same parameters, but now they are functions of n, the number of nodes in the communication Source:

19 Topics Introduction Performance Characteristics & Models Performance Models : LogP Performance Models : LogGP Benchmarks : b_eff MPI Tracing : PMPI TAU & MPI Summary – Materials for Test

20 MPI Performance Models LogP (fixed message size) –time to send = L+2*o –L: Latency, min send/recv time –o: Overhead, time waiting on processor –g: Gap, min time between successive sends or recvs & does include message length –P: Number of Processors –L/g: Max number of simultaneous messages

21 Measuring LogP Parameters Finding g (implementation dependent) –Proc 0: MPI_ISend() x N –Proc 1: MPI_Recv() x N –g = total time / N Finding L+2*o –Proc 0: (MPI_Send() then MPI_Recv()) x N –Proc 1: (MPI_Recv() then MPI_Send()) x N –L+2*o = total time/N Finding o –Proc 0: (MPI_Send() then MPI_Recv() then some_work) x N –Proc 1: (MPI_Recv() then some_work then MPI_Send()) x N –o = (1/2)total time/N – time(some_work) –requires time(some_work) > 2*L+2*o

22 Measuring LogP Parameters Finding L+2*o –Proc 0: (MPI_Send() then MPI_Recv()) x N –Proc 1: (MPI_Recv() then MPI_Send()) x N –L+2*o = total time/N Figure 1: Time diagram for benchmark 1 (a) is Time diagram of processor 0 (b) is Time diagram of processor 1

23 Measuring LogP Parameters Finding o –Proc 0: (MPI_Send() then some_work then MPI_Recv() ) x N –Proc 1: (MPI_Recv() then MPI_Send() then some_work) x N –o = (1/2)total time/N – time(some_work) –requires time(some_work) > 2*L+2*o Figure 2: Time diagram for benchmark 2 with X > 2*L+Or+Os (a) is Time diagram of processor 1 (b) is Time diagram of processor 2

24 Demo Measure LogP parameters

25 Topics Introduction Performance Characteristics & Models Performance Models : LogP Performance Models : LogGP Benchmarks : b_eff MPI Tracing : PMPI TAU & MPI Summary – Materials for Test

26 MPI Performance Models LogGP (variable message size) –time to send = L+2*o+(m-1)*G –L: Latency, min send/recv time –o: Overhead, time waiting on processor –g: Gap, min time between send/recvs –G: Gap per byte = 1/Bandwidth –P: Number of Processors –L/g: Max number of simultaneous messages – cs.berkeley.eduzSz~cullerzSzpaperszSzsort.pdf/dusseau96fast. pdfhttp://citeseer.ist.psu.edu/cache/papers/cs/756/ cs.berkeley.eduzSz~cullerzSzpaperszSzsort.pdf/dusseau96fast. pdf

Effective Bandwidth (LogGP) Toy Calculation –BW = m/(L+2*o+G(m-1)) –let: L+2*o-G = 5 –let: G = 3 –Asymptotically approaches bandwidth of 1/G for very large messages.

28 Topics Introduction Performance Characteristics & Models Performance Models : LogP Performance Models : LogGP Benchmarks : b_eff MPI Tracing : PMPI TAU & MPI Summary – Materials for Test

29 HPC Challenge Benchmarks HPC Challenge: –See results tab –b_eff benchmark is a part of this larger database –more info than just HPL!

30 b_eff Standard Benchmark – part of HPC Challenge –Provides effective bandwidth and latency Averages a variety of message sizes and communication patterns Determines an effective latency and bandwidth b_eff depends on: –hardware: interconnect, memory –software: MPI implementation –tuneable parameters of the os: buffers –etc. See :

31 Effective Bandwidth Benchmark

32 Example: Send/Recv, ring & random

33 Demo running of b_eff

34 Topics Introduction Performance Characteristics & Models Performance Models : LogP Performance Models : LogGP Benchmarks : b_eff MPI Tracing : PMPI TAU & MPI Summary – Materials for Test

35 Portable MPI Tracing: PMPI An API to MPI for tracing, debugging, performance measurements of MPI applications MPI_ () calls PMPI_ () MPI_Pcontrol(int) –0: disabled –1: enabled – Default Level –2: flush trace buffers

36 Demo : MPI_Pcontrol … int sends = 0; int pcontrol = 1; int main(int argc,char **argv) { MPI_Init(&argc,&argv); int imax = ; int nmax = 8; int rank; int data = 27; MPI_Status st; MPI_Comm_rank(MPI_COMM_WORLD,&rank); time_t start,end; double fac = (1.0/imax); double g,lp2o,o; // Find g time(&start); for(int i=0;i<imax;i++) { if(rank == 0) { MPI_Send(&data,1,MPI_INT,1,1,MPI_COMM_WORLD); } else { MPI_Recv(&data,1,MPI_INT,0,1,MPI_COMM_WORLD,&st); } time(&end); if(rank == 0) { g = fac*(end-start); printf("gap=%g sec\n",g); } // Find L+2*o time(&start); const int step = 5; for(int i=0;i<imax;i+=step) { if(rank == 0) { MPI_Send(&data,1,MPI_INT,1,1,MPI_COMM_WORLD); MPI_Recv(&data,1,MPI_INT,1,1,MPI_COMM_WORLD,&st); } else { MPI_Recv(&data,1,MPI_INT,0,1,MPI_COMM_WORLD,&st); MPI_Send(&data,1,MPI_INT,0,1,MPI_COMM_WORLD); } time(&end); if(rank == 0) { lp2o = 0.5*step*fac*(end-start); printf("L+2*o=%g sec\n",lp2o); if(sends > 0) printf("sends = %d\n",sends); } MPI_Finalize(); return 0; } int MPI_Pcontrol(int n) { pcontrol = n; return PMPI_Pcontrol(n); } int MPI_Send( void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm ) { if(pcontrol >= 1) sends++; return PMPI_Send(buf,count,datatype,dest,tag,comm ); }

37 Demo MPI tracing, custom implementation

38 Topics Introduction Performance Characteristics & Models Performance Models : LogP Performance Models : LogGP Benchmarks : b_eff MPI Tracing : PMPI TAU & MPI Summary – Materials for Test

39 TAU and MPI Tau uses the PMPI interface to track MPI calls Jumpshot is used as the viewer –Shows subroutine calls and mpi calls

40 TAU Performance System Architecture EPILOG Paraver

41 TAU Measurement Options Parallel profiling –Function-level, block-level, statement-level –Supports user-defined events –TAU parallel profile data stored during execution –Hardware counts values –Support for multiple counters –Support for callpath profiling Tracing –All profile-level events –Inter-process communication events –Timestamp synchronization –Trace merging and format conversion

42 How To Use TAU? Instrumentation – Application code and libraries – Selective instrumentation Install, compile, and link with TAU measurement library – % configure; make clean install – Multiple configurations for different measurements options – Does not require change in instrumentation – Selective measurement control Execute “experiments” to produce performance data – Performance data generated at end or during execution Use analysis tools to look at performance results

43 Using Tau Setup Environment: –source /home/packages/Tau/gcc-papi-mpi-slog2/env.sh –export COUNTER1=GET_TIME_OF_DAY Use tau_cc.sh, tau_f90.sh, etc. to compile Run with mpirun Post-process: –tau_treemerge.pl –tau2slog2 tau.trc tau.edf -o tau.slog2 Run:

44 Demo Tau and Jumpshot

45 Topics Introduction Performance Characteristics & Models Performance Models : LogP Performance Models : LogGP Benchmarks : b_eff MPI Tracing : PMPI TAU & MPI Summary – Materials for Test

46 Summary – Material for the Test Essential MPI - Slide: 9 Performance Models - Slide: 12, 15, 16, 18 (Hockney) LogP - Slide: 20 – 23 Effective Bandwidth – Slide: 30 Tau/MPI – Slide: 41, 43

47 Sources (tau) (mpi profiling interface) (Gropp course) (LogP paper with figures) cluster-2005.pdf (more LogP stuff) cluster-2005.pdf (b_eff bench) (hpc challenge)

48