Jidong Zhai, Tianwei Sheng, Jiangzhou He, Wenguang Chen, Weimin Zheng

Slides:



Advertisements
Similar presentations
MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Advertisements

CoMPI: Enhancing MPI based applications performance and scalability using run-time compression. Rosa Filgueira, David E.Singh, Alejandro Calderón and Jesús.
SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.
Program Slicing Mark Weiser and Precise Dynamic Slicing Algorithms Xiangyu Zhang, Rajiv Gupta & Youtao Zhang Presented by Harini Ramaprasad.
Introduction to MPI. What is Message Passing Interface (MPI)?  Portable standard for communication  Processes can communicate through messages.  Each.
Message-Passing Programming and MPI CS 524 – High-Performance Computing.
1 Parallel Computing—Introduction to Message Passing Interface (MPI)
S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE Message Passing Interface (MPI) Part I NPACI Parallel.
Comp 422: Parallel Programming Lecture 8: Message Passing (MPI)
E.Papandrea 06/11/2003 DFCI COMPUTING - HW REQUIREMENTS1 Enzo Papandrea COMPUTING HW REQUIREMENT.
A Brief Look At MPI’s Point To Point Communication Brian T. Smith Professor, Department of Computer Science Director, Albuquerque High Performance Computing.
Today Objectives Chapter 6 of Quinn Creating 2-D arrays Thinking about “grain size” Introducing point-to-point communications Reading and printing 2-D.
Parallel Programming Using Basic MPI Presented by Timothy H. Kaiser, Ph.D. San Diego Supercomputer Center Presented by Timothy H. Kaiser, Ph.D. San Diego.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Slide 1/24 Lawrence Livermore National Laboratory AutomaDeD: Automata-Based Debugging for Dissimilar Parallel Tasks Greg Bronevetsky, Bronis R. de Supinski,
Adventures in Mastering the Use of Performance Evaluation Tools Manuel Ríos Morales ICOM 5995 December 4, 2002.
ECE 1747H : Parallel Programming Message Passing (MPI)
1 MPI: Message-Passing Interface Chapter 2. 2 MPI - (Message Passing Interface) Message passing library standard (MPI) is developed by group of academics.
Part I MPI from scratch. Part I By: Camilo A. SilvaBIOinformatics Summer 2008 PIRE :: REU :: Cyberbridges.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
Introduction to Parallel Programming with C and MPI at MCSR Part 1 The University of Southern Mississippi April 8, 2010.
Introduction to Parallel Programming with C and MPI at MCSR Part 1 MCSR Unix Camp.
Parallel Computing A task is broken down into tasks, performed by separate workers or processes Processes interact by exchanging information What do we.
Plan: I. Introduction: Programming Model II. Basic MPI Command III. Examples IV. Collective Communications V. More on Communication modes VI. References.
Message Passing Programming Model AMANO, Hideharu Textbook pp. 140-147.
Summary of MPI commands Luis Basurto. Large scale systems Shared Memory systems – Memory is shared among processors Distributed memory systems – Each.
Message Passing Interface (MPI) 1 Amit Majumdar Scientific Computing Applications Group San Diego Supercomputer Center Tim Kaiser (now at Colorado School.
Distributed-Memory (Message-Passing) Paradigm FDI 2004 Track M Day 2 – Morning Session #1 C. J. Ribbens.
CSCI-455/522 Introduction to High Performance Computing Lecture 4.
1 Message Passing Models CEG 4131 Computer Architecture III Miodrag Bolic.
Sending large message counts (The MPI_Count issue)
Introduction to Parallel Programming at MCSR Message Passing Computing –Processes coordinate and communicate results via calls to message passing library.
3/12/2013Computer Engg, IIT(BHU)1 MPI-1. MESSAGE PASSING INTERFACE A message passing library specification Extended message-passing model Not a language.
April 24, 2002 Parallel Port Example. April 24, 2002 Introduction The objective of this lecture is to go over a simple problem that illustrates the use.
Message Passing Interface Using resources from
MPI-Message Passing Interface. What is MPI?  MPI is a specification for the developers and users of message passing libraries. By itself, it is NOT a.
Computations with MPI technology Matthew BickellThomas Rananga Carmen Jacobs John S. NkunaMalebo Tibane S UPERVISORS : Dr. Alexandr P. Sapozhnikov Dr.
MPI: Message Passing Interface An Introduction S. Lakshmivarahan School of Computer Science.
ITCS 4/5145 Parallel Computing, UNC-Charlotte, B
SketchVisor: Robust Network Measurement for Software Packet Processing
Introduction to parallel computing concepts and technics
Miraj Kheni Authors: Toyotaro Suzumura, Koji Ueno
MPI Basics.
Parallel Programming By J. H. Wang May 2, 2017.
CS4402 – Parallel Computing
Introduction to MPI.
MPI Message Passing Interface
Is System X for Me? Cal Ribbens Computer Science Department
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Parallel Programming with MPI and OpenMP
CS 584.
Jidong Zhai, Wenguang Chen, Weimin Zheng
MPI-Message Passing Interface
Milind A. Bhandarkar Adaptive MPI Milind A. Bhandarkar
ITCS 4/5145 Parallel Computing, UNC-Charlotte, B
Parallel Processing - MPI
Message Passing Models
Lecture 14: Inter-process Communication
Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights Feng Zhang †⋄, Jidong Zhai ⋄, Xipeng Shen #, Onur Mutlu ⋆, Wenguang.
May 19 Lecture Outline Introduce MPI functionality
CSCE569 Parallel Computing
Introduction to parallelism and the Message Passing Interface
Hardware Environment VIA cluster - 8 nodes Blade Server – 5 nodes
TensorFlow: A System for Large-Scale Machine Learning
Parallel Programming in C with MPI and OpenMP
Parallel Processing - MPI
MPI Message Passing Interface
Programming Parallel Computers
Presentation transcript:

Jidong Zhai, Tianwei Sheng, Jiangzhou He, Wenguang Chen, Weimin Zheng FACT: Fast Communication Trace Collection for Parallel Applications through Program Slicing Jidong Zhai, Tianwei Sheng, Jiangzhou He, Wenguang Chen, Weimin Zheng Tsinghua University 11/14/2018

Motivation The Importance of Communication Patterns Optimize the application performance Tuning process placement on non-uniform comm. platform MPIPP[ICS-08], OPP[EuroPar-09] Design better communication subsystems Circuit-switched networks in parallel computing[IPDPS-05] Optimize MPI programs debuggers Communication locality MPIWiz[PPoPP-09] Tsinghua University

Communication Pattern Comm. Patterns: Spatial Volume Temporal Can be acquired from Comm. traces files msg type, size, source, dest, etc. An Example: The spatial and volume attributes of NPB CG program (CLASS=D, NPROCS=64). Tsinghua University

Previous Work Previous Work: Mainly rely on traditional trace collection techniques Instrument original programs  Executed on a full-scale parallel systems  Communication traces are collected at runtime  Communication Pattern Such Tools: ITC/ITA, KOJAK, Paraver, TAU, VAMPIR etc. Tsinghua University

Limitations of Previous Work Huge Resource Requirements Computing resources ASCI SAGE, require 2000-4000 processors Memory requirements NPB FT (CLASS=E), more than 600GB memory size Cannot collection comm. traces without full-scale systems Long Trace Collection Time Execute the entire parallel application from the beginning to the end For ASCI SAGE, requires several months to finish Tsinghua University

Our Observations Two important observations: Many important applications do not require communication temporal attributes Process placement optimization Communication locality analysis Most computation and message contents of parallel applications are not relevant to their spatial and volume communication attributes If we can tolerate missing temporal attributes, can we find an efficient method to acquire communication traces? Tsinghua University

Our Approach FACT: FAst Communication Trace collection FACT can acquire comm. traces of large-scale parallel applications on small-scale systems Our idea: Reduce the original program to obtain a program slice at compile time Propose Live-Propagation Slicing Algorithm (LPSA) Execute the program slice to collect comm. traces at runtime Custom communication library FACT combines both static analysis and traditional trace collection methods Tsinghua University

Design Overview Two main components: Compilation framework Input: An MPI program Output: program slice and directives information LPSA Runtime environment Custom MPI comm. Library Comm. Traces Overview of FACT Tsinghua University

An Example (Matrix-Matrix Multiplication) 1 program MM 2 include ’mpif.h’ 3 parameter (N = 80) C memory allocation real A(N,N), B(N,N), C(N,N) 6 call MPI_Init(ierr) 7 call MPI_COMM_Rank(MPI_COMM_WORLD,myid,ierr) 8 call MPI_COMM_Size(MPI_COMM_WORLD,nprocs,ierr) 9 cols = N/(nprocs-1) 10 size = cols*N 11 tag = 1 12 master = 0 13 if (myid .eq. master) then 14 C Initialize matrix A and B 15 do i=1, N 16 do j=1, N 17 A(i,j) = (i-1)+(j-1) 18 B(i,j) = (i-1)*(j-1) 19 end do 20 end do 21 C Send matrix data to the worker tasks 22 do dest=1, nprocs-1 23 offset = 1 + (dest-1)*cols 24 call MPI_Send(A, N*N, MPI_REAL, dest, 25 & tag, MPI_COMM_WORLD, ierr) 26 call MPI_Send(B(1,offset),size,MPI_REAL, 27 & dest, tag, MPI_COMM_WORLD, ierr) 28 end do 29 C Receive results from worker tasks 30 do source=1, nprocs-1 31 offset = 1 + (source-1)*cols 32 call MPI_Recv(C(1,offset),size,MPI_REAL, 33 & source,tag,MPI_COMM_WORLD,status,ierr) 34 end do 35 else 36 C Worker receive data from master task 37 call MPI_Recv(A, N*N, MPI_REAL, master, tag, 38 & MPI_COMM_WORLD, status, ierr) 39 call MPI_Recv(B,size, MPI_REAL, master, tag, 40 & MPI_COMM_WORLD, status, ierr) 41 C Do matrix multiply 42 do k=1, cols 43 do i=1, N 44 C(i,k) = 0.0 45 do j=1, N 46 C(i,k) = C(i,k) + A(i,j) * B(j,k) 47 end do 48 end do 49 end do 50 call MPI_Send(C, size, MPI_REAL, master, 51 & tag, MPI_COMM_WORLD, ierr) 52 endif 53 call MPI_Finalize(ierr) 54 end Fortran Program: C = A * B Tsinghua University

After Slicing in FACT The source codes in the red boxes are deleted! 1 program MM 2 include ’mpif.h’ 3 parameter (N = 80) C memory allocation real A(N,N), B(N,N), C(N,N) real A(1,1), B(1,1), C(1,1) 6 call MPI_Init(ierr) 7 [M] call MPI_COMM_Rank(MPI_COMM_WORLD,myid,ierr) 8 [M] call MPI_COMM_Size(MPI_COMM_WORLD,nprocs,ierr) 9 cols = N/(nprocs-1) 10 size = cols*N 11 tag = 1 12 master = 0 13 if (myid .eq. master) then 14 C Initialize A and B 15 do i=1, N 16 do j=1, N 17 A(i,j) = (i-1)+(j-1) 18 B(i,j) = (i-1)*(j-1) 19 end do 20 end do 21 C Send matrix data to the worker tasks 22 do dest=1, nprocs-1 23 offset = 1 + (dest-1)*cols 24 call MPI_Send(A, N*N, MPI_REAL, dest, 25 & tag, MPI_COMM_WORLD, ierr) 26 call MPI_Send(B(1,offset),size,MPI_REAL, 27 & dest, tag, MPI_COMM_WORLD, ierr) 28 end do 29 C Receive results from worker tasks 30 do source=1, nprocs-1 31 offset = 1 + (source-1)*cols 32 call MPI_Recv(C(1,offset),size,MPI_REAL, 33 & source,tag,MPI_COMM_WORLD,status,ierr) 34 end do 35 else 36 C Worker receive data from master task 37 call MPI_Recv(A, N*N, MPI_REAL, master, tag, 38 & MPI_COMM_WORLD, status, ierr) 39 call MPI_Recv(B,size, MPI_REAL, master, tag, 40 & MPI_COMM_WORLD, status, ierr) 41 C Do matrix multiply 42 do k=1, cols 43 do i=1, N 44 C(i,k) = 0.0 45 do j=1, N 46 C(i,k) = C(i,k) + A(i,j) * B(j,k) 47 end do 48 end do 49 end do 50 call MPI_Send(C, size, MPI_REAL, master, 51 & tag, MPI_COMM_WORLD, ierr) 52 endif 53 call MPI_Finalize(ierr) 54 end The source codes in the red boxes are deleted! Tsinghua University

Resource Consumption Resource Consumption of Original Program (Matrix size is N, P is number of processes) Each Worker Process: 3N2 memory 2N3/(P-1) floating point computation 3 communication operations Master Process: 3(P-1) communication operations Tsinghua University

Live-Propagation Slicing Algorithm (LPSA) Program Slice: Slicing Criterion: <p, V> p: is a program point V: is a subset of the program variables Program Slice  A subset of program statements that preserve the behavior of the original program with respect to <p, V> Two Key Points for Complication Framework: Determine slicing criterion Design slicing algorithm Tsinghua University

Slicing Criterion in LPSA Our goal: Preserve Comm. Spatial and Volume Attributes Point-to-Point Communications: msg type, size, source, dest, tag and comm. id Collective Communications: msg type, sending size, receiving size, root id (if exist) and comm. id source, dest, comm. id  Spatial Attributes msg type, msg size  Volume Attributes Tsinghua University

Slicing Criterion in LPSA Comm Variable A parameter of a communication routine in a parallel program, the value of which directly determines the communication attributes of the parallel program Comm Set  Slicing Criterion For a program M, we use a Comm Set to record all the Comm Variables, C(M) C(M) is the Slicing Criterion in LPSA MPI_Send(buf, count, type, dest, tag, comm) buf : initial address of send buffer [Comm] count: number of elements in send buffer [Comm] type : datatype of each buffer element [Comm] dest : rank of destination [Comm] tag : uniquely identify a message [Comm] comm : communication context Tsinghua University

Comm Variables and Comm Set 1 program MM 2 include ’mpif.h’ 3 parameter (N = 80) C memory allocation real A(N,N), B(N,N), C(N,N) 6 call MPI_Init(ierr) 7 call MPI_COMM_Rank(MPI_COMM_WORLD, myid, ierr) 8 call MPI_COMM_Size(MPI_COMM_WORLD, nprocs, ierr) 9 cols = N/(nprocs-1) 10 size = cols*N 11 tag = 1 12 master = 0 13 if (myid .eq. master) then 14 C Initialize A and B 15 do i=1, N 16 do j=1, N 17 A(i,j) = (i-1)+(j-1) 18 B(i,j) = (i-1)*(j-1) 19 end do 20 end do 21 C Send matrix data to the worker tasks 22 do dest=1, nprocs-1 23 offset = 1 + (dest-1)*cols 24 call MPI_Send(A, N*N, MPI_REAL, dest, 25 & tag, MPI_COMM_WORLD, ierr) 26 call MPI_Send(B(1,offset), size, MPI_REAL, 27 & dest, tag, MPI_COMM_WORLD, ierr) 28 end do 29 C Receive results from worker tasks 30 do source=1, nprocs-1 31 offset = 1 + (source-1)*cols 32 call MPI_Recv(C(1,offset), size, MPI_REAL, 33 & source, tag, MPI_COMM_WORLD,status,ierr) 34 end do 35 else 36 C Worker receive data from master task 37 call MPI_Recv(A, N*N, MPI_REAL, master, tag, 38 & MPI_COMM_WORLD, status, ierr) 39 call MPI_Recv(B, size, MPI_REAL, master, tag, 40 & MPI_COMM_WORLD, status, ierr) 41 C Do matrix multiply 42 do k=1, cols 43 do i=1, N 44 C(i,k) = 0.0 45 do j=1, N 46 C(i,k) = C(i,k) + A(i,j) * B(j,k) 47 end do 48 end do 49 end do 50 call MPI_Send(C, size, MPI_REAL, master, 51 & tag, MPI_COMM_WORLD, ierr) 52 endif 53 call MPI_Finalize(ierr) 54 end C(M) = {(7,myid), (8, nprocs), (24,N), (24,dest), (25, tag), (26, size), (27, dest), (27, tag), (32, size), (33,source), (33, tag), (37,N), (37, master), (37, tag), (39,size), (39, master), (39, tag), (50, size), (50, master), (51,tag)}. Tsinghua University

How do we find all the statements and variables that can affect the values of Comm Variables? Tsinghua University

Dependence of MPI Programs Data Dependence (dd): Can be represented with UD Chains Control Dependence (cd): Can be converted into data dependence Communication Dependence (md): An inherent characteristic for MPI programs Tsinghua University

Data Dependence Data Dependence 1 program MM 2 include ’mpif.h’ 3 parameter (N = 80) C memory allocation real A(N,N), B(N,N), C(N,N) 6 call MPI_Init(ierr) 7 call MPI_COMM_Rank(MPI_COMM_WORLD,myid,ierr) 8 call MPI_COMM_Size(MPI_COMM_WORLD,nprocs,ierr) 9 cols = N/(nprocs-1) 10 size = cols*N 11 tag = 1 12 master = 0 13 if (myid .eq. master) then 14 C Initialize A and B 15 do i=1, N 16 do j=1, N 17 A(i,j) = (i-1)+(j-1) 18 B(i,j) = (i-1)*(j-1) 19 end do 20 end do 21 C Send matrix data to the worker tasks 22 do dest=1, nprocs-1 23 offset = 1 + (dest-1)*cols 24 call MPI_Send(A, N*N, MPI_REAL, dest, 25 & tag, MPI_COMM_WORLD, ierr) 26 call MPI_Send(B(1,offset), size, MPI_REAL, 27 & dest, tag, MPI_COMM_WORLD, ierr) 28 end do 29 C Receive results from worker tasks 30 do source=1, nprocs-1 31 offset = 1 + (source-1)*cols 32 call MPI_Recv(C(1,offset),size,MPI_REAL, 33 & source,tag,MPI_COMM_WORLD,status,ierr) 34 end do 35 else 36 C Worker receive data from master task 37 call MPI_Recv(A, N*N, MPI_REAL, master, tag, 38 & MPI_COMM_WORLD, status, ierr) 39 call MPI_Recv(B,size, MPI_REAL, master, tag, 40 & MPI_COMM_WORLD, status, ierr) 41 C Do matrix multiply 42 do k=1, cols 43 do i=1, N 44 C(i,k) = 0.0 45 do j=1, N 46 C(i,k) = C(i,k) + A(i,j) * B(j,k) 47 end do 48 end do 49 end do 50 call MPI_Send(C, size, MPI_REAL, master, 51 & tag, MPI_COMM_WORLD, ierr) 52 endif 53 call MPI_Finalize(ierr) 54 end Data Dependence Tsinghua University

Control Dependence Control Dependence 1 program MM 2 include ’mpif.h’ 3 parameter (N = 80) C memory allocation real A(N,N), B(N,N), C(N,N) 6 call MPI_Init(ierr) 7 call MPI_COMM_Rank(MPI_COMM_WORLD,myid,ierr) 8 call MPI_COMM_Size(MPI_COMM_WORLD,nprocs,ierr) 9 cols = N/(nprocs-1) 10 size = cols*N 11 tag = 1 12 master = 0 13 if (myid .eq. master) then 14 C Initialize A and B 15 do i=1, N 16 do j=1, N 17 A(i,j) = (i-1)+(j-1) 18 B(i,j) = (i-1)*(j-1) 19 end do 20 end do 21 C Send matrix data to the worker tasks 22 do dest=1, nprocs-1 23 offset = 1 + (dest-1)*cols 24 call MPI_Send(A, N*N, MPI_REAL, dest, 25 & tag, MPI_COMM_WORLD, ierr) 26 call MPI_Send(B(1,offset),size,MPI_REAL, 27 & dest, tag, MPI_COMM_WORLD, ierr) 28 end do 29 C Receive results from worker tasks 30 do source=1, nprocs-1 31 offset = 1 + (source-1)*cols 32 call MPI_Recv(C(1,offset),size,MPI_REAL, 33 & source,tag,MPI_COMM_WORLD,status,ierr) 34 end do 35 else 36 C Worker receive data from master task 37 call MPI_Recv(A, N*N, MPI_REAL, master, tag, 38 & MPI_COMM_WORLD, status, ierr) 39 call MPI_Recv(B,size, MPI_REAL, master, tag, 40 & MPI_COMM_WORLD, status, ierr) 41 C Do matrix multiply 42 do k=1, cols 43 do i=1, N 44 C(i,k) = 0.0 45 do j=1, N 46 C(i,k) = C(i,k) + A(i,j) * B(j,k) 47 end do 48 end do 49 end do 50 call MPI_Send(C, size, MPI_REAL, master, 51 & tag, MPI_COMM_WORLD, ierr) 52 endif 53 call MPI_Finalize(ierr) 54 end Control Dependence Tsinghua University

Communication Dependence Inherent characteristic for MPI programs due to message passing behavior Statement x in process i is comm. dependent on statement y in process j, if and only if: process j sends a message to process i through explicit communication routines statement x is a receiving operation, statement y is a sending operation (x≠y) Tsinghua University

Communication Dependence 1 program MM 2 include ’mpif.h’ 3 parameter (N = 80) C memory allocation real A(N,N), B(N,N), C(N,N) 6 call MPI_Init(ierr) 7 call MPI_COMM_Rank(MPI_COMM_WORLD,myid,ierr) 8 call MPI_COMM_Size(MPI_COMM_WORLD,nprocs,ierr) 9 cols = N/(nprocs-1) 10 size = cols*N 11 tag = 1 12 master = 0 13 if (myid .eq. master) then 14 C Initialize A and B 15 do i=1, N 16 do j=1, N 17 A(i,j) = (i-1)+(j-1) 18 B(i,j) = (i-1)*(j-1) 19 end do 20 end do 21 C Send matrix data to the worker tasks 22 do dest=1, nprocs-1 23 offset = 1 + (dest-1)*cols 24 call MPI_Send(A, N*N, MPI_REAL, dest, 25 & tag, MPI_COMM_WORLD, ierr) 26 call MPI_Send(B(1,offset),size,MPI_REAL, 27 & dest, tag, MPI_COMM_WORLD, ierr) 28 end do 29 C Receive results from worker tasks 30 do source=1, nprocs-1 31 offset = 1 + (source-1)*cols 32 call MPI_Recv(C(1,offset),size,MPI_REAL, 33 & source,tag,MPI_COMM_WORLD,status,ierr) 34 end do 35 else 36 C Worker receive data from master task 37 call MPI_Recv(A, N*N, MPI_REAL, master, tag, 38 & MPI_COMM_WORLD, status, ierr) 39 call MPI_Recv(B,size, MPI_REAL, master, tag, 40 & MPI_COMM_WORLD, status, ierr) 41 C Do matrix multiply 42 do k=1, cols 43 do i=1, N 44 C(i,k) = 0.0 45 do j=1, N 46 C(i,k) = C(i,k) + A(i,j) * B(j,k) 47 end do 48 end do 49 end do 50 call MPI_Send(C, size, MPI_REAL, master, 51 & tag, MPI_COMM_WORLD, ierr) 52 endif 53 call MPI_Finalize(ierr) 54 end Communication Dependence Tsinghua University

Slice Set of An MPI Program Slice set of an MPI program  Slicing Criterion C(M): Live Variable: Variables can affect the values of any Comm Variable through programs dependences Tsinghua University

Slicing Algorithm LPSA Algorithm: Program Slicing: Backward data flow problem Worklist algorithm Initial Worklist WL[P] Comm Set Compute all the Live Variables iteratively through dd, cd, and md After Slicing Preserve all the statements that define Live Variables and all the MPI statements Mark MPI statements that define Live Variables or have md. with marked MPI statements (will be used at runtime) Tsinghua University

Implementation Compilation Framework FACT in Open64 LPSA is implemented in Open64 Compiler http://www.open64.net/ DU and UD Chains-PreOPT MOD/REF analysis-IPA CFG and PCG Summary-based IPA Framework FACT in Open64 http://www.hpctest.org.cn/resources/fact-1.0.tgz Tsinghua University

Implementation Runtime Environment MPI_Send Routine: Provide a custom communication library MPI Profile Layer (PMPI) Collect comm. traces from program slice The library will judge the state of MPI statements based on the slicing results MPI_Send Routine: Tsinghua University

Evaluation Benchmarks: 7 NPB Programs ASCI Sweep3D BT, CG, EP, FT, LU, MG and SP NPB-3.3 Data Set  CLASS=D ASCI Sweep3D Solve a three-dimensional particle transport problem Weak-scaling mode Problem size  150*150*150 Tsinghua University

Evaluation Platforms Test Platform (32 cores) 4-nodes small-sale system Each node: 2-way Quad-Core Intel E5345, 8GB Memory Gigabit Ethernet Total: 32GB Memory Validation Platform (512 cores) 32-nodes large-scale system Each node: 4-way Quad-Core AMD 8347, 32GB Memory Infiniband Network Total: 1024GB Memory Tsinghua University

Validation Validations: Compare communication traces by FACT on test platform VS. collected by traditional trace collection methods on validation platform Proof of Live-Propagation Slicing Algorithm http://www.hpctest.org.cn/paper/Thu-HPC-TR20090717.pdf Tsinghua University

Communication Patterns by FACT Communication Spatial and Volume Attributes Acquired by FACT: Tsinghua University

Memory Consumption d FACT collects communication traces on test platform Traditional trace collection methods cannot achieve this on test platform due to memory limitation For example, Sweep3D consumes 1.25GB memory with FACT for 512 processes While the original program consumes 213.83GB memory Tsinghua University

Execution Time x For example, FACT takes 0.28 seconds for collecting the comm. traces of BT for 64 processes, While the original program takes 1175.65 seconds on validation platform Tsinghua University

Application of FACT The sensitivity analysis of communication patterns to key input parameters Key input parameters in Sweep3D: i, j  # of processes = i*j mk  the computation granularity mmi  the angle blocking factor 7 sets of communication traces on test platform Less than 1 second Tsinghua University

Application of FACT Communication Locality: i=8 j=8: Process 8 communicates with Processes 0, 9, 18 frequently i=4 j=16: Process 8 communicates with Processes 4, 9, 12 frequently Tsinghua University

Application of FACT Message Size  mk & mmi Tsinghua University

Limitations of FACT Limitation: Absence of Communication Temporal Attributes CAN: Process Mapping, Optimize MPI debugger, Design Better Communication subsystem CAN NOT: Analyze overhead of message transmission Message generation rate Potential Solutions: Analytical methods: (PMaC method) Tsinghua University

Related Work Traditional Trace Collection Methods ITC, KOJAK, TAU, VAMPIR etc. Trace Reduction Techniques Without compression in FACT Can integrate FACT with existing compression methods to reduce communication trace size Symbolic Expression Cannot deal with complex branches, loops etc. Program Slicing Techniques Program debugging, software testing etc. Tsinghua University

Conclusions and Future Work FACT Observation: Most of computation and communication contents are not relevant to communication patterns Efficiently acquire communication traces of large-scale parallel applications on small-scale systems About 1-2 orders of magnitude of improvement Future Work Acquire temporal attributes for performance prediction Tsinghua University

Thank you!

backup Tsinghua University

Live Propagation Slicing Algorithm-LPSA Tsinghua University

Some Considerations for Inter-Procedure Live Variables can propagate through Global Variables Arguments of Functions Special consideration for inter-procedure analysis MOD/REF Analysis Build precise UD chains Two phase analysis over PCG Top-down and Bottom-up Solve an iterative data flow equation in LPSA Tsinghua University

Slicing Results The Results of Example Program: All the Live Variables: LIVE[P]={(7,myid), (8,nprocs), (27, dest), (33, source), (37,N), (50, size), (50,master), (51, tag), (22, nprocs), (30, nprocs), (13,myid), (13, master), (10, cols), (10,N), (9,N), (9, nprocs)} Slice Sets: S(P) = {3, 7, 8, 9, 10, 11, 12, 13, 22, 30} Marked MPI statements: Lines 7-8 Tsinghua University

`buf` is LIVE Variable! LPSA can cover this case: 8:size  Comm Variable dd  7:num LIVE Variable dd  5:num LIVE Variable Line 5 is marked md Line 2 is marked … In the worst case, FACT is the same as traditional trace collection tools. Nothing can be sliced! 1 if(myid == 0){ 2 [M] MPI_Send(&num, 1, MPI_INT, 1, 55,...) 3 MPI_Recv(buf, num, MPI_INT, 1, 66,...) 4 }else{ 5 [M] MPI_Irecv(&num, 1, MPI_INT, 0, 55,..., req) 6 MPI_Wait(req,...) 7 size = num 8 MPI_Send(buf, size, MPI_INT, 0, 66,...) 9 } num is LIVE Variable Tsinghua University

Communication Dependence Need to Match Communication Operations Communication Matching is a hard issue! Current method: Simple algorithm to match MPI operations In fact, there is no point-to-point communications Precise methods: Users add some annotations Execute program with small-scale problem size to identify communication dependence More precise algorithm (G. Bronevetsky, 2009 CGO) Tsinghua University

Memory Consumption Null micro-benchmark MPI_Init MPI_Finalize MPI library consumes a certain memory for process management 512 Processes: NULL: 1.04GB mem. EP: 1.11GB mem. CG: 1.22GB mem. Compared with Null micro-benchmark. AVG is the arithmetic mean for all the programs Tsinghua University

Execution Time Reasons: More nodes: More Nodes are Used: Butter limitation of file system Communication Contention BT: MPI_Bcast More nodes: 12 nodes MG: 2.43 sec BT: 3.46 sec More Nodes are Used: Tsinghua University

An Example (Matrix-Matrix Multiplication) 1 program MM 2 include ’mpif.h’ 3 parameter (N = 80) C memory allocation real A(N,N), B(N,N), C(N,N) 6 call MPI_Init(ierr) 7 call MPI_COMM_Rank(MPI_COMM_WORLD,myid,ierr) 8 call MPI_COMM_Size(MPI_COMM_WORLD,nprocs,ierr) 9 cols = N/(nprocs-1) 10 size = cols*N 11 tag = 1 12 master = 0 13 if (myid .eq. master) then 14 C Initialize A and B 15 do i=1, N 16 do j=1, N 17 A(i,j) = (i-1)+(j-1) 18 B(i,j) = (i-1)*(j-1) 19 end do 20 end do 21 C Send matrix data to the worker tasks 22 do dest=1, nprocs-1 23 offset = 1 + (dest-1)*cols 24 call MPI_Send(A, N*N, MPI_REAL, dest, 25 & tag, MPI_COMM_WORLD, ierr) 26 call MPI_Send(B(1,offset),size,MPI_REAL, 27 & dest, tag, MPI_COMM_WORLD, ierr) 28 end do 29 C Receive results from worker tasks 30 do source=1, nprocs-1 31 offset = 1 + (source-1)*cols 32 call MPI_Recv(C(1,offset),size,MPI_REAL, 33 & source,tag,MPI_COMM_WORLD,status,ierr) 34 end do 35 else 36 C Worker receive data from master task 37 call MPI_Recv(A, N*N, MPI_REAL, master, tag, 38 & MPI_COMM_WORLD, status, ierr) 39 call MPI_Recv(B,size, MPI_REAL, master, tag, 40 & MPI_COMM_WORLD, status, ierr) 41 C Do matrix multiply 42 do k=1, cols 43 do i=1, N 44 C(i,k) = 0.0 45 do j=1, N 46 C(i,k) = C(i,k) + A(i,j) * B(j,k) 47 end do 48 end do 49 end do 50 call MPI_Send(C, size, MPI_REAL, master, 51 & tag, MPI_COMM_WORLD, ierr) 52 endif 53 call MPI_Finalize(ierr) 54 end Tsinghua University