Jidong Zhai, Tianwei Sheng, Jiangzhou He, Wenguang Chen, Weimin Zheng FACT: Fast Communication Trace Collection for Parallel Applications through Program Slicing Jidong Zhai, Tianwei Sheng, Jiangzhou He, Wenguang Chen, Weimin Zheng Tsinghua University 11/14/2018
Motivation The Importance of Communication Patterns Optimize the application performance Tuning process placement on non-uniform comm. platform MPIPP[ICS-08], OPP[EuroPar-09] Design better communication subsystems Circuit-switched networks in parallel computing[IPDPS-05] Optimize MPI programs debuggers Communication locality MPIWiz[PPoPP-09] Tsinghua University
Communication Pattern Comm. Patterns: Spatial Volume Temporal Can be acquired from Comm. traces files msg type, size, source, dest, etc. An Example: The spatial and volume attributes of NPB CG program (CLASS=D, NPROCS=64). Tsinghua University
Previous Work Previous Work: Mainly rely on traditional trace collection techniques Instrument original programs Executed on a full-scale parallel systems Communication traces are collected at runtime Communication Pattern Such Tools: ITC/ITA, KOJAK, Paraver, TAU, VAMPIR etc. Tsinghua University
Limitations of Previous Work Huge Resource Requirements Computing resources ASCI SAGE, require 2000-4000 processors Memory requirements NPB FT (CLASS=E), more than 600GB memory size Cannot collection comm. traces without full-scale systems Long Trace Collection Time Execute the entire parallel application from the beginning to the end For ASCI SAGE, requires several months to finish Tsinghua University
Our Observations Two important observations: Many important applications do not require communication temporal attributes Process placement optimization Communication locality analysis Most computation and message contents of parallel applications are not relevant to their spatial and volume communication attributes If we can tolerate missing temporal attributes, can we find an efficient method to acquire communication traces? Tsinghua University
Our Approach FACT: FAst Communication Trace collection FACT can acquire comm. traces of large-scale parallel applications on small-scale systems Our idea: Reduce the original program to obtain a program slice at compile time Propose Live-Propagation Slicing Algorithm (LPSA) Execute the program slice to collect comm. traces at runtime Custom communication library FACT combines both static analysis and traditional trace collection methods Tsinghua University
Design Overview Two main components: Compilation framework Input: An MPI program Output: program slice and directives information LPSA Runtime environment Custom MPI comm. Library Comm. Traces Overview of FACT Tsinghua University
An Example (Matrix-Matrix Multiplication) 1 program MM 2 include ’mpif.h’ 3 parameter (N = 80) C memory allocation real A(N,N), B(N,N), C(N,N) 6 call MPI_Init(ierr) 7 call MPI_COMM_Rank(MPI_COMM_WORLD,myid,ierr) 8 call MPI_COMM_Size(MPI_COMM_WORLD,nprocs,ierr) 9 cols = N/(nprocs-1) 10 size = cols*N 11 tag = 1 12 master = 0 13 if (myid .eq. master) then 14 C Initialize matrix A and B 15 do i=1, N 16 do j=1, N 17 A(i,j) = (i-1)+(j-1) 18 B(i,j) = (i-1)*(j-1) 19 end do 20 end do 21 C Send matrix data to the worker tasks 22 do dest=1, nprocs-1 23 offset = 1 + (dest-1)*cols 24 call MPI_Send(A, N*N, MPI_REAL, dest, 25 & tag, MPI_COMM_WORLD, ierr) 26 call MPI_Send(B(1,offset),size,MPI_REAL, 27 & dest, tag, MPI_COMM_WORLD, ierr) 28 end do 29 C Receive results from worker tasks 30 do source=1, nprocs-1 31 offset = 1 + (source-1)*cols 32 call MPI_Recv(C(1,offset),size,MPI_REAL, 33 & source,tag,MPI_COMM_WORLD,status,ierr) 34 end do 35 else 36 C Worker receive data from master task 37 call MPI_Recv(A, N*N, MPI_REAL, master, tag, 38 & MPI_COMM_WORLD, status, ierr) 39 call MPI_Recv(B,size, MPI_REAL, master, tag, 40 & MPI_COMM_WORLD, status, ierr) 41 C Do matrix multiply 42 do k=1, cols 43 do i=1, N 44 C(i,k) = 0.0 45 do j=1, N 46 C(i,k) = C(i,k) + A(i,j) * B(j,k) 47 end do 48 end do 49 end do 50 call MPI_Send(C, size, MPI_REAL, master, 51 & tag, MPI_COMM_WORLD, ierr) 52 endif 53 call MPI_Finalize(ierr) 54 end Fortran Program: C = A * B Tsinghua University
After Slicing in FACT The source codes in the red boxes are deleted! 1 program MM 2 include ’mpif.h’ 3 parameter (N = 80) C memory allocation real A(N,N), B(N,N), C(N,N) real A(1,1), B(1,1), C(1,1) 6 call MPI_Init(ierr) 7 [M] call MPI_COMM_Rank(MPI_COMM_WORLD,myid,ierr) 8 [M] call MPI_COMM_Size(MPI_COMM_WORLD,nprocs,ierr) 9 cols = N/(nprocs-1) 10 size = cols*N 11 tag = 1 12 master = 0 13 if (myid .eq. master) then 14 C Initialize A and B 15 do i=1, N 16 do j=1, N 17 A(i,j) = (i-1)+(j-1) 18 B(i,j) = (i-1)*(j-1) 19 end do 20 end do 21 C Send matrix data to the worker tasks 22 do dest=1, nprocs-1 23 offset = 1 + (dest-1)*cols 24 call MPI_Send(A, N*N, MPI_REAL, dest, 25 & tag, MPI_COMM_WORLD, ierr) 26 call MPI_Send(B(1,offset),size,MPI_REAL, 27 & dest, tag, MPI_COMM_WORLD, ierr) 28 end do 29 C Receive results from worker tasks 30 do source=1, nprocs-1 31 offset = 1 + (source-1)*cols 32 call MPI_Recv(C(1,offset),size,MPI_REAL, 33 & source,tag,MPI_COMM_WORLD,status,ierr) 34 end do 35 else 36 C Worker receive data from master task 37 call MPI_Recv(A, N*N, MPI_REAL, master, tag, 38 & MPI_COMM_WORLD, status, ierr) 39 call MPI_Recv(B,size, MPI_REAL, master, tag, 40 & MPI_COMM_WORLD, status, ierr) 41 C Do matrix multiply 42 do k=1, cols 43 do i=1, N 44 C(i,k) = 0.0 45 do j=1, N 46 C(i,k) = C(i,k) + A(i,j) * B(j,k) 47 end do 48 end do 49 end do 50 call MPI_Send(C, size, MPI_REAL, master, 51 & tag, MPI_COMM_WORLD, ierr) 52 endif 53 call MPI_Finalize(ierr) 54 end The source codes in the red boxes are deleted! Tsinghua University
Resource Consumption Resource Consumption of Original Program (Matrix size is N, P is number of processes) Each Worker Process: 3N2 memory 2N3/(P-1) floating point computation 3 communication operations Master Process: 3(P-1) communication operations Tsinghua University
Live-Propagation Slicing Algorithm (LPSA) Program Slice: Slicing Criterion: <p, V> p: is a program point V: is a subset of the program variables Program Slice A subset of program statements that preserve the behavior of the original program with respect to <p, V> Two Key Points for Complication Framework: Determine slicing criterion Design slicing algorithm Tsinghua University
Slicing Criterion in LPSA Our goal: Preserve Comm. Spatial and Volume Attributes Point-to-Point Communications: msg type, size, source, dest, tag and comm. id Collective Communications: msg type, sending size, receiving size, root id (if exist) and comm. id source, dest, comm. id Spatial Attributes msg type, msg size Volume Attributes Tsinghua University
Slicing Criterion in LPSA Comm Variable A parameter of a communication routine in a parallel program, the value of which directly determines the communication attributes of the parallel program Comm Set Slicing Criterion For a program M, we use a Comm Set to record all the Comm Variables, C(M) C(M) is the Slicing Criterion in LPSA MPI_Send(buf, count, type, dest, tag, comm) buf : initial address of send buffer [Comm] count: number of elements in send buffer [Comm] type : datatype of each buffer element [Comm] dest : rank of destination [Comm] tag : uniquely identify a message [Comm] comm : communication context Tsinghua University
Comm Variables and Comm Set 1 program MM 2 include ’mpif.h’ 3 parameter (N = 80) C memory allocation real A(N,N), B(N,N), C(N,N) 6 call MPI_Init(ierr) 7 call MPI_COMM_Rank(MPI_COMM_WORLD, myid, ierr) 8 call MPI_COMM_Size(MPI_COMM_WORLD, nprocs, ierr) 9 cols = N/(nprocs-1) 10 size = cols*N 11 tag = 1 12 master = 0 13 if (myid .eq. master) then 14 C Initialize A and B 15 do i=1, N 16 do j=1, N 17 A(i,j) = (i-1)+(j-1) 18 B(i,j) = (i-1)*(j-1) 19 end do 20 end do 21 C Send matrix data to the worker tasks 22 do dest=1, nprocs-1 23 offset = 1 + (dest-1)*cols 24 call MPI_Send(A, N*N, MPI_REAL, dest, 25 & tag, MPI_COMM_WORLD, ierr) 26 call MPI_Send(B(1,offset), size, MPI_REAL, 27 & dest, tag, MPI_COMM_WORLD, ierr) 28 end do 29 C Receive results from worker tasks 30 do source=1, nprocs-1 31 offset = 1 + (source-1)*cols 32 call MPI_Recv(C(1,offset), size, MPI_REAL, 33 & source, tag, MPI_COMM_WORLD,status,ierr) 34 end do 35 else 36 C Worker receive data from master task 37 call MPI_Recv(A, N*N, MPI_REAL, master, tag, 38 & MPI_COMM_WORLD, status, ierr) 39 call MPI_Recv(B, size, MPI_REAL, master, tag, 40 & MPI_COMM_WORLD, status, ierr) 41 C Do matrix multiply 42 do k=1, cols 43 do i=1, N 44 C(i,k) = 0.0 45 do j=1, N 46 C(i,k) = C(i,k) + A(i,j) * B(j,k) 47 end do 48 end do 49 end do 50 call MPI_Send(C, size, MPI_REAL, master, 51 & tag, MPI_COMM_WORLD, ierr) 52 endif 53 call MPI_Finalize(ierr) 54 end C(M) = {(7,myid), (8, nprocs), (24,N), (24,dest), (25, tag), (26, size), (27, dest), (27, tag), (32, size), (33,source), (33, tag), (37,N), (37, master), (37, tag), (39,size), (39, master), (39, tag), (50, size), (50, master), (51,tag)}. Tsinghua University
How do we find all the statements and variables that can affect the values of Comm Variables? Tsinghua University
Dependence of MPI Programs Data Dependence (dd): Can be represented with UD Chains Control Dependence (cd): Can be converted into data dependence Communication Dependence (md): An inherent characteristic for MPI programs Tsinghua University
Data Dependence Data Dependence 1 program MM 2 include ’mpif.h’ 3 parameter (N = 80) C memory allocation real A(N,N), B(N,N), C(N,N) 6 call MPI_Init(ierr) 7 call MPI_COMM_Rank(MPI_COMM_WORLD,myid,ierr) 8 call MPI_COMM_Size(MPI_COMM_WORLD,nprocs,ierr) 9 cols = N/(nprocs-1) 10 size = cols*N 11 tag = 1 12 master = 0 13 if (myid .eq. master) then 14 C Initialize A and B 15 do i=1, N 16 do j=1, N 17 A(i,j) = (i-1)+(j-1) 18 B(i,j) = (i-1)*(j-1) 19 end do 20 end do 21 C Send matrix data to the worker tasks 22 do dest=1, nprocs-1 23 offset = 1 + (dest-1)*cols 24 call MPI_Send(A, N*N, MPI_REAL, dest, 25 & tag, MPI_COMM_WORLD, ierr) 26 call MPI_Send(B(1,offset), size, MPI_REAL, 27 & dest, tag, MPI_COMM_WORLD, ierr) 28 end do 29 C Receive results from worker tasks 30 do source=1, nprocs-1 31 offset = 1 + (source-1)*cols 32 call MPI_Recv(C(1,offset),size,MPI_REAL, 33 & source,tag,MPI_COMM_WORLD,status,ierr) 34 end do 35 else 36 C Worker receive data from master task 37 call MPI_Recv(A, N*N, MPI_REAL, master, tag, 38 & MPI_COMM_WORLD, status, ierr) 39 call MPI_Recv(B,size, MPI_REAL, master, tag, 40 & MPI_COMM_WORLD, status, ierr) 41 C Do matrix multiply 42 do k=1, cols 43 do i=1, N 44 C(i,k) = 0.0 45 do j=1, N 46 C(i,k) = C(i,k) + A(i,j) * B(j,k) 47 end do 48 end do 49 end do 50 call MPI_Send(C, size, MPI_REAL, master, 51 & tag, MPI_COMM_WORLD, ierr) 52 endif 53 call MPI_Finalize(ierr) 54 end Data Dependence Tsinghua University
Control Dependence Control Dependence 1 program MM 2 include ’mpif.h’ 3 parameter (N = 80) C memory allocation real A(N,N), B(N,N), C(N,N) 6 call MPI_Init(ierr) 7 call MPI_COMM_Rank(MPI_COMM_WORLD,myid,ierr) 8 call MPI_COMM_Size(MPI_COMM_WORLD,nprocs,ierr) 9 cols = N/(nprocs-1) 10 size = cols*N 11 tag = 1 12 master = 0 13 if (myid .eq. master) then 14 C Initialize A and B 15 do i=1, N 16 do j=1, N 17 A(i,j) = (i-1)+(j-1) 18 B(i,j) = (i-1)*(j-1) 19 end do 20 end do 21 C Send matrix data to the worker tasks 22 do dest=1, nprocs-1 23 offset = 1 + (dest-1)*cols 24 call MPI_Send(A, N*N, MPI_REAL, dest, 25 & tag, MPI_COMM_WORLD, ierr) 26 call MPI_Send(B(1,offset),size,MPI_REAL, 27 & dest, tag, MPI_COMM_WORLD, ierr) 28 end do 29 C Receive results from worker tasks 30 do source=1, nprocs-1 31 offset = 1 + (source-1)*cols 32 call MPI_Recv(C(1,offset),size,MPI_REAL, 33 & source,tag,MPI_COMM_WORLD,status,ierr) 34 end do 35 else 36 C Worker receive data from master task 37 call MPI_Recv(A, N*N, MPI_REAL, master, tag, 38 & MPI_COMM_WORLD, status, ierr) 39 call MPI_Recv(B,size, MPI_REAL, master, tag, 40 & MPI_COMM_WORLD, status, ierr) 41 C Do matrix multiply 42 do k=1, cols 43 do i=1, N 44 C(i,k) = 0.0 45 do j=1, N 46 C(i,k) = C(i,k) + A(i,j) * B(j,k) 47 end do 48 end do 49 end do 50 call MPI_Send(C, size, MPI_REAL, master, 51 & tag, MPI_COMM_WORLD, ierr) 52 endif 53 call MPI_Finalize(ierr) 54 end Control Dependence Tsinghua University
Communication Dependence Inherent characteristic for MPI programs due to message passing behavior Statement x in process i is comm. dependent on statement y in process j, if and only if: process j sends a message to process i through explicit communication routines statement x is a receiving operation, statement y is a sending operation (x≠y) Tsinghua University
Communication Dependence 1 program MM 2 include ’mpif.h’ 3 parameter (N = 80) C memory allocation real A(N,N), B(N,N), C(N,N) 6 call MPI_Init(ierr) 7 call MPI_COMM_Rank(MPI_COMM_WORLD,myid,ierr) 8 call MPI_COMM_Size(MPI_COMM_WORLD,nprocs,ierr) 9 cols = N/(nprocs-1) 10 size = cols*N 11 tag = 1 12 master = 0 13 if (myid .eq. master) then 14 C Initialize A and B 15 do i=1, N 16 do j=1, N 17 A(i,j) = (i-1)+(j-1) 18 B(i,j) = (i-1)*(j-1) 19 end do 20 end do 21 C Send matrix data to the worker tasks 22 do dest=1, nprocs-1 23 offset = 1 + (dest-1)*cols 24 call MPI_Send(A, N*N, MPI_REAL, dest, 25 & tag, MPI_COMM_WORLD, ierr) 26 call MPI_Send(B(1,offset),size,MPI_REAL, 27 & dest, tag, MPI_COMM_WORLD, ierr) 28 end do 29 C Receive results from worker tasks 30 do source=1, nprocs-1 31 offset = 1 + (source-1)*cols 32 call MPI_Recv(C(1,offset),size,MPI_REAL, 33 & source,tag,MPI_COMM_WORLD,status,ierr) 34 end do 35 else 36 C Worker receive data from master task 37 call MPI_Recv(A, N*N, MPI_REAL, master, tag, 38 & MPI_COMM_WORLD, status, ierr) 39 call MPI_Recv(B,size, MPI_REAL, master, tag, 40 & MPI_COMM_WORLD, status, ierr) 41 C Do matrix multiply 42 do k=1, cols 43 do i=1, N 44 C(i,k) = 0.0 45 do j=1, N 46 C(i,k) = C(i,k) + A(i,j) * B(j,k) 47 end do 48 end do 49 end do 50 call MPI_Send(C, size, MPI_REAL, master, 51 & tag, MPI_COMM_WORLD, ierr) 52 endif 53 call MPI_Finalize(ierr) 54 end Communication Dependence Tsinghua University
Slice Set of An MPI Program Slice set of an MPI program Slicing Criterion C(M): Live Variable: Variables can affect the values of any Comm Variable through programs dependences Tsinghua University
Slicing Algorithm LPSA Algorithm: Program Slicing: Backward data flow problem Worklist algorithm Initial Worklist WL[P] Comm Set Compute all the Live Variables iteratively through dd, cd, and md After Slicing Preserve all the statements that define Live Variables and all the MPI statements Mark MPI statements that define Live Variables or have md. with marked MPI statements (will be used at runtime) Tsinghua University
Implementation Compilation Framework FACT in Open64 LPSA is implemented in Open64 Compiler http://www.open64.net/ DU and UD Chains-PreOPT MOD/REF analysis-IPA CFG and PCG Summary-based IPA Framework FACT in Open64 http://www.hpctest.org.cn/resources/fact-1.0.tgz Tsinghua University
Implementation Runtime Environment MPI_Send Routine: Provide a custom communication library MPI Profile Layer (PMPI) Collect comm. traces from program slice The library will judge the state of MPI statements based on the slicing results MPI_Send Routine: Tsinghua University
Evaluation Benchmarks: 7 NPB Programs ASCI Sweep3D BT, CG, EP, FT, LU, MG and SP NPB-3.3 Data Set CLASS=D ASCI Sweep3D Solve a three-dimensional particle transport problem Weak-scaling mode Problem size 150*150*150 Tsinghua University
Evaluation Platforms Test Platform (32 cores) 4-nodes small-sale system Each node: 2-way Quad-Core Intel E5345, 8GB Memory Gigabit Ethernet Total: 32GB Memory Validation Platform (512 cores) 32-nodes large-scale system Each node: 4-way Quad-Core AMD 8347, 32GB Memory Infiniband Network Total: 1024GB Memory Tsinghua University
Validation Validations: Compare communication traces by FACT on test platform VS. collected by traditional trace collection methods on validation platform Proof of Live-Propagation Slicing Algorithm http://www.hpctest.org.cn/paper/Thu-HPC-TR20090717.pdf Tsinghua University
Communication Patterns by FACT Communication Spatial and Volume Attributes Acquired by FACT: Tsinghua University
Memory Consumption d FACT collects communication traces on test platform Traditional trace collection methods cannot achieve this on test platform due to memory limitation For example, Sweep3D consumes 1.25GB memory with FACT for 512 processes While the original program consumes 213.83GB memory Tsinghua University
Execution Time x For example, FACT takes 0.28 seconds for collecting the comm. traces of BT for 64 processes, While the original program takes 1175.65 seconds on validation platform Tsinghua University
Application of FACT The sensitivity analysis of communication patterns to key input parameters Key input parameters in Sweep3D: i, j # of processes = i*j mk the computation granularity mmi the angle blocking factor 7 sets of communication traces on test platform Less than 1 second Tsinghua University
Application of FACT Communication Locality: i=8 j=8: Process 8 communicates with Processes 0, 9, 18 frequently i=4 j=16: Process 8 communicates with Processes 4, 9, 12 frequently Tsinghua University
Application of FACT Message Size mk & mmi Tsinghua University
Limitations of FACT Limitation: Absence of Communication Temporal Attributes CAN: Process Mapping, Optimize MPI debugger, Design Better Communication subsystem CAN NOT: Analyze overhead of message transmission Message generation rate Potential Solutions: Analytical methods: (PMaC method) Tsinghua University
Related Work Traditional Trace Collection Methods ITC, KOJAK, TAU, VAMPIR etc. Trace Reduction Techniques Without compression in FACT Can integrate FACT with existing compression methods to reduce communication trace size Symbolic Expression Cannot deal with complex branches, loops etc. Program Slicing Techniques Program debugging, software testing etc. Tsinghua University
Conclusions and Future Work FACT Observation: Most of computation and communication contents are not relevant to communication patterns Efficiently acquire communication traces of large-scale parallel applications on small-scale systems About 1-2 orders of magnitude of improvement Future Work Acquire temporal attributes for performance prediction Tsinghua University
Thank you!
backup Tsinghua University
Live Propagation Slicing Algorithm-LPSA Tsinghua University
Some Considerations for Inter-Procedure Live Variables can propagate through Global Variables Arguments of Functions Special consideration for inter-procedure analysis MOD/REF Analysis Build precise UD chains Two phase analysis over PCG Top-down and Bottom-up Solve an iterative data flow equation in LPSA Tsinghua University
Slicing Results The Results of Example Program: All the Live Variables: LIVE[P]={(7,myid), (8,nprocs), (27, dest), (33, source), (37,N), (50, size), (50,master), (51, tag), (22, nprocs), (30, nprocs), (13,myid), (13, master), (10, cols), (10,N), (9,N), (9, nprocs)} Slice Sets: S(P) = {3, 7, 8, 9, 10, 11, 12, 13, 22, 30} Marked MPI statements: Lines 7-8 Tsinghua University
`buf` is LIVE Variable! LPSA can cover this case: 8:size Comm Variable dd 7:num LIVE Variable dd 5:num LIVE Variable Line 5 is marked md Line 2 is marked … In the worst case, FACT is the same as traditional trace collection tools. Nothing can be sliced! 1 if(myid == 0){ 2 [M] MPI_Send(&num, 1, MPI_INT, 1, 55,...) 3 MPI_Recv(buf, num, MPI_INT, 1, 66,...) 4 }else{ 5 [M] MPI_Irecv(&num, 1, MPI_INT, 0, 55,..., req) 6 MPI_Wait(req,...) 7 size = num 8 MPI_Send(buf, size, MPI_INT, 0, 66,...) 9 } num is LIVE Variable Tsinghua University
Communication Dependence Need to Match Communication Operations Communication Matching is a hard issue! Current method: Simple algorithm to match MPI operations In fact, there is no point-to-point communications Precise methods: Users add some annotations Execute program with small-scale problem size to identify communication dependence More precise algorithm (G. Bronevetsky, 2009 CGO) Tsinghua University
Memory Consumption Null micro-benchmark MPI_Init MPI_Finalize MPI library consumes a certain memory for process management 512 Processes: NULL: 1.04GB mem. EP: 1.11GB mem. CG: 1.22GB mem. Compared with Null micro-benchmark. AVG is the arithmetic mean for all the programs Tsinghua University
Execution Time Reasons: More nodes: More Nodes are Used: Butter limitation of file system Communication Contention BT: MPI_Bcast More nodes: 12 nodes MG: 2.43 sec BT: 3.46 sec More Nodes are Used: Tsinghua University
An Example (Matrix-Matrix Multiplication) 1 program MM 2 include ’mpif.h’ 3 parameter (N = 80) C memory allocation real A(N,N), B(N,N), C(N,N) 6 call MPI_Init(ierr) 7 call MPI_COMM_Rank(MPI_COMM_WORLD,myid,ierr) 8 call MPI_COMM_Size(MPI_COMM_WORLD,nprocs,ierr) 9 cols = N/(nprocs-1) 10 size = cols*N 11 tag = 1 12 master = 0 13 if (myid .eq. master) then 14 C Initialize A and B 15 do i=1, N 16 do j=1, N 17 A(i,j) = (i-1)+(j-1) 18 B(i,j) = (i-1)*(j-1) 19 end do 20 end do 21 C Send matrix data to the worker tasks 22 do dest=1, nprocs-1 23 offset = 1 + (dest-1)*cols 24 call MPI_Send(A, N*N, MPI_REAL, dest, 25 & tag, MPI_COMM_WORLD, ierr) 26 call MPI_Send(B(1,offset),size,MPI_REAL, 27 & dest, tag, MPI_COMM_WORLD, ierr) 28 end do 29 C Receive results from worker tasks 30 do source=1, nprocs-1 31 offset = 1 + (source-1)*cols 32 call MPI_Recv(C(1,offset),size,MPI_REAL, 33 & source,tag,MPI_COMM_WORLD,status,ierr) 34 end do 35 else 36 C Worker receive data from master task 37 call MPI_Recv(A, N*N, MPI_REAL, master, tag, 38 & MPI_COMM_WORLD, status, ierr) 39 call MPI_Recv(B,size, MPI_REAL, master, tag, 40 & MPI_COMM_WORLD, status, ierr) 41 C Do matrix multiply 42 do k=1, cols 43 do i=1, N 44 C(i,k) = 0.0 45 do j=1, N 46 C(i,k) = C(i,k) + A(i,j) * B(j,k) 47 end do 48 end do 49 end do 50 call MPI_Send(C, size, MPI_REAL, master, 51 & tag, MPI_COMM_WORLD, ierr) 52 endif 53 call MPI_Finalize(ierr) 54 end Tsinghua University