Using compiler-directed approach to create MPI code automatically Paraguin Compiler ITCS4145/5145, Parallel Programming Clayton Ferner/B. Wilkinson March 10, 2014. ParagionSlides1.ppt
The Paraguin compiler is being developed by Dr The Paraguin compiler is being developed by Dr. C Ferner, UNC-Wilmington Following based upon his slides (Assumes already seen OpenMP)
Paraguin compiler A source-to-source compiler built using the Stanford SUIF compiler (suif.stanford.edu) Transforms a sequential program into an MPI program suitable for compilation and execution on a distributed-memory system User can inspect and modify resulting MPI code Create a similar abstraction as OpenMP for creating MPI code and uses pragma statements Directives also for higher-level patterns (Scatter-gather master-worker, workpool*, stencil, etc) * Not yet implemented.
Compiler Directives Advantage to using pragmas is that other compilers will ignore them You can provide information to Paraguin that is ignored by other compilers, say gcc You can create a hybrid program using pragmas for different compilers
Paraguin Directives Paraguin Syntax: #pragma paraguin <type> [<parameters>] A parallel region specified by: #pragma paraguin begin_parallel … #pragma paraguin end_parallel Other directives placed inside this region, as in OpenMP but now for MPI processes rather than threads.
Example 1 (Monte Carlo Estimation of PI) int main(int argc, char *argv[]) { char *usage = "Usage: %s N\n"; int i, error = 0, count, count_tmp, total = atoi(argv[1]); double x, y, result; #pragma paraguin begin_parallel #pragma paraguin bcast total count = 0; srandom(…); for (i = 0; i < total; i++) { x = ((double) random()) / RAND_MAX; y = ((double) random()) / RAND_MAX; if (x*x + y*y <= 1.0) { count++; } ; #pragma paraguin reduce sum count count_tmp #pragma paraguin end_parallel result = 4.0 * (((double) count_tmp) / (__guin_NP * total)); … Parallel Region Broadcast input Computation (All processes do this) Paraguin variable for no. of processes Reduce Partial Results End Parallel Region
Example 2 (Matrix Addition) int main(int argc, char *argv[]){ int i, j, error = 0; double A[N][N], B[N][N], C[N][N]; char *usage = "Usage: %s file\n"; … // Read input matrices A and B #pragma paraguin begin_parallel #pragma paraguin scatter A B // Scatter input to all processors. #pragma paraguin forall // Parallelize for loop for (i = 0; i < N; i++) { for (j = 0; j < N; j++) { C[i][j] = A[i][j] + B[i][j]; } ; #pragma paraguin gather C #pragma paraguin end_parallel … // Process Results Parallel Region Scatter input Forall Gather Partial Results End Parallel Region
Scatter/Gather Pattern Monte Carlo and Matrix Addition are examples of Scatter/Gather Scatter/gather pattern can also use either broadcast or reduction or both Done as a template: Master prepares input Scatter/Broadcast input Compute partial results Gather/Reduce partial results into the final result
Stencil Pattern
Paraguin Stencil Pattern Example (heat distribution) #define TOTAL_TIME 3000 #define N 200 #define M 200 double computeValue (double A[][M], int i, int j) { return (A[i-1][j] + A[i+1][j] + A[i][j-1] + A[i][j+1]) * 0.25; } int main(int argc, char *argv[]) { int i, j,n, m, max_iterations, done; double A[2][N][M]; … // Initialize input A #pragma paraguin begin_parallel n = N; m = M; max_iterations = TOTAL_TIME; ; #pragma paraguin stencil A n m max_iterations computeValue #pragma paraguin end_parallel … Computation (All processes do this) Stencil Pattern
Stencil Pragma Replaced with Code to do: The array given as an argument to the stencil pragma is broadcast to all available processors. A loop is created to iterate max_iteration number of times. Within that loop, code is inserted to perform the following steps: Each processor (except the last one) will send its last row to the processor with rank one more than its own rank. Each processor (except the first one) will receive the last row from the processor with rank one less than its own rank. Each processor (except the first one) will send its first row to the processor with rank one less than its own rank. Each processor (except the last one) will receive the first row from the processor with rank one more than its own rank. Each processor will iterate through the values of the rows for which it is responsible and use the function provided compute the next value. The data is gathered back to the root processor (rank 0).
Compilation and Running Source Code w/ pragmas Executable Run with mpiexec Paraguin mpicc Source code w/ pragmas scc is SUIF compiler driver (source to source compiler) scc -DPARAGUIN -D__x86_64__ hello.c -.out.c mpicc –o hello.out hello.out.c mpiexec -n 8 ./hello.out In this case, 8 processes Can be replaced with scc -DPARAGUIN -D__x86_64__ -cc mpicc hello.c -o hello.out
Compile Web Page (Avoids needing scc compiler) http://babbage.cis.uncw.edu/~cferner/compoptions.html Upload your source code Compiles remotely Download resulting MPI source code and compiler log messages (or actual executable) You can then use your MPI environment to compile and execute.
Questions so far