Download presentation
Presentation is loading. Please wait.
1
Paraguin Compiler Communication
2
I/O Now that we have parallelized the main loop nest of the program we need to get the input data from a file and to all the processors So how do we do this?
3
I/O First, we will have the master thread only read the input from a file. This is because we aren’t using parallel file I/O. This is easily done by putting file I/O outside a parallel region
4
I/O if (argc < 2) { fprintf (stderr, usage, argv[0]); return -1; } if ((fd = fopen (argv[1], "r")) == NULL) { fprintf (stderr, "%s: Cannot open file %s for reading.\n“, argv[0], argv[1]); for (i = 0; i < N; i++) for (j = 0; j < N; j++) fscanf (fd, "%f", &a[i][j]); ; #pragma paraguin begin_parallel
5
How do we get the data to the other processors
MPI provides a number of ways to do this Since MPI command can be put directly in the source code, you can still do everything in Paraguin that you can in MPI. Ex. #ifdef PARAGUIN MPI_Scatter(a, blksz, MPI_FLOAT, b , blksz, MPI_FLOAT, 0, MPI_COMM_WORLD); #endif
6
Paraguin only provides a Broadcast
This send the entire data set to the other processors Its complexity is O(log2NP) as opposed to O(NP), where NP is the number of processors Ex. ; #pragma paraguin bcast a This should be inside a parallel region
7
Gather Paraguin does provide a mechanism for gather; however, it does use the MPI_Gather and isn’t as efficient. Paraguin will send individual message from each processor back to the master thread To use the Paraguin gather, you did to specify which array access and values are the final values.
8
Syntax: #pragma paraguin gather <array reference> <matrix> Where <matrix> is a system of inequalities that identifies which iterations produce the final values
9
Example, array reference 0 and iterations where i=j-1:
#pragma paraguin gather 0x0 C i j k \ x0 \ x0 So what is the array reference? Array references are enumerated started at 0 Iterations where i=j-1 is when the value of the array is written and not changed again (final value)
10
Consider the elimination step of Gaussian Elimination:
for (i = 1; i <= N; i++) for (j = i+1; j <= N; j++) for (k = N+1; k >= i; k--) a[j][k] = a[j][k] - a[i][k] * a[j][i] / a[i][i]; Reference 0 Reference 2 Reference 4 Reference 1 Reference 3
11
Still You Can Use MPI Commands
Remember, you can still put in MPI Commands directly #ifdef PARAGUIN MPI_Gather(a, blksz, MPI_FLOAT, b , blksz, MPI_FLOAT, 0, MPI_COMM_WORLD); #endif
12
Dependency and Communication
If there is a data dependence across iterations (loop carried dependency) then this data must be transmitted from the processor that computes the value to the processor that needs it Paraguin can generate the MPI code to perform the communication You need to tell it about the dependency
13
Data Dependency Consider the elimination step of Gaussian Elimination:
for (i = 1; i <= N; i++) for (j = i+1; j <= N; j++) for (k = N+1; k >= i; k--) a[j][k] = a[j][k] - a[i][k] * a[j][i] / a[i][i]; Reference 0 Reference 2
14
This is called a loop-carried dependence
Data Dependency This is called a loop-carried dependence < i, j, k> <1, 2, 5>: a[2][5] = a[2][5] - a[1][5] * a[2][1] / a[1][1] <1, 2, 4>: a[2][4] = a[2][4] - a[1][4] * a[2][1] / a[1][1] <1, 2, 3>: a[2][3] = a[2][3] - a[1][3] * a[2][1] / a[1][1] <1, 2, 2>: a[2][2] = a[2][2] - a[1][2] * a[2][1] / a[1][1] <1, 2, 1>: a[2][1] = a[2][1] - a[1][1] * a[2][1] / a[1][1] ... <2, 3, 5>: a[3][5] = a[3][5] - a[2][5] * a[3][2] / a[2][2] <2, 3, 4>: a[3][4] = a[3][4] - a[2][4] * a[3][2] / a[2][2] <2, 3, 3>: a[3][3] = a[3][3] - a[2][3] * a[3][2] / a[2][2] <2, 3, 2>: a[3][2] = a[3][2] - a[2][2] * a[3][2] / a[2][2]
15
Data Dependency So we need to tell Paraguin which array references depend on which references and the mapping of the read iteration instance to the write iteration instance The format for a dependence pragma is: #pragma paraguin dep <write array reference> <read array reference> <matrix> Where <matrix> is a system of inequalities that maps the read iteration instance to the write iteration instance
16
Data Dependency We are going to skip the details of specifying loop-carried data dependence Why? because it is complicated, AND …
17
Results from Gaussian Elimination
18
I have been trying to figure out how to get better performance on problems like LU Decomposition and Gaussian Elimination Even with the progress made, we still can’t get speedup when there is inter-processor communication from loop-carried dependencies
19
“Until inter-processor communication latency is at least a fast as memory access latency, we won’t achieve speedup on a distributed- memory system for problems that require inter- process communication beyond scattering the input data and gathering the partial results.” [Ferner:2012]
20
So what good are distributed-memory systems?
Question: What can we run on distributed- memory systems and achieve speedup? Answer: Parallel programs that do not need inter-processor communication, beyond the scatter and gather process In other words: “embarrassingly parallel” applications
21
Communication Patterns Like This
After the scattering of input data and before the gathering of partial results, the processors work independently Scatter 1 2 3 4 5 6 7 Gather
22
Examples Matrix Multiplication (the obvious algorithm)
Sobel Edge Detection Monte Carlo algorithms Traveling Salesman Problem Several Spec Benchmark algorithms
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.