Slide-1 University of Maryland Five Common Defect Types in Parallel Computing Prepared for Applied Parallel Computing Prof. Alan Edelman Taiga Nakamura.

Slide-1 University of Maryland Five Common Defect Types in Parallel Computing Prepared for Applied Parallel Computing Prof. Alan Edelman Taiga Nakamura University of Maryland

Slide-2 University of Maryland Introduction Debugging and testing parallel programs is hard –What kinds of mistakes do programmers make? –How to prevent or effectively find and fix defects? Hypothesis: Knowing about common defects will reduce time spent debugging Here: Five common defect types in parallel programming (from last year’s classes) –These examples are in C/MPI –Suspect similar defect types in UPC, CAF, F/MPI Your feedback is solicited (by both us and UMD)!

Slide-3 University of Maryland Defect 1: Language Usage Example #include int main(int argc, char**argv) { FILE *fp; MPI_Status status; status = MPI_Init(NULL, NULL); if (status != MPI_SUCCESS) { return -1; } fp = fopen(...); if (fp == NULL) { return -1; }... fclose(fp); MPI_Finalize(); return 0; } (Valid in MPI 2.0 only. In 1.1, it had to be MPI_Init(&argc, &argv);) MPI_Finalize must be called by all processors in every execution path

Slide-4 University of Maryland Use of Language Features Advanced language features are not necessarily used –Try to understand a few, basic language features thoroughly MPI keywords in Conjugate Gradient in C/C++ (15 students) 24 functions, 8 constants

Slide-5 University of Maryland Defect 1: Language Usage Erroneous use of parallel language features –E.g. inconsistent data types between send and recv, usage of memory copy functions in UPC Simple mistakes in understanding –Very common in novices Compile-time defects can be found easily –Wrong number or type of parameters, etc. Some defects surface only under specific conditions –e.g., number of processors, value of input, hardware/software environment Advice: –Check unfamiliar language features carefully

Slide-6 University of Maryland Defect 2: Space Decomposition Example: Game of Life –Loop boundaries must be changed (there are other approaches too) /* Main loop */ /* … */ for (y = 0; y < ysize; y++) { for (x = 0; x < xsize; x++) { c = count(buffer, xsize, ysize, x, y); /* update buffer... */ } MPI_Comm_size(MPI_COMM_WORLD, &np); ysize /= np; /* Main loop */ /*... */ for (y = 0; y < ysize; y++) { for (x = 0; x < xsize; x++) { c = count(buffer, xsize, ysize, x, y); /* update buffer... */ } /* MPI_Send, MPI_Recv */ xsize ysize send/recv 1ysize+1 ysize+2 ysize may not be divisible by np

Slide-7 University of Maryland Defect 2: Space Decomposition Incorrect mapping between the problem space and the program memory space Mapping in parallel version can be different from that in serial version –Array origin is different in every processor –Additional memory space for communication can complicate the mapping logic Symptoms: –Segmentation fault (if array index is out of range) –Incorrect or slightly incorrect output

Slide-8 University of Maryland Defect 3: Side-Effects of Parallelization Example: srand(time(NULL)); for (i=0; i<n; i++) { double x = rand() / (double)RAND_MAX; double y = rand() / (double)RAND_MAX; if (x*x + y*y < 1) ++k; } return k/(double)n; int np; status = MPI_Comm_size(MPI_COMM_WORLD &np);... srand(time(NULL)); for (i=0; i<n; i+=np) { double x = rand() / (double)RAND_MAX; double y = rand() / (double)RAND_MAX; if (x*x + y*y < 1) ++k; } status = MPI_Reduce(... MPI_SUM... );... return k/(double)n; Approximation of pi 1: All procs might use the same pseudo-random sequence, spoiling independence 2: Hidden serialization in rand() causes performance bottleneck

Slide-9 University of Maryland Defect 3: Side-Effect of Parallelization Example: File I/O FILE *fp = fopen (….); If (fp != NULL) { while (…) { fscanf(…); } fclose(fp); } Filesystem may cause performance bottleneck if all processors access the same file simultaneously (Schedule I/O carefully, or let “master” processor do all I/O)

Slide-10 University of Maryland Defect 3: Side-Effect of Parallelization Typical parallel programs contain only a few parallel primitives –The rest of the code is made of a sequential program running in parallel Ordinary serial constructs can cause correctness/performance defects when they are accessed in parallel contexts Advice: –Don’t just focus on the parallel code –Check that the serial code is working on one processor, but remember that the defect may surface only in a parallel context

Slide-11 University of Maryland Defect 4: Performance Example: load balancing Rank 0 “master” processor is just waiting while other “worker” processors are executing the loop myN = N / ( numProc - 1 ); If ( myRank != 0) { for (i=0; i<myN; i++) { if (…) { ++myHits; } } MPI_Reduce(&myHits, &totalHits, 1, MPI_INT, 0, MPI_COMM_WORLD);

Slide-12 University of Maryland Defect 4: Performance Example: scheduling if (rank != 0) { MPI_Send ((board[y1], Xsize, MPI_CHAR, rank-1, tag, MPI_COMM_WORLD); MPI_Recv ((board[y1-1], Xsize, MPI_CHAR, rank-1, tag, MPI_COMM_WORLD, &status); } if (rank != (size-1)) { MPI_Recv ((board[y2+1], Xsize, MPI_CHAR, rank+1, tag, MPI_COMM_WORLD, &status); MPI_Send ((board[y2], Xsize, MPI_CHAR, rank+1, tag, MPI_COMM_WORLD); } send/recv Rank #0 Rank #1 Rank #(size-1) y2 y1 y2 y1 y2 y1 xsize Communication requires O(size) time (a “correct” solution takes O(1)) #1 Send → #0 Recv → #0 Send → #1 Recv #2 Send → #1 Recv → #1 Send → #2 Recv #3 Send → #2 Recv → #2 Send → #3 Recv

Slide-13 University of Maryland Defect 4: Performance Scalability problem because processors are not working in parallel –The program output itself is correct –Perfect parallelization is often difficult: need to evaluate if the execution speed is unacceptable Symptom: sub-linear scalability, performance much less than expected (e.g, most time spent waiting), unbalanced amount of computation –Load balancing may depend on input data Advice: –Make sure all processors are “working” in parallel –Profiling tool might help

Slide-14 University of Maryland Defect 5: Synchronization Example: deadlock MPI_Recv ((board[y2+1], Xsize, MPI_CHAR, rank+1, tag, MPI_COMM_WORLD, &status); MPI_Send ((board[y1], Xsize, MPI_CHAR, rank-1, tag, MPI_COMM_WORLD); MPI_Recv ((board[y1-1], Xsize, MPI_CHAR, rank-1, tag, MPI_COMM_WORLD, &status); MPI_Send ((board[y2], Xsize, MPI_CHAR, rank+1, tag, MPI_COMM_WORLD); send/recv Rank #0 Rank #1 Rank #(size-1) y2 y1 y2 y1 y2 y1 xsize Obvious example of deadlock (can’t avoid noticing this) #0 Recv → deadlock #1 Recv → deadlock #2 Recv → deadlock

Slide-15 University of Maryland Defect 5: Synchronization Example: deadlock MPI_Send ((board[y1], Xsize, MPI_CHAR, rank-1, tag, MPI_COMM_WORLD); MPI_Recv ((board[y2+1], Xsize, MPI_CHAR, rank+1, tag, MPI_COMM_WORLD, &status); MPI_Send ((board[y2], Xsize, MPI_CHAR, rank+1, tag, MPI_COMM_WORLD); MPI_Recv ((board[y1-1], Xsize, MPI_CHAR, rank-1, tag, MPI_COMM_WORLD, &status); send/recv Rank #0 Rank #1 Rank #(size-1) y2 y1 y2 y1 y2 y1 xsize This may work, but it cause deadlock with some implementation and parameters #0 Send → deadlock if MPI_Send is blocking #1 Send → deadlock if MPI_Send is blocking #2 Send → deadlock if MPI_Send is blocking A “correct” solution could be (1) alternate the order of send and recv, (2) use MPI_Bsend with sufficient buffer size, (3) MPI_Sendrecv, or (4) MPI_Isend/recv (see http://www.mpi-forum.org/docs/mpi-11-html/node41.html)

Slide-16 University of Maryland Defect 5: Synchronization Example: barriers for (…) { MPI_Isend ((board[y1], Xsize, MPI_CHAR, rank-1, tag, MPI_COMM_WORLD, &request); MPI_Recv ((board[y2+1], Xsize, MPI_CHAR, rank+1, tag, MPI_COMM_WORLD, &status); MPI_Isend ((board[y2], Xsize, MPI_CHAR, rank+1, tag, MPI_COMM_WORLD, &request); MPI_Recv ((board[y1-1], Xsize, MPI_CHAR, rank-1, tag, MPI_COMM_WORLD, &status); } send/recv Rank #0 Rank #1 Rank #(size-1) y2 y1 y2 y1 y2 y1 xsize Synchronization (e.g. MPI_Barrier) is needed at each iteration (But too many barriers can cause a performance problem)

Slide-17 University of Maryland Defect 5: Synchronization Well-known defect type in parallel programming: races, deadlocks –Some defects can be very subtle –Use of asynchronous (non-blocking) communication can lead to more synchronization defects Symptom: program hangs, incorrect/non- deterministic output –This particular example derives from insufficient understanding of the language specification Advice: –Make sure that all communications are correctly coordinated

Slide-18 University of Maryland Summary This is a first cut at understanding common defects in parallel programming

Slide-1 University of Maryland Five Common Defect Types in Parallel Computing Prepared for Applied Parallel Computing Prof. Alan Edelman Taiga Nakamura.

Similar presentations

Presentation on theme: "Slide-1 University of Maryland Five Common Defect Types in Parallel Computing Prepared for Applied Parallel Computing Prof. Alan Edelman Taiga Nakamura."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Slide-1 University of Maryland Five Common Defect Types in Parallel Computing Prepared for Applied Parallel Computing Prof. Alan Edelman Taiga Nakamura.

Similar presentations

Presentation on theme: "Slide-1 University of Maryland Five Common Defect Types in Parallel Computing Prepared for Applied Parallel Computing Prof. Alan Edelman Taiga Nakamura."— Presentation transcript:

Similar presentations

About project

Feedback