CSE 260 – Parallel Processing UCSD Fall 2006 A Performance Characterization of UPC Presented by – Anup Tapadia Fallon Chen.

CSE 260 – Parallel Processing UCSD Fall 2006 A Performance Characterization of UPC Presented by – Anup Tapadia Fallon Chen

CSE 260 – Parallel Processing UCSD Fall 2006 Introduction Unified Parallel C (UPC) is: Unified Parallel C (UPC) is: An explicit parallel extension of ANSI C An explicit parallel extension of ANSI C A partitioned global address space language A partitioned global address space language Similar to the C language philosophy Similar to the C language philosophy Concise and efficient syntax Concise and efficient syntax Common and familiar syntax and semantics for parallel C with simple extensions to ANSI C Common and familiar syntax and semantics for parallel C with simple extensions to ANSI C Based on ideas in Split-C, AC, and PCP Based on ideas in Split-C, AC, and PCP

CSE 260 – Parallel Processing UCSD Fall 2006 UPC Execution Model A number of threads working independently in a SPMD fashion A number of threads working independently in a SPMD fashion Number of threads specified at compile-time or run- time; available as program variable THREADS Number of threads specified at compile-time or run- time; available as program variable THREADS MYTHREAD specifies thread index ( 0..THREADS-1 ) MYTHREAD specifies thread index ( 0..THREADS-1 ) upc_barrier is a global synchronization: all wait upc_barrier is a global synchronization: all wait

CSE 260 – Parallel Processing UCSD Fall 2006 Simple Shared Memory Example shared [1] int data[4][THREADS] Block Size Array Size Thread Number 0,0 1,0 2,0 3,0 0,1 1,1 2,1 3,1 0,2 1,2 2,2 3,2 0,3 1,3 2,3 3,3 Thread 0Thread 1Thread 2Thread 3 0,n 1,n 2,n 3,n Thread n …

CSE 260 – Parallel Processing UCSD Fall 2006 Example: Monte Carlo Pi Calculation Estimate Pi by throwing darts at a unit square Estimate Pi by throwing darts at a unit square Calculate percentage that fall in the unit circle Calculate percentage that fall in the unit circle Area of square = r 2 = 1 Area of square = r 2 = 1 Area of circle quadrant = ¼ *  r 2 =  Area of circle quadrant = ¼ *  r 2 =  Randomly throw darts at x,y positions Randomly throw darts at x,y positions If x 2 + y 2 < 1, then point is inside circle If x 2 + y 2 < 1, then point is inside circle Compute ratio: Compute ratio: # points inside / # points total # points inside / # points total  = 4*ratio  = 4*ratio

CSE 260 – Parallel Processing UCSD Fall 2006 Monte Carlo Pi Scaling

CSE 260 – Parallel Processing UCSD Fall 2006 Ring Performance - DataStar

CSE 260 – Parallel Processing UCSD Fall 2006 Ring Performance - Spindel

CSE 260 – Parallel Processing UCSD Fall 2006 Ring Performance - DataStar

CSE 260 – Parallel Processing UCSD Fall 2006 Ring Performance - Spindel

CSE 260 – Parallel Processing UCSD Fall 2006 Parallel Binary sort 12345678 1257346817254836 17528463

CSE 260 – Parallel Processing UCSD Fall 2006 Parallel Binary sort (cont..) 17528463175284631752846314863725

CSE 260 – Parallel Processing UCSD Fall 2006 MPI Binary sort scaling (Spindel Test Cluster)

CSE 260 – Parallel Processing UCSD Fall 2006 A Performance Characterization of UPC Fallon Chen

Matrix Multiply ● Basic square matrix multiply: A x B = C ● A, B and C are NxN matrices ● In UPC, we can take advantage of the data layout for matrix multiply when N is a multiple of the number of THREADS ● Store A row wise ● Store B column wise

CSE 260 – Parallel Processing UCSD Fall 2006 Data Layout Thread 0 Thread 1 Thread THREADS-1 0.. (N*P / THREADS) -1 (N*P / THREADS)..(2*N*P / THREADS)-1 ((THREADS-1)  N*P) / THREADS.. (THREADS*N*P / THREADS)-1 Columns 0: (M/THREADS)-1 Thread 0 Thread THREADS-1 Note: N and M are assumed to be multiples of THREADS N P M P (images by Kathy Yelick, from the UPC Tutorial)

CSE 260 – Parallel Processing UCSD Fall 2006 Algorithm At each thread, get a local copy of the row(s) of A that have affinity to that particular thread At each thread, get a local copy of the row(s) of A that have affinity to that particular thread At each thread, broadcast column using a UPC collective function of B so that at the end each thread has a copy of B At each thread, broadcast column using a UPC collective function of B so that at the end each thread has a copy of B Multiply the row of A by B to produce a row (or rows) of C Multiply the row of A by B to produce a row (or rows) of C Very short– about 100 lines of code Very short– about 100 lines of code

Connected Components Labeling Used a union find algorithm for global relabeling Used a union find algorithm for global relabeling Stored global labels as a shared array, and used a shared array to exchange ghost cells Stored global labels as a shared array, and used a shared array to exchange ghost cells Directly accessing a shared array in a loop is slow for large amounts of data Directly accessing a shared array in a loop is slow for large amounts of data Need to use bulk copies upc_memput and upc_memget, but then you have to attend carefully to how data is laid out (see next two slides for what happens if you don’t) Need to use bulk copies upc_memput and upc_memget, but then you have to attend carefully to how data is laid out (see next two slides for what happens if you don’t)

CSE 260 – Parallel Processing UCSD Fall 2006 UPC CCL Scaling

CSE 260 – Parallel Processing UCSD Fall 2006 Did UPC help, hurt? Global view of memory useful aid in debugging and development Global view of memory useful aid in debugging and development Redistribution routines pretty easy to write Redistribution routines pretty easy to write Efficient code no easier to write than in MPI because you have to consider the shared memory data layout when fine tuning the code Efficient code no easier to write than in MPI because you have to consider the shared memory data layout when fine tuning the code

CSE 260 – Parallel Processing UCSD Fall 2006 Conclusions UPC is easy to program in for C writers, significantly easier than alternative paradigms at times UPC is easy to program in for C writers, significantly easier than alternative paradigms at times UPC exhibits very little overhead when compared with MPI for problems that are embarrassingly parallel. No tuning is necessary. UPC exhibits very little overhead when compared with MPI for problems that are embarrassingly parallel. No tuning is necessary. For other problems compiler optimizations are happening but not fully there For other problems compiler optimizations are happening but not fully there With hand-tuning, UPC performance compared favorably with MPI With hand-tuning, UPC performance compared favorably with MPI Hand tuned code, with block moves, is still substantially simpler than message passing code Hand tuned code, with block moves, is still substantially simpler than message passing code

CSE 260 – Parallel Processing UCSD Fall 2006 A Performance Characterization of UPC Presented by – Anup Tapadia Fallon Chen.

Similar presentations

Presentation on theme: "CSE 260 – Parallel Processing UCSD Fall 2006 A Performance Characterization of UPC Presented by – Anup Tapadia Fallon Chen."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSE 260 – Parallel Processing UCSD Fall 2006 A Performance Characterization of UPC Presented by – Anup Tapadia Fallon Chen.

Similar presentations

Presentation on theme: "CSE 260 – Parallel Processing UCSD Fall 2006 A Performance Characterization of UPC Presented by – Anup Tapadia Fallon Chen."— Presentation transcript:

Similar presentations

About project

Feedback