Parallel Computing Overview CS 524 – High-Performance Computing.

Parallel Computing Overview CS 524 – High-Performance Computing

CS 524 (Wi 2003/04)- Asim Karim @ LUMS2 Parallel Computing Multiple processors that are able to work cooperatively to solve a computational problem  Example of parallel computing include specially designed parallel computers and algorithms to geographically distributed network of workstations cooperating on a task There are problems that cannot be solved by present- day serial computers or they take an impractically long time to solve Parallel computing exploits concurrency and parallelism inherent in the problem domain  Task parallelism  Data parallelism

CS 524 (Wi 2003/04)- Asim Karim @ LUMS3 Development Trends Advances in IC technology and processor design  CPU performance double every 18 months for past 20+ years (Moore’s Law)  Clock rates increase from 4.77 MHz for 8088 (1979) to 3.2 GHz for Pentium 4 (2003)  FLOPS increase from a handful (1945) to 35.86 TFLOPS (Earth Simulator by NEC, 2002 to date)  Decrease in cost and size Advances in computer networking  Bandwidth increase from a few bits per second to > 10 Gb/s  Decrease in size and cost, and increase in reliability Need  Solution of larger and more complex problems

CS 524 (Wi 2003/04)- Asim Karim @ LUMS4 Issues in Parallel Computing Parallel architectures  Design of bottleneck-free hardware components Parallel programming models  Parallel view of problem domain for effective partitioning and distribution of work among processors Parallel algorithms  Efficient algorithms that take advantage of parallel architectures Parallel programming environments  Programming languages, compilers, portable libraries, development tools, etc

CS 524 (Wi 2003/04)- Asim Karim @ LUMS5 Two Key Algorithm Design Issues Load balancing  Execution time of parallel programs is the time elapsed from start of processing by the first processor to end of processing by the last processor  Partitioning of computational load among processors Communication overhead  Processors are much faster than communication links  Partitioning of data among processors

CS 524 (Wi 2003/04)- Asim Karim @ LUMS6 Parallel MVM: Row-Block Partition do i = 1, N do j = 1, N y(i) = y(i)+A(i,j)*x(j) end do A x y j i P0 P3 P1 P2 P0P1P2P3

CS 524 (Wi 2003/04)- Asim Karim @ LUMS7 Parallel MVM: Column-Block Partition do j = 1, N do i = 1, N y(i) = y(i)+A(i,j)*x(j) end do A x y j i P0 P3 P1 P2 P0P1P2P3

CS 524 (Wi 2003/04)- Asim Karim @ LUMS8 Parallel MVM: Block Partition Can we do any better?  Assume same distribution of x and y  Can A be partitioned to reduce communication? A x y j i P0 P3 P1 P2 P0P1P2P3 P0P1 P2P3

CS 524 (Wi 2003/04)- Asim Karim @ LUMS9 Parallel Architecture Models Bus-based shared memory or symmetric multiprocessor [SMP] (e.g. suraj, dual/quad processor Xeon machines) Network-based distributed-memory (e.g. Cray T3E, our linux cluster) Network-based distributed-shared-memory (e.g. SGI Origin 2000) Network-based distributed shared-memory (e.g. SMP clusters)

CS 524 (Wi 2003/04)- Asim Karim @ LUMS10 Bus-Based Shared-Memory (SMP) Any processor can access any memory location at equal cost (symmetric multiprocessor) Tasks “communicate” by writing/reading commonly accessible locations Easier to program Cannot scale beyond 30 processors (bus bottleneck) Examples: most workstation vendors make SMPs (Sun, IBM, Intel-based SMPs), Cray T90, SV1 (uses cross-bar) Shared memory Bus PPPP

CS 524 (Wi 2003/04)- Asim Karim @ LUMS11 Network-Connected Distributed-Memory Each processor can only access own memory Explicit communication by sending and receiving messages More tedious to program Can scale to thousand of processors Examples: Cray T3E, clusters PPPP MMMM Interconnection network

CS 524 (Wi 2003/04)- Asim Karim @ LUMS12 Network-Connected Distributed-Shared- Memory Each processor can directly access any memory location Physically distributed memory Non-uniform memory access costs Example: SGI Origin 2000 PPPP MMMM Interconnection network

CS 524 (Wi 2003/04)- Asim Karim @ LUMS13 Network-Connected Distributed Shared- Memory Network of SMPs Each SMP can only access own memory Explicit communication between SMPs Can take advantage of both shared-memory and distributed- memory programming models Can scale to hundreds of processors Examples: SMP clusters PP MM Interconnection network Bus PP

CS 524 (Wi 2003/04)- Asim Karim @ LUMS14 Parallel Programming Models Global-address (or shared-memory) space model  POSIX threads (PThreads)  OpenMP Message passing (or distributed-memory) model  MPI (message passing interface)  PVM (parallel virtual machine) Higher level programming environments  High-Performance Fortran (HPF)  PETSc (portable extensible toolkit for scientific computation)  POOMA (parallel object-oriented methods and applications)

CS 524 (Wi 2003/04)- Asim Karim @ LUMS15 Other Parallel Programming Models Task and channel  Similar to message passing  Instead of communicating between named tasks (as in message passing model), it communicates through named channels SPMD (single program multiple data)  Each processor executes the same program code that operates on different data  Most message passing programs are SPMD Data parallel  Operations on chunks of data (e.g. arrays) are parallelized Grid  Problem domain viewed in parcels with processing for parcel(s) allocated to different processors

CS 524 (Wi 2003/04)- Asim Karim @ LUMS16 Example real a(n,n), b(n,n) do k = 1, NumIter do i = 2, n-1 do j = 2, n-1 a(i,j) = (b(i-1,j) + b(i,j-1 + b(i+1,j) + b(i,j+1))/4 end do do i = 2, n-1 do j = 2, n-1 b(i,j) = a(i,j) end do

CS 524 (Wi 2003/04)- Asim Karim @ LUMS17 Global-Address Space Model: OpenMP real a(n,n), b(n,n) c$omp parallel shared(a,b,k) private(i,j) do k = 1, NumIter c$omp do do i = 2, n-1 do j = 2, n-1 a(i,j) = (b(i-1,j) + b(i,j-1) + b(i+1,j) + b(i,j+1))/4 end do c$omp do do i = 2, n-1 do j = 2, n-1 b(i,j) = a(i,j) end do

CS 524 (Wi 2003/04)- Asim Karim @ LUMS18 Message Passing Pseudo-code real aLoc(NdivP,n), bLoc(0:NdivP+1,n) me = get_my_procnum() do k = 1, NumIter if (me.ne. P-1) send(me+1, bLoc(NdivP, 1:n)) if (me.ne. 0) recv(me-1, bLoc(0, 1:n)) if (me.ne. 0) send(me-1, bLoc(1, 1:n)) if (me.ne. P-1) recv(me+1, bLoc(NdivP+1, 1:n)) if (me.eq. 0) then ibeg = 2 else ibeg = 1 endif if (me.eq. P-1) then iend = NdivP-1 else iend = NdivP endif do i = ibeg, iend do j = 2, n-1 aLoc(i,j) = (bLoc(i-1,j) + bLoc(i,j-1) + bLoc(i+1,j) + bLoc(i,j+1))/4 end do do i = ibeg, iend do j = 2, n-1 bLoc(i,j) = aLoc(i,j) end do

Parallel Computing Overview CS 524 – High-Performance Computing.

Similar presentations

Presentation on theme: "Parallel Computing Overview CS 524 – High-Performance Computing."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Parallel Computing Overview CS 524 – High-Performance Computing.

Similar presentations

Presentation on theme: "Parallel Computing Overview CS 524 – High-Performance Computing."— Presentation transcript:

Similar presentations

About project

Feedback