Case Study in Computational Science & Engineering - Lecture 2 1 Parallel Architecture Models Shared Memory –Dual/Quad Pentium, Cray T90, IBM Power3 Node Distributed Memory –Cray T3E, IBM SP2, Network of Workstations Distributed-Shared Memory –SGI Origin 2000, Convex Exemplar
Case Study in Computational Science & Engineering - Lecture 2 2 Shared Memory Systems (SMP) P c P c P c P c Shared Memory Bus - Any processor can access any memory location at equal cost (Symmetric Multi-Processor) - Tasks “communicate” by writing/reading common locations - Easier to program - Cannot scale beyond around 30 PE's (bus bottleneck) - Most workstation vendors make SMP's today (SGI, Sun, HP Digital; Pentium) -Cray Y-MP, C90, T90 (cross-bar between PE's and memory)
Case Study in Computational Science & Engineering - Lecture 2 3 Cache Coherence in SMP’s P c P c P c P c Shared Memory Bus - Each proc’s cache holds most recently accessed values - If multiply cached word is modified, need to make all copies consistent - Bus-based SMP’s use an efficient mechanism: snoopy bus - Snoopy bus monitors all writes; marks other copies invalid - When proc finds invalid cache word, fetches copy from SM
Case Study in Computational Science & Engineering - Lecture 2 4 Distributed Memory Systems Interconnection Network - Each processor can only access its own memory - Explicit communication by sending and receiving messages - More tedious to program - Can scale to hundreds/thousands of processors - Cache coherence is not needed - Examples: IBM SP-2, Cray T3E, Workstation Clusters P c M NIC P c M P c M P c M M: Memory c: Cache P: Processor NIC: Network Interface Card
Case Study in Computational Science & Engineering - Lecture 2 5 Distributed Shared Memory - Each processor can directly access any memory location - Physically distributed memory; many simultaneous accesses - Non-uniform memory access costs - Examples: Convex Exemplar, SGI Origin Complex hardware and high cost for cache coherence - Software DSM systems (e.g. Treadmarks) implement shared memory abstraction on top of Distributed Memory Systems P c P c P c P c Interconnection Network MMMM
Case Study in Computational Science & Engineering - Lecture 2 6 Parallel Programming Models Shared-Address Space Models –BSP (Bulk Synchronous Parallel model)BSP –HPF (High Performance Fortran)HPF –OpenMPOpenMP Message Passing –Partitioned address space PVM, MPI [Ch.8, I.Fosters book: Designing and Building Parallel Programs (available online)]PVMMPI Designing and Building Parallel Programs Higher Level Programming Environments –PETSc: Portable Extensible Toolkit for Scientific computationPETSc –POOMA: Parallel Object-Oriented Methods and ApplicationsPOOMA
Case Study in Computational Science & Engineering - Lecture 2 7 OpenMP Standard sequential Fortran/C model Single global view of data Automatic parallelization by compiler User can provide loop-level directives Easy to program Only available on Shared-Memory Machines
Case Study in Computational Science & Engineering - Lecture 2 8 High Performance Fortran Global shared address space, similar to sequential programming model User provides data mapping directives User can provide information on loop-level parallelism Portable: available on all three types of architectures Compiler automatically synthesizes message-passing code if needed Restricted to dense arrays and regular distributions Performance is not consistently good
Case Study in Computational Science & Engineering - Lecture 2 9 Message Passing Program is a collection of tasks Each task can only read/write its own data Tasks communicate data by explicitly sending/receiving messages Need to translate from global shared view to local partitioned view in porting a sequential program Tedious to program/debug Very good performance
Case Study in Computational Science & Engineering - Lecture 2 10 Illustrative Example Real a(n,n),b(n,n) Do k = 1,NumIter Do i = 2,n-1 Do j = 2,n-1 a(i,j)=(b(i-1,j)+b(i,j-1) +b(i+1,j)+b(i,j+1))/4 End Do Do i = 2,n-1 Do j = 2,n-1 b(i,j) = a(i,j) End Do a(20,20)b(20,20)
Case Study in Computational Science & Engineering - Lecture 2 11 Example: OpenMP Real a(n,n),b(n,n) c$omp parallel shared(a,b,k) private(i,j) Do k = 1,NumIter c$omp do Do i = 2,n-1 Do j = 2,n-1 a(i,j)=(b(i-1,j)+b(i,j-1) +b(i+1,j)+b(i,j+1))/4 End Do End Do c$omp do Do i = 2,n-1 Do j = 2,n-1 b(i,j) = a(i,j) End Do End Do End Do a(20,20)b(20,20) Global shared view of data
Case Study in Computational Science & Engineering - Lecture 2 12 Example: HPF (1D partition) Real a(n,n),b(n,n) chpf$ Distribute a(block,*), b(block,*) Do k = 1,NumIter chpf$ independent, new(i) Do i = 2,n-1 Do j = 2,n-1 a(i,j)=(b(i-1,j)+b(i,j-1) +b(i+1,j)+b(i,j+1))/4 End Do End Do chpf$ independent, new(i) Do i = 2,n-1 Do j = 2,n-1 b(i,j) = a(i,j) End Do End Do End Do a(20,20)b(20,20) P0 P1 P2 P3 Global shared view of data
Case Study in Computational Science & Engineering - Lecture 2 13 Example: HPF (2D partition) Real a(n,n),b(n,n) chpf$ Distribute a(block,block) chpf$ Distribute b(block,block) Do k = 1,NumIter chpf$ independent, new(i) Do i = 2,n-1 Do j = 2,n-1 a(i,j)=(b(i-1,j)+b(i,j-1) +b(i+1,j)+b(i,j+1))/4 End Do End Do chpf$ independent, new(i) Do i = 2,n-1 Do j = 2,n-1 b(i,j) = a(i,j) End Do End Do End Do a(20,20)b(20,20) Global shared view of data
Case Study in Computational Science & Engineering - Lecture 2 14 Message Passing: Local View a(20,20) b(20,20) Global shared view al(5,20) bl(5,20) Local partitioned view P0 P1 P2 P3 P0 P1 P2 P3 communication required bl(0:6,20) Local partitioned view with ghost cells al(5,20) ghost cells
Case Study in Computational Science & Engineering - Lecture 2 15 Example: Message Passing Real al(NdivP,n),bl(0:NdivP+1,n) me = get_my_procnum() Do k = 1,NumIter if (me=P-1) send(me+1,bl(NdivP,1:n)) if (me=0) recv(me-1,bl(0,1:n)) if (me=0) send(me-1,bl(1,1:n)) if (me=P-1) recv(me+1,bl(NdivP+1,1:n)) if (me=0) then i1=2 else i1=1 if (me=P-1) then i2=NdivP-1 else i2=NdivP Do i = i1,i2 Do j = 2,n-1 a(i,j)=(b(i-1,j)+b(i,j-1) +b(i+1,j)+b(i,j+1))/4 End Do ……... bl(0:6,20) Local partitioned view with ghost cells al(5,20) ghost cells are communicated by message-passing
Case Study in Computational Science & Engineering - Lecture 2 16 Comparison of Models Program Porting/Development Effort –OpenMP = HPF << MPI Portability across systems –HPF = MPI >> OpenMP (only shared-memory) Applicability –MPI = OpenMP >> HPF (only dense arrays) Performance –MPI > OpenMP >> HPF
Case Study in Computational Science & Engineering - Lecture 2 17 PETSc Higher level parallel programming model Aims to provide both ease of use and high performance for numerical PDE solution Uses efficient message-passing implementation underneath but: –Provides global view of data arrays –System takes care of needed message-passing Portable across shared & distributed memory systems