Download presentation
Presentation is loading. Please wait.
Published byIsaac Floyd Modified over 8 years ago
1
From Clustered SMPs to Clustered NUMA John M. Levesque The Advanced Computing Technology Center
2
The Advanced Computing Technology Center has been established at IBM Research to focus expertise in High Performance Computing and supply the user community with solutions to porting and optimizing their applications
3
ACTC Staff Blend of experienced IBM staff members and recent hires with 100 years experience in Cray Vector and MPP systems Combination of application specialists, system analysts, tool and library developers Perfect mixture for moving users off Vector architectures to IBM Hardware In 2000, ACTC has been extended into a worldwide organization with members in EMEA and AP
4
What is ACTC trying to do: Assist in taking IBM to leadership in HPC and keep it there –Hardware roadmap is excellent Takes more than that –HPC Software at IBM isn't of leadership quality ACTC filling holes in IBM HPC Product offering –HPC Customer Support wasn't of leadership quality - " ACTC has made a real difference in IBM's support" SCIENTIFIC USER SUPPORT IS MORE DIFFICULT AND MORE IMPORTANT
5
Outline Overview of Current Architecture –Clustered SMP Best Programming Paradigm for Clustered SMP –MPI only between nodes Overview of the Power 4 –Large SMP, Large NUMA, Clustered NUMA Best Programming Paradigm for Power 4 –Where should we use MPI
6
The Clustered SMP Four 4-way SMPs Processors on the node are closer than those on different nodes
7
Power 3 with Copper Winterhawk II - 375 MHZ 4-way SMP 2 MULT/ADD - 1500 MFLOPS 64 KB Level 1 - 5 nsec/3.2 GB/sec, 128 way associative 8 MB Level 2 - 45 nsec/6.4 GB/sec 1.6 GB/S Memory Bandwidth 6 GFLOPS/Node Nighthawk II - 375 MHZ 16-way SMP 2 MULT/ADD - 1500 MFLOPS 64 KB Level 1 - 5 nsec/3.2 GB/sec, 128 way associative 8 MB Level 2 - 45 nsec/6.4 GB/sec 14.2 GB/S Memory Bandwidth 24 GFLOPS/Node
8
IBM Hardware Roadmap
9
Power 4 Chip (Shared L2 Cache) 2 Processors per chip Advanced superscalar Out-of-order execution Enhanced branch prediction 7 execution Units Multiple outstanding miss support and prefetch logic Private on-chip L1 caches Large on-chip L2 shared between 2 processors Large L3 shared between all processors in node Up to 32 MB per GP chip Large shared memory Up to 32 GB/ GP chip Multiple, dedicated, high-bandwidth buses GX External Bus Inter- MCM Intra-MCM
10
Inter-processor communication on Power 4 L2 Cache L3 Cache Memory L2 Cache L3 Cache Memory Get a cache line without going to memory
11
Multi-Chip Module (MCM) 4 GP chips (8 processors) on an MCM Logically shared L3 cache - Logically UMA 4 GX links for external connections - SP Thin / Wide Node
12
Power 4 - 32 Way Logical UMA SP High Node
13
Cache Architecture – Power 4 12 Mbytes L2 Cache 128 Mbytes L3 Cache 8 Processors 128 Gbytes Memory 1.5 Mbytes L2 Cache 32 Mbytes L3 Cache 2 Processors 32 Gbytes Memory 48 Mbytes L2 Cache 512 Mbytes L3 Cache 32 Processors 512 Gbytes Memory
14
Programming for the Power 4 Data Locality is important –Size of L3 Cache is dependent upon where the data is stored – L3 is a reflection of what is in its memory If processor n wants to read a variable that is in processor n+k cache, that access can be made without going to memory –Much faster than getting it from memory
15
Assuring Data Locality on Node Memory is stored in 512 byte chunks around MCM memory Power 4 will have a large page of 16MBytes and a small page of 4096 bytes Within page, the memory will be physically contiguous –User will have mechanisms to assure data locality
16
Going to NUMA
17
Advantages of ALL-MPI Just run it across all the processors –Easy Conversion from uni-processor MPP systems MPI forces Data Locality –Good for NUMA Architectures –Not a benefit for Flat SMP system such as Power 3 nodes
18
Disadvantages of ALL-MPI MPI uses memory bandwidth –Bad for limited Bus Architectures Increases bandwidth requirement off SMP node
19
Message Passing on SMP Call MPI_SENDCall MPI_RECEIVE Buffer
20
Message Passing on SMP Call MPI_SENDCall MPI_RECEIVE Buffer Memory Crossbar or Switch Data to Send Received Data
21
Message Passing off Node MPI Across all the processors Many more messages going through the fabric
22
Advantages of the Hybrid Model Best Utilization of Available Memory Bandwidth
23
OpenMP - intra processor communication
24
Dual-SPMD Hybrid Model SPMD MPI – single program executes on each node – communicates via message passing SPMD OpenMP – –!$OMP PARALLEL/!$OMP END PARALLEL at high level –MASTER thread does MPI within PARALLEL Region
25
Hybrid Programming Tips Consider SPMD threading to go with SPMD MPI –Notion of a Master task (mpi – taskid=myid( ) and Master thread (OpenMP – threadid = OMP_GET_THREAD_NUM( )) Only do MPI commands on threadid=0 Only do I/O on threadid=0 and taskid=0
26
Comparison of Open MP techniques on the Nighthawk 1 Super-linear speedup
27
Clustered SMP: Please Don't run MPI across all processors MPI between nodes and threads on the node –Minimize MPI tasks - MPI overhead increases with the number of processors –Threads on the node minimizes the use of the memory bandwidth Two dimensions of parallelism can be fully controlled by environment variables –m mpi tasks and 1 thread m processors –m mpi tasks and n threadsm x n processors –1 mpi task and n threads n processors
28
Getting OPenMP loop limits
29
Things to watch out for False Sharing Increased Stack usage Must use !OMP BARRIER to synchronize threads Reading/Writing and rithmatic
30
False Sharing 90 NCYCLE = NCYCLE + 1 !$OMP PARALLEL Private(omp_is,omp_ie) threadid = OMP_GET_THREAD_NUM() call set_omp_loop_limits(n1,1,n1-1,1,omp_is,omp_ie) CALL CALC1(omp_is,omp_ie) CALL CALC2(omp_is,omp_ie) if(threadid.eq.0) TIME = TIME + DT IF(MOD(NCYCLE,MPRINT).NE. 0) GO TO 370 PCHECK (threadid)= 0.0 UCHECK (threadid) = 0.0 VCHECK (threadid) = 0.0 DO 3500 JCHECK = js,je DO 3500 ICHECK = omp_is,omp_ie PCHECK (threadid) = PCHECK (threadid) + ABS(PNEW(ICHECK,JCHECK)) UCHECK (threadid) = UCHECK (threadid) + ABS(UNEW(ICHECK,JCHECK)) VCHECK (threadid) = VCHECK (threadid) + ABS(VNEW(ICHECK,JCHECK)) 3500 CONTINUE !$OMP CRITICAL ppcheck = ppcheck + pcheck (threadid) pucheck = pucheck + ucheck (threadid) pvcheck = pvcheck + vcheck (threadid) !$OMP ENDCRITICAL !$OMP BARRIER if(threadid.eq.0)then cALL mpi_reduce(ppcheck,tpcheck,1,MPI_DOUBLE_PRECISION, 1 MPI_SUM,0,MPI_COMM_WORLD,ierr) cALL mpi_reduce(pucheck,tucheck,1,MPI_DOUBLE_PRECISION, 1 MPI_SUM,0,MPI_COMM_WORLD,ierr) cALL mpi_reduce(pvcheck,tvcheck,1,MPI_DOUBLE_PRECISION, 1 MPI_SUM,0,MPI_COMM_WORLD,ierr) endif
31
Hybrid Paradigm in the Flesh – Cont. if(threadid.eq.0)then if(taskid.lt.numtasks-1)then cALL mpi_irecv(V(1,je+1),n1,MPI_DOUBLE_PRECISION, 1 taskid+1,1,MPI_COMM_WORLD,req(1),ierr) endif if(taskid.gt.0)then cALL mpi_isend(V(1,js),n1,MPI_DOUBLE_PRECISION, 1 taskid-1,1,MPI_COMM_WORLD,req(4),ierr) endif if(taskid.gt.0)then cALL mpi_irecv(U(1,js-1),n1,MPI_DOUBLE_PRECISION, 1 taskid-1,2,MPI_COMM_WORLD,req(2),ierr) cALL mpi_irecv(P(1,js-1),n1,MPI_DOUBLE_PRECISION, 1 taskid-1,3,MPI_COMM_WORLD,req(3),ierr) endif if(taskid.lt.numtasks-1)then cALL mpi_isend(U(1,je),n1,MPI_DOUBLE_PRECISION, 1 taskid+1,2,MPI_COMM_WORLD,req(5),ierr) cALL mpi_isend(P(1,je),n1,MPI_DOUBLE_PRECISION, 1 taskid+1,3,MPI_COMM_WORLD,req(6),ierr) endif cALL MPI_WAITALL(6,req,istat,ierr) endif !$OMP BARRIER
32
Hybrid Paradigm in the Flesh – Cont. DO 100 J=js,je DO 100 I=omp_is,omp_ie CU(I+1,J) =.5*(P(I+1,J)+P(I,J))*U(I+1,J) if(j.gt.1)then CV(I,J) =.5*(P(I,J)+P(I,J-1))*V(I,J) Z(I+1,J) = (FSDX*(V(I+1,J)-V(I,J))-FSDY*(U(I+1,J) 1 -U(I+1,J-1)))/(P(I,J-1)+P(I+1,J-1)+P(I+1,J)+P(I,J)) endif if(j.eq.n)then CV(I,J+1) =.5*(P(I,J+1)+P(I,J))*V(I,J+1) Z(I+1,J+1) = (FSDX*(V(I+1,J+1)-V(I,J+1))-FSDY*(U(I+1,J+1) 1 -U(I+1,J)))/(P(I,J)+P(I+1,J)+P(I+1,J+1)+P(I,J+1)) endif H(I,J) = P(I,J)+.25*(U(I+1,J)*U(I+1,J)+U(I,J)*U(I,J) 1 +V(I,J+1)*V(I,J+1)+V(I,J)*V(I,J)) 100 CONTINUE !OMP BARRIER
33
Distribution of Data and Work Consider a 3-D Finite Difference Code –Option 1 MPI on one dimension OpenMP on another dimension –Inner index could give false sharing –Option 2 MPI and OpenMP on same dimension –Option 3 Domain decomposition (3-D Chunks) distributed over both MPI and OpenMP
34
Option 1 – MPI & OpenMP on different dimensions Size could limit load balancing –Consider 144x78x30 –30 dimension would limit number of MPI or OpenMP flexibility –False sharing with OpenMP on inner subscript 95x144x30 –On 16 processors, each processor would get 6 iterations of inner loop – cache line holds 128 Bytes – 16 Real*8 words
35
Option 2 – MPI & OpenMP on same dimensions Similar to 1-D decomposition –Would limit number of MPI + OpenMP processors
36
Option 3 – Domain Decomposition Poor data locality for OpenMP –Not on Power 3 – Power 4 NUMA
37
Disadvantages of Hybrid Two programming paradigms Data Locality – Is it important on the SMP node?
38
Assuring Data Locality off Node NUMA –Able to get cache lines over Federation Switch –Will OpenMP have data alignment directives? –Definitely will need to assure that data is local to the processor that needs it
39
Assuring Data Locality off Node Beyond NUMA –Typical MPI data locality issue –Given that cache lines can be accessed across switch One sided message passing, in particular GETs will work extremely well What about ? –Coarray Fortran –UPC
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.