From Clustered SMPs to Clustered NUMA John M. Levesque The Advanced Computing Technology Center.

From Clustered SMPs to Clustered NUMA John M. Levesque The Advanced Computing Technology Center

The Advanced Computing Technology Center has been established at IBM Research to focus expertise in High Performance Computing and supply the user community with solutions to porting and optimizing their applications

ACTC Staff Blend of experienced IBM staff members and recent hires with 100 years experience in Cray Vector and MPP systems Combination of application specialists, system analysts, tool and library developers Perfect mixture for moving users off Vector architectures to IBM Hardware In 2000, ACTC has been extended into a worldwide organization with members in EMEA and AP

What is ACTC trying to do: Assist in taking IBM to leadership in HPC and keep it there –Hardware roadmap is excellent Takes more than that –HPC Software at IBM isn't of leadership quality ACTC filling holes in IBM HPC Product offering –HPC Customer Support wasn't of leadership quality - " ACTC has made a real difference in IBM's support" SCIENTIFIC USER SUPPORT IS MORE DIFFICULT AND MORE IMPORTANT

Outline Overview of Current Architecture –Clustered SMP Best Programming Paradigm for Clustered SMP –MPI only between nodes Overview of the Power 4 –Large SMP, Large NUMA, Clustered NUMA Best Programming Paradigm for Power 4 –Where should we use MPI

The Clustered SMP Four 4-way SMPs Processors on the node are closer than those on different nodes

Power 3 with Copper Winterhawk II - 375 MHZ 4-way SMP 2 MULT/ADD - 1500 MFLOPS 64 KB Level 1 - 5 nsec/3.2 GB/sec, 128 way associative 8 MB Level 2 - 45 nsec/6.4 GB/sec 1.6 GB/S Memory Bandwidth 6 GFLOPS/Node Nighthawk II - 375 MHZ 16-way SMP 2 MULT/ADD - 1500 MFLOPS 64 KB Level 1 - 5 nsec/3.2 GB/sec, 128 way associative 8 MB Level 2 - 45 nsec/6.4 GB/sec 14.2 GB/S Memory Bandwidth 24 GFLOPS/Node

IBM Hardware Roadmap

Power 4 Chip (Shared L2 Cache) 2 Processors per chip Advanced superscalar Out-of-order execution Enhanced branch prediction 7 execution Units Multiple outstanding miss support and prefetch logic Private on-chip L1 caches Large on-chip L2 shared between 2 processors Large L3 shared between all processors in node Up to 32 MB per GP chip Large shared memory Up to 32 GB/ GP chip Multiple, dedicated, high-bandwidth buses GX External Bus Inter- MCM Intra-MCM

Inter-processor communication on Power 4 L2 Cache L3 Cache Memory L2 Cache L3 Cache Memory Get a cache line without going to memory

Multi-Chip Module (MCM) 4 GP chips (8 processors) on an MCM Logically shared L3 cache - Logically UMA 4 GX links for external connections - SP Thin / Wide Node

Power 4 - 32 Way Logical UMA SP High Node

Cache Architecture – Power 4 12 Mbytes L2 Cache 128 Mbytes L3 Cache 8 Processors 128 Gbytes Memory 1.5 Mbytes L2 Cache 32 Mbytes L3 Cache 2 Processors 32 Gbytes Memory 48 Mbytes L2 Cache 512 Mbytes L3 Cache 32 Processors 512 Gbytes Memory

Programming for the Power 4 Data Locality is important –Size of L3 Cache is dependent upon where the data is stored – L3 is a reflection of what is in its memory If processor n wants to read a variable that is in processor n+k cache, that access can be made without going to memory –Much faster than getting it from memory

Assuring Data Locality on Node Memory is stored in 512 byte chunks around MCM memory Power 4 will have a large page of 16MBytes and a small page of 4096 bytes Within page, the memory will be physically contiguous –User will have mechanisms to assure data locality

Going to NUMA

Advantages of ALL-MPI Just run it across all the processors –Easy Conversion from uni-processor MPP systems MPI forces Data Locality –Good for NUMA Architectures –Not a benefit for Flat SMP system such as Power 3 nodes

Disadvantages of ALL-MPI MPI uses memory bandwidth –Bad for limited Bus Architectures Increases bandwidth requirement off SMP node

Message Passing on SMP Call MPI_SENDCall MPI_RECEIVE Buffer

Message Passing on SMP Call MPI_SENDCall MPI_RECEIVE Buffer Memory Crossbar or Switch Data to Send Received Data

Message Passing off Node MPI Across all the processors Many more messages going through the fabric

Advantages of the Hybrid Model Best Utilization of Available Memory Bandwidth

OpenMP - intra processor communication

Dual-SPMD Hybrid Model SPMD MPI – single program executes on each node – communicates via message passing SPMD OpenMP – –!$OMP PARALLEL/!$OMP END PARALLEL at high level –MASTER thread does MPI within PARALLEL Region

Hybrid Programming Tips Consider SPMD threading to go with SPMD MPI –Notion of a Master task (mpi – taskid=myid( ) and Master thread (OpenMP – threadid = OMP_GET_THREAD_NUM( )) Only do MPI commands on threadid=0 Only do I/O on threadid=0 and taskid=0

Comparison of Open MP techniques on the Nighthawk 1 Super-linear speedup

Clustered SMP: Please Don't run MPI across all processors MPI between nodes and threads on the node –Minimize MPI tasks - MPI overhead increases with the number of processors –Threads on the node minimizes the use of the memory bandwidth Two dimensions of parallelism can be fully controlled by environment variables –m mpi tasks and 1 thread m processors –m mpi tasks and n threadsm x n processors –1 mpi task and n threads n processors

Getting OPenMP loop limits

Things to watch out for False Sharing Increased Stack usage Must use !OMP BARRIER to synchronize threads Reading/Writing and rithmatic

False Sharing 90 NCYCLE = NCYCLE + 1 !$OMP PARALLEL Private(omp_is,omp_ie) threadid = OMP_GET_THREAD_NUM() call set_omp_loop_limits(n1,1,n1-1,1,omp_is,omp_ie) CALL CALC1(omp_is,omp_ie) CALL CALC2(omp_is,omp_ie) if(threadid.eq.0) TIME = TIME + DT IF(MOD(NCYCLE,MPRINT).NE. 0) GO TO 370 PCHECK (threadid)= 0.0 UCHECK (threadid) = 0.0 VCHECK (threadid) = 0.0 DO 3500 JCHECK = js,je DO 3500 ICHECK = omp_is,omp_ie PCHECK (threadid) = PCHECK (threadid) + ABS(PNEW(ICHECK,JCHECK)) UCHECK (threadid) = UCHECK (threadid) + ABS(UNEW(ICHECK,JCHECK)) VCHECK (threadid) = VCHECK (threadid) + ABS(VNEW(ICHECK,JCHECK)) 3500 CONTINUE !$OMP CRITICAL ppcheck = ppcheck + pcheck (threadid) pucheck = pucheck + ucheck (threadid) pvcheck = pvcheck + vcheck (threadid) !$OMP ENDCRITICAL !$OMP BARRIER if(threadid.eq.0)then cALL mpi_reduce(ppcheck,tpcheck,1,MPI_DOUBLE_PRECISION, 1 MPI_SUM,0,MPI_COMM_WORLD,ierr) cALL mpi_reduce(pucheck,tucheck,1,MPI_DOUBLE_PRECISION, 1 MPI_SUM,0,MPI_COMM_WORLD,ierr) cALL mpi_reduce(pvcheck,tvcheck,1,MPI_DOUBLE_PRECISION, 1 MPI_SUM,0,MPI_COMM_WORLD,ierr) endif

Hybrid Paradigm in the Flesh – Cont. if(threadid.eq.0)then if(taskid.lt.numtasks-1)then cALL mpi_irecv(V(1,je+1),n1,MPI_DOUBLE_PRECISION, 1 taskid+1,1,MPI_COMM_WORLD,req(1),ierr) endif if(taskid.gt.0)then cALL mpi_isend(V(1,js),n1,MPI_DOUBLE_PRECISION, 1 taskid-1,1,MPI_COMM_WORLD,req(4),ierr) endif if(taskid.gt.0)then cALL mpi_irecv(U(1,js-1),n1,MPI_DOUBLE_PRECISION, 1 taskid-1,2,MPI_COMM_WORLD,req(2),ierr) cALL mpi_irecv(P(1,js-1),n1,MPI_DOUBLE_PRECISION, 1 taskid-1,3,MPI_COMM_WORLD,req(3),ierr) endif if(taskid.lt.numtasks-1)then cALL mpi_isend(U(1,je),n1,MPI_DOUBLE_PRECISION, 1 taskid+1,2,MPI_COMM_WORLD,req(5),ierr) cALL mpi_isend(P(1,je),n1,MPI_DOUBLE_PRECISION, 1 taskid+1,3,MPI_COMM_WORLD,req(6),ierr) endif cALL MPI_WAITALL(6,req,istat,ierr) endif !$OMP BARRIER

Hybrid Paradigm in the Flesh – Cont. DO 100 J=js,je DO 100 I=omp_is,omp_ie CU(I+1,J) =.5*(P(I+1,J)+P(I,J))*U(I+1,J) if(j.gt.1)then CV(I,J) =.5*(P(I,J)+P(I,J-1))*V(I,J) Z(I+1,J) = (FSDX*(V(I+1,J)-V(I,J))-FSDY*(U(I+1,J) 1 -U(I+1,J-1)))/(P(I,J-1)+P(I+1,J-1)+P(I+1,J)+P(I,J)) endif if(j.eq.n)then CV(I,J+1) =.5*(P(I,J+1)+P(I,J))*V(I,J+1) Z(I+1,J+1) = (FSDX*(V(I+1,J+1)-V(I,J+1))-FSDY*(U(I+1,J+1) 1 -U(I+1,J)))/(P(I,J)+P(I+1,J)+P(I+1,J+1)+P(I,J+1)) endif H(I,J) = P(I,J)+.25*(U(I+1,J)*U(I+1,J)+U(I,J)*U(I,J) 1 +V(I,J+1)*V(I,J+1)+V(I,J)*V(I,J)) 100 CONTINUE !OMP BARRIER

Distribution of Data and Work Consider a 3-D Finite Difference Code –Option 1 MPI on one dimension OpenMP on another dimension –Inner index could give false sharing –Option 2 MPI and OpenMP on same dimension –Option 3 Domain decomposition (3-D Chunks) distributed over both MPI and OpenMP

Option 1 – MPI & OpenMP on different dimensions Size could limit load balancing –Consider 144x78x30 –30 dimension would limit number of MPI or OpenMP flexibility –False sharing with OpenMP on inner subscript 95x144x30 –On 16 processors, each processor would get 6 iterations of inner loop – cache line holds 128 Bytes – 16 Real*8 words

Option 2 – MPI & OpenMP on same dimensions Similar to 1-D decomposition –Would limit number of MPI + OpenMP processors

Option 3 – Domain Decomposition Poor data locality for OpenMP –Not on Power 3 – Power 4 NUMA

Disadvantages of Hybrid Two programming paradigms Data Locality – Is it important on the SMP node?

Assuring Data Locality off Node NUMA –Able to get cache lines over Federation Switch –Will OpenMP have data alignment directives? –Definitely will need to assure that data is local to the processor that needs it

Assuring Data Locality off Node Beyond NUMA –Typical MPI data locality issue –Given that cache lines can be accessed across switch One sided message passing, in particular GETs will work extremely well What about ? –Coarray Fortran –UPC

From Clustered SMPs to Clustered NUMA John M. Levesque The Advanced Computing Technology Center.

Similar presentations

Presentation on theme: "From Clustered SMPs to Clustered NUMA John M. Levesque The Advanced Computing Technology Center."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

From Clustered SMPs to Clustered NUMA John M. Levesque The Advanced Computing Technology Center.

Similar presentations

Presentation on theme: "From Clustered SMPs to Clustered NUMA John M. Levesque The Advanced Computing Technology Center."— Presentation transcript:

Similar presentations

About project

Feedback