From Clustered SMPs to Clustered NUMA John M. Levesque The Advanced Computing Technology Center.

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
Parallel Processing with OpenMP
Introduction to Openmp & openACC
1 ISCM-10 Taub Computing Center High Performance Computing for Computational Mechanics Moshe Goldberg March 29, 2001.
Distributed Systems CS
SE-292 High Performance Computing
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Chapter 8-1 : Multiple Processor Systems Multiple Processor Systems Multiple Processor Systems Multiprocessor Hardware Multiprocessor Hardware UMA Multiprocessors.
Class CS 775/875, Spring 2011 Amit H. Kumar, OCCS Old Dominion University.
GWDG Matrix Transpose Results with Hybrid OpenMP / MPI O. Haan Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen, Germany ( GWDG ) SCICOMP.
1 Parallel Scientific Computing: Algorithms and Tools Lecture #3 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.
Compiler Challenges for High Performance Architectures
Multiprocessors CSE 4711 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor –Although.
Scientific Programming OpenM ulti- P rocessing M essage P assing I nterface.
IBM RS6000/SP Overview Advanced IBM Unix computers series Multiple different configurations Available from entry level to high-end machines. POWER (1,2,3,4)
Reference: Message Passing Fundamentals.
Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.
Parallel Computing Overview CS 524 – High-Performance Computing.
Introduction to Scientific Computing Doug Sondak Boston University Scientific Computing and Visualization.
Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Machine.
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Maria Athanasaki, Evangelos Koukis, Nectarios Koziris National Technical.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Programming for High Performance Computers John M. Levesque Director Cray’s Supercomputing Center Of Excellence.
Parallel Communications and NUMA Control on the Teragrid’s New Sun Constellation System Lars Koesterke with Kent Milfeld and Karl W. Schulz AUS Presentation.
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)
Computer System Architectures Computer System Software
Budapest, November st ALADIN maintenance and phasing workshop Short introduction to OpenMP Jure Jerman, Environmental Agency of Slovenia.
Mixed MPI/OpenMP programming on HPCx Mark Bull, EPCC with thanks to Jake Duthie and Lorna Smith.
Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.
IFS Benchmark with Federation Switch John Hague, IBM.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Hybrid MPI and OpenMP Parallel Programming
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j
HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster computers –shared memory model ( access nsec) –message passing multiprocessor.
Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.
PARALLEL PROCESSOR- TAXONOMY. CH18 Parallel Processing {Multi-processor, Multi-computer} Multiple Processor Organizations Symmetric Multiprocessors Cache.
Outline Why this subject? What is High Performance Computing?
OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel ECE1747 – Parallel Programming Vicky Tsang.
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
Single Node Optimization Computational Astrophysics.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.
Background Computer System Architectures Computer System Software.
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
1© Copyright 2015 EMC Corporation. All rights reserved. NUMA(YEY) BY JACOB KUGLER.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
CS5102 High Performance Computer Systems Thread-Level Parallelism
SHARED MEMORY PROGRAMMING WITH OpenMP
Distributed Systems CS
Hybrid Programming with OpenMP and MPI
Types of Parallel Computers
Shared-Memory Paradigm & OpenMP
Presentation transcript:

From Clustered SMPs to Clustered NUMA John M. Levesque The Advanced Computing Technology Center

The Advanced Computing Technology Center has been established at IBM Research to focus expertise in High Performance Computing and supply the user community with solutions to porting and optimizing their applications

ACTC Staff Blend of experienced IBM staff members and recent hires with 100 years experience in Cray Vector and MPP systems Combination of application specialists, system analysts, tool and library developers Perfect mixture for moving users off Vector architectures to IBM Hardware In 2000, ACTC has been extended into a worldwide organization with members in EMEA and AP

What is ACTC trying to do: Assist in taking IBM to leadership in HPC and keep it there –Hardware roadmap is excellent Takes more than that –HPC Software at IBM isn't of leadership quality ACTC filling holes in IBM HPC Product offering –HPC Customer Support wasn't of leadership quality - " ACTC has made a real difference in IBM's support" SCIENTIFIC USER SUPPORT IS MORE DIFFICULT AND MORE IMPORTANT

Outline Overview of Current Architecture –Clustered SMP Best Programming Paradigm for Clustered SMP –MPI only between nodes Overview of the Power 4 –Large SMP, Large NUMA, Clustered NUMA Best Programming Paradigm for Power 4 –Where should we use MPI

The Clustered SMP Four 4-way SMPs Processors on the node are closer than those on different nodes

Power 3 with Copper Winterhawk II MHZ 4-way SMP 2 MULT/ADD MFLOPS 64 KB Level nsec/3.2 GB/sec, 128 way associative 8 MB Level nsec/6.4 GB/sec 1.6 GB/S Memory Bandwidth 6 GFLOPS/Node Nighthawk II MHZ 16-way SMP 2 MULT/ADD MFLOPS 64 KB Level nsec/3.2 GB/sec, 128 way associative 8 MB Level nsec/6.4 GB/sec 14.2 GB/S Memory Bandwidth 24 GFLOPS/Node

IBM Hardware Roadmap

Power 4 Chip (Shared L2 Cache) 2 Processors per chip Advanced superscalar Out-of-order execution Enhanced branch prediction 7 execution Units Multiple outstanding miss support and prefetch logic Private on-chip L1 caches Large on-chip L2 shared between 2 processors Large L3 shared between all processors in node Up to 32 MB per GP chip Large shared memory Up to 32 GB/ GP chip Multiple, dedicated, high-bandwidth buses GX External Bus Inter- MCM Intra-MCM

Inter-processor communication on Power 4 L2 Cache L3 Cache Memory L2 Cache L3 Cache Memory Get a cache line without going to memory

Multi-Chip Module (MCM) 4 GP chips (8 processors) on an MCM Logically shared L3 cache - Logically UMA 4 GX links for external connections - SP Thin / Wide Node

Power Way Logical UMA SP High Node

Cache Architecture – Power 4 12 Mbytes L2 Cache 128 Mbytes L3 Cache 8 Processors 128 Gbytes Memory 1.5 Mbytes L2 Cache 32 Mbytes L3 Cache 2 Processors 32 Gbytes Memory 48 Mbytes L2 Cache 512 Mbytes L3 Cache 32 Processors 512 Gbytes Memory

Programming for the Power 4 Data Locality is important –Size of L3 Cache is dependent upon where the data is stored – L3 is a reflection of what is in its memory If processor n wants to read a variable that is in processor n+k cache, that access can be made without going to memory –Much faster than getting it from memory

Assuring Data Locality on Node Memory is stored in 512 byte chunks around MCM memory Power 4 will have a large page of 16MBytes and a small page of 4096 bytes Within page, the memory will be physically contiguous –User will have mechanisms to assure data locality

Going to NUMA

Advantages of ALL-MPI Just run it across all the processors –Easy Conversion from uni-processor MPP systems MPI forces Data Locality –Good for NUMA Architectures –Not a benefit for Flat SMP system such as Power 3 nodes

Disadvantages of ALL-MPI MPI uses memory bandwidth –Bad for limited Bus Architectures Increases bandwidth requirement off SMP node

Message Passing on SMP Call MPI_SENDCall MPI_RECEIVE Buffer

Message Passing on SMP Call MPI_SENDCall MPI_RECEIVE Buffer Memory Crossbar or Switch Data to Send Received Data

Message Passing off Node MPI Across all the processors Many more messages going through the fabric

Advantages of the Hybrid Model Best Utilization of Available Memory Bandwidth

OpenMP - intra processor communication

Dual-SPMD Hybrid Model SPMD MPI – single program executes on each node – communicates via message passing SPMD OpenMP – –!$OMP PARALLEL/!$OMP END PARALLEL at high level –MASTER thread does MPI within PARALLEL Region

Hybrid Programming Tips Consider SPMD threading to go with SPMD MPI –Notion of a Master task (mpi – taskid=myid( ) and Master thread (OpenMP – threadid = OMP_GET_THREAD_NUM( )) Only do MPI commands on threadid=0 Only do I/O on threadid=0 and taskid=0

Comparison of Open MP techniques on the Nighthawk 1 Super-linear speedup

Clustered SMP: Please Don't run MPI across all processors MPI between nodes and threads on the node –Minimize MPI tasks - MPI overhead increases with the number of processors –Threads on the node minimizes the use of the memory bandwidth Two dimensions of parallelism can be fully controlled by environment variables –m mpi tasks and 1 thread m processors –m mpi tasks and n threadsm x n processors –1 mpi task and n threads n processors

Getting OPenMP loop limits

Things to watch out for False Sharing Increased Stack usage Must use !OMP BARRIER to synchronize threads Reading/Writing and rithmatic

False Sharing 90 NCYCLE = NCYCLE + 1 !$OMP PARALLEL Private(omp_is,omp_ie) threadid = OMP_GET_THREAD_NUM() call set_omp_loop_limits(n1,1,n1-1,1,omp_is,omp_ie) CALL CALC1(omp_is,omp_ie) CALL CALC2(omp_is,omp_ie) if(threadid.eq.0) TIME = TIME + DT IF(MOD(NCYCLE,MPRINT).NE. 0) GO TO 370 PCHECK (threadid)= 0.0 UCHECK (threadid) = 0.0 VCHECK (threadid) = 0.0 DO 3500 JCHECK = js,je DO 3500 ICHECK = omp_is,omp_ie PCHECK (threadid) = PCHECK (threadid) + ABS(PNEW(ICHECK,JCHECK)) UCHECK (threadid) = UCHECK (threadid) + ABS(UNEW(ICHECK,JCHECK)) VCHECK (threadid) = VCHECK (threadid) + ABS(VNEW(ICHECK,JCHECK)) 3500 CONTINUE !$OMP CRITICAL ppcheck = ppcheck + pcheck (threadid) pucheck = pucheck + ucheck (threadid) pvcheck = pvcheck + vcheck (threadid) !$OMP ENDCRITICAL !$OMP BARRIER if(threadid.eq.0)then cALL mpi_reduce(ppcheck,tpcheck,1,MPI_DOUBLE_PRECISION, 1 MPI_SUM,0,MPI_COMM_WORLD,ierr) cALL mpi_reduce(pucheck,tucheck,1,MPI_DOUBLE_PRECISION, 1 MPI_SUM,0,MPI_COMM_WORLD,ierr) cALL mpi_reduce(pvcheck,tvcheck,1,MPI_DOUBLE_PRECISION, 1 MPI_SUM,0,MPI_COMM_WORLD,ierr) endif

Hybrid Paradigm in the Flesh – Cont. if(threadid.eq.0)then if(taskid.lt.numtasks-1)then cALL mpi_irecv(V(1,je+1),n1,MPI_DOUBLE_PRECISION, 1 taskid+1,1,MPI_COMM_WORLD,req(1),ierr) endif if(taskid.gt.0)then cALL mpi_isend(V(1,js),n1,MPI_DOUBLE_PRECISION, 1 taskid-1,1,MPI_COMM_WORLD,req(4),ierr) endif if(taskid.gt.0)then cALL mpi_irecv(U(1,js-1),n1,MPI_DOUBLE_PRECISION, 1 taskid-1,2,MPI_COMM_WORLD,req(2),ierr) cALL mpi_irecv(P(1,js-1),n1,MPI_DOUBLE_PRECISION, 1 taskid-1,3,MPI_COMM_WORLD,req(3),ierr) endif if(taskid.lt.numtasks-1)then cALL mpi_isend(U(1,je),n1,MPI_DOUBLE_PRECISION, 1 taskid+1,2,MPI_COMM_WORLD,req(5),ierr) cALL mpi_isend(P(1,je),n1,MPI_DOUBLE_PRECISION, 1 taskid+1,3,MPI_COMM_WORLD,req(6),ierr) endif cALL MPI_WAITALL(6,req,istat,ierr) endif !$OMP BARRIER

Hybrid Paradigm in the Flesh – Cont. DO 100 J=js,je DO 100 I=omp_is,omp_ie CU(I+1,J) =.5*(P(I+1,J)+P(I,J))*U(I+1,J) if(j.gt.1)then CV(I,J) =.5*(P(I,J)+P(I,J-1))*V(I,J) Z(I+1,J) = (FSDX*(V(I+1,J)-V(I,J))-FSDY*(U(I+1,J) 1 -U(I+1,J-1)))/(P(I,J-1)+P(I+1,J-1)+P(I+1,J)+P(I,J)) endif if(j.eq.n)then CV(I,J+1) =.5*(P(I,J+1)+P(I,J))*V(I,J+1) Z(I+1,J+1) = (FSDX*(V(I+1,J+1)-V(I,J+1))-FSDY*(U(I+1,J+1) 1 -U(I+1,J)))/(P(I,J)+P(I+1,J)+P(I+1,J+1)+P(I,J+1)) endif H(I,J) = P(I,J)+.25*(U(I+1,J)*U(I+1,J)+U(I,J)*U(I,J) 1 +V(I,J+1)*V(I,J+1)+V(I,J)*V(I,J)) 100 CONTINUE !OMP BARRIER

Distribution of Data and Work Consider a 3-D Finite Difference Code –Option 1 MPI on one dimension OpenMP on another dimension –Inner index could give false sharing –Option 2 MPI and OpenMP on same dimension –Option 3 Domain decomposition (3-D Chunks) distributed over both MPI and OpenMP

Option 1 – MPI & OpenMP on different dimensions Size could limit load balancing –Consider 144x78x30 –30 dimension would limit number of MPI or OpenMP flexibility –False sharing with OpenMP on inner subscript 95x144x30 –On 16 processors, each processor would get 6 iterations of inner loop – cache line holds 128 Bytes – 16 Real*8 words

Option 2 – MPI & OpenMP on same dimensions Similar to 1-D decomposition –Would limit number of MPI + OpenMP processors

Option 3 – Domain Decomposition Poor data locality for OpenMP –Not on Power 3 – Power 4 NUMA

Disadvantages of Hybrid Two programming paradigms Data Locality – Is it important on the SMP node?

Assuring Data Locality off Node NUMA –Able to get cache lines over Federation Switch –Will OpenMP have data alignment directives? –Definitely will need to assure that data is local to the processor that needs it

Assuring Data Locality off Node Beyond NUMA –Typical MPI data locality issue –Given that cache lines can be accessed across switch One sided message passing, in particular GETs will work extremely well What about ? –Coarray Fortran –UPC