N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER Evolution of the NERSC SP System NERSC User Services Original Plans Phase 1 Phase 2 Programming.

Slides:

Advertisements

Similar presentations

ECMWF Slide 1Porting MPI Programs to the IBM Cluster 1600 Peter Towers March 2004.

Advertisements

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER TotalView on the T3E and IBM SP Systems NERSC User Services June 12, 2000.

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Mixed Language Programming on Seaborg Mark Durst NERSC User Services.

Memory Management: Overlays and Virtual Memory

Program Analysis and Tuning The German High Performance Computing Centre for Climate and Earth System Research Panagiotis Adamidis.

WHAT IS AN OPERATING SYSTEM? An interface between users and hardware - an environment "architecture ” Allows convenient usage; hides the tedious stuff.

GWDG Matrix Transpose Results with Hybrid OpenMP / MPI O. Haan Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen, Germany ( GWDG ) SCICOMP.

1 Parallel Scientific Computing: Algorithms and Tools Lecture #3 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.

AMLAPI: Active Messages over Low-level Application Programming Interface Simon Yau, Tyson Condie,

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

IBM RS6000/SP Overview Advanced IBM Unix computers series Multiple different configurations Available from entry level to high-end machines. POWER (1,2,3,4)

Introduction CS 524 – High-Performance Computing.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

Parallel Computing Overview CS 524 – High-Performance Computing.

1: Operating Systems Overview

OPERATING SYSTEM OVERVIEW

1 HPC and the ROMS BENCHMARK Program Kate Hedstrom August 2003.

Introduction to Scientific Computing Doug Sondak Boston University Scientific Computing and Visualization.

Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.

E.Papandrea 06/11/2003 DFCI COMPUTING - HW REQUIREMENTS1 Enzo Papandrea COMPUTING HW REQUIREMENT.

IBM RS/6000 SP POWER3 SMP Jari Jokinen Pekka Laurila.

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Comparison of Communication and I/O of the Cray T3E and IBM SP Jonathan Carter NERSC User.

Operating Systems Concepts 1. A Computer Model An operating system has to deal with the fact that a computer is made up of a CPU, random access memory.

ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Chapter 3.1:Operating Systems Concepts 1. A Computer Model An operating system has to deal with the fact that a computer is made up of a CPU, random access.

NERCS Users’ Group, Oct. 3, 2005 NUG Training 10/3/2005 Logistics –Morning only coffee and snacks –Additional drinks $0.50 in refrigerator in small kitchen.

Task Farming on HPCx David Henty HPCx Applications Support

Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.

1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.

Computer System Architectures Computer System Software

Executing OpenMP Programs Mitesh Meswani. Presentation Outline Introduction to OpenMP Machine Architectures Shared Memory (SMP) Distributed Memory MPI.

LOGO OPERATING SYSTEM Dalia AL-Dabbagh

Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Office of Science U.S. Department of Energy Evaluating Checkpoint/Restart on the IBM SP Jay Srinivasan

Instruction Set Architecture

Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Porting from the Cray T3E to the IBM SP Jonathan Carter NERSC User Services.

Hybrid MPI and OpenMP Parallel Programming

Case Study in Computational Science & Engineering - Lecture 2 1 Parallel Architecture Models Shared Memory –Dual/Quad Pentium, Cray T90, IBM Power3 Node.

Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

Operating Systems David Goldschmidt, Ph.D. Computer Science The College of Saint Rose CIS 432.

Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.

Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.

Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Scaling Up MPI and MPI-I/O on seaborg.nersc.gov David Skinner, NERSC Division, Berkeley Lab.

Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.

Template This is a template to help, not constrain, you. Modify as appropriate. Move bullet points to additional slides as needed. Don’t cram onto a single.

Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.

Memory Management: Overlays and Virtual Memory. Agenda Overview of Virtual Memory –Review material based on Computer Architecture and OS concepts Credits.

COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Scaling Up User Codes on the SP David Skinner, NERSC Division, Berkeley Lab.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

A Pattern Language for Parallel Programming Beverly Sanders University of Florida.

CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.

Background Computer System Architectures Computer System Software.

From Clustered SMPs to Clustered NUMA John M. Levesque The Advanced Computing Technology Center.

Lecture 1 Page 1 CS 111 Summer 2013 Important OS Properties For real operating systems built and used by real people Differs depending on who you are talking.

Compute and Storage For the Farm at Jlab

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Is System X for Me? Cal Ribbens Computer Science Department

Lecture 5: GPU Compute Architecture

Lecture 5: GPU Compute Architecture for the last time

Chapter 2: Operating-System Structures

Hybrid MPI and OpenMP Parallel Programming

Programming Parallel Computers

Presentation transcript:

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER Evolution of the NERSC SP System NERSC User Services Original Plans Phase 1 Phase 2 Programming Models and Code Porting Using the System

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER June 2000SP Evolution - NERSC User Services2 Original Plans: The NERSC-3 Procurement Complete, reliable, high-end scientific system High availability and MTBF Fully configured - processing, storage, software, networking, support Commercially available components The greatest amount of computational power for the money Can be integrated with existing computing environment Can be evolved with product line Much careful benchmarking and acceptance testing done

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER June 2000SP Evolution - NERSC User Services3 Original Plans: The NERSC-3 Procurement What we wanted: –>1 teraflop of peak performance –10 terabytes of storage –1 terabyte of memory What we got in phase 1 –410 gigaflops of peak performance –10 terabytes of storage –512 gigabytes of memory What we will get in phase 2 –3 teraflops of peak performance –15 terabytes of storage –1 terabyte of memory

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER June 2000SP Evolution - NERSC User Services4 Hardware, Phase Power 3+ nodes: Nighthawk 1 –Node usage: 256 compute/batch nodes = 512 CPUs 8 login nodes = 16 CPUs 16 GPFS nodes = 32 CPUs 8 network nodes = 16 CPUs 16 service nodes = 32 CPUs –2 processors/node –200 MHz clock –4 flops/clock (2 multiply-add ops) = 800 Mflops/CPU, 1.6 Gflops/node –64 KB L-1 d-cache per 5 nsec & 3.2 GB/sec –4 MB L-2 cache per 45 nsec & 6.4 GB/sec –1 GB RAM per 175 nsec & 1.6 GB/sec –150 MB/sec switch bandwidth –9 GB local disk (two-way RAID)

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER June 2000SP Evolution - NERSC User Services5 Hardware, Phase Power 3+ nodes: Winterhawk 2 –Node usage: 128 compute/batch nodes = 2048 CPUs 2 login nodes = 32 CPUs 16 GPFS nodes = 256 CPUs 2 network nodes = 32 CPUs 4 service nodes = 64 CPUs –16 processors/node –375 MHz clock –4 flops/clock (2 multiply-add ops) = 1.5 Gflops/CPU, 22.4 Gflops/node –64 KB L-1 d-cache per 5 nsec & 3.2 GB/sec –8 MB L-2 cache per 45 nsec & 6.4 GB/sec –8 GB RAM per 175 nsec & 14.0 GB/sec –~2000 (?) MB/sec switch bandwidth –9 GB local disk (two-way RAID)

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER June 2000SP Evolution - NERSC User Services6 Programming Models, phase 1 Phase 1 will reply on MPI, with availability of threading –OpenMP directives –Pthreads –IBM SMP directives MPI now does intra-node communications efficiently Mixed-model programming not currently very advantageous PVM and LAPI messaging systems are also available SHMEM is “planned”… The SP has cache and virtual memory, which means –There are more ways to reduce code performance –There are more ways to lose portability

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER June 2000SP Evolution - NERSC User Services7 Programming Models, phase 2 Phase 2 will offer more payback for mixed model programming –Single node parallelism is a good target for PVP users –Vector and shared-memory codes can be “expanded” into MPI –MPI codes can be ported from the T3E –Threading can be added within MPI In both cases, re-engineering will be required, to exploit new and different levels of granularity This can be done along with increasing problem sizes

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER June 2000SP Evolution - NERSC User Services8 Porting Considerations, part 1 Things to watch out for in porting codes to the SP –Cache Not enough on the T3E to make worrying about it worth the trouble Enough on the SP to boost performance, if it’s used well Tuning for cache is different than tuning for vectorization False sharing caused by cache can reduce perfomrance –Virtual memory Gives you access to 1.75 GB of (virtual) RAM address space To use all of virtual (or even real) memory, must explicitly request “segments” Causes performance degradation due to paging –Data types Default sizes are different on PVP, T3E, and SP systems “ integer ”, “ int ”, “ real ”, and “ float ” must be used carefully Best to say what you mean: “ real*8 ”, integer*4 ” Do the same in MPI calls: “ MPI_REAL8 ”, “ MPI_INTEGER4 ” Be careful with intrinsic function use, as well

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER June 2000SP Evolution - NERSC User Services9 Porting Considerations, part 2 More things to watch out for in porting codes to the SP –Arithmetic Architecture tuning can help exploit special processor instructions Both T3E and SP can optimize beyond IEEE arithmetic T3E and PVP can also do fast reduced precision arithmetic Compiler options on T3E and SP can force IEEE compliance Compiler options can also throttle other optimizations for safety Special libraries offer faster intrinsics –MPI SP compilers and runtime will catch loose usage that was accepted on the T3E Communication bandwidth on SP Phase 1 is lower than on the T3E Message latency on the SP Phase 1 is higher than on the T3E We expect approximate parity with T3E in these areas, on the Phase 2 system Limited number of communication ports per node - approximately one per CPU “Default” versus “eager” buffer management in MPI_SEND

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER June 2000SP Evolution - NERSC User Services10 Porting Considerations, part 3 Compiling & linking –“Version” is dependent on language and parallelization scheme Language version –Fortran 77: f77, xlf –Fortran 90: xlf90 –Fortran 95: xlf95 –C: cc, xlc, c89 –C++: xlC MPI-included: mpxlf, mpxlf90, mpcc, mpCC Thread-safe: xlf_r, xlf90_r, xlf95_r, mpxlf_r, mpxlf90_r –Preprocessing can be ordered by compiler flag or source file suffix Use consistently, for all related compilations; the following may NOT produce a parallel executable: mpxlf90 -c *.F xlf90 -o foo *.o Use -bmaxdata:bytes option to get more than a single 256 MB segment (up to 7 segments, or ~1.75 GB can be specified; only 3, or 0.75 GB, are real)

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER June 2000SP Evolution - NERSC User Services11 Porting: MPI MPI codes should port relatively well Use one MPI task per node or processor –One per node during porting –One per processor during production –Let MPI worry about where it’s communicating to –Environment variables, execution parameters, and/or batch options can specify # tasks per node Total # tasks Total # processors Total # nodes Communications subsystem in use –User Space is best in batch jobs –IP may be best for interactive developmental runs There is a debug queue/class in batch

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER June 2000SP Evolution - NERSC User Services12 Porting: Shared Memory Don’t throw away old shared memory directives –OpenMP will work as is –Cray Tasking directives will be useful for documentation –We recommend porting Cray directives to OpenMP –Even small-scale parallelism can be useful –Larger scale parallelism will be available next year If your problems and/or algorithms will scale to larger granularities and greater parallelism, prepare for message passing –We recommend MPI

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER June 2000SP Evolution - NERSC User Services13 From Loop-slicing to MPI, before... allocate(A(1:imax,1:jmax)) !OMP$ PARALLEL DO PRIVATE(I, J), SHARED(A, imax, jmax) do I = 1, imax do J = 1, jmax A(I,J) = deep_thought(A, I, J,…) enddo Sanity checking –Run the program on one CPU to get baseline answers –Run on several CPUs to see parallel speedups and answers Optimization –Consider changing memory access patterns to improve cache usage –How big can your problem get before you run out of real memory?

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER June 2000SP Evolution - NERSC User Services14 From Loop-slicing to MPI, after... call MPI_COMM_RANK(MPI_COMM_WORLD, my_id, ierr) call MPI_COMM_SIZE(MPI_COMM_WORLD, nprocs, ierr) call my_indices(my_id, nprocs, my_imin, my_imax, my_jmin, my_jmax) allocate(A(my_imin : my_imax, my_jmin : my_jmax)) !OMP$ PARALLEL DO PRIVATE(I, J), SHARED(A, my_imax, my_jmax, my_imax, my_jmax) do I = my_imin, my_imax do J = my_jmin, my_jmax A(I,J) = deep_thought(A, I, J,…) enddo ! Communicate the shared values with neighbors… if(odd(my_ID)) then call MPI_SEND(my_left(...), leftsize, MPI_REAL, tag, MPI_COMM_WORLD, ierr) call MPI_RECV(my_right(...), rightsize, MPI_REAL, tag, MPI_COMM_WORLD, ierr) call MPI_SEND(my_top(...), leftsize, MPI_REAL, tag, MPI_COMM_WORLD, ierr) call MPI_RECV(my_bottom(...), rightsize, MPI_REAL, tag, MPI_COMM_WORLD, ierr) else call MPI_RECV(my_right(...), rightsize, MPI_REAL, tag, MPI_COMM_WORLD, ierr) call MPI_SEND(my_left(...), leftsize, MPI_REAL, tag, MPI_COMM_WORLD, ierr) call MPI_RECV(my_bottom(...), rightsize, MPI_REAL, tag, MPI_COMM_WORLD, ierr) call MPI_SEND(my_top(...), leftsize, MPI_REAL, tag, MPI_COMM_WORLD, ierr) endif

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER June 2000SP Evolution - NERSC User Services15 From Loop-slicing to MPI, after... You now have one MPI task and many OpenMP threads per node –The MPI task does all the communicating between nodes –The OpenMP threads do the parallelizable work –Do NOT use MPI within an OpenMP parallel region Sanity checking –Run on one node and one CPU to check baseline answers –Run on one node and several CPUs to see parallel speedup and answers –Run on several nodes, one CPU per node, and check answers –Run on several nodes, several CPUs per node, and check answers Scaling checking –Run a larger version of a similar problem on the same set of ensemble sizes –Run the same sized problem on a larger ensemble (Re-)Consider your I/O strategy…

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER June 2000SP Evolution - NERSC User Services16 From MPI to Loop-slicing Add OpenMP directives to existing code Perform sanity and scaling checks, as before Results in same overall code structure as on previous slides –One MPI task and several OpenMP threads per node For irregular codes, Pthreads may serve better, at the cost of increased complexity Nobody really expects it to be this easy...

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER June 2000SP Evolution - NERSC User Services17 Using the Machine, part 1 Somewhat similar to the Crays –Interactive and batch jobs are possible

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER June 2000SP Evolution - NERSC User Services18 Using the Machine, part 2 Interactive runs –Sequential executions run immediately on your login node –Every login will likely put you on a different node, so be careful about looking for your executions - “ ps ” returns info about only the node you’re logged into. –Small scale parallel jobs may be rejected if LoadLeveler can’t find the resources –There are two pools of nodes that can be used for interactive jobs: Login nodes A small subset of the compute nodes –Parallel execution can often be achieved by Trying again, after initial rejection Changing communication mechanisms from User Space to IP Using the other pool

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER June 2000SP Evolution - NERSC User Services19 Using the Machine, part 3 Batch jobs –Currently, very similar in capability to the T3E Similar run times, processor counts More memory available on the SP –Limits and capabilities may change, as we learn the machine –LoadLeveler is similar to, but simpler than NQE/NQS on the T3E Jobs are submitted, monitored, and cancelled by special commands Each batch job requires a script that is essentially a shell script The first few lines contain batch options that look like comments to the shell The rest of the script can contain any shell constructs Scripts can be debugged by executing them interactively Users are limited to 3 running jobs, 10 queued jobs, and 30 submitted jobs, at any given time

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER June 2000SP Evolution - NERSC User Services20 Using the Machine, part 4 File systems –Use the environment variables to let the system manage your file usage –Sequential work can be done in $HOME (not backed up) or $TMPDIR (transient) Medium performance, node-local –Parallel work can be done in $SCRATCH (transient) or /scratch/username (purgeable) High performance, located in GPFS –HPSS is available from batch jobs, via HSI, and interactively via FTP, PFTP, and HIS –There are quotas on space and inode usage

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER June 2000SP Evolution - NERSC User Services21 Using the Machine, part 4 The future? –The allowed scale of parallelism (CPU counts) may change Max now = 512 CPUs, same as on T3E –The allowed duration of runs may change Max now = 4 hours; Max on T3E = 12 hours –The size of possible problems will definitely change More CPUs in phase 1 than the T3E More memory per cpu, in both phases, than on T3e –The amount of work possible per unit time will definitely change CPUs in both phases are faster than those on the T3E Phase 2 interconnect will be faster than on Phase 1 –Better machine management Checkpointing will be available We will learn what can be adjusted in the batch system –There will be more and better tools for monitoring and tuning HPM, KAP, Tau, PAPI... –Some current problems will go away (e.g. memory mapped files)