Performance Comparison of Winterhawk I and Winterhawk II Systems Patrick H. Worley Computer Science and Mathematics Division Oak Ridge National Laboratory.

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
I would like to thank Louis P. Wilder and Dr. Joseph Trien for the opportunity to work on this project and for their continued support. The Research Alliance.
June 2003Yun (Helen) He1 Coupling MM5 with ISOLSM: Development, Testing, and Application W.J. Riley, H.S. Cooley, Y. He*, M.S. Torn Lawrence Berkeley National.
1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.
Supercomputing Challenges at the National Center for Atmospheric Research Dr. Richard Loft Computational Science Section Scientific Computing Division.
GWDG Matrix Transpose Results with Hybrid OpenMP / MPI O. Haan Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen, Germany ( GWDG ) SCICOMP.
Click to add text Introduction to the new mainframe: Large-Scale Commercial Computing © Copyright IBM Corp., All rights reserved. Chapter 3: Scalability.
Understanding Application Scaling NAS Parallel Benchmarks 2.2 on NOW and SGI Origin 2000 Frederick Wong, Rich Martin, Remzi Arpaci-Dusseau, David Wu, and.
IBM RS6000/SP Overview Advanced IBM Unix computers series Multiple different configurations Available from entry level to high-end machines. POWER (1,2,3,4)
Acknowledgments: Thanks to Professor Nicholas Brummell from UC Santa Cruz for his help on FFTs after class, and also thanks to Professor James Demmel from.
Cluster Computing Applications Project: Parallelizing BLAST The field of Bioinformatics needs faster string matching algorithms. What Exactly is BLAST?
PARALLEL PROCESSING The NAS Parallel Benchmarks Daniel Gross Chen Haiout.
IBM RS/6000 SP POWER3 SMP Jari Jokinen Pekka Laurila.
O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Cluster Computing Applications Project Parallelizing BLAST Research Alliance of Minorities.
1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.
1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)
© Fujitsu Laboratories of Europe 2009 HPC and Chaste: Towards Real-Time Simulation 24 March
Collective Communication on Architectures that Support Simultaneous Communication over Multiple Links Ernie Chan.
Methods  OpenGL Functionality Visualization Tool Functionality 1)3D Shape/Adding Color1)Atom/element representations 2)Blending/Rotation 2)Rotation 3)Sphere.
Kernel and Application Code Performance for a Spectral Atmospheric Global Circulation Model on the Cray T3E and IBM SP Patrick H. Worley Computer Science.
O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Nanoscale Electronics / Single-Electron Transport in Quantum Dot Arrays Dene Farrell SUNY.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors.
Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.
OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY Parallel Solution of 2-D Heat Equation Using Laplace Finite Difference Presented by Valerie Spencer.
Lionel F. Lovett, II Jackson State University Research Alliance in Math and Science Computer Science and Mathematics Division Mentors: George Ostrouchov.
August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.
O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY 1 Parallel Solution of the 3-D Laplace Equation Using a Symmetric-Galerkin Boundary Integral.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
ESMF Performance Evaluation and Optimization Peggy Li(1), Samson Cheung(2), Gerhard Theurich(2), Cecelia Deluca(3) (1)Jet Propulsion Laboratory, California.
Planned AlltoAllv a clustered approach Stephen Booth (EPCC) Adrian Jackson (EPCC)
Sept COMP60611 Fundamentals of Concurrency Lab Exercise 2 Notes Notes on the finite difference performance model example – for the lab… Graham Riley,
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
ARGONNE NATIONAL LABORATORY Climate Modeling on the Jazz Linux Cluster at ANL John Taylor Mathematics and Computer Science & Environmental Research Divisions.
1 CCSM Component Performance Benchmarking and Status of the CRAY X1 at ORNL Patrick H. Worley Oak Ridge National Laboratory Computing in Atmospheric Sciences.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster computers –shared memory model ( access nsec) –message passing multiprocessor.
METHODS CT scans were segmented and triangular surface meshes generated using Amira. Antiga and Steinman’s method (2004) for automatically extracting parameterized.
2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan.
The Research Alliance in Math and Science program is sponsored by the Office of Advanced Scientific Computing Research, Office of Science, U.S. Department.
High performance parallel computing of climate models towards the Earth Simulator --- computing science activities at CRIEPI --- Yoshikatsu Yoshida and.
Efficient Local Statistical Analysis via Integral Histograms with Discrete Wavelet Transform Teng-Yok Lee & Han-Wei Shen IEEE SciVis ’13Uncertainty & Multivariate.
CCSM Portability and Performance, Software Engineering Challenges, and Future Targets Tony Craig National Center for Atmospheric Research Boulder, Colorado,
Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.
ATmospheric, Meteorological, and Environmental Technologies RAMS Parallel Processing Techniques.
CCSM3 / HadCM3 Under predict precipitation rate near equator regions CCSM3 under predicts greater in SE U.S. than HadCM3 Methodology and Results Interpolate.
CCSM Performance, Successes and Challenges Tony Craig NCAR RIST Meeting March 12-14, 2002 Boulder, Colorado, USA.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Coupling Facility. The S/390 Coupling Facility (CF), the key component of the Parallel Sysplex cluster, enables multisystem coordination and datasharing.
+ Clusters Alternative to SMP as an approach to providing high performance and high availability Particularly attractive for server applications Defined.
08/10/ NRL Hybrid QR Factorization Algorithm for High Performance Computing Architectures Peter Vouras Naval Research Laboratory Radar Division Professor.
Outline Why this subject? What is High Performance Computing?
Managed by UT-Battelle for the Department of Energy 1 Decreasing the Artificial Attenuation of the RCSIM Radio Channel Simulation Software Abigail Snyder.
1 Parallel Applications Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.
Parallel Computing Presented by Justin Reschke
University of Texas at Arlington Scheduling and Load Balancing on the NASA Information Power Grid Sajal K. Das, Shailendra Kumar, Manish Arora Department.
Background Computer System Architectures Computer System Software.
From Clustered SMPs to Clustered NUMA John M. Levesque The Advanced Computing Technology Center.
Managed by UT-Battelle for the Department of Energy 1 United States Grid Security and Reliability Control in High Load Conditions Presented to Associate.
Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui.
TEMPLATE DESIGN © H. Che 2, E. D’Azevedo 1, M. Sekachev 3, K. Wong 3 1 Oak Ridge National Laboratory, 2 Chinese University.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Manchester Computing Supercomputing, Visualization & eScience Zoe Chaplin 11 September 2003 CAS2K3 Comparison of the Unified Model Version 5.3 on Various.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Software Practices for a Performance Portable Climate System Model
Department of Computer Science University of California, Santa Barbara
COMP60621 Designing for Parallelism
Department of Computer Science University of California, Santa Barbara
Presentation transcript:

Performance Comparison of Winterhawk I and Winterhawk II Systems Patrick H. Worley Computer Science and Mathematics Division Oak Ridge National Laboratory S CICOM P San Diego Supercomputer Center La Jolla, CA August 14, 2000

v Research sponsored by the Atmospheric and Climate Research Division and the Office of Mathematical, Information, and Computational Sciences, Office of Science, U.S. Department of Energy under Contract No. DE-AC05-00OR22725 with UT-Battelle, LLC. v These slides have been authored by a contractor of the U.S. Government under contract No. DE-AC05-00OR Accordingly, the U.S. Government retains a nonexclusive, royalty-free license to publish or reproduce the published form of this contribution, or allow others to do so, for U.S. Government purposes v Oak Ridge National Laboratory is managed by UT-Battelle, LLC for the United States Department of Energy under Contract No. DE-AC05- 00OR Acknowledgements

Overview v Goal Identify performance (and performance quirks) that users might expect when running applications on both Winterhawk I and Winterhawk II systems. v Outline Serial performance u PSTSWM spectral dynamics kernel u CRM column physics kernel Interprocessor communication performance Parallel performance u CCM/MP-2D atmospheric global circulation model

IBM SP Systems v IBM SP at NERSC 2-way Winterhawk I SMP “wide” nodes with 1 GB memory 200 MHz Power 3 processors with 4 MB L2 cache 1.6 GB/sec node memory bandwidth (single bus) Omega multistage interconnect v IBM SP at ORNL 4-way Winterhawk II SMP “thin” nodes with 2 GB memory 375 MHz Power 3-II processors with 8 MB L2 cache 1.6 GB/sec node memory bandwidth (single bus) Omega multistage interconnect

IBM SP Systems v IBM SP that used to be at ORNL 8-way Nighthawk I SMP nodes 222 MHz Power 3 processors with 4 MB L2 cache switch-based memory subsystem v IBM SP (at someplace within IBM) 16-way Nighthawk II SMP node 375 MHz Power3-II processors with 8 MB L2 cache switch-based memory subsystem Results obtained using prerelease hardware and software in March 2000

Other Platforms v SGI / Cray Research Origin 2000 at LANL 128-way SMP node with 32 GB memory 250 MHz MIPS R10000 processors with 4 MB L2 cache NUMA memory subsystem v SGI/Cray Research T3E-900 Single processor nodes with 256 MB memory 450 MHz Alpha (EV5) with 96 KB L2 cache 1.2 GB/sec node memory bandwidth 3D torus interconnect

Serial Performance v Issues Compiler optimization Domain decomposition Memory contention in SMP nodes v Kernel codes PSTSWM - spectral dynamics CRM - column physics

Spectral Dynamics v PSTSWM solves the nonlinear shallow water equations on a sphere using the spectral transform method 99% of floating point operations are fmul, fadd, or fmadd accessing memory linearly, but not much reuse (longitude, vertical, latitude) array index ordering u computation independent between horizontal layers (fixed vertical index) u as vertical dimension size increases, demands on memory increase

Spectral Dynamics PSTSWM on the IBM SP at NERSC Horizontal Resolutions T5: 8x16 T10:16x32 T21:32x64 T42:64x128 T85:128x256 T170:256x512

Spectral Dynamics PSTSWM on the IBM SP at ORNL Horizontal Resolutions T5: 8x16 T10:16x32 T21:32x64 T42:64x128 T85:128x256 T170:256x512

Spectral Dynamics PSTSWM Platform comparisons - 1 processor per SMP node Horizontal Resolutions T5: 8x16 T10:16x32 T21:32x64 T42:64x128 T85:128x256 T170:256x512

Spectral Dynamics PSTSWM Platform comparisons - all processors active in SMP node (except Origin-250) Horizontal Resolutions T5: 8x16 T10:16x32 T21:32x64 T42:64x128 T85:128x256 T170:256x512

Spectral Dynamics PSTSWM Platform comparisons - 1 processor per SMP node

Spectral Dynamics PSTSWM Platform comparisons - all processors active in SMP node (except Origin-250)

Spectral Dynamics v Summary for PSTSWM Math libraries and relaxed mathematical semantics improve performance significantly on the IBM SP. The single processor performance for the Winterhawk II can be more than twice that of the Winterhawk I. However, this advantage disappears for large problem sizes that require frequent access to main memory, especially when multiple processors are competing for memory access. The single processor performance for the Winterhawk node is better than that for the analogous Nighthawk node. This advantage can disappear for the Winterhawk II node when multiple processors are competing for memory access.

Column Physics v CRM Column Radiation Model extracted from the NCAR Community Climate Model 6% of floating point operations are sqrt, 3% are fdiv exp, log, and pow are among top six most frequently called functions (longitude, vertical, latitude) array index ordering u computations independent between vertical columns (fixed longitude, latitude) u as longitude dimension size increases, demands on memory increase

Column Physics CRM on the NERSC SP longitude-vertical slice, with varying number of longitudes

Column Physics CRM on the ORNL SP longitude-vertical slice, with varying number of longitudes

Column Physics CRM longitude-vertical slice, with varying number of longitudes 1 processor per SMP node except where indicated

Column Physics v Summary for CRM Performance on the IBM SP is very sensitive to compiler optimization and domain decomposition. Performance is less sensitive to node memory bandwidth for this kernel code, and Winterhawk II single processor performance is approximately twice that of Winterhawk I for all problem configurations and numbers of processors.

Communication Tests v Interprocessor communication performance within an SMP node between SMP nodes with and without contention v Brief description of some results. For more details, see

Communication Tests MPI_SENDRECV bidirectional and MPI_SEND/MPI_RECV unidirectional bandwidth between nodes on the IBM SP at NERSC and ORNL

Communication Tests Bidirectional bandwidth : exchange between processors 0-4 Latency Estimates (usecs) SP-200 (W1): SP-375 (W2): 21-60

Communication Tests Bidirectional bandwidth : exchange between processors 0-1 Latency Estimates (usecs) SP-200 (W1): SP-375 (W2): 8-29

Communication Tests Bidirectional bandwidth per processor: simultaneous exchange between processors 0-4,1-5,2-6,3-7

Communication Tests Bidirectional bandwidth per processor: 8 processor send/recv ring

Communication Tests v Summary Bidirectional bandwidth is worth exploiting on both systems. In isolation, bandwidth between processors in separate nodes is higher for Winterhawk II nodes than for Winterhawk I nodes. When all processors in a node are communicating, the per processor bandwidth is twice as large for Winterhawk I nodes. Winterhawk II intranode bandwidth is sensitive to cache “assumptions”.

Parallel Performance v Issues Scalability Overhead growth and analysis v Codes CCM/MP-2D

v Message-passing parallel implementation of the National Center for Atmospheric Research (NCAR) Community Climate Model v Computational Domains Physical Domain: Longitude x Latitude x Vertical levels Fourier Domain: Wavenumber x Latitude x Vertical levels Spectral Domain: (Wavenumber x Polynomial degree) x Vertical levels.

CCM/MP-2D v Problem Sizes T42L x 64 x 18 physical domain grid 42 x 64 x 18 Fourier domain grid 946 x 18 spectral domain grid ~59.5 GFlops per simulated day T170L x 256 x 18 physical domain grid 170 x 256 x 18 Fourier domain grid x 18 spectral domain grid ~3231 GFlops per simulated day

CCM/MP-2D v Computations Column Physics u independent between vertical columns Spectral Dynamics u Fourier transform in longitude direction u Legendre transform in latitude direction u tendencies for timestepping calculated in spectral domain, independent between spectral coordinates Semi-Lagrangian Advection u Use local approximations to interpolate wind fields and particle distributions away from grid points.

CCM/MP-2D v Decomposition across latitude u parallelizes the Legendre transform: Use distributed global sum algorithm currently u requires north/south halo updates for semi-Lagrangian advection v Decomposition across longitude u parallelizes the Fourier transform: Either use distributed FFT algorithm or transpose fields and use serial FFT u requires east/west halo updates for semi-Lagrangian advection u requires night/day vertical column swaps to load balance physics

CCM/MP-2D Sensitivity of message volume to domain decomposition

Scalability CCM/MP-2D T42L18 Benchmark

Computation Cost CCM/MP-2D T42L18 Benchmark Computation Time

Overhead Cost CCM/MP-2D T42L18 Benchmark Overhead Time

Overhead CCM/MP-2D T42L18 Benchmark Overhead Time Diagnosis

Scalability CCM/MP-2D T170L18 Benchmark

Computation Cost CCM/MP-2D T170L18 Benchmark Serial Time

Overhead Time CCM/MP-2D T170L18 Benchmark Overhead Time

Overhead CCM/MP-2D T170L18 Benchmark Overhead Time Diagnosis

CCM/MP-2D v Summary for CCM/MP-2D CCM application is communication intensive for large processor counts, even for large problem sizes. Winterhawk II system is % faster than Winterhawk I system for this application, even when communication bound. Computation rate comparison between Winterhawk I and Winterhawk II runs agrees with kernel experiments. Point-to-point communication benchmarks do not reflect the advantage of Winterhawk II over Winterhawk I, possibly due to u Contribution of load imbalance to communication costs. u Increase in variability in communication costs with increasing numbers of nodes.