1 CCSM Component Performance Benchmarking and Status of the CRAY X1 at ORNL Patrick H. Worley Oak Ridge National Laboratory Computing in Atmospheric Sciences.

Slides:



Advertisements
Similar presentations
Computation of High-Resolution Global Ocean Model using Earth Simulator By Norikazu Nakashiki (CRIEPI) Yoshikatsu Yoshida (CRIEPI) Takaki Tsubono (CRIEPI)
Advertisements

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Center for Computational Sciences Cray X1 and Black Widow at ORNL Center for Computational.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Thoughts on Shared Caches Jeff Odom University of Maryland.
Supercomputing Challenges at the National Center for Atmospheric Research Dr. Richard Loft Computational Science Section Scientific Computing Division.
O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Global Climate Modeling Research John Drake Computational Climate Dynamics Group Computer.
Understanding Application Scaling NAS Parallel Benchmarks 2.2 on NOW and SGI Origin 2000 Frederick Wong, Rich Martin, Remzi Arpaci-Dusseau, David Wu, and.
History of Distributed Systems Joseph Cordina
IBM RS6000/SP Overview Advanced IBM Unix computers series Multiple different configurations Available from entry level to high-end machines. POWER (1,2,3,4)
MPI in uClinux on Microblaze Neelima Balakrishnan Khang Tran 05/01/2006.
NPACI: National Partnership for Advanced Computational Infrastructure Supercomputing ‘98 Mannheim CRAY T90 vs. Tera MTA: The Old Champ Faces a New Challenger.
Parallel Computers Past and Present Yenchi Lin Apr 17,2003.
1 HPC and the ROMS BENCHMARK Program Kate Hedstrom August 2003.
PARALLEL PROCESSING The NAS Parallel Benchmarks Daniel Gross Chen Haiout.
IBM RS/6000 SP POWER3 SMP Jari Jokinen Pekka Laurila.
VTF Applications Performance and Scalability Sharon Brunett CACR/Caltech ASCI Site Review October 28,
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
© Fujitsu Laboratories of Europe 2009 HPC and Chaste: Towards Real-Time Simulation 24 March
NSF NCAR | NASA GSFC | DOE LANL ANL | NOAA NCEP GFDL | MIT Adoption and field tests of M.I.T General Circulation Model (MITgcm) with ESMF Chris Hill ESMF.
Kernel and Application Code Performance for a Spectral Atmospheric Global Circulation Model on the Cray T3E and IBM SP Patrick H. Worley Computer Science.
Integrating Visualization Peripherals into Power-Walls and Similar Tiled Display Environments James Da Cunha Savannah State University Research Alliance.
SOS7, Durango CO, 4-Mar-2003 Scaling to New Heights Retrospective IEEE/ACM SC2002 Conference Baltimore, MD Distilled [Trimmed & Distilled for SOS7 by M.
Scaling to New Heights Retrospective IEEE/ACM SC2002 Conference Baltimore, MD.
Lappeenranta University of Technology / JP CT30A7001 Concurrent and Parallel Computing Introduction to concurrent and parallel computing.
QCD Project Overview Ying Zhang September 26, 2005.
Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.
OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY Parallel Solution of 2-D Heat Equation Using Laplace Finite Difference Presented by Valerie Spencer.
ESMF Development Status and Plans ESMF 4 th Community Meeting Cecelia DeLuca July 21, 2005 Climate Data Assimilation Weather.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
ESMF Performance Evaluation and Optimization Peggy Li(1), Samson Cheung(2), Gerhard Theurich(2), Cecelia Deluca(3) (1)Jet Propulsion Laboratory, California.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Crystal Ball Panel ORNL Heterogeneous Distributed Computing Research Al Geist ORNL March 6, 2003 SOS 7.
CESM/ESMF Progress Report Mariana Vertenstein NCAR Earth System Laboratory CESM Software Engineering Group (CSEG) NCAR is sponsored by the National Science.
ARGONNE NATIONAL LABORATORY Climate Modeling on the Jazz Linux Cluster at ANL John Taylor Mathematics and Computer Science & Environmental Research Divisions.
Petascale –LLNL Appro AMD: 9K processors [today] –TJ Watson Blue Gene/L: 40K processors [today] –NY Blue Gene/L: 32K processors –ORNL Cray XT3/4 : 44K.
Evaluation of Modern Parallel Vector Architectures Leonid Oliker Future Technologies Group Computational Research Division LBNL
Presented by Global Coupled Climate and Carbon Cycle Modeling Forrest M. Hoffman Computational Earth Sciences Group Computer Science and Mathematics Division.
2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan.
Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.
The Research Alliance in Math and Science program is sponsored by the Office of Advanced Scientific Computing Research, Office of Science, U.S. Department.
High performance parallel computing of climate models towards the Earth Simulator --- computing science activities at CRIEPI --- Yoshikatsu Yoshida and.
Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR
Brent Gorda LBNL – SOS7 3/5/03 1 Planned Machines: BluePlanet SOS7 March 5, 2003 Brent Gorda Future Technologies Group Lawrence Berkeley.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
CCSM Portability and Performance, Software Engineering Challenges, and Future Targets Tony Craig National Center for Atmospheric Research Boulder, Colorado,
August 2001 Parallelizing ROMS for Distributed Memory Machines using the Scalable Modeling System (SMS) Dan Schaffer NOAA Forecast Systems Laboratory (FSL)
Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.
Cray Environmental Industry Solutions Per Nyberg Earth Sciences Business Manager Annecy CAS2K3 Sept 2003.
CCSM3 / HadCM3 Under predict precipitation rate near equator regions CCSM3 under predicts greater in SE U.S. than HadCM3 Methodology and Results Interpolate.
Modeling Billion-Node Torus Networks Using Massively Parallel Discrete-Event Simulation Ning Liu, Christopher Carothers 1.
CCSM Performance, Successes and Challenges Tony Craig NCAR RIST Meeting March 12-14, 2002 Boulder, Colorado, USA.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
On the Road to a Sequential CCSM Robert Jacob, Argonne National Laboratory Including work by: Mariana Vertenstein (NCAR), Ray Loy (ANL), Tony Craig (NCAR)
Interconnection network network interface and a case study.
Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,
WRF Software Development and Performance John Michalakes, NCAR NCAR: W. Skamarock, J. Dudhia, D. Gill, A. Bourgeois, W. Wang, C. Deluca, R. Loft NOAA/NCEP:
Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.
SDM Center High-Performance Parallel I/O Libraries (PI) Alok Choudhary, (Co-I) Wei-Keng Liao Northwestern University In Collaboration with the SEA Group.
Presented by LCF Climate Science Computational End Station James B. White III (Trey) Scientific Computing National Center for Computational Sciences Oak.
Benchmarking and Applications. Purpose of Our Benchmarking Effort Reveal compiler (and run-time systems) weak points and lack of adequate automatic optimizations.
Performance Comparison of Winterhawk I and Winterhawk II Systems Patrick H. Worley Computer Science and Mathematics Division Oak Ridge National Laboratory.
The Community Climate System Model (CCSM): An Overview Jim Hurrell Director Climate and Global Dynamics Division Climate and Ecosystem.
From Clustered SMPs to Clustered NUMA John M. Levesque The Advanced Computing Technology Center.
TEMPLATE DESIGN © H. Che 2, E. D’Azevedo 1, M. Sekachev 3, K. Wong 3 1 Oak Ridge National Laboratory, 2 Chinese University.
SciDAC CCSM Consortium: Software Engineering Update Patrick Worley Oak Ridge National Laboratory (On behalf of all the consorts) Software Engineering Working.
Software Practices for a Performance Portable Climate System Model
Department of Computer Science University of California, Santa Barbara
Designing a PC Farm to Simultaneously Process Separate Computations Through Different Network Topologies Patrick Dreher MIT.
Cluster Computers.
Presentation transcript:

1 CCSM Component Performance Benchmarking and Status of the CRAY X1 at ORNL Patrick H. Worley Oak Ridge National Laboratory Computing in Atmospheric Sciences Workshop 2003 September 10, 2003 L'Imperial Palace Hotel Annecy, France

2 Research sponsored by the Atmospheric and Climate Research Division and the Office of Mathematical, Information, and Computational Sciences, Office of Science, U.S. Department of Energy under Contract No. DE-AC05-00OR22725 with UT- Battelle, LLC. These slides have been authored by a contractor of the U.S. Government under contract No. DE-AC05-00OR Accordingly, the U.S. Government retains a nonexclusive, royalty-free license to publish or reproduce the published form of this contribution, or allow others to do so, for U.S. Government purposes Oak Ridge National Laboratory is managed by UT-Battelle, LLC for the United States Department of Energy under Contract No. DE-AC05-00OR Acknowledgements

3 At the Request of the Organizers … Two talks in one Benchmarking and Performance Analysis of CCSM Ocean and Atmosphere Models Early Experiences with the Cray X1 at Oak Ridge National Laboratory Only time for a quick overview, concentrating on process as much as results. For more information, see

4 Platforms Cray X1 at Oak Ridge National Laboratory (ORNL): 16 4-way vector SMP nodes and a torus interconnect. Each processor has 8 64-bit floating point vector units running at 800 MHz. Earth Simulator: way vector SMP nodes and a 640x640 single-stage crossbar interconnect. Each processor has 8 64-bit floating point vector units running at 500 MHz. HP/Compaq AlphaServer SC at Pittsburgh Supercomputing Center: 750 ES45 4-way SMP nodes (1GHz Alpha EV68) and a Quadrics QsNet interconnect with two network adapters per node. Compaq AlphaServer SC at ORNL: 64 ES40 4-way SMP nodes (667MHz Alpha EV67) and a Quadrics QsNet interconnect with one network adapter per node.

5 Platforms (cont.) IBM p690 cluster at ORNL: way p690 SMP nodes (1.3 GHz POWER4) and an SP Switch2 with two to eight network adapters per node. IBM SP at the National Energy Research Supercomputer Center (NERSC): 184 Nighthawk II 16-way SMP nodes (375MHz POWER3-II) and an SP Switch2 with two network adapters per node. IBM SP at ORNL: 176 Winterhawk II 4-way SMP nodes (375MHz POWER3-II) and an SP Switch with one network adapter per node. NEC SX-6 at the Arctic Region Supercomputing Center: 8-way SX-6 SMP node. Each processor has 8 64-bit floating point vector units running at 500 MHz. SGI Origin 3000 at Los Alamos National Laboratory (LANL): 512-way SMP node. Each processor is a 500 MHz MIPS R14000.

6 Community Climate System Model (CCSM) Community Atmospheric Model (CAM2) – Eulerian spectral dynamics* – Finite Volume/Semi-Lagrangian gridpoint dynamics Parallel Ocean Program (POP 1.4.3)* Community Land Model (CLM2) CCSM Ice Model (CSIM) CCSM Coupler (CPL6) * models examined in this talk

7 CCSM Component Benchmarking Performance evolution and diagnosis – for development planning Performance scaling – for configuration optimization and for resource planning Platform evaluation

8 Component Benchmarking Funded by … SciDAC Performance Evaluation Research Center ( SciDAC Collaborative Design and Development of the Community Climate System Model for Terascale Computers ( Evaluation of Early Systems ( in collaboration with the CCSM Software Engineering Group (CSEG), Software Engineering Working Group (SEWG), and the NCAR Scientific Computing Division.

9 CAM Performance Evolution Measuring perf. impact of tuning parameters: - Physics data structure size (pcols) - Load balancing - MPI processes vs. OpenMP threads and code optimizations: - spectral dycore load balance and comm. optimizations - land/atmosphere interface comm. optimizations - distributed ozone interpolation (added after dev10) CAM computational rate on the IBM p690 cluster.

10 CAM Performance Evolution Measuring perf. impact of tuning parameters: - Physics data structure size (pcols) - MPI processes vs. OpenMP threads - Load balancing and code optimizations: - spectral dycore load balance and comm. optimizations - land/atmosphere interface comm. optimizations CAM computational rate on the AlphaServer SC.

11 CAM Performance Diagnosis Measuring performance and scaling of individual phases for CAM2.0.1dev10+ (with distributed ozone interpolation) when run with standard I/O on the IBM p690 cluster.

12 CAM Platform Comparison Comparing performance and scaling across platforms for CAM2.0.1dev10 (without distributed ozone interpolation) when run in production mode.

13 What’s next for CAM benchmarking? CAM 2_0_2 – modified physical parameterizations – increased number of advected constituents Increased resolution – Spectral Eulerian: T85, T170 – Finite Volume: 1x1.25, 0.5x0.625 degree Tracking additional optimizations – see also R. Loft’s presentation

14 POP Platform Comparison Comparing performance and scaling across platforms. - Earth Simulator results courtesy of Dr. Y. Yoshida of the Central Research Institute of Electric Power Industry (CRIEPI). - SGI results courtesy of Dr. P. Jones of LANL. - IBM SP (NH II) results courtesy of Dr. T. Mohan of Lawrence Berkeley National Laboratory (LBNL)

15 POP Diagnostic Experiments Two primary phases that determine model performance – Baroclinic: 3D with limited nearest-neighbor communication; scales well. – Barotropic: dominated by solution of 2D implicit system using conjugate gradient solves; scales poorly One benchmark problem size – One degree horizontal grid (“by one” or “x1”) of size 320x384x40 Domain decomposition determined by grid size and 2D virtual processor grid. Results for a given processor count are the best observed over all applicable processor grids.

16 POP Performance Diagnosis: Baroclinic

17 POP Performance Diagnosis: Barotropic

18 What’s next for POP benchmarking? Tracking additional optimizations – primarily Cray X1 Increased resolution – 0.1 degree resolution More recent versions of the model – CCSM version of POP – POP 2.0 – HYPOP

19 Phoenix Cray X1 with 32 SMP nodes 4 Multi-Streaming Processors (MSP) per node 4 Single Streaming Processors (SSP) per MSP Two 32-stage 64-bit wide vector units running at 800 MHz and one 2-way superscalar unit running at 400 MHz per SSP 2 MB Ecache per MSP 16 GB of memory per node for a total of 128 processors (MSPs), 512 GB of memory, and 1600 GF/s peak performance. System will be upgraded to 64 nodes (256 MSPs) by the end of September, 2003.

20 Evaluation Goals Verifying advertised functionality and performance Quantifying performance impact of – Scalar vs. Vector vs. Streams – Contention for memory within SMP node – SSP vs. MSP mode of running codes – Page size – MPI communication protocols – Alternatives to MPI: SHMEM and Co-Array Fortran – … Guidance to Users – What performance to expect – Performance quirks and bottlenecks – Performance optimization tips

21 Caveats These are EARLY results, resulting from approx. four month of sporadic benchmarking on evolving system software and hardware configurations. Performance characteristics are still changing, due to continued evolution of OS and compilers and libraries. – This is a good thing - performance continues to improve. – This is a problem - performance data and analysis have a very short lifespan.

22 Outline of Full Talk Standard or External Benchmarks (unmodified) – Single MSP performance* – Memory subsystem performance – Interprocessor communication performance* Custom Kernels – Performance comparison of compiler and runtime options – Single MSP and SSP performance* – SMP node memory performance – Interprocessor communication performance* Application Codes – Performance comparison of compiler and runtime options – Scaling performance for fixed size problem* * topics examined in this talk

23 POP: More of the story … Ported to the Earth Simulator by Dr. Yoshikatsu Yoshida of CRIEPI. Initial port to the Cray X1 by John Levesque of Cray, using Co-Array Fortran for conjugate gradient solver. X1 and Earth Simulator ports merged and modified by Pat Worley of ORNL. Optimization on the X1 ongoing: – Cray-specific vectorization – Improved scalability of barotropic solver algorithm – OS performance tuning

24 POP Version Performance Comparison Comparing performance of different versions of POP. The Earth Simulator version vectorizes reasonably well on the Cray. Most of the performance improvement in the current version is due to Co-Array implementation of the conjugate gradient solver. This should be achievable in the MPI version as well.

25 Kernel Benchmarks Single Processor Performance – DGEMM matrix multiply benchmark* – Euroben MOD2D dense eigenvalue benchmark* – Euroben MOD2E sparse eigenvalue benchmark* Interprocessor Communication Performance – HALO benchmark* – COMMTEST benchmark suite * data collected by Tom Dunigan of ORNL. See for more details.

26 DGEMM Benchmark Comparing performance of vendor-supplied routines for matrix multiply. Cray X1 experiments used routines from the Cray scientific library libsci. Good performance achieved, reaching 80% of peak relatively quickly.

27 DGEMM Benchmark - What’s a Processor? Comparing performance of X1 MSP, X1 SSP, p690 processor, and four p690 processors (using PESSL parallel library). Max. percentage of peak - X1 SSP: 86% X1 MSP: 87% p690 (1): 70% p690 (4): 67%

28 MOD2D Benchmark Comparing performance of vendor-supplied routines for dense eigenvalue analysis. Cray X1 experiments used routines from the Cray scientific library libsci. Performance still growing with problem size. (Had to increase standard benchmark problem specifications.) Performance of nonvector systems has peaked.

29 MOD2E Benchmark Comparing performance of Fortran code for sparse eigenvalue analysis. Aggressive compiler options were used on the X1, but code was not restructured and compiler directives were not inserted. Performance is improving for larger problem sizes, so some streaming or vectorization is being exploited. Performance is poor compared to other systems.

30 COMMTEST SWAP Benchmark Comparing performance of SWAP for different platforms. Experiment measures bidirectional bandwidth between two processors in the same SMP node.

31 COMMTEST SWAP Benchmark Comparing performance of SWAP for different platforms. Experiment measures bidirectional bandwidth between two processors in different SMP nodes.

32 HALO Paradigm Comparison Comparing performance of MPI, SHMEM, and Co- Array Fortran implementation of Allan Wallcraft’s HALO benchmark on 16 MSPs. SHMEM and Co-Array Fortran are substantial performance enhancers for this benchmark.

33 HALO MPI Protocol Comparison Comparing performance of different MPI implementations of HALO on 16 MSPs. Persistent isend/irecv is always best. For codes that can not use persistent commands, MPI_SENDRECV is also a reasonable choice.

34 HALO Benchmark Comparing HALO performance using MPI on 16 MSPs of the Cray X1 and 16 processors of the IBM p690 (within a 32 processor p690 SMP node). Achievable bandwidth is much higher on the X1. For small halos, the p690 MPI HALO performance is between the X1 SHMEM and Co-Array Fortran HALO performance.

35 COMMTEST Benchmark COMMTEST is a suite of codes that measure the performance of MPI interprocessor communication. In particular, COMMTEST evaluates the impact of communication protocol, packet size, and total message length in a number of “common usage” scenarios. (However, it does not include persistent MPI point-to-point commands among the protocols examined.) It also includes simplified implementations of the SWAP and SENDRECV operators using SHMEM.

36 COMMTEST Experiment Particulars 0-1 MSP 0 swaps data with MSP 1 (within the same SMP node) 0-4 MSP 0 swaps data with MSP 4 (between two neighboring nodes) 0-8 MSP 0 swaps data with MSP 8 (between two more distant nodes) i-(i+1) MSP 0 swaps with MSP 1 and MSP 2 swaps with MSP 3 simultaneously (within the same SMP node) i-(i+8) MSP i swaps with MSP (i+8) for i=0,…,7 simultaneously, i.e. 8 pairs of MSPs across 4 SMP nodes swap data simultaneously.

37 COMMTEST SWAP Benchmark Comparing performance of SWAP for different communication patterns. All performance is identical except for the experiment in which 8 pairs of processors swap simultaneously. In this case, contention for internode bandwidth limits the single pair bandwidth.

38 COMMTEST SWAP Benchmark II Comparing performance of SWAP for different communication pattern, plotted on a log-linear scale. The single pair bandwidth has not reached its peak yet, but the two pair experiment bandwidth is beginning to reach its maximum.

39 MPI vs. SHMEM 0-1 Comparison Comparing MPI and SHMEM performance for 0-1 experiment, looking at both SWAP (bidirectional bandwidth) and ECHO (unidirectional bandwidth). SHMEM performance is better for all but the largest messages.

40 MPI vs. SHMEM 0-1 Comparison II Comparing MPI and SHMEM performance for 0-1 experiment, using a log-linear scale. MPI performance is very near to that of SHMEM for large messages (when using SHMEM to implement two-sided messaging).

41 MPI vs. SHMEM 0-1 Comparison III Comparing MPI and SHMEM performance for 0-1 experiment, using a log-linear scale and looking at small message sizes. The ECHO bandwidth is half of the SWAP bandwidth, so full bidirectional bandwidth is being achieved.

42 MPI vs. SHMEM i-(i+8) Comparison Comparing MPI and SHMEM performance for i-(i+8) experiment, looking at both SWAP (bidirectional bandwidth) and ECHO (unidirectional bandwidth). Again, SHMEM performance is better for all but the largest messages.

43 MPI vs. SHMEM i-(i+8) Comparison II Comparing MPI and SHMEM performance for i-(i+8) experiment, using a log-linear scale. MPI performance is very near to that of SHMEM for large messages (when using SHMEM to implement two-sided messaging). For the largest message sizes, SWAP bandwidth saturates the network and MPI ECHO bandwidth exceeds MPI SWAP bandwidth.

44 MPI vs. SHMEM i-(i+8) Comparison III Comparing MPI and SHMEM performance for i-(i+8) experiment, using a log-linear scale and looking at small message sizes. The ECHO bandwidth is more than half of the SWAP bandwidth, and something less than full bidirectional bandwidth is achieved.

45 Conclusions? System Works. We need more experience with application codes. We need experience on a larger system. There are currently OS limitations to efficient scaling for some codes. This should be solved in the near future. SHMEM and Co-Array Fortran performance can be superior to MPI. However, we hope that MPI small message performance can be improved. Both SSP and MSP modes of execution work fine. MSP mode should be preferable for fixed size problem scaling, but which is better is application and problem size specific.

46 Questions ? Comments ? For further information on these and other evaluation studies, visit