ESMF Performance Evaluation and Optimization Peggy Li(1), Samson Cheung(2), Gerhard Theurich(2), Cecelia Deluca(3) (1)Jet Propulsion Laboratory, California.

Slides:

Advertisements

Similar presentations

Database Architectures and the Web

Advertisements

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Center for Computational Sciences Cray X1 and Black Widow at ORNL Center for Computational.

1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.

♦ Commodity processor with commodity inter- processor connection Clusters Pentium, Itanium, Opteron, Alpha GigE, Infiniband, Myrinet, Quadrics, SCI NEC.

Architecture and Implementation of Lustre at the National Climate Computing Research Center Douglas Fuller National Climate Computing Research Center /

SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.

Nov. 2002NERSC/LBNL1 Climate Modeling: Coupling Component Models by MPH for Distributed Multi-Component Environment Chris Ding and Yun (Helen) He NERSC.

Dynamical Downscaling of CCSM Using WRF Yang Gao 1, Joshua S. Fu 1, Yun-Fat Lam 1, John Drake 1, Kate Evans 2 1 University of Tennessee, USA 2 Oak Ridge.

Silicon Graphics, Inc. Poster Presented by: SGI Proprietary Technologies for Breakthrough Research Rosario Caltabiano North East Higher Education & Research.

Performance Analysis of MPI Communications on the SGI Altix 3700 Nor Asilah Wati Abdul Hamid, Paul Coddington, Francis Vaughan Distributed & High Performance.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.

Nor Asilah Wati Abdul Hamid, Paul Coddington, Francis Vaughan School of Computer Science, University of Adelaide IPDPS - PMEO April 2006 Comparison of.

Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Machine.

IBM RS/6000 SP POWER3 SMP Jari Jokinen Pekka Laurila.

P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism Efficient Longest Common Subsequence Computation using Bulk-Synchronous.

Real Parallel Computers. Modular data centers Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra,

1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)

1 Parallel computing and its recent topics. 2 Outline 1. Introduction of parallel processing (1)What is parallel processing (2)Classification of parallel.

NSF NCAR | NASA GSFC | DOE LANL ANL | NOAA NCEP GFDL | MIT Adoption and field tests of M.I.T General Circulation Model (MITgcm) with ESMF Chris Hill ESMF.

NSF NCAR | NASA GSFC | DOE LANL ANL | NOAA NCEP GFDL | MIT | U MICH First Field Tests of ESMF GMAO Seasonal Forecast NCAR/LANL CCSM NCEP.

Kernel and Application Code Performance for a Spectral Atmospheric Global Circulation Model on the Cray T3E and IBM SP Patrick H. Worley Computer Science.

ESMF Development Status and Plans ESMF 4 th Community Meeting Cecelia DeLuca July 21, 2005 Climate Data Assimilation Weather.

Computational Design of the CCSM Next Generation Coupler Tom Bettge Tony Craig Brian Kauffman National Center for Atmospheric Research Boulder, Colorado.

Bulk Synchronous Parallel Processing Model Jamie Perkins.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Initial Results from the Integration of Earth and Space Frameworks Cecelia DeLuca/NCAR, Alan Sussman/University of Maryland, Gabor Toth/University of Michigan.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

ESMF Application Status GMAO Seasonal Forecast NCAR/LANL CCSM NCEP Forecast GFDL FMS Suite MITgcm NCEP/GMAO Analysis Climate Data Assimilation.

Jaguar Super Computer Topics Covered Introduction Architecture Location & Cost Bench Mark Results Location & Manufacturer Machines in top 500 Operating.

Planned AlltoAllv a clustered approach Stephen Booth (EPCC) Adrian Jackson (EPCC)

CESM/ESMF Progress Report Mariana Vertenstein NCAR Earth System Laboratory CESM Software Engineering Group (CSEG) NCAR is sponsored by the National Science.

ARGONNE NATIONAL LABORATORY Climate Modeling on the Jazz Linux Cluster at ANL John Taylor Mathematics and Computer Science & Environmental Research Divisions.

ESMF/Curator Status Cecelia DeLuca CCSM Software Engineering Working Group Boulder, CO March 16, 2007 Climate Data Assimilaton Weather.

Petascale –LLNL Appro AMD: 9K processors [today] –TJ Watson Blue Gene/L: 40K processors [today] –NY Blue Gene/L: 32K processors –ORNL Cray XT3/4 : 44K.

Components, Coupling and Concurrency in the Earth System Modeling Framework N. Collins/NCAR, C. DeLuca/NCAR, V. Balaji/GFDL, G. Theurich/SGI, A. da Silva/GSFC,

2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan.

Copyright © 2011 Curt Hill MIMD Multiple Instructions Multiple Data.

Rob Allan Daresbury Laboratory NW-GRID Training Event 25 th January 2007 Introduction to NW-GRID R.J. Allan CCLRC Daresbury Laboratory.

ESMF Status and Future Plans Cecelia DeLuca BEI Technical Review Boulder, CO March 13-14, 2007 Climate Data Assimilaton Weather.

Optimization of Collective Communication in Intra- Cell MPI Optimization of Collective Communication in Intra- Cell MPI Ashok Srinivasan Florida State.

High performance parallel computing of climate models towards the Earth Simulator --- computing science activities at CRIEPI --- Yoshikatsu Yoshida and.

Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.

CCSM Portability and Performance, Software Engineering Challenges, and Future Targets Tony Craig National Center for Atmospheric Research Boulder, Colorado,

Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.

High Performance Computing Group Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture Feasibility Study of MPI.

NSF NCAR / NASA GSFC / DOE LANL ANL / NOAA NCEP GFDL / MIT / U MICH May 15, 2003 Nancy Collins, NCAR 2nd Community Meeting, Princeton, NJ Earth System.

1 THE EARTH SIMULATOR SYSTEM By: Shinichi HABATA, Mitsuo YOKOKAWA, Shigemune KITAWAKI Presented by: Anisha Thonour.

NCEP ESMF GFS Global Spectral Forecast Model Weiyu Yang, Mike Young and Joe Sela ESMF Community Meeting MIT, Cambridge, MA July 21, 2005.

CCSM Performance, Successes and Challenges Tony Craig NCAR RIST Meeting March 12-14, 2002 Boulder, Colorado, USA.

Full and Para Virtualization

On the Road to a Sequential CCSM Robert Jacob, Argonne National Laboratory Including work by: Mariana Vertenstein (NCAR), Ray Loy (ANL), Tony Craig (NCAR)

Interconnection network network interface and a case study.

Running CESM An overview

Issues on the operational cluster 1 Up to 4.4x times variation of the execution time on 169 cores Using -O2 optimization flag Using IBM MPI without efficient.

Developments in High Performance Computing A Preliminary Assessment of the NAS SGI 256/512 CPU SSI Altix (1.5 GHz) Systems SC’03 November 17-20, 2003 Jim.

Presented by NCCS Hardware Jim Rogers Director of Operations National Center for Computational Sciences.

Performance Comparison of Winterhawk I and Winterhawk II Systems Patrick H. Worley Computer Science and Mathematics Division Oak Ridge National Laboratory.

NSF NCAR / NASA GSFC / DOE LANL ANL / NOAA NCEP GFDL / MIT / U MICH C. DeLuca/NCAR, J. Anderson/NCAR, V. Balaji/GFDL, B. Boville/NCAR, N. Collins/NCAR,

Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.

Manchester Computing Supercomputing, Visualization & eScience Zoe Chaplin 11 September 2003 CAS2K3 Comparison of the Unified Model Version 5.3 on Various.

SciDAC CCSM Consortium: Software Engineering Update Patrick Worley Oak Ridge National Laboratory (On behalf of all the consorts) Software Engineering Working.

Overview of the CCSM CCSM Software Engineering Group June

Community Grids Laboratory

GMAO Seasonal Forecast

Software Practices for a Performance Portable Climate System Model

Restructuring the multi-resolution approximation for spatial data to reduce the memory footprint and to facilitate scalability Vinay Ramakrishnaiah Mentors:

Parallel Programming in C with MPI and OpenMP

Cluster Computers.

Presentation transcript:

ESMF Performance Evaluation and Optimization Peggy Li(1), Samson Cheung(2), Gerhard Theurich(2), Cecelia Deluca(3) (1)Jet Propulsion Laboratory, California Institute of Technology, USA (2) Silicon Graphics Inc., USA (3) National Center for Atmospheric Research (NCAR), USA Objective: We report the results of two performance studies conducted on ESMF applications. The first one is a grid redistribution overhead benchmark based on two different-resolution grids used in the CCSM (Community Climate System Model) and the second one is a scalibility evaluation of the ESMF superstructure functions on large processors. 1. CCSM Grid Redistribution Benchmark Background: CCSM is a fully-coupled, global climate model that provides state-of-the-art computer simulations of the Earth’s past, present, and future climate states. The CCSM 3.0 consists of four dynamical geophysical models, namely, the Community Atmosphere Model (CAM), the Community Land Model (CLM), the Parallel Ocean Program (POP) and the Community Sea-Ice Model (CSIM), linked by a central coupler. CCSM Coupler controls the execution and time evolution of the coupled CCSM system by synchronizing and controlling the flow of data between the various components. Current CCSM Coupler is built on top of MCT (The Model Coupling Toolkit). In this study, we benchmark the performance of one major CCSM coupler function: the grid redistribution from the atmosphere model to the land model. The CCSM3 atmosphere model (CAM) and land model (CLM) share a common horizontal grid. The two resolutions been benchmarked are T85 - a Gaussian grid with 256 longitude points and 128 latitude points and T42 - a Gaussian grid with 128 longitude points and 64 latitude points. Figure 1.a CAM T42 Grid (128x64) Decomposition on 8 processors Figure 1.b CLM T42 Grid (128x64) Decomposition on 8 processors Benchmark Program Our benchmark program contains four components: an Atmosphere Grid Component (ATM), a Land Grid Component (LND), an Atmosphere-to-Land Coupler Component (A2L) and a Land-to- Atmosphere Coupler Component (L2A). The ATM component creates a 2D arbitrarily distributed global rectangular grid and a bundle of 19 floating-point fields associated with the grid. The decomposition of a T42 resolution ATM grid on 8 processors is depicted in Figure 1.a. The LND component contains a bundle of 13 floating-point fields on the land portion of the same 2D global rectangular grid. The LND grid is arbitrarily distributed on 8 processors as shown in Figure 1.b where the dark blue represents no data. The A2L and L2A components perform grid redistribution from ATM grid to the LND grid and vise versa. ESMF handles data redistribution in two stages: the initialization stage that precomputes the communication pattern required for performing the data distribution and the actual data redistribution stage. Our benchmark program measures the performance of the bundle level Redist functions ESMF_BundleRedistStore() and ESMF_BundleRedistRun() between an arbitrarily distributed ATM grid and another arbitrarily distributed LND grid. Contact: Full Reports: Acknowledgment: This task is sponsored by the Modeling, Analysis and Prediction (MAP) Program, National Aeronautics and Space Administration (NASA). Contact: Full Reports: Acknowledgment: This task is sponsored by the Modeling, Analysis and Prediction (MAP) Program, National Aeronautics and Space Administration (NASA). Results We ran the benchmark program on the IBM SP Cluster at NCAR and the Cray X1E at Cray Inc using 8 to 128 processors. We measured ESMF_BundleRedistStore() and ESMF_BundleRedistRun() in both A2L and L2A components and compared the timing results on the two platforms. In summary, the Cray X1E performs worse than the IBM SP in both functions. The performance of the data redistribution using ESMF is comparable to CCSM’s current MCT-based approach on both IBM SP and Cray X1E. A. T42 Grid B. T85 Grid CCSM T42 Grid: 128x64 (Time in Milliseconds) # NodesInit (X1E)Init (IBM)run (X1E)run (IBM) ESMF_BundleRedistStoreESMF_BundleRedistRun CCSM T85 Grid: 256x128 (Time in milliseconds) # NodesInit (X1E)Init (IBM)run (X1E)run (IBM) ESMF_BundleRedistStoreESMF_BundleRedistRun Optimization: 1.We optimized ESMF_BundleRedistStore() by redesigning a ESMF Route function ESMF_RoutePrecomputeRedistV() that calculates the send and receive route tables in each PET. The new algorithm sorts the local and the global grid points in the order of its grid index to reduce the time to calculate the intersection of the source and the destination grid. 2.We identified two functions that perform poorly on X1E, namely, MPI_Broadcast() and memcpy(). We replaced a loop of MPI_Broadcast() by a single MPI_AllGatherV() in ESMF_BundleRedistStore(). We also replaced memcpy() by assignment statements that was used to copy user data into message buffer in ESMF_BundleRedistRun(). These two modification improves the X1E performance significantly. 2. ESMF Superstructure Scalability Benchmark This benchmark program evaluates the performance of the ESMF Superstructure Functions on large number of processors, i.e., over 1000 processors. The ESMF superstructure functions include the ESMF initialization and termination (ESMF_Initialize(), ESMF_Finalize()), and the component creation, initialization and execution and termination (ESMF_GridCompCreate(), ESMF_GridCompInit(), ESMF_GridCompRun() and ESMF_GridCompFinalize()). We conducted the performance evaluation on the Cray XT3, jaguar, at Oak Ridge National Laboratory and the SGI Altix superclusters, columbia, at NASA Ames. We ran the benchmark from 4 procesors up to 2048 processors. Timing Results on XT3 The performance of ESMF_Initialize()and ESMF_Finalize() is dominated by the parallel I/O performance on the target machine because, by default, each processor opens an Error Log file at ESMF initialization (defaultLogType = ESMF_LOG_MULTI). By setting defaultLogType to ESMF_LOG_NONE, ESMF_Initialize()and ESMF_Finalize() run 200 times faster for 128 processors and above. The timings for these two functions with and without an error log file are shown below. ESMF component functions overheads are very small. ESMF_GridCompRun() time is below 20 microseconds for processors up to However, except for ESMF_GridCompFinalize(), the other three functions have complexity of O(n) where n is the number of processors. The following table and figures depict the timings of these four component functions on XT3. # Processors GridComp Create GridComp Init GridComp Run GridComp Finalize IBM SP2 (bluevista) Cray X1E (earth) Cray XT3/XT4 (jaguar) SGI Altix (columbia) CPU Type and Speed IBM POWER5/ 7.6 GFLOPS/sec MSP (Multi- streaming processor)/ 18 GFLOPS/sec Dual-core AMD Opteron/ 2.6GHz Intel Itanium/ 6GFLOPS/sec Memory per Processor 2GB/processor 16GB shared/node 1TB total 16 GB/compute module, 512GB total 4 GB/processor, 46TB total 2GB/processor 1TB shared/node Total number of processors 624 (78 8-processor nodes) 128 MSPs (32 4- MSP compute modules) 11,508 processors 10,240 ( processor nodes and one 2048-E system) Aggregated performance 4.74 TFLOPS2.3 TFLOPS119 TFLOPS51.9 TFLOPS Network IBM High Performance Switch (HPS) 5 microsecond latency DSM Architecture, 2D Torus, 34GB/s memory bandwidth Cray Seastar Router, 3D Torus SGI NumaLink fabric, 1 microsecond latency XT3 and Altix Comparison We compared the timing results for the six ESMF superstructure functions on Cray XT3 and SGI Altix. The timing charts are shown below. ESMF Component functions overhead on XT3 (numbers are in microseconds) Comparison of the Four Benchmark Machines ESMF Grid Redistribution Run Time (128x64 grid) Number of Processors Time (milliseconds ) run (X1E) run (IBM) ESMF_Initialize() and ESMF_Finalize() time shown in (A) and (B) was made with defaultLogType set to ESMF_LOG_NONE. Altix performs poorer than XT3 in both functions due to synchronization problem and MPI implementation. For ESMF_Initialize(), the time difference between the two machines was due to a global synchronization in the first MPI global operation called, MPI_Comm_Create() in the function. On Altix, MPI_Finalize() takes about 1 seconds regardless the number of processors used, which dominates the time for ESMF_Finalize(). The component functions on both machines have similar performance. ((C) to (F)). The timing for ESMF_GridCompRun() (E) are very close on two machines where XT3 is slightly better for all the configurations. On 1024 processors, it takes microseconds on XT3 and microseconds on Altix. ESMF_Initialize() and ESMF_Finalize() time shown in (A) and (B) was made with defaultLogType set to ESMF_LOG_NONE. Altix performs poorer than XT3 in both functions due to synchronization problem and MPI implementation. For ESMF_Initialize(), the time difference between the two machines was due to a global synchronization in the first MPI global operation called, MPI_Comm_Create() in the function. On Altix, MPI_Finalize() takes about 1 seconds regardless the number of processors used, which dominates the time for ESMF_Finalize(). The component functions on both machines have similar performance. ((C) to (F)). The timing for ESMF_GridCompRun() (E) are very close on two machines where XT3 is slightly better for all the configurations. On 1024 processors, it takes microseconds on XT3 and microseconds on Altix. (E) (B) (C) (D) (A) (F)