ESMF Performance Evaluation and Optimization Peggy Li(1), Samson Cheung(2), Gerhard Theurich(2), Cecelia Deluca(3) (1)Jet Propulsion Laboratory, California.

ESMF Performance Evaluation and Optimization Peggy Li(1), Samson Cheung(2), Gerhard Theurich(2), Cecelia Deluca(3) (1)Jet Propulsion Laboratory, California Institute of Technology, USA (2) Silicon Graphics Inc., USA (3) National Center for Atmospheric Research (NCAR), USA Objective: We report the results of two performance studies conducted on ESMF applications. The first one is a grid redistribution overhead benchmark based on two different-resolution grids used in the CCSM (Community Climate System Model) and the second one is a scalibility evaluation of the ESMF superstructure functions on large processors. 1. CCSM Grid Redistribution Benchmark Background: CCSM is a fully-coupled, global climate model that provides state-of-the-art computer simulations of the Earth’s past, present, and future climate states. The CCSM 3.0 consists of four dynamical geophysical models, namely, the Community Atmosphere Model (CAM), the Community Land Model (CLM), the Parallel Ocean Program (POP) and the Community Sea-Ice Model (CSIM), linked by a central coupler. CCSM Coupler controls the execution and time evolution of the coupled CCSM system by synchronizing and controlling the flow of data between the various components. Current CCSM Coupler is built on top of MCT (The Model Coupling Toolkit). In this study, we benchmark the performance of one major CCSM coupler function: the grid redistribution from the atmosphere model to the land model. The CCSM3 atmosphere model (CAM) and land model (CLM) share a common horizontal grid. The two resolutions been benchmarked are T85 - a Gaussian grid with 256 longitude points and 128 latitude points and T42 - a Gaussian grid with 128 longitude points and 64 latitude points. Figure 1.a CAM T42 Grid (128x64) Decomposition on 8 processors Figure 1.b CLM T42 Grid (128x64) Decomposition on 8 processors Benchmark Program Our benchmark program contains four components: an Atmosphere Grid Component (ATM), a Land Grid Component (LND), an Atmosphere-to-Land Coupler Component (A2L) and a Land-to- Atmosphere Coupler Component (L2A). The ATM component creates a 2D arbitrarily distributed global rectangular grid and a bundle of 19 floating-point fields associated with the grid. The decomposition of a T42 resolution ATM grid on 8 processors is depicted in Figure 1.a. The LND component contains a bundle of 13 floating-point fields on the land portion of the same 2D global rectangular grid. The LND grid is arbitrarily distributed on 8 processors as shown in Figure 1.b where the dark blue represents no data. The A2L and L2A components perform grid redistribution from ATM grid to the LND grid and vise versa. ESMF handles data redistribution in two stages: the initialization stage that precomputes the communication pattern required for performing the data distribution and the actual data redistribution stage. Our benchmark program measures the performance of the bundle level Redist functions ESMF_BundleRedistStore() and ESMF_BundleRedistRun() between an arbitrarily distributed ATM grid and another arbitrarily distributed LND grid. Contact: Peggy.Li@jpl.nasa.govPeggy.Li@jpl.nasa.gov Full Reports: www.esmf.ucar.edu/main_site/performance.htmwww.esmf.ucar.edu/main_site/performance.htm Acknowledgment: This task is sponsored by the Modeling, Analysis and Prediction (MAP) Program, National Aeronautics and Space Administration (NASA). Contact: Peggy.Li@jpl.nasa.govPeggy.Li@jpl.nasa.gov Full Reports: www.esmf.ucar.edu/main_site/performance.htmwww.esmf.ucar.edu/main_site/performance.htm Acknowledgment: This task is sponsored by the Modeling, Analysis and Prediction (MAP) Program, National Aeronautics and Space Administration (NASA). Results We ran the benchmark program on the IBM SP Cluster at NCAR and the Cray X1E at Cray Inc using 8 to 128 processors. We measured ESMF_BundleRedistStore() and ESMF_BundleRedistRun() in both A2L and L2A components and compared the timing results on the two platforms. In summary, the Cray X1E performs worse than the IBM SP in both functions. The performance of the data redistribution using ESMF is comparable to CCSM’s current MCT-based approach on both IBM SP and Cray X1E. A. T42 Grid B. T85 Grid CCSM T42 Grid: 128x64 (Time in Milliseconds) # NodesInit (X1E)Init (IBM)run (X1E)run (IBM) 8357.017840.500216.59271.2776 16218.290134.101914.89721.5684 32389.465631.358634.9281.9814 64425.242129.995659.47352.9228 ESMF_BundleRedistStoreESMF_BundleRedistRun CCSM T85 Grid: 256x128 (Time in milliseconds) # NodesInit (X1E)Init (IBM)run (X1E)run (IBM) 16924.6599150.683130.05664.0421 321087.6294140.484140.5923.3827 641149.7676124.663164.65354.6124 1281728.3291128.1008129.18397.5746 ESMF_BundleRedistStoreESMF_BundleRedistRun Optimization: 1.We optimized ESMF_BundleRedistStore() by redesigning a ESMF Route function ESMF_RoutePrecomputeRedistV() that calculates the send and receive route tables in each PET. The new algorithm sorts the local and the global grid points in the order of its grid index to reduce the time to calculate the intersection of the source and the destination grid. 2.We identified two functions that perform poorly on X1E, namely, MPI_Broadcast() and memcpy(). We replaced a loop of MPI_Broadcast() by a single MPI_AllGatherV() in ESMF_BundleRedistStore(). We also replaced memcpy() by assignment statements that was used to copy user data into message buffer in ESMF_BundleRedistRun(). These two modification improves the X1E performance significantly. 2. ESMF Superstructure Scalability Benchmark This benchmark program evaluates the performance of the ESMF Superstructure Functions on large number of processors, i.e., over 1000 processors. The ESMF superstructure functions include the ESMF initialization and termination (ESMF_Initialize(), ESMF_Finalize()), and the component creation, initialization and execution and termination (ESMF_GridCompCreate(), ESMF_GridCompInit(), ESMF_GridCompRun() and ESMF_GridCompFinalize()). We conducted the performance evaluation on the Cray XT3, jaguar, at Oak Ridge National Laboratory and the SGI Altix superclusters, columbia, at NASA Ames. We ran the benchmark from 4 procesors up to 2048 processors. Timing Results on XT3 The performance of ESMF_Initialize()and ESMF_Finalize() is dominated by the parallel I/O performance on the target machine because, by default, each processor opens an Error Log file at ESMF initialization (defaultLogType = ESMF_LOG_MULTI). By setting defaultLogType to ESMF_LOG_NONE, ESMF_Initialize()and ESMF_Finalize() run 200 times faster for 128 processors and above. The timings for these two functions with and without an error log file are shown below. ESMF component functions overheads are very small. ESMF_GridCompRun() time is below 20 microseconds for processors up to 2048. However, except for ESMF_GridCompFinalize(), the other three functions have complexity of O(n) where n is the number of processors. The following table and figures depict the timings of these four component functions on XT3. # Processors GridComp Create GridComp Init GridComp Run GridComp Finalize 459.1037.201.986827.90 876.1047.901.9868210.00 1695.8064.101.9868212.10 32122.1074.102.3047114.80 64161.8091.002.6226016.90 128266.10105.103.0199619.80 256604.90108.904.6094223.80 5121871.80115.106.9936127.90 10246957.10430.0111.2851435.00 204828701.00998.9720.9808347.20 IBM SP2 (bluevista) Cray X1E (earth) Cray XT3/XT4 (jaguar) SGI Altix (columbia) CPU Type and Speed IBM POWER5/ 7.6 GFLOPS/sec MSP (Multi- streaming processor)/ 18 GFLOPS/sec Dual-core AMD Opteron/ 2.6GHz Intel Itanium/ 6GFLOPS/sec Memory per Processor 2GB/processor 16GB shared/node 1TB total 16 GB/compute module, 512GB total 4 GB/processor, 46TB total 2GB/processor 1TB shared/node Total number of processors 624 (78 8-processor nodes) 128 MSPs (32 4- MSP compute modules) 11,508 processors 10,240 (16 512- processor nodes and one 2048-E system) Aggregated performance 4.74 TFLOPS2.3 TFLOPS119 TFLOPS51.9 TFLOPS Network IBM High Performance Switch (HPS) 5 microsecond latency DSM Architecture, 2D Torus, 34GB/s memory bandwidth Cray Seastar Router, 3D Torus SGI NumaLink fabric, 1 microsecond latency XT3 and Altix Comparison We compared the timing results for the six ESMF superstructure functions on Cray XT3 and SGI Altix. The timing charts are shown below. ESMF Component functions overhead on XT3 (numbers are in microseconds) Comparison of the Four Benchmark Machines ESMF Grid Redistribution Run Time (128x64 grid) 0 10 20 30 40 50 60 70 020406080 Number of Processors Time (milliseconds ) run (X1E) run (IBM) ESMF_Initialize() and ESMF_Finalize() time shown in (A) and (B) was made with defaultLogType set to ESMF_LOG_NONE. Altix performs poorer than XT3 in both functions due to synchronization problem and MPI implementation. For ESMF_Initialize(), the time difference between the two machines was due to a global synchronization in the first MPI global operation called, MPI_Comm_Create() in the function. On Altix, MPI_Finalize() takes about 1 seconds regardless the number of processors used, which dominates the time for ESMF_Finalize(). The component functions on both machines have similar performance. ((C) to (F)). The timing for ESMF_GridCompRun() (E) are very close on two machines where XT3 is slightly better for all the configurations. On 1024 processors, it takes 11.28 microseconds on XT3 and 13.84 microseconds on Altix. ESMF_Initialize() and ESMF_Finalize() time shown in (A) and (B) was made with defaultLogType set to ESMF_LOG_NONE. Altix performs poorer than XT3 in both functions due to synchronization problem and MPI implementation. For ESMF_Initialize(), the time difference between the two machines was due to a global synchronization in the first MPI global operation called, MPI_Comm_Create() in the function. On Altix, MPI_Finalize() takes about 1 seconds regardless the number of processors used, which dominates the time for ESMF_Finalize(). The component functions on both machines have similar performance. ((C) to (F)). The timing for ESMF_GridCompRun() (E) are very close on two machines where XT3 is slightly better for all the configurations. On 1024 processors, it takes 11.28 microseconds on XT3 and 13.84 microseconds on Altix. (E) (B) (C) (D) (A) (F)

ESMF Performance Evaluation and Optimization Peggy Li(1), Samson Cheung(2), Gerhard Theurich(2), Cecelia Deluca(3) (1)Jet Propulsion Laboratory, California.

Similar presentations

Presentation on theme: "ESMF Performance Evaluation and Optimization Peggy Li(1), Samson Cheung(2), Gerhard Theurich(2), Cecelia Deluca(3) (1)Jet Propulsion Laboratory, California."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ESMF Performance Evaluation and Optimization Peggy Li(1), Samson Cheung(2), Gerhard Theurich(2), Cecelia Deluca(3) (1)Jet Propulsion Laboratory, California.

Similar presentations

Presentation on theme: "ESMF Performance Evaluation and Optimization Peggy Li(1), Samson Cheung(2), Gerhard Theurich(2), Cecelia Deluca(3) (1)Jet Propulsion Laboratory, California."— Presentation transcript:

Similar presentations

About project

Feedback