Performance Characteristics of a Cosmology Package on Leading HPC Architectures Leonid Oliker Julian Borrill, Jonathan Carter.

Performance Characteristics of a Cosmology Package on Leading HPC Architectures Leonid Oliker http://crd.lbl.gov/~oliker Julian Borrill, Jonathan Carter Lawrence Berkeley National Laboratories

Overview  Superscalar cache-based architectures dominate HPC market  Leading architectures are commodity-based SMPs due to generality and perception of cost effectiveness  Growing gap between peak & sustained performance is well known in scientific computing  Modern parallel vectors may bridge gap this for many important applications  In April 2002, the Earth Simulator (ES) became operational: Peak ES performance > all DOE and DOD systems combined Demonstrated high sustained performance on demanding scientific apps  Conducting evaluation study of scientific applications on modern vector systems  09/2003 MOU between ES and NERSC was completed First visit to ES center: Dec 2003, second visit Oct 2004 (no remote access) First international team to conduct performance evaluation study at ES  Examining best mapping between demanding applications and leading HPC systems - one size does not fit all

Vector Paradigm  High memory bandwidth Allows systems to effectively feed ALUs (high byte to flop ratio)  Flexible memory addressing modes Supports fine grained strided and irregular data access  Vector Registers Hide memory latency via deep pipelining of memory load/stores  Vector ISA Single instruction specifies large number of identical operations  Vector architectures allow for: Reduced control complexity Efficiently utilize large number of computational resources Potential for automatic discovery of parallelism However: most effective if sufficient regularity discoverable in program structure Suffers even if small % of code non-vectorizable (Amdahl’s Law)

Architectural Comparison Node TypeWhere CPU/ Node Clock MHz Peak GFlop Mem BW GB/s Peak byte/flop Netwk BW GB/s/P Bisect BW byte/flop MPI Latency usec Network Topology Power3NERSC16375 1.51.00. 470.130.08716.3Fat-tree Power4ORNL321300 5.22.30.440.130.0257.0Fat-tree AltixORNL215006.06.41.10.400.0672.8Fat-tree ESESC8500 8.032.04.01.50.195.6Crossbar X1ORNL480012.834.12.76.30.0887.32D-torus  Custom vector architectures have High memory bandwidth relative to peak Superior interconnect: latency, point to point, and bisection bandwidth Another key balance point is I/O performance: Seaborg I/O: 16 GFPS servers, each w/ 32 GB main memory (for caching & metadata) I/O uses switch fabric, sharing bandwidth with message-passing traffic ES I/O: Each group 16 nodes has a pool of RAID disks attached with fiber channel switch (each node has a separate filesystem)

Previous ES visit Tremendous potential of vector architectures: 4 codes running faster than ever before  Vector systems allows resolution not possible with scalar arch (regardless of # procs)  Opportunity to perform scientific runs at unprecedented scale Evaluation codes contain sufficient regularity in computation for high vector performance However, none of the tested codes contained significant I/O requirements Code (P=64) % peak(P=Max avail) Speedup ES vs. Pwr3Pwr4AltixESX1Pwr3Pwr4AltixX1 LBMHD7%5%11%58%37%30.615.37.21.5 CACTUS6%11%7%34%6%45.05.16.44.0 GTC9%6%5%20%11%9.44.34.11.1 PARATEC57%33%54%58%20%8.23.91.43.9 Average23.37.24.82.6

The Cosmic Microwave Background  The CMB is a snapshot of the Universe when it first became neutral 400,000 years after the Big Bang.  After Big Bang the expansion of space cooled Universe sufficiently for charged electrons and neutrons to combine Cosmic - primordial photons filling all of space. Microwave - redshifted by the expansion of the Universe from 3000K to 3K. Background - coming from “behind” all astrophysical sources.

CMB Science The CMB is a unique probe of the very early Universe. Tiny fluctuations in its temperature & polarization encode - the fundamental parameters of cosmology Universe geometry, expansion rate, number of neutrino species, ionization history, dark matter, cosmological constant - ultra-high energy physics beyond the Standard Model

CMB analysis moves from the time domain - observations - O(10 12 ) to the pixel domain - maps - O(10 8 ) to the multipole domain - power spectra - O(10 4 ) calculating the compressed data and their reduced error bars at each step. CMB Data Analysis

MADCAP: Performance  Porting: ScaLAPACK plus rewrite of Legendre polynomial recursion, such that large batches are computed in inner loop  Original ES visit: only partially ported due to code’s requirements of global file system  Could not meet minimum parallelization and vectorization thresholds for ES  All systems sustain relatively low % peak considering MADCAP’s BLAS3 ops  Detailed analysis presented HiPC 2004  Further work performed for MADbench to: reduce I/O, remove system calls, and remove global file system requirements  New results collected from recent ES visit October 2004 P Power 3Power4ESX1 Gflops/P%peakGflops/P%peakGflops/P%peakGflops/P%peak 160.6241%1.529%4.132%2.227% 640.5436%0.8116%1.923%2.016%

IPM Overview Integrated Performance Monitoring  portable, lightweight, scalable profiling  fast hash method  profiles MPI topology  profiles code regions  open source MPI_Pcontrol(1,”W”); …code… MPI_Pcontrol(-1,”W”); ########################################### # IPMv0.7 :: csnode041 256 tasks ES/ESOS # madbench.x (completed) 10/27/04/14:45:56 # # (sec) # 171.67 352.16 393.80 # … ############################################### # W # (sec) # 36.40 198.00 198.36 # # call [time] %mpi %wall # MPI_Reduce 2.395e+01 65.8 6.1 # MPI_Recv 9.625e+00 26.4 2.4 # MPI_Send 2.708e+00 7.4 0.7 # MPI_Testall 7.310e-02 0.2 0.0 # MPI_Isend 2.597e-02 0.1 0.0 ############################################### …

Is a lightweight version of the MADCAP maximum likelihood CMB power spectrum estimation code. Retains the operational complexity & integrated system requirements of the full science code. Has three basic steps - dSdC, invD & W. Out of core calculation: holds approx 3 of the 50 matrices in memory Is used for - computer & file-system procurements. - realistic scientific code benchmarking and optimization. - architectural comparisons. MADbench

This step generates a set of N b dense, symmetric N p xN p signal correlation derivative matrices dSdC b by Lengendre polynomial recursion. Each matrix is block-cyclic distributed over the 2D processor array with blocksize B. As each matrix is calculated, each processor writes its subset of the matrix elements to a unique file. No inter-processor communication is required. Flops: O(N 2 P ) Disk: 8N b N 2 p (primarily writing) dSdC

This step generates the data correlation matrix D and inverts it. The dSdC b matrices are read from disk one at a time and progressively accumulated to build the signal correlation matrix S. A diagonal white noise correlation matrix N is added to S to give the data correlation matrix D, which is inverted using ScaLAPACK to give D -1. Each processor writes its subset of the D -1 matrix elements to a unique file. Flops: O(N 3 P ) Disk: 8N b N 2 p (primarily reading) invD

This step multiplies each dSdC b matrix by D -1 to form W b and derives a Newton-Raphson iterative step from this. Since they are independent, these matrix multiplications can be carried out gang-parallel across N g gangs of processors. Each dSdC b matrix is read in by all processors and then redistributed to the target gang. When all gangs have been given a matrix, they all perform their multiplication simultaneously. Flops: O(N 3 P ) Disk: 8N b N 2 p (primarily reading) W

N p - number of pixels (matrix size). N b - number of bins (matrix count). N g - number of gangs of processors. B - ScaLAPACK blocksize. MOD IO - IO concurrency control (only 1 in MOD IO processors do IO simultaneously). Running on P processors requires: - 3 x 8 x N p 2 bytes of memory per gang - N b x 8 x N p 2 bytes & N b x P inodes of disk - N b a multiple of N g to load-balance the gangs. B & MOD IO are architecture-specific optimizations. Parameters

dSdC performance  ES shows constant I/O performance (independent disks) Significantly fast computation (30X) due to high memory bandwidth Overall only 2.6X faster than Power3 due to I/O overhead  Power3 has faster write I/O until GPFS contention at P=1024

invD performance  I/O remains relatively constant, while MPI overhead and computation grows  Seaborg I/O reads faster than ES  Overall ES only 2.3X faster

W performance  Multi-gang runs significantly reduce MPI overhead (4.8X on ES, 3.3X on Seaborg)  MPI and CALC grow with numbers of processors  I/O trivial part of W calculation  Overall ES is 7X faster

Performance overview  Overall ES 5.6X faster & slightly higher % of peak compared w/ Seaborg for P=1024  For P=256 Seaborg shows higher % peak, due to relative I/O vs. peak flop performance  Although I/O cost remains relatively high, both systems achieve over 50% peak

Overview  New version of Madbench successfully reduced I/O overhead and removed global file system requirements  Allowed ES runs up to 1024 processors, achieving over 50% of peak Compared with only 23% of peak on 64 processors from first visit  Results show that I/O has more effect on ES than Seaborg - due to ratio between I/O performance and peak ALU speed  Demonstrated IPM capabilities to measure MPI overhead on variety of architectures without the need to recompile, at a trivial runtime overhead (1-2%)  Continue study of complex interplay between architecture, interconnect, and I/O  Currently performing experiments on Columbia and Phoenix  MADbench and IPM being prepared for public distribution  Future CMB analysis will require sparse methods due to size of data sets - potentially at odds with vector architectures

Performance Characteristics of a Cosmology Package on Leading HPC Architectures Leonid Oliker Julian Borrill, Jonathan Carter.

Similar presentations

Presentation on theme: "Performance Characteristics of a Cosmology Package on Leading HPC Architectures Leonid Oliker Julian Borrill, Jonathan Carter."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Performance Characteristics of a Cosmology Package on Leading HPC Architectures Leonid Oliker Julian Borrill, Jonathan Carter.

Similar presentations

Presentation on theme: "Performance Characteristics of a Cosmology Package on Leading HPC Architectures Leonid Oliker Julian Borrill, Jonathan Carter."— Presentation transcript:

Similar presentations

About project

Feedback