Download presentation
Presentation is loading. Please wait.
1
VTF Applications Performance and Scalability Sharon Brunett CACR/Caltech ASCI Site Review October 28, 29 2003
2
ASCI Platform Specifics LLNL’s IBM SP3 (frost) –65 node SMP, 375 MHz Power3 Nighthawk-2 (16 CPUs/node) –16 GB memory/node –~ 20 TB global parallel file system –SP switch2, colony switch 2 GB/sec node-to-node bandwidth bi-directional LANL’s HP/Compaq Alphaserver ES45 (QSC) –256 node SMP, 1.25 GHz Alpha EV6 ( 4 CPUs/node) –16 GB memory/node –~ 12 TB global file system –Quadrics network interconnect (QsNet) 2 mus latency 300 MB/sec bandwidth
3
Multiscale Polycrystal Studies Quantitative assessment of microstructural effects in macroscopic material response through the computation of full-field solutions of polycrystals Inhomogeneous plastic deformation fields Grain-boundary effects: –Stress concentration –Dislocation pile-up –Constraint-induced multislip Size dependence: (inverse) Hall-Petch effect Resolve (as opposed to model) mesoscale behavior exploiting the power of high-performance computing Enable full-scale simulation of engineering systems incorporating micromechanical effects.
4
Mesh Generation Ingrain subdivision behavior can be simulated in both single crystals and polycrystals. –texture simulation results agree well with experimental results Mesh generation method keeps the topology of individual grain shapes –Enables effective interactions between grains Increasing of the grain count in polycrystals gives a more stable mechanical response. Single grain corresponding to a single cell in a crystal
5
1.5 Million Element, 1241 Grain Multiscale Polycrystal Simulation Simulation carried out on 1024 processors of LLNL’s IBM SP3, frost
6
Multiscale Polycrystal Performance Aggregate parallel performance –LANL’s QSC Floating point operations 10.67% of peak Integer operations 15.39% of peak Memory operations 22.08 % of peak –DCPI hardware counters used to collect data –Qopcounter tool used to analyze DCPI database –LLNL’s Frost L1 cache hit rate 98% –Load/store instructions executed w/o main memory access Load Store Unit idle 36% Floating point operations 4.47% of peak –Hpmcount tool used to count hardware events during program execution
7
Multiscale Polycrystal Performance II MPI routines can consume ~ 30% of runtime for large runs on Frost –Workload imbalance as grains are distributed across nodes –MPI_Waitall every step dominating communications time Nearest neighbor sends take longer from nodes with computationally heavy grains –Routines taking the most CPU time on QSC resolved_fcc_cuitino 18.85% upslip_fcc_cuitino_explicit 11.74% setafcc 9.16% matvec 8.5 % –~50% of execution time in 4 routines –Room for performance improvement with better load balancing and routine level optimization
8
Multiscale Polycrystal Scaling on LLNL’s IBM SP3, Frost elements
9
Multiscale Polycrystal Scaling on LANL’s HP/Compaq, QSC elements
10
Scaling for Polycrystalline Copper in a Shear Compression Specimen Configuration LANL’s HP/Compaq QSC system elements
11
3D Converging Shock Simulations in a Wedge 1024 processor ASCI Frost run of a converging shock. The interface is nominally a 2D ellipse perturbed with a prescribed spectrum and randomized phases. –The 2D elliptical interface is computed using local shock polar analysis to yield a perfectly circular transmitted shock Resolution: 2000x400x400 with over 1T Byte of data generated. Density Pressure
12
Density Field in a 3D Wedge Density field in the Wedge. The transmitted shock front appears to be stable while the gas interface is Richtmyer-Meshkov unstable. The simulation took place on 1024 processors of LLNL’s IBM SP3, frost, 2000x400x400 initial grid.
13
Wedge3D Performance on LLNL’s IBM SP3, Frost Aggregate parallel performance for 1400x280x280 grid –LLNL’s Frost Floating point operations 5.8 to 10% of peak, depending on node –Hpmcount tool used to count hardware events during program execution –Most time consuming communication calls MPI_Wait() and MPI_Allreduce Accounting for 3 to 30% of runtime on 128 way run –175x70x70 grid per processor –Occasional high MPI time on a few nodes seem to be caused by system daemons competing for resources
14
Wedge3D Scaling on LLNL’s IBM SP3, Frost grid size XxYxZ
15
Fragmentation 2D Scaling on LANL’s HP/Compaq, QSC Levels of subdivision 450K to 1.1M elements 85K -> 1.1M elements 61K -> 915K elements 450K elements
16
Crack Patterns in the Configuration Occurring During Scalability Studies on QSC
17
Fragmentation 2D Performance on LANL’s HP/Compaq, QSC Procedures with highest CPU cycle consumption –element_driver 14.9% –assemble 13.9% –NewNeohookean 8.12% 16 processor run with 2 levels of subdivision (60K elements) Dcpiprof too used to profile run Problems processing dcpi database FLOP rates for large runs –reported to LANL support –small runs yield 3% FLOP peak –Only ~ 10% spend in fragmentation routines! Much room for improvement on our I/O performance dumping to the parallel file system (/scratch[1,2])
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.