P AL Performance Modeling of Unstructured Mesh Particle Transport Computations Mark M. Mathis and Darren J. Kerbyson Performance and Architectures Laboratory.

P AL Performance Modeling of Unstructured Mesh Particle Transport Computations Mark M. Mathis and Darren J. Kerbyson Performance and Architectures Laboratory (PAL) Los Alamos National Laboratory

P AL Performance Modeling at Los Alamos Model System unavailable for measurement What will be the performance of IBM PERCS be in 2010? Small prototype available What will be the performance of a 20Tflop system? Which system should I buy? Systems unavailable to measure (e.g. ASCI purple) Is the machine working? Performance should be expected (e.g. ASCI Q) Improvements in S/W or H/W Quantify impacts prior to implementation System updates Quantify impact on performance Design Implementation Procurement Installation Optimization Maintenance

P AL Modeling Approach Application-centric modeling: –capture key processing activities effects of an application Effectiveness of a model depends upon the ability to capture application performance behavior. Our modeling approach is based upon a detailed understanding of the application performance when: –system/configuration, and/or –data-set changes are applied. Developed models are parameterized, hence: –not restricted to a “performance-point” –permits scalability analysis –allows investigation of calculation behavior

P AL Modeled Workload (so far) Sweep3D– S N transport kernel on structured grids SAGE– hydro on cartesian AMR grids MCNP– Monte Carlo N-Particle Tycho– S N transport on unstructured grids UMT2K– S N transport on unstructured grids Partisn– S N transport (validated, March ‘04) POP– Parallel Ocean Program (almost finished) Other: Random Access (initial model) We will focus on Tycho and UMT2K.

P AL Programming point of view: –grid cell indexing is typically done by implicit array indexing Performance point of view: –constant time per grid cell –typically decomposed into columns »partitioning unique to this problem effective strategy due to directional dependences »no. of neighbors in partitioning is uniform across sub-grids. –a “sweep” for each direction of particle travel traverses the grid »consecutive sweeps can be “pipelined” »octants (in 3D) are processed sequentially Structured Meshes (Grids)

P AL Unstructured Meshes Programming point of view: –unstructured meshes need complex data structures as indexing is typically not done by implicit array indexing Performance point of view: –the time per grid point may be longer, but decomposition is similar –Mesh typically decomposed using a mesh partitioner, e.g. Metis. »typically mesh is partitioned to minimize boundary surfaces, hence minimize communication traffic »no. of neighbors in partitioning will not be uniform across sub-grids.

P AL 3-D Data Decomposition 3-D spatial grid partitioned using Metis: –P partitions (1 per PE) –cells per partition = V/P (V = grid volume) Partitioning results in: –~equal work per sub-grid –neighbor sub-grids not necessarily near in terms of processor arrangement For modeling purposes, decomposition approximated to that of a dense grid –ideally each processor would have 6 neighbors –but #neighbors can be different and hence increase communications 2-D Example: 241 cells, 16 partitions 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

P AL S N Transport on Unstructured Meshes S N transport (similar to Sweep3D) but on unstructured meshes – Tycho (research code from Los Alamos) and UMT2K (benchmark production code from Livermore) are two examples. Sweeps originate at the edge of the mesh and propagate to the opposite side – all sweep directions start at the same time, i.e. no dependence between octants (as in Sweep3D) The mesh is usually a fixed geometry (typically tetrahedrals), decomposed using a mesh partitioner. – strong scaling (same problem but parallelism gives a faster TTS) Metis is used to partition the mesh in both cases – ~equalizes #cells across partitions (load-balancing computation) – minimize sub-grid boundary surfaces (minimizes communication traffic)

P AL Processing Flow Dependency between cells in the direction of wavefront –a pipeline line (in 3-D) –unstructured meshes results in ‘interesting shaped’ wavefronts All wavefronts commence at the start of an iteration Unit of work is the cell-angle pair PEs keep active list of available work. 3 stages per step: –process up to a max cell-angle pairs per step –send boundary results to neighbors (0 or more neighbors) –receive new boundary data from neighbors (0 or more neighbors) for iterations for steps for work available (max: max_cells_per_step) process element for neighbor PEs Send boundary to PE if any for neighbor PEs Recv boundary from PE if any

P AL Computation: Cells processed per step Processing per step per PE depends on available work. –e.g. a middle sub-grid will wait while wavefronts reach it –work varies from step to step A key quantity – the PCE (Parallel Computational Efficiency) –Represents the fraction of the maximum cells processed across all steps Example: 43,012 mesh on 16 PEs Max cells processed per step = 512 Total number of steps = 76 PCE value is specific to a mesh and partitioning. It can either be measured, or assumed (typically 70-90%)

P AL Model The overall form of the model is: ComputationCommunication where: P no. PEs V the grid volume Gthe no. energy groups |N c (S,P)| the no. communications in step S, on PE P N c (S,P,C) the destination PE of communication C in step S on PE P N s (S,P,C) the size of communication C in step S on PE P T e (x) the time to process 1 element given x elements mapped to a PE T c (x,y) the time taken for a communication to PE X of size Y

P AL Model (cont.) Addition of computation and communication ignores overlap (assumed small). Both computation and communication are modeled as max. time taken over all processors, i.e. –max work over all processors for each step –max communications over all processors for each step This will tend to give an over-estimate of time-taken – mitigated by assuming constant number of neighbors …

P AL Single Processor Performance Fixed mesh geometry is typical (strong scaling) –As #PEs increases, work per PE decreases –Greater cache utilization possible –Time per cell will depend on the #cells per PE Tycho (1 group) UMT2K (1-3 groups)

P AL System Parameters AlphaServer ES40 833MHz –Tycho single processor performance, T e (E p,G)(µs) E p < 800:T e = 3.7µs 800 ≤ E p ≤ 16K:T e = 1.8 Ln(E p ) - 8.4µs E p > 16K:T e = 9.2µs –Latency and Bandwidth, L c (S )(µs) and 1/B c (S )(ns) S < 64B:L c = 9.28µs1/B c = 0.0 64 ≤ S ≤ 512:L c = 9.00µs1/B c = 22.7 S > 512:L c = 21.4µs1/B c = 11.2 Itanium-2 1GHz cluster –UMT2K single processor performance, T e (E p,G)(µs) E p < 10K:T e = 0.105 * (3 + G)µs 10K ≤ E p ≤ 50K:T e = (0.022 Ln(E p ) - 0.094) * (3 + G)µs E p > 50K:T e = 0.139 * (3 + G)µs –Latency and Bandwidth, L c (S )(µs) and 1/B c (S )(ns) S < 64B:L c = 6.4µs1/B c = 0.0 64 ≤ S ≤ 512:L c = 8.21µs1/B c = 25.5 S > 512:L c = 17.1µs1/B c = 13.7

P AL Tycho Validation 4 meshes used to validate the model for Tycho on an AlphaServer ES40 System, 256 Processors Mesh# ElementsDescriptionError (%) Nneut43,012Neutron well-logging tool and surrounding media 13.48 Silc51,963Computer Chip and packaging for radiation shielding 12.07 Reac165,530Reactor pressure vessel and surrounding cavity structures 7.44 Con_test5168,356Cube divided into approximately equal-sized elements 8.07

P AL Tycho Validation (cont.)

P AL UMT2K Validation 6 cases used to validate the model on a 64 processor Itanium-2 cluster CaseMesh#CellsDescriptionError (%) 0MMesh680,400Medium mesh, 4950 cells/layer, 3 layers, 1 energy group 12.53 1SMesh265,680Small mesh, 398 cells/layer, 15 layers, 1 energy group 8.33 2SMesh265,680Small mesh, 398 cells/layer, 15 layers, 2 energy groups 8.98 3SMesh53,136Small mesh, 398 cells/layer, 3 layers, 1 energy group 11.41 4Smesh265,680Small mesh, 398 cells/layer, 15 layers, 3 energy groups 8.87 5Mmesh3,402,000Large mesh, 4950 cells/layer, 15 layers, 1 energy group 9.06

P AL UMT2K Validation (cont.)

P AL Summary One model works for two different implementations –other unstructured mesh codes are under development Strong scaling: –single processor time needs to be encapsulated into separate model Unstructured mesh can be approximated to dense mesh –General characteristics of dense meshes may be used, for e.g.: »Number of neighbors and size of neighbor boundaries –Acceptable error margin Relies on PCE input (average of cells processed per stage) –This is specific to a mesh and needs to be measured or otherwise stated. Part of on-going work to develop accurate performance models of the ASCI workload http://www.c3.lanl.gov/par_arch

P AL Performance Modeling of Unstructured Mesh Particle Transport Computations Mark M. Mathis and Darren J. Kerbyson Performance and Architectures Laboratory.

Similar presentations

Presentation on theme: "P AL Performance Modeling of Unstructured Mesh Particle Transport Computations Mark M. Mathis and Darren J. Kerbyson Performance and Architectures Laboratory."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

P AL Performance Modeling of Unstructured Mesh Particle Transport Computations Mark M. Mathis and Darren J. Kerbyson Performance and Architectures Laboratory.

Similar presentations

Presentation on theme: "P AL Performance Modeling of Unstructured Mesh Particle Transport Computations Mark M. Mathis and Darren J. Kerbyson Performance and Architectures Laboratory."— Presentation transcript:

Similar presentations

About project

Feedback