Parallel Code Choices. Where We Stand? ShakeOut-D 1-Hz Vs=250m/s benchmark runs on Kraken- XT5 and Ranger at full machine scale, Hercules successful test.

Parallel Code Choices

Where We Stand? ShakeOut-D 1-Hz Vs=250m/s benchmark runs on Kraken- XT5 and Ranger at full machine scale, Hercules successful test run on 16k Kraken-XT4 with Vs=200m/s Multiple AWP-Olsen ShakeOut-D 1-Hz runs on NICS Kraken-XT5 using 64k processor cores, Wall Clock Time less than 5 hours, SORD using 16k Ranger cores Milestone to pass 100k mark! Recent successful benchmark runs on DOE ANL BG/P using up to 131,072 cores

SCEC capability runs update

20092013…2011 Ranger, Kraken, BG/P GPU/Cell, Blue Water, Hybrid, NUMA, CAF Future Architectures, FPGA, Chapel, Cloud computing … Current Parall Current parallel programming model Message passing C, C++, Fortran Plus MPI Communication Transition Model PGAS UPC, CAF, Titanium Current compilation technology High Productivity Models HPCS X-10 Chapel Future compilation technology Tier0: PFlops class Tier1: TG/DOE Supercomputer Centers Grid Computing Tier2: Regional Medium Ranger Supercomputers HPC initiative: short-term medium-term long-term Pick up new codes, EGM 3-10Hz, Contribute to architecture design Adaption, EGM 2-Hz, Vs=200m/s SO-1Hz,Vs=200m/s Data integration Tier 3: High Performance Workstations HPC Initiative Ph.D programs?

Parallel FD and FE Codes Split-node Dynamic rupture Wave propagation Surface topography Complex geometry Material nonlinearity Absorbing Boundaries FD-Olsen ✔✔ PML FD-Rob ✔ PML FD-SORD ✔✔✔✔ PML FE- Hercules ✔✔ Stacey FE-MaFE* ✔✔✔ ✔ arbitrary ✔ PML FE-DG* ✔✔✔ ✔ arbitrary -

Proposed Plan of Work: Automatic End-to-End Approach Automated rule-based workflow Highly configurable and customizable Reliable and robust Easy implementation

Target much higher TeraFlop/s! basic but most important optimization step due to the accumulated performance gains even in multi-core environments Application specific optimization techniques –Program behavior analysis (source level or run-time profiling) various traditional optimization techniques such as loop unrolling, code reassignment, register reallocation and so on –Optimize the behaviors of the code hotspot –Architecture aware optimization Optimization based on the underlying architecture: computational unit, interconnect, cache and memory Compiler driven optimization techniques, some already done –Optimal compiler and optimization flags –Optimal libraries Proposed Plan of Work: Single-core Optimization

Computational pipelining –Asynchronous process communication isend and irecv –Well-defined pipelines computational jobs to reduce the overhead imposed by the MPI synchronization –Guaranteed correctness of the computation Reduction of conflicts on shared resources –A computational node shares resources: Caches (Shared L2 or L3) and Memory –Resolves highly biased conflicts on shared resources program behavioral solutions through temporal or spatial conflict avoidance send recv Sync point stall senderreceiver isend irecv senderreceiver computation SYNC ASYNC Core1 Core2 L1 Cache frequent&biasedInfrequent&even Proposed Plan of Work: Multi-core Optimization Shared Memory Shared L2 cache

Proposed Plan of Work: Fault Tolerance Full systems are being designed with 500,000 processors… –Assuming 99.99% each processor to continue functioning for 1 year, the chance of one million-core machine remaining up for one week is 14% Checkpointing and restarting could take longer than the time to the next failure –System checkpoint/restart under way Last year, our 80+ hours 6k core run on BG/L successful using IBM system checkpoint (application-assisted infrastructure, application level responsible for identifying point in which there are no outstanding messages. –New model needed, checkpoints to disk will be impractical at exascale Collaboration with Dr. Zizheng Chen of CSM –Scalable algorithm-based checkpoint-free techniques to tolerate a small number of process failures, level fault tolerance solution

Centralized data-collection more and more difficult, as data size increases exponentially Automate administrative tasks huge challenge such as replication, distribution, access controls, metadata extraction. Data virtualization and grid technology to be integrated. With iRODS, for example, can write rules to track administrative functions such as integrity monitoring - provide logical name space so the data can be moved without the access name changing - provide metadata to support discovery of files and track provenance - provide rules to automate administrative tasks (authenticity, integrity, distribution, replication checks) - provide micro-services for parsing data sets (HDF5 routines). Potential to use new iRODS interface to serve large SCEC community -WebDAV (possible to access from such as iPhone) -Windows browser; efficient and fast browser interface Proposed Plan of Work: Data Management

Proposed Plan of Work: Data Visualization Visualization integration as critical interest, Amit has been working with a graduate student to develop GPU based new techniques for earthquake visualization

Candidates of Non-SCEC Applications Ader-DG: An FE arbitrary high-order discontinuous Galerkin method Shuo Ma’s FE code (MaFE) using simplified structured grid

AWP-Olsen-Day vs ADER-DG FD AWP-Olsen-DayFE ADER-DG Problem domain And settings 600x300x80km, 1-Hz, 250s 100x60km, S-wave vel 300-500 m/s (down to 1km) 60x30km, S-wave velo 100-300 m/s (down to 400m) 600x300x80km, 1-Hz, 250s Vol0: bottom to moho (30km), 600x300x50km, Vs=5000m/s, Vp=8500m/s Vol1: 30km to sediment base, 600x300x30km, Vs=3500m/s, Vp=6000m/s Vol2: 100x60x1km, Vs=500m/s, Vp=1800m/s Vol3: 60x30x0.4km, Vs=200m/s, Vp=1500m/s 3 elements per dominant wavelength, 5th-order accuracy in space and time, ie polynomials of degree 4 within each element, that gives 35 degrees of freedom Computational cost Vs =200m/s is (500/200)^4=39x more than Vs=500m/s Vs=200m/s is 2.25x more than using Vs=500m/s Elements Vs=200m/s: 2.25 x 10^11Vs=200m/s: 7.69 x 10^7 Time Steps Vs=200m/s:125,000 time stepsVs=200m/s: 485,000 time steps Total Wall Clock Time (2k / 64k cores) Vs=200m/s: 2557 hrs / 80 hrsVs=200m/s: 1191 hrs / 37 hrs

ADER-DG Scaling on Ranger

ADER-DG Validation LOH.3 (Source: Martin Kaeser 2009)

Each tetrahedral element (m) has its own time step where l min is the insphere radius of the tetrahedron and a max is the fastest wave speed. Therefore, the Taylor series in time depends on the local time level t (m) ADER-DG Local Time Stepping (Source: Martin Kaeser 2009)

ADER-DG Dynamic Rupture Results (Source: Martin Kaeser 2009)

ADER-DG Effect of mesh coarsening (Source: Martin Kaeser 2009)

DG Application to Landers branching fault system (Source: Martin Kaeser 2009)

(J. Wassermann) problem adapted mesh generation p-adaptive calculations to resolve topography very accurately load balancing by grouping subdomains DG Modeling of Wave Fields in Merapi Volcano (Source: Martin Kaeser 2009)

(J. Wassermann) analysing strong scattering effect of surface topography analysing the limits of standard moment tensor inversion procedures DG Modeling of Scattered Waves in Merapi Volcano (Source: Martin Kaeser 2009)

MaFE Scaling (Source: Shuo Ma 2009)

Parallel Code Choices. Where We Stand? ShakeOut-D 1-Hz Vs=250m/s benchmark runs on Kraken- XT5 and Ranger at full machine scale, Hercules successful test.

Similar presentations

Presentation on theme: "Parallel Code Choices. Where We Stand? ShakeOut-D 1-Hz Vs=250m/s benchmark runs on Kraken- XT5 and Ranger at full machine scale, Hercules successful test."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Parallel Code Choices. Where We Stand? ShakeOut-D 1-Hz Vs=250m/s benchmark runs on Kraken- XT5 and Ranger at full machine scale, Hercules successful test.

Similar presentations

Presentation on theme: "Parallel Code Choices. Where We Stand? ShakeOut-D 1-Hz Vs=250m/s benchmark runs on Kraken- XT5 and Ranger at full machine scale, Hercules successful test."— Presentation transcript:

Similar presentations

About project

Feedback