Tom LeCompte High Energy Physics Division Argonne National Laboratory

Tom LeCompte High Energy Physics Division Argonne National Laboratory
G4 At Extreme Scales Tom LeCompte High Energy Physics Division Argonne National Laboratory

Background For the last few years, Argonne has been adapting HEP code to run on supercomputers HEP needs are growing exponentially Grid resources are growing linearly Today ~15% of ATLAS grid computing is run on supercomputers The majority is event generation, run at ANL’s Mira (#6 in the Top 500) The remainder are grid jobs, sent to x86-based supercomputers This works best for small partitions: < a few hundred nodes The bulk of ATLAS grid computing is Geant I’d like to be running at least as much Geant as event generation on large supercomputers. That requires running in large partitions

BlueGene/Q and Mira Each Mira worker node contains
A 1.6 GHz 16-core BlueGene/Q processor (with 4 execution units a la Intel’s hyperthreading) 16 GB of memory No local storage – it communicates to the file system (GPFS) via dedicated I/O nodes Mira has worker nodes 786,432 processors and 3,145,728 hardware threads Jobs run in partitions of 512, 1K, 2K, 4K, 8K, 12K, 16K, 24K, 32K and 48K nodes Minimum partition size is 32,768 hardware threads For throughput reasons, it’s helpful to be able to run in both large and small partitions – our flexibility in this respect let us run three times as much as we were allocated by taking advantage of empty spots There are two smaller systems for testing Cetus: 4096 worker nodes, identical to Mira Vesta: 2048 worker nodes, has more I/O nodes: this will be relevant in a few slides

MPI vs. Threads Geant now does both MPI and threads.
MPI = Message Passing Interface Allows multiple copies of one program (or multiple programs) to communicate: Coordinate random number seeds, collect I/O, etc. Can work across multiple nodes Threads Allows one program to run multiple instances of a worker thread Limited to a single node Much, much smaller memory footprint  usually a lot faster Hybrid applications Use MPI across nodes, and threads inside a node – how we run on Mira Example: 2048 nodes = 2048 ranks x 64 threads/rank = 131,072 threads

Performance Minimum Mira Partition We see linear scaling up to 64 threads per node, and a 2 ± 2% boost to go to 80. 80 is an overcommit by 1 Scaling to a 512 node partition (32K threads) costs ~20% in performance Going to 1K (64K threads) costs another ~30%. Going to 2K (128K threads) costs more than half – you’re better off running in 1K. This is from the Geant4 HepExpMTBenchmark, roughly corrected for initialization time (not all jobs have the same number of events) Each node has 64 threads.

Performance Discussion
Minimum Mira Partition We have seen this behavior before We believe it is due to multiple I/O limitations. One limitation we experienced in the path was stdin/stdout/stderr Going to faster disk makes only a small difference Running on Vesta (green stars) shows both better absolute performance and better scaling Remember, Vesta has twice as many I/O nodes per worker node as Vesta or Cetus ALCF kindly ran a special full machine job on Vesta for me Full disclosure: 2 ranks stalled during this test; I don’t know why. This is from the Geant4 HepExpMTBenchmark, roughly corrected for initialization time (not all jobs have the same number of events) Each node has 64 threads.

Improving I/O Performance
The limitation on I/O is seldom the total output data rate This is why going to fast disk makes only a marginal improvement The number of writing threads is a major factor Lots of small writes is deadly This is why stdin/stdout/stderr limits performance Solution: aggregate many small writes to fewer larger ones – see next slide Initialization A few minutes is usually not worth worrying about single-threaded One virtue of scaling is that one can take a job that takes x time on y cores and turn it into a job that takes x/N time on Ny cores Long initialization times limit this It is possible to reduce the number of simultaneous reads and replace them with fewer reads that are then broadcast to other nodes with MPI We have done this, but I don’t think this has been tried with Geant I would wait to see the magnitude of the problem before trying to fix it

art ART is the framework for FNAL g-2, mu2e and other FNAL Intensity Frontier experiments ANL and FNAL have a small project to parallelize ART Make the parallelization transparent to the end-user: nothing changes when going from a desktop to a supercomputer Aggregate the I/O. Historically (see last slide) this has been the key to scaling Status: An ART release for g-2 runs, but doesn’t stop cleanly yet. G4 Thread Dispatcher Writer

Knights Landing I am limited to what I can say.
The Intel Phi 72xx series (Knight’s Landing) will be used in the supercomputers Theta and Cori Phase-2. One chip is expected to be the roughly the equivalent of two Ivy Bridge E5-2695V2 Xeons 64 physical cores at 1.3 GHz vs. 32 (2x16) at 2.4 GHz 215W vs 230W TDP Roughly half the price Potential to do better: Phi has 4-way hyperthreading vs, Xeon’s 2-way I can’t discuss performance, but if this chip significantly missed its targets, you’d probably have read about it. Memory is split: 16 GB of fast on-package memory (MCDRAM) and hundreds of GB of DD4. DDR4 sounds fast, but when you divide it by 256 threads, you get hard disk speeds Fortunately G4 with hundreds of threads fits in 16 GB G4 with many MPI ranks does not. (HepExpMT takes ~1.4 GB/rank on an x86) To use this chip effectively, you want to run one G4 task with many threads instead of multiple G4 tasks with one thread

Summary We have run Geant at very large scales: > 100,000 threads on Mira Both with HepExpMT and my own million-volume test geometry At ~50,000 threads we start seeing significant scaling imitations We have experience and evidence that points at I/O The adaptations to ART should help us get around this Will they get us to full machine jobs? (3.1 million threads) The next generation of US supercomputers will have a more familiar architecture (Xeon Phi/x86 based) These represent a huge amount of computing: Aurora (2019) is ~100 Grids. Threading is the way to go on both old and new architectures Thanks to Argonne’s ALCF, JLSE (Joint Laboratory for Systems Evaluation) and DOE’s Center for Computational Excellence

Tom LeCompte High Energy Physics Division Argonne National Laboratory

Similar presentations

Presentation on theme: "Tom LeCompte High Energy Physics Division Argonne National Laboratory"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Tom LeCompte High Energy Physics Division Argonne National Laboratory

Similar presentations

Presentation on theme: "Tom LeCompte High Energy Physics Division Argonne National Laboratory"— Presentation transcript:

Similar presentations

About project

Feedback