Workflow and HPC erhtjhtyhy Doug Benjamin Argonne National Lab
HPC cross-experiment discussion https://indico.cern.ch/event/811997 Link to Document Prepared by the ATLAS, CMS and LHCb computing coordinators: Tommaso Boccali, Concezio Bozzi, James Catmore, Davide Costanzo, Markus Klute. With contributions from Andrea Valassi Big/Mega/Hyper PanDA projects – collaborations with ASCR/OLCF and Kurchatov Insititute (Russia) – has lead to ATLAS’ advanced integration of the HPC Centers with our WFMS (PanDA)
Overview All experiments report usage of HPC resources HPC’s with grid like connectivity not an issue Two classes of issues – Distributing computing and data management issues. How do we work with the HPC centers? the challenge is that HPC systems often lack external connectivity and do not allow the easy installation of new services that require sysadmin privileges. Software issues In ATLAS our code has traditionally been written for serial jobs on x86 CPUs. New machines have a variety of CPU instruction sets and with most of the computing power in vary powerful accelerators.
Distributed Computing Issues Distributed computing issues may be more or less difficult to overcome depending on the specific technical choices and policies of each HPC center The difference in exploiting successfully (or not) some of the "complicated" HPCs is often the expertise provided by the HPC centers themselves, helping and supporting the experiments in integrating their workflows. Funding Agencies should understand this important aspect: to be able to use effectively these resources, experiments will need concrete manpower help from the HPC center themselves, to help interfacing to the experiments. To minimize the duplication of work, it would also make sense to set up a joint team of experts from all experiments, which will share the technical expertise on the various HPCs.
Software Issues Software issues are much more difficult to solve. blocker for the exploitation in the short-term of some HPC centers that require efficient use of their GPUs This require a long-term program of software modernization and reengineering (could be several years) A large investment in software in order to bring about a veritable paradigm shift, by retraining the existing personnel as well as hiring/training new experts to reengineer and port/adapt their software applications. ATLAS has been focusing on Full Sim, while CMS wants to run everything on HPC’s.
Software Issues (2) HPC centers are heterogenous - Traditional x86 processors at HPC centers can be and have already been readily exploited by all experiments, without the need for specific software work. The Piz Daint HPC in Lugano, is WLCG Tier 2 – Note we do not make use of their vast GPU resources. Many-core KNL processors at HPC centers can execute HEP x86 applications provide a much lower memory per core used efficiently by multi-process (MP) or multi-threaded (MT) applications. Cori (at NERSC) Theta (at ALCF) have been used by ATLAS to generate 100’s Million events of Geant4 Full Sim ARM or PowerPC processors at HPC centers can only execute HEP software applications which have been ported to these architectures.
GPU’s and other accelerators GPUs at HPC centers cannot presently be used efficiently by HEP experiments With the except for some workflows (such as ML training). ML training represents a minimal part of their worldwide-integrated computing. The majority of HEP experiments software applications have not yet been ported to GPU’s. Currently not designed to take advantage of any sort of accelerator GPUs are presently one of the main challenges for the exploitation of HPC centers by the HEP experiments. The issue is expected to become more important in the future, as the current trend is for new HPC systems to provide an increasingly larger fraction of their computing power through GPUs.
Newest machines and next generation ones In Europe - BullSequana X supercomputers
DOE - OLCF New machine Summit is on-line now. Available through Insite/ALCC process
Evolution in the Bay Area (ie NERSC) - Perlmutter To be delivered in 2020 Heterogenous system 3 times more powerful than Cori (Cori = 14 PFs) ~ 40 PFs X86 CPU cores for CPU only nodes CPU-GPU for ML and the like
Conclusions Very useful discussion amongst the experiments about HPCs Software is the key. For example programs (workflow) running on the HPC’s Simulation, Event Generation, Reconstruction etc GPU’s and other accelerators (ie TPU’s) are a challenge. Newest machines will have GPU’s providing the majority of their computing power. Now onto the items for discussion.
Provocations for Discussion What the the Cost – Benefit for using these machines and centers? Some Funding Agencies have said the these machines will be part of the solution for Run 4. Is this universal? Should some of ATLAS HPC labor be diverted from Distributed HPC’s to code for HPC’s? What software exist today to make use of exotic CPU’s (ie Power9 and ARM)? What about software for GPU’s? What work flow should we really be running on HPC’s? Event Generation? Simulation ? (Full – Geant , Fast Sim?) Reconstruction?
Even more Provocations How do we use Machine Learning in ATLAS on these machines? What is needed? How to do we attract users? How much effort is needed to vectorize large parts of our software? Where will the effort come from? How do we “keep the tail from wagging the dog”?