Workflow and HPC erhtjhtyhy Doug Benjamin Argonne National Lab.

Slides:



Advertisements
Similar presentations
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
Advertisements

Workload Management Massimo Sgaravatto INFN Padova.
Research Computing with Newton Gerald Ragghianti Nov. 12, 2010.
Trigger and online software Simon George & Reiner Hauser T/DAQ Phase 1 IDR.
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
Computing for ILC experiment Computing Research Center, KEK Hiroyuki Matsunaga.
Y. Kotani · F. Ino · K. Hagihara Springer Science + Business Media B.V Reporter: 李長霖.
Offline Coordinators  CMSSW_7_1_0 release: 17 June 2014  Usage:  Generation and Simulation samples for run 2 startup  Limited digitization and reconstruction.
N*Grid – Korean Grid Research Initiative Funded by Government (Ministry of Information and Communication) 5 Years from 2002 to million US$ Including.
Co-Design 2013 Summary Exascale needs new architectures due to slowing of Dennard scaling (since 2004), multi/many core limits New programming models,
RI The DEISA Sustainability Model Wolfgang Gentzsch DEISA-2 and OGF rzg.mpg.de.
Future computing strategy Some considerations Ian Bird WLCG Overview Board CERN, 28 th September 2012.
12 March, 2002 LCG Applications Area - Introduction slide 1 LCG Applications Session LCG Launch Workshop March 12, 2002 John Harvey, CERN LHCb Computing.
Ian Bird CERN, 17 th July 2013 July 17, 2013
IAG – Israel Academic Grid, EGEE and HEP in Israel Prof. David Horn Tel Aviv University.
Ian Bird Overview Board; CERN, 8 th March 2013 March 6, 2013
Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.
Alessandro De Salvo CCR Workshop, ATLAS Computing Alessandro De Salvo CCR Workshop,
Building on virtualization capabilities for ExTENCI Carol Song and Preston Smith Rosen Center for Advanced Computing Purdue University ExTENCI Kickoff.
EGI-InSPIRE RI EGI Compute and Data Services for Open Access in H2020 Tiziana Ferrari Technical Director, EGI.eu
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
16 th Geant4 Collaboration Meeting SLAC, September 2011 P. Mato, CERN.
LHCb Computing 2015 Q3 Report Stefan Roiser LHCC Referees Meeting 1 December 2015.
LHC collisions rate: Hz New PHYSICS rate: Hz Event selection: 1 in 10,000,000,000,000 Signal/Noise: Raw Data volumes produced.
1-2 March 2006 P. Capiluppi INFN Tier1 for the LHC Experiments: ALICE, ATLAS, CMS, LHCb.
Scientific Data Processing Portal and Heterogeneous Computing Resources at NRC “Kurchatov Institute” V. Aulov, D. Drizhuk, A. Klimentov, R. Mashinistov,
ScotGRID is the Scottish prototype Tier 2 Centre for LHCb and ATLAS computing resources. It uses a novel distributed architecture and cutting-edge technology,
Parallel Programming Models
CEPC software & computing study group report
These slides are based on the book:
Review of the WLCG experiments compute plans
Workload Management Workpackage
Status of WLCG FCPPL project
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Ian Bird WLCG Workshop San Francisco, 8th October 2016
A Dutch LHC Tier-1 Facility
AWS Integration in Distributed Computing
Report from WLCG Workshop 2017: WLCG Network Requirements GDB - CERN 12th of July 2017
Data Analytics and CERN IT Hadoop Service
Computing models, facilities, distributed computing
SuperB and its computing requirements
Introduction to parallel programming
The “Understanding Performance!” team in CERN IT
HPC DOE sites, Harvester Deployment & Operation
Tohoku University, Japan
Christos Markou Institute of Nuclear Physics NCSR ‘Demokritos’
for the Offline and Computing groups
Sergei V. Gleyzer University of Florida
Enabling machine learning in embedded systems
WLCG: TDR for HL-LHC Ian Bird LHCC Referees’ meting CERN, 9th May 2017.
Markus Schulz Understanding Performance CERN-IT-DI-LCG
Dagmar Adamova (NPI AS CR Prague/Rez) and Maarten Litmaath (CERN)
Recap: introduction to e-science
University of Central Florida COP 3330 Object Oriented Programming
Univ. of Texas at Arlington BigPanDA Workshop, ORNL
Infrastructure for testing accelerators and new
Containers in HPC By Raja.
FCT Follow-up Meeting 31 March, 2017 Fernando Meireles
The latest developments in preparations of the LHC community for the computing challenges of the High Luminosity LHC Dagmar Adamova (NPI AS CR Prague/Rez)
New strategies of the LHC experiments to meet
Simulation at NASA for the Space Radiation Effort
Microsoft Services Premier Support for Developers
Intuitive Development and Deployment of Web Applications from the Microsoft Azure Cloud “Thanks to Microsoft Azure our solution is available quickly and.
CS385T Software Engineering Dr.Doaa Sami
Software Acceleration in Hybrid Systems Xiaoqiao (XQ) Meng IBM T. J
Defining the Grid Fabrizio Gagliardi EMEA Director Technical Computing
GLOW A Campus Grid within OSG
Operating System Overview
A Possible OLCF Operational Model for HEP (2019+)
HPC resources LHC experiments are using (as seen from ATLAS - CMS lens) erhtjhtyhy Doug Benjamin Argonne National Lab.
Presentation transcript:

Workflow and HPC erhtjhtyhy Doug Benjamin Argonne National Lab

HPC cross-experiment discussion https://indico.cern.ch/event/811997 Link to Document Prepared by the ATLAS, CMS and LHCb computing coordinators: Tommaso Boccali, Concezio Bozzi, James Catmore, Davide Costanzo, Markus Klute. With contributions from Andrea Valassi Big/Mega/Hyper PanDA projects – collaborations with ASCR/OLCF and Kurchatov Insititute (Russia) – has lead to ATLAS’ advanced integration of the HPC Centers with our WFMS (PanDA)

Overview All experiments report usage of HPC resources HPC’s with grid like connectivity not an issue Two classes of issues – Distributing computing and data management issues. How do we work with the HPC centers? the challenge is that HPC systems often lack external connectivity and do not allow the easy installation of new services that require sysadmin privileges. Software issues In ATLAS our code has traditionally been written for serial jobs on x86 CPUs. New machines have a variety of CPU instruction sets and with most of the computing power in vary powerful accelerators.

Distributed Computing Issues Distributed computing issues may be more or less difficult to overcome depending on the specific technical choices and policies of each HPC center The difference in exploiting successfully (or not) some of the "complicated" HPCs is often the expertise provided by the HPC centers themselves, helping and supporting the experiments in integrating their workflows. ​ Funding Agencies should understand this important aspect: to be able to use effectively these resources, experiments will need concrete manpower help from the HPC center themselves, to help interfacing to the experiments. To minimize the duplication of work, it would also make sense to set up a joint team of experts from all experiments, which will share the technical expertise on the various HPCs.

Software Issues Software issues are much more difficult to solve. blocker for the exploitation in the short-term of some HPC centers that require efficient use of their GPUs This require a long-term program of software modernization and reengineering (could be several years)​ A large investment in software in order to bring about a veritable paradigm shift, by retraining the existing personnel as well as hiring/training new experts to reengineer and port/adapt their software applications. ATLAS has been focusing on Full Sim, while CMS wants to run everything on HPC’s.

Software Issues (2) HPC centers are heterogenous - Traditional x86 processors at HPC centers can be and have already been readily exploited by all experiments, without the need for specific software work. The Piz Daint HPC in Lugano, is WLCG Tier 2 – Note we do not make use of their vast GPU resources. Many-core KNL processors at HPC centers can execute HEP x86 applications provide a much lower memory per core used efficiently by multi-process (MP) or multi-threaded (MT) applications. Cori (at NERSC) Theta (at ALCF) have been used by ATLAS to generate 100’s Million events of Geant4 Full Sim ARM or PowerPC processors at HPC centers can only execute HEP software applications which have been ported to these architectures.

GPU’s and other accelerators GPUs at HPC centers cannot presently be used efficiently by HEP experiments With the except for some workflows (such as ML training). ML training represents a minimal part of their worldwide-integrated computing. The majority of HEP experiments software applications have not yet been ported to GPU’s. Currently not designed to take advantage of any sort of accelerator ​GPUs are presently one of the main challenges for the exploitation of HPC centers by the HEP experiments. The issue is expected to become more important in the future, as the current trend is for new HPC systems to provide an increasingly larger fraction of their computing power through GPUs.

Newest machines and next generation ones In Europe - BullSequana X supercomputers

DOE - OLCF New machine Summit is on-line now. Available through Insite/ALCC process

Evolution in the Bay Area (ie NERSC) - Perlmutter To be delivered in 2020 Heterogenous system 3 times more powerful than Cori (Cori = 14 PFs) ~ 40 PFs X86 CPU cores for CPU only nodes CPU-GPU for ML and the like

Conclusions Very useful discussion amongst the experiments about HPCs Software is the key. For example programs (workflow) running on the HPC’s Simulation, Event Generation, Reconstruction etc GPU’s and other accelerators (ie TPU’s) are a challenge. Newest machines will have GPU’s providing the majority of their computing power. Now onto the items for discussion.

Provocations for Discussion What the the Cost – Benefit for using these machines and centers? Some Funding Agencies have said the these machines will be part of the solution for Run 4. Is this universal? Should some of ATLAS HPC labor be diverted from Distributed HPC’s to code for HPC’s? What software exist today to make use of exotic CPU’s (ie Power9 and ARM)? What about software for GPU’s? What work flow should we really be running on HPC’s? Event Generation? Simulation ? (Full – Geant , Fast Sim?) Reconstruction?

Even more Provocations How do we use Machine Learning in ATLAS on these machines? What is needed? How to do we attract users? How much effort is needed to vectorize large parts of our software? Where will the effort come from? How do we “keep the tail from wagging the dog”?