Workflow Management Software For Tomorrow

Slides:



Advertisements
Similar presentations
ASCR Data Science Centers Infrastructure Demonstration S. Canon, N. Desai, M. Ernst, K. Kleese-Van Dam, G. Shipman, B. Tierney.
Advertisements

1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
Integrating Network Awareness in ATLAS Distributed Computing Using the ANSE Project J.Batista, K.De, A.Klimentov, S.McKee, A.Petroysan for the ATLAS Collaboration.
The new The new MONARC Simulation Framework Iosif Legrand  California Institute of Technology.
Testing PanDA at ORNL Danila Oleynik University of Texas at Arlington / JINR PanDA UTA 3-4 of September 2013.
MultiJob PanDA Pilot Oleynik Danila 28/05/2015. Overview Initial PanDA pilot concept & HPC Motivation PanDA Pilot workflow at nutshell MultiJob Pilot.
Sergey Belov, Tatiana Goloskokova, Vladimir Korenkov, Nikolay Kutovskiy, Danila Oleynik, Artem Petrosyan, Roman Semenov, Alexander Uzhinskiy LIT JINR The.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
PanDA A New Paradigm for Computing in HEP Kaushik De Univ. of Texas at Arlington NRC KI, Moscow January 29, 2015.
PanDA: Exascale Federation of Resources for the ATLAS Experiment
Tarball server (for Condor installation) Site Headnode Worker Nodes Schedd glidein - special purpose Condor pool master DB Panda Server Pilot Factory -
Experience and possible evolution Danila Oleynik (UTA), Sergey Panitkin (BNL), Taylor Childers (ANL) ATLAS TIM 2014.
Event Service Intro, plans, issues and objectives for today Torre Wenaus BNL US ATLAS S&C/PS Workshop Aug 21, 2014.
Overview of ASCR “Big PanDA” Project Alexei Klimentov Brookhaven National Laboratory September 4, 2013, Arlington, TX PanDA UTA.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
PanDA & BigPanDA Kaushik De Univ. of Texas at Arlington BigPanDA Workshop, CERN October 21, 2013.
HPC pilot code. Danila Oleynik 18 December 2013 from.
Shifters Jamboree Kaushik De ADC Jamboree, CERN December 4, 2014.
Future of Distributed Production in US Facilities Kaushik De Univ. of Texas at Arlington US ATLAS Distributed Facility Workshop, Santa Cruz November 13,
Big PanDA on HPC/LCF Update Sergey Panitkin, Danila Oleynik BigPanDA F2F Meeting. March
Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.
Danila Oleynik (On behalf of ATLAS collaboration)
CMS Experience with the Common Analysis Framework I. Fisk & M. Girone Experience in CMS with the Common Analysis Framework Ian Fisk & Maria Girone 1.
PanDA HPC integration. Current status. Danila Oleynik BigPanda F2F meeting 13 August 2013 from.
Production System 2 manpower and funding issues Alexei Klimentov Brookhaven National Laboratory Aug 19, 2013 Production System Technical Meeting CERN.
Scientific Computing at Fermilab Lothar Bauerdick, Deputy Head Scientific Computing Division 1 of 7 10k slot tape robots.
BigPanDA Status Kaushik De Univ. of Texas at Arlington Alexei Klimentov Brookhaven National Laboratory OSG AHM, Clemson University March 14, 2016.
Scientific Data Processing Portal and Heterogeneous Computing Resources at NRC “Kurchatov Institute” V. Aulov, D. Drizhuk, A. Klimentov, R. Mashinistov,
PANDA PILOT FOR HPC Danila Oleynik (UTA). Outline What is PanDA Pilot PanDA Pilot architecture (at nutshell) HPC specialty PanDA Pilot for HPC 2.
Daniele Bonacorsi Andrea Sciabà
Review of the WLCG experiments compute plans
Workload Management Workpackage
BigPanDA Workflow Management on Titan
Clouds , Grids and Clusters
Virtualization and Clouds ATLAS position
Data Analytics and CERN IT Hadoop Service
Introduction to Distributed Platforms
Dynamic Deployment of VO Specific Condor Scheduler using GT4
Computing models, facilities, distributed computing
U.S. ATLAS Grid Production Experience
Simone Campana CERN-IT
ATLAS Cloud Operations
Outline Benchmarking in ATLAS Performance scaling
PanDA setup at ORNL Sergey Panitkin, Alexei Klimentov BNL
GWE Core Grid Wizard Enterprise (
POW MND section.
HPC DOE sites, Harvester Deployment & Operation
DCC Workshop Input from Computing Coordination
HEP Computing Tools for Brain Studies
Fine grained processing with an Event Service
PanDA in a Federated Environment
Readiness of ATLAS Computing - A personal view
R.Mashinistov (UTA) July
Dagmar Adamova (NPI AS CR Prague/Rez) and Maarten Litmaath (CERN)
ATLAS Sites Jamboree, CERN January, 2017
BigPanDA WMS for Brain Studies
Univ. of Texas at Arlington BigPanDA Workshop, ORNL
LCG middleware and LHC experiments ARDA project
Cloud Computing R&D Proposal
Monitoring of the infrastructure from the VO perspective
This work is partially supported by projects InterExcellence (LTT17018), Research infrastructure CERN (CERN-CZ, LM ) and OP RDE CERN Computing (CZ /0.0/0.0/1.
ExaO: Software Defined Data Distribution for Exascale Sciences
Large Scale Distributed Computing
Wide Area Workload Management Work Package DATAGRID project
Overview of Workflows: Why Use Them?
rvGAHP – Push-Based Job Submission Using Reverse SSH Connections
Production Manager Tools (New Architecture)
A Possible OLCF Operational Model for HEP (2019+)
Alexei Klimentov BNL Jun 25, 2019
Workflow and HPC erhtjhtyhy Doug Benjamin Argonne National Lab.
Presentation transcript:

Workflow Management Software For Tomorrow Fernando Barreiro Kaushik De Alexei Klimentov Tadashi Maeno Danila Oleynik Pavlo Svirin Matteo Turiilli Distributed Data Management Meta-data handling Rucio AMI Physics Group Production requests pyAMI DEFT DB PanDA DB Requests, Production Tasks ProdSys2/ DEFT JEDI PanDA server Tasks Tasks Jobs Analysis tasks Jobs Production requests pilot ARC interface Pilot scheduler pilot condor-g pilot pilot pilot EGEE/EGI OSG NDGF HPCs Worker nodes HPC cross-experiment discussion CERN, May 10, 2019

Outline WFM SW evolution (HEP) WFM SW on HPC from non-LHC 23.09.2019

Workflow Management. PanDA. Production and Distributed Analysis System https://twiki.cern.ch/twiki/bin/view/PanDA/PanDA PanDA Brief Story 2005: Initiated for US ATLAS (BNL and UTA) 2006: Support for analysis 2008: Adopted ATLAS-wide 2009: First use beyond ATLAS 2011: Dynamic data caching based on usage and demand 2012: ASCR/HEP BigPanDA project 2014: Network-aware brokerage 2014 : Job Execution and Definition I/F (JEDI) adds complex task management and fine grained dynamic job management 2014: JEDI- based Event Service 2015: New ATLAS Production System, based on PanDA/JEDI 2015 :Manage Heterogeneous Computing Resources : HPCs and clouds 2016: DOE ASCR BigPanDA@Titan project 2016:PanDA for bioinformatics 2017:COMPASS adopted PanDA , NICA (JINR) PanDA beyond HEP : BlueBrain, IceCube, LQCD Global ATLAS operations Up to ~800k concurrent job slots 25-30M jobs/month at >250 sites ~1400 ATLAS users First exascale workload manager in HENP 1.3+ Exabytes processed every year in 2014 - 2018 Exascale scientific data processing today BigPanDA Monitor http://bigpanda.cern.ch/ Concurrent cores run by PanDA Big HPCs Grid Clouds

Future Challenges for WorkFlow(Load) Management Software New physics workflows and technologies machine learning training, parallelization, vectorization… also new ways how Monte-Carlo campaigns are organized Address computing model evolution and new strategies “provisioning for peak” Incorporating new architectures (like TPU, GPU, RISC, FPGA, ARM…) Leveraging new technologies (containerization, no-SQL analysis models, high data reduction frameworks, tracking…) Integration with networks (via DDM, via IS and directly) Data popularity -> event popularity Address future complexities in workflow handling Machine learning and Task Time To Complete prediction Monitoring, analytics, accounting and visualization Granularity and data streaming

Future development. Harvester Highlights Primary objectives : To have a common machinery for diverse computing resources To provide a common layer in bringing coherence to different HPC implementations To optimize workflow executions for diverse site capabilities T.Maeno To address wide spectrum of computing resources/facilities available to ATLAS and experiments in general New model : PanDA server- harvester-pilot The project was launched in Dec 2016 (PI T.Maeno)

Harvester Status What is Harvester Current Status A bridge service between workload, data management systems and resources to allow (quasi-) real time communication between them Flexible deployment model to work with various operational restrictions, constraints, and policies in those resources E.g. local deployment on edge node for HPCs behind multi-factor authentication, central deployment + SSH + RPC for HPCs without outbound network connections, stage-in/out plugins for various data transfer services/tools, messaging via share file system, … Experiments can use harvester by implementing their own plug-ins, harvester is not tightly coupled with PanDA Current Status Architecture design, coding and implementation completed Commissioning ~done Deployed on wide range of resources Theta/ALCF, Cori/NERSC, Titan/OLCF in production Summit/OLCF, MareNostrum4/BSC under testing Also at Google Cloud, Grid (~all ATLAS sites), HLT@CERN

OLCF CERN BNL PanDA/Harvester deployment for ATLAS @OLCF

Harvester for tomorrow (HPC only) Full-chain technical validation with Yoda   Yoda : ATLAS Event Service with MPI functionality running on HPC Yoda+Jumbo jobs in production Jumbo jobs : relax input file boundaries, pick up any event from dataset Two hops data stage-out with Globus Online + Rucio Containers integration Implementation of a capability to use simultaneously CPU/GPU within one node for MC and ML payloads Implementation of a capability to dynamically shape payloads based on real-time resource information Centralization of Harvester instances using CE, SSH, MOM, …

Simulation Science HEP “Summit”: ~1017 FLOPS HEP SIMULATION NEUROSCIENCE “ENIAC”: ~103 FLOPS Computing has seen an unparalleled exponential development In the last decades supercomputer performance grew 1000x every ~10 years Almost all scientific disciplines have long embraced this capability Original slide from F.Schurmann (EPFL)

Pegasus WFMS/PanDA Collaboration started in October 2018 December 2018: first working prototype, standard Pegasus example (Split document/word count) tested on Titan Future plans/possible applications/open questions: test the same workflow in a heterogeneous environment: Grid + HPC (Summit/NERSC/…) with data transfer via RUCIO or other tools Possible application: data transfer for LQCD jobs from/to OLCF storages Currently Pegasus/PanDA integration works on job level. How can we integrate Pegasus with other PanDA components like JEDI is still TBD

Next Generation Executor Project (Rutgers U) Schedules and runs multiple tasks concurrently and consecutively in one or more batch jobs: Tasks are individual programs Tasks are executed within the walltime of each batch job Late binding: Tasks are scheduled and then placed within a batch job at runtime Tasks and resource heterogeneity: Scheduling, placing and running CPU/GPU/OpenMM/MPI tasks in the same batch job Use single/multiple CPU/GPU for the same task and across multiple tasks Supports multiple HPC machines. Requires limited development to support a new HPC machine. Use cases: Molecular dynamics and HEP payloads on Summit

Status and near-term plans : Test runs with bulk task submission with harvester and NGE Address bulk and performant submission (currently ~320 units in 4000 sec) Run at scale on Summit once issues with jsrun are addressed by IBM Conduct submission with relevant workloads from MD and HEP