A Possible OLCF Operational Model for HEP (2019+)

Slides:



Advertisements
Similar presentations
 Contributing >30% of throughput to ATLAS and CMS in Worldwide LHC Computing Grid  Reliant on production and advanced networking from ESNET, LHCNET and.
Advertisements

U.S. Department of Energy Office of Science Advanced Scientific Computing Research Program NERSC Users Group Meeting Department of Energy Update June 12,
Towards a Virtual European Supercomputing Infrastructure Vision & issues Sanzio Bassini
Clouds from FutureGrid’s Perspective April Geoffrey Fox Director, Digital Science Center, Pervasive.
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
Integrating Network Awareness in ATLAS Distributed Computing Using the ANSE Project J.Batista, K.De, A.Klimentov, S.McKee, A.Petroysan for the ATLAS Collaboration.
1 MAIS Student Administration Advisory Group Meeting #31 October 4, 2006.
1 Ideas About the Future of HPC in Europe “The views expressed in this presentation are those of the author and do not necessarily reflect the views of.
Copyright © 2010 Platform Computing Corporation. All Rights Reserved. Platform Computing Ken Hertzler VP Product Management.
April 2009 OSG Grid School - RDU 1 Open Science Grid John McGee – Renaissance Computing Institute University of North Carolina, Chapel.
Plans for Exploitation of the ORNL Titan Machine Richard P. Mount ATLAS Distributed Computing Technical Interchange Meeting May 17, 2013.
Welcome to HTCondor Week #14 (year #29 for our project)
Assessment of Core Services provided to USLHC by OSG.
Reorganization at NCAR Presentation to the UCAR Board of Trustees February 25, 2004.
U.S. Department of Energy Office of Science Advanced Scientific Computing Research Program NERSC Users Group Meeting Department of Energy Update September.
PanDA A New Paradigm for Computing in HEP Kaushik De Univ. of Texas at Arlington NRC KI, Moscow January 29, 2015.
Grid Computing at The Hartford Condor Week 2008 Robert Nordlund
Alex Read, Dept. of Physics Grid Activity in Oslo CERN-satsingen/miljøet møter MN-fakultetet Oslo, 8 juni 2009 Alex Read.
NOAA Cooperative Institutes John Cortinas, Ph.D. OAR Cooperative Institute Program, Program Manager NOAA Cooperative Institute Committee, Chairperson.
Ruth Pordes November 2004TeraGrid GIG Site Review1 TeraGrid and Open Science Grid Ruth Pordes, Fermilab representing the Open Science.
Event Service Intro, plans, issues and objectives for today Torre Wenaus BNL US ATLAS S&C/PS Workshop Aug 21, 2014.
Overview of ASCR “Big PanDA” Project Alexei Klimentov Brookhaven National Laboratory September 4, 2013, Arlington, TX PanDA UTA.
Alex Read, Dept. of Physics Grid Activities in Norway R-ECFA, Oslo, 15 May, 2009.
PanDA & BigPanDA Kaushik De Univ. of Texas at Arlington BigPanDA Workshop, CERN October 21, 2013.
ISTeC Research Computing Open Forum: Using NSF or National Laboratory Resources for High Performance Computing Bhavesh Khemka.
NSF Middleware Initiative Purpose To design, develop, deploy and support a set of reusable, expandable set of middleware functions and services that benefit.
PEER 2003 Meeting 03/08/031 Interdisciplinary Framework Major focus areas Structural Representation Fault Systems Earthquake Source Physics Ground Motions.
Toward High Breakthrough Collaboration (HBC) Susan Turnbull Program Manager Advanced Scientific Computing Research March 4, 2009.
BigPanDA Status Kaushik De Univ. of Texas at Arlington Alexei Klimentov Brookhaven National Laboratory OSG AHM, Clemson University March 14, 2016.
Particle Physics Sector Young-Kee Kim / Greg Bock Leadership Team Strategic Planning Winter Workshop January 29, 2013.
Scientific Data Processing Portal and Heterogeneous Computing Resources at NRC “Kurchatov Institute” V. Aulov, D. Drizhuk, A. Klimentov, R. Mashinistov,
EGI-InSPIRE RI An Introduction to European Grid Infrastructure (EGI) March An Introduction to the European Grid Infrastructure.
Ian Bird, CERN WLCG Project Leader Amsterdam, 24 th January 2012.
Digital Campus: Foundation Projects
Review of the WLCG experiments compute plans
A Brief Introduction to NERSC Resources and Allocations
Computing Operations Roadmap
BigPanDA Workflow Management on Titan
Electron Ion Collider New aspects of EIC experiment instrumentation and computing, as well as their possible impact on and context in society (B) COMPUTING.
BigPanDA Technical Interchange Meeting July 20, 2017 Hong Ma
Outline Benchmarking in ATLAS Performance scaling
PanDA setup at ORNL Sergey Panitkin, Alexei Klimentov BNL
PanDA engagement with theoretical community via SciDAC-4
HPC DOE sites, Harvester Deployment & Operation
HEP Computing Tools for Brain Studies
National e-Infrastructure Vision
Readiness of ATLAS Computing - A personal view
Miron Livny John P. Morgridge Professor of Computer Science
R.Mashinistov (UTA) July
ICT NCP Infoday Brussels, 23 June 2010
BigPanDA WMS for Brain Studies
Univ. of Texas at Arlington BigPanDA Workshop, ORNL
SA1 ROC Meeting Bologna, October 2004
WLCG Collaboration Workshop;
1. Define a Vision & Identify Business Scenarios
Scientific Computing At Jefferson Lab
GENERAL SERVICES DEPARTMENT Facilities Management Division PROOF –NM (Process Reengineering & Optimization of O&M Functions for New Mexico) Phase II.
Clouds from FutureGrid’s Perspective
Optena: Enterprise Condor
Director of Industry Relations
STFC Update – Programmes Directorate PPAP Community Meeting
Defining the Grid Fabrizio Gagliardi EMEA Director Technical Computing
Ctclink executive leadership committee May 31, 2018
Sachiko A. Kuwabara, PhD, MA
Preliminary Project Execution Plan
Welcome to (HT)Condor Week #19 (year 34 of our project)
Workflow Management Software For Tomorrow
Workflow and HPC erhtjhtyhy Doug Benjamin Argonne National Lab.
HPC resources LHC experiments are using (as seen from ATLAS - CMS lens) erhtjhtyhy Doug Benjamin Argonne National Lab.
GENERAL SERVICES DEPARTMENT Facilities Management Division PROOF –NM (Process Reengineering & Optimization of O&M Functions for New Mexico) Project.
Presentation transcript:

A Possible OLCF Operational Model for HEP (2019+) HPC cross-experiment discussion CERN, May 10th 2019 Jack C. Wells, Valentine Anantharaj National Center for Computational Sciences, Oak Ridge National Laboratory Shantenu Jha, Alexei Klimentov Brookhaven National Laboratory Kaushik De University Texas at Arlington

Simulation Science “Summit”: ~1017 FLOPS “ENIAC”: ~103 FLOPS HEP SIMULATION NEUROSCIENCE “ENIAC”: ~103 FLOPS Computing has seen an unparalleled exponential development In the last decades supercomputer performance grew 1000x every ~10 years Almost all scientific disciplines have long embraced this capability Original slide from F.Schurmann (EPFL)

Oak Ridge Leadership Class Facilities (OLCF) Path to Exascale Competitive procurement asking for: 50–100 application performance of Titan Support for traditional modeling and simulation, high-performance data analysis, and artificial intelligence applications Peak performance of at least 1300 PF Smooth transition for existing and future applications Titan: 27 PF Accelerated Computing World’s Fastest Summit: 200 PF Accelerated Computing 5–10 Titan Performance Frontier: >1000 PF Competitive Procurement 5-10 Summit Performance Jaguar: 2.3 PF World’s Fastest 2008 2012 2017 2021

The Problem of Supercomputer-HTC Integration How do we efficiently integrate supercomputing resources and distributed High Throughput Computing (Grid) resources? The problem is more general than application of supercomputers to LHC data processing, (or experimental- observational needs, in general) From the perspective of large supercomputing centers, how best to integrate large capability workloads, e.g., the traditional workloads of leadership computing facilities, with the large capacity workloads emerging from, e.g., experimental and observational data? Workflow Management Systems (WFMS) are needed to effectively integrate experimental and observational data into our data centers. The Worldwide LHC Computing Grid and a leadership computing facility (LCF) are of comparable compute capacity. WLCG: 220,000 x86 compute cores Titan: 300,000 x86 compute cores and 18,000 GPUs There is a well-defined opportunity to increase LCF utilization through backfill. Batch scheduling prioritizing leadership-scale jobs results in ~90% utilization of available resources.. But this is only a special case of the more general problem of capability-capacity integration (or in other words, integration of experimental-observational data). Not all HPC centers are in favor to allow us to run HEP payloads in backfill mode

Primary Allocation Programs for Access to the LCFs Current distribution of allocable hours 20% Director’s Discretionary (Includes LCF strategic programs, ECP) 60% INCITE Leadership-class computing 20% ASCR Leadership Computing Challenge DOE/SC capability computing

Director’s Discretionary – 20% OLCF allocation programs: Selecting applications of national importance INCITE – 60% of resources ALCC – 20% of resources Director’s Discretionary – 20% Mission High-risk, high-payoff science that requires LCF-scale resources Capability resources for science of interest to DOE Strategic LCF goals, (including Exascale Computing Project) Call frequency (alloc. Year) Open annually, April to June, (Alloc.: January – December) Open annually, Dec. - Feb. (Alloc.: July – June) Open year round (ECP awards made quarterly) Duration 1-3 years, yearly renewal 1 year 3m, 6m,1 year Anticipated Size ~30 projects per year per center 300K to 900K Summit node-hours/yr. ~25 projects per year per center 100K to 600K Summit node-hours/yr. ~180 of projects per center 5K to 50K Summit node-hours. Review Process Scientific Peer-Review Computational Readiness Peer-Review & Alignment with Goals Managed by INCITE management committee (ALCF & OLCF) DOE Office of Science Locally @ LCF center Availability Open to all scientific researchers and organizations including industry

Highly Competitive Process. Summit User Program Update – May 2019 Early Science Program (ESP) on Summit 25 proposals have been awarded time and work began January 2019 Early Science Program terminates at the end of June. INCITE Program on Summit 64 INCITE proposals requested Summit resources; 30 proposals were accepted 9 INCITE proposals reviewed by new “learning panel”; 2 of these were awarded projects 31 INCITE projects have been awarded time and work began January 2019 Nine 2019 ACM Gordon Bell nominee submissions from work on Summit Diverse topics spanning modeling & simulation, data analytics, and AI 2019-2020 ALCC program on Summit will begin by 1 July 2019. 2020 INCITE proposal call issued 15 April, closes 21 June, 2019 http://www.doeleadershipcomputing.org/proposal/call-for-proposals/

A Possible Operational Scenarios on Summit at OLCF? Option A: Support user projects from LHC using PanDA instance Support by OLCF, in collaboration with ATLAS/PanDA team. PanDA instance @ORNL We need a conversation about details of support. What are the advantages of having access to a pool of tasks from multiple user projects external to Summit’s queue from which one could proactively backfill Summit? Is there an execution strategy that would benefit from access to a pool of “backfill tasks”? OLCF is not ready to move straight to option A. Option B: Support user projects from ATLAS and other science projects in using their WLMS of choice Kubernetes/OpenShift container orchestration (”Slate” service) is available, but still in ”pilot” development. Each project would be responsible for deploying WLMS/WFMS middleware upon Slate. Enables access to wide-area, distributed task management and proactive backfill of Summit’s queues (as demonstrated by BigPanDA project @ Titan. Normal queue policies apply Queue policy special requests can be considered. OLCF can begin to move forward with this option B straight away. Option C: Support user projects from LHC using PanDA or/and other science projects in using their WLMS of choice This is a ”blending” of Options A & B. Support by OLCF in collaboration with ATLAS/PanDA team an instance (including Harvester and NGE) Projects will have a choice to use PanDA or be responsible for deploying WLMS/WFMS middleware upon Slate. Enables access to wide-area, distributed task management and proactive backfill of Summit’s queues (as demonstrated by BigPanDA project @ Titan). Normal queue policies apply Queue policy special requests can be considered. LHC Community input is needed. OLCF can move more quickly on the option B capabilities, in contrast to capabilities in option A.

Considerations for implementing Option B: The implementation and deployment will be facilitated by OLCF, in collaboration with ATLAS/PanDA team. Identify individuals who will develop an implementation strategy. Contribute to the knowledge base by documenting the experience. Develop and document a recipe for deploying the essential services. Harden the process by enlisting friendly users to test the recipe for a set of use cases. How do we make it as easy as possible for diverse user projects? The Kubernetes/Openshift platform at OLCF is still maturing. How do we develop an automated test suite? Identify risks and mitigation strategies. OLCF is ready to work on implementation ‘today’