A Possible OLCF Operational Model for HEP (2019+)

A Possible OLCF Operational Model for HEP (2019+)
HPC cross-experiment discussion CERN, May 10th 2019 Jack C. Wells, Valentine Anantharaj National Center for Computational Sciences, Oak Ridge National Laboratory Shantenu Jha, Alexei Klimentov Brookhaven National Laboratory Kaushik De University Texas at Arlington

Simulation Science “Summit”: ~1017 FLOPS “ENIAC”: ~103 FLOPS HEP
SIMULATION NEUROSCIENCE “ENIAC”: ~103 FLOPS Computing has seen an unparalleled exponential development In the last decades supercomputer performance grew 1000x every ~10 years Almost all scientific disciplines have long embraced this capability Original slide from F.Schurmann (EPFL)

Oak Ridge Leadership Class Facilities (OLCF) Path to Exascale
Competitive procurement asking for: 50–100 application performance of Titan Support for traditional modeling and simulation, high-performance data analysis, and artificial intelligence applications Peak performance of at least 1300 PF Smooth transition for existing and future applications Titan: 27 PF Accelerated Computing World’s Fastest Summit: 200 PF Accelerated Computing 5–10 Titan Performance Frontier: >1000 PF Competitive Procurement 5-10 Summit Performance Jaguar: 2.3 PF World’s Fastest 2008 2012 2017 2021

The Problem of Supercomputer-HTC Integration
How do we efficiently integrate supercomputing resources and distributed High Throughput Computing (Grid) resources? The problem is more general than application of supercomputers to LHC data processing, (or experimental- observational needs, in general) From the perspective of large supercomputing centers, how best to integrate large capability workloads, e.g., the traditional workloads of leadership computing facilities, with the large capacity workloads emerging from, e.g., experimental and observational data? Workflow Management Systems (WFMS) are needed to effectively integrate experimental and observational data into our data centers. The Worldwide LHC Computing Grid and a leadership computing facility (LCF) are of comparable compute capacity. WLCG: 220,000 x86 compute cores Titan: 300,000 x86 compute cores and 18,000 GPUs There is a well-defined opportunity to increase LCF utilization through backfill. Batch scheduling prioritizing leadership-scale jobs results in ~90% utilization of available resources.. But this is only a special case of the more general problem of capability-capacity integration (or in other words, integration of experimental-observational data). Not all HPC centers are in favor to allow us to run HEP payloads in backfill mode

Primary Allocation Programs for Access to the LCFs Current distribution of allocable hours
20% Director’s Discretionary (Includes LCF strategic programs, ECP) 60% INCITE Leadership-class computing 20% ASCR Leadership Computing Challenge DOE/SC capability computing

Director’s Discretionary – 20%
OLCF allocation programs: Selecting applications of national importance INCITE – 60% of resources ALCC – 20% of resources Director’s Discretionary – 20% Mission High-risk, high-payoff science that requires LCF-scale resources Capability resources for science of interest to DOE Strategic LCF goals, (including Exascale Computing Project) Call frequency (alloc. Year) Open annually, April to June, (Alloc.: January – December) Open annually, Dec. - Feb. (Alloc.: July – June) Open year round (ECP awards made quarterly) Duration 1-3 years, yearly renewal 1 year 3m, 6m,1 year Anticipated Size ~30 projects per year per center 300K to 900K Summit node-hours/yr. ~25 projects per year per center 100K to 600K Summit node-hours/yr. ~180 of projects per center 5K to 50K Summit node-hours. Review Process Scientific Peer-Review Computational Readiness Peer-Review & Alignment with Goals Managed by INCITE management committee (ALCF & OLCF) DOE Office of Science LCF center Availability Open to all scientific researchers and organizations including industry

Highly Competitive Process. Summit User Program Update – May 2019
Early Science Program (ESP) on Summit 25 proposals have been awarded time and work began January 2019 Early Science Program terminates at the end of June. INCITE Program on Summit 64 INCITE proposals requested Summit resources; 30 proposals were accepted 9 INCITE proposals reviewed by new “learning panel”; 2 of these were awarded projects 31 INCITE projects have been awarded time and work began January 2019 Nine 2019 ACM Gordon Bell nominee submissions from work on Summit Diverse topics spanning modeling & simulation, data analytics, and AI ALCC program on Summit will begin by 1 July 2019. 2020 INCITE proposal call issued 15 April, closes 21 June, 2019

A Possible Operational Scenarios on Summit at OLCF?
Option A: Support user projects from LHC using PanDA instance Support by OLCF, in collaboration with ATLAS/PanDA team. PanDA We need a conversation about details of support. What are the advantages of having access to a pool of tasks from multiple user projects external to Summit’s queue from which one could proactively backfill Summit? Is there an execution strategy that would benefit from access to a pool of “backfill tasks”? OLCF is not ready to move straight to option A. Option B: Support user projects from ATLAS and other science projects in using their WLMS of choice Kubernetes/OpenShift container orchestration (”Slate” service) is available, but still in ”pilot” development. Each project would be responsible for deploying WLMS/WFMS middleware upon Slate. Enables access to wide-area, distributed task management and proactive backfill of Summit’s queues (as demonstrated by BigPanDA Titan. Normal queue policies apply Queue policy special requests can be considered. OLCF can begin to move forward with this option B straight away. Option C: Support user projects from LHC using PanDA or/and other science projects in using their WLMS of choice This is a ”blending” of Options A & B. Support by OLCF in collaboration with ATLAS/PanDA team an instance (including Harvester and NGE) Projects will have a choice to use PanDA or be responsible for deploying WLMS/WFMS middleware upon Slate. Enables access to wide-area, distributed task management and proactive backfill of Summit’s queues (as demonstrated by BigPanDA project @ Titan). Normal queue policies apply Queue policy special requests can be considered. LHC Community input is needed. OLCF can move more quickly on the option B capabilities, in contrast to capabilities in option A.

Considerations for implementing Option B:
The implementation and deployment will be facilitated by OLCF, in collaboration with ATLAS/PanDA team. Identify individuals who will develop an implementation strategy. Contribute to the knowledge base by documenting the experience. Develop and document a recipe for deploying the essential services. Harden the process by enlisting friendly users to test the recipe for a set of use cases. How do we make it as easy as possible for diverse user projects? The Kubernetes/Openshift platform at OLCF is still maturing. How do we develop an automated test suite? Identify risks and mitigation strategies. OLCF is ready to work on implementation ‘today’

A Possible OLCF Operational Model for HEP (2019+)

Similar presentations

Presentation on theme: "A Possible OLCF Operational Model for HEP (2019+)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Possible OLCF Operational Model for HEP (2019+)

Similar presentations

Presentation on theme: "A Possible OLCF Operational Model for HEP (2019+)"— Presentation transcript:

Similar presentations

About project

Feedback