BigPanDA US ATLAS SW&C Technical Meeting August 1, 2016 Alexei Klimentov Brookhaven National Laboratory.

Slides:

Advertisements

Similar presentations

ASCR Data Science Centers Infrastructure Demonstration S. Canon, N. Desai, M. Ernst, K. Kleese-Van Dam, G. Shipman, B. Tierney.

Advertisements

U.S. Department of Energy’s Office of Science Basic Energy Sciences Advisory Committee Dr. Daniel A. Hitchcock October 21, 2003

Presentation at WebEx Meeting June 15,  Context  Challenge  Anticipated Outcomes  Framework  Timeline & Guidance  Comment and Questions.

U.S. Department of Energy Office of Science Advanced Scientific Computing Research Program NERSC Users Group Meeting Department of Energy Update June 12,

1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.

Integrating Network Awareness in ATLAS Distributed Computing Using the ANSE Project J.Batista, K.De, A.Klimentov, S.McKee, A.Petroysan for the ATLAS Collaboration.

NGNS Program Managers Richard Carlson Thomas Ndousse ASCAC meeting 11/21/2014 Next Generation Networking for Science Program Update.

April 2009 OSG Grid School - RDU 1 Open Science Grid John McGee – Renaissance Computing Institute University of North Carolina, Chapel.

Testing PanDA at ORNL Danila Oleynik University of Texas at Arlington / JINR PanDA UTA 3-4 of September 2013.

Welcome to HTCondor Week #14 (year #29 for our project)

Open Science Grid For CI-Days Internet2: Fall Member Meeting, 2007 John McGee – OSG Engagement Manager Renaissance Computing Institute.

Russian “MegaProject” ATLAS SW&C week Plenary session : status, problems and plans Feb 24, 2014 Alexei Klimentov Brookhaven National Laboratory.

U.S. Department of Energy Office of Science Advanced Scientific Computing Research Program NERSC Users Group Meeting Department of Energy Update September.

EMI INFSO-RI SA2 - Quality Assurance Alberto Aimar (CERN) SA2 Leader EMI First EC Review 22 June 2011, Brussels.

PanDA A New Paradigm for Computing in HEP Kaushik De Univ. of Texas at Arlington NRC KI, Moscow January 29, 2015.

DataTAG Research and Technological Development for a Transatlantic Grid Abstract Several major international Grid development projects are underway at.

PanDA: Exascale Federation of Resources for the ATLAS Experiment

Ruth Pordes November 2004TeraGrid GIG Site Review1 TeraGrid and Open Science Grid Ruth Pordes, Fermilab representing the Open Science.

Overview of ASCR “Big PanDA” Project Alexei Klimentov Brookhaven National Laboratory September 4, 2013, Arlington, TX PanDA UTA.

National Strategic Computing Initiative

1 OFFICE OF ADVANCED SCIENTIFIC COMPUTING RESEARCH The NERSC Center --From A DOE Program Manager’s Perspective-- A Presentation to the NERSC Users Group.

PanDA & BigPanDA Kaushik De Univ. of Texas at Arlington BigPanDA Workshop, CERN October 21, 2013.

ComPASS Summary, Budgets & Discussion Panagiotis Spentzouris, Fermilab ComPASS PI.

SDN Provisioning, next steps after ANSE Kaushik De Univ. of Texas at Arlington US ATLAS Planning, CERN June 29, 2015.

INFSO-RI Enabling Grids for E-sciencE The EGEE Project Owen Appleton EGEE Dissemination Officer CERN, Switzerland Danish Grid Forum.

CERN VISIONS LEP  web LHC  grid-cloud HL-LHC/FCC  ?? Proposal: von-Neumann  NON-Neumann Table 1: Nick Tredennick’s Paradigm Classification Scheme Early.

Data Infrastructure Building Blocks (DIBBS) NSF Solicitation Webinar -- March 3, 2016 Amy Walton, Program Director Advanced Cyberinfrastructure.

INDIGO – DataCloud WP5 introduction INFN-Bari CYFRONET RIA

1 Open Science Grid: Project Statement & Vision Transform compute and data intensive science through a cross- domain self-managed national distributed.

Evolution of successful Forum for Computational Excellence (FCE) Pilot project – raising awareness for HEP response to rapid evolution of the computational.

Building on virtualization capabilities for ExTENCI Carol Song and Preston Smith Rosen Center for Advanced Computing Purdue University ExTENCI Kickoff.

Production System 2 manpower and funding issues Alexei Klimentov Brookhaven National Laboratory Aug 19, 2013 Production System Technical Meeting CERN.

ORNL Site Report ESCC Feb 25, 2014 Susan Hicks. 2 Optical Upgrades.

Scientific Computing at Fermilab Lothar Bauerdick, Deputy Head Scientific Computing Division 1 of 7 10k slot tape robots.

BigPanDA Status Kaushik De Univ. of Texas at Arlington Alexei Klimentov Brookhaven National Laboratory OSG AHM, Clemson University March 14, 2016.

Scientific Data Processing Portal and Heterogeneous Computing Resources at NRC “Kurchatov Institute” V. Aulov, D. Drizhuk, A. Klimentov, R. Mashinistov,

Ian Bird, CERN WLCG Project Leader Amsterdam, 24 th January 2012.

Bob Jones EGEE Technical Director

Accessing the VI-SEEM infrastructure

Review of the WLCG experiments compute plans

Status of WLCG FCPPL project

A Brief Introduction to NERSC Resources and Allocations

Strategies for NIS Development

BigPanDA Workflow Management on Titan

Joslynn Lee – Data Science Educator

Computing models, facilities, distributed computing

Electron Ion Collider New aspects of EIC experiment instrumentation and computing, as well as their possible impact on and context in society (B) COMPUTING.

BigPanDA Technical Interchange Meeting July 20, 2017 Hong Ma

PanDA setup at ORNL Sergey Panitkin, Alexei Klimentov BNL

PanDA engagement with theoretical community via SciDAC-4

HEP Computing Tools for Brain Studies

FET Plans FET - Proactive 1.

DOE Facilities - Drivers for Science: Experimental and Simulation Data

ICT NCP Infoday Brussels, 23 June 2010

BigPanDA WMS for Brain Studies

Univ. of Texas at Arlington BigPanDA Workshop, ORNL

U.S. ATLAS Testbed Status Report

Cloud Computing R&D Proposal

Monitoring of the infrastructure from the VO perspective

Leigh Grundhoefer Indiana University

ESnet and Science DMZs: an update from the US

LHC Data Analysis using a worldwide computing grid

EGI Webinar - Introduction -

Brian Matthews STFC EOSCpilot Brian Matthews STFC

Agenda Purpose for Project Goals & Objectives Project Process & Status Common Themes Outcomes & Deliverables Next steps.

Defining the Grid Fabrizio Gagliardi EMEA Director Technical Computing

A Possible OLCF Operational Model for HEP (2019+)

Workflow Management Software For Tomorrow

Project Certification Planning Phase August 27, 2014

Presentation transcript:

BigPanDA US ATLAS SW&C Technical Meeting August 1, 2016 Alexei Klimentov Brookhaven National Laboratory

Outline Introduction BigPanDA projects – DOE HEP and ASCR project “Next generation workload management system for Big Data” – BigPanDA; 2012-Sep 2015; extended for 1 year until Sep Project scope Work packages Highlights and lessons learned – DOE ASCR project “BigPanDA Workflow Management on Titan for High Energy and Nuclear Physics and for Future Extreme Scale Scientific Applications” – BigPanDA++; ; Project scope Work packages Status and plans – Summary DOE Exascale Computing Project (ECP) 8/01/2016Alexei Klimentov2

PanDA’s Success PanDA was able to cope with increasing LHC luminosity, ATLAS data taking rate and processing and analysis challenges Adopted to evolution in ATLAS computing model – New workflows have been added Open-ended production Trains Tier0 spill-over HLT reprocessing HEP and astro-particle experiments (COMPASS and AMS) has chosen PanDA as workload management system for data processing and analysis. ALICE is interested in PanDA evaluation for OLCF. JINR (Dubna) is considering PanDA as main WMS for NICA collider Several PanDA instances beyond ATLAS : Dubna, Moscow, Taiwan, Amazon EC2 PanDA was cited in the document titled “Fact sheet: Big Data across the Federal Government” prepared by the Executive Office of the President of the United States as an example of successful technology already in place at the time of the “Big Data Research and Development Initiative” announcement. PanDA team has crucial responsibilities in ATLAS Distributed Software 8/01/2016Alexei Klimentov3

BigPanDA Project Proposal titled “Next Generation Workload Management and Analysis System for BigData” – Big PanDA was submitted to ASCR DOE in April DOE ASCR and HEP funded 3 years project started in Sep 2012 and extended for one year in –Generalization of PanDA as meta application, providing location transparency of processing and data management for HEP and other data-intensive sciences, and a wider exascale community. Three dimensions to evolution of PanDA – Making PanDA available beyond ATLAS and High Energy Physics – Extending beyond Grid (Leadership Computing Facilities, Clouds, University clusters) – Integration of network as a resource in workload management 8/01/2016Alexei Klimentov4 There were many commonalities with what was needed for ATLAS

8/01/2016Alexei Klimentov5 April 27, 2012

BigPanDA Project. Cont’d Work packages : – WP1 (Factorizing the core): Factorizing the core components of PanDA to enable adoption by a wide range of exascale scientific communities (K.De) – WP2 (Extending the scope): Evolving PanDA to support extreme scale computing clouds and Leadership Computing Facilities (S.Panitkin) – WP3 (Leveraging intelligent networks): Integrating network services and real-time data access to the PanDA workflow (D.Yu) – WP4 (Usability and monitoring): Real time monitoring and visualization package for PanDA (T.Wenaus) BigPanDA team (2016) – BNL, UTA, Rutgers U : Shantenu Jha, UTK : K.Read, ORNL : J.Wells, Dubna (COMPASS, NICA), “Kurchatov Institute”, ASGC TW (AMS). BigPanDA meeting at CERN in Apr 2016 – People supported by the project in : F.Barreiro, T.Maeno, D.Oleynik, S.Padolski, S.Panitkin, A.Petrosyan, S.Schovancova 8/01/2016Alexei Klimentov6

The Growing PanDA Ecosystem 8/01/2016Alexei Klimentov7 ATLAS PanDA – US ATLAS, CERN, UK, DE, ND, CA, Dubna, Protvino, MEPhI, OSG … DOE ASCR/HEP BigPanDA DOE ASCR BigPanDA++ NSF ANSE PanDA – NSF funded network project - CalTech, Michigan, Vanderbilt, UTA HPC and Cloud PanDA Taiwan PanDA – AMS and other communities megaPanDA, PanDA – new communities including bioinformatics – RF Ministry of Education and Science funded project at NRC KI AliEn PanDA, LSST PanDA, NICA, COMPASS, other experiments …and growing PanDA SW development community contribution from Rutgers U, OLCF, “Kurchatov Institute”, Dubna

Contribution to ATLAS. 8/01/20168 US ATLAS Contribution to Monte-Carlo Production. Jan – Jul 2016 OLCF in % all jobs 15.3% CPU consumption 32% all produced files Up to 2.5M CPU hours per week Completed Good jobs Files Produced CPU Consumption Titan monthly CPU Consumption Titan

8/01/2016Alexei Klimentov9 PanDA quadchart slide for Barbara Helland of DOE

BigPanDA Workflow Management on Titan for High Energy and Nuclear Physics and for Future Extreme Scale Scientific Applications 8/01/2016Alexei Klimentov10 White paper : May 2015 Proposal : Jan 2016 Funding : Jul 2016 BigPanDA project success New collaborators : OLCF, Rutgers University Support from DOE Office of Science : HEP, NP, Fusion, ASCR BNL ORNL Rutgers U UTA

BigPanDA++ Project Research Goal It is an overarching goal of this proposal to translate the R&D artifacts and accomplishments from the BigPanDA and AIMES projects into LCF operational advances and enhancement. The proposed solution will provide an important model for future exascale computing, increasing the coherence between the technology base used for high-performance, scalable modeling and simulation and that used for data- analytic computing, and thereby directly advancing the first of the five objectives 6 of the National Strategic Computing Initiative announced by the White House in June This proposal also directly advances the first of five strategic themes articulated by the White House Office of Science and Technology Policy (OSTP) to "create systems that can apply exaflops of computing power to exabytes of data.” We propose a novel and unique approach to workload management for current and future leadership computing facilities, which will have enormous impact on scientific communities in HEP, NP and beyond. Through previous DOE ASCR and HEP research projects, we have demonstrated the feasibility of our approach. In this proposal, we present a work plan to build a production service that will quickly and effectively enable cutting edge science. 8/01/2016Alexei Klimentov11

BigPanDA++ Project Research Plan Track 1 : Workload Management System. PI : K.De (UTA). Track one will yield workload management systems of the sophistication and scale required to meet the science objectives of this proposal. This will entail invasive and significant architectural modifications so as to optimize multi-job capabilities per PanDA pilot, to software engineering aspects such as improving software packaging and compilation, and optimization of application execution characteristics on Titan. Track one will also result in support of greater heterogeneous workloads thanks to the generalization of job definitions and the enhanced and flexible support for MPI-based workloads. This track provides important improvements to the interface layer between workload management systems including application characteristics and OLCF as a facility. Track 2 : Data Handling and Data Management. PI : S.Panitkin (BNL). The second track will yield the critical ability to integrate data-intensive workloads with large-scale parallelism. The next-generation workload management systems will entail large data volumes and thereby stress the data pipes at multiple levels, ranging from distributed transfer in/out of Titan, runtime movement and I/O bandwidth, and memory bus transfer. It is necessary to characterize the performance of data movement at each of these levels, so as to develop powerful models that will permit an analysis of the various trade-offs and optimization needed. Track two will investigate these issues in conjunction with scalable workload management systems outlined in Track one and the architectural considerations of Titan and subsequent LCF systems. Track 3 : Abstraction and Future. PI : S.Jha (Rutgers U). The third track will translate the “in vitro” conceptual and fundamental advances from the DOE ASCR AIMES project and deeply integrate them “in vivo” with the middleware capabilities of this project. Specifically, the third track will provide the integrated scheduling, workload decomposition and pilot size and placement algorithms, so as to take the advances from track one and two beyond simple backfilling approaches, and make them suitable for advanced objectives and multiple concurrent and heterogeneous workloads. We expect this track to play an important role in developing and testing models for future Exascale systems based on the experience at current HPC systems. Track 4 : Operations and Operational model. PI : J.Wells. (ORNL). The fourth track will ensure the complete translation of research and development tracks 1-3 into a scalable and sustainable operational activity on OLCF, while providing a model for other DOE facilities and future workloads and platforms. The primary goal is to ensure the success of the scientific goals of ATLAS and ALICE, providing sustained operations and utilization of OLCF. Next, building on the developments from tracks 1-3, we propose to deploy an independent PanDA server instance at OLCF, operated locally and capable of serving the broader user community at OLCF, with pluggable interface to other workloads and workflows. While it will take more than the two-year scope of this project to integrate other applications and user groups, we plan to achieve the fundamental goal of setting up the infrastructure to make this possible after the two year duration of this project. 12

BigPanDA++ Status Funding was started Jul 25 th 2016 – S.Panitkin (BNL) 1 FTE starting from Oct 1 st 2016 – D.Oleynik (UTA) 0.5 FTE starting from Oct 1 st 2016 – New hires at BNL and UTA (TBD) Kick-off meeting last week (Jul 29 th ) Technical plan and revised milestones in a month. They will be presented to DOE ASCR Project Manager in September 8/01/2016Alexei Klimentov13

Summary (Big)PanDA team plays a vital role in the ATLAS Distributed Computing – Production system based on PanDA/JEDI was designed and commissioned in one year – Titan is integrated with the ATLAS production system and delivers millions of CPU hours to ATLAS – Extra FTEs were committed to ATLAS from BigPanDA project(s) to work on monitoring, pilot, ML, DKB, data management DOE ASCR gave us a great opportunity to evolve PanDA beyond ATLAS and HEP, to start “BigPanDA “ project and to find new collaborators (OLCF and Rutgers groups) – Progress in many areas : VO independent PanDA instance, cloud computing, HPC and LCF, monitoring, non-ORACLE database backend Very beneficial for the ATLAS Strong interest to the project from several experiments (disciplines) and scientific centers to have a joined project. Strong interest and support to the project from DOE (HEP and ASCR) 8/01/2016Alexei Klimentov14

Exascale Computing Project (ECP) On July 29, 2015 the President established the National Strategic Computing Initiative (NSCI) to maximize the benefits of HPC for US economic competitiveness and scientific discovery. DOE is a lead agency within NSCI with the responsibility that the DOE Office of Science and DOE National Nuclear Security Administration will execute a joint program focused on advanced simulation through a capable exascale computing program emphasizing sustained performance on relevant applications. The ECP is a lab-led Project transitioning from the DOE exascale research activities Project Director : Paul Messina, ANL Deputy : Stephen Lee, LANL ECP has a detailed integrated timeline /01/2016Alexei Klimentov15

ECP Goals Develop a broad set of modeling and simulation applications that meet the requirements of the scientific, engineering, and nuclear security programs of the Department of Energy and the NNSA Develop a productive exascale capability in the US by 2023, including the required software and hardware technologies Prepare two or more DOE Office of Science and NNSA facilities to house this capability Maximize the benefits of HPC for US economic competitiveness and scientific discovery 8/01/2016Alexei Klimentov16

ECP Technical Approach ECP will pursue a ten-year plan structured into four focus areas: Application Development deliver scalable science and mission performance on a suite of ECP applications that are ready for efficient execution on the ECP exascale systems. Software Technology enhance the software stack that DOE SC and NNSA applications rely on to meet the needs of exascale applications and evolve it to utilize efficiently exascale systems. Conduct R&D on tools and methods that enhance productivity and facilitate portability. – Software Technology Director : Rajeev Thakur, ANL – Deputy : Pat McCormick, LANL WBS Data Management and Workflow Hardware Technology fund supercomputer vendors to do the research and development of hardware-architecture designs needed to build and support the exascale systems. Exascale Systems fund testbeds, advanced system engineering development (NRE) by the vendors, incremental site preparation, and cost of system expansion needed to acquire capable exascale systems. 8/01/2016Alexei Klimentov17

ECP Activity Solicitation and Selection Process Request For Information (RFI) – Mar 2016 – 2-3 page white paper (Mar 2016) An Exascale Workflow Management System Based on PanDA and Its Integration with ESnet – Alexei Klimentov (lead PI) , Sergey Panitkin, Torre Wenaus, Dantong Yu; Brookhaven National Laboratory – Kaushik De, Danila Oleynik; University of Texas at Arlington – Eli Dart, Inder Monga, Chin Guok, Brian Tierney, Energy Sciences Network – Shantenu Jha, Rutgers University – Vakhtang Tsulaia, Lawrence Berkeley National Laboratory Selection criteria include quality and makeup of the team, relevance to exascale, match to mission needs, technical feasibility. (Apr – Jun) Request for Proposal (RFP) – Jul 2016 – Deadline Aug 10th 8/01/2016Alexei Klimentov18