Overview of ASCR “Big PanDA” Project Alexei Klimentov Brookhaven National Laboratory September 4, 2013, Arlington, TX PanDA UTA.

Slides:

Advertisements

Similar presentations

ASCR Data Science Centers Infrastructure Demonstration S. Canon, N. Desai, M. Ernst, K. Kleese-Van Dam, G. Shipman, B. Tierney.

Advertisements

Alexei Klimentov Brookhaven National Laboratory

1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.

Integrating Network Awareness in ATLAS Distributed Computing Using the ANSE Project J.Batista, K.De, A.Klimentov, S.McKee, A.Petroysan for the ATLAS Collaboration.

Integrating Network and Transfer Metrics to Optimize Transfer Efficiency and Experiment Workflows Shawn McKee, Marian Babik for the WLCG Network and Transfer.

Workload Management Workpackage Massimo Sgaravatto INFN Padova.

Integrated Scientific Workflow Management for the Emulab Network Testbed Eric Eide, Leigh Stoller, Tim Stack, Juliana Freire, and Jay Lepreau and Jay Lepreau.

Workload Management Massimo Sgaravatto INFN Padova.

LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.

Welcome to HTCondor Week #14 (year #29 for our project)

Assessment of Core Services provided to USLHC by OSG.

F Run II Experiments and the Grid Amber Boehnlein Fermilab September 16, 2005.

EMI INFSO-RI SA2 - Quality Assurance Alberto Aimar (CERN) SA2 Leader EMI First EC Review 22 June 2011, Brussels.

Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.

PhysX CoE: LHC Data-intensive workflows and data- management Wahid Bhimji, Pete Clarke, Andrew Washbrook – Edinburgh And other CoE WP4 people…

PanDA Multi-User Pilot Jobs Maxim Potekhin Brookhaven National Laboratory Open Science Grid WLCG GDB Meeting CERN March 11, 2009.

PanDA A New Paradigm for Computing in HEP Kaushik De Univ. of Texas at Arlington NRC KI, Moscow January 29, 2015.

10/24/2015OSG at CANS1 Open Science Grid Ruth Pordes Fermilab

Extending the ATLAS PanDA Workload Management System for New Big Data Applications XLDB 2013 Workshop CERN, 29 May 2013 Alexei Klimentov Brookhaven National.

PanDA: Exascale Federation of Resources for the ATLAS Experiment

Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.

Bob Jones Technical Director CERN - August 2003 EGEE is proposed as a project to be funded by the European Union under contract IST

ATLAS Grid Data Processing: system evolution and scalability D Golubkov, B Kersevan, A Klimentov, A Minaenko, P Nevski, A Vaniachine and R Walker for the.

NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.

PanDA Update Kaushik De Univ. of Texas at Arlington XRootD Workshop, UCSD January 27, 2015.

Ruth Pordes November 2004TeraGrid GIG Site Review1 TeraGrid and Open Science Grid Ruth Pordes, Fermilab representing the Open Science.

US LHC OSG Technology Roadmap May 4-5th, 2005 Welcome. Thank you to Deirdre for the arrangements.

Computing Coordination Aspects for HEP in Germany International ICFA Workshop on HEP Networking, Grid and Digital Divide Issues for Global e-Science nLCG.

A PanDA Backend for the Ganga Analysis Interface J. Elmsheuser 1, D. Liko 2, T. Maeno 3, P. Nilsson 4, D.C. Vanderster 5, T. Wenaus 3, R. Walker 1 1: Ludwig-Maximilians-Universität.

6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.

Network awareness and network as a resource (and its integration with WMS) Artem Petrosyan (University of Texas at Arlington) BigPanDA Workshop, CERN,

Slide David Britton, University of Glasgow IET, Oct 09 1 Prof. David Britton GridPP Project leader University of Glasgow UK-T0 Meeting 21 st Oct 2015 GridPP.

PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.

PanDA & BigPanDA Kaushik De Univ. of Texas at Arlington BigPanDA Workshop, CERN October 21, 2013.

Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.

Julia Andreeva on behalf of the MND section MND review.

The GridPP DIRAC project DIRAC for non-LHC communities.

Eileen Berman. Condor in the Fermilab Grid FacilitiesApril 30, 2008  Fermi National Accelerator Laboratory is a high energy physics laboratory outside.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Monitoring of the LHC Computing Activities Key Results from the Services.

MND review. Main directions of work  Development and support of the Experiment Dashboard Applications - Data management monitoring - Job processing monitoring.

Global ADC Job Monitoring Laura Sargsyan (YerPhI).

Distributed Physics Analysis Past, Present, and Future Kaushik De University of Texas at Arlington (ATLAS & D0 Collaborations) ICHEP’06, Moscow July 29,

The GridPP DIRAC project DIRAC for non-LHC communities.

PanDA & Networking Kaushik De Univ. of Texas at Arlington ANSE Workshop, CalTech May 6, 2013.

Joint Institute for Nuclear Research Synthesis of the simulation and monitoring processes for the data storage and big data processing development in physical.

Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.

Danila Oleynik (On behalf of ATLAS collaboration)

Breaking the frontiers of the Grid R. Graciani EGI TF 2012.

CMS Experience with the Common Analysis Framework I. Fisk & M. Girone Experience in CMS with the Common Analysis Framework Ian Fisk & Maria Girone 1.

Building on virtualization capabilities for ExTENCI Carol Song and Preston Smith Rosen Center for Advanced Computing Purdue University ExTENCI Kickoff.

Production System 2 manpower and funding issues Alexei Klimentov Brookhaven National Laboratory Aug 19, 2013 Production System Technical Meeting CERN.

Scientific Computing at Fermilab Lothar Bauerdick, Deputy Head Scientific Computing Division 1 of 7 10k slot tape robots.

Computing infrastructures for the LHC: current status and challenges of the High Luminosity LHC future Worldwide LHC Computing Grid (WLCG): Distributed.

J. Templon Nikhef Amsterdam Physics Data Processing Group Large Scale Computing Jeff Templon Nikhef Jamboree, Utrecht, 10 december 2012.

EGI-InSPIRE RI EGI Compute and Data Services for Open Access in H2020 Tiziana Ferrari Technical Director, EGI.eu

BigPanDA Status Kaushik De Univ. of Texas at Arlington Alexei Klimentov Brookhaven National Laboratory OSG AHM, Clemson University March 14, 2016.

BigPanDA US ATLAS SW&C Technical Meeting August 1, 2016 Alexei Klimentov Brookhaven National Laboratory.

Scientific Data Processing Portal and Heterogeneous Computing Resources at NRC “Kurchatov Institute” V. Aulov, D. Drizhuk, A. Klimentov, R. Mashinistov,

Ian Bird, CERN WLCG Project Leader Amsterdam, 24 th January 2012.

Bob Jones EGEE Technical Director

Review of the WLCG experiments compute plans

Workload Management Workpackage

BigPanDA Workflow Management on Titan

EGEE Middleware Activities Overview

Electron Ion Collider New aspects of EIC experiment instrumentation and computing, as well as their possible impact on and context in society (B) COMPUTING.

PanDA setup at ORNL Sergey Panitkin, Alexei Klimentov BNL

PanDA in a Federated Environment

Univ. of Texas at Arlington BigPanDA Workshop, ORNL

Cloud Computing R&D Proposal

A Possible OLCF Operational Model for HEP (2019+)

Presentation transcript:

Overview of ASCR “Big PanDA” Project Alexei Klimentov Brookhaven National Laboratory September 4, 2013, Arlington, TX PanDA UTA

Alexei Klimentov BNL/PAS Main topics Introduction –PanDA in ATLAS PanDA philosophy PanDA’s success ASCR project “Next generation workload management system for Big Data” – Big PanDA –Project scope –Work packages –Status and plans Summary and conclusions PanDA Workshop 9/4/13 2

Alexei Klimentov BNL/PAS PanDA in ATLAS The ATLAS experiment at the LHC - Big Data Experiment ATLAS Detector generates about 1PB of raw data per second – most filtered out As of 2013 ATLAS Distributed Data Management System manages ~140 PB of data, distributed world-wide to 130 of WLCG computing centers Expected rate of data influx into ATLAS Grid ~40 PB of data per year Thousands of physicists from ~40 countries analyze the data PanDA project was started in Fall Production and Data Analysis system Goal: An automated yet flexible workload management system (WMS) which can optimally make distributed resources accessible to all users Originally developed in US for US physicists Adopted as the ATLAS wide WMS in 2008 (first LHC data in 2009) for all computing applications Now successfully manages O(10E2) sites, O(10E5) cores, O(10E8) jobs per year, O(10E3) users 9/4/13 PanDA Workshop3

Alexei Klimentov BNL/PAS PanDA Philosophy PanDA Workload Management System design goals –Deliver transparency of data processing in a distributed computing environment –Achieve high level of automation to reduce operational effort –Flexibility in adapting to evolving hardware, computing technologies and network configurations –Scalable to the experiment requirements –Support diverse and changing middleware –Insulate user from hardware, middleware, and all other complexities of the underlying system –Unified system for central Monte-Carlo production and user data analysis Support custom workflow of individual physicists –Incremental and adaptive software development 9/4/13 PanDA Workshop4

Alexei Klimentov BNL/PAS PanDA’s Success 9/4/13 PanDA Workshop5 PanDA was able to cope with increasing LHC luminosity and ATLAS data taking rate Adopted to evolution in ATLAS computing model Two leading HEP and astro-particle experiments (CMS and AMS) has chosen PanDA as workload management system for data processing and analysis. ALICE is interested in PanDA evaluation for Grid MC Production and LCF. PanDA was chosen as a core component of Common Analysis Framework by CERN-IT/ATLAS/CMS project PanDA was cited in the document titled “Fact sheet: Big Data across the Federal Government” prepared by the Executive Office of the President of the United States as an example of successful technology already in place at the time of the “Big Data Research and Development Initiative” announcement

Alexei Klimentov BNL/PAS Evolving PanDA for Advanced Scientific Computing 9/4/13 PanDA Workshop6 Proposal titled “Next Generation Workload Management and Analysis System for BigData” – Big PanDA was submitted to ASCR DoE in April DoE ASCR and HEP funded project started in Sep –Generalization of PanDA as meta application, providing location transparency of processing and data management, for HEP and other data-intensive sciences, and a wider exascale community. Other efforts –PanDA : US ATLAS funded project (Sep 3, talks) –Networking : Advance Network Services (ANSE funded project, S.McKee talk) There are three dimensions to evolution of PanDA –Making PanDA available beyond ATLAS and High Energy Physics –Extending beyond Grid (Leadership Computing Facilities, Clouds, University clusters) –Integration of network as a resource in workload management

Alexei Klimentov BNL/PAS 9/4/13 PanDA Workshop7 April 27, 2012

Alexei Klimentov BNL/PAS “Big PanDA” Work Packages 9/4/13 PanDA Workshop8 WP1 (Factorizing the core): Factorizing the core components of PanDA to enable adoption by a wide range of exascale scientific communities (K.De) WP2 (Extending the scope): Evolving PanDA to support extreme scale computing clouds and Leadership Computing Facilities (S.Panitkin) WP3 (Leveraging intelligent networks): Integrating network services and real-time data access to the PanDA workflow (D.Yu) WP4 (Usability and monitoring): Real time monitoring and visualization package for PanDA (T.Wenaus) There are many commonalities with what we need for ATLAS

Alexei Klimentov BNL/PAS BigPanDA work plan 3 years plan –Year 1. Setting the collaboration, define algorithms and metrics –Year 2. Prototyping and implementation 2014 is ultimately important year –Year 3. Production and operations “Big PanDA” project will help to generalize PanDA and to use it beyond ATLAS and HEP –While being careful not to allow distractions from ATLAS PanDA needs ie the effort has to be incremental over that required by ATLAS PanDA (which itself has a substantial todo list) –We work very close with PanDA core software developers (Tadashi and Paul) 9/4/2013 PanDA Workshop9

Alexei Klimentov BNL/PAS “Big PanDA” first steps. Forming the team Funding started Sep 1, 2012 –Hiring process was completed in May-June Development team is formed Sergey Panitkin (BNL) 0.5 FTE starting from Oct 1, 2012 –HPC and Cloud Computing Jaroslava Schovancova (BNL) 1 FTE starting from June 3, 2013 –PanDA core software Danila Olejnik (UTA) 1 FTE starting from May 7, 2013 –HPC and PanDA core software Artem Petrosyan (UTA) 0.5 FTE starting from May 7, 2013 –Intelligent networking and monitoring There is a commitment from Dubna to support Artem and Danila for 1-2 more years 9/4/13 PanDA Workshop10

Alexei Klimentov BNL/PAS “Big PanDA”. Status. WP1 WP1 (Factorizing the core) –Evolving PanDA pilot Until recently the pilot has been ATLAS specific, with lots of code only relevant for ATLAS –To meet the needs of the Common Analysis Framework project, the pilot is being refactored Experiments as plug-ins –Introducing new experiment specific classes, enabling better organization of the code –E.g. containing methods for how a job should be setup, metadata and site information handling etc, that is unique to each experiment –CMS experiment classes have been implemented Changes are being introduced gradually, to avoid affecting current production 9/4/13 PanDA Workshop11 P.Nilsson ’s talk

Alexei Klimentov BNL/PAS “Big PanDA”. Status. WP1. Cont’d WP1 (Factorizing the core) –PanDA Back to mySQL VO independent It will be used as a test-bed for non-LHC experiments »PanDA Instance with all functionalities is installed and running at EC2. Database migration from Oracle to MySQL is finished. The instance is VO independent. »LSST MC production is the first use-case for the new instance Next step will be refactoring PanDA monitoring package 9/4/13 PanDA Workshop12 J.Schovancova ’s talk

Alexei Klimentov BNL/PAS WP2 (Extending the scope) – Compute Engine (GCE) preview project Google allocated additional resources for ATLAS for free –~5M cpu hours, 4000 cores for about 2 month, (original preview allocation 1k cores) –Resources are organized as HTCondor based PanDA queue Centos 6 based custom built images, with SL5 compatibility libraries to run ATLAS software Condor head node, proxies are at BNL Output exported to BNL SE –Work on capturing the GCE setup in Puppet  Transparent inclusion of cloud resources into ATLAS Grid  The idea was to test long term stability while running a cloud cluster similar in size to Tier 2 site in ATLAS  Intended for CPU intensive Monte-Carlo simulation workloads  Planned as a production type of run. Delivered to ATLAS as a resource and not as an R&D platform.  We also tested high performance PROOF based analysis cluster 9/4/13 PanDA Workshop13 “Big PanDA”. Status. WP2 S.Panitkin’s talk

Alexei Klimentov BNL/PAS 9/4/13 PanDA Workshop14 “Big PanDA”. Status. WP2. Cont’d PanDA project on ORNL Leadership Computing Facilities –Get experience with all relevant aspects of the platform and workload job submission mechanism job output handling local storage system details outside transfers details security environment adjust monitoring model –Develop appropriate pilot/agent model for Titan –MC generators will be initial use case on Titan Collaboration between ANL, BNL, ORNL, SLAC, UTA, UTK Cross-disciplinary project - HEP, Nuclear Physics, High- Performance Computing

Alexei Klimentov BNL/PAS 9/4/13 PanDA Workshop15 “Big PanDA”. Status. WP2. Cont’d PanDA & multicore jobs scenario PanDA/OLCF meeting in Knoxville. Aug 9 PanDA deployment at OLCF was discussed and agreed, including AIMS project component Cyber-Security issues were discussed both for the near and longer term. Discussion with OLCF Operations Payloads for TITAN CE (followed by discussion in ATLAS) D.Oleynik

Alexei Klimentov BNL/PAS Slide from Ken Read ATLAS PanDA Coming to Oak-Ridge Leadership Computing Facilities “Big PanDA”. Status. WP2. Cont’d

Alexei Klimentov BNL/PAS WP3 (Leveraging intelligent networks)  Network as resource  Optimal site selection should take network capability into account  Network as a resource should be managed (i.e. provisioning)  WP3 is synchronized with two other efforts  US ATLAS funded, primarily integrate FAX with PanDA  ANSE funded (Shawn’s talk) Networking database tables schema is finalized (for source- destination matrix) and implemented. Networking throughput performance and P2P statistics collected by different sources such perfsonar, Grid sites status board, Information systems are continuously exported to PanDA database. Tasks brokering algorithm for discussion during this workshop. 9/4/13 PanDA Workshop17 “Big PanDA”. Status. WP3 A.Petrosyan talk

Alexei Klimentov BNL/PAS Summary and Conclusions 9/4/13 PanDA Workshop18 The ATLAS experiment Distributed Computing and Software performance was a great success in LHC Run 1 –The challenge how to process and analyze the data and produce timely physics results was substantial, but at the end resulted in a great success PanDA WMS and team played a vital role during LHC Run 1 data processing, simulation and analysis ASCR gave us a great opportunity to evolve PanDA beyond ATLAS and HEP and to start “Big PanDA “ project Project team was set up. Progress in many areas : networking, VO independent PanDA instance, cloud computing, HPC –The work on extending PanDA to Leadership Class Facilities has a good start –Large scale PanDA deployments on commercial clouds are already producing valuable results Strong interest in the project from several experiments (disciplines) and scientific centers to have a joined project. Plan of work and coherent PanDA software development for discussion in this workshop

Alexei Klimentov BNL/PAS Acknowledgements Many thanks to K.De, B.Kersevan, P.Nilsson, D.Oleynik, S.Panitkin, K.Read for slides and materials used in this talk PanDA Workshop 9/4/13 19