BigPanDA Status Kaushik De Univ. of Texas at Arlington Alexei Klimentov Brookhaven National Laboratory OSG AHM, Clemson University March 14, 2016.

Slides:

Advertisements

Similar presentations

1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.

Advertisements

12. March 2003Bernd Panzer-Steindel, CERN/IT1 LCG Fabric status

Integrating Network Awareness in ATLAS Distributed Computing Using the ANSE Project J.Batista, K.De, A.Klimentov, S.McKee, A.Petroysan for the ATLAS Collaboration.

1 Bridging Clouds with CernVM: ATLAS/PanDA example Wenjing Wu

MultiJob PanDA Pilot Oleynik Danila 28/05/2015. Overview Initial PanDA pilot concept & HPC Motivation PanDA Pilot workflow at nutshell MultiJob Pilot.

Plans for Exploitation of the ORNL Titan Machine Richard P. Mount ATLAS Distributed Computing Technical Interchange Meeting May 17, 2013.

LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.

Assessment of Core Services provided to USLHC by OSG.

GridPP Steve Lloyd, Chair of the GridPP Collaboration Board.

CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.

Computing for ILC experiment Computing Research Center, KEK Hiroyuki Matsunaga.

UTA Site Report Jae Yu UTA Site Report 4 th DOSAR Workshop Iowa State University Apr. 5 – 6, 2007 Jae Yu Univ. of Texas, Arlington.

OSG Area Coordinator’s Report: Workload Management February 9 th, 2011 Maxim Potekhin BNL

PanDA Multi-User Pilot Jobs Maxim Potekhin Brookhaven National Laboratory Open Science Grid WLCG GDB Meeting CERN March 11, 2009.

PanDA A New Paradigm for Computing in HEP Kaushik De Univ. of Texas at Arlington NRC KI, Moscow January 29, 2015.

OSG Area Coordinator’s Report: Workload Management April 20 th, 2011 Maxim Potekhin BNL

14 Aug 08DOE Review John Huth ATLAS Computing at Harvard John Huth.

PanDA: Exascale Federation of Resources for the ATLAS Experiment

And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR

US ATLAS Tier 1 Facility Rich Baker Brookhaven National Laboratory Review of U.S. LHC Software and Computing Projects Fermi National Laboratory November.

Ruth Pordes November 2004TeraGrid GIG Site Review1 TeraGrid and Open Science Grid Ruth Pordes, Fermilab representing the Open Science.

CASTOR evolution Presentation to HEPiX 2003, Vancouver 20/10/2003 Jean-Damien Durand, CERN-IT.

Event Service Intro, plans, issues and objectives for today Torre Wenaus BNL US ATLAS S&C/PS Workshop Aug 21, 2014.

Overview of ASCR “Big PanDA” Project Alexei Klimentov Brookhaven National Laboratory September 4, 2013, Arlington, TX PanDA UTA.

A PanDA Backend for the Ganga Analysis Interface J. Elmsheuser 1, D. Liko 2, T. Maeno 3, P. Nilsson 4, D.C. Vanderster 5, T. Wenaus 3, R. Walker 1 1: Ludwig-Maximilians-Universität.

Network awareness and network as a resource (and its integration with WMS) Artem Petrosyan (University of Texas at Arlington) BigPanDA Workshop, CERN,

OSG Area Coordinator’s Report: Workload Management May14 th, 2009 Maxim Potekhin BNL

PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.

INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.

Data Management: US Focus Kaushik De, Armen Vartapetian Univ. of Texas at Arlington US ATLAS Facility, SLAC Apr 7, 2014.

PANDA: Networking Update Kaushik De Univ. of Texas at Arlington SC15 Demo November 18, 2015.

PanDA & BigPanDA Kaushik De Univ. of Texas at Arlington BigPanDA Workshop, CERN October 21, 2013.

CD FY09 Tactical Plan Status FY09 Tactical Plan Status Report for Neutrino Program (MINOS, MINERvA, General) Margaret Votava April 21, 2009 Tactical plan.

Julia Andreeva on behalf of the MND section MND review.

SDN Provisioning, next steps after ANSE Kaushik De Univ. of Texas at Arlington US ATLAS Planning, CERN June 29, 2015.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Monitoring of the LHC Computing Activities Key Results from the Services.

OSG Area Coordinator’s Report: Workload Management Maxim Potekhin BNL May 8 th, 2008.

MND review. Main directions of work  Development and support of the Experiment Dashboard Applications - Data management monitoring - Job processing monitoring.

Global ADC Job Monitoring Laura Sargsyan (YerPhI).

Parag Mhashilkar Computing Division, Fermi National Accelerator Laboratory.

OSG Area Coordinator’s Report: Workload Management October 6 th, 2010 Maxim Potekhin BNL

Shifters Jamboree Kaushik De ADC Jamboree, CERN December 4, 2014.

Distributed Physics Analysis Past, Present, and Future Kaushik De University of Texas at Arlington (ATLAS & D0 Collaborations) ICHEP’06, Moscow July 29,

Future of Distributed Production in US Facilities Kaushik De Univ. of Texas at Arlington US ATLAS Distributed Facility Workshop, Santa Cruz November 13,

Big PanDA on HPC/LCF Update Sergey Panitkin, Danila Oleynik BigPanDA F2F Meeting. March

PanDA & Networking Kaushik De Univ. of Texas at Arlington ANSE Workshop, CalTech May 6, 2013.

Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.

INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.

Breaking the frontiers of the Grid R. Graciani EGI TF 2012.

CMS Experience with the Common Analysis Framework I. Fisk & M. Girone Experience in CMS with the Common Analysis Framework Ian Fisk & Maria Girone 1.

Architecture of a platform for innovation and research Erik Deumens – University of Florida SC15 – Austin – Nov 17, 2015.

PanDA HPC integration. Current status. Danila Oleynik BigPanda F2F meeting 13 August 2013 from.

Building on virtualization capabilities for ExTENCI Carol Song and Preston Smith Rosen Center for Advanced Computing Purdue University ExTENCI Kickoff.

A Scalable and Resilient PanDA Service for Open Science Grid Dantong Yu Grid Group RHIC and US ATLAS Computing Facility.

Production System 2 manpower and funding issues Alexei Klimentov Brookhaven National Laboratory Aug 19, 2013 Production System Technical Meeting CERN.

EGI-InSPIRE RI EGI Compute and Data Services for Open Access in H2020 Tiziana Ferrari Technical Director, EGI.eu

BigPanDA US ATLAS SW&C Technical Meeting August 1, 2016 Alexei Klimentov Brookhaven National Laboratory.

Scientific Data Processing Portal and Heterogeneous Computing Resources at NRC “Kurchatov Institute” V. Aulov, D. Drizhuk, A. Klimentov, R. Mashinistov,

BigPanDA Workflow Management on Titan

PanDA setup at ORNL Sergey Panitkin, Alexei Klimentov BNL

PanDA engagement with theoretical community via SciDAC-4

HEP Computing Tools for Brain Studies

PanDA in a Federated Environment

R.Mashinistov (UTA) July

ATLAS Sites Jamboree, CERN January, 2017

BigPanDA WMS for Brain Studies

Univ. of Texas at Arlington BigPanDA Workshop, ORNL

Leigh Grundhoefer Indiana University

A Possible OLCF Operational Model for HEP (2019+)

Workflow Management Software For Tomorrow

Presentation transcript:

BigPanDA Status Kaushik De Univ. of Texas at Arlington Alexei Klimentov Brookhaven National Laboratory OSG AHM, Clemson University March 14, 2016

Outline  BigPanDA Overview  Goals and Plans  Accomplishments over the past 3 years  What is happening now  And for the next 6 months  Future of BigPanDA March 14, 2016Kaushik De2

BigPanDA Project “Next Generation Workload Management and Analysis System for BigData”  DOE ASCR and HEP funded project started in Sep 2012  Generalization of PanDA Workload Management System as meta application, providing location transparency of processing and data management, for HEP and other data-intensive sciences, and a wider exascale community  Program Managers : Rich Carlson (ASCR), Lali Chaterjee (HEP)  Institutions :  Lead Institution : Brookhaven National Laboratory  Pis : Alexei Klimentov*, Sergey Panitkin, Torre Wenaus and Dantong Yu  Argonne National Laboratory :  PI : Alexandre Vaniachine  University Texas at Arlington :  Pis : Kaushik De, Gergely Zaruba  July 2015 : Project extended until Dec March 14, 2016Kaushik De3

BigPanDA Work Packages  WP1. Factorizing the code  Factorizing the core components of PanDA to enable adoption by a wide range of exascale scientific communities  WP2. Extending the scope  Evolving PanDA to support extreme scale computing clouds and Leadership Computing Facilities  WP3. Leveraging intelligent networks  Integrating network services and real-time data access to the PanDA workflow  WP4. Usability and Monitoring  Real time monitoring and visualization package  Schedule  Year 1. Setting the collaboration, define algorithms and metrics  Year 2. Prototyping and implementation  Year 3. Production and operations March 14, 2016Kaushik De4

WP1 Factorizing the core  New Code repository.  Migration from CERN SVN to GitHub  New build system; distribution through RPMs (thanks OSG for collaborative effort)  PanDA server improvements.  PanDA server consolidation: now one master branch that serves multiple experiments  Consolidated divergence into multiple branches over many years  Standardized installation thanks to collaboration with OSG  PanDA pilot improvements.  Core pilot has been refactored to a generic (VO independent) version;  VO specifics are handled as plug-ins; execution backends are handled as plug-ins  Multiple database backends.  Oracle database backend (ATLAS, AMS, COMPASS)  mySQL (running on EC2 PanDA server) – LSST, ALICE  New PanDA instance with MySQL backend deployed in Amazon EC2  Instance tuned for multi-VO support  Documented installation process for all components 5March 14, 2016Kaushik De

WP2 Extending the scope  Extend PanDA job management to Clouds and HPC  Commercial and academic clouds  Google, Amazon EC2, NECTAR, etc…  Primary goal for HPC activity - Leadership Computing Facilities  – main target  Ancelm, Ostrava (CZ), Archer (UK), Kurchatov (RF)  See talk tomorrow by Frank/me on HPC’s 6March 14, 2016Kaushik De

# PFlops (Peak theoretical performance). Cray XK-7 18,688 compute nodes with GPUs 299,008 CPU cores AMD Opteron GHz (16 cores per node) 32 GB RAM per node NVidia TESLA K20x GPU per node 32 PB disk storage (center-wide Luster file system) >1TB/s aggregate FS throughput 29 PB HPSS tape archive Titan Example March 14, 2016Kaushik De

Titan Usage Last Month March 14, 2016Kaushik De8

WP3 PanDA and Networking  Goal for PanDA  Direct integration of networking with PanDA workflow – never attempted before for large scale automated WMS systems  Pick a few use cases  Cases which are important to PanDA users  Enhance workload management through use of network  Case 1: Improve User Analysis workflow  Case 2: Improve Tier 1 to Tier 2 workflow  Step by step approach  Collect network information  Storage and access  Using network information March 14, 2016Kaushik De9

Network Summary  All work completed in first 2 years  Successful demonstrations – overflow and automatic MCP  Results shown at SC15 and other meetings  New implementation underway for “World Cloud”  Details in Fernando’s talk earlier  Moving from demonstration to operation March 14, 2016Kaushik De10

WP4 Usability and monitoring  New PanDA monitoring web application was developed based on Django framework  All of you know this daily as BigPanDA monitoring  Allows rapid development and easy extension March 14, 2016Kaushik De

“BigPanDA” Collaborations  Through ASCR project, PanDA has moved well beyond ATLAS  Collaborating with RADICAL group of Rutgers University and Oak Ridge LCF team (Shantenu Jha and Jack Wells)  Many joint papers and invited talks  Collaboration between ATLAS, ALICE, nEDM experiments for efficient usage of opportunistic resources, especially HPC and LCF  LSST evaluating PanDA for distributed data processing  COMPASS and AMS02 installed their own PanDA instances  Other communities getting involved 12March 14, 2016Kaushik De

BigPanDA - Generalizing PanDA Beyond Grid and ATLAS March 14, 2016Kaushik De13 LHC

What Are We Doing Now?  WP1 – continuing to improve packaging  Bi-weekly meetings, focusing on non-ATLAS experiments  Work carried out by core PanDA and people from other experiments  WP3 – networking demonstration completed  Ongoing work done by core PanDA and ADC teams  World Cloud, NWS, Configurator…  WP4 – BigPanDA monitoring big focus  New features being added continuously  WP3 – increasing LCF usage  Heavy focus of work  Collaborating with Rutgers CSE team and OLCF operations March 14, 2016Kaushik De14

Titan Backfill March 14, 2016Kaushik De15

Future of BigPanDA  Submitted proposal to ASCR for continuation  Hope to hear back before summer  Focus on BNL+UTA+Rutgers+OLCF collaboration on Titan  Setup PanDA as a service on Titan for ‘all’ users  ATLAS (and ALICE) will benefit from big increase in CPU cycles March 14, 2016Kaushik De16