PanDA setup at ORNL Sergey Panitkin, Alexei Klimentov BNL

Slides:

Advertisements

Similar presentations

1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.

Advertisements

1 Bridging Clouds with CernVM: ATLAS/PanDA example Wenjing Wu

Testing PanDA at ORNL Danila Oleynik University of Texas at Arlington / JINR PanDA UTA 3-4 of September 2013.

MultiJob PanDA Pilot Oleynik Danila 28/05/2015. Overview Initial PanDA pilot concept & HPC Motivation PanDA Pilot workflow at nutshell MultiJob Pilot.

LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.

CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES P. Saiz (IT-ES) AliEn job agents.

CHEP'07 September D0 data reprocessing on OSG Authors Andrew Baranovski (Fermilab) for B. Abbot, M. Diesburg, G. Garzoglio, T. Kurca, P. Mhashilkar.

PanDA A New Paradigm for Computing in HEP Kaushik De Univ. of Texas at Arlington NRC KI, Moscow January 29, 2015.

- Distributed Analysis (07may02 - USA Grid SW BNL) Distributed Processing Craig E. Tull HCG/NERSC/LBNL (US) ATLAS Grid Software.

PanDA: Exascale Federation of Resources for the ATLAS Experiment

November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.

Tarball server (for Condor installation) Site Headnode Worker Nodes Schedd glidein - special purpose Condor pool master DB Panda Server Pilot Factory -

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Direct gLExec integration with PanDA Fernando H. Barreiro Megino CERN IT-ES-VOS.

T3 analysis Facility V. Bucard, F.Furano, A.Maier, R.Santana, R. Santinelli T3 Analysis Facility The LHCb Computing Model divides collaboration affiliated.

Experience and possible evolution Danila Oleynik (UTA), Sergey Panitkin (BNL), Taylor Childers (ANL) ATLAS TIM 2014.

Event Service Intro, plans, issues and objectives for today Torre Wenaus BNL US ATLAS S&C/PS Workshop Aug 21, 2014.

Overview of ASCR “Big PanDA” Project Alexei Klimentov Brookhaven National Laboratory September 4, 2013, Arlington, TX PanDA UTA.

A PanDA Backend for the Ganga Analysis Interface J. Elmsheuser 1, D. Liko 2, T. Maeno 3, P. Nilsson 4, D.C. Vanderster 5, T. Wenaus 3, R. Walker 1 1: Ludwig-Maximilians-Universität.

Network awareness and network as a resource (and its integration with WMS) Artem Petrosyan (University of Texas at Arlington) BigPanDA Workshop, CERN,

EGEE-III INFSO-RI Enabling Grids for E-sciencE Ricardo Rocha CERN (IT/GS) EGEE’08, September 2008, Istanbul, TURKEY Experiment.

PanDA & BigPanDA Kaushik De Univ. of Texas at Arlington BigPanDA Workshop, CERN October 21, 2013.

HPC pilot code. Danila Oleynik 18 December 2013 from.

Proxy management mechanism and gLExec integration with the PanDA pilot Status and perspectives.

Shifters Jamboree Kaushik De ADC Jamboree, CERN December 4, 2014.

Distributed Physics Analysis Past, Present, and Future Kaushik De University of Texas at Arlington (ATLAS & D0 Collaborations) ICHEP’06, Moscow July 29,

Future of Distributed Production in US Facilities Kaushik De Univ. of Texas at Arlington US ATLAS Distributed Facility Workshop, Santa Cruz November 13,

Big PanDA on HPC/LCF Update Sergey Panitkin, Danila Oleynik BigPanDA F2F Meeting. March

The GridPP DIRAC project DIRAC for non-LHC communities.

ATLAS Distributed Computing ATLAS session WLCG pre-CHEP Workshop New York May 19-20, 2012 Alexei Klimentov Stephane Jezequel Ikuo Ueda For ATLAS Distributed.

Danila Oleynik (On behalf of ATLAS collaboration)

A Data Handling System for Modern and Future Fermilab Experiments Robert Illingworth Fermilab Scientific Computing Division.

Breaking the frontiers of the Grid R. Graciani EGI TF 2012.

Geant4 GRID production Sangwan Kim, Vu Trong Hieu, AD At KISTI.

PanDA HPC integration. Current status. Danila Oleynik BigPanda F2F meeting 13 August 2013 from.

Job submission overview Marco Mambelli – August OSG Summer Workshop TTU - Lubbock, TX THE UNIVERSITY OF CHICAGO.

HTCondor-CE. 2 The Open Science Grid OSG is a consortium of software, service and resource providers and researchers, from universities, national laboratories.

BigPanDA Status Kaushik De Univ. of Texas at Arlington Alexei Klimentov Brookhaven National Laboratory OSG AHM, Clemson University March 14, 2016.

PanDA & Networking Kaushik De Univ. of Texas at Arlington UM July 31, 2013.

Panda Monitoring, Job Information, Performance Collection Kaushik De (UT Arlington), Torre Wenaus (BNL) OSG All Hands Consortium Meeting March 3, 2008.

CERN IT Department CH-1211 Genève 23 Switzerland t EGEE09 Barcelona ATLAS Distributed Data Management Fernando H. Barreiro Megino on behalf.

Scientific Data Processing Portal and Heterogeneous Computing Resources at NRC “Kurchatov Institute” V. Aulov, D. Drizhuk, A. Klimentov, R. Mashinistov,

PANDA PILOT FOR HPC Danila Oleynik (UTA). Outline What is PanDA Pilot PanDA Pilot architecture (at nutshell) HPC specialty PanDA Pilot for HPC 2.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI solution for high throughput data analysis Peter Solagna EGI.eu Operations.

THE ATLAS COMPUTING MODEL Sahal Yacoob UKZN On behalf of the ATLAS collaboration.

ATLAS Distributed Analysis S. González de la Hoz 1, D. Liko 2, L. March 1 1 IFIC – Valencia 2 CERN.

WLCG IPv6 deployment strategy

Review of the WLCG experiments compute plans

Workload Management Workpackage

L’analisi in LHCb Angelo Carbone INFN Bologna

BigPanDA Workflow Management on Titan

U.S. ATLAS Grid Production Experience

Example: Rapid Atmospheric Modeling System, ColoState U

Workload Management System ( WMS )

GWE Core Grid Wizard Enterprise (

CREAM Status and Plans Massimo Sgaravatto – INFN Padova

HEP Computing Tools for Brain Studies

PanDA in a Federated Environment

Artem Petrosyan (JINR), Danila Oleynik (JINR), Julia Andreeva (CERN)

Readiness of ATLAS Computing - A personal view

R.Mashinistov (UTA) July

BigPanDA WMS for Brain Studies

Short update on the latest gLite status

LCG middleware and LHC experiments ARDA project

WLCG Collaboration Workshop;

Cloud Computing R&D Proposal

Mats Rynge USC Information Sciences Institute

rvGAHP – Push-Based Job Submission Using Reverse SSH Connections

Production Manager Tools (New Architecture)

Workflow Management Software For Tomorrow

Presentation transcript:

PanDA setup at ORNL Sergey Panitkin, Alexei Klimentov BNL Kaushik De, Paul Nilsson, Danila Oleynik, Artem Petrosyan UTA for the PanDA Team

Outline Introduction PanDA Panda pilot ORNL setup Summary

PanDA in ATLAS The ATLAS experiment at the LHC - Big Data Experiment ATLAS Detector generates about 1PB of raw data per second – most filtered out As of 2013 ATLAS DDM manages ~140 PB of data, distributed world-wide to 130 of WLCG computing centers Expected rate of data influx into ATLAS Grid ~40 PB of data per year Thousands of physicists from ~40 countries analyze the data PanDA project was started in Fall 2005. Production and Data Analysis system Goal: An automated yet flexible workload management system (WMS) which can optimally make distributed resources accessible to all users Originally developed in US for US physicists Adopted as the ATLAS wide WMS in 2008 (first LHC data in 2009) for all computing applications Now successfully manages O(10E2) sites, O(10E5) cores, O(10E8) jobs per year, O(10E3) users Sergey Panitkin

Key Features of PanDA Pilot based job execution system Condor based pilot factory Payload is sent only after execution begins on CE Minimize latency, reduce error rates Central job queue Unified treatment of distributed resources SQL DB keeps state - critical component Automatic error handling and recovery Extensive monitoring Modular design HTTP/S RESTful communications GSI authentication Workflow is maximally asynchronous Use of Open Source components Sergey Panitkin

PanDA. ATLAS Workload Management System Data Management System (DQ2) Production managers PanDA server production job https Local Replica Catalog (LFC) Local Replica Catalog (LFC) Local Replica Catalog (LFC) https submitter (bamboo) https Logging System define pull task/job repository (Production DB) https https https NDGF analysis job pilot https OSG job pilot submit arc ARC Interface (aCT) EGEE/EGI pilot condor-g pilot scheduler (autopyfactory) End-user Worker Nodes Sergey Panitkin

Next Generation “Big PanDA” ASCR and HEP funded project “Next Generation Workload Management and Analysis System for Big Data”. Started in September 2012. Generalization of PanDA as meta application, providing location transparency of processing and data management, for HEP and other data-intensive sciences, and a wider exascale community. Project participants from ANL, BNL, UT Arlington Alexei Klimentov – Lead PI, Kaushik De Co-PI WP1 (Factorizing the core): Factorizing the core components of PanDA to enable adoption by a wide range of exascale scientific communities (UTA, K.De) WP2 (Extending the scope): Evolving PanDA to support extreme scale computing clouds and Leadership Computing Facilities (BNL, S.Panitkin) WP3 (Leveraging intelligent networks): Integrating network services and real- time data access to the PanDA workflow (BNL, D.Yu) WP4 (Usability and monitoring): Real time monitoring and visualization package for PanDA (BNL, T.Wenaus) Sergey Panitkin

PanDA pilot PanDA is pilot based system. Pilot is what is submitted to batch queues PanDA pilot is an execution environment used to prepare computing element Request actual payload from PanDA Transfers input data from SE Executes payload and monitors it during execution Clean up after the payload is finished Transfer output Clean up, transmit logs and monitoring information Pilots allow for low latency job scheduling which is especially important in data analysis Sergey Panitkin

Panda set up on ORNL machines Main idea - try to reuse existing PanDA components and workflow logic as much as possible PanDA pilot, APF, etc PanDA connection layer runs on front end machines in user space All connections to PanDA server at CERN are initiated from the front end machines “Pull” architecture over HTTPS to predefined ports on PanDA server For local HPC batch interface use SAGA (Simple API for Grid Applications) framework http://saga-project.github.io/saga-python/ http://www.ogf.org/documents/GFD.90.pdf Please note that Titan FE is running SLES 11 not RH5 !

Schematic PanDA setup at ORNL

Workflow on HPC machines Software is installed on HPC machine via CernvmFS or direct pull from repositories (for example non-ATLAS workload) Pilot is instantiated by APF or other entity Pilot ask PanDA for workload Pilot gets workload description Pilot gets input data, if any Pilot sets up output directories for current workload Pilots generates and submits JDL description to local batch system Pilot monitors workload execution (qstat, SAGA calls) When workload is finished pilot moves data to destination SE Pilot cleans up output directories Pilot exits

Data management on ORNL machines Input and output data on /tmp/work/$USER or /tmp/proj/$PROJID Accessible from both front end and worker nodes Output data moved by pilot to ATLAS storage element after job completion. Currently to BNL SE. End point is configurable.

Current status Sergey has access to Titan, still waiting for a fob for Kraken Danila has access to Kraken, waiting for (approval?) on Titan ATLAS pilot is running on Titan FE Connections to PanDA server verified AutoPilotFactory (APF) is installed and tested on Titan FE Local HTCondor queue for APF installed APF’s pilot wrapper is tested with the latest version of ATLAS pilot on Titan FE SAGA-Python is installed on Titan FE and Kraken FE. In contact with SAGA authors from Rutgers (S. Jha, O. Weidner) A queue for Titan is defined in PanDA Connection from Titan FE to Federated ATLAS Xrootd is tested

Next Steps Pilot job submission module (runJob) development SAGA based interface to PBS Better understanding of job submission to worker nodes Multicore, GPU, etc DDM details at ORNL. Use of data transfer nodes Realistic Workloads ATLAS codes Root GEANT4 on GPU etc

Summary Work on integration of ORNL machines and PanDA has started Key Panda system components ported to Titan Component integration is the next step Pilot modifications for HPC Pilot-SAGA runJob module Pilot DDM details on Titan Realistic workloads are desirable

Acknowledgements K Read (UTK) J. Hover and J. Caballero Bejar(BNL) S. Jha and O. Weidner (Rutgers) M. Livny and HTCondor team (UW) A. DiGirolamo (CERN)