PanDA A New Paradigm for Computing in HEP Kaushik De Univ. of Texas at Arlington NRC KI, Moscow January 29, 2015.

Slides:

Advertisements

Similar presentations

Introduction to Grid Application On-Boarding Nick Werstiuk

Advertisements

High Performance Computing Course Notes Grid Computing.

1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.

Integrating Network Awareness in ATLAS Distributed Computing Using the ANSE Project J.Batista, K.De, A.Klimentov, S.McKee, A.Petroysan for the ATLAS Collaboration.

Workload Management Workpackage Massimo Sgaravatto INFN Padova.

Other servers Java client, ROOT (analysis tool), IGUANA (CMS viz. tool), ROOT-CAVES client (analysis sharing tool), … any app that can make XML-RPC/SOAP.

Workload Management Massimo Sgaravatto INFN Padova.

1 Bridging Clouds with CernVM: ATLAS/PanDA example Wenjing Wu

Testing PanDA at ORNL Danila Oleynik University of Texas at Arlington / JINR PanDA UTA 3-4 of September 2013.

LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.

The Panda System Mark Sosebee (for K. De) University of Texas at Arlington dosar workshop March 30, 2006.

Pilots 2.0: DIRAC pilots for all the skies Federico Stagni, A.McNab, C.Luzzi, A.Tsaregorodtsev On behalf of the DIRAC consortium and the LHCb collaboration.

CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.

ATLAS DQ2 Deletion Service D.A. Oleynik, A.S. Petrosyan, V. Garonne, S. Campana (on behalf of the ATLAS Collaboration)

Computing for ILC experiment Computing Research Center, KEK Hiroyuki Matsunaga.

José M. Hernández CIEMAT Grid Computing in the Experiment at LHC Jornada de usuarios de Infraestructuras Grid January 2012, CIEMAT, Madrid.

Grid Initiatives for e-Science virtual communities in Europe and Latin America DIRAC TEAM CPPM – CNRS DIRAC Grid Middleware.

Take on messages from Lecture 1 LHC Computing has been well sized to handle the production and analysis needs of LHC (very high data rates and throughputs)

Integrating HPC into the ATLAS Distributed Computing environment Doug Benjamin Duke University.

Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.

PanDA Summary Kaushik De Univ. of Texas at Arlington ADC Retreat, Naples Feb 4, 2011.

PanDA: Exascale Federation of Resources for the ATLAS Experiment

Status of the LHCb MC production system Andrei Tsaregorodtsev, CPPM, Marseille DataGRID France workshop, Marseille, 24 September 2002.

Tarball server (for Condor installation) Site Headnode Worker Nodes Schedd glidein - special purpose Condor pool master DB Panda Server Pilot Factory -

PanDA Update Kaushik De Univ. of Texas at Arlington XRootD Workshop, UCSD January 27, 2015.

What is SAM-Grid? Job Handling Data Handling Monitoring and Information.

Ruth Pordes November 2004TeraGrid GIG Site Review1 TeraGrid and Open Science Grid Ruth Pordes, Fermilab representing the Open Science.

US LHC OSG Technology Roadmap May 4-5th, 2005 Welcome. Thank you to Deirdre for the arrangements.

Overview of ASCR “Big PanDA” Project Alexei Klimentov Brookhaven National Laboratory September 4, 2013, Arlington, TX PanDA UTA.

A PanDA Backend for the Ganga Analysis Interface J. Elmsheuser 1, D. Liko 2, T. Maeno 3, P. Nilsson 4, D.C. Vanderster 5, T. Wenaus 3, R. Walker 1 1: Ludwig-Maximilians-Universität.

6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.

Network awareness and network as a resource (and its integration with WMS) Artem Petrosyan (University of Texas at Arlington) BigPanDA Workshop, CERN,

Alex Read, Dept. of Physics Grid Activities in Norway R-ECFA, Oslo, 15 May, 2009.

PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.

PANDA: Networking Update Kaushik De Univ. of Texas at Arlington SC15 Demo November 18, 2015.

PanDA & BigPanDA Kaushik De Univ. of Texas at Arlington BigPanDA Workshop, CERN October 21, 2013.

Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.

SDN Provisioning, next steps after ANSE Kaushik De Univ. of Texas at Arlington US ATLAS Planning, CERN June 29, 2015.

DIRAC Pilot Jobs A. Casajus, R. Graciani, A. Tsaregorodtsev for the LHCb DIRAC team Pilot Framework and the DIRAC WMS DIRAC Workload Management System.

DIRAC Project A.Tsaregorodtsev (CPPM) on behalf of the LHCb DIRAC team A Community Grid Solution The DIRAC (Distributed Infrastructure with Remote Agent.

1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.

Efi.uchicago.edu ci.uchicago.edu Storage federations, caches & WMS Rob Gardner Computation and Enrico Fermi Institutes University of Chicago BigPanDA Workshop.

Shifters Jamboree Kaushik De ADC Jamboree, CERN December 4, 2014.

Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.

Distributed Physics Analysis Past, Present, and Future Kaushik De University of Texas at Arlington (ATLAS & D0 Collaborations) ICHEP’06, Moscow July 29,

Future of Distributed Production in US Facilities Kaushik De Univ. of Texas at Arlington US ATLAS Distributed Facility Workshop, Santa Cruz November 13,

PanDA & Networking Kaushik De Univ. of Texas at Arlington ANSE Workshop, CalTech May 6, 2013.

Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.

Danila Oleynik (On behalf of ATLAS collaboration)

Breaking the frontiers of the Grid R. Graciani EGI TF 2012.

SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,

DIRAC for Grid and Cloud Dr. Víctor Méndez Muñoz (for DIRAC Project) LHCb Tier 1 Liaison at PIC EGI User Community Board, October 31st, 2013.

VO Box discussion ATLAS NIKHEF January, 2006 Miguel Branco -

Job submission overview Marco Mambelli – August OSG Summer Workshop TTU - Lubbock, TX THE UNIVERSITY OF CHICAGO.

The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.

BigPanDA Status Kaushik De Univ. of Texas at Arlington Alexei Klimentov Brookhaven National Laboratory OSG AHM, Clemson University March 14, 2016.

PanDA & Networking Kaushik De Univ. of Texas at Arlington UM July 31, 2013.

Scientific Data Processing Portal and Heterogeneous Computing Resources at NRC “Kurchatov Institute” V. Aulov, D. Drizhuk, A. Klimentov, R. Mashinistov,

THE ATLAS COMPUTING MODEL Sahal Yacoob UKZN On behalf of the ATLAS collaboration.

Computing models, facilities, distributed computing

U.S. ATLAS Grid Production Experience

PanDA setup at ORNL Sergey Panitkin, Alexei Klimentov BNL

HEP Computing Tools for Brain Studies

Fine grained processing with an Event Service

PanDA in a Federated Environment

Readiness of ATLAS Computing - A personal view

BigPanDA WMS for Brain Studies

Univ. of Texas at Arlington BigPanDA Workshop, ORNL

Workflow Management Software For Tomorrow

Presentation transcript:

PanDA A New Paradigm for Computing in HEP Kaushik De Univ. of Texas at Arlington NRC KI, Moscow January 29, 2015

Computing Challenges at the LHC  The scale and complexity of LHC computing  Hundreds of petabytes of data per year, thousands of users worldwide, many dozens of complex applications…  Required a new approach to distributed computing  A huge hierarchy of computing centers had to work together  Main challenge – how to provide efficient automated performance  Auxiliary challenge – make resources easily accessible to all users  Goals of this talk  Present a new model of computing developed for the ATLAS experiment, to make optimum use of widely distributed resources  How this system is being used beyond the LHC  Future evolution January 29, 2015Kaushik De

Early History  Original ideas were based on batch processing  Using grid technology to enable distributed batch processing  Lessons learned from grid computing before 2005  Errors were difficult to debug in distributed environment  Waits in small percentage of queues = huge task completion delays  Difficult for users to manually schedule among long list of sites  New project started in 2005  Based on previous 5 years of experience  Primary goal – improve user experience with distributed computing, making it as easy as local computing  Users should get quick results by leveraging distributed resources  Isolate users from heterogeneity in infrastructure and middleware  Fair sharing of resources among thousands of users January 29, 2015Kaushik De

Enter PanDA  PanDA – Production and Distributed Analysis System  Designed for the ATLAS experiment during LHC Run 1  Deployed on WLCG infrastructure  Standards based implementation  REST framework – HTTP/S  Oracle or MySQL backends  About a dozen Python packages available from SVN and GitHub  Command-line and GUI/Web interfaces  References January 29, 2015Kaushik De

Core Ideas in PanDA  Make hundreds of distributed sites appear as local  Provide a central queue for users – similar to local batch systems  Reduce site related errors and reduce latency  Build a pilot job system – late transfer of user payloads  Crucial for distributed infrastructure maintained by local experts  Hide middleware while supporting diversity and evolution  PanDA interacts with middleware – users see high level workflow  Hide variations in infrastructure  PanDA presents uniform ‘job’ slots to user (with minimal sub-types)  Easy to integrate grid sites, clouds, HPC sites …  Production and Analysis users see same PanDA system  Same set of distributed resources available to all users  Highly flexible system, giving full control of priorities to experiment January 29, 2015Kaushik De

Additional PanDA Ideas  Excellent fault tolerance across distributed resources  Independent components with asynchronous workflow  Internal re-scheduling at every step of execution  Trust but verify all sources of information  PanDA uses own information system and internal metrics  Multi-level scheduling for optimal use of resources  Task brokerage (T1), job brokerage (sites), job dispatch (to pilots)  Multi-step data placement for maximum flexibility  Algorithmic pre-placement, asynchronous transfers with callback, pilot data movers, special optimizations for federated storage  Integration with independent data management systems  DQ2 and Rucio for ATLAS, but others options can be plugged in January 29, 2015Kaushik De

January 29, 2015 EGI PanDA server OSG pilot Worker Nodes condor-g pilot scheduler (autopyfactory) pilot scheduler (autopyfactory) https submit pull End-user analysis job pilot task/job repository (Production DB) production job job Logging System Local Replica Catalog (LFC) Data Management System NDGF ARC Interface (aCT) ARC Interface (aCT) pilot arc Production managers define https Local Replica Catalog (LFC) Local Replica Catalog submitter (bamboo/JEDI) https PanDA Workload Management Kaushik De

Job Workflow  Panda jobs go through a succession of steps tracked in central DB  Defined  Assigned  Activated  Starting  Running  Holding  Transferring  Finished/failed January 29, 2015Kaushik De

Recent Evolutions  Dynamic caching of input files – PD2P  For user analysis, jobs go to data, which are initially placed by policy  PanDA dynamically re-distributes based on usage and demand  Evolution to mesh model  Rigid hierarchy of sites relaxed based on network performance  Work and data can flow dynamically between sites  Dynamic subdivision of tasks into jobs - JEDI  New component to dynamically slice work to match resources  No longer job based – big change from batch model of running jobs  Support for multi-core, MPI…  Event service  Processing small chunks of events on demand  Network as a managed resource  On par with CPU’s and storage January 29, 2015Kaushik De

PanDA Performance January 29, 2015 Current scale – 25M jobs completed every month at >hundred sites First exascale system in HEP – 1.2 Exabytes processed in 2013 Kaushik De

Resources Accessible via PanDA January 29, 2015Kaushik De About 150,000 job slots used continuously 24x7x365

Paradigm Shift in HEP Computing  New Ideas from PanDA  Distributed resources are seamlessly integrated  All users have access to resources worldwide through a single submission system  Uniform fair share, priorities and policies allow efficient management of resources  Automation, error handling, and other features in PanDA improve user experience  All users have access to same resources  Old HEP paradigm  Distributed resources are independent entities  Groups of users utilize specific resources (whether locally or remotely)  Fair shares, priorities and policies are managed locally, for each resource  Uneven user experience at different sites, based on local support and experience  Privileged users have access to special resources January 29, 2015Kaushik De

The Growing PanDA EcoSystem  ATLAS PanDA  US ATLAS, CERN, UK, DE, ND, CA, Dubna, Protvino, OSG …  ASCR/HEP BigPanDA  DoE funded project at BNL, UTA – PanDA beyond HEP, at LCF  ANSE PanDA  NSF funded network project - CalTech, Michigan, Vanderbilt, UTA  HPC and Cloud PanDA  Taiwan PanDA – AMS and other communities  NRC KI PanDA – new communities  AliEn PanDA, LSST PanDA, other experiments  MegaPanDA … January 29, 2015Kaushik De

Conclusion  PanDA – a scalable and universal computing system  Processing million jobs a day at hundreds of sites  Thousands of active users in ATLAS  Computing as a valuable resource for all users, irrespective of location or affiliation, enabling fast physics results  Being evaluated/tested by other experiments, communities  Continuing to evolve January 29, 2015Kaushik De