Danila Oleynik (On behalf of ATLAS collaboration) 02.07.2014.

Slides:

Advertisements

Similar presentations

CERN LCG Overview & Scaling challenges David Smith For LCG Deployment Group CERN HEPiX 2003, Vancouver.

Advertisements

1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.

S. Gadomski, "ATLAS computing in Geneva", journee de reflexion, 14 Sept ATLAS computing in Geneva Szymon Gadomski description of the hardware the.

Integrating Network Awareness in ATLAS Distributed Computing Using the ANSE Project J.Batista, K.De, A.Klimentov, S.McKee, A.Petroysan for the ATLAS Collaboration.

Workload Management Workpackage Massimo Sgaravatto INFN Padova.

Office of Science U.S. Department of Energy Grids and Portals at NERSC Presented by Steve Chan.

Workload Management Massimo Sgaravatto INFN Padova.

1 Bridging Clouds with CernVM: ATLAS/PanDA example Wenjing Wu

Testing PanDA at ORNL Danila Oleynik University of Texas at Arlington / JINR PanDA UTA 3-4 of September 2013.

MultiJob PanDA Pilot Oleynik Danila 28/05/2015. Overview Initial PanDA pilot concept & HPC Motivation PanDA Pilot workflow at nutshell MultiJob Pilot.

Pilots 2.0: DIRAC pilots for all the skies Federico Stagni, A.McNab, C.Luzzi, A.Tsaregorodtsev On behalf of the DIRAC consortium and the LHCb collaboration.

Assessment of Core Services provided to USLHC by OSG.

 Cloud computing  Workflow  Workflow lifecycle  Workflow design  Workflow tools : xcp, eucalyptus, open nebula.

Connecting OurGrid & GridSAM A Short Overview. Content Goals OurGrid: architecture overview OurGrid: short overview GridSAM: short overview GridSAM: example.

OSG Public Storage and iRODS

Take on messages from Lecture 1 LHC Computing has been well sized to handle the production and analysis needs of LHC (very high data rates and throughputs)

Integrating HPC into the ATLAS Distributed Computing environment Doug Benjamin Duke University.

Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.

Wenjing Wu Andrej Filipčič David Cameron Eric Lancon Claire Adam Bourdarios & others.

PanDA A New Paradigm for Computing in HEP Kaushik De Univ. of Texas at Arlington NRC KI, Moscow January 29, 2015.

PanDA: Exascale Federation of Resources for the ATLAS Experiment

November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.

Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”

 Apache Airavata Architecture Overview Shameera Rathnayaka Graduate Assistant Science Gateways Group Indiana University 07/27/2015.

Tarball server (for Condor installation) Site Headnode Worker Nodes Schedd glidein - special purpose Condor pool master DB Panda Server Pilot Factory -

NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.

What is SAM-Grid? Job Handling Data Handling Monitoring and Information.

Experience and possible evolution Danila Oleynik (UTA), Sergey Panitkin (BNL), Taylor Childers (ANL) ATLAS TIM 2014.

Event Service Intro, plans, issues and objectives for today Torre Wenaus BNL US ATLAS S&C/PS Workshop Aug 21, 2014.

Overview of ASCR “Big PanDA” Project Alexei Klimentov Brookhaven National Laboratory September 4, 2013, Arlington, TX PanDA UTA.

A PanDA Backend for the Ganga Analysis Interface J. Elmsheuser 1, D. Liko 2, T. Maeno 3, P. Nilsson 4, D.C. Vanderster 5, T. Wenaus 3, R. Walker 1 1: Ludwig-Maximilians-Universität.

6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.

MultiJob pilot on Titan. ATLAS workloads on Titan Danila Oleynik (UTA), Sergey Panitkin (BNL) US ATLAS HPC. Technical meeting 18 September 2015.

Network awareness and network as a resource (and its integration with WMS) Artem Petrosyan (University of Texas at Arlington) BigPanDA Workshop, CERN,

AliEn AliEn at OSC The ALICE distributed computing environment by Bjørn S. Nilsen The Ohio State University.

PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.

PanDA & BigPanDA Kaushik De Univ. of Texas at Arlington BigPanDA Workshop, CERN October 21, 2013.

PERFORMANCE AND ANALYSIS WORKFLOW ISSUES US ATLAS Distributed Facility Workshop November 2012, Santa Cruz.

Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.

Tier3 monitoring. Initial issues. Danila Oleynik. Artem Petrosyan. JINR.

Tool Integration with Data and Computation Grid “Grid Wizard 2”

HPC pilot code. Danila Oleynik 18 December 2013 from.

Global ADC Job Monitoring Laura Sargsyan (YerPhI).

DIRAC Project A.Tsaregorodtsev (CPPM) on behalf of the LHCb DIRAC team A Community Grid Solution The DIRAC (Distributed Infrastructure with Remote Agent.

EGI Technical Forum Amsterdam, 16 September 2010 Sylvain Reynaud.

1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.

Shifters Jamboree Kaushik De ADC Jamboree, CERN December 4, 2014.

Big PanDA on HPC/LCF Update Sergey Panitkin, Danila Oleynik BigPanDA F2F Meeting. March

Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.

Breaking the frontiers of the Grid R. Graciani EGI TF 2012.

Geant4 GRID production Sangwan Kim, Vu Trong Hieu, AD At KISTI.

PanDA HPC integration. Current status. Danila Oleynik BigPanda F2F meeting 13 August 2013 from.

Alessandro De Salvo Mayuko Kataoka, Arturo Sanchez Pineda,Yuri Smirnov CHEP 2015 The ATLAS Software Installation System v2 Alessandro De Salvo Mayuko Kataoka,

DIRAC Distributed Computing Services A. Tsaregorodtsev, CPPM-IN2P3-CNRS FCPPL Meeting, 29 March 2013, Nanjing.

The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.

Scientific Data Processing Portal and Heterogeneous Computing Resources at NRC “Kurchatov Institute” V. Aulov, D. Drizhuk, A. Klimentov, R. Mashinistov,

PANDA PILOT FOR HPC Danila Oleynik (UTA). Outline What is PanDA Pilot PanDA Pilot architecture (at nutshell) HPC specialty PanDA Pilot for HPC 2.

Workload Management Workpackage

Dag Toppe Larsen UiB/CERN CERN,

U.S. ATLAS Grid Production Experience

Dag Toppe Larsen UiB/CERN CERN,

ATLAS Cloud Operations

PanDA setup at ORNL Sergey Panitkin, Alexei Klimentov BNL

StratusLab Final Periodic Review

StratusLab Final Periodic Review

Ruslan Fomkin and Tore Risch Uppsala DataBase Laboratory

WLCG Collaboration Workshop;

Cloud Computing R&D Proposal

rvGAHP – Push-Based Job Submission Using Reverse SSH Connections

Exploit the massive Volunteer Computing resource for HEP computation

Presentation transcript:

Danila Oleynik (On behalf of ATLAS collaboration)

 Production and Distributed Analysis system developed for the ATLAS experiment at the LHC  Deployed on WLCG resources worldwide  Now also used by AMS and LSST experiment  Being evaluated for ALICE, NICA, COMPAS, NA-62 and others  Many international partners: CERN IT, OSG, NorduGrid, European grid projects, Russian grid projects… 2

PanDA can manage:  Large data volume – hundreds of petabytes  Distributed resources – hundreds of computing centers worldwide  Collaborative – thousands of scientific users  Complex work flows, multiple applications, automated processing chunks, fast turn- around... 3

 The ATLAS experiment at the LHC  Well known for recent Higgs discovery  Search for new physics continues  PanDA project was started in Fall 2005  Goal: An automated yet flexible workload management system which can optimally make distributed resources accessible to all users  Originally developed for US physicists  Joint project by UTA and BNL  Adopted as the ATLAS wide WMS in 2008  In use for all ATLAS computing applications 4

PanDA WMS design goals:  Achieve high level of automation to reduce operational effort for large collaboration  Flexibility in adapting to evolving hardware and network configurations over many decades  Support diverse and changing middleware  Insulate user from hardware, middleware, and all other complexities of the underlying system  Unified system for production and user analysis  Incremental and adaptive software development 5

 Key features of PanDA  Central job queue  Unified treatment of distributed resources  SQL DB keeps state of all workloads  Pilot based job execution system  ATLAS work is sent only after pilot execution begins on CE  Minimize latency, reduce error rates  Fairshare or policy driven priorities for thousands of users at hundreds of resources  Automatic error handling and recovery  Extensive monitoring  Modular design 6

PanDA server Database back-end Brokerage Dispatcher PanDA pilot system Job wrapper Pilot factory Dynamic data placement Information system Monitoring system 7

8 PanDA server

9

 The ADC (ATLAS Distributed Computing) software components, and particularly PanDA, managed, with contingency, the challenge of data taking and data processing  But: the increase in trigger rate and luminosity will most likely not be accompanied by an equivalent increase of available resources  One of the way for increasing efficiency of system – involving opportunistic resources: Clouds and HPC 10

 ATLAS cloud computing project set up a few years ago to exploit virtualization and clouds  To utilize public and private clouds as extra computing resources  Additional mechanism to cope with peak loads on the grid  Excellent progress so far  Commercial clouds – Google Compute Engine, Amazon EC2  Academic clouds – CERN, Canada (Victoria, Edmonton, Quebec and Ottawa), the United States (FutureGrid clouds in San Diego and Chicago), Australia (Melbourne and Queensland) and the United Kingdom  Personal PanDA analysis queues being set up  Mechanism for a user to exploit private cloud resources or purchase additional resources from commercial cloud providers when conventional grid resources are busy and the user has exhausted his/her fair share 11

From Sergey Panitkin 12

13 ATLAS has members with access to these machines and to ARCHER, NERSC, … Already successfully used in ATLAS as part of Nordu Grid and in DE

 Already two solutions:  ARC-CE/aCT2  Nordu Grid based solution, ARC-CE as unified interface  OLCF Titan (Cray)  Modular structure of PanDA Pilot now allow easy integration with different computing facility. 14

15 More information: ARC-CE: updates and plans Dr. Oxana SMIRNOVA on 1 Jul 2014

 Parallel file system shared between nodes.  Access only to interactive nodes (computing nodes have extremely limited connectivity)  One-Time Password Authentication  Internal job management tool: PBS/TORQUE  One job occupy minimum one node  Limitation of number of jobs in scheduler for one user  Special data transfer nodes (high speed stage in/out) 16

 Pilot(s) executes on HPC interactive node  Pilot interact with local job scheduler to manage job  Number of executing pilots = number of available slots in local scheduler 17

 SAGA API was chosen for encapsulation of interactions with Titan internal batch system (PBS)  High level job description API  Set of adapters for different job submission systems (SSH and GSISSH; Condor and Condor-G; PBS and Torque; Sun Grid Engine; SLURM, IBM LoadLeveler)  Local and remote intercommunication with job submission systems  API for different data transfers protocols (SFTP/GSIFTP; HTTP/HTTPS; iRODS)  To avoid deployment restrictions, SAGA API modules was directly included in PanDA pilot code 18

 Typical HPC facility is ran on average at ~90% occupancy  On a machine of the scale of Titan that translates into ~300M unused core hours per year  Anything that helps to improve this number is very useful  We added to PanDA Pilot a capability to collect, in near real time, information about current free resources on Titan  Both number of free worker nodes and time of their availability  Based on that information Pilot can define job submission parameters when forming PBS script for Titan, thus tailoring the submission to the available resource.  Takes into account Titan’s scheduling policies  Can also take into account other limitations, such as workload output size, etc 19

20

 Conducted together with Titan engineers  Payload : ATLAS alpgen and ALICE full simulation  Goals for the tests :  Demonstrate that current approach provides real improvement in Titan’s utilization  Demonstrate that PanDA backfill has no negative impact (wait time, IO effects, failure rate…)  Test jobs submission chain and the whole system  24h test results  Stable operations  22k core-hours collected in 24h  Observed encouragingly short job wait time on Titan (few min.) 21

 Test ‘Titan’ solution on others HPCs, on first step on other Cray machines: Eos (ORNL), Hopper (NERSC)  Improvement of Pilot for dynamic rearranging of submission parameters (available resources and walltime limit) in case job stuck in local queue  Implementation of new ‘computing backend’ plugin for ALCF 22

 D.Benjamin, T.Childers, K.De, A.Filipcic, T.LeCompte, A.Klimentov, T.Maeno, S.Panitkin 23

24

 Goal: to run ATLAS simulation jobs on volunteer computers  Huge potential resources  Conditions:  Volunteer computers need to be virtualized (use CernVM image and CernVM FS to distribute software), so standard software can be used  Volunteer cloud should be integrated into PanDA, i.e. all the volunteer computers should appear as a site. Jobs are created by PanDA  No credentials should be put on volunteer computers  Solution: Use BOINC platform, and a combination of PanDA components and a modified version of the ARC Control Tower (aCT) located between user and PanDA Server  Project status: Successfully demonstrated in February 2014  Client running on a “volunteer” computer downloaded simulation jobs from aCT, payload was executed by PanDA Pilot and results were uploaded back to aCT which put the output files in the proper Storage Element and updated the final job status to PanDA Server 25