ATLAS Distributed Computing in LHC Run2 Simone Campana – CERN on behalf of the ATLAS collaboration Simone Campana – CHEP2015 14/04/2015
e.g. tracking, calorimeters The Run-2 Challenge Trigger rate: ~400 Hz Pile-up: ~20 Trigger rate: ~1k Hz Pile-up: ~40 A new detector Resources constrained by “flat budget” (no increase in funding for computing) e.g. tracking, calorimeters Simone Campana – CHEP2015 14/04/2015
How to face the Run-2 challenge New ATLAS distributed computing systems Rucio for Data Management Prodsys-2 for Workload Management FAX and Event Service to optimize resource usage More efficient utilization of resources Improvements in Simulation/Reconstruction Limit resource consumption (e.g. memory sharing in multicore) Optimize workflows (Derivation Framework/Analysis Model) Leveraging opportunistic resources additionally to pledged ones Grid, Cloud, HPC, Volunteer Computing New data lifecycle management model Simone Campana – CHEP2015 14/04/2015
New ATLAS distributed computing systems Simone Campana – CHEP2015 14/04/2015
Distributed Data Management: Rucio The new ATLAS Data Management system, Rucio[1], is in production since 1st December 2014 Rucio: a sophisticated system (offers more features than the previous one) Transferred Files vs time Transferred Volume vs time 1M files/day 2PB/week Already at early stage, equivalent performance as previous DDM in core functionalities Rucio deletion rate vs time DQ2 deletion rate vs time 5M files/day Most of Rucio potential (still unexplored) will be leveraged in production during Run-2 5M files/day Simone Campana – CHEP2015 14/04/2015
Remote data access: the Xrootd ATLAS Federation (FAX) Goal reached ! ~100% data covered We deployed a Federate Storage Infrastructure: all data accessible from any location Increase resiliency against storage failures: FAILOVER Jobs can run at sites w/o data but with free CPUs: OVERFLOW (up to 10% of jobs) Simone Campana – CHEP2015 14/04/2015
Remote data access: FAX FAX site reliability FAX failover (jobs/day) 1000 jobs/day recovered (1% of failures) FAX overflow CPU/WCT efficiency FAX overflow: job efficiency Local: 83% FAX: 76% Local: 84% FAX: 43% Simone Campana – CHEP2015 14/04/2015
Distributed Production and Analysis We developed a new service for simulated and detector data processing: Prodsys-2[2] Prodsys-2 core components Request I/F: allows production managers to define a request DEFT: translates user request into task definitions JEDI: generates the job definitions PanDA: executes the jobs in the distributed infrastructure JEDI+PanDA will provide the new framework for Distributed Analysis Simone Campana – CHEP2015 14/04/2015
Prodsys-2 is in production since 1st December 2014 JEDI is in use for analysis since 8th August 2014 Prodsys-2 and JEDI offer an extremely large set of improvements # cores of running jobs vs time Built-in file merging capability Dynamic job definition optimizing resource scheduling Automated recovery of lost data Advanced task management interface New monitoring 150k Migration to JEDI Completed analysis jobs vs time 01/05/14 01/08/14 01/12/14 01/05/15 Prodsys-1 + PanDA Prodsys-1 + JEDI PanDA Prodsys-2 + JEDI PanDA 01/07/14 30/08/14 Simone Campana – CHEP2015 14/04/2015
More efficient utilization of resources Simone Campana – CHEP2015 14/04/2015
Simulation Simulation is CPU intensive Integrated Simulation Framework Mixing of full GEANT & fast simulation within an event Work in progress, target is 2016 More events per 12h job, larger output files, less transfers/merging, less I/O Or shorter, more granular jobs for opportunistic resources Simone Campana – CHEP2015 14/04/2015
Reconstruction Reconstruction is memory eager and requires non negligible CPU (40% w.r.t. simulation, 20% of ATLAS CPU usage) Athena memory Profile AthenaMP[3]: multi-processing reduces the memory footprint MP Serial 2GB/core Running Jobs Reconstruction Time (s/event) Single Core Time (a.u.) Multi Core Code and algorithms optimization largely reduced CPU needs in reconstruction[4] Simone Campana – CHEP2015 14/04/2015
Analysis Model Common analysis data format: xAOD replacement of AOD & group ntuple of any kind Readable both by Athena & ROOT Data reduction framework[5] Athena to produce group derived data sample (DxAOD) Centrally via Prodsys Based on train model one input, N outputs from PB to TB Simone Campana – CHEP2015 14/04/2015
Leveraging opportunistic resources Almost 50% of ATLAS production at peak rate relies on opportunistic resources # of cores for ATLAS running jobs 200k Efficient utilization of the largest variety of opportunistic resources is vital for ATLAS pledge 100k Enabling utilization of non-Grid resources is a long term investment (beyond opportunistic use) 01/05/14 01/03/15 Simone Campana – CHEP2015 14/04/2015
(Opportunistic) Cloud Resources We invested a lot of effort in enabling usage of Cloud resources[6] The ATLAS HLT farm at the CERN ATLAS pit (P1) for example was instrumented with a Cloud interface in order to run simulation: Sim@P1[7] #events vs time T2s 20M events/day T1s 4 days sum CERN P1 (approx 5%) P1 07/09/14 04/10/14 The HLT farm was dynamically reconfigured to run reconstruction on multicore resources (Reco@P1). We expect to be able to do the same with other clouds Simone Campana – CHEP2015 14/04/2015
HPCs High Performance Computers were designed for massively parallel applications (different from HEP use case) but we can parasitically benefit from empty cycles that others can not use (e.g. single core job slots) The ATLAS production system has been extended to leverage HPC resources[8] Running jobs vs time 24h test at Oak Ridge Titan system (#2 world HPC machine, 299,008 cores). ATLAS event generation: 200,000 CPU hours on 90K parallel cores (equivalent of 70% of our Grid resources) EVNT,SIMUL,RECO jobs @ MPPMU, LRZ and CSCS Average 1,700 running jobs Mira@ARGONNE: Sherpa Generation using 12244 nodes with 8 threads per node, so 97,952 parallel Sherpa processes. 08/09/14 05/10/14 The goal is to validate as many workflows as possible. Today approximately 5% of ATLAS production runs on HPCs Simone Campana – CHEP2015 14/04/2015
Enabling users laptops and desktops to run ATLAS simulation[9] Volunteer Computing Enabling users laptops and desktops to run ATLAS simulation[9] http://atlasathome.cern.ch/ # running jobs vs time # users/hosts vs time 4k 16k 14/04/14 09/03/15 06/02/15 02/03/15 Simone Campana – CHEP2015 14/04/2015
Event Service Efficient utilization of opportunistic resources implies short payloads (get out quickly from the resources if the owner needs it) We developed a system to deliver payloads as short as the single event: the Event Service[10]. The Event Service will be commissioned during 2015 Simone Campana – CHEP2015 14/04/2015
Event Service Schematic Event IDs Event requester Fine grained dispatcher intelligently manages… Event level bookeeping Event dispatcher …requests every few min per node… Event list Event data Event data fetch …assigned events are efficiently fetched, local or WAN… Async data cache Data repositories Event data service …buffered asynchronously… Event loop Parallel payload …processed free of fetch latency… Output files Worker Out Worker Out Merge …outputs uploaded in ~real time… Object store Output events Output stager …and merged on job complete. Simone Campana – CHEP2015 14/04/2015
New data lifecycle management model a. k. a New data lifecycle management model a.k.a. “you can get unpledged CPU but not so much unpledged disk” Simone Campana – CHEP2015 14/04/2015
Dynamic Data Replication and Reduction Data Popularity Dynamic Replication Dynamic Reduction Cache Pinned Simone Campana – CHEP2015 14/04/2015
8 PB of data on disk never been touched 18 months ago … Disk occupancy at T1s vs time 23PB on disk, created in the last 3 months and never accessed Primary (pinned) Default (pinned) 8 PB of data on disk never been touched T1 dynamically managed space (green) is unacceptably small It compromises our strategy of dynamic replication and cleaning of popular/unpopular data Large fraction of primary space is occupied by old and unused data Simone Campana – CHEP2015 14/04/2015
The new data lifecycle model Every dataset has a lifetime set at creation The lifetime can be infinite (e.g. RAW data) The lifetime can be extended if the dataset is accessed Datasets with expired lifetime can disappear at any time from disk and tape. ATLAS Distributed Computing flexibly manages data replication and reduction, within the boundaries of lifetime and retention Increase/reduce the number of copies based on data popularity Re-distribute data at T2s rather than T1s and viceversa Move data to tape and free up disk space Simone Campana – CHEP2015 14/04/2015
Implications of the model We will use more tapes Access to tape remains “centralized” For the first time we will “delete” tapes In the steady flow we will approximately delete as much as we will write Access through storage backdoors is today not accounted We will improve this, but most people use official tools (PanDA/Rucio) Simone Campana – CHEP2015 14/04/2015
After the first (partial) run of the model … T1 tape occupancy vs time Number of dataset accesses T1 disk occupancy vs time pinned 1.2 PB never accessed, older than 1 year. It was 8 PB before cached Simone Campana – CHEP2015 14/04/2015
Conclusions A lot of hard work went in preparing the ATLAS Software and Computing for Run-2 A balanced mixture between evolution and revolution Commissioning of new systems was carried on in non disruptive manner Our systems are ready for the new challenges Still we have not yet explored many new capabilities Simone Campana – CHEP2015 14/04/2015
References to relevant ATLAS contributions [1] CHEP ID 205 - The ATLAS Data Management system - Rucio: commissioning, migration and operational experiences (Vincent Garonne) [2] CHEP ID 100 - Scaling up ATLAS production system for the LHC Run 2 and beyond: project ProdSys2 (Alexei Klimentov) [3] CHEP ID 165 - Running ATLAS workloads within massively parallel distributed applications using Athena Multi-Process framework (AthenaMP) (Vakhtang Tsulaia) [4] CHEP ID 147 - Preparing ATLAS Reconstruction for LHC Run 2 (Jovan Mitrevski) [5] CHEP ID 164 - New Petabyte-scale Data Derivation Framework for ATLAS (James Catmore) [6] CHEP ID 146 - Evolution of Cloud Computing in ATLAS (Ryan Taylor) Simone Campana – CHEP2015 14/04/2015
References to relevant ATLAS contributions [7] CHEP ID 169 - Design, Results, Evolution and Status of the ATLAS simulation in Point1 project (Franco Brasolin) [8] CHEP ID 92 - ATLAS computing on the HPC Piz Daint machine (Michael Arthur Hostettler [8] CHEP ID 153 - Bringing ATLAS production to HPC resources - A use case with the Hydra supercomputer of the Max Planck Society (Luca Mazzaferro) [8] CHEP ID 152 - Integration of PanDA workload management system with Titan supercomputer at OLCF (Sergey Panitkin) [8] CHEP ID 140 - Fine grained event processing on HPCs with the ATLAS Yoda system (Vakhatang Tsulaia) [9] CHEP ID 170 - ATLAS@Home: Harnessing Volunteer Computing for HEP (David Cameron) [10] CHEP ID 183 - The ATLAS Event Service: A new approach to event processing (Torre Wenaus) Simone Campana – CHEP2015 14/04/2015