Alessandro De Salvo CCR Workshop, 18-5-2016 ATLAS Computing Alessandro De Salvo CCR Workshop, 18-5-2016.

Slides:

Advertisements

Similar presentations

Alastair Dewhurst, Dimitrios Zilaskos RAL Tier1 Acknowledgements: RAL Tier1 team, especially John Kelly and James Adams Maximising job throughput using.

Advertisements

Cloud Computing to Satisfy Peak Capacity Needs Case Study.

Trigger and online software Simon George & Reiner Hauser T/DAQ Phase 1 IDR.

Test results Test definition (1) Istituto Nazionale di Fisica Nucleare, Sezione di Roma; (2) Istituto Nazionale di Fisica Nucleare, Sezione di Bologna.

Large scale data flow in local and GRID environment V.Kolosov, I.Korolko, S.Makarychev ITEP Moscow.

Ian Bird WLCG Workshop Okinawa, 12 th April 2015.

Assessment of Core Services provided to USLHC by OSG.

Ian Fisk and Maria Girone Improvements in the CMS Computing System from Run2 CHEP 2015 Ian Fisk and Maria Girone For CMS Collaboration.

Roger Jones, Lancaster University1 Experiment Requirements from Evolving Architectures RWL Jones, Lancaster University Ambleside 26 August 2010.

ATLAS Computing Model Evolution Borut Paul Kersevan Simone Campana.

Offline Coordinators  CMSSW_7_1_0 release: 17 June 2014  Usage:  Generation and Simulation samples for run 2 startup  Limited digitization and reconstruction.

ALICE Upgrade for Run3: Computing HL-LHC Trigger, Online and Offline Computing Working Group Topical Workshop Sep 5 th 2014.

14 Aug 08DOE Review John Huth ATLAS Computing at Harvard John Huth.

CDF Offline Production Farms Stephen Wolbers for the CDF Production Farms Group May 30, 2001.

1. Maria Girone, CERN  Q WLCG Resource Utilization  Commissioning the HLT for data reprocessing and MC production  Preparing for Run II  Data.

ATLAS in LHCC report from ATLAS –ATLAS Distributed Computing has been working at large scale Thanks to great efforts from shifters.

ALICE Offline Week | CERN | November 7, 2013 | Predrag Buncic AliEn, Clouds and Supercomputers Predrag Buncic With minor adjustments by Maarten Litmaath.

Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015.

Predrag Buncic, October 3, 2013 ECFA Workshop Aix-Les-Bains - 1 Computing at the HL-LHC Predrag Buncic on behalf of the Trigger/DAQ/Offline/Computing Preparatory.

From the Transatlantic Networking Workshop to the DAM Jamboree to the LHCOPN Meeting (Geneva-Amsterdam-Barcelona) David Foster CERN-IT.

Claudio Grandi INFN Bologna CMS Computing Model Evolution Claudio Grandi INFN Bologna On behalf of the CMS Collaboration.

Modeling Billion-Node Torus Networks Using Massively Parallel Discrete-Event Simulation Ning Liu, Christopher Carothers 1.

Future computing strategy Some considerations Ian Bird WLCG Overview Board CERN, 28 th September 2012.

Predrag Buncic, October 3, 2013 ECFA Workshop Aix-Les-Bains - 1 Computing at the HL-LHC Predrag Buncic on behalf of the Trigger/DAQ/Offline/Computing Preparatory.

Predrag Buncic Future IT challenges for ALICE Technical Workshop November 6, 2015.

DiRAC-3 – The future Jeremy Yates, STFC DiRAC HPC Facility.

MND review. Main directions of work  Development and support of the Experiment Dashboard Applications - Data management monitoring - Job processing monitoring.

Computing Issues for the ATLAS SWT2. What is SWT2? SWT2 is the U.S. ATLAS Southwestern Tier 2 Consortium UTA is lead institution, along with University.

Predrag Buncic ALICE Status Report LHCC Referee Meeting CERN

ATLAS Distributed Computing perspectives for Run-2 Simone Campana CERN-IT/SDC on behalf of ADC.

Ian Bird Overview Board; CERN, 8 th March 2013 March 6, 2013

Tackling I/O Issues 1 David Race 16 March 2010.

LHCbComputing Computing for the LHCb Upgrade. 2 LHCb Upgrade: goal and timescale m LHCb upgrade will be operational after LS2 (~2020) m Increase significantly.

Moore vs. Moore Rainer Schwemmer, LHCb Computing Workshop 2015.

ATLAS Distributed Computing in LHC Run2

Markus Frank (CERN) & Albert Puig (UB).  An opportunity (Motivation)  Adopted approach  Implementation specifics  Status  Conclusions 2.

16 September 2014 Ian Bird; SPC1. General ALICE and LHCb detector upgrades during LS2  Plans for changing computing strategies more advanced CMS and.

Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.

Perspectives on LHC Computing José M. Hernández (CIEMAT, Madrid) On behalf of the Spanish LHC Computing community Jornadas CPAN 2013, Santiago de Compostela.

Societal applications of large scalable parallel computing systems ARTEMIS & ITEA Co-summit, Madrid, October 30th 2009.

Scientific Computing at Fermilab Lothar Bauerdick, Deputy Head Scientific Computing Division 1 of 7 10k slot tape robots.

Computing infrastructures for the LHC: current status and challenges of the High Luminosity LHC future Worldwide LHC Computing Grid (WLCG): Distributed.

Predrag Buncic CERN Plans for Run2 and the ALICE upgrade in Run3 ALICE Tier-1/Tier-2 Workshop February 2015.

LHCb Computing 2015 Q3 Report Stefan Roiser LHCC Referees Meeting 1 December 2015.

CEPC software & computing study group report

Dynamic Extension of the INFN Tier-1 on external resources

Review of the WLCG experiments compute plans

Ian Bird WLCG Workshop San Francisco, 8th October 2016

AWS Integration in Distributed Computing

Virtualization and Clouds ATLAS position

Computing models, facilities, distributed computing

Simone Campana CERN-IT

SuperB and its computing requirements

The “Understanding Performance!” team in CERN IT

evoluzione modello per Run3 LHC

for the Offline and Computing groups

“The HEP Cloud Facility: elastic computing for High Energy Physics” Outline Computing Facilities for HEP need to evolve to address the new needs of the.

Philippe Charpentier CERN – LHCb On behalf of the LHCb Computing Group

Dagmar Adamova (NPI AS CR Prague/Rez) and Maarten Litmaath (CERN)

Thoughts on Computing Upgrade Activities

Multi-Processing in High Performance Computer Architecture:

ALICE Computing Model in Run3

ALICE Computing Upgrade Predrag Buncic

Multi-Processing in High Performance Computer Architecture:

The latest developments in preparations of the LHC community for the computing challenges of the High Luminosity LHC Dagmar Adamova (NPI AS CR Prague/Rez)

New strategies of the LHC experiments to meet

Grid Canada Testbed using HEP applications

Computing at the HL-LHC

Exploring Multi-Core on

Workflow and HPC erhtjhtyhy Doug Benjamin Argonne National Lab.

Presentation transcript:

Alessandro De Salvo CCR Workshop, ATLAS Computing Alessandro De Salvo CCR Workshop,

2015 Data Taking 2 > 92% efficiency Collected data 2015 Pile-Up 2015 > 93% efficiency Collected data 2016

HLT: Readout rate 5-10 kHz HLT: Readout rate 1 kHz HLT: Readout rate 0.4 kHz LHC Upgrade Timeline The Challenge to Computing Repeats periodically! … fb fb fb -1

… fb fb fb -1 HLT: Readout rate 5-10 kHz HLT: Readout rate 1 kHz HLT: Readout rate 0.4 kHz In ~10 years, increase by factor 10 the number of events per second  More events to process  More events to store In ~10 years, increase by factor 10 the number of events per second  More events to process  More events to store The data rate and volume challenge 4

… fb fb fb -1 HLT: Readout rate 5-10 kHz HLT: Readout rate 1 kHz HLT: Readout rate 0.4 kHz In ~10 years, increase by factor 10 the luminosity  More complex events In ~10 years, increase by factor 10 the luminosity  More complex events The data complexity challenge 5

Pile-up challenge The average pile-up will be: =30 in 2016 =35 in 2017 … =200 in HL-LHC (10 years) The average pile-up will be: =30 in 2016 =35 in 2017 … =200 in HL-LHC (10 years) Higher pileup means: Linear increase of digitization time Factorial increase of Reconstruction time Slightly larger events Lots of more memory Higher pileup means: Linear increase of digitization time Factorial increase of Reconstruction time Slightly larger events Lots of more memory 6

Simulation – full and fast Time/Event distribution Log scale! Time/Event distribution Log scale! Speed Precisio n 7

Simulation  Simulation is CPU intensive  Integrated Simulation Framework  Mixing of full GEANT & fast simulation within an event  Work in progress, target is

Reconstruction Reconstruction is memory eager and requires non negligible CPU (40% w.r.t. simulation, 20% of ATLAS CPU usage) AthenaMP: multi-processing reduces the memory footprint Single Core Multi Core Athena memory Profile Running Jobs 2GB/core Time MPSerial Code and algorithms optimization largely reduced CPU needs in reconstruction [4] Reconstruction Time (s/event) 9

Derivations  New analysis model for Run 2: group data format DAOD made using a train model  Production of 84+ DAOD species by 19 trains on the grid  24h after data reconstruction at Tier-0  Vital for quick turn around and robustness of analyses  >= 2015 ATLAS results based on DAODs 10

Hardware Trend and implications  Clock speed stalled (bad)  Transistor density keeps increasing (good)  Memory/core diminishes  WLCG: 2GB/core  XeonPhi: 60 cores, 16GB  Tesla K40: 2880 cores, 16GB  Multiprocessing (AthenaMP) will not be sufficient anymore  Future Framework  MultiThreading and Parallelism 11

Future Framework Requirements Group  Established between Trigger/DAQ and Computing  Examine needs of a future framework to satisfy both offline and HLT use cases  Reported in December  Run3 multi-threaded reconstruction cartoon: Colours represent different events, shapes different algorithms; all one process running multiple threads 12

Timeline  Want to have a multi-threading framework in place for Run3  Allows experience with running multi-threaded before HL-LHC  Thus most development should be done by the start of LS2  This is now only 2 years away  At the end of Run2 we should have a functional multi-threaded prototype ready for testing 13

Leveraging opportunistic resources # of cores for ATLAS running jobs 450k 100k pledge Almost 50% of ATLAS production at peak rate relies on opportunistic resources Today most opportunistic resources are accessible via Grid interfaces/services 01/05/1401/03/15 Enabling utilization of non-Grid resources is a long term investment (beyond opportunistic use) 14 AWS burst

Grid and off-Grid resources  Global community did not fully buy into Grid technologies, which were very successful for us  We have a dedicated network of sites, using custom software and serving (mostly) the WLCG community  Finding opportunistic/common resources:  High Performance Computer centers   Opportunistic and commercial cloud resources   You ask for resources though a defined interface and you get access and control of a (virtual) machine, rather than a job slot on the Grid 15

(Opportunistic) Cloud Resources We invested a lot of effort in enabling usage of Cloud resources The ATLAS HLT farm at the CERN ATLAS pit (P1) for example was instrumented with a Cloud interface in order to run simulation: P1 4 days sum The HLT farm was dynamically reconfigured to run reconstruction on multicore resources We expect to be able to do the same with other clouds 20M events/day CERN P1 (approx 5%) T2s T1s #events vs time 07/09/14 04/10/14 16

HPCs High Performance Computers were designed for massively parallel applications (different from HEP use case) but we can parasitically benefit from empty cycles that others can not use (e.g. single core job slots) 24h test at Oak Ridge Titan system (#2 world HPC machine, 299,008 cores). ATLAS event generation: 200,000 CPU hours on 90K parallel cores (equivalent of 70% of our Grid resources) The goal is to validate as many workflows as possible. Today approximately 5% of ATLAS production runs on HPCs The ATLAS production system has been extended to leverage HPC resources Sherpa Generation using nodes with 8 threads per node, so 97,952 parallel Sherpa processes. 10,000 running cores Running cores vs time 17

Challenges in HPCs utilisation Blue Gene: PowerPC architecture Restrictive site policies, inbound/outbound connectivity, #jobs/#threads 18

Networking  Networking is the one item that will most probably continue its progress & evolution further..  In terms of bandwidth increase.  In terms of new technologies 19

T. Wenaus Content Delivery Networking 20

Storage Endpoints 75% of Tier 2 available storage in ~30 sites Large disparity in size of Tier 2s More efficient to have larger and fewer storage end-points 2 possible categories ’Cache based’ & ‘large’ Tier 2s Some Tier 2s are already larger than some Tier 1s Storage endpoints < 300 TB Do not plan an increase of storage (pledges) in next years or aggregate with other(s) end point(s) to form an entity larger than 300TB 21

How possibly will the ATLAS Computing look like? Even today, our CPU capacities fit into one super- computing center. The future: allocations at HPC centers, commercial and academic clouds, computing on demand, Grid technologies Even today, our CPU capacities fit into one super- computing center. The future: allocations at HPC centers, commercial and academic clouds, computing on demand, Grid technologies With the network evolution ‘local’ storage becomes re- defined Consolidate/Federate storage to few endpoints Data Caching and Remote data access With the network evolution ‘local’ storage becomes re- defined Consolidate/Federate storage to few endpoints Data Caching and Remote data access The main item that does not have many solutions and gives a severe constraint is our data storage: We need reliable and permanent storage under ATLAS control. The main item that does not have many solutions and gives a severe constraint is our data storage: We need reliable and permanent storage under ATLAS control. 22

Conclusions  Computing model has been adapted to Run 2  2015 data processing and distribution has been a success  2016 data taking has been started smoothly  No big changes envisaged for Run 3  In the future more efficient usage of opportunistic resources and reorganization of the global storage facilities 23