Alessandro De Salvo CCR Workshop, 18-5-2016 ATLAS Computing Alessandro De Salvo CCR Workshop, 18-5-2016.

Alessandro De Salvo CCR Workshop, 18-5-2016 ATLAS Computing Alessandro De Salvo CCR Workshop, 18-5-2016

2015 Data Taking 2 > 92% efficiency Collected data 2015 Pile-Up 2015 > 93% efficiency Collected data 2016

HLT: Readout rate 5-10 kHz HLT: Readout rate 1 kHz HLT: Readout rate 0.4 kHz LHC Upgrade Timeline The Challenge to Computing Repeats periodically! 3 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 … 2037 30 fb -1 150 fb -1 300 fb -1

2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 … 2037 30 fb -1 150 fb -1 300 fb -1 HLT: Readout rate 5-10 kHz HLT: Readout rate 1 kHz HLT: Readout rate 0.4 kHz In ~10 years, increase by factor 10 the number of events per second  More events to process  More events to store In ~10 years, increase by factor 10 the number of events per second  More events to process  More events to store The data rate and volume challenge 4

2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 … 2037 30 fb -1 150 fb -1 300 fb -1 HLT: Readout rate 5-10 kHz HLT: Readout rate 1 kHz HLT: Readout rate 0.4 kHz In ~10 years, increase by factor 10 the luminosity  More complex events In ~10 years, increase by factor 10 the luminosity  More complex events The data complexity challenge 5

Pile-up challenge The average pile-up will be: =30 in 2016 =35 in 2017 … =200 in HL-LHC (10 years) The average pile-up will be: =30 in 2016 =35 in 2017 … =200 in HL-LHC (10 years) Higher pileup means: Linear increase of digitization time Factorial increase of Reconstruction time Slightly larger events Lots of more memory Higher pileup means: Linear increase of digitization time Factorial increase of Reconstruction time Slightly larger events Lots of more memory 6

Simulation – full and fast Time/Event distribution Log scale! Time/Event distribution Log scale! Speed Precisio n 7

Simulation  Simulation is CPU intensive  Integrated Simulation Framework  Mixing of full GEANT & fast simulation within an event  Work in progress, target is 2016 8

Reconstruction Reconstruction is memory eager and requires non negligible CPU (40% w.r.t. simulation, 20% of ATLAS CPU usage) AthenaMP: multi-processing reduces the memory footprint Single Core Multi Core Athena memory Profile Running Jobs 2GB/core Time MPSerial Code and algorithms optimization largely reduced CPU needs in reconstruction [4] Reconstruction Time (s/event) 9

Derivations  New analysis model for Run 2: group data format DAOD made using a train model  Production of 84+ DAOD species by 19 trains on the grid  24h after data reconstruction at Tier-0  Vital for quick turn around and robustness of analyses  >= 2015 ATLAS results based on DAODs 10

Hardware Trend and implications  Clock speed stalled (bad)  Transistor density keeps increasing (good)  Memory/core diminishes  WLCG: 2GB/core  XeonPhi: 60 cores, 16GB  Tesla K40: 2880 cores, 16GB  Multiprocessing (AthenaMP) will not be sufficient anymore  Future Framework  MultiThreading and Parallelism 11

Future Framework Requirements Group  Established between Trigger/DAQ and Computing  Examine needs of a future framework to satisfy both offline and HLT use cases  Reported in December  https://cds.cern.ch/record/1974156/ https://cds.cern.ch/record/1974156/ Run3 multi-threaded reconstruction cartoon: Colours represent different events, shapes different algorithms; all one process running multiple threads 12

Timeline  Want to have a multi-threading framework in place for Run3  Allows experience with running multi-threaded before HL-LHC  Thus most development should be done by the start of LS2  This is now only 2 years away  At the end of Run2 we should have a functional multi-threaded prototype ready for testing 13

Leveraging opportunistic resources # of cores for ATLAS running jobs 450k 100k pledge Almost 50% of ATLAS production at peak rate relies on opportunistic resources Today most opportunistic resources are accessible via Grid interfaces/services 01/05/1401/03/15 Enabling utilization of non-Grid resources is a long term investment (beyond opportunistic use) 14 AWS burst

Grid and off-Grid resources  Global community did not fully buy into Grid technologies, which were very successful for us  We have a dedicated network of sites, using custom software and serving (mostly) the WLCG community  Finding opportunistic/common resources:  High Performance Computer centers  https://en.wikipedia.org/wiki/Supercomputer https://en.wikipedia.org/wiki/Supercomputer  Opportunistic and commercial cloud resources  https://en.wikipedia.org/wiki/Cloud_computing https://en.wikipedia.org/wiki/Cloud_computing  You ask for resources though a defined interface and you get access and control of a (virtual) machine, rather than a job slot on the Grid 15

(Opportunistic) Cloud Resources We invested a lot of effort in enabling usage of Cloud resources The ATLAS HLT farm at the CERN ATLAS pit (P1) for example was instrumented with a Cloud interface in order to run simulation: Sim@P1 P1 4 days sum The HLT farm was dynamically reconfigured to run reconstruction on multicore resources (Reco@P1). We expect to be able to do the same with other clouds 20M events/day CERN P1 (approx 5%) T2s T1s #events vs time 07/09/14 04/10/14 16

HPCs High Performance Computers were designed for massively parallel applications (different from HEP use case) but we can parasitically benefit from empty cycles that others can not use (e.g. single core job slots) 24h test at Oak Ridge Titan system (#2 world HPC machine, 299,008 cores). ATLAS event generation: 200,000 CPU hours on 90K parallel cores (equivalent of 70% of our Grid resources) The goal is to validate as many workflows as possible. Today approximately 5% of ATLAS production runs on HPCs The ATLAS production system has been extended to leverage HPC resources Mira@ARGONNE: Sherpa Generation using 12244 nodes with 8 threads per node, so 97,952 parallel Sherpa processes. 10,000 running cores Running cores vs time 17

Challenges in HPCs utilisation Blue Gene: PowerPC architecture Restrictive site policies, inbound/outbound connectivity, #jobs/#threads 18

Networking  Networking is the one item that will most probably continue its progress & evolution further..  In terms of bandwidth increase.  In terms of new technologies 19

T. Wenaus Content Delivery Networking 20

Storage Endpoints 75% of Tier 2 available storage in ~30 sites Large disparity in size of Tier 2s More efficient to have larger and fewer storage end-points 2 possible categories ’Cache based’ & ‘large’ Tier 2s Some Tier 2s are already larger than some Tier 1s Storage endpoints < 300 TB Do not plan an increase of storage (pledges) in next years or aggregate with other(s) end point(s) to form an entity larger than 300TB 21

How possibly will the ATLAS Computing look like? Even today, our CPU capacities fit into one super- computing center. The future: allocations at HPC centers, commercial and academic clouds, computing on demand, Grid technologies Even today, our CPU capacities fit into one super- computing center. The future: allocations at HPC centers, commercial and academic clouds, computing on demand, Grid technologies With the network evolution ‘local’ storage becomes re- defined Consolidate/Federate storage to few endpoints Data Caching and Remote data access With the network evolution ‘local’ storage becomes re- defined Consolidate/Federate storage to few endpoints Data Caching and Remote data access The main item that does not have many solutions and gives a severe constraint is our data storage: We need reliable and permanent storage under ATLAS control. The main item that does not have many solutions and gives a severe constraint is our data storage: We need reliable and permanent storage under ATLAS control. 22

Conclusions  Computing model has been adapted to Run 2  2015 data processing and distribution has been a success  2016 data taking has been started smoothly  No big changes envisaged for Run 3  In the future more efficient usage of opportunistic resources and reorganization of the global storage facilities 23

Alessandro De Salvo CCR Workshop, 18-5-2016 ATLAS Computing Alessandro De Salvo CCR Workshop, 18-5-2016.

Similar presentations

Presentation on theme: "Alessandro De Salvo CCR Workshop, 18-5-2016 ATLAS Computing Alessandro De Salvo CCR Workshop, 18-5-2016."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Alessandro De Salvo CCR Workshop, 18-5-2016 ATLAS Computing Alessandro De Salvo CCR Workshop, 18-5-2016.

Similar presentations

Presentation on theme: "Alessandro De Salvo CCR Workshop, 18-5-2016 ATLAS Computing Alessandro De Salvo CCR Workshop, 18-5-2016."— Presentation transcript:

Similar presentations

About project

Feedback