Download presentation
Presentation is loading. Please wait.
Published byVerity Stewart Modified over 8 years ago
1
Alessandro De Salvo CCR Workshop, 18-5-2016 ATLAS Computing Alessandro De Salvo CCR Workshop, 18-5-2016
2
2015 Data Taking 2 > 92% efficiency Collected data 2015 Pile-Up 2015 > 93% efficiency Collected data 2016
3
HLT: Readout rate 5-10 kHz HLT: Readout rate 1 kHz HLT: Readout rate 0.4 kHz LHC Upgrade Timeline The Challenge to Computing Repeats periodically! 3 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 … 2037 30 fb -1 150 fb -1 300 fb -1
4
2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 … 2037 30 fb -1 150 fb -1 300 fb -1 HLT: Readout rate 5-10 kHz HLT: Readout rate 1 kHz HLT: Readout rate 0.4 kHz In ~10 years, increase by factor 10 the number of events per second More events to process More events to store In ~10 years, increase by factor 10 the number of events per second More events to process More events to store The data rate and volume challenge 4
5
2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 … 2037 30 fb -1 150 fb -1 300 fb -1 HLT: Readout rate 5-10 kHz HLT: Readout rate 1 kHz HLT: Readout rate 0.4 kHz In ~10 years, increase by factor 10 the luminosity More complex events In ~10 years, increase by factor 10 the luminosity More complex events The data complexity challenge 5
6
Pile-up challenge The average pile-up will be: =30 in 2016 =35 in 2017 … =200 in HL-LHC (10 years) The average pile-up will be: =30 in 2016 =35 in 2017 … =200 in HL-LHC (10 years) Higher pileup means: Linear increase of digitization time Factorial increase of Reconstruction time Slightly larger events Lots of more memory Higher pileup means: Linear increase of digitization time Factorial increase of Reconstruction time Slightly larger events Lots of more memory 6
7
Simulation – full and fast Time/Event distribution Log scale! Time/Event distribution Log scale! Speed Precisio n 7
8
Simulation Simulation is CPU intensive Integrated Simulation Framework Mixing of full GEANT & fast simulation within an event Work in progress, target is 2016 8
9
Reconstruction Reconstruction is memory eager and requires non negligible CPU (40% w.r.t. simulation, 20% of ATLAS CPU usage) AthenaMP: multi-processing reduces the memory footprint Single Core Multi Core Athena memory Profile Running Jobs 2GB/core Time MPSerial Code and algorithms optimization largely reduced CPU needs in reconstruction [4] Reconstruction Time (s/event) 9
10
Derivations New analysis model for Run 2: group data format DAOD made using a train model Production of 84+ DAOD species by 19 trains on the grid 24h after data reconstruction at Tier-0 Vital for quick turn around and robustness of analyses >= 2015 ATLAS results based on DAODs 10
11
Hardware Trend and implications Clock speed stalled (bad) Transistor density keeps increasing (good) Memory/core diminishes WLCG: 2GB/core XeonPhi: 60 cores, 16GB Tesla K40: 2880 cores, 16GB Multiprocessing (AthenaMP) will not be sufficient anymore Future Framework MultiThreading and Parallelism 11
12
Future Framework Requirements Group Established between Trigger/DAQ and Computing Examine needs of a future framework to satisfy both offline and HLT use cases Reported in December https://cds.cern.ch/record/1974156/ https://cds.cern.ch/record/1974156/ Run3 multi-threaded reconstruction cartoon: Colours represent different events, shapes different algorithms; all one process running multiple threads 12
13
Timeline Want to have a multi-threading framework in place for Run3 Allows experience with running multi-threaded before HL-LHC Thus most development should be done by the start of LS2 This is now only 2 years away At the end of Run2 we should have a functional multi-threaded prototype ready for testing 13
14
Leveraging opportunistic resources # of cores for ATLAS running jobs 450k 100k pledge Almost 50% of ATLAS production at peak rate relies on opportunistic resources Today most opportunistic resources are accessible via Grid interfaces/services 01/05/1401/03/15 Enabling utilization of non-Grid resources is a long term investment (beyond opportunistic use) 14 AWS burst
15
Grid and off-Grid resources Global community did not fully buy into Grid technologies, which were very successful for us We have a dedicated network of sites, using custom software and serving (mostly) the WLCG community Finding opportunistic/common resources: High Performance Computer centers https://en.wikipedia.org/wiki/Supercomputer https://en.wikipedia.org/wiki/Supercomputer Opportunistic and commercial cloud resources https://en.wikipedia.org/wiki/Cloud_computing https://en.wikipedia.org/wiki/Cloud_computing You ask for resources though a defined interface and you get access and control of a (virtual) machine, rather than a job slot on the Grid 15
16
(Opportunistic) Cloud Resources We invested a lot of effort in enabling usage of Cloud resources The ATLAS HLT farm at the CERN ATLAS pit (P1) for example was instrumented with a Cloud interface in order to run simulation: Sim@P1 P1 4 days sum The HLT farm was dynamically reconfigured to run reconstruction on multicore resources (Reco@P1). We expect to be able to do the same with other clouds 20M events/day CERN P1 (approx 5%) T2s T1s #events vs time 07/09/14 04/10/14 16
17
HPCs High Performance Computers were designed for massively parallel applications (different from HEP use case) but we can parasitically benefit from empty cycles that others can not use (e.g. single core job slots) 24h test at Oak Ridge Titan system (#2 world HPC machine, 299,008 cores). ATLAS event generation: 200,000 CPU hours on 90K parallel cores (equivalent of 70% of our Grid resources) The goal is to validate as many workflows as possible. Today approximately 5% of ATLAS production runs on HPCs The ATLAS production system has been extended to leverage HPC resources Mira@ARGONNE: Sherpa Generation using 12244 nodes with 8 threads per node, so 97,952 parallel Sherpa processes. 10,000 running cores Running cores vs time 17
18
Challenges in HPCs utilisation Blue Gene: PowerPC architecture Restrictive site policies, inbound/outbound connectivity, #jobs/#threads 18
19
Networking Networking is the one item that will most probably continue its progress & evolution further.. In terms of bandwidth increase. In terms of new technologies 19
20
T. Wenaus Content Delivery Networking 20
21
Storage Endpoints 75% of Tier 2 available storage in ~30 sites Large disparity in size of Tier 2s More efficient to have larger and fewer storage end-points 2 possible categories ’Cache based’ & ‘large’ Tier 2s Some Tier 2s are already larger than some Tier 1s Storage endpoints < 300 TB Do not plan an increase of storage (pledges) in next years or aggregate with other(s) end point(s) to form an entity larger than 300TB 21
22
How possibly will the ATLAS Computing look like? Even today, our CPU capacities fit into one super- computing center. The future: allocations at HPC centers, commercial and academic clouds, computing on demand, Grid technologies Even today, our CPU capacities fit into one super- computing center. The future: allocations at HPC centers, commercial and academic clouds, computing on demand, Grid technologies With the network evolution ‘local’ storage becomes re- defined Consolidate/Federate storage to few endpoints Data Caching and Remote data access With the network evolution ‘local’ storage becomes re- defined Consolidate/Federate storage to few endpoints Data Caching and Remote data access The main item that does not have many solutions and gives a severe constraint is our data storage: We need reliable and permanent storage under ATLAS control. The main item that does not have many solutions and gives a severe constraint is our data storage: We need reliable and permanent storage under ATLAS control. 22
23
Conclusions Computing model has been adapted to Run 2 2015 data processing and distribution has been a success 2016 data taking has been started smoothly No big changes envisaged for Run 3 In the future more efficient usage of opportunistic resources and reorganization of the global storage facilities 23
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.