Presentation is loading. Please wait.

Presentation is loading. Please wait.

Simone Campana CERN-IT

Similar presentations


Presentation on theme: "Simone Campana CERN-IT"— Presentation transcript:

1 Simone Campana CERN-IT
Run-2 Computing Model Simone Campana CERN-IT An asterisk (*) in this presentation indicates that the item will be covered in details in a jamboree session Simone Campana – ATLAS Computing Jamboree, December 2014 03/12/14

2 Intro: the challenges of Run2
LHC operation Trigger rate 1 kHz (~400) Pile-up up above 30 (~20) 25 ns bunch spacing (~50) Centre-of-mass energy x ~2 Different detector Constraints of ‘flat budget’ Limited increase of resources And we still have data from Run1 Simone Campana – ATLAS Computing Jamboree, December 2014 03/12/14 2

3 How to face the challenge
New ATLAS distributed computing systems Rucio for Data Management Prodsys-2 for Workload Management FAX and Event Service to optimize resource usage More efficient utilization of resources More flexibility in the computing model (Clouds/Tiers) Limit avoidable resource consumption (multicore) Optimize workflows (Derivation Framework/Analysis Model) Leveraging opportunistic resources Grid, Cloud, HPC New data lifecycle management model Simone Campana – ATLAS Computing Jamboree, December 2014 03/12/14

4 New ATLAS distributed computing systems
Simone Campana – ATLAS Computing Jamboree, December 2014 03/12/14

5 Distributed Data Management: Rucio
Rucio builds on data management learning from Run-1 Space optimization and fragmentation of space tokens Rucio implements multiple ownerships for files and logical quota We will be able to gradually eliminate many space tokens (*) Integrating “new” technologies and protocols With Rucio we can use other protocols other than SRM for data transfers and storage management (*) Better support for metadata Gradually introduced in the next months Rucio is in production since Monday Dec. 1st Leverage new Rucio features once comfortable with core functionalities Simone Campana – ATLAS Computing Jamboree, December 2014 03/12/14

6 Remote data access: FAX
Goal reached ! >96% data covered We deployed a Federate Storage Infrastructure (*): all data accessible from any location Analysis (and production) will be able to access remote (offsite) files Jobs can run at sites w/o data but with free CPUs. We call this “overflow”. Simone Campana – ATLAS Computing Jamboree, December 2014 03/12/14

7 Workload Management: Prodsys-2
Prodsys-2 relies on the JEDI/PanDA core Same engine for analysis and production Allows to optimize resource scheduling (*) MCORE vs SCORE, Analysis vs Production, HIGH vs LOW mem Minimizes data traffic Merging at T2s New monitoring system Integrated with Rucio Prodsys-2 is in production since Mon. Dec 1st JEDI already in production since summer Simone Campana – ATLAS Computing Jamboree, December 2014 03/12/14

8 Event Service (*) Efficient utilization of opportunistic resources implies short payloads (get out quickly from the resources if the owner needs it) We developed a system to deliver payloads as short as the single event: the Event Service. Based on core components such as: A new ‘JEDI’ extension to PanDA allows it to manage fine grained workloads The new parallel framework athenaMP brings multi/many-core concurrency to ATLAS processing, can manage independent streams of events in parallel Newly available object stores provide highly scalable cloud storage for small event-scale outputs Usage is surely to be extended beyond opportunistic resources. Simone Campana – ATLAS Computing Jamboree, December 2014 03/12/14

9 More efficient utilization of resources
Simone Campana – ATLAS Computing Jamboree, December 2014 03/12/14

10 Flexible utilization of resources (*)
Run-1 model defines rigidly T0/T1/T2 roles. We need more flexibility Examples: Different kind of jobs (e.g. Reco) can run at various sites regardless the tier level Custodial copies of data can be hosted at various sites regardless the tier level T0 will be able to spill over into the Grid in case of resources shortage at CERN We will use AthenaMP for production and Athena/ROOT for analysis: need flexibly to use both multi and single core resources. Some T2s are equivalent to T1s in term of disk storage & CPU power In general, today sites are connected by fast and reliable networks. Use it. Simone Campana – ATLAS Computing Jamboree, December 2014 03/12/14

11 Running AthenaMP on the grid
MultiCore resources scheduling (*) is not an easy tasks Statically allocating resources for multicore jobs is not what sites want To limit inefficiencies, dynamic allocation needs a steady flow of long multicore jobs a steady flow of short single core jobs Target would be Much of the production on multicore Analysis on single core Need to work in this direction and get on board all sites We (naively?) expect no loss of resources because we try to allocate them in 8 cores slots Today in SCORE we get 30%+ more resources than in MCORE Simone Campana – ATLAS Computing Jamboree, December 2014 03/12/14

12 Simone Campana – ATLAS Computing Jamboree, December 2014
Analysis Model Common analysis data format: xAOD replacement of AOD & group ntuple of any kind Readable both by Athena & ROOT Data reduction framework Athena to produce group derived data sample (DxAOD) Centrally via Prodsys Based on train model one input, N outputs from PB to TB Simone Campana – ATLAS Computing Jamboree, December 2014 03/12/14 12

13 LOCALGROUPDISK (same size as today) GROUPDISK (5% of today’s size)
DATADISK DATATAPE Size DxAOD = 1% size xAOD DxAOD User Analysis PanDA/JEDI xAOD DxAOD User Analysis PanDA/JEDI Derivation Framework (Prodsys-2) DxAOD Group Analysis PanDA/JEDI 100 derivations DxAOD Group Analysis PanDA/JEDI DATADISK GROUPDISK (5% of today’s size) Simone Campana – ATLAS Computing Jamboree, December 2014 03/12/14

14 Leveraging opportunistic resources
Simone Campana – ATLAS Computing Jamboree, December 2014 03/12/14

15 (Opportunistic) Cloud Resources
We invested a lot of effort in enabling usage of Cloud resources The HLT farm for example was has been instrumented with a Cloud interface in order to run simulation: 20M events/day 4 days sum CERN-P1 (approx 5%) CERN-P1 The HLT farm was dynamically reconfigured to run reconstruction on multicore resources We expect to be able to do the same with other clouds Simone Campana – ATLAS Computing Jamboree, December 2014 03/12/14

16 The goal is to validate as many workflows as possible
HPCs High Performance Computers were designed for massively parallel applications (different from HEP use case) but we can parasitically benefit from empty cycles that others can not use (e.g. single core job slots) The ATLAS production system has been extended to leverage HPC resources 24h test at Oak Ridge Titan system (#2 world HPC machine, 299,008 cores). ATLAS event generation: 200,000 CPU hours on 90K parallel cores (equivalent of 70% of our Grid resources) EVNT,SIMUL,RECO jobs @ MPPMU, LRZ and CSCS Average 1,700 running jobs Sherpa Generation using nodes with 8 threads per node, so 97,952 parallel Sherpa processes. The goal is to validate as many workflows as possible Simone Campana – ATLAS Computing Jamboree, December 2014 03/12/14

17 New data lifecycle management model
Simone Campana – ATLAS Computing Jamboree, December 2014 03/12/14

18 Space management crisis
Disk occupancy at T1s 23PB on disk, created in the last 3 months and never accessed Primary (pinned) Default (pinned) 8 PB of data on disk never been touched T1 dynamically managed space (green) is unacceptably small It compromises our strategy of dynamic replication and cleaning of popular/unpopular data A lot of the primary space is occupied by old and unused data Simone Campana – ATLAS Computing Jamboree, December 2014 03/12/14

19 The new data lifecycle model
Every dataset will have a lifetime set at creation The lifetime can be infinite (e.g. RAW data) The lifetime can be extended E.g. if the dataset is recently accessed. Or if there is a known exception Every dataset will have a retention policy E.g. RAW need at least 2 copies on tape. Need at least one copy of AODs on tape. Lifetime being agreed with ATLAS Computing Resources management Simone Campana – ATLAS Computing Jamboree, December 2014 03/12/14

20 Effect of the data lifecycle model
Datasets with expired lifetime can disappear at any time from (data)disk and datatape groupdisk and localgroupdisk exempt “Organized” expiration lists will be distributed to groups ATLAS Distributed Computing will flexibly manage data replication and reduction Within the boundaries of lifetime and retention For example Increase/reduce the number of copies based on data popularity Re-distribute data at T2s rather than T1s and viceversa Move data to tape and free up disk space Simone Campana – ATLAS Computing Jamboree, December 2014 03/12/14

21 Cautious Implementation – First Dry Run
(TB) Simone Campana – ATLAS Computing Jamboree, December 2014 03/12/14

22 Cautious Implementation – First Dry Run
X X Simone Campana – ATLAS Computing Jamboree, December 2014 03/12/14

23 Cautious Implementation – First Dry Run
X X T1 Disk: delete 3 of 18 PB Simone Campana – ATLAS Computing Jamboree, December 2014 03/12/14

24 Cautious Implementation – First Dry Run
X X T1 Tape: delete 10 of 26 PB Simone Campana – ATLAS Computing Jamboree, December 2014 03/12/14

25 Cautious Implementation – First Dry Run
X X Months Simone Campana – ATLAS Computing Jamboree, December 2014 03/12/14

26 Cautious Implementation – First Dry Run
X X Months T1 Disk: delete 0.3 of 17 PB Simone Campana – ATLAS Computing Jamboree, December 2014 03/12/14

27 Simone Campana – ATLAS Computing Jamboree, December 2014
Further Implications We will use more tapes (*) Both in terms of volume and number of accesses Access to tape remains “centralized” Through PanDA + Rucio For the first time we will “delete” tapes We should discuss how to do this efficiently In the steady flow, we will approximately delete as much as we will write From both disk and tape. How to do this efficiently? Access through storage backdoors is today not accounted We will improve this, but watch out the deletion lists! And preferably use official tools (PanDA/Rucio) Simone Campana – ATLAS Computing Jamboree, December 2014 03/12/14

28 Impact: staging from tape
Data staged per week (TB) What happens if we remove all “unused” data from disk and keep it on tape? “unused” here = not accessed in 9 months 15TB Simulation based on last year’s data access Tape access from Reconstruction and Reprocessing in 2014 750TB We would have to restage from tape 20TB/week, compare with 1PB/week for reco/repro (2% increase). In terms of number of files, it is a 10% increase Simone Campana – ATLAS Computing Jamboree, December 2014 03/12/14

29 Simone Campana – ATLAS Computing Jamboree, December 2014
Conclusions For many topics, I just gave an introduction More discussions should follow in the next sessions/days Simone Campana – ATLAS Computing Jamboree, December 2014 03/12/14


Download ppt "Simone Campana CERN-IT"

Similar presentations


Ads by Google