This work is partially supported by projects InterExcellence (LTT17018), Research infrastructure CERN (CERN-CZ, LM2015058) and OP RDE CERN Computing (CZ.02.1.01/0.0/0.0/1.

This work is partially supported by projects InterExcellence (LTT17018), Research infrastructure CERN (CERN-CZ, LM ) and OP RDE CERN Computing (CZ /0.0/0.0/1 6013/ ) from EU funds and MEYS.

Exploitation of heterogeneous resources for ATLAS Computing
Jiří Chudoba on behalf of the ATLAS collaboration Institute of Physics (FZU) of the Czech Academy of Sciences

Outline ATLAS general introduction (number of collisions, data rates and volumes, CPU requirements Usage of Grid as the main resource geographic distribution of resources heterogeneity of resources ATLAS Solutions for Grid PanDA, JEDI, Rucio Newer types resources HPC harvester, ARC cache, huge variations in number of cores Clouds

Large number of channels
ATLAS experiment Collaboration of more than 3000 authors from 182 institutes – grant data access to all Large number of channels Huge data volumes + = High energies -> Huge in size -> Big number of channels and high event rate -> High computational requirements Collaboration of more than 3000 authors from 182 institutes – all must have access to data Number of channels: 200 M HLT rate up to 1kHZzin 2018 Raw data volume in 2017: x PB High trigger rates

Computing requirements
Still increasing: LHC luminosity expected CPU usage higher ATLAS 2018 requirements 186 PB on disk 289 PB on tape 2520 kHS06 CPU years

ATLAS experiment Z -> µ+µ- with additional 28 vertices 7. 7. 2018

Grid - WLCG WLCG connects sites distributed across 5 continents
“EGI” sites with various implementation of CEs and SEs OSG sites WLCG Capacities in 2018 cores 694000 CPU capacity 8768 kHS06 disk 382 PB tape 361 PB source: rebus US resources not included US capacities not included

ATLAS Distributed Computing Solutions
Big effort to maximize resource usage and to automatize workflows PanDA workload management system Used by ATLAS since 2008, now adopted also by other projects Pilot based Factorized code Panda – add a picture from

Rucio Distributed Data Management System Developed for ATLAS, but now used and evaluated by other projects Handles 1 Billion ATLAS files, 365 PB Rucio key features single namespace metadata support no local services policy based lifestyle management

JEDI Job Execution and Definition Interface Dynamic job definition from tasks ref: ATL-SOFT-SLIDE Job Execution and Definition Interface (JEDI): is an intelligent component in the PanDA server to have capability for task-level workload management. Key part of it is ‘Dynamic’ job definition, which highly optimizes resources usage compared to ‘Static’ model used in ProdSys1. Dynamic job definition in JEDI is also crucial for multi-core, HPCs and other new requirements jobs must be of reasonable length and with reasonable input and output size

Pilot Control and benchmark execution node Get jobs Monitor Stage-in, stage-out Cleanup Panda – add a picture from

HPC Differences between HPC and HTC Architecture Authentication
Network access Access model describe differences between HPC and HTC add picture of Titan or CORI

HPCs differences in speed of CPU cores on some HPCs 7. 7. 2018

Harvester Harvester New component between Pilot and PanDA
Running on site Stateless service plus DB ATL-SOFT-SLIDE Harvester is a resource-facing service between the PanDA server and collection of pilots for resource provisioning and workload shaping. It is a lightweight stateless service running on a VObox or an edge node of HPC centers to provide a uniform view for various resources. The following picture shows how harvester interacts with PanDA and resources.

Event Service Event Service
reduces lost CPU time in case of premature job termination important for backfilling HPC resources A fine-grained approach to event processing. Designed for exploiting diverse, distributed and potentially short-lived resources ○ Quasi-continuous event streaming through worker nodes ● Exploit event processors fully and efficiently through their lifetime ○ Real-time delivery of fine-grained workloads to running application ○ Be robust against disappearance of compute node on short notice ● Decouple processing from chunkiness of files, from data locality considerations and from WAN latency ● Stream outputs away quickly ○ Negligible losses if the worker node vanishes ○ Minimal demands for the local storage

Clouds Successful business model
Huge resources provided by commercial clouds All LHC computing need covered by less than 1% But significantly more expensive (apart from spot market) Option for peak demands Private clouds enable better sharing with other groups - not suitable for here add info about Object Stores (S3 protocol supported by Rucio)

ATLAS CPU consumption per resources,
Clouds for ATLAS Fully integrated in the ATLAS infrastructure “Standard” CE exposed by clouds Details hidden by local setup HTCondor, ARC Cloud Data storage used mostly for log files Not (yet) suitable for big volume data storage Grid – 73% HPC – 5% HPC_special – 9% Source: Randy Sobie Clouds used for ATLAS for several years (5?) Variety of methods for running HEP workloads on clouds – VM-DIRAC (LHCb and Belle II) – VAC/Vcycle (UK) – HTCondor/CloudScheduler (Canada) – HTC/GlideinWMS (FNAL), HTC/VM (PNNL), HTC/APR (BNL) – Dynamic-Torque (Australia) – Cloud Area Padovan(INFN) – ARC (NorduGrid) • Each method has its own merits and often was designed to integrated clouds into an existing infrastructure (e.g. local, WLCG and experiment) Clouds – 13% ATLAS CPU consumption per resources, January - May 2018

BOINC Open-source software for volunteer computing
3.2% of CPU time in May 2018 Open-source software for volunteer computing Uses the idle time of (personal) computers VirtualBox used for various platforms (Windows, Mac, Linux) Easy to use for laymen Also great for outreach BOINC = Berkeley Open Infrastructure for Network Computing

ATLAS@home BOINC = Berkeley Open Infrastructure for Network Computing

ATLAS@Home New use cases on clusters Example of Beijing Tier2 site
clusters not supporting ATLAS VO clusters supporting ATLAS VO Example of Beijing Tier2 site Grid jobs walltime utilization 88%, cputime 66% additional 23% of cpu utilization source: use cases on ATLAS Tier2s include downtimes

Thank you for your attention!
Conclusions ATLAS Offline Computing successfully uses various resources with still increasing automation of workflows But it remains a huge distributed effort in terms of manpower for development and operations We greatly appreciate also non-pledged resources Thank you for your attention!

This work is partially supported by projects InterExcellence (LTT17018), Research infrastructure CERN (CERN-CZ, LM2015058) and OP RDE CERN Computing (CZ.02.1.01/0.0/0.0/1.

Similar presentations

Presentation on theme: "This work is partially supported by projects InterExcellence (LTT17018), Research infrastructure CERN (CERN-CZ, LM2015058) and OP RDE CERN Computing (CZ.02.1.01/0.0/0.0/1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

This work is partially supported by projects InterExcellence (LTT17018), Research infrastructure CERN (CERN-CZ, LM2015058) and OP RDE CERN Computing (CZ.02.1.01/0.0/0.0/1.

Similar presentations

Presentation on theme: "This work is partially supported by projects InterExcellence (LTT17018), Research infrastructure CERN (CERN-CZ, LM2015058) and OP RDE CERN Computing (CZ.02.1.01/0.0/0.0/1."— Presentation transcript:

Similar presentations

About project

Feedback