Download presentation
Published byOscar Atkins Modified over 9 years ago
1
PanDA: Exascale Federation of Resources for the ATLAS Experiment
Fernando Barreiro Megino (University of Texas at Arlington) for the PanDA team MMCP15, Stará Lesná, Slovakia
2
The LHC
3
The ATLAS detector
4
...~1/10th of its members
5
Distributed Computing: the WLCG
Tier-0 (CERN): (15%) Tier-1 (11 centres): (40%) Tier-2 (~140 centres): (45%)
6
Big Data? ~14x growth expected 2012-2020 Big Data in 2012
Lib of Congress Business s sent 3000PB/year (Doesn’t count; not managed as a coherent data set) Big Data? ~14x growth expected Source: Wired magazine Climate DB Facebook uploads 180PB/year Current ATLAS data set, all data products: 140 PB LHC data 15PB/yr Google search 100PB Nasdaq US Census YouTube 15PB/yr Kaiser Permanente 30PB Wired 4/2013
7
What is PanDA? Production and Distributed Analysis system developed for ATLAS Now also used by AMS, ALICE, LSST and others Many international partners: DoE HEP, DoE ASCR, NSF, CERN IT, OSG, ASGC, NorduGrid, European grid projects, Russian grid projects…
8
PanDA at a glance Cloud A Tier 2 Tier 1 Tier 2
Rucio: Distributed Data Management Cloud B Users PanDA Tier 2 Tier 1 Tier 2 Pilot factory
9
Orders of magnitude
10
Paradigm Shift in HEP Computing
New Ideas from PanDA Distributed resources are seamlessly integrated worldwide through a single submission system All users have access to same resources Global fair share, priorities and policies allow efficient management of resources Automation, error handling, and other features improve user experience All users have access to resources Old HEP paradigm Distributed resources are independent entities Groups of users utilize specific resources (whether locally or remotely) Fair shares, priorities and policies are managed locally, for each resource Uneven user experience at different sites, based on local support and experience Privileged users have access to special resources
11
Core Ideas in PanDA Single entry point to the WLCGProvide a central queue for users – similar to local batch systems Make hundreds of distributed sites appear as local Reduce site related errors and reduce latency Build a pilot job system – late transfer of user payloads Crucial for distributed infrastructure maintained by local experts Hide middleware while supporting diversity and evolution PanDA interacts with middleware – users see high level workflow Hide variations in infrastructure PanDA presents uniform ‘job’ slots to user (with minimal sub-types) Easy to integrate grid sites, clouds, HPC sites … Production and Analysis users see same PanDA system Same set of distributed resources available to all users Highly flexible system, giving full control of priorities to experiment
12
Key Features of PanDA Workflow is maximally asynchronous
Pilot based job execution system Condor based pilot factory Payload is sent only after execution begins on CE Minimize latency, reduce error rates Central job queue Unified treatment of distributed resources SQL DB keeps state - critical component Automatic error handling and recovery Extensive monitoring Modular design RESTful communications GSI authentication Use of Open Source components
13
Task management PanDA is not just a job execution engine: it manages complex tasks. Tasks are groupings of jobs where a certain order might have to be respected.
14
Monitoring
15
Evolution of the PanDA system
Integration of upcoming computing paradigms Clouds Leadership Computing Facilities Integration of network as a resource in workload management PanDA beyond ATLAS: BigPanDA, MegaPanDA…
16
PanDA and upcoming computing paradigms
It is not about replacing the WLCG, but about integrating additional computing resources Overspilling into the CLOUD Backfilling HPC Monte Carlo jobs as ideal candidates for external compute
17
PanDA and the Cloud ATLAS Cloud activity started in 2012
Commercial clouds frequently offer free allocations trying to entice research institutes Research clouds: institutes serving multiple experiments wanted to increase flexibility by offering resources through a cloud interface Some questions we needed to solve What is the best integration model for PanDA? If we get any offering… we want to be ready! Possibility of overspilling on the cloud in periods of high demand Study the cost models of commercial providers… is running your own computing center really cheaper?
18
PanDA and the Cloud Wide range of providers have been integrated and evaluated Most cloud providers have similar offerings However watch out for lack of standardization Running jobs in the Cloud is “easy” Run condor workers in the cloud that join a centrally managed condor pool With the current experience, new cloud providers can be plugged in with reduced effort Sustained operation demonstrated The more difficult part is using permanent Cloud storage Monte Carlo jobs to the rescue High CPU usage, low IO
19
Example: PanDA on GCE We ran for about 8 weeks (2 weeks were planned for scaling up) Very stable running on the Cloud side. Most problems on the ATLAS side Completed 458,000 jobs, generated and processed about 214 M events
20
PanDA and HPC Please see Ruslan’s presentation in this conference
21
Extending beyond the Grid
Example for June 2015 Cloud and HPC resources are steadily gaining territory
22
Network as a resource in PanDA
Network bandwidth has multiplied at a factor of O(1k) in the last 15 years Networking transcended national boundaries With LHCOPN and LHCONE… do we need to keep the MONARC restrictions? Direct mesh of Tier 2 data flows, cloud boundaries loosened based on network metrics
23
Network as a resource in PanDA
Let’s relax the limitations defined back in those days Let’s take network measurements to do this gradually Better and more dynamic use of storage Reduced load on the Tier1s for data serving Increased speed to populate analysis facilities
24
Sources of network information
DDM Sonar: transfer stats covering the whole mesh, as reported by DDM/FTS perfSonar: low level network statistics FAX data: transfer stats covering federated XRootD sites
25
Faster User analysis through FAX
First use case of network integration with PanDA Brokerage will use concept of ‘nearby’ sites Calculate weight based on brokerage criteria availability of CPU, release, pilot rate… add network transfer cost to brokerage weight Jobs will be sent to site with best weight – not necessarily the site with local data If nearby site has less wait time, access data through FAX FAX transfer monitoring Historical job dashboard FAX Kibana
26
Dynamic cloud selection
A cloud is an aggregations of sites, usually delimited nationally Tasks are kept in a cloud and the output aggregated in the Tier1 Optimize and automate choice of T1-T2 pairings Currently manual operation using suggestions Dynamic cloud monitoring
27
PanDA beyond ATLAS If PanDA works so well, why not use it also for other experiments? Collaborative work with other institutes: NRC KI, JINR Make PanDA accessible to everyone Migrated code to github: PanDA is now Oracle and MySQL compatible Refactorize the core: update the architecture to a plugin approach, where different communities can customize the components Host multi-VO instance on Amazon EC2 Redesigned, modular monitoring Experiments collaborating with PanDA AMS ALICE COMPASS LSST
28
Acknowledgements Kaushik De, Alexei Klimentov, Tadashi Maeno, Paul Nilsson, Danila Oleynik, Sergey Panitkin, Artem Petrosyan, Ilija Vukotic, Torre Wenaus
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.