1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.

1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India

David Cameron CERN CHEP 06, Mumbai, India 13-17 Feb 2006 2 Outline The data flow from the ATLAS experiment The Distributed Data Management system - Don Quijote Architecture Datasets Central catalogs Subscriptions Site services Implementation details Results from first large scale test Conclusion and future plans

David Cameron CERN CHEP 06, Mumbai, India 13-17 Feb 2006 3 The ATLAS Experiment Data Flow Detector CERN Computer Centre + Tier 0 Tier 1 centresTier 2 centres GRID RAW data Reconstructed + RAW data Small data products Simulated data Reprocessing

David Cameron CERN CHEP 06, Mumbai, India 13-17 Feb 2006 4 The Need for ATLAS Data Management Grids provide a set of tools to manage distributed data These are low-level file cataloging, storage and transfer services ATLAS uses three Grids (LCG, OSG, NorduGrid), each having their own versions of these services Therefore there needs to be an ATLAS specific layer on top of the Grid middleware The goal is to manage data flow as described in the computing model and provide a single entry point to all distributed ATLAS data Our software is called Don Quijote (DQ)

David Cameron CERN CHEP 06, Mumbai, India 13-17 Feb 2006 5 Don Quijote The first version of DQ simply provided an interface to the three Grid catalogs (one per Grid) to query data locations and a simple reliable file transfer system. This was used for the ATLAS Data Challenge 2 and Rome productions LCG EDG RLS OSG Globus RLS NG Globus RLS DQ queries

David Cameron CERN CHEP 06, Mumbai, India 13-17 Feb 2006 6 Don Quijote 2 The fact that this was not a scalable solution and advancements in Grid middleware meant a redesign of DQ: DQ2 We base DQ2 on the concept of versioned datasets Defined as a collection of files or other datasets We have ATLAS central catalogs which define datasets and their location A dataset is also the unit of data movement To enable data movement we have distributed ‘site services’ which use a subscription mechanism to pull data to a site

David Cameron CERN CHEP 06, Mumbai, India 13-17 Feb 2006 7 Central Catalogs Dataset Repository Holds all dataset names and unique IDs (+ system metadata) Dataset Content Catalog Dataset Location Catalog Stores locations of each dataset Maps each dataset to its constituent files Dataset Subscription Catalog Stores subscriptions of datasets to sites

David Cameron CERN CHEP 06, Mumbai, India 13-17 Feb 2006 8 Catalog Interactions

David Cameron CERN CHEP 06, Mumbai, India 13-17 Feb 2006 9 Central Catalogs There is no global physical file replica catalog Physical file resolution is done by (Grid specific) catalogs at each site holding only data on that site The central catalogs are split (different databases) because we expect different access patterns on each one For example the content catalog will be very heavily used The catalogs are logically centralised but may be physically separated or partitioned for performance reasons A unified client interface ensures consistency between catalogs when multiple catalog operations are performed

David Cameron CERN CHEP 06, Mumbai, India 13-17 Feb 2006 10 Implementation Central catalogs: The clients and servers are written in python and communicate using REST-style HTTP calls Servers hosted in Apache using mod_python Using mod_gridsite for security and currently MySQL databases as a backend DB server.py catalog.py Apache/mod_python server RepositoryClient.py ContentClient.py DQ2Client.py client HTTP GET/POST

David Cameron CERN CHEP 06, Mumbai, India 13-17 Feb 2006 11 Site Services DQ2 site services are hosted on a site and pull data to their site The subscription catalog is queried for any dataset subscriptions to the site New versions of the dataset are automatically pulled to the site The site services then copy any new data and register it in their site using underlying Grid middleware tools CNAF Tier 1 physics.aod.0001 physics.esd.0001 physics.aod.0001 | CNAF physics.aod.0001 | BNL physics.esd.0001 | BNL BNL Tier 1 Subscriptions: File1File2 File3File4 CERN Tier 0

David Cameron CERN CHEP 06, Mumbai, India 13-17 Feb 2006 12 Site Services File state (site local DB) unknownSURL knownSURL assigned toValidate validated done Python Agents Fetcher ReplicaResolver MoverPartitioner Mover ReplicaVerifier BlockVerifier Finds incomplete datasets Finds remote SURL Assigns Mover agents Moves file Verifies local replica Verifies whole dataset complete Function

David Cameron CERN CHEP 06, Mumbai, India 13-17 Feb 2006 13 Experience running DQ2 - Tier 0 Exercise A large-scale test of the system was performed as part of LCG Service Challenge 3 at the end of last year We ran a ‘Tier 0 exercise’, a scaled down version of the data movement out from CERN when the experiment starts Fake events were generated at CERN, reconstructed at CERN and the data was shipped out to Tier 1 centres Starting with 5% of the operational rate and slowly ramping up and adding Tier 1 sites each week This was a test of our integration with LCG middleware and of the scalability of our software See also “ATLAS Tier-0 Scaling Test” talk #341

David Cameron CERN CHEP 06, Mumbai, India 13-17 Feb 2006 14 Tier 0 data flow (full operational rates)

David Cameron CERN CHEP 06, Mumbai, India 13-17 Feb 2006 15 Results from the Tier 0 exercise The exercise started at the end of October and finished just before Christmas We were able to achieve target rates of data throughput for short periods The throughput peaked at 200MB/s for 2 hours at the end of the exercise, our largest average daily rate was just over 90MB/s (production of data ran ~8h per day)

David Cameron CERN CHEP 06, Mumbai, India 13-17 Feb 2006 16 Transfer rates per day

David Cameron CERN CHEP 06, Mumbai, India 13-17 Feb 2006 17 Total data transferred

David Cameron CERN CHEP 06, Mumbai, India 13-17 Feb 2006 18 Experience of DQ2 in PANDA DQ2 has been integrated and tested with PANDA (the production system used in OSG) for several months There are 4 sites in the US producing data using PANDA integrated with DQ2 With an instance of site services at each site and a PANDA- specific global catalog instance So far ~270,000 files in ~8700 datasets have been produced by ~40,000 jobs and registered in DQ2 The failure rate was around 10% for PANDA-DQ2 interactions (but is falling rapidly!)

David Cameron CERN CHEP 06, Mumbai, India 13-17 Feb 2006 19 Conclusions and Future Plans DQ2 worked well during the Tier 0 exercise and for PANDA distributed production The design of the system (datasets, central catalogs, site services) makes the data flow manageable LCG Service Challenge 4 starts in a few months This will be the first large scale test using DQ2 in distributed production and distributed analysis for the whole of ATLAS Also in this time we will experiment with improved technologies, e.g. Oracle database backends and improved deployment model and monitoring By the end of SC4 we will have exercised all the usages of DQ2 at large scale and be ready for data taking For more information: https://uimon.cern.ch/twiki/bin/view/Atlas/DDM

1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.

Similar presentations

Presentation on theme: "1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.

Similar presentations

Presentation on theme: "1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India."— Presentation transcript:

Similar presentations

About project

Feedback