1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.

Slides:



Advertisements
Similar presentations
Peter Berrisford RAL – Data Management Group SRB Services.
Advertisements

 Contributing >30% of throughput to ATLAS and CMS in Worldwide LHC Computing Grid  Reliant on production and advanced networking from ESNET, LHCNET and.
RLS Production Services Maria Girone PPARC-LCG, CERN LCG-POOL and IT-DB Physics Services 10 th GridPP Meeting, CERN, 3 rd June What is the RLS -
David Adams ATLAS DIAL Distributed Interactive Analysis of Large datasets David Adams BNL March 25, 2003 CHEP 2003 Data Analysis Environment and Visualization.
Magda – Manager for grid-based data Wensheng Deng Physics Applications Software group Brookhaven National Laboratory.
1 Bridging Clouds with CernVM: ATLAS/PanDA example Wenjing Wu
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
Ian M. Fisk Fermilab February 23, Global Schedule External Items ➨ gLite 3.0 is released for pre-production in mid-April ➨ gLite 3.0 is rolled onto.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
December 17th 2008RAL PPD Computing Christmas Lectures 11 ATLAS Distributed Computing Stephen Burke RAL.
Data Management Kelly Clynes Caitlin Minteer. Agenda Globus Toolkit Basic Data Management Systems Overview of Data Management Data Movement Grid FTP Reliable.
ATLAS DQ2 Deletion Service D.A. Oleynik, A.S. Petrosyan, V. Garonne, S. Campana (on behalf of the ATLAS Collaboration)
Don Quijote Data Management for the ATLAS Automatic Production System Miguel Branco – CERN ATC
The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Dataset Caitlin Minteer & Kelly Clynes.
Computing Infrastructure Status. LHCb Computing Status LHCb LHCC mini-review, February The LHCb Computing Model: a reminder m Simulation is using.
Miguel Branco CERN/University of Southampton Enabling provenance on large-scale e-Science applications.
OSG Area Coordinator’s Report: Workload Management February 9 th, 2011 Maxim Potekhin BNL
How to Install and Use the DQ2 User Tools US ATLAS Tier2 workshop at IU June 20, Bloomington, IN Marco Mambelli University of Chicago.
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES Data Replication Service Sandeep Chandra GEON Systems Group San Diego Supercomputer Center.
F. Fassi, S. Cabrera, R. Vives, S. González de la Hoz, Á. Fernández, J. Sánchez, L. March, J. Salt, A. Lamas IFIC-CSIC-UV, Valencia, Spain Third EELA conference,
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
MAGDA Roger Jones UCL 16 th December RWL Jones, Lancaster University MAGDA  Main authors: Wensheng Deng, Torre Wenaus Wensheng DengTorre WenausWensheng.
November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.
Tier-2  Data Analysis  MC simulation  Import data from Tier-1 and export MC data CMS GRID COMPUTING AT THE SPANISH TIER-1 AND TIER-2 SITES P. Garcia-Abia.
David Adams ATLAS ADA, ARDA and PPDG David Adams BNL June 28, 2004 PPDG Collaboration Meeting Williams Bay, Wisconsin.
Production Tools in ATLAS RWL Jones GridPP EB 24 th June 2003.
And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR
The Replica Location Service The Globus Project™ And The DataGrid Project Copyright (c) 2002 University of Chicago and The University of Southern California.
Towards a Global Service Registry for the World-Wide LHC Computing Grid Maria ALANDES, Laurence FIELD, Alessandro DI GIROLAMO CERN IT Department CHEP 2013.
Replica Management Services in the European DataGrid Project Work Package 2 European DataGrid.
DDM Monitoring David Cameron Pedro Salgado Ricardo Rocha.
Metadata Mòrag Burgon-Lyon University of Glasgow.
David Adams ATLAS DIAL/ADA JDL and catalogs David Adams BNL December 4, 2003 ATLAS software workshop Production session CERN.
Storage cleaner: deletes files on mass storage systems. It depends on the results of deletion, files can be set in states: deleted or to repeat deletion.
Application Development
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Ricardo Rocha CERN (IT/GS) EGEE’08, September 2008, Istanbul, TURKEY Experiment.
Replica Management Kelly Clynes. Agenda Grid Computing Globus Toolkit What is Replica Management Replica Management in Globus Replica Management Catalog.
Integration of the ATLAS Tag Database with Data Management and Analysis Components Caitriana Nicholson University of Glasgow 3 rd September 2007 CHEP,
Performance of The NorduGrid ARC And The Dulcinea Executor in ATLAS Data Challenge 2 Oxana Smirnova (Lund University/CERN) for the NorduGrid collaboration.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
David Adams ATLAS ATLAS distributed data management David Adams BNL February 22, 2005 Database working group ATLAS software workshop.
Super Computing 2000 DOE SCIENCE ON THE GRID Storage Resource Management For the Earth Science Grid Scientific Data Management Research Group NERSC, LBNL.
David Adams ATLAS ATLAS-ARDA strategy and priorities David Adams BNL October 21, 2004 ARDA Workshop.
ATLAS-specific functionality in Ganga - Requirements for distributed analysis - ATLAS considerations - DIAL submission from Ganga - Graphical interfaces.
Grid Technologies for Distributed Database Services 3D Project Meeting CERN, May 19, 2005 A. Vaniachine (ANL)
David Adams ATLAS Datasets for the Grid and for ATLAS David Adams BNL September 24, 2003 ATLAS Software Workshop Database Session CERN.
Pavel Nevski DDM Workshop BNL, September 27, 2006 JOB DEFINITION as a part of Production.
Distributed Physics Analysis Past, Present, and Future Kaushik De University of Texas at Arlington (ATLAS & D0 Collaborations) ICHEP’06, Moscow July 29,
Status of tests in the LCG 3D database testbed Eva Dafonte Pérez LCG Database Deployment and Persistency Workshop.
Distributed Analysis Tutorial Dietrich Liko. Overview  Three grid flavors in ATLAS EGEE OSG Nordugrid  Distributed Analysis Activities GANGA/LCG PANDA/OSG.
Finding Data in ATLAS. May 22, 2009Jack Cranshaw (ANL)2 Starting Point Questions What is the latest reprocessing of cosmics? Are there are any AOD produced.
DDM Central Catalogs and Central Database Pedro Salgado.
D.Spiga, L.Servoli, L.Faina INFN & University of Perugia CRAB WorkFlow : CRAB: CMS Remote Analysis Builder A CMS specific tool written in python and developed.
LHCC Referees Meeting – 28 June LCG-2 Data Management Planning Ian Bird LHCC Referees Meeting 28 th June 2004.
ATLAS Distributed Analysis DISTRIBUTED ANALYSIS JOBS WITH THE ATLAS PRODUCTION SYSTEM S. González D. Liko
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
VO Box discussion ATLAS NIKHEF January, 2006 Miguel Branco -
Joe Foster 1 Two questions about datasets: –How do you find datasets with the processes, cuts, conditions you need for your analysis? –How do.
Magda Distributed Data Manager Torre Wenaus BNL October 2001.
CERN IT Department CH-1211 Genève 23 Switzerland t EGEE09 Barcelona ATLAS Distributed Data Management Fernando H. Barreiro Megino on behalf.
ATLAS DDM Developing a Data Management System for the ATLAS Experiment September 20, 2005 Miguel Branco
Business System Development
Database Replication and Monitoring
ATLAS Use and Experience of FTS
INFN-GRID Workshop Bari, October, 26, 2004
Readiness of ATLAS Computing - A personal view
ATLAS DC2 & Continuous production
Presentation transcript:

1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India

David Cameron CERN CHEP 06, Mumbai, India Feb Outline The data flow from the ATLAS experiment The Distributed Data Management system - Don Quijote Architecture Datasets Central catalogs Subscriptions Site services Implementation details Results from first large scale test Conclusion and future plans

David Cameron CERN CHEP 06, Mumbai, India Feb The ATLAS Experiment Data Flow Detector CERN Computer Centre + Tier 0 Tier 1 centresTier 2 centres GRID RAW data Reconstructed + RAW data Small data products Simulated data Reprocessing

David Cameron CERN CHEP 06, Mumbai, India Feb The Need for ATLAS Data Management Grids provide a set of tools to manage distributed data These are low-level file cataloging, storage and transfer services ATLAS uses three Grids (LCG, OSG, NorduGrid), each having their own versions of these services Therefore there needs to be an ATLAS specific layer on top of the Grid middleware The goal is to manage data flow as described in the computing model and provide a single entry point to all distributed ATLAS data Our software is called Don Quijote (DQ)

David Cameron CERN CHEP 06, Mumbai, India Feb Don Quijote The first version of DQ simply provided an interface to the three Grid catalogs (one per Grid) to query data locations and a simple reliable file transfer system. This was used for the ATLAS Data Challenge 2 and Rome productions LCG EDG RLS OSG Globus RLS NG Globus RLS DQ queries

David Cameron CERN CHEP 06, Mumbai, India Feb Don Quijote 2 The fact that this was not a scalable solution and advancements in Grid middleware meant a redesign of DQ: DQ2 We base DQ2 on the concept of versioned datasets Defined as a collection of files or other datasets We have ATLAS central catalogs which define datasets and their location A dataset is also the unit of data movement To enable data movement we have distributed ‘site services’ which use a subscription mechanism to pull data to a site

David Cameron CERN CHEP 06, Mumbai, India Feb Central Catalogs Dataset Repository Holds all dataset names and unique IDs (+ system metadata) Dataset Content Catalog Dataset Location Catalog Stores locations of each dataset Maps each dataset to its constituent files Dataset Subscription Catalog Stores subscriptions of datasets to sites

David Cameron CERN CHEP 06, Mumbai, India Feb Catalog Interactions

David Cameron CERN CHEP 06, Mumbai, India Feb Central Catalogs There is no global physical file replica catalog Physical file resolution is done by (Grid specific) catalogs at each site holding only data on that site The central catalogs are split (different databases) because we expect different access patterns on each one For example the content catalog will be very heavily used The catalogs are logically centralised but may be physically separated or partitioned for performance reasons A unified client interface ensures consistency between catalogs when multiple catalog operations are performed

David Cameron CERN CHEP 06, Mumbai, India Feb Implementation Central catalogs: The clients and servers are written in python and communicate using REST-style HTTP calls Servers hosted in Apache using mod_python Using mod_gridsite for security and currently MySQL databases as a backend DB server.py catalog.py Apache/mod_python server RepositoryClient.py ContentClient.py DQ2Client.py client HTTP GET/POST

David Cameron CERN CHEP 06, Mumbai, India Feb Site Services DQ2 site services are hosted on a site and pull data to their site The subscription catalog is queried for any dataset subscriptions to the site New versions of the dataset are automatically pulled to the site The site services then copy any new data and register it in their site using underlying Grid middleware tools CNAF Tier 1 physics.aod.0001 physics.esd.0001 physics.aod.0001 | CNAF physics.aod.0001 | BNL physics.esd.0001 | BNL BNL Tier 1 Subscriptions: File1File2 File3File4 CERN Tier 0

David Cameron CERN CHEP 06, Mumbai, India Feb Site Services File state (site local DB) unknownSURL knownSURL assigned toValidate validated done Python Agents Fetcher ReplicaResolver MoverPartitioner Mover ReplicaVerifier BlockVerifier Finds incomplete datasets Finds remote SURL Assigns Mover agents Moves file Verifies local replica Verifies whole dataset complete Function

David Cameron CERN CHEP 06, Mumbai, India Feb Experience running DQ2 - Tier 0 Exercise A large-scale test of the system was performed as part of LCG Service Challenge 3 at the end of last year We ran a ‘Tier 0 exercise’, a scaled down version of the data movement out from CERN when the experiment starts Fake events were generated at CERN, reconstructed at CERN and the data was shipped out to Tier 1 centres Starting with 5% of the operational rate and slowly ramping up and adding Tier 1 sites each week This was a test of our integration with LCG middleware and of the scalability of our software See also “ATLAS Tier-0 Scaling Test” talk #341

David Cameron CERN CHEP 06, Mumbai, India Feb Tier 0 data flow (full operational rates)

David Cameron CERN CHEP 06, Mumbai, India Feb Results from the Tier 0 exercise The exercise started at the end of October and finished just before Christmas We were able to achieve target rates of data throughput for short periods The throughput peaked at 200MB/s for 2 hours at the end of the exercise, our largest average daily rate was just over 90MB/s (production of data ran ~8h per day)

David Cameron CERN CHEP 06, Mumbai, India Feb Transfer rates per day

David Cameron CERN CHEP 06, Mumbai, India Feb Total data transferred

David Cameron CERN CHEP 06, Mumbai, India Feb Experience of DQ2 in PANDA DQ2 has been integrated and tested with PANDA (the production system used in OSG) for several months There are 4 sites in the US producing data using PANDA integrated with DQ2 With an instance of site services at each site and a PANDA- specific global catalog instance So far ~270,000 files in ~8700 datasets have been produced by ~40,000 jobs and registered in DQ2 The failure rate was around 10% for PANDA-DQ2 interactions (but is falling rapidly!)

David Cameron CERN CHEP 06, Mumbai, India Feb Conclusions and Future Plans DQ2 worked well during the Tier 0 exercise and for PANDA distributed production The design of the system (datasets, central catalogs, site services) makes the data flow manageable LCG Service Challenge 4 starts in a few months This will be the first large scale test using DQ2 in distributed production and distributed analysis for the whole of ATLAS Also in this time we will experiment with improved technologies, e.g. Oracle database backends and improved deployment model and monitoring By the end of SC4 we will have exercised all the usages of DQ2 at large scale and be ready for data taking For more information: