ATLAS DDM Developing a Data Management System for the ATLAS Experiment September 20, 2005 Miguel Branco

Slides:



Advertisements
Similar presentations
Data Management Expert Panel - WP2. WP2 Overview.
Advertisements

Data Management Expert Panel. RLS Globus-EDG Replica Location Service u Joint Design in the form of the Giggle architecture u Reference Implementation.
RLS Production Services Maria Girone PPARC-LCG, CERN LCG-POOL and IT-DB Physics Services 10 th GridPP Meeting, CERN, 3 rd June What is the RLS -
David Adams ATLAS DIAL Distributed Interactive Analysis of Large datasets David Adams BNL March 25, 2003 CHEP 2003 Data Analysis Environment and Visualization.
GGF Toronto Spitfire A Relational DB Service for the Grid Peter Z. Kunszt European DataGrid Data Management CERN Database Group.
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
Data Grid Web Services Chip Watson Jie Chen, Ying Chen, Bryan Hess, Walt Akers.
Client/Server Grid applications to manage complex workflows Filippo Spiga* on behalf of CRAB development team * INFN Milano Bicocca (IT)
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
Data Management Kelly Clynes Caitlin Minteer. Agenda Globus Toolkit Basic Data Management Systems Overview of Data Management Data Movement Grid FTP Reliable.
ATLAS DQ2 Deletion Service D.A. Oleynik, A.S. Petrosyan, V. Garonne, S. Campana (on behalf of the ATLAS Collaboration)
Don Quijote Data Management for the ATLAS Automatic Production System Miguel Branco – CERN ATC
The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Dataset Caitlin Minteer & Kelly Clynes.
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES Data Replication Service Sandeep Chandra GEON Systems Group San Diego Supercomputer Center.
Advanced Computer Networks Topic 2: Characterization of Distributed Systems.
Introduction to DFS. Distributed File Systems A file system whose clients, servers and storage devices are dispersed among the machines of a distributed.
Middleware for Grid Computing and the relationship to Middleware at large ECE 1770 : Middleware Systems By: Sepehr (Sep) Seyedi Date: Thurs. January 23,
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE middleware: gLite Data Management EGEE Tutorial 23rd APAN Meeting, Manila Jan.
The Replica Location Service The Globus Project™ And The DataGrid Project Copyright (c) 2002 University of Chicago and The University of Southern California.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
Replica Management Services in the European DataGrid Project Work Package 2 European DataGrid.
DDM Monitoring David Cameron Pedro Salgado Ricardo Rocha.
David Adams ATLAS DIAL/ADA JDL and catalogs David Adams BNL December 4, 2003 ATLAS software workshop Production session CERN.
Storage cleaner: deletes files on mass storage systems. It depends on the results of deletion, files can be set in states: deleted or to repeat deletion.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
DGC Paris WP2 Summary of Discussions and Plans Peter Z. Kunszt And the WP2 team.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
David Adams ATLAS ATLAS distributed data management David Adams BNL February 22, 2005 Database working group ATLAS software workshop.
Super Computing 2000 DOE SCIENCE ON THE GRID Storage Resource Management For the Earth Science Grid Scientific Data Management Research Group NERSC, LBNL.
Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.
Distributed Data Management Miguel Branco 1 DQ2 discussion on future features BNL workshop October 4, 2007.
David Adams ATLAS ATLAS-ARDA strategy and priorities David Adams BNL October 21, 2004 ARDA Workshop.
Data Management The European DataGrid Project Team
Pavel Nevski DDM Workshop BNL, September 27, 2006 JOB DEFINITION as a part of Production.
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
Distributed Physics Analysis Past, Present, and Future Kaushik De University of Texas at Arlington (ATLAS & D0 Collaborations) ICHEP’06, Moscow July 29,
David Adams ATLAS ATLAS Distributed Analysis (ADA) David Adams BNL December 5, 2003 ATLAS software workshop CERN.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Mario Reale – GARR NetJobs: Network Monitoring Using Grid Jobs.
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
A Data Handling System for Modern and Future Fermilab Experiments Robert Illingworth Fermilab Scientific Computing Division.
VO Box discussion ATLAS NIKHEF January, 2006 Miguel Branco -
Joe Foster 1 Two questions about datasets: –How do you find datasets with the processes, cuts, conditions you need for your analysis? –How do.
Vendredi 27 avril 2007 Management of ATLAS CC-IN2P3 Specificities, issues and advice.
Grid Services for Digital Archive Tao-Sheng Chen Academia Sinica Computing Centre
System Software Laboratory Databases and the Grid by Paul Watson University of Newcastle Grid Computing: Making the Global Infrastructure a Reality June.
Baseline Services Group Status of File Transfer Service discussions Storage Management Workshop 6 th April 2005 Ian Bird IT/GD.
a brief summary for users
Daniele Bonacorsi Andrea Sciabà
Jean-Philippe Baud, IT-GD, CERN November 2007
The ATLAS “DQ2 Accounting and Storage Usage Service”
ATLAS Use and Experience of FTS
StoRM: a SRM solution for disk based storage systems
(on behalf of the POOL team)
U.S. ATLAS Grid Production Experience
POW MND section.
GGF OGSA-WG, Data Use Cases Peter Kunszt Middleware Activity, Data Management Cluster EGEE is a project funded by the European.
StoRM Architecture and Daemons
Readiness of ATLAS Computing - A personal view
Grid Services Ouafa Bentaleb CERIST, Algeria
Dirk Düllmann CERN Openlab storage workshop 17th March 2003
VO Box Requirements and Future Plans
Introduction to Cloud Computing
Providing Secure Storage on the Internet
A Web-Based Data Grid Chip Watson, Ian Bird, Jie Chen,
AWS Cloud Computing Masaki.
Outline Announcements Lab2 Distributed File Systems 1/17/2019 COP5611.
Integrating SRB with the GIGGLE framework
Outline Review of Quiz #1 Distributed File Systems 4/20/2019 COP5611.
gLite The EGEE Middleware Distribution
Presentation transcript:

ATLAS DDM Developing a Data Management System for the ATLAS Experiment September 20, 2005 Miguel Branco

ATLAS DDM Outline ‘Data Challenges 2’ and ‘Rome Production’ Lessons Learned DQ2 –Design –Implementation –Data model –Services Conclusion

DC2 and Rome Production Production started Spring 2004 and finished recently ProdSys: Data Management (DQ): high-level service that interacted with all ATLAS Grid catalogs and storages –File-based: relied on backend RLS (Globus RLS, EDG RLS) –Also implemented a simple reliable file transfer (FIFO queue) Supervisors: collect jobs from production database dispatch to executors Executors (per ‘Grid’): translate physics definition to a Grid job and launch it DQ: All components interacted with data management

ATLAS DDM Lessons learned Catalogs were provided by Grid providers and used “as-is” Granularity: file-level. No datasets, no “file collections” No scoping of queries (difficult to find data, slow) No bulk operations No managed and transparent data access, unreliable GridFTP –SRM also unreliable; Problems with mass storage –Difficult to handle different mass storage stagers from Grid Metadata support not usable; too slow –Logical Collection Name as metadata string field: /datafiles/rome/… Catalogs not always geographically distributed –Single point of failure (middleware, people/timezones) No “ATLAS resources information system” (with known/negotiated QoS) … and unreliable information systems from Grid providers Operational problems –Timezones, lack of people, experience, communication

ATLAS DDM DQ2 Design rationale Evolve from past experience Scalability –Administrative, Geographical, Load Interoperability Grid m/w components –Replica Catalog, Storage Management, Reliable File Transfer Global != Site != Local != Clients Production and User Analysis Security Datasets, not files… Bulk –Datasets and Datablocks (a immutable collection of files)

DQ2 Moves from a file based system to one based on datasets –Hides file level granularity from users –A hierarchical structure makes cataloging more manageable –However file level access is still possible Scalable global data discovery and access via a catalog hierarchy No global physical file replica catalog –but global dataset replica catalog and global dataset location catalog DatasetsSites Files Dataset

ATLAS DDM Catalog architecture and interactions

ATLAS DDM ‘Global’ catalogs Dataset Repository Holds all dataset names and unique IDs (+ system metadata) Dataset Content Catalog Dataset Hierarchy Dataset Location Catalog Maintains versioning information and information on ‘container datasets’, datasets consisting of other datasets Stores locations of each dataset Maps each dataset to its constituent files This one holds info on every logical file so must be highly scalable, however it can be highly partitioned using metadata etc.. All logically global but may be distributed physically

ATLAS DDM ‘Local’ Catalogs Local Replica Catalog Claims Catalog Per grid/site/tier providing logical to physical file name mapping. Implementations of this catalog are Grid specific but must use a standard interface. Per site storage, keeping user claims on datasets. Claims are used to manage stage lifetime, resources and provide accounting. Currently all ‘Local’ catalogs are deployed per ATLAS site

Implementation Architectural Style –REST-style (not entirely RESTful) –Communication: intend to migrate non-performance critical payload (monitoring, real-time status reporting) to XML soon vocabularies will emerge from experience of running system Development –First usable prototype deployed 47 days after project started Technology choices –Python; servers hosted on Apache (mod_python, mod_gridsite); clients using PyCurl –POOL File Catalog interface gives us choice of back-end for catalogs –File movement: SRM, GridFTP, gLite FTS, HTTP, dccp, cp Security –Use HTTPS (with Globus proxy certs) for POST/PUT/DELETE and HTTP for GETs, ie world-readable data, best performance (can be made secure to ATLAS VO if required)

ATLAS DDM Datablocks Datablocks are defined as immutable and unbreakable collections of files –They are a special case of datasets –A site cannot hold partial datablocks –There are no versions for datablocks Used to aggregate files for convenient distribution –Files grouped together by physics properties, run number etc.. –Much more scalable than file level distribution –Useful for provenance: immutable sets of data The principal means of data distribution and data discovery –immutability avoids consistency problems when distributing data –moving data in blocks improves data distribution (bulk SRM requests)

Subscriptions A site can subscribe to data –When a new version is available, this latest version of the dataset is automatically made available through site-local specific services carrying out the required replication - Automated movement Subscriptions can be made to datasets (for file distribution) or container datasets (for datablock distribution) Use cases: –Automatic distribution of datasets holding a variable collection of datablocks (container datasets) –Automatic replication of files by subscribing to a mutable dataset (eg file-based calibration data distribution) Site ‘X’: Dataset ‘A’ (Container) Dataset ‘B’ Dataset ‘A’ | Site ‘X’ Dataset ‘B’ | Site ‘Y’ Site ‘Y’: Subscriptions: File1File2 Data block1Data block2

Subscriptions Various data movement use cases… –Datasets: latest version of a dataset (triggers automatic updates whenever a new version appears) –Container Datasets: which in turn contain datablocks or datasets supports subscriptions to the latest version of a container dataset (automatically triggers updates whenever e.g. the set of datablocks making up the container dataset changes) –Datablocks (single copy of immutable set of files) –Databuckets (diagram next slide) replication of a set of files using notification model (whenever new content appears on the databucket, the replication is triggered) Subscribes to DS1 DS1 File1 File2 File3 CDS1 DB1 DB2 DB3 Subscribes to CDS1 “Subscribes” (temporarily) to DB1 DB1 Dataset Location Catalog updated

ATLAS DDM Data buckets Data must be replicated (quickly) not by the appearance of a new version but by new content –alternative would be constantly defining new versions of datasets! Will use notification model: –Whenever new content appears on a data bucket, sites subscribing to it are notified and data is moved accordingly Data buckets can contain files Data buckets can contain datablocks File 1 File 2 Remote Site Data “bucket”(file-based data bucket)

Summary of Services Global services –Dataset catalogs –Requirements: grid environment, database, Apache services Site services –Subscriptions, Databuckets, Claims and minimal information system (monitoring, real-time reporting) –Requirements: grid environment, database, Apache services, DQ2 agents for moving data, grid-specific data movement clients, Python, PyCURL, grid certificate Local worker node client –Contact local LRC, get and put data to local Storage –Requirements: grid environment Clients –Define datasets and datablocks, subscribe them to sites –Associate files with new dataset versions –Query dataset definition, contents, location –… –Requirements: Python, PyCURL, grid certificate for writing

Detail on Subscriptions State Machine unknownSURL knownSURL assigned toValidate validated done Agents Fetcher ReplicaResolver MoverPartitioner Mover ReplicaVerifier BlockVerifier Finds incomplete datasets Finds remote SURL Assigns Mover agents Moves file Verifies local replica Verifies whole dataset complete Function List of software required to handle subscriptions. Requires minimal deployment effort (laptop support!)

ATLAS DDM Claims Claims catalog manages the usage of datasets –User requests have a lifetime –Claim is assigned –User may add claims on existing datasets –Claim owner may (should) release claim when done –Claim owner may extend lifetime of claim –Automatically handled by user client tools Behavior –Each claim has an expiration time (now plus lifetime) –Claim is active until released or expired –Datasets may have multiple active claims for different users –Cache-turnover relies on expired claims Claims provide mechanism for accounting, policy enforcement and dealing with Mass Storage (claim triggers SRM stage request)

ATLAS DDM Conclusion Evolve the model based on past experience –based on proven technologies Appears to scale so far –load, geographic and very important administrative scalability It is running now across some US ATLAS and LCG sites –Ramping up (starting now!) to the full set of LCG and US ATLAS resources.