Distributed Data Management Miguel Branco 1 DQ2 discussion on future features BNL workshop October 4, 2007.

Slides:

Advertisements

Similar presentations

Data Management Expert Panel. RLS Globus-EDG Replica Location Service u Joint Design in the form of the Giggle architecture u Reference Implementation.

Advertisements

Grid and CDB Janusz Martyniak, Imperial College London MICE CM37 Analysis, Software and Reconstruction.

Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.

File Systems and N/W attached storage (NAS) | VTU NOTES | QUESTION PAPERS | NEWS | VTU RESULTS | FORUM | BOOKSPAR ANDROID APP.

Presented by: Alvaro Llanos E.  Motivation and Overview  Frangipani Architecture overview  Similar DFS  PETAL: Distributed virtual disks ◦ Overview.

Distributed File Systems Concepts & Overview. Goals and Criteria Goal: present to a user a coherent, efficient, and manageable system for long-term data.

Summary of issues and questions raised. FTS workshop for experiment integrators Summary of use  Generally positive response on current state!  Now the.

CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.

INFSO-RI Enabling Grids for E-sciencE SRMv2.2 experience Sophie Lemaitre WLCG Workshop.

ATLAS DQ2 Deletion Service D.A. Oleynik, A.S. Petrosyan, V. Garonne, S. Campana (on behalf of the ATLAS Collaboration)

Don Quijote Data Management for the ATLAS Automatic Production System Miguel Branco – CERN ATC

Computer Measurement Group, India Optimal Design Principles for better Performance of Next generation Systems Balachandar Gurusamy,

The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Dataset Caitlin Minteer & Kelly Clynes.

Grid Resource Allocation and Management (GRAM) Execution management Execution management –Deployment, scheduling and monitoring Community Scheduler Framework.

1 DDM Troubleshooting Console (quick evaluation of splunk for such) Rob Gardner University of Chicago ATLAS DDM Workshop January 26, 2007.

How to Install and Use the DQ2 User Tools US ATLAS Tier2 workshop at IU June 20, Bloomington, IN Marco Mambelli University of Chicago.

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES PhEDEx Monitoring Nicolò Magini CERN IT-ES-VOS For the PhEDEx.

DDM-Panda Issues Kaushik De University of Texas At Arlington DDM Workshop, BNL September 29, 2006.

Sage ACT! 2013 SDK Update Brian P. Mowka March 23, 2012 Template date: October 2010.

PanDA Summary Kaushik De Univ. of Texas at Arlington ADC Retreat, Naples Feb 4, 2011.

D C a c h e Michael Ernst Patrick Fuhrmann Tigran Mkrtchyan d C a c h e M. Ernst, P. Fuhrmann, T. Mkrtchyan Chep 2003 Chep2003 UCSD, California.

Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE middleware: gLite Data Management EGEE Tutorial 23rd APAN Meeting, Manila Jan.

BNL DDM Status Report Hironori Ito Brookhaven National Laboratory.

CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.

Towards a Global Service Registry for the World-Wide LHC Computing Grid Maria ALANDES, Laurence FIELD, Alessandro DI GIROLAMO CERN IT Department CHEP 2013.

DDM Monitoring David Cameron Pedro Salgado Ricardo Rocha.

Stefano Belforte INFN Trieste 1 Middleware February 14, 2007 Resource Broker, gLite etc. CMS vs. middleware.

David Adams ATLAS DIAL/ADA JDL and catalogs David Adams BNL December 4, 2003 ATLAS software workshop Production session CERN.

Storage cleaner: deletes files on mass storage systems. It depends on the results of deletion, files can be set in states: deleted or to repeat deletion.

CASTOR evolution Presentation to HEPiX 2003, Vancouver 20/10/2003 Jean-Damien Durand, CERN-IT.

1 User Analysis Workgroup Discussion  Understand and document analysis models  Best in a way that allows to compare them easily.

08-Nov Database TEG workshop, Nov 2011 ATLAS Oracle database applications and plans for use of the Oracle 11g enhancements Gancho Dimitrov.

1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.

6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.

Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT DPM / LFC and FTS news Ricardo Rocha ( on behalf of the IT/GT/DMS.

Report from the WLCG Operations and Tools TEG Maria Girone / CERN & Jeff Templon / NIKHEF WLCG Workshop, 19 th May 2012.

ClearQuest XML Server with ClearCase Integration Northwest Rational User’s Group February 22, 2007 Frank Scholz Casey Stewart

Evolution of storage and data management Ian Bird GDB: 12 th May 2010.

David Adams ATLAS ATLAS distributed data management David Adams BNL February 22, 2005 Database working group ATLAS software workshop.

The new FTS – proposal FTS status. EMI INFSO-RI /05/ FTS /05/ /05/ Bugs fixed – Support an SE publishing more than.

Super Computing 2000 DOE SCIENCE ON THE GRID Storage Resource Management For the Earth Science Grid Scientific Data Management Research Group NERSC, LBNL.

T3 data access via BitTorrent Charles G Waldman USATLAS/University of Chicago USATLAS T2/T3 Workshop Aug

for all Hyperion video tutorial/Training/Certification/Material Essbase Optimization Techniques by Amit.

1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.

Dynamic Data Placement: the ATLAS model Simone Campana (IT-SDC)

The GridPP DIRAC project DIRAC for non-LHC communities.

Distributed Data Management Miguel Branco 1 DQ2 status & plans BNL workshop October 3, 2007.

Replicazione e QoS nella gestione di database grid-oriented Barbara Martelli INFN - CNAF.

INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.

Data Management at Tier-1 and Tier-2 Centers Hironori Ito Brookhaven National Laboratory US ATLAS Tier-2/Tier-3/OSG meeting March 2010.

What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently and safely. Provide.

PanDA Configurator and Network Aware Brokerage Fernando Barreiro Megino, Kaushik De, Tadashi Maeno 14 March 2015, US ATLAS Distributed Facilities Meeting,

VO Box discussion ATLAS NIKHEF January, 2006 Miguel Branco -

DECADE Requirements draft-gu-decade-reqs-05 Yingjie Gu, David A. Bryan, Y. Richard Yang, Richard Alimi IETF-78 Maastricht, DECADE Session.

Joe Foster 1 Two questions about datasets: –How do you find datasets with the processes, cuts, conditions you need for your analysis? –How do.

WP2: Data Management Gavin McCance University of Glasgow.

CERN IT Department CH-1211 Genève 23 Switzerland t EGEE09 Barcelona ATLAS Distributed Data Management Fernando H. Barreiro Megino on behalf.

ATLAS DDM Developing a Data Management System for the ATLAS Experiment September 20, 2005 Miguel Branco

Jean-Philippe Baud, IT-GD, CERN November 2007

The ATLAS “DQ2 Accounting and Storage Usage Service”

ATLAS Use and Experience of FTS

DQ2 status, releases and plans

David Adams Brookhaven National Laboratory September 28, 2006

SRM2 Migration Strategy

AMI – Status November Solveig Albrand Jerome Fulachier

CMS staging from tape Natalia Ratnikova, Fermilab

Evolution of the distributed computing model The case of CMS

A Web-Based Data Grid Chip Watson, Ian Bird, Jie Chen,

Outline Review of Quiz #1 Distributed File Systems 4/20/2019 COP5611.

Presentation transcript:

Distributed Data Management Miguel Branco 1 DQ2 discussion on future features BNL workshop October 4, 2007

2 DQ2 0.4.x Continue to optimize DB schema to cope with higher load –channel allocation to follow ‘Dataset Subscription policy’ Hiro/Patrick also asking for local configurable ordered list of preferred sources within cloud –implications on channel allocation How much to ‘prefer’ a T1 before going to a T2 for a replica? Right now, shortest queue wins… –distinguishing files unlikely to have replicas in the future (bad subscriptions) particularly in the local monitoring –removing ‘holes’ in system (growing backlogs) Reduce load (better GSI session reuse) Goal O(100K) file transfers/day/site –or SRM/storage limitations –Need better understanding outside DQ2

3 Local monitoring of site services

4 Staging… Did not recognize this was a problem for OSG.. It is very hard to do with remote storages without SRM –FTS 2 + SRMv2 move on the right direction but not there yet Could do a local mechanism for T1->T2 transfers in the same cloud –provided site services for T2 run “close” to the T1 storage … but not for cross T1 transfers

5 Hierarchies current thoughts, for discussion Hierarchical datasets would be a special kind of dataset. These would have only 2 states: open AND frozen These would not have versions The constituents of a hierarchical dataset could only be closed dataset versions or frozen datasets Not sure if the following commands should be provided explicitly: –list files in hierarchical dataset directly? or only list datasets in hierarchical dataset and forcing user to loop over results? –subscribe open hierarchical dataset? or only allow listing datasets in open hierarchical dataset and forcing user to manually subscribe sub-units point is: having to loop over OPEN hierarchies (likely manageable) –locations of hierarchical dataset? or only allow listing locations of the individual datasets in the hierarchical dataset?

6 Merging Not much to do from DQ2 side here but provide an attribute for each dataset –“merged” Y/N (or protocol: zip, tar?) DQ2 does 3rd party transfers only –does not actually ‘see’ the data

7 Checksums Not much from DQ2 here but enforcing checksums in the central catalogues and its protocol –‘md5:’ for MD5 adler32 is frequently discussed as a better checksum candidate –but not relevant to DQ2, rather to the sites and production people

8 Subscription lifetime Increasingly important… –Would clean up what no one is cleaning up now… (some sites with O(100K) files in impossible situations) Discussion from yesterday: –allow only waitForSources to be set by users with production role ? avoid creating looping subscriptions in the system Forbid subscriptions for datasets with more than X files, if not production user requesting? Forbid more than Y subscriptions per sure, if not production user? Ignore subscription - regardless of its state - after more than 3 months? –Subscription is marked as broken

9 Central catalogues [ as mentioned yesterday ] Main changes are: –for Scalability only… –dropping VUIDs (becomes DUID+Version number) –DUID becomes timestamp-oriented UUID so that backend is partitioned in time and highly optimized UUID storage on ORACLE –meaning shorter index ORACLE partitioning, redirect service… –.. but fully backward compatible with 0.3 clients Many queries become much faster –list files in dataset is query by DUID as opposed to query by N number of VUIDs –ORACLE IOTs guarantees listing files from a dataset [version] reads close to sequential blocks on disk

10 Location catalogue [ as mentioned yesterday ] Location catalogue will be populated asynchronously with: –information on missing files –(re)marking complete/incomplete locations for existing datasets - consistency –Missing files are extra information made available on ‘best- effort’ to the users derived from request by Ganga This is populated by the ‘tracker’ service –Which was being reworked for the site services –The tracker service is a ‘stronger’ Fetcher (as existing on the site services), used to find content on site VS content missing on site - one of the site services performance bottleneck

11 Dashboard Relatively big update coming soon –distinguish errors source/destination –display messages on the dashboard for all sites –alarms supported –more overview of site services state from a central place e.g. states of files (based also on new site services monitoring)

12 ToA More and more info there… Blacklist/whitelist Preferred site connections This is a cache file, same style as ToA –but independent file from ToA cache since it is more dynamic ToA renewal much stronger –I’d claim it is the most reliable info system so far on the Grid :-)

13 Communication… … still not working: –e.g. did not recognize staging as a problem –e.g apparently not deployed on OSG T2s quite bad as had a simple bug where agents could simply die whenever a glitch happened in the central catalogue connection –glitches “common” with the central catalogue request rate, but harmless and ok to retry … what to do here? Jabber chatroom :-) –ask me - or - to be