Presentation is loading. Please wait.

Presentation is loading. Please wait.

David Adams Brookhaven National Laboratory September 28, 2006

Similar presentations


Presentation on theme: "David Adams Brookhaven National Laboratory September 28, 2006"— Presentation transcript:

1 David Adams Brookhaven National Laboratory September 28, 2006
Data validation DDM Workshop BNL David Adams Brookhaven National Laboratory September 28, 2006 Updated September 27, 2006

2 D. Adams BNL data validation BNL DDM Workshop
Contents Goals Publication Issues Status Conclusions D. Adams BNL data validation BNL DDM Workshop September 28, 2006

3 D. Adams BNL data validation BNL DDM Workshop
Goals Goal of the BNL data validation effort Determine which data is available at BNL Which datasets Which files in each of these datasets Validate each dataset Validity of GUID and LFN’s LFN corresponds to dataset name Duplicate file numbers within datasets Consistency of BNL replica catalog Publish results Create “BNL” datasets Include only files at BNL Remove duplicate and invalid files Registered as DSNAME_bnl in DQ2 D. Adams BNL data validation BNL DDM Workshop September 28, 2006

4 D. Adams BNL data validation BNL DDM Workshop
Publication Validation is published on a series of web pages Starting point: BNL summary: Tables are updated twice a day Update time at the top of each page Automatic and fairly robust procedure Tables provide field that can be used to restrict listing Simple pattern matching with * for wildcard E.g. *Zmumu*AOD* Tables for tasks, task names, datasets and BNL resident datasets D. Adams BNL data validation BNL DDM Workshop September 28, 2006

5 Issues: Which datasets
Which datasets should be validated? I start from the task table (BNL replica) Select tasks that begin with “csc” (task table) Combine tasks with the same name (task name table) Follow conventions to guess the datasets produced by each task Check if dataset name is registered in DQ2 (DQ2 tasks) Check if BNL is a DQ2 location for the dataset (BNL datasets) This has potential problems Conventions change and my code has to keep up Datasets become obsolete and should be dropped from validation Restricted to production datasets Preferable to have an external source listing datasets of interest Perhaps the metadata catalog D. Adams BNL data validation BNL DDM Workshop September 28, 2006

6 Issues: Additional validation
What additional validation is desired? Check existence of physical files at BNL BNL dcache sometimes loses files and the replica catalog is not updated to reflect this Not too difficult if check is done with ls command Data inside file Right type (AOD, ESD, RDO, …) Event numbers consistent with file name Difficult because these checks require sophisticated code and reading each file Accessibility: can files be read? Again difficult because it is time consuming to open and read files Staging: report how many files in each dataset are staged Expensive to check each file with dc_check Status often changes faster than my twice daily validation checks D. Adams BNL data validation BNL DDM Workshop September 28, 2006

7 D. Adams BNL data validation BNL DDM Workshop
Issues: Remediation Remediation When there are problems (and there are some), who should resolve them? E.g. duplicate files, files missing in dcache Problems such as duplicate files are a feature of the dataset definition and are not BNL-specific Need production expert to sort out which file should be kept Need authorization to change the dataset definition Other problems such as files disappearing from dcache and not from the replica catalog need to resolved locally D. Adams BNL data validation BNL DDM Workshop September 28, 2006

8 D. Adams BNL data validation BNL DDM Workshop
Issues: Data movement Replicating data at BNL Which data? Long term model is all AOD and ESD And 2/N of raw? Should support reasonable user requests Can we do this now? Are we trying? 245/402 AOD datasets are at BNL Those at BNL are mostly complete Big improvement since the spring Validation At present the BNL validation table lists datasets registered at BNL Replace this selection with a policy or an external list? Or just register BNL as an (incomplete) location for desired datasets? Can/should table let users know what data is coming? Or why data is missing? D. Adams BNL data validation BNL DDM Workshop September 28, 2006

9 Issues: Historical information
Current validation pages only provide a snapshot view Difficult to know if the situation is improving or deteriorating Historical data is available Results from each scan are saved Last week on disk, web accessible All data since June stored in dcache Interesting to track the number of datasets and files of each type as a function of time Both in DQ2 and at BNL Volunteers? D. Adams BNL data validation BNL DDM Workshop September 28, 2006

10 D. Adams BNL data validation BNL DDM Workshop
Status Validation has been running at BNL since Spring Automated: I only need to update my proxy every week or two Fairly robust Down for a couple weeks when I was on vacation because database passwords changed Major DB failures will occasionally leave it in a state that require me to clean up BNL table provides a nice answer to the question “What data is available at BNL?” Easy to select by physics process, data type and release Up to date without requiring DB query for each request D. Adams BNL data validation BNL DDM Workshop September 28, 2006

11 D. Adams BNL data validation BNL DDM Workshop
Conclusions Validation pages provide useful summaries For users and production experts Easy to use and understand Can be improved External list or list of datasets Additional validation Active reporting of problems Information about data movement (or lack thereof) Historical information What else? Volunteers welcome To address any of the above or whatever other features you would like to see D. Adams BNL data validation BNL DDM Workshop September 28, 2006


Download ppt "David Adams Brookhaven National Laboratory September 28, 2006"

Similar presentations


Ads by Google