Datasets on the GRID David Adams PPDG All Hands Meeting Catalogs and Datasets session June 11, 2003 BNL.

Slides:



Advertisements
Similar presentations
M. D'Amato, M. Mennea, L.Silvestris INFN-Bari CMS Data Model 9-11 Aprile 2001, Catania I Workshop INFN Grid CMS DATA MODEL M. D’Amato, M. Mennea, L. Silvestris.
Advertisements

Provenance in Open Distributed Information Systems Syed Imran Jami PhD Candidate FAST-NU.
IE 423 – Design of Decision Support Systems Introduction to Data Base Management Systems and MS Access.
Finding Similar Music Artists for Recommendation Presented by :Abhay Goel, Prerak Trivedi.
Xyleme A Dynamic Warehouse for XML Data of the Web.
David Adams ATLAS DIAL Distributed Interactive Analysis of Large datasets David Adams BNL March 25, 2003 CHEP 2003 Data Analysis Environment and Visualization.
EventStore Managing Event Versioning and Data Partitioning using Legacy Data Formats Chris Jones Valentin Kuznetsov Dan Riley Greg Sharp CLEO Collaboration.
IASSIST Conference 2006 – Ann Arbor, May Metadata as report and support A case for distinguishing expected from fielded metadata Reto Hadorn S I.
David Adams ATLAS DIAL Distributed Interactive Analysis of Large datasets David Adams BNL June 23, 2003 GAE workshop Caltech.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
Overview of Search Engines
The Project AH Computing. Functional Requirements  What the product must do!  Examples attractive welcome screen all options available as clickable.
C o n f i d e n t i a l Developed By Nitendra NextHome Subject Name: Data Structure Using C Title: Overview of Data Structure.
L/O/G/O Metadata Business Intelligence Erwin Moeyaert.
Databases From A to Boyce Codd. What is a database? It depends on your point of view. For Manovich, a database is a means of structuring information in.
Database Design - Lecture 2
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
David Adams ATLAS DIAL Distributed Interactive Analysis of Large datasets David Adams BNL July 15, 2003 LCG Analysis RTAG CERN.
David Adams ATLAS ATLAS Distributed Analysis David Adams BNL March 18, 2004 ATLAS Software Workshop Grid session.
Databases From A to Boyce Codd. What is a database? It depends on your point of view. For Manovich, a database is a means of structuring information in.
David Adams ATLAS AJDL: Analysis Job Description Language David Adams BNL December 15, 2003 PPDG Collaboration Meeting LBL.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
ATLAS DIAL: Distributed Interactive Analysis of Large Datasets David Adams – BNL September 16, 2005 DOSAR meeting.
David Adams ATLAS DIAL status David Adams BNL July 16, 2003 ATLAS GRID meeting CERN.
Bookkeeping Tutorial. Bookkeeping & Monitoring Tutorial2 Bookkeeping content  Contains records of all “jobs” and all “files” that are created by production.
Event Data History David Adams BNL Atlas Software Week December 2001.
David Adams ATLAS ADA, ARDA and PPDG David Adams BNL June 28, 2004 PPDG Collaboration Meeting Williams Bay, Wisconsin.
Bi-Hadoop: Extending Hadoop To Improve Support For Binary-Input Applications Xiao Yu and Bo Hong School of Electrical and Computer Engineering Georgia.
INFSO-RI Enabling Grids for E-sciencE ATLAS Distributed Analysis A. Zalite / PNPI.
David Adams ATLAS Architecture for ATLAS Distributed Analysis David Adams BNL March 25, 2004 ATLAS Distributed Analysis Meeting.
David Adams ATLAS DIAL status David Adams BNL November 21, 2002 ATLAS software meeting GRID session.
Distributed Database. Introduction A major motivation behind the development of database systems is the desire to integrate the operational data of an.
Metadata Mòrag Burgon-Lyon University of Glasgow.
David Adams ATLAS DIAL/ADA JDL and catalogs David Adams BNL December 4, 2003 ATLAS software workshop Production session CERN.
David Adams ATLAS Virtual Data in ATLAS David Adams BNL May 5, 2002 US ATLAS core/grid software meeting.
David Adams ATLAS ATLAS Distributed Analysis David Adams BNL September 30, 2004 CHEP2004 Track 5: Distributed Computing Systems and Experiences.
Intellectual Works and their Manifestations Representation of Information Objects IR Systems & Information objects Spring January, 2006 Bharat.
D. Adams, D. Liko, K...Harrison, C. L. Tan ATLAS ATLAS Distributed Analysis: Current roadmap David Adams – DIAL/PPDG/BNL Dietrich Liko – ARDA/EGEE/CERN.
DBS/DLS Data Management and Discovery Lee Lueking 3 December, 2006 Asia and EU-Grid Workshop 1-4 December, 2006.
David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk.
Replica Management Kelly Clynes. Agenda Grid Computing Globus Toolkit What is Replica Management Replica Management in Globus Replica Management Catalog.
Integration of the ATLAS Tag Database with Data Management and Analysis Components Caitriana Nicholson University of Glasgow 3 rd September 2007 CHEP,
Bookkeeping Tutorial. 2 Bookkeeping content  Contains records of all “jobs” and all “files” that are produced by production jobs  Job:  In fact technically.
David Adams ATLAS DIAL Distributed Interactive Analysis of Large datasets David Adams BNL November 17, 2003 SC2003 Phoenix.
David Adams ATLAS ATLAS distributed data management David Adams BNL February 22, 2005 Database working group ATLAS software workshop.
David Adams ATLAS ATLAS Distributed Analysis: Overview David Adams BNL December 8, 2004 Distributed Analysis working group ATLAS software workshop.
On the D4Science Approach Toward AquaMaps Richness Maps Generation Pasquale Pagano - CNR-ISTI Pedro Andrade.
David Adams ATLAS ATLAS-ARDA strategy and priorities David Adams BNL October 21, 2004 ARDA Workshop.
David Adams ATLAS Datasets for the Grid and for ATLAS David Adams BNL September 24, 2003 ATLAS Software Workshop Database Session CERN.
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
David Adams ATLAS ATLAS Distributed Analysis (ADA) David Adams BNL December 5, 2003 ATLAS software workshop CERN.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Finding Data in ATLAS. May 22, 2009Jack Cranshaw (ANL)2 Starting Point Questions What is the latest reprocessing of cosmics? Are there are any AOD produced.
David Adams ATLAS ATLAS Distributed Analysis and proposal for ATLAS-LHCb system David Adams BNL March 22, 2004 ATLAS-LHCb-GANGA Meeting.
David Adams ATLAS AJDL: Abstract Job Description Language David Adams BNL June 29, 2004 PPDG Collaboration Meeting Williams Bay.
David Adams ATLAS ADA: ATLAS Distributed Analysis David Adams BNL December 15, 2003 PPDG Collaboration Meeting LBL.
LECTURE TWO Introduction to Databases: Data models Relational database concepts Introduction to DDL & DML.
Of 24 lecture 11: ontology – mediation, merging & aligning.
Joe Foster 1 Two questions about datasets: –How do you find datasets with the processes, cuts, conditions you need for your analysis? –How do.
David Adams ATLAS DIAL Distributed Interactive Analysis of Large datasets David Adams BNL May 19, 2003 BNL Technology Meeting.
David Adams ATLAS Hybrid Event Store Integration with Athena/StoreGate David Adams BNL March 5, 2002 ATLAS Software Week Event Data Model and Detector.
System Software Laboratory Databases and the Grid by Paul Watson University of Newcastle Grid Computing: Making the Global Infrastructure a Reality June.
Data Grids, Digital Libraries and Persistent Archives: An Integrated Approach to Publishing, Sharing and Archiving Data. Written By: R. Moore, A. Rajasekar,
Introduction to Database Management Systems
David Adams Brookhaven National Laboratory September 28, 2006
GGF OGSA-WG, Data Use Cases Peter Kunszt Middleware Activity, Data Management Cluster EGEE is a project funded by the European.
OGSA Data Architecture Scenarios
Review of Week 1 Database DBMS File systems vs. database systems
Introduction to Spark.
CloudAnt: Database as a Service (DBaaS)
Presentation transcript:

Datasets on the GRID David Adams PPDG All Hands Meeting Catalogs and Datasets session June 11, 2003 BNL

June 11, 2001David Adams – All Hands Meeting2 Introduction Contents –What is a dataset? –Dataset properties –Logical vs. physical –Dataset operations –Implementation More information –Dataset page (little out of date) at –Document with content similar to this talk

June 11, 2001David Adams – All Hands Meeting3 What is a dataset? A dataset is a collection of data that belong together Important example for HEP is the event dataset –Data organized as a collection of events –Other realms may have records equivalent to events Each event/record to be processed in the same way Use cases: –Select a dataset Of all the event datasets in my experiment, which meet my criteria? –Process a dataset For each event in my event dataset, run my application and apply some task Natural to do distributed production by splitting the dataset along event boundaries

June 11, 2001David Adams – All Hands Meeting4 Dataset properties Datasets have the following properties –0. Identity –1. Content –2. Location (of the data) –3. Mapping (content to location) –4. Provenance (of the dataset) –5. History (of the production) –6. Metadata (describing dataset) Details follow All the above are “metadata” Dataset also has “data”

June 11, 2001David Adams – All Hands Meeting5 Dataset properties (cont) Identity –Index or name to distinguish datasets Content –Tells what kind of data is carried by the dataset –E.g. for HEP event data List of event identifiers Content for each event (raw, clusters, tracks, jets, …) Location –Most interesting example is a collection of logical files –Could also be physical files DB data store such as POOL

June 11, 2001David Adams – All Hands Meeting6 Dataset properties (cont) Mapping –For each piece of content, where (in the location) is the associated data –For collection of logical files, which file and where in the file Provenance –Parent datasets –Transformation Applied to parent dataset to obtain this dataset –Sufficient to construct dataset (virtual data) History –Production history beyond provenance –Division into jobs, which compute nodes, time stamps, …

June 11, 2001David Adams – All Hands Meeting7 Dataset properties (cont) Metadata –Additional data characterizing the dataset –For example Motivation for dataset –E.g. starting point for Higgs searches Results of dataset evaluation –E.g. approved for basis of publications Other global properties –E.g. integrated luminosity for an event dataset

June 11, 2001David Adams – All Hands Meeting8 Logical vs. physical Distributed file management expressed in terms of logical files and physical replicas Useful to introduce the same decomposition for datasets –Data in a dataset may be replicated to produce another physical representation (replica) of a dataset –Common example is after event selection Selected event ID’s plus original dataset or Replicate selected objects to define the new (physical) dataset These are two replicas of the same logical dataset –For a dataset made up of logical files File replication does not imply a new physical dataset Moving data to a new collection of logical files is a new physical dataset

June 11, 2001David Adams – All Hands Meeting9 Logical vs. physical (cont) PropertyLogicalPhysical IdentityXX ContentX LocationX MappingX ProvenanceX HistoryX MetadataX Dataset properties are associated with either the physical or logical view of a dataset

June 11, 2001David Adams – All Hands Meeting10 Logical vs. physical (cont) Our use cases break down along logical/physical properties Dataset selection uses content, provenance and metadata, the logical properties Job processing requires content, location and mapping, i.e. must make use of the physical properties

June 11, 2001David Adams – All Hands Meeting11 Dataset operations Next consider some of the operations that datasets should support These include –Content selection –Splitting –Merging Content selection –User can create a new dataset by selecting content from an existing dataset –For an event dataset, there are two categories Event selection –Apply an algorithm which flags which events to accept Event content selection –e.g. keep tracks, drop jets

June 11, 2001David Adams – All Hands Meeting12 Dataset operations (cont) Splitting –Distributed processing requires that the input dataset be split into sub-datasets for processing –Natural to split along file boundaries –For event datasets, the split is along event boundaries Merging –Combine multiple datasets to form a new dataset –For event datasets, there are again two dimensions Datasets with the same event content and data for different events may be combined –E.g. data from two different run periods Datasets with the same events and different content may be combined –E.g. raw data and the corresponding reconstructed data –Or reconstructed data and the conditions data used in its construction

June 11, 2001David Adams – All Hands Meeting13 Dataset operations (cont) Event dataset operations

June 11, 2001David Adams – All Hands Meeting14 Implementation The dataset properties span many realms Different properties will be stored in different ways –Properties used for selection naturally reside in relational DB tables Content, provenance, metadata –Properties required for end applications on worker nodes might expect properties to reside in files Content, location mapping –Some property data will likely be replicated both RDB and files Identification of the different properties is a first step in implementing datasets

June 11, 2001David Adams – All Hands Meeting15 Implementation (cont) Interface for schedulers –Central problem is how to split dataset –Account for Data location Compute cycle locations Matching these –Interactive analysis (fast response) is a special challenge –How can different dataset providers share a scheduler? –Options: Common dataset interface which provides the required information –Logical file mapping or –Composite structure (sub-datasets) Each provider also provides a service to do the splitting