Download presentation
Presentation is loading. Please wait.
Published byReynold Hunter Modified over 9 years ago
1
Miguel Branco CERN/University of Southampton Enabling provenance on large-scale e-Science applications
2
2 Overview The ATLAS Experiment –Motivation for Provenance Model –Documenting –Storing –Reasoning –Querying Scalability Conclusion
3
3 The ATLAS Experiment ATLAS is a High Energy Physics experiment –Detector being built for the Large Hadron Collider at CERN One of the largest scientific collaborations ever –Used as a “scenario” for our provenance work “Typical” large-scale e-Science experiment –Very large “application base” –Heterogeneous development “environment” SOA, pure-OO, scripts(!) Badly defined and multiple coexisting workflows and data schemas –Very large user community Producing, analyzing large amounts of data
4
4 Motivation for Provenance ATLAS detector delivering hundreds of Mbytes raw data/second –720 MB/s just out of CERN –PetaBytes/year of data Will run ~20 years Data is distributed world-wide using the Grid –Analyzed by physicists at their institutes and on shared computing clusters world-wide High Energy Physics is about understanding the data under analysis and finding the abnormalities –Which may break or prove theories
5
5 Model Documenting Storing Reasoning Querying –4-phase model for applying provenance to an e-Science application
6
6 Documenting Identify (coarse) components part of the workflow –Wrap these as services - SOA-style Apply PReP (Provenance Recording Protocol): –Defines a representation of process documentation suitable for service-oriented architectures, introducing a generic protocol for recording provenance Data model, schema, principles
7
7 Workflow / Application Service A Service B Service C
8
8 Storing Storing documentation of execution –Using Grid principles –Documentation records may become a significant portion of the total data volume for ATLAS Serializing documentation records onto data files on “disk” –Shipped for suitable storages Multiple querying sources, but well-defined provenance schema from PReP “On the Grid”: –Globus RLS (cataloguing), GridFTP (transfer), …
9
9 Service A Service B Service C Grid “Storage Element” Grid “Storage Element” “Replica Location Service”
10
10 Reasoning Documentation records are now stored on the Grid Two problems with documentation records: –These can be seen purely as "raw data” without added value per si “Where is the provenance I want”? –Not very efficient to query if large amounts of data exist “Millions” of documentation records? One important property: –Queries are often known in advance e.g. software version for a particular algorithm?
11
11 Reasoning Define static reasoners: –Optimize access to provenance "raw data”, deriving data provenance properties software version is a data provenance property –Many reasoners different access patterns to raw data ("crawling" techniques) access permissions (visibility of output) Virtual Organization reasoners, private user reasoners
12
12 Grid “Storage Element” Grid “Storage Element” “Replica Location Service” Reasoner
13
13 Querying Our goal for building a provenance infrastructure is to provide, as the final outcome, metadata for the data generation –Reasoners operate on documentation records to produce metadata kept on a performant metadata catalog –User queries directed to the metadata catalog Pre-defined queries - a common use case –Helping to solve the problem of building metadata for data Asynchronously
14
14 Metadata Catalog Reasoner
15
15 Scalability Scalability is a fundamental motivation for the design –Modeling application as services Even if not originally SOA-based Flexible applicability: granularity of information present on original documentation records –Split documentation from storage –Split storage from querying Reasoners generating metadata –Avoid need for many queries to go against original documentation stores
16
16 Conclusion 4-phase model for applying storage to e- Science applications –Emphasis on integrating with the “Grid” Scalability –Ease integration with existing legacy applications Prototype being put in place within the ATLAS Experiment –Starting from a data management perspective –Provenance as a first-class concept for e-Science
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.