Miguel Branco CERN/University of Southampton Enabling provenance on large-scale e-Science applications.

Slides:



Advertisements
Similar presentations
Wei Lu 1, Kate Keahey 2, Tim Freeman 2, Frank Siebenlist 2 1 Indiana University, 2 Argonne National Lab
Advertisements

Jens G Jensen Atlas Petabyte store Supporting Multiple Interfaces to Mass Storage Providing Tape and Mass Storage to Diverse Scientific Communities.
A conceptual model of grid resources and services Authors: Sergio Andreozzi Massimo Sgaravatto Cristina Vistoli Presenter: Sergio Andreozzi INFN-CNAF Bologna.
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
1 Richard White Design decisions: architecture 1 July 2005 BiodiversityWorld Grid Workshop NeSC, Edinburgh, 30 June - 1 July 2005 Design decisions: architecture.
1 Grid services based architectures Growing consensus that Grid services is the right concept for building the computing grids; Recent ARDA work has provoked.
23/04/2008VLVnT08, Toulon, FR, April 2008, M. Stavrianakou, NESTOR-NOA 1 First thoughts for KM3Net on-shore data storage and distribution Facilities VLV.
Data Management for Physics Analysis in PHENIX (BNL, RHIC) Evaluation of Grid architecture components in PHENIX context Barbara Jacak, Roy Lacey, Saskia.
UMIACS PAWN, LPE, and GRASP data grids Mike Smorul.
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
GRID COMPUTING: REPLICATION CONCEPTS Presented By: Payal Patel.
Universität Stuttgart Universitätsbibliothek Information Retrieval on the Grid? Results and suggestions from Project GRACE Werner Stephan Stuttgart University.
GRACE Project IST EGAAP meeting – Den Haag, 25/11/2004 Giuseppe Sisto – Telecom Italia Lab.
Metadata Creation with the Earth System Modeling Framework Ryan O’Kuinghttons – NESII/CIRES/NOAA Kathy Saint – NESII/CSG July 22, 2014.
Cloud Computing. What is Cloud Computing? Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable.
Data Management Kelly Clynes Caitlin Minteer. Agenda Globus Toolkit Basic Data Management Systems Overview of Data Management Data Movement Grid FTP Reliable.
Don Quijote Data Management for the ATLAS Automatic Production System Miguel Branco – CERN ATC
10/20/05 LIGO Scientific Collaboration 1 LIGO Data Grid: Making it Go Scott Koranda University of Wisconsin-Milwaukee.
A Metadata Catalog Service for Data Intensive Applications Presented by Chin-Yi Tsai.
Why GridFTP? l Performance u Parallel TCP streams, optimal TCP buffer u Non TCP protocol such as UDT u Order of magnitude greater l Cluster-to-cluster.
The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Dataset Caitlin Minteer & Kelly Clynes.
INFSO-RI Enabling Grids for E-sciencE Logging and Bookkeeping and Job Provenance Services Ludek Matyska (CESNET) on behalf of the.
10/24/2015OSG at CANS1 Open Science Grid Ruth Pordes Fermilab
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
Virtual Data Grid Architecture Ewa Deelman, Ian Foster, Carl Kesselman, Miron Livny.
November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.
Service - Oriented Middleware for Distributed Data Mining on the Grid ,劉妘鑏 Antonio C., Domenico T., and Paolo T. Journal of Parallel and Distributed.
1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.
Policy Based Data Management Data-Intensive Computing Distributed Collections Grid-Enabled Storage iRODS Reagan W. Moore 1.
GO-ESSP Workshop, LLNL, Livermore, CA, Jun 19-21, 2006, Center for ATmosphere sciences and Earthquake Researches Construction of e-science Environment.
Replica Management Services in the European DataGrid Project Work Package 2 European DataGrid.
The Earth System Grid (ESG) Computer Science and Technologies DOE SciDAC ESG Project Review Argonne National Laboratory, Illinois May 8-9, 2003.
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
GEON2 and OpenEarth Framework (OEF) Bradley Wallet School of Geology and Geophysics, University of Oklahoma
T3 analysis Facility V. Bucard, F.Furano, A.Maier, R.Santana, R. Santinelli T3 Analysis Facility The LHCb Computing Model divides collaboration affiliated.
CGW 04, Stripped replication for the grid environment as a web service1 Stripped replication for the Grid environment as a web service Marek Ciglan, Ondrej.
Grid User Interface for ATLAS & LHCb A more recent UK mini production used input data stored on RAL’s tape server, the requirements in JDL and the IC Resource.
ATLAS is a general-purpose particle physics experiment which will study topics including the origin of mass, the processes that allowed an excess of matter.
Cooperative experiments in VL-e: from scientific workflows to knowledge sharing Z.Zhao (1) V. Guevara( 1) A. Wibisono(1) A. Belloum(1) M. Bubak(1,2) B.
May 6, 2002Earth System Grid - Williams The Earth System Grid Presented by Dean N. Williams PI’s: Ian Foster (ANL); Don Middleton (NCAR); and Dean Williams.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Services for advanced workflow programming.
Enabling e-Research in Combustion Research Community T.V Pham 1, P.M. Dew 1, L.M.S. Lau 1 and M.J. Pilling 2 1 School of Computing 2 School of Chemistry.
Website: Answering Continuous Queries Using Views Over Data Streams Alasdair J G Gray Werner.
© 2008 Open Grid Forum File Catalog Development in Japan e-Science Project GFS-WG, OGF24 Singapore Hideo Matsuda Osaka University.
Integration of the ATLAS Tag Database with Data Management and Analysis Components Caitriana Nicholson University of Glasgow 3 rd September 2007 CHEP,
Performance of The NorduGrid ARC And The Dulcinea Executor in ATLAS Data Challenge 2 Oxana Smirnova (Lund University/CERN) for the NorduGrid collaboration.
Development of e-Science Application Portal on GAP WeiLong Ueng Academia Sinica Grid Computing
David Chiu and Gagan Agrawal Department of Computer Science and Engineering The Ohio State University 1 Supporting Workflows through Data-driven Service.
CRISTAL Andrew Branson University of the West of England.
© Geodise Project, University of Southampton, Integrating Data Management into Engineering Applications Zhuoan Jiao, Jasmin.
Super Computing 2000 DOE SCIENCE ON THE GRID Storage Resource Management For the Earth Science Grid Scientific Data Management Research Group NERSC, LBNL.
The GridPP DIRAC project DIRAC for non-LHC communities.
DIRAC Project A.Tsaregorodtsev (CPPM) on behalf of the LHCb DIRAC team A Community Grid Solution The DIRAC (Distributed Infrastructure with Remote Agent.
Nigel Baker UWE & CERN/EP-CMA Design Patterns for Integrating Product and Process Models The C.R.I.S.T.A.L. Project ( C ooperative R epositories & I nformation.
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
© Geodise Project, University of Southampton, Workflow Support for Advanced Grid-Enabled Computing Fenglian Xu *, M.
10 March Andrey Grid Tools Working Prototype of Distributed Computing Infrastructure for Physics Analysis SUNY.
Developing GRID Applications GRACE Project
Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
Breaking the frontiers of the Grid R. Graciani EGI TF 2012.
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
IT-DSS Alberto Pace2 ? Detecting particles (experiments) Accelerating particle beams Large-scale computing (Analysis) Discovery We are here The mission.
VO Box discussion ATLAS NIKHEF January, 2006 Miguel Branco -
Servicing Seismic and Oil Reservoir Simulation Data through Grid Data Services Sivaramakrishnan Narayanan, Tahsin Kurc, Umit Catalyurek and Joel Saltz.
EGI-InSPIRE RI EGI Compute and Data Services for Open Access in H2020 Tiziana Ferrari Technical Director, EGI.eu
Data Grids, Digital Libraries and Persistent Archives: An Integrated Approach to Publishing, Sharing and Archiving Data. Written By: R. Moore, A. Rajasekar,
EGI-InSPIRE RI An Introduction to European Grid Infrastructure (EGI) March An Introduction to the European Grid Infrastructure.
Provenance: Problem, Architectural issues, Towards Trust
SuperComputing 2003 “The Great Academia / Industry Grid Debate” ?
Presentation transcript:

Miguel Branco CERN/University of Southampton Enabling provenance on large-scale e-Science applications

2 Overview The ATLAS Experiment –Motivation for Provenance Model –Documenting –Storing –Reasoning –Querying Scalability Conclusion

3 The ATLAS Experiment ATLAS is a High Energy Physics experiment –Detector being built for the Large Hadron Collider at CERN One of the largest scientific collaborations ever –Used as a “scenario” for our provenance work “Typical” large-scale e-Science experiment –Very large “application base” –Heterogeneous development “environment” SOA, pure-OO, scripts(!) Badly defined and multiple coexisting workflows and data schemas –Very large user community Producing, analyzing large amounts of data

4 Motivation for Provenance ATLAS detector delivering hundreds of Mbytes raw data/second –720 MB/s just out of CERN –PetaBytes/year of data Will run ~20 years Data is distributed world-wide using the Grid –Analyzed by physicists at their institutes and on shared computing clusters world-wide High Energy Physics is about understanding the data under analysis and finding the abnormalities –Which may break or prove theories

5 Model Documenting Storing Reasoning Querying –4-phase model for applying provenance to an e-Science application

6 Documenting Identify (coarse) components part of the workflow –Wrap these as services - SOA-style Apply PReP (Provenance Recording Protocol): –Defines a representation of process documentation suitable for service-oriented architectures, introducing a generic protocol for recording provenance Data model, schema, principles

7 Workflow / Application Service A Service B Service C

8 Storing Storing documentation of execution –Using Grid principles –Documentation records may become a significant portion of the total data volume for ATLAS Serializing documentation records onto data files on “disk” –Shipped for suitable storages Multiple querying sources, but well-defined provenance schema from PReP “On the Grid”: –Globus RLS (cataloguing), GridFTP (transfer), …

9 Service A Service B Service C Grid “Storage Element” Grid “Storage Element” “Replica Location Service”

10 Reasoning Documentation records are now stored on the Grid Two problems with documentation records: –These can be seen purely as "raw data” without added value per si “Where is the provenance I want”? –Not very efficient to query if large amounts of data exist “Millions” of documentation records? One important property: –Queries are often known in advance e.g. software version for a particular algorithm?

11 Reasoning Define static reasoners: –Optimize access to provenance "raw data”, deriving data provenance properties software version is a data provenance property –Many reasoners different access patterns to raw data ("crawling" techniques) access permissions (visibility of output) Virtual Organization reasoners, private user reasoners

12 Grid “Storage Element” Grid “Storage Element” “Replica Location Service” Reasoner

13 Querying Our goal for building a provenance infrastructure is to provide, as the final outcome, metadata for the data generation –Reasoners operate on documentation records to produce metadata kept on a performant metadata catalog –User queries directed to the metadata catalog Pre-defined queries - a common use case –Helping to solve the problem of building metadata for data Asynchronously

14 Metadata Catalog Reasoner

15 Scalability Scalability is a fundamental motivation for the design –Modeling application as services Even if not originally SOA-based Flexible applicability: granularity of information present on original documentation records –Split documentation from storage –Split storage from querying Reasoners generating metadata –Avoid need for many queries to go against original documentation stores

16 Conclusion 4-phase model for applying storage to e- Science applications –Emphasis on integrating with the “Grid” Scalability –Ease integration with existing legacy applications Prototype being put in place within the ATLAS Experiment –Starting from a data management perspective –Provenance as a first-class concept for e-Science