GriPhyN and Data Provenance The Grid Physics Network Virtual Data System DOE Data Management Workshop SLAC, 17 March 2004 Mike Wilde Argonne National Laboratory.

Slides:



Advertisements
Similar presentations
University of Chicago Department of Energy The Parallel and Grid I/O Perspective MPI, MPI-IO, NetCDF, and HDF5 are in common use Multi TB datasets also.
Advertisements

Virtual Data and the Chimera System* Ian Foster Mathematics and Computer Science Division Argonne National Laboratory and Department of Computer Science.
A. Arbree, P. Avery, D. Bourilkov, R. Cavanaugh, S. Katageri, G. Graham, J. Rodriguez, J. Voeckler, M. Wilde CMS & GriPhyN Conference in High Energy Physics,
Ewa Deelman, Integrating Existing Scientific Workflow Systems: The Kepler/Pegasus Example Nandita Mangal,
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
R. Cavanaugh GriPhyN Analysis Workshop Caltech, June, 2003 Virtual Data Toolkit.
Office of Science U.S. Department of Energy Grids and Portals at NERSC Presented by Steve Chan.
Workflow Management and Virtual Data Ewa Deelman USC Information Sciences Institute.
Miron Livny Computer Sciences Department University of Wisconsin-Madison Workflows, Provenance and Virtual.
GriPhyN Virtual Data System Mike Wilde Argonne National Laboratory Mathematics and Computer Science Division LISHEP 2004, UERJ, Rio De Janeiro 13 Feb 2004.
Pegasus: Mapping complex applications onto the Grid Ewa Deelman Center for Grid Technologies USC Information Sciences Institute.
Knowledge Environments for Science: Representative Projects Ian Foster Argonne National Laboratory University of Chicago
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
Computer Science Research Ian Foster University of Chicago & Argonne National Laboratory GriPhyN NSF Project Review January 2003.
The Grid as Infrastructure and Application Enabler Ian Foster Mathematics and Computer Science Division Argonne National Laboratory and Department of Computer.
Provenance in my Grid Jun Zhao School of Computer Science The University of Manchester, U.K. 21 October, 2004.
INFSO-RI Enabling Grids for E-sciencE FloodGrid application Ladislav Hluchy, Viet D. Tran Institute of Informatics, SAS Slovakia.
CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.
Experiment Applications: applying the power of the grid to real science Rick Cavanaugh University of Florida GriPhyN/iVDGL External Advisory Committee.
M. Bardeen, BCNET Conference, April 2004 The QuarkNet/Grid Project Developing Our “e-Lab” for High School Students & Their Teachers Marge Bardeen, Fermilab.
Grappa: Grid access portal for physics applications Shava Smallen Extreme! Computing Laboratory Department of Physics Indiana University.
XCAT Science Portal Status & Future Work July 15, 2002 Shava Smallen Extreme! Computing Laboratory Indiana University.
Data analysis in I2U2 I2U2 all-hands meeting Michael Wilde Argonne MCS University of Chicago Computation Institute 12 Dec 2005.
Teaching and Learning with Open Science Grid quarknet.fnal.gov/e-labs/
The Grid is a complex, distributed and heterogeneous execution environment. Running applications requires the knowledge of many grid services: users need.
Data and the Grid: From Databases to Global Knowledge Communities Ian Foster Argonne National Laboratory University of Chicago
ARGONNE  CHICAGO Ian Foster Discussion Points l Maintaining the right balance between research and development l Maintaining focus vs. accepting broader.
Applying the Virtual Data Provenance Model IPAW 2006 Yong Zhao, Ian Foster, Michael Wilde 4 May 2006.
David Adams ATLAS ATLAS Distributed Analysis David Adams BNL March 18, 2004 ATLAS Software Workshop Grid session.
Virtual Data Management for Grid Computing Michael Wilde Argonne National Laboratory Gaurang Mehta, Karan Vahi Center for Grid Technologies.
Large-Scale Science Through Workflow Management Ewa Deelman Center for Grid Technologies USC Information Sciences Institute.
Miguel Branco CERN/University of Southampton Enabling provenance on large-scale e-Science applications.
Introduction to Apache OODT Yang Li Mar 9, What is OODT Object Oriented Data Technology Science data management Archiving Systems that span scientific.
The GriPhyN Virtual Data System GRIDS Center Community Workshop Michael Wilde Argonne National Laboratory 24 June 2005.
1 st December 2003 JIM for CDF 1 JIM and SAMGrid for CDF Mòrag Burgon-Lyon University of Glasgow.
Pegasus-a framework for planning for execution in grids Ewa Deelman USC Information Sciences Institute.
Pegasus: Planning for Execution in Grids Ewa Deelman Information Sciences Institute University of Southern California.
The Virtual Data Grid: A New Model and Architecture for Data-Intensive Collaboration Summer Grid 2004 UT Brownsville South Padre Island Center 24 June.
GriPhyN Status and Project Plan Mike Wilde Mathematics and Computer Science Division Argonne National Laboratory.
Pegasus: Mapping Scientific Workflows onto the Grid Ewa Deelman Center for Grid Technologies USC Information Sciences Institute.
Virtual Data Grid Architecture Ewa Deelman, Ian Foster, Carl Kesselman, Miron Livny.
Datasets on the GRID David Adams PPDG All Hands Meeting Catalogs and Datasets session June 11, 2003 BNL.
CPT Demo May Build on SC03 Demo and extend it. Phase 1: Doing Root Analysis and add BOSS, Rendezvous, and Pool RLS catalog to analysis workflow.
Where’s the Data? On the Grid! Transforming Science Education Researching Educational Use of the Grid Case Study: Cosmic Rays in the Classroom Computing.
Pegasus: Running Large-Scale Scientific Workflows on the TeraGrid Ewa Deelman USC Information Sciences Institute
GO-ESSP Workshop, LLNL, Livermore, CA, Jun 19-21, 2006, Center for ATmosphere sciences and Earthquake Researches Construction of e-science Environment.
Pegasus: Mapping complex applications onto the Grid Ewa Deelman Center for Grid Technologies USC Information Sciences Institute.
Virtual Data Workflows with the GriPhyN VDS Condor Week University of Wisconsin Michael Wilde Argonne National Laboratory 14 March 2005.
The GriPhyN Planning Process All-Hands Meeting ISI 15 October 2001.
GriPhyN Virtual Data System Grid Execution of Virtual Data Workflows Mike Wilde Argonne National Laboratory Mathematics and Computer Science Division.
Metadata Mòrag Burgon-Lyon University of Glasgow.
GRID Overview Internet2 Member Meeting Spring 2003 Sandra Redman Information Technology and Systems Center and Information Technology Research Center National.
High Energy Physics and Grids at UF (Dec. 13, 2002)Paul Avery1 University of Florida High Energy Physics.
May 6, 2002Earth System Grid - Williams The Earth System Grid Presented by Dean N. Williams PI’s: Ian Foster (ANL); Don Middleton (NCAR); and Dean Williams.
Workflow Management and Virtual Data Ewa Deelman Center for Grid Technologies USC Information Sciences Institute.
Pegasus-a framework for planning for execution in grids Karan Vahi USC Information Sciences Institute May 5 th, 2004.
Planning Ewa Deelman USC Information Sciences Institute GriPhyN NSF Project Review January 2003 Chicago.
Pegasus: Planning for Execution in Grids Ewa Deelman, Carl Kesselman, Gaurang Mehta, Gurmeet Singh, Karan Vahi Information Sciences Institute University.
12 Oct 2003VO Tutorial, ADASS Strasbourg, Data Access Layer (DAL) Tutorial Doug Tody, National Radio Astronomy Observatory T HE US N ATIONAL V IRTUAL.
Virtual Data Management for CMS Simulation Production A GriPhyN Prototype.
Securing the Grid & other Middleware Challenges Ian Foster Mathematics and Computer Science Division Argonne National Laboratory and Department of Computer.
GriPhyN Project Paul Avery, University of Florida, Ian Foster, University of Chicago NSF Grant ITR Research Objectives Significant Results Approach.
AHM04: Sep 2004 Nottingham CCLRC e-Science Centre eMinerals: Environment from the Molecular Level Managing simulation data Lisa Blanshard e- Science Data.
GADU: A System for High-throughput Analysis of Genomes using Heterogeneous Grid Resources. Mathematics and Computer Science Division Argonne National Laboratory.
Interoperability Achieved by GADU in using multiple Grids. OSG, Teragrid and ANL Jazz Presented by: Dinanath Sulakhe Mathematics and Computer Science Division.
Developing GRID Applications GRACE Project
GGF OGSA-WG, Data Use Cases Peter Kunszt Middleware Activity, Data Management Cluster EGEE is a project funded by the European.
Pegasus and Condor Gaurang Mehta, Ewa Deelman, Carl Kesselman, Karan Vahi Center For Grid Technologies USC/ISI.
A General Approach to Real-time Workflow Monitoring
Status of Grids for HEP and HENP
Presentation transcript:

GriPhyN and Data Provenance The Grid Physics Network Virtual Data System DOE Data Management Workshop SLAC, 17 March 2004 Mike Wilde Argonne National Laboratory Mathematics and Computer Science Division

DOE Data Management 17 Mar GriPhyN: Grid Physics Network Mission Enhance scientific productivity through discovery and processing of datasets, using the grid as a scientific workstation Virtual Data enables this approach by creating datasets from workflow “recipes” and recording their provenance. GriPhyN works to “cross the chasm” - application and computer scientists create and field-test paradigms and toolkits together

DOE Data Management 17 Mar Virtual Data Scenario simulate – t 10 … file1 file2 reformat – f fz … file1 File3,4,5 psearch – t 10 … conv – I esd – o aod file6 summarize – t 10 … file7 file8 On-demand data generation Update workflow following changes Manage workflow; psearch –t 10 –i file3 file4 file5 –o file8 summarize –t 10 –i file6 –o file7 reformat –f fz –i file2 –o file3 file4 file5 conv –l esd –o aod –i file 2 –o file6 simulate –t 10 –o file1 file2 Explain provenance, e.g. for file8:

DOE Data Management 17 Mar Grid3 – The Laboratory Supported by the National Science Foundation and the Department of Energy.

DOE Data Management 17 Mar VDL: Virtual Data Language Describes Data Transformations l Transformation –Abstract template of program invocation –Similar to "function definition" l Derivation –“Function call” to a Transformation –Store past and future: >A record of how data products were generated >A recipe of how data products can be generated l Invocation –Record of a Derivation execution l These XML documents reside in a “virtual data catalog” – VDC - a relational database

DOE Data Management 17 Mar VDL Describes Workflow via Data Dependencies TR tr1(in a1, out a2) { argument stdin = ${a1}; argument stdout = ${a2}; } TR tr2(in a1, out a2) { argument stdin = ${a1}; argument stdout = ${a2}; } DV DV file1 file2 file3 x1 x2

DOE Data Management 17 Mar Workflow example l Graph structure –Fan-in –Fan-out –"left" and "right" can run in parallel l Needs external input file –Located via replica catalog l Data file dependencies –Form graph structure findrange analyze preprocess

DOE Data Management 17 Mar Complete VDL workflow l Generate appropriate derivations DV out:"f.b2"} ], ); DV left->findrange( name="left", p="0.5" ); DV right->findrange( name="right" ); DV );

DOE Data Management 17 Mar Compound Transformations Enable Functional Abstractions l Compound TR encapsulates an entire sub-graph: TR rangeAnalysis (in fa, p1, p2, out fd, io fc1, io fc2, io fb1, io fb2, ) { call preprocess( a=${fa}, b=[ ${out:fb1}, ${out:fb2} ] ); call findrange( a1=${in:fb1}, a2=${in:fb2}, name="LEFT", p=${p1}, b=${out:fc1} ); call findrange( a1=${in:fb1}, a2=${in:fb2}, name="RIGHT", p=${p2}, b=${out:fc2} ); call analyze( a=[ ${in:fc1}, ${in:fc2} ], b=${fd} ); }

DOE Data Management 17 Mar Derivation scripts l Representation of virtual data provenance: DV d1->diamond( p2="100", p1="0" ); DV d2->diamond( p2=" ", p1="0" );... DV d70->diamond( p2="800", p1="18" );

DOE Data Management 17 Mar Invocation Provenance Completion status and resource usage Attributes of executable transformation Attributes of input and output files

DOE Data Management 17 Mar Executing VDL Workflows Abstract workflow local planner Concrete DAG Global planner “Pegasus” DAGman / Condor-G Grid Info “jit” planner (research)

DOE Data Management 17 Mar GriPhyN-iVDGL Applications to date l ATLAS, BTeV, CMS – HEP event simulation l Argonne Computational Biology – sequence comparison and result capture l LIGO – Pulsar search l Sloan Digital Sky Survey – cluster finding; near-earth object search planned l Quarknet – science education – cosmic rays, HEP analysis

DOE Data Management 17 Mar Genome Analysis Database Update Application work by Alex Rodriguez, Dina Sulakhe, Natalia Matlsev, Argonne MCS Described in GGF10 workshop paper.

DOE Data Management 17 Mar Galaxy cluster size distribution DAG Virtual Data Example: Galaxy Cluster Search Sloan Data Jim Annis, Steve Kent, Vijay Sehkri, Fermilab, Michael Milligan, Yong Zhao, University of Chicago. Described in SC2002 paper

DOE Data Management 17 Mar Cluster Search Workflow Graph and Execution Trace Workflow jobs vs time

DOE Data Management 17 Mar mass = 200 decay = WW stability = 1 LowPt = 20 HighPt = mass = 200 decay = WW stability = 1 event = 8 mass = 200 decay = WW stability = 1 plot = 1 mass = 200 decay = WW plot = 1 mass = 200 decay = WW event = 8 mass = 200 decay = WW stability = 1 mass = 200 decay = WW stability = 3 mass = 200 decay = WW mass = 200 decay = ZZ mass = 200 decay = bb mass = 200 plot = 1 mass = 200 event = 8 Virtual Data Application: High Energy Physics Data Analysis Work and slide by Rick Cavanaugh and Dimitri Bourilkov, University of Florida Ref: CHEP 2002 paper

DOE Data Management 17 Mar Using Virtual Data for Science Education l The QuarkNet-Trillium collaboration is using Grid virtual data tools and methods to enrich science education l Its an experiment to give students the means to: –discover and apply datasets, algorithms, and data analysis methods –collaborate by developing new ones and sharing results and observations –learn data analysis methods that will ready and excite them for a scientific career l And in later steps, we may actually use the Grid!

DOE Data Management 17 Mar Quarknet Virtual Data Project Standard Web access Central High School Reston, Virginia Locally Collected Data Cosmic Ray Detector Student/ Teacher Teams Yale / Middletown High Collaboration Hartford, Connecticut Locally Collected Data Cosmic Ray Detector Student/ Teacher Teams Foothills High School Great Falls, Montana Locally Collected Data Cosmic Ray Detector Student/ Teacher Teams Quarknet Virtual Data Portal Student Data, Algorithms, Results, Notes, and communications Virtual Data Toolkit Virtual Data Catalog Student teacher teams sharing data, methods, programs, and knowledge Enabling collaboration-intensive science discovery with virtual data tools and methods

DOE Data Management 17 Mar Detector Performance Study

DOE Data Management 17 Mar Example: BTeV Event Simulation

DOE Data Management 17 Mar Search by Metadata

DOE Data Management 17 Mar Derving a new dataset …to find mass of “z” particle:

DOE Data Management 17 Mar Workflow for missing energy calculations

DOE Data Management 17 Mar Virtual Provenance: list of derivations and files <job id="ID000001" namespace="Quarknet.HEPSRCH" name="ECalEnergySum" level="5“ dv-namespace="Quarknet.HEPSRCH" dv-name="run1aesum"> <job id="ID000002" namespace="Quarknet.HEPSRCH" name="ECalEnergySum" level="7“ dv-namespace="Quarknet.HEPSRCH" … … <job id="ID000014" namespace="Quarknet.HEPSRCH" name="ReconTotalEnergy" level="3"… … …. (excerpted for display)

DOE Data Management 17 Mar Virtual Provenance in XML: control flow graph … … … … … (excerpted for display…)

And writing the results up in a “poster”

DOE Data Management 17 Mar Poster describing analysis

DOE Data Management 17 Mar Observations l A provenance approach based on interface definition and data flow declaration fits well with Grid requirements for code and data transportability and heterogeneity l Working in a provenance-managed system has many fringe benefits: uniformity, precision, structure, communication, documentation l The real world is messy – finding the right abstractions is hard, and handling “legacy” applications is even harder

DOE Data Management 17 Mar Vision for Provenance in the Large l Universal knowledge management and production systems l Vendors integrate the provenance tracking protocol into data processing products l Ability to run anywhere “in the Grid”

DOE Data Management 17 Mar Virtual Data Grid Vision

DOE Data Management 17 Mar Planned Dataset Model <FORM /FORM> FileSet of files Relational query or spreadsheet range XML Element Set of files with relational index Object closure New user-defined dataset type: Speculative model described in CIDR 2003 paper by Foster, Voeckler, Wilde and Zhao

DOE Data Management 17 Mar Planned Dataset Type Model FileDataset FileFileSet MultiFileSetTarFileSet EventCollection RawEventSetSimulatedEventSet MonteCarlo Simulation DiscreteEvent Simulation Representational Logical (Nonleaf Types are Superclasses)

DOE Data Management 17 Mar Provenance Server Plans l OGSA-based Grid services –Discovery, security, resource management l Supports code and data discovery and workflow management l Object names (TR, DS, TY, DV, IV) can be used as global cross-server links l Derivations can reference remote transformations and datasets l Structured object namespaces & object-level access control enable large VO collaboration l Generalize transforms to describe service calls, database queries and language interpreters

DOE Data Management 17 Mar Provenance Hyperlinks

DOE Data Management 17 Mar Indexing Servers to Support Discovery

DOE Data Management 17 Mar For Information and Software l Virtual Data System – - Chimera Virtual Data System: Overview, papers, software l Grids and Grid Software – - Using Grid3 – - Virtual Data Toolkit – – The Globus Toolkit – - The Condor Project – – Particle Physics Data Grid

DOE Data Management 17 Mar Acknowledgements: Virtual Data is a Large Team Effort The Chimera Virtual Data System is the work of Ian Foster, Jens Voeckler, Mike Wilde and Yong Zhao The Pegasus Planner is the work of Ewa Deelman, Gaurang Mehta, and Karan Vahi Applications described are the work of many people, including: James Annis, Rick Cavanaugh, Dan Engh, Rob Gardner, Albert Lazzarini, Natalia Maltsev, and their wonderful teams

DOE Data Management 17 Mar Acknowledgements GriPhyN, iVDGL, and QuarkNet (in part) are supported by the National Science Foundation The Globus Alliance, PPDG, and QuarkNet are supported in part by the US Department of Energy, Office of Science; by the NASA Information Power Grid program; and by IBM