Download presentation
Presentation is loading. Please wait.
Published byDonald Doyle Modified over 9 years ago
1
GriPhyN and Data Provenance The Grid Physics Network Virtual Data System DOE Data Management Workshop SLAC, 17 March 2004 Mike Wilde Argonne National Laboratory Mathematics and Computer Science Division
2
DOE Data Management www.griphyn.org/chimera 17 Mar 2004 2 GriPhyN: Grid Physics Network Mission Enhance scientific productivity through discovery and processing of datasets, using the grid as a scientific workstation Virtual Data enables this approach by creating datasets from workflow “recipes” and recording their provenance. GriPhyN works to “cross the chasm” - application and computer scientists create and field-test paradigms and toolkits together
3
DOE Data Management www.griphyn.org/chimera 17 Mar 2004 3 Virtual Data Scenario simulate – t 10 … file1 file2 reformat – f fz … file1 File3,4,5 psearch – t 10 … conv – I esd – o aod file6 summarize – t 10 … file7 file8 On-demand data generation Update workflow following changes Manage workflow; psearch –t 10 –i file3 file4 file5 –o file8 summarize –t 10 –i file6 –o file7 reformat –f fz –i file2 –o file3 file4 file5 conv –l esd –o aod –i file 2 –o file6 simulate –t 10 –o file1 file2 Explain provenance, e.g. for file8:
4
DOE Data Management www.griphyn.org/chimera 17 Mar 2004 4 Grid3 – The Laboratory Supported by the National Science Foundation and the Department of Energy.
5
DOE Data Management www.griphyn.org/chimera 17 Mar 2004 5 VDL: Virtual Data Language Describes Data Transformations l Transformation –Abstract template of program invocation –Similar to "function definition" l Derivation –“Function call” to a Transformation –Store past and future: >A record of how data products were generated >A recipe of how data products can be generated l Invocation –Record of a Derivation execution l These XML documents reside in a “virtual data catalog” – VDC - a relational database
6
DOE Data Management www.griphyn.org/chimera 17 Mar 2004 6 VDL Describes Workflow via Data Dependencies TR tr1(in a1, out a2) { argument stdin = ${a1}; argument stdout = ${a2}; } TR tr2(in a1, out a2) { argument stdin = ${a1}; argument stdout = ${a2}; } DV x1->tr1(a1=@{in:file1}, a2=@{out:file2}); DV x2->tr2(a1=@{in:file2}, a2=@{out:file3}); file1 file2 file3 x1 x2
7
DOE Data Management www.griphyn.org/chimera 17 Mar 2004 7 Workflow example l Graph structure –Fan-in –Fan-out –"left" and "right" can run in parallel l Needs external input file –Located via replica catalog l Data file dependencies –Form graph structure findrange analyze preprocess
8
DOE Data Management www.griphyn.org/chimera 17 Mar 2004 8 Complete VDL workflow l Generate appropriate derivations DV top->preprocess( b=[ @{out:"f.b1"}, @{ out:"f.b2"} ], a=@{in:"f.a"} ); DV left->findrange( b=@{out:"f.c1"}, a2=@{in:"f.b2"}, a1=@{in:"f.b1"}, name="left", p="0.5" ); DV right->findrange( b=@{out:"f.c2"}, a2=@{in:"f.b2"}, a1=@{in:"f.b1"}, name="right" ); DV bottom->analyze( b=@{out:"f.d"}, a=[ @{in:"f.c1"}, @{in:"f.c2"} );
9
DOE Data Management www.griphyn.org/chimera 17 Mar 2004 9 Compound Transformations Enable Functional Abstractions l Compound TR encapsulates an entire sub-graph: TR rangeAnalysis (in fa, p1, p2, out fd, io fc1, io fc2, io fb1, io fb2, ) { call preprocess( a=${fa}, b=[ ${out:fb1}, ${out:fb2} ] ); call findrange( a1=${in:fb1}, a2=${in:fb2}, name="LEFT", p=${p1}, b=${out:fc1} ); call findrange( a1=${in:fb1}, a2=${in:fb2}, name="RIGHT", p=${p2}, b=${out:fc2} ); call analyze( a=[ ${in:fc1}, ${in:fc2} ], b=${fd} ); }
10
DOE Data Management www.griphyn.org/chimera 17 Mar 2004 10 Derivation scripts l Representation of virtual data provenance: DV d1->diamond( fd=@{out:"f.00005"}, fc1=@{io:"f.00004"}, fc2=@{io:"f.00003"}, fb1=@{io:"f.00002"}, fb2=@{io:"f.00001"}, fa=@{io:"f.00000"}, p2="100", p1="0" ); DV d2->diamond( fd=@{out:"f.0000B"}, fc1=@{io:"f.0000A"}, fc2=@{io:"f.00009"}, fb1=@{io:"f.00008"}, fb2=@{io:"f.00007"}, fa=@{io:"f.00006"}, p2="141.42135623731", p1="0" );... DV d70->diamond( fd=@{out:"f.001A3"}, fc1=@{io:"f.001A2"}, fc2=@{io:"f.001A1"}, fb1=@{io:"f.001A0"}, fb2=@{io:"f.0019F"}, fa=@{io:"f.0019E"}, p2="800", p1="18" );
11
DOE Data Management www.griphyn.org/chimera 17 Mar 2004 11 Invocation Provenance Completion status and resource usage Attributes of executable transformation Attributes of input and output files
12
DOE Data Management www.griphyn.org/chimera 17 Mar 2004 12 Executing VDL Workflows Abstract workflow local planner Concrete DAG Global planner “Pegasus” DAGman / Condor-G Grid Info “jit” planner (research)
13
DOE Data Management www.griphyn.org/chimera 17 Mar 2004 13 GriPhyN-iVDGL Applications to date l ATLAS, BTeV, CMS – HEP event simulation l Argonne Computational Biology – sequence comparison and result capture l LIGO – Pulsar search l Sloan Digital Sky Survey – cluster finding; near-earth object search planned l Quarknet – science education – cosmic rays, HEP analysis
14
DOE Data Management www.griphyn.org/chimera 17 Mar 2004 14 Genome Analysis Database Update Application work by Alex Rodriguez, Dina Sulakhe, Natalia Matlsev, Argonne MCS Described in GGF10 workshop paper.
15
DOE Data Management www.griphyn.org/chimera 17 Mar 2004 15 Galaxy cluster size distribution DAG Virtual Data Example: Galaxy Cluster Search Sloan Data Jim Annis, Steve Kent, Vijay Sehkri, Fermilab, Michael Milligan, Yong Zhao, University of Chicago. Described in SC2002 paper
16
DOE Data Management www.griphyn.org/chimera 17 Mar 2004 16 Cluster Search Workflow Graph and Execution Trace Workflow jobs vs time
17
DOE Data Management www.griphyn.org/chimera 17 Mar 2004 17 mass = 200 decay = WW stability = 1 LowPt = 20 HighPt = 10000 mass = 200 decay = WW stability = 1 event = 8 mass = 200 decay = WW stability = 1 plot = 1 mass = 200 decay = WW plot = 1 mass = 200 decay = WW event = 8 mass = 200 decay = WW stability = 1 mass = 200 decay = WW stability = 3 mass = 200 decay = WW mass = 200 decay = ZZ mass = 200 decay = bb mass = 200 plot = 1 mass = 200 event = 8 Virtual Data Application: High Energy Physics Data Analysis Work and slide by Rick Cavanaugh and Dimitri Bourilkov, University of Florida Ref: CHEP 2002 paper
18
DOE Data Management www.griphyn.org/chimera 17 Mar 2004 18 Using Virtual Data for Science Education l The QuarkNet-Trillium collaboration is using Grid virtual data tools and methods to enrich science education l Its an experiment to give students the means to: –discover and apply datasets, algorithms, and data analysis methods –collaborate by developing new ones and sharing results and observations –learn data analysis methods that will ready and excite them for a scientific career l And in later steps, we may actually use the Grid!
19
DOE Data Management www.griphyn.org/chimera 17 Mar 2004 19 Quarknet Virtual Data Project Standard Web access Central High School Reston, Virginia Locally Collected Data Cosmic Ray Detector Student/ Teacher Teams Yale / Middletown High Collaboration Hartford, Connecticut Locally Collected Data Cosmic Ray Detector Student/ Teacher Teams Foothills High School Great Falls, Montana Locally Collected Data Cosmic Ray Detector Student/ Teacher Teams Quarknet Virtual Data Portal Student Data, Algorithms, Results, Notes, and communications Virtual Data Toolkit Virtual Data Catalog Student teacher teams sharing data, methods, programs, and knowledge Enabling collaboration-intensive science discovery with virtual data tools and methods
20
DOE Data Management www.griphyn.org/chimera 17 Mar 2004 20 Detector Performance Study
21
DOE Data Management www.griphyn.org/chimera 17 Mar 2004 21 Example: BTeV Event Simulation
22
DOE Data Management www.griphyn.org/chimera 17 Mar 2004 22 Search by Metadata
23
DOE Data Management www.griphyn.org/chimera 17 Mar 2004 23 Derving a new dataset …to find mass of “z” particle:
24
DOE Data Management www.griphyn.org/chimera 17 Mar 2004 24 Workflow for missing energy calculations
25
DOE Data Management www.griphyn.org/chimera 17 Mar 2004 25 Virtual Provenance: list of derivations and files <job id="ID000001" namespace="Quarknet.HEPSRCH" name="ECalEnergySum" level="5“ dv-namespace="Quarknet.HEPSRCH" dv-name="run1aesum"> <job id="ID000002" namespace="Quarknet.HEPSRCH" name="ECalEnergySum" level="7“ dv-namespace="Quarknet.HEPSRCH" … … <job id="ID000014" namespace="Quarknet.HEPSRCH" name="ReconTotalEnergy" level="3"… … …. (excerpted for display)
26
DOE Data Management www.griphyn.org/chimera 17 Mar 2004 26 Virtual Provenance in XML: control flow graph … … … … … (excerpted for display…)
27
And writing the results up in a “poster”
28
DOE Data Management www.griphyn.org/chimera 17 Mar 2004 28 Poster describing analysis
29
DOE Data Management www.griphyn.org/chimera 17 Mar 2004 29 Observations l A provenance approach based on interface definition and data flow declaration fits well with Grid requirements for code and data transportability and heterogeneity l Working in a provenance-managed system has many fringe benefits: uniformity, precision, structure, communication, documentation l The real world is messy – finding the right abstractions is hard, and handling “legacy” applications is even harder
30
DOE Data Management www.griphyn.org/chimera 17 Mar 2004 30 Vision for Provenance in the Large l Universal knowledge management and production systems l Vendors integrate the provenance tracking protocol into data processing products l Ability to run anywhere “in the Grid”
31
DOE Data Management www.griphyn.org/chimera 17 Mar 2004 31 Virtual Data Grid Vision
32
DOE Data Management www.griphyn.org/chimera 17 Mar 2004 32 Planned Dataset Model <FORM /FORM> FileSet of files Relational query or spreadsheet range XML Element Set of files with relational index Object closure New user-defined dataset type: Speculative model described in CIDR 2003 paper by Foster, Voeckler, Wilde and Zhao
33
DOE Data Management www.griphyn.org/chimera 17 Mar 2004 33 Planned Dataset Type Model FileDataset FileFileSet MultiFileSetTarFileSet EventCollection RawEventSetSimulatedEventSet MonteCarlo Simulation DiscreteEvent Simulation Representational Logical (Nonleaf Types are Superclasses)
34
DOE Data Management www.griphyn.org/chimera 17 Mar 2004 34 Provenance Server Plans l OGSA-based Grid services –Discovery, security, resource management l Supports code and data discovery and workflow management l Object names (TR, DS, TY, DV, IV) can be used as global cross-server links l Derivations can reference remote transformations and datasets l Structured object namespaces & object-level access control enable large VO collaboration l Generalize transforms to describe service calls, database queries and language interpreters
35
DOE Data Management www.griphyn.org/chimera 17 Mar 2004 35 Provenance Hyperlinks
36
DOE Data Management www.griphyn.org/chimera 17 Mar 2004 36 Indexing Servers to Support Discovery
37
DOE Data Management www.griphyn.org/chimera 17 Mar 2004 37 For Information and Software l Virtual Data System –www.griphyn.org/chimera - Chimera Virtual Data System: Overview, papers, software l Grids and Grid Software –www.ivdgl.org/grid2003 - Using Grid3 –www.griphyn.org/vdt - Virtual Data Toolkit –www.globus.org – The Globus Toolkit –www.cs.wisc.edu/condor - The Condor Project –www.ppdg.net – Particle Physics Data Grid
38
DOE Data Management www.griphyn.org/chimera 17 Mar 2004 38 Acknowledgements: Virtual Data is a Large Team Effort The Chimera Virtual Data System is the work of Ian Foster, Jens Voeckler, Mike Wilde and Yong Zhao The Pegasus Planner is the work of Ewa Deelman, Gaurang Mehta, and Karan Vahi Applications described are the work of many people, including: James Annis, Rick Cavanaugh, Dan Engh, Rob Gardner, Albert Lazzarini, Natalia Maltsev, and their wonderful teams
39
DOE Data Management www.griphyn.org/chimera 17 Mar 2004 39 Acknowledgements GriPhyN, iVDGL, and QuarkNet (in part) are supported by the National Science Foundation The Globus Alliance, PPDG, and QuarkNet are supported in part by the US Department of Energy, Office of Science; by the NASA Information Power Grid program; and by IBM
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.