Download presentation
Presentation is loading. Please wait.
Published byCory Alicia Walters Modified over 9 years ago
1
The Virtual Data Grid: A New Model and Architecture for Data-Intensive Collaboration Summer Grid 2004 UT Brownsville South Padre Island Center 24 June 2004 Mike Wilde Argonne National Laboratory Mathematics and Computer Science Division
2
Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 2 GriPhyN: Grid Physics Network Mission Enhance scientific productivity through discovery and processing of datasets, using the grid as a scientific workstation Virtual Data enables this approach by creating datasets from workflow “recipes” and recording their provenance. GriPhyN works to “cross the chasm” - application and computer scientists create and field-test paradigms and toolkits together
3
Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 3 Acknowledgements: Virtual Data is a Large Team Effort The Chimera Virtual Data System is the work of Ian Foster, Jens Voeckler, Mike Wilde and Yong Zhao The Pegasus Planner is the work of Ewa Deelman, Gaurang Mehta, and Karan Vahi Applications described are the work of many people, including: James Annis, Rick Cavanaugh, Dan Engh, Rob Gardner, Albert Lazzarini, Natalia Maltsev, Marge Bardeen, and their wonderful teams
4
Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 4 Virtual Data Scenario simulate – t 10 … file1 file2 reformat – f fz … file1 File3,4,5 psearch – t 10 … conv – I esd – o aod file6 summarize – t 10 … file7 file8 On-demand data generation Update workflow following changes Manage workflow; psearch –t 10 –i file3 file4 file5 –o file8 summarize –t 10 –i file6 –o file7 reformat –f fz –i file2 –o file3 file4 file5 conv –l esd –o aod –i file 2 –o file6 simulate –t 10 –o file1 file2 Explain provenance, e.g. for file8:
5
Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 5 Virtual Data Describes analysis workflow l The recorded virtual data “recipe” here is: –Files: 8 < (1,3,4,5,7), 7 < 6, (3,4,5,6) < 2 –Programs: 8 < psearch, 7 < summarize, (3,4,5) < reformat, 6 < conv, (1,2) < simulate simulate – t 10 … file1 file2 reformat – f fz … file1 File3,4,5 psearch – t 10 … conv – I esd – o aod file6 summarize – t 10 … file7 file8 Requested dataset
6
Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 6 Virtual Data Describes analysis workflow l To recreate file 8: Step 1 –simulate > file1, file2 simulate – t 10 … file1 file2 reformat – f fz … file1 File3,4,5 psearch – t 10 … conv – I esd – o aod file6 summarize – t 10 … file7 file8 Requested file
7
Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 7 Virtual Data Describes analysis workflow l To re-create file8: Step 2 –files 3, 4, 5, 6 derived from file 2 –reformat > file3, file4, file5 –conv > file 6 simulate – t 10 … file1 file2 reformat – f fz … file1 File3,4,5 psearch – t 10 … conv – I esd – o aod file6 summarize – t 10 … file7 file8 Requested file
8
Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 8 Virtual Data Describes analysis workflow l To re-create file 8: step 3 –File 7 depends on file 6 –Summarize > file 7 simulate – t 10 … file1 file2 reformat – f fz … file1 File3,4,5 psearch – t 10 … conv – I esd – o aod file6 summarize – t 10 … file7 file8 Requested file
9
Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 9 Virtual Data Describes analysis workflow l To re-create file 8: final step –File 8 depends on files 1, 3, 4, 5, 7 –psearch file 8 simulate – t 10 … file1 file2 psearch – t 10 … reformat – f fz … conv – I esd – o aod file1 File3,4,5 file6 summarize – t 10 … file7 file8 Requested file
10
Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 10 Grid3 – The Laboratory Supported by the National Science Foundation and the Department of Energy.
11
Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 11 VDL: Virtual Data Language Describes Data Transformations l Transformation –Abstract template of program invocation –Similar to "function definition" l Derivation –“Function call” to a Transformation –Store past and future: >A record of how data products were generated >A recipe of how data products can be generated l Invocation –Record of a Derivation execution l These XML documents reside in a “virtual data catalog” – VDC - a relational database
12
Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 12 VDL Describes Workflow via Data Dependencies TR tr1(in a1, out a2) { argument stdin = ${a1}; argument stdout = ${a2}; } TR tr2(in a1, out a2) { argument stdin = ${a1}; argument stdout = ${a2}; } DV x1->tr1(a1=@{in:file1}, a2=@{out:file2}); DV x2->tr2(a1=@{in:file2}, a2=@{out:file3}); file1 file2 file3 x1 x2
13
Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 13 Workflow example l Graph structure –Fan-in –Fan-out –"left" and "right" can run in parallel l Needs external input file –Located via replica catalog l Data file dependencies –Form graph structure findrange analyze preprocess
14
Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 14 Complete VDL workflow l Generate appropriate derivations DV top->preprocess( b=[ @{out:"f.b1"}, @{ out:"f.b2"} ], a=@{in:"f.a"} ); DV left->findrange( b=@{out:"f.c1"}, a2=@{in:"f.b2"}, a1=@{in:"f.b1"}, name="left", p="0.5" ); DV right->findrange( b=@{out:"f.c2"}, a2=@{in:"f.b2"}, a1=@{in:"f.b1"}, name="right" ); DV bottom->analyze( b=@{out:"f.d"}, a=[ @{in:"f.c1"}, @{in:"f.c2"} );
15
Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 15 Compound Transformations Enable Functional Abstractions l Compound TR encapsulates an entire sub-graph: TR rangeAnalysis (in fa, p1, p2, out fd, io fc1, io fc2, io fb1, io fb2, ) { call preprocess( a=${fa}, b=[ ${out:fb1}, ${out:fb2} ] ); call findrange( a1=${in:fb1}, a2=${in:fb2}, name="LEFT", p=${p1}, b=${out:fc1} ); call findrange( a1=${in:fb1}, a2=${in:fb2}, name="RIGHT", p=${p2}, b=${out:fc2} ); call analyze( a=[ ${in:fc1}, ${in:fc2} ], b=${fd} ); }
16
Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 16 Derivation scripts l Representation of virtual data provenance: DV d1->diamond( fd=@{out:"f.00005"}, fc1=@{io:"f.00004"}, fc2=@{io:"f.00003"}, fb1=@{io:"f.00002"}, fb2=@{io:"f.00001"}, fa=@{io:"f.00000"}, p2="100", p1="0" ); DV d2->diamond( fd=@{out:"f.0000B"}, fc1=@{io:"f.0000A"}, fc2=@{io:"f.00009"}, fb1=@{io:"f.00008"}, fb2=@{io:"f.00007"}, fa=@{io:"f.00006"}, p2="141.42135623731", p1="0" );... DV d70->diamond( fd=@{out:"f.001A3"}, fc1=@{io:"f.001A2"}, fc2=@{io:"f.001A1"}, fb1=@{io:"f.001A0"}, fb2=@{io:"f.0019F"}, fa=@{io:"f.0019E"}, p2="800", p1="18" );
17
Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 17 Invocation Provenance Completion status and resource usage Attributes of executable transformation Attributes of input and output files
18
Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 18 Executing VDL Workflows Abstract workflow local planner Concrete DAG Global planner “Pegasus” DAGman / Condor-G Grid Info “jit” planner (research)
19
Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 19 GriPhyN-iVDGL Applications to date l ATLAS, BTeV, CMS – HEP event simulation l Argonne Computational Biology – sequence comparison and result capture l LIGO – Pulsar search l Sloan Digital Sky Survey – cluster finding; near-earth object search planned l Quarknet – science education – cosmic rays, HEP analysis
20
Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 20 Genome Analysis Database Update Application work by Alex Rodriguez, Dina Sulakhe, Natalia Matlsev, Argonne MCS Described in GGF10 workshop paper.
21
Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 21 Galaxy cluster size distribution DAG Virtual Data Example: Galaxy Cluster Search Sloan Data Jim Annis, Steve Kent, Vijay Sehkri, Fermilab, Michael Milligan, Yong Zhao, University of Chicago. Described in SC2002 paper
22
Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 22 Cluster Search Workflow Graph and Execution Trace Workflow jobs vs time
23
Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 23 mass = 200 decay = WW stability = 1 LowPt = 20 HighPt = 10000 mass = 200 decay = WW stability = 1 event = 8 mass = 200 decay = WW stability = 1 plot = 1 mass = 200 decay = WW plot = 1 mass = 200 decay = WW event = 8 mass = 200 decay = WW stability = 1 mass = 200 decay = WW stability = 3 mass = 200 decay = WW mass = 200 decay = ZZ mass = 200 decay = bb mass = 200 plot = 1 mass = 200 event = 8 Virtual Data Application: High Energy Physics Data Analysis Work and slide by Rick Cavanaugh and Dimitri Bourilkov, University of Florida Ref: CHEP 2002 paper
24
Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 24 Using Virtual Data for Science Education l The QuarkNet-Trillium collaboration is using Grid virtual data tools and methods to enrich science education l Its an experiment to give students the means to: –discover and apply datasets, algorithms, and data analysis methods –collaborate by developing new ones and sharing results and observations –learn data analysis methods that will ready and excite them for a scientific career l And in later steps, we may actually use the Grid!
25
Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 25 Quarknet Virtual Data Project Standard Web access Central High School Reston, Virginia Locally Collected Data Cosmic Ray Detector Student/ Teacher Teams Yale / Middletown High Collaboration Hartford, Connecticut Locally Collected Data Cosmic Ray Detector Student/ Teacher Teams Foothills High School Great Falls, Montana Locally Collected Data Cosmic Ray Detector Student/ Teacher Teams Quarknet Virtual Data Portal Student Data, Algorithms, Results, Notes, and communications Virtual Data Toolkit Virtual Data Catalog Student teacher teams sharing data, methods, programs, and knowledge Enabling collaboration-intensive science discovery with virtual data tools and methods
26
Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 26 Detector Performance Study
27
Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 27 Example: BTeV Event Simulation
28
Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 28 Support for Search and Discovery l Goal: make it as easy to use as Google l More advanced capabilities lie below the surface (as with Google) l Understand the structure and meaning of the datasets and their fields. l Advanced search, using SQL-like queries l Find both DATA and TRANSFORMATIONS l Create datasets from queries l Perform calculations on datasets, filtering results to look for patterns
29
Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 29 Search by Metadata
30
Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 30 Derving a new dataset …to find mass of “z” particle:
31
Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 31 Workflow for missing energy calculations
32
Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 32 Virtual Provenance: list of derivations and files <job id="ID000001" namespace="Quarknet.HEPSRCH" name="ECalEnergySum" level="5“ dv-namespace="Quarknet.HEPSRCH" dv-name="run1aesum"> <job id="ID000002" namespace="Quarknet.HEPSRCH" name="ECalEnergySum" level="7“ dv-namespace="Quarknet.HEPSRCH" … … <job id="ID000014" namespace="Quarknet.HEPSRCH" name="ReconTotalEnergy" level="3"… … …. (excerpted for display)
33
Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 33 Virtual Provenance in XML: control flow graph … … … … … (excerpted for display…)
34
And writing the results up in a “poster”
35
Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 35 Poster describing analysis
36
Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 36 Using active data from Web Services
37
Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 37
38
Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 38
39
Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 39
40
Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 40 Levels of Interaction l “Skins” – use it like a calculator, experiment with scenarios and settings, use virtual data like a log book to document, assess, and share parameter values. l “Blocks” – re-assemble workflow pipelines using existing ones as patterns and pre- developed transforms as building blocks l “Code” – write new transforms in a variety of languages and data models
41
Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 41 Observations l A provenance approach based on interface definition and data flow declaration fits well with Grid requirements for code and data transportability and heterogeneity l Working in a provenance-managed system has many fringe benefits: uniformity, precision, structure, communication, documentation l The real world is messy – finding the right abstractions is hard, and handling “legacy” applications is even harder
42
Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 42 Vision for Provenance in the Large l Universal knowledge management and production systems l Vendors integrate the provenance tracking protocol into data processing products l Ability to run anywhere “in the Grid”
43
Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 43 Virtual Data Grid Vision
44
Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 44 Planned Dataset Model <FORM /FORM> FileSet of files Relational query or spreadsheet range XML Element Set of files with relational index Object closure New user-defined dataset type: Speculative model described in CIDR 2003 paper by Foster, Voeckler, Wilde and Zhao
45
Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 45 Planned Dataset Type Model FileDataset FileFileSet MultiFileSetTarFileSet EventCollection RawEventSetSimulatedEventSet MonteCarlo Simulation DiscreteEvent Simulation Representational Logical (Nonleaf Types are Superclasses)
46
Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 46 Provenance Server Plans l OGSA-based Grid services –Discovery, security, resource management l Supports code and data discovery and workflow management l Object names (TR, DS, TY, DV, IV) can be used as global cross-server links l Derivations can reference remote transformations and datasets l Structured object namespaces & object-level access control enable large VO collaboration l Generalize transforms to describe service calls, database queries and language interpreters
47
Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 47 Provenance Hyperlinks
48
Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 48 Indexing Servers to Support Discovery
49
Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 49 For Information and Software l Virtual Data System –www.griphyn.org/chimera - Chimera Virtual Data System: Overview, papers, software l Grids and Grid Software –www.ivdgl.org/grid2003 - Using Grid3 –www.griphyn.org/vdt - Virtual Data Toolkit –www.globus.org – The Globus Toolkit –www.cs.wisc.edu/condor - The Condor Project –www.ppdg.net – Particle Physics Data Grid
50
Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 50 Acknowledgements GriPhyN, iVDGL, and QuarkNet (in part) are supported by the National Science Foundation The Globus Alliance, PPDG, and QuarkNet are supported in part by the US Department of Energy, Office of Science; by the NASA Information Power Grid program; and by IBM
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.