The Virtual Data Grid: A New Model and Architecture for Data-Intensive Collaboration Summer Grid 2004 UT Brownsville South Padre Island Center 24 June.

Slides:



Advertisements
Similar presentations
DIGIDOC A web based tool to Manage Documents. System Overview DigiDoc is a web-based customizable, integrated solution for Business Process Management.
Advertisements

University of Chicago Department of Energy The Parallel and Grid I/O Perspective MPI, MPI-IO, NetCDF, and HDF5 are in common use Multi TB datasets also.
Virtual Data and the Chimera System* Ian Foster Mathematics and Computer Science Division Argonne National Laboratory and Department of Computer Science.
A. Arbree, P. Avery, D. Bourilkov, R. Cavanaugh, S. Katageri, G. Graham, J. Rodriguez, J. Voeckler, M. Wilde CMS & GriPhyN Conference in High Energy Physics,
Ewa Deelman, Integrating Existing Scientific Workflow Systems: The Kepler/Pegasus Example Nandita Mangal,
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
Workload Management Workpackage Massimo Sgaravatto INFN Padova.
R. Cavanaugh GriPhyN Analysis Workshop Caltech, June, 2003 Virtual Data Toolkit.
Workflow Management and Virtual Data Ewa Deelman USC Information Sciences Institute.
Miron Livny Computer Sciences Department University of Wisconsin-Madison Workflows, Provenance and Virtual.
GriPhyN Virtual Data System Mike Wilde Argonne National Laboratory Mathematics and Computer Science Division LISHEP 2004, UERJ, Rio De Janeiro 13 Feb 2004.
Pegasus: Mapping complex applications onto the Grid Ewa Deelman Center for Grid Technologies USC Information Sciences Institute.
Knowledge Environments for Science: Representative Projects Ian Foster Argonne National Laboratory University of Chicago
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
A Semantic Workflow Mechanism to Realise Experimental Goals and Constraints Edoardo Pignotti, Peter Edwards, Alun Preece, Nick Gotts and Gary Polhill School.
The Grid as Infrastructure and Application Enabler Ian Foster Mathematics and Computer Science Division Argonne National Laboratory and Department of Computer.
Provenance in my Grid Jun Zhao School of Computer Science The University of Manchester, U.K. 21 October, 2004.
INFSO-RI Enabling Grids for E-sciencE FloodGrid application Ladislav Hluchy, Viet D. Tran Institute of Informatics, SAS Slovakia.
Scientific Data Infrastructure in CAS Dr. Jianhui Scientific Data Center Computer Network Information Center Chinese Academy of Sciences.
M. Bardeen, BCNET Conference, April 2004 The QuarkNet/Grid Project Developing Our “e-Lab” for High School Students & Their Teachers Marge Bardeen, Fermilab.
Grappa: Grid access portal for physics applications Shava Smallen Extreme! Computing Laboratory Department of Physics Indiana University.
XCAT Science Portal Status & Future Work July 15, 2002 Shava Smallen Extreme! Computing Laboratory Indiana University.
The Old World Meets the New: Utilizing Java Technology to Revitalize and Enhance NASA Scientific Legacy Code Michael D. Elder Furman University Hayden.
Teaching and Learning with Open Science Grid quarknet.fnal.gov/e-labs/
The Grid is a complex, distributed and heterogeneous execution environment. Running applications requires the knowledge of many grid services: users need.
Data and the Grid: From Databases to Global Knowledge Communities Ian Foster Argonne National Laboratory University of Chicago
Jan Storage Resource Broker Managing Distributed Data in a Grid A discussion of a paper published by a group of researchers at the San Diego Supercomputer.
ARGONNE  CHICAGO Ian Foster Discussion Points l Maintaining the right balance between research and development l Maintaining focus vs. accepting broader.
San Diego Supercomputer Center Grid Physics Network (GriPhyN) University of Florida Dataflows in SRB using SDSC Matrix Arun Jagatheesan Architect & Team.
Applying the Virtual Data Provenance Model IPAW 2006 Yong Zhao, Ian Foster, Michael Wilde 4 May 2006.
Virtual Data Management for Grid Computing Michael Wilde Argonne National Laboratory Gaurang Mehta, Karan Vahi Center for Grid Technologies.
Managed by UT-Battelle for the Department of Energy 1 Integrated Catalogue (ICAT) Auto Update System Presented by Jessica Feng Research Alliance in Math.
Miguel Branco CERN/University of Southampton Enabling provenance on large-scale e-Science applications.
Introduction to Apache OODT Yang Li Mar 9, What is OODT Object Oriented Data Technology Science data management Archiving Systems that span scientific.
GriPhyN and Data Provenance The Grid Physics Network Virtual Data System DOE Data Management Workshop SLAC, 17 March 2004 Mike Wilde Argonne National Laboratory.
The GriPhyN Virtual Data System GRIDS Center Community Workshop Michael Wilde Argonne National Laboratory 24 June 2005.
1 st December 2003 JIM for CDF 1 JIM and SAMGrid for CDF Mòrag Burgon-Lyon University of Glasgow.
CSIU Submission of BLAST jobs via the Galaxy Interface Rob Quick Open Science Grid – Operations Area Coordinator Indiana University.
GriPhyN Status and Project Plan Mike Wilde Mathematics and Computer Science Division Argonne National Laboratory.
Pegasus: Mapping Scientific Workflows onto the Grid Ewa Deelman Center for Grid Technologies USC Information Sciences Institute.
Virtual Data Grid Architecture Ewa Deelman, Ian Foster, Carl Kesselman, Miron Livny.
Datasets on the GRID David Adams PPDG All Hands Meeting Catalogs and Datasets session June 11, 2003 BNL.
Where’s the Data? On the Grid! Transforming Science Education Researching Educational Use of the Grid Case Study: Cosmic Rays in the Classroom Computing.
San Diego Supercomputer Center Grid Physics Network (GriPhyN) University of Florida DGL: The Assembly Language for Grid Computing Arun swaran Jagatheesan.
Pegasus: Running Large-Scale Scientific Workflows on the TeraGrid Ewa Deelman USC Information Sciences Institute
GO-ESSP Workshop, LLNL, Livermore, CA, Jun 19-21, 2006, Center for ATmosphere sciences and Earthquake Researches Construction of e-science Environment.
Pegasus: Mapping complex applications onto the Grid Ewa Deelman Center for Grid Technologies USC Information Sciences Institute.
Virtual Data Workflows with the GriPhyN VDS Condor Week University of Wisconsin Michael Wilde Argonne National Laboratory 14 March 2005.
The GriPhyN Planning Process All-Hands Meeting ISI 15 October 2001.
GriPhyN Virtual Data System Grid Execution of Virtual Data Workflows Mike Wilde Argonne National Laboratory Mathematics and Computer Science Division.
GRID Overview Internet2 Member Meeting Spring 2003 Sandra Redman Information Technology and Systems Center and Information Technology Research Center National.
A radiologist analyzes an X-ray image, and writes his observations on papers  Image Tagging improves the quality, consistency.  Usefulness of the data.
High Energy Physics and Grids at UF (Dec. 13, 2002)Paul Avery1 University of Florida High Energy Physics.
May 6, 2002Earth System Grid - Williams The Earth System Grid Presented by Dean N. Williams PI’s: Ian Foster (ANL); Don Middleton (NCAR); and Dean Williams.
Project Database Handler The Project Database Handler is a brokering application that mediates interactions between the project database and the external.
Workflow Management and Virtual Data Ewa Deelman Center for Grid Technologies USC Information Sciences Institute.
M. Bardeen, August 18, 2004 The QuarkNet Story Bringing high school teachers and their students to physics frontiers Marge Bardeen, Fermilab.
Planning Ewa Deelman USC Information Sciences Institute GriPhyN NSF Project Review January 2003 Chicago.
Pegasus: Planning for Execution in Grids Ewa Deelman, Carl Kesselman, Gaurang Mehta, Gurmeet Singh, Karan Vahi Information Sciences Institute University.
Virtual Data Management for CMS Simulation Production A GriPhyN Prototype.
Securing the Grid & other Middleware Challenges Ian Foster Mathematics and Computer Science Division Argonne National Laboratory and Department of Computer.
Mike Hildreth DASPOS Update Mike Hildreth representing the DASPOS project 1.
GriPhyN Project Paul Avery, University of Florida, Ian Foster, University of Chicago NSF Grant ITR Research Objectives Significant Results Approach.
AHM04: Sep 2004 Nottingham CCLRC e-Science Centre eMinerals: Environment from the Molecular Level Managing simulation data Lisa Blanshard e- Science Data.
GADU: A System for High-throughput Analysis of Genomes using Heterogeneous Grid Resources. Mathematics and Computer Science Division Argonne National Laboratory.
Interoperability Achieved by GADU in using multiple Grids. OSG, Teragrid and ANL Jazz Presented by: Dinanath Sulakhe Mathematics and Computer Science Division.
Application Web Service Toolkit Allow users to quickly add new applications GGF5 Edinburgh Geoffrey Fox, Marlon Pierce, Ozgur Balsoy Indiana University.
GGF OGSA-WG, Data Use Cases Peter Kunszt Middleware Activity, Data Management Cluster EGEE is a project funded by the European.
Pegasus and Condor Gaurang Mehta, Ewa Deelman, Carl Kesselman, Karan Vahi Center For Grid Technologies USC/ISI.
A General Approach to Real-time Workflow Monitoring
Presentation transcript:

The Virtual Data Grid: A New Model and Architecture for Data-Intensive Collaboration Summer Grid 2004 UT Brownsville South Padre Island Center 24 June 2004 Mike Wilde Argonne National Laboratory Mathematics and Computer Science Division

Summer Grid June, UTB/SPI 2 GriPhyN: Grid Physics Network Mission Enhance scientific productivity through discovery and processing of datasets, using the grid as a scientific workstation Virtual Data enables this approach by creating datasets from workflow “recipes” and recording their provenance. GriPhyN works to “cross the chasm” - application and computer scientists create and field-test paradigms and toolkits together

Summer Grid June, UTB/SPI 3 Acknowledgements: Virtual Data is a Large Team Effort The Chimera Virtual Data System is the work of Ian Foster, Jens Voeckler, Mike Wilde and Yong Zhao The Pegasus Planner is the work of Ewa Deelman, Gaurang Mehta, and Karan Vahi Applications described are the work of many people, including: James Annis, Rick Cavanaugh, Dan Engh, Rob Gardner, Albert Lazzarini, Natalia Maltsev, Marge Bardeen, and their wonderful teams

Summer Grid June, UTB/SPI 4 Virtual Data Scenario simulate – t 10 … file1 file2 reformat – f fz … file1 File3,4,5 psearch – t 10 … conv – I esd – o aod file6 summarize – t 10 … file7 file8 On-demand data generation Update workflow following changes Manage workflow; psearch –t 10 –i file3 file4 file5 –o file8 summarize –t 10 –i file6 –o file7 reformat –f fz –i file2 –o file3 file4 file5 conv –l esd –o aod –i file 2 –o file6 simulate –t 10 –o file1 file2 Explain provenance, e.g. for file8:

Summer Grid June, UTB/SPI 5 Virtual Data Describes analysis workflow l The recorded virtual data “recipe” here is: –Files: 8 < (1,3,4,5,7), 7 < 6, (3,4,5,6) < 2 –Programs: 8 < psearch, 7 < summarize, (3,4,5) < reformat, 6 < conv, (1,2) < simulate simulate – t 10 … file1 file2 reformat – f fz … file1 File3,4,5 psearch – t 10 … conv – I esd – o aod file6 summarize – t 10 … file7 file8 Requested dataset

Summer Grid June, UTB/SPI 6 Virtual Data Describes analysis workflow l To recreate file 8: Step 1 –simulate > file1, file2 simulate – t 10 … file1 file2 reformat – f fz … file1 File3,4,5 psearch – t 10 … conv – I esd – o aod file6 summarize – t 10 … file7 file8 Requested file

Summer Grid June, UTB/SPI 7 Virtual Data Describes analysis workflow l To re-create file8: Step 2 –files 3, 4, 5, 6 derived from file 2 –reformat > file3, file4, file5 –conv > file 6 simulate – t 10 … file1 file2 reformat – f fz … file1 File3,4,5 psearch – t 10 … conv – I esd – o aod file6 summarize – t 10 … file7 file8 Requested file

Summer Grid June, UTB/SPI 8 Virtual Data Describes analysis workflow l To re-create file 8: step 3 –File 7 depends on file 6 –Summarize > file 7 simulate – t 10 … file1 file2 reformat – f fz … file1 File3,4,5 psearch – t 10 … conv – I esd – o aod file6 summarize – t 10 … file7 file8 Requested file

Summer Grid June, UTB/SPI 9 Virtual Data Describes analysis workflow l To re-create file 8: final step –File 8 depends on files 1, 3, 4, 5, 7 –psearch file 8 simulate – t 10 … file1 file2 psearch – t 10 … reformat – f fz … conv – I esd – o aod file1 File3,4,5 file6 summarize – t 10 … file7 file8 Requested file

Summer Grid June, UTB/SPI 10 Grid3 – The Laboratory Supported by the National Science Foundation and the Department of Energy.

Summer Grid June, UTB/SPI 11 VDL: Virtual Data Language Describes Data Transformations l Transformation –Abstract template of program invocation –Similar to "function definition" l Derivation –“Function call” to a Transformation –Store past and future: >A record of how data products were generated >A recipe of how data products can be generated l Invocation –Record of a Derivation execution l These XML documents reside in a “virtual data catalog” – VDC - a relational database

Summer Grid June, UTB/SPI 12 VDL Describes Workflow via Data Dependencies TR tr1(in a1, out a2) { argument stdin = ${a1}; argument stdout = ${a2}; } TR tr2(in a1, out a2) { argument stdin = ${a1}; argument stdout = ${a2}; } DV DV file1 file2 file3 x1 x2

Summer Grid June, UTB/SPI 13 Workflow example l Graph structure –Fan-in –Fan-out –"left" and "right" can run in parallel l Needs external input file –Located via replica catalog l Data file dependencies –Form graph structure findrange analyze preprocess

Summer Grid June, UTB/SPI 14 Complete VDL workflow l Generate appropriate derivations DV out:"f.b2"} ], ); DV left->findrange( name="left", p="0.5" ); DV right->findrange( name="right" ); DV );

Summer Grid June, UTB/SPI 15 Compound Transformations Enable Functional Abstractions l Compound TR encapsulates an entire sub-graph: TR rangeAnalysis (in fa, p1, p2, out fd, io fc1, io fc2, io fb1, io fb2, ) { call preprocess( a=${fa}, b=[ ${out:fb1}, ${out:fb2} ] ); call findrange( a1=${in:fb1}, a2=${in:fb2}, name="LEFT", p=${p1}, b=${out:fc1} ); call findrange( a1=${in:fb1}, a2=${in:fb2}, name="RIGHT", p=${p2}, b=${out:fc2} ); call analyze( a=[ ${in:fc1}, ${in:fc2} ], b=${fd} ); }

Summer Grid June, UTB/SPI 16 Derivation scripts l Representation of virtual data provenance: DV d1->diamond( p2="100", p1="0" ); DV d2->diamond( p2=" ", p1="0" );... DV d70->diamond( p2="800", p1="18" );

Summer Grid June, UTB/SPI 17 Invocation Provenance Completion status and resource usage Attributes of executable transformation Attributes of input and output files

Summer Grid June, UTB/SPI 18 Executing VDL Workflows Abstract workflow local planner Concrete DAG Global planner “Pegasus” DAGman / Condor-G Grid Info “jit” planner (research)

Summer Grid June, UTB/SPI 19 GriPhyN-iVDGL Applications to date l ATLAS, BTeV, CMS – HEP event simulation l Argonne Computational Biology – sequence comparison and result capture l LIGO – Pulsar search l Sloan Digital Sky Survey – cluster finding; near-earth object search planned l Quarknet – science education – cosmic rays, HEP analysis

Summer Grid June, UTB/SPI 20 Genome Analysis Database Update Application work by Alex Rodriguez, Dina Sulakhe, Natalia Matlsev, Argonne MCS Described in GGF10 workshop paper.

Summer Grid June, UTB/SPI 21 Galaxy cluster size distribution DAG Virtual Data Example: Galaxy Cluster Search Sloan Data Jim Annis, Steve Kent, Vijay Sehkri, Fermilab, Michael Milligan, Yong Zhao, University of Chicago. Described in SC2002 paper

Summer Grid June, UTB/SPI 22 Cluster Search Workflow Graph and Execution Trace Workflow jobs vs time

Summer Grid June, UTB/SPI 23 mass = 200 decay = WW stability = 1 LowPt = 20 HighPt = mass = 200 decay = WW stability = 1 event = 8 mass = 200 decay = WW stability = 1 plot = 1 mass = 200 decay = WW plot = 1 mass = 200 decay = WW event = 8 mass = 200 decay = WW stability = 1 mass = 200 decay = WW stability = 3 mass = 200 decay = WW mass = 200 decay = ZZ mass = 200 decay = bb mass = 200 plot = 1 mass = 200 event = 8 Virtual Data Application: High Energy Physics Data Analysis Work and slide by Rick Cavanaugh and Dimitri Bourilkov, University of Florida Ref: CHEP 2002 paper

Summer Grid June, UTB/SPI 24 Using Virtual Data for Science Education l The QuarkNet-Trillium collaboration is using Grid virtual data tools and methods to enrich science education l Its an experiment to give students the means to: –discover and apply datasets, algorithms, and data analysis methods –collaborate by developing new ones and sharing results and observations –learn data analysis methods that will ready and excite them for a scientific career l And in later steps, we may actually use the Grid!

Summer Grid June, UTB/SPI 25 Quarknet Virtual Data Project Standard Web access Central High School Reston, Virginia Locally Collected Data Cosmic Ray Detector Student/ Teacher Teams Yale / Middletown High Collaboration Hartford, Connecticut Locally Collected Data Cosmic Ray Detector Student/ Teacher Teams Foothills High School Great Falls, Montana Locally Collected Data Cosmic Ray Detector Student/ Teacher Teams Quarknet Virtual Data Portal Student Data, Algorithms, Results, Notes, and communications Virtual Data Toolkit Virtual Data Catalog Student teacher teams sharing data, methods, programs, and knowledge Enabling collaboration-intensive science discovery with virtual data tools and methods

Summer Grid June, UTB/SPI 26 Detector Performance Study

Summer Grid June, UTB/SPI 27 Example: BTeV Event Simulation

Summer Grid June, UTB/SPI 28 Support for Search and Discovery l Goal: make it as easy to use as Google l More advanced capabilities lie below the surface (as with Google) l Understand the structure and meaning of the datasets and their fields. l Advanced search, using SQL-like queries l Find both DATA and TRANSFORMATIONS l Create datasets from queries l Perform calculations on datasets, filtering results to look for patterns

Summer Grid June, UTB/SPI 29 Search by Metadata

Summer Grid June, UTB/SPI 30 Derving a new dataset …to find mass of “z” particle:

Summer Grid June, UTB/SPI 31 Workflow for missing energy calculations

Summer Grid June, UTB/SPI 32 Virtual Provenance: list of derivations and files <job id="ID000001" namespace="Quarknet.HEPSRCH" name="ECalEnergySum" level="5“ dv-namespace="Quarknet.HEPSRCH" dv-name="run1aesum"> <job id="ID000002" namespace="Quarknet.HEPSRCH" name="ECalEnergySum" level="7“ dv-namespace="Quarknet.HEPSRCH" … … <job id="ID000014" namespace="Quarknet.HEPSRCH" name="ReconTotalEnergy" level="3"… … …. (excerpted for display)

Summer Grid June, UTB/SPI 33 Virtual Provenance in XML: control flow graph … … … … … (excerpted for display…)

And writing the results up in a “poster”

Summer Grid June, UTB/SPI 35 Poster describing analysis

Summer Grid June, UTB/SPI 36 Using active data from Web Services

Summer Grid June, UTB/SPI 37

Summer Grid June, UTB/SPI 38

Summer Grid June, UTB/SPI 39

Summer Grid June, UTB/SPI 40 Levels of Interaction l “Skins” – use it like a calculator, experiment with scenarios and settings, use virtual data like a log book to document, assess, and share parameter values. l “Blocks” – re-assemble workflow pipelines using existing ones as patterns and pre- developed transforms as building blocks l “Code” – write new transforms in a variety of languages and data models

Summer Grid June, UTB/SPI 41 Observations l A provenance approach based on interface definition and data flow declaration fits well with Grid requirements for code and data transportability and heterogeneity l Working in a provenance-managed system has many fringe benefits: uniformity, precision, structure, communication, documentation l The real world is messy – finding the right abstractions is hard, and handling “legacy” applications is even harder

Summer Grid June, UTB/SPI 42 Vision for Provenance in the Large l Universal knowledge management and production systems l Vendors integrate the provenance tracking protocol into data processing products l Ability to run anywhere “in the Grid”

Summer Grid June, UTB/SPI 43 Virtual Data Grid Vision

Summer Grid June, UTB/SPI 44 Planned Dataset Model <FORM /FORM> FileSet of files Relational query or spreadsheet range XML Element Set of files with relational index Object closure New user-defined dataset type: Speculative model described in CIDR 2003 paper by Foster, Voeckler, Wilde and Zhao

Summer Grid June, UTB/SPI 45 Planned Dataset Type Model FileDataset FileFileSet MultiFileSetTarFileSet EventCollection RawEventSetSimulatedEventSet MonteCarlo Simulation DiscreteEvent Simulation Representational Logical (Nonleaf Types are Superclasses)

Summer Grid June, UTB/SPI 46 Provenance Server Plans l OGSA-based Grid services –Discovery, security, resource management l Supports code and data discovery and workflow management l Object names (TR, DS, TY, DV, IV) can be used as global cross-server links l Derivations can reference remote transformations and datasets l Structured object namespaces & object-level access control enable large VO collaboration l Generalize transforms to describe service calls, database queries and language interpreters

Summer Grid June, UTB/SPI 47 Provenance Hyperlinks

Summer Grid June, UTB/SPI 48 Indexing Servers to Support Discovery

Summer Grid June, UTB/SPI 49 For Information and Software l Virtual Data System – - Chimera Virtual Data System: Overview, papers, software l Grids and Grid Software – - Using Grid3 – - Virtual Data Toolkit – – The Globus Toolkit – - The Condor Project – – Particle Physics Data Grid

Summer Grid June, UTB/SPI 50 Acknowledgements GriPhyN, iVDGL, and QuarkNet (in part) are supported by the National Science Foundation The Globus Alliance, PPDG, and QuarkNet are supported in part by the US Department of Energy, Office of Science; by the NASA Information Power Grid program; and by IBM