The GriPhyN Virtual Data System GRIDS Center Community Workshop Michael Wilde Argonne National Laboratory 24 June 2005.

Slides:



Advertisements
Similar presentations
Virtual Data and the Chimera System* Ian Foster Mathematics and Computer Science Division Argonne National Laboratory and Department of Computer Science.
Advertisements

The ADAMANT Project: Linking Scientific Workflows and Networks “Adaptive Data-Aware Multi-Domain Application Network Topologies” Ilia Baldine, Charles.
Data Grids Darshan R. Kapadia Gregor von Laszewski
Ewa Deelman, Integrating Existing Scientific Workflow Systems: The Kepler/Pegasus Example Nandita Mangal,
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
Miron Livny Computer Sciences Department University of Wisconsin-Madison Workflows, Provenance and Virtual.
GriPhyN Virtual Data System Mike Wilde Argonne National Laboratory Mathematics and Computer Science Division LISHEP 2004, UERJ, Rio De Janeiro 13 Feb 2004.
Pegasus: Mapping complex applications onto the Grid Ewa Deelman Center for Grid Technologies USC Information Sciences Institute.
An Astronomical Image Mosaic Service for the National Virtual Observatory / ESTO.
Ewa Deelman Using Grid Technologies to Support Large-Scale Astronomy Applications Ewa Deelman Center for Grid Technologies USC Information.
Managing Workflows with the Pegasus Workflow Management System
The Grid as Infrastructure and Application Enabler Ian Foster Mathematics and Computer Science Division Argonne National Laboratory and Department of Computer.
QCDgrid Technology James Perry, George Beckett, Lorna Smith EPCC, The University Of Edinburgh.
Ewa Deelman, Pegasus and DAGMan: From Concept to Execution Mapping Scientific Workflows onto the National.
Scientific Workflows on the Grid. Goals Enhance scientific productivity through: Discovery and application of datasets and programs at petabyte scale.
End User Tools. Target environment: Cluster and Grids (distributed sets of clusters) Grid Protocols Grid Resources at UW Grid Storage Grid Middleware.
January, 23, 2006 Ilkay Altintas
Pegasus A Framework for Workflow Planning on the Grid Ewa Deelman USC Information Sciences Institute Pegasus Acknowledgments: Carl Kesselman, Gaurang Mehta,
Grappa: Grid access portal for physics applications Shava Smallen Extreme! Computing Laboratory Department of Physics Indiana University.
XCAT Science Portal Status & Future Work July 15, 2002 Shava Smallen Extreme! Computing Laboratory Indiana University.
Data analysis in I2U2 I2U2 all-hands meeting Michael Wilde Argonne MCS University of Chicago Computation Institute 12 Dec 2005.
The Grid is a complex, distributed and heterogeneous execution environment. Running applications requires the knowledge of many grid services: users need.
10/20/05 LIGO Scientific Collaboration 1 LIGO Data Grid: Making it Go Scott Koranda University of Wisconsin-Milwaukee.
ARGONNE  CHICAGO Ian Foster Discussion Points l Maintaining the right balance between research and development l Maintaining focus vs. accepting broader.
Applying the Virtual Data Provenance Model IPAW 2006 Yong Zhao, Ian Foster, Michael Wilde 4 May 2006.
The GriPhyN Virtual Data System Ian Foster for the VDS team.
Virtual Data Management for Grid Computing Michael Wilde Argonne National Laboratory Gaurang Mehta, Karan Vahi Center for Grid Technologies.
Large-Scale Science Through Workflow Management Ewa Deelman Center for Grid Technologies USC Information Sciences Institute.
Managing large-scale workflows with Pegasus Karan Vahi ( Collaborative Computing Group USC Information Sciences Institute Funded.
GriPhyN and Data Provenance The Grid Physics Network Virtual Data System DOE Data Management Workshop SLAC, 17 March 2004 Mike Wilde Argonne National Laboratory.
Pegasus-a framework for planning for execution in grids Ewa Deelman USC Information Sciences Institute.
Pegasus: Planning for Execution in Grids Ewa Deelman Information Sciences Institute University of Southern California.
The Virtual Data Grid: A New Model and Architecture for Data-Intensive Collaboration Summer Grid 2004 UT Brownsville South Padre Island Center 24 June.
GriPhyN Status and Project Plan Mike Wilde Mathematics and Computer Science Division Argonne National Laboratory.
Pegasus: Mapping Scientific Workflows onto the Grid Ewa Deelman Center for Grid Technologies USC Information Sciences Institute.
Virtual Data Grid Architecture Ewa Deelman, Ian Foster, Carl Kesselman, Miron Livny.
Shannon Hastings Multiscale Computing Laboratory Department of Biomedical Informatics.
A Logic Programming Approach to Scientific Workflow Provenance Querying* Shiyong Lu Department of Computer Science Wayne State University, Detroit, MI.
Tools for collaboration How to share your duck tales…
San Diego Supercomputer Center Grid Physics Network (GriPhyN) University of Florida DGL: The Assembly Language for Grid Computing Arun swaran Jagatheesan.
Pegasus: Running Large-Scale Scientific Workflows on the TeraGrid Ewa Deelman USC Information Sciences Institute
Pegasus: Mapping complex applications onto the Grid Ewa Deelman Center for Grid Technologies USC Information Sciences Institute.
Virtual Data Workflows with the GriPhyN VDS Condor Week University of Wisconsin Michael Wilde Argonne National Laboratory 14 March 2005.
The GriPhyN Planning Process All-Hands Meeting ISI 15 October 2001.
GriPhyN Virtual Data System Grid Execution of Virtual Data Workflows Mike Wilde Argonne National Laboratory Mathematics and Computer Science Division.
GRID Overview Internet2 Member Meeting Spring 2003 Sandra Redman Information Technology and Systems Center and Information Technology Research Center National.
Ruth Pordes November 2004TeraGrid GIG Site Review1 TeraGrid and Open Science Grid Ruth Pordes, Fermilab representing the Open Science.
CEDPS Data Services Ann Chervenak USC Information Sciences Institute.
Pegasus-a framework for planning for execution in grids Karan Vahi USC Information Sciences Institute May 5 th, 2004.
Alain Roy Computer Sciences Department University of Wisconsin-Madison Condor & Middleware: NMI & VDT.
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center National Partnership for.
7. Grid Computing Systems and Resource Management
Planning Ewa Deelman USC Information Sciences Institute GriPhyN NSF Project Review January 2003 Chicago.
VDT 1 The Virtual Data Toolkit Todd Tannenbaum (Alain Roy)
Pegasus: Planning for Execution in Grids Ewa Deelman, Carl Kesselman, Gaurang Mehta, Gurmeet Singh, Karan Vahi Information Sciences Institute University.
LQCD Workflow Discussion Fermilab, 18 Dec 2006 Mike Wilde
Virtual Data Management for CMS Simulation Production A GriPhyN Prototype.
Securing the Grid & other Middleware Challenges Ian Foster Mathematics and Computer Science Division Argonne National Laboratory and Department of Computer.
GriPhyN Project Paul Avery, University of Florida, Ian Foster, University of Chicago NSF Grant ITR Research Objectives Significant Results Approach.
AHM04: Sep 2004 Nottingham CCLRC e-Science Centre eMinerals: Environment from the Molecular Level Managing simulation data Lisa Blanshard e- Science Data.
GADU: A System for High-throughput Analysis of Genomes using Heterogeneous Grid Resources. Mathematics and Computer Science Division Argonne National Laboratory.
OSG Deployment Preparations Status Dane Skow OSG Council Meeting May 3, 2005 Madison, WI.
Interoperability Achieved by GADU in using multiple Grids. OSG, Teragrid and ANL Jazz Presented by: Dinanath Sulakhe Mathematics and Computer Science Division.
Ewa Deelman, Virtual Metadata Catalogs: Augmenting Existing Metadata Catalogs with Semantic Representations Yolanda Gil, Varun Ratnakar,
1 USC Information Sciences InstituteYolanda Gil AAAI-08 Tutorial July 13, 2008 Part IV Workflow Mapping and Execution in Pegasus (Thanks.
Chimera Workshop September 4 th, Outline l Install Chimera l Run Chimera –Hello world –Convert simple shell pipeline –Some diamond, etc. l Get.
Managing LIGO Workflows on OSG with Pegasus Karan Vahi USC Information Sciences Institute
Enhancements to Galaxy for delivering on NIH Commons
Pegasus and Condor Gaurang Mehta, Ewa Deelman, Carl Kesselman, Karan Vahi Center For Grid Technologies USC/ISI.
A General Approach to Real-time Workflow Monitoring
Presentation transcript:

The GriPhyN Virtual Data System GRIDS Center Community Workshop Michael Wilde Argonne National Laboratory 24 June 2005

Acknowledgements …many thanks to the entire Trillium / OSG Collaboration, iVDGL and OSG Team, Virtual Data Toolkit Team, and all of our application science partners in ATLAS, CMS, LIGO, SDSS, Dartmouth DBIC and fMRIDC, SCEC, and Argonne’s Computational Biology and Climate Science Groups of the Mathematics and Computer Science Division. The Virtual Data System group is: u ISI/USC: Ewa Deelman, Carl Kesselman, Gaurang Mehta, Gurmeet Singh, Mei-Hui Su, Karan Vahi u U of Chicago: Catalin Dumitrescu, Ian Foster, Luiz Meyer (UFRJ, Brazil), Doug Scheftner, Jens Voeckler, Mike Wilde, Yong Zhao u GriPhyN and iVDGL are supported by the National Science Foundation Many of the research efforts involved in this work are supported by the US Department of Energy, office of Science.

The GriPhyN Project Enhance scientific productivity through… l Discovery, application and management of data and processes at petabyte scale l Using a worldwide data grid as a scientific workstation The key to this approach is Virtual Data – creating and managing datasets through workflow “recipes” and provenance recording.

Jim Annis, Steve Kent, Vijay Sehkri, Fermilab, Michael Milligan, Yong Zhao, University of Chicago Galaxy cluster size distribution DAG Virtual Data Example: Galaxy Cluster Search Sloan Data

A virtual data glossary l virtual data u defining data by the logical workflow needed to create it virtualizes it with respect to location, existence, failure, and representation l VDS – Virtual Data System u The tools to define, store, manipulate and execute virtual data workflows l VDT – Virtual Data Toolkit u A larger set of tools, based on NMI, VDT provides the Grid environment in which VDL workflows run l VDL – Virtual Data Language u A language (text and XML) that defines the functions and function calls of a virtual data workflow l VDC – Virtual Data Catalog u The database and schema that store VDL definitions

What must we “virtualize” to compute on the Grid? l Location-independent computing: represent all workflow in abstract terms l Declarations not tied to specific entities: u sites u file systems u schedulers l Failures – automated retry for data server and execution site un-availability

Expressing Workflow in VDL TR grep (in a1, out a2) { argument stdin = ${a1}; argument stdout = ${a2}; } TR sort (in a1, out a2) { argument stdin = ${a1}; argument stdout = ${a2}; } DV grep DV sort file1 file2 file3 grep sort

Expressing Workflow in VDL TR grep (in a1, out a2) { argument stdin = ${a1}; argument stdout = ${a2}; } TR sort (in a1, out a2) { argument stdin = ${a1}; argument stdout = ${a2}; } DV grep DV sort file1 file2 file3 grep sort Define a “function” wrapper for an application Provide “actual” argument values for the invocation Define “formal arguments” for the application Define a “call” to invoke application Connect applications via output-to-input dependencies

Essence of VDL l Elevates specification of computation to a logical, location-independent level l Acts as an “interface definition language” at the shell/application level l Can express composition of functions l Codable in textual and XML form l Often machine-generated to provide ease of use and higher-level features l Preprocessor provides iteration and variables

Compound Workflow l Complex structure u Fan-in u Fan-out u "left" and "right" can run in parallel l Uses input file u Register with RC l Supports complex file dependencies u Glues workflow findrange analyze preprocess

Compound Transformations for nesting Workflows l Compound TR encapsulates an entire sub-graph: TR rangeAnalysis (in fa, p1, p2, out fd, io fc1, io fc2, io fb1, io fb2, ) { call preprocess( a=${fa}, b=[ ${out:fb1}, ${out:fb2} ] ); call findrange( a1=${in:fb1}, a2=${in:fb2}, name="LEFT", p=${p1}, b=${out:fc1} ); call findrange( a1=${in:fb1}, a2=${in:fb2}, name="RIGHT", p=${p2}, b=${out:fc2} ); call analyze( a=[ ${in:fc1}, ${in:fc2} ], b=${fd} ); }

Compound Transformations (cont) l Multiple DVs allow easy generator scripts: DV d1-> rangeAnalysis ( p2="100", p1="0" ); DV d2-> rangeAnalysis ( p2=" ", p1="0" );... DV d70-> rangeAnalysis ( p2="800", p1="18" );

Using VDL l Generated directly for low-volume usage l Generated by scripts for production use l Generated by application tool builders as wrappers around scripts provided for community use l Generated transparently in an application- specific portal (e.g. quarknet.fnal.gov/grid) l Generated by drag-and-drop workflow design tools such as Triana

Basic VDL Toolkit l Convert between text and XML representation l Insert, update, remove definitions from a virtual data catalog l Attach metadata annotations to defintions l Search for definitions l Generate an abstract workflow for a data derivation request l Multiple interface levels provided: u Java API, command line, web service

Representing Workflow l Specifies a set of activities and control flow l Sequences information transfer between activities l VDS uses XML-based notation called “DAG in XML” (DAX) format l VDC Represents a wide range of workflow possibilities l DAX document represents steps to create a specific data product

Executing VDL Workflows Abstract workflow Local planner DAGman DAG Statically Partitioned DAG DAGman & Condor-G Dynamically Planned DAG VDL Program Virtual Data catalog Virtual Data Workflow Generator Job Planner Job Cleanup Workflow spec Create Execution Plan Grid Workflow Execution

OSG:The “target chip” for VDS Workflows Supported by the National Science Foundation and the Department of Energy.

VDS Supported Via Virtual Data Toolkit Sources (CVS) Patching GPT src bundles NMI Build & Test Condor pool 22+ Op. Systems Build Test Package VDT Build Many Contributors Build Pacman cache RPMs Binaries Test A unique laboratory for testing, supporting, deploying, packaging, upgrading, & troubleshooting complex sets of software! Slide courtesy of Paul Avery, UFL

Computer Science Research Virtual Data Toolkit Partner science projects Partner networking projects Partner outreach projects Larger Science Community Globus, Condor, NMI, iVDGL, PPDG EU DataGrid, LHC Experiments, QuarkNet, CHEPREO, Dig. Divide Production Deployment Tech Transfer Techniques & software Requirements Prototyping & experiments Other linkages  Work force  CS researchers  Industry U.S.Grids Int’l Outreach Slide courtesy of Paul Avery, UFL Collaborative Relationships

VDS Applications ApplicationJobs / workflowLevelsStatus ATLAS HEP Event Simulation 500K1In Use LIGO Inspiral/Pulsar ~7002-5Inspiral In Use NVO/NASA Montage/Morphology 1000s7Both In Use GADU Genomics: BLAST,… 40K1In Use fMRI DBIC AIRSN Image Proc 100s12In Devel QuarkNet CosmicRay science <103-6 In Use SDSS Coadd; Cluster Search 40K 500K 2828 In Devel / CS Research FOAM Ocean/Atmos Model 2000 (core app runs CPU jobs) 3In use GTOMO Image proc 1000s1In Devel SCEC Earthquake sim 1000sIn use

A Case Study – Functional MRI l Problem: “spatial normalization” of a images to prepare data from fMRI studies for analysis l Target community is approximately 60 users at Dartmouth Brain Imaging Center l Wish to share data and methods across country with researchers at Berkeley l Process data from arbitrary user and archival directories in the center’s AFS space; bring data back to same directories l Grid needs to be transparent to the users: Literally, “Grid as a Workstation”

A Case Study – Functional MRI (2) l Based workflow on shell script that performs 12-stage process on a local workstation l Adopted replica naming convention for moving user’s data to Grid sites l Creates VDL pre-processor to iterate transformations over datasets l Utilizing resources across two distinct grids – Grid3 and Dartmouth Green Grid

Functional MRI Analysis Workflow courtesy James Dobson, Dartmouth Brain Imaging Center

fMRI Dataset processing FOREACH BOLDSEQ DV reorient (# Process Blood O2 Level Dependent Sequence input = "$BOLDSEQ.hdr"} ], output = "$CWD/FUNCTIONAL/r$BOLDSEQ.hdr"}], direction = "y", ); END DV softmean ( input = [ FOREACH END ], mean = ] );

fMRI Virtual Data Queries Which transformations can process a “subject image”? l Q: xsearchvdc -q tr_meta dataType subject_image input l A: fMRIDC.AIR::align_warp List anonymized subject-images for young subjects: l Q: xsearchvdc -q lfn_meta dataType subject_image privacy anonymized subjectType young l A: _anonymized.img Show files that were derived from patient image : l Q: xsearchvdc -q lfn_tree _anonymized.img l A: _anonymized.img _anonymized.sliced.hdr atlas.hdr atlas.img … atlas_z.jpg _anonymized.sliced.img

US-ATLAS Data Challenge 2 Sep 10 Mid July CPU-day Event generation using Virtual Data

Provenance for DC2 How much compute time was delivered? | years| mon | year | |.45 | 6 | 2004 | | 20 | 7 | 2004 | | 34 | 8 | 2004 | | 40 | 9 | 2004 | | 15 | 10 | 2004 | | 15 | 11 | 2004 | | 8.9 | 12 | 2004 | Selected statistics for one of these jobs: start: :33:56 duration: pid: 6123 exitcode: 0 args: JobTransforms /share/dc2.g4sim.filter.trf CPE_6785_ dc2_B4_filter_frag.txt utime: stime: minflt: majflt: Which Linux kernel releases were used ? How many jobs were run on a Linux Kernel?

LIGO Inspiral Search Application l Describe… Inspiral workflow application is the work of Duncan Brown, Caltech, Scott Koranda, UW Milwaukee, and the LSC Inspiral group

Small Montage Workflow ~1200 node workflow, 7 levels Mosaic of M42 created on the Teragrid using Pegasus

FOAM: Fast Ocean/Atmosphere Model 250-Member Ensemble Run on TeraGrid under VDS FOAM run for Ensemble Member 1 FOAM run for Ensemble Member 2 FOAM run for Ensemble Member N Atmos Postprocessing Ocean Postprocessing for Ensemble Member 2 Coupl Postprocessing for Ensemble Member 2 Atmos Postprocessing for Ensemble Member 2 Coupl Postprocessing for Ensemble Member 2 Results transferred to archival storage Work of: Rob Jacob (FOAM), Veronica Nefedova (Workflow design and execution) Remote Directory Creation for Ensemble Member 1 Remote Directory Creation for Ensemble Member 2 Remote Directory Creation for Ensemble Member N

FOAM: TeraGrid/VDSBenefits Climate Supercomputer TeraGrid with NMI and VDS Visualization courtesy Pat Behling and Yun Liu, UW Madison

NMI Tools Experience l GRAM & Grid Information System u Tools needed to facilitate app deployment, debugging, and maintenance across sets of sites: “gstar” prototype at osg.ivdgl.org/twiki/bin/view/GriphynMainTWiki/GstarToolkit (work of Jed Dobson and Jens Voeckler) l Condor-G/DAGman u Efforts under way to provide means to dynamically extend a running DAG; also research exploring the influence of scheduling parameters on DAG throughput and responsiveness l Site Selection u Automated, opportunistic approaches being designed and evaluated u Policy based approaches are a research topic (Dumitrescu, others) l RLS Namespace u Needed to extend data archives to Grid and provide app transparency – some efforts underway to prototype this l Job Execution Records (accounting) u Several different efforts – desire to unify in OSG

Conclusion l Using VDL to express location-independent computing is proving effective: science users save time by using it over ad-hoc methods u VDL automates many complex and tedious aspects of distributed computing l Proving capable of expressing workflows across numerous sciences and diverse data models: HEP, Genomics, Astronomy, Biomedical l Makes possible new capabilities and methods for data-intensive science based on its uniform provenance model l Provides an abstract front-end for Condor workflow, automating DAG creation

Next Steps l Unified representation of data-sets, metadata, provenance and mappings to physical storage l Improved queries to discover existing products and to perform incremental work (versioning) l Improved error handling and diagnosis: VDS is like a compiler whose target chip architecture is the Grid – this is a tall order, and much work remains. l Leverage XQuery to formulate new workflows from those in a VO’s catalogs

Acknowledgements …many thanks to the entire Trillium / OSG Collaboration, iVDGL and OSG Team, Virtual Data Toolkit Team, and all of our application science partners in ATLAS, CMS, LIGO, SDSS, Dartmouth DBIC and fMRIDC, SCEC, and Argonne’s Computational Biology and Climate Science Groups of the Mathematics and Computer Science Division. The Virtual Data System group is: u ISI/USC: Ewa Deelman, Carl Kesselman, Gaurang Mehta, Gurmeet Singh, Mei-Hui Su, Karan Vahi u U of Chicago: Catalin Dumitrescu, Ian Foster, Luiz Meyer (UFRJ, Brazil), Doug Scheftner, Jens Voeckler, Mike Wilde, Yong Zhao u GriPhyN and iVDGL are supported by the National Science Foundation Many of the research efforts involved in this work are supported by the US Department of Energy, office of Science.