LQCD Workflow Discussion Fermilab, 18 Dec 2006 Mike Wilde

Slides:



Advertisements
Similar presentations
Case Study 2: Scientific Workflow for Computational Economics Tiberiu Stef-Praun Gabriel Madeira Ian Foster Robert Townsend.
Advertisements

Parallel Scripting on Beagle with Swift Ketan Maheshwari Postdoctoral Appointee (Argonne National.
Virtual Data and the Chimera System* Ian Foster Mathematics and Computer Science Division Argonne National Laboratory and Department of Computer Science.
The ADAMANT Project: Linking Scientific Workflows and Networks “Adaptive Data-Aware Multi-Domain Application Network Topologies” Ilia Baldine, Charles.
Ewa Deelman, Integrating Existing Scientific Workflow Systems: The Kepler/Pegasus Example Nandita Mangal,
Miron Livny Computer Sciences Department University of Wisconsin-Madison Workflows, Provenance and Virtual.
Pegasus: Mapping complex applications onto the Grid Ewa Deelman Center for Grid Technologies USC Information Sciences Institute.
An Astronomical Image Mosaic Service for the National Virtual Observatory
An Astronomical Image Mosaic Service for the National Virtual Observatory / ESTO.
Ewa Deelman Using Grid Technologies to Support Large-Scale Astronomy Applications Ewa Deelman Center for Grid Technologies USC Information.
Swift: A Scientist’s Gateway to Campus Clusters, Grids and Supercomputers Swift project: Presenter contact:
Managing Workflows with the Pegasus Workflow Management System
Ewa Deelman, Pegasus and DAGMan: From Concept to Execution Mapping Scientific Workflows onto the National.
Scientific Workflows on the Grid. Goals Enhance scientific productivity through: Discovery and application of datasets and programs at petabyte scale.
CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.
End User Tools. Target environment: Cluster and Grids (distributed sets of clusters) Grid Protocols Grid Resources at UW Grid Storage Grid Middleware.
Grid Workflow Midwest Grid Workshop Module 6. Goals Enhance scientific productivity through: Discovery and application of datasets and programs at petabyte.
Using Globus to Scale an Application Case Study 4: Scientific Workflow for Computational Economics Tiberiu Stef-Praun, Gabriel Madeira, Ian Foster, Robert.
Pegasus A Framework for Workflow Planning on the Grid Ewa Deelman USC Information Sciences Institute Pegasus Acknowledgments: Carl Kesselman, Gaurang Mehta,
Grappa: Grid access portal for physics applications Shava Smallen Extreme! Computing Laboratory Department of Physics Indiana University.
Introduction to Grid Computing Tutorial Outline I. Motivation and Grid Architecture II. Grid Examples from Life Sciences III.Grid Security IV.Job Management:
Data analysis in I2U2 I2U2 all-hands meeting Michael Wilde Argonne MCS University of Chicago Computation Institute 12 Dec 2005.
The Grid is a complex, distributed and heterogeneous execution environment. Running applications requires the knowledge of many grid services: users need.
10/20/05 LIGO Scientific Collaboration 1 LIGO Data Grid: Making it Go Scott Koranda University of Wisconsin-Milwaukee.
Introduction to Swift Parallel scripting for distributed systems Mike Wilde Ben Clifford Computation Institute,
Swift Fast, Reliable, Loosely Coupled Parallel Computation Ian Foster Computation Institute Argonne National Laboratory University of Chicago Joint work.
Applying the Virtual Data Provenance Model IPAW 2006 Yong Zhao, Ian Foster, Michael Wilde 4 May 2006.
Swift Fast, Reliable, Loosely Coupled Parallel Computation Ben Clifford, Tibi Stef-Praun, Mike Wilde Computation Institute University.
The GriPhyN Virtual Data System Ian Foster for the VDS team.
Large-Scale Science Through Workflow Management Ewa Deelman Center for Grid Technologies USC Information Sciences Institute.
Introduction to Swift Parallel scripting for distributed systems Mike Wilde
Managing large-scale workflows with Pegasus Karan Vahi ( Collaborative Computing Group USC Information Sciences Institute Funded.
GriPhyN and Data Provenance The Grid Physics Network Virtual Data System DOE Data Management Workshop SLAC, 17 March 2004 Mike Wilde Argonne National Laboratory.
 The workflow description modified to output a VDS DAX.  The workflow description toolkit developed allows any concrete workflow description to be migrated.
The GriPhyN Virtual Data System GRIDS Center Community Workshop Michael Wilde Argonne National Laboratory 24 June 2005.
Pegasus-a framework for planning for execution in grids Ewa Deelman USC Information Sciences Institute.
Pegasus: Planning for Execution in Grids Ewa Deelman Information Sciences Institute University of Southern California.
The Virtual Data Grid: A New Model and Architecture for Data-Intensive Collaboration Summer Grid 2004 UT Brownsville South Padre Island Center 24 June.
GriPhyN Status and Project Plan Mike Wilde Mathematics and Computer Science Division Argonne National Laboratory.
Pegasus: Mapping Scientific Workflows onto the Grid Ewa Deelman Center for Grid Technologies USC Information Sciences Institute.
Virtual Data Grid Architecture Ewa Deelman, Ian Foster, Carl Kesselman, Miron Livny.
A Logic Programming Approach to Scientific Workflow Provenance Querying* Shiyong Lu Department of Computer Science Wayne State University, Detroit, MI.
Pegasus: Running Large-Scale Scientific Workflows on the TeraGrid Ewa Deelman USC Information Sciences Institute
Pegasus: Mapping complex applications onto the Grid Ewa Deelman Center for Grid Technologies USC Information Sciences Institute.
Virtual Data Workflows with the GriPhyN VDS Condor Week University of Wisconsin Michael Wilde Argonne National Laboratory 14 March 2005.
The GriPhyN Planning Process All-Hands Meeting ISI 15 October 2001.
GriPhyN Virtual Data System Grid Execution of Virtual Data Workflows Mike Wilde Argonne National Laboratory Mathematics and Computer Science Division.
Pegasus-a framework for planning for execution in grids Karan Vahi USC Information Sciences Institute May 5 th, 2004.
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center National Partnership for.
Planning Ewa Deelman USC Information Sciences Institute GriPhyN NSF Project Review January 2003 Chicago.
Pegasus: Planning for Execution in Grids Ewa Deelman, Carl Kesselman, Gaurang Mehta, Gurmeet Singh, Karan Vahi Information Sciences Institute University.
End User Tools. Target environment: Cluster and Grids (distributed sets of clusters) Grid Protocols Grid Resources at UW Grid Storage Grid Middleware.
Virtual Data Management for CMS Simulation Production A GriPhyN Prototype.
GriPhyN Project Paul Avery, University of Florida, Ian Foster, University of Chicago NSF Grant ITR Research Objectives Significant Results Approach.
AHM04: Sep 2004 Nottingham CCLRC e-Science Centre eMinerals: Environment from the Molecular Level Managing simulation data Lisa Blanshard e- Science Data.
Funded by the NSF OCI program grants OCI and OCI Mats Rynge, Gideon Juve, Karan Vahi, Gaurang Mehta, Ewa Deelman Information Sciences Institute,
GADU: A System for High-throughput Analysis of Genomes using Heterogeneous Grid Resources. Mathematics and Computer Science Division Argonne National Laboratory.
Interoperability Achieved by GADU in using multiple Grids. OSG, Teragrid and ANL Jazz Presented by: Dinanath Sulakhe Mathematics and Computer Science Division.
Ewa Deelman, Virtual Metadata Catalogs: Augmenting Existing Metadata Catalogs with Semantic Representations Yolanda Gil, Varun Ratnakar,
1 Pegasus and wings WINGS/Pegasus Provenance Challenge Ewa Deelman Yolanda Gil Jihie Kim Gaurang Mehta Varun Ratnakar USC Information Sciences Institute.
1 USC Information Sciences InstituteYolanda Gil AAAI-08 Tutorial July 13, 2008 Part IV Workflow Mapping and Execution in Pegasus (Thanks.
Managing LIGO Workflows on OSG with Pegasus Karan Vahi USC Information Sciences Institute
Ewa Deelman, Managing Scientific Workflows on OSG with Pegasus Ewa Deelman USC Information Sciences.
Resource Allocation and Scheduling for Workflows Gurmeet Singh, Carl Kesselman, Ewa Deelman.
VO Experiences with Open Science Grid Storage OSG Storage Forum | Wednesday September 22, 2010 (10:30am)
Enhancements to Galaxy for delivering on NIH Commons
Accessing the VI-SEEM infrastructure
Cloudy Skies: Astronomy and Utility Computing
Pegasus and Condor Gaurang Mehta, Ewa Deelman, Carl Kesselman, Karan Vahi Center For Grid Technologies USC/ISI.
A General Approach to Real-time Workflow Monitoring
Presentation transcript:

LQCD Workflow Discussion Fermilab, 18 Dec 2006 Mike Wilde

Virtual Data Origins: The Grid Physics Network Enhance scientific productivity through… Discovery, application and management of data and processes at all scales Using a worldwide data grid as a scientific workstation The key to this approach is Virtual Data – creating and managing datasets through workflow “recipes” and provenance recording.

Virtual Data Workflow Abstracts Grid Details

Expressing Workflow in VDL TR grep (in a1, out a2) { argument stdin = ${a1}; argument stdout = ${a2}; } TR sort (in a1, out a2) { argument stdin = ${a1}; argument stdout = ${a2}; } DV grep DV sort file1 file2 file3 grep sort Define a “function” wrapper for an application Provide “actual” argument values for the invocation Define “formal arguments” for the application Define a “call” to invoke application Connect applications via output-to-input dependencies

Complex Workflows Run on Grid Galaxy Image Montage Workflow ~1200 node workflow, 7 levels Mosaic of M42 created on the Teragrid using Pegasus Courtesy NASA and NVO = Data Transfer = Compute Job

...and collecting Provenance VDL DAGman script Pegasus Planner DAGman & Condor-G Abstract workflow Virtual Data catalog Virtual Data Workflow Generator Specify WorkflowCreate and run DAG Grid Workflow Execution (on worker nodes) launcher file1 file2 file3 grep sort Provenance data Provenance data Provenance collector

Neuroscience Example Create a library of typed procedures for popular fMRI tools, that are pre-installed across Grid sites Process structured data collections with simple procedures – automated iteration Establish a data cataloging scheme with metadata annotation and search Provide easy-to-use workflow execution portals – command line environments that make clusters and Grids appear like workstations Work in an environment that is a fully-tracked “clean room” with accurate provenance and powerful search

Example Science Driver: Functional MRI Analysis Workflow courtesy James Dobson, Dartmouth Brain Imaging Center

fMRI Dataset processing FOREACH BOLDSEQ DV reorient (# Process Blood O2 Level Dependent Sequence input = "$BOLDSEQ.hdr"} ], output = "$CWD/FUNCTIONAL/r$BOLDSEQ.hdr"}], direction = "y", ); END DV softmean ( input = [ FOREACH END ], mean = ] );

Neuroscience workflow: spatially normalize a time series of images from a functional MRI experiment

Spatial normalization of functional run Dataset-level workflowExpanded (10 volume) workflow

VDL2 Workflow – Data types and “atomic” procedures type file {} type fileNames{ file f[]; } // Procedure to run "R" statistical package (file t) bricRInvoke (file script, int iteration, file dataAll, file dataPerm){ app { iteration } // Procedure to run AFNI Clustering tool (file t, file v) bricCluster (file clusterScript, int iteration, file randBrain, file brainFile, file specFile) { app { } // Procedure to merge results based on statistical likelihoods (file t) bricCentralize ( file fn[]) { app { } }

VDL2 Workflow – Dataset iteration procedures // Procedure to iterate over the data collection (fileNames randCluster, fileNames dsetReturn) brain_cluster () { int j[]=[1:2000]; file dataAll ; file dataPerm ; file brainFile ; file specFile ; file randScript ; file clusterScript ; fileNames randBrain ; foreach int i in j { randBrain.f[i] = bricRInvoke(randScript,i,dataAll,dataPerm); file rBrain=randBrain.f[i]; (randCluster.f[i], dsetReturn.f[i]) = bricCluster(clusterScript,i,rBrain, brainFile,specFile); }

VDL2 Workflow – Main Workflow Program // Declare datasets fileNames randBrain ; fileNames randCluster <simple_mapper; prefix="Tmean.4mm.perm", suffix="_ClstTable_r4.1_a2.0.1D">; fileNames dsetReturn <simple_mapper; prefix="Tmean.4mm.perm", suffix="_Clustered_r4.1_a2.0.niml.dset">; file bricResult ; // Main program – launches the whole workflow (randCluster, dsetReturn) = brain_cluster(); bricResult= bricCentralize (randCluster.f);

Karajan Powers VDL2 Research fast efficient highly scalable threading model flexible model of task completion and dependencies - like tuple space highly dynamic - permits very flexible flow of control and workflow update/monitoring flexible drivers for various (any) run time environment entire workflow client runs from one Java container flexible constructs for flow control patterns flexible procedures for mapping, data management, etc Flow control to avoid resource overload

fMRI Virtual Data Queries Which transformations can process a “subject image”? Q: xsearchvdc -q tr_meta dataType subject_image input A: fMRIDC.AIR::align_warp List anonymized subject-images for young subjects: Q: xsearchvdc -q lfn_meta dataType subject_image privacy anonymized subjectType young A: _anonymized.img Show files that were derived from patient image : Q: xsearchvdc -q lfn_tree _anonymized.img A: _anonymized.img _anonymized.sliced.hdr atlas.hdr atlas.img … atlas_z.jpg _anonymized.sliced.img

US-ATLAS Data Challenge 2 Sep 10 Mid July CPU-day Event generation using Virtual Data

LIGO Inspiral Search Application Describe… Inspiral workflow application is the work of Duncan Brown, Caltech, Scott Koranda, UW Milwaukee, and the LSC Inspiral group

Virtual Data Applications ApplicationJobs / workflowLevelsStatus ATLAS HEP Event Simulation 500K1In Use LIGO Inspiral/Pulsar ~7002-5Inspiral In Use NVO/NASA Montage/Morphology 1000s7Both In Use GADU/ BLAST Genomics 40K1In Use fMRI DBIC AIRSN Image Proc 100s12In Devel QuarkNet CosmicRay science <103-6 In Use SDSS Coadd image proc 40K2In Devel SDSS Galaxy cluster search 500K8CS Research GTOMO Image proc 1000s1In Devel SCEC Earthquake sim --In Devel

Dimensions of Provenance Data Virtual Data Catalog

Virtual Data Catalog Schema

Provenance for Large-scale ATLAS Simulation How much compute time was delivered? | years| mon | year | |.45 | 6 | 2004 | | 20 | 7 | 2004 | | 34 | 8 | 2004 | | 40 | 9 | 2004 | | 15 | 10 | 2004 | | 15 | 11 | 2004 | | 8.9 | 12 | 2004 | Selected statistics for one of these jobs: start: :33:56 duration: pid: 6123 exitcode: 0 args: JobTransforms /share/dc2.g4sim.filter.trf CPE_6785_ dc2_B4_filter_frag.txt utime: stime: minflt: majflt: Which Linux kernel releases were used ? How many jobs were run on a Linux Kernel? (Data from work of Robert Gardner and Marco Mambelli, University of Chicago)

Provenance Query Goals Find procedures that take in ImageAtlas and Date, have processed atlas.std.2005.img, and have annotation QAScore > 5.6 Display metadata tags study-type on result datasets that were linearly aligned with parameter model=affine and with an input dataset annotated with center set to UofChicago Show me the output dataset names (and all their metadata tags) that were linearly aligned with model=affine and with input dataset attribute center=UChicago

Conclusion Workflow tools abstract Grid details and are making it easier to leverage massive resources Workflow lets users program in location- independent terms Failure recover and retry is critical – just as in a networking stack Elegance and usability are key factors in the query language

Acknowledgements …many thanks to the entire OSG Collaboration and our application science partners in ATLAS, CMS, LIGO, SDSS, UChicago Human Neuroscience Laboratory, SCEC, and Argonne’s Computational Biology and Climate Science Groups of the Mathematics and Computer Science Division. Workflow research and development: ISI/USC: Ewa Deelman, Carl Kesselman, Gaurang Mehta, Gurmeet Singh, Mei-Hui Su, Karan Vahi, Jens Voeckler U of Chicago: Ben Clifford, Ian Foster, Mihael Hategan, Tibi Stef- Praun, Gregor von Laszewsk, Mike Wilde, Yong Zhao GriPhyN and iVDGL are supported by the National Science Foundation Many of the research efforts involved in this work are supported by the US Department of Energy, office of Science.