Download presentation
Presentation is loading. Please wait.
Published byMaria Shelton Modified over 9 years ago
1
LQCD Workflow Discussion Fermilab, 18 Dec 2006 Mike Wilde wilde@mcs.anl.gov
2
Virtual Data Origins: The Grid Physics Network Enhance scientific productivity through… Discovery, application and management of data and processes at all scales Using a worldwide data grid as a scientific workstation The key to this approach is Virtual Data – creating and managing datasets through workflow “recipes” and provenance recording.
3
Virtual Data Workflow Abstracts Grid Details
4
Expressing Workflow in VDL TR grep (in a1, out a2) { argument stdin = ${a1}; argument stdout = ${a2}; } TR sort (in a1, out a2) { argument stdin = ${a1}; argument stdout = ${a2}; } DV grep (a1=@{in:file1}, a2=@{out:file2}); DV sort (a1=@{in:file2}, a2=@{out:file3}); file1 file2 file3 grep sort Define a “function” wrapper for an application Provide “actual” argument values for the invocation Define “formal arguments” for the application Define a “call” to invoke application Connect applications via output-to-input dependencies
5
Complex Workflows Run on Grid Galaxy Image Montage Workflow ~1200 node workflow, 7 levels Mosaic of M42 created on the Teragrid using Pegasus Courtesy NASA and NVO = Data Transfer = Compute Job
6
...and collecting Provenance VDL DAGman script Pegasus Planner DAGman & Condor-G Abstract workflow Virtual Data catalog Virtual Data Workflow Generator Specify WorkflowCreate and run DAG Grid Workflow Execution (on worker nodes) launcher file1 file2 file3 grep sort Provenance data Provenance data Provenance collector
7
Neuroscience Example Create a library of typed procedures for popular fMRI tools, that are pre-installed across Grid sites Process structured data collections with simple procedures – automated iteration Establish a data cataloging scheme with metadata annotation and search Provide easy-to-use workflow execution portals – command line environments that make clusters and Grids appear like workstations Work in an environment that is a fully-tracked “clean room” with accurate provenance and powerful search
8
Example Science Driver: Functional MRI Analysis Workflow courtesy James Dobson, Dartmouth Brain Imaging Center
9
fMRI Dataset processing FOREACH BOLDSEQ DV reorient (# Process Blood O2 Level Dependent Sequence input = [ @{in: "$BOLDSEQ.img"}, @{in: "$BOLDSEQ.hdr"} ], output = [@{out: "$CWD/FUNCTIONAL/r$BOLDSEQ.img"} @{out: "$CWD/FUNCTIONAL/r$BOLDSEQ.hdr"}], direction = "y", ); END DV softmean ( input = [ FOREACH BOLDSEQ @{in:"$CWD/FUNCTIONAL/har$BOLDSEQ.img"} END ], mean = [ @{out:"$CWD/FUNCTIONAL/mean"} ] );
10
Neuroscience workflow: spatially normalize a time series of images from a functional MRI experiment
11
Spatial normalization of functional run Dataset-level workflowExpanded (10 volume) workflow
12
VDL2 Workflow – Data types and “atomic” procedures type file {} type fileNames{ file f[]; } // Procedure to run "R" statistical package (file t) bricRInvoke (file script, int iteration, file dataAll, file dataPerm){ app { bricRInvoke @filename(script) iteration @filename(dataAll) @filename(dataPerm); } // Procedure to run AFNI Clustering tool (file t, file v) bricCluster (file clusterScript, int iteration, file randBrain, file brainFile, file specFile) { app { bricPerlCluster @filename(clusterScript) iteration @filename(randBrain) @filename(brainFile) @filename(specFile); } // Procedure to merge results based on statistical likelihoods (file t) bricCentralize ( file fn[]) { app { bricCentralize @filenames(fn); } }
13
VDL2 Workflow – Dataset iteration procedures // Procedure to iterate over the data collection (fileNames randCluster, fileNames dsetReturn) brain_cluster () { int j[]=[1:2000]; file dataAll ; file dataPerm ; file brainFile ; file specFile ; file randScript ; file clusterScript ; fileNames randBrain ; foreach int i in j { randBrain.f[i] = bricRInvoke(randScript,i,dataAll,dataPerm); file rBrain=randBrain.f[i]; (randCluster.f[i], dsetReturn.f[i]) = bricCluster(clusterScript,i,rBrain, brainFile,specFile); }
14
VDL2 Workflow – Main Workflow Program // Declare datasets fileNames randBrain ; fileNames randCluster <simple_mapper; prefix="Tmean.4mm.perm", suffix="_ClstTable_r4.1_a2.0.1D">; fileNames dsetReturn <simple_mapper; prefix="Tmean.4mm.perm", suffix="_Clustered_r4.1_a2.0.niml.dset">; file bricResult ; // Main program – launches the whole workflow (randCluster, dsetReturn) = brain_cluster(); bricResult= bricCentralize (randCluster.f);
15
Karajan Powers VDL2 Research fast efficient highly scalable threading model flexible model of task completion and dependencies - like tuple space highly dynamic - permits very flexible flow of control and workflow update/monitoring flexible drivers for various (any) run time environment entire workflow client runs from one Java container flexible constructs for flow control patterns flexible procedures for mapping, data management, etc Flow control to avoid resource overload
16
fMRI Virtual Data Queries Which transformations can process a “subject image”? Q: xsearchvdc -q tr_meta dataType subject_image input A: fMRIDC.AIR::align_warp List anonymized subject-images for young subjects: Q: xsearchvdc -q lfn_meta dataType subject_image privacy anonymized subjectType young A: 3472-4_anonymized.img Show files that were derived from patient image 3472-3: Q: xsearchvdc -q lfn_tree 3472-3_anonymized.img A: 3472-3_anonymized.img 3472-3_anonymized.sliced.hdr atlas.hdr atlas.img … atlas_z.jpg 3472-3_anonymized.sliced.img
17
US-ATLAS Data Challenge 2 Sep 10 Mid July CPU-day Event generation using Virtual Data
18
LIGO Inspiral Search Application Describe… Inspiral workflow application is the work of Duncan Brown, Caltech, Scott Koranda, UW Milwaukee, and the LSC Inspiral group
19
Virtual Data Applications ApplicationJobs / workflowLevelsStatus ATLAS HEP Event Simulation 500K1In Use LIGO Inspiral/Pulsar ~7002-5Inspiral In Use NVO/NASA Montage/Morphology 1000s7Both In Use GADU/ BLAST Genomics 40K1In Use fMRI DBIC AIRSN Image Proc 100s12In Devel QuarkNet CosmicRay science <103-6 In Use SDSS Coadd image proc 40K2In Devel SDSS Galaxy cluster search 500K8CS Research GTOMO Image proc 1000s1In Devel SCEC Earthquake sim --In Devel
20
Dimensions of Provenance Data Virtual Data Catalog
21
Virtual Data Catalog Schema
22
Provenance for Large-scale ATLAS Simulation How much compute time was delivered? | years| mon | year | +------+------+------+ |.45 | 6 | 2004 | | 20 | 7 | 2004 | | 34 | 8 | 2004 | | 40 | 9 | 2004 | | 15 | 10 | 2004 | | 15 | 11 | 2004 | | 8.9 | 12 | 2004 | +------+------+------+ Selected statistics for one of these jobs: start: 2004-09-30 18:33:56 duration: 76103.33 pid: 6123 exitcode: 0 args: 8.0.5 JobTransforms-08-00-05-09/share/dc2.g4sim.filter.trf CPE_6785_556... -6 6 2000 4000 8923 dc2_B4_filter_frag.txt utime: 75335.86 stime: 28.88 minflt: 862341 majflt: 96386 Which Linux kernel releases were used ? How many jobs were run on a Linux 2.4.28 Kernel? (Data from work of Robert Gardner and Marco Mambelli, University of Chicago)
23
Provenance Query Goals Find procedures that take in ImageAtlas and Date, have processed atlas.std.2005.img, and have annotation QAScore > 5.6 Display metadata tags study-type on result datasets that were linearly aligned with parameter model=affine and with an input dataset annotated with center set to UofChicago Show me the output dataset names (and all their metadata tags) that were linearly aligned with model=affine and with input dataset attribute center=UChicago
24
Conclusion Workflow tools abstract Grid details and are making it easier to leverage massive resources Workflow lets users program in location- independent terms Failure recover and retry is critical – just as in a networking stack Elegance and usability are key factors in the query language
25
Acknowledgements …many thanks to the entire OSG Collaboration and our application science partners in ATLAS, CMS, LIGO, SDSS, UChicago Human Neuroscience Laboratory, SCEC, and Argonne’s Computational Biology and Climate Science Groups of the Mathematics and Computer Science Division. Workflow research and development: ISI/USC: Ewa Deelman, Carl Kesselman, Gaurang Mehta, Gurmeet Singh, Mei-Hui Su, Karan Vahi, Jens Voeckler U of Chicago: Ben Clifford, Ian Foster, Mihael Hategan, Tibi Stef- Praun, Gregor von Laszewsk, Mike Wilde, Yong Zhao www.griphyn.org/vds GriPhyN and iVDGL are supported by the National Science Foundation Many of the research efforts involved in this work are supported by the US Department of Energy, office of Science.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.