Grid Workflow Midwest Grid Workshop Module 6. Goals Enhance scientific productivity through: Discovery and application of datasets and programs at petabyte.

Slides:



Advertisements
Similar presentations
Parallel Scripting on Beagle with Swift Ketan Maheshwari Postdoctoral Appointee (Argonne National.
Advertisements

Virtual Data and the Chimera System* Ian Foster Mathematics and Computer Science Division Argonne National Laboratory and Department of Computer Science.
Ian Foster Computation Institute Argonne National Lab & University of Chicago Services for Science.
The ADAMANT Project: Linking Scientific Workflows and Networks “Adaptive Data-Aware Multi-Domain Application Network Topologies” Ilia Baldine, Charles.
Data Grids Darshan R. Kapadia Gregor von Laszewski
Microsoft Research Faculty Summit Ian Foster Computation Institute University of Chicago & Argonne National Laboratory.
Ian Foster Computation Institute Argonne National Lab & University of Chicago Education in the Science 2.0 Era.
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
Swift Fast, Reliable, Loosely Coupled Parallel Computation Ian Foster Computation Institute Argonne National Laboratory University of Chicago Joint work.
6th Biennial Ptolemy Miniconference Berkeley, CA May 12, 2005 Distributed Computing in Kepler Ilkay Altintas Lead, Scientific Workflow Automation Technologies.
Milos Kobliha Alejandro Cimadevilla Luis de Alba Parallel Computing Seminar GROUP 12.
Pegasus: Mapping complex applications onto the Grid Ewa Deelman Center for Grid Technologies USC Information Sciences Institute.
Swift: A Scientist’s Gateway to Campus Clusters, Grids and Supercomputers Swift project: Presenter contact:
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
Managing Workflows with the Pegasus Workflow Management System
Scientific Workflows on the Grid. Goals Enhance scientific productivity through: Discovery and application of datasets and programs at petabyte scale.
CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.
End User Tools. Target environment: Cluster and Grids (distributed sets of clusters) Grid Protocols Grid Resources at UW Grid Storage Grid Middleware.
Using Globus to Scale an Application Case Study 4: Scientific Workflow for Computational Economics Tiberiu Stef-Praun, Gabriel Madeira, Ian Foster, Robert.
Grappa: Grid access portal for physics applications Shava Smallen Extreme! Computing Laboratory Department of Physics Indiana University.
Introduction to Grid Computing Tutorial Outline I. Motivation and Grid Architecture II. Grid Examples from Life Sciences III.Grid Security IV.Job Management:
Data analysis in I2U2 I2U2 all-hands meeting Michael Wilde Argonne MCS University of Chicago Computation Institute 12 Dec 2005.
Workflow Systems for LQCD SciDAC LQCD Software meeting, Boston, Feb 2008 Fermilab, IIT, Vanderbilt.
Swift: a scientist’s gateway to campus clusters, grids and supercomputers Swift project: Presenter contact:
Introduction to Swift Parallel scripting for distributed systems Mike Wilde Ben Clifford Computation Institute,
Swift Fast, Reliable, Loosely Coupled Parallel Computation Ian Foster Computation Institute Argonne National Laboratory University of Chicago Joint work.
ARGONNE  CHICAGO Ian Foster Discussion Points l Maintaining the right balance between research and development l Maintaining focus vs. accepting broader.
Applying the Virtual Data Provenance Model IPAW 2006 Yong Zhao, Ian Foster, Michael Wilde 4 May 2006.
Swift: implicitly parallel dataflow scripting for clusters, clouds, and multicore systems Michael Wilde -
Swift Fast, Reliable, Loosely Coupled Parallel Computation Ben Clifford, Tibi Stef-Praun, Mike Wilde Computation Institute University.
Ian Foster Computation Institute Argonne National Lab & University of Chicago From the Heroic to the Logistical Programming Model Implications of New Supercomputing.
The GriPhyN Virtual Data System Ian Foster for the VDS team.
Ian Foster Computation Institute Argonne National Lab & University of Chicago From the heroic to the logistical Non-traditional applications for parallel.
Grids and Portals for VLAB Marlon Pierce Community Grids Lab Indiana University.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
Miguel Branco CERN/University of Southampton Enabling provenance on large-scale e-Science applications.
Introduction to Swift Parallel scripting for distributed systems Mike Wilde
GridFE: Web-accessible Grid System Front End Jared Yanovich, PSC Robert Budden, PSC.
Managing large-scale workflows with Pegasus Karan Vahi ( Collaborative Computing Group USC Information Sciences Institute Funded.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
GriPhyN Status and Project Plan Mike Wilde Mathematics and Computer Science Division Argonne National Laboratory.
Accelerating Scientific Exploration Using Workflow Automation Systems Terence Critchlow (LLNL) Ilkay Altintas (SDSC) Scott Klasky(ORNL) Mladen Vouk (NCSU)
The Future of the iPlant Cyberinfrastructure: Coming Attractions.
Virtual Data Grid Architecture Ewa Deelman, Ian Foster, Carl Kesselman, Miron Livny.
A Logic Programming Approach to Scientific Workflow Provenance Querying* Shiyong Lu Department of Computer Science Wayne State University, Detroit, MI.
Server to Server Communication Redis as an enabler Orion Free
Pegasus: Running Large-Scale Scientific Workflows on the TeraGrid Ewa Deelman USC Information Sciences Institute
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
Pegasus: Mapping complex applications onto the Grid Ewa Deelman Center for Grid Technologies USC Information Sciences Institute.
Virtual Data Workflows with the GriPhyN VDS Condor Week University of Wisconsin Michael Wilde Argonne National Laboratory 14 March 2005.
The GriPhyN Planning Process All-Hands Meeting ISI 15 October 2001.
GriPhyN Virtual Data System Grid Execution of Virtual Data Workflows Mike Wilde Argonne National Laboratory Mathematics and Computer Science Division.
End User Tools. Target environment: Cluster and Grids (distributed sets of clusters) Grid Protocols Grid Resources at UW Grid Storage Grid Middleware.
LQCD Workflow Discussion Fermilab, 18 Dec 2006 Mike Wilde
Virtual Data Management for CMS Simulation Production A GriPhyN Prototype.
GriPhyN Project Paul Avery, University of Florida, Ian Foster, University of Chicago NSF Grant ITR Research Objectives Significant Results Approach.
Experiences Running Seismic Hazard Workflows Scott Callaghan Southern California Earthquake Center University of Southern California SC13 Workflow BoF.
GADU: A System for High-throughput Analysis of Genomes using Heterogeneous Grid Resources. Mathematics and Computer Science Division Argonne National Laboratory.
Interoperability Achieved by GADU in using multiple Grids. OSG, Teragrid and ANL Jazz Presented by: Dinanath Sulakhe Mathematics and Computer Science Division.
April 25, 2006Parag Mhashilkar, Fermilab1 Resource Selection in OSG & SAM-On-The-Fly Parag Mhashilkar Fermi National Accelerator Laboratory Condor Week.
Physical Oceanography Distributed Active Archive Center THUANG June 9-13, 20089th GHRSST-PP Science Team Meeting GHRSST GDAC and EOSDIS PO.DAAC.
1 USC Information Sciences InstituteYolanda Gil AAAI-08 Tutorial July 13, 2008 Part IV Workflow Mapping and Execution in Pegasus (Thanks.
Managing LIGO Workflows on OSG with Pegasus Karan Vahi USC Information Sciences Institute
VO Experiences with Open Science Grid Storage OSG Storage Forum | Wednesday September 22, 2010 (10:30am)
Ian Foster Computation Institute Argonne National Lab & University of Chicago Towards an Open Analytics Environment (A “Data Cauldron”)
Parallel scripting for distributed systems
Pegasus and Condor Gaurang Mehta, Ewa Deelman, Carl Kesselman, Karan Vahi Center For Grid Technologies USC/ISI.
Overview of Workflows: Why Use Them?
Mats Rynge USC Information Sciences Institute
A General Approach to Real-time Workflow Monitoring
Presentation transcript:

Grid Workflow Midwest Grid Workshop Module 6

Goals Enhance scientific productivity through: Discovery and application of datasets and programs at petabyte scale Enabling use of a worldwide data grid as a scientific workstation

Goals of using grids through scripting Provide an easy on-ramp to the grid Utilize massive resources with simple scripts  Leverage multiple grids like a workstation Empower script-writers to empower end users Track and leverage provenance in the science process

Classes of Workflow Systems Earlier generation business workflow systems  Document management, forms processing, etc Scientific laboratory management systems  LIMS, “wet lab” workflow Application-oriented workflow  Kepler, DAGman, P-Star, VisTrails, Karajan VDS: First-generation Virtual Data System  Pegasus, Virtual Data Language Service-oriented workflow systems  BPEL, BPDL, Taverna/SCUFL, Triana Pegasus/Wings  Pegasus with OWL/RDF workflow specification Swift workflow system  Karajan with typed and mapped VDL - SwiftScript

VDS – The Virtual Data System Introduced Virtual Data Language - VDL  A location-independent parallel language Several Planners  Pegasus: main production planner  Euryale: experimental “just in time” planner  GADU/GNARE – user application planner (D. Sulahke, Argonne) Provenance  Kickstart – app launcher and tracker  VDC – virtual data catalog

Virtual Data and Workflows Challenge is managing and organizing the vast computing and storage capabilities provided by Grids Workflow expresses computations in a form that can be readily mapped to Grids Virtual data keeps accurate track of data derivation methods and provenance Grid tools virtualize location and caching of data, and recovery from failures

Virtual Data Origins: The Grid Physics Network Enhance scientific productivity through… Discovery, application and management of data and processes at all scales Using a worldwide data grid as a scientific workstation The key to this approach is Virtual Data – creating and managing datasets through workflow “recipes” and provenance recording.

Virtual Data workflow abstracts Grid details

mass = 200 decay = WW stability = 1 LowPt = 20 HighPt = mass = 200 decay = WW stability = 1 event = 8 mass = 200 decay = WW stability = 1 plot = 1 mass = 200 decay = WW plot = 1 mass = 200 decay = WW event = 8 mass = 200 decay = WW stability = 1 mass = 200 decay = WW stability = 3 mass = 200 decay = WW mass = 200 decay = ZZ mass = 200 decay = bb mass = 200 plot = 1 mass = 200 event = 8 Example Application: High Energy Physics Data Analysis Work and slide by Rick Cavanaugh and Dimitri Bourilkov, University of Florida

The core essence: Basic data analysis programs CMS.ECal : 24B707CC AF A V : 24B707CD F A V : E9DCA V : Raw Data bins =60 xmin = 40.5 ymin =.003 Data Analysis Program bins xmin ymin infile

Expressing Workflow in VDL TR grep (in a1, out a2) { argument stdin = ${a1}; argument stdout = ${a2}; } TR sort (in a1, out a2) { argument stdin = ${a1}; argument stdout = ${a2}; } DV grep DV sort file1 file2 file3 grep sort Define a “function” wrapper for an application Provide “actual” argument values for the invocation Define “formal arguments” for the application Define a “call” to invoke application Connect applications via output-to-input dependencies

Executing VDL Workflows Abstract workflow DAGman DAG Pegasus Planner DAGman & Condor-G VDL Program Virtual Data catalog Virtual Data Workflow Generator Job Planner Job Cleanup Workflow spec Create Execution Plan Grid Workflow Execution Show world and results in large DAG on right, as animated overlay

...and collecting Provenance VDL DAGman script Pegasus Planner DAGman & Condor-G Abstract workflow Virtual Data catalog Virtual Data Workflow Generator Specify WorkflowCreate and run DAG Grid Workflow Execution (on worker nodes) launcher file1 file2 file3 grep sort Provenance data Provenance data Provenance collector

What must we “virtualize” to compute on the Grid? Location-independent computing: represent all workflow in abstract terms Declarations not tied to specific entities:  sites  file systems  schedulers Failures – automated retry for data server and execution site un-availability

Mapping the Science Process to workflows Start with a single workflow Automate the generation of workflow for sets of files (datasets) Replicate workflow to explore many datasets Change Parameters Change code – add new transformations Build new workflows Use provenance info

How does Workflow Relate to Provenance? Workflow – specifies what to do Provenance – tracks what was done Executed Executing Executable Waiting Query Edit Schedule Execution environment What I Did What I Want to Do What I Am Doing …

Having interface definitions also facilitates provenance tracking. CMS.ECal : 24B707CC AF A V : 24B707CD F A V : E9DCA V : Raw Data bins =60 xmin = 40.5 ymin =.003 Data Analysis Program bins xmin ymin infile

Functional MRI Analysis Workflow courtesy James Dobson, Dartmouth Brain Imaging Center

LIGO Inspiral Search Application Describe… Inspiral workflow application is the work of Duncan Brown, Caltech, Scott Koranda, UW Milwaukee, the ISI Pegasus Team, and the LSC Inspiral group

Example Montage Workflow ~1200 node workflow, 7 levels Mosaic of M42 created on the Teragrid using Pegasus

Blasting for Protein Knowledge BLAST compare of complete nr database for sequence similarity and function characterization Knowledge Base PUMA is an interface for the researchers to be able to find information about a specific protein after having been analyzed against the complete set of sequenced genomes (nr file ~ approximately 3 million sequences) Analysis on the Grid The analysis of the protein sequences occurs in the background in the grid environment. Millions of processes are started since several tools are run to analyze each sequence, such as finding out protein similarities (BLAST), protein family domain searches (BLOCKS), and structural characteristics of the protein.

FOAM: Fast Ocean/Atmosphere Model 250-Member Ensemble Run on TeraGrid under VDS FOAM run for Ensemble Member 1 FOAM run for Ensemble Member 2 FOAM run for Ensemble Member N Atmos Postprocessing Ocean Postprocessing for Ensemble Member 2 Coupl Postprocessing for Ensemble Member 2 Atmos Postprocessing for Ensemble Member 2 Coupl Postprocessing for Ensemble Member 2 Results transferred to archival storage Work of: Rob Jacob (FOAM), Veronica Nefedova (Workflow design and execution) Remote Directory Creation for Ensemble Member 1 Remote Directory Creation for Ensemble Member 2 Remote Directory Creation for Ensemble Member N

TeraGrid and VDS speed up modelling Climate Supercomputer TeraGrid with NMI and VDS FOAM application by Rob Jacob, Argonne; VDS workflow by Veronika Nefedova, Argonne Visualization courtesy Pat Behling and Yun Liu, UW Madison..

VDS Virtual Data System  Virtual Data Language (VDL) A language to express workflows  Pegasus planner Decides how the workflow will run  Virtual Data Catalog (VDC) Stores information about workflows Stores provenance of data

Virtual Data Process Describe data derivation or analysis steps in a high-level workflow language (VDL) VDL is cataloged in a database for sharing by the community Grid workflows are generated from VDL Provenance of derived results stored in database for assessment or verification

Planning with Pegasus High Level Application Knowledge Resource Information and Configuration Data Location Information Pegasus Planner Plan to be submitted to the grid (e.g condor submit files) DAX abstract workflow, from VDL

Abstract to Concrete, Step 1: Workflow Reduction

Step 2: Site Selection & Addition of Data Stage-in Nodes

Step 3: Addition of Data Stage-out Nodes

Step 4: Addition of Replica Registration Jobs

Step 5: Addition of Job-Directory Creation

Final Result of Abstract-to-Concrete Process

39 Swift System improves on VDS/VDL Clean separation of logical/physical concerns  XDTM specification of logical data structures + Concise specification of parallel programs  SwiftScript, with iteration, etc. + Efficient execution on distributed resources  Lightweight threading, dynamic provisioning, Grid interfaces, pipelining, load balancing + Rigorous provenance tracking and query is in design  Virtual data schema & automated recording  Improved usability and productivity  Demonstrated in numerous applications

40 AIRSN: an example program (Run snr) functional ( Run r, NormAnat a, Air shrink ) { Run yroRun = reorientRun( r, "y" ); Run roRun = reorientRun( yroRun, "x" ); Volume std = roRun[0]; Run rndr = random_select( roRun, 0.1 ); AirVector rndAirVec = align_linearRun( rndr, std, 12, 1000, 1000, "81 3 3" ); Run reslicedRndr = resliceRun( rndr, rndAirVec, "o", "k" ); Volume meanRand = softmean( reslicedRndr, "y", "null" ); Air mnQAAir = alignlinear( a.nHires, meanRand, 6, 1000, 4, "81 3 3" ); Warp boldNormWarp = combinewarp( shrink, a.aWarp, mnQAAir ); Run nr = reslice_warp_run( boldNormWarp, roRun ); Volume meanAll = strictmean( nr, "y", "null" ) Volume boldMask = binarize( meanAll, "y" ); snr = gsmoothRun( nr, boldMask, "6 6 6" ); } (Run or) reorientRun (Run ir, string direction) { foreach Volume iv, i in ir.v { or.v[i] = reorient(iv, direction); }

VDL/VDS Limitations Missing VDL language features  Data typing & data mapping  Iterators and control-flow constructs Run time complexity in VDS  State explosion for data-parallel applications  Computation status hard to provide  Debugging information complex & distributed Performance  Still many runtime bottlenecks

The Messy Data Problem Scientific data is typically logically structured  E.g., hierarchical structure  Common to map functions over dataset members  Nested map operations can scale to millions of objects

The Messy Data Problem But physically “ messy ” Heterogeneous storage format and access protocol  Logically identical dataset can be stored in textual File (e.g. CSV), spreadsheet, database, …  Data available from filesystem, DBMS, HTTP, WebDAV,.. Metadata encoded in directory and file names Hinders program development, composition, execution./Group23 total 58 drwxr-xr-x 4 yongzh users 2048 Nov 12 14:15 AA drwxr-xr-x 4 yongzh users 2048 Nov 11 21:13 CH drwxr-xr-x 4 yongzh users 2048 Nov 11 16:32 EC./Group23/AA: total 4 drwxr-xr-x 5 yongzh users 2048 Nov 5 12:41 04nov06aa drwxr-xr-x 4 yongzh users 2048 Dec 6 12:24 11nov06aa. /Group23/AA/04nov06aa: total 54 drwxr-xr-x 2 yongzh users 2048 Nov 5 12:52 ANATOMY drwxr-xr-x 2 yongzh users Dec 5 11:40 FUNCTIONAL. /Group23/AA/04nov06aa/ANATOMY: total rw-r--r-- 1 yongzh users 348 Nov 5 12:29 coplanar.hdr -rw-r--r-- 1 yongzh users Nov 5 12:29 coplanar.img. /Group23/AA/04nov06aa/FUNCTIONAL: total rw-r--r-- 1 yongzh users 348 Nov 5 12:32 bold1_0001.hdr -rw-r--r-- 1 yongzh users Nov 5 12:32 bold1_0001.img -rw-r--r-- 1 yongzh users 348 Nov 5 12:32 bold1_0002.hdr -rw-r--r-- 1 yongzh users Nov 5 12:32 bold1_0002.img -rw-r--r-- 1 yongzh users 496 Nov 15 20:44 bold1_0002.mat -rw-r--r-- 1 yongzh users 348 Nov 5 12:32 bold1_0003.hdr -rw-r--r-- 1 yongzh users Nov 5 12:32 bold1_0003.img

SwiftScript Typed parallel programming notation  XDTM as data model and type system  Typed dataset and procedure definitions Scripting language  Implicit data parallelism  Program composition from procedures  Control constructs (foreach, if, while, …) [SIGMOD05, Springer06] Clean application logic Type checking Dataset selection, iteration A Notation & System for Expressing and Executing Cleanly Typed Workflows on Messy Scientific Data [SIGMOD Record Sep05]

fMRI Type Definitions in SwiftScript type Study { Group g[ ]; } type Group { Subject s[ ]; } type Subject { Volume anat; Run run[ ]; } type Run { Volume v[ ]; } type Volume { Image img; Header hdr; } type Image {}; type Header {}; type Warp {}; type Air {}; type AirVec { Air a[ ]; } type NormAnat { Volume anat; Warp aWarp; Volume nHires; } Simplified declarations of fMRI AIRSN (Spatial Normalization)

AIRSN Program Definition (Run snr) functional ( Run r, NormAnat a, Air shrink ) { Run yroRun = reorientRun( r, "y" ); Run roRun = reorientRun( yroRun, "x" ); Volume std = roRun[0]; Run rndr = random_select( roRun, 0.1 ); AirVector rndAirVec = align_linearRun( rndr, std, 12, 1000, 1000, "81 3 3" ); Run reslicedRndr = resliceRun( rndr, rndAirVec, "o", "k" ); Volume meanRand = softmean( reslicedRndr, "y", "null" ); Air mnQAAir = alignlinear( a.nHires, meanRand, 6, 1000, 4, "81 3 3" ); Warp boldNormWarp = combinewarp( shrink, a.aWarp, mnQAAir ); Run nr = reslice_warp_run( boldNormWarp, roRun ); Volume meanAll = strictmean( nr, "y", "null" ) Volume boldMask = binarize( meanAll, "y" ); snr = gsmoothRun( nr, boldMask, "6 6 6" ); } (Run or) reorientRun (Run ir, string direction) { foreach Volume iv, i in ir.v { or.v[i] = reorient(iv, direction); }

SwiftScript Expressiveness Lines of code with different workflow encodings Collaboration with James Dobson, Dartmouth [SIGMOD Record Sep05] fMRI Workflow Shell Script VDLSwift ATLAS ATLAS FILM FEAT AIRSN215~40034 AIRSN workflow:AIRSN workflow expanded:

48 SwiftScript Abstract computation Virtual Data Catalog SwiftScript Compiler SpecificationExecution Virtual Node(s) Provenance data Provenance data Provenance collector launcher file1 file2 file3 App F1 App F2 Scheduling Execution Engine (Karajan w/ Swift Runtime) Swift runtime callouts C CCC Status reporting Swift Architecture Provisioning Dynamic Resource Provisioner Amazon EC2

Workflow Status and logs swift command launcher f1 f2 f3 Worker Nodes App a1 App a2 Using Swift SwiftScript App a1 App a2 Data f1 f2 f3 site list app list Provenance data

Swift uses Karajan Workflow Engine Fast, scalable threading model Suitable constructs for control flow Flexible task dependency model  “ Futures ” enable pipelining Flexible provider model allows for use of different run time environments  Job execution and data transfer  Flow controlled to avoid resource overload Workflow client runs from a Java container Java CoG Workflow, Gregor von Laszewski et al., Workflows for Science, 2007

Application example: ACTIVAL: Neural activation validation Identifies clusters of neural activity not likely to be active by random chance: switch labels of the conditions for one or more participants; calculate the delta values in each voxel, re-calculate the reliability of delta in each voxel, and evaluate clusters found. If the clusters in data are greater than the majority of the clusters found in the permutations, then the null hypothesis is refuted indicating that clusters of activity found in our experiment are not likely to be found by chance. Work by S. Small and U. Hasson, UChicago.

SwiftScript Workflow ACTIVAL – Data types and utilities type script {} type fullBrainData {} type brainMeasurements{} type fullBrainSpecs {} type precomputedPermutations{} type brainDataset {} type brainClusterTable {} type brainDatasets{ brainDataset b[]; } type brainClusters{ brainClusterTable c[]; } // Procedure to run "R" statistical package (brainDataset t) bricRInvoke (script permutationScript, int iterationNo, brainMeasurements dataAll, precomputedPermutations dataPerm) { app { iterationNo } } // Procedure to run AFNI Clustering tool (brainClusterTable v, brainDataset t) bricCluster (script clusterScript, int iterationNo, brainDataset randBrain, fullBrainData brainFile, fullBrainSpecs specFile) { app { } } // Procedure to merge results based on statistical likelhoods (brainClusterTable t) bricCentralize ( brainClusterTable bc[]) { app { } }

ACTIVAL Workflow – Dataset iteration procedures // Procedure to iterate over the data collection (brainClusters randCluster, brainDatasets dsetReturn) brain_cluster (fullBrainData brainFile, fullBrainSpecs specFile) { int sequence[]=[1:2000]; brainMeasurements dataAll ; precomputedPermutations dataPerm ; script randScript ; script clusterScript ; brainDatasets randBrains ; foreach int i in sequence { randBrains.b[i] = bricRInvoke(randScript,i,dataAll,dataPerm); brainDataset rBrain = randBrains.b[i] ; (randCluster.c[i],dsetReturn.b[i]) = bricCluster(clusterScript,i,rBrain, brainFile,specFile); }

ACTIVAL Workflow – Main Workflow Program // Declare datasets fullBrainData brainFile ; fullBrainSpecs specFile ; brainDatasets randBrain ; brainClusters randCluster<simple_mapper; prefix="Tmean.4mm.perm", suffix="_ClstTable_r4.1_a2.0.1D">; brainDatasets dsetReturn<simple_mapper; prefix="Tmean.4mm.perm", suffix="_Clustered_r4.1_a2.0.niml.dset">; brainClusterTable clusterThresholdsTable ; brainDataset brainResult ; brainDataset origBrain ; // Main program – executes the entire workflow (randCluster, dsetReturn) = brain_cluster(brainFile, specFile); clusterThresholdsTable = bricCentralize (randCluster.c); brainResult = makebrain(origBrain,clusterThresholdsTable,brainFile,specFile);

Performance example: fMRI workflow 4-stage workflow (subset of AIRSN) 476 jobs, <10 secs CPU each, 119 jobs per stage. No pipelining: 24 minutes (idle uc-teragrid cluster, via GRAM to Torque) Jobs

Example Performance Optimizations Pipelining Jobs pipelined between stages: 19 minutes Jobs

Example Performance Optimizations Pipelining + clustering with pipelining and clustering (up to 6 jobs clustered into one GRAM job): 8 mins Jobs

Example Performance Optimizations Pipelining + provisioning With pipelining and CPU provisioning: 2.2 minutes. Jobs

Load Balancing uc-teragrid: 216 UC-TeraPort: 260 Load balancing between UC TeraPort (OSG) and UC-TeraGrid (IA32) Jobs

Development Status Initial release is available for evaluation Performance measurement and tuning efforts active Adapting to OSG Grid info and site conventions Many applications in progress and eval  Astrophysics, molecular dynamics, neuroscience, psychology, radiology Provisioning mechanism progressing Virtual data catalog re-integration starting ~ April Collating language feedback – focus is on mapping Web site for docs, downloads and more info:

Conclusion Swift is in its early stages of development and its transition from the VDS virtual data language Application testing is underway in neuroscience, molecular dynamics, astrophysics, radiology, and other applications.  Providing valuable feedback for language refinement and finalization SwiftScript is proving to be a productive language  while feedback from usage is still shaping it  Positive comments from VDL users – radiology in particular Ongoing performance evaluation and improvement is yielding exciting results Major initial focus is usability – good progress on improving time-to-get-started and on ease of debugging

Acknowledgements Swift effort is supported by DOE (Argonne LDRD), NSF (i2u2,GriPhyN, iVDGL), NIH, and the UChicago Computation Institute Team  Ben Clifford, Ian Foster, Mihael Hategan, Veronika Nefedova, Tiberiu Stef-Praun, Mike Wilde, Yong Zhao Java CoG Kit  Mihael Hategan, Gregor Von Laszewski, and many collaborators User contributed workflows and Swift applications  ASCI Flash, I2U2, UC Human Neuroscience Lab, UCH Molecular Dynamics, UCH Radiology, caBIG Ravi Madduri, Patrick McConnell, and the caGrid team of caBIG.

Based on: The Virtual Data System – a workflow toolkit for science applications OSG Summer Grid Workshop Lecture 8 June 29, 2006

Based on: Fast, Reliable, Loosely Coupled Parallel Computation Tiberiu Stef-Praun Computation Institute University of Chicago & Argonne National Laboratory

Acknowledgements The technologies and applications described here were made possible by the following projects and support: GriPhyN, iVDGL, the Globus Alliance and QuarkNet, supported by The National Science Foundation The Globus Alliance, PPDG, and QuarkNet, supported by the US Department of Energy, Office of Science Support was also provided by NVO, NIH, and SCEC