Presentation is loading. Please wait.

Presentation is loading. Please wait.

Grid Workflow Midwest Grid Workshop Module 6. Goals Enhance scientific productivity through: Discovery and application of datasets and programs at petabyte.

Similar presentations


Presentation on theme: "Grid Workflow Midwest Grid Workshop Module 6. Goals Enhance scientific productivity through: Discovery and application of datasets and programs at petabyte."— Presentation transcript:

1 Grid Workflow Midwest Grid Workshop Module 6

2 Goals Enhance scientific productivity through: Discovery and application of datasets and programs at petabyte scale Enabling use of a worldwide data grid as a scientific workstation

3 Goals of using grids through scripting Provide an easy on-ramp to the grid Utilize massive resources with simple scripts  Leverage multiple grids like a workstation Empower script-writers to empower end users Track and leverage provenance in the science process

4 Classes of Workflow Systems Earlier generation business workflow systems  Document management, forms processing, etc Scientific laboratory management systems  LIMS, “wet lab” workflow Application-oriented workflow  Kepler, DAGman, P-Star, VisTrails, Karajan VDS: First-generation Virtual Data System  Pegasus, Virtual Data Language Service-oriented workflow systems  BPEL, BPDL, Taverna/SCUFL, Triana Pegasus/Wings  Pegasus with OWL/RDF workflow specification Swift workflow system  Karajan with typed and mapped VDL - SwiftScript

5 VDS – The Virtual Data System Introduced Virtual Data Language - VDL  A location-independent parallel language Several Planners  Pegasus: main production planner  Euryale: experimental “just in time” planner  GADU/GNARE – user application planner (D. Sulahke, Argonne) Provenance  Kickstart – app launcher and tracker  VDC – virtual data catalog

6 Virtual Data and Workflows Challenge is managing and organizing the vast computing and storage capabilities provided by Grids Workflow expresses computations in a form that can be readily mapped to Grids Virtual data keeps accurate track of data derivation methods and provenance Grid tools virtualize location and caching of data, and recovery from failures

7 Virtual Data Origins: The Grid Physics Network Enhance scientific productivity through… Discovery, application and management of data and processes at all scales Using a worldwide data grid as a scientific workstation The key to this approach is Virtual Data – creating and managing datasets through workflow “recipes” and provenance recording.

8 Virtual Data workflow abstracts Grid details

9 mass = 200 decay = WW stability = 1 LowPt = 20 HighPt = 10000 mass = 200 decay = WW stability = 1 event = 8 mass = 200 decay = WW stability = 1 plot = 1 mass = 200 decay = WW plot = 1 mass = 200 decay = WW event = 8 mass = 200 decay = WW stability = 1 mass = 200 decay = WW stability = 3 mass = 200 decay = WW mass = 200 decay = ZZ mass = 200 decay = bb mass = 200 plot = 1 mass = 200 event = 8 Example Application: High Energy Physics Data Analysis Work and slide by Rick Cavanaugh and Dimitri Bourilkov, University of Florida

10 The core essence: Basic data analysis programs CMS.ECal.2006.0405 107: 24B707CC AF 01 37 01 00 01 00 01 24655A35 235011.603 061206 V 03 0 +0269 108: 24B707CD 01 23 01 3F 00 01 00 01 24655A35 235011.603 061206 V 03 0 +0269 109: 06194161 80 01 38 01 00 01 00 01 03E9DCA9 235142.597 061206 V 03 0 -0723 110: 06194163 00 01 01 28 32 01 00 01 Raw Data bins =60 xmin = 40.5 ymin =.003 Data Analysis Program bins xmin ymin infile

11 Expressing Workflow in VDL TR grep (in a1, out a2) { argument stdin = ${a1}; argument stdout = ${a2}; } TR sort (in a1, out a2) { argument stdin = ${a1}; argument stdout = ${a2}; } DV grep (a1=@{in:file1}, a2=@{out:file2}); DV sort (a1=@{in:file2}, a2=@{out:file3}); file1 file2 file3 grep sort Define a “function” wrapper for an application Provide “actual” argument values for the invocation Define “formal arguments” for the application Define a “call” to invoke application Connect applications via output-to-input dependencies

12 Executing VDL Workflows Abstract workflow DAGman DAG Pegasus Planner DAGman & Condor-G VDL Program Virtual Data catalog Virtual Data Workflow Generator Job Planner Job Cleanup Workflow spec Create Execution Plan Grid Workflow Execution Show world and results in large DAG on right, as animated overlay

13 ...and collecting Provenance VDL DAGman script Pegasus Planner DAGman & Condor-G Abstract workflow Virtual Data catalog Virtual Data Workflow Generator Specify WorkflowCreate and run DAG Grid Workflow Execution (on worker nodes) launcher file1 file2 file3 grep sort Provenance data Provenance data Provenance collector

14 What must we “virtualize” to compute on the Grid? Location-independent computing: represent all workflow in abstract terms Declarations not tied to specific entities:  sites  file systems  schedulers Failures – automated retry for data server and execution site un-availability

15 Mapping the Science Process to workflows Start with a single workflow Automate the generation of workflow for sets of files (datasets) Replicate workflow to explore many datasets Change Parameters Change code – add new transformations Build new workflows Use provenance info

16 How does Workflow Relate to Provenance? Workflow – specifies what to do Provenance – tracks what was done Executed Executing Executable Waiting Query Edit Schedule Execution environment What I Did What I Want to Do What I Am Doing …

17 Having interface definitions also facilitates provenance tracking. CMS.ECal.2006.0405 107: 24B707CC AF 01 37 01 00 01 00 01 24655A35 235011.603 061206 V 03 0 +0269 108: 24B707CD 01 23 01 3F 00 01 00 01 24655A35 235011.603 061206 V 03 0 +0269 109: 06194161 80 01 38 01 00 01 00 01 03E9DCA9 235142.597 061206 V 03 0 -0723 110: 06194163 00 01 01 28 32 01 00 01 Raw Data bins =60 xmin = 40.5 ymin =.003 Data Analysis Program bins xmin ymin infile

18 Functional MRI Analysis Workflow courtesy James Dobson, Dartmouth Brain Imaging Center

19 LIGO Inspiral Search Application Describe… Inspiral workflow application is the work of Duncan Brown, Caltech, Scott Koranda, UW Milwaukee, the ISI Pegasus Team, and the LSC Inspiral group

20 Example Montage Workflow ~1200 node workflow, 7 levels Mosaic of M42 created on the Teragrid using Pegasus http://montage.ipac.caltech.edu/

21 Blasting for Protein Knowledge BLAST compare of complete nr database for sequence similarity and function characterization Knowledge Base PUMA is an interface for the researchers to be able to find information about a specific protein after having been analyzed against the complete set of sequenced genomes (nr file ~ approximately 3 million sequences) Analysis on the Grid The analysis of the protein sequences occurs in the background in the grid environment. Millions of processes are started since several tools are run to analyze each sequence, such as finding out protein similarities (BLAST), protein family domain searches (BLOCKS), and structural characteristics of the protein.

22 FOAM: Fast Ocean/Atmosphere Model 250-Member Ensemble Run on TeraGrid under VDS FOAM run for Ensemble Member 1 FOAM run for Ensemble Member 2 FOAM run for Ensemble Member N Atmos Postprocessing Ocean Postprocessing for Ensemble Member 2 Coupl Postprocessing for Ensemble Member 2 Atmos Postprocessing for Ensemble Member 2 Coupl Postprocessing for Ensemble Member 2 Results transferred to archival storage Work of: Rob Jacob (FOAM), Veronica Nefedova (Workflow design and execution) Remote Directory Creation for Ensemble Member 1 Remote Directory Creation for Ensemble Member 2 Remote Directory Creation for Ensemble Member N

23 TeraGrid and VDS speed up modelling Climate Supercomputer TeraGrid with NMI and VDS FOAM application by Rob Jacob, Argonne; VDS workflow by Veronika Nefedova, Argonne Visualization courtesy Pat Behling and Yun Liu, UW Madison..

24 VDS Virtual Data System  Virtual Data Language (VDL) A language to express workflows  Pegasus planner Decides how the workflow will run  Virtual Data Catalog (VDC) Stores information about workflows Stores provenance of data

25 Virtual Data Process Describe data derivation or analysis steps in a high-level workflow language (VDL) VDL is cataloged in a database for sharing by the community Grid workflows are generated from VDL Provenance of derived results stored in database for assessment or verification

26 Planning with Pegasus High Level Application Knowledge Resource Information and Configuration Data Location Information Pegasus Planner Plan to be submitted to the grid (e.g condor submit files) DAX abstract workflow, from VDL

27 Abstract to Concrete, Step 1: Workflow Reduction

28 Step 2: Site Selection & Addition of Data Stage-in Nodes

29 Step 3: Addition of Data Stage-out Nodes

30 Step 4: Addition of Replica Registration Jobs

31 Step 5: Addition of Job-Directory Creation

32 Final Result of Abstract-to-Concrete Process

33

34

35

36

37

38

39 39 Swift System improves on VDS/VDL Clean separation of logical/physical concerns  XDTM specification of logical data structures + Concise specification of parallel programs  SwiftScript, with iteration, etc. + Efficient execution on distributed resources  Lightweight threading, dynamic provisioning, Grid interfaces, pipelining, load balancing + Rigorous provenance tracking and query is in design  Virtual data schema & automated recording  Improved usability and productivity  Demonstrated in numerous applications

40 40 AIRSN: an example program (Run snr) functional ( Run r, NormAnat a, Air shrink ) { Run yroRun = reorientRun( r, "y" ); Run roRun = reorientRun( yroRun, "x" ); Volume std = roRun[0]; Run rndr = random_select( roRun, 0.1 ); AirVector rndAirVec = align_linearRun( rndr, std, 12, 1000, 1000, "81 3 3" ); Run reslicedRndr = resliceRun( rndr, rndAirVec, "o", "k" ); Volume meanRand = softmean( reslicedRndr, "y", "null" ); Air mnQAAir = alignlinear( a.nHires, meanRand, 6, 1000, 4, "81 3 3" ); Warp boldNormWarp = combinewarp( shrink, a.aWarp, mnQAAir ); Run nr = reslice_warp_run( boldNormWarp, roRun ); Volume meanAll = strictmean( nr, "y", "null" ) Volume boldMask = binarize( meanAll, "y" ); snr = gsmoothRun( nr, boldMask, "6 6 6" ); } (Run or) reorientRun (Run ir, string direction) { foreach Volume iv, i in ir.v { or.v[i] = reorient(iv, direction); }

41 VDL/VDS Limitations Missing VDL language features  Data typing & data mapping  Iterators and control-flow constructs Run time complexity in VDS  State explosion for data-parallel applications  Computation status hard to provide  Debugging information complex & distributed Performance  Still many runtime bottlenecks

42 The Messy Data Problem Scientific data is typically logically structured  E.g., hierarchical structure  Common to map functions over dataset members  Nested map operations can scale to millions of objects

43 The Messy Data Problem But physically “ messy ” Heterogeneous storage format and access protocol  Logically identical dataset can be stored in textual File (e.g. CSV), spreadsheet, database, …  Data available from filesystem, DBMS, HTTP, WebDAV,.. Metadata encoded in directory and file names Hinders program development, composition, execution./Group23 total 58 drwxr-xr-x 4 yongzh users 2048 Nov 12 14:15 AA drwxr-xr-x 4 yongzh users 2048 Nov 11 21:13 CH drwxr-xr-x 4 yongzh users 2048 Nov 11 16:32 EC./Group23/AA: total 4 drwxr-xr-x 5 yongzh users 2048 Nov 5 12:41 04nov06aa drwxr-xr-x 4 yongzh users 2048 Dec 6 12:24 11nov06aa. /Group23/AA/04nov06aa: total 54 drwxr-xr-x 2 yongzh users 2048 Nov 5 12:52 ANATOMY drwxr-xr-x 2 yongzh users 49152 Dec 5 11:40 FUNCTIONAL. /Group23/AA/04nov06aa/ANATOMY: total 58500 -rw-r--r-- 1 yongzh users 348 Nov 5 12:29 coplanar.hdr -rw-r--r-- 1 yongzh users 16777216 Nov 5 12:29 coplanar.img. /Group23/AA/04nov06aa/FUNCTIONAL: total 196739 -rw-r--r-- 1 yongzh users 348 Nov 5 12:32 bold1_0001.hdr -rw-r--r-- 1 yongzh users 409600 Nov 5 12:32 bold1_0001.img -rw-r--r-- 1 yongzh users 348 Nov 5 12:32 bold1_0002.hdr -rw-r--r-- 1 yongzh users 409600 Nov 5 12:32 bold1_0002.img -rw-r--r-- 1 yongzh users 496 Nov 15 20:44 bold1_0002.mat -rw-r--r-- 1 yongzh users 348 Nov 5 12:32 bold1_0003.hdr -rw-r--r-- 1 yongzh users 409600 Nov 5 12:32 bold1_0003.img

44 SwiftScript Typed parallel programming notation  XDTM as data model and type system  Typed dataset and procedure definitions Scripting language  Implicit data parallelism  Program composition from procedures  Control constructs (foreach, if, while, …) [SIGMOD05, Springer06] Clean application logic Type checking Dataset selection, iteration A Notation & System for Expressing and Executing Cleanly Typed Workflows on Messy Scientific Data [SIGMOD Record Sep05]

45 fMRI Type Definitions in SwiftScript type Study { Group g[ ]; } type Group { Subject s[ ]; } type Subject { Volume anat; Run run[ ]; } type Run { Volume v[ ]; } type Volume { Image img; Header hdr; } type Image {}; type Header {}; type Warp {}; type Air {}; type AirVec { Air a[ ]; } type NormAnat { Volume anat; Warp aWarp; Volume nHires; } Simplified declarations of fMRI AIRSN (Spatial Normalization)

46 AIRSN Program Definition (Run snr) functional ( Run r, NormAnat a, Air shrink ) { Run yroRun = reorientRun( r, "y" ); Run roRun = reorientRun( yroRun, "x" ); Volume std = roRun[0]; Run rndr = random_select( roRun, 0.1 ); AirVector rndAirVec = align_linearRun( rndr, std, 12, 1000, 1000, "81 3 3" ); Run reslicedRndr = resliceRun( rndr, rndAirVec, "o", "k" ); Volume meanRand = softmean( reslicedRndr, "y", "null" ); Air mnQAAir = alignlinear( a.nHires, meanRand, 6, 1000, 4, "81 3 3" ); Warp boldNormWarp = combinewarp( shrink, a.aWarp, mnQAAir ); Run nr = reslice_warp_run( boldNormWarp, roRun ); Volume meanAll = strictmean( nr, "y", "null" ) Volume boldMask = binarize( meanAll, "y" ); snr = gsmoothRun( nr, boldMask, "6 6 6" ); } (Run or) reorientRun (Run ir, string direction) { foreach Volume iv, i in ir.v { or.v[i] = reorient(iv, direction); }

47 SwiftScript Expressiveness Lines of code with different workflow encodings Collaboration with James Dobson, Dartmouth [SIGMOD Record Sep05] fMRI Workflow Shell Script VDLSwift ATLAS149726 ATLAS29713510 FILM16313417 FEAT8419113 AIRSN215~40034 AIRSN workflow:AIRSN workflow expanded:

48 48 SwiftScript Abstract computation Virtual Data Catalog SwiftScript Compiler SpecificationExecution Virtual Node(s) Provenance data Provenance data Provenance collector launcher file1 file2 file3 App F1 App F2 Scheduling Execution Engine (Karajan w/ Swift Runtime) Swift runtime callouts C CCC Status reporting Swift Architecture Provisioning Dynamic Resource Provisioner Amazon EC2

49 Workflow Status and logs swift command launcher f1 f2 f3 Worker Nodes App a1 App a2 Using Swift SwiftScript App a1 App a2 Data f1 f2 f3 site list app list Provenance data

50 Swift uses Karajan Workflow Engine Fast, scalable threading model Suitable constructs for control flow Flexible task dependency model  “ Futures ” enable pipelining Flexible provider model allows for use of different run time environments  Job execution and data transfer  Flow controlled to avoid resource overload Workflow client runs from a Java container Java CoG Workflow, Gregor von Laszewski et al., Workflows for Science, 2007

51 Application example: ACTIVAL: Neural activation validation Identifies clusters of neural activity not likely to be active by random chance: switch labels of the conditions for one or more participants; calculate the delta values in each voxel, re-calculate the reliability of delta in each voxel, and evaluate clusters found. If the clusters in data are greater than the majority of the clusters found in the permutations, then the null hypothesis is refuted indicating that clusters of activity found in our experiment are not likely to be found by chance. Work by S. Small and U. Hasson, UChicago.

52 SwiftScript Workflow ACTIVAL – Data types and utilities type script {} type fullBrainData {} type brainMeasurements{} type fullBrainSpecs {} type precomputedPermutations{} type brainDataset {} type brainClusterTable {} type brainDatasets{ brainDataset b[]; } type brainClusters{ brainClusterTable c[]; } // Procedure to run "R" statistical package (brainDataset t) bricRInvoke (script permutationScript, int iterationNo, brainMeasurements dataAll, precomputedPermutations dataPerm) { app { bricRInvoke @filename(permutationScript) iterationNo @filename(dataAll) @filename(dataPerm); } } // Procedure to run AFNI Clustering tool (brainClusterTable v, brainDataset t) bricCluster (script clusterScript, int iterationNo, brainDataset randBrain, fullBrainData brainFile, fullBrainSpecs specFile) { app { bricPerlCluster @filename(clusterScript) iterationNo @filename(randBrain) @filename(brainFile) @filename(specFile); } } // Procedure to merge results based on statistical likelhoods (brainClusterTable t) bricCentralize ( brainClusterTable bc[]) { app { bricCentralize @filenames(bc); } }

53 ACTIVAL Workflow – Dataset iteration procedures // Procedure to iterate over the data collection (brainClusters randCluster, brainDatasets dsetReturn) brain_cluster (fullBrainData brainFile, fullBrainSpecs specFile) { int sequence[]=[1:2000]; brainMeasurements dataAll ; precomputedPermutations dataPerm ; script randScript ; script clusterScript ; brainDatasets randBrains ; foreach int i in sequence { randBrains.b[i] = bricRInvoke(randScript,i,dataAll,dataPerm); brainDataset rBrain = randBrains.b[i] ; (randCluster.c[i],dsetReturn.b[i]) = bricCluster(clusterScript,i,rBrain, brainFile,specFile); }

54 ACTIVAL Workflow – Main Workflow Program // Declare datasets fullBrainData brainFile ; fullBrainSpecs specFile ; brainDatasets randBrain ; brainClusters randCluster<simple_mapper; prefix="Tmean.4mm.perm", suffix="_ClstTable_r4.1_a2.0.1D">; brainDatasets dsetReturn<simple_mapper; prefix="Tmean.4mm.perm", suffix="_Clustered_r4.1_a2.0.niml.dset">; brainClusterTable clusterThresholdsTable ; brainDataset brainResult ; brainDataset origBrain ; // Main program – executes the entire workflow (randCluster, dsetReturn) = brain_cluster(brainFile, specFile); clusterThresholdsTable = bricCentralize (randCluster.c); brainResult = makebrain(origBrain,clusterThresholdsTable,brainFile,specFile);

55 Performance example: fMRI workflow 4-stage workflow (subset of AIRSN) 476 jobs, <10 secs CPU each, 119 jobs per stage. No pipelining: 24 minutes (idle uc-teragrid cluster, via GRAM to Torque) Jobs

56 Example Performance Optimizations Pipelining Jobs pipelined between stages: 19 minutes Jobs

57 Example Performance Optimizations Pipelining + clustering with pipelining and clustering (up to 6 jobs clustered into one GRAM job): 8 mins Jobs

58 Example Performance Optimizations Pipelining + provisioning With pipelining and CPU provisioning: 2.2 minutes. Jobs

59 Load Balancing uc-teragrid: 216 UC-TeraPort: 260 Load balancing between UC TeraPort (OSG) and UC-TeraGrid (IA32) Jobs

60 Development Status Initial release is available for evaluation Performance measurement and tuning efforts active Adapting to OSG Grid info and site conventions Many applications in progress and eval  Astrophysics, molecular dynamics, neuroscience, psychology, radiology Provisioning mechanism progressing Virtual data catalog re-integration starting ~ April Collating language feedback – focus is on mapping Web site for docs, downloads and more info: www.ci.uchicago.edu/swift

61 Conclusion Swift is in its early stages of development and its transition from the VDS virtual data language Application testing is underway in neuroscience, molecular dynamics, astrophysics, radiology, and other applications.  Providing valuable feedback for language refinement and finalization SwiftScript is proving to be a productive language  while feedback from usage is still shaping it  Positive comments from VDL users – radiology in particular Ongoing performance evaluation and improvement is yielding exciting results Major initial focus is usability – good progress on improving time-to-get-started and on ease of debugging

62 Acknowledgements Swift effort is supported by DOE (Argonne LDRD), NSF (i2u2,GriPhyN, iVDGL), NIH, and the UChicago Computation Institute Team  Ben Clifford, Ian Foster, Mihael Hategan, Veronika Nefedova, Tiberiu Stef-Praun, Mike Wilde, Yong Zhao Java CoG Kit  Mihael Hategan, Gregor Von Laszewski, and many collaborators User contributed workflows and Swift applications  ASCI Flash, I2U2, UC Human Neuroscience Lab, UCH Molecular Dynamics, UCH Radiology, caBIG Ravi Madduri, Patrick McConnell, and the caGrid team of caBIG.

63 Based on: The Virtual Data System – a workflow toolkit for science applications OSG Summer Grid Workshop Lecture 8 June 29, 2006

64 Based on: Fast, Reliable, Loosely Coupled Parallel Computation Tiberiu Stef-Praun Computation Institute University of Chicago & Argonne National Laboratory tiberius@ci.uchicago.edu www.ci.uchicago.edu/swift

65 Acknowledgements The technologies and applications described here were made possible by the following projects and support: GriPhyN, iVDGL, the Globus Alliance and QuarkNet, supported by The National Science Foundation The Globus Alliance, PPDG, and QuarkNet, supported by the US Department of Energy, Office of Science Support was also provided by NVO, NIH, and SCEC


Download ppt "Grid Workflow Midwest Grid Workshop Module 6. Goals Enhance scientific productivity through: Discovery and application of datasets and programs at petabyte."

Similar presentations


Ads by Google