Parallel scripting for distributed systems

Parallel scripting for distributed systems
Introduction to Swift Parallel scripting for distributed systems Mike Wilde Ben Clifford Computation Institute, University of Chicago and Argonne National Laboratory This work is being done by Mike Wilde, Ben Clifford and others. This is a follow up to a concept that started with The Griffin Project.

Workflow Motivation: example from Neuroscience
Large fMRI datasets 90,000 volumes / study 100s of studies Wide range of analyses Testing, production runs Data mining Ensemble, Parameter studies The Grid Physics network project was started to help scientists use massive scale resources as if they were local workstations. We concluded that we needed to find out when it was better to reproduce data locally, rather that sending it over the grid, so that we could obtain very interesting optimizations. From this we concluded that we needed to create a methodical description of how the data was produced. This made us realize that we are really talking about the methodical way of describing workflows. This allow us to know were the data came from, this is applied to legal documents and many other applications. For this purpose we create in Griffin has a language called VDL (Virtual Data Language), this was the first attempt to lift up how to describe computations on the Grid, so that if you could describe things in a high enough level, you could provide the rigorous specifications of how data is derived, that could be compiled and the compiler could handle all the meticulous specification that are required. A very smart grid oriented compiler will make life easier if it can do the work of all the manual tools that we need to use today to execute the steps required by Grid. If we doing a MRI study on 100 patients for three months it will produce upwards of 100 thousand files. So we are trying to create a logical abstract high level description of this work and treat it like is it was a in-memory data structure, because we think compilers should be the ones to figure out how grid operations work., so that you could write your code as if the grid was your workstation, and that is where this project is heading to. Our software “swift” is in the internet and can be downloaded, we are interested in finding people who can use it in your scientific projects you are doing, or computer science projects you want to do, we’ll talk in detail about later. So again we have these large groups of data, these large numbers of grid resources where we want to run things on, and we take the prolong set of steps to run the jobs and methodically describe those steps in . 2

Target environment: Cluster and Grids (distributed sets of clusters)
Grid Client Grid Resources at UW Computing Cluster Grid Middleware Application User Interface Grid Storage Grid Resources at UCSD Grid Middleware Computing Cluster Grid Middleware Grid Protocols Grid Storage Resource, Workflow And Data Catalogs Grid Resources at ANL Computing Cluster Grid Middleware Grid Storage This is a graphical representation of what we just described in the previous slide. This is our target Running a uniform middleware stack: Security to control access and protect communication (GSI)‏ Directory to locate grid sites and services: (VORS, MDS)‏ Uniform interface to computing sites (GRAM)‏ Facility to maintain and schedule queues of work (Condor-G)‏ Fast and secure data set mover (GridFTP, RFT)‏ Directory to track where datasets live (RLS)‏ 3 3

Why script in Swift? Orchestration of many resources over long time periods Very complex to do manually - workflow automates this effort Enables restart of long running scripts Write scripts in a manner that’s location-independent: run anywhere Higher level of abstraction gives increased portability of the workflow script (over ad-hoc scripting) Orchestration: Grid is a workstation, we want things to over over a long period of time.and over large amount of resource, where you are much vulnerable. If your mean time of failure is ‘n’ in one workstation, using ‘m’ resources in the grid your mean time of failure now is n * m. In the internet the same thing happens but the protocol stack takes care of correcting those problems, we face the same problems. We assist in solving the problems we face in the grid with shell scripting processes instead of manual single line input. Lastly we want to write our code in a way that is location independent, so it is very important to be able to write your code without depending on what grid sites locations your job is going to run. So this allows you to write code in a way that makes these processes run as easily as possible..

Swift is… A language for writing scripts that: process and produce large collections of persistent data with large and/or complex sequences of application programs on diverse distributed systems with a high degree of parallelism persisting over long periods of time surviving infrastructure failures and tracking the provenance of execution Note: this slide is NOT in the presentation, but it could be read to clarify pervious slide.

Swift programs A Swift script is a set of functions Atomic functions wrap & invoke application programs Composite functions invoke other functions Data is typed as composable arrays and structures of files and simple scalar types (int, float, string) Collections of persistent file structures (datasets) are mapped into this data model Members of datasets can be processed in parallel Statements in a procedure are executed in data-flow dependency order and concurrency Variables are single assignment Provenance is gathered as scripts execute Note: this slide is NOT in the presentation, but it could be read to clarify pervious slide. This describes the concepts of what Swift is, the presenter talks about this in future slides, I think they decided to take some slides out because the material was repetitive.

A simple Swift script type imagefile; (imagefile output) flip(imagefile input) { app { convert } imagefile stars <"orion jpg">; imagefile flipped <"output.jpg">; flipped = flip(stars); NOTE: this slide is different to the one in the presentation. The third generation is called SWFIT. We started with tools like dagmom, very useful but very rudimentary, on top we build the virtual data system and in the heart we build pegasus that dis the translation from the virtual data systems into dagmon, so we achieved a little data independence, but we could not do data typing., it had streams and data files only, so we could not do things like data iterations, you had to go outside and run another shell to do the iterations. Declare type, just declares without any type of structure that imagefile is a data type. Apps is a reserved swift word, it means this is the application to be run

Parallelism via foreach { }
type imagefile; (imagefile output) flip(imagefile input) { app { convert } imagefile observations[ ] <simple_mapper; prefix=“orion”>; imagefile flipped[ ] <simple_mapper; prefix=“orion-flipped”>; foreach obs,i in observations { flipped[i] = flip(obs); Name outputs based on inputs This is another short sample of a Swift code, in the next slide we are going to describe in detail a larger swift example. Process all dataset members in parallel

A Swift data mining example
type pcapfile; // packet data capture - input file type type angleout; // “angle” data mining output type anglecenter; // geospatial centroid output (angleout ofile, anglecenter cfile) angle4 (pcapfile ifile) { app { angle4.sh } // interface to shell script } pcapfile infile <"anl dump pcap">; // maps real file angleout outdata <"data.out">; anglecenter outcenter <"data.center">; (outdata, outcenter) = angle4(infile); Again, here we declare three types, pcapfile, angleout, and anglecdenter. Angle4 is a function passing an input file called ifile of type pcapfile and with two returns or output files called ofile of type angleout and cfile of tyle anglecenter. So with this shell command we can start treating this as a function, as an abstraction App is a swift reserved word that says: this how I want the command line to look like when I call the application that in this case is called “angle4” The line with <> is called mapping, this is were we map physical data to logical names In case one infile is the logical representation of “andl dump pcap” and Output is the logical representation of “data.out” and Outcenter is the logical representation of “data.center” then I am ready to run the program The last line runs the program angle4 with infile input file and gets outdata and outcenter as return So the function angle4 causes the grid process to execute.

Parallelism and name mapping
type pcapfile; type angleout; type anglecenter; (angleout ofile, anglecenter cfile) angle4 (pcapfile ifile) { app { angle4.sh } } pcapfile pcapfiles[]<filesys_mapper; prefix="pc", suffix=".pcap">; angleout of[] <structured_regexp_mapper; source=pcapfiles,match="pc(.*)\.pcap", transform="_output/of/of\1.angle">; anglecenter cf[] <structured_regexp_mapper; source=pcapfiles,match="pc(.*)\.pcap", transform="_output/cf/cf\1.center">; foreach pf,i in pcapfiles { (of[i],cf[i]) = angle4(pf); Name outputs based on inputs This is an example of how to create and array in swift. In line 8 pcafiles is an array that loads the names of all files with the prefix pc and the suffix pcap. Of[] is another array and cf[] yet another. Foreach is the most powerful command in Swift. Foraech section assigns all jobs to the grid and have those jobs run in parallel, all the data management is also done automatically Detail documentation in arrays is found in the Swift documentation. This is the beginning of treating the Grid as a chip and Swift as a compiler Iterate over dataset members in parallel

Swift runtime callouts
Swift Architecture Specification Scheduling Execution Engine (Karajan w/ Swift Runtime)‏ Swift runtime callouts C Status reporting Execution Provisioning Falkon Resource Provisioner Amazon EC2 Abstract computation Virtual Node(s)‏ Worker Nodes launcher file1 file2 file3 App F1 App F2 SwiftScript Compiler Provenance data collector On the bottom left, we start with the SwiftScript described in the previous slides. Then the Virtual Data Calalog is being delevelop will contain all information about the data files being used. We then send this information to the SwiftCompiler, written in Java, which produces some abstract code. Swift produces its own language called Karajan, which becomes part of Globus, java cockpit. Karajan a very light program model that has very simple threads, very flexible constructs for synchronizing tasks, which can be large amount of tasks. On column three we see the execution representation if the jobs, all of the processing and execution of Swifts is processed in a java virtual machine. The last column shows the Provision steps which is being written by a graduate student of Ian Forster, which automatically will deploy the jobs to available machines capable of by passing the present queue system. Virtual Data Catalog SwiftScript 11

The Swift Scripting Model
Program in high-level, functional model Swift hides issues of location, mechanism and data representation Basic active elements are functions that encapsulate application tools and run jobs Typed data model: structures and arrays of files and scalar types Variables are single assignment Control structures perform conditional, iterative and parallel operations So to recap: Swift is a high-level, functional model that hides issues of location, mechanism and data representation The basic active elements are functions that encapsulate application tools and run jobs. Typed data model: structures and arrays of files and scalar types very much like C. Variables are single assignment. Control structures perform conditional, iterative and parallel operations. Swift very very computation capabilities, it has the minimum computational contructs to make it very simply, it is a lot like a shell language its main job is to execute other programs, and tries to be like a functional language so that it knows what is coming in and what is coming out.

The Messy Data Problem (1)‏
Scientific data is often logically structured E.g., hierarchical structure Common to map functions over dataset members Nested map operations can scale to millions of objects This is a graphical representation of how large complicated jobs like this can be represented in a high level manner. This is a project run in an application called “air” which in turn has over 30 programs, where swift can run for multiple subjects (patients) in paralell. This is just one example of the huge many applications that can be run using Swift 13

The Messy Data Problem (2)‏
./group23 drwxr-xr-x 4 yongzh users 2048 Nov 12 14:15 AA drwxr-xr-x 4 yongzh users 2048 Nov 11 21:13 CH drwxr-xr-x 4 yongzh users 2048 Nov 11 16:32 EC ./group23/AA: drwxr-xr-x 5 yongzh users 2048 Nov 5 12:41 04nov06aa drwxr-xr-x 4 yongzh users 2048 Dec 6 12:24 11nov06aa . /group23/AA/04nov06aa: drwxr-xr-x 2 yongzh users Nov 5 12:52 ANATOMY drwxr-xr-x 2 yongzh users Dec 5 11:40 FUNCTIONAL . /group23/AA/04nov06aa/ANATOMY: -rw-r--r-- 1 yongzh users Nov 5 12:29 coplanar.hdr -rw-r--r-- 1 yongzh users Nov 5 12:29 coplanar.img . /group23/AA/04nov06aa/FUNCTIONAL: -rw-r--r-- 1 yongzh users Nov 5 12:32 bold1_0001.hdr -rw-r--r-- 1 yongzh users Nov 5 12:32 bold1_0001.img -rw-r--r-- 1 yongzh users Nov 5 12:32 bold1_0002.hdr -rw-r--r-- 1 yongzh users Nov 5 12:32 bold1_0002.img -rw-r--r-- 1 yongzh users Nov 15 20:44 bold1_0002.mat -rw-r--r-- 1 yongzh users Nov 5 12:32 bold1_0003.hdr -rw-r--r-- 1 yongzh users Nov 5 12:32 bold1_0003.img Heterogeneous storage format & access protocols Same dataset can be stored in text file, spreadsheet, database, … Access via filesystem, DBMS, HTTP, WebDAV, … Metadata encoded in directory and file names Hinders program development, composition, execution NOTE: on this slice the presenter said: “so you get this kind of explosion of data in the disk” 14

Automated image registration for spatial normalization
AIRSN workflow: AIRSN workflow expanded: These are sets of steps using an off the shelf package, we define them in Swift turning them into a Swift function using that app declaration and then you can structure those more complex workflows using those atomic functions. This is a simple logical view of how data is going to be processed. In this case this workflow expressed at a high level using an off the shelf program of command lines tools called AIRSN, is the set of steps. This is a sample of what one particular discipline does, and how their use of programs and data, drives their needs for high performance grid computing resources, an we see this model over and over again in every data intensive science. Collaboration with James Dobson, Dartmouth [SIGMOD Record Sep05]

Example: fMRI Type Definitions
type Study { Group g[ ]; } type Group { Subject s[ ]; type Subject { Volume anat; Run run[ ]; type Run { Volume v[ ]; type Volume { Image img; Header hdr; type Image {}; type Header {}; type Warp {}; type Air {}; type AirVec { Air a[ ]; } type NormAnat { Volume anat; Warp aWarp; Volume nHires; These are examples of the many different functions and data type that we can declare to implement the AIRS project described in the previous slide. The presenter went and read each type in this example Simplified version of fMRI AIRSN Program (Spatial Normalization)‏ 16

Collaboration with James Dobson, Dartmouth
fMRI Example Workflow (Run resliced) reslice_wf ( Run r) { Run yR = reorientRun( r , "y", "n" ); Run roR = reorientRun( yR , "x", "n" ); Volume std = roR.v[1]; AirVector roAirVec = alignlinearRun(std, roR, 12, 1000, 1000, "81 3 3"); resliced = resliceRun( roR, roAirVec, "-o", "-k"); } (Run or) reorientRun (Run ir, string direction, string overwrite) { foreach Volume iv, i in ir.v { or.v[i] = reorient (iv, direction, overwrite); } Swifts variables are single assignment, so we know when a variable gets set. This is very important because Swift structure is based on output to input dependencies, because this tells Swift in what order to run things and determines what level of parallelism is possible. One very nice thing about Swift is that when you execute a compound function (a function that calls other function(s) ) each function is executed based on the data dependencies. In this case the input for reslice_ws is Run r, the first reorientRun produces vR and the second reorientRun consumes vR, then alignlinearRun comsumes roR producing roAirVec which is them consumed by resliceRun which produces resliced. At run time the Karajan Program has all the logic of the data flow and knows what to run and in what order. The second function is actually the implementation of rescli_wf, the foreach portion is the part that executes the parallelisation. Again, in Swift variables are single assignment, once a data file is created it does not change, it does not get created again. Collaboration with James Dobson, Dartmouth

AIRSN Program Definition
(Run or) reorientRun (Run ir, string direction) { foreach Volume iv, i in ir.v { or.v[i] = reorient(iv, direction); } (Run snr) functional ( Run r, NormAnat a, Air shrink ) { Run yroRun = reorientRun( r , "y" ); Run roRun = reorientRun( yroRun , "x" ); Volume std = roRun[0]; Run rndr = random_select( roRun, 0.1 ); AirVector rndAirVec = align_linearRun( rndr, std, 12, 1000, 1000, "81 3 3" ); Run reslicedRndr = resliceRun( rndr, rndAirVec, "o", "k" ); Volume meanRand = softmean( reslicedRndr, "y", "null" ); Air mnQAAir = alignlinear( a.nHires, meanRand, 6, 1000, 4, "81 3 3" ); Warp boldNormWarp = combinewarp( shrink, a.aWarp, mnQAAir ); Run nr = reslice_warp_run( boldNormWarp, roRun ); Volume meanAll = strictmean( nr, "y", "null" ) Volume boldMask = binarize( meanAll, "y" ); snr = gsmoothRun( nr, boldMask, "6 6 6" ); } This example we should walk thru it in our own, this example actually produces the next slice…

SwiftScript Expressiveness
Lines of code with different workflow encodings AIRSN workflow: AIRSN workflow expanded: 34 ~400 215 AIRSN 13 191 84 FEAT 17 134 63 FILM1 10 135 97 ATLAS2 6 72 49 ATLAS1 Swift VDL Shell Script fMRI Workflow This example is produced by the code that we need to go thru in our own, which is in the previous slice. These are also results of experiments we did in different workflows, first in shell script, then in VDL and finally in Swift, these represent the lines of code needed by each. Collaboration with James Dobson, Dartmouth [SIGMOD Record Sep05]

Application example: ACTIVAL: Neural activation validation
The ACTIVAL Swift script identifies clusters of neural activity not likely to be active by random chance: switch labels of the conditions for one or more participants; calculate the delta values in each voxel, re-calculate the reliability of delta in each voxel, and evaluate clusters found. If the clusters in data are greater than the majority of the clusters found in the permutations, then the null hypothesis is refuted indicating that clusters of activity found in our experiment are not likely to be found by chance. This is another example from another neuro scientist, their work is a very good application that lends itself to be run in parallell. DON’T read this – just say it in a few words: “To illustrate Swift I’ll present as an example a workflow which we call “ACTIVAL” that uses statistical analysis (with R) to validate observed activations in a functional MRI study”. ACTIFIND is our name for this algorithm (its not a formal neuroscience name) Work by S. Small and U. Hasson, UChicago.

SwiftScript Workflow ACTIVAL – Data types and utilities
type script {} type fullBrainData {} type brainMeasurements{} type fullBrainSpecs {} type precomputedPermutations{} type brainDataset {} type brainClusterTable {} type brainDatasets{ brainDataset b[]; } type brainClusters{ brainClusterTable c[]; } // Procedure to run "R" statistical package (brainDataset t) bricRInvoke (script permutationScript, int iterationNo, brainMeasurements dataAll, precomputedPermutations dataPerm) { app { iterationNo } } // Procedure to run AFNI Clustering tool (brainClusterTable v, brainDataset t) bricCluster (script clusterScript, int iterationNo, brainDataset randBrain, fullBrainData brainFile, fullBrainSpecs specFile) { app { iterationNo @filename(specFile); } // Procedure to merge results based on statistical likelhoods (brainClusterTable t) bricCentralize ( brainClusterTable bc[]) { app { } This is just another example of defining type and then creating functions. Replace this sequence with one that uses better typenames

ACTIVAL Workflow – Dataset iteration procedures
// Procedure to iterate over the data collection (brainClusters randCluster, brainDatasets dsetReturn) brain_cluster (fullBrainData brainFile, fullBrainSpecs specFile) { int sequence[]=[1:2000]; brainMeasurements dataAll<fixed_mapper; file="obs.imit.all">; precomputedPermutations dataPerm<fixed_mapper; file="perm.matrix.11">; script randScript<fixed_mapper; file="script.obs.imit.tibi">; script clusterScript<fixed_mapper; file="surfclust.tibi">; brainDatasets randBrains<simple_mapper; prefix="rand.brain.set">; foreach int i in sequence { randBrains.b[i] = bricRInvoke(randScript,i,dataAll,dataPerm); brainDataset rBrain = randBrains.b[i] ; (randCluster.c[i],dsetReturn.b[i]) = bricCluster(clusterScript,i,rBrain, brainFile,specFile); } NOTE: this slice is not in the presentation!!!! Point out that this is the heart of the algorithm – and that the parallelism comes from the foreach. Show how the datasets flow in the body of the foreach QUESTION: why do we create the variable “rBrain”? Whats the data structuring concept being used here? (in green, above)

ACTIVAL Workflow – Main Workflow Program
// Declare datasets fullBrainData brainFile<fixed_mapper; file="colin_lh_mesh140_std.pial.asc">; fullBrainSpecs specFile<fixed_mapper; file="colin_lh_mesh140_std.spec">; brainDatasets randBrain<simple_mapper; prefix="rand.brain.set">; brainClusters randCluster<simple_mapper; prefix="Tmean.4mm.perm", suffix="_ClstTable_r4.1_a2.0.1D">; brainDatasets dsetReturn<simple_mapper; prefix="Tmean.4mm.perm", suffix="_Clustered_r4.1_a2.0.niml.dset">; brainClusterTable clusterThresholdsTable<fixed_mapper; file="thresholds.table">; brainDataset brainResult<fixed_mapper; file="brain.final.dset">; brainDataset origBrain<fixed_mapper; file="brain.permutation.1">; // Main program – executes the entire workflow (randCluster, dsetReturn) = brain_cluster(brainFile, specFile); clusterThresholdsTable = bricCentralize (randCluster.c); brainResult = makebrain(origBrain,clusterThresholdsTable,brainFile,specFile); NOTE: this slice is not in the presentation!!!! Explain the dataset declarations and the two mappers used here. Simple_mapper Fixed_mapper Point out that entire datasets are being passed as arguments.

Swift Application: Economics “moral hazard” problem
This is another example of our users using Swift, here we see the results of an economics analysis of different agrarian economies. ~200 job workflow using Octave/Matlab and the CLP LP-SOLVE application. Work by Tibi Stef-Praun, CI, with Robert Townsend & Gabriel Madiera, UChicago Economics

Running swift Fully contained Java grid client Can test on a local machine Can run on a PBS cluster Runs on multiple clusters over Grid interfaces So, we already talked about running Swift, and you will see more of this in the labs. So, quickly will go into the next slice.

site list Worker Nodes app list SwiftScript swift command Workflow
Using Swift site list Worker Nodes Data f1 f2 f3 app list f1 launcher App a1 SwiftScript App a1 App a2 swift command f2 Details on running Swift can be obtained in the Swift website, there many different variations of these tutorials, and users guides that will walk you thru every aspect of the Swift language and all the different mapper types that are available. One advantage of this is “separation of layers” is simplicity for the user. When we integrate it back in, its likely that we’ll make it optional to retain this simplicity, especially for users less interested in provenance storage, and for new users that may not have ready access to a DBMS to hold the VDC. launcher App a2 Workflow Status and logs Provenance data f3 26

The Variable model Single assignment: Can only assign a value to a var once This makes data flow semantics much cleaner to specify, understand and implement Variables are scalars or references to composite objects Variables are typed File typed variables are “mapped” to files We already talked a lot about variables

Data Flow Model This is what makes it possible to be location independent Computations proceed when data is ready (often not in source-code order) User specifies DATA dependencies, doesn’t worry about sequencing of operations Exposes maximal parallelism And also we already talked a lot about data flow, and ..

Swift statements Var declarations Can be mapped Type declarations Assignment statements Assignments are type-checked Control-flow statements if, foreach, iterate Function declarations …….. Swift statements, and …….

Passing scripts as data
When running scripting languages, target language interpreter can be the executable (eg shell, perl, python, R)‏ We should note that you can write script as if they where program, such as perl, and execute them as programs, in other words the atomic programs that you run can also be scripts, or you can pretend that they are data and Swift can pass the scripts to the executables as if they were data, to the remote site, how you want to do that is up to you. Note that there is no magic in Swfit, you still need to install the applications and know thru the catalog where those applications are installed.

Assessing your analysis tool performance
Job usage records tell where when and how things ran: v runtime cputime angle4-szlfhtji-kickstart.xml T23:03: : ia64 tg- c007.uc.teragrid.org angle4-hvlfhtji-kickstart.xml T23:00: : ia64 tg- c034.uc.teragrid.org angle4-oimfhtji-kickstart.xml T23:30: : ia64 tg- c015.uc.teragrid.org angle4-u9mfhtji-kickstart.xml T23:15: : ia64 tg- c035.uc.teragrid.org Analysis tools display this visually This is another application that uses xml.

Performance recording
site list Worker Nodes app list f1 launcher App a1 SwiftScript App a1 App a2 swift command f2 We talked a little bit about data modeling. One advantage of this is “separation of layers” is simplicity for the user. When we integrate it back in, its likely that we’ll make it optional to retain this simplicity, especially for users less interested in provenance storage, and for new users that may not have ready access to a DBMS to hold the VDC. Data f1 f2 f3 launcher App a2 Workflow Status and logs Provenance data f3 32

Directories and management model local dir, storage dir, work dir
Data Management Directories and management model local dir, storage dir, work dir caching within workflow reuse of files on restart Makes unique names for: jobs, files, wf Can leave data on a site For now, in Swift you need to track it In Pegasus (and VDS) this is done automatically I must point out that the data management model is still rather simplistic, typically the Swift mode start in the submit host, and Swift willlmove them out to the sites that it chooses for execution and bring the results back, and typically the data that I move there cached until later when it clean that memory, that is not automated yet, you can also do mapping in Swift not just to data management sites but also to gridftp urls and so that your input and you output can be located at any location .

Mappers and Vars Vars can be “file valued” Many useful mappers built-in, written in Java to the Mapper interface “Ext”ernal mapper can be easily written as an external script in any language This is all I want to say today, you can go thru the rest of the presentation on your own. Do you have any questions?

Mapping outputs based on input names
type pcapfile; type angleout; type anglecenter; (angleout ofile, anglecenter cfile) angle4 (pcapfile ifile) { app @cfile; } } pcapfile pcapfiles[]<filesys_mapper; prefix="pc", suffix=".pcap">; angleout of[] <structured_regexp_mapper; source=pcapfiles,match="pc(.*)\.pcap", transform="_output/of/of\1.angle">; anglecenter cf[] <structured_regexp_mapper; source=pcapfiles,match="pc(.*)\.pcap", transform="_output/cf/cf\1.center">; foreach pf,i in pcapfiles { (of[i],cf[i]) = angle4(pf); Name outputs based on inputs Please read this code and go over it, it is the same type of code as it was previously presented.

Parallelism for processing datasets
type pcapfile; type angleout; type anglecenter; (angleout ofile, anglecenter cfile) angle4 (pcapfile ifile) { app { angle4.sh } } pcapfile pcapfiles[]<filesys_mapper; prefix="pc", suffix=".pcap">; angleout of[] <structured_regexp_mapper; source=pcapfiles,match="pc(.*)\.pcap", transform="_output/of/of\1.angle">; anglecenter cf[] <structured_regexp_mapper; source=pcapfiles,match="pc(.*)\.pcap", transform="_output/cf/cf\1.center">; foreach pf,i in pcapfiles { (of[i],cf[i]) = angle4(pf); Name outputs based on inputs Please read this code and go over it, it is the same type of code as it was previously presented. Iterate over dataset members in parallel

Coding your own “external” mapper
awk <angle-spool-1-2 ' BEGIN { server="gsiftp://tp-osg.ci.uchicago.edu//disks/ci-gpfs/angle"; } { printf "[%d] %s/%s\n", i++, server, $0 }’ $ cat angle-spool-1-2 spool_1/anl dump pcap.gz spool_1/anl dump pcap.gz spool_1/anl dump pcap.gz spool_1/anl dump pcap.gz … $ ./map1 | head [0] gsiftp://tp-osg.ci.uchicago.edu//disks/ci-gpfs/angle/spool_1/anl dump pcap.gz [1] gsiftp://tp-osg.ci.uchicago.edu//disks/ci-gpfs/angle/spool_1/anl dump pcap.gz [2] gsiftp://tp-osg.ci.uchicago.edu//disks/ci-gpfs/angle/spool_1/anl dump pcap.gz … Please read this code and go over it, it is the same type of code as it was previously presented.

Site selection and throttling
Avoid overloading target infrastructure Base resource choice on current conditions and real response for you Balance this with space availability Things are getting more automated. We need to follow these points to make things easier: Avoid overloading target infrastructure Base resource choice on current conditions and real response for you Balance this with space availability Things are getting more automated

Clustering and Provisioning
Can cluster jobs together to reduce grid overhead for small jobs Can use a provisioner Can use a provider to go straight to a cluster

Testing and debugging techniques
Trace and print statements Put logging into your wrapper Capture stdout/error in returned files Capture glimpses of runtime environment Kickstart data helps understand what happened at runtime Reading/filtering swift client log files Check what sites are doing with local tools - condor_q, qstat Log reduction tools tell you how your workflow behaved These steps must be followed with care: Debugging Trace and print statements Put logging into your wrapper Capture stdout/error in returned files Capture glimpses of runtime environment Kickstart data helps understand what happened at runtime Reading/filtering swift client log files Check what sites are doing with local tools - condor_q, qstat Log reduction tools tell you how your workflow behaved

Other Workflow Style Issues
Expose or hide parameters One atomic, many variants Expose or hide program structure Driving a parameter sweep with readdata() - reads a csv file into struct[]. Swift is not a data manipulation language - use scripting tools for that We also have the following workflow style issued to look out for: Expose or hide parameters One atomic, many variants Expose or hide program structure Driving a parameter sweep with readdata() - reads a csv file into struct[]. Swift is not a data manipulation language - use scripting tools for that

Swift: Getting Started
Documentation -> tutorials Get CI accounts Request: workstation, gridlab, teraport Get a DOEGrids Grid Certificate Virtual organization: OSG / OSGEDU Sponsor: Mike Wilde, Develop your Swift code and test locally, then: On PBS / TeraPort On OSG: OSGEDU Use simple scripts (Perl, Python) as your test apps To get started please go to these multiple internet locations and spend as much time as possible learning the information posted there. 42

Additional data management models Integration of provenance tracking
Planned Enhancements Additional data management models Integration of provenance tracking Improved logging for troubleshooting Improved compilation and tool integration (especially with scripting tools and SDEs)‏ Improved abstraction and descriptive capability in mappers Continual performance measurement and speed improvement These are some of the enhancements that are planned for implementation in the near future. 43

http://www.ci.uchicago.edu/swift Swift: Summary
Clean separation of logical/physical concerns XDTM specification of logical data structures + Concise specification of parallel programs SwiftScript, with iteration, etc. + Efficient execution (on distributed resources)‏ Karajan+Falkon: Grid interface, lightweight dispatch, pipelining, clustering, provisioning + Rigorous provenance tracking and query Records provenance data of each job executed  Improved usability and productivity Demonstrated in numerous applications Here we have a summary of what Swift is. 44

Acknowledgments Swift effort is supported in part by NSF grants OCI and PHY , NIH DC08638, and the UChicago/Argonne Computation Institute The Swift team: Ben Clifford, Ian Foster, Mihael Hategan, Veronika Nefedova, Ioan Raicu, Tibi Stef-Praun, Mike Wilde, Zhao Zhang, Yong Zhao Java CoG Kit used by Swift developed by: Mihael Hategan, Gregor Von Laszewski, and many collaborators User contributed workflows and application use I2U2, U.Chicago Molecular Dynamics, U.Chicago Radiology and Human Neuroscience Lab, Dartmouth Brain Imaging Center

References - Workflow Taylor, I.J., Deelman, E., Gannon, D.B. and Shields, M. eds. Workflows for e- Science, Springer, 2007 SIGMOD Record Sep Special Section on Scientific Workflows, Zhao Y., Hategan, M., Clifford, B., Foster, I., vonLaszewski, G., Raicu, I., Stef-Praun, T. and Wilde, M Swift: Fast, Reliable, Loosely Coupled Parallel Computation IEEE International Workshop on Scientific Workflows 2007 Stef-Praun, T., Clifford, B., Foster, I., Hasson, U., Hategan, M., Small, S., Wilde, M and Zhao,Y. Accelerating Medical Research using the Swift Workflow System Health Grid 2007 Stef-Praun, T., Madeira, G., Foster, I., and Townsend, R. Accelerating solution of a moral hazard problem with Swift e-Social Science 2007 Zhao, Y., Wilde, M. and Foster, I. Virtual Data Language: A Typed Workflow Notation for Diversely Structured Scientific Data. Taylor, I.J., Deelman, E., Gannon, D.B. and Shields, M. eds. Workflows for eScience, Springer, 2007, Zhao, Y., Dobson, J., Foster, I., Moreau, L. and Wilde, M. A Notation and System for Expressing and Executing Cleanly Typed Workflows on Messy Scientific Data. SIGMOD Record 34 (3), 37-43 Vöckler, J.-S., Mehta, G., Zhao, Y., Deelman, E. and Wilde, M., Kickstarting Remote Applications. 2nd International Workshop on Grid Computing Environments, 2006. Raicu, I., Zhao Y., Dumitrescu, C., Foster, I. and Wilde, M Falkon: a Fast and Light- weight tasK executiON framework Supercomputing Conference 2007

Additional Information
Quick Start Guide: User Guide: Introductory Swift Tutorials:

Parallel scripting for distributed systems

Similar presentations

Presentation on theme: "Parallel scripting for distributed systems"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Parallel scripting for distributed systems

Similar presentations

Presentation on theme: "Parallel scripting for distributed systems"— Presentation transcript:

Similar presentations

About project

Feedback