Workflow Management and Virtual Data Ewa Deelman USC Information Sciences Institute.

Slides:



Advertisements
Similar presentations
Pegasus on the Virtual Grid: A Case Study of Workflow Planning over Captive Resources Yang-Suk Kee, Eun-Kyu Byun, Ewa Deelman, Kran Vahi, Jin-Soo Kim Oracle.
Advertisements

Virtual Data and the Chimera System* Ian Foster Mathematics and Computer Science Division Argonne National Laboratory and Department of Computer Science.
The ADAMANT Project: Linking Scientific Workflows and Networks “Adaptive Data-Aware Multi-Domain Application Network Topologies” Ilia Baldine, Charles.
Ewa Deelman, Integrating Existing Scientific Workflow Systems: The Kepler/Pegasus Example Nandita Mangal,
Information Retrieval in Practice
GriPhyN Virtual Data System Mike Wilde Argonne National Laboratory Mathematics and Computer Science Division LISHEP 2004, UERJ, Rio De Janeiro 13 Feb 2004.
Pegasus: Mapping complex applications onto the Grid Ewa Deelman Center for Grid Technologies USC Information Sciences Institute.
An Astronomical Image Mosaic Service for the National Virtual Observatory
A Grid-Enabled Engine for Delivering Custom Science- Grade Images on Demand
An Astronomical Image Mosaic Service for the National Virtual Observatory / ESTO.
Ewa Deelman Using Grid Technologies to Support Large-Scale Astronomy Applications Ewa Deelman Center for Grid Technologies USC Information.
Overview of Search Engines
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
CREATING A MULTI-WAVELENGTH GALACTIC PLANE ATLAS WITH AMAZON WEB SERVICES G. Bruce Berriman, John Good IPAC, California Institute of Technolog y Ewa Deelman,
1 The Application-Infrastructure Gap Dynamic and/or Distributed Applications A 1 B Shared Distributed Infrastructure.
Managing Workflows with the Pegasus Workflow Management System
Ewa Deelman, Pegasus and DAGMan: From Concept to Execution Mapping Scientific Workflows onto the National.
CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.
Authors: Weiwei Chen, Ewa Deelman 9th International Conference on Parallel Processing and Applied Mathmatics 1.
Chapter 10 Architectural Design
Pegasus A Framework for Workflow Planning on the Grid Ewa Deelman USC Information Sciences Institute Pegasus Acknowledgments: Carl Kesselman, Gaurang Mehta,
Grappa: Grid access portal for physics applications Shava Smallen Extreme! Computing Laboratory Department of Physics Indiana University.
Why Build Image Mosaics for Wide Area Surveys? An All-Sky 2MASS Mosaic Constructed on the TeraGrid A. C. Laity, G. B. Berriman, J. C. Good (IPAC, Caltech);
1 Yolanda Gil Information Sciences InstituteJanuary 10, 2010 Requirements for caBIG Infrastructure to Support Semantic Workflows Yolanda.
The Grid is a complex, distributed and heterogeneous execution environment. Running applications requires the knowledge of many grid services: users need.
A Metadata Catalog Service for Data Intensive Applications Presented by Chin-Yi Tsai.
LIGO- G E LIGO Grid Applications Development Within GriPhyN Albert Lazzarini LIGO Laboratory, Caltech GriPhyN All Hands.
ARGONNE  CHICAGO Ian Foster Discussion Points l Maintaining the right balance between research and development l Maintaining focus vs. accepting broader.
Applying the Virtual Data Provenance Model IPAW 2006 Yong Zhao, Ian Foster, Michael Wilde 4 May 2006.
Virtual Data Management for Grid Computing Michael Wilde Argonne National Laboratory Gaurang Mehta, Karan Vahi Center for Grid Technologies.
Large-Scale Science Through Workflow Management Ewa Deelman Center for Grid Technologies USC Information Sciences Institute.
What are the main differences and commonalities between the IS and DA systems? How information is transferred between tasks: (i) IS it may be often achieved.
GriPhyN and Data Provenance The Grid Physics Network Virtual Data System DOE Data Management Workshop SLAC, 17 March 2004 Mike Wilde Argonne National Laboratory.
Combining the strengths of UMIST and The Victoria University of Manchester Utility Driven Adaptive Workflow Execution Kevin Lee School of Computer Science,
 The workflow description modified to output a VDS DAX.  The workflow description toolkit developed allows any concrete workflow description to be migrated.
Pegasus-a framework for planning for execution in grids Ewa Deelman USC Information Sciences Institute.
Pegasus: Planning for Execution in Grids Ewa Deelman Information Sciences Institute University of Southern California.
CSIU Submission of BLAST jobs via the Galaxy Interface Rob Quick Open Science Grid – Operations Area Coordinator Indiana University.
The Virtual Data Grid: A New Model and Architecture for Data-Intensive Collaboration Summer Grid 2004 UT Brownsville South Padre Island Center 24 June.
Bookkeeping Tutorial. Bookkeeping & Monitoring Tutorial2 Bookkeeping content  Contains records of all “jobs” and all “files” that are created by production.
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES Data Replication Service Sandeep Chandra GEON Systems Group San Diego Supercomputer Center.
Pegasus: Mapping Scientific Workflows onto the Grid Ewa Deelman Center for Grid Technologies USC Information Sciences Institute.
Condor Week 2005Optimizing Workflows on the Grid1 Optimizing workflow execution on the Grid Gaurang Mehta - Based on “Optimizing.
Virtual Data Grid Architecture Ewa Deelman, Ian Foster, Carl Kesselman, Miron Livny.
Combining the strengths of UMIST and The Victoria University of Manchester Adaptive Workflow Processing and Execution in Pegasus Kevin Lee School of Computer.
Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”
The Replica Location Service The Globus Project™ And The DataGrid Project Copyright (c) 2002 University of Chicago and The University of Southern California.
Ames Research CenterDivision 1 Information Power Grid (IPG) Overview Anthony Lisotta Computer Sciences Corporation NASA Ames May 2,
Pegasus: Running Large-Scale Scientific Workflows on the TeraGrid Ewa Deelman USC Information Sciences Institute
Pegasus: Mapping complex applications onto the Grid Ewa Deelman Center for Grid Technologies USC Information Sciences Institute.
The GriPhyN Planning Process All-Hands Meeting ISI 15 October 2001.
GriPhyN Virtual Data System Grid Execution of Virtual Data Workflows Mike Wilde Argonne National Laboratory Mathematics and Computer Science Division.
Workflow Management and Virtual Data Ewa Deelman Center for Grid Technologies USC Information Sciences Institute.
Pegasus-a framework for planning for execution in grids Karan Vahi USC Information Sciences Institute May 5 th, 2004.
Planning Ewa Deelman USC Information Sciences Institute GriPhyN NSF Project Review January 2003 Chicago.
Pegasus: Planning for Execution in Grids Ewa Deelman, Carl Kesselman, Gaurang Mehta, Gurmeet Singh, Karan Vahi Information Sciences Institute University.
OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.
Virtual Data Management for CMS Simulation Production A GriPhyN Prototype.
Fire Emissions Network Sept. 4, 2002 A white paper for the development of a NSF Digital Government Program proposal Stefan Falke Washington University.
Tool Integration with Data and Computation Grid “Grid Wizard 2”
April 25, 2006Parag Mhashilkar, Fermilab1 Resource Selection in OSG & SAM-On-The-Fly Parag Mhashilkar Fermi National Accelerator Laboratory Condor Week.
Climate-SDM (1) Climate analysis use case –Described by: Marcia Branstetter Use case description –Data obtained from ESG –Using a sequence steps in analysis,
1 USC Information Sciences InstituteYolanda Gil AAAI-08 Tutorial July 13, 2008 Part IV Workflow Mapping and Execution in Pegasus (Thanks.
Managing LIGO Workflows on OSG with Pegasus Karan Vahi USC Information Sciences Institute
1 Performance Impact of Resource Provisioning on Workflows Gurmeet Singh, Carl Kesselman and Ewa Deelman Information Science Institute University of Southern.
Pegasus WMS Extends DAGMan to the grid world
Pegasus and Condor Gaurang Mehta, Ewa Deelman, Carl Kesselman, Karan Vahi Center For Grid Technologies USC/ISI.
Overview of Workflows: Why Use Them?
Mats Rynge USC Information Sciences Institute
Frieda meets Pegasus-WMS
Presentation transcript:

Workflow Management and Virtual Data Ewa Deelman USC Information Sciences Institute

GGF Summer SchoolWorkflow Management2 Tutorial Objectives l Provide a detailed introduction to existing services for workflow and virtual data management l Provide descriptions and interactive demonstrations of: –the Chimera system for managing virtual data products –the Pegasus system for planning and execution in grids

GGF Summer SchoolWorkflow Management3 Acknowledgements l Chimera: ANL and UofC, Ian Foster, Jens Voeckler, Mike Wilde – l Pegasus: USC/ISI, Carl Kesselman, Gaurang Mehta, Gurmeet Singh, Mei-Hu Su, Karan Vahi –pegasus.isi.edu

GGF Summer SchoolWorkflow Management4 Outline l Workflows on the Grid l The GriPhyN project l Chimera l Pegasus l Research issues l Exercises

GGF Summer SchoolWorkflow Management5 Abstract System Representation l A workflow is a graph –The vertices of the graph represent “activities” –The edges of the graph represent precedence between “activities” >The edges are directed –The graph may be c yclic –An annotation is a set of zero or more attributes associated with an vertex, edge or subgraph of the graph A graph

GGF Summer SchoolWorkflow Management6 Operations on the graph l A subgraph can be operated on by an editor l An editor performs a transaction that maps a subgraph (s 1 ) onto a subgraph (s 2 ) l An editor –May add nodes or vertices to the subgraph –May delete nodes or vertices within the subgraph –May add or modify the annotations on the subgraph or the vertices or edges in the subgraph l After the mapping –the edges that were directed to s 1 are directed to s 2 –the edges that were directed from s 1 are directed from s 2 l Two editors cannot edit two subgraphs at the same time if these subgraphs have common vertices or edges

GGF Summer SchoolWorkflow Management7 Subgraph editing vertexannotation editor Other subgraphs

GGF Summer SchoolWorkflow Management8 Abstract Workflow Generation Concrete Workflow Generation

GGF Summer SchoolWorkflow Management9 Generating an Abstract Workflow l Available Information –Specification of component capabilities –Ability to generate the desired data products Select and configure application components to form an abstract workflow –assign input files that exist or that can be generated by other application components. –specify the order in which the components must be executed –components and files are referred to by their logical names >Logical transformation name >Logical file name >Both transformations and data can be replicated

GGF Summer SchoolWorkflow Management10 Generating a Concrete Workflow l Information –location of files and component Instances –State of the Grid resources l Select specific –Resources –Files –Add jobs required to form a concrete workflow that can be executed in the Grid environment >Data movement –Data registration –Each component in the abstract workflow is turned into an executable job

GGF Summer SchoolWorkflow Management11 Why Automate Workflow Generation? l Usability: Limit User’s necessary Grid knowledge >Monitoring and Directory Service >Replica Location Service l Complexity: –User needs to make choices >Alternative application components >Alternative files >Alternative locations –The user may reach a dead end –Many different interdependencies may occur among components l Solution cost: –Evaluate the alternative solution costs >Performance >Reliability >Resource Usage l Global cost: –minimizing cost within a community or a virtual organization –requires reasoning about individual user’s choices in light of other user’s choices

GGF Summer SchoolWorkflow Management12 Workflow Evolution l Workflow description –Metadata –Partial, abstract description –Full, abstract description –A concrete, executable workflow l Workflow refinement –Take a description and produce an executable workflow l Workflow execution

GGF Summer SchoolWorkflow Management13 Workflow Refinement l The workflow can undergo an arbitrarily complex set of refinements l A refiner can modify part of the workflow or the entire workflow l A refiner uses a set of Grid information services and catalogs to perform the refinement (metadata catalog, virtual data catalog, replica location services, monitoring and discovery services, etc. )

GGF Summer SchoolWorkflow Management14 time Levels of abstraction Application -level knowledge Logical tasks Tasks bound to resources and sent for execution User’s Request Relevant components Full abstract workflow Partial execution Not yet executed Workflow refinement Task matchmaker Workflow repair Policy info Workflow Refinement and execution

GGF Summer SchoolWorkflow Management15 Outline l Workflows on the Grid l The GriPhyN project l Chimera l Pegasus l Exercises

GGF Summer SchoolWorkflow Management16 Ongoing Workflow Management Work l Part of the NSF-funded GriPhyN project l Supports the concept of Virtual Data, where data is materialized on demand l Data can exist on some data resource and be directly accessible l Data can exist only in a form of a recipe l The GriPhyN Virtual Data System can seamlessly deliver the data to the user or application regardless of the form in which the data exists l GriPhyN targets applications in high-energy physics, gravitational-wave physics and astronomy

GGF Summer SchoolWorkflow Management17 Relationship between virtual data, and provenance l Virtual data can be described by a subgraph, that needs to undergo an editing process to obtain a subgraph in the state that is “done” l The recoding of the editing process is provenance vertex annotation editor Provenance Virtual data Virtual data materialization

GGF Summer SchoolWorkflow Management18 Workflow management in GriPhyN l Workflow Generation: how do you describe the workflow (at various levels of abstraction)? (Chimera) l Workflow Mapping/Refinement: how do you map an abstract workflow representation to an executable form? (Pegasus) l Workflow Execution: how to you reliably execute the workflow? (Condor’s DAGMan)

GGF Summer SchoolWorkflow Management19 Terms l Abstract Workflow (DAX) –Expressed in terms of logical entities –Specifies all logical files required to generate the desired data product from scratch –Dependencies between the jobs –Analogous to build style dag l Concrete Workflow –Expressed in terms of physical entities –Specifies the location of the data and executables –Analogous to a make style dag

GGF Summer SchoolWorkflow Management20 Executable Workflow Construction l Chimera builds an abstract workflow based on VDL descriptions l Pegasus takes the abstract workflow and produces and executable workflow for the Grid l Condor’s DAGMan executes the workflow

GGF Summer SchoolWorkflow Management21 Example Workflow Reduction l Original abstract workflow l If “b” already exists (as determined by query to the RLS), the workflow can be reduced

GGF Summer SchoolWorkflow Management22 Mapping from abstract to concrete l Query RLS, MDS, and TC, schedule computation and data movement

GGF Summer SchoolWorkflow Management23 Application Workflow Characteristics Experiment #workflows per analysis # of jobs in workflow Data Size per job Compute Time per job LHCO(100K)7~300MB~12CPU hours LIGOO(1K) ~1MB~2min SDSSO(20K)10~1MB~1-5 min Number of resources: currently several condor pools and clusters with 100s of nodes

GGF Summer SchoolWorkflow Management24 Astronomy l Galaxy Morphology (National Virtual Observatory) –Investigates the dynamical state of galaxy clusters –Explores galaxy evolution inside the context of large- scale structure. –Uses galaxy morphologies as a probe of the star formation and stellar distribution history of the galaxies inside the clusters. –Data intensive computations involving hundreds of galaxies in a cluster The x-ray emission is shown in blue, and the optical mission is in red. The colored dots are located at the positions of the galaxies within the cluster; the dot color represents the value of the asymmetry index. Blue dots represent the most asymmetric galaxies and are scattered throughout the image, while orange are the most symmetric, indicative of elliptical galaxies, are concentrated more toward the center. People involved: Gurmeet Singh, Mei-Hui Su, many others

GGF Summer SchoolWorkflow Management25 Astronomy l Montage (NASA and NVO) ( Bruce Berriman, John Good, Joe Jacob, Gurmeet Singh, Mei-Hui Su ) –Deliver science-grade custom mosaics on demand –Produce mosaics from a wide range of data sources (possibly in different spectra) –User-specified parameters of projection, coordinates, size, rotation and spatial sampling. Sloan Digital Sky Survey (GriPhyN project) finding clusters of galaxies from the Sloan Digital Sky Survey database of galaxies. Lead by Jim Annis (Fermi), Mike Wilde (ANL)

GGF Summer SchoolWorkflow Management26 Montage Workflow Transfer the template header Transfer the image file Re-projection of images. Calculating the difference Fit to a common plane Background modeling Background correction Adding the images to get the final mosaic Register the mosaic in RLS

GGF Summer SchoolWorkflow Management27 BLAST : set of sequence comparison algorithms that are used to search sequence databases for optimal local alignments to a query Lead by Veronika Nefedova (ANL) as part of the Paci Data Quest Expedition program 2 major runs were performed using Chimera and Pegasus: 1)60 genomes (4,000 sequences each), In 24 hours processed Genomes selected from DOE-sponsored sequencing projects 67 CPU-days of processing time delivered ~ 10,000 Grid jobs >200,000 BLAST executions 50 GB of data generated 2) 450 genomes processed Speedup of 5-20 times were achieved because the compute nodes we used efficiently by keeping the submission of the jobs to the compute cluster constant.

GGF Summer SchoolWorkflow Management28 Biology Applications (cont’d) Tomography (NIH-funded project) l Derivation of 3D structure from a series of 2D electron microscopic projection images, l Reconstruction and detailed structural analysis –complex structures like synapses –large structures like dendritic spines. l Acquisition and generation of huge amounts of data l Large amount of state-of-the-art image processing required to segment structures from extraneous background. Dendrite structure to be rendered by Tomography Work performed by Mei Hui-Su with Mark Ellisman, Steve Peltier, Abel Lin, Thomas Molina (SDSC)

GGF Summer SchoolWorkflow Management29 Physics (GriPhyN Project) l High-energy physics –CMS—collaboration with Rick Cavannaugh, UFL >Processed simulated events >Cluster of 25 dual-processor Pentium machines. >Computation: 7 days, 678 jobs with 250 events each >Produced ~ 200GB of simulated data. –Atlas >Uses GriPhyN technologies for production Rob Gardner l Gravitational-wave science (collaboration with Bruce Allen A. Lazzarini and S. Koranda)

GGF Summer SchoolWorkflow Management30 LIGO’s pulsar search at SC 2002 l The pulsar search conducted at SC 2002 –Used LIGO’s data collected during the first scientific run of the instrument –Targeted a set of 1000 locations of known pulsar as well as random locations in the sky –Results of the analysis were be published via LDAS (LIGO Data Analysis System) to the LIGO Scientific Collaboration –performed using LDAS and compute and storage resources at Caltech, University of Southern California, University of Wisconsin Milwaukee. ISI people involved: Gaurang Mehta, Sonal Patil, Srividya Rao, Gurmeet Singh, Karan Vahi Visualization by Marcus Thiebaux

GGF Summer SchoolWorkflow Management31 Outline l Workflows on the Grid l The GriPhyN project l Chimera l Pegasus l Research issues l Exercises

GGF Summer SchoolWorkflow Management32 Chimera Virtual Data System Outline l Virtual data concept and vision l VDL – the Virtual Data Language l Simple virtual data examples l Virtual data applications in High Energy Physics and Astronomy

GGF Summer SchoolWorkflow Management33 The Virtual Data Concept Enhance scientific productivity through: l Discovery and application of datasets and programs at petabyte scale l Enabling use of a worldwide data grid as a scientific workstation Virtual Data enables this approach by creating datasets from workflow “recipes” and recording their provenance.

GGF Summer SchoolWorkflow Management34 Virtual Data Vision

GGF Summer SchoolWorkflow Management35 Virtual Data System Capabilities Producing data from transformations with uniform, precise data interface descriptions enables… l Discovery: finding and understanding datasets and transformations l Workflow: structured paradigm for organizing, locating, specifying, & producing scientific datasets –Forming new workflow –Building new workflow from existing patterns –Managing change l Planning: automated to make the Grid transparent l Audit: explanation and validation via provenance

GGF Summer SchoolWorkflow Management36 Virtual Data Scenario simulate – t 10 … file1 file2 reformat – f fz … file1 File3,4,5 psearch – t 10 … conv – I esd – o aod file6 summarize – t 10 … file7 file8 On-demand data generation Update workflow following changes Manage workflow; psearch –t 10 –i file3 file4 file5 –o file8 summarize –t 10 –i file6 –o file7 reformat –f fz –i file2 –o file3 file4 file5 conv –l esd –o aod –i file 2 –o file6 simulate –t 10 –o file1 file2 Explain provenance, e.g. for file8:

GGF Summer SchoolWorkflow Management37 VDL: Virtual Data Language Describes Data Transformations l Transformation –Abstract template of program invocation –Similar to "function definition" l Derivation –“Function call” to a Transformation –Store past and future: >A record of how data products were generated >A recipe of how data products can be generated l Invocation –Record of a Derivation execution

GGF Summer SchoolWorkflow Management38 Example Transformation TR t1( out a2, in a1, none pa = "500", none env = "100000" ) { argument = "-p "${pa}; argument = "-f "${a1}; argument = "-x –y"; argument stdout = ${a2}; profile env.MAXMEM = ${env}; } $a1 $a2 t1

GGF Summer SchoolWorkflow Management39 Example Derivations DV d1->t1 ( env="20000", pa="600", ); DV d2->t1 ( );

GGF Summer SchoolWorkflow Management40 Workflow from File Dependencies TR tr1(in a1, out a2) { argument stdin = ${a1}; argument stdout = ${a2}; } TR tr2(in a1, out a2) { argument stdin = ${a1}; argument stdout = ${a2}; } DV DV file1 file2 file3 x1 x2

GGF Summer SchoolWorkflow Management41 Example Workflow l Complex structure –Fan-in –Fan-out –"left" and "right" can run in parallel l Uses input file –Register with RC l Complex file dependencies –Glues workflow findrange analyze preprocess

GGF Summer SchoolWorkflow Management42 Workflow step "preprocess" l TR preprocess turns f.a into f.b1 and f.b2 TR preprocess( output b[], input a ) { argument = "-a top"; argument = " –i "${input:a}; argument = " –o " ${output:b}; } l Makes use of the "list" feature of VDL –Generates 0..N output files. –Number file files depend on the caller.

GGF Summer SchoolWorkflow Management43 Workflow step "findrange" l Turns two inputs into one output TR findrange( output b, input a1, input a2, none name="findrange", none p="0.0" ) { argument = "-a "${name}; argument = " –i " ${a1} " " ${a2}; argument = " –o " ${b}; argument = " –p " ${p}; } l Uses the default argument feature

GGF Summer SchoolWorkflow Management44 Can also use list[] parameters TR findrange( output b, input a[], none name="findrange", none p="0.0" ) { argument = "-a "${name}; argument = " –i " ${" "|a}; argument = " –o " ${b}; argument = " –p " ${p}; }

GGF Summer SchoolWorkflow Management45 Workflow step "analyze" l Combines intermediary results TR analyze( output b, input a[] ) { argument = "-a bottom"; argument = " –i " ${a}; argument = " –o " ${b}; }

GGF Summer SchoolWorkflow Management46 Complete VDL workflow l Generate appropriate derivations DV out:"f.b2"} ], ); DV left->findrange( name="left", p="0.5" ); DV right->findrange( name="right" ); DV );

GGF Summer SchoolWorkflow Management47 Compound Transformations l Using compound TR –Permits composition of complex TRs from basic ones –Calls are independent >unless linked through LFN –A Call is effectively an anonymous derivation >Late instantiation at workflow generation time –Permits bundling of repetitive workflows –Model: Function calls nested within a function definition

GGF Summer SchoolWorkflow Management48 Compound Transformations (cont) l TR diamond bundles black-diamonds TR diamond( out fd, io fc1, io fc2, io fb1, io fb2, in fa, p1, p2 ) { call preprocess( a=${fa}, b=[ ${out:fb1}, ${out:fb2} ] ); call findrange( a1=${in:fb1}, a2=${in:fb2}, name="LEFT", p=${p1}, b=${out:fc1} ); call findrange( a1=${in:fb1}, a2=${in:fb2}, name="RIGHT", p=${p2}, b=${out:fc2} ); call analyze( a=[ ${in:fc1}, ${in:fc2} ], b=${fd} ); }

GGF Summer SchoolWorkflow Management49 Compound Transformations (cont) l Multiple DVs allow easy generator scripts: DV d1->diamond( p2="100", p1="0" ); DV d2->diamond( p2=" ", p1="0" );... DV d70->diamond( p2="800", p1="18" );

GGF Summer SchoolWorkflow Management50 mass = 200 decay = WW stability = 1 LowPt = 20 HighPt = mass = 200 decay = WW stability = 1 event = 8 mass = 200 decay = WW stability = 1 plot = 1 mass = 200 decay = WW plot = 1 mass = 200 decay = WW event = 8 mass = 200 decay = WW stability = 1 mass = 200 decay = WW stability = 3 mass = 200 decay = WW mass = 200 decay = ZZ mass = 200 decay = bb mass = 200 plot = 1 mass = 200 event = 8 Virtual Data Application: High Energy Physics Data Analysis Work and slide by Rick Cavanaugh and Dimitri Bourilkov, University of Florida

GGF Summer SchoolWorkflow Management51 Jim Annis, Steve Kent, Vijay Sehkri, Fermilab, Michael Milligan, Yong Zhao, University of Chicago Galaxy cluster size distribution DAG Virtual Data Example: Galaxy Cluster Search Sloan Data

GGF Summer SchoolWorkflow Management52 Cluster Search Workflow Graph and Execution Trace Workflow jobs vs time

GGF Summer SchoolWorkflow Management53 Outline l Workflows on the Grid l The GriPhyN project l Chimera l Pegasus l Research issues l Exercises

GGF Summer SchoolWorkflow Management54 Outline l Pegasus Introduction l Pegasus and Other Globus Components l Pegasus’ Concrete Planner l Deferred planning mode l Pegasus portal l Future Improvements

GGF Summer SchoolWorkflow Management55 Grid Applications l Increasing in the level of complexity l Use of individual application components l Reuse of individual intermediate data products l Description of Data Products using Metadata Attributes l Execution environment is complex and very dynamic –Resources come and go –Data is replicated –Components can be found at various locations or staged in on demand l Separation between –the application description –the actual execution description

GGF Summer SchoolWorkflow Management56 Abstract Workflow Generation Concrete Workflow Generation

GGF Summer SchoolWorkflow Management57 Pegasus l Flexible framework, maps abstract workflows onto the Grid l Possess well-defined APIs and clients for: –Information gathering >Resource information >Replica query mechanism >Transformation catalog query mechanism –Resource selection >Compute site selection >Replica selection –Data transfer mechanism l Can support a variety of workflow executors

GGF Summer SchoolWorkflow Management58 Pegasus Components

GGF Summer SchoolWorkflow Management59 Pegasus: A particular configuration l Automatically locates physical locations for both components (transformations) and data –Use Globus RLS and the Transformation Catalog l Finds appropriate resources to execute the jobs –Via Globus MDS l Reuses existing data products where applicable –Possibly reduces the workflow l Publishes newly derived data products –RLS, Chimera virtual data catalog

GGF Summer SchoolWorkflow Management60

GGF Summer SchoolWorkflow Management61 Replica Location Service l Pegasus uses the RLS to find input data LRC RLI Computation l Pegasus uses the RLS to register new data products

GGF Summer SchoolWorkflow Management62 Use of MDS in Pegasus l MDS provides up-to-date Grid state information –Total and idle job queues length on a pool of resources (condor) –Total and available memory on the pool –Disk space on the pools –Number of jobs running on a job manager l Can be used for resource discovery and selection –Developing various task to resource mapping heuristics l Can be used to publish information necessary for replica selection –Developing replica selection components

GGF Summer SchoolWorkflow Management63 Abstract Workflow Reduction The output jobs for the Dag are all the leaf nodes i.e. f, h, i. Each job requires 2 input files and generates 2 output files. The user specifies the output location KEY The original node Input transfer node Registration node Output transfer node Node deleted by Reduction algorithm Job e Job gJob h Job d Job a Job c Job f Job i Job b

GGF Summer SchoolWorkflow Management64 Jobs d, e, f have output files that have been found in the Replica Location Service. Additional jobs are deleted. All jobs (a, b, c, d, e, f) are removed from the DAG. KEY The original node Input transfer node Registration node Output transfer node Node deleted by Reduction algorithm Job e Job gJob h Job d Job a Job c Job f Job i Job b Optimizing from the point of view of Virtual Data

GGF Summer SchoolWorkflow Management65 Job e Job gJob h Job d Job a Job c Job f Job i Job b adding transfer nodes for the input files for the root nodes Planner picks execution and replica locations Plans for staging data in KEY The original node Input transfer node Registration node Output transfer node Node deleted by Reduction algorithm

GGF Summer SchoolWorkflow Management66 Staging and registering for each job that materializes data (g, h, i ). KEY The original node Input transfer node Registration node Output transfer node Node deleted by Reduction algorithm transferring the output files of the leaf job (f) to the output location Job e Job gJob h Job d Job a Job c Job f Job i Job b Staging data out and registering new derived products in the RLS

GGF Summer SchoolWorkflow Management67 KEY The original node Input transfer node Registration node Output transfer node The final, executable DAG Job gJob h Job i Job e Job gJob h Job d Job a Job c Job f Job i Job b Input DAG

GGF Summer SchoolWorkflow Management68 Pegasus Components l Concrete Planner and Submit file generator (gencdag) –The Concrete Planner of the VDS makes the logical to physical mapping of the DAX taking into account the pool where the jobs are to be executed (execution pool) and the final output location (output pool). l Java Replica Location Service Client (rls- client & rls-query-client) –Used to populate and query the globus replica location service.

GGF Summer SchoolWorkflow Management69 Pegasus Components (cont’d) l XML Pool Config generator (genpoolconfig) –The Pool Config generator queries the MDS as well as local pool config files to generate a XML pool config which is used by Pegasus. –MDS is preferred for generation pool configuration as it provides a much richer information about the pool including the queue statistics, available memory etc. l The following catalogs are looked up to make the translation –Transformation Catalog (tc.data) –Pool Config File –Replica Location Services –Monitoring and Discovery Services

GGF Summer SchoolWorkflow Management70 Transformation Catalog (Demo) l Consists of a simple text file. –Contains Mappings of Logical Transformations to Physical Transformations. l Format of the tc.data file #poolname logical tr physical tr env isi preprocess /usr/vds/bin/preprocess VDS_HOME=/usr/vds/; l All the physical transformations are absolute path names. l Environment string contains all the environment variables required in order for the transformation to run on the execution pool. l DB based TC in testing phase.

GGF Summer SchoolWorkflow Management71 Pool Config (Demo) l Pool Config is an XML file which contains information about various pools on which DAGs may execute. l Some of the information contained in the Pool Config file is –Specifies the various job-managers that are available on the pool for the different types of condor universes. –Specifies the GridFtp storage servers associated with each pool. –Specifies the Local Replica Catalogs where data residing in the pool has to be cataloged. –Contains profiles like environment hints which are common site-wide. –Contains the working and storage directories to be used on the pool.

GGF Summer SchoolWorkflow Management72 Pool config l Two Ways to construct the Pool Config File. –Monitoring and Discovery Service –Local Pool Config File (Text Based) l Client tool to generate Pool Config File –The tool genpoolconfig is used to query the MDS and/or the local pool config file/s to generate the XML Pool Config file.

GGF Summer SchoolWorkflow Management73 Gvds.Pool.Config l This file is read by the information provider and published into MDS. l Format gvds.pool.id : gvds.pool.lrc : gvds.pool.gridftp gvds.pool.gridftp : gvds.pool.universe : gvds.pool.gridlaunch : gvds.pool.workdir : gvds.pool.profile : gvds.pool.profile :

GGF Summer SchoolWorkflow Management74 Properties l Properties file define and modify the behavior of Pegasus. l Properties set in the $VDS_HOME/properties can be overridden by defining them either in $HOME/.chimerarc or by giving them on the command line of any executable. –eg. Gendax –Dvds.home=path to vds home…… l Some examples follow but for more details please read the sample.properties file in $VDS_HOME/etc directory. l Basic Required Properties –vds.home : This is auto set by the clients from the environment variable $VDS_HOME –vds.properties : Path to the default properties file >Default : ${vds.home}/etc/properties

GGF Summer SchoolWorkflow Management75 Concrete Planner Gencdag l The Concrete planner takes the DAX produced by Chimera and converts into a set of condor dag and submit files. l Usage : gencdag --dax --p [--dir ] [--o ] [--force] l You can specify more then one execution pools. Execution will take place on the pools on which the executable exists. If the executable exists on more then one pool then the pool on which the executable will run is selected randomly. l Output pool is the pool where you want all the output products to be transferred to. If not specified the materialized data stays on the execution pool

GGF Summer SchoolWorkflow Management76 Original Pegasus configuration Simple scheduling: random or round robin using well-defined scheduling interfaces.

GGF Summer SchoolWorkflow Management77 Deferred Planning through Partitioning A variety of planning algorithms can be implemented

GGF Summer SchoolWorkflow Management78 Mega DAG is created by Pegasus and then submitted to DAGMan

GGF Summer SchoolWorkflow Management79 Mega DAG Pegasus

GGF Summer SchoolWorkflow Management80 Re-planning capabilities

GGF Summer SchoolWorkflow Management81 Complex Replanning for Free (almost)

GGF Summer SchoolWorkflow Management82 Optimizations l If the workflow being refined by Pegasus consists of only 1 node –Create a condor submit node rather than a dagman node –This optimization can leverage Euryale’s super-node writing component

GGF Summer SchoolWorkflow Management83 Planning & Scheduling Granularity l Partitioning –Allows to set the granularity of planning ahead l Node aggregation –Allows to combine nodes in the workflow and schedule them as one unit (minimizes the scheduling overheads) –May reduce the overheads of making scheduling and planning decisions l Related but separate concepts –Small jobs >High-level of node aggregation >Large partitions –Very dynamic system >Small partitions

GGF Summer SchoolWorkflow Management84 l Create workflow partitions –partitiondax --dax./blackdiamond.dax --dir dax l Create the MegaDAG (creates the dagman submit files) – gencdag - Dvds.properties=~/conf/properties -- pdax./dax/blackdiamond.pdax --pools isi_condor --o isi_condor --dir./dags/ Note the --pdax option instead of the normal --dax option. l submit the.dag file for the mega dag –condor_submit_dag black-diamond_0.dag

GGF Summer SchoolWorkflow Management85

GGF Summer SchoolWorkflow Management86 LIGO Scientific Collaboration l Continuous gravitational waves are expected to be produced by a variety of celestial objects l Only a small fraction of potential sources are known l Need to perform blind searches, scanning the regions of the sky where we have no a priori information of the presence of a source –Wide area, wide frequency searches l Search is performed for potential sources of continuous periodic waves near the Galactic Center and the galactic core l The search is very compute and data intensive l LSC used the occasion of SC2003 to initiate a month-long production run with science data collected during 8 weeks in the Spring of 2003

GGF Summer SchoolWorkflow Management87 Additional resources used: Grid3 iVDGL resources

GGF Summer SchoolWorkflow Management88 LIGO Acknowledgements l Bruce Allen, Scott Koranda, Brian Moe, Xavier Siemens, University of Wisconsin Milwaukee, USA l Stuart Anderson, Kent Blackburn, Albert Lazzarini, Dan Kozak, Hari Pulapaka, Peter Shawhan, Caltech, USA l Steffen Grunewald, Yousuke Itoh, Maria Alessandra Papa, Albert Einstein Institute, Germany l Many Others involved in the Testbed l l group.phys.uwm.edu/lscdatagrid/ l l LIGO Laboratory operates under NSF cooperative agreement PHY

GGF Summer SchoolWorkflow Management89 Montage l Montage (NASA and NVO) –Deliver science-grade custom mosaics on demand –Produce mosaics from a wide range of data sources (possibly in different spectra) –User-specified parameters of projection, coordinates, size, rotation and spatial sampling. Mosaic created by Pegasus based Montage from a run of the M101 galaxy images on the Teragrid.

GGF Summer SchoolWorkflow Management90 Small Montage Workflow ~1200 nodes

GGF Summer SchoolWorkflow Management91 Montage Acknowledgments l Bruce Berriman, John Good, Anastasia Laity, Caltech/IPAC l Joseph C. Jacob, Daniel S. Katz, JPL l caltech.edu/ l Testbed for Montage: Condor pools at USC/ISI, UW Madison, and Teragrid resources at NCSA, PSC, and SDSC. Montage is funded by the National Aeronautics and Space Administration's Earth Science Technology Office, Computational Technologies Project, under Cooperative Agreement Number NCC5-626 between NASA and the California Institute of Technology.

GGF Summer SchoolWorkflow Management92 Portal Demonstration

GGF Summer SchoolWorkflow Management93 Outline l Workflows on the Grid l The GriPhyN project l Chimera l Pegasus l Research issues l Exercises

GGF Summer SchoolWorkflow Management94 Grid3 – The Laboratory Supported by the National Science Foundation and the Department of Energy.

GGF Summer SchoolWorkflow Management95 Grid3 – Cumulative CPU Days to ~ 25 Nov 2003

GGF Summer SchoolWorkflow Management96 Grid2003: ~100TB data processed to ~ 25 Nov 2003

GGF Summer SchoolWorkflow Management97 l Focus on data intensive science l Planning is necessary l Reaction to the environment is a must (things go wrong, resources come up) l Iterative Workflow Execution: –Workflow Planner –Workload Manager l Planning decision points –Workflow Delegation Time (eager) –Activity Scheduling Time (deferred) –Resource Availability Time (just in time) l Decision specification level l Reacting to the changing environment and recovering from failures l How does the communication takes place? Callbacks, workflow annotations etc… Planner Manager Concrete workflow Tasks Abstract workflow info Grid Resource Manager Research issues

GGF Summer SchoolWorkflow Management98 Future work l Staging in executables on demand l Expanding the scheduling plug-ins l Investigating various partitioning approaches l Investigating reliability across partitions

GGF Summer SchoolWorkflow Management99 For further information l Chimera and Pegasus: – –pegasus.isi.edupegasus.isi.edu l Workflow Management research group in GGF: –

GGF Summer SchoolWorkflow Management100 Outline l Workflows on the Grid l The GriPhyN project l Chimera l Pegasus l Research issues l Exercises