1 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute Part III Computational Workflows in Wings/Pegasus AAAI-08.

Slides:



Advertisements
Similar presentations
A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University
Advertisements

Configuration management
Configuration management
TU e technische universiteit eindhoven / department of mathematics and computer science Modeling User Input and Hypermedia Dynamics in Hera Databases and.
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Pegasus on the Virtual Grid: A Case Study of Workflow Planning over Captive Resources Yang-Suk Kee, Eun-Kyu Byun, Ewa Deelman, Kran Vahi, Jin-Soo Kim Oracle.
1 USC INFORMATION SCIENCES INSTITUTE Modeling and Using Simulation Code for SCEC/IT Yolanda Gil Varun Ratnakar Norm Tubman USC/Information Sciences Institute.
The ADAMANT Project: Linking Scientific Workflows and Networks “Adaptive Data-Aware Multi-Domain Application Network Topologies” Ilia Baldine, Charles.
Ewa Deelman, Integrating Existing Scientific Workflow Systems: The Kepler/Pegasus Example Nandita Mangal,
David Adams ATLAS DIAL Distributed Interactive Analysis of Large datasets David Adams BNL March 25, 2003 CHEP 2003 Data Analysis Environment and Visualization.
Components and Architecture CS 543 – Data Warehousing.
Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer.
1 Yolanda Gil, PhD Information Sciences Institute and Department of Computer Science University of Southern California
Architectural Design Establishing the overall structure of a software system Objectives To introduce architectural design and to discuss its importance.
Overview of Search Engines
UNIT-V The MVC architecture and Struts Framework.
CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.
January, 23, 2006 Ilkay Altintas
Chapter 10 Architectural Design
Database System Concepts and Architecture Lecture # 3 22 June 2012 National University of Computer and Emerging Sciences.
 Cloud computing  Workflow  Workflow lifecycle  Workflow design  Workflow tools : xcp, eucalyptus, open nebula.
1 Yolanda Gil Information Sciences InstituteJanuary 10, 2010 Requirements for caBIG Infrastructure to Support Semantic Workflows Yolanda.
1 USC Information Sciences Institute Yolanda Gil AAAI-08 Tutorial July 13, 2008 Part II Designing Workflows AAAI-08 Tutorial on Computational.
1 USC INFORMATION SCIENCES INSTITUTE Modeling and Using Simulation Code for SCEC/IT Yolanda Gil Jihie Kim Varun Ratnakar Marc Spraragen USC/Information.
The Grid is a complex, distributed and heterogeneous execution environment. Running applications requires the knowledge of many grid services: users need.
1 USC Information Sciences Institute Yolanda Gil AAAI-08 Tutorial July 13, 2008 Part VII: Future Challenges in Computational Workflows and.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Copyright 2002 Prentice-Hall, Inc. Modern Systems Analysis and Design Third Edition Jeffrey A. Hoffer Joey F. George Joseph S. Valacich Chapter 20 Object-Oriented.
Miguel Branco CERN/University of Southampton Enabling provenance on large-scale e-Science applications.
Configuration Management (CM)
Grid Resource Allocation and Management (GRAM) Execution management Execution management –Deployment, scheduling and monitoring Community Scheduler Framework.
1 USC Information Sciences Institute Yolanda Gil AAAI-08 Tutorial July 13, 2008 AAAI-08 Tutorial on Computational Workflows for Large-Scale.
CSE 219 Computer Science III Program Design Principles.
11 CORE Architecture Mauro Bruno, Monica Scannapieco, Carlo Vaccari, Giulia Vaste Antonino Virgillito, Diego Zardetto (Istat)
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES Data Replication Service Sandeep Chandra GEON Systems Group San Diego Supercomputer Center.
1 Yolanda Gil Information Sciences InstituteFebruary 4, 2010 Metadata Meets Semantic Workflows Yolanda Gil, PhD Information Sciences Institute.
1 USC INFORMATION SCIENCES INSTITUTE Yolanda Gil Interactive Composition of Computational Pathways Jihie Kim Varun Ratnakar Students: Marc Spraragen (USC)
NOVA Networked Object-based EnVironment for Analysis P. Nevski, A. Vaniachine, T. Wenaus NOVA is a project to develop distributed object oriented physics.
Chapter 10 Analysis and Design Discipline. 2 Purpose The purpose is to translate the requirements into a specification that describes how to implement.
1 USC INFORMATION SCIENCES INSTITUTE CAT: Composition Analysis Tool Interactive Composition of Computational Pathways Yolanda Gil Jihie Kim Varun Ratnakar.
Pegasus: Running Large-Scale Scientific Workflows on the TeraGrid Ewa Deelman USC Information Sciences Institute
GRIDS Center Middleware Overview Sandra Redman Information Technology and Systems Center and Information Technology Research Center National Space Science.
DATABASE MANAGEMENT SYSTEM ARCHITECTURE
11 CORE Architecture Mauro Bruno, Monica Scannapieco, Carlo Vaccari, Giulia Vaste Antonino Virgillito, Diego Zardetto (Istat)
Create Content Capture Content Review Content Edit Content Version Content Version Content Translate Content Translate Content Format Content Transform.
David Adams ATLAS Virtual Data in ATLAS David Adams BNL May 5, 2002 US ATLAS core/grid software meeting.
Data Structures and Algorithms Dr. Tehseen Zia Assistant Professor Dept. Computer Science and IT University of Sargodha Lecture 1.
CASE (Computer-Aided Software Engineering) Tools Software that is used to support software process activities. Provides software process support by:- –
David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
Architecture View Models A model is a complete, simplified description of a system from a particular perspective or viewpoint. There is no single view.
Workflows Description, Enactment and Monitoring in SAGA Ashiq Anjum, UWE Bristol Shantenu Jha, LSU 1.
Funded by the NSF OCI program grants OCI and OCI Mats Rynge, Gideon Juve, Karan Vahi, Gaurang Mehta, Ewa Deelman Information Sciences Institute,
David Adams ATLAS Datasets for the Grid and for ATLAS David Adams BNL September 24, 2003 ATLAS Software Workshop Database Session CERN.
Ewa Deelman, Virtual Metadata Catalogs: Augmenting Existing Metadata Catalogs with Semantic Representations Yolanda Gil, Varun Ratnakar,
1 Pegasus and wings WINGS/Pegasus Provenance Challenge Ewa Deelman Yolanda Gil Jihie Kim Gaurang Mehta Varun Ratnakar USC Information Sciences Institute.
1 USC Information Sciences InstituteYolanda Gil AAAI-08 Tutorial July 13, 2008 Part IV Workflow Mapping and Execution in Pegasus (Thanks.
Application architectures Advisor : Dr. Moneer Al_Mekhlafi By : Ahmed AbdAllah Al_Homaidi.
Managing LIGO Workflows on OSG with Pegasus Karan Vahi USC Information Sciences Institute
INTRODUCTION TO XSEDE. INTRODUCTION  Extreme Science and Engineering Discovery Environment (XSEDE)  “most advanced, powerful, and robust collection.
TensorFlow– A system for large-scale machine learning
Pegasus WMS Extends DAGMan to the grid world
Scott Callaghan Southern California Earthquake Center
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 2 Database System Concepts and Architecture.
USC Information Sciences Institute {jihie, gil,
Overview of Workflows: Why Use Them?
Mats Rynge USC Information Sciences Institute
Chaitali Gupta, Madhusudhan Govindaraju
Frieda meets Pegasus-WMS
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

1 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute Part III Computational Workflows in Wings/Pegasus AAAI-08 Tutorial on Computational Workflows for Large-Scale Artificial Intelligence Research

2 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute Our Approach Express analysis as distributed workflows Data analysis as distributed application User-centric workflow refinement process Start with high-level problem description, add layers of detail, map to distributed execution environment Knowledge-rich descriptions of workflows -- OWL/RDF Descriptions of input data and data products (aka “metadata”) Models of components in terms of I/O data and their function Automation of resource allocation and optimization Efficient scheduling algorithms for workflow graphs Optimization techniques of broad applicability Build on distributed computing research -- GRID Designed, by definition, to be robust, secure, flexible

3 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute The Wings/Pegasus Workflow System [Gil et al 07; Deelman et al 03; Deelman et al 05; Kim et al 08; Gil et al forthcoming] Grid services condor.uwisc.edu Pegasus: Automated workflow refinement and execution pegasus.isi.edu WINGS: Knowledge-based workflow environment Ontology-based reasoning on workflows and data (W3C’s OWL) Workflow library of useful analyses Proactive assistance +automation Execution-independent workflows Optimize for performance, cost, reliability Assign execution resources Manage execution through DAGMan Daily operational use in many domains Secure and controlled sharing of distributed services, computing, data Scalable service-oriented architecture Commercial quality, open source

4 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute Workflow Selection Workflow Template Data Selection Workflow Instance Workflow Libraries Data Repositories Application Components Ontologies: Domain terms, Component types, Workflow Products - Preexisting data collections - Workflow execution results “Show me workflows that classify datasets” “Run this workflow with the weather1980 data set” “Validate this workflow based on the component specs” STUDENT SEASONED NL RESEARCHER Workflow Creation ALGORITHM DEVELOPER -Workflow templates specify complex analyses sequences - Workflow instances specify data “Here is a new classification algorithm, has a parameter for smoothing, is compiled for MPI” Component Specification Executable Workflow Pegasus WINGS - Specifies data requirements - Specifies execution requirements DAGMan/ Grid (OWL) Wings: Workflow Instance Generation and Selection

5 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute September 3, 2015 © 2005 TANGRAM5 Globus RLS replica mgmt GRAM remote submission GridFTP data transfer Condor DAGMan execution engine Condor-G job manager Nagios monitoring probes Pegasus Site selection Replica selection Workflow optimization Wings Workflow validation Data/Comp selection Metadata generation Workflow generation National Middleware Infrastructure (NMI) software Workflow submission LEGEND: Workflow System All software is open source

6 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute Workflow Structure We take to heart the separation of “programming” from “analysis” activities – Components are designed by programmers and can be complex (and need testing, debugging, loops should terminate, etc) – Workflows are composed by non-programmers and should have simple structure-- focus is on selecting application components and data Therefore, our workflow structure is very streamlined Only iterations handled are parallel data processing pipelines Only conditionals handled are data-driven component selections Standard workflow languages offer much more complex constructs Workflow structure designed to: Be accessible to users Facilitate automation and failure recovery

7 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute Core Workflow Concepts C1 C2 F1 F4 F6 F2 Workflow consists of Components: software to be executed Links: data flow among components Directed Acyclic Graphs (DAGs) Facilitate automation, esp. execution monitoring and repair Data always handled through files Special handling of some control constructs loops (more on this later) Choices of components Iterations over data sets Layered workflow refinement process Select application components -> select data -> select execution resources Each layer adds more information to the same basic workflow structure C3 F5 F3 F5

8 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute Workflow Abstraction Layers We use several layers of description of workflows

9 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute WINGS: Workflow Representation

10 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute F2-operation-SA-Median-Distance-JBF2-operation-SA-Median-VS30 Compute-F2-SA-Median-wrt-Distance-JB- given-Fault-Type-&-Basin-Depth-&-… Compute-F2-SA-MEDIAN-wrt-VS30- given-Fault-Type-&-Basin-Depth-&-… Hazard-Level Hazard-Level-with-SA Hazard-Level-with-PGA Hazard-Level-with-PGV Compute-Hazard-Level- given-IMR-input-parameters... Compute-Hazard-Level- with-SA- given-IMR-input-parameters Compute-Hazard-Level-with-PGA- given-IMR-input-parameters Compute-Hazard-Level- with-PGV- given-IMR-input-parameters Hazard-Level-with- SA-Median Hazard-Level-with- SA-Std-Dev Hazard-Level-with- SA-Prob-Exc Hazard-Level-with-Median Hazard-Level-with-Std-Dev Hazard-Level-with-Median... Compute-Hazard-Level-with-SA-Median- given-IMR-input-parameters Compute-Hazard-Level-with-SA-Std-Dev- given-IMR-input-parameters Compute-Hazard-Level-with-SA-Prob-Exc- given-IMR-input-parameters IMR-Input-Parameter Field-2000-Input- Parameter Parameter Fault-Type Basin-Depth Distance... Compute-F2-SA-Median- given-Field-2000-input-parameters Compute-F2-Hazard-Level- given-Field-2000-input-parameters F2-Hazard-Level... Domain OntologyOntology of Components IMT probability-function IMR probability-function F2-SA-Median-wrt-VS30...

11 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute F1 WINGS: Representing Components Any input or output can be defined as a file collection Same file type Unspecified cardinality Ordered Inputs and outputs through files Files are typed Each input is uniquely identified by a file descriptor (~ parameterID) Ordered lists of file descriptors for both I and O C-one D1 D3 D2 C-many F1 D13 F1 DC11 D12

12 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute Data Descriptions Metadata of different kinds can be organized in ontology Files represented as instances and classified in ontology according to their metadata File collections also represented as instances and defined as ordered sets of file instances A file Skolem is created for each class as a representative instance (more on this later) Similarly, a file collection Skolem is created for each class Application-Specific Metadata Ontologies Content Metadata Format Metadata Kim- Homepage EHS-T File Collection Gil- Homepage Kim- Homepage Gil- Homepage … EHCS-T IKCAP-pages

13 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute A Component in a Workflow Template C-one D1 D3 D2 Nodes correspond to individual application components Links include file descriptors for origin and destination and a file Skolem C-one D1 D3 D2 Link Node C67 D6 D7 D6 C67 D6 L1 L2 L3 L4 N1 N2 N3 FS-A FS-B FS-C FS-D Notation: “S” marks a Skolem

14 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute File Collections in a Workflow Template F1 Links that include file descriptors that are collections refer to file collection Skolems Using the same file Skolem ID or file collection Skolem ID in different links indicates identity F1 DC11 D12 C-many D13 F1 C-many F1 D13 F1 DC11 D12 FS-B FS-C L1 L2 L3 N1 FCS-A

15 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute Iteration Over File Collections in a Workflow T Iteration over sets compactly represented with single nodes that contain component collections Will be expanded to as many jobs as files are specified for the executable workflow Links capture formation of file collections as input C-one G1 Z1 D1 D2 D3 C-many C-one Z2 C-one Z88 … … … K1 G2 K2G88 K88 L1 L2 L3 C-many N2 D12 L4 FS-Y Y1 C-one D1 D3 D2 F1 C-many F1 D13 F1 DC11 D12 F1 DC11 FCS-G FCS-K FCS-Z C-one NC1

16 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute Iteration With a Constant in a Workflow T Nodes that represent component collections can take the same file from the same link when the link contains a file Skolem instead of a file collection Skolem C-one G1 Z1 C-many C-one Z2 C-one Z88 … … … K1 G2 K1G88 K1 Y1 C-one D1 D3 D2 F1 C-many F1 D13 F1 DC11 D12 D1 D2 D3 L1 L2 L3 C-many N2 D12 L4 FS-Y F1 DC11 FCS-G FCS-Z C-one NC1 FS-K1

17 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute Constraints on Workflow Templates CybershakeTemplate InputLink_SiteNameFil e_to_BoxNameCheck hasSiteName InputLink_RuptureVars _to_SeisgmogramGen hasLink … F-RV C-RuptVars CC-RuptureVariations InputLink_SGTCollforRup _to_SeismogramGen F-SGT C-SGT-forRups CC-SGTs hasFile SGTsSiteName SiteNameFile hasSiteName SiteName N_Rups hasN_Items … … isSameAs Constraints on number of elements in different collections Constraints on files/collections of different workflow components

18 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute Workflow Instances C-one D1 D3 D2 C67 D6 C-plenty L1 L2 L3 N1 N2 N3 FS-A FS-B F-C D7 C-one D1 L5 N4 FS-E D8 D2 L6 FS-F D3 L7 FS-G DC9 L4 File85 File28 F FileColl54 F F F Existing data New data products Input data selected from the file library by querying for files of the type of file Skolems Logical names created for new data products with metadata based on file Skolems Compact Workflow Instance = WT + bindings Easy to understand, and easily transformed into an expanded WI and a DAX for Pegasus Bindings FCS-D

19 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute AUTOMATED WORKFLOW INSTANCE GENERATION IN WINGS

20 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute Corpus Kernel_Rules Split Filter_Rules Prune_Rules BinarizeGenerate_Rule_Map Compile XRS_RulesBRF_RulesLexicon_Dictionary 1…n WSJ-2001 KR Workflow Instance Expressions Compact expression for efficient search and matching Expanded expression when further details are needed

21 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute Expanded Workflow Instance Count the number of unique words in a file

22 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute W Instance: “dax” for Pegasus <adag xmlns=" xmlns:xsi=" xsi:schemaLocation=" version="1.7" count="1" index="0" name="WorkFlow0b"> -a top -T60 -i -o -a left -T60 -i -o -p 0.5

23 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute AUTOMATED METADATA GENERATION IN WINGS

24 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute Metadata Reasoning for file name generation and workflow validation Filename Generation Explicit representation of metadata in ontology (e.g. source id, rupture id) Propagate metadata attributes for all data products when creating workflow instance Names for intermediate files are created automatically from the metadata Workflow Validation Explicit representation of metadata constraints (examples are shown below) –Constraints on individual files and collections –Constraints on component inputs and outputs –Constraints among components in a workflow Check constraints while generating workflow instantiations

25 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute Propagation of metadata for filename generation: an example SeismogramGen_Li RVM 127_6.rvm - source_id: rupture_id: 6 Rupture_variation 127_6.txt.variation -s0000-h source_id: rupture_id: 6 - slip_relaization_#:0 - hypo_center_#: 1 127_6.txt.variation -s0000-h source_id: rupture_id: 6 - slip_relaization_#:0 - hypo_center_#: 1 127_6.txt.variation -s0000-h source_id: rupture_id: 6 - slip_relaization_#:0 - hypo_center_#: 1 127_6.txt.variation -s0000-h source_id: rupture_id: 6 - slip_relaization_#:0 - hypo_center_#: 1 SGT 127_6.txt.variation -s0000-h source_id: rupture_id: 6 - slip_relaization_#:0 - hypo_center_#: 1 127_6.txt.variation -s0000-h source_id: rupture_id: 6 - slip_relaization_#:0 - hypo_center_#: 1 FD_SGT/PAS_1/A/SGT161 - site_name: PAS - tensor_direction: 1 - time_period: A - xyz_volumn_id: _6.txt.variation -s0000-h source_id: rupture_id: 6 - slip_realization_#:0 - hypo_center_#: 1 Seismogram Seismogram_PAS_127_6.grm - site_name: PAS - source_id: rupture_id: 6 … … SGT

26 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute AUTOMATIC WORKFLOW GENERATION IN WINGS

27 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute Automatic Template-Based Workflow Generation Algorithm WR0: Workflow Template Workflow request = Workflow Template + Seed Constraints Seed workflow from request unified well-formed request Find input data requirements seeded workflows Data source selection binding-ready workflows Parameter selection bound workflows configured workflows Workflow instantiation Workflow grounding workflow instances Workflow mapping ground workflows executable workflows Workflow ranking top-k workflows dataVariable5 data:contains data:Muti-party-communication dataVariable0 data:creator 5048 dataVariable1 data:creator 5048 WR0: Seed Constraints

28 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute Step 1: Workflow Template is Seeded unified well-formed request Find input data requirements seeded workflows Data source selection binding-ready workflows Parameter selection bound workflows configured workflows Workflow instantiation Workflow grounding workflow instances Workflow mapping ground workflows executable workflows Workflow ranking top-k workflows Seed workflow from request

29 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute Step 2: Backward Sweep unified well-formed request Find input data requirements seeded workflows Data source selection binding-ready workflows Parameter selection bound workflows configured workflows Workflow instantiation Workflow grounding workflow instances Workflow mapping ground workflows executable workflows Workflow ranking top-k workflows Seed workflow from request

30 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute E-07 S-NY Step 3: Select Data Sources unified well-formed request Find input data requirements seeded workflows Data source selection binding-ready workflows Parameter selection bound workflows configured workflows Workflow instantiation Workflow grounding workflow instances Workflow mapping ground workflows executable workflows Workflow ranking top-k workflows Seed workflow from request

31 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute E-07 S-NY Step 3: Select Data Sources unified well-formed request Find input data requirements seeded workflows Data source selection binding-ready workflows Parameter selection bound workflows configured workflows Workflow instantiation Workflow grounding workflow instances Workflow mapping ground workflows executable workflows Workflow ranking top-k workflows Seed workflow from request

32 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute E-07 S-NY Step 4: Forward Sweep unified well-formed request Find input data requirements seeded workflows Data source selection binding-ready workflows Parameter selection bound workflows configured workflows Workflow instantiation Workflow grounding workflow instances Workflow mapping ground workflows executable workflows Workflow ranking top-k workflows Seed workflow from request

33 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute E-07 S-NY Result-PartA Result-PartB Step 5: Workflow Instantiation unified well-formed request Find input data requirements seeded workflows Data source selection binding-ready workflows Parameter selection bound workflows configured workflows Workflow instantiation Workflow grounding workflow instances Workflow mapping ground workflows executable workflows Workflow ranking top-k workflows Seed workflow from request

34 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute E-07 S-NY Result-PartA Result-PartB Step 5: Workflow Instantiation unified well-formed request Find input data requirements seeded workflows Data source selection binding-ready workflows Parameter selection bound workflows configured workflows Workflow instantiation Workflow grounding workflow instances Workflow mapping ground workflows executable workflows Workflow ranking top-k workflows Seed workflow from request

35 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute E-07 S-NY Result-PartA Result-PartB -i E o ES-07…. parent Step 6: Workflow Grounding Ground Workflow Seed workflow from request unified well-formed request Find input data requirements seeded workflows Data source selection binding-ready workflows Parameter selection bound workflows configured workflows Workflow instantiation Workflow grounding workflow instances Workflow mapping ground workflows executable workflows Workflow ranking top-k workflows

36 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute W1: estimated exec time 3hrs W2: estimated exec time 20hrs W3: estimated exec time 3d W4: estimated exec time 5hrs Step 7: Workflow Ranking Seed workflow from request unified well-formed request Find input data requirements seeded workflows Data source selection binding-ready workflows Parameter selection bound workflows configured workflows Workflow instantiation Workflow grounding workflow instances Workflow mapping ground workflows executable workflows Workflow ranking top-k workflows

37 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute Seed workflow from request unified well-formed request Find input data requirements seeded workflows Data source selection binding-ready workflows Parameter selection bound workflows configured workflows Workflow instantiation Workflow grounding workflow instances Workflow mapping ground workflows executable workflows Workflow ranking top-k workflows W1: estimated exec time 3hrs W2: estimated exec time 20hrs W3: estimated exec time 3d W4: estimated exec time 5hrs Step 7: Workflow Ranking

38 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute Ground workflow: 15 compute nodes devoid of resource assignment data stage-in nodes 11 compute nodes (1-2&5-6 reduced based on available intermediate data) 8 inter-site data transfers 14 data stage-out nodes to long-term storage 14 data registration nodes (data cataloging) Executable workflow: mapped to 3 sites Step 8: Workflow Mapping Seed workflow from request unified well-formed request Find input data requirements seeded workflows Data source selection binding-ready workflows Parameter selection bound workflows configured workflows Workflow instantiation Workflow grounding workflow instances Workflow mapping ground workflows executable workflows Workflow ranking top-k workflows

39 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute Why Do We Automate All This? So You Don’t Have To Seed workflow from request unified well-formed request Find input data requirements seeded workflows Data source selection binding-ready workflows Parameter selection bound workflows configured workflows Workflow instantiation Workflow grounding workflow instances Workflow mapping ground workflows executable workflows Workflow ranking top-k workflows Workflow candidates generated + considered (many are eliminated) Queries about data Queries about tools

40 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute WINGS DEMO

41 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute Editing a Seed & Template, Generating a DAX WR0: Workflow Template dataVariable5 data:contains data:Muti-party-communication dataVariable0 data:creator 5048 dataVariable1 data:creator 5048 WR0: Seed Constraints Workflow seed = Workflow Template + Seed Constraints Seed workflow from request unified well-formed request Find input data requirements seeded workflows Data source selection candidate workflows Parameter selection bound workflows configured workflows Workflow instantiation Workflow grounding workflow instances Workflow mapping ground workflows (DAXes) executable workflows Workflow ranking top-k workflows

42 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute SCEC WORKFLOWS IN WINGS

43 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute Seismic Hazard Model Seismicity Paleoseismology Local site effects Geologic structure Faults Stresstransfer Crustalmotion Crustaldeformation Seismic velocity structure Rupturedynamics Seismic Hazard Analysis in Southern California Earthquake Center (SCEC) [Slide from T. Jordan]

44 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute Intensional descriptions of data sets Intensional descriptions of parallel computations Querying results of other data creation subworkflows Rich metadata descriptions for all data products Reusable High-Level Workflow Templates

45 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute Workflows for Seismic Hazard Analysis [Gil et al 06; Kim et al 06; Gil et al 07] Input data: a site and an earthquake forecast model thousands of possible fault ruptures and rupture variations, each a file, unevenly distributed ~110,000 rupture variations to be simulated for a given site High-level template combines 11 application codes 8048 application nodes in the workflow instance generated by Wings 24,135 nodes in the executable workflow generated by Pegasus, including: data stage-in jobs, data stage-out jobs, data registration jobs Executed in USC HPCC cluster, 1820 nodes w/ dual processors) but only < 144 available Including MPI jobs, each runs on hundreds of processors for hours Runtime was 1.9 CPU years Provenance records kept throughout the generation and execution process for 100,000 workflow data products

46 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute DAX automatically generated from WINGS 14,639 jobs for 4,626 ruptures with 106,124 rupture variations for USC site

47 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute Summary: Creating Workflows with WINGS Separates analysis spec from data Workflow template as reusable well-defined acceptable analysis process Workflow instance binds template to data for particular analyses Ensures that the data complies with the component specifications and their constraints within the workflow Represents data collections (nominal or otherwise) within the workflow specification Automatically generates descriptions and metadata to new data products to be created by the workflow execution Compact workflow instance is user-friendly and reusable Separates data provenance (workflow instance) and pedigree (workflow template) Expands workflow instance into DAX for Pegasus, which creates the executable workflow

48 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute Key Benefits Efficient and correct creation of new workflows By retrieving a template and filling in the data Framework ensures adherence to methodology Represents as templates widely-accepted analysis methodologies Supports repeatability of experiments/analyses Enables controlled variations Ensures better quality of data analysis results Attaches provenance and pedigree information

49 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute Ongoing and Future Work Interactive assistance in creating valid workflow templates Based on CAT (Composition Analysis Tool) [Kim et al 05] More sophisticated models of components Automatic completion of workflow’s data conversion and formatting steps through AI planning techniques Tracking new versions of components, invalidate data and workflows from old versions Workflow template libraries Indexing, retrieval Managing collections of workflows as part of an overall analysis activity Eg: parameter sweeping, variants of analysis

50 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute BACKUP SLIDES

51 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute … Extension 1: Handle Collections of Collections SGT 127_6.txt.variation-s0000-h _6.txt.variation-s0000-h _6.txt.variation-s0001-h _6.txt.variation-s0001-h0001 … 20_0.txt.variation-s0000-h0000 … 150_11.txt.variation-s0000-h0000 … SGT 127_6 SGT 20_0.txt.variation-s0000-h0000 SGT 150_11.txt.variation-s0000-h0000 … For rupture 127_6 (source ID 127, rupture ID 6), there are 8 variations For rupture 20_0(source ID 20, rupture ID 0), there are 1352 variations  A set of ruptures, each with a set of variations  Each variation in a separate file

52 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute Extending Wings to Handle Collections of Collections File Collection File Variation File Collection has-type Variation File Collection of Collections has-type Ruptures-PAS … SGT 127_6.txt.variation-s0000-h0000 SGT 127_6 SGT 127_7.txt.variation-s0000-h0000 SGT 150_11.txt.variation-s0000-h0000 … 127_6.txt.variation-s000-h000 Vars_127_6 Vars_127_7 127_6.txt.variation-s000-h _7.txt.variation-s000-h _7.txt.variation-s000-h001 … … 127_6.txt.variation-s0000-h _6.txt.variation-s0000-h _6.txt.variation-s0001-h _6.txt.variation-s0001-h0001 … 20_0.txt.variation-s0000-h0000 … 150_11.txt.variation-s0000-h0000 …

53 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute Wings Coll/Coll 150_11 127_7 L1 F1 RupVar L2 F1 SGT SeismogramGen_Li NC1 L3 seism L4 SA FCS-S FCS-SA PeakValCalc_Okaya NC2 FCS-Var CCS-Rup SGT 127_6.txt.variation-s000-h000 SeisGen_Li PeakValCalc Seismograms_PAS_127_6.grm PeakVals_allPAS_127_6.bsa SGT161 SGT 127_7.txt.variation-s000-h000 SeisGen_Li PeakValCalc Seismograms_PAS_127_7.grm PeakVals_allPAS_127_7.bsa SGT282 SGT 150_11.txt.variation-s000-h000 SeisGen_Li PeakValCalc Seismograms_PAS_151_11.grm PeakVals_allPAS_151_11.bsa SGT161 FCS-SGTCol CCS-SGT RV_127_6 150_11 127_7 S_127_6

54 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute Constraints (in OWL ontology) Constraints on Files metadata attributes: data types and default values e.g. simulation_out_timesamples of SeisParamValsFile should be an integer and the default value is 1801 File name format with respect to metadata attributes e.g. rupture variation file: e.g. 127_6.txt.variation-s0002-h0000 Format: _.txt.variation-s[4 digit slip_realization#]-h000[4 digit hypo center #] Constraints on collections and collection of collection Type of each element Relations between metadata of a collection and metadata of individual items e.g. Each rupture variation has the same source/rupture ids as the rupture variation collection Component level constraints on metadata attributes of input/output files or collections Deriving metadata of output files from metadata of input files e.g. The output of PeakValCalc_Okaya (SA output file) should have the same site name as the seismogram file Template level constraints on metadata attributes of files or collections Input/output files of different components can have the same metadata e.g. The RVM collection input for SeismogramGen_Li should have the same site name as the CollOfCollection rupture variations input Checking number of items in collections e.g. number of RVM files and the number of rupture var collections should be equal

55 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute Constraints on Files RuptureVarFile Int Metadata:4DigitInt hasSourceID hasRuptureID hasSlipRealization hasHypoCenter FileNameFormat hasNameFormat List of Metadata or StringConstant File Skolem Instances RupVar-SK xsd:int hasDefaultVal hasMetadata Metadata SourceID1RuptureID1SlipRealz1HypoCent1 RupVar_FileNameFormat1 hasDefaultValue _.txt.variation0 Constraints on default values Constraints on file names … hasSourceID hasRuptureID usedAs Domain independent definitions SCEC dependent definitions : classes : instances : roles

56 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute Constraints on Collections Rupture Variations CollOf Collection hasType:hasCollectionType File hasType:hasFileType RuptureVarsFor ForRupture RuptureVarFile RupVar-SK C-RuptVars-SK CC-RuptureVariations-SK hasCollectionType hasSiteName Metadata:String hasFileType hasSourceID hasRuptureID Metadata:Int hasSourceID hasRuptureID Skolem Instances hasSiteName SiteName1 hasSiteName SourceID1 hasSourceID RuptureID1 Constraints on collection element types metadata constraints on collections & their elements …

57 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute Constraints on Components SeismogramGen Component Type hasInputs FileOrCollection hasOutputs SeismogramGen_Li Skolem Instances hasInputs SeismogramGenLi_InputsSeismogramGenLi_Outputs hasOutputs RVM1 Seismogram1 S-RV1 S-RuptVarsForRup1 hasSourceID RVM_SourceID1 RVM_RuptureID1 hasRuptureID hasSiteName SGTsSiteName1 metadata constraints on input and output files Constraints on the types of input and output file and collections … SGT1 C-SGT1 …

58 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute Workflow Templates: a set of nodes and links Template hasNode Node hasLink Link(Input, Output, InOut, LinkMaping) CybershakeTemplate1 Node_SeismogramGen _Collection ComponentType or ComponentCollection hasComponent hasFile File or Collection hasNode hasDestinationNode, hasOriginNode, hasDestinationFileDesc, hasOriginFileDesc, … hasComponent ComponentCollection_ SeismogramGen hasComponentType InputLink_RuptureVars _to_SeisgmogramGen hasLink hasDestinationNode … hasFile F-RV1 C-RuptVars1 CC-RuptureVariations1 SeismogramGen_Li S-RV1 S-RuptVarsForRup1 hasDestinationFileDesc InputOutLink_Seismogr am_from_SeismGen_to _PeakValCalc Skolem Instances

59 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute Constraints on Templates CybershakeTemplate1 InputLink_SiteNameFil e_to_BoxNameCheck hasSiteName InputLink_RuptureVars _to_SeisgmogramGen hasLink … F-RV1 C-RuptVars1 CC-RuptureVariations1 InputLink_SGTCollforRup _to_SeismogramGen F-SGT1 C-SGT-forRups1 CC-SGTs1 hasFile SGTsSiteName1 SiteNameFile1 hasSiteName SiteName1 N_Rups hasN_Items … … isSameAs Skolem Instances Constraints on number of elements in different collections metadata constraints on files/collections of different components

60 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute Example OWL definitions Filename format for rupture variation files Definitions for metadata propagation (SynthSGT) Constraints on files/collections of different components

61 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute Extension 3: Creating many workflow instantiations SGT 127_6.txt.variation-s000-h000 SeisGen_Li PeakValCalc Seismograms_PAS_127_6.grm PeakVals_allPAS_127_6.bsa SGT161 SGT 127_7.txt.variation-s000-h000 SeisGen_Li PeakValCalc Seismograms_PAS_127_7.grm PeakVals_allPAS_127_7.bsa SGT282 SGT 150_11.txt.variation-s000-h000 SeisGen_Li PeakValCalc Seismograms_PAS_151_11.grm PeakVals_allPAS_151_11.bsa SGT independent instances for each rupture, >100,000 variations for a site Memory Bottleneck: handling many files in the file library e.g. rupture variations... BNC GenMD BNC GenMD

62 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute Creating many workflow instantiations (on-going work) Independent instances are generated separately Instantiations for different ruptures are generated separately On-demand creation of files and collections in the file library If files or collections are not used in metadata reasoning, we don’t need to create file library objects for them (e.g. rupture variations) and only an ID is generated for them  Currently Wings needs 5-6 hrs to generate DAXes for 4626 ruptures with 106,124 variations

63 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute Extension 4: Interleaving execution with workflow generation Extensions in the WF template representations System links: a link from a component that generates results needed in template instantiation E.g. BoxNameCheck generates a file that contains SGT file names Template navigation algorithm: while navigating links, identify partial workflows that can be executed based on system links & steps that are already executed Wings and Pegasus interaction On-going work: Client/server style interaction e.g. use secure shell

64 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute Partial DAX generation: Workflow Navigation Algorithm System link Template navigation Used for Partial DAX generation

65 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute Summary: Current System MCS Ontology API file & matadata API OWL ontologies Wings File Ont Wings Component Ont Domain component Ont Template Library CC-Rup-Vars C-Rup-Vars-for-Rup File Library Domain File Ont … Metadata constraints Metadata reasoner F-RV1 -current wf instance -logical files used -bindings -new file objects and metadata created Jena Template Instantiator Pegasus CAT Template Validator Template Selection DAX generator User WINGS

66 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute Ongoing Work Approaches for handling many thousands of files Use of MCS for storing logical file names and metadata Use of more efficient OWL reasoners (e.g. Sesame can handle 100 million triples) Client/server style interactions with Pegasus

67 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute Mappings in a Workflow Template Link mappings specify the order of inputs to a node that accepts a collection F1 C-plenty F1 D8 F1 DC9 C-one G1 Z1 C-plenty C-one Z2 C-one Z88 … … … K1 G2 K2G88 K88 Y1 C-spl H1 C-one D1 D3 D2 C-spl D17 D18 C-plenty N3 L4 FS-Y C-spl N2 M5 D18 #1 #2 F1 DC9 D1 D2 D3 L1 L2 L3 F1 DC11 FCS-G FCS-K FCS-Z C-one NC1 FCS-T

68 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute … Nested File Collections SGT 127_6.txt.variation-s0000-h _6.txt.variation-s0000-h _6.txt.variation-s0001-h _6.txt.variation-s0001-h0001 … 20_0.txt.variation-s0000-h0000 … 150_11.txt.variation-s0000-h0000 … SGT 127_6 SGT 20_0.txt.variation-s0000-h0000 SGT 150_11.txt.variation-s0000-h0000 … For rupture 127_6 (source ID 127, rupture ID 6), there are 8 variations For rupture 20_0(source ID 20, rupture ID 0), there are 1352 variations  A set of ruptures, each with a set of variations  Each variation in a separate file

69 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute Nested File Collections File Collection File Variation File Collection has-type Variation File Collection of Collections has-type Ruptures-PAS … SGT 127_6.txt.variation-s0000-h0000 SGT 127_6 SGT 127_7.txt.variation-s0000-h0000 SGT 150_11.txt.variation-s0000-h0000 … 127_6.txt.variation-s000-h000 Vars_127_6 Vars_127_7 127_6.txt.variation-s000-h _7.txt.variation-s000-h _7.txt.variation-s000-h001 … … 127_6.txt.variation-s0000-h _6.txt.variation-s0000-h _6.txt.variation-s0001-h _6.txt.variation-s0001-h0001 … 20_0.txt.variation-s0000-h0000 … 150_11.txt.variation-s0000-h0000 …

70 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute Nested File Collections 150_11 127_7 L1 F1 RupVar L2 F1 SGT SeismogramGen_Li NC1 L3 seism L4 SA FCS-S FCS-SA PeakValCalc_Okaya NC2 FCS-Var CCS-Rup SGT 127_6.txt.variation-s000-h000 SeisGen_Li PeakValCalc Seismograms_PAS_127_6.grm PeakVals_allPAS_127_6.bsa SGT161 SGT 127_7.txt.variation-s000-h000 SeisGen_Li PeakValCalc Seismograms_PAS_127_7.grm PeakVals_allPAS_127_7.bsa SGT282 SGT 150_11.txt.variation-s000-h000 SeisGen_Li PeakValCalc Seismograms_PAS_151_11.grm PeakVals_allPAS_151_11.bsa SGT161 FCS-SGTCol CCS-SGT RV_127_6 150_11 127_7 S_127_6

71 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute Example OWL definitions Filename format for rupture variation files Definitions for metadata propagation (SynthSGT) Constraints on files/collections of different components

72 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute Component Ontology in OWL

73 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute A Component Description from the Library <clns:hasNamespace rdf:datatype=" >vds <clns:hasVersion rdf:datatype=" >1 <clns:hasExecutablePath rdf:datatype=" >/nfs/isd/varunr/wings/removeCommonWords <clns:hasPrefix rdf:datatype=" >-o <clns:hasPrefix rdf:datatype=" >-i

74 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute Formats for Filenames (examples) SGT file: e.g. FD_SGT/USC_1/A/SGT161 Format: FD_SGT/ _[1-2]/[A-L]/SGT[3-digit-alphanumeric] - site_name: e.g. USC - tensor direction[1-2]: 1 (EW) 2(NS) - time_period [A-L]: A (0-15 seconds) B(15-30 seconds), etc. - 3-digit-alphanumeric :xyz volumn id rupture variation file: e.g. 127_6.txt.variation-s0002-h0000 Format: _.txt.variation-s[4 digit slip_realization#]- h000[4 digit hypo center #] - source_id: e.g rupture_id: e.g digit slip_realization# : digit hypo center #: 0 SA output file: e.g. PeakVals_allLADT_127_6.bsa Format: PeakVals_all _ _.bsa seismogram file : e.g. Seismogram_LADT_127_6.grm Format: Seismogram_ _ _.grm SRL file: e.g. USC-sorted_by_rupture_variations.srl Format: -sorted_by_rupture_variations.srl additional metadata:

75 Yolanda Gil AAAI-08 Tutorial July 13, 2008 USC Information Sciences Institute All Data Products Have Rich Metadata