1 USC Information Sciences Institute Yolanda Gil AAAI-08 Tutorial July 13, 2008 Part II Designing Workflows AAAI-08 Tutorial on Computational.

Slides:



Advertisements
Similar presentations
A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University
Advertisements

Provenance-Aware Storage Systems Margo Seltzer April 29, 2005.
Database System Concepts and Architecture
Database Architectures and the Web
29-April Harris Team – Atmospheric and Environmental Research, GOES-R Ground System and Algorithm Implementation Design Alex Werbos Lizzie Lundgren.
1 USC INFORMATION SCIENCES INSTITUTE Modeling and Using Simulation Code for SCEC/IT Yolanda Gil Varun Ratnakar Norm Tubman USC/Information Sciences Institute.
WebRatio BPM: a Tool for Design and Deployment of Business Processes on the Web Stefano Butti, Marco Brambilla, Piero Fraternali Web Models Srl, Italy.
® IBM Software Group © 2006 IBM Corporation Rational Software France Object-Oriented Analysis and Design with UML2 and Rational Software Modeler 04. Other.
Ewa Deelman, Integrating Existing Scientific Workflow Systems: The Kepler/Pegasus Example Nandita Mangal,
Distributed, parallel web service orchestration using XSLT Peter Kelly Paul Coddington Andrew Wendelborn.
Reference: Message Passing Fundamentals.
Application architectures
Astrophysics, Biology, Climate, Combustion, Fusion, Nanoscience Working Group on Simulation-Driven Applications 10 CS, 10 Sim, 1 VR.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 2 Overview of Database Languages and Architectures.
Course Instructor: Aisha Azeem
Copyright Arshi Khan1 System Programming Instructor Arshi Khan.
Client-Server Processing and Distributed Databases
Application architectures
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
Database System Concepts and Architecture Lecture # 3 22 June 2012 National University of Computer and Emerging Sciences.
Chapter 7: Architecture Design Omar Meqdadi SE 273 Lecture 7 Department of Computer Science and Software Engineering University of Wisconsin-Platteville.
1 Yolanda Gil Information Sciences InstituteJanuary 10, 2010 Requirements for caBIG Infrastructure to Support Semantic Workflows Yolanda.
Designing Workflows: An Example from Image Analysis Yolanda Gil Information Sciences Institute University of Southern California October 17,
1 USC INFORMATION SCIENCES INSTITUTE Modeling and Using Simulation Code for SCEC/IT Yolanda Gil Jihie Kim Varun Ratnakar Marc Spraragen USC/Information.
Function-oriented Design
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
The Pipeline Processing Framework LSST Applications Meeting IPAC Feb. 19, 2008 Raymond Plante National Center for Supercomputing Applications.
Introduction to MDA (Model Driven Architecture) CYT.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.
Unit-1 Introduction Prepared by: Prof. Harish I Rathod
Alastair Duncan STFC Pre Coffee talk STFC July 2014 The Trials and Tribulations and ultimate success of parallelisation using Hadoop within the SCAPE project.
7 Systems Analysis and Design in a Changing World, Fifth Edition.
1 USC INFORMATION SCIENCES INSTITUTE CAT: Composition Analysis Tool Interactive Composition of Computational Pathways Yolanda Gil Jihie Kim Varun Ratnakar.
Quality views: capturing and exploiting the user perspective on data quality Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University.
Distribution and components. 2 What is the problem? Enterprise computing is Large scale & complex: It supports large scale and complex organisations Spanning.
1 USC Information Sciences InstituteYolanda Gil AAAI-08 Tutorial July 13, 2008 Part VI AI Workflows AAAI-08 Tutorial on Computational Workflows.
Improving I/O with Compiler-Supported Parallelism Why Should We Care About I/O? Disk access speeds are much slower than processor and memory access speeds.
1. 2 Preface In the time since the 1986 edition of this book, the world of compiler design has changed significantly 3.
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication.
CSC480 Software Engineering Lecture 10 September 25, 2002.
1 Object Oriented Logic Programming as an Agent Building Infrastructure Oct 12, 2002 Copyright © 2002, Paul Tarau Paul Tarau University of North Texas.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Programmability Hiroshi Nakashima Thomas Sterling.
Data Communications and Networks Chapter 9 – Distributed Systems ICT-BVF8.1- Data Communications and Network Trainer: Dr. Abbes Sebihi.
Object storage and object interoperability
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
CS223: Software Engineering
1 Pegasus and wings WINGS/Pegasus Provenance Challenge Ewa Deelman Yolanda Gil Jihie Kim Gaurang Mehta Varun Ratnakar USC Information Sciences Institute.
Lecture #1: Introduction to Algorithms and Problem Solving Dr. Hmood Al-Dossari King Saud University Department of Computer Science 6 February 2012.
PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.
1 USC Information Sciences InstituteYolanda Gil AAAI-08 Tutorial July 13, 2008 Part IV Workflow Mapping and Execution in Pegasus (Thanks.
Application architectures Advisor : Dr. Moneer Al_Mekhlafi By : Ahmed AbdAllah Al_Homaidi.
Chapter 1 Overview of Databases and Transaction Processing.
Service Composition Orchestration BPEL Cédric Tedeschi ISI – M2R.
Data Grids, Digital Libraries and Persistent Archives: An Integrated Approach to Publishing, Sharing and Archiving Data. Written By: R. Moore, A. Rajasekar,
Databases and DBMSs Todd S. Bacastow January 2005.
Pegasus WMS Extends DAGMan to the grid world
Database Architectures and the Web
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 2 Database System Concepts and Architecture.
USC Information Sciences Institute {jihie, gil,
Charles Tappert Seidenberg School of CSIS, Pace University
Mats Rynge USC Information Sciences Institute
Frieda meets Pegasus-WMS
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

1 USC Information Sciences Institute Yolanda Gil AAAI-08 Tutorial July 13, 2008 Part II Designing Workflows AAAI-08 Tutorial on Computational Workflows for Large-Scale Artificial Intelligence Research

2 USC Information Sciences Institute Yolanda Gil AAAI-08 Tutorial July 13, 2008 What Is In a Workflow

3 USC Information Sciences Institute Yolanda Gil AAAI-08 Tutorial July 13, 2008 Similar but not Same: What We Do NOT Mean by “Workflow” Workflows in human organizations, eg Process handbook [Malone 95] Task dependencies and flow of information in human organizations Workflows that represent process models, eg PSL [Grunninger et al 05] Graphs of activities and subactivities expressing complex temporal and resource constraints Workflows for e-business and e-commerce transactions Interactions among processes that support negotiations, multi- party coordinated activities, etc.

4 USC Information Sciences Institute Yolanda Gil AAAI-08 Tutorial July 13, 2008 Simple Examples of Workflows for Machine Learning Experiments

5 USC Information Sciences Institute Yolanda Gil AAAI-08 Tutorial July 13, 2008 Computational vs Service-Based Workflow C1 C2 F1 F4 F6 F2 Workflow consists of Nodes: computations Links: dataflow among nodes C3 F5 F3 F5 COMPUTATIONAL WORKFLOW Data stored in files Metadata catalogs of dataset properties Data catalogs of physical replicas Components catalogs of code locations and code execution requirements SERVICE-BASED WORKFLOW Message-passing paradigm for dataflow Service registries show availability Standing (3 rd party) services Computations managed by service provider Services are invoked remotely through potentially complex workflow orchestration

6 USC Information Sciences Institute Yolanda Gil AAAI-08 Tutorial July 13, 2008 Designing Workflow Structure Complexity of application logic is ideally hidden within components Components are designed and implemented by programmers –Full-fledged programming constructs –May be implemented in parallel languages Workflow logic is streamlined Workflows are designed by non-programmers and should have simple structure –Focus is on selecting application components and data

7 USC Information Sciences Institute Yolanda Gil AAAI-08 Tutorial July 13, 2008 Designing and Implementing Workflow Components Workflow components as encapsulated software Declare I/O data Declare parameters Declare execution dependencies (e.g., runtime libraries) Declare execution architecture requirements (e.g., OS, processor) Basic Component Encapsulation (BCE) Schema Workflow components as services Request and result messages Web Services Description Language (WSDL)

8 USC Information Sciences Institute Yolanda Gil AAAI-08 Tutorial July 13, 2008 Where to Start: A Workflow Sketch Computations as nodes Dataflow as links Represent intermediate data Code1 Code2 DataB Code3 DataC Input DataA Output DataH Code4 DataE DataF Output DataG

9 USC Information Sciences Institute Yolanda Gil AAAI-08 Tutorial July 13, 2008 Terrible Workflow Sketches (I): The “I-Too-Can-Sketch” Workflow Sketch User Specifies Parameters Rough Image Analysis Data Storage Image Filtering Detailed Image Analysis Results

10 USC Information Sciences Institute Yolanda Gil AAAI-08 Tutorial July 13, 2008 Terrible Workflow Sketches (II): The “I-Am-Organized” Workflow Sketch Initialize Pre-process Code1Code2Code3 Process Post-Process Code4Code5

11 USC Information Sciences Institute Yolanda Gil AAAI-08 Tutorial July 13, 2008 Terrible Workflow Sketches (II): The “I-Am-Organized” Workflow Sketch Initialize Pre-process Code1Code2Code3 Process Post-Process Code4Code5

12 USC Information Sciences Institute Yolanda Gil AAAI-08 Tutorial July 13, 2008 Terrible Workflow Sketches (III): The “I-Am-In-Control” Workflow Sketch Code1 Code2 “go” Code3 “go”

13 USC Information Sciences Institute Yolanda Gil AAAI-08 Tutorial July 13, 2008 Terrible Workflow Sketches (III): The “I-Am-In-Control” Workflow Sketch Code1 Code2 “go” Code3 “go”

14 USC Information Sciences Institute Yolanda Gil AAAI-08 Tutorial July 13, 2008 Terrible Workflow Sketches (IV): (Alternative Version of) The “I-Am-In-Control” Workflow Sketch Code1 Code2 “go” Code3 “go”

15 USC Information Sciences Institute Yolanda Gil AAAI-08 Tutorial July 13, 2008 Terrible Workflow Sketches (V): (YAA Version of) The “I-Am-In-Control” Workflow Sketch Code1 Code2 “go” Code3 “go”

16 USC Information Sciences Institute Yolanda Gil AAAI-08 Tutorial July 13, 2008 Terrible Workflow Sketches (VI): The “I-Sure-Can-Program” Workflow

17 USC Information Sciences Institute Yolanda Gil AAAI-08 Tutorial July 13, 2008 Terrible Workflow Sketches (VII): The “I-Am-User-Friendly” Workflow Code1 Code2 DataA Code3 DataB

18 USC Information Sciences Institute Yolanda Gil AAAI-08 Tutorial July 13, 2008 Terrible Workflow Sketches (VIII): The “I-Support-Collaboration” Workflow Code1 Code2 DataA Code3 DataB

19 USC Information Sciences Institute Yolanda Gil AAAI-08 Tutorial July 13, 2008 Terrible Workflow Sketches (IX): A Fine but Baffling Workflow Sketch Code1 Code2 Data1 Code3 Data1&2 Data1&2&3 Code4 Code6 Code5 Data1&2&3&4 Data1&2&3&4&5

20 USC Information Sciences Institute Yolanda Gil AAAI-08 Tutorial July 13, 2008 Terrible Workflow Sketches (X): Any Suggestions?

21 USC Information Sciences Institute Yolanda Gil AAAI-08 Tutorial July 13, 2008 Parallel Processing in Workflows Exploit parallel computations when possible Iteration over datasets Code2 DataB Code3 Input DataC Output DataH Union DataD Output DataG Split Input DataA DataB Code2 DataB DataE

22 USC Information Sciences Institute Yolanda Gil AAAI-08 Tutorial July 13, 2008 Designing Workflows: An Example from Image Analysis

23 USC Information Sciences Institute Yolanda Gil AAAI-08 Tutorial July 13, 2008 Microscopy Image Correction [Saltz et al 05] Microscope takes image tiles, one at a time, for all the slices tens of GBytes uncompressed Image is corrected, then warped to map to standard brain, then pre-processed for efficient querying Correction workflow: original image has repeated areas across tiles, this needs to be corrected. Steps: 1. Z-projection: picks the best pixel from all the z-slices. 2. Normalization 3. Auto-alignment: piece together the image mosaic from the tiles –Uses a feature matching algorithm, output is "offsets”, the alignment is given a score, goal is to maximize the score 4. Stitching: remove overlapping regions, so the tiles can "transition" more smoothly by finding best match across tiles. –Optimization algorithm picks the right offsets and stitches them together MPI

Chunkisize Z-projection Normalize Auto-align Stitch Starting point: high-level dataflow Warp [Sharma et al 07]: 3D image layers Chunked into tiles Projected into the Z plane Normalized and aligned Stitched back into a single 2D image Tiles and parallelism are not explicit!

25 USC Information Sciences Institute Yolanda Gil AAAI-08 Tutorial July 13, 2008 Some restructuring of the workflow that helped…

Format conversion Chunkisize Z-projection Normalize Auto-align Stitch Separated “conversion” from “chunkisizing”

Format conversion Chunkisize Z-projection Normalize Auto-align Stitch Controllable parameter: Chunk size?? Now we could add a control parameter… But which?

Format conversion Chunkisize Z-projection Generate magic tiles Normalize Auto-align Stitch Controllable parameter: Chunk size?? Separated “normalize” from generation of 2 “magic” tiles

Format conversion Chunkisize Z-projection Generate magic tiles Normalize Auto-align MST Stitch Controllable parameter: Chunk size?? Separated “auto- align” from MST

Format conversion Chunkisize Z-projection Generate magic tiles Normalize Auto-align MST Normalize Stitch Controllable parameter: Chunk size?? Splitting “normalize” steps (z-proj separate from z layers)

31 USC Information Sciences Institute Yolanda Gil AAAI-08 Tutorial July 13, 2008 Now we could see the controllable parameter…

Format conversion Chunkisize Z-projection Generate magic tiles Normalize Auto-align MST Normalize Stitch Controllable parameter: p p is number of chunks in an image layer Identified which should be parallel codes (MPI) because of communication needs

Format conversion Chunkisize Z-projection Generate magic tiles Normalize Auto-align MST Normalize Stitch Controllable parameter: p p is number of chunks in an image layer Identified steps that could be parallelized as independent nodes

Format conversion Chunkisize Z-projection Generate magic tiles Normalize Auto-align MST Normalize Stitch Controllable parameter: p p is number of chunks in an image layer Now we could see that the controllable parameter should be the number of chunks

35 USC Information Sciences Institute Yolanda Gil AAAI-08 Tutorial July 13, 2008 Now we could see the workflow!!!

Workflow Template Sketch Format conversion Chunkisize Z-projection Generate magic tiles Normalize Auto-align MST Normalize Stitch Set of stacks of chunks Collections of z layers, each has p chunks, each chunk has t tiles) p (one per chunk stack) p * z (one per chunk per layer) p (one per chunk of z-proj) Sets of chunks for z-proj Internally iterates over each chunk of the z-proj Internally iterates over each chunk of the normalized z-proj 2 magic tiles Sets of normalized chunks for z-proj Scores per pairs of chunks in z-proj Graph Sets of normalized chunks for all layers Internally iterates over each chunk of each layer Sets of stitched norm. chunks for all layers Will be expanded to several nodes Parallel codes (MPI) with one computation node Regular codes LEGEND Set of z layers each with t tiles (.IMG) Set of z layers each with t tiles (.DIM) Controllable parameter: p p is number of chunks in an image layer

Sketch of Workflow Template Format conversion Chunkisize Z-projection Generate magic tiles Normalize Auto-align MST Normalize Stitch Will be expanded to several computation nodes Parallel codes (MPI) with one computation node Regular codes LEGEND Controllable parameter: p p is number of chunks in an image layer

Sketch of computations in the workflow Format conversion Chunkisize Generate magic tiles Auto-align MST Norm Stitch p p * zp Internally iterates over each chunk of the z-proj Internally iterates over each chunk of the normalized z-proj Internally iterates over each chunk of each layer Controllable parameter: p p is number of chunks in an image layer Norm … … Z-proj … Computation nodes Parallel codes (MPI) with one computation node Regular codes LEGEND

39 USC Information Sciences Institute Yolanda Gil AAAI-08 Tutorial July 13, 2008 Summary

40 USC Information Sciences Institute Yolanda Gil AAAI-08 Tutorial July 13, 2008 Blocks and Arrows A Workflow Do Not Make: A Well-Formed Workflow Sketch C1 C2 F1 F4 F6 F2 Components are models of software to be executed Components have all I/O data exposed All important parameter settings in components are exposed So workflow system can reflect settings in provenance records Links reflect ALL data flow Each dataset is produced by one component, can be consumed by several No side effects No global context C3 F5 F3 F5 Pa Pb

41 USC Information Sciences Institute Yolanda Gil AAAI-08 Tutorial July 13, 2008 Abstraction Layers in Computational Workflows

42 USC Information Sciences Institute Yolanda Gil AAAI-08 Tutorial July 13, 2008 Workflow System Numerous Interdependent Decisions for Workflow Creation Many types/variants/implementations of application components, each with different storage and computational requirements Many alternative data collections with different degrees of pre-processing Many possible resources to execute application components Many possible data replicas in the distributed environment

43 USC Information Sciences Institute Yolanda Gil AAAI-08 Tutorial July 13, 2008 Workflow Complexity Workflow complexity may be reflected in two major dimensions: Scale : measured by –amount of processes –compute times –data sizes –heterogeneity of code execution requirements Scope : measured by –Constraints on code I/O requirements –Constraints on data properties

44 USC Information Sciences Institute Yolanda Gil AAAI-08 Tutorial July 13, 2008 Creation of Workflows in Layers of Increasing Detail 1. Workflow Template (generic known-to-work recipes) Specifies application components and dataflow among them No data specified, just their type 2. Workflow Instance (data-specific) Specifies data files for a given template Logical file names, not physical file replicas 3. Executable Workflow (actual run) Specifies physical locations of data files (may be in data repositories) Assigned hosts/pools for execution of each component Includes data movements among execution sites and data repositories as well as data deposition steps

45 USC Information Sciences Institute Yolanda Gil AAAI-08 Tutorial July 13, 2008 Creation of Workflows in Layers of Increasing Detail 1. Workflow Template (generic known-to-work recipes) Specifies application components and dataflow among them No data specified, just their type 2. Workflow Instance (data-specific) Specifies data files for a given template Logical file names, not physical file replicas 3. Executable Workflow (actual run) Specifies physical locations of data files (may be in data repositories) Assigned hosts/pools for execution of each component Includes data movements among execution sites and data repositories as well as data deposition steps User guided Automated

46 USC Information Sciences Institute Yolanda Gil AAAI-08 Tutorial July 13, 2008 Corpus Split Filter_Rules Prune_Rules BinarizeGenerate_Rule_Map Compile XRS_RulesBRF_RulesLexicon_Dictionary 1…n Kernel_Rules Workflow Templates: No Data or Execution Information

47 USC Information Sciences Institute Yolanda Gil AAAI-08 Tutorial July 13, 2008 Corpus Kernel_Rules Split Filter_Rules Prune_Rules BinarizeGenerate_Rule_Map Compile XRS_RulesBRF_RulesLexicon_Dictionary 1…n WSJ-2001 KR Workflow Instance: Compact Expression

48 USC Information Sciences Institute Yolanda Gil AAAI-08 Tutorial July 13, 2008 … … Workflow Instance: Expanded Expression

49 USC Information Sciences Institute Yolanda Gil AAAI-08 Tutorial July 13, 2008 Corpus Kernel_Rules Split Filter_Rules Prune_Rules BinarizeGenerate_Rule_Map Compile XRS_RulesBRF_RulesLexicon_Dictionary 1…n WSJ-2001 KR Workflow Instance Expressions Compact expression for efficient search and matching Expanded expression when further details are needed

50 USC Information Sciences Institute Yolanda Gil AAAI-08 Tutorial July 13, 2008 … … … … … WSJ-2001 KR Executable Workflow: Maps Expanded Instance to Execution Resources

51 USC Information Sciences Institute Yolanda Gil AAAI-08 Tutorial July 13, 2008 Workflow Creation then Workflow Mapping Workflow Creation Functions Workflow Mapping and Execution Functions