Scientific Workflows. 2 Overview More background on workflows Kepler Details Example Scientific Workflows Other Workflow Systems.

Slides:



Advertisements
Similar presentations
GRADD: Scientific Workflows. Scientific Workflow E. Science laboris Workflows are the new rock and roll of eScience Machinery for coordinating the execution.
Advertisements

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Lecture # 2 : Process Models
MotoHawk Training Model-Based Design of Embedded Systems.
Distributed components
Scientific Workflows Systems : In Drug discovery informatics Presented By: Tumbi Muhammad Khaled 3 rd Semester Department of Pharmacoinformatics.
Behavioral Types as Interface Definitions for Concurrent Components Center for Hybrid and Embedded Software Systems Edward A. Lee Professor UC Berkeley.
6th Biennial Ptolemy Miniconference Berkeley, CA May 12, 2005 Distributed Computing in Kepler Ilkay Altintas Lead, Scientific Workflow Automation Technologies.
Computational Physics Kepler Dr. Guy Tel-Zur. This presentations follows “The Getting Started with Kepler” guide. A tutorial style manual for scientists.
Review of “Embedded Software” by E.A. Lee Katherine Barrow Vladimir Jakobac.
Scripting Languages For Virtual Worlds. Outline Necessary Features Classes, Prototypes, and Mixins Static vs. Dynamic Typing Concurrency Versioning Distribution.
Application architectures
Chapter 4.1 Interprocess Communication And Coordination By Shruti Poundarik.
Course Instructor: Aisha Azeem
C++ fundamentals.
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
Slide 1 of 9 Presenting 24x7 Scheduler The art of computer automation Press PageDown key or click to advance.
Application architectures
The chapter will address the following questions:
TIBCO Designer TIBCO BusinessWorks is a scalable, extensible, and easy to use integration platform that allows you to develop, deploy, and run integration.
What is Concurrent Programming? Maram Bani Younes.
 Scientific workflow management system based on Ptolemy II  Allows scientists to visually design and execute scientific workflows  Actor-oriented.
A First Program Using C#
January, 23, 2006 Ilkay Altintas
Scientific Workflows Scientific workflows describe structured activities arising in scientific problem-solving. Conducting experiments involve complex.
KARMA with ProActive Parallel Suite 12/01/2009 Air France, Sophia Antipolis Solutions and Services for Accelerating your Applications.
Chapter 2 The process Process, Methods, and Tools
Scientific Workflow reusing and long term big data preservation Salima Benbernou Université Paris Descartes Project.
Composing Models of Computation in Kepler/Ptolemy II
Copyright © 2007, Oracle. All rights reserved. Managing Concurrent Requests.
Workflow Project Luciano Piccoli Illinois Institute of Technology.
4/2/03I-1 © 2001 T. Horton CS 494 Object-Oriented Analysis & Design Software Architecture and Design Readings: Ambler, Chap. 7 (Sections to start.
©Ian Sommerville 2000 Software Engineering, 6th edition. Chapter 10Slide 1 Architectural Design l Establishing the overall structure of a software system.
What are the main differences and commonalities between the IS and DA systems? How information is transferred between tasks: (i) IS it may be often achieved.
Accelerating Scientific Exploration Using Workflow Automation Systems Terence Critchlow (LLNL) Ilkay Altintas (SDSC) Scott Klasky(ORNL) Mladen Vouk (NCSU)
11 CORE Architecture Mauro Bruno, Monica Scannapieco, Carlo Vaccari, Giulia Vaste Antonino Virgillito, Diego Zardetto (Istat)
1 Software Design Overview Reference: Software Engineering, by Ian Sommerville, Ch. 12 & 13.
Cohesion and Coupling CS 4311
Advanced Computer Networks Topic 2: Characterization of Distributed Systems.
An Ontological Framework for Web Service Processes By Claus Pahl and Ronan Barrett.
Term 2, 2011 Week 1. CONTENTS Problem-solving methodology Programming and scripting languages – Programming languages Programming languages – Scripting.
Composing workflows in the environmental sciences using Web Services and Inferno Jon Blower, Adit Santokhee, Keith Haines Reading e-Science Centre Roger.
Systems Analysis and Design in a Changing World, Fourth Edition
Framework for MDO Studies Amitay Isaacs Center for Aerospace System Design and Engineering IIT Bombay.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
AgINFRA science gateway for workflows and integrated services 07/02/2012 Robert Lovas MTA SZTAKI.
9 Systems Analysis and Design in a Changing World, Fourth Edition.
ICCS WSES BOF Discussion. Possible Topics Scientific workflows and Grid infrastructure Utilization of computing resources in scientific workflows; Virtual.
Kepler includes contributors from GEON, SEEK, SDM Center and Ptolemy II, supported by NSF ITRs (SEEK), EAR (GEON), DOE DE-FC02-01ER25486.
Design Languages in 2010 Chess: Center for Hybrid and Embedded Software Systems Edward A. Lee Professor UC Berkeley Panel Position Statement Forum on Design.
11 CORE Architecture Mauro Bruno, Monica Scannapieco, Carlo Vaccari, Giulia Vaste Antonino Virgillito, Diego Zardetto (Istat)
1 Limitations of BLAST Can only search for a single query (e.g. find all genes similar to TTGGACAGGATCGA) What about more complex queries? “Find all genes.
David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk.
© 2006 Pearson Addison-Wesley. All rights reserved 2-1 Chapter 2 Principles of Programming & Software Engineering.
Architecture View Models A model is a complete, simplified description of a system from a particular perspective or viewpoint. There is no single view.
Toward interactive visualization in a distributed workflow Steven G. Parker Oscar Barney Ayla Khan Thiago Ize Steven G. Parker Oscar Barney Ayla Khan Thiago.
Qusay H. Mahmoud CIS* CIS* Service-Oriented Computing Qusay H. Mahmoud, Ph.D.
Development of e-Science Application Portal on GAP WeiLong Ueng Academia Sinica Grid Computing
Slide 1 Service-centric Software Engineering. Slide 2 Objectives To explain the notion of a reusable service, based on web service standards, that provides.
Scientific Workflows for the Sensor Web ICT for Earth Observation Anwar Vahed.
Chapter 18 Object Database Management Systems. Outline Motivation for object database management Object-oriented principles Architectures for object database.
JavaScript Introduction and Background. 2 Web languages Three formal languages HTML JavaScript CSS Three different tasks Document description Client-side.
THE EYESWEB PLATFORM - GDE The EyesWeb XMI multimodal platform GDE 5 March 2015.
SE 548 Process Modelling WEB SERVICE ORCHESTRATION AND COMPOSITION ÖZLEM BİLGİÇ.
Advanced Taverna Aleksandra Pawlik University of Manchester materials by Katy Wolstencroft, Aleksandra Pawlik, Alan Williams
Business Process Execution Language (BPEL) Pınar Tekin.
Service-centric Software Engineering
Laura Bright David Maier Portland State University
Chapter 6: Architectural Design
Scientific Workflows Lecture 15
Presentation transcript:

Scientific Workflows

2 Overview More background on workflows Kepler Details Example Scientific Workflows Other Workflow Systems

3 Recap from last time Background: What is a scientific workflow? –Goals: automate a scientist’s repetitive data management and analysis tasks –Typical Phases: Data access, scheduling, generation, transformation, aggregation, analysis, visualization Design, test, share, deploy, execute, reuse SWF’s Overview and demo of Kepler Adapted from B. Ludaescher

4 Scientific Workflows: Some Findings Very different granularities: from high-level design to lowest level plumbing More dataflow than (business control) workflow Need for “programming extensions” –Iterations over lists (foreach), filtering, functional composition, generic & higher-order operations (zip, map(f)) Need for abstraction and nested workflows Adapted from B. Ludaescher

5 Scientific Workflows: findings (continued) Need for data transformations Need for rich user interaction and workflow steering –Pause/revise/resume –Select & branch, e.g., web browser capability at specific steps as part of a coordinated SWF Need for high-throughput data transfers and CPU cycles “Grid-enabling”, “streaming” Need for persistence of intermediate products and provenance Adapted from B. Ludaescher

6 Data-flow vs Control-flow Useful for –Specification (language, model) –Synthesis (scheduling, optimization) –Validation (simulation, formal verification) Rough classification: –Control Don’t know when data arrive (quick reaction) Time of arrival often matters more than value –Data Data arrive in regular streams (samples) Value matters most Adapted from B. Ludaescher

7 Data-flow vs. Control-flow Specification, synthesis, and validation methods tend to emphasize… For control: –Event/reaction relation –Response time –(Real time scheduling for deadline satisfaction) –Priority among events and processes Adapted from B. Ludaescher

8 Data-flow vs. Control-flow For Data: –Functional dependency between input and output –Memory/time efficiency –(Dataflow scheduling for efficient pipelining) –All events and processes are equal Adapted from B. Ludaescher

9 Business Workflows vs. Scientific Workflows Business Workflows –Task oriented: travel reservations, credit- approval, etc. –Tasks, documents, etc undergo modifications (e.g., flight reservation from reserved to ticketed), but modified WF objects still identifiable throughout –Complex control flow, complex process composition –Dataflow and control-flow are often divorced Adapted from B. Ludaescher

10 Business Workflows vs. Scientific Workflows Scientific Workflows –Dataflow and data transformations –Data problems: volume, complexity, heterogeneity –Grid aspects: Distributed computation Distributed data –User-interactions/WF steering –Data, tool, and analysis integration –Dataflow and control-flow are often married Adapted from B. Ludaescher

11 SWF User Requirements Design tools – especially for non-expert users –Need to look into how scientists define processes Ease of use – fairly simple user interface having more complex features hidden in background Reusable generic features Generic enough to serve different communities but specific enough to serve one domain Extensibility for the expert user – almost a visual programming interface Registration and publication of data products and “process products” (workflows); provenance Adapted from B. Ludaescher

12 SWF Technical Requirements Error detection and recovery from failure Logging information for each workflow Allow data-intensive and compute-intensive tasks (maybe at the same time) Data management/integration Allow status checks and on the fly updates Visualization Semantics and metadata based dataset access Certification, trust, security Adapted from B. Ludaescher

13 Challenges/Requirements Seamless access to resources and services –Web services are simple solution but doesn’t address harder problems, e.g., web service orchestration, third party transfers Service composition & reuse and workflow design –How to compose simple services to perform complex tasks –Design components that are easily reusable, not application-specific Adapted from B. Ludaescher

14 Challenges/Requirements Scalability –Some workflows require large amounts of data and/or high-end computational resources –Require interfaces to Grid middleware components Detached execution –Allow long running workflows to run in the background on remote server Reliability and Fault Tolerance –e.g., workflow could fail if web service fails Adapted from B. Ludaescher

15 Challenges/Requirements User interaction –e.g., users may inspect intermediate results “Smart” re-runs –Changing a parameter after intermediate results without executing workflow from scratch “Smart” semantic links –Assist in workflow design by suggesting which components might fit together Data Provenance –Which data products and tools created a derived data product –Log sequence of steps, parameter settings,etc. Adapted from B. Ludaescher

16 Why is a GUI useful? No need to learn a programming language Visual representation of what workflow does Allows you to monitor workflow execution Enables user interaction Facilitates sharing workflows

17 Kepler Details Director/Actor metaphor –Actors are executable components of a workflow –Director controls execution of workflow Workflows are saved as XML files –Workflows can easily be shared/published

18 Directors Many different models of computation are possible Synchronous – Processing occurs one component at a time Parallel – One or more components run simultaneously Every Kepler workflow needs a director

19 Actors Reusable components that execute a variety of functions Communicate with other actors in workflow through ports Composite actor – aggregation of actors Composite actor may have a local director

20 Parameters Values that can be attached to workflow or individual directors/actors Accessible by all actors in a workspace Facilitate worklflow configuration Analogous to global variables

21 Ports Ports used to produce and consume data and communicate with other actors in workflow –Input port – data consumed by actor –Output port – data produced by actor –Input/output port – data both produced and consumed Ports can be singular or multiple

22 Relations Direct the same input or output to more than one other port Example: direct an output to a display actor to show intermediate results, and an operational actor for further processing

23 Other Kepler features Can call external functions Can implement your own actors Incremental development for rapid prototyping –If inputs and outputs defined, can incorporate actors into workflow –Example – “dummy” composite actor –Components can be designed and tested separately

24 Focus on Actor-Oriented Design Adapted from B. Ludaescher Object orientation: Actor orientation: What flows through object is sequential control What flows through object is streams of data class name data methods callreturn input dataoutput data actor name data (state) parameters ports

25 Object-Oriented vs. Actor Oriented Interface Definitions Adapted from B. Ludaescher Object oriented Actor oriented OO interface definition gives procedures that have to be invoked in an order not specified as part of the interface definition AO interface definition says “Give me text and I’ll give you speech” TextToSpeech initialize(); void notify(); void isReady(); boolean getSpeech(); double[] Text to Speech text in speech out

26 Models of Computation Semantic interpretations of the abstract syntax Different models  Different semantic  Different execution One class: Producer/consumer Are actors active? Passive? Reactive? Are communications timed? Synchronized? Buffered? Adapted from B. Ludaescher

27 Directors: Semantics for Component Interaction Some directors: –CT – continuous time modeling –DE – discrete event systems –FSM – finite state machines –PN – process networks –SDF – synchronous dataflow Adapted from B. Ludaescher

28 Polymorphic Actors: Working Across Data Types and Domains Recall the add/subtract actor from last time Actor Data Polymorphism: –Add numbers (int, float, double, complex) –Add strings (concatenation) –Add complex types (arrays, records, matrices) –Add user-defined types Adapted from B. Ludaescher

29 Polymorphic Actors (continued) Actor behavioral polymorphism –In synchronous dataflow model (SDF), add when all inputs have data –In process networks, execute infinite loop in a thread that blocks when reading empty inputs –In a time-triggered model, add when clock ticks Adapted from B. Ludaescher

30 Benefits of Polymorphism Some observations: –Can define actors without defining input types –Can define actors without defining model of computation Why is this useful? –Increases reusability –But need to ensure that actor works in every circumstance

31 Actor Implementation Details beyond the scope of this class Idea: each actor implements several methods: –initialize() – initializes state variables –prefire() – indicates if actor wants to fire –fire() – main point of execution Read inputs, produce outputs, read parameter values –postfire() – update persistent state, see if execution complete –wrapup() Each director call these methods according to its model

32 Third-party transfers Problem: Many workflows access data from one web service S1, pass the output on to service S2 Current web services do not provide mechanism to transfer directly from S1 to S2 Data is moved around more than necessary

33 Third party transfers S3 S1 S2 ship request execute service ship reply client ship request ship reply execute service

34 Handle-oriented approach Idea: instead of shipping actual data, web service send handle (pointer to data) Web services need support for handles

35 Scientific Workflow Examples Promoter Identification Mineral Classification Environmental Modeling Blast-ClustalW Workflow

36 Promoter Identification Workflow Designed to help a biologist compare a set of genes that exhibit similar expression levels Goal: find the set of promoter modules responsible for this behavior Promoter is a subsequence of a chromosome that sits close to a gene and regulates its activity

37 Promoter Identification Workflow 1.Input – list of gene IDs 2.For each gene, construct likely upstream region by finding sequences that significantly overlap input gene 1.Use GenBank to get sequence for each gene ID 2.Use BLAST to find similar sequences 3.Find transcription factor binding sites in each of the sequences 1.Run a Transfac search on sequence to identify binding sites 4.Align them and display 1.Use ClustalW to align

38 Promoter Identification Workflow

39 Mineral Classification Workflow Samples selected from a database holding mineral compositions of igneous rocks This data, along with set of classification diagrams, fed to a classifier Process of classifying samples involves determining position of sample values in series of diagrams When location of point in diagram of order n is determined, consult corresponding diagram of order n+1 Repeat until terminal level of diagrams reached

40 Mineral Classification Workflow diagrams Rock dataset Classifier Result GetPoint NextDiagram Diagrams Diagram ToPolygons PointInPolygon

41 CORIE Environmental Observation and Forecasting System Daily forecasts of bodies of water throughout coastal United States Simulation program models physical properties of water (e.g., salinity, temperature, velocity) Scripts generate images, plots, and animations from raw simulation outputs

42 Example CORIE Workflows Simulation Model stations Isolines salt temp vert Isolines temp Transects temp Simulation Outputs (>300 MB) Data Product Tasks Data Products Simulation Run

43 Blast-ClustalW Workflow Goal: Run BLASTN against DDBJ with a given DNA sequence, compare alignment regions of similar sequences using ClustalW 1.Run BLAST service with input sequence 2.Run GetEntry to get sequences of each hit 3.Cut off corresponding area 4.Run ClustalW

44 Some Scientific Workflow Tools Kepler SCIRun Triana Taverna Some commercial tools: –Windows Workflow Foundation –Mac OS X Automator

45 SciRun Computational workbench to interactively design and modify simulations Emphasis on visualization Scientists can interactively change models and parameters Fine-grained dataflow to improve computational efficiency

46 Some SCIRun Images C-Safe Integrated Fire/Container Simulation Granular compaction simulation

47 Triana Problem-solving environment combining visual interface and data analysis tools Emphasis on P2P and Grid computing environments Distributed functionality

48 Triana

49 Taverna Emphasis on bioinformatics workflows Enables coordination of local and remote resources Provides a GUI and access to bioinformatics web services Records provenance information

50 Taverna

51 A brief aside: BioMOBY Model Organism Bring Your own Database Messaging standard to automatically discover and interact with biological data and service providers Automatic manipulation of data formats

52 BioMOBY (continued) Ontology of bio-informatics data types Define data syntax Create an open API over this ontology Define web service inputs/outputs Register services Many clients being deployed Clients for some workflow tools, e.g., Taverna in development

53 Executing Kepler on the Grid Many challenges to Grid workflows, including: –Authentication –Data movement –Remote service execution –Grid job submission –Scheduling and resource management –Fault tolerance –Logging and provenance –User interaction May be difficult for domain scientists

54 Example Grid Workflow Stage-execute-fetch: Local serverRemote server 1.Stage local files to remote server 2. Execute computational experiment on remote resource 3. Fetch results back to local environment

55 Why not use a script? Script does not specify low-level task scheduling and communication May be platform-dependent Can’t be easily reused

56 Some Kepler Grid Actors Copy – copy files from one resource to another during execution –Stage actor – local to remote host –Fetch actor - remote to local host Job execution actor – submit and run a remote job Monitoring actor – notify user of failures Service discovery actor – import web services from a service repository or web site