Presentation is loading. Please wait.

Presentation is loading. Please wait.

USC Viterbi School of Engineering Scientific Workflows and Systems Ewa Deelman.

Similar presentations


Presentation on theme: "USC Viterbi School of Engineering Scientific Workflows and Systems Ewa Deelman."— Presentation transcript:

1 USC Viterbi School of Engineering Scientific Workflows and Systems Ewa Deelman

2 USC Viterbi School of Engineering Outline Scientific workflows Business workflows Different workflow systems –Taverna –Kepler –Triana –Askalon

3 USC Viterbi School of Engineering Ewa Deelman deelman@isi.edu Applications today Complex –Involve many computational steps –Require many (possibly diverse resources) Composed of individual application components –Components written by different individuals –Components require and generate large amounts of data –Components written in different languages Reuse of individual intermediate data products Need to keep track of how the data was produced

4 USC Viterbi School of Engineering Workflow Instance Ewa Deelman, deelman@isi.eduwww.isi.edu/~deelmanpegasus.isi.edu Collect image Adjust Color Adjust Color Adjust Color Co-Add image Visualize … Image 2 Image 1 Image n

5 USC Viterbi School of Engineering Business Workflows

6 USC Viterbi School of Engineering Business Workflows Designed to compose applications based on web services BPEL –Standard language for service interactions –Has many constructs to deal with the invocation of web services, including fault handling, and support for conditional logic.

7 USC Viterbi School of Engineering BPEL constructs : Blocks until a matching message is received. This is typically used to receive a message from the client or a callback from a partner web service. : Send a message in response to a message received via a : Perform an invocation on a web service. (one- way or request-response) : Assign a value to a variable. : Executes a list of activities sequentially in lexical order. : Executes the activities in parallel. : Used for looping until a criteria is true. : Select one branch for execution amongst a set of branches based on a value.

8 USC Viterbi School of Engineering Many BPEL engines Active bpel IBM BPEL4J Oracle BPEL Process Manager Microsoft Windows Foundation ….

9 USC Viterbi School of Engineering Scientific vs Business Workflows Large amounts of data Varied granularity of computations Large number of computations Often standalone components Non-programmers need to be able to compose them Need to provide provenance info Performance is important Deal with services across domains Do not deal with standalone application components Usually not very data intensive –Data can be easily sent between services Important to agree on standard interfaces so that MS & IBM can work together Focus on functionality/interoperability rather than performance

10 USC Viterbi School of Engineering Example of a business workflow

11 USC Viterbi School of Engineering Example of Scientific Workflow Workflow Specification Components –Standalone computations –Designed by different individuals

12 USC Viterbi School of Engineering Different workflow systems Taverna, a workbench for bioinformatics workflows Slides courtesy of Katy Wolstencroft

13 USC Viterbi School of Engineering The Community Problems Everything is Distributed –Data, Resources and Scientists Heterogeneous data Very few standards –I/O formats, data representation, annotation –Everything is a string! Integration of data and interoperability of resources is difficult

14 USC Viterbi School of Engineering Lots of Resources NAR 2007 – 968 databases

15 USC Viterbi School of Engineering Traditional Bioinformatics 12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt 12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt 12541 ttcttataag tctgtggttt ttatattaat gtttttattg atgactgttt tttacaattg 12601 tggttaagta tacatgacat aaaacggatt atcttaacca ttttaaaatg taaaattcga 12661 tggcattaag tacatccaca atattgtgca actatcacca ctatcatact ccaaaagggc 12721 atccaatacc cattaagctg tcactcccca atctcccatt ttcccacccc tgacaatcaa 12781 taacccattt tctgtctcta tggatttgcc tgttctggat attcatatta atagaatcaa

16 USC Viterbi School of Engineering Cutting and Pasting Advantages: –Low Technology on both server and client side –Very Robust: Hard to break. –Data Integration happens along the way Disadvantages: –Time Consuming (and painful!) Can be repeated rarely Limited to small data sets. –Error Prone : Poor repeatability

17 USC Viterbi School of Engineering Pipeline Programming Advantages –Repeatable –Allows automation –Quick, reliable, efficient Disadvantages –Requires programming skills –Difficult to modify –Requires local tool and database installation –Requires tool and database maintenance!!!

18 USC Viterbi School of Engineering What we want as a solution A system that is: Allows automation Allows easy repetition, verification and sharing of experiments Works on distributed resources Requires few programming skills Runs on a local desktop / laptop

19 USC Viterbi School of Engineering my Grid as a solution my Grid allows the automated orchestration of in silico experiments over distributed resources from the scientist’s desktop Built on computer science technologies of: Web services Workflows Semantic web technologies

20 USC Viterbi School of Engineering Workflows –General technique for describing and enacting a process –Describes what you want to do, not how you want to do it –High level description of the experiment Repeat Masker Web service GenScan Web Service Blast Web Service

21 USC Viterbi School of Engineering Workflow language specifies how bioinformatics processes fit together. High level workflow diagram separated from any lower level coding – you don’t have to be a coder to build workflows. Workflow is a kind of script or protocol that you configure when you run it. Easier to explain, share, relocate, reuse and repurpose. Workflow Model Workflow is the integrator of knowledge The METHODS section of a scientific publication Workflows

22 USC Viterbi School of Engineering Workflow Advantages Automation –Capturing processes in an explicit manner –Tedium! Computers don’t get bored/distracted/hungry/impatient! –Saves repeated time and effort Modification, maintenance, substitution and personalisation Easy to share, explain, relocate, reuse and build Releases Scientists/Bioinformaticians to do other work Record –Provenance: what the data is like, where it came from, its quality

23 USC Viterbi School of Engineering Taverna Workflow Components Scufl Simple Conceptual Unified Flow Language Taverna Writing, running workflows & examining results SOAPLAB Makes applications available SOAPLAB Web Service Any Application Web Service e.g. DDBJ BLAST

24 USC Viterbi School of Engineering An Open World Open domain services and resources. Taverna accesses 3000+ services Third party – we don’t own them – we didn’t build them All the major providers –NCBI, DDBJ, EBI … Enforce NO common data model. Quality Web Services considered desirable

25 USC Viterbi School of Engineering Adding your own web services SoapLabJava API Consumer import Java API of libSBML as workflow components http://www.ebi.ac.uk/soaplab/

26 USC Viterbi School of Engineering Shield the Scientist – Bury the Complexity Workflow enactor Processor Plain Web Service Soap lab Processor Local Java App Processor Enactor Processor Bio MOBY Processor WSRF Processor Bio MART Styx client Processor R package... Scufl Model Taverna Workbench Workflow Execution Application Simple Conceptual Unified Flow Language

27 USC Viterbi School of Engineering Kepler Slides courtesy of Bertram Ludaesher

28 USC Viterbi School of Engineering Scientific Workflow Capture how a scientist works with data and analytical tools –data access, transformation, analysis, visualization –possible worldview: dataflow-oriented (cf. signal-processing) Scientific workflow (wf) benefits (compare w/ script-based approaches) : –wf automation –wf & component reuse –wf design, documentation –wf archival, sharing –built-in concurrency (task-, pipeline-parallelism) –built-in provenance support –distributed execution (Grid) support – …

29 USC Viterbi School of Engineering Ex: SEEK Ecological Niche Modeling Pipeline Scientific Workflow paradigm: –Reusable components (“actors”): a scientist’s verbs/actions –Top-level workflows ≈ conceptual representation of the science process, sentences in the scientist’s language –Sub-workflows ≈ increasing levels of detail Separation of concerns: –actors: what to do –parameters: configurable behavior –channels: dataflow, pipeline composition –directors: fix execution model, scheduling –semantic types: smart discovery, linking D Pennington, D Higgins, AT Peterson, M Jones, B Ludaescher, S Bowers. Ecological Niche Modeling using the Kepler Workflow System. Workflows for e-Science, Springer.

30 USC Viterbi School of Engineering Simple Kepler workflow using R (a statistics package) Data source from EcoGrid (metadata-driven ingestion) res <- lm(BARO ~ T_AIR) res plot(T_AIR, BARO) abline(res) R processing script

31 USC Viterbi School of Engineering Convert Archive Monitor Transfer Plumbing with Style … (Norbert Podhorszki UC Davis, Scott Klasky ORNL) Plasma physics simulation on 2048 processors on Seaborg@NERSC (LBL) –Gyrokinetic Toroidal Code (GTC) to study energy transport in fusion devices (plasma microturbulence) –Generating 800GB of data (3000 files, 6000 timesteps, 267MB/timestep), 30+ hour simulation run Under workflow control: –Monitor (watch) simulation progress (via remote scripts) –Transfer from NERSC to ORNL concurrently with the simulation run –Convert each file to HDF5 file –Archive files to 4GB chunks into HPSS

32 USC Viterbi School of Engineering Our Starting Point: Actor-Oriented Modeling Ports –each actor has a set of input and output ports –denote the actor’s signature –produce/consume data (a.k.a. tokens) –parameters are special “static” ports

33 USC Viterbi School of Engineering Actor-Oriented Modeling Dataflow Connections –unidirectional actor “communication” channels –connect output ports with input ports –for composing analysis pipelines

34 USC Viterbi School of Engineering Actor-Oriented Modeling Sub-workflows / Composite Actors –composite actors “wrap” sub-workflows –like actors, have signatures (i/o ports of sub-workflow) –hierarchical workflows (arbitrary nesting levels)

35 USC Viterbi School of Engineering Actor-Oriented Modeling Directors –define the execution semantics of workflow graphs –executes workflow graph (some schedule) –sub-workflows may have different directors –promotes reusability

36 USC Viterbi School of Engineering Models of Computation (A Wf Engineer’s Issue) Directors separate the concerns of orchestration and scheduling from conceptual design –Synchronous Dataflow (SDF) Statically analyzable: schedule, no deadlocks, fixed buffer requirements; executable as a single thread by the director. –Process Networks (PN) Generalizes SDF. Actors execute as separate threads/processes, with queues of unbounded size (Kahn/MacQueen networks). –Directed Acyclic Graph (DAG) Special case of SDF. No loops, no pipelining. –Continuous Time (CT) Connections represent the value of a continuous time signal at some point in time... Often used to model physical processes. –Discrete Event (DE) Actors communicate through a queue of events in time. Used for instantaneous reactions in physical systems. –…

37 USC Viterbi School of Engineering Everything is a service / actor…

38 USC Viterbi School of Engineering Smart Discovery Find a component (here: an actor) in different locations (“categories”) … based on the semantic annotation of the component (or its ports) Browse for ComponentsSearch for Component NameSearch for Category / Keyword

39 USC Viterbi School of Engineering Behold the Beauty of Scientific Workflow Design Author: Kristian Stevens, UC Davis

40 USC Viterbi School of Engineering … Shimology Part 2: the ugly truth inside Author: Kristian Stevens, UC Davis

41 USC Viterbi School of Engineering Triana Slides courtesy of Ian Taylor

42 USC Viterbi School of Engineering Triana Focus Two core underlying focuses: –Interactive graphical programming of the distributed tasks - complex editing Intuitive drag/drop flexible editing - copy/paste services, wizards for creating tools/toolboxes, user interfaces, adding nodes and multi-level grouping. Has been used as a “graphical editor” for other languages, e.g. DAG, VDLx (DAX in progress). –Heterogeneous workflows - Bridge the gap between different distributed environments Use cross-environment interfaces led to integration with GAT (pre SAGA), GAP

43 USC Viterbi School of Engineering Types of Uses –For fine-grained operations, specifying dataflow for local operations –Or course-grained composition of a distributed workflow –Or Both - can connect heterogeneous tools (e.g. Web services, Java units, Jxta services) on one workflow Has been used as a dataflow system, a distributed-workflow environment, workflow-management system, an automated scripting tool, workflow editor.

44 USC Viterbi School of Engineering Current Capabilities Local Java Units –600 units in signal, image, audio, text processing, complete math/stats toolboxes etc –Common units - flexible importers/exporters, graphing, duplicators –Data types - strong data types for a number of domains - includes run-time checking Distributed Integration –GAT - Java GAT implementation - graphical representation of GAT primitives - supports GRAM, GridFTP, etc –GAP - SOA publish, find, bind triad of operations Bindings: Jxta, P2PS, Web Services, WS-RF –Group unit deployment Legacy Applications –Can incorporate legacy applications easy (using local GAT adaptor) - standard file in/out interface

45 USC Viterbi School of Engineering Distributed Work-flow Workflow Commands Workflow, e.g. BPEL4WS Triana Engine Triana Service & Engine Remote Legacy Applications Distributed services Distributing Triana Units or Groups (Java) Integrating Legacy applications into Workflow Integrating Web Services or P2P Services GAP GAT & GAP Upperware Middleware

46 USC Viterbi School of Engineering Triana, the GAT and the GAP P2PSJXTA Web Services GAP Interface UDDI SOAP P2PS Discovery P2PS Pipes JXTA Discovery JXTA Pipes GAT Interface Condor Globus RLS Unicore PBS GridLab GRMS SGESSH WSRF LDR.NET Other.. GridFTP Grid Computing: Job Submission, File services A Graphical Grid Computing Environment or Portal Service Based Computing: Deployment, discovery and communication with distributed services e.g. P2P and (GSI) Web services

47 USC Viterbi School of Engineering Audio Processing (Groups)

48 USC Viterbi School of Engineering Group Units

49 USC Viterbi School of Engineering GAT Interface Main deliverable of Gridlab Application-level interface With a set of adapters –That adapt the interface to an underlying capability Versions in C++ and Java Pre-cursor to SAGA - Simple API for Grid Applications

50 USC Viterbi School of Engineering Grid FTP Adapter Grid FTP Connection Jxta File Adapter Jxta Pipe GAT Adapters: Example GAT API Resource Management Streaming/ Comms File Management Job Management Monitoring Collection Management GAT Engine P2P Environment Copy File(Machine A, Machine B) Grid Environment

51 USC Viterbi School of Engineering GAP Interface Motivation by GAT A Simple Service based API, for –Service Deployment, –Service Discovery –Pipe Based Communication Static application interface with multiple middleware bindings –P2PS (name…?) –JXTA –Web services P2PSJXTA Web Services GAP Interface UDDI SOAP P2PS Discovery P2PS Pipes JXTA Discovery JXTA Pipes

52 USC Viterbi School of Engineering Deploying and Connecting To Remote Services Running services are automatically discovered via the GAP Interface, and appear in the tool tree User can drag remote services onto the workspace and connect cables to them like standard tools (except the cables represent actual JXTA/P2PS pipes) Remote Services

53 USC Viterbi School of Engineering Web Service Discovery Triana allows users to query UDDI repositories Alternatively, users can import services directly from WSDL

54 USC Viterbi School of Engineering Complex Data Types Users can build their own interface for creating/mediating between complex types Alternatively, Triana can dynamically generate an interface from the WSDL2Java generated bean class

55 USC Viterbi School of Engineering Askalon Slides Courtesy of Thomas Fahringer

56 USC Viterbi School of Engineering Goal: simple, efficient, effective application development for the Grid Invisible Grid Application Modeling (UML) and programming at a high level of abstraction (AGWL) Semantics technologies Semi-automatic deployment SOA-based runtime environment with stateful services Analysis and optimization of performance, costs and reliability ASKALON Application Development and Runtime Environment for the Grid

57 USC Viterbi School of Engineering WSRF ASKALON Workflow Composition and Runtime Environment Execution Engine Execution Engine Scheduler Resource Manager Resource Manager activity activity The Grid Globus toolkit UML-based Workflow Composition AGWL Runtime Middleware Services Data Repository Data Repository Job Performance Analysis

58 USC Viterbi School of Engineering Austrian Grid karwendel 80 CPUs 272 CPUs altix1 64 CPUs 16 CPUs CA UniVie RA Uni-Linz RA UIBK MAUI Uni-Sbg 16 CPUs MAUI ZID Grid gescher FHV RA RA` hydra altix1 16 CPUs HPC 16 CPUs grid 21 CPUs Torque PBS SGE PBS/Torque SGE Torque schafberg 16 CPUs PBS RA 517 CPUs distributed across 5 cities and over 20 parallel computers Parallel computer#CPUClockArchitectureLocation altix1.jku hydra.gup schafberg.sbg grid.fhv.at gescher.vcpc karwendel.dps altix1.uibk hc-ma.uibk zid-grid 64 16 21 32 80 16 272 ITA2 Athlon ITA2 Xeon Opteron ITA2 Opteron P4 1.6 3 2.2 1.6 2.2 1.8 ccNUMA COW ccNUMA COW ccNUMA COW NOW Linz Salzburg Vorarlberg Vienna Innsbruck

59 USC Viterbi School of Engineering ASKALON Workflows Activity = basic or atomic unit of computation Activity type –Functional description of the activity Signature specified by data input/output ports –Semantically meaningful name E.g. matrix multiplication, Gaussian elimination, povray, png2yuv, ffmpeg, FFT, LAPW, WASIM, … –Implementation-independent Workflow = collection of activity types interconnected through control flow and data flow dependencies –Plus some advanced constructs Activity deployment –Binds an activity type to a concrete installed implementation –Description how to instantiate the activity –Registered by the application provider in a special registry of the Resource Management service

60 USC Viterbi School of Engineering ASKALON: Abstract Grid Workflow Language (AGWL) Atomic activities –abstract from the real implementation, e.g. Web services, legacy applications –Sequential constructs: –Conditional constructs:, Basic compound activities –Loop constructs:,,, –Directed Acyclic Graph constructs: Advanced compound activities –Parallel section constructs: –Parallel loop constructs:, Data flow constructs –dataIn/dataOut ports, collections, data repositories, data set distributions, etc. Properties –provide hints about the behavior of activities –Predicted I/O data size, computational complexity, non-functional parameters Constraints –Optimization metric (e.g. performance, cost, fault tolerance) –Scheduling constraints (e.g. compute architecture, disk, memory)

61 USC Viterbi School of Engineering ASKALON Workflow Development Stack Portal AGWL CGWR Grid Application Developer ASKALON Middleware Abstract Grid Workflow Language UML Workflow UML model XML Activity Type Java ASKALON Activity Deployment Grid Activity Instance Concretizing Concrete Grid Workflow Representation

62 USC Viterbi School of Engineering Real-world Scientific Workflows with ASKALON WIEN2k Material science application Technical University of Vienna –Institute of Theoretical Chemistry Seven activity types Over 500 activity instances Statically unknown number of sequential loop iterations

63 USC Viterbi School of Engineering Resource Management Resource brokerage –Interface to MDS information service for resource discovery –Selection based on matchmaking Advance reservation –Useful for co-allocation purposes GLARE –Registry of activity deployments Activity deployment –Binds an abstract activity type to a concrete implementation –Refers to an installed executable or a deployed Web/Grid service –Description how to instantiate the activity –Registered in GLARE by the application provider

64 USC Viterbi School of Engineering Askalon Runtime Environment Dynamic Bindings of Workflow Abstract - Concrete Node 1 Nod 2 Node 3 Node 4 Abstract Workflow Web Services Executables A G A A D CB A B AB y x yx Activity Type (abstract) Activity Deployment AB y x AB y x Concrete Workflow Resource Manager

65 USC Viterbi School of Engineering Composite Activities Composite activity –Sequence –Parallel activities –Conditional activities: if, switch –Sequential loops: for, while, for each –Parallel loops: parallel for, parallel for each –Sub-workflows...... data flow control flow A1 A2 Sequence

66 USC Viterbi School of Engineering If-then-else......... (2) (4) (3) A1 A2 A0 A3 (1) thenelse

67 USC Viterbi School of Engineering Execution Engine Workflow controller –Converts XML-based specification (AGWL) to internal representation –Executes the workflow according to control and data flow dependencies One separate Controller for every workflow instance Event system –Other components can subscribe to the internal events –e.g. logging, controller, tool (WS-Notification),... Logging and database –For post-mortem performance analysis GT4 WSRF wrapper –Send WS-Notifications to the portal  Scheduler –Receives jobs ready to execute from the task loop –Retrieves the resources with available from GridARM –Assigns the task to the best machine according to the selection criteria oClock speed * no free processors oPrediction information, memory available, … Core Task Loop Fault Handler Controller AGWL Interpreter Event System GT4 WSRF Service Logging & Database Scheduler Execution / Launching Framework GridARM AGWL


Download ppt "USC Viterbi School of Engineering Scientific Workflows and Systems Ewa Deelman."

Similar presentations


Ads by Google