Building Scientific Workflows with Taverna and BPEL: a Comparative Study in caGrid Wei Tan 1, Paolo Missier 2, Ravi Madduri 1, Ian Foster 1 1 University of Chicago and Argonne National Laboratory, USA 2 School of Computer Science, University of Manchester, Manchester, U.K
W. Tan, et al. Building Scientific Workflow with Taverna and BPEL 2 Agenda Introduction to caGrid Why scientific workflows in caGrid? BPEL and Taverna comparison -Service discovery -Service composition & workflow execution - Data-driven vs. control-driven modeling - Implicit vs. explicit definition of data - Implicit vs. explicit iteration on data -Workflow result analysis Conclusion
W. Tan, et al. Building Scientific Workflow with Taverna and BPEL Globus Introduction: caBIG and caGrid
W. Tan, et al. Building Scientific Workflow with Taverna and BPEL As of Oct 19, 2008: 122 participants 105 services 70 data 35 analytical
W. Tan, et al. Building Scientific Workflow with Taverna and BPEL 5 caGrid data instruments computation resource Virtualization Security Connectivity Introduction: caGrid and workflow Discovery Composition Execution Analysis Community Scientific workflow lifecycle reuse generate
W. Tan, et al. Building Scientific Workflow with Taverna and BPEL Challenges faced by caGrid users 66 caGrid Discovery Composition Execution Analysis Community reuse generate Locating needed services Determining function Accessing services from a workflow GUI for building workflows easily Executing workflow efficiently Persisting and visualizing results Sharing and reusing workflows
W. Tan, et al. Building Scientific Workflow with Taverna and BPEL Our goals in this paper Communicate practical experiences based on our work in the caGrid project Cover the entire scientific workflow lifecycle, from service discovery to service composition, workflow execution, and workflow result analysis Based on caGrid requirements for workflow language and tooling Also applicable to other areas in data-intensive and exploratory science? 7
W. Tan, et al. Building Scientific Workflow with Taverna and BPEL BPEL and Taverna Not the only two but they are representative choices BPEL -XML-based specification for web service based process behavior -Industry standard adopted by IBM, SAP, Oracle, etc. -Has also attracted attention from the scientific community because of its support for SOA paradigm Taverna -Open-source, from the myGrid consortium in UK -Design and execution of scientific workflows -Plug-in architecture for extension (access more applications, visualize more data types, etc.) 8
W. Tan, et al. Building Scientific Workflow with Taverna and BPEL Querying semantic data in cancer research Identify description logic concepts relating to a particular context, e.g., “caCore” 1)Query all projects related to context “caCore” 2)find UML classes in each project 3)use project and UML class information to query the semantic metadata 4)retrieve the concept code We adopt this query as a use case to guide our comparison
W. Tan, et al. Building Scientific Workflow with Taverna and BPEL Support for service discovery Before building a workflow -Need to find appropriate services to be composed -Service endpoints are not naturally known to users -Exact semantics of those services are not known Taverna offers -A extensible scavenger interface for arbitrary service discovery according to users needs (see next page) -A native semantic discovery facility called Feta: myGrid ontology based service annotation and search. BPEL offers -UDDI which is not widely adopted -Research efforts like: WSMO, OWL-S, which are more on specification level -No open-source tool is available that works with a service query component in an integrated way 10
W. Tan, et al. Building Scientific Workflow with Taverna and BPEL 11 Solution for caGrid: Metadata-based service query 1. Semantic/metadata based service discovery. 2. Build a workflow using the services obtained by discovery. 3. Execute the workflow and view the results. 1. Semantic/metadata based service discovery. 2. Build a workflow using the services obtained by discovery. 3. Execute the workflow and view the results. caGrid service metadata caGrid scavenger: query the CaDSR Service in the use case Types of query -String based -Property based -Semantic based
W. Tan, et al. Building Scientific Workflow with Taverna and BPEL
Service composition & workflow execution Data-driven vs. control-driven modeling Implicit vs. explicit definition of data Implicit vs. explicit iteration on data 13
W. Tan, et al. Building Scientific Workflow with Taverna and BPEL Data-driven vs. control-driven modeling 14 BPEL Taverna (Scufl) Activities in model Basic and structure activities Processors as data processing units with in/output ports Semantics of links Transfer of controlTransfer of data Data definition Explicitly defined (global variables) Implicitly defined (processor’s input/output) Data initialization Complex data type must be explicitly initialized Automatically Control logic Full-fledged: sequence, conditional, parallel, event- triggered, etc Limited: sequential, parallel and conditional Parallel execution Defined in or By default Comparison of BPEL and Taverna (Scufl) w.r.t. control/data-flow
W. Tan, et al. Building Scientific Workflow with Taverna and BPEL Implicit vs. explicit definition of data Taverna -Processors have input/output ports with an associated data type -Data travels from the output port of a processor to the input of one or more downstream processors -Interaction among processors is defined entirely by the arcs in the dataflow graph BPEL -Requires the explicit definition of variables, and explicit initiation for complex types -Data are shared amongst activities (i.e., are global) -More complexity, but more power and flexibility in data handling 15
W. Tan, et al. Building Scientific Workflow with Taverna and BPEL Implicit vs. explicit iteration on data Implicit iteration in Taverna -Occurs when an input port receives a list element: - E.g., a processor that outputs a “list of strings,” can legally be connected to a processor with an input port of type “string.” -Taverna interprets this type mismatch as an indication that the destination processor must be invoked repeatedly, once for each element of the input list -This behavior is defined with Taverna's functional programming model Explicit iteration in BPEL -BPEL does not allow type mismatch and iterate needs to be defined explicitly -Again, BPEL offers more flexibility to define more advanced iteration patterns (with more complexity in the model, though) 16
W. Tan, et al. Building Scientific Workflow with Taverna and BPEL Implicit vs. explicit iteration in CaDSR 17 findProjects returns an array Project [] findClassesInProject receives type Project and finds all UML classes in this (single) project In Taverna an xmlsplitter extracts the project array and feeds this directly into findClassesInProject In BPEL a ForEach construct is needed for the iteration over array Project []
W. Tan, et al. Building Scientific Workflow with Taverna and BPEL Workflow result analysis Workflow provides a natural framework for data tracking and analysis -In both Taverna and BPEL Taverna: offers native provenance support -More precise linkage annotation between services’ input and output -Semantic support -Not the focus of our project, see ref. [16] [17] for more details 18
W. Tan, et al. Building Scientific Workflow with Taverna and BPEL 19 Conclusion: Taverna offers lifecycle support + caGrid = ? + caGrid = ? + caGrid = ? + caGrid = ? caGrid Discovery composition Execution Analysis Community reuse generate Scavenger: for customized service discovery Feta: service annotation and discovery. Scufl: compact modeling of data flow Built-in processors: Soaplab, BioMart, etc. Customized processors as plug-ins Implicit iteration: handle parallel execution Result persistence and visualization A community for sharing workflows Provides a compact set of primitives that eases the modeling of data flows Allows users to specify “what to do” instead of “how to do it”
W. Tan, et al. Building Scientific Workflow with Taverna and BPEL 20 Conclusion: BPEL offers unique features Build-time -A comprehensive set of primitives to model processes of all flavors - control-flow oriented - data-flow oriented (although a little verbose) - event driven, etc. -Full featured - process logic, data manipulation, event and message processing, fault handling, etc. Run-time -BPEL engines typically run inside application servers with - persistent state storage - reliability and scalability guarantees -Important for long-running and computation-intensive workflows -For now Taverna engine does not provide these capabilities
W. Tan, et al. Building Scientific Workflow with Taverna and BPEL 21 Conclusion Factors in deciding which language/tool to choose -User IT expertise - some prefer scripting language, others a friendly GUI -Problem size - Taverna often runs on desktop and handles problem of moderate size (currently common in bioinformatics) - Grid/server based systems like Swift can deal with huge volume of data and intensive computation (for example, applications in medical informatics, neuroscience, physics) -Applications involved - Web services, batch jobs, shell scripts, etc. Future work -Enrich the caGrid workflow tool set based on Taverna -Build more real workflows to help scientific investigation -Address issues of scale as they arise
W. Tan, et al. Building Scientific Workflow with Taverna and BPEL 22 Thank you for your attention
W. Tan, et al. Building Scientific Workflow with Taverna and BPEL 23 Introduction: caGrid and workflow caGrid data instruments computation resource Virtualization Security Connectivity