eScience Workshop on Scientific Workflows Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara
Outline What is SEEK? What is a scientific workflow system? Kepler as an example system Interoperability among workflow systems Models of computation Incorporating space, time, and other constraints Languages for representing scientific workflows Distributed computation and the Grid Challenges from existing scientific codes Data and model integration and semantics Discussion sessions for the day
What is SEEK? Science Environment for Ecological Knowledge (SEEK) Multidisciplinary research project to create: Distributed data network (EcoGrid) Environmental, ecological, and systematics data Scalable systems for scientific analysis (workflow systems) Systems for semi-automated data and model integration Collaborators NCEAS, UNM, SDSC, U Kansas Vermont, Napier, ASU, UNC
What is a scientific workflow? Scientists conduct analyses in varied systems They mentally coordinate the export and import of data across these systems This is a flow of data, analogous to business workflows Strong parallels with scripting and visual programming Scientific workflows formalize this process to: Design Execute Communicate Systems: Kepler/PtolemyII, DiscoveryNet, Pipeline Pilot, Taverna, Triana, Chimera, Pegasus, … analytical procedures efficiently
A Trivial Workflow Modeled as a directed graph Data ingestion/cleaning can be metadata driven Output generation includes creating appropriate metadata The analysis pipeline itself becomes metadata Query Grid to find data Archive output to Grid
More realistic workflows Scientific workflows represent knowledge about the analytical and modeling process
GARP Invasive Species Model Training sample (d) GARP rule set (e) Test sample (d) Integrated layers (native range) (c) DiGIR Species presence & absence points (native range) (a) EcoGrid Query EcoGrid Query Layer Integration Layer Integration Sample + A3 + A2 + A1 Data Calculation MapValidation User ValidationMap SRB Environmental layers (invasion area) (b) Integrated layers (invasion area) (c) Invasion area prediction map (f) DiGIR Species presence &absence points (invasion area) (a) Native range prediction map (f) Model quality parameter (g) SRB Environmental layers (native range) (b) Model quality parameter (g) Slide from D. Pennington
Metadata driven data ingestion Key information needed to read and machine process a data file is in the metadata Physical descriptors (CSV, Excel, RDBMS, etc.) Logical Entity (table, image, etc) and Attribute (column) descriptions Name Type (integer, float, string, etc.) Codes (missing values, nulls, etc.) Integrity constraints Semantic descriptions (ontology-based type systems)
Provenance of derived data Metadata needs to be revised following any data transformation Versioning metadata and data is important to reuse/repeatability The workflow describes the lineage of data processing Derived data sets can be stored in Grid with provenance Question: which workflow languages are most effective for archiving
Kepler: scientific workflows Open, collaborative effort of: SEEK, SciDAC/SDM, GEON, Ptolemy Project Ecology, biodiversity, molecular bio, geology, engineering Kepler aims to extend the Ptolemy system with: Domain-specific computational models Web and grid service access Data integration support Semantic reasoning Kepler actors are written in Java but can wrap other applications (such as MATLAB, GRASS) Actors can call arbitrary Web (or Grid) Services Ptolemy already has a very large inventory of actors
Kepler understands EML data* * EML = Ecological Metadata Language, Support is only partially implemented
Kepler: database access
Kepler: web services access
Kepler: grid services access
Kepler: ecological modeling
Models of Computation How data flows among workflow nodes is typically not explicitly represented Scientific models have specific data flow requirements E.g., simulations sometimes use discrete and sometimes continuous time Ptolemy introduced specific “Directors” that explicitly control data flow Process Networks, Discrete Event, Continuous Time, Synchronous Data Flow Spatial/Temporal/Taxonomic domains
Workflow languages Modeling Markup Language (MoML) Discovery Process Markup Language (DPML) … BPEL WS Invocation Framework (WSIF) WS Choreography
Distributed Computation Traditional Distributed systems CORBA, DCOM, RMI Emerging Distributed systems Web services Grid Existing scheduling systems Challenge of linking these together in integrated workflows Data movement can be limiting, so mobile code is attractive Moving code among computational nodes is limiting Security issues for mobile code Implicit models of computation hinder interoperability Among workflow execution systems Among existing scientific models
Existing scientific codes Many existing applications in science Codes in analytical environments (SAS, Matlab, ArcGIS, R, …) Custom models and simulations (C, C++, FORTRAN,…) Network-accessible services (e.g., Web and Grid services) All use different models of computation Granularity of implementation is always an issue for use in modular workflows
Data and Model Integration Complex workflows utilize variety of data E.g., in ecology, species distribution, climate, hydrology, molecular genetics, physiology Challenges Easily bind heterogeneous data to workflows Locate type-compatible workflow components Create semantically-correct metadata for derived products of workflows
Homogeneous data integration Integration of homogeneous or mostly homogeneous data via EML metadata is relatively straightforward
Heterogeneous Data integration Requires advanced metadata and processing Attributes must be semantically typed Collection protocols must be known Units and measurement scale must be known Measurement relationships must be known e.g., that ArealDensity=Count/Area
Label data with semantic types Label inputs and outputs of analytical components with semantic types Use reasoning engines to generate transformation steps Beware analytical constraints Use reasoning engine to discover relevant components Semantic Mediation DataOntologyWorkflow Components
Discussion sessions Challenges with making web services work together Compatibility, composition Workflow language interoperability Workflow environment interoperability Distributed computation Models of computation Workshop findings
Discussion Points 1 Workflows are not necessary in some contexts Pre-compute intermediate products that can then be accessed by db lookup, especially when it is expensive to compute that product Workflows are a way of documenting what has been done (provenance) Can be seen as their conceptual model of what needs to be done, need for more descriptive information in the process Highlights a hot topic: combine the conceptual view with the executable workflow Go from napkin diagram to formal conceptual workflow to executable workflow As or more important an aspect to design the workflow than to execute it Need to be able to get more information about the workflow than the wsdl provided Existing work been done on getting people involved in the documentation of processes: see Soft systems methodology by Peter Checkland Documentation contributes to reproducability of results because of the exact record a workflow creates Annontation of usage history for workflows gives new users an idea of the quality, appropriateness, and reliability of the workflow for their own usage Useful to be able to print the WF out in a reference, maybe part of methods, or at least cite it
Discussion Points 2 Distributed computing with workflows good idea but the human cost of coordinating the system is still too high to be practical But, still need to make progress through projects that focus on infrastructure Process flows could also demonstrate the benefits of infrastructure development to the domain scientists Last mile in terms of usability is often missed by pure infrastructure efforts – need domain investment to make it seamless Build collaboration into the proposals, but what is the real research reward in that for the domain scientists? WebServices++: includes “agreement” on how to pass data by reference (e.g., by LSID) But also need this to be a long-term solution, which is harder to achieve, yet can’t really wait for the Ws-* standards before we try to make progress
Discussion Points 3 Models of computation There’s an important point in them, but has as much to do with how you separate different scientific problems – I.e, does ecology have different needs than bioinformatics that is implicit in the discipline Need much clearer ways of communicating about these models, and the need for different models may not ever arise Partly driven by how you scope the domain of usefulness for a tool, for example if you’re handling just web services you’ll never need a continuous time model User probably shouldn’t have to select the model of computation, especially for workflows that can only use one model How should an end-user choose a workflow system? Don’t really have a good comparison of the various wf systems out there Track time to create workflows to get estimate of effort
Discussion Points 4 Workflow languages It doesn’t matter too much that they don’t interoperate because there are so few workflows People aren’t used to digitizing these methodologies so its not considered an issue Two separate languages: for designing the actors and the workflow You can describe the workflow without understanding what each component does Need another language to describe semantics of individual components (e.g. OWL-S, Web service model ontology (WSMO)) Our current efforts focus on describing semantics of data flow, not processing Simplest descriptions of components are name, can extend it over time with better and better approximations of a formal specification Inputs and outputs alone doesn’t cut it Mathematical description alone doesn’t cut it Really need concept that constrains how the statistical approach is used Mathematically simple models are rare in ecology, complex arbitrary designs are common and extremely difficult to design Until we learn how to represent models declaratively, we’ll never fully understand these complex models
Acknowledgements This material is based upon work supported by the National Science Foundation under awards for SEEK and (AWSFL008-DS3) for GEON and by the Department of Energy under Contract No. DE-FC02- 01ER25486 for SciDAC/SDM and by DARPA under Contract No. F C-1703 for Ptolemy. Any opinions, findings and conclusions or recomendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF). The National Center for Ecological Analysis and Synthesis, a Center funded by NSF (Grant Number ), the University of California, and the UC Santa Barbara campus. The Andrew W. Mellon Foundation. PBI Collaborators: NCEAS, University of New Mexico (Long Term Ecological Research Network Office), San Diego Supercomputer Center, University of Kansas (Center for Biodiversity Research) Kepler contributors: SEEK, Ptolemy II, SDM/SciDAC, GEON