Scientific Workflow systems: Summary and Opportunities for SEEK and e-Science
Scientific workflow systems Workflows are a way of documenting what has been done (provenance) Can be seen as their conceptual model of what needs to be done, need for more descriptive information in the process Combine the conceptual view with the executable workflow Go from napkin diagram to formal conceptual workflow to executable workflow As important to design the workflow than to execute it Documentation contributes to reproducibility of results because of the exact record a workflow creates Annotation of usage history for workflows gives new users an idea of the quality, appropriateness, and reliability of the workflow for their own usage Need to be able to get more information about the workflow than the WSDL provides strong ties to semantic mediation, in terms of: Integration, composition, discovery User interface
Distributed computing Distributed computing with workflows Good idea but the human cost of coordinating the system is still too high to be practical when ad-hoc analytical services are considered Gains may be made by leveraging existing systems like Condor and Pegasus Process flows could also demonstrate the benefits of infrastructure development to the domain scientists
Models of computation There’s an important point in them, but has as much to do with how you separate different scientific problems – I.e, does ecology have different needs than bioinformatics that is implicit in the discipline Need much clearer ways of communicating about these models, and the need for different models may not ever arise Partly driven by how you scope the domain of usefulness for a tool, for example if you’re handling just web services you’ll never need a continuous time model User probably shouldn’t have to select the model of computation, especially for workflows that can only use one model
Workflow languages Two separate languages: for designing the actors and the workflow You can describe the workflow without understanding what each component does Need another language to describe semantics of individual components (e.g. OWL-S, Web service model ontology (WSMO)) Our current efforts focus on describing semantics of data flow, not processing Simplest descriptions of components are name, can extend it over time with better and better approximations of a formal specification Inputs and outputs alone doesn’t cut it Mathematical description alone doesn’t cut it Really need concept that constrains how the statistical approach is used Mathematically simple models are rare in ecology, complex arbitrary designs are common and extremely difficult to describe Until we learn how to represent models declaratively, we’ll never fully understand these complex models Shared language: good idea, but all current languages incorporate references that can only be interpreted within one specific environment
Collaboration opportunities Shared workflow languages Scufl/MoML/DPML/… Shared work on semantic annotation of workflow components Shared ontologies that cross domains SEEK ontologies focus on ecology & environment myGrid ontologies focus on molecular biology Shared case study: conservation genetics Incorporates data from multiple disciplines Incorporates workflows, mediation, grid issues all in one issue Ecoinformatics.org
Acknowledgements This material is based upon work supported by the National Science Foundation under awards for SEEK and (AWSFL008-DS3) for GEON and by the Department of Energy under Contract No. DE-FC02- 01ER25486 for SciDAC/SDM and by DARPA under Contract No. F C-1703 for Ptolemy. Any opinions, findings and conclusions or recomendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF). The National Center for Ecological Analysis and Synthesis, a Center funded by NSF (Grant Number ), the University of California, and the UC Santa Barbara campus. The Andrew W. Mellon Foundation. PBI Collaborators: NCEAS, University of New Mexico (Long Term Ecological Research Network Office), San Diego Supercomputer Center, University of Kansas (Center for Biodiversity Research) Kepler contributors: SEEK, Ptolemy II, SDM/SciDAC, GEON