1 Yolanda Gil Information Sciences InstituteJanuary 10, 2010 Requirements for caBIG Infrastructure to Support Semantic Workflows Yolanda Gil, PhD Information Sciences Institute and Department of Computer Science University of Southern California
2 Yolanda Gil Information Sciences InstituteJanuary 10, 2010 Outline Brief background on semantic workflows Semantic workflow representations in Wings Five uses of semantic workflows to assist users and their resulting requirements Reproducibility Validation Metadata generation Data discovery Workflow discovery Requirements for architecture components Ontology repositories and services Data/metadata catalogs and services Component/service catalogs and services Workflow catalogs and services
3 Yolanda Gil Information Sciences InstituteJanuary 10, 2010 Benefits of Semantic Workflows [Gil JSP-09] Execution management: Automation of workflow execution Managing distributed computation Managing large data sets Security and access control Provenance recording Low-cost high fidelity reproducibility Semantics and reasoning: Workflow retrieval and discovery Automation of workflow generation Systematic exploration of design space Validation of workflows Automated generation of metadata Guarantees of data pedigree “Conceptual” reproducibility
4 Yolanda Gil Information Sciences InstituteJanuary 10, 2010 Semantic Workflows in Wings [Kim et al CCPEJ 08; Gil et al IEEE eScience 09; Gil et al K-CAP 09; Kim et al IUI 06; Gil et al IEEE IS 2010] Workflows augmented with semantic constraints Each workflow constituent has a variable associated with it –Nodes, links, workflow components, datasets –Workflow variables can represent collections of data as well as classes of software components Constraints are used to restrict variables, and include: –Metadata properties of datasets –Constraints across workflow variables Incorporate function of workflow components: how data is transformed Reasoning about semantic constraints in a workflow Algorithms for semantic enrichment of workflow templates Algorithms for matching queries against workflow catalogs Algorithms for generating workflows from high-level user requests Algorithms for generating metadata of new data products Algorithms for assisting users w/creation of valid workflow templates
5 Yolanda Gil Information Sciences InstituteJanuary 10, 2010 Semantic Workflows in WINGS Workflow templates Dataflow diagram Each constituent (node, link, component, dataset) has a corresponding variable Semantic properties Constraints on workflow variables (TestData dcdom:isDiscrete false) (TrainingData dcdom:isDiscrete false)
6 Yolanda Gil Information Sciences InstituteJanuary 10, 2010 Semantic Constraints as Metadata Properties Constraints on reusable template (shown below) Constraints on current user request (shown above) [modelerInput_not_equal_to_classifierInput: (:modelerInput wflow:hasDataBinding ?ds1) (:classifierInput wflow:hasDataBinding ?ds2) equal(?ds1, ?ds2) (?t rdf:type wflow:WorkflowTemplate) > (?t wflow:isInvalid "true"^^xsd:boolean)]
7 Yolanda Gil Information Sciences InstituteJanuary 10, 2010 Outline Brief background on semantic workflows Semantic workflow representations in Wings Five uses of semantic workflows to assist users and their resulting requirements Reproducibility Validation Metadata generation Data discovery Workflow discovery Requirements for architecture components Ontology repositories and services Data/metadata catalogs and services Component/service catalogs and services Workflow catalogs and services
8 Yolanda Gil Information Sciences InstituteJanuary 10, 2010 Uses of Semantic Workflows: 1) Easily Replicate Previously Published Results A catalog of carefully crafted workflows of select state-of- the-art methods to cover a wide range of common analyses Many implementations of same algorithm, some proprietary Same implementation but new versions and bug fixes With such catalog, the effort involved in reproducing results is greatly reduced Semantics needed to assist users to use workflows correctly
9 Yolanda Gil Information Sciences InstituteJanuary 10, 2010 Resulting Requirements (1) Semantic representations of workflows need to abstract from software implementation Representing abstract classes of software components –Instances are the implemented codes –Workflow steps refer to component classes Representing abstract kinds of data (eg exclude format) Semantic reasoning needed to specialize workflow To map the abstract workflow into an execution-ready workflow To insert lower level steps (eg data transformations)
10 Yolanda Gil Information Sciences InstituteJanuary 10, 2010 Uses of Semantic Workflows: 2) Ensure Correct Use of State-of-the-Art Methods Analytic software and methods are well documented but all is text (papers, manuals, etc) Time consuming, hard to spot interdependencies, no validation Semantics needed to guide users to set up workflows correctly and customize them to their datasets and goals
11 Yolanda Gil Information Sciences InstituteJanuary 10, 2010 Requirements (2) Semantic workflows can check constraints and guide users Representing requirements of software components –Constraints on input data –Constraints on parameter settings given properties of input data Representing metadata properties of datasets Semantic reasoning needed: To check constraints of each workflow step To propagate constraints across the workflow
12 Yolanda Gil Information Sciences InstituteJanuary 10, 2010 Uses of Semantic Workflows: 3) Automatic Generation of Metadata Metadata annotations are tedious and involved Often not done, an obstacle to sharing and to reuse Semantic workflows can automate the generation of metadata for analysis data products
13 Yolanda Gil Information Sciences InstituteJanuary 10, 2010 Requirements (3) Semantic representations needed: Representing expected characteristics of output dataset for each software component given the input metadata Representing metadata properties of input datasets Semantic reasoning needed: To propagate metadata for each workflow step To propagate metadata across the workflow
14 Yolanda Gil Information Sciences InstituteJanuary 10, 2010 Uses of Semantic Workflows: 4) Discovery of Relevant Data Need a dataset of updated common (known) loci to annotate findings, where can I find one? Workflows reused from a catalog may require additional data besides what is provided by the user Semantic workflows can help identify characteristics of required datasets and query data catalogs to find them for the user
15 Yolanda Gil Information Sciences InstituteJanuary 10, 2010 Requirements (4) Semantic representations needed: Metadata properties of any additional input datasets in the workflow, including: –Default properties for the given workflow –Augmented properties that result from the specific input data provided by the user Semantic reasoning needed: Propagation of semantic constraints through the workflow Formulation of queries to data catalogs based on semantic properties required of datasets in the workflow
16 Yolanda Gil Information Sciences InstituteJanuary 10, 2010 Uses of Semantic Workflows: 5) Retrieval of Workflows Hard to find workflows for the type of analysis a user wants Semantic information is not provided when creating the workflow However, retrieval queries are often based on metadata properties of data –e.g., “Find workflows that can normalize data which is continuous and has missing values [<- constraints on inputs] to create a decision tree model [constraint on intermediate data products]” Semantic workflows needed to augment user-provided workflows with semantic constraints from metadata catalogs and component catalogs
17 Yolanda Gil Information Sciences InstituteJanuary 10, 2010 Requirements (5) Semantic representations are needed: For workflow constituents –Metadata properties of input, intermediate and final data products –Metadata properties of workflow and component function For user queries –Express workflow sketches containing partial data descriptions (constraints) Reasoning capabilities Automatic creation of metadata for expected workflow data products Workflow matching to queries (exact and partial)
18 Yolanda Gil Information Sciences InstituteJanuary 10, 2010 Outline Brief background on semantic workflows Semantic workflow representations in Wings Five uses of semantic workflows to assist users and their resulting requirements Reproducibility Validation Metadata generation Data discovery Workflow discovery Requirements for architecture components Ontology repositories and services Data/metadata catalogs and services Component/service catalogs and services Workflow catalogs and services
19 Yolanda Gil Information Sciences InstituteJanuary 10, 2010 Requirements on Core Ontology Repositories and Services Component/service ontologies Extend with semantic representations that support reasoning, not just their execution Workflow ontologies Develop workflow ontologies that enable shared workflow repositories Develop semantic layer for the workflow ontologies –Workflow steps must be able to represent component classes –Support reasoning about workflows in all architecture components
20 Yolanda Gil Information Sciences InstituteJanuary 10, 2010 Requirements on Data/Metadata Catalogs and Services Representing abstracts kinds of data (eg exclude format) Representing metadata properties that are relevant to data analysis Eg: the organization that contributed the data may be less relevant than the instrument used to collect it, its calibration, its quality and accuracy, etc.
21 Yolanda Gil Information Sciences InstituteJanuary 10, 2010 Requirements on Component/Service Catalogs and Services Represent abstract classes of software components Instances correspond to implemented codes/services Represent constraints on input data Metadata properties that make the component appropriate for a given input dataset Represent constraints on output data Metadata properties of expected input datasets given the required outcome of the execution of the component Represent constraints on parameter values Constraints on parameter settings given properties of input or output data Represent how metadata properties of inputs is related to metadata of outputs Metadata properties of output datasets given the properties of the input datasets
22 Yolanda Gil Information Sciences InstituteJanuary 10, 2010 Requirements on Workflow Catalogs and Services Semantic reasoning to specialize workflows Given user requirements and a high-level workflow, automatically generate valid execution-ready workflows Automatically insert lower level steps when needed (eg data format conversions) Semantic reasoning to propagate constraints of each workflow step Check constraints of each workflow step and propagate them throughout the workflow Incorporate constraints coming from the user’s requirements with constraints from the individual steps of the workflow Formulation of data catalog queries based on the metadata properties of a given dataset in the workflow Workflow discovery and matching for a given user query Need a language to express user queries as workflow sketches containing partial data descriptions (constraints) and partial dataflow patterns Need semantic reasoning for matching such queries, both exact and partial matching