Semantic Mediation in SEEK/Kepler: Exploiting Semantic Annotation for Discovery, Analysis, and Integration of Scientific Data and Workflows Bertram Ludäscher Dept. of Computer Science, UC Davis UC Davis Genome Center ucdavis.edu Shawn Bowers UC Davis Genome Center ucdavis.edu seek.ecoinformatics.orgseek.ecoinformatics.org | kepler-project.org | | dbis.ucdavis.edu | genomics.ucdavis.edukepler-project.orgwww.sdsc.edudbis.ucdavis.edugenomics.ucdavis.edu
Semantic Mediation System, SEEK/Kepler Science Environment for Ecological Knowledge SEEK is an NSF-funded, multidisciplinary research project to facilitate … Access to distributed ecological, environmental, and biodiversity data –Enable data sharing & reuse –Enhance data discovery at global scales Scalable analysis and synthesis –Taxonomic, spatial, temporal, conceptual integration of data, addressing data heterogeneity issues –Enable communication and collaboration for analysis –Enable reuse of analytical components –Support scientific workflow design and modeling
Semantic Mediation System, SEEK/Kepler SEEK data access, analysis, mediation Data Access (EcoGrid) –Distributed data network for environmental, ecological, and systematics data –Interoperate diverse environmental data systems Workflow Tools (Kepler) –Problem-solving environment for scientific data analysis and visualization “scientific workflows” Semantic Mediation (SMS) –Leverage ontologies for “smart” data/component discovery and integration
Semantic Mediation System, SEEK/Kepler Managing Data Heterogeneity Data comes from heterogeneous sources –Real-world observations –Spatial-temporal contexts –Collection/measurement protocols and procedures –Many representations for the same information (count, area, density) –Data, Syntax, Schema, Semantic heterogeneity Discovery and “synthesis” (integration) performed manually –Discovery often based on intuitive notion of “what is out there” –Synthesis of data is very time consuming, and limits use
Semantic Mediation System, SEEK/Kepler Scientific workflow systems support data analysis KEPLER
Semantic Mediation System, SEEK/Kepler Composite Component (Sub-workflow) Loops often used in SWFs; e.g., in genomics and bioinformatics (collections of data, nested data, statistical regressions,...) A simple Kepler workflow (T. McPhillips)
Semantic Mediation System, SEEK/Kepler Workflow runs PhylipPars iteratively to discover all of the most parsimonious trees. UniqueTrees discards redundant trees in each collection. Lists Nexus files to process (project) Reads text filesParses Nexus format Draws phylogenetic trees PhylipPars infers trees from discrete, multi-state characters. A simple Kepler workflow (T. McPhillips)
Semantic Mediation System, SEEK/Kepler An example workflow run, executed as a Dataflow Process Network A simple Kepler workflow
Semantic Mediation System, SEEK/Kepler SMS motivation Scientific Workflow Life-cycle –Resource Discovery discover relevant datasets discover relevant actors or workflow templates –Workflow Design and Configuration data actor (data binding) data data (data integration / merging / interlinking) actor actor (actor / workflow composition) Challenge: do all this in the presence of … –100’s of workflows and templates –1000’s of actors (e.g. actors for web services, data analytics, …) –10,000’s of datasets –1,000,000’s of data items –… highly complex, heterogeneous data – price to pay for these resources: $$$ (lots) – scientist’s time wasted: priceless!
Semantic Mediation System, SEEK/Kepler Approach & SMS capabilities Ontologies Semantic Annotation Iterative Development Iterative Development Resource Discovery Workflow Validation Resource Integration Resource Integration Workflow Elaboration Workflow Elaboration
Semantic Mediation System, SEEK/Kepler Approach & SMS capabilities Ontologies Semantic Annotation Iterative Development Iterative Development Resource Discovery SEEK KR group is developing OWL-DL ontologies: –Various workflow-component ontologies (for categorizing by function, project, scientific discipline, …) –Scientific observation ontology (OBOE), an upper ontology for defining and relating observations, measurements, and units –Domain specific ontologies that extend OBOE (standard and derived units, ecology and biodiversity concepts, …) Workflow Validation Resource Integration Resource Integration Workflow Elaboration Workflow Elaboration
Semantic Mediation System, SEEK/Kepler Approach & SMS capabilities Ontologies Semantic Annotation Iterative Development Iterative Development Resource Discovery Annotations “connect” resources to ontologies –Conceptually describe a resource and/or its “data schema” –Annotations provide the means for ontology-based discovery, integration, … Workflow Validation Resource Integration Resource Integration Workflow Elaboration Workflow Elaboration
Semantic Mediation System, SEEK/Kepler “Hybrid” types … Semantic + Structural Typing Structural Types: Given a structural type language S –Datasets, inputs, and outputs can be assigned structural types S S Semantic Types: Given an ontology language O (e.g., OWL-DL) –Datasets, inputs, and outputs can be assigned ontology types O O S out S O out O O : Observation obsProperty.SpeciesOccurrence S : SpeciesData(site, day, spp, occ) O : Observation obsProperty.SpeciesOccurrence S : SpeciesData(site, day, spp, occ) S O S out O out S in O in Semantically compatible but structurally incompatible A1A1 A1A1 A2A2 A2A2 Semantic & structural types can be combined using logic constraints := ( site, day, sp, occ ) SpeciesData ( site, day, sp, occ ) ( y ) Observation (y), obsProp ( y, occ ), SpeciesOccurrence ( occ ) := ( site, day, sp, occ ) SpeciesData ( site, day, sp, occ ) ( y ) Observation (y), obsProp ( y, occ ), SpeciesOccurrence ( occ )
Semantic Mediation System, SEEK/Kepler Semantic Type Annotation in Kepler Component input and output port annotation –Each port can be annotated with multiple classes from multiple ontologies –Annotations are stored within the component metadata
Semantic Mediation System, SEEK/Kepler Component Annotation and Indexing Component Annotations –New components can be annotated and indexed into the component library (e.g., specializing generic actors) –Existing components can also be revised, annotated, and indexed (hiding previous versions)
Semantic Mediation System, SEEK/Kepler Approach & SMS capabilities Ontologies Semantic Annotation Iterative Development Iterative Development Resource Discovery Ontology-based “smart” search –Find components by semantic types –Find components by input/output semantic types –Ontology-based query rewriting for discovery/integration Joint work with GEON project (see SSDBM-04, SWDB-04) Workflow Validation Resource Integration Resource Integration Workflow Elaboration Workflow Elaboration
Semantic Mediation System, SEEK/Kepler Smart Search Find a component (here: an actor) in different locations (“categories”) … based on the semantic annotation of the component (or its ports) Browse for ComponentsSearch for Component NameSearch for Category / Keyword
Semantic Mediation System, SEEK/Kepler Searching in context Search for components with compatible input/output semantic types –… searches over actor library –… applies subsumption checking on port annotations
Semantic Mediation System, SEEK/Kepler Approach & SMS capabilities Ontologies Semantic Annotation Iterative Development Iterative Development Resource Discovery Workflow validation and analysis –Check that workflows are semantically & structurally well-typed –Infer semantic type annotations of derived data (ie, type inference) An initial approach and prototype based on mapping composition (see QLQP-05) –User-oriented provenance Collect & query data-lineage of WF runs (see IPAW-06) Workflow Validation Resource Integration Resource Integration Workflow Elaboration Workflow Elaboration
Semantic Mediation System, SEEK/Kepler Workflow validation in Kepler Navigate errors and warnings within the workflow –Search for and insert “adapters” to fix (structural and semantic) errors … Statically perform semantic and structural type checking
Semantic Mediation System, SEEK/Kepler Approach & SMS capabilities Ontologies Semantic Annotation Iterative Development Iterative Development Resource Discovery Integrating and transforming data –Merge (“smart union”) datasets –Find mappings between data schemas for transformation data binding, component connections (see DILS-04) Workflow Validation Resource Integration Resource Integration Workflow Elaboration Workflow Elaboration
Semantic Mediation System, SEEK/Kepler Smart (Data) Integration: Merge Discover data of interest … connect to merge actor … “compute merge” –align attributes via annotations –open dialog for user refinement –store merge mapping in MOML … enjoy! –… your merged dataset –almost, can be much more complicated
Semantic Mediation System, SEEK/Kepler a3a3 a6a6 a1a1 a8a8 a4a4 a1 a3 a4 a b a 0.1 c 0.2 d 0.3 a1 a3 a4 a b a 0.1 c 0.2 d 0.3 Merge Result a1 a2 a3 a4 a 5 10 b 6 11 a1 a2 a3 a4 a 5 10 b 6 11 a5 a6 a7 a8 0.1 a 0.2 c 0.3 d a5 a6 a7 a8 0.1 a 0.2 c 0.3 d Merge a1a8 a3a6 a4 Biomass Site Under the hood of “Smart Merge” … Exploits semantic type annotations and ontology definitions to find mappings between sources Executing the merge actor results in an integrated data product (via “outer union”)
Semantic Mediation System, SEEK/Kepler Approach & SMS capabilities Ontologies Semantic Annotation Iterative Development Iterative Development Resource Discovery Workflow design support –(Semi-) automatically combine resource discovery, integration, and validation –Abstract Executable WF –… ongoing work! Workflow Validation Resource Integration Resource Integration Workflow Elaboration Workflow Elaboration Automated SWF Refinement
Semantic Mediation System, SEEK/Kepler Summary Outlook: –Ontologies and semantic anotations for WF design & reuse –Put ontologies to actual use in Kepler –Continue to develop Kepler tools for annotation (KR observation ontology), discovery, integration, design, … Issues & Challenges: –Tools/approaches for ontology (OWL) management, organization, reasoning –Open source (distributed) ontology (OWL) storage and reasoning –Tools and techniques for robust ontology versioning, and extension Acknowledgements –Timothy McPhillips, Dave Thau (UC Davis) –Mark Schildhauer, Josh Madin, Matt Jones (UCSB) –Deana Pennington (UNM) –Rich Williams (Microsoft Research) –Ferdinando Villa, Sergey Krivov (UVM)