Semantic Extensions for Scientific Workflows on the Grid Bertram Ludäscher San Diego Supercomputer Center Associate Professor Dept.

Slides:



Advertisements
Similar presentations
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Advertisements

Overview of the Science Environment for Ecological Knowledge (SEEK) Ricardo Scachetti Pereira.
UCSD SAN DIEGO SUPERCOMPUTER CENTER Ilkay Altintas Scientific Workflow Automation Technologies Provenance Collection Support in the Kepler Scientific Workflow.
SONet (Scientific Observations Network) and OBOE (Extensible Observation Ontology): Mark Schildhauer, Director of Computing National Center for Ecological.
Workflow Exchange and Archival: The KSW File and the Kepler Object Manager Shawn Bowers (For Chad Berkley & Matt Jones) University of California, Davis.
6th Biennial Ptolemy Miniconference Berkeley, CA May 12, 2005 Distributed Computing in Kepler Ilkay Altintas Lead, Scientific Workflow Automation Technologies.
KEPLER: Overview and Project Status Bertram Ludäscher San Diego Supercomputer Center Associate Professor Dept. of Computer Science.
KEPLER: Overview and Project Status Bertram Ludäscher San Diego Supercomputer Center Associate Professor Dept. of Computer Science.
Leveraging semantic metadata for ecological data discovery and integration for analysis and modeling Matthew B. Jones Mark P. Schildhauer with contributions.
A Semantic Workflow Mechanism to Realise Experimental Goals and Constraints Edoardo Pignotti, Peter Edwards, Alun Preece, Nick Gotts and Gary Polhill School.
Improving Data Discovery in Metadata Repositories through Semantic Search Chad Berkley 1, Shawn Bowers 2, Matt Jones 1, Mark Schildhauer 1, Josh Madin.
RDA Wheat Data Interoperability Working Group Outcomes RDA Outputs P5 9 th March 2015, San Diego.
1 Ilkay ALTINTAS - October, 2007 Ilkay ALTINTAS Lab Director, Scientific Workflow Automation Technologies San Diego Supercomputer Center, UCSD Kepler Scientific.
Biology.sdsc.edu CIPRes in Kepler: An integrative workflow package for streamlining phylogenetic data analyses Zhijie Guan 1, Alex Borchers 1, Timothy.
January, 23, 2006 Ilkay Altintas
1 Yolanda Gil Information Sciences InstituteJanuary 10, 2010 Requirements for caBIG Infrastructure to Support Semantic Workflows Yolanda.
Bertram Ludäscher Managing Scientific Data: From Data Integration to Scientific Workflows Bertram Ludäscher UC.
SEEK: Enabling Ecology and Biodiversity Science Through Cyberinfrastructure.
Composing Models of Computation in Kepler/Ptolemy II
Introduction for BEAM Ecological Niche Modeling Working Meeting Deana Pennington University of New Mexico December 14, 2004.
GEON-UTEP GEON-Knowledge Representation WG Update GEON-KR list (currently) Bertram Ludaescher (SDSC: Bertram Ludaescher (SDSC:
Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher San Diego Supercomputer Center Associate.
EXCS Sept Knowledge Engineering Meets Software Engineering Hele-Mai Haav Institute of Cybernetics at TUT Software department.
Data R&D Issues for GTL Data and Knowledge Systems San Diego Supercomputer Center University of California, San Diego Bertram Ludäscher
Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches Bertram Ludäscher San Diego Supercomputer Center Associate.
Pipelines and Scientific Workflows with Ptolemy II Deana Pennington University of New Mexico LTER Network Office Shawn Bowers UCSD San Diego Supercomputer.
Introduction to Apache OODT Yang Li Mar 9, What is OODT Object Oriented Data Technology Science data management Archiving Systems that span scientific.
1 Kepler/SPA Extensions for Scientific Workflows – Now and Upcoming Ilkay Altintas SWAT lead San Diego Supercomputer Center Bertram Ludäscher.
Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life UC DAVIS Department of Computer Science The Kepler/pPOD Team Shawn.
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES GEON IT Advances: ⁃ Data Integration ⁃ GEON Workbench ⁃ Scientific Workflows Bertram Ludäscher.
Science Environment for Ecological Knowledge Bertram Ludäscher San Diego Supercomputer Center University of California, San Diego
Science Environment for Ecological Knowledge: EcoGrid Matthew B. Jones National Center for.
1 Ontology-based Semantic Annotatoin of Process Template for Reuse Yun Lin, Darijus Strasunskas Depart. Of Computer and Information Science Norwegian Univ.
Semantic Mediation in SEEK/Kepler: Exploiting Semantic Annotation for Discovery, Analysis, and Integration of Scientific Data and Workflows Bertram Ludäscher.
Accelerating Scientific Exploration Using Workflow Automation Systems Terence Critchlow (LLNL) Ilkay Altintas (SDSC) Scott Klasky(ORNL) Mladen Vouk (NCSU)
11 CORE Architecture Mauro Bruno, Monica Scannapieco, Carlo Vaccari, Giulia Vaste Antonino Virgillito, Diego Zardetto (Istat)
SEEK EcoGrid l Integrate diverse data networks from ecology, biodiversity, and environmental sciences l Metacat, DiGIR, SRB, Xanthoria,... l EML is the.
Chad Berkley NCEAS National Center for Ecological Analysis and Synthesis (NCEAS), University of California Santa Barbara Long Term Ecological Research.
Issues in (Financial) High Performance Computing John Darlington Director Imperial College Internet Centre Fast Financial Algorithms and Computing 4th.
1 Ilkay ALTINTAS - July 24th, 2007 Ilkay ALTINTAS Director, Scientific Workflow Automation Technologies Laboratory San Diego Supercomputer Center, UCSD.
Research Design for Collaborative Computational Approaches and Scientific Workflows Deana Pennington January 8, 2007.
Grid Technologies Arcot Rajasekar (SEEK) Paul Watson (North East eScience Centre)
Paolo Missier (1), Bertram Luda ̈ scher (2), Shawn Bowers (3), Saumen Dey (2), Anandarup Sarkar (3), Biva Shrestha (4), Ilkay Altintas (5), Manish Kumar.
©Ferenc Vajda 1 Semantic Grid Ferenc Vajda Computer and Automation Research Institute Hungarian Academy of Sciences.
Ecoinformatics Workshop Summary SEEK, LTER Network Main Office University of New Mexico Aluquerque, NM.
The SEEK EcoGrid: A Data Grid System for Ecology Arcot Rajasekar Matthew Jones Bertram Ludäscher
ICCS WSES BOF Discussion. Possible Topics Scientific workflows and Grid infrastructure Utilization of computing resources in scientific workflows; Virtual.
Kepler includes contributors from GEON, SEEK, SDM Center and Ptolemy II, supported by NSF ITRs (SEEK), EAR (GEON), DOE DE-FC02-01ER25486.
Knowledge Representation Breakout KR: to create content (objects, reltnshps) for SMS (logic/inference) that will be useful for enhancing the discovery.
11 CORE Architecture Mauro Bruno, Monica Scannapieco, Carlo Vaccari, Giulia Vaste Antonino Virgillito, Diego Zardetto (Istat)
Semantic Data Integration in myGrid and ourGrid (SEEK) National e-Science Centre e-Science Institute, Edinburgh May 14 th, 2004.
1 Limitations of BLAST Can only search for a single query (e.g. find all genes similar to TTGGACAGGATCGA) What about more complex queries? “Find all genes.
EScience Workshop on Scientific Workflows Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara.
Scientific Workflow systems: Summary and Opportunities for SEEK and e-Science.
An Ontology-Driven Framework for Data Transformation in Scientific Workflows Shawn Bowers Bertram Ludäscher San Diego Supercomputer Center University of.
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES GEON IT Advances: ⁃ Data Integration ⁃ GEON Workbench ⁃ Scientific Workflows Bertram Ludäscher.
SEEK Semantic Mediation Shawn Bowers Bertram Ludäscher e-Science Centre, May 11-14, 2004,
Towards Self-Describing Workflows for Climate Models Kathy Saint – UCAR Ufuk Utku Turuncoglu – ITU Sylvia Murphy – NCAR Cecelia DeLuca – NCAR.
SEEK Science Environment for Ecological Knowledge l EcoGrid l Ecological, biodiversity and environmental data l Computational access l Standardized, open.
GEONSearch: From Searching to Recommending GeoInformatics 2006 May 10-12, Reston, Virginia Ullas Nambiar, Bertram Ludaescher Dept. of Computer Science.
Satisfying Requirements BPF for DRA shall address: –DAQ Environment (Eclipse RCP): Gumtree ISEE workbench integration; –Design Composing and Configurability,
Distributed Archives Interoperability Cynthia Y. Cheung NASA Goddard Space Flight Center IAU 2000 Commission 5 Manchester, UK August 12, 2000.
Workflow-Driven Science using Kepler Ilkay Altintas, PhD San Diego Supercomputer Center, UCSD words.sdsc.edu.
Staging of the Ecological Niche Modeling Mammal Prototype Project Deana Pennington University of New Mexico December 14, 2004.
Efrat Jaeger – SDSC Bertram Ludäscher – UC DAVIS Krishna Sinha – Virginia Tech Ashraf Memon – SDSC Ghulam Memon – SDSC Ilkay Altintas – SDSC Kai Lin –
EcoGrid in SEEK A Data Grid System for Ecology Bertram Ludaescher University of California, Davis Arcot Rajasekar San Diego Supercomputer Center, University.
Enhancements to Galaxy for delivering on NIH Commons
2. An overview of SDMX (What is SDMX? Part I)
A Semantic Type System and Propagation
KEPLER: Overview and Project Status
Presentation transcript:

Semantic Extensions for Scientific Workflows on the Grid Bertram Ludäscher San Diego Supercomputer Center Associate Professor Dept. of Computer Science & Genome Center University of California, Davis Fellow San Diego Supercomputer Center University of California, San Diego UC DAVIS Department of Computer Science

ISGC’2005, April 25-29, 2005 SWDBAug 29, Overview Science Environment for Ecological Knowledge (SEEK) Scientific Workflows –What are they? –Why do we need them? The Kepler Scientific Workflow System Adding Semantics to Scientific Workflows

Science Environment for Ecological Knowledge Large collaborative NSF/ITR ( ) Bringing together ecologists, IT experts, CS researchers, … SEEK.ecoinformatics.org

ISGC’2005, April 25-29, 2005 SWDBAug 29, SEEK: Multidisciplinary research to facilitate … Access to ecological, environmental, and biodiversity data –Enable data sharing & re-use –Enhance data discovery at global scales Scalable analysis and synthesis –Taxonomic, Spatial, Temporal, Conceptual integration of data, addressing data heterogeneity issues –Enable communication and collaboration for analysis –Enable re-use of analytical components

ISGC’2005, April 25-29, 2005 SWDBAug 29, SEEK Main Components Kepler –Problem-solving environment for scientific data analysis and visualization  “scientific workflows” EcoGrid* –Distributed data network for environmental, ecological, and systematics data –Making diverse environmental data systems interoperate Semantic Mediation System –“Smart” data discovery and integration Knowledge Representation WG Taxon WG BEAM WG Education, Outreach, Training *name-clash: cf. other Eco-Grid project!

ISGC’2005, April 25-29, 2005 SWDBAug 29, Overview Science Environment for Ecological Knowledge (SEEK) Scientific Workflows –What are they? –Why do we need them? The Kepler Scientific Workflow System Adding Semantics to Scientific Workflows

ISGC’2005, April 25-29, 2005 SWDBAug 29, Ecology Scientific Workflow: Invasive Species Prediction Training sample (d) GARP rule set (e) Test sample (d) Integrated layers (native range) (c) Species presence & absence points (native range) (a) EcoGrid Query EcoGrid Query Layer Integration Layer Integration Sample Data + A3 + A2 + A1 Data Calculation Map Generation Validation User Validation Map Generation Integrated layers (invasion area) (c) Species presence &absence points (invasion area) (a) Native range prediction map (f) Model quality parameter (g) Environmental layers (native range) (b) Generate Metadata Archive To Ecogrid Registered Ecogrid Database Registered Ecogrid Database Registered Ecogrid Database Registered Ecogrid Database Environmental layers (invasion area) (b) Invasion area prediction map (f) Model quality parameter (g) Selected prediction maps (h) Source: NSF SEEK (Deana Pennington et. al, UNM)

ISGC’2005, April 25-29, 2005 SWDBAug 29, Scientific Workflows Model the way scientists work with their data and tools –Mentally coordinate export and import of data among software systems Scientific workflows emphasize data flow ( ≠ business workflows) Metadata (incl. provenance info, semantic types etc.) is crucial for automated data ingestion, data analysis, … Goals: –SWF automation, –SWF & component reuse, –SWF design & documentation –making scientists ’ data analysis and management easier!

ISGC’2005, April 25-29, 2005 SWDBAug 29, Commercial & Open Source Scientific Workflow” ( Dataflow ) Systems Kensington Discovery Edition from InforSense Taverna Triana

ISGC’2005, April 25-29, 2005 SWDBAug 29, Overview Science Environment for Ecological Knowledge (SEEK) Scientific Workflows –What are they? –Why do we need them? The Kepler Scientific Workflow System Adding Semantics to Scientific Workflows

ISGC’2005, April 25-29, 2005 SWDBAug 29, Kepler Starting Point: UC Berkeley’s Ptolemy II Large, polymorphic component (“Actors”) and Directors libraries (drag & drop) “Directors” define the component interaction & execution semantics

ISGC’2005, April 25-29, 2005 SWDBAug 29, Kepler Scientific Workflows e.g. from Web Services  “Minute-made” (MM) WS-based application integration Similarly: MM workflow design & sharing w/o implemented components

ISGC’2005, April 25-29, 2005 SWDBAug 29, Job Management (here: with NIMROD) Job management infrastructure in place Results database: under development Goal: 1000’s of GAMESS jobs (quantum mechanics)

ISGC’2005, April 25-29, 2005 SWDBAug 29, Some Kepler Actor Additions

ISGC’2005, April 25-29, 2005 SWDBAug 29, Ecological Niche Model in Kepler (200 to 500 runs per species x 2000 mammal species x 3 minutes/run) = 833 to 2083 days

ISGC’2005, April 25-29, 2005 SWDBAug 29, Utilize distributed computing resources Execute single steps or sub-workflows on distributed machines Initially, focus on ‘trivially parallel’ workflows Support collaboration through the formation of ad-hoc grids Implementations –Peer to peer using JXTA –Traditional HPC-based batch job submission (e.g., NIMROD, Condor) Grid-enabled Kepler KeplerGrid for Niche Modeling KeplerGrid for Biodiversity (200 to 500 runs per species x 2000 mammal species x 3 minutes/run) / 100 nodes = 8 to 20 days

ISGC’2005, April 25-29, 2005 SWDBAug 29,

ISGC’2005, April 25-29, 2005 SWDBAug 29, A GEON Data Analysis Workflow

ISGC’2005, April 25-29, 2005 SWDBAug 29, Statistics Packages (here: R) in Kepler Source: Dan Higgins, Kepler/SEEK

ISGC’2005, April 25-29, 2005 SWDBAug 29, ORB

ISGC’2005, April 25-29, 2005 SWDBAug 29, KEPLER: An OPEN SOURCE, cross-project collaboration Ilkay Altintas SDM, Resurgence, NLADR,… Kim Baldridge Resurgence, NMI Chad Berkley SEEK Shawn Bowers SEEK Terence Critchlow SDM Tobin Fricke ROADNet Jeffrey Grethe BIRN Christopher H. Brooks Ptolemy II Zhengang Cheng SDM Dan Higgins SEEK Efrat Jaeger GEON Matt Jones SEEK Werner Krebs, EOL Edward A. Lee Ptolemy II Kai Lin GEON Bertram Ludaescher SDM, SEEK, GEON, BIRN, ROADNet Mark Miller EOL Steve Mock NMI Steve Neuendorffer Ptolemy II Jing Tao SEEK Mladen Vouk SDM Xiaowen Xin SDM Yang Zhao Ptolemy II Bing Zhu SEEK Ptolemy II Your Logos & Names HERE!!!

ISGC’2005, April 25-29, 2005 SWDBAug 29, GEON Dataset Generation & Registration (a co-development in KEPLER) Xiaowen (SDM) Edward et al.(Ptolemy) Yang (Ptolemy) Efrat (GEON) Ilkay (SDM) SQL database access (JDBC) Matt,Chad, Dan et al. (SEEK) % Makefile $> ant run % Makefile $> ant run

ISGC’2005, April 25-29, 2005 SWDBAug 29, Kepler today Supports scientific workflows –Ecology, molecular bio, geology, … –Variety of analytical components (including spatial data transformations) –Support for R scripts and Matlab scripts –Real-time data access via Antelope ORB EcoGrid access to heterogeneous data –EML Data support Experimental data, survey data, spatial raster and vector data, etc. –DarwinCore Data support Museum collections –EcoGrid registry to discover data sources Ontology-based browsing for analytical components –Exploit semantics to improve the user experience Demonstration workflows –Ecology: Ecological Niche Modeling, Biodiversity Analysis, … –Genomics: Promoter Identification Workflow –Geology: Geologic Map Integration, Rock-type distribution analysis –Oceanography: Real-time Revelle example of data access

ISGC’2005, April 25-29, 2005 SWDBAug 29, Kepler soon (this year mostly …) Usability engineering –Full evaluation and user-oriented customization of all UI components Distributed computing/grid computing –Large jobs, lots of machines –Detached execution “Smart” data and component discovery –Support annotating data sources Component repository / downloadable components Automated data and service integration and transformation using ontologies Complete EcoGrid access –Full EML support –Support for “large” data and 3 rd -party transfer –More data sources and types of data sources (e.g., JDBC, GEON data) Provenance and metadata propagation

ISGC’2005, April 25-29, 2005 SWDBAug 29, Joint Ptolemy/Kepler Meeting (in eigener Sache ;-)

ISGC’2005, April 25-29, 2005 SWDBAug 29, Overview Science Environment for Ecological Knowledge (SEEK) Scientific Workflows –What are they? –Why do we need them? The Kepler Scientific Workflow System Adding Semantics to Scientific Workflows

ISGC’2005, April 25-29, 2005 SWDBAug 29, Kepler Actor-Library w/ Concept Index How do you find the right component (actor)?  Ontology-based actor organization / browsing  Simple text-based and concept-based searching Next: ontology-based workflow design Workflow Components (MoML) Ontologies (OWL) Default + Other Semantic Annotations urn ids instance expressions

ISGC’2005, April 25-29, 2005 SWDBAug 29, Ecological ontologies What was measured (e.g., biomass) Type of measurement (e.g., Energy) Context of measurement (e.g., Psychotria limonensis) How it was measured (e.g., dry weight) SEEK intends to enable community-created ecological ontologies using OWL –Represents a controlled vocabulary for ecological metadata

ISGC’2005, April 25-29, 2005 SWDBAug 29, Ontology O (in Description Logic … cf. OWL-DL)

ISGC’2005, April 25-29, 2005 SWDBAug 29, SEEK KR (Knowledge Representation) Working Group Current Ontologies –Ecological Concepts, Models, Networks –Measurements –Properties –Statistical Analyses –Time and Space –Taxonomic Identifiers –Units –Symbiosis Recent Developments –Biodiversity (measured traits, computation of traits) –Descriptive Terminology for Plant Communities –Ontology documentation Future Goals –“Fill-in” existing concepts, evolve the ontology framework –More domains …

ISGC’2005, April 25-29, 2005 SWDBAug 29, Label data with semantic types (concept expressions from an ontology) Label inputs and outputs of analytical components with semantic types Example: Data has COUNT and AREA; workflow wants DENSITY  via ontology, system “knows” that data can still be used (because DENSITY := COUNT/AREA) Use reasoning engines to generate transformation steps Use reasoning engine to discover relevant components Need for Semantic Annotations of data & actors DataOntologyWorkflow Components

ISGC’2005, April 25-29, 2005 SWDBAug 29, A Scientist’s “Semantic” View of Actors S 1 (life stage property) S 2 (mortality rate for period) S 2 (mortality rate for period) P1P1 P2P2 P4P4 P3P3 P5P5 Phase ObservedPeriodPhases Eggs Instar I Instar II Instar III Instar IV Adults 44,000 3,513 2,529 1,922 1,461 1,300 Nymphal{Instar I, Instar II, Instar III, Instar IV} Population samples for life stages of the common field grasshopper [Begon et al, 1996] Periods of development in terms of phases life stage periods k-value for each period of observation [(nymphal, 0.44)] observations Source: [Bowers-Ludaescher, DILS’04]

ISGC’2005, April 25-29, 2005 SWDBAug 29, Structural Type (XML DTD) Annotations S 1 (life stage property) S 2 (mortality rate for period) S 2 (mortality rate for period) P1P1 P2P2 P4P4 P3P3 P5P5 root population = (sample)* elem sample= (meas, lsp) elem meas= (cnt, acc) elem cnt= xsd:integer elem acc= xsd:double elem lsp= xsd:string 44, Eggs … root cohortTable= (measurement)* elem measuremnt= (phase, obs) elem phase= xsd:string elem obs= xsd:integer Eggs 44,000 … structType(P 2 ) structType(P 3 ) Source: [Bowers-Ludaescher, DILS’04]

ISGC’2005, April 25-29, 2005 SWDBAug 29, Semantic Type Annotations Take concepts and relationships from an ontology to “semantically type” the data-in/out ports Application: e.g., design support: –smart/semi-automatic wiring, generation of “adaptor actors” Actor (normalize) p in p out Takes Abundance Count Measurements for Life Stages Returns Mortality Rate Derived Measurements for Life Stages Source: [Bowers-Ludaescher, DILS’04]

ISGC’2005, April 25-29, 2005 SWDBAug 29,

ISGC’2005, April 25-29, 2005 SWDBAug 29,

ISGC’2005, April 25-29, 2005 SWDBAug 29, A KR+DI+Scientific Workflow Problem Services can be semantically compatible, but structurally incompatible Source Service Source Service Target Service Target Service PsPs PtPt Semantic Type P s Semantic Type P t Structural Type P t Structural Type P s Desired Connection Incompatible Compatible (⋠)(⋠) (⊑)(⊑) (Ps)(Ps) (Ps)(Ps)  (≺)(≺) Ontologies (OWL) Source: [Bowers-Ludaescher, DILS’04]

ISGC’2005, April 25-29, 2005 SWDBAug 29, The Ontology-Driven Framework Source Service Source Service Target Service Target Service PsPs PtPt Semantic Type P s Semantic Type P t Structural Type P t Structural Type P s Desired Connection Compatible (⊑)(⊑) Registration Mapping (Output) Registration Mapping (Input) Correspondence Ontologies (OWL)

ISGC’2005, April 25-29, 2005 SWDBAug 29, Correspondence Example /population/sample == semType(P2) /population/sample/meas/cnt == semType(P2).itemMeasured /population/sample/meas/cnt/text() == semType(P2).itemMeasured.hasCount /population/sample/meas/acc == semType(P2).hasProperty /population/sample/meas/acc/text() == semType(P2).hasProperty.hasValue /population/sample/lsp/text() == semType(P2).hasContext.appliesTo /cohortTable/measurement == semType(P3) /cohortTable/measurement/obs == semType(P3).itemMeasured /cohortTable/measurement/obs/text() == semType(P3).itemMeasured.hasCount /cohortTable/measurement/phase/text() == semType(P3).hasContext.appliesTo Source Target population sample * meas cnt xsd:double xsd:string lsp xsd:integer acc cohortTable measurement * obs xsd:integer phase xsd:string We want to exploit the semantic information to obtain structural correspondences

ISGC’2005, April 25-29, 2005 SWDBAug 29, Correspondence Example /population/sample == semType(P2) /population/sample/meas/cnt == semType(P2).itemMeasured /population/sample/meas/cnt/text() == semType(P2).itemMeasured.hasCount /population/sample/meas/acc == semType(P2).hasProperty /population/sample/meas/acc/text() == semType(P2).hasProperty.hasValue /population/sample/lsp/text() == semType(P2).hasContext.appliesTo /cohortTable/measurement == semType(P3) /cohortTable/measurement/obs == semType(P3).itemMeasured /cohortTable/measurement/obs/text() == semType(P3).itemMeasured.hasCount /cohortTable/measurement/phase/text() == semType(P3).hasContext.appliesTo Source Target population sample * meas cnt xsd:double xsd:string lsp xsd:integer acc cohortTable measurement * obs xsd:integer phase xsd:string /population/sample == semType(P2) /cohortTable/measurement == semType(P3) These fragments correspond

ISGC’2005, April 25-29, 2005 SWDBAug 29, Correspondence Example /population/sample == semType(P2) /population/sample/meas/cnt == semType(P2).itemMeasured /population/sample/meas/cnt/text() == semType(P2).itemMeasured.hasCount /population/sample/meas/acc == semType(P2).hasProperty /population/sample/meas/acc/text() == semType(P2).hasProperty.hasValue /population/sample/lsp/text() == semType(P2).hasContext.appliesTo /cohortTable/measurement == semType(P3) /cohortTable/measurement/obs == semType(P3).itemMeasured /cohortTable/measurement/obs/text() == semType(P3).itemMeasured.hasCount /cohortTable/measurement/phase/text() == semType(P3).hasContext.appliesTo Source Target population sample * meas cnt xsd:double xsd:string lsp xsd:integer acc cohortTable measurement * obs xsd:integer phase xsd:string /population/sample/meas/cnt == semType(P2).itemMeasured /cohortTable/measurement/obs == semType(P3).itemMeasured These fragments correspond

ISGC’2005, April 25-29, 2005 SWDBAug 29, Correspondence Example /population/sample == semType(P2) /population/sample/meas/cnt == semType(P2).itemMeasured /population/sample/meas/cnt/text() == semType(P2).itemMeasured.hasCount /population/sample/meas/acc == semType(P2).hasProperty /population/sample/meas/acc/text() == semType(P2).hasProperty.hasValue /population/sample/lsp/text() == semType(P2).hasContext.appliesTo /cohortTable/measurement == semType(P3) /cohortTable/measurement/obs == semType(P3).itemMeasured /cohortTable/measurement/obs/text() == semType(P3).itemMeasured.hasCount /cohortTable/measurement/phase/text() == semType(P3).hasContext.appliesTo Source Target population sample * meas cnt xsd:double xsd:string lsp xsd:integer acc cohortTable measurement * obs xsd:integer phase xsd:string /population/sample/meas/cnt/text() == semType(P2).itemMeasured.hasCount /cohortTable/measurement/obs/text() == semType(P3).itemMeasured.hasCount These fragments correspond

ISGC’2005, April 25-29, 2005 SWDBAug 29, Correspondence Example /population/sample == semType(P2) /population/sample/meas/cnt == semType(P2).itemMeasured /population/sample/meas/cnt/text() == semType(P2).itemMeasured.hasCount /population/sample/meas/acc == semType(P2).hasProperty /population/sample/meas/acc/text() == semType(P2).hasProperty.hasValue /population/sample/lsp/text() == semType(P2).hasContext.appliesTo /cohortTable/measurement == semType(P3) /cohortTable/measurement/obs == semType(P3).itemMeasured /cohortTable/measurement/obs/text() == semType(P3).itemMeasured.hasCount /cohortTable/measurement/phase/text() == semType(P3).hasContext.appliesTo Source Target population sample * meas cnt xsd:double xsd:string lsp xsd:integer acc cohortTable measurement * obs xsd:integer phase xsd:string /population/sample/lsp/text() == semType(P2).hasContext.appliesTo /cohortTable/measurement/phase/text() == semType(P3).hasContext.appliesTo These fragments correspond

ISGC’2005, April 25-29, 2005 SWDBAug 29, Ontology-Guided Data Transformation Source Service Source Service Target Service Target Service PsPs PtPt Semantic Type P s Semantic Type P t Structural Type P t Structural Type P s Desired Connection Compatible (⊑)(⊑) Structural/Semantic Association Structural/Semantic Association Correspondence Generate (Ps)(Ps) (Ps)(Ps) Ontologies (OWL) Transformation Source: [Bowers-Ludaescher, DILS’04]

ISGC’2005, April 25-29, 2005 SWDBAug 29, Linking Structural and Semantic Types  : S  O Schema elements/ Structural type S Ontology / Semantic type O

ISGC’2005, April 25-29, 2005 SWDBAug 29, Propagating Semantic Annotations Given: –structural schemas S (input) and S’ (output), and an ontology O –a semantic annotation  : S  O –a query annotation q: S  S’ Problem: compute  ’

ISGC’2005, April 25-29, 2005 SWDBAug 29, Applications WF design time: –Actor  Actor connections Data binding time: –Actor  Data connections (“data binding”) WF runtime: –“semantic tagging” of derived data products

ISGC’2005, April 25-29, 2005 SWDBAug 29, Semantic Propagation Infer annotations for derived products: –When a (partial) specification of an actor is given (e.g., as a query q), then exploit this to propagate semantic annotations from S to T  minimize costly semantic annotation  check for consistency q T ru S annotated Traditional LAV query answering Chase & Backchase, e.g., via MARS maps to source annotation new target annotationn query

ISGC’2005, April 25-29, 2005 SWDBAug 29, Biodiversity Workflow w/ Query Annotations q

ISGC’2005, April 25-29, 2005 SWDBAug 29, Annotation Constraint  : S  O  =  x (  s (x)   c (x))   y  o (z) % z = x  y –  s links the variables x to schema elements of S –  c is conjunction of comparisons over x and constants –  o “ populates ” the ontology structure O X : biom[seas=S], S = ‘ w ’  X : observation[temporalContext = S : WinterSeason] s(x)s(x) c(x)c(x) o(z)o(z)

ISGC’2005, April 25-29, 2005 SWDBAug 29, Example (Biodiversity Workflow)

ISGC’2005, April 25-29, 2005 SWDBAug 29, Scientific Workflow (SWF) Design Methodology Support SWF design & reuse, via: –Structural data types –Semantic types –Associations (=constraints) between them –Type checking, inference, propagation  Separation of concerns: –structure, semantics, WF orchestration, etc.

ISGC’2005, April 25-29, 2005 SWDBAug 29, DataIntegration KnowledgeRepresentation Process Integration (Scientific Workflows) Src: ECS-289 Scientific Data Management WQ’05 DataFederation EcoGrid

ISGC’2005, April 25-29, 2005 SWDBAug 29, Q & A

ISGC’2005, April 25-29, 2005 SWDBAug 29, KEPLER: An Open Collaboration Initiated by members from DOE SDM/SPA and NSF SEEK; now several other projects (GEON, Ptolemy II, EOL, Resurgence/NMI, …) Open Source (BSD-style license) Intensive Communications: –Web-archived mailing lists –IRC (!) –Meetings, Hackathons Co-development: –via shared CVS repository –joining as a new co-developer (currently): get a CVS account (read-only) local development + contribution via existing KEPLER member be voted “in” as a member/co-developer Software & social engineering –How to better accommodate new groups/communities? –How to better accommodate different usage/contribution models (core dev … special purpose extender … user)?

ISGC’2005, April 25-29, 2005 SWDBAug 29, GrOWL Graphical Ontology Editing and Browsing … Krivov and Villa (UVM)

ISGC’2005, April 25-29, 2005 SWDBAug 29, Data Procurement using Semantics “Find all datasets that contain abundance measurements of ‘Manica bradleyi’ inter-ant parasites observed within California”

ISGC’2005, April 25-29, 2005 SWDBAug 29, Related Publications Scientific Workflows Scientific Workflow Management and the Kepler System, B. Ludäscher, I. Altintas, C. Berkley, D. Higgins, E. Jaeger-Frank, M. Jones, E. Lee, J. Tao, Y. Zhao, Concurrency and Computation: Practice & Experience, Special Issue on Scientific Workflows, to appear, 2005.Scientific Workflow Management and the Kepler System A Framework for the Design and Reuse of Grid Workflows, Ilkay Altintas, Adam Birnbaum, Kim Baldridge, Wibke Sudholt, Mark Miller, Celine Amoreira, Yohann Potier, and Bertram Ludaescher, Intl. Workshop on Scientific Applications on Grid Computing (SAG'04), LNCS 3458, Springer, 2005A Framework for the Design and Reuse of Grid WorkflowsSAG'04 Kepler: An Extensible System for Design and Execution of Scientific Workflows, I. Altintas, C. Berkley, E. Jaeger, M. Jones, B. Ludäscher, S. Mock, 16th International Conference on Scientific and Statistical Database Management (SSDBM'04), June 2004, Santorini Island, Greece.Kepler: An Extensible System for Design and Execution of Scientific WorkflowsSSDBM'04 Kepler: Towards a Grid-Enabled System for Scientific Workflows, Ilkay Altintas, Chad Berkley, Efrat Jaeger, Matthew Jones, Bertram Ludäscher, Steve Mock, Workflow in Grid Systems (GGF10), Berlin, March 9th, 2004.Kepler: Towards a Grid-Enabled System for Scientific WorkflowsWorkflow in Grid Systems (GGF10) An Ontology-Driven Framework for Data Transformation in Scientific Workflows, S. Bowers and B. Ludäscher, Intl. Workshop on Data Integration in the Life Sciences (DILS'04), March 25-26, 2004 Leipzig, Germany, LNCS 2994.An Ontology-Driven Framework for Data Transformation in Scientific WorkflowsDILS'04 A Web Service Composition and Deployment Framework for Scientific Workflows, I. Altintas, E. Jaeger, K. Lin, B. Ludaescher, A. Memon, In the 2nd Intl. Conference on Web Services (ICWS), San Diego, California, July 2004.ICWS

ISGC’2005, April 25-29, 2005 SWDBAug 29, Related Publications Semantic Data Registration and Integration On Integrating Scientific Resources through Semantic Registration, S. Bowers, K. Lin, and B. Ludäscher, 16th International Conference on Scientific and Statistical Database Management (SSDBM'04), June 2004, Santorini Island, Greece.On Integrating Scientific Resources through Semantic RegistrationSSDBM'04 A System for Semantic Integration of Geologic Maps via Ontologies, K. Lin and B. Ludäscher. In Semantic Web Technologies for Searching and Retrieving Scientific Data (SCISW), Sanibel Island, Florida, 2003.A System for Semantic Integration of Geologic Maps via OntologiesSCISW Towards a Generic Framework for Semantic Registration of Scientific Data, S. Bowers and B. Ludäscher. In Semantic Web Technologies for Searching and Retrieving Scientific Data (SCISW), Sanibel Island, Florida, 2003.Towards a Generic Framework for Semantic Registration of Scientific DataSCISW The Role of XML in Mediated Data Integration Systems with Examples from Geological (Map) Data Interoperability, B. Brodaric, B. Ludäscher, and K. Lin. In Geological Society of America (GSA) Annual Meeting, volume 35(6), November 2003.The Role of XML in Mediated Data Integration Systems with Examples from Geological (Map) Data Interoperability Semantic Mediation Services in Geologic Data Integration: A Case Study from the GEON Grid, K. Lin, B. Ludäscher, B. Brodaric, D. Seber, C. Baru, and K. A. Sinha. In Geological Society of America (GSA) Annual Meeting, volume 35(6), November 2003.Semantic Mediation Services in Geologic Data Integration: A Case Study from the GEON Grid Query Planning and Rewriting Processing First-Order Queries under Limited Access Patterns, Alan Nash and B. Ludäscher, Proc. 23rd ACM Symposium on Principles of Database Systems (PODS'04) Paris, France, June 2004.Processing First-Order Queries under Limited Access PatternsPODS'04 Processing Unions of Conjunctive Queries with Negation under Limited Access Patterns, Alan Nash and B. Ludäscher., 9th Intl. Conference on Extending Database Technology (EDBT'04) Heraklion, Crete, Greece, March 2004, LNCS 2992.Processing Unions of Conjunctive Queries with Negation under Limited Access PatternsEDBT'04 Web Service Composition Through Declarative Queries: The Case of Conjunctive Queries with Union and Negation, B. Ludäscher and Alan Nash. Research abstract (poster), 20th Intl. Conference on Data Engineering (ICDE'04) Boston, IEEE Computer Society, April 2004.Web Service Composition Through Declarative Queries: The Case of Conjunctive Queries with Union and NegationICDE'04