1 Tutorial #5: Scientific Data Integration and Mediation San Diego Supercomputer Center U.C. San Diego U.C. San Diego Bertram Ludäscher Ilkay Altintas.

Slides:



Advertisements
Similar presentations
Semantic Interoperability & Semantic Models: Introduction
Advertisements

XML: Extensible Markup Language
An Introduction to Description Logics
1 Introduction to XML. XML eXtensible implies that users define tag content Markup implies it is a coded document Language implies it is a metalanguage.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
1 COS 425: Database and Information Management Systems XML and information exchange.
1 Lecture 13: Database Heterogeneity Debriefing Project Phase 2.
The Semantic Web Week 12 Term 1 Recap Lee McCluskey, room 2/07 Department of Computing And Mathematical Sciences Module Website:
XML on Semantic Web. Outline The Semantic Web Ontology XML Probabilistic DTD References.
CIS607, Fall 2005 Semantic Information Integration Article Name: Clio Grows Up: From Research Prototype to Industrial Tool Name: DH(Dong Hwi) kwak Date:
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
Knowledge Mediation in the WWW based on Labelled DAGs with Attached Constraints Jutta Eusterbrock WebTechnology GmbH.
Amarnath Gupta Univ. of California San Diego. An Abstract Question There is no concrete answer …but …
XML, distributed databases, and OLAP/warehousing The semantic web and a lot more.
RDF (Resource Description Framework) Why?. XML XML is a metalanguage that allows users to define markup XML separates content and structure from formatting.
Modeling Interactive Web Sources for Information Mediation Information Mediation Framework/Motivation Modeling Interactive Sources with Interaction Diagrams.
Aurora: A Conceptual Model for Web-content Adaptation to Support the Universal Accessibility of Web-based Services Anita W. Huang, Neel Sundaresan Presented.
 Copyright 2005 Digital Enterprise Research Institute. All rights reserved. Towards Translating between XML and WSML based on mappings between.
Practical RDF Chapter 1. RDF: An Introduction
San Diego Supercomputer Center EDBT'02, Prague 1 EDBT Panel, March 2002, Prague: Scientific Data Integration for Complex Multiple-Worlds Scenarios: Databases.
GEON-UTEP GEON-Knowledge Representation WG Update GEON-KR list (currently) Bertram Ludaescher (SDSC: Bertram Ludaescher (SDSC:
An Introduction to Description Logics. What Are Description Logics? A family of logic based Knowledge Representation formalisms –Descendants of semantic.
The Semantic Web Service Shuying Wang Outline Semantic Web vision Core technologies XML, RDF, Ontology, Agent… Web services DAML-S.
Model Based Mediation With Domain Maps ___________________________ Xiaosen Li Guanrao William
Data R&D Issues for GTL Data and Knowledge Systems San Diego Supercomputer Center University of California, San Diego Bertram Ludäscher
GEON AHM, April 16-18, SDSC C YBERINFRASTRUCTURE FOR THE G EOSCIENCES Towards Semantic Mediation for GEON: Facilitating Scientific Data Integration using.
CSE-291: Ontologies in Data & Process Integration Department of Computer Science & Engineering University of California, San Diego CSE-291: Ontologies.
Scientific Data Integration with Model-Based Mediation : Databases Meets * Knowledge Representation Bertram Ludäscher Bertram
RELATIONAL FAULT TOLERANT INTERFACE TO HETEROGENEOUS DISTRIBUTED DATABASES Prof. Osama Abulnaja Afraa Khalifah
1 Ontology-based Semantic Annotatoin of Process Template for Reuse Yun Lin, Darijus Strasunskas Depart. Of Computer and Information Science Norwegian Univ.
Model-Based Mediation: Framework and Challenges Bertram Ludäscher Data and Knowledge Systems San Diego Supercomputer Center U.C. San.
11 CORE Architecture Mauro Bruno, Monica Scannapieco, Carlo Vaccari, Giulia Vaste Antonino Virgillito, Diego Zardetto (Istat)
Metadata. Generally speaking, metadata are data and information that describe and model data and information For example, a database schema is the metadata.
An Introduction to Description Logics (chapter 2 of DLHB)
1 Model-Based Information Integration in a Neuroscience Mediator System Bertram Ludaescher Amarnath Gupta Maryann E. Martone University of California San.
©Ferenc Vajda 1 Semantic Grid Ferenc Vajda Computer and Automation Research Institute Hungarian Academy of Sciences.
Semantic web course – Computer Engineering Department – Sharif Univ. of Technology – Fall Knowledge Representation Semantic Web - Fall 2005 Computer.
San Diego Supercomputer Center XMLDM'02, Prague 1 Time to Leave the Trees: From Syntactic to Conceptual Querying of XML Bertram Ludäscher Ilkay Altintas.
EEL 5937 Ontologies EEL 5937 Multi Agent Systems Lecture 5, Jan 23 th, 2003 Lotzi Bölöni.
From Data Integration To Semantic Mediation: Addressing Heterogeneities in Data Bertram Ludäscher Bertram Ludäscher Knowledge-Based Information.
SKOS. Ontologies Metadata –Resources marked-up with descriptions of their content. No good unless everyone speaks the same language; Terminologies –Provide.
Knowledge-Based Integration of Neuroscience Data Sources Amarnath Gupta Bertram Ludäscher Maryann Martone University of California San Diego.
Artificial Intelligence 2004 Ontology
Bertram Ludäscher Department of Computer Science & Engineering University of California, San Diego CSE-291: Ontologies in Data Integration.
Information Integration BIRN supports integration across complex data sources – Can process wide variety of structured & semi-structured sources (DBMS,
OWL Representing Information Using the Web Ontology Language.
1 Limitations of BLAST Can only search for a single query (e.g. find all genes similar to TTGGACAGGATCGA) What about more complex queries? “Find all genes.
From Database Federation to Model-Based Mediation: Databases Meets * Knowledge Representation Bertram Ludäscher Data and Knowledge Systems.
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
Mining the Biomedical Research Literature Ken Baclawski.
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
NeuroLOG ANR-06-TLOG-024 Software technologies for integration of process and data in medical imaging A transitional.
Model-Based Mediation with Domain Maps Bertram Ludäscher * Amarnath Gupta * Maryann E. Martone + * San Diego Supercomputer Center (SDSC) + National Center.
Semantic Mediation and Scientific Workflows Bertram Ludäscher Data and Knowledge Systems San Diego Supercomputer Center University of California, San Diego.
CSE-291: Ontologies in Data Integration Department of Computer Science & Engineering University of California, San Diego CSE-291: Ontologies in Data Integration.
Raluca Paiu1 Semantic Web Search By Raluca PAIU
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
An Extensible Model-Based Mediator System with Domain Maps Amarnath Gupta * Bertram Ludäscher * Maryann E. Martone + * San Diego Supercomputer Center (SDSC)
Semantic Data Extraction for B2B Integration Syntactic-to-Semantic Middleware Bruno Silva 1, Jorge Cardoso 2 1 2
WonderWeb. Ontology Infrastructure for the Semantic Web. IST Project Review Meeting, 11 th March, WP2: Tools Raphael Volz Universität.
Presented by Kyumars Sheykh Esmaili Description Logics for Data Bases (DLHB,Chapter 16) Semantic Web Seminar.
Welcome to CPSC 534B: Information Integration Laks V.S. Lakshmanan Rm. 315.
CSE-291: Ontologies in Data Integration Department of Computer Science & Engineering University of California, San Diego CSE-291: Ontologies in Data Integration.
National Partnership of Advanced Computational Infrastructure San Diego Supercomputer Center KNOW-ME (KNOWledge-Map-Explorer) Semantic Browsing of Integrated.
XML and Distributed Applications By Quddus Chong Presentation for CS551 – Fall 2001.
Semantic Data Integration: From Syntax and Structural Transformations to Semantics Bertram Ludäscher Data and Knowledge Systems San Diego.
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Data R&D Issues for GTL Bertram Ludäscher Data and Knowledge Systems
Query Optimization.
Ontologies: Introduction and Some Uses
Presentation transcript:

1 Tutorial #5: Scientific Data Integration and Mediation San Diego Supercomputer Center U.C. San Diego U.C. San Diego Bertram Ludäscher Ilkay Altintas Amarnath Gupta Kai Lin

2 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure Acknowledgements National Science Foundation (NSF) – GEOsciences Network (NSF) – Biomedical Informatics Research Network (NIH) – Science Environment for Ecological Knowledge (NSF) –seek.ecoinformatics.orgseek.ecoinformatics.org Scientific Data Management Center (DOE) –sdm.lbl.gov/sdmcenter/sdm.lbl.gov/sdmcenter/

3 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure Outline 8:30 – 10:30am: Tutorial: Data Integration & Mediation –Introduction to database mediation: motivation and architecture XML-based data integration –Database mediation theory primer: logic view definitions, view unfolding, computing feasible plans –From XML-based to Knowledge-based mediation: use of ontologies in data integration,... 10:30 – 10:45am: BREAK 10:45 – 12:00: Applications and Demos –10:45 – 11:05 Mediator Demo –11:05 – 11:20 Queries w/ Ontology Support –11:20 – 11:40 Scientific Workflows –11:40 – 12:00 KNOW-ME Ontology Tool

4 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure Information Integration Challenges System aspects: “Grid” Middleware –distributed data & computing –Web Services, WSDL/SOAP, … –sources = functions, files, databases, … Syntax & Structure: XML-Based Mediators –wrapping, restructuring –XML queries and views –sources = XML databases Semantics: Model-Based/Semantic Mediators –conceptual models and declarative views –SemanticWeb/KnowledgeGrid stuff: ontologies, description logics (RDF(S), DAML+OIL, OWL...) –sources = knowledge bases (DB+CMs+ICs) Syntax Structure Semantics System aspects  reconciling S 4 heterogeneities  “gluing” together multiple data sources  bridging information and knowledge gaps computationally

5 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure Information Integration from a DB Perspective Information Integration Problem –Given: data sources S 1,..., S k (DBMS, web sites,...) and user questions Q 1,..., Q n that can be answered using the S i –Find: the answers to Q 1,..., Q n The Database Perspective: source = “database”  S i has a schema (relational, XML, OO,...)  S i can be queried  define virtual (or materialized) integrated views V over S 1,..., S k using database query languages (SQL, XQuery,...)  questions become queries Q i against V(S 1,..., S k )

6 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure Standard (XML-Based) Mediator Architecture MEDIATOR XML Queries & Results S1S1 Wrapper (XML) View S2S2 Wrapper (XML) View SkSk Wrapper (XML) View Integrated Global (XML) View G Integrated View Definition G(..)  S 1 (..)…S k (..) USER/Client USER/Client Query Q ( G (S 1,..., S k ) ) Query Q ( G (S 1,..., S k ) ) wrappers implemented as web services

7 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure Data Integration Approaches: –Let’s just share data, e.g., link everything from a web page! –... or better put everything into an relational or XML database –... and do remote access using the Grid –... or just use Web services! Nice try. But: –“Find the files where the amygdala was segmented.” –“Which other structures were segmented in the same files?” –“Did the volume of any of those structures differ much from normal?” –“What is the cerebellar distribution of rat proteins with more than 70% homology with human NCS-1? Any structure specificity? How about other rodents?” Some BIRNing Data Integration Questions Biomedical Informatics Research Network

An Online Shopper’s Information Integration Problem El Cheapo: “Where can I get the cheapest copy (including shipping cost) of Wittgenstein’s Tractatus Logicus-Philosophicus within a week?” ? Information Integration ? Information Integration addall.com “One-World” Mediation “One-World” Mediation amazon.com A1books.com half.com barnes&noble.com WWWpublic library

A Home Buyer’s Information Integration Problem What houses for sale under $500k have at least 2 bathrooms, 2 bedrooms, a nearby school ranking in the upper third, in a neighborhood with below-average crime rate and diverse population? ? Information Integration ? Information Integration Realtor Demographics School Rankings Crime Stats “Multiple-Worlds” Mediation “Multiple-Worlds” Mediation

A Geoscientist’s Information Integration Problem What is the distribution and U/ Pb zircon ages of A-type plutons in VA? How about their 3-D geometry ? How does it relate to host rock structures? ? Information Integration ? Information Integration Geologic Map (Virginia) Geologic Map (Virginia) GeoChemical GeoPhysical (gravity contours) GeoPhysical (gravity contours) GeoChronologic (Concordia) GeoChronologic (Concordia) Foliation Map (structure DB) Foliation Map (structure DB) “Complex Multiple-Worlds” Mediation “Complex Multiple-Worlds” Mediation

A Neuroscientist’s Information Integration Problem What is the cerebellar distribution of rat proteins with more than 70% homology with human NCS-1? Any structure specificity? How about other rodents? ? Information Integration ? Information Integration protein localization (NCMIR) protein localization (NCMIR) neurotransmission (SENSELAB) neurotransmission (SENSELAB) sequence info (CaPROT) sequence info (CaPROT) morphometry (SYNAPSE) morphometry (SYNAPSE) “Complex Multiple-Worlds” Mediation “Complex Multiple-Worlds” Mediation Biomedical Informatics Research Network

12 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure Structural / XML-Based Mediation

13 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure Abstract XML-Based Mediator Architecture S_1 MEDIATOR XML Queries & Results USER/Client USER/Client Wrapper XML View S_2 Wrapper XML View S_k Wrapper XML View Integrated XML View V Integrated View Definition IVD(S1,...,Sn) Query Q o V (S_1,...,S_k) Query Q o V (S_1,...,S_k)

14 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure Extensible Markup Language (XML) (meta)language for marking up text & data with user-definable tags –(X)HTML, XSLT, XML Schema,... –MathML, BioML, GeoML, NeuroML,... –XML-RPC, SOAP,... semistructured tree data model –flexible: marked-up text, web-pages, databases,... container model: –“boxes within boxes” (meta)language for marking up text & data with user-definable tags –(X)HTML, XSLT, XML Schema,... –MathML, BioML, GeoML, NeuroML,... –XML-RPC, SOAP,... semistructured tree data model –flexible: marked-up text, web-pages, databases,... container model: –“boxes within boxes”... in their wonderful book called SemWeb Tractat by B. Schatz and T.B. Lee, the authors show how... author: “B. Schatz” book: title: “SemWeb Tractat” author: “T.B. Lee” book title author “SemWeb Tractat” author “B. Schatz” “T.B. Lee” SemWeb Tractat B. Schatz T.B. Lee... in their wonderful book called SemWeb Tractat by B. Schatz and T.B. Lee, the authors show how...

15 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure Example: Relational Data => XML c2b2a2 c3b3a3 c1b1a1 CBA R  R   tuple   A  a1  /A   B  b1  /B   C  c1  /C   /tuple   tuple   A  a2  /A   B  b2  /B   C  c2  /C   /tuple  …  /R  R tuple ABC a1 b1 c1 tuple ABC a2 b2 c2 tuple ABC a3 b3 c3

16 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure Tag Names & Nesting => XML DTDs (Grammars) XML DTD bibliography paper* paper authors fullPaper? title booktitle authors author+ Grammar Rules

17 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure XML DTDs vs. XML Schema XML DTDs –set of allowed tag names –their nesting structure (via grammar rules) XML Schema –tag names and nesting structure –user-defined complex data types –subtyping (no multiple inheritance): RESTRICT and EXTEND –separate “namespace” for type names and tag (=element) names –...

18 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure XML Schema: User-Defined Type/Class Hierarchy

19 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure XML Schema Declarations (“home-style” syntax) Complex Type Declarations

20 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure XML Schema (“home-style”) Complex Types Simple Type Declarations

21 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure XML Schema: Substitution Groups Elements of a substitution group (hexagons) and associated complex types (boxes)

22 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure XML Schema Declarations (W3C syntax)

23 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure XML Query Languages XPath: – root//books/book[cover_style=“paperback”][price<80] XQuery –the W3C XML query language XSLT –XML transformations (XML=>HTML, XML=>XML)...

24 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure Transforming and Rendering XML: XSLT

25 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure XMAS: XML Matching And Structuring language Integrated View Definition: “Find books from amazon.com and DBLP, join on author, group by authors and title” CONSTRUCT $a1 $t $p { $p } { $a1, $t } WHERE $a1 : $t : IN "amazon.com" AND $a2 : $p : IN " " AND value( $a1 ) = value( $a2 ) CONSTRUCT $a1 $t $p { $p } { $a1, $t } WHERE $a1 : $t : IN "amazon.com" AND $a2 : $p : IN " " AND value( $a1 ) = value( $a2 ) XMAS XMAS Algebra

26 Database Mediation Theory Primer

27 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure Mediator Query Processing Translator Rewriter/Optimizer composed plan optimized plan Query Q Composition (Q o V) Integrated View Definition V parsed plan Plan Execution Compile-time Run-time

28 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure Logic View Definitions (Global-as-View) or Querying and Reasoning with the Family... Warm up: Who says this? –“Your are my son, but I’m not your father!” The mother!

29 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure Logic View Definitions (Global-as-View) Globals-as-View (GAV) –Integrated view V is defined in terms of the sources Src_1,..., Src_k Given the following source databases: –Src_1 schema = { father(Father,Child), mother(Mother,Child) } –Src_2 schema = { spouse(Spouse, Spouse) } –Src_3 schema = { male(Person), female(Person) } Can you define integrated views V for... ? –parent(Parent,Child) short: parent/2, i.e., table/relation name is ‘parent’, arity (#columns) is 2 –son/2, daughter/2 –brother/2, sister/2 –brother_in_law/2, sister_in_law/2 –aunt/2, uncle/2 –married/2, bachelor/2

30 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure Logic View Definitions (Global-as-View) Source relations: father/2, mother/2, spouse/2, male/1, female/1  = “,” = conjunction (and)  = “ ; ” = disjunction (or)  = “not” = negation parent(C,P)  father(C,P) ; mother(C,P). son(P,S)  parent(S,P), male(S). brother(X,B)  parent(X,P), son(P,B), X  B. brother_in_law(X,B)  sister(X, Z), spouse(Z, B) ; spouse(X, Z), brother(Z, B).

31 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure Logic View Definitions (Global-as-View) Source relations: father/2, mother/2, spouse/2, male/1, female/1  = “,” = conjunction (and)  = “ ; ” = disjunction (or)  = “not” = negation uncle(X, U)  parent(X, Z), brother(Z, U) ; parent(X, Z), brother_in_law(Z, U). aunt(X, A)  parent(X, Z), sister(Z, A) ; parent(X, Z), sister_in_law(Z, A). married(X)  spouse(X, _). bachelor(X)  [person(X)], not married(X).

32 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure Query Rewriting and Query Evaluation Query Rewriting: - Given a user query Q in terms of virtual views V... - Find an equivalent query Q’ in terms of the sources Src_1,...,Src_k Query Evaluation: - Given a query Q’, evaluate Q’ over the source databases D := Src_1 ...  Src_k Examples: –Q_uncle/2 = { (X,Y) | uncle(X,Y) holds in D } –Q_tom’s_uncle/1 = { X | uncle(tom, X) holds in D } –Q_whose_uncle_is_tom/1 = { X | uncle(X, tom) holds in D }

33 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure Query Rewriting (for GAV) Query rewriting: - Given a user query Q in terms of virtual views V... - Find an equivalent query Q’ in terms of the sources Src_1,...,Src_k Query Q, views V, source schemas S View unfolding: –starting with Q, repeatedly replace view predicates by the definition Creating a feasible plan: –here: compute disjunctive normal form (DNF) –DNF = disjunction of conjunctions (= “union of joins”) –order goals within each conjunction according to sources’ query capabilities

34 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure Example ?- plan(brother(X0,X1)). brother(X0, X1) == LQP ==> (father(X0, X2) v mother(X0, X2)) & (father(X1, X2) v mother(X1, X2)) & male(X1) & neq(X0, X1) brother(X0, X1) ==NNF LQP==> (father(X0, X2) v mother(X0, X2)) & (father(X1, X2) v mother(X1, X2)) & male(X1) & neq(X0, X1)

35 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure Example (Cont’d) ?- plan(brother(X0,X1)). brother(X0, X1) ==DNF LQP==> father(X0, X2)&father(X1, X2)&male(X1)&neq(X0, X1) v mother(X0, X2)&father(X1, X2)&male(X1)&neq(X0, X1) v father(X0, X2)&mother(X1, X2)&male(X1)&neq(X0, X1) v mother(X0, X2)&mother(X1, X2)&male(X1)&neq(X0, X1)

36 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure Example (Cont’d) ?- plan(brother(X0,X1)). brother(X0, X1) ==Bp ordered LQP==> parentDb(father(X1, X2) & father(X0, X2)) & genderDb(male(X1)) & mediator(neq(X0, X1)) v parentDb(father(X1, X2) & mother(X0, X2)) & genderDb(male(X1)) & mediator(neq(X0, X1)) v parentDb(mother(X1, X2)&father(X0,X2)) & genderDb(male(X1)) & z_mediator(neq(X0, X1)) v parentDb(mother(X1, X2)&mother(X0, X2)) & genderDb(male(X1))&z_mediator(neq(X0, X1))

37 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure Computing Feasible Plans (Goal Ordering) A conjunctive query Q is an expression of the form – q( X )  p1( X1 ),..., pn( Xn ) – order of subgoals p_i is irrelevant An ordered plan P is an expression of the form – q( X )  [p1( X1 ),..., pn( Xn )] – order of subgoals p_i is important Problem: –given Q, compute P which is feasible, i.e., observes the limited query capabilities of sources –Here: binding patterns, i.e., predicates’ arguments can be “b” – bound “f” – free “_” – bound or free

38 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure A Simple Algorithm for Ordering Goals

39 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure Query Containment A query Q1 is contained in Q2, denoted Q1  Q2 – if for all possible database instances, the set of answers to Q1 is contained in the set of answers to Q2. Q1 and Q2 are called equivalent – if Q1  Q2 and Q2  Q1. Query containment is undecidable for many languages, e.g., for the relational calculus (SQL). For conjunctive queries, the problem is NP- complete (and thus decidable) –Since query sizes tend to be “small” (in particular, when compared to database sizes), query containment is still of use in practice (indeed, it is one of the most fundamental tools for logic-based query optimization).

40 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure Query Containment Q1(Xs,Ys) is contained in Q2(Xs,Zs) iff ALL Xs: (EXISTS Ys: Q1(Xs,Ys))  (EXISTS Zs: Q2(Xs,Zs)) iff we can refute its negation iff NOT ALL Xs: (EXISTS Ys: Q1(Xs,Ys))  (EXISTS Zs: Q2(Xs,Zs)) |= [] iff EXISTS Xs: (EXISTS Ys: Q1(Xs,Ys)) AND NOT (EXISTS Zs: Q2(Xs,Zs)) |= [] iff –canonical_db(Q1) AND  Q2(Xs,Zs) |= [] create database from Q1, then run Q2 as a query...

41 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure Query Containment Algorithm (in Prolog) Applications: –query minimization (conjunctive query is minimal if not conjunct can be dropped) –semantic query optimization Q  denial here: denial is an integrity constraint and states what must not hold example: denial = false  mother(X,M), father(Y,M)

42 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure Example 50% of the clauses of the executable plan are irrelevant...

43 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure Mediator Demo Computer Science Challenges: –Given a query Q over virtual integrated database V, how to come up with Q’ over the source schemas? (cf. Garlic, DiscoveryLink,...) query rewriting of Q(V) into Q’(SRCs) using unfolding and normalization computation of feasible orders (NP-complete!?) while minimizing number of “chunks” sent to sources semantic query optimization (reasoning over plans!); e.g. conjunctive query containment is NP-complete [Chandra-Merlin-77] A Quick Demo of the current prototype: –Find 3D reconstructions of cells found in ‘cerebellar cortex’: ?- ccdbData('cerebellar cortex'). Join everything reachable along ‘cerebellar-cortex’.(has-a)* in UMLS....with concept markup in CCDB... retrieve (links to) results... also show on SmartAtlas tool

44 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure Mediator Demo

45 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure From XML-Based to Logic and Model- Based (“Semantic”) Mediation

46 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure What’s the Problem with XML & Complex Multiple-Worlds? XML is Syntax –DTDs talk about element nesting –XML Schema schemas give you data types –need anything else? => write comments! Domain Semantics is complex: –implicit assumptions, hidden semantics  sources seem unrelated to the non-expert Need Structure and Semantics beyond XML trees!  employ richer OO models  make domain semantics and “glue knowledge” explicit  use ontologies to fix terminology and conceptualization  avoid ambiguities by using formal semantics

47 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure From XML-Based to Model-Based Mediation Data and Knowledge Sharing Potential: Database Mediation + Knowledge Representation ________________________ = Model-Based Mediation Basic Ideas: –turn primary data sources into knowledge sources –employ secondary glue knowledge sources generic: UMLS,... specific: community/laboratory ontologies

48 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure DB mediation techniques Ontologies KR formalisms Model-Based Mediation Information Integration Landscape conceptual distance one-world multiple-worlds conceptual complexity/depth low high addall book-buyer BLAST EcoCyc Cyc WordNet GO home-buyer 24x7 consumer UMLS MIA Entrez RiboWeb Tambis Bioinformatics Geoinformatics

49 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure Knowledge Representation: Relating Theory to the World via Formal Models John F. Sowa, Knowledge Representation: Logical, Philosophical, and Computational FoundationsKnowledge Representation: Logical, Philosophical, and Computational Foundations “All models are wrong, but some are useful!”

XML-Based vs. Model-Based Mediation Raw Data IF  THEN  Logical Domain Constraints Integrated-CM := CM-QL(Src1-CM,...) Integrated-CM := CM-QL(Src1-CM,...) (XML) Objects Conceptual Models XML Elements XML Models C2 C3 C1 R Classes, Relations, is-a, has-a,... Glue Maps DMs, PMs Glue Maps DMs, PMs Integrated-DTD := XML-QL(Src1-DTD,...) Integrated-DTD := XML-QL(Src1-DTD,...) No Domain Constraints A = (B*|C),D B =... Structural Constraints (DTDs), Parent, Child, Sibling,... CM ~ {Descr.Logic, ER, UML, RDF/XML(-Schema), …} CM-QL ~ {F-Logic, DAML+OIL, …}

51 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure What’s the Glue? What’s in a Link? Syntactic Joins –  (X,Y) := X.SSN = Y.SSN equality –  (X,Y) := X.UMLS-ID = Y.UID “Speciality” Joins –  (X,Y,Score) := BLAST(X,Y,Score) similarity Semantic/Rule-Based Joins –  (X,Y,C) := X isa C, Y isa C, BLAST(X,Y,S), S>0.8 homology, lub –  (X,Y,[produces,B,increased_in]) := X produces B, B increased_in Y. rule-based e.g., X=  - secretase, B=beta amyloid, Y=Alzheimer’s disease Challenge: –compile semantic joins into efficient syntactic ones X Y 

52 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure Model-Based Mediation Methodology... Lift Sources to export CMs: CM(S) = OM(S) + KB(S) + CON(S) Object Model OM(S): –complex objects (frames), class hierarchy, OO constraints Knowledge Base KB(S): –explicit representation of (“hidden”) source semantics –logic rules over OM(S) Contextualization CON(S): –situate OM(S) data using “glue maps” (GMs):  domain maps DMs (ontology) = terminological knowledge: concepts + roles  process maps PMs = “procedural knowledge”: states + transitions

53 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure... Model-Based Mediation Methodology Integrated View Definition (IVD) –declarative (logic) rules with object-oriented features –defined over CM(S), domain maps, process maps –needs “mediation engineers” = domain + KRDB experts Knowledge-Based Querying and Browsing (runtime): –mediator composes the user query Q with the IVD... rewrites (Q o IVD), sends subqueries to sources... post-processes returned results (e.g., situate in context)

54 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure S1 S2 S3 (XML-Wrapper) CM-Wrapper USER/Client USER/Client CM (Integrated View) Mediator Engine FL rule proc. LP rule proc. Graph proc. XSB Engine CM(S) = OM(S)+KB(S)+CON(S) GCM CM S1 GCM CM S2 GCM CM S3 CM Queries & Results (exchanged in XML) Domain Maps DMs Domain Maps DMs Domain Maps DMs Domain Maps DMs Domain Maps DMs Process Maps PMs “Glue” Maps GMs semantic context CON(S) Integrated View Definition IVD Model-Based Mediator Architecture First results & Demos: KIND prototype, formal DM semantics, PMs [SSDBM00] [VLDB00] [ICDE01] [NIH-HB01] (w/ Gupta, Martone)

55 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure Domain Maps (Ontologies) as Glue Knowledge Sources Domain Map = Ontology –representation of terminological knowledge Use in Model-Based Mediation –(derived) concepts as “drop points”, “anchor points”, “context” for source classes –compile-time use: view definition, subsumption, classification,... –runtime use: querying/deduction, path queries,.... Formalisms: –Semantic nets, Thesauri, Frame-logic, Description logics,...

56 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure Ontologies So what is an Ontology? –definition of things that are relevant to your application –representation of terminological knowledge (“TBox”) –explicit specification of a conceptualization –concept hierarchy (“is-a”) –further semantic relationships between concepts –abstractions of relational schemas, (E)ER, UML classes, XML Schemas Examples: –NCMIR ANATOM –GO (Gene Ontology) –UMLS (Unified Medical Language System –CYC

57 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure Formalism for Ontologies: Description Logic DL definition of “Happy Father” (Example from Ian Horrocks, U Manchester, UK)

58 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure Description Logic Statements as Rules In first-order logic (rule form): happyFather(X)  man(X), child(X,C1), child(X,C2), blue(C1), green(C2), not ( child(X,C3), poorunhappyChild(C3) ). poorunhappyChild(C)  not rich(C), not happy(C).

59 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure Description Logics Terminological Knowledge (TBox) –Concept Definition (naming of concepts): –Axiom (constraining of concepts): => a mediators “glue knowledge source” Assertional Knowledge (ABox) –the marked neuron in image 27 => the concrete instances/individuals of the concepts/classes that your sources export

60 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure Querying vs. Reasoning Querying: –given a DB instance I (= logic interpretation), evaluate a query expression (e.g. SQL, FO formula, Prolog program,...) –boolean query: check if I |=  (i.e., if I is a model of  ) –(ternary) query: { (X, Y, Z) | I |=  (X,Y,Z) } => check happyFathers in a given database Reasoning: –check if I |=  implies I |=  for all databases I, –i.e., if  =>  –undecidable for FO, F-logic, etc. –Descriptions Logics are decidable fragments  concept subsumption, concept hierarchy, classification  semantic tableaux, resolution, specialized algorithms

61 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure What’s in an Answer? (What’s in a Link? revisited) Semantic/Rule-Based Joins –  (X,Y,[produces,B,increased_in]) := X produces B, B increased_in Y. rule-based e.g., X=  - secretase, B=beta amyloid, Y=Alzheimer’s disease What is the Erdoes number of person P? –3–3 Really? Why? –authority based: said so –faith based: don’t know but firmly believe –query statement Q =... derived it from DB I –query Q =... derived it from DB I and KB T using derivation D => logic-based systems often “come with explanations” (“computations as proofs”) X Y 

62 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure Formalizing Glue Knowledge: Domain Map for SYNAPSE and NCMIR Domain Map = labeled graph with concepts ("classes") and roles ("associations") additional semantics: expressed as logic rules (F-logic) Domain Map = labeled graph with concepts ("classes") and roles ("associations") additional semantics: expressed as logic rules (F-logic) Domain Map (DM) Purkinje cells and Pyramidal cells have dendrites that have higher-order branches that contain spines. Dendritic spines are ion (calcium) regulating components. Spines have ion binding proteins. Neurotransmission involves ionic activity (release). Ion-binding proteins control ion activity (propagation) in a cell. Ion-regulating components of cells affect ionic activity (release). Domain Expert Knowledge DM in Description Logic

63 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure Source Contextualization & DM Refinement In addition to registering (“hanging off”) data relative to existing concepts, a source may also refine the mediator’s domain map...  sources can register new concepts at the mediator...

Example: ANATOM Domain Map

65 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure Browsing Registered Data with Domain Maps

66 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure Process Maps with Abstractions and Elaborations: From Terminological to Procedural Glue nodes ~ states edges ~ processes, transitions blue/red edges: processes in Src1/Src2 general form of edges: related formalisms

67 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure Summary: Mediation Scenarios & Techniques Federated Databases XML-Based Mediation Model-Based Mediation One-World One-/Multiple-Worlds Complex Multiple-Worlds Common Schema Mediated Schema Common Glue Maps SQL, rules XML query languages DOOD query languages Schema Transformations Syntax-Aware Mappings Semantics-Aware Mappings Syntactic Joins Syntactic Joins “Semantic” Joins via Glue Maps DB expertDB expert KRDB + domain expert

68 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure Semantic (Community) Webs “Within the next decade, computing technology will transform the Internet into the Interspace, an information infrastructure that supports semantics indexing and concept navigation across widely distributed community repositories.” Bruce Schatz, IEEE Computer, Jan "The Semantic Web is an extension of the current web in which information is given well-defined meaning, better enabling computers and people to work in cooperation." Tim Berners-Lee et al., 2001

69 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure Combine Everything: Die eierlegende Wollmilchsau: Database Federation/Mediation –query rewriting under GAV/LAV –w/ binding pattern constraints –distributed query processing Semantic Mediation –semantic integrity constraints, reasoning w/ plans, automated deduction –deductive database/logic programming technology, AI “stuff”... –Semantic Web technology Scientific Workflow Management –more procedural than database mediation (often the scientist is the query planner) –deployment using web services

70 B R E A K... followed by demos...

71 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure

72 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure GEON SMART Metadata: Multihierarchical Rock Classification for “Thematic Queries” (GSC) Composition Genesis Fabric Texture “smart discovery & querying” via multiple, independent concept hierarchies (controlled vocabularies) data at different description levels can be found and processed

73 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure GEON SMART Metadata:Multihierarchical Rock Classification for “Thematic Queries”

74 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure GEON Ontology Demo

75 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure Architecture of Ontology Based Map Integration Ontology Mapping Web Map Server Database Global Web Map Server

76 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure DOE Scientific Datamanagement Center Scientific Workflow Demo

77 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure Microarray analysis cDNA Cluster Database search for promoter identification Database search ABC Promoter model Common promoter alignment Promoter sequences *- New candidate target genes * * * * * Adapted from Thomas Werner Biomolecular Engineering, 17: (2001) Example: A Scientific Workflow

78 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure Compute clusters (min. distance) Select gene-set (cluster-level) For each gene Retrieve Transcription factors Arrange Transcription factors For each promoter Compute Subsequence labels With all Promoter Models Compute Joint Promoter Model Retrieve matching cDNA Retrieve genomic Sequence Extract promoter Region(begin, end) Create consensus sequence Align promoters Conceptual Workflow

79 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure Mapping This Workflow To Web Sites

80 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure Customized CGI Application

81 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure

82 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure

83 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure ClustalW Output Transfac Query Results

84 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure AWFEWF DesignExecution monitoring WF-Pilot Abstract Task (AT) Repository AAV rules C C C Data & Parameter Ontologies ET schemas User ET GenbankBLAST query rewriting web service invocation Executable Task (ET) Repository web service invocation semantic type checking conversion rules data type conversion Datatype & Conversion Repository web service matching WF-Compiler AWF  EWF Translation WF-Engine Scheduling and execution SDM-SciDAC System Architecture

85 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure AWF to EWF GetGenomicSequence (+{selectedGene}, -{{GenomicSequence}}) :- GENBANK (+{selectedGene}, -{cDNASequence}), BLAST (+{cDNASequence}, +dbName, +format, - {rankedGenomicSequenceList}). GetGenomicSequence (+{selectedGene}, -{{GenomicSequence}}) :- GENBANK (+{selectedGene}, -{cDNASequence}), BLAT (+{cDNASequence}, +QueryType, +SortCriteria, +OutputType, - {rankedGenomicSequenceList}). IdentifyPromoterElements (+{rankedGenomicSequenceList}, -{element}) :- PromoterSequences (+{rankedGenomicSequenceList}, getBeginEnd(+Species, -Begin, -End), -{element}). For each gene Retrieve matching cDNA Retrieve genomic Sequence Extract promoter Region(begin, end) Same functionality, different operational constraints and availability Need extra domain knowledge User supplied Translation to EWF needs creation of iterators Declarative specification

86 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure

87 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure Abstract Task (AT) Registration

88 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure Abstract Task (AT) View and Delete

89 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure Abstract Task (AT) Update

90 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure AWF Design

91 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure EWF Planning and Compilation

92 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure EWF Execution

93 BIRN Tools Demo

94 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure Some References (starting points) XML –General: –XQuery: –XSLT: Query Rewriting: –database research literature Logic Programming –Learn Prolog Now! –SWI-Prolog (nice free Prolog system): Ontologies –Ontology Web language: – – Model-Based Mediation: – Semantic Web: –

95 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure References: Project Web Sites GEOsciences Network (NSF) – Biomedical Informatics Research Network (NIH) – Science Environment for Ecological Knowledge (NSF) –seek.ecoinformatics.orgseek.ecoinformatics.org Scientific Data Management Center (DOE) –sdm.lbl.gov/sdmcenter/sdm.lbl.gov/sdmcenter/