Download presentation
Presentation is loading. Please wait.
Published byBrice Woods Modified over 9 years ago
1
From Data Integration To Semantic Mediation: Addressing Heterogeneities in Data Bertram Ludäscher Bertram Ludäscher LUDAESCH@SDSC.EDU Knowledge-Based Information Systems Lab San Diego Supercomputer Center and Department of Computer Science & Engineering University of California, San Diego Bertram Ludäscher Bertram Ludäscher LUDAESCH@SDSC.EDU Knowledge-Based Information Systems Lab San Diego Supercomputer Center and Department of Computer Science & Engineering University of California, San Diego
2
2 Outline 1.Information Integration from a Database Perspective 2.XML-Based Data Integration 3.Model-Based / Semantic Mediation 4.Discussion
3
An Online Shopper’s Information Integration Problem El Cheapo: “Where can I get the cheapest copy (including shipping cost) of Wittgenstein’s Tractatus Logicus-Philosophicus within a week?” ? Information Integration ? Information Integration addall.com “One-World” Scenario: XML-based mediator “One-World” Scenario: XML-based mediator amazon.com A1books.com half.com barnes&noble.com Mediator (virtual DB) (vs. Datawarehouse) Mediator (virtual DB) (vs. Datawarehouse)
4
A Home Buyer’s Information Integration Problem Which houses for sale under $500k have at least 2 bathrooms, 2 bedrooms, a nearby school ranking in the upper third, in a neighborhood with below-average crime rate and diverse population? ? Information Integration ? Information Integration Realtor Demographics School Rankings Crime Stats “Multiple-Worlds” Scenario: XML-based mediator “Multiple-Worlds” Scenario: XML-based mediator
5
A Neuroscientist’s Information Integration Problem What is the cerebellar distribution of rat proteins with more than 70% homology with human NCS-1? Any structure specificity? How about other rodents? ? Information Integration ? Information Integration protein localization (NCMIR) protein localization (NCMIR) neurotransmission (SENSELAB) neurotransmission (SENSELAB) sequence info (CaPROT) sequence info (CaPROT) morphometry (SYNAPSE) morphometry (SYNAPSE) “Complex Multiple- Worlds” Scenario: Model-based mediator “Complex Multiple- Worlds” Scenario: Model-based mediator
6
A Geoscientist’s Information Integration Problem What is the distribution and U/ Pb zircon ages of A-type plutons in VA? How about their 3-D geometry ? How does it relate to host rock structures? ? Information Integration ? Information Integration Geologic Map (Virginia) Geologic Map (Virginia) GeoChemical GeoPhysical (gravity contours) GeoPhysical (gravity contours) GeoChronologic (Concordia) GeoChronologic (Concordia) Foliation Map (structure DB) Foliation Map (structure DB) “Complex Multiple- Worlds” Scenario: Model-based mediator “Complex Multiple- Worlds” Scenario: Model-based mediator
7
7 Information Integration Challenges: Heterogeneities = S 4... System AspectsSystem Aspects –platforms, devices, distribution, APIs, protocols, … SyntaxesSyntaxes –heterogeneous data formats (one for each tool...) StructuresStructures –heterogeneous schemas (one for each DB...) –heterogeneous data models (RDBs, ORDBs, OODBs, XMLDBs, flat files, …) SemanticsSemantics –unclear & “hidden” semantics : e.g., incoherent terminology, multiple / informal taxonomies, implicit assumptions,...
8
8 Information Integration Challenges System aspects: “Grid” middlewareSystem aspects: “Grid” middleware –distributed data & computing –Web services, WSDL/SOAP, … –sources = functions, files, databases, … Syntax & Structure:Syntax & Structure: (XML-Based) Mediators –wrapping, restructuring –(XML) queries and views –sources = (XML) databases Semantics:Semantics: Model-Based/Semantic Mediators –conceptual models and declarative views –Semantic Web: ontologies, description logics, RDF(S), DAML+OIL, OWL,... –sources = knowledge bases (DB+CMs+ICs) Syntax Structure Semantics System aspects reconciling S 4 heterogeneities “gluing” together multiple data sources bridging information and knowledge gaps computationally
9
9 Information Integration from a DB Perspective Information Integration ProblemInformation Integration Problem –Given: data sources S 1,..., S k (DBMS, web sites,...) and user questions Q 1,..., Q n that can be answered using the S i –Find: the answers to Q 1,..., Q n The Database Perspective: source = “database”The Database Perspective: source = “database” S i has a schema (relational, XML, OO,...) S i can be queried define virtual (or materialized) integrated views V over S 1,..., S k using database query languages (SQL, XQuery,...) questions become queries Q i against V(S 1,..., S k )
10
10 Outline 1.Information Integration from a Database Perspective 2.XML-Based Data Integration 3.Model-Based / Semantic Mediation 4.Discussion
11
11 Extensible Markup Language (XML) (meta)language for marking up text & data with user-definable tags(meta)language for marking up text & data with user-definable tags –(X)HTML, XSLT, XML Schema,... –MathML, BioML, GeoML, NeuroML,... –XML-RPC, SOAP, WSDL, OWL,... semistructured tree data modelsemistructured tree data model –flexible: marked-up text, web-pages, databases,... container model:container model: –“boxes within boxes” (meta)language for marking up text & data with user-definable tags(meta)language for marking up text & data with user-definable tags –(X)HTML, XSLT, XML Schema,... –MathML, BioML, GeoML, NeuroML,... –XML-RPC, SOAP, WSDL, OWL,... semistructured tree data modelsemistructured tree data model –flexible: marked-up text, web-pages, databases,... container model:container model: –“boxes within boxes”... in their wonderful book called SemWeb Tractat by B. Schatz and T.B. Lee, the authors show how... author: “B. Schatz” book: title: “SemWeb Tractat” author: “T.B. Lee” book title author “SemWeb Tractat” author “B. Schatz” “T.B. Lee” SemWeb Tractat B. Schatz T.B. Lee... in their wonderful book called SemWeb Tractat by B. Schatz and T.B. Lee, the authors show how...
12
12 XML-Based Mediator Architecture MEDIATOR XML Queries & Results S1S1 Wrapper XML View S2S2 Wrapper XML View SkSk Wrapper XML View Integrated Global XML View G Integrated View Definition G(..) S 1 (..)…S k (..) USER/Client USER/Client Query Q ( G (S 1,..., S k ) ) Query Q ( G (S 1,..., S k ) )
13
13 Some Challenges in XML-Based Integration... XML Query/Transformation LanguagesXML Query/Transformation Languages –DB community: QLs for semistructured data, e.g., TSIMMIS/MSL, Lorel, Yatl,..., Florid/F-logic [InfSystems98] –CSE/SDSC: XMAS [SSD99,SIGMOD99,WebDB99,EDBT00] –W3C: XPath, XSLT, XQuery (Working Draft, June 2001) XML Schema LanguagesXML Schema Languages –DTDs, RELAX NG, XML Schema,... [XMLDM02] DB Theoreticians:DB Theoreticians: –Expressiveness/Complexity Trade-Off querying: FO, (WF/S-)Datalog, FO(LFP), FO(PFP),..., allquerying: FO, (WF/S-)Datalog, FO(LFP), FO(PFP),..., all reasoning: query satisfiability, containment, equivalencereasoning: query satisfiability, containment, equivalence......
14
14 XMAS: XML Matching And Structuring language Integrated View Definition: “Find books from amazon.com and DBLP, join on author, group by authors and title” CONSTRUCT $a1 $t $p { $p } { $a1, $t } WHERE $a1 : $t : IN "amazon.com" AND $a2 : $p : IN "www...DBLP… " AND value( $a1 ) = value( $a2 ) CONSTRUCT $a1 $t $p { $p } { $a1, $t } WHERE $a1 : $t : IN "amazon.com" AND $a2 : $p : IN "www...DBLP… " AND value( $a1 ) = value( $a2 ) XMAS XMAS Algebra [QL98,SIGMOD99] [EDBT00]
15
15 XML (XMAS) Query Processing Translator Rewriter/Optimizer: Q’(S) composed plan optimized plan XML Query Q Composition Q(G) XML Global View Definition G(S) algebraic plans Plan Execution Compile-time Run-time:query evaluation
16
16 …New Challenges in (XML-Based) Mediation Global-As-View (GAV)Global-As-View (GAV) –user query Q global relations G Q(G) –global relations G source relations S G(S) –challenge: compute answers Q(G(V(S))) without computing all of V and G query rewriting (with limited source capabilities): Q’(S) = Q(G) Local-As-View (LAV)Local-As-View (LAV) –user query Q global relations G Q(G) –source relations S global relations G S(G) –challenge: “reverse/rewrite rules” from S(G) to some G’(S) answering queries using views: equivalent rewritings may not exist find maximally contained ones: Q’(G’(S)) Q(G) Inter(CS)disciplinary research needed: DB FP LPInter(CS)disciplinary research needed: DB FP LP –GAV/LAV view (un)folding Clark’s completion, resolution, factoring
17
17 Querying XML Streams: A New Frontier New applications for stream-based XML processing:New applications for stream-based XML processing: –Continuous, real-time data streams (wireless sensor networks, …) –Data / message transformation in Web services (SOAP, RMI, processing …) –Extract-transform-load applications (Tera/Peta-byte archival migration, …) … leading to a new XML querying & transformation paradigm:… leading to a new XML querying & transformation paradigm: –how to execute (some) XML queries & transformations on very large (infinite) data streams using only limited memory –XML stream machine (XSM): extended XML transducers with buffers XQuery XSM network XSMs clearly outperform tree-based approaches on streamable queries (100x over Xalan) [A Transducer-Based XML Query Processor, Ludäscher Mukhopadhyay, Papakonstantinou, VLDB’02]
18
18 Outline 1.Information Integration from a Database Perspective 2.XML-Based Data Integration 3.Model-Based / Semantic Mediation 4.Discussion
19
A Neuroscientist’s Information Integration Problem What is the cerebellar distribution of rat proteins with more than 70% homology with human NCS-1? Any structure specificity? How about other rodents? ? Information Integration ? Information Integration protein localization (NCMIR) protein localization (NCMIR) neurotransmission (SENSELAB) neurotransmission (SENSELAB) sequence info (CaPROT) sequence info (CaPROT) morphometry (SYNAPSE) morphometry (SYNAPSE) “Complex Multiple-Worlds” Mediation “Complex Multiple-Worlds” Mediation
20
A Geoscientist’s Information Integration Problem What is the distribution and U/ Pb zircon ages of A-type plutons in VA? How about their 3-D geometry ? How does it relate to host rock structures? ? Information Integration ? Information Integration Geologic Map (Virginia) Geologic Map (Virginia) GeoChemical GeoPhysical (gravity contours) GeoPhysical (gravity contours) GeoChronologic (Concordia) GeoChronologic (Concordia) Foliation Map (structure DB) Foliation Map (structure DB) “Complex Multiple-Worlds” Mediation “Complex Multiple-Worlds” Mediation
21
21 What’s the Problem with XML & Complex Multiple-Worlds? XML is SyntaxXML is Syntax –... for labeled ordered trees –... all semantics lies outside of XML XML DTDs => tags + nestingXML DTDs => tags + nesting XML Schema => DTDs + data modeling XML Schema => DTDs + data modeling need anything else? => write comments!need anything else? => write comments! Domain Semantics is Complex:Domain Semantics is Complex: –implicit assumptions, hidden semantics sources seem unrelated to the non-expert Need Structure and Semantics beyond trees!Need Structure and Semantics beyond trees! employ richer OO models make domain semantics and “glue knowledge” explicit use ontologies to fix terminology and conceptualization avoid ambiguities by using KR and formal semantics
22
22 DB mediation techniques Ontologies KR formalisms Model-Based Mediation Information Integration Landscape conceptual distance one-world multiple-worlds conceptual complexity/depth low high addall book-buyer BLAST EcoCyc Cyc WordNet GO home-buyer 24x7 consumer UMLS MIA Entrez RiboWeb Tambis Bioinformatics Geo-, Ecoinformatics
23
XML-Based vs. Model-Based Mediation Raw Data IF THEN Logical Domain Constraints Integrated-CM CM-QL(Src1-CM,...)...... (XML) Objects Conceptual Models XML Elements XML Models C2 C3 C1 R Classes, Relations, is-a, has-a,... “Glue Maps” = Domain & Process Maps (ontologies) Integrated-DTD XML-QL(Src1-DTD,...) No Domain Constraints A = (B*|C),D B =... Structural Constraints (DTDs), Parent, Child, Sibling,... CM ~ {Descr.Logic, ER, UML, RDF/XML(-Schema), …} CM-QL ~ {F-Logic, DAML+OIL, …}
24
24 What’s the Glue? What’s in a Link? Syntactic JoinsSyntactic Joins – (X,Y) := X.SSN = Y.SSN equality – (X,Y) := X.UMLS-ID = Y.UID “Speciality” Joins“Speciality” Joins – (X,Y,Score) := BLAST(X,Y,Score) similarity Semantic/Rule-Based JoinsSemantic/Rule-Based Joins – (X,Y,C) := X isa C, Y isa C, BLAST(X,Y,S), S>0.8 homology, lub – (X,Y,[produces,B,increased_in]) := X produces B, B increased_in Y. rule-based e.g., X= - secretase, B=beta amyloid, Y=Alzheimer’s disease CS Challenge:CS Challenge: –compile semantic joins into efficient syntactic ones X Y
25
25 Semantic Mediation Methodology @ SOURCES Lift Sources to export CMs:Lift Sources to export CMs: CM(S) = OM(S) + KB(S) + CON(S) Object Model OM(S):Object Model OM(S): –complex objects (frames), class hierarchy, OO constraints Knowledge Base KB(S):Knowledge Base KB(S): –explicit representation of (“hidden”) source semantics –logic rules over OM(S) Contextualization CON(S):Contextualization CON(S): –situate OM(S) data using “glue maps” (ontologies): domain maps DMs = terminological knowledge: concepts + roles process maps PMs = “procedural knowledge”: states + transitions
26
26 Semantic Mediation Methodology @ MEDIATOR Integrated View Definition (IVD)Integrated View Definition (IVD) –declarative (logic) rules with object-oriented features –defined over CM(S), domain maps, process maps –needs “mediation engineers” = domain + KRDB experts Knowledge-Based Querying and Browsing (runtime):Knowledge-Based Querying and Browsing (runtime): –mediator composes the user query Q with the IVD... rewrites (Q o IVD), sends subqueries to sources... post-processes returned results (e.g., situate in context)
27
27 S1 S2 S3 (XML-Wrapper) CM-Wrapper USER/Client USER/Client CM (Integrated View) Mediator Engine FL rule proc. LP rule proc. Graph proc. XSB Engine CM(S) = OM(S)+KB(S)+CON(S) GCM CM S1 GCM CM S2 GCM CM S3 CM Queries & Results (exchanged in XML) Domain Maps DMs Domain Maps DMs Domain Maps DMs Domain Maps DMs Domain Maps DMs Process Maps PMs “Glue” Maps GMs semantic context CON(S) Integrated View Definition IVD Model-Based Mediator Architecture First results & Demos: KIND prototype, formal DM semantics, PMs [SSDBM00] [VLDB00] [ICDE01] [NIH-HB01] [BNCOD02] [ER02] [EDBT02] [BioInf02]
28
28 Domain Map = labeled graph with concepts ("classes") and roles ("associations") additional semantics: expressed as logic rules (F-logic) Domain Map = labeled graph with concepts ("classes") and roles ("associations") additional semantics: expressed as logic rules (F-logic) Domain Map (DM) Purkinje cells and Pyramidal cells have dendrites that have higher-order branches that contain spines. Dendritic spines are ion (calcium) regulating components. Spines have ion binding proteins. Neurotransmission involves ionic activity (release). Ion-binding proteins control ion activity (propagation) in a cell. Ion-regulating components of cells affect ionic activity (release). Domain Expert Knowledge DM in Description Logic Formalizing Glue Knowledge: Domain Map for SYNAPSE and NCMIR
29
29 Source Contextualization & DM Refinement Source Contextualization & DM Refinement In addition to registering (“hanging off”) data relative to existing concepts, a source may also refine the mediator’s domain map... sources can register new concepts at the mediator...
30
Example: ANATOM Domain Map Example: ANATOM Domain Map
31
31 Browsing Registered Data with Domain Maps
32
Query Processing Demo Query Processing Demo Query results in context Contextualization CON(Result) wrt. ANATOM. Mediator View Definition DERIVE protein_distribution (Protein, Organism,Brain_region, Feature_name, Anatom, Value) WHERE I: protein_label_image[ proteins ->> {Protein}; organism -> Organism; anatomical_structures ->> {AS: anatomical_structure[ name->Anatom ] } ], % from PROLAB NAE: neuro_anatomic_entity[ name->Anatom; % from ANATOM NAE: neuro_anatomic_entity[ name->Anatom; % from ANATOM located_in->>{Brain_region} ], located_in->>{Brain_region} ], AS..segments..features [ name->Feature_name; value->Value ]. AS..segments..features [ name->Feature_name; value->Value ]. provided by the domain expert and mediation engineer deductive OO language (here: F-logic)
33
Example: Inside Query Evaluation push selection @SENSELAB: X1 := select targets of “output from parallel fiber” ; determine source context @MEDIATOR: X2 := “find and situate” X1 in ANATOM Domain Map; compute region of interest (here: downward closure) @MEDIATOR: X3 := subregion-closure(X2); push selection @NCMIR: X4 := select PROT-data(X3, Ryanodine Receptors); compute protein distribution @MEDIATOR: X5 := compute aggregate(X4); display in context @MEDIATOR/GUI: display X5 in context (ANATOM) "How does the parallel fiber output (Yale/SENSELAB) relate to the distribution of Ryanodine Receptors (UCSD/NCMIR)?” => DEMONSTRATION
34
34 Open Database & Knowledge Representation Issues Mix of Query Processing and ReasoningMix of Query Processing and Reasoning –GAV & LAV with semantic query optimization (NIH BIRN, NSF GEON) –description logic reasoner for DMs (FaCT) ? –reconciliation of conflicting DMs via argumentation-frameworks (“games”) using well-founded and stable models of logic programs [ICDT97, PODS97, TCS00, TODS02] Modeling “Process Knowledge” => Process MapsModeling “Process Knowledge” => Process Maps –formal semantics? (dynamic/temporal/Kripke models/Petri nets?) –executable semantics? (Statelog?) Graph Queries over DMs and PMsGraph Queries over DMs and PMs –expressible in F-logic [InfSystem98] –scalability? (UMLS Domain Map has millions of entries) How to incorporate “procedural features”?How to incorporate “procedural features”? –Bioinformatics, Ecoinformatics, … => sources = DBs + analytical tools + … scientific workflow planning and management (“promoter identification workflow” for DOE SciDAC, NSF/ITR SEEK)
35
35 Process Maps with Abstractions and Elaborations: From Terminological to Procedural Glue nodes ~ states edges ~ processes, transitions blue/red edges: processes in Src1/Src2 general form of edges: related formalisms
36
36 A Scientific Workflow: Promoter Identification Questions: Are chr#’s in common? Are chr#’s locations in common? Are there conserved upstream sequences? Are gene locations conserved across species Questions: RNA POLII promoter? GpC Island present? Are there common TAF’s across genomic gi#? Questions: Are there other common genes? gi#’s from clusfavor cDNA gi# Gene name blast blast human Genomic gi# Chr # Gene location TAF’s Location on Genomic gi#’s Probabilities of match Probabilities of random match TRANSFAC GC Island location Exon/intron location Repeats location Promoter location GRAIL Validates polII promoter location promoter location Shared TAF’s across cluster Common consensus sequence Data Consolidation Consensus sequences CLUSTAL blast other species Genomic gi# Chr # Gene location blast Matthew Coleman, LLNL, 2002 Genomic gi# cDNA gi# blast CLUSTAL TRANSFAC
37
37 SDM Demo & Architecture Translation Approach: Abstract Workflow (AWF) => Executable Workflow (EWF) Translation Approach: Abstract Workflow (AWF) => Executable Workflow (EWF)
38
38 Analytical Pipelines: An Open Source Tool
39
39 A Commercial Tool for Analytical Pipelines
40
40 Summary: Mediation Scenarios & Techniques Federated Databases XML-Based Mediation Model-Based Mediation One-World One-/Multiple-Worlds Complex Multiple-Worlds Common Schema Mediated Schema Common Glue Maps SQL, rules XML query languages DOOD query languages Schema Transformations Syntax-Aware Mappings Semantics-Aware Mappings Syntactic Joins Syntactic Joins “Semantic” Joins via Glue Maps DB expertDB expert KRDB + domain experts Glue?
41
41 GEON vs. SEEK
42
42 Outline 1.Information Integration from a Database Perspective 2.XML-Based Data Integration 3.Model-Based / Semantic Mediation 4.Discussion
43
43 Thank you! Questions? Queries?
44
44 Some References Model-Based Mediation:Model-Based Mediation: –A Model-Based Mediator System for Scientific Data Management, B. Ludäscher, A. Gupta, M. Martone, Bioinformatics: Managing Scientific Data, Lacroix, Critchlow (eds), Morgan Kaufmann, to appear, 2003 –Model-Based Mediation with Domain Maps, B. Ludäscher, A. Gupta, M. E. Martone, 17th Intl. Conference on Data Engineering (ICDE’01), Heidelberg, Germany, IEEE Computer Society, 2001. Model-Based Mediation with Domain Maps(ICDE’01)Model-Based Mediation with Domain Maps(ICDE’01) –Managing Semistructured Data with FLORID: A Deductive Object-Oriented Perspective, B. Ludäscher, R. Himmeröder, G. Lausen, W. May, C. Schlepphorst, Information Systems, 23(8), Special Issue on Semistructured Data, 1998. Managing Semistructured Data with FLORID: A Deductive Object-Oriented PerspectiveInformation Systems, 23(8), Special Issue on Semistructured DataManaging Semistructured Data with FLORID: A Deductive Object-Oriented PerspectiveInformation Systems, 23(8), Special Issue on Semistructured Data XML-Based Mediation:XML-Based Mediation: –VXD/Lazy Mediators: Navigation-Driven Evaluation of Virtual Mediated Views, B. Ludäscher, Y. Papakonstantinou, P. Velikhov, Intl. Conference on Extending Database Technology (EDBT’00), Konstanz, Germany, LNCS 1777, Springer, 2000. Navigation-Driven Evaluation of Virtual Mediated Views (EDBT’00)Navigation-Driven Evaluation of Virtual Mediated Views (EDBT’00) –XML Streams: A Transducer-Based XML Query Processor, B. Ludäscher, P. Mukhopadhyay, Y. Papakonstantinou, Intl. Conference on Very Large Databases (VLDB’02), Hong Kong, 2002
45
45 Knowledge Representation: Relating Theory to the World via Formal Models John F. Sowa, Knowledge Representation: Logical, Philosophical, and Computational FoundationsKnowledge Representation: Logical, Philosophical, and Computational Foundations “All models are wrong, but some are useful!”
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.