Data Integration: A Status Report Alon Halevy University of Washington, Seattle BTW 2003.

Data Integration: A Status Report Alon Halevy University of Washington, Seattle BTW 2003

February 27 th, 2003BTW 2003 Data Integration Report Recent progress Mediation languages Query processing (XML and other) Commercial Current challenges Flexible architectures: peer-data mgmt. Getting to the root of semantic heterogeneity: schema mapping.

Data Integration Systems This is one possible architecture (virtual integration) Only logical mediated schema is central. Data stays at the sources.

February 27 th, 2003BTW 2003 Motivation and Activity Application areas of data integration: Enterprise information integration ($$) The government Data sources on the web Scientific data sharing. Many research projects: Mine: Information Manifold, Tukwila, LSD. Companies: Many startups, big guys getting in.

February 27 th, 2003BTW 2003 Outline Recent progress Mediation languages Adaptive Query processing XML data management Commercial Current challenges Flexible architectures: peer-data mgmt. Getting to the root of semantic heterogeneity: schema mapping. Crossing the Structure Chasm.

February 27 th, 2003BTW 2003 Mediation Languages Goal: Mediated Schema Source Language for Specifying Semantic relationships Q Q’Q’ Q’Q’ Q’Q’ Q’Q’ Q’Q’

February 27 th, 2003BTW 2003 Global-as-View (GAV) Mediated Schema Source R1R2R3R4R5 Title, Actor, … Create view Actor AS R1 Union Select A,B From S2 Union …

February 27 th, 2003BTW 2003 Local-as-View (LAV) Mediated Schema Source R1R2R3R4R5 Title, Actor … Create View R1 as Select title, name From Title Join Actor Where Year>1970 Create View R5 as Select * From Movie Where lang=“German” (GLAV)

February 27 th, 2003BTW 2003 Adaptive Query Processing Problem: no stats, network unstable Cannot ‘ Plan and then execute ’ Need to adapt plan during execution. Idea already in Ingres (1976) Proposed before data integration: Cole and Graefe (choose nodes) Kabra and Dewitt (mid-query re-opt).

February 27 th, 2003BTW 2003 Convergent Query Processing [Zack Ives, Ph.D 2002, U. Penn] Processor starts with initial plan Monitors execution, accumulating stats. Switches plan when a better one found Reuses intermediate results. Final, cleanup phase. Possible transformation types: Plan partitioning, data partitioning, low-level rescheduling. Can be aggressive (e.g., with aggregations).

February 27 th, 2003BTW 2003 XML Query Processing XML facilitates integration. Mediator query processor may manipulate XML directly. Progress on: Publishing to XML, XML views on relations Physical algebras for manipulating XML Optimization of XQuery.

February 27 th, 2003BTW 2003 The Commercial World Some startups: Nimble, MetaMatrix, Calixa, Enosys, … Big guys making announcements: IBM, BEA, MS, (Oracle still being defiant). Progress: analysts have buzzword -- EII. Challenges: Integration with EAI? Yet another middleware? Horizontal vs. vertical?

February 27 th, 2003BTW 2003 Outline Recent progress Mediation languages Adaptive Query processing XML data management Commercial Current challenges Flexible architectures: peer-data mgmt. Getting to the root of semantic heterogeneity: schema mapping.

February 27 th, 2003BTW 2003 Peer Data-Management PDMS: a network of peers Peers can: Export base data Provide views on base data Serve as logical mediators for other peers A peer can be both a server and a client. Semantic relationships are specified locally (between small sets of peers).

Network of Mappings (Piazza) UWStanford DBLP Saarbruecken Leipzig CiteSeer Berlin GAV, LAV GLAV Q Q’Q’ Q’Q’ Q ’’

February 27 th, 2003BTW 2003 Advantages of PDMS No need for a central mediated schema. Can map data opportunistically, as is most convenient. Queries are posed using the peer ’ s schema. Answers come from anywhere in the system. Semantic Web. This is not P2P file sharing. Data has rich semantics Membership is not as dynamic.

Schema Mediation UWStanford DBLP Saarbruecken Leipzig CiteSeer Berlin GAV, LAV GLAV Q Q’Q’ Q’Q’ Q ’’ When can LAV and GAV be combined to form such a network structure? [ICDE-03], [WWW-03 for XML]

Query Optimization UWStanford DBLP Saarbruecken Leipzig CiteSeer Berlin Q Q’Q’ Q’Q’ Q ’’ Problems: redundant paths expensive reformulation. Possible solution: Pre-compose some paths

February 27 th, 2003BTW 2003 Mapping Composition Incredibly subtle! [w/ Madhavan] In general, composition can be an infinite set of GLAV formulas. Results: Finite in many cases Even when infinite, often has finite, useful encoding. Hence, compositions can usually be pre- optimized.

Management of Updates [w/ Mork, Gribble] UWStanford DBLP Saarbruecken Leipzig CiteSeer Berlin Q Q’Q’ Q’Q’ Q ’’ Problem: when updates are generated, we don ’ t know who will use them. Solution: represent updates as first-class citizens Complement with boosters Rules for usage.

Other Research Issues UWStanford DBLP Saarbruecken Leipzig CiteSeer Berlin Q Q’Q’ Q’Q’ Q ’’ Intelligent data placement Management of mapping networks Improving networks: finding additional connections. Indexing of views

February 27 th, 2003BTW 2003 Schema Matching/Mapping Given S 1 and S 2: a pair of schemas/DTDs/ontologies, … Possibly, data accompanying instances Additional domain knowledge Find: A match between S 1 and S 2 A set of correspondences between the terms. Ultimately, a mapping Should enable translating data between the schemas.

Example: House Listings house location view house address front back num-baths full-bathshalf-baths Water view Lake Mountains 1-1 mapping non 1-1 mapping ?

February 27 th, 2003BTW 2003 Motivations Heart of any data sharing architecture Virtual, warehouse, messaging, web services, semantic web Translation of legacy data, EAI, … Key operator in model management Algebra for manipulating models of data See [Bernstein, CIDR-03], Melnik et al. [SIGMOD 03]. Currently, a bottleneck. Done mostly by hand.

February 27 th, 2003BTW 2003 Approaches to Matching Matching is hard because schema does not fully capture the semantics. Many techniques proposed. They consider similarities in: Attribute names (synonyms) Data values, data types Relationships between columns Structural similarities Anything a human expert would try! Hence, let ’ s try to simulate a human.

February 27 th, 2003BTW 2003 Philosophy of Solutions Effective schema matching requires a principled combination of techniques. Like human experts, the matcher should improve over time Learn from seeing many schemas, matches. LSD [Doan, Ph.D 2002, U. of Illinois] COMA [Do et al.]

February 27 th, 2003BTW 2003 Corpus Based Solution [Madhavan, Bernstein, Chen, Halevy, Shenoy] Collect a corpus of schemas and matches. Learn from the corpus: Create a classifier for every corpus element Use multi-strategy learning. Given S 1 and S 2 : Compare each schema element to corpus elements. If two elements ’ similarity vectors are close, then maybe they match each other.

February 27 th, 2003BTW 2003 Learning from Corpus vs. Learning from the schemas

February 27 th, 2003BTW 2003 Finding Different Matches

February 27 th, 2003BTW 2003 Other Corpus Based Tools Conjecture: a corpus of schemas can be the basis for many useful tools. Auto-complete: I start creating a schema (or show sample data), and the tool suggests a completion. Query reformulation: I ask a query using my terminology, and it gets reformulated appropriately. Improving structured queries over structured web sites (and focused crawling, a la BINGO!)

February 27 th, 2003BTW 2003 The Corpus Contents: Schemas, ontologies, meta-data, data, queries. Sample statistics: How often does a word appear as a relation name? When it does, what tend to be the attribute names? What other tables are there? What are the foreign keys?

February 27 th, 2003BTW 2003 Conclusion: Crossing the Structure Chasm Data authoring, querying and sharing is everywhere; done by novices too. Semantic web: the extreme example. Corpus Of schemas schema mapping

February 27 th, 2003BTW 2003 Some References www.cs.washington.edu/homes/alon Piazza: WebDB01, ICDE03, WWW03 The Structure Chasm: CIDR-03 Mediation surveys: VLDB Journal 01 Lenzerini, PODS 02 tutorial. Schema matching: Rahm and Bernstein, VLDB Journal 01.

Data Integration: A Status Report Alon Halevy University of Washington, Seattle BTW 2003.

Similar presentations

Presentation on theme: "Data Integration: A Status Report Alon Halevy University of Washington, Seattle BTW 2003."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Integration: A Status Report Alon Halevy University of Washington, Seattle BTW 2003.

Similar presentations

Presentation on theme: "Data Integration: A Status Report Alon Halevy University of Washington, Seattle BTW 2003."— Presentation transcript:

Similar presentations

About project

Feedback