Data Integration: A Status Report Alon Halevy University of Washington, Seattle BTW 2003.

Slides:



Advertisements
Similar presentations
1 Data Integration June 3 rd, What is Data Integration? uniform accessmultiple autonomousheterogeneousdistributed Provide uniform access to data.
Advertisements

Lukas Blunschi Claudio Jossen Donald Kossmann Magdalini Mori Kurt Stockinger.
CSE 636 Data Integration Data Integration Approaches.
GridVine: Building Internet-Scale Semantic Overlay Networks By Lan Tian.
Relational Databases Chapter 4.
Distributed DBMS© M. T. Özsu & P. Valduriez Ch.4/1 Outline Introduction Background Distributed Database Design Database Integration ➡ Schema Matching ➡
Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships Eduard C. Dragut Ramon Lawrence Eduard C. Dragut Ramon Lawrence.
A Next Wave of Challenges in the Junction of Information Management (esp. Integration) and the Web Yannis Papakonstantinou Associate Prof., CSE, UCSD.
Information Integration: A Status Report Alon Halevy University of Washington, Seattle IJCAI 2003.
XML Views El Hazoui Ilias Supervised by: Dr. Haddouti Advanced XML data management.
Crossing the Structure Chasm Alon Halevy University of Washington, Seattle UBC, January 15, 2004.
1 Lecture 13: Database Heterogeneity Debriefing Project Phase 2.
1 Lecture 13: Database Heterogeneity. 2 Outline Database Integration Wrappers Mediators Integration Conflicts.
Schema Matching Algorithms Phil Bernstein CSE 590sw February 2003.
1 Database Research at the UW  Faculty: Alon Halevy and Dan Suciu. A dozen Ph.D students  Related faculty: Oren Etzioni, Pedro Domingos, Dan Weld and.
CSE 636 Data Integration Overview. 2 Data Warehouse Architecture Data Source Data Source Relational Database (Warehouse) Data Source Users   Applications.
Dataspaces: A New Abstraction for Data Management Mike Franklin, Alon Halevy, David Maier, Jennifer Widom.
Distributed Database Management Systems. Reading Textbook: Ch. 4 Textbook: Ch. 4 FarkasCSCE Spring
Sangam: A Transformation Modeling Framework Kajal T. Claypool (U Mass Lowell) and Elke A. Rundensteiner (WPI)
1 Information Integration and Source Wrapping Jose Luis Ambite, USC/ISI.
Crossing the Structure Chasm Alon Halevy University of Washington FQAS 2002.
Crossing the Structure Chasm Alon Halevy University of Washington, Seattle UCLA, April 15, 2004.
ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 6: General Schema Manipulation Operators PRINCIPLES OF DATA INTEGRATION.
Cloud based linked data platform for Structural Engineering Experiment Xiaohui Zhang
Data Warehousing: Defined and Its Applications Pete Johnson April 2002.
09/12/2003 Peer-to-Peer Information Systems – WS 03/04 1 Piazza: Data Management Infrastructure for Semantic Web Applications Alon Y. Halevy, Zachary G.
OMAP: An Implemented Framework for Automatically Aligning OWL Ontologies SWAP, December, 2005 Raphaël Troncy, Umberto Straccia ISTI-CNR
Data Integration in Service Oriented Architectures Rahul Patel Sr. Director R & D, BEA Systems Liquid Data – XML-based data access and integration for.
Cost-based Optimization of Graph Queries Silke Trißl Humboldt-Universität zu Berlin Knowledge Management in Bioinformatics IDAR 2007.
Lecture 2 The Relational Model. Objectives Terminology of relational model. How tables are used to represent data. Connection between mathematical relations.
Ontology Alignment/Matching Prafulla Palwe. Agenda ► Introduction  Being serious about the semantic web  Living with heterogeneity  Heterogeneity problem.
1 Distributed Monitoring of Peer-to-Peer Systems By Serge Abiteboul, Bogdan Marinoiu Docflow meeting, Bordeaux.
Research Topics in Computing Data Modelling for Data Schema Integration 1 March 2005 David George.
HomeViews: P2P Middleware for Personal Data Sharing Applications Roxana Geambasu, Magdalena Balazinska, Steve Gribble, Hank Levy University of Washington.
BACKGROUND KNOWLEDGE IN ONTOLOGY MATCHING Pavel Shvaiko joint work with Fausto Giunchiglia and Mikalai Yatskevich INFINT 2007 Bertinoro Workshop on Information.
Peer-to-Peer Data Integration Using Distributed Bridges Neal Arthorne B. Eng. Computer Systems (2002) Supervisor: Babak Esfandiari April 12, 2005 Candidate.
Learning Source Mappings Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems October 27, 2008 LSD Slides courtesy AnHai.
CSE 636 Data Integration Overview Fall What is Data Integration? The problem of providing uniform (sources transparent to user) access to (query,
Navigational Plans For Data Integration Marc Friedman Alon Levy Todd Millistein Presented By Avinash Ponnala Avinash Ponnala.
1 Schema Mediation and Query Processing in Peer Data Management Systems Presenter: Jie Zhao Supervisor: Rachel Pottinger Sept. 29, 2006.
1 Ontology-based Semantic Annotatoin of Process Template for Reuse Yun Lin, Darijus Strasunskas Depart. Of Computer and Information Science Norwegian Univ.
Lecture #9 Data Integration May 30 th, Agenda/Administration Project demo scheduling. Reading pointers for exam.
The Data Ring: Community Content Sharing Serge Abiteboul (INRIA) Alkis Polyzotis (UC Santa Cruz)
1 The Relational Database Model. 2 Learning Objectives Terminology of relational model. How tables are used to represent data. Connection between mathematical.
Dimitrios Skoutas Alkis Simitsis
9/7/2012ISC329 Isabelle Bichindaritz1 The Relational Database Model.
5-1 McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved.
CSE 636 Data Integration Schema Matching Cupid Fall 2006.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
Grid Computing & Semantic Web. Grid Computing Proposed with the idea of electric power grid; Aims at integrating large-scale (global scale) computing.
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
Data Integration: Achievements and Perspectives in the Last Ten Years AiJing.
Scalable Hybrid Keyword Search on Distributed Database Jungkee Kim Florida State University Community Grids Laboratory, Indiana University Workshop on.
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
Semantic Mappings for Data Mediation
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
Data Integration Approaches
Of 24 lecture 11: ontology – mediation, merging & aligning.
Harnessing the Deep Web : Present and Future -Tushar Mhaskar Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,
Cloud based linked data platform for Structural Engineering Experiment
One Language. One Enterprise.™
Yannis Papakonstantinou Associate Prof., CSE, UCSD
Data and Applications Security Developments and Directions
Chen Li Information and Computer Science
Data and Applications Security Developments and Directions
Composing Mappings among Data Sources
Course Instructor: Supriya Gupta Asstt. Prof
Data and Applications Security Developments and Directions
Toward an Ontology-Driven Architectural Framework for B2B E. Kajan, L
Presentation transcript:

Data Integration: A Status Report Alon Halevy University of Washington, Seattle BTW 2003

February 27 th, 2003BTW 2003 Data Integration Report Recent progress Mediation languages Query processing (XML and other) Commercial Current challenges Flexible architectures: peer-data mgmt. Getting to the root of semantic heterogeneity: schema mapping.

Data Integration Systems This is one possible architecture (virtual integration) Only logical mediated schema is central. Data stays at the sources.

February 27 th, 2003BTW 2003 Motivation and Activity Application areas of data integration: Enterprise information integration ($$) The government Data sources on the web Scientific data sharing. Many research projects: Mine: Information Manifold, Tukwila, LSD. Companies: Many startups, big guys getting in.

February 27 th, 2003BTW 2003 Outline Recent progress Mediation languages Adaptive Query processing XML data management Commercial Current challenges Flexible architectures: peer-data mgmt. Getting to the root of semantic heterogeneity: schema mapping. Crossing the Structure Chasm.

February 27 th, 2003BTW 2003 Mediation Languages Goal: Mediated Schema Source Language for Specifying Semantic relationships Q Q’Q’ Q’Q’ Q’Q’ Q’Q’ Q’Q’

February 27 th, 2003BTW 2003 Global-as-View (GAV) Mediated Schema Source R1R2R3R4R5 Title, Actor, … Create view Actor AS R1 Union Select A,B From S2 Union …

February 27 th, 2003BTW 2003 Local-as-View (LAV) Mediated Schema Source R1R2R3R4R5 Title, Actor … Create View R1 as Select title, name From Title Join Actor Where Year>1970 Create View R5 as Select * From Movie Where lang=“German” (GLAV)

February 27 th, 2003BTW 2003 Adaptive Query Processing Problem: no stats, network unstable Cannot ‘ Plan and then execute ’ Need to adapt plan during execution. Idea already in Ingres (1976) Proposed before data integration: Cole and Graefe (choose nodes) Kabra and Dewitt (mid-query re-opt).

February 27 th, 2003BTW 2003 Convergent Query Processing [Zack Ives, Ph.D 2002, U. Penn] Processor starts with initial plan Monitors execution, accumulating stats. Switches plan when a better one found Reuses intermediate results. Final, cleanup phase. Possible transformation types: Plan partitioning, data partitioning, low-level rescheduling. Can be aggressive (e.g., with aggregations).

February 27 th, 2003BTW 2003 XML Query Processing XML facilitates integration. Mediator query processor may manipulate XML directly. Progress on: Publishing to XML, XML views on relations Physical algebras for manipulating XML Optimization of XQuery.

February 27 th, 2003BTW 2003 The Commercial World Some startups: Nimble, MetaMatrix, Calixa, Enosys, … Big guys making announcements: IBM, BEA, MS, (Oracle still being defiant). Progress: analysts have buzzword -- EII. Challenges: Integration with EAI? Yet another middleware? Horizontal vs. vertical?

February 27 th, 2003BTW 2003 Outline Recent progress Mediation languages Adaptive Query processing XML data management Commercial Current challenges Flexible architectures: peer-data mgmt. Getting to the root of semantic heterogeneity: schema mapping.

February 27 th, 2003BTW 2003 Peer Data-Management PDMS: a network of peers Peers can: Export base data Provide views on base data Serve as logical mediators for other peers A peer can be both a server and a client. Semantic relationships are specified locally (between small sets of peers).

Network of Mappings (Piazza) UWStanford DBLP Saarbruecken Leipzig CiteSeer Berlin GAV, LAV GLAV Q Q’Q’ Q’Q’ Q ’’

February 27 th, 2003BTW 2003 Advantages of PDMS No need for a central mediated schema. Can map data opportunistically, as is most convenient. Queries are posed using the peer ’ s schema. Answers come from anywhere in the system. Semantic Web. This is not P2P file sharing. Data has rich semantics Membership is not as dynamic.

Schema Mediation UWStanford DBLP Saarbruecken Leipzig CiteSeer Berlin GAV, LAV GLAV Q Q’Q’ Q’Q’ Q ’’ When can LAV and GAV be combined to form such a network structure? [ICDE-03], [WWW-03 for XML]

Query Optimization UWStanford DBLP Saarbruecken Leipzig CiteSeer Berlin Q Q’Q’ Q’Q’ Q ’’ Problems: redundant paths expensive reformulation. Possible solution: Pre-compose some paths

February 27 th, 2003BTW 2003 Mapping Composition Incredibly subtle! [w/ Madhavan] In general, composition can be an infinite set of GLAV formulas. Results: Finite in many cases Even when infinite, often has finite, useful encoding. Hence, compositions can usually be pre- optimized.

Management of Updates [w/ Mork, Gribble] UWStanford DBLP Saarbruecken Leipzig CiteSeer Berlin Q Q’Q’ Q’Q’ Q ’’ Problem: when updates are generated, we don ’ t know who will use them. Solution: represent updates as first-class citizens Complement with boosters Rules for usage.

Other Research Issues UWStanford DBLP Saarbruecken Leipzig CiteSeer Berlin Q Q’Q’ Q’Q’ Q ’’ Intelligent data placement Management of mapping networks Improving networks: finding additional connections. Indexing of views

February 27 th, 2003BTW 2003 Schema Matching/Mapping Given S 1 and S 2: a pair of schemas/DTDs/ontologies, … Possibly, data accompanying instances Additional domain knowledge Find: A match between S 1 and S 2 A set of correspondences between the terms. Ultimately, a mapping Should enable translating data between the schemas.

Example: House Listings house location view house address front back num-baths full-bathshalf-baths Water view Lake Mountains 1-1 mapping non 1-1 mapping ?

February 27 th, 2003BTW 2003 Motivations Heart of any data sharing architecture Virtual, warehouse, messaging, web services, semantic web Translation of legacy data, EAI, … Key operator in model management Algebra for manipulating models of data See [Bernstein, CIDR-03], Melnik et al. [SIGMOD 03]. Currently, a bottleneck. Done mostly by hand.

February 27 th, 2003BTW 2003 Approaches to Matching Matching is hard because schema does not fully capture the semantics. Many techniques proposed. They consider similarities in: Attribute names (synonyms) Data values, data types Relationships between columns Structural similarities Anything a human expert would try! Hence, let ’ s try to simulate a human.

February 27 th, 2003BTW 2003 Philosophy of Solutions Effective schema matching requires a principled combination of techniques. Like human experts, the matcher should improve over time Learn from seeing many schemas, matches. LSD [Doan, Ph.D 2002, U. of Illinois] COMA [Do et al.]

February 27 th, 2003BTW 2003 Corpus Based Solution [Madhavan, Bernstein, Chen, Halevy, Shenoy] Collect a corpus of schemas and matches. Learn from the corpus: Create a classifier for every corpus element Use multi-strategy learning. Given S 1 and S 2 : Compare each schema element to corpus elements. If two elements ’ similarity vectors are close, then maybe they match each other.

February 27 th, 2003BTW 2003 Learning from Corpus vs. Learning from the schemas

February 27 th, 2003BTW 2003 Finding Different Matches

February 27 th, 2003BTW 2003 Other Corpus Based Tools Conjecture: a corpus of schemas can be the basis for many useful tools. Auto-complete: I start creating a schema (or show sample data), and the tool suggests a completion. Query reformulation: I ask a query using my terminology, and it gets reformulated appropriately. Improving structured queries over structured web sites (and focused crawling, a la BINGO!)

February 27 th, 2003BTW 2003 The Corpus Contents: Schemas, ontologies, meta-data, data, queries. Sample statistics: How often does a word appear as a relation name? When it does, what tend to be the attribute names? What other tables are there? What are the foreign keys?

February 27 th, 2003BTW 2003 Conclusion: Crossing the Structure Chasm Data authoring, querying and sharing is everywhere; done by novices too. Semantic web: the extreme example. Corpus Of schemas schema mapping

February 27 th, 2003BTW 2003 Some References Piazza: WebDB01, ICDE03, WWW03 The Structure Chasm: CIDR-03 Mediation surveys: VLDB Journal 01 Lenzerini, PODS 02 tutorial. Schema matching: Rahm and Bernstein, VLDB Journal 01.