1 Data Integration. 2 Motivating Examples An organization has on average 49 databases –can talk about the same topic, but use different vocabularies,

Slides:



Advertisements
Similar presentations
1 Data Integration June 3 rd, What is Data Integration? uniform accessmultiple autonomousheterogeneousdistributed Provide uniform access to data.
Advertisements

CSE 636 Data Integration Data Integration Approaches.
C6 Databases.
Database Management Systems, R. Ramakrishnan1 Introduction to Semistructured Data and XML Chapter 27, Part D Based on slides by Dan Suciu University of.
Information Retrieval in Practice
Managing Data Resources
Agenda from now on Done: SQL, views, transactions, conceptual modeling, E/R, relational algebra. Starting: XML To do: the database engine: –Storage –Query.
Xyleme A Dynamic Warehouse for XML Data of the Web.
1 Lecture 10 XML Wednesday, October 18, XML Outline XML (4.6, 4.7) –Syntax –Semistructured data –DTDs.
1 Lecture 10: Database Design XML Wednesday, October 20, 2004.
1 COS 425: Database and Information Management Systems XML and information exchange.
1 Lecture 13: Database Heterogeneity Debriefing Project Phase 2.
1 Introduction to XML Yanlei Diao UMass Amherst April 19, 2007 Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau.
CSE 636 Data Integration Overview. 2 Data Warehouse Architecture Data Source Data Source Relational Database (Warehouse) Data Source Users   Applications.
CSE 636 Data Integration Introduction. 2 Staff Instructor: Dr. Michalis Petropoulos Location: 210 Bell Hall Office Hours:
Data Integration Rachel Pottinger and Liang Sun CSE 590ES January 24, 2000.
End of SQL XML April 22 th, Null Values If x=Null then 4*(3-x)/7 is still NULL If x=Null then x=“Joe” is UNKNOWN Three boolean values: –FALSE =
Methodology Conceptual Database Design
1 Lecture 08: XML and Semistructured Data. 2 Outline XML (Section 17) –XML syntax, semistructured data –Document Type Definitions (DTDs) XPath.
Managing Data Resources. File Organization Terms and Concepts Bit: Smallest unit of data; binary digit (0,1) Byte: Group of bits that represents a single.
1 Lecture 08: XML and Semistructured Data. 2 Outline XML (Section 17) –XML syntax, semistructured data –Document Type Definitions (DTDs) XPath.
Overview of Search Engines
XML – a data sharing standard DSC340 Mike Pangburn.
XML, distributed databases, and OLAP/warehousing The semantic web and a lot more.
XML: Extensible Markup Language FST-UMAC Gong Zhiguo.
IS432: Semi-Structured Data Dr. Azeddine Chikh. 1. Semi Structured Data Object Exchange Model.
Intro to MIS – MGS351 Databases and Data Warehouses Chapter 3.
Research Topics in Computing Data Modelling for Data Schema Integration 1 March 2005 David George.
CSE 636 Data Integration Overview Fall What is Data Integration? The problem of providing uniform (sources transparent to user) access to (query,
Chapter 27 The World Wide Web and XML. Copyright © 2004 Pearson Addison-Wesley. All rights reserved.27-2 Topics in this Chapter The Web and the Internet.
XML & Mediators Thitima Sirikangwalkul Wai Sum Mong April 10, 2003.
XML A web enabled data description language 4/22/2001 By Mark Lawson & Edward Ryan L’Herault.
I Information Systems Technology Ross Malaga 4 "Part I Understanding Information Systems Technology" Copyright © 2005 Prentice Hall, Inc. 4-1 DATABASE.
RELATIONAL FAULT TOLERANT INTERFACE TO HETEROGENEOUS DISTRIBUTED DATABASES Prof. Osama Abulnaja Afraa Khalifah
Lecture #9 Data Integration May 30 th, Agenda/Administration Project demo scheduling. Reading pointers for exam.
1 Lessons from the TSIMMIS Project Yannis Papakonstantinou Department of Computer Science & Engineering University of California, San Diego.
1.file. 2.database. 3.entity. 4.record. 5.attribute. When working with a database, a group of related fields comprises a(n)…
C6 Databases. 2 Traditional file environment Data Redundancy and Inconsistency: –Data redundancy: The presence of duplicate data in multiple data files.
5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.
Databases Shortfalls of file management systems Structure of a database Database administration Database Management system Hierarchical Databases Network.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
MANAGING DATA RESOURCES ~ pertemuan 7 ~ Oleh: Ir. Abdul Hayat, MTI.
Lecture 5: XML Tuesday, January 16, Outline XML, DTDs (Data on the Web, 3.1) Semistructured data in XML (3.2) Exporting Relational Data in XML (8.3.1)
Chapter 27 The World Wide Web and XML. Copyright © 2004 Pearson Addison-Wesley. All rights reserved.27-2 Topics in this Chapter The Web and the Internet.
1 Introduction to Semistructured Data and XML. 2 How the Web is Today  HTML documents often generated by applications consumed by humans only easy access:
More XML: semantics, DTDs, XPATH February 18, 2004.
Scaling Heterogeneous Databases and Design of DISCO Anthony Tomasic Louiqa Raschid Patrick Valduriez Presented by: Nazia Khatir Texas A&M University.
The Semistructured-Data Model Programming Languages for XML Spring 2011 Instructor: Hassan Khosravi.
Managing Data Resources. File Organization Terms and Concepts Bit: Smallest unit of data; binary digit (0,1) Byte: Group of bits that represents a single.
Information Integration BIRN supports integration across complex data sources – Can process wide variety of structured & semi-structured sources (DBMS,
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
Metadata By N.Gopinath AP/CSE Metadata and it’s role in the lifecycle. The collection, maintenance, and deployment of metadata Metadata and tool integration.
Object storage and object interoperability
Data Integration Approaches
SEMI-STRUCTURED DATA (XML) 1. SEMI-STRUCTURED DATA ER, Relational, ODL data models are all based on schema Structure of data is rigid and known is advance.
XML and Distributed Applications By Quddus Chong Presentation for CS551 – Fall 2001.
Managing Data Resources File Organization and databases for business information systems.
Data Mining and Data Warehousing: Concepts and Techniques What is a Data Warehouse? Data Warehouse vs. other systems, OLTP vs. OLAP Conceptual Modeling.
Lecture 14: Relational Algebra Projects XML?
Intro to MIS – MGS351 Databases and Data Warehouses
Lecture 10 XML Monday, Oct. 21, 2001.
Databases and Data Warehouses Chapter 3
MANAGING DATA RESOURCES
Lecture 11 XML Wednesday, Oct. 24, 2001.
Semi-Structured data (XML Data MODEL)
Lecture 9: XML Monday, October 17, 2005.
Lecture 8: XML Data Wednesday, October
Introduction to Database Systems CSE 444 Lecture 10 XML
Semi-Structured data (XML)
Lecture 11: XML and Semistructured Data
Presentation transcript:

1 Data Integration

2 Motivating Examples An organization has on average 49 databases –can talk about the same topic, but use different vocabularies, different schemas –how can we access them as if accessing a single db? Hundreds of online bookstores –amazon.com, barnes&noble.com, etc. –how can we query them as if querying a single source? Hundreds of CS websites in US, in text format –can we consolidate information about all of them and query them as if querying a giant relational database?

3 Data Integration The general problem –how can we access a set of heterogeneous, distributed, autonomous databases as if accessing a single database? Arises in numerous contexts –on the Web, at enterprises, military, scientific cooperation, bio-informatics domains, e- commerce, etc. Currently very hot –in both database research and industry

4 What is Data Integration Providing –Uniform (same query interface to all sources) –Access to (queries; eventually updates too) –Multiple (we want many, but 2 is hard too) –Autonomous (DBA doesn’t report to you) –Heterogeneous (data models are different) –Structured (or at least semi-structured) –Data Sources (not only databases).

5 The Problem: Data Integration Uniform query capability across autonomous, heterogeneous data sources on LAN, WAN, or Internet

6 Motivation(s) Enterprise data integration; web-site construction. WWW: –Comparison shopping –Portals integrating data from multiple sources –B2B, electronic marketplaces Science and culture: –Medical genetics: integrating genomic data –Astrophysics: monitoring events in the sky. –Environment: Puget Sound Regional Synthesis Model –Culture: uniform access to all cultural databases produced by countries in Europe.

7 Current State of Affairs Mostly ad-hoc programming: create a special solution for every case; pay consultants a lot of money. Long-standing challenge in the DB community AI/WWW communities are on board Annual workshops, vision papers,... Companies –Informatica, many others,...

8 A Brief History Many early ad-hoc solutions Converged into two approaches –data warehousing vs. virtual DI systems Semi-structured data, XML Wrappers Other issues: query optimization, schema matching,... Current directions –DI for specialized domains (e.g., bioinformatics) –on-the-fly DI, entity-centric DI New types of data sharing systems –P2P systems, Semantic Web

9 Data warehousing vs. Virtual DI systems

10 Data Warehouse Architecture Data source Data source Data source Relational database (warehouse) User queries Data extraction programs Data cleaning/ scrubbing OLAP / Decision support/ Data cubes/ data mining

11 Data warehousing Data warehousing: load all the data periodically into a warehouse. –6-18 months lead time –Separates operational DBMS from decision support DBMS. (not only a solution to data integration). –Performance is good; data may not be fresh. –Need to clean, scrub you data.

12 The Virtual Integration Architecture Leave the data in the sources. When a query comes in: –Determine the relevant sources to the query –Break down the query into sub-queries for the sources. –Get the answers from the sources, and combine them appropriately. Data is fresh. Challenge: many

13 Virtual Integration Architecture Data source wrapper Data source wrapper Data source wrapper Sources can be: relational, hierarchical (IMS), structure files, web sites. Mediator: User queries Mediated schema Data source catalog Reformulation engine optimizer Execution engine Which data model?

14 Architecture of (Virtual) Data Integration System global query interface powell.comamazon.com query interface 2 bn.com query interface 3query interface 1 Find books written by Isaac Asimov & priced under $15

15 A Brief History Many early ad-hoc solutions Converged into two approaches –data warehousing vs. virtual DI systems Semi-structured data, XML Wrappers Other issues: query optimization, schema matching,... Current directions –DI for specialized domains (e.g., bioinformatics) –on-the-fly DI, entity-centric DI New types of data sharing systems –P2P systems, Semantic Web

16 Semi-structured data XML

17 Semi-structured Data What should be the underlying data model for DI contexts? –relational model is not an ideal choice Developed semi-structured data model –started with the OEM (object exchange model) Then XML came along It is now the most well-known semi-structured data model Generating much research in the DB community

18 HTML Bibliography Foundations of Databases Abiteboul, Hull, Vianu Addison Wesley, 1995 Data on the Web Abiteboul, Buneman, Suciu Morgan Kaufmann, 1999 Bibliography Foundations of Databases Abiteboul, Hull, Vianu Addison Wesley, 1995 Data on the Web Abiteboul, Buneman, Suciu Morgan Kaufmann, 1999 HTML is hard for applications

19 XML Foundations… Abiteboul Hull Vianu Addison Wesley 1995 … Foundations… Abiteboul Hull Vianu Addison Wesley 1995 … XML describes the content: easy for applications

20 DTDs as Grammars Same thing as: A DTD is a EBNF (Extended BNF) grammar An XML tree is precisely a derivation tree XML Documents that have a DTD and conform to it are called valid db ::= (book|publisher)* book ::= (title,author*,year?) title ::= string author ::= string year ::= string publisher ::= string db ::= (book|publisher)* book ::= (title,author*,year?) title ::= string author ::= string year ::= string publisher ::= string

21 More on DTDs as Grammars <!DOCTYPE paper [ ]> <!DOCTYPE paper [ ]> … XML documents can be nested arbitrarily deep

22 XML for Representing Data John 3634 Sue 6343 Dick 6363 John 3634 Sue 6343 Dick 6363 row name phone “John”3634“Sue”“Dick” persons XML: persons

23 XML vs Data Models XML is self-describing Schema elements become part of the data –Relational schema: persons(name,phone) –In XML,, are part of the data, and are repeated many times Consequence: XML is much more flexible XML = semistructured data

24 Semi-structured Data Explained Missing attributes: Repeated attributes John 1234 Joe John 1234 Joe  no phone ! Mary Mary  two phones !

25 Semistructured Data Explained Attributes with different types in different objects Nested collections (no 1NF) Heterogeneous collections: – contains both s and s John Smith 1234 John Smith 1234  structured name !

26 XML Data v.s. E/R, ODL, Relational Q: is XML better or worse ? A: serves different purposes –E/R, ODL, Relational models: For centralized processing, when we control the data –XML: Data sharing between different systems we do not have control over the entire data E.g. on the Web Do NOT use XML to model your data ! Use E/R, ODL, or relational instead.

27 Exporting Relational Data to XML Product(pid, name, weight) Company(cid, name, address) Makes(pid, cid, price) productcompany makes

28 Export data grouped by companies GizmoWorks Tacoma gizmo … Bang Kirkland gizmo … … GizmoWorks Tacoma gizmo … Bang Kirkland gizmo … … Redundant representation of products

29 The DTD

30 Export Data by Products Gizmo GizmoWorks Tacoma Bang Kirkland … OneClick … Gizmo GizmoWorks Tacoma Bang Kirkland … OneClick … Redundant Representation of companies

31 Which One Do We Choose ? The structure of the XML data is determined by agreement, with our partners, or dictated by committees –Many XML dialects (called applications) XML Data is often nested, irregular, etc No normal forms for XML

32 XML Query Languages Xpath XML-QL Xquery

33 A Brief History Many early ad-hoc solutions Converged into two approaches –data warehousing vs. virtual DI systems Semi-structured data, XML Wrappers Other issues: query optimization, schema matching,... Current directions –DI for specialized domains (e.g., bioinformatics) –on-the-fly DI, entity-centric DI New types of data sharing systems –P2P systems, Semantic Web

34 Wrappers Information Extraction

35 Virtual Integration Architecture Data source wrapper Data source wrapper Data source wrapper Sources can be: relational, hierarchical (IMS), structure files, web sites. Mediator: User queries Mediated schema Data source catalog Reformulation engine optimizer Execution engine Which data model?

36 Wrapper Programs Task: to communicate with the data sources and do format translations. They are built w.r.t. a specific source. They can sit either at the source or at the mediator. Often hard to build (very little science). Can be “intelligent”: perform source-specific optimizations.

37 Example Introduction to DB Phil Bernstein Eric Newcomer Addison Wesley, 1999 Introduction to DB Phil Bernstein Eric Newcomer Addison Wesley 1999 Transform: into:

38 Wrapper Construction Huge amount of research in the past decade Two major approaches –machine learning: typically requires some hand-labeled data –data-intensive, completely automatic Different focuses –pull out each record (i.e., segment page into records) –pull out fields in each record –remove junk portions (ads, etc.) Current solutions are still brittle Unclear whether “standards” such as XML & Web services will eliminate the problem –the need likely will still remain

39 Information Extraction If the source cannot be wrapped with a grammar or some easy-to-parse rules –must do information extraction Huge research in the AI community

40 A Brief History Many early ad-hoc solutions Converged into two approaches –data warehousing vs. virtual DI systems Semi-structured data, XML Wrappers Other issues: query optimization, schema matching,... Current directions –DI for specialized domains (e.g., bioinformatics) –on-the-fly DI, entity-centric DI New types of data sharing systems –P2P systems, Semantic Web

41 Other Issues

42 Data Source Catalog Contains all meta-information about the sources: –Logical source contents (books, new cars). –Source capabilities (can answer SQL queries) –Source completeness (has all books). –Physical properties of source and network. –Statistics about the data (like in an RDBMS) –Source reliability –Mirror sources –Update frequency.

43 Content Descriptions User queries refer to the mediated schema. Data is stored in the sources in a local schema. Content descriptions provide the semantic mappings between the different schemas. Data integration system uses the descriptions to translate user queries into queries on the sources.

44 Desiderata from Source Descriptions Expressive power: distinguish between sources with closely related data. Hence, be able to prune access to irrelevant sources. Easy addition: make it easy to add new data sources. Reformulation: be able to reformulate a user query into a query on the sources efficiently and effectively.

45 Reformulation Problem Given: –A query Q posed over the mediated schema –Descriptions of the data sources Find: –A query Q’ over the data source relations, such that: Q’ provides only correct answers to Q, and Q’ provides all possible answers from to Q given the sources.

46 Approaches to Specifying Source Descriptions Global-as-view: express the mediated schema relations as a set of views over the data source relations Local-as-view: express the source relations as views over the mediated schema. Can be combined with no additional cost.

47 Global-as-View Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). Create View Movie AS select * from S1 [S1(title,dir,year,genre)] union select * from S2 [S2(title, dir,year,genre)] union [S3(title,dir), S4(title,year,genre)] select S3.title, S3.dir, S4.year, S4.genre from S3, S4 where S3.title=S4.title

48 Global-as-View: Example 2 Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). Create View Movie AS [S1(title,dir,year)] select title, dir, year, NULL from S1 union [S2(title, dir,genre)] select title, dir, NULL, genre from S2

49 Global-as-View: Example 3 Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). Source S4: S4(cinema, genre) Create View Movie AS select NULL, NULL, NULL, genre from S4 Create View Schedule AS select cinema, NULL, NULL from S4. But what if we want to find which cinemas are playing comedies?

50 Global-as-View Summary Query reformulation boils down to view unfolding. Very easy conceptually. Can build hierarchies of mediated schemas. You sometimes loose information. Not always natural. Adding sources is hard. Need to consider all other sources that are available.

51 Local-as-View: example 1 Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). Create Source S1 AS select * from Movie Create Source S3 AS [S3(title, dir)] select title, dir from Movie Create Source S5 AS select title, dir, year from Movie where year > 1960 AND genre=“Comedy”

52 Local-as-View: Example 2 Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). Source S4: S4(cinema, genre) Create Source S4 select cinema, genre from Movie m, Schedule s where m.title=s.title. Now if we want to find which cinemas are playing comedies, there is hope!

53 Local-as-View Summary Very flexible. You have the power of the entire query language to define the contents of the source. Hence, can easily distinguish between contents of closely related sources. Adding sources is easy: they’re independent of each other. Query reformulation: answering queries using views!

54 The General Problem Given a set of views V1,…,Vn, and a query Q, can we answer Q using only the answers to V1,…,Vn? Many, many papers on this problem. The best performing algorithm: The MiniCon Algorithm, (Pottinger & Levy, 2000). Great survey on the topic: (Halevy, 2001).

55 Query Optimization Very related to query reformulation! Goal of the optimizer: find a physical plan with minimal cost. Key components in optimization: –Search space of plans –Search strategy –Cost model

56 Optimization in Distributed DBMS A distributed database (2-minute tutorial): –Data is distributed over multiple nodes, but is uniform. –Query execution can be distributed to sites. –Communication costs are significant. Consequences for optimization: –Optimizer needs to decide locality –Need to exploit independent parallelism. –Need operators that reduce communication costs (semi-joins).

57 DDBMS vs. Data Integration In a DDBMS, data is distributed over a set of uniform sites with precise rules. In a data integration context: –Data sources may provide only limited access patterns to the data. –Data sources may have additional query capabilities. –Cost of answering queries at sources unknown. –Statistics about data unknown. –Transfer rates unpredictable.