Download presentation
Presentation is loading. Please wait.
Published byBruce Cummings Modified over 9 years ago
1
1 Data Integration
2
2 Motivating Examples An organization has on average 49 databases –can talk about the same topic, but use different vocabularies, different schemas –how can we access them as if accessing a single db? Hundreds of online bookstores –amazon.com, barnes&noble.com, etc. –how can we query them as if querying a single source? Hundreds of CS websites in US, in text format –can we consolidate information about all of them and query them as if querying a giant relational database?
3
3 Data Integration The general problem –how can we access a set of heterogeneous, distributed, autonomous databases as if accessing a single database? Arises in numerous contexts –on the Web, at enterprises, military, scientific cooperation, bio-informatics domains, e- commerce, etc. Currently very hot –in both database research and industry
4
4 What is Data Integration Providing –Uniform (same query interface to all sources) –Access to (queries; eventually updates too) –Multiple (we want many, but 2 is hard too) –Autonomous (DBA doesn’t report to you) –Heterogeneous (data models are different) –Structured (or at least semi-structured) –Data Sources (not only databases).
5
5 The Problem: Data Integration Uniform query capability across autonomous, heterogeneous data sources on LAN, WAN, or Internet
6
6 Motivation(s) Enterprise data integration; web-site construction. WWW: –Comparison shopping –Portals integrating data from multiple sources –B2B, electronic marketplaces Science and culture: –Medical genetics: integrating genomic data –Astrophysics: monitoring events in the sky. –Environment: Puget Sound Regional Synthesis Model –Culture: uniform access to all cultural databases produced by countries in Europe.
7
7 Current State of Affairs Mostly ad-hoc programming: create a special solution for every case; pay consultants a lot of money. Long-standing challenge in the DB community AI/WWW communities are on board Annual workshops, vision papers,... Companies –Informatica, many others,...
8
8 A Brief History Many early ad-hoc solutions Converged into two approaches –data warehousing vs. virtual DI systems Semi-structured data, XML Wrappers Other issues: query optimization, schema matching,... Current directions –DI for specialized domains (e.g., bioinformatics) –on-the-fly DI, entity-centric DI New types of data sharing systems –P2P systems, Semantic Web
9
9 Data warehousing vs. Virtual DI systems
10
10 Data Warehouse Architecture Data source Data source Data source Relational database (warehouse) User queries Data extraction programs Data cleaning/ scrubbing OLAP / Decision support/ Data cubes/ data mining
11
11 Data warehousing Data warehousing: load all the data periodically into a warehouse. –6-18 months lead time –Separates operational DBMS from decision support DBMS. (not only a solution to data integration). –Performance is good; data may not be fresh. –Need to clean, scrub you data.
12
12 The Virtual Integration Architecture Leave the data in the sources. When a query comes in: –Determine the relevant sources to the query –Break down the query into sub-queries for the sources. –Get the answers from the sources, and combine them appropriately. Data is fresh. Challenge: many
13
13 Virtual Integration Architecture Data source wrapper Data source wrapper Data source wrapper Sources can be: relational, hierarchical (IMS), structure files, web sites. Mediator: User queries Mediated schema Data source catalog Reformulation engine optimizer Execution engine Which data model?
14
14 Architecture of (Virtual) Data Integration System global query interface powell.comamazon.com query interface 2 bn.com query interface 3query interface 1 Find books written by Isaac Asimov & priced under $15
15
15 A Brief History Many early ad-hoc solutions Converged into two approaches –data warehousing vs. virtual DI systems Semi-structured data, XML Wrappers Other issues: query optimization, schema matching,... Current directions –DI for specialized domains (e.g., bioinformatics) –on-the-fly DI, entity-centric DI New types of data sharing systems –P2P systems, Semantic Web
16
16 Semi-structured data XML
17
17 Semi-structured Data What should be the underlying data model for DI contexts? –relational model is not an ideal choice Developed semi-structured data model –started with the OEM (object exchange model) Then XML came along It is now the most well-known semi-structured data model Generating much research in the DB community
18
18 HTML Bibliography Foundations of Databases Abiteboul, Hull, Vianu Addison Wesley, 1995 Data on the Web Abiteboul, Buneman, Suciu Morgan Kaufmann, 1999 Bibliography Foundations of Databases Abiteboul, Hull, Vianu Addison Wesley, 1995 Data on the Web Abiteboul, Buneman, Suciu Morgan Kaufmann, 1999 HTML is hard for applications
19
19 XML Foundations… Abiteboul Hull Vianu Addison Wesley 1995 … Foundations… Abiteboul Hull Vianu Addison Wesley 1995 … XML describes the content: easy for applications
20
20 DTDs as Grammars Same thing as: A DTD is a EBNF (Extended BNF) grammar An XML tree is precisely a derivation tree XML Documents that have a DTD and conform to it are called valid db ::= (book|publisher)* book ::= (title,author*,year?) title ::= string author ::= string year ::= string publisher ::= string db ::= (book|publisher)* book ::= (title,author*,year?) title ::= string author ::= string year ::= string publisher ::= string
21
21 More on DTDs as Grammars <!DOCTYPE paper [ ]> <!DOCTYPE paper [ ]> … XML documents can be nested arbitrarily deep
22
22 XML for Representing Data John 3634 Sue 6343 Dick 6363 John 3634 Sue 6343 Dick 6363 row name phone “John”3634“Sue”“Dick”63436363 persons XML: persons
23
23 XML vs Data Models XML is self-describing Schema elements become part of the data –Relational schema: persons(name,phone) –In XML,, are part of the data, and are repeated many times Consequence: XML is much more flexible XML = semistructured data
24
24 Semi-structured Data Explained Missing attributes: Repeated attributes John 1234 Joe John 1234 Joe no phone ! Mary 2345 3456 Mary 2345 3456 two phones !
25
25 Semistructured Data Explained Attributes with different types in different objects Nested collections (no 1NF) Heterogeneous collections: – contains both s and s John Smith 1234 John Smith 1234 structured name !
26
26 XML Data v.s. E/R, ODL, Relational Q: is XML better or worse ? A: serves different purposes –E/R, ODL, Relational models: For centralized processing, when we control the data –XML: Data sharing between different systems we do not have control over the entire data E.g. on the Web Do NOT use XML to model your data ! Use E/R, ODL, or relational instead.
27
27 Exporting Relational Data to XML Product(pid, name, weight) Company(cid, name, address) Makes(pid, cid, price) productcompany makes
28
28 Export data grouped by companies GizmoWorks Tacoma gizmo 19.99 … Bang Kirkland gizmo 22.99 … … GizmoWorks Tacoma gizmo 19.99 … Bang Kirkland gizmo 22.99 … … Redundant representation of products
29
29 The DTD
30
30 Export Data by Products Gizmo GizmoWorks 19.99 Tacoma Bang 22.99 Kirkland … OneClick … Gizmo GizmoWorks 19.99 Tacoma Bang 22.99 Kirkland … OneClick … Redundant Representation of companies
31
31 Which One Do We Choose ? The structure of the XML data is determined by agreement, with our partners, or dictated by committees –Many XML dialects (called applications) XML Data is often nested, irregular, etc No normal forms for XML
32
32 XML Query Languages Xpath XML-QL Xquery
33
33 A Brief History Many early ad-hoc solutions Converged into two approaches –data warehousing vs. virtual DI systems Semi-structured data, XML Wrappers Other issues: query optimization, schema matching,... Current directions –DI for specialized domains (e.g., bioinformatics) –on-the-fly DI, entity-centric DI New types of data sharing systems –P2P systems, Semantic Web
34
34 Wrappers Information Extraction
35
35 Virtual Integration Architecture Data source wrapper Data source wrapper Data source wrapper Sources can be: relational, hierarchical (IMS), structure files, web sites. Mediator: User queries Mediated schema Data source catalog Reformulation engine optimizer Execution engine Which data model?
36
36 Wrapper Programs Task: to communicate with the data sources and do format translations. They are built w.r.t. a specific source. They can sit either at the source or at the mediator. Often hard to build (very little science). Can be “intelligent”: perform source-specific optimizations.
37
37 Example Introduction to DB Phil Bernstein Eric Newcomer Addison Wesley, 1999 Introduction to DB Phil Bernstein Eric Newcomer Addison Wesley 1999 Transform: into:
38
38 Wrapper Construction Huge amount of research in the past decade Two major approaches –machine learning: typically requires some hand-labeled data –data-intensive, completely automatic Different focuses –pull out each record (i.e., segment page into records) –pull out fields in each record –remove junk portions (ads, etc.) Current solutions are still brittle Unclear whether “standards” such as XML & Web services will eliminate the problem –the need likely will still remain
39
39 Information Extraction If the source cannot be wrapped with a grammar or some easy-to-parse rules –must do information extraction Huge research in the AI community
40
40 A Brief History Many early ad-hoc solutions Converged into two approaches –data warehousing vs. virtual DI systems Semi-structured data, XML Wrappers Other issues: query optimization, schema matching,... Current directions –DI for specialized domains (e.g., bioinformatics) –on-the-fly DI, entity-centric DI New types of data sharing systems –P2P systems, Semantic Web
41
41 Other Issues
42
42 Data Source Catalog Contains all meta-information about the sources: –Logical source contents (books, new cars). –Source capabilities (can answer SQL queries) –Source completeness (has all books). –Physical properties of source and network. –Statistics about the data (like in an RDBMS) –Source reliability –Mirror sources –Update frequency.
43
43 Content Descriptions User queries refer to the mediated schema. Data is stored in the sources in a local schema. Content descriptions provide the semantic mappings between the different schemas. Data integration system uses the descriptions to translate user queries into queries on the sources.
44
44 Desiderata from Source Descriptions Expressive power: distinguish between sources with closely related data. Hence, be able to prune access to irrelevant sources. Easy addition: make it easy to add new data sources. Reformulation: be able to reformulate a user query into a query on the sources efficiently and effectively.
45
45 Reformulation Problem Given: –A query Q posed over the mediated schema –Descriptions of the data sources Find: –A query Q’ over the data source relations, such that: Q’ provides only correct answers to Q, and Q’ provides all possible answers from to Q given the sources.
46
46 Approaches to Specifying Source Descriptions Global-as-view: express the mediated schema relations as a set of views over the data source relations Local-as-view: express the source relations as views over the mediated schema. Can be combined with no additional cost.
47
47 Global-as-View Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). Create View Movie AS select * from S1 [S1(title,dir,year,genre)] union select * from S2 [S2(title, dir,year,genre)] union [S3(title,dir), S4(title,year,genre)] select S3.title, S3.dir, S4.year, S4.genre from S3, S4 where S3.title=S4.title
48
48 Global-as-View: Example 2 Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). Create View Movie AS [S1(title,dir,year)] select title, dir, year, NULL from S1 union [S2(title, dir,genre)] select title, dir, NULL, genre from S2
49
49 Global-as-View: Example 3 Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). Source S4: S4(cinema, genre) Create View Movie AS select NULL, NULL, NULL, genre from S4 Create View Schedule AS select cinema, NULL, NULL from S4. But what if we want to find which cinemas are playing comedies?
50
50 Global-as-View Summary Query reformulation boils down to view unfolding. Very easy conceptually. Can build hierarchies of mediated schemas. You sometimes loose information. Not always natural. Adding sources is hard. Need to consider all other sources that are available.
51
51 Local-as-View: example 1 Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). Create Source S1 AS select * from Movie Create Source S3 AS [S3(title, dir)] select title, dir from Movie Create Source S5 AS select title, dir, year from Movie where year > 1960 AND genre=“Comedy”
52
52 Local-as-View: Example 2 Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). Source S4: S4(cinema, genre) Create Source S4 select cinema, genre from Movie m, Schedule s where m.title=s.title. Now if we want to find which cinemas are playing comedies, there is hope!
53
53 Local-as-View Summary Very flexible. You have the power of the entire query language to define the contents of the source. Hence, can easily distinguish between contents of closely related sources. Adding sources is easy: they’re independent of each other. Query reformulation: answering queries using views!
54
54 The General Problem Given a set of views V1,…,Vn, and a query Q, can we answer Q using only the answers to V1,…,Vn? Many, many papers on this problem. The best performing algorithm: The MiniCon Algorithm, (Pottinger & Levy, 2000). Great survey on the topic: (Halevy, 2001).
55
55 Query Optimization Very related to query reformulation! Goal of the optimizer: find a physical plan with minimal cost. Key components in optimization: –Search space of plans –Search strategy –Cost model
56
56 Optimization in Distributed DBMS A distributed database (2-minute tutorial): –Data is distributed over multiple nodes, but is uniform. –Query execution can be distributed to sites. –Communication costs are significant. Consequences for optimization: –Optimizer needs to decide locality –Need to exploit independent parallelism. –Need operators that reduce communication costs (semi-joins).
57
57 DDBMS vs. Data Integration In a DDBMS, data is distributed over a set of uniform sites with precise rules. In a data integration context: –Data sources may provide only limited access patterns to the data. –Data sources may have additional query capabilities. –Cost of answering queries at sources unknown. –Statistics about data unknown. –Transfer rates unpredictable.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.