Planning for the Web I Data Integration Dan Weld University of Washington June, 2003.

Slides:

Advertisements

Similar presentations

1 Data Integration June 3 rd, What is Data Integration? uniform accessmultiple autonomousheterogeneousdistributed Provide uniform access to data.

Advertisements

Query Optimization Reserves Sailors sid=sid bid=100 rating > 5 sname (Simple Nested Loops) Imperative query execution plan: SELECT S.sname FROM Reserves.

CSE 636 Data Integration Data Integration Approaches.

Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.

Database Management Systems, R. Ramakrishnan and J. Gehrke1 Relational Query Optimization Chapters 14.

Query Optimization Goal: Declarative SQL query

1 Overview of Query Evaluation Chapter Objectives  Preliminaries:  Core query processing techniques  Catalog  Access paths to data  Index matching.

Planning for the Web I Data Integration Dan Weld University of Washington June, 2003.

1 Lecture 12: SQL Friday, October 26, Outline Simple Queries in SQL (5.1) Queries with more than one relation (5.2) Subqueries (5.3) Duplicates.

1 Relational Query Optimization Module 5, Lecture 2.

Planning for the Web I Data Integration Dan Weld University of Washington June, 2003.

Relational Query Optimization 198:541. Overview of Query Optimization  Plan: Tree of R.A. ops, with choice of alg for each op. Each operator typically.

Query Rewrite: Predicate Pushdown (through grouping) Select bid, Max(age) From Reserves R, Sailors S Where R.sid=S.sid GroupBy bid Having Max(age) > 40.

CS263 Lecture 19 Query Optimisation.  Motivation for Query Optimisation  Phases of Query Processing  Query Trees  RA Transformation Rules  Heuristic.

Relational Query Optimization (this time we really mean it)

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Query Evaluation Chapter 12.

Query Optimization II R&G, Chapters 12, 13, 14 Lecture 9.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Relational Query Optimization Chapter 15.

Query Optimization Overview Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems December 2, 2004 Some slide content derived.

Data Integration Rachel Pottinger and Liang Sun CSE 590ES January 24, 2000.

Correlated Queries SELECT title FROM Movie AS Old WHERE year < ANY (SELECT year FROM Movie WHERE title = Old.title); Movie (title, year, director, length)

Overview of Query Optimization v Plan : Tree of R.A. ops, with choice of alg for each op. –Each operator typically implemented using a `pull’ interface:

Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.

Query Optimization, part 2 CS634 Lecture 13, Mar Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.

Overview of Implementing Relational Operators and Query Evaluation

Introduction to Database Systems1 Relational Query Optimization Query Processing: Topic 2.

AnHai Doan, Pedro Domingos, Alon Halevy University of Washington Reconciling Schemas of Disparate Data Sources: A Machine Learning Approach The LSD Project.

Database Management Systems, R. Ramakrishnan and J. Gehrke1 Query Evaluation Chapter 12: Overview.

AnHai Doan Pedro Domingos Alon Levy Department of Computer Science & Engineering University of Washington Learning Source Descriptions for Data Integration.

Learning Source Mappings Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems October 27, 2008 LSD Slides courtesy AnHai.

1 Data Integration. 2 Motivating Examples An organization has on average 49 databases –can talk about the same topic, but use different vocabularies,

Access Path Selection in a Relational Database Management System Selinger et al.

Query Optimization. overview Histograms A histogram is a data structure maintained by a DBMS to approximate a data distribution Equiwidth vs equidepth.

Database systems/COMP4910/Melikyan1 Relational Query Optimization How are SQL queries are translated into relational algebra? How does the optimizer estimates.

Advanced Databases: Lecture 8 Query Optimization (III) 1 Query Optimization Advanced Databases By Dr. Akhtar Ali.

CSE 636 Data Integration Overview Fall What is Data Integration? The problem of providing uniform (sources transparent to user) access to (query,

Lecture #9 Data Integration May 30 th, Agenda/Administration Project demo scheduling. Reading pointers for exam.

1 Lessons from the TSIMMIS Project Yannis Papakonstantinou Department of Computer Science & Engineering University of California, San Diego.

1 Relational Query Optimization Chapter Query Blocks: Units of Optimization  An SQL query is parsed into a collection of query blocks :  An SQL.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Implementing Relational Operators and Query Evaluation Chapter 12.

Introduction to Query Optimization, R. Ramakrishnan and J. Gehrke 1 Introduction to Query Optimization Chapter 13.

Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.

Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.

Database Management Systems, R. Ramakrishnan and J. Gehrke1 Introduction to Query Optimization Chapter 13.

Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.

Semantic Mappings for Data Mediation

Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data can be stored.

Hash Tables and Query Execution March 1st, Hash Tables Secondary storage hash tables are much like main memory ones Recall basics: –There are n.

Data Integration Approaches

Implementation of Database Systems, Jarek Gryz1 Relational Query Optimization Chapters 12.

Chapter 9: Web Services and Databases Title: NiagaraCQ: A Scalable Continuous Query System for Internet Databases Authors: Jianjun Chen, David J. DeWitt,

Query Execution Query compiler Execution engine Index/record mgr. Buffer manager Storage manager storage User/ Application Query update Query execution.

Cost Estimation For each plan considered, must estimate cost: –Must estimate cost of each operation in plan tree. Depends on input cardinalities. –Must.

1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.

Database Management Systems, R. Ramakrishnan and J. Gehrke1 Introduction To Query Optimization and Examples Chpt

What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently and safely. Provide.

Database Applications (15-415) DBMS Internals- Part IX Lecture 20, March 31, 2016 Mohammad Hammoud.

Query Optimization. overview Application Programmer (e.g., business analyst, Data architect) Sophisticated Application Programmer (e.g., SAP admin) DBA,

CS222P: Principles of Data Management Lecture #15 Query Optimization (System-R) Instructor: Chen Li.

Introduction to Query Optimization

Introduction to Database Systems

Database Applications (15-415) DBMS Internals- Part IX Lecture 21, April 1, 2018 Mohammad Hammoud.

Relational Query Optimization

Why use a DBMS in your website?

CS222P: Principles of Data Management Notes #13 Set operations, Aggregation, Query Plans Instructor: Chen Li.

Evaluation of Relational Operations: Other Techniques

Relational Query Optimization (this time we really mean it)

CS222: Principles of Data Management Lecture #15 Query Optimization (System-R) Instructor: Chen Li.

Relational Query Optimization

Relational Query Optimization

Presentation transcript:

Planning for the Web I Data Integration Dan Weld University of Washington June, 2003

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 2 Acknowledgements Alon Halevy Zack Ives Rao Kambhampati UW students

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 3 My Two Talks for Today Data Integration Providing uniform access to disparate data srcs AI meets DB Answering queries using views Execution in the face of uncertainty, latency Service Integration Invoking and composing web services Query and update Planning with incomplete information

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 4 Overview: Data Integration Motivation Wrappers / information extraction Database Review Modeling data sources Content, completeness, capabilities Execution Coping with incomplete statistics, latency History & Refs

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 5 What is Data Integration? Uniform (same query interface to all sources) Access to (queries; eventually updates too) Multiple (we want many, but 2 is hard too) Autonomous (DBA doesn’t report to you) Heterogeneous (data models are different) Structured (or at least semi-structured) Data Sources (not only databases). A system providing:

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 6 User enters query Formulate queries LycosExcite... Collate results Remove duplicates Post-process + rank Download? Present to user

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 7 Meta-? Web Search Shopping Product Reviews Chat Finder Columnists (e.g. jokes, sports, ….) Lookup Event Finder People Finder Restaurant Reviews Job Listings Classifieds Apartment + Real Estate

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration Intuition: Info Integration Info aggregation … on Steroids! Want agent such that User says what she wants Agent decides how & when to achieve it Example: Show me all reviews of movies starring Matt Damon that are currently playing in Seattle EbertIMDB Fandango Sidewalk

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 9 Info. Aggregation vs. Integration More complex queries Dynamic generation/optimization of execution plan Applicable to wider range of problems Much harder to implement efficiently prices of laptop with … sort store1store2storeN … movies in Seattle starring … join IMDBsidewalk rev2 … rev1 Join, sort aggregate revN

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration User must know which sites have relevant info User must go to each one in turn Slow: Sequential access takes time Confusing: Each site has a different interface User must manually integrate information Challenges

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 11 Practical Motivation Enterprise Business “dashboard’’; web-site construction. WWW Comparison shopping Portals integrating data from multiple sources B2B, electronic marketplaces Science and culture: Medical genetics: integrating genomic data Astrophysics: monitoring events in the sky. Environment: Puget Sound Regional Synthesis Model Culture: uniform access to all cultural databases produced by countries in Europe.

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 12 The Problem: Data Integration Uniform query capability across autonomous, heterogeneous data sources on LAN, WAN, or Internet

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 13 Current Solutions Mostly ad-hoc programming: create a special solution for every case; pay consultants a lot of money. Data warehousing: load all the data periodically into a warehouse months lead time Separates operational DBMS from decision support DBMS. (not only a solution to data integration). Performance is good; data may not be fresh. Need to clean, scrub you data.

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 14 Data Warehouse Architecture Data source Data source Data source Relational database (warehouse) User queries Data extraction programs Data extraction, cleaning/ scrubbing OLAP / Decision support/ Data cubes/ data mining

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 15 Warehouse Summary Pro Relatively simple Good performance (OLAP support) Mature technology (ETL industry) Con Expensive Stale data Risky – most warehouse projects fail Rigid architecture Fixed schema Must know all queries ahead of time

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 16 Architecture for Virtual Integration Leave the data in the sources. When a query comes in: 1)Determine the relevant sources to the query 2)Break down the query into sub-queries for the sources. 3)Get the answers from the sources, and combine them appropriately. Data is fresh. Challenge: performance.

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 17 Virtual Integration Architecture Data source wrapper Data source wrapper Data source wrapper Sources can be: relational, hierarchical (IMS), structured files, web sites. Mediator: User queries Mediated schema Data source catalog Reformulation engine Optimizer Execution engine Which data model?

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 18 Research Projects Garlic (IBM), Information Manifold (AT&T) Tsimmis, InfoMaster (Stanford) The Internet Softbot/Razor/Tukwila (UW) Hermes (Maryland) DISCO, Agora (INRIA, France) SIMS/Ariadne (USC/ISI) Emerac/Havasu (ASU)

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 19 Industry Nimble Technology Enosys Markets IBM starting to announce stuff BEA marketing announcing stuff too.

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 20 Dimensions to Consider How many sources are we accessing? How autonomous are they? Meta-data about sources? Is the data structured? Queries or also updates? Requirements: accuracy, completeness, performance, handling inconsistencies. Closed world assumption vs. open world?

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 21 Outline Motivation Wrappers / information extraction Database Review Modeling data sources Content, completeness, capabilities Execution Coping with incomplete statistics, latency History & Refs

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 22 Wrapper Programs Task to communicate with the data sources and do format translations. Built w.r.t. a specific source. Can sit either at the source or mediator. Often hard to build (very little science). Can be “intelligent” perform source-specific optimizations.

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 23 Example Introduction to DB Phil Bernstein Eric Newcomer Addison Wesley, 1999 Introduction to DB Phil Bernstein Eric Newcomer Addison Wesley 1999 Transform: into:

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 24 Wrapper Construction Use PERL, or Generate wrappers automatically Get training examples Human marks up selected pages with GUI tool Use shallow NLP to create features Favorite learning method HMMs, VS on prefix, postfix strings, ?? Boosting Co-training See research on information extraction

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 25 Outline Motivation Wrappers / information extraction Database Review Relational algebra, SQL, datalog Views Optimization (query planning) Modeling data sources Content, completeness, capabilities Execution Coping with incomplete statistics, latency History & Refs

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 26 Database (relational) Database Manager (DBMS) -Storage mgmt -Query processing -View management -(Transaction processing) Query (SQL) Answer (relation) Traditional Database Architecture

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 27 Relational Data: Terminology Name Price Category Manufacturer gizmo $19.99 gadgets GizmoWorks Power gizmo $29.99 gadgets GizmoWorks SingleTouch $ photography Canon MultiTouch $ household Hitachi tuple attribute Product Product(Name: string, Price: real, category: enum, Manufacturer: string) schema relation (Arity=4)

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 28 Relational Algebra Operators tuple sets as input, new set as output Operations Union, Intersection, difference,.. Selection (  Projection (  ) Cartesian product (X) Join ( ) City Manufacturer GizmoWorks Canon Hitachi Tempe Kyoto Dayton

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 29 SQL: A query language for Relational Algebra Many standards out there: SQL92, SQL2, SQL3, SQL99 Select attributes From relations (possibly multiple, joined) Where conditions (selections) “Find companies that manufacture products bought by Joe Blow” SELECT Company.name FROM Company, Product WHERE Company.name=Product.maker AND Product.name IN (SELECT product FROM Purchase WHERE buyer = “Joe Blow”); Other features: aggregation, group-by etc.

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 30 Deductive Databases Tables viewed as predicates. Operations on tables expressed as “datalog” rules (Horn clauses, without function symbols) Enames(Name) :- Employe(Name, SSN) [Projection] Wealthy-Employee(Name) :- Employee(Name,SSN), Salary(SSN,Money),Money> [Selection] Ed(Name, Dname) :- Employee(Name, SSN), Employee_Dependents(SSN, Dname) [Join] Emprelated(Name,Dname) :- Ed(Name,Dname) Emprelated(Name,Dname) :- Ed(Name,D1), Emprelated(D1,D2) [Recursion]

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 31 More datalog terminology A datalog program is a set of datalog rules. A program with a single rule is a conjunctive query. We distinguish EDB predicates and IDB predicates EDB’s are stored in the database, appear only in the bodies IDB’s are intensionally defined, appear in both bodies and heads.

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 32 Views Views are relations, except that they are not physically stored. Uses: simplify complex queries, & define conceptually different views of DB for diff. users. Example: purchases of telephony products: CREATE VIEW telephony-purchases AS SELECT product, buyer, seller, store FROM Purchase, Product WHERE Purchase.product = Product.name AND Product.category = “telephony”

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 33 A Different View CREATE VIEW Seattle-view AS SELECT buyer, seller, product, store FROM Person, Purchase WHERE Person.city = “Seattle” AND Person.name = Purchase.buyer We can later use the view: SELECT name, store FROM Seattle-view, Product WHERE Seattle-view.product = Product.name AND Product.category = “shoes” What’s really happening when we query a view??

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 34 Materialized Views Views whose corresponding queries have been executed and the data is stored in a separate database Uses: Caching Issues Using views in answering queries Normally, the views are available in addition to DB – (so, views are local caches) In information integration, views may be the only things we have access to. –An internet source that specializes in woody allen movies can be seen as a view on a database of all movies. –Except, there is no DB out there which contains all movies..

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 35 Query Optimization Imperative query execution plan: Declarative SQL query Ideally: Want to find best plan. Practically: Avoid worst plans! Goal: Purchase Person Buyer=name City=‘seattle’ phone>’ ’ buyer (Simple Nested Loops)  (Table scan)(Index scan) SELECT S.buyer FROM Purchase P, Person Q WHERE P.buyer=Q.name AND Q.city=‘seattle’ AND Q.phone > ‘ ’ Inputs: the query statistics about the data (indexes, cardinalities, selectivity factors) available memory

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 36 Reserves Sailors sid=sid bid=100 sname (On-the-fly) rating > 5 (Scan; write to temp T1) (Scan; write to temp T2) (Sort-Merge Join) Reserves Sailors sid=sid bid=100 rating > 5 sname (Simple Nested Loops) (On-the-fly) Reserves Sailors sid=sid bid=100 sname (On-the-fly) rating > 5 (Use hash index; do not write result to temp) with pipelining ) (On-the-fly) SELECT S.sname FROM Reserves R, Sailors S WHERE R.sid=S.sid AND R.bid=100 AND S.rating>5 Goal of optimization: To find more efficient plans that compute the same answer.

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration Relational Algebra Equivalences Allow us to choose different join orders and to ‘push’ selections and projections ahead of joins. ( Commute ) Projections: (Cascade) Joins: R (S T) (R S) T (Associative) (R S) (S R) (Commute) Selections:

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 38 Optimizing Joins Q(u,x) :- R(u,v), S(v,w), T(w,x) R S T Many ways of doing a single join R S Symmetric vs. asymmetric join operations Nested join, hash join, double pipe-lined hash join etc. Processing costs alone vs. processing + transfer costs Get R and S together vs, get R, get just the tuples of S that will join with R (“semi-join”) Many orders in which to do the join (R join S) join T (S join R) join T (T join S) join R etc. All with different costs

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration Determining Join Order In principle, we need to consider all possible join orderings: As the number of joins increases, the number of alternative plans grows rapidly; we need to restrict the search space. System-R: consider only left-deep join trees. Left-deep trees allow us to generate all fully pipelined plans:Intermediate results not written to temporary files. Not all left-deep trees are fully pipelined (e.g., SM join). B A C D B A C D C D B A

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration Enumeration of Left-Deep Plans Naïve approach: n! combinations. Principle of optimality: the best plan for the join of R 1,…R n-1 will be part of the best plan for the join of R 1,…,R n Enumerated using N passes (if N relations joined): Pass 1: Find best 1-relation plan for each relation. Pass 2: Find best way to join result of each 1-relation plan (as outer) to another relation. (All 2-relation plans.) Pass N: Find best way to join result of a (N-1)-relation plan (outer) to the N’th relation. (All N-relation plans.) For each subset of relations, retain only: Cheapest plan overall, plus Cheapest plan for each interesting order of the tuples.

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration Cost Estimation For each plan considered, estimate cost: Estimate cost of each operation in plan tree. Depends on input cardinalities. Estimate size of result for each op in tree! Use information about the input relations. Selectivity (Histograms) For selections and joins, assume independence of predicates. System R cost estimation approach. Very inexact, but works ok in practice. More sophisticated techniques known now.

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 42 Key Lessons in Optimization Classic planning / execution scenario Uncertainty / replanning key for data integration Main points Disk IO as cost metric Algebraic rules / use in query transformation.. Join ordering via dynamic programming Estimating cost of plans size of intermediate results.

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 43 Integrator vs. DBMS No common schema Sources with heterogeneous schemas Semi-structured sources Legacy Sources Not relational-complete Variety of access/process limitations Autonomous sources No central administration Uncontrolled source content overlap Lack of source statistics Tradeoffs between query plan cost, coverage, quality etc. Multi-objective cost models Unpredictable run-time behavior Makes query execution hard Reprise

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 44 Outline Motivation Wrappers / information extraction Database Review Modeling data sources Content, completeness, capabilities Execution Coping with incomplete statistics, latency History & Refs

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 45 Data Source Catalog Contains meta-information about sources: Logical source contents (books, new cars). Source capabilities (can answer SQL queries) Source completeness (has all books). Physical properties of source and network. Statistics about the data (like in an RDBMS) Source reliability Mirror sources Update frequency.

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 46 Content Descriptions User queries refer to the mediated schema. Source data is stored in a local schema. Content descriptions provide semantic mappings between different schemas. Data integration system uses the descriptions to translate user queries into queries on the sources.

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 47 Desiderata for Source Descriptions Expressive power: distinguish between sources with closely related data. Enable pruning of access to irrelevant sources. Easy addition: make it easy to add new data sources. Reformulation: be able to reformulate a user query into a query on the sources efficiently and effectively.

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 48 Reformulation Problem Given: A query Q posed over the mediated schema Descriptions of the data sources Find: A query Q’ over the data source relations, such that: Q’ provides only correct answers to Q, and Q’ provides all possible answers from to Q given the sources.

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 49 Approaches to Specifying Source Descriptions Global-as-view: express the mediated schema relations as a set of views over the data source relations Local-as-view: express the source relations as views over the mediated schema. Can be combined with no additional cost.

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 50 Global-as-View Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). Create View Movie AS select * from S1 [S1(title,dir,year,genre)] union select * from S2 [S2(title, dir,year,genre)] union [S3(title,dir), S4(title,year,genre)] select S3.title, S3.dir, S4.year, S4.genre from S3, S4 where S3.title=S4.title

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 51 Global-as-View: Example 2 Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). Create View Movie AS [S1(title,dir,year)] select title, dir, year, NULL from S1 union [S2(title, dir,genre)] select title, dir, NULL, genre from S2

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 52 Global-as-View: Example 3 Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). Source S4: S4(cinema, genre) Create View Movie AS select NULL, NULL, NULL, genre from S4 Create View Schedule AS select cinema, NULL, NULL from S4. But what if we want to find which cinemas are playing comedies?

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 53 Global-as-View Summary Query reformulation boils down to view unfolding. Very easy conceptually. Can build hierarchies of mediated schemas. You sometimes loose information. Not always natural. Adding sources is hard. Need to consider all other sources that are available.

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 54 Local-as-View: example 1 Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). Create Source S1 AS select * from Movie Create Source S3 AS [S3(title, dir)] select title, dir from Movie Create Source S5 AS select title, dir, year from Movie where year > 1960 AND genre=“Comedy”

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 55 Local-as-View: Example 2 Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). Source S4: S4(cinema, genre) Create Source S4 select cinema, genre from Movie m, Schedule s where m.title=s.title. Now if we want to find which cinemas are playing comedies, there is hope!

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 56 Local-as-View Summary Very flexible. You have the power of the entire query language to define the contents of the source. Hence, can easily distinguish between contents of closely related sources. Adding sources is easy: they’re independent of each other. Query reformulation: answering queries using views!

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 57 The General Problem Given a set of views V1,…,Vn, and a query Q, can we answer Q using only the answers to V1,…,Vn? Many, many papers on this problem. The best performing algorithm: The MiniCon Algorithm, (Pottinger & Levy, 2000). Great survey on the topic: (Halevy, 2001).

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 58 Local Completeness Information If sources are incomplete, we need to look at each one of them. Often, sources are locally complete. Movie(title, director, year) complete for years after 1960, or for American directors. Question: given a set of local completeness statements, is a query Q’ a complete answer to Q?

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 59 Example Movie(title, director, year) (complete after 1960). Show(title, theater, city, hour) Query: find movies (and directors) playing in Seattle: Select m.title, m.director From Movie m, Show s Where m.title=s.title AND city=“Seattle” Complete or not?

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 60 Example #2 Movie(title, director, year), Oscar(title, year) Query: find directors whose movies won Oscars after 1965: select m.director from Movie m, Oscar o where m.title=o.title AND m.year=o.year AND o.year > Complete or not?

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 61 Query Optimization Very related to query reformulation! Goal of the optimizer: find a physical plan with minimal cost. Key components in optimization: Search space of plans Search strategy Cost model

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 62 Optimization in Distributed DBMS A distributed database (2-minute tutorial): Data is distributed over multiple nodes, but is uniform. Query execution can be distributed to sites. Communication costs are significant. Consequences for optimization: Optimizer needs to decide locality Need to exploit independent parallelism. Need operators that reduce communication costs (semi-joins).

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 63 DDBMS vs. Data Integration In a DDBMS, data is distributed over a set of uniform sites with precise rules. In a data integration context: Data sources may provide only limited access patterns to the data. Data sources may have additional query capabilities. Cost of answering queries at sources unknown. Statistics about data unknown. Transfer rates unpredictable.

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 64 Modeling Source Capabilities Negative capabilities: A web site may require certain inputs (in an HTML form). Need to consider only valid query execution plans. Positive capabilities: A source may be an ODBC compliant system. Need to decide placement of operations according to capabilities. Problem: how to describe and exploit source capabilities.

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 65 Example #1: Access Patterns Mediated schema relation: Cites(paper1, paper2) Create Source S1 as select * from Cites given paper1 Create Source S2 as select paper1 from Cites Query: select paper1 from Cites where paper2=“Hal00”

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 66 Example #1: Continued Create Source S1 as select * from Cites given paper1 Create Source S2 as select paper1 from Cites Select p1 From S1, S2 Where S2.paper1=S1.paper1 AND S1.paper2=“Hal00”

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 67 Example #2: Access Patterns Create Source S1 as select * from Cites given paper1 Create Source S2 as select paperID from UW-Papers Create Source S3 as select paperID from AwardPapers given paperID Query: select * from AwardPapers

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 68 Example #2: Solutions Can’t go directly to S3 because it requires a binding. Can go to S1, get UW papers, and check if they’re in S3. Can go to S1, get UW papers, feed them into S2, and feed the results into S3. Can go to S1, feed results into S2, feed results into S2 again, and then feed results into S3. Strictly speaking, we can’t a priori decide when to stop. Need recursive query processing.

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 69 Handling Positive Capabilities Characterizing positive capabilities: Schema independent (e.g., can always perform joins, selections). Schema dependent: can join R and S, but not T. Given a query, tells you whether it can be handled. Key issue: how do you search for plans? Garlic approach (IBM): Given a query, STAR rules determine which subqueries are executable by the sources. Then proceed bottom-up as in System-R.

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 70 Matching Objects Across Sources How do I know that A. Halevy in source 1 is the same as Alon Halevy in source 2? If there are uniform keys across sources, no problem. If not: Domain specific solutions (e.g., maybe look at the address, ssn). Use Information retrieval techniques (Cohen, 98). Judge similarity as you would between documents. Use concordance tables. These are time- consuming to build, but you can then sell them for lots of money.

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 71 Optimization and Execution Problem: Few and unreliable statistics about the data. Unexpected (possibly bursty) network transfer rates. Generally, unpredictable environment. General solution: (research area) Adaptive query processing. Interleave optimization and execution. As you get to know more about your data, you can improve your plan.

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 72 Tukwila Data Integration System Novel components: Event handler Optimization-execution loop

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 73 Double Pipelined Join (Tukwila) Hash Join 8Partially pipelined: no output until inner read 8Asymmetric (inner vs. outer) — optimization requires source behavior knowledge Double Pipelined Hash Join 4Outputs data immediately 4Symmetric — requires less source knowledge to optimize

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 74 Piazza: A Peer-Data Management System Goal: To enable users to share data across local or wide area networks in an ad-hoc, highly dynamic distributed architecture.  Peers share data, mediated views.  Peers act as both clients and servers  Rich semantic relationships between peers.  Ad-hoc collaborations (peers join and leave at will).

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 75 Extending the Vision to Data Sharing

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 76 The Structure Mapping Problem Types of structures: Database schemas, XML DTDs, ontologies, …, Input: Two (or more) structures, S 1 and S 2 (perhaps) Data instances for S 1 and S 2 Background knowledge Output: A mapping between S 1 and S 2 Should enable translating between data instances.

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 77 Semantic Mappings between Schemas Source schemas = XML DTDs house location contact house address name phone num-baths full-bathshalf-baths contact-info agent-name agent-phone 1-1 mapping non 1-1 mapping

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 78 Why Matching is Difficult Structures represent same entity differently different names => same entity: area & address => location same names => different entities: area => location or square-feet Intended semantics is typically subjective! IBM Almaden Lab = IBM? Schema, data and rules never fully capture semantics! not adequately documented, certainly not for machine consumption. Often hard for humans (committees are formed!)

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 79 Desiderata from Proposed Solutions Accuracy, efficiency, ease of use. Realistic expectations: Unlikely to be fully automated. Need user in the loop. Some notion of semantics for mappings. Extensibility: Solution should exploit additional background knowledge. “Memory”, knowledge reuse: System should exploit previous manual or automatically generated matchings. Key idea behind LSD.

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 80 Learning for Mapping Context: generating semantic mappings between a mediated schema and a large set of data source schemas. Key idea: generate the first mappings manually, and learn from them to generate the rest. Technique: multi-strategy learning (extensible!) L( earning ) S( ource ) D( escriptions ) [ SIGMOD 2001 ].

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 81 Data Integration (a simple PDMS) Find houses with four bathrooms priced under $500,000 mediated schema homes.comrealestate.com source schema 2 homeseekers.com source schema 3source schema 1 Applications: WWW, enterprises, science projects Techniques: virtual data integration, warehousing, custom code. Query reformulation and optimization.

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 82 price agent-name agent-phone office-phone description Learning from the Manual Mappings listed-price contact-name contact-phone office comments Schema of realestate.com Mediated schema $250K James Smith (305) (305) Fantastic house $320K Mike Doan (617) (617) Great location listed-price contact-name contact-phone office comments realestate.com If “fantastic” & “great” occur frequently in data instances => description sold-at contact-agent extra-info $350K (206) Beautiful yard $230K (617) Close to Seattle $190K (512) Great lot homes.com If “office” occurs in the name => office-phone

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 83 Multi-Strategy Learning Use a set of base learners: Name learner, Naïve Bayes, Whirl, XML learner And a set of recognizers: County name, zip code, phone numbers. Each base learner produces a prediction weighted by confidence score. Combine base learners with a meta-learner, using stacking.

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 84 The Semantic Web How does it relate to data integration? How are we going to do it? Why should we do it? Do we need a killer app or is the semantic web a killer app?

© Daniel S. Weld, for PLANET 2003 Tutorial on Data Integration 85 Outline Alon DBPL99 Answering Queries Using Views: Applications, Algorithms and Opportunities Motivation and formal definition Basic decidability result Algorithms (data integration context) Extensions to bounds on rewriting size Language extensions Computing the certain answers Current work and open problems.