Credit: Slides are an adaptation of slides from Jeffrey D. Ullman 1.

Slides:



Advertisements
Similar presentations
Uncertainty in Data Integration Ai Jing
Advertisements

CSE 636 Data Integration Conjunctive Queries Containment Mappings / Canonical Databases Slides by Jeffrey D. Ullman.
2005conjunctive-ii1 Query languages II: equivalence & containment (Motivation: rewriting queries using views)  conjunctive queries – CQ’s  Extensions.
Information Integration Using Logical Views Jeffrey D. Ullman.
1 Lecture 10: Transactions. 2 The Setting uDatabase systems are normally being accessed by many users or processes at the same time. wBoth queries and.
Copyright © C. J. Date 2005page 97 S#Y S1DURINGS3DURING [d04:d10][d08:d10] S2DURINGS4DURING [d02:d04][d04:d10] [d08:d10] WITH ( EXTEND T2 ADD ( COLLAPSE.
CPSC 504: Data Management Discussion on Chandra&Merlin 1977 Laks V.S. Lakshmanan Dept. of CS UBC.
1 Introduction to SQL Select-From-Where Statements Multirelation Queries Subqueries.
SECTION 21.5 Eilbroun Benjamin CS 257 – Dr. TY Lin INFORMATION INTEGRATION.
Database Management Systems Chapter 3 The Relational Data Model (II) Instructor: Li Ma Department of Computer Science Texas Southern University, Houston.
1 Introduction to SQL Multirelation Queries Subqueries Slides are reused by the approval of Jeffrey Ullman’s.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification.
Advanced Database Systems September 2013 Dr. Fatemeh Ahmadi-Abkenari 1.
1 Improvements to A-Priori Park-Chen-Yu Algorithm Multistage Algorithm Approximate Algorithms Compacting Results.
1 More SQL Database Modification Defining a Database Schema Views Source: slides by Jeffrey Ullman.
Chapter 21.2 Modes of Information Integration ID: 219 Name: Qun Yu Class: CS Spring 2009 Instructor: Dr. T.Y.Lin.
SECTIONS 21.4 – 21.5 Sanuja Dabade & Eilbroun Benjamin CS 257 – Dr. TY Lin INFORMATION INTEGRATION.
1 Multivalued Dependencies Fourth Normal Form Source: Slides by Jeffrey Ullman.
1 Using Views to Implement Datalog Programs Inverse Rules Duschka’s Algorithm.
1 The Relational Data Model Functional Dependencies.
CS 257 Database Systems Principles Assignment 1 Instructor: Student: Dr. T. Y. Lin Rajan Vyas (119)
1 Distributed Databases Chapter Two Types of Applications that Access Distributed Databases The application accesses data at the level of SQL statements.
1 More SQL Defining a Database Schema Views. 2 Defining a Database Schema uA database schema comprises declarations for the relations (“tables”) of the.
1 Multivalued Dependencies Fourth Normal Form Sources: Slides by Jeffrey Ullman book by Ramakrishnan & Gehrke.
Local-as-View Mediators Priya Gangaraju(Class Id:203)
1 SQL Authorization Privileges Grant and Revoke Grant Diagrams.
1 Functional Dependencies Meaning of FD’s Keys and Superkeys Inferring FD’s Source: slides by Jeffrey Ullman.
1 Information Integration Mediators Semistructured Data Answering Queries Using Views.
2005Integration-intro1 Data Integration Systems overview The architecture of a data integration system:  Components and their interaction  Tasks  Concepts.
CSE 636 Data Integration Answering Queries Using Views Overview.
1 Multivalued Dependencies Fourth Normal Form. 2 A New Form of Redundancy uMultivalued dependencies (MVD’s) express a condition among tuples of a relation.
1 Where Is Database Research Headed? Jeffrey D. Ullman DASFAA March 26, 2003.
Information Integration Using Logical Views Jeffrey D. Ullman.
1 Semantics of Datalog With Negation Local Stratification Stable Models Well-Founded Models.
Distributed Databases and Query Processing. Distributed DB’s vs. Parallel DB’s Many autonomous processors that may participate in database operations.
1 XML Semistructured Data Extensible Markup Language Document Type Definitions.
Credit: Slides are an adaptation of slides from Jeffrey D. Ullman 1.
1 Information Integration Mediators Warehousing Answering Queries Using Views.
1 Searching for Solutions Careful Analysis of Expansions The Bucket Algorithm.
Credit: Slides are an adaptation of slides from Jeffrey D. Ullman 1.
Chapter 21.2 Modes of Information Integration ID: 219 Name: Qun Yu Class: CS Spring 2009 Instructor: Dr. T.Y.Lin.
Databases From A to Boyce Codd. What is a database? It depends on your point of view. For Manovich, a database is a means of structuring information in.
© 2007 by Prentice Hall 1 Introduction to databases.
Mediators, Wrappers, etc. Based on TSIMMIS project at Stanford. Concepts used in several other related projects. Goal: integrate info. in heterogeneous.
 Three-Schema Architecture Three-Schema Architecture  Internal Level Internal Level  Conceptual Level Conceptual Level  External Level External Level.
Answering Queries Using Views LMSS’95 Laks V.S. Lakshmanan Dept. of Comp. Science UBC.
CSE 636 Data Integration Conjunctive Queries Containment Mappings / Canonical Databases Slides by Jeffrey D. Ullman Fall 2006.
Fall 2013, Databases, Exam 2 Questions for the second exam. Your answers are due by Dec. 18 at 4PM. (This is the final exam slot.) And please type your.
1 Information Integration Mediators Warehousing Answering Queries Using Views Slides are modified from Dr. Ullman’s notes.
Information Integration By Neel Bavishi. Mediator Introduction A mediator supports a virtual view or collection of views that integrates several sources.
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
COMP3030 Database Management System Final Review
1 Multivalued Dependencies Fourth Normal Form Reasoning About FD’s + MVD’s.
1 Theory, Practice & Methodology of Relational Database Design and Programming Copyright © Ellis Cohen Relational State Assertions These slides.
Databases : SQL-Schema Definition and View 2007, Fall Pusan National University Ki-Joune Li These slides are made from the materials that Prof. Jeffrey.
1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linköpings universitet.
Section 20.1 Modes of Information Integration Anilkumar Panicker CS257: Database Systems ID: 118.
1 10 Systems Analysis and Design in a Changing World, 2 nd Edition, Satzinger, Jackson, & Burd Chapter 10 Designing Databases.
1 Datalog with negation Adapted from slides by Jeff Ullman.
1 Data Modification with SQL CREATE TABLE, INSERT, DELETE, UPDATE Slides from
Where Is Database Research Headed?
CPSC-310 Database Systems
Slides are reused by the approval of Jeffrey Ullman’s
Foreign Keys Local and Global Constraints Triggers
Introduction to Database Systems, CS420
Containment Mappings Canonical Databases Sariaya’s Algorithm
Local-as-View Mediators
Information Integration
Materializing Views With Minimal Size To Answer Queries
Select-From-Where Statements Multirelation Queries Subqueries
Presentation transcript:

Credit: Slides are an adaptation of slides from Jeffrey D. Ullman 1

2 Information Integration Mediators Semistructured Data Answering Queries Using Views

3 Importance of Information Integration uVery many modern DB applications involve combining databases. wE.g., mashups uSometimes a “database” is not stored in a DBMS --- it could be a spreadsheet, flat file, XML document, etc.

4 Example Applications 1.Enterprise Information Integration: making separate DB’s, all owned by one company, work together. 2.Catalog integration: combining product information from all your suppliers. 3.Etc., etc.

5 Challenges 1.Legacy databases : DB’s get used for many applications. uYou can’t change its structure for the sake of one application, because it will cause others to break. 2.Incompatibilities : Two, supposedly similar databases, will mismatch in many ways.

6 Examples: Incompatibilities  Lexical : addr in one DB is address in another. uValue mismatches : is a “red” car the same color in each DB? Is 20 degrees Fahrenheit or Celsius? uSemantic : are “employees” in each database the same? What about consultants? Retirees? Contractors?

7 What Do You Do About It? uGrubby, handwritten translation at each interface. wSome research on automatic inference of relationships. uWrapper (aka “adapter”) translates incoming queries and outgoing answers.

8 Integration Architectures 1.Federation : everybody talks directly to everyone else. 2.Warehouse : Sources are translated from their local schema to a global schema and copied to a central DB. 3.Mediator : Virtual warehouse --- turns a user query into a sequence of source queries.

9 Federations Wrapper What problems do you see here?

10 Warehouse Diagram Warehouse Wrapper Source 1Source 2 What problems do you see here?

11 A Mediator Mediator Wrapper Source 1Source 2 User query Query Result

12 Two Mediation Approaches 1.Query-centric : Mediator processes queries into steps executed at sources. 2.View-centric : Sources are defined in terms of global relations; mediator finds all ways to build query from views.

13 Example uSuppose Dell wants to buy a bus and a disk that share the same protocol.  Global schema: Buses(manf,model,protocol) Disks(manf,model,protocol) uLocal schemas: each bus or disk manufacturer has a (model,protocol) relation --- manf is implied.

14 Example: Query-Centric uMediator might start by querying each bus manufacturer for model-protocol pairs. wThe wrapper would turn them into triples by adding the manf component. uThen, for each protocol returned, mediator queries disk manufacturers for disks with that protocol. wAgain, wrapper adds manf component.

15 Example: View-Centric uSources’ capabilities are defined in terms of the global predicates. wE.g.,Quantum’s disk database could be defined by QuantumView(M,P) = Disks(’Quantum’,M,P). uMediator discovers all combinations of a bus and disk “view,” equijoined on the protocol components.

16 Comparison uQuery-centric is simpler to implement. wLets you have control of what the mediator does. uView centric is more extensible. wSame query engine works for any number of sources. wAdd a new source simply by defining what it contributes as a view of the global schema.

17 View-Centric Mediation uKey assumptions: 1.There is a set of global predicates that define the schema. uThese do not exist as stored relations. 2.Each data source has its capabilities defined by views, which are (typically) CQ’s whose subgoals involve the global predicates.

18 Assumptions --- Continued 3. A query is (typically) a CQ over the global predicates. 4. A solution is an expression (union of CQ’s, typically) involving the views. wIdeally, the solution is equivalent to the query. wIn practice, we have to be happy with a solution maximally contained in the query.

19 Interpretation of Views uA view describes (some of) the facts that are available at the source. uA view does not define exactly what is at the source.  Example: a view v(X) :- p(X,10) says that the source has some p -facts with second component v could even be empty although p(X,10) is not.

20 Put Another Way …  The :- separator between head and body of a view definition should not be interpreted as “if.” uRather, it is “only if.”

21 Example uGlobal predicates: emp(E) = “E is an employee.” phone(E,P) = “P is a phone of E.” office(E,O) = “O is an office of E.” mgr(E,M) = “M is E’s manager.” dept(E,D) = “D is E’s department.”

22 Example --- Continued uThree sources each provide one view: At source S1: view v1(E,P,M) defined by: v1(E,P,M) :- emp(E) & phone(E,P) & mgr(E,M) wInterpretation: “every triple (e,p,m) at S1 is an employee, one of their phones, and their manager.” wIt does not say “S1 has all E-P-M facts.”

23 Example: Sources S2 and S3 uAt S2: v2(E,O,D) :- emp(E) & office(E,O) & dept(E,D) wS2 has (some of the) employee-office- department facts. uAt S3: v3(E,P) :- emp(E) & phone(E,P) & dept(E, ‘toy’) wS3 has (some) toy-department phones.

24 Example: A Query q1(P,O) :- phone(’sally’,P) & office(’sally’,O) wFind Sally’s office and phone. uThere are two useful solutions: s1(P,O) :- v1(’sally’,P,M) & v2(’sally’,O,D) s2(P,O) :- v3(’sally’,P) & v2(’sally’,O,D)

25 What Makes a Solution S Useful? 1.There must be no other solution containing S. 2.S, when expanded from views into global predicates, is contained in the query. Why?

26 Expanding Views uSuppose we have a subgoal v(X,Y) in a solution, and v is defined by: v(A,B) :- p(A,X) & q(X,B) 1.Find unique variables for the local variables of the view (those that appear only in the body). 2.Substitute variables of the subgoal for variables of the head. 3.Use the resulting body as the substitution.

27 Example v(A,B) :- p(A,X) & q(X,B) becomes: v(A,B) :- p(A,X1) & q(X1,B) Then substitute A->X, B->Y; yields body: p(X,X1) & q(X1,Y)

Test Yourself v1(E,P,M):-emp(E)& phone(E,P)& mgr(E,M) v2(E,O,D):-emp(E)& office(E,O)&dept(E,D) v3(E,P):-emp(E)& phone(E,P)& dept(E,‘toy’) Expand: s1(P,O):-v1(’sally’,P,M)&v2(’sally’,O,D) s2(P,O):-v3(’sally’,P) & v2(’sally’,O,D) 28

29 Important Points uTo test containment of a solution in a query, we expand the solution first, then test CQ containment of the expansion in the query. uThe view definition describes what any tuples of the view look like, so CQ containment implies that the solution will provide only true answers.

30 The Picture Query: q(X,Y) :- p(X,Z) & … Soln: q(A,B) :- v(A,C,D) & w(B,E) & … Exp: q(A,B) :- p(A,U) & … & r(B,V) & … Is there a containment mapping?

31 Important Points --- (2) uThere is no guarantee a solution supplies any answers to the query. uComparing different solutions by testing if one solution is contained in another must be done at the level of the unexpanded views.

32 Example  Two sources might have similar views, defined by: v1(X,Y) :- p(X,Y) v2(X,Y) :- p(X,Y) uBut the sources actually have different sets of p -facts.

33 Example --- Continued  Then, the two solutions: s1(X,Y) :- v1(X,Y) s2(X,Y) :- v2(X,Y) have the same expansions, p(X,Y), but there is no reason to believe one solution is contained in the other. wOne view could provide lots of p -facts, the other, few or none.

34 Important Points --- (3) uOn the other hand, when one solution, unexpanded, is contained in another, we can be sure the first provides no answers the second does not.

35 Example  Here are two solutions: s1(X,Y) :- v1(X,Z) & v2(Z,Y) s2(X,Y) :- v1(X,Z) & v2(W,Y) uThere is a containment mapping s2 -> s1. wThus, s1  s2 at the level of views. uNo matter what tuples v1 and v2 represent, s2 provides all answers s1 provides.

36 The Office Example q1(P,O) :- phone(’sally’,P) & office(’sally’,O) v1(E,P,M) :- emp(E) & phone(E,P) & mgr(E,M) v2(E,O,D) :- emp(E) & office(E,O) & dept(E,D) v3(E,P) :- emp(E) & phone(E,P) & dept(E, ‘toy’)

37 Office Example --- Solutions s1(P,O) :- v1(’sally’,P,M) & v2(’sally’,O,D) s2(P,O) :- v3(’sally’,P) & v2(’sally’,O,D)

38 Expansion of S1 s1(P,O) :- emp(’sally’) & phone(’sally’,P) & mgr(’sally’,M) & emp(’sally’) & office(’sally’,O) & dept(’sally’,D) q1(P,O) :- phone(’sally’,P) & office(’sally’,O) Containment mapping q1->e1

39 Office Example --- Concluded uMapping from q1 to s2 is similar. uNotice we have used the head predicate to name the solution, expansion, etc. wTechnically, head predicates have to be the same, but that’s not a problem here. uExpansions are properly contained in query --- not equivalent.

40 Finding All Solutions to a Query uKey idea: LMSS (Levy-Mendelzon- Sagiv-Srivastava) test. uIf a query has n subgoals, then we only need to consider solutions with at most n subgoals. wAny other solution must be contained in one with < n subgoals.

41 Proof of LMSS Theorem uSuppose the query has n subgoals, and a solution S has >n subgoals. uLook at the expansion diagram again – at least one subgoal (view) in the solution has an expansion to which no query subgoal maps.

42 Expansion Diagram Query: q(X,Y) :- p(X,Z) & … Soln: q(A,B) :- v(A,C,D) & w(B,E) & … Exp: q(A,B) :- p(A,U) & … & r(B,V) & … n of these More than n of these

43 Proof --- Continued uConsider the new solution S ’, which removes from S every subgoal whose expansion is not a target of the CM from the query. uClearly S  S ’. wIn general, throwing away subgoals grows the result of the CQ. wAnd, Q  S ’ also holds – Why? uBut S ’ has at most n subgoals.

44 Example uIn our running “office” example, we can immediately conclude that the solution s3(P,O) :- v1(‘’sally’,P,M) & v2(‘’sally’,O,D) & v3(E,P) is not minimal. wIt has more subgoals than the query. wIn fact, it is contained in s1.