1 Data integration Most slides are borrowed from Dr. Chen Li, UC Irvine.

Slides:



Advertisements
Similar presentations
Manipulation of Query Expressions. Outline Query unfolding Query containment and equivalence Answering queries using views.
Advertisements

CSE 636 Data Integration Answering Queries Using Views Bucket Algorithm.
CSE 636 Data Integration Data Integration Approaches.
CSE 636 Data Integration Conjunctive Queries Containment Mappings / Canonical Databases Slides by Jeffrey D. Ullman.
2005conjunctive-ii1 Query languages II: equivalence & containment (Motivation: rewriting queries using views)  conjunctive queries – CQ’s  Extensions.
Information Integration Using Logical Views Jeffrey D. Ullman.
CPSC 504: Data Management Discussion on Chandra&Merlin 1977 Laks V.S. Lakshmanan Dept. of CS UBC.
1 Conjunctions of Queries. 2 Conjunctive Queries A conjunctive query is a single Datalog rule with only non-negated atoms in the body. (Note: No negated.
Data integration Chitta Baral Arizona State University.
1 Global-as-View and Local-as-View for Information Integration CS652 Spring 2004 Presenter: Yihong Ding.
ICDT'2001, London, UK1 On Answering Queries in the Presence of Limited Access Patterns Chen Li Stanford University joint work with Edward Chang, UC Santa.
2005lav-ii1 Local as View: Some refinements  IM: Filtering irrelevant sources  Views with restricted access patterns  A summary of IM.
Efficient Query Evaluation on Probabilistic Databases
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
Generating Efficient Plans for Queries Using Views Chen Li Stanford University with Foto Afrati (National Technical University of Athens) and Jeff Ullman.
Constraint Logic Programming Ryan Kinworthy. Overview Introduction Logic Programming LP as a constraint programming language Constraint Logic Programming.
SECTIONS 21.4 – 21.5 Sanuja Dabade & Eilbroun Benjamin CS 257 – Dr. TY Lin INFORMATION INTEGRATION.
A scalable algorithm for answering queries using views Rachel Pottinger, Alon Levy [2000] Rachel Pottinger and Alon Y. Levy A Scalable Algorithm for Answering.
Data Integration Techniques Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 30, 2003 Some slide content may.
2005certain1 Views as Incomplete Databases – Certain & Possible Answers  Views – an incomplete representation  Certain and possible answers  Complexity.
CSE 636 Data Integration Datalog Rules / Programs / Negation Slides by Jeffrey D. Ullman.
Local-as-View Mediators Priya Gangaraju(Class Id:203)
1 CIS607, Fall 2005 Semantic Information Integration Presentation by Paea LePendu Week 8 (Nov. 16)
1 Describing and Utilizing Constraints to Answer Queries in Data-Integration Systems Chen Li Information and Computer Science University of California,
2005Integration-intro1 Data Integration Systems overview The architecture of a data integration system:  Components and their interaction  Tasks  Concepts.
CSE 636 Data Integration Answering Queries Using Views Overview.
CSE 636 Data Integration Answering Queries Using Views MiniCon Algorithm.
Information Integration Using Logical Views Jeffrey D. Ullman.
CSE 636 Data Integration Overview. 2 Data Warehouse Architecture Data Source Data Source Relational Database (Warehouse) Data Source Users   Applications.
CSE 636 Data Integration Introduction. 2 Staff Instructor: Dr. Michalis Petropoulos Location: 210 Bell Hall Office Hours:
2005lav-iii1 The Infomaster system & the inverse rules algorithm  The InfoMaster system  The inverse rules algorithm  A side trip – equivalence & containment.
CSE 636 Data Integration Limited Source Capabilities Slides by Hector Garcia-Molina.
1 Relational Algebra and Calculus Yanlei Diao UMass Amherst Feb 1, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
1 Query Planning with Limited Source Capabilities Chen Li Stanford University Edward Y. Chang University of California, Santa Barbara.
Rada Chirkova (North Carolina State University) and Chen Li (University of California, Irvine) Materializing Views With Minimal Size To Answer Queries.
Credit: Slides are an adaptation of slides from Jeffrey D. Ullman 1.
Presenter: Dongning Luo Sept. 29 th 2008 This presentation based on The following paper: Alon Halevy, “Answering queries using views: A Survey”, VLDB J.
Optimizing Queries and Diverse Data Sources Laura M. Hass Donald Kossman Edward L. Wimmers Jun Yang Presented By Siddhartha Dasari.
1 Relational Algebra and Calculus Chapter 4. 2 Relational Query Languages  Query languages: Allow manipulation and retrieval of data from a database.
1 Searching and Integrating Information on the Web Seminar 2: Data Integration Professor Chen Li UC Irvine.
CSE 636 Data Integration Limited Source Capabilities Slides by Hector Garcia-Molina Fall 2006.
CSE 636 Data Integration Overview Fall What is Data Integration? The problem of providing uniform (sources transparent to user) access to (query,
The Relational Model: Relational Calculus
Navigational Plans For Data Integration Marc Friedman Alon Levy Todd Millistein Presented By Avinash Ponnala Avinash Ponnala.
1 Lessons from the TSIMMIS Project Yannis Papakonstantinou Department of Computer Science & Engineering University of California, San Diego.
Mediators, Wrappers, etc. Based on TSIMMIS project at Stanford. Concepts used in several other related projects. Goal: integrate info. in heterogeneous.
Advanced Topics in Propositional Logic Chapter 17 Language, Proof and Logic.
1 Query Processing in the Presence of Limited Source Capabilities Chen Li Information and Computer Science UC Irvine.
Datalog Inspired by the impedance mismatch in relational databases. Main expressive advantage: recursive queries. More convenient for analysis: papers.
Answering Queries Using Views LMSS’95 Laks V.S. Lakshmanan Dept. of Comp. Science UBC.
CSE 636 Data Integration Conjunctive Queries Containment Mappings / Canonical Databases Slides by Jeffrey D. Ullman Fall 2006.
1 Relational Algebra and Calculas Chapter 4, Part A.
INFORMATION INTEGRATION Shengyu Li CS-257 ID-211.
Describing and Using Query Capabilities of Heterogeneous Sources Vasilis Vassalos& Yannis Papakonstantinou Presented by Srujan Kothapally.
A Logic of Partially Satisfied Constraints Nic Wilson Cork Constraint Computation Centre Computer Science, UCC.
Information Integration By Neel Bavishi. Mediator Introduction A mediator supports a virtual view or collection of views that integrates several sources.
Answering Queries Using Views: The Last Frontier.
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
Integration what it takes to put data together Ir. Richard Vdovjak, MTD.
Lu Chaojun, SJTU 1 Extended Relational Algebra. Bag Semantics A relation (in SQL, at least) is really a bag (or multiset). –It may contain the same tuple.
Data Integration Approaches
1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linköpings universitet.
Answering Queries Using Views Presented by: Mahmoud ELIAS.
Capability-Sensitive Query Processing on Internet Sources
Datalog Rules / Programs / Negation Slides by Jeffrey D. Ullman
Containment Mappings Canonical Databases Sariaya’s Algorithm
Relational Algebra 461 The slides for this text are organized into chapters. This lecture covers relational algebra, from Chapter 4. The relational calculus.
Local-as-View Mediators
Chen Li Information and Computer Science
Materializing Views With Minimal Size To Answer Queries
Presentation transcript:

1 Data integration Most slides are borrowed from Dr. Chen Li, UC Irvine

2 Motivation Legacy database Plain text files Biblio sever Support seamless access to autonomous and heterogeneous information sources.

3 Comparison Shopping Lowest price of the DVD: “The Matrix”? Applications Comparison shopping Supply-chain management Supplier 2 … Integrator Supplier M Supplier 1 Buyer 2 Buyer M Buyer 1 …

4 Mediation architecture Mediator Wrapper Source 1 Wrapper Source 2 Wrapper Source n

5 Sources are heterogeneous: –Different data models: relational, object-oriented, XML, … –Different schemas and representations. E.g., “Keanu Reeves” or “Reeves, Keanu” or “Reeves, K.” etc. Describe source contents Use source data to answer queries Sources have limited query capabilities Data quality Performance … Challenges

6 Research projects Garlic (IBM), Information Manifold (AT&T) InfoSleuth (MCC), Tsimmis, InfoMaster (Stanford) Internet Softbot/Razor/Tukwila (U Wash.) Hermes (Maryland) Telegraph / Eddies (UC Berkeley) Niagara (Univ Wisconsin) DISCO, Agora (INRIA, France) SIMS/Ariadne (USC/ISI) Emerac/Havasu (ASU)

7 Industry Nimble Technology Enosys Markets IBM BEA

8 Virtual integration Leave the data in the sources When a query comes in: –Determine the relevant sources to the query –Break down the query into sub-queries for the sources –Get the answers from the sources, filter them if needed and combine them appropriately Data is fresh Otherwise known as On Demand Integration Slides from Dr. Michalis Petropoulos

9 Mediator Virtual Integration Architecture Data Source Data Source Global Schema Local Schema Local Schema QueryResult Wrapper End User  Design-Time Mediation Language Mapping Tool Run-Time Query Reformulation Optimization & Execution XML Web Services 1 Slides from Dr. Michalis Petropoulos

10 Design-Time Mediator Virtual Integration Architecture Data Source Data Source Global Schema Local Schema Local Schema QueryResult Wrapper End User  Mediation Language Mapping Tool Run-Time Query Reformulation Optimization & Execution XML Web Services 1 2 Slides from Dr. Michalis Petropoulos

11 Mediator Virtual Integration Architecture Data Source Data Source Global Schema Local Schema Local Schema QueryResult Wrapper End User  Design-Time Mediation Language Mapping Tool Run-Time Query Reformulation Optimization & Execution XML Web Services Slides from Dr. Michalis Petropoulos

12 Mediator Virtual Integration Architecture Data Source Data Source Global Schema Local Schema Local Schema QueryResult Wrapper End User  Design-Time Mediation Language Mapping Tool Run-Time Query Reformulation Optimization & Execution XML Web Services Slides from Dr. Michalis Petropoulos

13 Mediator Virtual Integration Architecture Data Source Data Source Global Schema Local Schema Local Schema QueryResult Wrapper End User  Design-Time Mediation Language Mapping Tool Run-Time Query Reformulation Optimization & Execution XML Web Services Slides from Dr. Michalis Petropoulos

14 Mediator Virtual Integration Architecture Data Source Data Source Global Schema Local Schema Local Schema QueryResult End User  Wrapper Design-Time Mediation Language Mapping Tool Run-Time Query Reformulation Optimization & Execution XML Web Services Slides from Dr. Michalis Petropoulos

15 Outline Basics: theories of conjunctive queries Global-as-view (GAV) approach to data integration Local-as-view (LAV) approach to data integration

16 Conjunctive Queries (CQ’s) in Datalog Most common form of query; equivalent to select-project-join (SPJ) queries Useful for data integration Form: q(X) :- p 1 (X 1 ), p 2 (X 2 ),…, p n (X n ). Head q(X) represents the query answers Body p 1 (X 1 ), p 2 (X 2 ),…, p n (X n ) represents the query conditions –The head is true if all the subgoals are true. –Each pi(Xi) is called a subgoal. Xi is a vector of variables or constants. –Shared variables represent join conditions –Constants represent “Attribute=const” selection conditions –A relation can appear in multiple predicates (subgoals) head body subgoals q(X) :- p 1 (X 1 ), p 2 (X 2 ), …, p n (X n )

17 Conjunctive queries Head and subgoals are atoms. An atom consists of a predicate applied to zero or more arguments Predicates represent relations. An atom is true for given values of its variables iff the arguments form a tuple of the relation. Whenever an assignment of values to all variables makes all subgoals true, the rule asserts that the resulting head is also true.

18 Conjunctive Queries: example Schema student(name, courseNum), course(number, Instructor) SQL SELECT name FROM student, course WHERE student.courseNum=course.number AND instructor=‘Li’; Equal to: ans(SN) :- student(SN, CN), course(CN,’Li’). –Predicates student and course correspond to relations names –Two subgoals: student(SN, CN) and course(CN,’Li’) –Variables: SN, CN. Constant: ‘Li’ –Shared variable, CN, corresponds to “student.courseNum=course.number” –Variable SN in the head: the answer to the query

19 Why not SQL Datalog is more concise Let us state some general principles –e.g., containment of rules that are almost impossible to state correctly in SQL. –Will see that later Recursion is much easier to express in Datalog.

20 Answer to a CQ For a CQ Q on database D, the answer Q(D) is a set of heads of Q if we: –Substitute constants for variables in the body of Q in all possible ways –Require all subgoals to be true Example:ans(SN) :- student(SN, CN), course(CN,’Li’). –Tuples are also called facts: student(Jack, 184), student(Tom,215), …, course(184,Li), course(215,Li), … –Answer “Jack”: SN  Jack,CN  184 –Answer “Tom”: SN  Tom,CN  215 –Answer “Jack”: SN  Jack,CN  215 (duplicate eliminated) Student Course

21 Query containment For two queries Q 1 and Q 2, we say Q 1 is contained in Q 2, denoted Q 1  Q 2, if any database D, we have Q 1 (D)  Q 2 (D). We say Q 1 and Q 2 are equivalent, denoted Q 1  Q 2, if Q 1 (D)  Q 2 (D) and Q 2 (D)  Q 1 (D). Example: Q 1 : ans(SN) :- student(SN, CN), course(CN, ’Li’). Q 2 : ans(SN) :- student(SN, CN), course(CN, INS). We have: Q 1 (D)  Q 2 (D).

22 Another example Q 1 : p(X,Y) :- r(X,W), b(W,Z), r(Z,Y). Q 2 : p(X,Y) :- r(X,W), b(W,W), r(W,Y). We have: Q 2  Q 1 Proof: –For any DB D, suppose p(x,y) is in Q 2 (D). Then there is a w such that r(x,w), b(w,w), and r(w,y) are in D. –For Q 1, consider the substitution: X  x, W  w, Z  w, Y  y. –Thus the head of Q 1 becomes p(x,y), meaning that p(x,y) is also in Q 1 (D). In general, how to test containment of CQ’s? –Containment mappings

23 Test containment Two approaches: 1.Containment mappings. 2.Canonical databases. Really the same in the simple CQ case covered so far. Containment test is NP-complete, but CQ’s tend to be small so here is one case where intractability doesn’t hurt you.

24 Containment mappings A containment mapping from Q2 to Q1: Map variables of Q2 to variables of Q1, such that: –Head of Q2 becomes head of Q1; –Each subgoal of Q2 becomes some subgoal of Q1. It is not necessary that every subgoal of Q1 is the target of some subgoal of Q2. Q 1  Q 2 iff there is a containment mapping from Q2 to Q1. –Note that the containment mapping is opposite the containment --- it goes from the larger (containing CQ) to the smaller (contained CQ). Example: Q 1 : p(X,Y) :- r(X,W), b(W,Z), r(Z,Y). Q 2 : p(X,Y) :- r(X,W), b(W,W), r(W,Y). –Containment mapping from Q1 to Q2: X  X, Y  Y, W  W, Z  W –No containment mapping from Q2 to Q1: For b(W,W) in Q2, its only possible target in Q1 is b(W,Z) However, we cannot have a mapping W  W and W  Z, since each variable cannot be mapped to two different variables

25 A slightly different example Q1: p(X,Y):- r(X,Z), g(Z,Z), r(Z,Y). Q2: p(A,B):- r(A,C), g(C,D), r(D,B). Containment mapping m: m(A)=X; m(B)=Y; m(C)=m(D)=Z.

26 Q1: p(X,Y):- r(X,Y), g(Y,Z). Q2: p(A,B):- r(A,B), r(A,C). Q1 looks for: Q2 looks for: Another Example XZY AB C

27 Q1: p(X,Y):- r(X,Y), g(Y,Z). Q2: p(A,B):- r(A,B), r(A,C). Containment mapping: m(A)=X; m(B)=m(C)=Y. Example - Continued Notice two subgoals can map to one. And not every subgoal need be a target.

28 Example - Concluded Q1: p(X,Y):- r(X,Y), g(Y,Z). Q2: p(A,B):- r(A,B), r(A,C). No containment mapping from Q1 to Q2. –g(Y,Z) cannot map anywhere, since there is no g subgoal in Q2. Thus, Q1 properly contained in Q2.

29 Extending CQ’s CQ’s with built-in predicates: –We can add more conditions to variables in a CQ. –Example: student(name, GPA, courseNum), course(number,instructor,year) Q1(SN) :- student(SN, G, CN), course(CN, ’Li’), G>=3.5. Q2(SN) :- student(SN, G, CN), course(CN, ’Li’), G>=3.5, Y < Q2(SN)  Q1(SN). Datalog queries: –a (possibly infinite) set of CQ’s with (possibly) recursion –Example: parent(Parent, Child) –Query: finding all ancestors of Tom ancestor(P,C) :- parent(P, C). ancestor(P,C) :- ancestor(P,X), parent(X, C). result(P) :- ancestor(P, ‘tom’).

30 Although CQ theory first appeared at a database conference, the AI community has taken CQ’s to heart. CQ’s, or similar logics like description logic, are used in a number of AI applications. –Again, their design theory is really containment and equivalence.

31 Outline Basics: theories of conjunctive queries Global-as-view (GAV) approach to data integration Local-as-view (LAV) approach to data integration

32 GAV approach to data integration Readings: –Jeffrey Ullman, Information Integration Using Logical Views, ICDT –Ramana Yerneni, Chen Li, Hector Garcia-Molina, and Jeffrey Ullman, Computing Capabilities of Mediators, SIGMOD 1999.

33 Global-as-view Approach Mediator Mediator exports views defined on source relations med(Dealer,City,Make,Year) = R1 R2 A query is posted on mediator views: SELECT * FROM med WHERE Year = ‘2001’; ans(D,C,M, ‘2001’) :- med(D,C,M,‘2001’). Mediator expands query to source queries: SELECT * FROM R1, R2 WHERE Year = ‘2001’; ans(D,C,M,’2001’) :- R1(D,C), R2(D,M, ‘2001’). R1(Dealer,City) R2(Dealer, Make, Year) med(Dealer,City,Make,Year) = R1 R2

34 Project: TSIMMIS at Stanford Advantages: –User queries are easy to define –Query transformation generation is straightforward Disadvantages: –Not all source information is exported: –Not easily scalable: every time a new source is added, mediator views need to be changed. Research issues –Efficient query execution? –Deal with limited source capabilities? GAV Approach

35 Limited source capabilities Complete scans of relations not possible Reasons: – Legacy databases or structured files: limited interfaces – Security/Privacy – Performance concerns Example 1: legacy databases with restrictive interfaces Ullman DBMS Knuth TeX … … author title Given an author, return the books.

36 Another example: Web search forms

37 Problems How to describe source restrictions? How to compute mediator restrictions from sources? How to answer queries efficiently given these restrictions? How to compute as many answers as possible to a query? …

38 Computing mediator restrictions Motivation: do not want users to be frustrated by submitting a query that cannot be answerable by the mediator Example: –Source 1: book(author?, title, price) Capability: “bff” i.e., we must provide an author, and can get title and price info –Source 2: review(title?, reviewer, rate) Capability: “bff” i.e., we must provide a book title, and can get other info –Mediator view: MedView(A?,T,P,RV,RT) :- book(A,T,P),review(T,RV,RT). –Query on the mediator view: Ans(RT) :- MedView(A, ‘db’, P, RV, RT). I.e., “find the review rates of DB books” –But the mediator cannot answer this query, since we do not know the authors. We want to tell the user beforehand what queries can be answered

39 Outline Basics: theories of conjunctive queries; Global-as-view (GAV) approach to data integration; Local-as-view (LAV) approach to data integration.

40 Local-as-view (LAV) approach Mediator There are global predicates, e.g., “car,” “person,” “book,” etc. They can been seen as mediator views The content of each source is described using these global predicates A query to the mediator is also defined on the global predicates The mediator finds a way to answer the query using the source contents sources

41 Example Mediator Global predicates: Loc(Dealer,City),Sell(Dealer,Make,Year) Source content defined on global predicates: S1(Dealer,City) :- Loc(Dealer, City). S2(Dealer,Make,Year) :- Sell(Dealer, Make, Year). In general, each definition could be more complicated, rather than direct copies. Queries defined on global predicates. Q: ans(D,M,Y) :- Loc(D, ’windsor’), Sell(D, M, Y). –Users do not know source views. The mediator decides how to use source views to answer queries. –“Answering queries using views”: ans(D, M, Y) :- S1(D,’windsor’), S2(D,M,Y). S1(Dealer,City) S2(Dealer,Make,Year)

42 Answering queries using views Mediator Source views can be complicated: SPJs, arithmetic comparisons,… Not easy to decide how to answer a query using source views Query: ans(D,M) :- Loc(D,‘windsor'), Sell(D,M,Y). Rewriting: ans(D,M) :- V3(D,‘windsor’, M,Y). ans(D,M) :- V1(D,’windsor’), V2(D,M,Y). … –“Equivalent rewriting”: compute the “same” answer as the query –A rewriting can join multiple source views V1(Dealer,City):- Loc(Dealer, City). V2(Dealer,Make,Year):-Sell(Dealer, Make, Year). V3(D,C,M,Y) :- Loc(D,C),Sell(D,M,Y). V4(D,C,M,Y) :- Loc(D,C),Sell(D,M,Y), Y<1970. Query

43 Arithmetic comparisons Mediator Comparisons can make the problem even trickier Query: ans(D,M) :- Loc(D,‘windsor'), Sell(D,M,Y). Rewriting: ans(D,M) :- V(D,‘windsor’, M,Y). Contained rewriting: only retrieve cars before Query: ans(D,M):- Loc(D, ‘windsor'), Sell(D,M,Y), Y < Rewriting: ans(D,M) :- V(D,‘windsor’, M, Y), Y < V(D,C,M,Y):- Loc(D,C),Sell(D,M,Y),Y<1970.

44 Local-as-View (LAV) Source 1 Source 2 Source 3 Source 4 Source 5 Local Schema Local Schema Local Schema Local Schema Global Schema Book ISBN Title Genre Year Author ISBN Name R1 ISBN Title Name Local Schema R5 ISBN Title Books before 1970 Humor Books Create View R1 AS SELECT B.ISBN, B.Title, A.Name FROM Book B, Author A WHERE A.ISBN = B.ISBN AND B.Year < 1970 R1(ISBN, Title, Name):- Book(ISBN, Title, Genre,Year), Author(ISBN, Name), Year<1970. Create View R5 AS SELECT B.ISBN, B.Title FROM Book B WHERE B.Genre = ‘Humor’ R5(ISBN, Title) :-Book(ISBN, Title, ‘humor’, Year).

45 LAV details Query: Find authors of humor books Q(Name):-Book(ISBN,Title,”humor”,YEAR), Author(ISBN, Name) Views: R1(ISBN, Title, Name):- Book(ISBN, Title, Genre,Year), Author(ISBN, Name), Year<1970. R5(ISBN, Title) :-Book(ISBN, Title, ‘humor’, Year). Rewriting of Q using views: Q’(Name):-R1(ISBN, Title, Name), R2(ISBN, Title) Expansion of Q’ Q’’(Name):- Book(ISBN, Title, Genre,Year), Author(ISBN, Name), Year<1970, Book(ISBN, Title, ‘humor’, Year). Q’’’(Name):- Author(ISBN, Name), Year<1970, Book(ISBN, Title, ‘humor’, Year). Q’’’ is contained in Q

46 Query Rewritings Given a query Q and a set of views V: –A conjunctive query P is called a “rewriting” of Q using V if P only uses views in V, and P computes a partial answer of Q. That is: P exp  Q. A rewriting is also called a “contained rewriting” (CR). –A conjunctive query P is called an “equivalent rewriting” (ER) of Q using V if P only uses views in V, and P computes the exact answer of Q. That is: P exp  Q.

47 Bucket algorithm It is the basic method for query rewriting Each subgoal must be “covered” by some view Make a list of candidates (buckets) per query subgoal Consider combinations of candidates from different buckets Not all combos are “compatible” Keep the compatible ones and minimize them Discard the ones contained in another Take their union

48 The Bucket Algorithm: Example V1(Std,Crs,Qtr,Title) :- reg(Std,Crs,Qtr), course(Crs,Title), Crs ≥ 500, Qtr ≥ Aut98 V2(Std,Prof,Crs,Qtr) :- reg(Std,Crs,Qtr), teaches(Prof,Crs,Qtr) V3(Std,Crs) :- reg(Std,Crs,Qtr), Qtr ≤ Aut94 V4(Prof,Crs,Title,Qtr) :- reg(Std,Crs,Qtr), course(Crs,Title), teaches(Prof,Crs,Qtr), Qtr ≤ Aut97 q(S,C,P) :- teaches(P,C,Q), reg(S,C,Q), course(C,T), C ≥ 300, Q ≥ Aut95 Step 1: For each query subgoal, put the relevant sources into a bucket

49 The Bucket Algorithm: Example V1(Std,Crs,Qtr,Title) :- reg(Std,Crs,Qtr), course(Crs,Title), Crs ≥ 500, Qtr ≥ Aut98 V2(Std,Prof,Crs,Qtr) :- reg(Std,Crs,Qtr), teaches(Prof,Crs,Qtr) V3(Std,Crs) :- reg(Std,Crs,Qtr), Qtr ≤ Aut94 V4(Prof,Crs,Title,Qtr) :- reg(Std,Crs,Qtr), course(Crs,Title), teaches(Prof,Crs,Qtr), Qtr ≤ Aut97 q(S,C,P) :- teaches(P,C,Q), reg(S,C,Q), course(C,T), C ≥ 300, Q ≥ Aut95 P  Prof, C  Crs, Q  Qtr Note: Arithmetic predicates don’t pose a problem in this step V2 Buckets V4 teachesregcourse

50 The Bucket Algorithm: Example V1(Std,Crs,Qtr,Title) :- reg(Std,Crs,Qtr), course(Crs,Title), Crs ≥ 500, Qtr ≥ Aut98 V2(Std,Prof,Crs,Qtr) :- reg(Std,Crs,Qtr), teaches(Prof,Crs,Qtr) V3(Std,Crs) :- reg(Std,Crs,Qtr), Qtr ≤ Aut94 V4(Prof,Crs,Title,Qtr) :- reg(Std,Crs,Qtr), course(Crs,Title), teaches(Prof,Crs,Qtr), Qtr ≤ Aut97 q(S,C,P) :- teaches(P,C,Q), reg(S,C,Q), course(C,T), C ≥ 300, Q ≥ Aut95 S  Std, C  Crs, Q  Qtr Note:V3 doesn’t work: arithmetic predicates not consistent V4 doesn’t work: S not in the output of V4 V2 Buckets V4 teachesregcourse V1 V2

51 The Bucket Algorithm: Example V1(Std,Crs,Qtr,Title) :- reg(Std,Crs,Qtr), course(Crs,Title), Crs ≥ 500, Qtr ≥ Aut98 V2(Std,Prof,Crs,Qtr) :- reg(Std,Crs,Qtr), teaches(Prof,Crs,Qtr) V3(Std,Crs) :- reg(Std,Crs,Qtr), Qtr ≤ Aut94 V4(Prof,Crs,Title,Qtr) :- reg(Std,Crs,Qtr), course(Crs,Title), teaches(Prof,Crs,Qtr), Qtr ≤ Aut97 q(S,C,P) :- teaches(P,C,Q), reg(S,C,Q), course(C,T), C ≥ 300, Q ≥ Aut95 C  Crs, T  Title V2 Buckets V4 teachesregcourse V1 V2 V1 V4

52 The Bucket Algorithm: Example Step 2: Try all combos of views, one each from a bucket Test satisfaction of arithmetic predicates in each case –e.g., two views may not overlap, i.e., they may be inconsistent Desired rewriting = union of surviving ones Query rewriting 1: q1(S,C,P) :- V2(S’,P,C,Q), V1(S,C,Q,T’), V1(S”,C,Q’,T) –no problem from arithmetic predicates (none in V2) –May or may not be minimal (why?) V2 V4 teachesregcourse V1 V2 V1 V4

53 The Bucket Algorithm: Example Unfolding of rewriting 1: q1’(S,C,P) :- r(S’,C,Q), t(P,C,Q), r(S,C,Q), c(C,T’), r(S”,C,Q’), c(C,T), C ≥ 500, Q ≥ Aut98, C ≥ 500, Q’ ≥ Aut98 Black r’s can be mapped to green r: S’  S, S”  S, Q’  Q Black c can be mapped to green c: just extend above mapping to T  T’ Minimized unfolding of rewriting 1: q1m’(S,C,P) :- t(P,C,Q), r(S,C,Q), c(C,T’), C ≥ 500, Q ≥ Aut98 Minimized rewriting 1: q1m(S,C,P) :- V2(S’,P,C,Q), V1(S,C,Q,T’)

54 The Bucket Algorithm: Example Query Rewriting 2: q2(S,C,P) :- V2(S’,P,C,Q), V1(S,C,Q,T’), V4(P’,C,T,Q’) q2’(S,C,P) :- r(S’,C,Q), t(P,C,Q), r(S,C,Q), r(S,C,Q), c(C,T’), C ≥ 500, Q ≥ Aut98, r(S”,C,Q’), c(C,T), t(P’,C,Q’), Q’ ≤ Aut97 This combo is infeasible: consider the conjunction of arithmetic predicates in V1 and V4 Query rewriting 3: q3(S,C,P) :- V2(S’,P,C,Q), V2(S,P’,C,Q), V4(P”,C,T,Q’) V2 V4 teachesregcourse V1 V2 V1 V4 V2 V4 teachesregcourse V1 V2 V1 V4

55 The Bucket Algorithm: Example Unfolding of rewriting 3: q3’(S,C,P) :- r(S’,C,Q), t(P,C,Q), r(S,C,Q), t(P’,C,Q), r(S”,C,Q’), c(C,T), t(P”,C,Q’), Q’ ≤ Aut97 The green subgoals can cover the black ones under the mapping: S’  S, S”  S, P’  P, P”  P, Q’  Q Minimized rewriting 3: q3m(S,C,P) :- V2(S,P,C,Q), V4(P,C,T,Q) Verify that there are only two rewritings that are not covered by others Maximally Contained Rewriting: q’ = q1m  q3m

56 The Bucket Algorithm: Example 2 Query: q(X) :- cites(X,Y), cites(Y,X), sameTopic(X,Y) Views: V4(A) :- cites(A,B), cites(B,A) V5(C,D) :- sameTopic(C,D) V6(F,H) :- cites(F,G), cites(G,H), sameTopic(F,G) Note: Should we list V4(X) twice in the buckets? V4 Buckets V6 cites sameTopic V4 V6 V5 V6

57 Bucket algorithm Query: q(x):-car(x), sell(x, d), loc(d, ’windsor’). Views: v1(x) :- car(x). v2(x) :- car(x), sell(x, d). v3(x,d) :- sell(x, d), loc(d, ’windsor’). v4(x) :- sell(x, d), loc(d, ’windsor’). Car(x)Sell(x,d)Loc(d,’windsor’) V1(x) V2(x)v2(x) V3(x,d) V4(x) q(x):-v1(x), v2(x), v3(x,d). q(x):-v1(x), v3(x,d). q(x):-v1(x), v4(x). q(x):-v2(x), v3(x,d). q(x):-v2(x), v4(x). …

58 Projects: Information Manifold, Infomaster, Tukwila, … Advantages: –Scalable: new sources easy to add without modifying the mediator views –All we need to do is to define the new source using the existing mediator views (predicates) Disadvantages: –Hard to decide how to answer a query using views Reading: Alon Halevy, Answering Queries Using Views: A Survey. Projects using the LAV approach