Presentation is loading. Please wait.

Presentation is loading. Please wait.

CSE 636 Data Integration Limited Source Capabilities Slides by Hector Garcia-Molina Fall 2006.

Similar presentations


Presentation on theme: "CSE 636 Data Integration Limited Source Capabilities Slides by Hector Garcia-Molina Fall 2006."— Presentation transcript:

1 CSE 636 Data Integration Limited Source Capabilities Slides by Hector Garcia-Molina Fall 2006

2 2 Heterogeneous Databases data DBMS 1 data DBMS 2 data legacy data web site Distributed Database System

3 3 Limited Capabilities

4 4 author: title: subject: format: price: must specify at least one of these this attribute not returned cannot query on this attribute menu of choices Example: Amazon.com

5 5 Example: BarnesAndNoble.com must specify at least one of these can query if one of other attributes specified Menu of choices author: title: subject: format: price:

6 6 Why Limited Capabilities? Search forms Security Indexes Legacy

7 7 Capability vs. Content Capability description –Can only search for subject = “art,” “history,” “science” Content description –Source only contains subject = “art,” “history,” “science”

8 8 Describing source capabilities Extending source capabilities How mediators cope with limited capabilities Mediator capabilities Other topics Outline Mediator Source Wrapper

9 9 Describing Query Capabilities R(X, Y,... Z) Adornments: f: may or may not specify u: cannot be specified b: must be specified c[S]: specified from list S o[S]: optional, chose from S

10 10 Describing Query Capabilities R(X, Y,... Z) Adornments: f: may or may not specify u: cannot be specified b: must be specified c[S]: specified from list S o[S]: optional, chose from S With output restriction f’ u’ b’ c’[S] o’[S]

11 11 Example Relation R(X, Y, Z) Description Templates: bu’f, uf’c[z 1, z 2 ] Answerable queries: R(x 1, Y, Z), R(X, Y, z 1 ) Unanswerable queries: R(X, y 1, Z), R(X, Y, z 3 )

12 12 Extending Source Capabilities amazon Wrapper Query: author=“Freud” AND price > 10 Source: R(author, price,...) Template: b, u,...

13 13 Extending Source Capabilities Source: R(author, price,...) Template: b, u,... Query: author=“Freud” AND price > 10 Source Query: author=“Freud” Wrapper Filter: price > 10 amazon Wrapper

14 14 Another Example Barnes&Noble Wrapper Query: (author = “Freud” OR author = “Jung”) AND price < 10 R(author, price, …) No disjunctive conditions; Price can only be specified with author

15 15 Another Example Query: (author = “Freud” OR author = “Jung”) AND price < 10 R(author, price, …) No disjunctive conditions; Price can only be specified with author Q1: author = “Freud” AND price < 10 Q2: author = “Jung” AND price < 10 Union Operation Barnes&Noble Wrapper

16 16 Other Description Mechanisms Tsimmis –Query templates Information Manifold –capability records (# bound attrs, conditions ok,...) Disco Garlic –black box

17 17 Extending Source Capabilities General scheme: –try many query rewritings –check if query fragments supported by source –check if wrapper can combine answer fragments –do all this very efficiently!! –H. Garcia-Molina, W. Labio, R. Yerneni: Capability-Sensitive Query Processing on Internet Sources, ICDE 1999 Tsimmis, Info Manifold: no disjunctive queries DISCO: no query splitting Garlic: only CNF queries

18 18 Tsimmis Suppose a database contains information about employees and students, the only queries that are accepted by the database are: –Retrieve person records by specifying the last name –Retrieve person records by specifying the first and the last name –Retrieve all person records by issuing the command

19 19 Tsimmis Query templates –Retrieve person records by specifying the last name O :- }> –Retrieve person records by specifying the first and the last name O :- –Retrieve all person records by issuing the command O :-

20 20 Tsimmis Directly supported queries Q :- }> Logical supported queries Q:- }> Indirectly supported queries Q :- }>

21 21 Information Manifold (IM) IM used capability records to capture the two kinds of capabilities: –The ability of sources to apply a number of selections. –The limited forms of variable bindings that an source can accept. A capability records has the form (S in, S out, S sel, min, max) The information must be given bindings for at least min elements of S in, The elements in S out are the parameters that can be returned from the source.

22 22 Information Manifold (IM) Car reviews database, containing reviews for cars manufactured after 1990. V(m, y, r) :- Car(c), Model(c, m), Year(c, y), ProductReview(m, y, r) Capabilities: ({m}, {m, y, r}, {y}, 1, 2)

23 23 Garlic Allows unrestricted condition expressions The condition expressions are transformed into CNF, and then each clause in the CNF expression is considered for evaluation at the source. –If the source cannot evaluate a clause, it is evaluated by Garlic itself by downloading the source. –This attemp is not only expensive but also may not be allowed by the source.

24 24 DISCO DISCO does not explore the possibility of splitting the condition expression into parts Only those options in which the source processes the entire condition expression, or no part of it are considered. This strategy limits DISCO’s ability to generate feasible plans for many queries.

25 25 An Example Suppose we are looking for books written by Freud or Jung on the topic of dreams in the Internet bookstore BarnesAndNoble, which does not allow us to search for two authors at once: Query: (author = “Freud” OR author = “Jung”) AND (title contains “dreams”) Garlic can not evaluate the first clause The second clause extracts over 2,000 entries.

26 26 An Example (cont.) A better plan is to break up the query into two. –First search for (author = “Freud” AND title contains “dreams”) –Then for (author = “Jung” AND title contains “dreams”) –Union the results of the two queries –The plan extracts fewer than 20 entries

27 27 Mediator Processing R(X, Y, Z) f, f, b T(Z, W, U) f, u, b M(X, Y, Z, W, U) = Join(R, T) Query: M(5, Y, Z, W, 3) Mediator Source Wrapper

28 28 Plan 1 R(X, Y, Z) f, f, b T(Z, W, U) f, u, b M(X, Y, Z, W, U) = Join(R, T) Query: M(5, Y, Z, W, 3) Mediator Source Wrapper (1) R(5, Y, Z) (2) T(Z, W, 3) (3) Join answers

29 29 Plan 2 R(X, Y, Z) f, f, b T(Z, W, U) f, u, b M(X, Y, Z, W, U) = Join(R, T) Query: M(5, Y, Z, W, 3) Mediator Source Wrapper (3) Join answers (1) P = T(Z, W, 3) (2) for each (z,w,u)  P: R(5, Y, z)

30 30 Mediator Plan Generation Need feasible and efficient plan Search space is huge Tsimmis, Info Manifold, Garlic: – exponential algorithms Polynomial algorithms: –often find optimal or near-optimal plan –bounded performance –R. Yerneni, C. Li, J. D. Ullman, H. Garcia-Molina: Optimizing Large Join Queries in Mediation Systems, ICDT 1999

31 31 Conclusion Not all sources are created equal! Need to –describe what sources can do –efficiently process queries with limited sources –describe what mediators can do –exploit content information –deal with unavailable sources

32 32 References Computing Capabilities of Mediators –Ramana Yerneni, Chen Li, Hector Garcia-Molina, Jeffrey D. Ullman –SIGMOD Conference 1999 Describing and Using Query Capabilities of Heterogeneous Sources –Vasilis Vassalos, Yannis Papakonstantinou –VLDB 1997


Download ppt "CSE 636 Data Integration Limited Source Capabilities Slides by Hector Garcia-Molina Fall 2006."

Similar presentations


Ads by Google