Mayssam Sayyadian, AnHai Doan University of Wisconsin - Madison Hieu LeKhac University of Illinois - Urbana Luis Gravano Columbia University Efficient.

Mayssam Sayyadian, AnHai Doan University of Wisconsin - Madison Hieu LeKhac University of Illinois - Urbana Luis Gravano Columbia University Efficient Keyword Search across Heterogeneous Relational Databases

2 Key Message of Paper Precise data integration is expensive But we can do IR-style data integration very cheaply, with no manual cost! –just apply automatic schema/data matching –then do keyword search across the databases –no need to verify anything manually Already very useful Build upon keyword search over a single database...

3 Keyword Search over a Single Relational Database A growing field, numerous current works –DBXplorer [ICDE02], BANKS [ICDE02] –DISCOVER [VLDB02] –Efficient IR-style keyword search in databases [VLDB03], –VLDB-05, SIGMOD-06, etc. Many related works over XML / other types of data –XKeyword [ICDE03], XRank [Sigmod03] –TeXQuery [WWW04] –ObjectRank [Sigmod06] –TopX [VLDB05], etc. More are coming at SIGMOD-07...

4 A Typical Scenario tid custid name contact addr t1 c124 Cisco Michael Jones … t2 c533 IBM David Long … t3 c333 MSR David Ross … Customers tid id emp-name comments u1 c124 Michael Smith Repair didn’t work u2 c124 John Deferred work to John Smith Complaints Foreign-Key Join Q = [Michael Smith Cisco] Ranked list of answers t1 c124 Cisco Michael Jones … u1 c124 Michael Smith Repair didn ’ t work t1 c124 Cisco Michael Jones … u2 c124 John Deferred work to John Smith score=.8 score=.7

5 Our Proposal: Keyword Search across Multiple Databases  IR-style data integration tid eid reports-to x1 e23 e37 x2 e14 e37 Groups tid empid name Employees v1 e23 Mike D. Smith v2 e14 John Brown v3 e37 Jack Lucas tid custid name contact addr t1 c124 Cisco Michael Jones … t2 c533 IBM David Long … t3 c333 MSR Joan Brown … Customers tid id emp-name comments u1 c124 Michael Smith Repair didn’t work u2 c124 John Deferred work to John Smith Complaints t1 c124 Cisco Michael Jones … u1 c124 Michael Smith Repair didn ’ t work v1 e23 Mike D. Smith x1 e23 e37 v3 e37 Jack Lucas across databases Query: [Cisco Jack Lucas]

6 A Naïve Solution 1. Manually identify FK joins across DBs 2. Manually identify matching data instances across DBs 3. Now treat the combination of DBs as a single DB  apply current keyword search techniques Just like in traditional data integration, this is too much manual work

7 Kite Solution tid eid reports-to x1 e23 e37 x2 e14 e37 Groups tid empid name Employees v1 e23 Mike D. Smith v2 e14 John Brown v3 e37 Jack Lucas tid custid name contact addr t1 c124 Cisco Michael Jones … t2 c533 IBM David Long … t3 c333 MSR Joan Brown … Customers tid id emp-name comments u1 c124 Michael Smith Repair didn’t work u2 c124 John Deferred work to John Smith Complaints Automatically find FK joins / matching data instances across databases  no manual work is required from user

8 Automatically Find FK Joins across Databases Current solutions analyze data values (e.g., Bellman) Limited accuracy –e.g., “waterfront” with values yes/no “electricity” with values yes/no Our solution: data analysis + schema matching –improve accuracy drastically (by as much as 50% F-1) tid empid name Employees v1 e23 Mike D. Smith v2 e14 John Brown v3 e37 Jack Lucas tid id emp-name comments u1 c124 Michael Smith Repair didn’t work u2 c124 John Deferred work to John Smith Complaints Automatic join/data matching can be wrong  incorporate confidence scores into answer scores

9 Incorporate Confidence Scores into Answer Scores α.score_kw (A, Q) + β.score_join (A, Q) + γ.score_data (A, Q) size (A) score (A, Q) = t1 c124 Cisco Michael Jones … u1 c124 Michael Smith Repair didn ’ t work score=.8 Recall: answer example in single-DB settings Recall: answer example in multiple-DB settings t1 c124 Cisco Michael Jones … u1 c124 Michael Smith Repair didn ’ t work v1 e23 Mike D. Smith x1 e23 e37 v3 e37 Jack Lucas score 0.9 for FK join score 0.7 for data matching

10 Summary of Trade-Offs Precise data integration –the holy grail SQL queries IR-style data integration, naïve way –manually identify FK joins, matching data –still too expensive IR-style data integration, using Kite –automatic FK join finding / data matching –cheap –only approximates the “ideal” ranked list found by naïve

11 Kite Architecture Q = [ Smith Cisco ] … Distributed SQL queries D1D1 DnDn Index Builder Foreign key joins IR index 1 IR index n … … D1D1 DnDn Refinement rules Offline preprocessing Online querying Condensed CN Generator Top-k Searcher Foreign-Key Join Finder Data-based Schema Join Finder Matcher Data instance matcher – Partial – Full – Deep

12 Online Querying Database 2 Relation 1Relation 2 Database 1 Relation 1Relation 2 What current solutions do: 1. Create answer templates 2. Materialize answer templates to obtain answers

13 Create Answer Templates Find tuples that contain query keywords –Use DB’s IR index –example: Q = [Smith Cisco] Tuple sets: Create tuple-set graph Schema graph: Tuple set graph: Service-DB: Complaints Q ={u 1, u 2 }Customers Q ={v 1 } HR-DB: Employees Q ={t 1 }Groups Q ={} Customers Complaints Emps Groups J1J1 J4J4 J2J2 J3J3 Customers {} Complaints {} Emps {} Groups {} J1J1 Customers Q Complaints Q Emps Q J1J1 J1J1 J1J1 J4J4 J4J4 J4J4 J4J4 J2J2 J3J3 J3J3 J2J2 Complaints u1 u2 Service-DB Groups x1 x2 Employees t1 t2 t3 HR-DB Customers v1 v2 v3

14 Create Answer Templates (cont.) Customers {} Complaints {} Emps {} Groups {} J1J1 Customers Q Complaints Q Emps Q J1J1 J1J1 J1J1 J4J4 J4J4 J4J4 J4J4 J2J2 J3J3 J3J3 J2J2 sample tuple set graph Emps Q Groups {}  Emps {} Complaints {Q} J2J2 J3J3 J4J4 CN4: sample CNs Emps Q Groups {}  Emps {} Complaints {Q} J2J2 J2J2 J4J4 CN3: Customers Q Complaints {Q} CN2: J1J1 Customers Q CN1: Search tuple-set graph to generate answer templates –also called Candidate Networks (CNs) Each answer template = one way to join tuples to form an answer

15 Materialize Answer Templates to Generate Answers By generating and executing SQL queries CN: Customers Q Complaints Q ( CustomersQ = {v1}, ComplaintsQ = {u1, u2}) SQL: SELECT * FROM Customers C, Complaints P WHERE C.cust-id = P.id AND (C.tuple-id = v1) AND (P.tuple-id = u1 OR tuple-id = u2) J1J1 Naïve solution –materialize all answer templates, score, rank, then return answers Current solutions –find only top-k answers –materialize only certain answer templates –make decisions using refinement rules + statistics

16 Challenges for Kite Setting More databases  way too many answer templates to generate –can take hours on just 3-4 databases Materializing an answer template takes way too long –requires SQL query execution across multiple databases –invoking each database incurs large overhead Difficult to obtain reliable statistics across databases See paper for our solutions (or backup slides)

17 Empirical Evaluation Domain# DBs Avg # tables per DB Avg # attributes per schema Avg # approximate FK joins tuples per table Avg # tuples per table Total size totalacross DBsper pair DBLP233116 500K400M Inventory85.85.489080433.62K50M Domains The DBLP Schema CNF (id, name) CITE (id1, id2) AR (id, title) AU (id, name) AR (aid, biblo) PU (aid, uid) DBLP 1DBLP 2 Sample Inventory Schema WAREHOUSE AUTHOR BOOK WH2BOOK CD ARTIST WH2CD Inventory 1

18 Runtime Performance (1) Runtime Performance (1) Hybrid algorithm adapted to run over multiple databases Kite without condensed CNs Kite without adaptive rule selection and without rule Deep Full-fledged Kite algorithm Kite without rule Deep time (sec) Inventory time (sec) DBLP max CCN size max CCN size 2-keyword queries, k=10, 2 databases 2-keyword queries, k=10, 5 databases time (sec) Inventory runtime vs. # of databases # of DBs maximum CCN size = 4, 2-keyword queries, k=10 runtime vs. maximum CCN size

19 Runtime Performance (2) runtime vs. # of keywords in the query |q| time (sec) DBLP max CCN=6, k=10, 2 databases time (sec) Inventory |q| max CCN=4, k=10, 5 databases runtime vs. # of answers requested time (sec) Inventory k 2-keyword queries, max CCN=4, 5 databases time (sec) k 2-keyword queries, max CCN=4, |q|=2, 5 databases

20 Query Result Quality Pr@k k k OR-semantic queriesAND-semantic queries Pr@k = the fraction of answers that appear in the “ideal” list

21 Summary Kite executes IR-style data integration –performs some automatic preprocessing –then immediately allows keyword querying Relatively painless –no manual work! –no need to create global schema, nor to understand SQL Can be very useful in many settings: e.g., on-the-fly, best-effort, for non-technical people –enterprises, on the Web, need only a few answers –emergency (e.g., hospital + police), need answers quickly

22 Future Directions Incorporate user feedback  interactive IR-style data integration More efficient query processing –large # of databases, network latency Extends to other types of data –XML, ontologies, extracted data, Web data IR-style data integration is feasible and useful extends current works on keyword search over DB raises many opportunities for future work

23 BACKUP

24 Condensing Candidate Networks In multi-database settings  unmanageable number of CNs –Many CNs share the same tuple sets and differ only in the associated joins –Group CNs into condensed candidate networks (CCNs) Customers {} Complaints {} Emps {} Groups {} J1J1 Customers Q Complaints Q Emps Q J1J1 J1J1 J1J1 J4J4 J4J4 J4J4 J4J4 J2J2 J3J3 J3J3 J2J2 sample tuple set graph sample CNs Customers {} Complaints {} Emps {} Groups {}` J1J1 Customers Q Complaints Q Emps Q J1J1 J1J1 J1J1 J4J4 J4J4 J4J4 J4J4 {J 2, J 3 }   condense tuple set graph sample CCNs Emps Q Groups {}  Emps {} Complaints {Q} J2J2 J3J3 J4J4 CN4: Emps Q Groups {}  Emps {} Complaints {Q} J2J2 J2J2 J4J4 CN3: Emps Q Groups {}  Emps {} Complaints {Q} J2J2 {J 2, J 3 } J4J4 Condense

25 Top-k Search Main ideas for top-k keyword search: –No need to materialize all CNs –Sometimes, even partially materializing a CN is enough –Estimate score intervals for CNs, then branch and bound search.... Q [0.5, 0.7].... P [0.6, 1] R [0.4, 0.9].. iteration 1 K = {P 2, P 3 }, min score = 0.7.... P 1 [0.6, 0.8] P 2 0.9. P 3 0.7 R [0.4, 0.9].. iteration 2 Res = {P 2, R 2 } min score = 0.85.. R 1 [0.4, 0.6]. R 2 0.85 iteration 3 Kite approach: materialize CNs using refinement rules

26 Top-k Search Using Refinement Rules In single-database setting  selecting rules based on database statistics In multi-database setting  Inaccurate statistics Inaccurate statistics  Inappropriate rule selection

27 Refinement Rules Full: – Exhaustively extract all answers from a CN (fully materialize S) –  too much data to move around the network (data transfer cost) Partial: –Try to extract the most promising answer from a CN –  invoke remote databases for only one answer (high cost of database invocation) Deep: –A middle-ground approach –Once a table in a remote database is invoked, extract all answers involving that table –Takes into account database invocation cost TQTQ UQUQ t1u1 T Q U Q t1 0.9 t2 0.7 t3 0.4 t4 0.3 0.8 u1 0.6 u2 0.5 u3 0.1 u4 t1 0.9 t2 0.7 t3 0.4 t4 0.3 0.8 u1 0.6 u2 0.5 u3 0.1 u4 TQTQ UQUQ T Q U Q t1 0.9 t2 0.7 t3 0.4 t4 0.3 0.8 u1 0.6 u2 0.5 u3 0.1 u4 t1 0.9 t2 0.7 t3 0.4 t4 0.3 0.8 u1 0.6 u2 0.5 u3 0.1 u4 t1 u3 t1 u1,

28 Adaptive Search Question: which refinement rule to apply next? –In single-database setting  based on database statistics –Multi-database setting  inaccurate statistics Kite approach: adaptively select rules goodness-score (rule, cn) = benefit (rule, cn) – cost (rule, cn) –cost (rule, cn): optimizer’s estimated cost for SQL statements –benefit (rule, cn): reduce the benefit if a rule is applied for a while without making any progress

29 Other Experiments accuracy (F1) Kite over single database time (sec) max CCN size Schema matching helps improve join discovery algorithm drastically Kite also improves single- database keyword search algorithm mHybrid

Mayssam Sayyadian, AnHai Doan University of Wisconsin - Madison Hieu LeKhac University of Illinois - Urbana Luis Gravano Columbia University Efficient.

Similar presentations

Presentation on theme: "Mayssam Sayyadian, AnHai Doan University of Wisconsin - Madison Hieu LeKhac University of Illinois - Urbana Luis Gravano Columbia University Efficient."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Mayssam Sayyadian, AnHai Doan University of Wisconsin - Madison Hieu LeKhac University of Illinois - Urbana Luis Gravano Columbia University Efficient.

Similar presentations

Presentation on theme: "Mayssam Sayyadian, AnHai Doan University of Wisconsin - Madison Hieu LeKhac University of Illinois - Urbana Luis Gravano Columbia University Efficient."— Presentation transcript:

Similar presentations

About project

Feedback