Data Integration with Dependent Sources Anish Das Sarma, Xin (Luna) Dong, Alon Halevy Yahoo! Research, AT&T Labs-Research, Google Inc. November 20, 2018
Query Answering in Data Integration Best guess, based on 7 websites Query Answering in Data Integration A = Ui Ai Q: “France capital” Ai’s may have conflicts: D1,D2,D3: Paris D4,D5: Inria Mediated Schema A1 A5 Q1 Count number of sources Q5 A4 A2 A3 D5 D1 Q2 Assuming independence in counting! Sources can copy from each other D2 Q4 Q3 D4 D3 Consider number of independent sources for each answer November 20, 2018
Motivation [Solomon project, Luna Dong et. al.] Web data consists of an ecosystem of dependent sources Information extracted from AbeBooks.com Data from 877 bookstores 465 pairs of sources involved copying 314 copiers, 202 copy from a single source Some copy all tuples, some copy a fraction November 20, 2018
Goal Build a system, IDS, for Integrating Dependent Sources This paper: Proposed a system for query answering with dependent sources Address theoretical challenges in building such a system Note: detecting dependencies is part of other work [Solomon] November 20, 2018
Q Source Selection Cost Computation Computation Coverage Data Sources Configuration Source Ordering Query Answering Answer November 20, 2018
1) Given a query Q, which data sources to use for answering this query 1) Given a query Q, which data sources to use for answering this query? (cost-coverage tradeoff) Q Source Selection Cost Computation Computation Coverage Data Sources Configuration Source Ordering Query Answering Answer November 20, 2018
2) In which order should we query a set S of sources to get answers as soon as possible? Source Selection Cost Computation Computation Coverage Data Sources Configuration Source Ordering Query Answering Answer November 20, 2018
3) For a subset S of sources, what fraction of the total set of tuples is captured by tuples in S? Coverage is core part of previous problems. Q Source Selection Cost Computation Computation Coverage Data Sources Configuration Source Ordering Query Answering Answer November 20, 2018
Answers to source selection and ordering problems depend on cost model Q Source Selection Cost Computation Computation Coverage Data Sources Configuration Source Ordering Query Answering Answer November 20, 2018
Next Formal problem definitions Dependency model Cost model for query answering Coverage and optimization problems Algorithms and complexity result summary Coverage problem (CP) Cost minimization problem (CMP) Maximum coverage problem (MCP) Source ordering problem (SOP) November 20, 2018
Next Formal problem definitions Dependency model Cost model for query answering Coverage and optimization problems Algorithms and complexity results summary Coverage problem (CP) Cost minimization problem (CMP) Maximum coverage problem (MCP) Source ordering problem (SOP) November 20, 2018
Dependency Model: Example #tuples provided independently (1) Fraction-copying: S6 copies a random 0.8 fraction of tuples from S2 Edges depict copying November 20, 2018
Dependency Model: Example S1 S2 S3 A<4 2<B<=5 S4 (B>5) ^ (A>2) true (2) Selection-copying: S2 copies all tuples with A<4 from S1. (3) Histogram-copying combines selection- and fraction-copying November 20, 2018
Query Answering This talk: query to find all tuples (``select *’’) Given set S = {S1,…,Sn}, we want Q(S) = UQ(Si) Technical point: Assume each tuple t annotated with the source S providing it; i.e., ``tuple’’ is (t,S) Extension of results for other queries in paper Selections, projections, joins November 20, 2018
Cost Model Given set S={S1,…,Sn} of sources to query, we consider three models for cost of querying T: Linear cost model: Cost = Σi |Si| Data stored locally, and scanning (I/O) cost dominates Number-of-sources cost model: Cost = |T| When ``charged’’ for every source (e.g., web services) Arbitrary source cost model: Cost = Σi ci Each source has an arbitrary cost ci November 20, 2018
Next Formal problem definitions Dependency model Cost model for query answering Coverage and optimization problems Algorithms and complexity results summary Coverage problem (CP) Cost minimization problem (CMP) Maximum coverage problem (MCP) Source ordering problem (SOP) November 20, 2018
Coverage Problem Coverage Problem: What fraction of total tuples are covered by a subset? Example: What is the coverage of {S4,S5}? Total #tuples = 300 S5 gets all tuples from S1, S2 S4 provides 50 new tuples Coverage = 250/300 November 20, 2018
Cost Minimization Problem Cost Min. Problem: Which sources to query to: (1) get all tuples, (2) minimize total cost? Linear Cost Model: {S3,S4,S5} (cost = 100 + 100 + 200 = 400) Num-sources Cost Model: {S5,S6} (cost = 2) November 20, 2018
Maximum Coverage Problem Max. Coverage Problem: Given max. cost bound, what sources to query to: (1) get max. tuples, (2) be within cost bound? Cost Bound = 1 (num-sources model): Query S6: Get 255 tuples: 80 from S2 50 each from S3 and S4 75 from S1 (through S3 and S4: because of independence of copying) November 20, 2018
Source Ordering Problem Source Ordering Problem: What order to query sources to: obtain tuples ``as-fast-as-possible’’. (Intuitively, max-area-under-curve plotting cost versus tuples retrieved) Example: Query S6 -> S5 -> other sources - S6 gives 255 tuples - S5 gives remaining 45 tuples Query S1 -> S2 -> S3 -> S4 -> S5 -> S6 - S1 gives 100 tuples - S2 gives 100 tuples - S3 gives 50 tuples - S4 gives 50 tuples Source ordering Num. tuples S6 S5 S1 S2 S3 S4
Next Formal problem definitions Dependency model Cost model for query answering Coverage and optimization problems Algorithms and complexity results summary Coverage problem (CP) Cost minimization problem (CMP) Maximum coverage problem (MCP) Source ordering problem (SOP) November 20, 2018
Summary of results Paper presents detailed theoretical investigation: All four problems: coverage, cost-minimization, max-coverage, source-ordering All cost models: linear, num-sources, arbitrary Hardness results, polynomial-time algorithms, tractable sub-classes, approximations Next: 1-2 slides on each of the problems, highlighting main results Focus on fraction-copying model November 20, 2018
Coverage Problem #P-complete in general Reduction from the problem of evaluating the number of satisfying assignments in a monotone 2-DNF formula November 20, 2018
Coverage Problem PTIME when each copy-fraction is 0 or 1 Algorithm-A: Compute exact coverage of the subset of sources Compute total number of tuples among all sources November 20, 2018
Coverage Problem PTIME with select-copying Lot of sources copy by applying selection queries Knowledge of selection predicate makes problem tractable November 20, 2018
Coverage Problem PTIME randomized approximation algorithm November 20, 2018
Coverage: Randomized Algorithm Polynomial-time randomized (MC) algorithm Algorithm runs in: O(NE log(1/d) / e2) Coverage is within error ‘e’ with probability ≥ (1-d) Randomized algorithm: Randomly include/omit each edge with probability based on copy-fraction of edge Run Algorithm-A on the resulting graph Final coverage: average of multiple iterations November 20, 2018
Coverage: Randomized Algorithm Coverage of S6: -Retain edges based on fraction -Compute coverage -Correct answer = 255/300 tuples Iteration-1: #covered tuples = 300 Iteration-2: #covered tuples = 200 Compute average coverage November 20, 2018
Next Formal problem definitions Dependency model Cost model for query answering Coverage and optimization problems Algorithms and complexity results summary Coverage problem (CP) Cost minimization problem (CMP) Maximum coverage problem (MCP) Source ordering problem (SOP) November 20, 2018
CMP, MCP, SOP: Complete Results a Number-of-sources cost model b Linear cost-model and arbitrary source cost model c With PTIME coverage algorithm Next: Sample of results from the table above Greedy algorithm for picking sources November 20, 2018
Other Results All problems intractable in general a Number-of-sources cost model b Linear cost-model and arbitrary source cost model c With PTIME coverage algorithm All problems intractable in general November 20, 2018
Greedy Algorithm Greedy algorithm for cost-minimization, maximum coverage, source ordering: Pick the source S maximizing I(S)/c(S) I(S) = total number of new tuples c(S) = cost of querying S Greedy algorithm is optimal when all sources: Copy from at most one source Copy all or zero tuples (i.e., fraction=1) November 20, 2018
Greedy Algorithm Results Number-of-sources cost model b Linear cost-model and arbitrary source cost model c With PTIME coverage algorithm Optimal for Single-source copying: Copy from at most one source Copy all or zero tuples (i.e., fraction=1) November 20, 2018
Greedy Algorithm Results Number-of-sources cost model b Linear cost-model and arbitrary source cost model c With PTIME coverage algorithm Approximations in the general-case: Log-approx (in size of largest source) for cost-minimization (1-1/e)-approx for maximum-coverage 2-approx for source-ordering November 20, 2018
Summary Contributions: Future: Studied query answering with dependent sources Simple model to capture dependencies Four important problems: coverage, cost-minimization, maximum-coverage, source ordering Algorithms and complexity results for the problems Future: Build a system based on our theoretical foundations Specific open problems laid out in the paper November 20, 2018
Thanks! November 20, 2018