Data Integration with Dependent Sources

Slides:



Advertisements
Similar presentations
Efficient Processing of Top- k Queries in Uncertain Databases Ke Yi, AT&T Labs Feifei Li, Boston University Divesh Srivastava, AT&T Labs George Kollios,
Advertisements

Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.
BY ANISH D. SARMA, XIN DONG, ALON HALEVY, PROCEEDINGS OF SIGMOD'08, VANCOUVER, BRITISH COLUMBIA, CANADA, JUNE 2008 Bootstrapping Pay-As-You-Go Data Integration.
~1~ Infocom’04 Mar. 10th On Finding Disjoint Paths in Single and Dual Link Cost Networks Chunming Qiao* LANDER, CSE Department SUNY at Buffalo *Collaborators:
Fast Algorithms For Hierarchical Range Histogram Constructions
1 NP-Complete Problems. 2 We discuss some hard problems:  how hard? (computational complexity)  what makes them hard?  any solutions? Definitions 
Online Data Fusion School of Computing National University of Singapore AT&T Shannon Research Labs Xuan Liu, Xin Luna Dong, Beng Chin Ooi, Divesh Srivastava.
Top-K Query Evaluation on Probabilistic Data Christopher Ré, Nilesh Dalvi and Dan Suciu University of Washington.
Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements Raju Balakrishnan (Arizona State University)
Efficient Query Evaluation on Probabilistic Databases
Los Angeles September 27, 2006 MOBICOM Localization in Sparse Networks using Sweeps D. K. Goldenberg P. Bihler M. Cao J. Fang B. D. O. Anderson.
Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.
Analysis of Algorithms CS 477/677
Computability and Complexity 24-1 Computability and Complexity Andrei Bulatov Approximation.
Mariam Salloum (YP.com) Xin Luna Dong (Google) Divesh Srivastava (AT&T Research) Vassilis J. Tsotras (UC Riverside) 1 Online Ordering of Overlapping Data.
CBLOCK: An Automatic Blocking Mechanism for Large-Scale Deduplication Tasks Ashwin Machanavajjhala Duke University with Anish Das Sarma, Ankur Jain, Philip.
Efficient Gathering of Correlated Data in Sensor Networks
Chapter 11 Limitations of Algorithm Power. Lower Bounds Lower bound: an estimate on a minimum amount of work needed to solve a given problem Examples:
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
RESOURCES, TRADE-OFFS, AND LIMITATIONS Group 5 8/27/2014.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Online Data Fusion School of Computing National University of Singapore AT&T Shannon Research Labs Xuan Liu, Xin Luna Dong, Beng Chin Ooi, Divesh Srivastava.
DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015.
1 Lower Bounds Lower bound: an estimate on a minimum amount of work needed to solve a given problem Examples: b number of comparisons needed to find the.
Randomized Composable Core-sets for Submodular Maximization Morteza Zadimoghaddam and Vahab Mirrokni Google Research New York.
NP-Complete Problems. Running Time v.s. Input Size Concern with problems whose complexity may be described by exponential functions. Tractable problems.
NP-COMPLETE PROBLEMS. Admin  Two more assignments…  No office hours on tomorrow.
NP-Complete problems.
Answering Top-k Queries Using Views Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris.
Deterministic Algorithms for Submodular Maximization Problems Moran Feldman The Open University of Israel Joint work with Niv Buchbinder.
Approximation algorithms for combinatorial allocation problems
Nanyang Technological University
Independent Cascade Model and Linear Threshold Model
Abolfazl Asudeh Azade Nazi Nan Zhang Gautam DaS
Statistical Cost Sharing: Learning Fair Cost Allocations from Samples
A Study of Group-Tree Matching in Large Scale Group Communications
Approximating the MST Weight in Sublinear Time
Moran Feldman The Open University of Israel
A paper on Join Synopses for Approximate Query Answering
Bolin Ding Silu Huang* Surajit Chaudhuri Kaushik Chakrabarti Chi Wang
Approximate Lineage for Probabilistic Databases
Distributed Submodular Maximization in Massive Datasets
Relational Algebra 461 The slides for this text are organized into chapters. This lecture covers relational algebra, from Chapter 4. The relational calculus.
Spatio-temporal Pattern Queries
Queries with Difference on Probabilistic Databases
Independent Cascade Model and Linear Threshold Model
Statistical Methods For Engineers
Analysis and design of algorithm
Random Sampling over Joins Revisited
Effective Social Network Quarantine with Minimal Isolation Costs
Bin Fu Department of Computer Science
Randomized Algorithms CS648
Coverage Approximation Algorithms
Chapter 11 Limitations of Algorithm Power
NP-Complete Problems.
Minimizing the Aggregate Movements for Interval Coverage
Feifei Li, Ching Chang, George Kollios, Azer Bestavros
Range-Efficient Computation of F0 over Massive Data Streams
DATABASE HISTOGRAMS E0 261 Jayant Haritsa
CSC 380: Design and Analysis of Algorithms
NP-Completeness Reference: Computers and Intractability: A Guide to the Theory of NP-Completeness by Garey and Johnson, W.H. Freeman and Company, 1979.
Lecture 6: Counting triangles Dynamic graphs & sampling
Data Placement Problems in Database Applications
Complexity Theory in Practice
Clustering.
15th Scandinavian Workshop on Algorithm Theory
Independent Cascade Model and Linear Threshold Model
Submodular Maximization with Cardinality Constraints
Efficient Aggregation over Objects with Extent
Presentation transcript:

Data Integration with Dependent Sources Anish Das Sarma, Xin (Luna) Dong, Alon Halevy Yahoo! Research, AT&T Labs-Research, Google Inc. November 20, 2018

Query Answering in Data Integration Best guess, based on 7 websites Query Answering in Data Integration A = Ui Ai Q: “France capital” Ai’s may have conflicts: D1,D2,D3: Paris D4,D5: Inria Mediated Schema A1 A5 Q1 Count number of sources Q5 A4 A2 A3 D5 D1 Q2 Assuming independence in counting! Sources can copy from each other D2 Q4 Q3 D4 D3 Consider number of independent sources for each answer November 20, 2018

Motivation [Solomon project, Luna Dong et. al.] Web data consists of an ecosystem of dependent sources Information extracted from AbeBooks.com Data from 877 bookstores 465 pairs of sources involved copying 314 copiers, 202 copy from a single source Some copy all tuples, some copy a fraction November 20, 2018

Goal Build a system, IDS, for Integrating Dependent Sources This paper: Proposed a system for query answering with dependent sources Address theoretical challenges in building such a system Note: detecting dependencies is part of other work [Solomon] November 20, 2018

Q Source Selection Cost Computation Computation Coverage Data Sources Configuration Source Ordering Query Answering Answer November 20, 2018

1) Given a query Q, which data sources to use for answering this query 1) Given a query Q, which data sources to use for answering this query? (cost-coverage tradeoff) Q Source Selection Cost Computation Computation Coverage Data Sources Configuration Source Ordering Query Answering Answer November 20, 2018

2) In which order should we query a set S of sources to get answers as soon as possible? Source Selection Cost Computation Computation Coverage Data Sources Configuration Source Ordering Query Answering Answer November 20, 2018

3) For a subset S of sources, what fraction of the total set of tuples is captured by tuples in S? Coverage is core part of previous problems. Q Source Selection Cost Computation Computation Coverage Data Sources Configuration Source Ordering Query Answering Answer November 20, 2018

Answers to source selection and ordering problems depend on cost model Q Source Selection Cost Computation Computation Coverage Data Sources Configuration Source Ordering Query Answering Answer November 20, 2018

Next Formal problem definitions Dependency model Cost model for query answering Coverage and optimization problems Algorithms and complexity result summary Coverage problem (CP) Cost minimization problem (CMP) Maximum coverage problem (MCP) Source ordering problem (SOP) November 20, 2018

Next Formal problem definitions Dependency model Cost model for query answering Coverage and optimization problems Algorithms and complexity results summary Coverage problem (CP) Cost minimization problem (CMP) Maximum coverage problem (MCP) Source ordering problem (SOP) November 20, 2018

Dependency Model: Example #tuples provided independently (1) Fraction-copying: S6 copies a random 0.8 fraction of tuples from S2 Edges depict copying November 20, 2018

Dependency Model: Example S1 S2 S3 A<4 2<B<=5 S4 (B>5) ^ (A>2) true (2) Selection-copying: S2 copies all tuples with A<4 from S1. (3) Histogram-copying combines selection- and fraction-copying November 20, 2018

Query Answering This talk: query to find all tuples (``select *’’) Given set S = {S1,…,Sn}, we want Q(S) = UQ(Si) Technical point: Assume each tuple t annotated with the source S providing it; i.e., ``tuple’’ is (t,S) Extension of results for other queries in paper Selections, projections, joins November 20, 2018

Cost Model Given set S={S1,…,Sn} of sources to query, we consider three models for cost of querying T: Linear cost model: Cost = Σi |Si| Data stored locally, and scanning (I/O) cost dominates Number-of-sources cost model: Cost = |T| When ``charged’’ for every source (e.g., web services) Arbitrary source cost model: Cost = Σi ci Each source has an arbitrary cost ci November 20, 2018

Next Formal problem definitions Dependency model Cost model for query answering Coverage and optimization problems Algorithms and complexity results summary Coverage problem (CP) Cost minimization problem (CMP) Maximum coverage problem (MCP) Source ordering problem (SOP) November 20, 2018

Coverage Problem Coverage Problem: What fraction of total tuples are covered by a subset? Example: What is the coverage of {S4,S5}? Total #tuples = 300 S5 gets all tuples from S1, S2 S4 provides 50 new tuples Coverage = 250/300 November 20, 2018

Cost Minimization Problem Cost Min. Problem: Which sources to query to: (1) get all tuples, (2) minimize total cost? Linear Cost Model: {S3,S4,S5} (cost = 100 + 100 + 200 = 400) Num-sources Cost Model: {S5,S6} (cost = 2) November 20, 2018

Maximum Coverage Problem Max. Coverage Problem: Given max. cost bound, what sources to query to: (1) get max. tuples, (2) be within cost bound? Cost Bound = 1 (num-sources model): Query S6: Get 255 tuples: 80 from S2 50 each from S3 and S4 75 from S1 (through S3 and S4: because of independence of copying) November 20, 2018

Source Ordering Problem Source Ordering Problem: What order to query sources to: obtain tuples ``as-fast-as-possible’’. (Intuitively, max-area-under-curve plotting cost versus tuples retrieved) Example: Query S6 -> S5 -> other sources - S6 gives 255 tuples - S5 gives remaining 45 tuples Query S1 -> S2 -> S3 -> S4 -> S5 -> S6 - S1 gives 100 tuples - S2 gives 100 tuples - S3 gives 50 tuples - S4 gives 50 tuples Source ordering Num. tuples S6 S5 S1 S2 S3 S4

Next Formal problem definitions Dependency model Cost model for query answering Coverage and optimization problems Algorithms and complexity results summary Coverage problem (CP) Cost minimization problem (CMP) Maximum coverage problem (MCP) Source ordering problem (SOP) November 20, 2018

Summary of results Paper presents detailed theoretical investigation: All four problems: coverage, cost-minimization, max-coverage, source-ordering All cost models: linear, num-sources, arbitrary Hardness results, polynomial-time algorithms, tractable sub-classes, approximations Next: 1-2 slides on each of the problems, highlighting main results Focus on fraction-copying model November 20, 2018

Coverage Problem #P-complete in general Reduction from the problem of evaluating the number of satisfying assignments in a monotone 2-DNF formula November 20, 2018

Coverage Problem PTIME when each copy-fraction is 0 or 1 Algorithm-A: Compute exact coverage of the subset of sources Compute total number of tuples among all sources November 20, 2018

Coverage Problem PTIME with select-copying Lot of sources copy by applying selection queries Knowledge of selection predicate makes problem tractable November 20, 2018

Coverage Problem PTIME randomized approximation algorithm November 20, 2018

Coverage: Randomized Algorithm Polynomial-time randomized (MC) algorithm Algorithm runs in: O(NE log(1/d) / e2) Coverage is within error ‘e’ with probability ≥ (1-d) Randomized algorithm: Randomly include/omit each edge with probability based on copy-fraction of edge Run Algorithm-A on the resulting graph Final coverage: average of multiple iterations November 20, 2018

Coverage: Randomized Algorithm Coverage of S6: -Retain edges based on fraction -Compute coverage -Correct answer = 255/300 tuples Iteration-1: #covered tuples = 300 Iteration-2: #covered tuples = 200 Compute average coverage November 20, 2018

Next Formal problem definitions Dependency model Cost model for query answering Coverage and optimization problems Algorithms and complexity results summary Coverage problem (CP) Cost minimization problem (CMP) Maximum coverage problem (MCP) Source ordering problem (SOP) November 20, 2018

CMP, MCP, SOP: Complete Results a Number-of-sources cost model b Linear cost-model and arbitrary source cost model c With PTIME coverage algorithm Next: Sample of results from the table above Greedy algorithm for picking sources November 20, 2018

Other Results All problems intractable in general a Number-of-sources cost model b Linear cost-model and arbitrary source cost model c With PTIME coverage algorithm All problems intractable in general November 20, 2018

Greedy Algorithm Greedy algorithm for cost-minimization, maximum coverage, source ordering: Pick the source S maximizing I(S)/c(S) I(S) = total number of new tuples c(S) = cost of querying S Greedy algorithm is optimal when all sources: Copy from at most one source Copy all or zero tuples (i.e., fraction=1) November 20, 2018

Greedy Algorithm Results Number-of-sources cost model b Linear cost-model and arbitrary source cost model c With PTIME coverage algorithm Optimal for Single-source copying: Copy from at most one source Copy all or zero tuples (i.e., fraction=1) November 20, 2018

Greedy Algorithm Results Number-of-sources cost model b Linear cost-model and arbitrary source cost model c With PTIME coverage algorithm Approximations in the general-case: Log-approx (in size of largest source) for cost-minimization (1-1/e)-approx for maximum-coverage 2-approx for source-ordering November 20, 2018

Summary Contributions: Future: Studied query answering with dependent sources Simple model to capture dependencies Four important problems: coverage, cost-minimization, maximum-coverage, source ordering Algorithms and complexity results for the problems Future: Build a system based on our theoretical foundations Specific open problems laid out in the paper November 20, 2018

Thanks! November 20, 2018