Download presentation
Presentation is loading. Please wait.
Published byKenzie Frampton Modified over 10 years ago
1
Xin Luna Dong (AT&T Labs Google Inc.) Barna Saha, Divesh Srivastava (AT&T Labs-Research) VLDB’2013
5
* Lots of money
6
* Lots of machines
7
* Lots of people
8
1096 books from the largest source 1213 books from the 2 largest sources 1250 books from the 10 largest sources 1260 books from the first 35 sources All 1265 books from the first 537 sources In total 894 sources, 1265 CS books CS books from AbeBooks.com
9
90 > 80 books w. correct authors after 579 sources (Accu) 93 > 80 books w. correct authors after 583 sources (Vote) All 100 books (gold standard) from the first 548 sources 78 books w. correct authors for Vote 80 books w. correct authors for Accu CS books from AbeBooks.com
10
* Questions * Is it best to integrate all data? * How to spend the computing resources in a wise way? * How to wisely select sources before real integration to balance the gain and the cost? * Prelude for data integration and outside traditional integration tasks (schema mapping, entity resolution, data fusion)
11
14 books (17.6% fewer) w. correct authors from the first 200 (33% less resources) sources 17 books w. correct authors from 300 sources (budget) CS books from AbeBooks.com
12
65 books w. correct authors (quality requirement) from the first 520 sources 81 books (25% more) w. correct authors from 526 sources (1% more) CS books from AbeBooks.com
13
Marginal gain II Marginal cost Marginal gain II Marginal cost The law of Diminishing Returns Largest profit
14
Marginal point with the largest profit in this ordering: 548 sources CS books from AbeBooks.com Challenge 1. The Law of Diminishing Returns does not necessarily hold, so multiple marginal points Challenge 2. Each source is different in quality, so different ordering leads to different marginal points: best solution integrates 26 sources Challenge 3. Estimating gain and cost w/o real integration
15
* Input * S: a set of available sources * F: integration model * Output: subset Ŝ to maximize profit G F (Ŝ)-C F (Ŝ) * G F (Ŝ): Gain of integrating Ŝ using model F * C F (Ŝ): Cost of integrating Ŝ using model F * Gain and cost need to be in the same unit to be comparable; e.g., $
16
* Theorem I (NP-Completeness). Under the arbitrary cost model (i.e., different sources have different costs), Marginalism is NP- complete. * Theorem II (A greedy solution can obtain arbitrarily bad results): Let d opt be the optimal profit and d be the profit by a greedy solution. For any θ, there exists an input set of sources and a gain model s.t. d/d opt < θ.
17
Improvement I. Randomly select from Top-k solutions Improvement II. Hill climbing to improve the initial solution Improvement III. Repeat r times and choose the best solution
18
* Side contributions on data fusion * The PopAccu model: monotonicity—adding a source should never decrease fusion quality * Algorithms to estimate fusion quality: dynamic programming
19
* Book data set: CS books at Abebooks.com in 2007 * 894 sources * 1265 books * 24364 records * Flight data set: Deep-Web sources for “flight status” in 2011 * 38 sources * 1200 flights * 27469 records
20
228 sources provide books in gold standard Marginalism selects 165 sources; reaching the highest quality PopAccu outperforms Vote and Accu, and is nearly monotonic for “good” sources
21
Marginalism has higher profit than MaxGLimitC and MinCLimitG most of the time
22
Greedy solution often cannot find the optimal solution GRASP (top-10, repeating 320 times) obtains nearly optimal results
23
* Full-fledged source selection for data integration * Other quality measures: e.g., freshness, consistency, redundancy; correlations, copying relationships between sources * Complex cost and gain models * Selecting subsets of data from each source * Other components of data integration: schema mapping, entity resolution
24
The More the Better? OR Less is More?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.