Download presentation
Presentation is loading. Please wait.
Published byLoren Cecily Riley Modified over 9 years ago
1
UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH SRIVASTAVA CHARACTERIZING AND SELECTING FRESH DATA SOURCES
2
DATA IS A COMMODITY myriads of data sources
3
DATA IS A COMMODITY myriads of data sources FREELY AVAILABLE SOURCES open data initiative world data bank crawling the web
4
DATA IS A COMMODITY myriads of data sources FREELY AVAILABLE SOURCES DATA MARKETS: SELL YOUR DATA TO OTHERS datasift microsoft azure marketplace datamarket.com infochimps
5
DATA IS A COMMODITY myriads of data sources FREELY AVAILABLE SOURCES DATA MARKETS: SELL YOUR DATA TO OTHERS HETEROGENEOUS DATA SOURCES different quality - cost cover different topics static or dynamic exhibit different update patterns
6
–LEO TOLSTOY “Truth, like gold, is to be obtained not by its growth, but by washing away from it all that is not gold.”
7
Yes! Use source selection to reason about the benefits and costs of acquiring and integrating data sources [Dong et al., 2013] So, can we find gold in a systematic and automated fashion? Techniques agnostic to time and focus on accuracy of static sources
8
matters Select sources before actual data integration When do we need to use the integration result ? Data in the world and the sources changes
9
matters Select sources before actual data integration When do we need to use the integration result ? Data in the world and the sources changes
10
IT IS A DYNAMIC WORLD time World Data Sources Updated every 2 time points Updated every 3 time points
11
CHALLENGES AND OPPORTUNITIES Business listings (BL) ~40 sources, 2 years ~1,400 categories, 51 locations
12
QUALITY CHANGES OVER TIME The optimal set of sources changes over time
13
LOWER COST OPPORTUNITIES Integrate updates in lower frequency to lower cost
14
UpToDate Entries OutOfDate Entries NonDeleted Entries TIME-BASED SOURCE QUALITY Coverage(, ) = Entries (, ) / Entries (, ) Freshness(, ) = UpToDate Entries (, ) / Entries (, ) Coverage ~ Recall Freshness ~ Precision Combine Accuracy(, )
15
SELECTING FRESH SOURCES Time-aware source selection
16
EXTENSIONS Optimal frequency Subset of provided data
17
EXTENSIONS Optimal frequency Subset of provided data Time-aware source selection with many more sources
18
PROPOSED FRAMEWORK HISTORICAL SNAPSHOTS OF AVAILABLE SOURCES Pre-processing Statistical Modeling UPDATE MODELS FOR SOURCES EVOLUTION MODELS FOR DATA DOMAIN Source selection USE STATISTICAL MODELS TO ESTIMATE QUALITY OF INTEGRATED DATA INTEGRATION COST MODEL Maximize Quality - Cost Tradeoff SELECT OPTIMAL SUBSET OF SOURCES
19
PROPOSED FRAMEWORK HISTORICAL SNAPSHOTS OF AVAILABLE SOURCES Pre-processing Statistical Modeling UPDATE MODELS FOR SOURCES EVOLUTION MODELS FOR DATA DOMAIN Source selection USE STATISTICAL MODELS TO ESTIMATE QUALITY OF INTEGRATED DATA INTEGRATION COST MODEL Maximize Quality - Cost Tradeoff SELECT OPTIMAL SUBSET OF SOURCES
20
WORLD EVOLUTION MODELS Poisson Random Process Exponentially distributed changes Integrate available data source snapshots to extract the evolution of the world Ensemble of parametric models
21
SOURCE UPDATE MODELS Shall we consider only the update frequency?
22
SOURCE UPDATE MODELS High update frequency does not imply high freshness
23
SOURCE UPDATE MODELS Update frequency of the source Empirical Effectiveness distributions Ensemble of non-parametric models
24
PROPOSED FRAMEWORK HISTORICAL SNAPSHOTS OF AVAILABLE SOURCES Pre-processing Statistical Modeling UPDATE MODELS FOR SOURCES EVOLUTION MODELS FOR DATA DOMAIN Source selection USE STATISTICAL MODELS TO ESTIMATE QUALITY OF INTEGRATED DATA INTEGRATION COST MODEL Maximize Quality - Cost Tradeoff SELECT OPTIMAL SUBSET OF SOURCES
25
SOURCE QUALITY ESTIMATION Combine statistical models time OldQuality (, ; ) NewQuality (, ; ) as a function of ΔQuality (, ; ) and
26
SOURCE QUALITY ESTIMATION Combine statistical models time ? Entries (, ) Coverage(, ) =Entries (, )
27
SOURCE QUALITY ESTIMATION Combine statistical models time Estimating Entries (, ): use the intensity rates λ of the Poisson models Entries (, ) +
28
SOURCE QUALITY ESTIMATION Combine statistical models time Estimating : Entries (, ) Pr (Exist (, ))+ New Entries(, ) Pr (Exist (, ))
29
SOURCE QUALITY ESTIMATION Combine statistical models time Estimating : Entries (, )
30
PROPOSED FRAMEWORK HISTORICAL SNAPSHOTS OF AVAILABLE SOURCES Pre-processing Statistical Modeling UPDATE MODELS FOR SOURCES EVOLUTION MODELS FOR DATA DOMAIN Source selection USE STATISTICAL MODELS TO ESTIMATE QUALITY OF INTEGRATED DATA INTEGRATION COST MODEL Maximize Quality - Cost Tradeoff SELECT OPTIMAL SUBSET OF SOURCES
31
SOLVING SOURCE SELECTION Maximize marginal gain
32
SOLVING SOURCE SELECTION Greedy Start with an empty solution and add sources greedily No quality guarantees with arbitrarily bad solutions Highly efficient
33
ARBITRARY OBJECTIVE FUNCTIONS GRASP (k,r) [used in Dong et al., `13] Local-search and randomized hill-climbing Run r times and keep best solution Empirically high-quality solutions Very expensive
34
A large family of benefit functions are monotone submodular (e.g., functions of coverage) INSIGHTS FOR QUALITY GUARANTEES Under a linear cost function the marginal gain is submodular AB x f(A U {x}) f(A) f(B U {x})f(B)
35
SUBMODULAR OBJECTIVE FUNCTIONS Start by selecting the best source Explore local neighborhood: add/delete sources Either selected set or complement is a local optimum Constant factor approximation [Feige, `11] Submodular Maximization (MaxSub) Highly efficient Empirically high-quality even for non-sub functions
36
SELECTED EXPERIMENTS Business listings (BL) ~40 sources, 2 years ~1,400 categories, 51 locations World-wide Event listings GDELT @gdeltproject.org 15,275 sources, 1 month 236 event types, 242 locations
37
WORLD CHANGE ESTIMATION Small relative error even with little training data Expected increasing trend over time
38
SOURCE CHANGE ESTIMATION Small relative error for source quality
39
SELECTION QUALITY BENEFITMETRICMSR.GREEDYMAXSUBGRASP LINEAR cov. best16.7%50%100% (5,20) diff. acc. best0.0%33.3%83.3% (2,100) diff. STEP cov. best50.0%66.7%83.3% (10,100) diff. acc. best50%66.7%83.3% (5,100) diff. Grasp finds the best solution most of the times perc. of times finding the best solution
40
SELECTION QUALITY MaxSub solutions are mostly comparable to Grasp BENEFITMETRICMSR.GREEDYMAXSUBGRASP LINEAR cov. best16.7%50%100% (5,20) diff..005 (.01)%.001 (.007)%- acc. best0.0%33.3%83.3% (2,100) diff.9.5 (53.7)%.39 (2.31)%8.9% (53.7)% STEP cov. best50.0%66.7%83.3% (10,100) diff.7.45 (27.8)%.012 (.06)%.7 (4.2)% acc. best50%66.7%83.3% (5,100) diff.6 (23.98)%1.76 (10.6)%3.99 (23.98)% avg. and worst quality loss
41
SELECTION QUALITY BENEFITMETRICMSR.GREEDYMAXSUBGRASP LINEAR cov. best16.7%50%100% (5,20) diff..005 (.01)%.001 (.007)%- acc. best0.0%33.3%83.3% (2,100) diff.9.5 (53.7)%.39 (2.31)%8.9% (53.7)% STEP cov. best50.0%66.7%83.3% (10,100) diff.7.45 (27.8)%.012 (.06)%.7 (4.2)% acc. best50%66.7%83.3% (5,100) diff.6 (23.98)%1.76 (10.6)%3.99 (23.98)% but there are cases when Grasp is significantly worse
42
INCREASING NUMBER OF SOURCES MaxSub is one to two orders of magnitude faster
43
SELECTION CHARACTERISTICS Accuracy selects fewer more focused sources
44
CONCLUSIONS Thank you! thodrek@cs.umd.edu Source selection before data integration to increase quality and reduce cost Collection of statistical models to describe the evolution of the world and the updates of sources Exploiting submodularity gives more efficient solutions with rigorous guarantees
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.