UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH SRIVASTAVA CHARACTERIZING AND SELECTING FRESH DATA SOURCES
DATA IS A COMMODITY myriads of data sources
DATA IS A COMMODITY myriads of data sources FREELY AVAILABLE SOURCES open data initiative world data bank crawling the web
DATA IS A COMMODITY myriads of data sources FREELY AVAILABLE SOURCES DATA MARKETS: SELL YOUR DATA TO OTHERS datasift microsoft azure marketplace datamarket.com infochimps
DATA IS A COMMODITY myriads of data sources FREELY AVAILABLE SOURCES DATA MARKETS: SELL YOUR DATA TO OTHERS HETEROGENEOUS DATA SOURCES different quality - cost cover different topics static or dynamic exhibit different update patterns
–LEO TOLSTOY “Truth, like gold, is to be obtained not by its growth, but by washing away from it all that is not gold.”
Yes! Use source selection to reason about the benefits and costs of acquiring and integrating data sources [Dong et al., 2013] So, can we find gold in a systematic and automated fashion? Techniques agnostic to time and focus on accuracy of static sources
matters Select sources before actual data integration When do we need to use the integration result ? Data in the world and the sources changes
matters Select sources before actual data integration When do we need to use the integration result ? Data in the world and the sources changes
IT IS A DYNAMIC WORLD time World Data Sources Updated every 2 time points Updated every 3 time points
CHALLENGES AND OPPORTUNITIES Business listings (BL) ~40 sources, 2 years ~1,400 categories, 51 locations
QUALITY CHANGES OVER TIME The optimal set of sources changes over time
LOWER COST OPPORTUNITIES Integrate updates in lower frequency to lower cost
UpToDate Entries OutOfDate Entries NonDeleted Entries TIME-BASED SOURCE QUALITY Coverage(, ) = Entries (, ) / Entries (, ) Freshness(, ) = UpToDate Entries (, ) / Entries (, ) Coverage ~ Recall Freshness ~ Precision Combine Accuracy(, )
SELECTING FRESH SOURCES Time-aware source selection
EXTENSIONS Optimal frequency Subset of provided data
EXTENSIONS Optimal frequency Subset of provided data Time-aware source selection with many more sources
PROPOSED FRAMEWORK HISTORICAL SNAPSHOTS OF AVAILABLE SOURCES Pre-processing Statistical Modeling UPDATE MODELS FOR SOURCES EVOLUTION MODELS FOR DATA DOMAIN Source selection USE STATISTICAL MODELS TO ESTIMATE QUALITY OF INTEGRATED DATA INTEGRATION COST MODEL Maximize Quality - Cost Tradeoff SELECT OPTIMAL SUBSET OF SOURCES
PROPOSED FRAMEWORK HISTORICAL SNAPSHOTS OF AVAILABLE SOURCES Pre-processing Statistical Modeling UPDATE MODELS FOR SOURCES EVOLUTION MODELS FOR DATA DOMAIN Source selection USE STATISTICAL MODELS TO ESTIMATE QUALITY OF INTEGRATED DATA INTEGRATION COST MODEL Maximize Quality - Cost Tradeoff SELECT OPTIMAL SUBSET OF SOURCES
WORLD EVOLUTION MODELS Poisson Random Process Exponentially distributed changes Integrate available data source snapshots to extract the evolution of the world Ensemble of parametric models
SOURCE UPDATE MODELS Shall we consider only the update frequency?
SOURCE UPDATE MODELS High update frequency does not imply high freshness
SOURCE UPDATE MODELS Update frequency of the source Empirical Effectiveness distributions Ensemble of non-parametric models
PROPOSED FRAMEWORK HISTORICAL SNAPSHOTS OF AVAILABLE SOURCES Pre-processing Statistical Modeling UPDATE MODELS FOR SOURCES EVOLUTION MODELS FOR DATA DOMAIN Source selection USE STATISTICAL MODELS TO ESTIMATE QUALITY OF INTEGRATED DATA INTEGRATION COST MODEL Maximize Quality - Cost Tradeoff SELECT OPTIMAL SUBSET OF SOURCES
SOURCE QUALITY ESTIMATION Combine statistical models time OldQuality (, ; ) NewQuality (, ; ) as a function of ΔQuality (, ; ) and
SOURCE QUALITY ESTIMATION Combine statistical models time ? Entries (, ) Coverage(, ) =Entries (, )
SOURCE QUALITY ESTIMATION Combine statistical models time Estimating Entries (, ): use the intensity rates λ of the Poisson models Entries (, ) +
SOURCE QUALITY ESTIMATION Combine statistical models time Estimating : Entries (, ) Pr (Exist (, ))+ New Entries(, ) Pr (Exist (, ))
SOURCE QUALITY ESTIMATION Combine statistical models time Estimating : Entries (, )
PROPOSED FRAMEWORK HISTORICAL SNAPSHOTS OF AVAILABLE SOURCES Pre-processing Statistical Modeling UPDATE MODELS FOR SOURCES EVOLUTION MODELS FOR DATA DOMAIN Source selection USE STATISTICAL MODELS TO ESTIMATE QUALITY OF INTEGRATED DATA INTEGRATION COST MODEL Maximize Quality - Cost Tradeoff SELECT OPTIMAL SUBSET OF SOURCES
SOLVING SOURCE SELECTION Maximize marginal gain
SOLVING SOURCE SELECTION Greedy Start with an empty solution and add sources greedily No quality guarantees with arbitrarily bad solutions Highly efficient
ARBITRARY OBJECTIVE FUNCTIONS GRASP (k,r) [used in Dong et al., `13] Local-search and randomized hill-climbing Run r times and keep best solution Empirically high-quality solutions Very expensive
A large family of benefit functions are monotone submodular (e.g., functions of coverage) INSIGHTS FOR QUALITY GUARANTEES Under a linear cost function the marginal gain is submodular AB x f(A U {x}) f(A) f(B U {x})f(B)
SUBMODULAR OBJECTIVE FUNCTIONS Start by selecting the best source Explore local neighborhood: add/delete sources Either selected set or complement is a local optimum Constant factor approximation [Feige, `11] Submodular Maximization (MaxSub) Highly efficient Empirically high-quality even for non-sub functions
SELECTED EXPERIMENTS Business listings (BL) ~40 sources, 2 years ~1,400 categories, 51 locations World-wide Event listings 15,275 sources, 1 month 236 event types, 242 locations
WORLD CHANGE ESTIMATION Small relative error even with little training data Expected increasing trend over time
SOURCE CHANGE ESTIMATION Small relative error for source quality
SELECTION QUALITY BENEFITMETRICMSR.GREEDYMAXSUBGRASP LINEAR cov. best16.7%50%100% (5,20) diff. acc. best0.0%33.3%83.3% (2,100) diff. STEP cov. best50.0%66.7%83.3% (10,100) diff. acc. best50%66.7%83.3% (5,100) diff. Grasp finds the best solution most of the times perc. of times finding the best solution
SELECTION QUALITY MaxSub solutions are mostly comparable to Grasp BENEFITMETRICMSR.GREEDYMAXSUBGRASP LINEAR cov. best16.7%50%100% (5,20) diff..005 (.01)%.001 (.007)%- acc. best0.0%33.3%83.3% (2,100) diff.9.5 (53.7)%.39 (2.31)%8.9% (53.7)% STEP cov. best50.0%66.7%83.3% (10,100) diff.7.45 (27.8)%.012 (.06)%.7 (4.2)% acc. best50%66.7%83.3% (5,100) diff.6 (23.98)%1.76 (10.6)%3.99 (23.98)% avg. and worst quality loss
SELECTION QUALITY BENEFITMETRICMSR.GREEDYMAXSUBGRASP LINEAR cov. best16.7%50%100% (5,20) diff..005 (.01)%.001 (.007)%- acc. best0.0%33.3%83.3% (2,100) diff.9.5 (53.7)%.39 (2.31)%8.9% (53.7)% STEP cov. best50.0%66.7%83.3% (10,100) diff.7.45 (27.8)%.012 (.06)%.7 (4.2)% acc. best50%66.7%83.3% (5,100) diff.6 (23.98)%1.76 (10.6)%3.99 (23.98)% but there are cases when Grasp is significantly worse
INCREASING NUMBER OF SOURCES MaxSub is one to two orders of magnitude faster
SELECTION CHARACTERISTICS Accuracy selects fewer more focused sources
CONCLUSIONS Thank you! Source selection before data integration to increase quality and reduce cost Collection of statistical models to describe the evolution of the world and the updates of sources Exploiting submodularity gives more efficient solutions with rigorous guarantees