Presentation is loading. Please wait.

Presentation is loading. Please wait.

UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH SRIVASTAVA CHARACTERIZING AND SELECTING FRESH DATA SOURCES.

Similar presentations


Presentation on theme: "UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH SRIVASTAVA CHARACTERIZING AND SELECTING FRESH DATA SOURCES."— Presentation transcript:

1 UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH SRIVASTAVA CHARACTERIZING AND SELECTING FRESH DATA SOURCES

2 DATA IS A COMMODITY myriads of data sources

3 DATA IS A COMMODITY myriads of data sources FREELY AVAILABLE SOURCES open data initiative world data bank crawling the web

4 DATA IS A COMMODITY myriads of data sources FREELY AVAILABLE SOURCES DATA MARKETS: SELL YOUR DATA TO OTHERS datasift microsoft azure marketplace datamarket.com infochimps

5 DATA IS A COMMODITY myriads of data sources FREELY AVAILABLE SOURCES DATA MARKETS: SELL YOUR DATA TO OTHERS HETEROGENEOUS DATA SOURCES different quality - cost cover different topics static or dynamic exhibit different update patterns

6 –LEO TOLSTOY “Truth, like gold, is to be obtained not by its growth, but by washing away from it all that is not gold.”

7 Yes! Use source selection to reason about the benefits and costs of acquiring and integrating data sources [Dong et al., 2013] So, can we find gold in a systematic and automated fashion? Techniques agnostic to time and focus on accuracy of static sources

8 matters Select sources before actual data integration When do we need to use the integration result ? Data in the world and the sources changes

9 matters Select sources before actual data integration When do we need to use the integration result ? Data in the world and the sources changes

10 IT IS A DYNAMIC WORLD time World Data Sources Updated every 2 time points Updated every 3 time points

11 CHALLENGES AND OPPORTUNITIES Business listings (BL) ~40 sources, 2 years ~1,400 categories, 51 locations

12 QUALITY CHANGES OVER TIME The optimal set of sources changes over time

13 LOWER COST OPPORTUNITIES Integrate updates in lower frequency to lower cost

14 UpToDate Entries OutOfDate Entries NonDeleted Entries TIME-BASED SOURCE QUALITY Coverage(, ) = Entries (, ) / Entries (, ) Freshness(, ) = UpToDate Entries (, ) / Entries (, ) Coverage ~ Recall Freshness ~ Precision Combine Accuracy(, )

15 SELECTING FRESH SOURCES Time-aware source selection

16 EXTENSIONS Optimal frequency Subset of provided data

17 EXTENSIONS Optimal frequency Subset of provided data Time-aware source selection with many more sources

18 PROPOSED FRAMEWORK HISTORICAL SNAPSHOTS OF AVAILABLE SOURCES Pre-processing Statistical Modeling UPDATE MODELS FOR SOURCES EVOLUTION MODELS FOR DATA DOMAIN Source selection USE STATISTICAL MODELS TO ESTIMATE QUALITY OF INTEGRATED DATA INTEGRATION COST MODEL Maximize Quality - Cost Tradeoff SELECT OPTIMAL SUBSET OF SOURCES

19 PROPOSED FRAMEWORK HISTORICAL SNAPSHOTS OF AVAILABLE SOURCES Pre-processing Statistical Modeling UPDATE MODELS FOR SOURCES EVOLUTION MODELS FOR DATA DOMAIN Source selection USE STATISTICAL MODELS TO ESTIMATE QUALITY OF INTEGRATED DATA INTEGRATION COST MODEL Maximize Quality - Cost Tradeoff SELECT OPTIMAL SUBSET OF SOURCES

20 WORLD EVOLUTION MODELS Poisson Random Process Exponentially distributed changes Integrate available data source snapshots to extract the evolution of the world Ensemble of parametric models

21 SOURCE UPDATE MODELS Shall we consider only the update frequency?

22 SOURCE UPDATE MODELS High update frequency does not imply high freshness

23 SOURCE UPDATE MODELS Update frequency of the source Empirical Effectiveness distributions Ensemble of non-parametric models

24 PROPOSED FRAMEWORK HISTORICAL SNAPSHOTS OF AVAILABLE SOURCES Pre-processing Statistical Modeling UPDATE MODELS FOR SOURCES EVOLUTION MODELS FOR DATA DOMAIN Source selection USE STATISTICAL MODELS TO ESTIMATE QUALITY OF INTEGRATED DATA INTEGRATION COST MODEL Maximize Quality - Cost Tradeoff SELECT OPTIMAL SUBSET OF SOURCES

25 SOURCE QUALITY ESTIMATION Combine statistical models time OldQuality (, ; ) NewQuality (, ; ) as a function of ΔQuality (, ; ) and

26 SOURCE QUALITY ESTIMATION Combine statistical models time ? Entries (, ) Coverage(, ) =Entries (, )

27 SOURCE QUALITY ESTIMATION Combine statistical models time Estimating Entries (, ): use the intensity rates λ of the Poisson models Entries (, ) +

28 SOURCE QUALITY ESTIMATION Combine statistical models time Estimating : Entries (, ) Pr (Exist (, ))+ New Entries(, ) Pr (Exist (, ))

29 SOURCE QUALITY ESTIMATION Combine statistical models time Estimating : Entries (, )

30 PROPOSED FRAMEWORK HISTORICAL SNAPSHOTS OF AVAILABLE SOURCES Pre-processing Statistical Modeling UPDATE MODELS FOR SOURCES EVOLUTION MODELS FOR DATA DOMAIN Source selection USE STATISTICAL MODELS TO ESTIMATE QUALITY OF INTEGRATED DATA INTEGRATION COST MODEL Maximize Quality - Cost Tradeoff SELECT OPTIMAL SUBSET OF SOURCES

31 SOLVING SOURCE SELECTION Maximize marginal gain

32 SOLVING SOURCE SELECTION Greedy Start with an empty solution and add sources greedily No quality guarantees with arbitrarily bad solutions Highly efficient

33 ARBITRARY OBJECTIVE FUNCTIONS GRASP (k,r) [used in Dong et al., `13] Local-search and randomized hill-climbing Run r times and keep best solution Empirically high-quality solutions Very expensive

34 A large family of benefit functions are monotone submodular (e.g., functions of coverage) INSIGHTS FOR QUALITY GUARANTEES Under a linear cost function the marginal gain is submodular AB x f(A U {x}) f(A) f(B U {x})f(B)

35 SUBMODULAR OBJECTIVE FUNCTIONS Start by selecting the best source Explore local neighborhood: add/delete sources Either selected set or complement is a local optimum Constant factor approximation [Feige, `11] Submodular Maximization (MaxSub) Highly efficient Empirically high-quality even for non-sub functions

36 SELECTED EXPERIMENTS Business listings (BL) ~40 sources, 2 years ~1,400 categories, 51 locations World-wide Event listings GDELT @gdeltproject.org 15,275 sources, 1 month 236 event types, 242 locations

37 WORLD CHANGE ESTIMATION Small relative error even with little training data Expected increasing trend over time

38 SOURCE CHANGE ESTIMATION Small relative error for source quality

39 SELECTION QUALITY BENEFITMETRICMSR.GREEDYMAXSUBGRASP LINEAR cov. best16.7%50%100% (5,20) diff. acc. best0.0%33.3%83.3% (2,100) diff. STEP cov. best50.0%66.7%83.3% (10,100) diff. acc. best50%66.7%83.3% (5,100) diff. Grasp finds the best solution most of the times perc. of times finding the best solution

40 SELECTION QUALITY MaxSub solutions are mostly comparable to Grasp BENEFITMETRICMSR.GREEDYMAXSUBGRASP LINEAR cov. best16.7%50%100% (5,20) diff..005 (.01)%.001 (.007)%- acc. best0.0%33.3%83.3% (2,100) diff.9.5 (53.7)%.39 (2.31)%8.9% (53.7)% STEP cov. best50.0%66.7%83.3% (10,100) diff.7.45 (27.8)%.012 (.06)%.7 (4.2)% acc. best50%66.7%83.3% (5,100) diff.6 (23.98)%1.76 (10.6)%3.99 (23.98)% avg. and worst quality loss

41 SELECTION QUALITY BENEFITMETRICMSR.GREEDYMAXSUBGRASP LINEAR cov. best16.7%50%100% (5,20) diff..005 (.01)%.001 (.007)%- acc. best0.0%33.3%83.3% (2,100) diff.9.5 (53.7)%.39 (2.31)%8.9% (53.7)% STEP cov. best50.0%66.7%83.3% (10,100) diff.7.45 (27.8)%.012 (.06)%.7 (4.2)% acc. best50%66.7%83.3% (5,100) diff.6 (23.98)%1.76 (10.6)%3.99 (23.98)% but there are cases when Grasp is significantly worse

42 INCREASING NUMBER OF SOURCES MaxSub is one to two orders of magnitude faster

43 SELECTION CHARACTERISTICS Accuracy selects fewer more focused sources

44 CONCLUSIONS Thank you! thodrek@cs.umd.edu Source selection before data integration to increase quality and reduce cost Collection of statistical models to describe the evolution of the world and the updates of sources Exploiting submodularity gives more efficient solutions with rigorous guarantees


Download ppt "UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH SRIVASTAVA CHARACTERIZING AND SELECTING FRESH DATA SOURCES."

Similar presentations


Ads by Google