UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH SRIVASTAVA CHARACTERIZING AND SELECTING FRESH DATA SOURCES.

Slides:

Advertisements

Similar presentations

Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign.

Advertisements

QoS-based Management of Multiple Shared Resources in Dynamic Real-Time Systems Klaus Ecker, Frank Drews School of EECS, Ohio University, Athens, OH {ecker,

Xin Luna Dong (AT&T Labs  Google Inc.) Barna Saha, Divesh Srivastava (AT&T Labs-Research) VLDB’2013.

Efficient summarization framework for multi-attribute uncertain data Jie Xu, Dmitri V. Kalashnikov, Sharad Mehrotra 1.

Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.

Laure Berti (Universite de Rennes 1), Anish Das Sarma (Stanford), Xin Luna Dong (AT&T), Amelie Marian (Rutgers), Divesh Srivastava (AT&T)

LEARNING INFLUENCE PROBABILITIES IN SOCIAL NETWORKS Amit Goyal Francesco Bonchi Laks V. S. Lakshmanan University of British Columbia Yahoo! Research University.

Fast Algorithms For Hierarchical Range Histogram Constructions

Boosting Rong Jin.

Minimizing Seed Set for Viral Marketing Cheng Long & Raymond Chi-Wing Wong Presented by: Cheng Long 20-August-2011.

Cost-effective Outbreak Detection in Networks Jure Leskovec, Andreas Krause, Carlos Guestrin, Christos Faloutsos, Jeanne VanBriesen, Natalie Glance.

Randomized Sensing in Adversarial Environments Andreas Krause Joint work with Daniel Golovin and Alex Roper International Joint Conference on Artificial.

Online Distributed Sensor Selection Daniel Golovin, Matthew Faulkner, Andreas Krause theory and practice collide 1.

Online Data Fusion School of Computing National University of Singapore AT&T Shannon Research Labs Xuan Liu, Xin Luna Dong, Beng Chin Ooi, Divesh Srivastava.

4/15/2017 Using Gaussian Process Regression for Efficient Motion Planning in Environments with Deformable Objects Barbara Frank, Cyrill Stachniss, Nichola.

Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.

Optimal Marketing Strategies over Social Networks Jason Hartline (Northwestern), Vahab Mirrokni (Microsoft Research) Mukund Sundararajan (Stanford)

Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Luciano Barbosa * Ana Carolina Salgado ! Francisco Tenorio ! Jacques.

1 Searching the Web Junghoo Cho UCLA Computer Science.

Server-based Inference of Internet Performance V. N. Padmanabhan, L. Qiu, and H. Wang.

Sensor placement applications Monitoring of spatial phenomena Temperature Precipitation... Active learning, Experiment design Precipitation data from Pacific.

INFERRING NETWORKS OF DIFFUSION AND INFLUENCE Presented by Alicia Frame Paper by Manuel Gomez-Rodriguez, Jure Leskovec, and Andreas Kraus.

1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.

1 Internet and Data Management Junghoo “John” Cho UCLA Computer Science.

Parametric Inference.

A TABU SEARCH APPROACH TO POLYGONAL APPROXIMATION OF DIGITAL CURVES.

Model-Driven Data Acquisition in Sensor Networks - Amol Deshpande et al., VLDB ‘04 Jisu Oh March 20, 2006 CS 580S Paper Presentation.

Mariam Salloum (YP.com) Xin Luna Dong (Google) Divesh Srivastava (AT&T Research) Vassilis J. Tsotras (UC Riverside) 1 Online Ordering of Overlapping Data.

Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Luciano Barbosa * Ana Carolina Salgado ! Francisco Tenorio ! Jacques.

Minimum-Delay Load-Balancing Through Non-Parametric Regression F. Larroca and J.-L. Rougier IFIP/TC6 Networking 2009 Aachen, Germany, May 2009.

RESEARCH A systematic quest for undiscovered truth A way of thinking

Distributed Constraint Optimization Michal Jakob Agent Technology Center, Dept. of Computer Science and Engineering, FEE, Czech Technical University A4M33MAS.

An Introduction to MBT  what, why and when 张坚

1 Pengjie Ren, Zhumin Chen and Jun Ma Information Retrieval Lab. Shandong University 报告人：任鹏杰 2013 年 11 月 18 日 Understanding Temporal Intent of User Query.

A Comparative Study of Search Result Diversification Methods Wei Zheng and Hui Fang University of Delaware, Newark DE 19716, USA

Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.

Linier Programming By Agustina Shinta. What is it? LP is essentially a mathematical technique for solving a problem that has certain characteristics.

An Integration Framework for Sensor Networks and Data Stream Management Systems.

Master Thesis Defense Jan Fiedler 04/17/98

1 1 Stanford University 2 MPI for Biological Cybernetics 3 California Institute of Technology Inferring Networks of Diffusion and Influence Manuel Gomez.

Influence Maximization in Dynamic Social Networks Honglei Zhuang, Yihan Sun, Jie Tang, Jialin Zhang, Xiaoming Sun.

Online Data Fusion School of Computing National University of Singapore AT&T Shannon Research Labs Xuan Liu, Xin Luna Dong, Beng Chin Ooi, Divesh Srivastava.

Experimental Evaluation of Learning Algorithms Part 1.

Ontology Evolution and Regression Analysis Insights into Ontology Regression Testing Maria Copeland Rafael Goncalvez Robert Stevens Bijan Parsia Uli Sattler.

Randomized Composable Core-sets for Submodular Maximization Morteza Zadimoghaddam and Vahab Mirrokni Google Research New York.

Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science ＆ Information Engineering.

Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.

Answering Top-k Queries Using Views Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris.

1 Network Tomography Using Passive End-to-End Measurements Venkata N. Padmanabhan Lili Qiu Helen J. Wang Microsoft Research DIMACS’2002.

A Unified Continuous Greedy Algorithm for Submodular Maximization Moran Feldman Roy SchwartzJoseph (Seffi) Naor Technion – Israel Institute of Technology.

Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.

Deterministic Algorithms for Submodular Maximization Problems Moran Feldman The Open University of Israel Joint work with Niv Buchbinder.

How to Crawl the Web Hector Garcia-Molina Stanford University Joint work with Junghoo Cho.

1 Network Tomography Using Passive End-to-End Measurements Lili Qiu Joint work with Venkata N. Padmanabhan and Helen J. Wang.

1 1 Stanford University 2 MPI for Biological Cybernetics 3 California Institute of Technology Inferring Networks of Diffusion and Influence Manuel Gomez.

Jan 27, Digital Preservation Seminar1 Effective Page Refresh Policies for Web Crawlers Written By: Junghoo Cho & Hector Garcia-Molina Presenter:

Inferring Networks of Diffusion and Influence

Model Discovery through Metalearning

Monitoring rivers and lakes [IJCAI ‘07]

DEFECT PREDICTION : USING MACHINE LEARNING

Distributed Submodular Maximization in Massive Datasets

Data Integration with Dependent Sources

Lecture 5: Leave no relevant data behind: Data Search

Presented by: Prof. Ali Jaoua

Sequential Data Cleaning: A Statistical Approach

Structure and Content Scoring for XML

Submodular Maximization Through the Lens of the Multilinear Relaxation

Structure and Content Scoring for XML

Guess Free Maximization of Submodular and Linear Sums

Presentation transcript:

UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH SRIVASTAVA CHARACTERIZING AND SELECTING FRESH DATA SOURCES

DATA IS A COMMODITY myriads of data sources

DATA IS A COMMODITY myriads of data sources FREELY AVAILABLE SOURCES open data initiative world data bank crawling the web

DATA IS A COMMODITY myriads of data sources FREELY AVAILABLE SOURCES DATA MARKETS: SELL YOUR DATA TO OTHERS datasift microsoft azure marketplace datamarket.com infochimps

DATA IS A COMMODITY myriads of data sources FREELY AVAILABLE SOURCES DATA MARKETS: SELL YOUR DATA TO OTHERS HETEROGENEOUS DATA SOURCES different quality - cost cover different topics static or dynamic exhibit different update patterns

–LEO TOLSTOY “Truth, like gold, is to be obtained not by its growth, but by washing away from it all that is not gold.”

Yes! Use source selection to reason about the benefits and costs of acquiring and integrating data sources [Dong et al., 2013] So, can we find gold in a systematic and automated fashion? Techniques agnostic to time and focus on accuracy of static sources

matters Select sources before actual data integration When do we need to use the integration result ? Data in the world and the sources changes

matters Select sources before actual data integration When do we need to use the integration result ? Data in the world and the sources changes

IT IS A DYNAMIC WORLD time World Data Sources Updated every 2 time points Updated every 3 time points

CHALLENGES AND OPPORTUNITIES Business listings (BL) ~40 sources, 2 years ~1,400 categories, 51 locations

QUALITY CHANGES OVER TIME The optimal set of sources changes over time

LOWER COST OPPORTUNITIES Integrate updates in lower frequency to lower cost

UpToDate Entries OutOfDate Entries NonDeleted Entries TIME-BASED SOURCE QUALITY Coverage(, ) = Entries (, ) / Entries (, ) Freshness(, ) = UpToDate Entries (, ) / Entries (, ) Coverage ~ Recall Freshness ~ Precision Combine Accuracy(, )

SELECTING FRESH SOURCES Time-aware source selection

EXTENSIONS Optimal frequency Subset of provided data

EXTENSIONS Optimal frequency Subset of provided data Time-aware source selection with many more sources

PROPOSED FRAMEWORK HISTORICAL SNAPSHOTS OF AVAILABLE SOURCES Pre-processing Statistical Modeling UPDATE MODELS FOR SOURCES EVOLUTION MODELS FOR DATA DOMAIN Source selection USE STATISTICAL MODELS TO ESTIMATE QUALITY OF INTEGRATED DATA INTEGRATION COST MODEL Maximize Quality - Cost Tradeoff SELECT OPTIMAL SUBSET OF SOURCES

PROPOSED FRAMEWORK HISTORICAL SNAPSHOTS OF AVAILABLE SOURCES Pre-processing Statistical Modeling UPDATE MODELS FOR SOURCES EVOLUTION MODELS FOR DATA DOMAIN Source selection USE STATISTICAL MODELS TO ESTIMATE QUALITY OF INTEGRATED DATA INTEGRATION COST MODEL Maximize Quality - Cost Tradeoff SELECT OPTIMAL SUBSET OF SOURCES

WORLD EVOLUTION MODELS Poisson Random Process Exponentially distributed changes Integrate available data source snapshots to extract the evolution of the world Ensemble of parametric models

SOURCE UPDATE MODELS Shall we consider only the update frequency?

SOURCE UPDATE MODELS High update frequency does not imply high freshness

SOURCE UPDATE MODELS Update frequency of the source Empirical Effectiveness distributions Ensemble of non-parametric models

PROPOSED FRAMEWORK HISTORICAL SNAPSHOTS OF AVAILABLE SOURCES Pre-processing Statistical Modeling UPDATE MODELS FOR SOURCES EVOLUTION MODELS FOR DATA DOMAIN Source selection USE STATISTICAL MODELS TO ESTIMATE QUALITY OF INTEGRATED DATA INTEGRATION COST MODEL Maximize Quality - Cost Tradeoff SELECT OPTIMAL SUBSET OF SOURCES

SOURCE QUALITY ESTIMATION Combine statistical models time OldQuality (, ; ) NewQuality (, ; ) as a function of ΔQuality (, ; ) and

SOURCE QUALITY ESTIMATION Combine statistical models time ? Entries (, ) Coverage(, ) =Entries (, )

SOURCE QUALITY ESTIMATION Combine statistical models time Estimating Entries (, ): use the intensity rates λ of the Poisson models Entries (, ) +

SOURCE QUALITY ESTIMATION Combine statistical models time Estimating : Entries (, ) Pr (Exist (, ))+ New Entries(, ) Pr (Exist (, ))

SOURCE QUALITY ESTIMATION Combine statistical models time Estimating : Entries (, )

PROPOSED FRAMEWORK HISTORICAL SNAPSHOTS OF AVAILABLE SOURCES Pre-processing Statistical Modeling UPDATE MODELS FOR SOURCES EVOLUTION MODELS FOR DATA DOMAIN Source selection USE STATISTICAL MODELS TO ESTIMATE QUALITY OF INTEGRATED DATA INTEGRATION COST MODEL Maximize Quality - Cost Tradeoff SELECT OPTIMAL SUBSET OF SOURCES

SOLVING SOURCE SELECTION Maximize marginal gain

SOLVING SOURCE SELECTION Greedy Start with an empty solution and add sources greedily No quality guarantees with arbitrarily bad solutions Highly efficient

ARBITRARY OBJECTIVE FUNCTIONS GRASP (k,r) [used in Dong et al., `13] Local-search and randomized hill-climbing Run r times and keep best solution Empirically high-quality solutions Very expensive

A large family of benefit functions are monotone submodular (e.g., functions of coverage) INSIGHTS FOR QUALITY GUARANTEES Under a linear cost function the marginal gain is submodular AB x f(A U {x}) f(A) f(B U {x})f(B)

SUBMODULAR OBJECTIVE FUNCTIONS Start by selecting the best source Explore local neighborhood: add/delete sources Either selected set or complement is a local optimum Constant factor approximation [Feige, `11] Submodular Maximization (MaxSub) Highly efficient Empirically high-quality even for non-sub functions

SELECTED EXPERIMENTS Business listings (BL) ~40 sources, 2 years ~1,400 categories, 51 locations World-wide Event listings 15,275 sources, 1 month 236 event types, 242 locations

WORLD CHANGE ESTIMATION Small relative error even with little training data Expected increasing trend over time

SOURCE CHANGE ESTIMATION Small relative error for source quality

SELECTION QUALITY BENEFITMETRICMSR.GREEDYMAXSUBGRASP LINEAR cov. best16.7%50%100% (5,20) diff. acc. best0.0%33.3%83.3% (2,100) diff. STEP cov. best50.0%66.7%83.3% (10,100) diff. acc. best50%66.7%83.3% (5,100) diff. Grasp finds the best solution most of the times perc. of times finding the best solution

SELECTION QUALITY MaxSub solutions are mostly comparable to Grasp BENEFITMETRICMSR.GREEDYMAXSUBGRASP LINEAR cov. best16.7%50%100% (5,20) diff..005 (.01)%.001 (.007)%- acc. best0.0%33.3%83.3% (2,100) diff.9.5 (53.7)%.39 (2.31)%8.9% (53.7)% STEP cov. best50.0%66.7%83.3% (10,100) diff.7.45 (27.8)%.012 (.06)%.7 (4.2)% acc. best50%66.7%83.3% (5,100) diff.6 (23.98)%1.76 (10.6)%3.99 (23.98)% avg. and worst quality loss

SELECTION QUALITY BENEFITMETRICMSR.GREEDYMAXSUBGRASP LINEAR cov. best16.7%50%100% (5,20) diff..005 (.01)%.001 (.007)%- acc. best0.0%33.3%83.3% (2,100) diff.9.5 (53.7)%.39 (2.31)%8.9% (53.7)% STEP cov. best50.0%66.7%83.3% (10,100) diff.7.45 (27.8)%.012 (.06)%.7 (4.2)% acc. best50%66.7%83.3% (5,100) diff.6 (23.98)%1.76 (10.6)%3.99 (23.98)% but there are cases when Grasp is significantly worse

INCREASING NUMBER OF SOURCES MaxSub is one to two orders of magnitude faster

SELECTION CHARACTERISTICS Accuracy selects fewer more focused sources

CONCLUSIONS Thank you! Source selection before data integration to increase quality and reduce cost Collection of statistical models to describe the evolution of the world and the updates of sources Exploiting submodularity gives more efficient solutions with rigorous guarantees