Online Data Fusion School of Computing National University of Singapore AT&T Shannon Research Labs Xuan Liu, Xin Luna Dong, Beng Chin Ooi, Divesh Srivastava.

Slides:



Advertisements
Similar presentations
Xin Luna Dong (AT&T Labs  Google Inc.) Barna Saha, Divesh Srivastava (AT&T Labs-Research) VLDB’2013.
Advertisements

Xin Luna Dong (Google Inc.) Divesh Srivastava (AT&T 5/2013.
Hybrid Context Inconsistency Resolution for Context-aware Services
Xin Luna Dong AT&T Labs-Research Joint work w. Laure Berti-Equille, Yifan Hu, Divesh
Laure Berti (Universite de Rennes 1), Anish Das Sarma (Stanford), Xin Luna Dong (AT&T), Amelie Marian (Rutgers), Divesh Srivastava (AT&T)
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
Best-Effort Top-k Query Processing Under Budgetary Constraints
Comparison of parallel and random approach to a candidate list in the multifeature querying Peter Gurský Institute of Computer Science UPJŠ, Košice, Slovakia.
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Fusion in web data extraction
Entity Profiling with Varying Source Reliabilities
Online Data Fusion School of Computing National University of Singapore AT&T Shannon Research Labs Xuan Liu, Xin Luna Dong, Beng Chin Ooi, Divesh Srivastava.
Effectively Indexing Uncertain Moving Objects for Predictive Queries School of Computing National University of Singapore Department of Computer Science.
Active Learning and Collaborative Filtering
A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.
Algorithmic Complexity Nelson Padua-Perez Bill Pugh Department of Computer Science University of Maryland, College Park.
On Computing Compression Trees for Data Collection in Wireless Sensor Networks Jian Li, Amol Deshpande and Samir Khuller Department of Computer Science,
Subscription Subsumption Evaluation for Content-Based Publish/Subscribe Systems Hojjat Jafarpour, Bijit Hore, Sharad Mehrotra, and Nalini Venkatasubramanian.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University.
Exploiting Correlated Attributes in Acquisitional Query Processing Amol Deshpande University of Maryland Joint work with Carlos Sam
An Incremental Refining Spatial Join Algorithm for Estimating Query Results in GIS Wan D. Bae, Shayma Alkobaisi, Scott T. Leutenegger Department of Computer.
Comparing path-based and vertically-partitioned RDF databases Preetha Lakshmi & Chris Mueller 12/10/2007 CSCI 8715 Shashi Shekhar.
1 Ensembles of Nearest Neighbor Forecasts Dragomir Yankov, Eamonn Keogh Dept. of Computer Science & Eng. University of California Riverside Dennis DeCoste.
Affinity Rank Yi Liu, Benyu Zhang, Zheng Chen MSRA.
A Local Facility Location Algorithm Supervisor: Assaf Schuster Denis Krivitski Technion – Israel Institute of Technology.
Mariam Salloum (YP.com) Xin Luna Dong (Google) Divesh Srivastava (AT&T Research) Vassilis J. Tsotras (UC Riverside) 1 Online Ordering of Overlapping Data.
1 Software Testing and Quality Assurance Lecture 5 - Software Testing Techniques.
Large-Scale Cost-sensitive Online Social Network Profile Linkage.
Quality-aware Collaborative Question Answering: Methods and Evaluation Maggy Anastasia Suryanto, Ee-Peng Lim Singapore Management University Aixin Sun.
1 Efficiently Learning the Accuracy of Labeling Sources for Selective Sampling by Pinar Donmez, Jaime Carbonell, Jeff Schneider School of Computer Science,
Cluster based fact finders Manish Gupta, Yizhou Sun, Jiawei Han Feb 10, 2011.
Predicting and Bypassing End-to-End Internet Service Degradation Anat Bremler-BarrEdith CohenHaim KaplanYishay Mansour Tel-Aviv UniversityAT&T Labs Tel-Aviv.
A Comparative Study of Search Result Diversification Methods Wei Zheng and Hui Fang University of Delaware, Newark DE 19716, USA
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 On-line Learning of Sequence Data Based on Self-Organizing.
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
Trust-Aware Optimal Crowdsourcing With Budget Constraint Xiangyang Liu 1, He He 2, and John S. Baras 1 1 Institute for Systems Research and Department.
Searching for Extremes Among Distributed Data Sources with Optimal Probing Zhenyu (Victor) Liu Computer Science Department, UCLA.
Additional Problems.
Qingqing Gan Torsten Suel CSE Department Polytechnic Institute of NYU Improved Techniques for Result Caching in Web Search Engines.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Stefan Mutter, Mark Hall, Eibe Frank University of Freiburg, Germany University of Waikato, New Zealand The 17th Australian Joint Conference on Artificial.
Energy-Efficient Monitoring of Extreme Values in Sensor Networks Loo, Kin Kong 10 May, 2007.
Leonardo Guerreiro Azevedo Geraldo Zimbrão Jano Moreira de Souza Approximate Query Processing in Spatial Databases Using Raster Signatures Federal University.
Characterizing the Uncertainty of Web Data: Models and Experiences Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli Studi.
Efficient Processing of Top-k Spatial Preference Queries
Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07.
1 Efficient Algorithms for Incremental Update of Frequent Sequences Minghua ZHANG Dec. 7, 2001.
On Computing Top-t Influential Spatial Sites Authors: T. Xia, D. Zhang, E. Kanoulas, Y.Du Northeastern University, USA Appeared in: VLDB 2005 Presenter:
9/2/2005VLDB 2005, Trondheim, Norway1 On Computing Top-t Most Influential Spatial Sites Tian Xia, Donghui Zhang, Evangelos Kanoulas, Yang Du Northeastern.
QED: A Novel Quaternary Encoding to Completely Avoid Re-labeling in XML Updates Changqing Li,Tok Wang Ling.
Neural Network Implementation of Poker AI
Joseph M. Hellerstein Peter J. Haas Helen J. Wang Presented by: Calvin R Noronha ( ) Deepak Anand ( ) By:
Answering Top-k Queries Using Views Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris.
11 A Classification-based Approach to Question Routing in Community Question Answering Tom Chao Zhou 1, Michael R. Lyu 1, Irwin King 1,2 1 The Chinese.
UNIT: User-ceNtrIc Transaction Management in Web-Database Systems Huiming Qu, Alexandros Labrinidis, Daniel Mosse Advanced Data Management Technologies.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
On-Line Algorithms in Machine Learning By: WALEED ABDULWAHAB YAHYA AL-GOBI MUHAMMAD BURHAN HAFEZ KIM HYEONGCHEOL HE RUIDAN SHANG XINDI.
Computer Science and Engineering Jianye Yang 1, Ying Zhang 2, Wenjie Zhang 1, Xuemin Lin 1 Influence based Cost Optimization on User Preference 1 The University.
Slides from Luna Dong’s VLDB Tutorials
Tian Xia and Donghui Zhang Northeastern University
Staging User Feedback toward Rapid Conflict Resolution in Data Fusion
Data Integration with Dependent Sources
Sequential Data Cleaning: A Statistical Approach
Jonathan Elsas LTI Student Research Symposium Sept. 14, 2007
Efficient Processing of Top-k Spatial Preference Queries
Presentation transcript:

Online Data Fusion School of Computing National University of Singapore AT&T Shannon Research Labs Xuan Liu, Xin Luna Dong, Beng Chin Ooi, Divesh Srivastava

Conflicting Data on the Web What’s the temperature and humidity of Seattle?

Solution 1: Choose from One Source What’s the status of flight CO 1581? –Result of Google

Solution 2: List All Values What’s the length of Mississippi River? –Results on the National Park Service website

Solution 3: Best Guess on the True Value What’s the capital of Washington state? Google

Copying Between Sources finance.boston.com finance.bostonmerchant.com financial.businessinsider.com markets.chron.com finance.abc7.com

Data Fusion Resolving conflicts –Where is AT&T Shannon Research Labs? –9 sources provide 3 different answers: NY, NJ, TX –Answer: NJ Accuracy Copying

Motivation Problem: offline –Inappropriate for web-scale data and frequent updates –Long waiting time if applied online Our proposal : Online Data Fusion –

Online Data Fusion

Advantages of Return answers to users while probing sources, no waiting Provide the likelihood of the correctness of the answers to the users Terminate as early as possible once the system gains enough confidence

Framework Source ordering Truth finding Probability computation Offline Online Fusion Queries Probing order Source probing Result output Terminated? Y N Q1: Incremental vote counting Q2: Compute probabilities Q3: Termination justification Q4: Ordering Sources

Outline Motivation & framework Preliminaries of Online Data Fusion Techniques Experimental results Conclusions

Problem Input S O1O1 O2O2 O3O3 OnOn … …

Problem Output

Preliminaries on Data Fusion * Dong et al., VLDB 2009

Example of Data Fusion

Outline Motivation & framework Preliminaries of Online Data Fusion Technology –Independent sources –Dependent sources Experimental results Conclusions

Probability Computation

Example of Independent Sources RoundTXNJNYResult 1500TX RoundTXNJNYResult 1500TX 2550 RoundTXNJNYResult 1500TX NJ RoundTXNJNYResult 1500TX NJ 49100NJ RoundTXNJNYResult 1500TX NJ 49100NJ 59104NJ RoundTXNJNYResult 1500TX NJ 49100NJ 59104NJ 69144NJ RoundTXNJNYResult 1500TX NJ 49100NJ 59104NJ 69144NJ NJ RoundTXNJNYResult 1500TX NJ 49100NJ 59104NJ 69144NJ NJ TX RoundTXNJNYResult 1500TX NJ 49100NJ 59104NJ 69144NJ NJ TX TX Order S9S9 S5S5 S3S3 S8S8 S6S6 S2S2 S7S7 S4S4 S1S1 Order Sources by accuracy Terminate: min(v 1 )>exp(v 2 )Terminate: min(v 1 )>max(v 2 )

Outline Motivation & Framework Preliminaries of Online Data Fusion Technology –Independent sources –Dependent sources Experimental results Conclusions

Challenges and Solutions Challenge: Independent vote count or dependent vote count? –When a copier is probed earlier than the copied source, we do not know whether they provide the same value No-over-counting principle –For each value, among its providers that could have copying relationships on it, at any time we apply the independent vote count for at most one source

1. Incremental Vote Counting - Conservative Before probing the copied source –Assumes the copier provides the same value as the copied source –Use dependent vote count After probing the copied source –If observe a different value from the copier Dependent vote count -> Independent vote count for the copier Features –Pro: monotonic increase of vote counts –Con: may under-counting

1. Incremental Vote Counting - Pragmatic Before probing the copied source –Assumes the copier provides a different value from the copied source –Use independent vote count After probing the copied source –If observe a same value as the copier Independent vote count -> Dependent vote count for the copier Features –Pro: no under-counting or over-counting –Con: vote counts can decrease after seeing more sources

Example of Two Voting Methods Assume probing order: S3, S2, S1 Ind: 3 Dep: 3 Ind: 4 Dep:.8 Ind: 5 Dep: 1

2. Probability Computation

3. Source Ordering Worst case assumption –All sources are assumed to provide the same value Pragmatic ordering –Iteratively choose the source that increases the total vote count most –Co-copier Condition: order the copied source before ordering both co-copiers

Example of Source Ordering Condition vote count in each round of computing

Outline Motivation & Framework Preliminaries of Online Data Fusion Technology Experimental results Conclusions

Experiment Settings Dataset: Abebooks data –894 bookstores (data sources) –1263 books (objects) –24364 listings –1758 pair of copyings Queries and measures –Query author by ISBN –Golden standard: the authors of 100 randomly selected books (manually checked from the book cover) –Measure precision by the percentage of correctly returned author lists

Output by Pragmatic A large fraction of answers get stable quickly The number of terminated answers grows much slower

Comparison of Different Algorithms Implementations 1.NAÏVE: probe all sources in a random order and repeatedly apply fusion from scratch on probed sources. 2.ACCU: use accuracy only. 3.CONSERVATIVE: use conservative ordering and vote counting 4.PRAGMATIC: use pragmatic ordering and vote counting

Stable Correct Values Pragmatic performs best Pragmatic dominates Conservative Pragmatic provide more correct values than Accu Naïve performs worst

Precision of Different Methods Pragmatic has the highest precision Accu ignores copying Conservative may terminate with incorrect values early

Scalability Pragmatic is the fastest on each data set Probing all sources before returning an answer can take a long time Vote counting from scratch in each iteration takes a long CPU time Number of sources:

Related work Online aggregation –[Hellerstein et al. 97] Data fusion – resolving conflicts –[Blanco et al. 10] [Dong et al. 09] [Galland et al. 10] [Wu et al. 11] [Yin et al. 08] Quality-aware query answering –[Mihaila et al. 00] [Naumann et al. 02] [Sarma et al. 11] [Suryanto et al. 09] [Yeganeh et al. 09]

Conclusions The first online data fusion system Address challenges in building an online data fusion system –incremental vote counting –computing probabilities –termination justification –source ordering

Thanks! Q & A

Observations of output probabilities by PRAGMATIC

Fusion CPU time

Comparison of different source ordering strategies -precision

Comparison of different source ordering strategies - #probed sources

Comparison of different source ordering strategies – fusion time

Comparison of different vote counting strategies -precision

Comparison of different vote counting strategies - #probed sources

Comparison of different vote counting strategies – fusion time

Comparison of different termination conditions - precision

Comparison of different termination conditions - #probed sources

Comparison of different termination conditions – fusion time

Coverage vs. accuracy

Query-answering time

Fusion time