Xin Luna Dong, Laure Berti-Equille, Divesh Srivastava AT&T Labs-Research.

Slides:



Advertisements
Similar presentations
Google News Personalization: Scalable Online Collaborative Filtering
Advertisements

Mining customer ratings for product recommendation using the support vector machine and the latent class model William K. Cheung, James T. Kwok, Martin.
Provenance-Aware Storage Systems Margo Seltzer April 29, 2005.
Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.
Divesh Srivastava AT&T Labs-Research. The Web is Great.
Xin Luna Dong (AT&T Labs  Google Inc.) Barna Saha, Divesh Srivastava (AT&T Labs-Research) VLDB’2013.
Xin Luna Dong (Google Inc.) Divesh Srivastava (AT&T 5/2013.
BY ANISH D. SARMA, XIN DONG, ALON HALEVY, PROCEEDINGS OF SIGMOD'08, VANCOUVER, BRITISH COLUMBIA, CANADA, JUNE 2008 Bootstrapping Pay-As-You-Go Data Integration.
Xin Luna Dong AT&T Labs-Research Joint work w. Laure Berti-Equille, Yifan Hu, Divesh
Ziv Bar-YossefMaxim Gurevich Google and Technion Technion TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A AA.
Laure Berti (Universite de Rennes 1), Anish Das Sarma (Stanford), Xin Luna Dong (AT&T), Amelie Marian (Rutgers), Divesh Srivastava (AT&T)
Date : 2013/05/27 Author : Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, Gong Yu Source : SIGMOD’12 Speaker.
1 st question Who is considered to be the Father of his country? a)Abraham Lincoln b)Benjamin Franklin c)Thomas Jefferson d)George Washington +
Fusion in web data extraction
Online Data Fusion School of Computing National University of Singapore AT&T Shannon Research Labs Xuan Liu, Xin Luna Dong, Beng Chin Ooi, Divesh Srivastava.
Bar Ilan University And Georgia Tech Artistic Consultant: Aviya Amir.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
Reference Reconciliation in Complex Information Spaces Xin (Luna) Dong, Alon Halevy, Jayant Sigmod 2005 University of Washington.
1 Introduction to Computability Theory Lecture12: Reductions Prof. Amos Israeli.
Algorithmic and Economic Aspects of Networks Nicole Immorlica.
Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
CSE 574 – Artificial Intelligence II Statistical Relational Learning Instructor: Pedro Domingos.
Mariam Salloum (YP.com) Xin Luna Dong (Google) Divesh Srivastava (AT&T Research) Vassilis J. Tsotras (UC Riverside) 1 Online Ordering of Overlapping Data.
UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH SRIVASTAVA CHARACTERIZING AND SELECTING FRESH DATA SOURCES.
Statistical Methods in Computer Science Hypothesis Testing II: Single-Factor Experiments Ido Dagan.
Equivalence Class Testing
An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets.
A Statistical and Schema Independent Approach to Identify Equivalent Properties on Linked Data † Kno.e.sis Center Wright State University Dayton OH, USA.
OMAP: An Implemented Framework for Automatically Aligning OWL Ontologies SWAP, December, 2005 Raphaël Troncy, Umberto Straccia ISTI-CNR
Introduction to Profile Hidden Markov Models
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
Semantic Analytics on Social Networks: Experiences in Addressing the Problem of Conflict of Interest Detection Boanerges Aleman-Meza, Meenakshi Nagarajan,
Databases From A to Boyce Codd. What is a database? It depends on your point of view. For Manovich, a database is a means of structuring information in.
Cluster based fact finders Manish Gupta, Yizhou Sun, Jiawei Han Feb 10, 2011.
Efficient Exact Similarity Searches using Multiple Token Orderings Jongik Kim 1 and Hongrae Lee 2 1 Chonbuk National University, South Korea 2 Google Inc.
1 Naïve Bayes Models for Probability Estimation Daniel Lowd University of Washington (Joint work with Pedro Domingos)
ENSEMBLE LEARNING David Kauchak CS451 – Fall 2013.
1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.
Finding Better Answers in Video Using Pseudo Relevance Feedback Informedia Project Carnegie Mellon University Carnegie Mellon Question Answering from Errorful.
Databases From A to Boyce Codd. What is a database? It depends on your point of view. For Manovich, a database is a means of structuring information in.
February 18, 2015CS21 Lecture 181 CS21 Decidability and Tractability Lecture 18 February 18, 2015.
1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Online Data Fusion School of Computing National University of Singapore AT&T Shannon Research Labs Xuan Liu, Xin Luna Dong, Beng Chin Ooi, Divesh Srivastava.
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research.
Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California.
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
A Formal Analysis of Conservative Update Based Approximate Counting Gil Einziger and Roy Freidman Technion, Haifa.
Characterizing the Uncertainty of Web Data: Models and Experiences Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli Studi.
LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009.
Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
Reasoning with Probs How does evidence lead to conclusions in situations of uncertainty? Bayes Theorem Data fusion, use of techniques that combine data.
De novo discovery of mutated driver pathways in cancer Discussion leader: Matthew Bernstein Scribe: Kun-Chieh Wang Computational Network Biology BMI 826/Computer.
CS555Spring 2012/Topic 31 Cryptography CS 555 Topic 3: One-time Pad and Perfect Secrecy.
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
Warren Shen, Xin Li, AnHai Doan Database & AI Groups University of Illinois, Urbana Constraint-Based Entity Matching.
Slides from Luna Dong’s VLDB Tutorials
Data Integration with Dependent Sources
Lecture 5: Leave no relevant data behind: Data Search
Record Linkage with Uniqueness Constraints and Erroneous Values
Panagiotis G. Ipeirotis Luis Gravano
CS246: Information Retrieval
Pei Lee, ICDE 2014, Chicago, IL, USA
Minwise Hashing and Efficient Search
Presentation transcript:

Xin Luna Dong, Laure Berti-Equille, Divesh Srivastava AT&T Labs-Research

The WWW is Great

A Lot of Information on the Web!

Information Can Be Erroneous 7/2009

Information Can Be Out-Of-Date 7/2009

Information Can Be Ahead-Of-Time The story, marked “Hold for release – Do not use”, was sent in error to the news service’s thousands of corporate clients.

False Information Can Be Propagated (I) Maurice Jarre ( ) French Conductor and Composer “One could say my life itself has been one long soundtrack. Music was my life, music brought me to life, and music is how I will be remembered long after I leave this life. When I die there will be a final waltz playing in my head and that only I can hear.” 2:29, 30 March 2009

False Information Can Be Propagated (II) UA’s bankruptcy Chicago Tribune, 2002 Sun-Sentinel.com Google News Bloomberg.com The UAL stock plummeted to $3 from $12.5

Wrong information can be worse than lack of information. The Internet needs a way to help people separate rumor from real science. – Tim Berners-Lee

Why is the Problem Hard? Facts and truth really don’t have much to do with each other. — William Faulkner S1S2S3 StonebrakerMITBerkeleyMIT DewittMSR UWisc BernsteinMSR CareyUCIAT&TBEA HalevyGoogle UW

Why is the Problem Hard? Facts and truth really don’t have much to do with each other. — William Faulkner S1S2S3 StonebrakerMITBerkeleyMIT DewittMSR UWisc BernsteinMSR CareyUCIAT&TBEA HalevyGoogle UW Naïve voting works

Why is the Problem Hard? A lie told often enough becomes the truth.— Vladimir Lenin S1S2S3S4S5 StonebrakerMITBerkeleyMIT MS DewittMSR UWisc BernsteinMSR CareyUCIAT&TBEA HalevyGoogle UW Naïve voting works only if data sources are independent.

S1S2S3S4S5 StonebrakerMITBerkeleyMIT MS DewittMSR UWisc BernsteinMSR CareyUCIAT&TBEA HalevyGoogle UW Naïve voting works only if data sources are independent. Goal: Discovery of Truth and Dependence A lie told often enough becomes the truth.— Vladimir Lenin

Challenges in Dependence Discovery 1. Sharing common data does not in itself imply copying. S1S2S3S4S5 StonebrakerMITBerkeleyMIT MS DewittMSR UWisc BernsteinMSR CareyUCIAT&TBEA HalevyGoogle UW 2. With only a snapshot it is hard to decide which source is a copier. 3. A copier can also provide or verify some data by itself, so it is inappropriate to ignore all of its data.

Intuitions for Dependence Detection  Intuition I: decide dependence (w/o direction) Sources S1 and S2 are likely to be dependent if they share a lot of false values.

Dependence? Source 1 on USA Presidents : 1 st : George Washington 2 nd : John Adams 3 rd : Thomas Jefferson 4 th : James Madison … 41 st : George H.W. Bush 42 nd : William J. Clinton 43 rd : George W. Bush 44 th : Barack Obama Source 2 on USA Presidents : 1 st : George Washington 2 nd : John Adams 3 rd : Thomas Jefferson 4 th : James Madison … 41 st : George H.W. Bush 42 nd : William J. Clinton 43 rd : George W. Bush 44 th : Barack Obama Are Source 1 and Source 2 dependent? Not necessarily

Dependence? Source 1 on USA Presidents : 1 st : George Washington 2 nd : Benjamin Franklin 3 rd : John F. Kennedy 4 th : Abraham Lincoln … 41 st : George W. Bush 42 nd : Hillary Clinton 43 rd : Dick Cheney 44 th : Barack Obama Source 2 on USA Presidents : 1 st : George Washington 2 nd : Benjamin Franklin 3 rd : John F. Kennedy 4 th : Abraham Lincoln … 41 st : George W. Bush 42 nd : Hillary Clinton 43 rd : Dick Cheney 44 th : John McCain Are Source 1 and Source 2 dependent? -- Common Errors Very likely      

Intuitions for Dependence Detection  Intuition I: decide dependence (w/o direction) Sources S1 and S2 are likely to be dependent if they share a lot of false values.  Intuition II: decide copying direction Source S1 is likely to copy from S2 if the accuracy of the common data is very different from the overall accuracy of S1.

Dependence? Source 2 on USA Presidents : 1 st : George Washington 2 nd : Benjamin Franklin 3 rd : John F. Kennedy 4 th : Abraham Lincoln … 41 st : George W. Bush 42 nd : Hillary Clinton 43 rd : Dick Cheney 44 th : John McCain Are Source 1 and Source 2 dependent? -- Different Accuracy Source 1 on USA Presidents : 1 st : George Washington 2 nd : John Adams 3 rd : Thomas Jefferson 4 th : Abraham Lincoln … 41 st : George W. Bush 42 nd : Hillary Clinton 43 rd : George W. Bush 44 th : John McCain S1 more likely to be a copier       

Outline Motivation and intuitions for solution For a static world [VLDB’09]  Techniques  Experimental Results For a dynamic world [VLDB’09]  Techniques  Experimental Results

Problem Definition  INPUT  Objects: an aspect of a real-world entity  E.g., director of a movie, author list of a book  Each associated with one true value  Sources: provide values for some objects  OUTPUT: the true value for each object

Source Dependence  Source dependence: two sources S and T deriving the same part of data directly or transitively from a common source (can be one of S or T).  Independent source  Copier  copying part (or all) of data from other sources  may verify or revise some of the copied values  may add additional values  Assumptions  Independent values  Independent copying  No loop copying

Models for a Static World  Core case  Conditions 1.Same source accuracy 2.Uniform false-value distribution 3.Categorical value  Proposition: W. independent “good” sources, Naïve voting selects values with highest probability to be true.  Models Depen AccuPR Consider value probabilities in dependence analysis Accu Remove Cond 1 Sim Remove Cond 3 NonUni Remove Cond 2

Models for a Static World  Core case  Conditions 1.Same source accuracy 2.Uniform false-value distribution 3.Categorical value  Proposition: W. independent “good” sources, Naïve voting selects values with highest probability to be true.  Models Depen AccuPR Consider value probabilities in dependence analysis Accu Remove Cond 1 Sim Remove Cond 3 NonUni Remove Cond 2

I. Dependence Detection Intuition I. If two sources share a lot of true values, they are not necessarily dependent. Different Values Same Values TRUE S1  S2

I. Dependence Detection Intuition I. If two sources share a lot of false values, they are more likely to be dependent. Different Values TRUE S1  S2 FALSE Same Values

Bayesian Analysis – Basic Different Values O d TRUE O t S1  S2 FALSE O f Same Values  Observation: Ф  Goal: Pr(S1  S2| Ф), Pr(S1  S2| Ф) (sum up to 1)  According to the Bayes Rule, we need to know Pr(Ф|S1  S2), Pr(Ф|S1  S2)  Key: computing Pr(Ф(O)|S1  S2), Pr(Ф(O)|S1  S2) for each O  S1  S2

Bayesian Analysis – Probabilities Different Values O d TRUE O t S1  S2 FALSE O f Same Values PrIndependenceDependence OtOt OfOf OdOd ε-error rate; n-#wrong-values; c-copy rate   >

10 sources voting for an object II. Finding the True Value S1S1 S2S2 S3S3 S4S4 S5S5 S7S7 S6S6 S8S8 S9S9 S (1-.4*.8=.68) (1) (.68 2 ) Order? See paper Count =2.14 Count =2 Count=

 Core case conditions 1.Same source accuracy 2.Uniform false-value distribution 3.Categorical value Models in This Paper Depen AccuPR Consider value probabilities in dependence analysis Accu Remove Cond 1 Sim Remove Cond 3 NonUni Remove Cond 2

III. Considering Source Accuracy Intuition II. S1 is more likely to copy from S2, if the accuracy of the common data is highly different from the accuracy of S1. PrIndependenceDependence OtOt OfOf OdOd

III. Considering Source Accuracy Intuition II. S1 is more likely to copy from S2, if the accuracy of the common data is highly different from the accuracy of S1. PrIndependenceS1 Copies S2S2 Copies S1 OtOt OfOf OdOd ≠ ≠

Source Accuracy Consider dependence

IV. Combining Accuracy and Dependence Truth Discovery Source-accuracy Computation Dependence Detection Step 1 Step 3 Step 2  Theorem: w/o accuracy, converges  Observation: w. accuracy, converges when #objs >> #srcs

The Motivating Example S1S2S3S4S5 StonebrakerMITBerkeleyMIT MS DewittMSR UWisc BernsteinMSR CareyUCIAT&TBEA HalevyGoogle UW S1S1 S2S2 S4S4 S3S3 S5S Rnd 2 Rnd 11Rnd 3 … S1S1 S2S2 S4S4 S3S3 S5S S1S1 S2S2 S4S4 S3S3 S5S

Experimental Setup  Dataset: AbeBooks  877 bookstores  1263 CS books  listings, w. ISBN, author-list  After pre-cleaning, each book on avg has 19 listings and 4 author lists (ranges from 1-23)  Golden standard: 100 random books  Manually check author list from book cover  Measure: Precision=#(Corr author lists)/#(All lists)  Parameters: c=.8, ε=.2, n=100  ranging the paras did not change the results much  WindowsXP, 64 2 GHz CPU, 960MB memory

Naïve Voting and Types of Errors  Naïve voting has precision.71

Contributions of Various Components MethodsPrec#RndsTime(s) Naïve Only value similarity Only source accuracy Only source dependence Depen+accu Depen+accu+sim Precision improves by 25.4% over Naïve Considering dependence improves the results most Reasonably fast

 2916 bookstore pairs provide data on at least the same 10 books; 508 pairs are likely to be dependent Discovered Dependence Bookstore#Copiers#BooksAccu Caiman MildredsBooks COBU GmbH & Co. KG THESAINTBOOKSTORE Limelight Bookshop Revaluation Books Players Quest AshleyJohnson Powell’s Books AlphaCraze.com Avg Among all bookstores, on avg each provides 28 books; conforming to the intuition that small bookstores are more likely to copy from large ones Accuracy not very high; applying Naïve obtains precision of only.58

Outline Motivation and intuitions for solution For a static world [VLDB’09] Techniques Experimental Results For a dynamic world [VLDB’09]  Techniques  Experimental Results

Challenges for a Dynamic World S1S2S3S4S5 StonebrakerMITUCBMIT MS DewittMSR Wisc BernsteinMSR CareyUCIAT&TBEA HalevyGoogle UW

Challenges for a Dynamic World 1. True values can evolve over time 2. Low-quality data can be caused by different reasons S1S2S3S4S5 Stonebraker (Ѳ, UCB), (02, MIT) (03, MIT)(00, UCB)(01, UCB) (06, MIT) (05, MIT)(03, UCB) (05, MS) Dewitt (Ѳ, Wisc), (08, MSR) (00, Wisc) (09, MSR) (00, UW) (01, Wisc) (08, MSR) (01, UW) (02, Wisc) (05, Wisc)(03, UW) (05,  ) (07, Wisc) Bernstein (Ѳ, MSR) (00, MSR) (01, MSR)(07, MSR)(03, MSR) Carey (Ѳ, Propell), (02, BEA), (08, UCI) (04, BEA) (09, UCI) (05, AT&T)(06, BEA)(07, BEA) Halevy (Ѳ, UW), (05, Google) (00, UW) (07, Google) (00, Wisc) (02, UW) (05, Google) (01, Wisc) (06, UW) (05, UW)(03, Wisc) (05, Google) (07, UW) ERR! Out-of-date! SLOW! Out-of-date! SLOW! Out-of-date!

Problem Definition Static WorldDynamic World Objects Each associated with a value; e.g., Google for Halevy Each associated with a lifespan; e.g., (00, UW), (05, Google) for Halevy Sources Each can provide a value for an object; e.g., S1 providing Google Each can have a list of updates for an object; e.g., S1’s updates for Halevy (00, UW), (07, Google) OUTPUT true value for each object 1.Life span: true value for each object at each time point 2.Copying: pr of S1 is a copier of S2 and pr of S1 being actively copying at each time point

Contributions I. Quality measures of data sources II. Dependence detection (HMM model) III. Lifespan discovery (Bayesian model) IV. Considering delayed publishing

I. Quality of Data Sources  Three orthogonal quality measures CEF-measure  Coverage: how many transitions are captured  Exactness: how many transitions are not mis-captured  Freshness: how quickly transitions are captured Dewitt S5 Ѳ(2000) Wisc MSR WiscUW  Capturable Mis-capturable Captured Coverage = #Captured/#Capturable (e.g., ¼=.25) Mis-captured Exactness= 1-#Mis-Captured/#Mis-Capturable (e.g., 1-2/5=.6) Freshness(  )= #(Captured w. length<=  )/#Captured (e.g., F(0)=0, F(1)=0, F(2)=1/1 = 1…)

 Intuition I. S1 and S2 are likely to be dependent if  common mistakes  overlapping updates are performed after the real values have already changed II. Dependence Detection S1S2S3S4S5 Stonebraker (00, UCB), (02, MIT) (03, MIT)(00, UCB)(01, UCB) (06, MIT) (05, MIT)(03, UCB) (05, MS) Dewitt (00, Wisc), (08, MSR) (00, Wisc) (09, MSR) (00, UW) (01, Wisc) (08, MSR) (01, UW) (02, Wisc) (05, Wisc)(03, UW) (05,  ) (07, Wisc) Bernstein (00, MSR) (00, MSR) (01, MSR)(07, MSR)(03, MSR) Carey (00, Propell), (02, BEA), (08, UCI) (04, BEA) (09, UCI) (05, AT&T)(06, BEA)(07, BEA) Halevy (00, UW), (05, Google) (00, UW) (07, Google) (00, Wisc) (02, UW) (05, Google) (01, Wisc) (06, UW) (05, UW)(03, Wisc) (05, Google) (07, UW)

The Copying-Detection HMM Model I (S1 and S2 independent) C1c (S1 as an active copier) C1~c (S1 as an idle copier) C2c (S2 as an active copier) C2~c (S2 as an idle copier)  A period of copying starts from and ends with a real copying.  Parameters:  – Pr(init independence) ; f – Pr(a copier actively copying); t i – Pr(remaining independent); t c – Pr(remaining as a copier); titi (1-t i )/2 (1-t c )t i (1-t c )(1-t i ) ft c (1-f)t c (1-t c )t i (1-t c )(1-t i ) ft c (1-f)t c f f 1-f pr i =  pr i = (1-  )/2 pr i = 0

III. Lifespan Discovery  Algorithm: for each object O (Details in the paper)

Iterative Process Lifespan Discovery CEF-measure Computation Dependence Detection Step 1 Step 3 Step 2  Typically converges when #objs >> #srcs.

 Lifespan for Halevy and CEF-measure for S1 and S2 The Motivating Example RndHalevyC(S1)E(S1)F(S1,0)F(S1,1)C(S2)E(S2)F(S2,0)F(S2,1) (Ѳ, Wisc) (2002, UW) (2003, Google) (Ѳ, UW) (2002, Google) (Ѳ, UW) (2005, Google) S1S2S3S4S5 Halevy (Ѳ, UW), (05, Google) (00, UW) (07, Google) (00, Wisc) (02, UW) (05, Google) (01, Wisc) (06, UW) (05, UW)(03, Wisc) (05, Google) (07, UW)

Experimental Setup  Dataset: Manhattan restaurants  Data crawled from 12 restaurant websites  8 versions: weekly from 1/22/2009 to 3/12/2009  5269 restaurants, 5231 appearing in the first crawling and 5251 in the last crawling  467 restaurants deleted from some websites, 280 closed before 3/15/2009 (Golden standard)  Measure: Precision, Recall, F-measure  G: really closed restaurants; R: detected closed restaurants  Parameters: s=.8, α=f=.5, t i =t c =.99, n=1 (open/close)  WindowsXP, 64 2 GHz CPU, 960MB memory

Contributions of Various Components Method Ever-existingClosed #RndsTime(s) #RestPrecRecF-msr ALL ALL Naïve CEF CopyCEF Google CEF and CopyCEF obtain High precision and recall Applying rules is inadequate.Naïve missed a lot of restaurants. Google Map lists a lot of out- of-business restaurants

Computed CEF-Measure SourcesCoverageExactnessFreshness#Closed-rest MenuPages TasteSpace NYMagazine NYTimes ActiveDiner TimeOut SavoryCities VillageVoice FoodBuzz NewYork OpenTable DiningGuide GoogleMaps---228

 12 out of 66 pairs are likely to be dependent Discovered Dependence TasteSpace FoodBuzz VillageVoice ActiveDiner NYTimes TimeOut MenuPages NYMagazine NewYork OpenTable DiningGuide SavoryCities

Related Work  Data provenance [Buneman et al., PODS’08]  Focus on effective presentation and retrieval  Assume knowledge of provenance/lineage  Opinion pooling [Clemen&Winkler, 1985]  Combine pr distributions from multiple experts  Again, assume knowledge of dependence  Plagiarism of programs [Schleimer, Sigmod’03]  Unstructured data

Data Integration Faces 3 Challenges

Scissors Paper Scissors

Data Integration Faces 3 Challenges Scissors Glue

Existing Solutions Assume Independence of Data Sources Schema matching Model management Query answering using views Information extraction String matching (edit distance, token-based, etc.) Object matching (aka. record linkage, reference reconciliation, …) Data fusion Truth discovery Assume INDEPENDENCE of data sources

Source Dependence Adds A New Dimension to Data Integration

Research Agenda: Solomon