Laure Berti (Universite de Rennes 1), Anish Das Sarma (Stanford), Xin Luna Dong (AT&T), Amelie Marian (Rutgers), Divesh Srivastava (AT&T)

Slides:



Advertisements
Similar presentations
Divesh Srivastava AT&T Labs-Research. The Web is Great.
Advertisements

Xin Luna Dong (AT&T Labs  Google Inc.) Barna Saha, Divesh Srivastava (AT&T Labs-Research) VLDB’2013.
TGI Wheel of Color Game READ ME Do NOT delete or add ANY slides in this game. You will only need to edit the question slides (8 – 27) to add your questions/answers.
XIN LUNA DONG SIGMOD NEW RESEARCHER SYMPOSIUM ATHENS, GREECE JUNE, 2011 Developing and Realizing Your Big Idea.
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Xin Luna Dong (Google Inc.) Divesh Srivastava (AT&T 5/2013.
Xin Luna Dong AT&T Labs-Research Joint work w. Laure Berti-Equille, Yifan Hu, Divesh
1 st question Who is considered to be the Father of his country? a)Abraham Lincoln b)Benjamin Franklin c)Thomas Jefferson d)George Washington +
US Dollar bills.
Fusion in web data extraction
Online Data Fusion School of Computing National University of Singapore AT&T Shannon Research Labs Xuan Liu, Xin Luna Dong, Beng Chin Ooi, Divesh Srivastava.
Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements Raju Balakrishnan (Arizona State University)
1 Searching the Web Junghoo Cho UCLA Computer Science.
Xin Luna Dong, Laure Berti-Equille, Divesh Srivastava AT&T Labs-Research.
1 st 10 Presidents of the United States of America.
Mariam Salloum (YP.com) Xin Luna Dong (Google) Divesh Srivastava (AT&T Research) Vassilis J. Tsotras (UC Riverside) 1 Online Ordering of Overlapping Data.
UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH SRIVASTAVA CHARACTERIZING AND SELECTING FRESH DATA SOURCES.
Black & White and the White House: Race, the Presidency and Barack Obama's Victory Professor Eric Freedman 9 December 2011 Mykolas Romeris University,
Road to the Constitution. Vocabulary Articles of Confederation – Original plan of government for the United States after the Revolutionary War Delegates.
The Road to the Constitution
Different Types of Money By: Michelle Palka Click Here to Begin!
Cluster based fact finders Manish Gupta, Yizhou Sun, Jiawei Han Feb 10, 2011.
President’s Park Williamsburg, VA. 1. George Washington.
Presidents of the United States. Essential Questions What date was president elected? What years did he serve? To which party did he belong? Major names.
The United States’ Presidents Hayden Cowie. George Washington 1 st president Political party: no official Vice president John Adams term of office 4/30/17/89-3/3/97.
GEORGE WASHINGTON ST PRESIDENT GENERAL WHO WON THE WAR OF INDEPENDENCE WITH LAFAYETTE AGAINST THE ENGLISH.
1 Adapted from Pearson Prentice Hall Adapted form James A. Senn’s Information Technology, 3 rd Edition Chapter 7 Enterprise Databases and Data Warehouses.
Most Important Presidents Lindsey Redden 3 rd /4 th Grade Presentation.
A Probabilistic Graphical Model for Joint Answer Ranking in Question Answering Jeongwoo Ko, Luo Si, Eric Nyberg (SIGIR ’ 07) Speaker: Cho, Chin Wei Advisor:
Online Data Fusion School of Computing National University of Singapore AT&T Shannon Research Labs Xuan Liu, Xin Luna Dong, Beng Chin Ooi, Divesh Srivastava.
Theme #1: Presidents Unit #1: 1492 to 1775 – None Unit #2: – George Washington & Thomas Jefferson Unit #3: – Abraham Lincoln Unit #4:
A study of the first three presidents of the United States. Three Presidents Tri-Fold.
Ranking of Web Services Eyhab Al-Masri. Outline Discovery of Web Services 1 Ranking of Web Services 2 Approaches 3 Conclusion 4 Q & A 5.
Characterizing the Uncertainty of Web Data: Models and Experiences Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli Studi.
US History: Prior Knowledge Assessment Review What You Knew.
Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07.
8 th grade Research Project US History Related Topics – 1 st 6 weeks.
+ Critical Thinking and Online Research Unit 7. + What is Critical Thinking? Critical thinking is the intellectually disciplined process of actively and.
1 Differences Between Observed and Latent Confidence in Rank Ordering Brent J. Miller Mark Steyvers University of California, Irvine.
U. S. Presidents Trivia Quiz Number from 1-15 in your journals.
US PRESIDENTS QUIZ MS. FIGHT. QUESTION #1 In the year 2001, who was inaugurated as Vice President of the United States Albert Gore George W. Bush Mickey.
Recommendation Algorithms for E-Commerce. Introduction Millions of products are sold over the web. Choosing among so many options is proving challenging.
President’s Day By: Cara Edenfield. George Washington Washington D.C. is named after him. He is a President.
Articles of Confederation Made by: Brooke Wagner.
George WashingtonJohn AdamsThomas Jefferson James Madison James Monroe John Quincy Adams Andrew Jackson.
February 19, When do we celebrate Presidents’ Day?  On the 3 rd Monday in February.
Hail to the Presidents Music K-8 Vol. 24 #3 Hail to the presidents. Hail to the chiefs. 1. George Washington 2. John Adams 3. Thomas Jefferson 4. James.
Surfing above the Influence Amélie Marian Rutgers University.
Abraham Lincoln was so disorganized he put his papers in his hat. Abraham.
John Adams 2 nd US President from March 4, March 4, 1801 George Washington 1 st US President from April 30, March 4, 1797 Famous for being.
Most Famous Presidents. There are 43 Presidents up to date. If Obama or McCain get elected they will the 44 th President.
As of January 2017, there have been 58 elections and 45 US presidents.
James A. Senn’s Information Technology, 3rd Edition
Slides from Luna Dong’s VLDB Tutorials
Data Extraction.
Проект по английскому языку
Presidential Song From 1 – 44 Sung by Geraldine Miller
How many elections have there been
Chapter 8 Review Bell Ringer Date: December 12th, 2016
U.S. History Time Line 17th and 18th Centuries
Click your favorite coin to begin!
Unit 1 HLP: Home Learning Project
Data Integration with Dependent Sources
Lecture 5: Leave no relevant data behind: Data Search
Important Individuals of the American Revolution
People of the American Revolution
Learning Probabilistic Graphical Models Overview Learning Problems.
December
U.S. Constitution & Plagiarism Review
Pop-Up Graphic Organizer
Presentation transcript:

Laure Berti (Universite de Rennes 1), Anish Das Sarma (Stanford), Xin Luna Dong (AT&T), Amelie Marian (Rutgers), Divesh Srivastava (AT&T)

Challenges that Data Integration Faces Data ConflictsInstance HeterogeneityStructure Heterogeneity

Challenges that Data Integration Faces Data ConflictsInstance HeterogeneityStructure Heterogeneity Schema matching Model management Query answering using views Information extraction

Challenges that Data Integration Faces Data ConflictsInstance HeterogeneityStructure Heterogeneity Scissors Paper Scissors String matching (edit distance, token-based, etc.) Object matching (aka. record linkage, reference reconciliation, …)

Challenges that Data Integration Faces Data ConflictsInstance HeterogeneityStructure Heterogeneity Scissors Glue Data fusion Truth discovery

Existing Solutions Assume Independence of Data Sources Data ConflictsInstance HeterogeneityStructure Heterogeneity However, advanced technologies, such as the Web, eases copying of data between data sources. Such copying can significantly affect effectiveness of existing techniques. Schema matching Model management Query answering using views Information extraction String matching (edit distance, token-based, etc.) Object matching (aka. record linkage, reference reconciliation, …) Data fusion Truth discovery Assume INDEPENDENCE of data sources

False Information on the Web UA’s bankruptcy Chicago Tribune, 2002 Sun-Sentinel.com Google News Bloomberg.com The UAL stock plummeted to $3 from $12.5

How to Find the Truth?  Naïve voting: among conflicting values, choose the one that is asserted by the most number of data sources  However, “A lie told often enough becomes the truth.” — Vladimir Lenin  Identify dependence between data sources:  One source copies from other sources  Opinion by one source is influenced by others

I. Identifying Dependence bet. Sources  Intuition I: decide dependence (w/o direction) Let D1, D2 be data from two sources. D1 and D2 are dependent if Pr(D1, D2) <> Pr(D1) * Pr(D2).

Dependence? Source 1 on USA Presidents : 1 st : George Washington 2 nd : John Adams 3 rd : Thomas Jefferson 4 th : James Madison … 41 st : George H.W. Bush 42 nd : William J. Clinton 43 rd : George W. Bush 44 th : Barack Obama Source 2 on USA Presidents : 1 st : George Washington 2 nd : John Adams 3 rd : Thomas Jefferson 4 th : James Madison … 41 st : George H.W. Bush 42 nd : William J. Clinton 43 rd : George W. Bush 44 th : Barack Obama Are Source 1 and Source 2 dependent? Not necessarily

Dependence? Source 1 on USA Presidents : 1 st : George Washington 2 nd : Benjamin Franklin 3 rd : Tom Jefferson 4 th : Abraham Lincoln … 41 st : George W. Bush 42 nd : Hillary Clinton 43 rd : Mickey Mouse 44 th : Barack Obama Source 2 on USA Presidents : 1 st : George Washington 2 nd : Benjamin Franklin 3 rd : Tom Jefferson 4 th : Abraham Lincoln … 41 st : George W. Bush 42 nd : Hillary Clinton 43 rd : Mickey Mouse 44 th : John McCain Are Source 1 and Source 2 dependent? -- Common Errors Very likely      

I. Identifying Dependence bet. Sources  Intuition I: decide dependence (w/o direction) Let D1, D2 be data from two sources. D1 and D2 are dependent if Pr(D1, D2) <> Pr(D1) * Pr(D2).  Intuition II: decide copying direction Let F be a property function of the data; e.g., accuracy of data. D1 is likely to be dependent on D2 if |F(D1  D2)-F(D1-D2)| > |F(D1  D2)-F(D2-D1)|.

Dependence? Source 2 on USA Presidents : 1 st : George Washington 2 nd : Benjamin Franklin 3 rd : Tom Jefferson 4 th : Abraham Lincoln … 41 st : George W. Bush 42 nd : Hillary Clinton 43 rd : Mickey Mouse 44 th : John McCain Are Source 1 and Source 2 dependent? -- Different Accuracy Source 1 on USA Presidents : 1 st : George Washington 2 nd : John Adams 3 rd : Thomas Jefferson 4 th : Abraham Lincoln … 41 st : George W. Bush 42 nd : Hillary Clinton 43 rd : George W. Bush 44 th : John McCain S1 more likely to be a copier       

Data ConflictsInstance HeterogeneityStructure Heterogeneity II. Applying Dependence bet. Sources in DI Truth discovery Integrating probabilistic data Data Fusion Improve record linkage Distinguish bet wrong values and alter representations Record Linkage Query optimization Improve schema matching Query Answering Recommend trustworthy, up-to-date, and independent sources Source Recom- mendation

Data ConflictsInstance HeterogeneityStructure Heterogeneity Research Agenda: Solomon Discovery Discovery of copying for snapshots of data Discovery of copying for update history Discovery of opinion influence in reviews … Applications Truth discovery Record linkage Query optimization Source recommendation …

Related Work  Data provenance [Buneman et al., PODS’08]  Assume knowledge of provenance/lineage  Focus on effective presentation and retrieval  Opinion pooling [Clemen&Winkler, 1985]  Combine pr distributions from multiple experts  Again, assume knowledge of dependence  Detect plagiarism of programs [Schleimer, Sigmod’03]  Unstructured data

Discovering Dependence Between Sources  Challenges  Accurate sources: independently provide true values  Different coverage and expertise: specialist srcs v.s. generalist srcs  Lazy copiers and slow providers  Partial dependence: copy only a subset of data, reformat some of the copied values, provide some info independently, etc.  Correlated information: common interest/belief system  Incomplete observations: hidden data, undiscovered sources, missing updates, etc.  Sub-problems  Discovery of copying for snapshots of data  Sharing common false data  Different accuracy on common data and distinct data  Discovery of copying for update history  Same updates in close enough time frame  Different accuracy on pre-provided data and post-provided data  Discovery of opinion influence in ratings  …

App I. Data Fusion w. Source Dependence  Truth discovery  Decide one true value for each object.  Challenge: interdependence between truth discovery and dependence detection.  Integrating probabilistic data  Generate a probabilistic distribution of possible values for each object.  Challenge: the dependence between sources may also be probabilistic.  Finding consensus opinions in recommendation systems. Data ConflictsInstance HeterogeneityStructure Heterogeneity

App II. Record Linkage w. Source Dependence  Record linkage  Knowledge of dependence bet. sources can improve record linkage.  Challenges  Again, interdependence between record linkage and dependence detection.  Distinguish alternative representations and wrong values; e.g., Xin Dong (official name) Luna Dong (alternative) Xin Deng (wrong value) Data ConflictsInstance HeterogeneityStructure Heterogeneity

App III. Query Answering w. Source Dependence  Query Answering  Optimization: avoid visiting sources dependent on, or having been copied by, source already visited.  Online query answering: first return partially computed answers and then update the answers as querying more sources; need to order sources so as to provide complete and accurate answers from the beginning.  Schema matching  Knowledge of dependence bet. sources can improve schema matching. Data ConflictsInstance HeterogeneityStructure Heterogeneity