Xin Luna Dong AT&T Labs-Research Joint work w. Laure Berti-Equille, Yifan Hu, Divesh

Slides:



Advertisements
Similar presentations
Provenance-Aware Storage Systems Margo Seltzer April 29, 2005.
Advertisements

Research Issues in Web Services CS 4244 Lecture Zaki Malik Department of Computer Science Virginia Tech
Autonomic Scaling of Cloud Computing Resources
Xin Luna Dong (AT&T Labs  Google Inc.) Barna Saha, Divesh Srivastava (AT&T Labs-Research) VLDB’2013.
How do we know when we know. Outline  What is Research  Measurement  Method Types  Statistical Reasoning  Issues in Human Factors.
Design of Experiments Lecture I
Xin Luna Dong (Google Inc.) Divesh Srivastava (AT&T 5/2013.
Modeling and Querying Possible Repairs in Duplicate Detection George Beskales Mohamed A. Soliman Ihab F. Ilyas Shai Ben-David.
Laure Berti (Universite de Rennes 1), Anish Das Sarma (Stanford), Xin Luna Dong (AT&T), Amelie Marian (Rutgers), Divesh Srivastava (AT&T)
VLDB 2011 Pohang University of Science and Technology (POSTECH) Republic of Korea Jongwuk Lee, Seung-won Hwang VLDB 2011.
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
Experiments and Variables
PROVENANCE FOR THE CLOUD (USENIX CONFERENCE ON FILE AND STORAGE TECHNOLOGIES(FAST `10)) Kiran-Kumar Muniswamy-Reddy, Peter Macko, and Margo Seltzer Harvard.
Fusion in web data extraction
Online Data Fusion School of Computing National University of Singapore AT&T Shannon Research Labs Xuan Liu, Xin Luna Dong, Beng Chin Ooi, Divesh Srivastava.
Patch to the Future: Unsupervised Visual Prediction
Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007.
Truth Discovery with Multiple Confliction Information Providers on the Web Xiaoxin Yin, Jiawei Han, Philip S.Yu Industrial and Government Track short paper.
An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.
Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.
Tagging Systems Austin Wester. Tags A keywords linked to a resource (image, video, web page, blog, etc) by users without using a controlled vocabulary.
Tagging Systems Mustafa Kilavuz. Tags A tag is a keyword added to an internet resource (web page, image, video) by users without relying on a controlled.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
Compressed Accessibility Map: Efficient Access Control for XML Ting Yu : University of Illinois Divesh Srivastava : AT&T Labs Laks V.S. Lakshmanan : University.
Fundamentals of Information Systems, Second Edition 1 Organizing Data and Information Chapter 3.
THE PROCESS OF SCIENCE. Assumptions  Nature is real, understandable, knowable through observation  Nature is orderly and uniform  Measurements yield.
Model-Driven Data Acquisition in Sensor Networks - Amol Deshpande et al., VLDB ‘04 Jisu Oh March 20, 2006 CS 580S Paper Presentation.
Mariam Salloum (YP.com) Xin Luna Dong (Google) Divesh Srivastava (AT&T Research) Vassilis J. Tsotras (UC Riverside) 1 Online Ordering of Overlapping Data.
2 pt 3 pt 4 pt 5 pt 1 pt 2 pt 3 pt 4 pt 5 pt 1 pt 2 pt 3 pt 4 pt 5 pt 1 pt 2 pt 3 pt 4 pt 5 pt 1 pt 2 pt 3 pt 4 pt 5 pt 1 pt Scientific Understanding of.
Chapter 1 Introduction to the Scientific Method Can Science Cure the Common Cold?
Introduction to Database Systems 1.  Assignments – 3 – 9%  Marked Lab – 5 – 10% + 2% (Bonus)  Marked Quiz – 3 – 6%  Mid term exams – 2 – (30%) 15%
Introduction to Databases Transparencies 1. ©Pearson Education 2009 Objectives Common uses of database systems. Meaning of the term database. Meaning.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Fundamentals of Information Systems, Second Edition 1 Organizing Data and Information.
Fundamentals of Information Systems, Fifth Edition
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
Online Data Fusion School of Computing National University of Singapore AT&T Shannon Research Labs Xuan Liu, Xin Luna Dong, Beng Chin Ooi, Divesh Srivastava.
Characterizing the Uncertainty of Web Data: Models and Experiences Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli Studi.
Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07.
Education 793 Class Notes Decisions, Error and Power Presentation 8.
Functional Requirements for Bibliographic Records The Changing Face of Cataloging William E. Moen Texas Center for Digital Knowledge School of Library.
Measuring Behavioral Trust in Social Networks
1 Scalable Probabilistic Databases with Factor Graphs and MCMC Michael Wick, Andrew McCallum, and Gerome Miklau VLDB 2010.
Trust Analysis on Heterogeneous Networks Manish Gupta 17 Mar 2011.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Mining Advisor-Advisee Relationships from Research Publication.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
1 Design and evaluation methods: Objectives n Design life cycle: HF input and neglect n Levels of system design: Going beyond the interface n Sources of.
THE LEONS COLLEGE OF LAW1 Organizing Data and Information Chapter 4.
Slides from Luna Dong’s VLDB Tutorials
CHAPTER 9 Testing a Claim
Methodology Logical Database Design for the Relational Model
Forensic Metrology A short introduction
WSRec: A Collaborative Filtering Based Web Service Recommender System
Staging User Feedback toward Rapid Conflict Resolution in Data Fusion
Statistical Data Analysis
Fundamentals of Information Systems
Integrating Meta-Path Selection With User-Guided Object Clustering in Heterogeneous Information Networks Yizhou Sun†, Brandon Norick†, Jiawei Han†, Xifeng.
Large Scale Metabolic Network Alignments by Compression
RankClus: Integrating Clustering with Ranking for Heterogeneous Information Network Analysis Yizhou Sun, Jiawei Han, Peixiang Zhao, Zhijun Yin, Hong Cheng,
Data Integration with Dependent Sources
Database Environment Transparencies
Science and Engineering Practice 1
Data Integration for Relational Web
Introduction.
SmartPrim: Critical Thinking
Statistical Data Analysis
1. INTRODUCTION.
Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University
Presentation transcript:

Xin Luna Dong AT&T Labs-Research Joint work w. Laure Berti-Equille, Yifan Hu, Divesh

Information Propagation Becomes Much Easier with the Web Technologies

False Information Can Be Propagated Posted by Andrew Breitbart In his blog …

The Internet needs a way to help people separate rumor from real science. – Tim Berners-Lee We now live in this media culture where something goes up on YouTube or a blog and everybody scrambles. - Barack Obama

Large-Scaled Copying on Structured Data (Copying of AbeBooks Data) Data collected from AbeBooks [Yin et al., 2007]

Observation I. Intuitively Meaningful Clusters According to the Copying Relationships

Observation II. Complex Copying Relationships Co-copying

Observation II. Complex Copying Relationships Transitive copying Multi-source copying

Understanding Complex Copying Relationships  Benefits  Business purpose: data are valuable  In-depth data analysis: information dissemination  Improve data integration: truth discovery, entity resolution, schema mapping, query optimization  Current techniques make local decisions [Dong et al., 09a][Dong et al., 09b][Blanco et al., 10]  Cannot distinguish co-copying, transitive copying, direct copying from multiple sources

Our Contributions  More accurate decisions on copying direction (important for global detection)  Glean information from completeness, formatting  Consider correlated copying: e.g., a source copying the name of a book can also copy its author list Local DetectionGlobal Detection  Global detection of copying  Discovering co-copying and transitive copying

Outline Motivation and contributions  Problem definition and techniques  Experimental results  Related work and conclusions Local DetectionGlobal Detection Intuitions Techniques

Problem Definition—Input SrcISBNNameAuthor S1 1IPV6: Theory, Protocol, and PracticeLoshin, Peter 2 Web Usability: A User-Centered Design Approach Lazar, Jonathan S2 1IPV4: Theory, Protocol, and Practice- 2Web Usability: A UserJonathan Lazar S3 1IPV6: Theory, Protocol, and PracticeLoshin, Peter 2Web Usability: A UserJonathan Lazar S4 1IPV6: Theory, Protocol, and PracticeLoshin 2Web Usability: A UserLazar Missing values Different formats Incorrect values  Objects: a real-world entity, described by a set of attributes  Each associated w. a true value  Sources: each providing data for a subset of objects Input

Problem Definition—Output  For each S1, S2, decide pr of S1 copying directly from S2  A copier copies all or a subset of data  A copier can add values and verify/modify copied values—independent contribution  A copier can re-format copied values—still considered as copied S1 S2 S3 S4 SrcISBNNameAuthor S1 1IPV6: Theory, Protocol, and PracticeLoshin, Peter 2 Web Usability: A User-Centered Design Approach Lazar, Jonathan S2 1IPV4: Theory, Protocol, and Practice- 2Web Usability: A UserJonathan Lazar S3 1IPV6: Theory, Protocol, and PracticeLoshin, Peter 2Web Usability: A UserJonathan Lazar S4 1IPV6: Theory, Protocol, and PracticeLoshin 2Web Usability: A UserLazar

Intuitions for Local Copying Detection  Overlap on unpopular values  Copying  Changes in quality of different parts of data  Copying direction [VLDB’09] Consider correctness of data Pr(Ф(S1)|S1  S2) >> Pr(Ф(S1)|S1  S2) S1  S2

SrcISBNNameAuthor S1 1IPV6: Theory, Protocol, and PracticeLoshin, Peter 2 Web Usability: A User-Centered Design Approach Lazar, Jonathan S2 1IPV4: Theory, Protocol, and Practice- 2Web Usability: A UserJonathan Lazar S3 1IPV6: Theory, Protocol, and PracticeLoshin, Peter 2Web Usability: A UserJonathan Lazar S4 1IPV6: Theory, Protocol, and PracticeLoshin 2Web Usability: A UserLazar Correctness of Data as Evidence for Copying S1 S2 S3 S4

Intuitions for Local Copying Detection  Overlap on unpopular values  Copying  Changes in quality of different parts of data  Copying direction [VLDB’09] Consider correctness of data Consider additional evidence Pr(Ф(S1)|S1  S2) >> Pr(Ф(S1)|S1  S2) S1  S2

SrcISBNNameAuthor S1 1IPV6: Theory, Protocol, and PracticeLoshin, Peter 2 Web Usability: A User-Centered Design Approach Lazar, Jonathan S2 1IPV4: Theory, Protocol, and Practice- 2Web Usability: A UserJonathan Lazar S3 1IPV6: Theory, Protocol, and PracticeLoshin, Peter 2Web Usability: A UserJonathan Lazar S4 1IPV6: Theory, Protocol, and PracticeLoshin 2Web Usability: A UserLazar Formatting as Evidence for Copying S1 S2 S3 S4 Different formats SubValues

Intuitions for Local Copying Detection Pr(Ф(S1)|S1  S2) >> Pr(Ф(S1)|S1 ┴ S2) S1->S2  Overlap on unpopular values  Copying  Changes in quality of different parts of data  Copying direction [VLDB’09] Consider correctness of data Consider additional evidence Consider correlated copying

Correlated Copying KA1A2A3A4 O1SSSDD O2SDSSD O3SSDSD O4SSSDS O5SDSSS KA1A2A3A4 O1SSSSS O2SSSSS O3SSSSS O4SDDDD O5SDDDD 17 same values, and 8 different values Copying S: Two sources providing the same value D: Two sources providing different values

Intuitions for Local Copying Detection Pr(Ф(S1)|S1->S2) >> Pr(Ф(S1)|S1 ┴ S2) S1->S2  Overlap on unpopular values  Copying  Changes in quality of different parts of data  Copying direction [VLDB’09] Consider correctness of data Consider additional evidence Consider correlated copying

Experimental Results for Local Copying Detection on Synthetic Data

Outline Motivation and contributions  Problem definition and techniques  Experimental results  Related work and conclusions Local DetectionGlobal Detection Intuitions Techniques

Multi-Source Copying? Co-copying? Transitive Copying? S1 {V1-V100} S2 S3 Multi-source copying Co-copying {V51-V130} {V1-V50, V101-V130} S1 {V1-V100} S2 S3 {V21-V70} {V1-V50} Transitive copying S1 {V1-V100} S2 S3 {V21-V50, V81-V100} {V1-V50} (V81-V100 are popular values)

Multi-Source Copying? Co-copying? Transitive Copying? S1 {V1-V100} S2 S3 Multi-source copying Co-copying Local copying detection results {V51-V130} {V1-V50, V101-V130} S1 {V1-V100} S2 S3 {V21-V70} {V1-V50} Transitive copying S1 {V1-V100} S2 S3 {V21-V50, V81-V100} {V1-V50} (V81-V100 are popular values)

Multi-Source Copying? Co-copying? Transitive Copying? S1 {V1-V100} S2 S3 Multi-source copying Co-copying - Looking at the copying probabilities? {V51-V130} {V1-V50, V101-V130} S1 {V1-V100} S2 S3 {V21-V70} {V1-V50} Transitive copying S1 {V1-V100} S2 S3 {V21-V50, V81-V100} {V1-V50} (V81-V100 are popular values)

Multi-Source Copying? Co-copying? Transitive Copying? S1 {V1-V100} S2 S3 Multi-source copying Co-copying 1 X Looking at the copying probabilities? - Counting shared values? {V51-V130} {V1-V50, V101-V130} S1 {V1-V100} S2 S3 {V21-V70} {V1-V50} Transitive copying S1 {V1-V100} S2 S3 {V21-V50, V81-V100} {V1-V50} (V81-V100 are popular values)

Multi-Source Copying? Co-copying? Transitive Copying? S1 {V1-V100} S2 S3 Multi-source copying Co-copying 50 X Looking at the copying probabilities? X Counting shared values? - Comparing the set of shared values? {V51-V130} {V1-V50, V101-V130} S1 {V1-V100} S2 S3 {V21-V70} {V1-V50} Transitive copying S1 {V1-V100} S2 S3 {V21-V50, V81-V100} {V1-V50} (V81-V100 are popular values)

Multi-Source Copying? Co-copying? Transitive Copying? S1 {V1-V100} S2 S3 Multi-source copying Co-copying V1-V50 V101-V130 X Looking at the copying probabilities? X Counting shared values? - Comparing the set of shared values? V51-V100 {V51-V130} {V1-V50, V101-V130} S1 {V1-V100} S2 S3 V1-V50 V21-V50 V21-V70 {V21-V70} {V1-V50} Transitive copying S1 {V1-V100} S2 S3 V1-V50 V21-V50 V21-V50, V81-V100 {V21-V50, V81-V100} {V1-V50} (V81-V100 are popular values)

Multi-Source Copying? Co-copying? Transitive Copying? S1 {V1-V100} S2 S3 Multi-source copying Co-copying V1-V50 V101-V130 X Looking at the copying probabilities? X Counting shared values? X Comparing the set of shared values? V51-V100 {V51-V130} {V1-V50, V101-V130} S1 {V1-V100} S2 S3 V1-V50 V21-V50 V21-V70 {V21-V70} {V1-V50} Transitive copying S1 {V1-V100} S2 S3 V1-V50 V21-V50 V21-V50, V80-V100 {V21-V50, V81-V100} {V1-V50} (V81-V100 are popular values) V21-V50 shared by 3 sources We need to reason for each data item in a principled way!

Global Copying Detection 1. First find a set of copyings R that significantly influence the rest of the copyings  How to find such R? 2. Adjust copying probability for the rest of the copyings: P(S1  S2|R)  How to compute P(S1  S2|R)?

Computing P(S1  S2|R)  Replace Pr(Ф(S1)|S1  S2) everywhere with Pr(Ф(S1)|S1  S2, R)  For each O.A, consider sources associated with S1 in R  S f (O.A)—sources providing the same value in the same format on O.A as S1  S v (O.A)—sources providing the same value in a different format on O.A as S1  P f /P v – Probability that S1 does not copy O.A from any source in S f (O.A)/S v (O.A)  Pr(Ф O.A (S1)|S1->S2, R) =(1-P f P v )+P f P v Pr(Ф O.A (S1)|S1  S2) Pr(Ф(S1)|S1  S2) >> Pr(Ф(S1)|S1  S2) S1  S2

Multi-Source Copying? Co-copying? Transitive Copying? S1 {V1-V100} S2 S3 Multi-source copying Co-copying V1-V50 V101-V130 V51-V100 {V51-V130} {V1-V50, V101-V130} S1 {V1-V100} S2 S3 V1-V50 V21-V50 V21-V70 {V21-V70} {V1-V50} Transitive copying S1 {V1-V100} S2 S3 V1-V50 V21-V50 V21-V50, V81-V100 {V21-V50, V81-V100} {V1-V50} (V81-V100 are popular values) R={S3  S1}, Pr(Ф(S3))= Pr(Ф(S3)|R) for V101-V130 R={S3  S1}, Pr(Ф(S3))<<Pr(Ф(S3)|R) for V21-V50 R={S3  S2}, Pr(Ф(S3))<<Pr(Ф(S3)|R) for V21-V50 Pr(Ф(S3)) is high for V81-V100 X X ? ? ?

Finding R  R (most influential copying relationships) Maximize  Finding R is NP-complete (Reduction from HITTING SET problem)  We need a fast greedy algorithm

Greedy Algorithm for Finding R  Goal: Maximize  Intuitions  For each source, find the most “influential” sources from which it copies  Order the original sources by their accumulated influence on others, and iteratively add each corresponding copying to R unless one of the following holds  Prune copyings that have less accumulated influence on others than being affected by others  Prune copyings that can be significantly influenced by the already selected copyings  E.g., P(S4  S1)-P(S4  S1|S4  S3)=.8, P(S4  S2)-P(S4  S2|S4  S3)=.8 P(S4  S3)-P(S4  S3|S4  S1)=.5, P(S4  S3)-P(S4  S3|S4  S2)=.5 S1 S2 S3 S4 Accumulated influence:.8+.8=1.6 XX

Experimental Results for Global Detection on Synthetic Data  Sensitivity: Percentage of copying that are identified w. correct direction  Specificity: Percentage of non-copying that are identified as so

Outline Motivation and contributions Problem definition and techniques  Experimental results  Related work and conclusions Local DetectionGlobal Detection Intuitions Techniques

Experimental Setup  Dataset: Weather data  18 weather websites  for 30 major USA cities  collected every 45 minutes for a day  33 collections, so 990 objects  28 distinct attributes  Challenges  No true/false notion, only popularity  Frequent updates—up-to-date data may not have been copied at crawling  Complete data and standard formatting—lack evidence from completeness & formatting

Golden Standard

Silver Standard

Results of Global Detection

Results of Local Detection

Experiment Results  Measure: Precision, Recall, F-measure  C: real copying; D: detected copying MethodsPrecisionRecallF-measure Corr (Only correctness) Enriched (More evidence) Local (correlated copying) Global (global detection).79 Transitive/co-copying not removed Ignoring evidence from correlated copying Enriched improves over Corr when true/false notion does apply

Related Work  Copying detection  Texts/Programs [Schleimer et al., 03][Buneman, 71]  Videos [Law-To et al., 07]  Structured sources  [Dong et al., 09a] [Dong et al., 09b]: Local decision  [Blanco et al., 10]: Assume a copier must copy all attribute values of an object  Data provenance [Buneman et al., PODS’08]  Focus on effective presentation and retrieval  Assume knowledge of provenance/lineage

Conclusions and Future Work  Conclusions  Improve previous techniques for pairwise copying detection by  plugging in different types of copying evidence  considering correlations between copying  Global detection for eliminating co-copying and transitive copying  Ongoing and future work  Categorization and summarization of the copied instances  Visualization of copying relationships [VLDB’10 demo]