Fusion in web data extraction

Slides:



Advertisements
Similar presentations
Trustworthy Service Selection and Composition CHUNG-WEI HANG MUNINDAR P. Singh A. Moini.
Advertisements

Weiren Yu 1, Jiajin Le 2, Xuemin Lin 1, Wenjie Zhang 1 On the Efficiency of Estimating Penetrating Rank on Large Graphs 1 University of New South Wales.
Design of Experiments Lecture I
Xin Luna Dong (Google Inc.) Divesh Srivastava (AT&T 5/2013.
Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.
Xin Luna Dong AT&T Labs-Research Joint work w. Laure Berti-Equille, Yifan Hu, Divesh
Laure Berti (Universite de Rennes 1), Anish Das Sarma (Stanford), Xin Luna Dong (AT&T), Amelie Marian (Rutgers), Divesh Srivastava (AT&T)
Fast Algorithms For Hierarchical Range Histogram Constructions
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
Online Data Fusion School of Computing National University of Singapore AT&T Shannon Research Labs Xuan Liu, Xin Luna Dong, Beng Chin Ooi, Divesh Srivastava.
Partitioned Logistic Regression for Spam Filtering Ming-wei Chang University of Illinois at Urbana-Champaign Wen-tau Yih and Christopher Meek Microsoft.
Dynamic Service Composition with QoS Assurance Feb , 2009 Jing Dong UTD Farokh Bastani UTD I-Ling Yen UTD.
Cost-Sensitive Classifier Evaluation Robert Holte Computing Science Dept. University of Alberta Co-author Chris Drummond IIT, National Research Council,
CS 590M Fall 2001: Security Issues in Data Mining Lecture 3: Classification.
Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University.
The Islamic University of Gaza Faculty of Engineering Civil Engineering Department Numerical Analysis ECIV 3306 Chapter 3 Approximations and Errors.
Topic 3: Regression.
Answering Imprecise Queries over Autonomous Web Databases Ullas Nambiar Dept. of Computer Science University of California, Davis Subbarao Kambhampati.
Mariam Salloum (YP.com) Xin Luna Dong (Google) Divesh Srivastava (AT&T Research) Vassilis J. Tsotras (UC Riverside) 1 Online Ordering of Overlapping Data.
What If You Reject H 0 ? How much different is Second Population from which Your Sample Is derived? d is a rati o (Comparison) of group differences to.
Multiple Regression – Basic Relationships
SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Basic Relationships Purpose of multiple regression Different types of multiple regression.
Efficient Search Engine Measurements Maxim Gurevich Technion Ziv Bar-Yossef Technion and Google.
Lecture 5 Correlation and Regression
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 01: Training, Testing, and Tuning Datasets.
Regression Analysis Regression analysis is a statistical technique that is very useful for exploring the relationships between two or more variables (one.
Mean Field Inference in Dependency Networks: An Empirical Study Daniel Lowd and Arash Shamaei University of Oregon.
1 Naïve Bayes Models for Probability Estimation Daniel Lowd University of Washington (Joint work with Pedro Domingos)
Inferential Statistics 2 Maarten Buis January 11, 2006.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
Trust-Aware Optimal Crowdsourcing With Budget Constraint Xiangyang Liu 1, He He 2, and John S. Baras 1 1 Institute for Systems Research and Department.
Online Data Fusion School of Computing National University of Singapore AT&T Shannon Research Labs Xuan Liu, Xin Luna Dong, Beng Chin Ooi, Divesh Srivastava.
Experimental Evaluation of Learning Algorithms Part 1.
Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.
SW388R6 Data Analysis and Computers I Slide 1 Multiple Regression Key Points about Multiple Regression Sample Homework Problem Solving the Problem with.
Leonardo Guerreiro Azevedo Geraldo Zimbrão Jano Moreira de Souza Approximate Query Processing in Spatial Databases Using Raster Signatures Federal University.
Characterizing the Uncertainty of Web Data: Models and Experiences Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli Studi.
Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 1 Chapter 3.
Chapter 9: Testing Hypotheses Overview Research and null hypotheses One and two-tailed tests Type I and II Errors Testing the difference between two means.
Answering Top-k Queries Using Views Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris.
De novo discovery of mutated driver pathways in cancer Discussion leader: Matthew Bernstein Scribe: Kun-Chieh Wang Computational Network Biology BMI 826/Computer.
Introducing Communication Research 2e © 2014 SAGE Publications Chapter Seven Generalizing From Research Results: Inferential Statistics.
Linear Discriminant Analysis and Logistic Regression.
Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.
Model Fusion and its Use in Earth Sciences R. Romero, O. Ochoa, A. A. Velasco, and V. Kreinovich Joint Annual Meeting NSF Division of Human Resource Development.
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
11 A Classification-based Approach to Question Routing in Community Question Answering Tom Chao Zhou 1, Michael R. Lyu 1, Irwin King 1,2 1 The Chinese.
1 Probability and Statistics Confidence Intervals.
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
An Efficient Algorithm for a Class of Fused Lasso Problems Jun Liu, Lei Yuan, and Jieping Ye Computer Science and Engineering The Biodesign Institute Arizona.
Chapter 15 Running Time Analysis. Topics Orders of Magnitude and Big-Oh Notation Running Time Analysis of Algorithms –Counting Statements –Evaluating.
PreCalculus 3-4 Solving Exponential and Logarithm Equations.
Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.
Warren Shen, Xin Li, AnHai Doan Database & AI Groups University of Illinois, Urbana Constraint-Based Entity Matching.
Design and Data Analysis in Psychology I English group (A) Salvador Chacón Moscoso Susana Sanduvete Chaves Milagrosa Sánchez Martín School of Psychology.
Xiaolan Wang, Xin Luna Dong, Alexandra Meliou Presentation By: Tomer Amir.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Slides from Luna Dong’s VLDB Tutorials
Boosted Augmented Naive Bayes. Efficient discriminative learning of
Staging User Feedback toward Rapid Conflict Resolution in Data Fusion
Learning Software Behavior for Automated Diagnosis
Statistical Methods For Engineers
Data Integration with Dependent Sources
Solving Exponential and Logarithm Equations
Sequential Data Cleaning: A Statistical Approach
Structure and Content Scoring for XML
Structure and Content Scoring for XML
Presentation transcript:

Fusing Data with Correlations Ravali Pochampally, Anish Das Sarma, Luna Dong, Alexandra Meliou, Divesh Srivastava AT&T Research

Fusion in web data extraction extractor extractor extractor extractor extractor Imagine that you have a large collection of web sources that you process using multiple extraction systems to derive facts. These facts are usually knowledge triples in the form of subject-predicate-object, for example: Daniel Radcliff - played role - Harry Potter. Unfortunately, extractors often make mistakes, and some of the extracted knowledge triples are incorrect. The problem that we want to solve, is how to identify and remove these wrong triples from the dataset. This problem is very important in many applications such as builging knowledge bases, answering questions, facilitating data mining, etc. <subject,predicate,object> <subject,predicate,object> <subject,predicate,object> <subject,predicate,object> <subject,predicate,object> How can we purge wrong triples from the dataset? Applications: Building knowledge bases, answer questions, facilitate data mining

The data fusion problem Contribution: Fusion techniques that consider source quality and correlations bad good Knowledge triple S1 S2 S3 S4 <Daniel Radcliffe, played role, Harry Potter> ✓ <Daniel Radcliffe, spouse, Bonnie Wright> <Daniel Radcliffe, acted in, Frankenstein> <Emma Watson, acted in, Harry Potter> <J. K. Rowling, acted in, Harry Potter> <Richard Harris, played role, Dumbledore> <Michael Gambon, played role, Dumbledore> <Tim Burton, directed, Harry Potter> <Daniel Craig, acted in, Harry Potter> <Rupert Grint, acted in, Harry Potter> ✗ ✗ So, why is this problem challenging? Perhaps we can use simple voting techniques: only accept a triple as true if it is returned by a large portion of the extractors. Unfortunately, these approaches often behave poorly: Bad web sources that copy from each other, or low-quality extractors that are otherwise correlated, can lead us to accept incorrect results. At the same time, we may end up rejecting correct triples derived by good extractors, if these triples do not appear in other extractor outputs. Our contribution in this work is to provide fusion techniques that consider the quality of different sources and their correlations, which can be used to derive high-quality datasets. ✗ ✗ correlated anti-correlated

This talk 2 techniques PrecRec: consider source quality PrecRecCorr: consider correlations extractor approximations evaluation future directions <subject,predicate,object> extractor diagnosis

High-level intuition Researcher affiliation S1 S2 S3 Jagadish UMich source quality correlations evaluation future directions High-level intuition Researcher affiliation S1 S2 S3 Jagadish UMich ATT Dewitt MSR UWisc Bernstein Carey UCI BEA Franklin UCB UMD

High-level intuition Researcher affiliation S1 S2 S3 Jagadish UMich source quality correlations evaluation future directions High-level intuition Researcher affiliation S1 S2 S3 Jagadish UMich ATT Dewitt MSR UWisc Bernstein Carey UCI BEA Franklin UCB UMD Voting: Trust the majority

High-level intuition Researcher affiliation S1 S2 S3 Jagadish UMich source quality correlations evaluation future directions High-level intuition Researcher affiliation S1 S2 S3 Jagadish UMich ATT Dewitt MSR UWisc Bernstein Carey UCI BEA Franklin UCB UMD Quality-based: More votes to accurate sources

Source quality in extraction correlations evaluation future directions Source quality in extraction Actors/actresses in “Harry Potter” films S1 S2 S3 Daniel Radcliffe ✓ Emma Watson J. K. Rowling Daniel Craig Rupert Grint ✗ ✗ high recall high precision med prec/rec Considering source quality: -- More likely to be correct if extracted by high-precision source. -- More likely to be wrong if not extracted by high-recall source.

Source quality metrics correlations evaluation future directions Source quality metrics Recall: False positive rate: probability to return a true triple probability to return a false triple A source is good if ri > qi

Accounting for quality source quality correlations evaluation future directions Accounting for quality Compute score for each triple: If extracts it, multiply by Good source higher score Bad source lower score If does not extract it, multiply by Good source lower score Bad source higher score

Correlation scenarios source quality correlations evaluation future directions Correlation scenarios Triple provided by good sources with recall r and FPR q Copying: Overlapping On true triples: On false triples: Complementary sources: Correlations capture richer information than copying relationships

Correlation in web extraction source quality correlations evaluation future directions Correlation in web extraction [Dong et al. PVLDB 2014] Significant negative correlation The Kappa measure is considered as a more robust measure than merely measuring the intersection, as it takes into account the intersection that can happen even in case of independence. A positive Kappa measure indicates positive correlation; a negative one indicates negative correlation; and one close to 0 indicates independence. Among the 66 pairs of extractors, 53% of them are independent. Five pairs of sources are positively correlated (but the kappa measures are very close to 0), as they apply the same extraction techniques (sometimes only differ in parameter settings) or investigate the same type of Web contents. We observe negative correlation on 40% of the pairs; they are often caused by considering different types of Web contents, but sometimes even extractors on the same type of Web contents can be highly anti-correlated when they apply different techniques

Considering correlations source quality correlations evaluation future directions Considering correlations Positive correlation: Negative correlation: joint recall Exact solution: We can express these probabilities using an exponential number of correlation parameters

Aggressive approximation source quality correlations evaluation future directions Aggressive approximation Partial independence assumptions correlation between Si and the other sources linear number of parameters But: low accuracy

Approximation levels exact solution elastic approximation source quality correlations evaluation future directions Approximation levels no independence assumptions high accuracy exponential size exact solution closer approximation add parameters elastic approximation trade efficiency for accuracy partial independence assumptions low accuracy linear size aggressive approximation

Elastic approximation source quality correlations evaluation future directions Elastic approximation 3 iterations achieve near-optimal accuracy 3 steps Iterations of the elastic approximation

Comparisons Our techniques: PrecRec & PrecRecCorr Union-K source quality correlations evaluation future directions Comparisons Our techniques: PrecRec & PrecRecCorr Union-K A triple is correct if at least K% of sources provide it 3-Estimate [Galland et al. WSDM 2010] Iteratively computes trustworthiness LTM [Zhao et al. PVLDB 2012] Uses graphical models and Gibbs sampling

Three real-world datasets source quality correlations evaluation future directions Three real-world datasets Restaurant: [Marian et al. DE Bull, 2011] 7 sources 93 triples Book: [Dong et al. PVLDB, 2009] 879 sources 225 triples ReVerb: [Fader et al. EMNLP, 2011] 6 extractors 2407 triples

source quality correlations evaluation future directions Restaurant

source quality correlations evaluation future directions Book

source quality correlations evaluation future directions ReVerb

Synthetic data: low precision source quality correlations evaluation future directions Synthetic data: low precision

Synthetic data: high precision source quality correlations evaluation future directions Synthetic data: high precision

Synthetic data: low recall source quality correlations evaluation future directions Synthetic data: low recall

Synthetic data: correlations source quality correlations evaluation future directions Synthetic data: correlations

Error diagnosis source quality correlations evaluation future directions Error diagnosis <subject,predicate,object> <subject,predicate,object> <subject,predicate,object> <subject,predicate,object> <subject,predicate,object>

Contributions Fusion techniques that consider source quality and correlations The number of correlation parameters grows exponentially, but we provide a scalable solution Evaluation on real-world and synthetic data shows that our techniques are more effective than the state-of-the-art

The data fusion problem Naïve approach: Simple majority voting achieves relatively low precision and recall Knowledge triple S1 S2 S3 S4 S5 <Daniel Radcliffe, played role, Harry Potter> ✓ <Daniel Radcliffe, spouse, Bonnie Wright> <Daniel Radcliffe, acted in, Frankenstein> <Emma Watson, acted in, Harry Potter> <J. K. Rowling, acted in, Harry Potter> <Richard Harris, played role, Dumbledore> <Michael Gambon, played role, Dumbledore> <Tim Burton, directed, Harry Potter> <Daniel Craig, acted in, Harry Potter> <Rupert Grint, acted in, Harry Potter> ✗ ✗ \begin{tabular}{|c|c|c|c|c|c|c|c|} \hline {\bf ID} & {\bf KnowledgeTriple} & {\bf Correct?} & $\mathbf{S_1}$ & $\mathbf{S_2}$ & $\mathbf{S_3}$ & $\mathbf{S_4}$ & $\mathbf{S_5}$\\ $\mathbf{t_1}$ & \triple{Obama,profession,president} & Yes & \checkmark & \checkmark & & \checkmark & \checkmark \\ $\mathbf{t_2}$ & \triple{Obama,died,1982} & No & \checkmark & \checkmark & & & \\ $\mathbf{t_3}$ & \triple{Obama,profession,lawyer} & Yes & & & \checkmark & & \\ $\mathbf{t_4}$ & \triple{Obama,religion,Christian} & Yes & & \checkmark & \checkmark & \checkmark & \checkmark \\ $\mathbf{t_5}$ & \triple{Obama,age,50} & No & & \checkmark & \checkmark & & \\ \hline $\mathbf{t_6}$ & \triple{Obama,support,White Sox} & Yes & \checkmark & & & \checkmark & \checkmark \\ $\mathbf{t_7}$ & \triple{Obama,spouse,Michelle} & Yes & \checkmark & \checkmark & \checkmark & & \\ $\mathbf{t_8}$ & \triple{Obama,administered by,John G. Roberts} & No & \checkmark & \checkmark & & \checkmark & \checkmark \\ $\mathbf{t_9}$ & \triple{Obama,surgical operation,05/01/2011} & No & \checkmark & \checkmark & & \checkmark & \checkmark \\ $\mathbf{t_{10}}$ & \triple{Obama,profession,community organizer} & Yes & \checkmark & & \checkmark & \checkmark & \checkmark \\ \end{tabular} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \begin{tabular}{|c|c|c|c|c|c|c|} {\bf ID} & {\bf KnowledgeTriple} & $\mathbf{S_1}$ & $\mathbf{S_2}$ & $\mathbf{S_3}$ & $\mathbf{S_4}$ & $\mathbf{S_5}$\\ $\mathbf{t_1}$ & \triple{Obama,profession,president} & \checkmark & \checkmark & & \checkmark & \checkmark \\ $\mathbf{t_2}$ & \triple{Obama,died,1982} & \checkmark & \checkmark & & & \\ $\mathbf{t_3}$ & \triple{Obama,profession,lawyer} & & & \checkmark & & \\ $\mathbf{t_4}$ & \triple{Obama,religion,Christian} & & \checkmark & \checkmark & \checkmark & \checkmark \\ $\mathbf{t_5}$ & \triple{Obama,age,50} & & \checkmark & \checkmark & & \\ \hline $\mathbf{t_6}$ & \triple{Obama,support,White Sox} & \checkmark & & & \checkmark & \checkmark \\ $\mathbf{t_7}$ & \triple{Obama,spouse,Michelle} & \checkmark & \checkmark & \checkmark & & \\ $\mathbf{t_8}$ & \triple{Obama,administered by,John G. Roberts} & \checkmark & \checkmark & & \checkmark & \checkmark \\ $\mathbf{t_9}$ & \triple{Obama,surgical operation,05/01/2011} & \checkmark & \checkmark & & \checkmark & \checkmark \\ $\mathbf{t_{10}}$ & \triple{Obama,profession,community organizer} & \checkmark & & \checkmark & \checkmark & \checkmark \\ \begin{tabular}{c|} Correct?\\ Yes\\ No\\ ✗ ✗

Extracting web data extractor extractor extractor extractor extractor <subject,predicate,object> <subject,predicate,object> <subject,predicate,object> <subject,predicate,object> <subject,predicate,object> Different extractors can extract different data from the same document

Semantics Triple independence Open-world If a source provides triple t1, it is independent of whether it provides t2. Open-world If a triple is not provided by a source, it is considered unknown, rather than false.

Independence assumption source quality correlations evaluation future directions Independence assumption Assumes source independence! Do we need to worry about correlations?

Experimental evaluation source quality correlations evaluation future directions Experimental evaluation Effectiveness: Comparison with state-of-the-art techniques on real-world data Efficiency: Evaluation of the approximation algorithms Pushing the limits with synthetic data

Execution time source quality correlations evaluation future directions Execution time \begin{tabular}{lrrr} \toprule \textbf{time(sec)} & \reverb & \restaurant &\book\\ \midrule \union-25 & 0.39 & 0.56 & 3.86\\ \union-50 & 0.14 & 0.32 & 3.71\\ \union-75 & 0.11 & 0.35 & 3.00\\ \estimate & 0.7 & 0.06 & 39\\ \ltm (10 iter) & 49 & 5.3 & 3791\\ \precrec & 2.6 & 0.3 & 35\\ \preccorr & 124 & 5.4 & 6786\\ \preccorr-\textsc{lvl3} & 79 & 2.25 & 2452\\ \bottomrule \\ \end{tabular}