Characterizing the Uncertainty of Web Data: Models and Experiences Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli Studi.

Slides:

Advertisements

Similar presentations

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.

Advertisements

Weiren Yu 1, Jiajin Le 2, Xuemin Lin 1, Wenjie Zhang 1 On the Efficiency of Estimating Penetrating Rank on Large Graphs 1 University of New South Wales.

Jose-Luis Blanco, Javier González, Juan-Antonio Fernández-Madrigal University of Málaga (Spain) Dpt. of System Engineering and Automation May Pasadena,

Xin Luna Dong (Google Inc.) Divesh Srivastava (AT&T 5/2013.

Flint: exploiting redundant information to wring out value from Web data Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti.

Xin Luna Dong AT&T Labs-Research Joint work w. Laure Berti-Equille, Yifan Hu, Divesh

Laure Berti (Universite de Rennes 1), Anish Das Sarma (Stanford), Xin Luna Dong (AT&T), Amelie Marian (Rutgers), Divesh Srivastava (AT&T)

Fast Algorithms For Hierarchical Range Histogram Constructions

PROBABILITY. Uncertainty  Let action A t = leave for airport t minutes before flight from Logan Airport  Will A t get me there on time ? Problems :

R OBERTO B ATTITI, M AURO B RUNATO The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.

Fusion in web data extraction

Online Data Fusion School of Computing National University of Singapore AT&T Shannon Research Labs Xuan Liu, Xin Luna Dong, Beng Chin Ooi, Divesh Srivastava.

Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements Raju Balakrishnan (Arizona State University)

What is Statistical Modeling

Truth Discovery with Multiple Confliction Information Providers on the Web Xiaoxin Yin, Jiawei Han, Philip S.Yu Industrial and Government Track short paper.

A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.

Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.

Chapter 7 – K-Nearest-Neighbor

Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.

Multiple-Instance Learning Paper 1: A Framework for Multiple-Instance Learning [Maron and Lozano-Perez, 1998] Paper 2: EM-DD: An Improved Multiple-Instance.

Exploiting Correlated Attributes in Acquisitional Query Processing Amol Deshpande University of Maryland Joint work with Carlos Sam

Representing Uncertainty CSE 473. © Daniel S. Weld 2 Many Techniques Developed Fuzzy Logic Certainty Factors Non-monotonic logic Probability Only one.

Classification and Prediction by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.

Quality-driven Integration of Heterogeneous Information System by Felix Naumann, et al. (VLDB1999) 17 Feb 2006 Presented by Heasoo Hwang.

Today Concepts underlying inferential statistics

CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.

Classification and Prediction: Regression Analysis

For Better Accuracy Eick: Ensemble Learning

©2003/04 Alessandro Bogliolo Background Information theory Probability theory Algorithms.

Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica.

Chapter 8 Introduction to Hypothesis Testing

Where Are the Nuggets in System Audit Data? Wenke Lee College of Computing Georgia Institute of Technology.

PageRank for Product Image Search Kevin Jing (Googlc IncGVU, College of Computing, Georgia Institute of Technology) Shumeet Baluja (Google Inc.) WWW 2008.

Access Path Selection in a Relational Database Management System Selinger et al.

Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,

1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:

Online Data Fusion School of Computing National University of Singapore AT&T Shannon Research Labs Xuan Liu, Xin Luna Dong, Beng Chin Ooi, Divesh Srivastava.

OLAP : Blitzkreig Introduction 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema :

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Statistical Inference (By Michael Jordon) l Bayesian perspective –conditional perspective—inferences.

OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema : Hierarchical Dimensions.

Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07.

Michael Isard and Andrew Blake, IJCV 1998 Presented by Wen Li Department of Computer Science & Engineering Texas A&M University.

CHAPTER 5 Probability Theory (continued) Introduction to Bayesian Networks.

Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.

De novo discovery of mutated driver pathways in cancer Discussion leader: Matthew Bernstein Scribe: Kun-Chieh Wang Computational Network Biology BMI 826/Computer.

Data Extraction and Integration from Imprecise Web Sources Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli.

Histograms for Selectivity Estimation, Part II Speaker: Ho Wai Shing Global Optimization of Histograms.

Trust Analysis on Heterogeneous Networks Manish Gupta 17 Mar 2011.

Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.

CSE 473 Uncertainty. © UW CSE AI Faculty 2 Many Techniques Developed Fuzzy Logic Certainty Factors Non-monotonic logic Probability Only one has stood.

Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:

Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.

A Connectivity-Based Popularity Prediction Approach for Social Networks Huangmao Quan, Ana Milicic, Slobodan Vucetic, and Jie Wu Department of Computer.

On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007.

A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation Yee W. Teh, David Newman and Max Welling Published on NIPS 2006 Discussion.

Bias Management in Time Changing Data Streams We assume data is generated randomly according to a stationary distribution. Data comes in the form of streams.

Software Testing.

Online Conditional Outlier Detection in Nonstationary Time Series

CS b553: Algorithms for Optimization and Learning

Roberto Battiti, Mauro Brunato

RankClus: Integrating Clustering with Ranking for Heterogeneous Information Network Analysis Yizhou Sun, Jiawei Han, Peixiang Zhao, Zhijun Yin, Hong Cheng,

Data Integration with Dependent Sources

Representing Uncertainty

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Sequential Data Cleaning: A Statistical Approach

Feifei Li, Ching Chang, George Kollios, Azer Bestavros

Michal Rosen-Zvi University of California, Irvine

Probabilistic Databases

Stable and Practical AS Relationship Inference with ProbLink

Presentation transcript:

Characterizing the Uncertainty of Web Data: Models and Experiences Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli Studi Roma Tre Dipartimento di Informatica ed Automazione

Outline Introduction and goals Probabilistic models to evaluate the accuracy of web data sources Experiencing the models on real-life web data Lessons learned

The Web as a Source of Information Opportunities − a huge amount of information publicly available − valuable data repository can be built by aggregating information spread over many sources − abundance of redundancy for data of many domains

The Web as a Source of Information Opportunities − a huge amount of information publicly available − valuable data repository can be built by aggregating information spread over many sources − abundance of redundancy for data of many domains [Blanco et al. WebDb2010, SyLaMinMaxVolOpen Ibm Cisc Appl Appl

Limitations − sources are inaccurate, uncertain and unreliable − some sources reproduce the contents published by others Data conflicts … HRBN max price?

Several ranking methods for web sources E.g. Google PageRank, Alexa Traffic Rank Mainly based on the popularity of the sources Several factors can compromise the quality of data even when extracted from authoritative sources Errors in the editorial process Errors in the publishing process Errors in the data extraction process Popularity-based rankings

Problem Definition A set of sources (possibly with copiers) provide values of several attributes for a common set of objects w1w1 w2w2 w3w3 errors in bold

Problem Definition A set of sources (possibly with copiers) provide values of several attributes for a common set of objects We want to compute automatically − A score of accuracy for each web source − The probability distribution for each value w1w1 w2w2 w3w3 w4w4 (Copier) score (w 1 )?... score (w 4 )?

State-of-the-art Probabilistic models to evaluate the accuracy of web data sources (i.e., algorithms to reconcile data from inaccurate sources)  NAIVE (voting)  ACCU [Yin et al, TKDE08; Wu&Marian, WebDb07; Galland et al, WSDM10]  DEP [Dong et al, PVLDB09]  M-DEP [Blanco et al, Caise10; Dong et al, PVLDB10]

Goals The goal of our work is twofold: illustrate the state-of-the-art models compare the result of these models on the same real world datasets

NAIVE Independent sources Consider a single attribute at a time Count the votes for each possible value it works it does not! 381 gets 2 votes 380 gets 1 vote Sources Truth

Limitations of the NAIVE Model Real sources can exhibit different accuracies Every source is considered equivalent independently from its authority and accuracy More accurate sources should weight more than inaccurate sources

ACCU: a Model considering the Accuracy of the Sources Accuracy 3/31/3 The vote of a source is weighted according to its accuracy with respect to that attribute Sources Truth Result Main intuition: it's difficult that sources agree on errors! Consensus on (many) true values allows the algorithm to compute accuracy Truth Discovery (consensus) Source Accuracy Discovery

Limitations of the ACCU model Accuracy 3/32/31/3 Misleading majorities might be formed by copiers Both values (380 and 381) get 3/3 as weighted vote 2/3 CopierIndependentsSources: Truth Result Copiers have to be detected to neutralize the “copied” portion of their votes

A Generative Model of Copiers Copier Truth independently produced objects copied objects Source 2 Source 1 e e1e1 e1e1 e2e2 e2e2 Source 3

DEP: A Model to Consider Source Dependencies “Portion” of independent opinion Accuracy 3/3 2/31/3 3/3 1/3 380 gets 3/3 as independent weighted vote 381 gets 2/3 x 3/3 + 1/3 x 1/3 = 7/9 as independent weighted vote A source is copying 2/3 of its tuples CopierIndependents Sources: Truth 2/3 Result Main intuition: copiers can be detected as they propagate false values (i.e., errors)

Contextual Analysis of Truth, Accuracies, and Dependencies Truth Discovery Source Accuracy Discovery Dependence Detection

M-DEP: Improved Evidence from MULTIATT Analysis Truth An analysis based only on the Volume would fail in this example: it would recognizes w 2 as a copier of w 1 but it would not detect w 4 as a copier of w 3 actually w 1 and w 2 are independent sources sharing a common format for volumes MULTIATT(3) w1w1 w2w2 w3w3 w4w4 errors in bold Copier

Experiments with Web Data Soccer players Truth : hand crafted from official pages Stats: 976 objects and 510 symbols (on average) Videogames Truth : Stats: 227 objects and 40 symbols (on average) NASDAQ Stock Quotes Truth : Stats: 819 objects, 2902 symbols (on average)

Sample Accuracies of the Sources Sampled accuracy: the number of true values correctly reported over the number of objects. Pearson correlation coefficient shows that quality of data and popularity do not overlap

Experiments with Models Probability Concentration measures the performance in computing probability distributions for the observed objects. Low scores for Soccer: no authority on the Web Differences in VideoGames: #of distinct symbols (5 vs 75) High SA scores in Finance for every model: large #of distinct symbols a

Global Execution Times

Attributes Execution Times

Lessons Learned Three dimensions to decide which technique to use: Characteristics of the domain - domains where authoritative sources exist are much easier to handle - large number of distinct symbols help a lot too Requirements on the results - on average, more complex models return better results, especially for Probability Concentration Execution times - depend on the number of objects and number of distinct symbols. Naïve always scales well.

Thanks!

Bayesian Analysis (1) A random variable X to model possible values of the observed objects o is the observation of the value provided by a source for a single object Accuracy represents the event X=x=x t is the true value Observations for k objects Goal:

Bayesian Analysis (2) Goal: According to the Bayes Rule, we need to know Main idea: computing based on a generative model of the sources

A Simple Probabilistic Model for ACCU Sources Independence assumptions Sources Values Attributes Uniform distribution assumption The accuracy of a source wrt an attribute is the average of the probabilities associated with the values that provides

How to Compute the Independent Weighted Vote Thanks to the independent copying assumption is the probability that the source w is a copier of the source w' − c: prior probability of copying a tuple − 1-c: prior probability of providing a tuple independently

Bayesian Analysis of the Relationships between Sources Bayes Rule on All observations: of all tuples provided by all the sources We need to consider all possible relationships between w 1 and w 2 in our model A partition of the space of events: − w 1 and w 2 are independent − w 1 is a copier of w 2 − w 2 is a copier of w 1

Dealing with Many Attributes Let us consider only two attributes A and B and two sources w 1 and w 2 O set of objects for which both sources provide a value same values: different values: e.g. : w 1 and w 2 provide the same values: the value of A is true, but the value of B is false

Independent Sources: Both Sources are Correct error rate: Є = 1 - Accuracy − notation: − error rate of source w 1 wrt the attribute A Thanks to the independent attributes assumption Probability that w 1 provides a correct value for both attributes Probability that w 2 provides a correct value for both attributes

Independent Sources: Other Cases possible false values of attribute A possible false values of attribute B Remaining cases:

Wrap-Up Thanks to the independent values assumption

Dependent Sources Thanks to the independent attributes assumption Probability that w 1 and w 2 both independently provide a correct tuple Probability that w 1 is copying a true tuple produced by w 2 w 1 is acting like a copier w 1 is acting independently

How to Compute the Accuracy? Bayesian Analysis of the mutual dependence between consensus and accuracy Iterate two steps: − Consensus Analysis based on the agreement of the sources among their observations on individual objects and on the current accuracy of sources, compute the probability for the attributes of every object − Accuracy Analysis based on the current probability distributions of the observed object attributes, evaluate the accuracy of the sources