Resolving Conflicts in Heterogeneous Data by Truth Discovery and Source Reliability Estimation Qi Li 1, Yaliang Li 1, Jing Gao 1, Bo Zhao 2, Wei Fan 3,

Slides:



Advertisements
Similar presentations
On the Optimality of Probability Estimation by Random Decision Trees Wei Fan IBM T.J.Watson.
Advertisements

Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson.
A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois.
Dialogue Policy Optimisation
AGIFORS--RM Study Group New York City, March 2000 Lawrence R. Weatherford, PhD University of Wyoming Unconstraining Methods.
Maintaining Variance and k-Medians over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O’Callaghan Stanford University.
On the alternative approaches to ITRF formulation. A theoretical comparison. Department of Geodesy and Surveying Aristotle University of Thessaloniki Athanasios.
Truth Discovery with Multiple Confliction Information Providers on the Web Xiaoxin Yin, Jiawei Han, Philip S.Yu Industrial and Government Track short paper.
An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.
Chapter 7 – K-Nearest-Neighbor
On Appropriate Assumptions to Mine Data Streams: Analyses and Solutions Jing Gao† Wei Fan‡ Jiawei Han† †University of Illinois at Urbana-Champaign ‡IBM.
Heterogeneous Consensus Learning via Decision Propagation and Negotiation Jing Gao † Wei Fan ‡ Yizhou Sun † Jiawei Han † †University of Illinois at Urbana-Champaign.
Heterogeneous Consensus Learning via Decision Propagation and Negotiation Jing Gao† Wei Fan‡ Yizhou Sun†Jiawei Han† †University of Illinois at Urbana-Champaign.
1 Ensembles of Nearest Neighbor Forecasts Dragomir Yankov, Eamonn Keogh Dept. of Computer Science & Eng. University of California Riverside Dennis DeCoste.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
CSci 6971: Image Registration Lecture 5: Feature-Base Regisration January 27, 2004 Prof. Chuck Stewart, RPI Dr. Luis Ibanez, Kitware Prof. Chuck Stewart,
Principles of Time Scales
Reconstructing the Emission Height of Volcanic SO2 from Back Trajectories: Comparison of Explosive and Effusive Eruptions Modeling trace gas transport.
Principles of the Global Positioning System Lecture 11 Prof. Thomas Herring Room A;
Fenglong Ma1, Yaliang Li1, Qi Li1, Minghui Qiu2,
Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.
Confidence Intervals and Hypothesis Testing - II
Latent (S)SVM and Cognitive Multiple People Tracker.
Applications of Bayesian sensitivity and uncertainty analysis to the statistical analysis of computer simulators for carbon dynamics Marc Kennedy Clive.
DATA FROM A SAMPLE OF 25 STUDENTS ABBAB0 00BABB BB0A0 A000AB ABA0BA.
Slide 1 Estimating Performance Below the National Level Applying Simulation Methods to TIMSS Fourth Annual IES Research Conference Dan Sherman, Ph.D. American.
Trust-Aware Optimal Crowdsourcing With Budget Constraint Xiangyang Liu 1, He He 2, and John S. Baras 1 1 Institute for Systems Research and Department.
A Confidence-Aware Approach for Truth Discovery on Long-Tail Data
Time Series Data Analysis - I Yaji Sripada. Dept. of Computing Science, University of Aberdeen2 In this lecture you learn What are Time Series? How to.
BIA 2610 – Statistical Methods Chapter 1 – Data and Statistics.
High-resolution computational models of genome binding events Yuan (Alan) Qi Joint work with Gifford and Young labs Dana-Farber Cancer Institute Jan 2007.
Basic Business Statistics
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Earth, Atmospheric and Planetary Sciences Massachusetts Institute of Technology 77 Massachusetts Avenue | Cambridge MA V F
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign User Profiling in Ego-network: Co-profiling Attributes and Relationships.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Stefano Vezzoli, CROS NT
Uncertainty Management in Rule-based Expert Systems
MARE 250 Dr. Jason Turner Introduction to Statistics.
28. Multiple regression The Practice of Statistics in the Life Sciences Second Edition.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Evaluating the Impact of Job Scheduling and Power Management on Processor Lifetime for Chip Multiprocessors (SIGMETRICS 2009) Authors: Ayse K. Coskun,
Consensus Extraction from Heterogeneous Detectors to Improve Performance over Network Traffic Anomaly Detection Jing Gao 1, Wei Fan 2, Deepak Turaga 2,
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Towards Social User Profiling: Unified and Discriminative Influence Model for Inferring Home Locations Rui Li, Shengjie Wang, Hongbo Deng, Rui Wang, Kevin.
1 Module One: Measurements and Uncertainties No measurement can perfectly determine the value of the quantity being measured. The uncertainty of a measurement.
Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
NetQuest: A Flexible Framework for Large-Scale Network Measurement Lili Qiu University of Texas at Austin Joint work with Han Hee Song.
BIVARIATE/MULTIVARIATE DESCRIPTIVE STATISTICS Displaying and analyzing the relationship between continuous variables.
Crowdsourcing High Quality Labels with a Tight Budget Qi Li 1, Fenglong Ma 1, Jing Gao 1, Lu Su 1, Christopher J. Quinn 2 1 SUNY Buffalo; 2 Purdue University.
Part I Project Initiation.
Data Mining.
Statistics in Management
Data Mining K-means Algorithm
Nodal Methods for Core Neutron Diffusion Calculations
Fast Preprocessing for Robust Face Sketch Synthesis
Collective Network Linkage across Heterogeneous Social Platforms
A Consensus-Based Clustering Method
How to handle missing data values
Enhanced-alignment Measure for Binary Foreground Map Evaluation
Data Mining Practical Machine Learning Tools and Techniques
Community Distribution Outliers in Heterogeneous Information Networks
Ensembles.
Integration of sensory modalities
Ch10 Analysis of Variance.
Estimation Error and Portfolio Optimization
Principles of the Global Positioning System Lecture 11
Psy 425 Tests & Measurements
Mathematical Foundations of BME Reza Shadmehr
STATISTICS INFORMED DECISIONS USING DATA
Presentation transcript:

Resolving Conflicts in Heterogeneous Data by Truth Discovery and Source Reliability Estimation Qi Li 1, Yaliang Li 1, Jing Gao 1, Bo Zhao 2, Wei Fan 3, Jiawei Han 4 1 SUNY Buffalo; 2 Microsoft Research; 3 Huawei Noah’s Ark Lab; 4 University of Illinois 1

Jing Gao UIUChttp:// What Is The Height Of Mount Everest?

6 Source 1 Source 2 Source 3 Source 4 Source 5 IntegrationIntegration Object

A Straightforward Solution Voting/Averaging – Take the value that is claimed by majority of the sources – Or compute the mean of all the claims Limitation – Ignore source reliability Source reliability – Is crucial for finding the true fact but unknown 7

Truth Discovery Principle – Infer both truth and source reliability from the data A source is reliable if it provides many pieces of true information A piece of information is likely to be true if it is provided by many reliable sources 8

Existing Work on Truth Discovery Existing methods – Tackle different challenges in truth discovery Source correlations, source costs, streaming data, …… Limitation when handling data of various types – Apply on one type only: Not enough information to estimate source reliability accurately – Apply on all types but treat them the same way: Each data type’s characteristics are not taken into account 9

Overview of Our Work A framework for discovering truth from data of various types – Integrate all the data of various types in truth and source reliability estimation – Take unique characteristics of data type into consideration 10

Problem Setting Source 2Source 3 ObjectCityHeightCityHeightCityHeight BobNYC1.72NYC1.70NYC1.90 MaryLA1.62LA1.61LA1.85 KateNYC1.74NYC1.72LA1.65 MikeNYC1.72LA1.70DC1.85 JoeDC1.72NYC1.71NYC

Problem Formulation 12

Further Notation 13

CRH Framework 14 Basic idea Truths should be close to the observations from reliable sources Minimize the overall weighted distance to the truths in which reliable sources have high weights

Functions 15

Iterative Procedure 16

Truth Computation 17

Truth Computation 18

Source Weight Assignment 19

20 ObjectCityHeightCityHeightCityHeight BobNYC1.72NYC1.70NYC1.90 MaryLA1.62LA1.61LA1.85 KateNYC1.74NYC1.72LA1.65 MikeNYC1.72LA1.70DC1.85 JoeDC1.72NYC1.71NYC1.85 Input

21 ObjectCityHeightCityHeightCityHeight BobNYC1.72NYC1.70NYC1.90 MaryLA1.62LA1.61LA1.85 KateNYC1.74NYC1.72LA1.65 MikeNYC1.72LA1.70DC1.85 JoeDC1.72NYC1.71NYC1.85 Initialization ObjectCityHeight BobNYC1.77 MaryLA1.69 KateNYC1.70 MikeDC1.76 JoeNYC1.76 Input Truth Initialize by Voting or Averaging

WeightIteration 1 Source Source Source ObjectCityHeightCityHeightCityHeight BobNYC1.72NYC1.70NYC1.90 MaryLA1.62LA1.61LA1.85 KateNYC1.74NYC1.72LA1.65 MikeNYC1.72LA1.70DC1.85 JoeDC1.72NYC1.71NYC1.85 Initialization ObjectCityHeight BobNYC1.77 MaryLA1.69 KateNYC1.70 MikeDC1.76 JoeNYC1.76 Input Truth SourceWeight Initialize by Voting or Averaging

WeightIteration 1 Source Source Source ObjectCityHeightCityHeightCityHeight BobNYC1.72NYC1.70NYC1.90 MaryLA1.62LA1.61LA1.85 KateNYC1.74NYC1.72LA1.65 MikeNYC1.72LA1.70DC1.85 JoeDC1.72NYC1.71NYC1.85 Input Truth SourceWeight InitializationIteration 1 ObjectCityHeightCityHeight BobNYC1.77NYC1.72 MaryLA1.69LA1.62 KateNYC1.70NYC1.72 MikeDC1.76NYC1.72 JoeNYC1.76NYC1.72

24 ObjectCityHeightCityHeightCityHeight BobNYC1.72NYC1.70NYC1.90 MaryLA1.62LA1.61LA1.85 KateNYC1.74NYC1.72LA1.65 MikeNYC1.72LA1.70DC1.85 JoeDC1.72NYC1.71NYC1.85 InitializationIteration 1 ObjectCityHeightCityHeight BobNYC1.77NYC1.72 MaryLA1.69LA1.62 KateNYC1.70NYC1.72 MikeDC1.76NYC1.72 JoeNYC1.76NYC1.72 WeightIteration 1Iteration 2 Source Source Source Input Truth SourceWeight

25 ObjectCityHeightCityHeightCityHeight BobNYC1.72NYC1.70NYC1.90 MaryLA1.62LA1.61LA1.85 KateNYC1.74NYC1.72LA1.65 MikeNYC1.72LA1.70DC1.85 JoeDC1.72NYC1.71NYC1.85 InitializationIteration 1Iteration 2 ObjectCityHeightCityHeightCityHeight BobNYC1.77NYC1.72NYC1.72 MaryLA1.69LA1.62LA1.62 KateNYC1.70NYC1.72NYC1.74 MikeDC1.76NYC1.72NYC1.72 JoeNYC1.76NYC1.72DC1.72 WeightIteration 1Iteration 2 Source Source Source Input Truth SourceWeight

26 ObjectCityHeightCityHeightCityHeight BobNYC1.72NYC1.70NYC1.90 MaryLA1.62LA1.61LA1.85 KateNYC1.74NYC1.72LA1.65 MikeNYC1.72LA1.70DC1.85 JoeDC1.72NYC1.71NYC1.85 InitializationIteration 1Iteration 2Iteration 3Iteration 4 ObjectCityHeightCityHeightCityHeightCityHeightCityHeight BobNYC1.77NYC1.72NYC1.72NYC1.72NYC1.72 MaryLA1.69LA1.62LA1.62LA1.62LA1.62 KateNYC1.70NYC1.72NYC1.74NYC1.74NYC1.74 MikeDC1.76NYC1.72NYC1.72NYC1.72NYC1.72 JoeNYC1.76NYC1.72DC1.72DC1.72DC1.72 WeightIteration 1Iteration 2Iteration 3Iteration 4 Source Source Source Input Truth SourceWeight

Experiment Evaluation Performance Measure – Error rate: for categorical data type – Mean normalized absolute distance (MNAD): for continuous data type – The lower the better for both 27

Data Sets Weather Forecast Data – We crawled temperature and weather condition from various platforms for 20 cities over a month Stock Data and Flight Data – Different properties about stocks crawled from various websites – Departure and arrival time and gate information crawled from various websites – Available at UCI Adult and Bank Data – We simulate multiple conflicting sources by injecting different levels of noise on original data 28

Baseline Methods Applied to one type only – Categorical data only: Voting – Continuous data only: Mean, Median, GTM Applied to multiple types – Investment, PooledInvestment, 2-Estimates, 3- Estimates, TruthFinder, AccuSim 29

Performance Comparison 30

31 CRH derives better estimates of source reliability by effectively characterizing different data types in a joint model

Varying Number of Reliable Sources Categorical data Continuous data 32 Continuous data

Summary 33 Truth Discovery on Heterogeneous Data – Provide a nice way to combine data of various types when deriving source weights and truths – Present several common loss functions and effective solutions under this framework – Unique characteristics of each data type are considered, and all types contribute to source reliability estimation together – This joint inference improves source reliability estimation and leads to better truth discovery on heterogeneous data Slides, code and datasets available –

34