Download presentation
Presentation is loading. Please wait.
Published byJanel Miller Modified over 8 years ago
1
Resolving Conflicts in Heterogeneous Data by Truth Discovery and Source Reliability Estimation Qi Li 1, Yaliang Li 1, Jing Gao 1, Bo Zhao 2, Wei Fan 3, Jiawei Han 4 1 SUNY Buffalo; 2 Microsoft Research; 3 Huawei Noah’s Ark Lab; 4 University of Illinois 1
2
Jing Gao UIUChttp://www.ews.uiuc.edu/~jinggao32/61 What Is The Height Of Mount Everest?
6
6 Source 1 Source 2 Source 3 Source 4 Source 5 IntegrationIntegration Object
7
A Straightforward Solution Voting/Averaging – Take the value that is claimed by majority of the sources – Or compute the mean of all the claims Limitation – Ignore source reliability Source reliability – Is crucial for finding the true fact but unknown 7
8
Truth Discovery Principle – Infer both truth and source reliability from the data A source is reliable if it provides many pieces of true information A piece of information is likely to be true if it is provided by many reliable sources 8
9
Existing Work on Truth Discovery Existing methods – Tackle different challenges in truth discovery Source correlations, source costs, streaming data, …… Limitation when handling data of various types – Apply on one type only: Not enough information to estimate source reliability accurately – Apply on all types but treat them the same way: Each data type’s characteristics are not taken into account 9
10
Overview of Our Work A framework for discovering truth from data of various types – Integrate all the data of various types in truth and source reliability estimation – Take unique characteristics of data type into consideration 10
11
Problem Setting Source 2Source 3 ObjectCityHeightCityHeightCityHeight BobNYC1.72NYC1.70NYC1.90 MaryLA1.62LA1.61LA1.85 KateNYC1.74NYC1.72LA1.65 MikeNYC1.72LA1.70DC1.85 JoeDC1.72NYC1.71NYC1.85 11
12
Problem Formulation 12
13
Further Notation 13
14
CRH Framework 14 Basic idea Truths should be close to the observations from reliable sources Minimize the overall weighted distance to the truths in which reliable sources have high weights
15
Functions 15
16
Iterative Procedure 16
17
Truth Computation 17
18
Truth Computation 18
19
Source Weight Assignment 19
20
20 ObjectCityHeightCityHeightCityHeight BobNYC1.72NYC1.70NYC1.90 MaryLA1.62LA1.61LA1.85 KateNYC1.74NYC1.72LA1.65 MikeNYC1.72LA1.70DC1.85 JoeDC1.72NYC1.71NYC1.85 Input
21
21 ObjectCityHeightCityHeightCityHeight BobNYC1.72NYC1.70NYC1.90 MaryLA1.62LA1.61LA1.85 KateNYC1.74NYC1.72LA1.65 MikeNYC1.72LA1.70DC1.85 JoeDC1.72NYC1.71NYC1.85 Initialization ObjectCityHeight BobNYC1.77 MaryLA1.69 KateNYC1.70 MikeDC1.76 JoeNYC1.76 Input Truth Initialize by Voting or Averaging
22
WeightIteration 1 Source 10.2789 Source 20.2521 Source 30.1318 22 ObjectCityHeightCityHeightCityHeight BobNYC1.72NYC1.70NYC1.90 MaryLA1.62LA1.61LA1.85 KateNYC1.74NYC1.72LA1.65 MikeNYC1.72LA1.70DC1.85 JoeDC1.72NYC1.71NYC1.85 Initialization ObjectCityHeight BobNYC1.77 MaryLA1.69 KateNYC1.70 MikeDC1.76 JoeNYC1.76 Input Truth SourceWeight Initialize by Voting or Averaging
23
WeightIteration 1 Source 10.2789 Source 20.2521 Source 30.1318 23 ObjectCityHeightCityHeightCityHeight BobNYC1.72NYC1.70NYC1.90 MaryLA1.62LA1.61LA1.85 KateNYC1.74NYC1.72LA1.65 MikeNYC1.72LA1.70DC1.85 JoeDC1.72NYC1.71NYC1.85 Input Truth SourceWeight InitializationIteration 1 ObjectCityHeightCityHeight BobNYC1.77NYC1.72 MaryLA1.69LA1.62 KateNYC1.70NYC1.72 MikeDC1.76NYC1.72 JoeNYC1.76NYC1.72
24
24 ObjectCityHeightCityHeightCityHeight BobNYC1.72NYC1.70NYC1.90 MaryLA1.62LA1.61LA1.85 KateNYC1.74NYC1.72LA1.65 MikeNYC1.72LA1.70DC1.85 JoeDC1.72NYC1.71NYC1.85 InitializationIteration 1 ObjectCityHeightCityHeight BobNYC1.77NYC1.72 MaryLA1.69LA1.62 KateNYC1.70NYC1.72 MikeDC1.76NYC1.72 JoeNYC1.76NYC1.72 WeightIteration 1Iteration 2 Source 10.27890.5552 Source 20.25210.4539 Source 30.13180.0149 Input Truth SourceWeight
25
25 ObjectCityHeightCityHeightCityHeight BobNYC1.72NYC1.70NYC1.90 MaryLA1.62LA1.61LA1.85 KateNYC1.74NYC1.72LA1.65 MikeNYC1.72LA1.70DC1.85 JoeDC1.72NYC1.71NYC1.85 InitializationIteration 1Iteration 2 ObjectCityHeightCityHeightCityHeight BobNYC1.77NYC1.72NYC1.72 MaryLA1.69LA1.62LA1.62 KateNYC1.70NYC1.72NYC1.74 MikeDC1.76NYC1.72NYC1.72 JoeNYC1.76NYC1.72DC1.72 WeightIteration 1Iteration 2 Source 10.27890.5552 Source 20.25210.4539 Source 30.13180.0149 Input Truth SourceWeight
26
26 ObjectCityHeightCityHeightCityHeight BobNYC1.72NYC1.70NYC1.90 MaryLA1.62LA1.61LA1.85 KateNYC1.74NYC1.72LA1.65 MikeNYC1.72LA1.70DC1.85 JoeDC1.72NYC1.71NYC1.85 InitializationIteration 1Iteration 2Iteration 3Iteration 4 ObjectCityHeightCityHeightCityHeightCityHeightCityHeight BobNYC1.77NYC1.72NYC1.72NYC1.72NYC1.72 MaryLA1.69LA1.62LA1.62LA1.62LA1.62 KateNYC1.70NYC1.72NYC1.74NYC1.74NYC1.74 MikeDC1.76NYC1.72NYC1.72NYC1.72NYC1.72 JoeNYC1.76NYC1.72DC1.72DC1.72DC1.72 WeightIteration 1Iteration 2Iteration 3Iteration 4 Source 10.27890.55520.6734 Source 20.25210.45390.4077 Source 30.13180.01490.0141 Input Truth SourceWeight
27
Experiment Evaluation Performance Measure – Error rate: for categorical data type – Mean normalized absolute distance (MNAD): for continuous data type – The lower the better for both 27
28
Data Sets Weather Forecast Data – We crawled temperature and weather condition from various platforms for 20 cities over a month Stock Data and Flight Data – Different properties about stocks crawled from various websites – Departure and arrival time and gate information crawled from various websites – Available at http://lunadong.com/fusionDataSets.htm UCI Adult and Bank Data – We simulate multiple conflicting sources by injecting different levels of noise on original data 28
29
Baseline Methods Applied to one type only – Categorical data only: Voting – Continuous data only: Mean, Median, GTM Applied to multiple types – Investment, PooledInvestment, 2-Estimates, 3- Estimates, TruthFinder, AccuSim 29
30
Performance Comparison 30
31
31 CRH derives better estimates of source reliability by effectively characterizing different data types in a joint model
32
Varying Number of Reliable Sources Categorical data Continuous data 32 Continuous data
33
Summary 33 Truth Discovery on Heterogeneous Data – Provide a nice way to combine data of various types when deriving source weights and truths – Present several common loss functions and effective solutions under this framework – Unique characteristics of each data type are considered, and all types contribute to source reliability estimation together – This joint inference improves source reliability estimation and leads to better truth discovery on heterogeneous data Slides, code and datasets available – http://www.cse.buffalo.edu/~jing
34
34
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.