Presentation is loading. Please wait.

Presentation is loading. Please wait.

Resolving Conflicts in Heterogeneous Data by Truth Discovery and Source Reliability Estimation Qi Li 1, Yaliang Li 1, Jing Gao 1, Bo Zhao 2, Wei Fan 3,

Similar presentations


Presentation on theme: "Resolving Conflicts in Heterogeneous Data by Truth Discovery and Source Reliability Estimation Qi Li 1, Yaliang Li 1, Jing Gao 1, Bo Zhao 2, Wei Fan 3,"— Presentation transcript:

1 Resolving Conflicts in Heterogeneous Data by Truth Discovery and Source Reliability Estimation Qi Li 1, Yaliang Li 1, Jing Gao 1, Bo Zhao 2, Wei Fan 3, Jiawei Han 4 1 SUNY Buffalo; 2 Microsoft Research; 3 Huawei Noah’s Ark Lab; 4 University of Illinois 1

2 Jing Gao UIUChttp://www.ews.uiuc.edu/~jinggao32/61 What Is The Height Of Mount Everest?

3

4

5

6 6 Source 1 Source 2 Source 3 Source 4 Source 5 IntegrationIntegration Object

7 A Straightforward Solution Voting/Averaging – Take the value that is claimed by majority of the sources – Or compute the mean of all the claims Limitation – Ignore source reliability Source reliability – Is crucial for finding the true fact but unknown 7

8 Truth Discovery Principle – Infer both truth and source reliability from the data A source is reliable if it provides many pieces of true information A piece of information is likely to be true if it is provided by many reliable sources 8

9 Existing Work on Truth Discovery Existing methods – Tackle different challenges in truth discovery Source correlations, source costs, streaming data, …… Limitation when handling data of various types – Apply on one type only: Not enough information to estimate source reliability accurately – Apply on all types but treat them the same way: Each data type’s characteristics are not taken into account 9

10 Overview of Our Work A framework for discovering truth from data of various types – Integrate all the data of various types in truth and source reliability estimation – Take unique characteristics of data type into consideration 10

11 Problem Setting Source 2Source 3 ObjectCityHeightCityHeightCityHeight BobNYC1.72NYC1.70NYC1.90 MaryLA1.62LA1.61LA1.85 KateNYC1.74NYC1.72LA1.65 MikeNYC1.72LA1.70DC1.85 JoeDC1.72NYC1.71NYC1.85 11

12 Problem Formulation 12

13 Further Notation 13

14 CRH Framework 14 Basic idea Truths should be close to the observations from reliable sources Minimize the overall weighted distance to the truths in which reliable sources have high weights

15 Functions 15

16 Iterative Procedure 16

17 Truth Computation 17

18 Truth Computation 18

19 Source Weight Assignment 19

20 20 ObjectCityHeightCityHeightCityHeight BobNYC1.72NYC1.70NYC1.90 MaryLA1.62LA1.61LA1.85 KateNYC1.74NYC1.72LA1.65 MikeNYC1.72LA1.70DC1.85 JoeDC1.72NYC1.71NYC1.85 Input

21 21 ObjectCityHeightCityHeightCityHeight BobNYC1.72NYC1.70NYC1.90 MaryLA1.62LA1.61LA1.85 KateNYC1.74NYC1.72LA1.65 MikeNYC1.72LA1.70DC1.85 JoeDC1.72NYC1.71NYC1.85 Initialization ObjectCityHeight BobNYC1.77 MaryLA1.69 KateNYC1.70 MikeDC1.76 JoeNYC1.76 Input Truth Initialize by Voting or Averaging

22 WeightIteration 1 Source 10.2789 Source 20.2521 Source 30.1318 22 ObjectCityHeightCityHeightCityHeight BobNYC1.72NYC1.70NYC1.90 MaryLA1.62LA1.61LA1.85 KateNYC1.74NYC1.72LA1.65 MikeNYC1.72LA1.70DC1.85 JoeDC1.72NYC1.71NYC1.85 Initialization ObjectCityHeight BobNYC1.77 MaryLA1.69 KateNYC1.70 MikeDC1.76 JoeNYC1.76 Input Truth SourceWeight Initialize by Voting or Averaging

23 WeightIteration 1 Source 10.2789 Source 20.2521 Source 30.1318 23 ObjectCityHeightCityHeightCityHeight BobNYC1.72NYC1.70NYC1.90 MaryLA1.62LA1.61LA1.85 KateNYC1.74NYC1.72LA1.65 MikeNYC1.72LA1.70DC1.85 JoeDC1.72NYC1.71NYC1.85 Input Truth SourceWeight InitializationIteration 1 ObjectCityHeightCityHeight BobNYC1.77NYC1.72 MaryLA1.69LA1.62 KateNYC1.70NYC1.72 MikeDC1.76NYC1.72 JoeNYC1.76NYC1.72

24 24 ObjectCityHeightCityHeightCityHeight BobNYC1.72NYC1.70NYC1.90 MaryLA1.62LA1.61LA1.85 KateNYC1.74NYC1.72LA1.65 MikeNYC1.72LA1.70DC1.85 JoeDC1.72NYC1.71NYC1.85 InitializationIteration 1 ObjectCityHeightCityHeight BobNYC1.77NYC1.72 MaryLA1.69LA1.62 KateNYC1.70NYC1.72 MikeDC1.76NYC1.72 JoeNYC1.76NYC1.72 WeightIteration 1Iteration 2 Source 10.27890.5552 Source 20.25210.4539 Source 30.13180.0149 Input Truth SourceWeight

25 25 ObjectCityHeightCityHeightCityHeight BobNYC1.72NYC1.70NYC1.90 MaryLA1.62LA1.61LA1.85 KateNYC1.74NYC1.72LA1.65 MikeNYC1.72LA1.70DC1.85 JoeDC1.72NYC1.71NYC1.85 InitializationIteration 1Iteration 2 ObjectCityHeightCityHeightCityHeight BobNYC1.77NYC1.72NYC1.72 MaryLA1.69LA1.62LA1.62 KateNYC1.70NYC1.72NYC1.74 MikeDC1.76NYC1.72NYC1.72 JoeNYC1.76NYC1.72DC1.72 WeightIteration 1Iteration 2 Source 10.27890.5552 Source 20.25210.4539 Source 30.13180.0149 Input Truth SourceWeight

26 26 ObjectCityHeightCityHeightCityHeight BobNYC1.72NYC1.70NYC1.90 MaryLA1.62LA1.61LA1.85 KateNYC1.74NYC1.72LA1.65 MikeNYC1.72LA1.70DC1.85 JoeDC1.72NYC1.71NYC1.85 InitializationIteration 1Iteration 2Iteration 3Iteration 4 ObjectCityHeightCityHeightCityHeightCityHeightCityHeight BobNYC1.77NYC1.72NYC1.72NYC1.72NYC1.72 MaryLA1.69LA1.62LA1.62LA1.62LA1.62 KateNYC1.70NYC1.72NYC1.74NYC1.74NYC1.74 MikeDC1.76NYC1.72NYC1.72NYC1.72NYC1.72 JoeNYC1.76NYC1.72DC1.72DC1.72DC1.72 WeightIteration 1Iteration 2Iteration 3Iteration 4 Source 10.27890.55520.6734 Source 20.25210.45390.4077 Source 30.13180.01490.0141 Input Truth SourceWeight

27 Experiment Evaluation Performance Measure – Error rate: for categorical data type – Mean normalized absolute distance (MNAD): for continuous data type – The lower the better for both 27

28 Data Sets Weather Forecast Data – We crawled temperature and weather condition from various platforms for 20 cities over a month Stock Data and Flight Data – Different properties about stocks crawled from various websites – Departure and arrival time and gate information crawled from various websites – Available at http://lunadong.com/fusionDataSets.htm UCI Adult and Bank Data – We simulate multiple conflicting sources by injecting different levels of noise on original data 28

29 Baseline Methods Applied to one type only – Categorical data only: Voting – Continuous data only: Mean, Median, GTM Applied to multiple types – Investment, PooledInvestment, 2-Estimates, 3- Estimates, TruthFinder, AccuSim 29

30 Performance Comparison 30

31 31 CRH derives better estimates of source reliability by effectively characterizing different data types in a joint model

32 Varying Number of Reliable Sources Categorical data Continuous data 32 Continuous data

33 Summary 33 Truth Discovery on Heterogeneous Data – Provide a nice way to combine data of various types when deriving source weights and truths – Present several common loss functions and effective solutions under this framework – Unique characteristics of each data type are considered, and all types contribute to source reliability estimation together – This joint inference improves source reliability estimation and leads to better truth discovery on heterogeneous data Slides, code and datasets available – http://www.cse.buffalo.edu/~jing

34 34


Download ppt "Resolving Conflicts in Heterogeneous Data by Truth Discovery and Source Reliability Estimation Qi Li 1, Yaliang Li 1, Jing Gao 1, Bo Zhao 2, Wei Fan 3,"

Similar presentations


Ads by Google