Download presentation
Presentation is loading. Please wait.
Published byMarjorie Harvey Modified over 9 years ago
1
Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07
2
2 Motivation Example: Authors of books –We tried to find out who wrote the book “Rapid Contextual Design”. Many different sets of authors from different online bookstores Accurate information Incomplete information
3
3 Motivation Is the world-wide web always trustable? Unfortunately, the answer is “NO”. There is no guarantee for the correctness of information on the web.
4
4 Motivation Different web sites often provide conflicting information on a subject. 54% of Internet users trust news web sites 26% for web sites that sell products 12% for blogs
5
5 Veracity Veracity i.e., conformity to truth How to find true facts from a large amount of conflicting information on many subjects that is provided by various web sites. This paper invent an algorithm called T RUTH F INDER, –A web site is trustworthy if it provides many pieces of true information, and a piece of information is likely to be true if it is provided by many trustworthy web sites.
6
6 BookstoreBooks Authors Problem Definitions Facts: properties of the objects ReadersNews Emotions
7
7 Problem Definitions Definition 1: (Confidence of facts.) –The confidence of a fact f (denoted by s(f)) is the probability of f being correct, according to the best of our knowledge. Definition 2: (Trustworthiness of web sites.) –The trustworthiness of a web site w (denoted by t(w)) is the expected confidence of the facts provided by w.
8
8 Problem Definitions Implication between facts –Imp( f 1 f 2 ) : how much f 2 ’s confidence should be increased or decreased according to f 1 ’s confidence. –Imp( f 1 f 2 ) is a value between -1 and 1. A positive value indicates if f 1 is correct, f 2 is likely to be correct. A negative value means if f 1 is correct, f 2 is likely to be wrong. –Imp( f 1 f 2 ) = sim(f 1, f 2 ) - base_sim, where sim(f 1, f 2 ) is the similarity between f 1 and f 2, and base_sim is a threshold for similarity. T F
9
9 Computational Model Heuristic 1: Usually there is only one true fact for a property of an object. Heuristic 2: This true fact appears to be the same or similar on different web sites. Heuristic 3: The false facts on different web sites are less likely to be the same or similar. Heuristic 4: In a certain domain, a web site that provides mostly true facts for many objects will likely provide true facts for other objects.
10
10 Computational Model If a fact is provided by many trustworthy web sites, it is likely to be true If a fact is conflicting with the facts provided by many trustworthy web sites, it is unlikely to be true. A web site is trustworthy if it provides facts with high confidence Web site trustworthiness and fact confidence can be determined by each other True facts are more consistent than false facts (Heuristic 3)
11
11 Computational Model - Basic Inference Basic Inference t(w 1 ) = 0.9 and t(w 2 ) = 0.99 t(w 2 ) = 1.1 × t( w 1 ) (w 2 ) = 2× ( w 1 )
12
12 Computational Model - Basic Inference f 1 is provided by w 1 and w 2, if f 1 is wrong then both w 1 and w 2 are wrong the probability that both of them are wrong is –(1 − t(w 1 )) · (1 − t(w 2 )) the probability that f 1 is not wrong is –1 − (1 − t(w 1 )) · (1 − t(w 2 )) s(f) can be computed as :
13
13 Computational Model - Basic Inference
14
14 Computational Model - Inferences between Facts Adjusted confidence score New score of the fact
15
15 Computational Model - Handling Additional Subtlety Different web sites are not independent with each other –dampening factor the confidence of a fact f can easily be negative
16
16 Computational Model - Iterative Computation New score of the fact
17
17 Experiments Baseline –Voting : This method chooses the fact that is provided by most web sites. = 0.5 = 0.3 Book Authors Dataset: –1,265 computer science books –Using ISBN search on www.abebooks.comwww.abebooks.com 894 bookstores and 34,031 listings 5.4 different authors / book
18
18 Experiments randomly select 100 books manually find out their authors accuracy : Partially correct facts: –last name, first name and middle name : 3:2:1 –for example : “Graeme Simsion” is 5/6 (omit middle name) If f 1 has x authors and f 2 has y authors, and there are z shared ones, then imp(f 1 f 2 ) = z/x − base-sim –base-sim = 0.5
19
19 Experiments - Book Authors Dataset
20
20 Experiments - Book Authors Dataset One book may make multiple errors Miss authors: –only provide subset of all authors
21
21 Experiments - Query with Google Google is good at finding authoritative web sites. But do these web sites provide accurate information ? Compare the online bookstores –highest ranks by Google vs. –highest trustworthiness found by TruthFinder Querying Google with “bookstore” –Find all bookstores that exist in their dataset from the top 300 Google results.
22
22 Experiments - Query with Google
23
23 Conclusion Veracity problem T RUTH F INDER Algorithm
24
24
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.