1 Truth Validation and Veracity Analysis with Information Networks Jiawei Han Data Mining Group, Computer Science University of Illinois at Urbana-Champaign.

Slides:



Advertisements
Similar presentations
PEBL: Web Page Classification without Negative Examples Hwanjo Yu, Jiawei Han, Kevin Chen- Chuan Chang IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
Advertisements

Weiren Yu 1, Jiajin Le 2, Xuemin Lin 1, Wenjie Zhang 1 On the Efficiency of Estimating Penetrating Rank on Large Graphs 1 University of New South Wales.
Diversified Retrieval as Structured Prediction Redundancy, Diversity, and Interdependent Document Relevance (IDR ’09) SIGIR 2009 Workshop Yisong Yue Cornell.
Multi-label Relational Neighbor Classification using Social Context Features Xi Wang and Gita Sukthankar Department of EECS University of Central Florida.
One Theme in All Views: Modeling Consensus Topics in Multiple Contexts Jian Tang 1, Ming Zhang 1, Qiaozhu Mei 2 1 School of EECS, Peking University 2 School.
Experiments on Query Expansion for Internet Yellow Page Services Using Log Mining Summarized by Dongmin Shin Presented by Dongmin Shin User Log Analysis.
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Introduction to IR Research ChengXiang Zhai Department of Computer.
1 Social Influence Analysis in Large-scale Networks Jie Tang 1, Jimeng Sun 2, Chi Wang 1, and Zi Yang 1 1 Dept. of Computer Science and Technology Tsinghua.
Truth Discovery with Multiple Confliction Information Providers on the Web Xiaoxin Yin, Jiawei Han, Philip S.Yu Industrial and Government Track short paper.
An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.
Yu Zheng, Lizhu Zhang, Xing Xie, Wei-Ying Ma Microsoft Research Asia
Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1 University of Illinois, IBM TJ Watson Debapriya Basu.
On Community Outliers and their Efficient Detection in Information Networks Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1.
SCS CMU Proximity Tracking on Time- Evolving Bipartite Graphs Speaker: Hanghang Tong Joint Work with Spiros Papadimitriou, Philip S. Yu, Christos Faloutsos.
Heterogeneous Consensus Learning via Decision Propagation and Negotiation Jing Gao † Wei Fan ‡ Yizhou Sun † Jiawei Han † †University of Illinois at Urbana-Champaign.
Heterogeneous Consensus Learning via Decision Propagation and Negotiation Jing Gao† Wei Fan‡ Yizhou Sun†Jiawei Han† †University of Illinois at Urbana-Champaign.
Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)
1 Discovering Unexpected Information from Your Competitor’s Web Sites Bing Liu, Yiming Ma, Philip S. Yu Héctor A. Villa Martínez.
Query-Based Outlier Detection in Heterogeneous Information Networks Jonathan Kuck 1, Honglei Zhuang 1, Xifeng Yan 2, Hasan Cam 3, Jiawei Han 1 1 University.
Honglei Zhuang1, Jing Zhang2, George Brova1,
1 A Topic Modeling Approach and its Integration into the Random Walk Framework for Academic Search 1 Jie Tang, 2 Ruoming Jin, and 1 Jing Zhang 1 Knowledge.
COVERTNESS CENTRALITY IN NETWORKS Michael Ovelgönne UMIACS University of Maryland 1 Chanhyun Kang, Anshul Sawant Computer Science Dept.
Data Mining Techniques
Multiple testing correction
Cluster based fact finders Manish Gupta, Yizhou Sun, Jiawei Han Feb 10, 2011.
1 Data Mining Books: 1.Data Mining, 1996 Pieter Adriaans and Dolf Zantinge Addison-Wesley 2.Discovering Data Mining, 1997 From Concept to Implementation.
Evolutionary Clustering and Analysis of Bibliographic Networks Manish Gupta (UIUC) Charu C. Aggarwal (IBM) Jiawei Han (UIUC) Yizhou Sun (UIUC) ASONAM 2011.
X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox Associate Dean for.
Accessing the Deep Web Bin He IBM Almaden Research Center in San Jose, CA Mitesh Patel Microsoft Corporation Zhen Zhang computer science at the University.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
25/03/2003CSCI 6405 Zheyuan Yu1 Finding Unexpected Information Taken from the paper : “Discovering Unexpected Information from your Competitor’s Web Sites”
Garrett Poppe, Liv Nguekap, Adrian Mirabel CSUDH, Computer Science Department.
Discovering Meta-Paths in Large Heterogeneous Information Network
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
P-Rank: A Comprehensive Structural Similarity Measure over Information Networks CIKM’ 09 November 3 rd, 2009, Hong Kong Peixiang Zhao, Jiawei Han, Yizhou.
1/52 Overlapping Community Search Graph Data Management Lab, School of Computer Science
Stratified K-means Clustering Over A Deep Web Data Source Tantan Liu, Gagan Agrawal Dept. of Computer Science & Engineering Ohio State University Aug.
Characterizing the Uncertainty of Web Data: Models and Experiences Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli Studi.
Andreas Papadopoulos - [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International.
Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07.
1 LinkClus: Efficient Clustering via Heterogeneous Semantic Links Xiaoxin Yin, Jiawei Han Univ. of Illinois at Urbana-Champaign Philip S. Yu IBM T.J. Watson.
Computing & Information Sciences Kansas State University IJCAI HINA 2015: 3 rd Workshop on Heterogeneous Information Network Analysis KSU Laboratory for.
Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Page 1 PathSim: Meta Path-Based Top-K Similarity Search in Heterogeneous Information Networks Yizhou Sun, Jiawei Han, Xifeng Yan, Philip S. Yu, Tianyi.
Sovereign Information Sharing, Searching and Mining Rakesh Agrawal IBM Almaden Research Center.
Relation Strength-Aware Clustering of Heterogeneous Information Networks with Incomplete Attributes ∗ Source: VLDB.
Unsupervised Streaming Feature Selection in Social Media
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Mining Advisor-Advisee Relationships from Research Publication.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
1 Relational Factor Graphs Lin Liao Joint work with Dieter Fox.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
1 Discovering Web Communities in the Blogspace Ying Zhou, Joseph Davis (HICSS 2007)
Paper Presentation Social influence based clustering of heterogeneous information networks Qiwei Bao & Siqi Huang.
Meta-Path-Based Ranking with Pseudo Relevance Feedback on Heterogeneous Graph for Citation Recommendation By: Xiaozhong Liu, Yingying Yu, Chun Guo, Yizhou.
Term Project Proposal By J. H. Wang Apr. 7, 2017.
Truth Discovery and Veracity Analysis
Finding Dense and Connected Subgraphs in Dual Networks
Hanan Ayad Supervisor Prof. Mohamed Kamel
Introduction to IR Research
Jiawei Han Computer Science University of Illinois at Urbana-Champaign
Integrating Meta-Path Selection With User-Guided Object Clustering in Heterogeneous Information Networks Yizhou Sun†, Brandon Norick†, Jiawei Han†, Xifeng.
CS7280: Special Topics in Data Mining Information/Social Networks
RankClus: Integrating Clustering with Ranking for Heterogeneous Information Network Analysis Yizhou Sun, Jiawei Han, Peixiang Zhao, Zhijun Yin, Hong Cheng,
An Efficient method to recommend research papers and highly influential authors. VIRAJITHA KARNATAPU.
Jiawei Han Department of Computer Science
Promising “Newer” Technologies to Cope with the
Presentation transcript:

1 Truth Validation and Veracity Analysis with Information Networks Jiawei Han Data Mining Group, Computer Science University of Illinois at Urbana-Champaign May 22, 2015

2 Outline TruthFinder: Tuth Validation by Information Network Analysis Beyond TruthFinder: Multiple Versions of Truth and Evolution of Truth Enhancing Truth Validation by InfoNet Analysis: The RankClus & NetClus Methodology Summary

3 Motivation Why truth validation and veracity analysis? Information sharing Sharing trustable, quality information Identifying false information among many conflicting ones Information security Protecting trustable information and its sources Identifying which information providers are suspicious ones: frequently providing false information Tracing back suspicious information providers via information networks

4 Truth Validation and Veracity Analysis by Information Network Analysis The trustworthiness problem of the web (according to a survey): 54% of Internet users trust news web sites most of time 26% for web sites that sell products 12% for blogs TruthFinder: Truth discovery on the Web by link analysis Among multiple conflict results, can we automatically identify which one is likely the true fact? Veracity (conformity to truth): Given a large amount of conflicting information about many objects, provided by multiple web sites (or other information providers), how to discover the true fact about each object? Xiaoxin Yin, Jiawei Han, Philip S. Yu, “Truth Discovery with Multiple Conflicting Information Providers on the Web”, TKDE’08

5 Conflicting Information on the Web Different websites often provide conflicting info. on a subject, e.g., Authors of “Rapid Contextual Design ” Online StoreAuthors Powell’s booksHoltzblatt, Karen Barnes & NobleKaren Holtzblatt, Jessamyn Wendell, Shelley Wood A1 BooksKaren Holtzblatt, Jessamyn Burns Wendell, Shelley Wood Cornwall booksHoltzblatt-Karen, Wendell-Jessamyn Burns, Wood Mellon’s booksWendell, Jessamyn Lakeside booksWENDELL, JESSAMYNHOLTZBLATT, KARENWOOD, SHELLEY Blackwell onlineWendell, Jessamyn, Holtzblatt, Karen, Wood, Shelley

6 Mapping It to Information Networks Each object may have a set of conflicting facts E.g., different author names for a book And each web site provides some facts How to find the true fact for each object? w1w1 f1f1 f2f2 f3f3 w2w2 w3w3 w4w4 f4f4 f5f5 Web sitesFacts o1o1 o2o2 Objects

7 Basic Heuristics for Problem Solving 1. There is usually only one true fact for a property of an object 2. This true fact appears to be the same or similar on different web sites E.g., “Jennifer Widom” vs. “J. Widom” 3. The false facts on different web sites are less likely to be the same or similar False facts are often introduced by random factors 4. A web site that provides mostly true facts for many objects will likely provide true facts for other objects

8 Mutual Consolidation between Confidence of Facts and Trustworthiness of Providers Confidence of facts ↔ Trustworthiness of web sites A fact has high confidence if it is provided by (many) trustworthy web sites A web site is trustworthy if it provides many facts with high confidence The TruthFinder mechanism, an overview: Initially, each web site is equally trustworthy Based on the above four heuristics, infer fact confidence from web site trustworthiness, and then backwards Repeat until achieving stable state

9 Analogy to Authority-Hub Analysis Facts ↔ Authorities, Web sites ↔ Hubs Difference from authority-hub analysis Linear summation cannot be used A web site is trustable if it provides accurate facts, instead of many facts Confidence is the probability of being true Different facts of the same object influence each other w1w1 f1f1 Web sitesFacts HubsAuthorities High trustworthiness High confidence

10 Inference on Trustworthness Inference of web site trustworthiness & fact confidence w1w1 f1f1 f2f2 w2w2 w3w3 w4w4 f4f4 Web sitesFacts o1o1 o2o2 Objects f3f3 True facts and trustable web sites will become apparent after some iterations

11 Computation Model: t(w) and s(f) The trustworthiness of a web site w : t(w) Average confidence of facts it provides The confidence of a fact f : s(f) One minus the probability that all web sites providing f are wrong w1w1 f1f1 w2w2 t(w1)t(w1) t(w2)t(w2) s(f1)s(f1) Sum of fact confidence Set of facts provided by w Probability that w is wrong Set of websites providing f

12 Experiments: Finding Truth of Facts Determining authors of books Dataset contains 1265 books listed on abebooks.com We analyze 100 random books (using book images) CaseVotingTruthFinderBarnes & Noble Correct Miss author(s)1224 Incomplete names1856 Wrong first/middle names113 Has redundant names0223 Add incorrect names155 No information002

13 Experiments: Trustable Info Providers Finding trustworthy information sources Most trustworthy bookstores found by TruthFinder vs. Top ranked bookstores by Google (query “bookstore”) Bookstoretrustworthiness#bookAccuracy TheSaintBookstore MildredsBooks Alphacraze.com BookstoreGoogle rank#bookAccuracy Barnes & Noble Powell’s books TruthFinder Google

14 Outline TruthFinder: Tuth Validation by Information Network Analysis Beyond TruthFinder: Multiple Versions of Truth and Evolution of Truth Enhancing Truth Validation by InfoNet Analysis: The RankClus & NetClus Methodology Summary

15 Beyond TruthFinder: Extensions Limitations of TruthFinder: Only one version of truth But people may have different, contrasting opinions Not consider the time factor But truth may change with time, e.g., Obama’s status in 2008 and 2009 Needed Extensions Multiple versions of truth or opinions Evolution of truth Philosophy Truth is a relative, evolving, and dynamically changing judgment

16 Multiple Versions of Truth Watch out of copy-cats! Copy-cat: Some information providers or even new agencies simply copy each other Falsity could be amplified by copy-cats How to judge copy-cats: Always copying in certain dimensional space Treat copy-cats as one instead of multiples w1w1 f1f1 f2f2 f3f3 w2w2 w3w3 w4w4 f4f4 f5f5 Web sitesFacts o1o1 o2o2 Objects Statements can be clustered into multiple centers False statements: still diverse, spread, and lack of converge Statements could be clustered based on different dimensional space (context), e.g., Java

17 Transition/Evolution of Truth Truth is not static: It changes dynamically with time Associating different versions of truth with different time periods Clustering statements based on time durations Statements Identifying clusters (density-based clustering) Distinguishing time-based clusters from outliers Information providers Leaders, followers, and old-timers Information-network based ranking and clustering Powerful analysis by information network analysis

18 Outline TruthFinder: Tuth Validation by Information Network Analysis Beyond TruthFinder: Multiple Versions of Truth and Evolution of Truth Enhancing Truth Validation by InfoNet Analysis: The RankClus & NetClus Methodology Summary

19 Why RankClus? More meaningful cluster Within each cluster, ranking score for every object is available as well More meaningful ranking Ranking within a cluster is more meaningful than in the whole network Address the problem of clustering in heterogeneous networks No need to compute pair-wise similarity of objects Mapping each object into a low measure space What type of objects to be clustered: Target objects (specified by user) Clustering of target objects can induce a sub-network of the original network

20 Algorithm Framework - Illustration Sub-Network Ranking Clustering

21 Algorithm Framework - Summary Step 0. Initialization Randomly partition target objects into K clusters Step 1. Ranking Ranking for each sub-network induced from each cluster, which serves as feature for each cluster Step 2. Generating new measure space Estimate mixture model coefficients for each target object Step 3. Adjusting cluster Step 4. Repeat Step 1-3 until stable

22 Focus on A Bi-type Network Case Conference-author network, links can exist between Conference (X) and author (Y) Author (Y) and author (Y) Use W to denote the links and there weights W =

23 Step 1: Feature Extraction — Ranking Simple Ranking Proportional to degree counting for objects E.g., number of publications of authors Considers only immediate neighborhood in the network Authority Ranking Extension to HITS in weighted bi-type network Rules: Rule 1: Highly ranked authors publish many papers in highly ranked conferences Rule 2: Highly ranked conferences attract many papers from many highly ranked authors Rule 3: The rank of an author is enhanced if he or she co- authors with many authors or many highly ranked authors

24 Rules in Authority Ranking Rule 1: Highly ranked authors publish many papers in highly ranked conferences Rule 2: Highly ranked conferences attract many papers from many highly ranked authors Rule 3: The rank of an author is enhanced if he or she co- authors with many authors or many highly ranked authors

25 Example: Authority Ranking in the 2- Area Conference-Author Network Given the correct cluster, the ranking of authors are quite distinct from each other

26 Example: 2-D Coefficients in the 2- Area Conference-Author Network The conferences are well separated in the new measure space Scatter plots of two conferences and component coefficients

27 A Running Case Illustration for 2-Area Conf-Author Network Initially, ranking distributions are mixed together Two clusters of objects mixed together, but preserve similarity somehow Improved a little Two clusters are almost well separated Improved significantly Stable Well separated

28 Time Complexity Analysis At each iteration, |E|: edges in network, m: number of target objects, K: number of clusters Ranking for sparse network ~O(|E|) Mixture model estimation ~O(K|E|+mK) Cluster adjustment ~O(mK^2) In all, linear to |E| ~O(K|E|)

29 Case Study: Dataset: DBLP All the 2676 conferences and 20,000 authors with most publications, from the time period of year 1998 to year Both conference-author relationships and co-author relationships are used. K=15

30 Beyond RankClus: A NetClus Model RankClus combines ranking and clustering successfully to analyze information networks A study on how ranking and clustering can mutually reinforce each other in information network analysis RankClus works well on bi-typed information networks Extension of bi-type network model to star-network model DBLP: Author - paper - conference - title (subject) AuthorConference Subject Paper

31 NetClus: Database System Cluster database databases system data query systems queries management object relational processing based distributed xml oriented design web information model efficient Surajit Chaudhuri Michael Stonebraker Michael J. Carey C. Mohan David J. DeWitt Hector Garcia-Molina H. V. Jagadish David B. Lomet Raghu Ramakrishnan Philip A. Bernstein Joseph M. Hellerstein Jeffrey F. Naughton Yannis E. Ioannidis Jennifer Widom Per-?ke Larson Rakesh Agrawal Dan Suciu Michael J. Franklin Umeshwar Dayal Abraham Silberschatz VLDB SIGMOD Conf ICDE PODS EDBT Ranking authors in XML

32 Outline TruthFinder: Tuth Validation by Information Network Analysis Beyond TruthFinder: Multiple Versions of Truth and Evolution of Truth Enhancing Truth Validation by InfoNet Analysis: The RankClus & NetClus Methodology Summary

33 Summary Progress Highlights 3 PhD graduated in 2009 Currently over 20 Ph.D.s working on closely related projects Attract more funded projects: 3 NSFs, NASA, DHS, … Industry collaborations: Microsoft Research, IBM Research, Boeing, HP Labs, Yahoo!, Google, … Research papers published in 2008 & 2009: 8 journal papers and 53 conference papers, including KDD, NIPS, SIGMOD, VLDB, ICDM, SDM, ICDE, ECML/PKDD, SenSys, ICDCS, IJCAI, AAAI, Discovery Science, PAKDD, SSDBM, ACM Multimedia, EDBT, CIKM, … Truth validation by information network analysis: A promising direction: TruthFinder, iNextCube, and beyond Knowledge is power, but knowledge is hidden in massive links Integration of data mining with the project: Much more to be explored!

34 Recent Publications Related to the Talk X. Yin, J. Han, and P. S. Yu, “Truth Discovery with Multiple Conflicting Information Providers on the Web”, TKDE’08 Y. Sun, J. Han, P. Zhao, Z. Yin, H. Cheng, T. Wu, “RankClus: Integrating Clustering with Ranking for Heterogeneous Information Network Analysis”, EDBT'09 Y. Sun, Y. Yu, and J. Han, “Ranking-Based Clustering of Heterogeneous Information Networks with Star Network Schema", KDD'09 Y. Sun, J. Han, J. Gao, and Y. Yu, “iTopicModel: Information Network- Integrated Topic Modeling", ICDM'09 J. Han, “Mining Heterogeneous Information Networks by Exploring the Power of Links", Discovery Science'09 (Invited Keynote Speech) M.-S. Kim and J. Han, “A Particle-and-Density Based Evolutionary Clustering Method for Dynamic Networks", VLDB'09 Y. Yu, C. Lin, Y. Sun, C. Chen, J. Han, B. Liao, T.Wu, C. Zhai, D. Zhang, and B. Zhao, “iNextCube: Information Network-Enhanced Text Cube", VLDB'09 (system demo).