Cluster based fact finders Manish Gupta, Yizhou Sun, Jiawei Han Feb 10, 2011.

Slides:

Advertisements

Similar presentations

Advertisements

Mining User Similarity Based on Location History Yu Zheng, Quannan Li, Xing Xie Microsoft Research Asia.

Clustering Basic Concepts and Algorithms

Fast Algorithms For Hierarchical Range Histogram Constructions

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.

CUBELSI : AN EFFECTIVE AND EFFICIENT METHOD FOR SEARCHING RESOURCES IN SOCIAL TAGGING SYSTEMS Bin Bi, Sau Dan Lee, Ben Kao, Reynold Cheng The University.

Yue Han and Lei Yu Binghamton University.

K Means Clustering , Nearest Cluster and Gaussian Mixture

Optimal Design Laboratory | University of Michigan, Ann Arbor 2011 Design Preference Elicitation Using Efficient Global Optimization Yi Ren Panos Y. Papalambros.

GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.

Ao-Jan Su † Y. Charlie Hu ‡ Aleksandar Kuzmanovic † Cheng-Kok Koh ‡ † Northwestern University ‡ Purdue University How to Improve Your Google Ranking: Myths.

Tru-Alarm: Trustworthiness Analysis of Sensor Network in Cyber Physical Systems Lu-An Tang, Xiao Yu, Sangkyum Kim, Jiawei Han, Chih-Chieh Hung, Wen-Chih.

Truth Discovery with Multiple Confliction Information Providers on the Web Xiaoxin Yin, Jiawei Han, Philip S.Yu Industrial and Government Track short paper.

KDD 2011 Research Poster Content - Driven Trust Propagation Framwork V. G. Vinod Vydiswaran, ChengXiang Zhai, and Dan Roth University of Illinois at Urbana-Champaign.

An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.

Instructor: Mircea Nicolescu Lecture 13 CS 485 / 685 Computer Vision.

Active Contour Models (Snakes)

The Comparison of the Software Cost Estimating Methods

Clustered alignments of gene- expression time series data Adam A. Smith, Aaron Vollrath, Cristopher A. Bradfield and Mark Craven Department of Biosatatistics.

Relevance Feedback based on Parameter Estimation of Target Distribution K. C. Sia and Irwin King Department of Computer Science & Engineering The Chinese.

Dimension of Meaning Author: Hinrich Schutze Presenter: Marian Olteanu.

Quality-driven Integration of Heterogeneous Information System by Felix Naumann, et al. (VLDB1999) 17 Feb 2006 Presented by Heasoo Hwang.

Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC Lecture 7: Coding and Representation 1 Computational Architectures in.

Tomer Sagi and Avigdor Gal Technion - Israel Institute of Technology Non-binary Evaluation for Schema Matching ER 2012 October 2012, Florence.

Tokyo Research Laboratory © Copyright IBM Corporation 2009 | 2009/04/03 | SDM 09 / Travel-Time Prediction Travel-Time Prediction using Gaussian Process.

Wang, Z., et al. Presented by: Kayla Henneman October 27, 2014 WHO IS HERE: LOCATION AWARE FACE RECOGNITION.

L. Padmasree Vamshi Ambati J. Anand Chandulal J. Anand Chandulal M. Sreenivasa Rao M. Sreenivasa Rao Signature Based Duplicate Detection in Digital Libraries.

Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.

Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.

Evolutionary Clustering and Analysis of Bibliographic Networks Manish Gupta (UIUC) Charu C. Aggarwal (IBM) Jiawei Han (UIUC) Yizhou Sun (UIUC) ASONAM 2011.

X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox Associate Dean for.

CSE554AlignmentSlide 1 CSE 554 Lecture 5: Alignment Fall 2011.

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

Exploring Online Social Activities for Adaptive Search Personalization CIKM’10 Advisor ： Jia Ling, Koh Speaker ： SHENG HONG, CHUNG.

Avoiding Segmentation in Multi-digit Numeral String Recognition by Combining Single and Two-digit Classifiers Trained without Negative Examples Dan Ciresan.

Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.

Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.

Collective Vision: Using Extremely Large Photograph Collections Mark Lenz CameraNet Seminar University of Wisconsin – Madison February 2, 2010 Acknowledgments:

Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”

Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.

This paper was presented at KDD ‘06 Discovering Interesting Patterns Through User’s Interactive Feedback Dong Xin Xuehua Shen Qiaozhu Mei Jiawei Han Presented.

Characterizing the Uncertainty of Web Data: Models and Experiences Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli Studi.

Finding Top-k Shortest Path Distance Changes in an Evolutionary Network SSTD th August 2011 Manish Gupta UIUC Charu Aggarwal IBM Jiawei Han UIUC.

Exploit of Online Social Networks with Community-Based Graph Semi-Supervised Learning Mingzhen Mo and Irwin King Department of Computer Science and Engineering.

Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07.

Author Name Disambiguation in Medline Vetle I. Torvik and Neil R. Smalheiser August 31, 2006.

Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.

Theme 2: Data & Models One of the central processes of science is the interplay between models and data Data informs model generation and selection Models.

KDD 2011 Doctoral Session Modeling Trustworthiness of Online Content V. G. Vinod Vydiswaran Advisors: Prof.ChengXiang Zhai, Prof.Dan Roth University of.

Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang.

Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.

Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.

Education 793 Class Notes Inference and Hypothesis Testing Using the Normal Distribution 8 October 2003.

1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 21: Dynamic Multi-Criteria RL problems Dr. Itamar Arel College of Engineering Department.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.

Trust Analysis on Heterogeneous Networks Manish Gupta 17 Mar 2011.

Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.

Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:

CS307P-SYSTEM PRACTICUM CPYNOT. B13107 – Amit Kumar B13141 – Vinod Kumar B13218 – Paawan Mukker.

Analysis of Massive Data Sets Prof. dr. sc. Siniša Srbljić Doc. dr. sc. Dejan Škvorc Doc. dr. sc. Ante Đerek Faculty of Electrical Engineering and Computing.

Gaussian Mixture Model classification of Multi-Color Fluorescence In Situ Hybridization (M-FISH) Images Amin Fazel 2006 Department of Computer Science.

Hao Ma, Dengyong Zhou, Chao Liu Microsoft Research Michael R. Lyu

GLBIO ML workshop May 17, 2016 Ivan Kryukov and Jeff Wintersinger

Zhu Han University of Houston Thanks for Dr. Hung Nguyen’s Slides

Machine Learning Basics

RankClus: Integrating Clustering with Ranking for Heterogeneous Information Network Analysis Yizhou Sun, Jiawei Han, Peixiang Zhao, Zhijun Yin, Hong Cheng,

Community Distribution Outliers in Heterogeneous Information Networks

Mingzhen Mo and Irwin King

Presentation transcript:

Cluster based fact finders Manish Gupta, Yizhou Sun, Jiawei Han Feb 10, 2011

Why perform cluster based fact finding? Books: Goldstone Books is a highly trustworthy provider, but it is not the best for history books Google/Yahoo/Bing are good search engines. But I would prefer Monster for jobs or 101apartments for apartments CNN or CBS or Google news are best for news. But I prefer Slashdot or Techcrunch for technical news and ESPN or cricinfo for sports news. Aljazeera for Middle East news! Providers excel in their fields of focus. 2

Our Contributions Formally define problem of cluster based fact finding Algorithm that performs trust analysis and clustering of objects iteratively Comparison of our algorithm using different fact finders on multiple datasets showing better accuracy and interesting clusters Analysis of clustering based fact finders using synthetic dataset 3

Related work Yin et al [TKDE 2008]: Truth finder Dong et al [PVLDB 2009]: Time varying truth, copycat detection Pasternack et al [COLING 2010]: Multiple fact finders and effect of priors Sun et al [EDBT 2009]: Alternate ranking-clustering framework (RankClus) Gupta et al [WWW 2011]: Trust Analysis with Clustering Work in Agent-based systems (trust of agents on each other based on past mutual interactions etc) 4

The iterative fact finder model Three components of model – Trustworthiness of providers (sources) – Confidence (belief) of facts (claims) – Implications between facts 5

Basic Fact Finder Algorithm 6

Intuitive example 7

Drawbacks of basic fact finders No object specific trust ranking is generated. Only global trustworthiness ranking of providers is computed. Confidence ranking of facts for an object is influenced by trustworthiness of providers who are not so “good” for this object or objects related to this object. 8

Our hypothesis Objects can be clustered based on provider trustworthiness profiles, t o (p), personalized to the particular object. Restrictive flow of trust information across objects, using clusters, can improve ranking accuracy of facts and providers. Iterative alternate clustering and trust analysis can provide high quality trust-based clusters and can improve accuracy of trust ranking of providers and confidence ranking of facts. 9

Clustering before Trust Analysis Drawbacks – Does not use the information about the providers related to objects in other clusters. – This method needs some input clustering. Clusters are fixed and depend on a particular dimension. In many cases, such a clustering is not available or the desired trustworthiness based clustering may not follow any natural clustering of the objects along just a single dimension. 10

Clustering in provider trustworthiness space 11

Basic Cluster Based Fact Finder Drawbacks: There is no trustworthiness information sharing between objects in BCFF2. Every iteration in Algorithm 3 simply re-computes trustworthiness of providers based on implications between various facts about the same object. 12

Clustering with Trust Analysis 13

Smoothing Three kinds of providers – “correct” information about each object – “wrong” information for each of the objects – “correct” for some, “wrong” for some Our cluster based algorithms would intuitively work better for the third case. If the vectors are quite close to each other, clustering is not really effective, hence smooth using the global scores s C is cluster based score and s G is the global score. α is set to average inter-cluster similarity. 14

Datasets Books (Yin et al.) author listings for 1265 books provided by 894 online book stores. Ground truth: manually from scanned book covers. Accuracy and implication values computed as match between best author list and golden list 15

Datasets Wikipedia Biography Infobox dataset (Pasternack et al) – Accuracy for date measured as – Accuracy of strings: using Edit distance (if >75% else 0) Population dataset (Pasternack et al) – Population claims by 1361 contributors about 30K cities. Golden truth using US Census data. – Accuracy measured as 16

Analysis of clustering profiles 17

Accuracy results 18

Synthetic dataset 60 objects, 21 providers, 3 clusters Each object has 4-5 different facts Providers and objects are assigned to clusters A provider can provide a fact for an object within the cluster with a probability of 0.8 For a set of dicy objects (for which most frequent fact is the true fact), prolific providers from other clusters provide false fact with total freq=1+max freq 19

Improvement in accuracy Parameters: – max support for true fact of dicy objects – number of dicy objects – original strength of the providers Gains are more when number of dicy objects are more and best fact for them is not supported by many providers within their cluster 20

Comparison of various cluster based fact finders Sums performs better. Sums has no kind of normalization and hence has best chances of improving 21

Conclusion We identified the problem of cluster based fact finding We proposed algorithms for trust analysis using cluster based methods. We showed using four datasets that our algorithms perform better than traditional fact finders and generate interesting clusters. In the future, we plan to use the network information within objects and use it to influence clustering of objects 22

Acknowledgements Xiaoxin Yin for basic code base and books and movies datasets Jeff Pasternack for wikipedia datasets Dr. Dan Roth for interesting discussions Vinod Vydiswaran for reviewing a preliminary version of the work NSF (IIS ) and ARL- NSCTA (W911NF ) for funding. 23

References 24

Thanks! 25

Variants of clustering with trust analysis This version of ACFF would give more importance to trust analysis and tries to organize the clusters around the results of trust analysis. Drawback: cluster conditional trust computations are used to re- compute object conditional trust vectors and also as centroids for clustering of object conditional trust vectors. This may bias the algorithm heavily towards changes in trust analysis. 26

Variants of clustering with trust analysis Use the object conditional trustworthiness vectors computed initially using BCFF2 and avoid re-computing them after the cluster conditional trust analysis iterations. Iterative trust analysis is done with the sole purpose of improving the cluster centroids. The representation of each of the objects is kept fixed. Intuition: cluster centroids would organize themselves as far away from each other as possible in the trust space and hence lead to distinct clusters. 27

Variants of clustering with trust analysis Perform clustering in a secondary richer space. The ith element of vector V is computed as the cosine similarity between the object conditional trust vector to and the ith cluster conditional trust vector tci. 28