Community Distribution Outliers in Heterogeneous Information Networks

Slides:

Advertisements

Similar presentations

Data Mining Tools Overview Business Intelligence for Managers.

Advertisements

BiG-Align: Fast Bipartite Graph Alignment

Learning Trajectory Patterns by Clustering: Comparative Evaluation Group D.

Multi-label Relational Neighbor Classification using Social Context Features Xi Wang and Gita Sukthankar Department of EECS University of Central Florida.

Active Learning for Streaming Networked Data Zhilin Yang, Jie Tang, Yutao Zhang Computer Science Department, Tsinghua University.

Constructing Popular Routes from Uncertain Trajectories Authors of Paper: Ling-Yin Wei (National Chiao Tung University, Hsinchu) Yu Zheng (Microsoft Research.

Constructing Popular Routes from Uncertain Trajectories Ling-Yin Wei 1, Yu Zheng 2, Wen-Chih Peng 1 1 National Chiao Tung University, Taiwan 2 Microsoft.

Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1 University of Illinois, IBM TJ Watson Debapriya Basu.

Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.

A General Model for Relational Clustering Bo Long and Zhongfei (Mark) Zhang Computer Science Dept./Watson School SUNY Binghamton Xiaoyun Wu Yahoo! Inc.

On Community Outliers and their Efficient Detection in Information Networks Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1.

Commentary-based Video Categorization and Concept Discovery By Janice Leung.

Heterogeneous Consensus Learning via Decision Propagation and Negotiation Jing Gao † Wei Fan ‡ Yizhou Sun † Jiawei Han † †University of Illinois at Urbana-Champaign.

Heterogeneous Consensus Learning via Decision Propagation and Negotiation Jing Gao† Wei Fan‡ Yizhou Sun†Jiawei Han† †University of Illinois at Urbana-Champaign.

Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University

Birch: An efficient data clustering method for very large databases

1 A Topic Modeling Approach and its Integration into the Random Walk Framework for Academic Search 1 Jie Tang, 2 Ruoming Jin, and 1 Jing Zhang 1 Knowledge.

LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.

COMMUNITIES IN MULTI-MODE NETWORKS 1. Heterogeneous Network Heterogeneous kinds of objects in social media – YouTube Users, tags, videos, ads – Del.icio.us.

Data Mining Techniques

Tag-based Social Interest Discovery

«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,

Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.

Cao et al. ICML 2010 Presented by Danushka Bollegala.

Cluster based fact finders Manish Gupta, Yizhou Sun, Jiawei Han Feb 10, 2011.

Jaegul Choo1*, Changhyun Lee1, Chandan K. Reddy2, and Haesun Park1

GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)

Recommendation system MOPSI project KAROL WAGA

Garrett Poppe, Liv Nguekap, Adrian Mirabel CSUDH, Computer Science Department.

Co-clustering Documents and Words Using Bipartite Spectral Graph Partitioning Jinghe Zhang 10/28/2014 CS 6501 Information Retrieval.

Discovering Meta-Paths in Large Heterogeneous Information Network

Wei Feng , Jiawei Han, Jianyong Wang , Charu Aggarwal , Jianbin Huang

Ground Truth Free Evaluation of Segment Based Maps Rolf Lakaemper Temple University, Philadelphia,PA,USA.

Finding Top-k Shortest Path Distance Changes in an Evolutionary Network SSTD th August 2011 Manish Gupta UIUC Charu Aggarwal IBM Jiawei Han UIUC.

Exploit of Online Social Networks with Community-Based Graph Semi-Supervised Learning Mingzhen Mo and Irwin King Department of Computer Science and Engineering.

Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.

Data Mining: Knowledge Discovery in Databases Peter van der Putten ALP Group, LIACS Pre-University College LAPP-Top Computer Science February 2005.

Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.

CoNMF: Exploiting User Comments for Clustering Web2.0 Items Presenter: He Xiangnan 28 June School of Computing National.

Date: 2012/08/21 Source: Zhong Zeng, Zhifeng Bao, Tok Wang Ling, Mong Li Lee (KEYS’12) Speaker: Er-Gang Liu Advisor: Dr. Jia-ling Koh 1.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Mining Advisor-Advisee Relationships from Research Publication.

Anomaly Detection. Network Intrusion Detection Techniques. Ştefan-Iulian Handra Dept. of Computer Science Polytechnic University of Timișoara June 2010.

Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:

Query-based Graph Cuboid Outlier Detection

Paper Presentation Social influence based clustering of heterogeneous information networks Qiwei Bao & Siqi Huang.

CLUSTER ANALYSIS. Cluster Analysis  Cluster analysis is a major technique for classifying a ‘mountain’ of information into manageable meaningful piles.

Outlier Detection for Information Networks Manish Gupta 15 th Jan 2013.

Cohesive Subgraph Computation over Large Graphs

Clustering Anna Reithmeir Data Mining Proseminar 2017

Exploring Social Tagging Graph for Web Object Classification

What Is Cluster Analysis?

Semi-Supervised Clustering

Discrete ABC Based on Similarity for GCP

CACTUS-Clustering Categorical Data Using Summaries

Constrained Clustering -Semi Supervised Clustering-

Collective Network Linkage across Heterogeneous Social Platforms

A Consensus-Based Clustering Method

Integrating Meta-Path Selection With User-Guided Object Clustering in Heterogeneous Information Networks Yizhou Sun†, Brandon Norick†, Jiawei Han†, Xifeng.

RankClus: Integrating Clustering with Ranking for Heterogeneous Information Network Analysis Yizhou Sun, Jiawei Han, Peixiang Zhao, Zhijun Yin, Hong Cheng,

Discovering Functional Communities in Social Media

Data Mining 資料探勘分群分析 (Cluster Analysis) Min-Yuh Day 戴敏育

Noémi Gaskó, Rodica Ioana Lung, Mihai Alexandru Suciu

Jiawei Han Department of Computer Science

Personalized Celebrity Video Search Based on Cross-space Mining

Example: Academic Search

Paper Reading Dalong Du April.08, 2011.

Graph-based Security and Privacy Analytics via Collective Classification with Joint Weight Learning and Propagation Binghui Wang, Jinyuan Jia, and Neil.

Mingzhen Mo and Irwin King

Topic 5: Cluster Analysis

Presentation transcript:

Community Distribution Outliers in Heterogeneous Information Networks Manish Gupta1, Jing Gao2, Jiawei Han3 1Microsoft, India 2SUNY Buffalo, NY, USA 3UIUC, IL, USA EUROPEAN CONFERENCE ON MACHINE LEARNING AND PRINCIPLES AND PRACTICE OF KNOWLEDGE DISCOVERY IN DATABASES 2013 11/18/2018 gmanish@microsoft.com

Heterogeneous Networks are Ubiquitous Studio IMDB Network DBLP Network Facebook Network Actor Movie Studio MultiMediaN E-Culture By linking the collections and vocabularies from multiple museums together the E-Culture project constructed a graph of cultural heritage data. Within the MultimediaN E-Culture team I investigated how keyword search can be applied to this heterogeneous graph to find artworks that are semantically related artworks to a keyword query? Director 11/18/2018 Image Credits: http://cairo.cs.uiuc.edu/projects/firstResponder/pics/fsnetwork.jpg http://h71000.www7.hp.com/network/works.gif http://www.few.vu.nl/~michielh/img/eculture_datacloud.png http://www.pnas.org/content/101/9/2981/F2.large.jpg

Abstract Community Distribution Outliers (CDOutliers) for heterogeneous information networks: Objects whose community distribution does not follow any of the popular community distribution patterns Distribution Patterns Discovered by exploiting both intra-type and inter-type information An iterative two-stage approach to identify outliers Experiments on multiple real and synthetic datasets 11/18/2018 gmanish@microsoft.com

Community Distribution Outliers Type x y z Pattern “b” 0.8 0.0 0.2 Pattern “g” Pattern “r” Pattern “c” 0.4 0.6 Pattern “m” Pattern “y” Outlier 1 Outlier 2 0.33 0.34 Distribution Pattern A cluster obtained by grouping rows of a belongingness matrix Can be represented using cluster centroids Community Distribution Outliers: Objects whose community distribution does not follow any of the popular community distribution patterns 11/18/2018 gmanish@microsoft.com

Real-life Examples User Tag URL Arts Science Fashion Sports User Tag EXPERT User Tag Video Arts Science Fashion Sports MARKETER 11/18/2018 gmanish@microsoft.com

Existing Work Outlier Detection for Networks with Global Context Sub-Structure Outliers [Noble and Cook, 2003], SCAN [Xu et al., 2007], Stream Outliers [Aggarwal et al., 2011b] Global versus community context Outlier Detection for Homogeneous Networks Community outlier detection algorithms [Gao et al., 2010], [Gupta et al., 2012a], [Gupta et al., 2012b] Homogeneous versus heterogeneous networks. 11/18/2018 gmanish@microsoft.com

Our Approach in Brief Joint NMF Pattern Discovery Outlier Detection H1 Top 𝜅 Outliers T1 W1 Joint NMF H2 Top 𝜅 Outliers W2 T2 H3 Top 𝜅 Outliers W3 T3 Remove Outliers from Ti gmanish@microsoft.com

Community Distribution Outlier Detection Problem CDOutlier: An object in a heterogeneous network for which the distance to the nearest cluster in the community membership space, is very high Problem Definition Input: Community membership matrices 𝑇 1 , 𝑇 2 , …, 𝑇 𝐾 for the types 𝜏 1 , 𝜏 2 ,…, 𝜏 𝐾 Output: Top 𝜅 outlier objects of each type Broad solution Iterative two stage approach: Pattern discovery and outlier detection Pattern discovery using joint NMF 11/18/2018 gmanish@microsoft.com

Brief Overview of NMF Given a non-negative matrix 𝑇∈ 𝑅 𝑁×𝐶 Compute a factorization of T with two factors 𝑊∈ 𝑅 𝑁× 𝐶 ′ and H∈ 𝑅 𝐶 ′ ×𝐶 Such that 𝑇≈𝑊𝐻 And both W and H are non-negative NMF is similar to KMeans clustering [Ding and He, 2005], [Zass and Shashua, 2005] W is cluster indicator matrix H is cluster centroid matrix Optimization problem min 𝑊,𝐻 𝑇−𝑊𝐻 2 subject to the constraints 𝑊≥0, 𝐻≥0 11/18/2018 gmanish@microsoft.com

Discovery of Distribution Patterns Each of the T matrices can be clustered individually But the membership matrices T Are defined for objects that are connected to each other Represent objects in the same space of C dimensions Hidden structures across types should be consistent with each other Divergence between any two clusterings should be small 11/18/2018 gmanish@microsoft.com

Optimization & Iterative Update Rules min 𝑊,𝐻 𝑘=1 𝐾 𝑇 𝑘 − 𝑊 𝑘 𝐻 𝑘 2 +𝛼 𝑘=1 𝑙=1 𝑘<𝑙 𝐾 𝐻 𝑘 − 𝐻 𝑙 2 subject to the constraints 𝑊 𝑘 ≥0 ∀𝑘=1,2,…𝐾 𝐻 𝑘 ≥0 ∀𝑘=1,2,…𝐾 𝑊 𝑘 ← 𝑊 𝑘 ⊙ 𝑇 𝑘 𝐻 𝑘 𝑇 𝑊 𝑘 𝐻 𝑘 𝐻 𝑘 𝑇 ∀𝑘=1,2,…, 𝐾 𝐻 𝑘 ← 𝐻 𝑘 ⊙ 𝑊 𝑘 𝑇 𝑇 𝑘 +𝛼 𝑙=1 𝑘≠𝑙 𝐾 𝐼 𝐶 ′ × 𝐶 ′ 𝐻 𝑙 𝑊 𝑘 𝑇 𝑊 𝑘 𝐻 𝑘 +𝛼 𝑙=1 𝑘≠𝑙 𝐾 𝐼 𝐶 ′ × 𝐶 ′ 𝐻 𝑘 ∀𝑘=1,2,…, 𝐾 ⊙ denotes the Hadamard Product and 𝐴 𝐵 denotes the element-wise division 11/18/2018 gmanish@microsoft.com

Outlier Detection Joint NMF outputs the 𝑊 𝑘 and 𝐻 𝑘 matrices Each row of 𝐻 𝑘 is a distribution pattern Each element (i,j) of 𝑊 𝑘 denotes probability with which object i belongs to community j Outlier score of an object i is the distance of the object from the nearest cluster centroid 𝑂𝑆 𝑖 =𝑎𝑟𝑔𝑚𝑖 𝑛 𝑗 𝐷𝑖𝑠𝑡( 𝑇 𝑘 𝑖,. , 𝐻 𝑘 𝑗,. ) Objects far away from nearest cluster centroids get higher outlier score 11/18/2018 gmanish@microsoft.com

Iterative Refinement Algorithm 𝑶(𝑵 𝑰 ′ 𝑲[𝑲𝑰 𝑪 ′𝟐 +𝐥𝐨𝐠⁡(𝜿)]) Linear in number of objects 𝑶( 𝑲 𝟐 𝑰𝑵 𝑪 ′𝟐 ) 𝑶(𝑵 𝑲𝑪 ′𝟐 ) 𝑶(𝑲𝑵𝒍𝒐𝒈(𝜿)) 11/18/2018 gmanish@microsoft.com

Experiments Lack of ground truth Multiple Synthetic Datasets with different parameter settings Number of objects Number of object types Percent outlierness Number of dimensions Two real datasets: DBLP, Delicious Case Studies and Analysis 11/18/2018 gmanish@microsoft.com

Baselines SingleIteration (SI) Homogeneous (Homo) CDO performs community pattern discovery and outlier detection iteratively SI performs only one iteration Pattern Discovery in SI suffers from presence of CDOutliers Homogeneous (Homo) CDO performs pattern discovery using joint-NMF Homo treats all objects to be of the same type and so performs a single matrix NMF 11/18/2018 gmanish@microsoft.com

Synthetic Dataset Results Summary Synthetic Dataset Results (CDO =The Proposed Algorithm CDODA, SI = Single Iteration Baseline, Homo = Homogenous (Single NMF) Baseline) for C=6 SI (2.9%) Homo(21%) 100 experiments per setting Average std deviations are 3.07% for CDO, 3.48% for SI and 2.19% for Homo 11/18/2018 gmanish@microsoft.com

Running Time and Convergence 11/18/2018 gmanish@microsoft.com

Real Dataset Case Studies (DBLP) Each research area appears as a pattern and then there are other patterns with distributions across multiple areas. E.g., “Data Mining” and “Computational Biology” is a pattern Some patterns are specific to particular types “Software engineering” and “Operating systems” for conferences “Concurrent Distributed and Parallel Computing” and “Security and privacy” for authors “Security and privacy” and “Education” for terms Top Outlier Author: Giuseppe de Giacomo - Algorithms and Theory (0.25), Databases (0.47), Artificial Intelligence (0.13), Human Computer Interaction (0.06) Top conference outlier: From integrated publication and information systems to virtual information and knowledge environments - Databases (0.5), Artificial Intelligence (0.09), Human Computer interaction (0.4) Top terms outlier: military - Algorithms and theory (0.02), Security and Privacy (0.37), Databases (0.22), Computer Graphics (0.37)

Real Dataset Case Studies (Delicious) Outlier users saassaga: Arts and Design (0.25), Food (0.04), Lifestyle (0.35), Travel (0.34) lbbrad: Food (0.24), LifeStyle (0.37), News and Politics (0.37) Outlier tags canoeing - Sports (0.62), Travel (0.38) xmen - Entertainment (0.39), Sports (0.61) Outlier pages (for categories Lifestyle and Travel) http://globetrooper.com/ http://vandelaydesign.com/blog/galleries/travelwebsites/ 11/18/2018 gmanish@microsoft.com

Conclusions Introduced outliers with respect to latent communities for heterogeneous networks Proposed a joint-NMF optimization framework to learn distribution patterns across multiple object types Proposed an iterative two stage approach for outlier detection Experimented with multiple real and synthetic datasets 11/18/2018 gmanish@microsoft.com

Thanks! gmanish@microsoft.com 11/18/2018 gmanish@microsoft.com