On Community Outliers and their Efficient Detection in Information Networks Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1.

Slides:



Advertisements
Similar presentations
A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois.
Advertisements

Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
CO-AUTHOR RELATIONSHIP PREDICTION IN HETEROGENEOUS BIBLIOGRAPHIC NETWORKS Yizhou Sun, Rick Barber, Manish Gupta, Charu C. Aggarwal, Jiawei Han 1.
Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.
A Phrase Mining Framework for Recursive Construction of a Topical Hierarchy Date : 2014/04/15 Source : KDD’13 Authors : Chi Wang, Marina Danilevsky, Nihit.
Multi-label Relational Neighbor Classification using Social Context Features Xi Wang and Gita Sukthankar Department of EECS University of Central Florida.
Active Learning for Streaming Networked Data Zhilin Yang, Jie Tang, Yutao Zhang Computer Science Department, Tsinghua University.
Sampling: Final and Initial Sample Size Determination
Patch to the Future: Unsupervised Visual Prediction
1 Social Influence Analysis in Large-scale Networks Jie Tang 1, Jimeng Sun 2, Chi Wang 1, and Zi Yang 1 1 Dept. of Computer Science and Technology Tsinghua.
Autocorrelation and Linkage Cause Bias in Evaluation of Relational Learners David Jensen and Jennifer Neville.
Graph Data Management Lab School of Computer Science , Bristol, UK.
1 1 Chenhao Tan, 1 Jie Tang, 2 Jimeng Sun, 3 Quan Lin, 4 Fengjiao Wang 1 Department of Computer Science and Technology, Tsinghua University, China 2 IBM.
Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1 University of Illinois, IBM TJ Watson Debapriya Basu.
On Appropriate Assumptions to Mine Data Streams: Analyses and Solutions Jing Gao† Wei Fan‡ Jiawei Han† †University of Illinois at Urbana-Champaign ‡IBM.
Experimental Design, Statistical Analysis CSCI 4800/6800 University of Georgia Spring 2007 Eileen Kraemer.
Lecture 5: Learning models using EM
Descriptive statistics Experiment  Data  Sample Statistics Experiment  Data  Sample Statistics Sample mean Sample mean Sample variance Sample variance.
Heterogeneous Consensus Learning via Decision Propagation and Negotiation Jing Gao † Wei Fan ‡ Yizhou Sun † Jiawei Han † †University of Illinois at Urbana-Champaign.
Heterogeneous Consensus Learning via Decision Propagation and Negotiation Jing Gao† Wei Fan‡ Yizhou Sun†Jiawei Han† †University of Illinois at Urbana-Champaign.
Query-Based Outlier Detection in Heterogeneous Information Networks Jonathan Kuck 1, Honglei Zhuang 1, Xifeng Yan 2, Hasan Cam 3, Jiawei Han 1 1 University.
Honglei Zhuang1, Jing Zhang2, George Brova1,
Learning In Bayesian Networks. Learning Problem Set of random variables X = {W, X, Y, Z, …} Training set D = { x 1, x 2, …, x N }  Each observation specifies.
Models of Influence in Online Social Networks
Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.
Evolutionary Clustering and Analysis of Bibliographic Networks Manish Gupta (UIUC) Charu C. Aggarwal (IBM) Jiawei Han (UIUC) Yizhou Sun (UIUC) ASONAM 2011.
Bayesian inference review Objective –estimate unknown parameter  based on observations y. Result is given by probability distribution. Bayesian inference.
Graph-based Consensus Maximization among Multiple Supervised and Unsupervised Models Jing Gao 1, Feng Liang 2, Wei Fan 3, Yizhou Sun 1, Jiawei Han 1 1.
Page 1 Ming Ji Department of Computer Science University of Illinois at Urbana-Champaign.
第十讲 概率图模型导论 Chapter 10 Introduction to Probabilistic Graphical Models
Department of Computer Science City University of Hong Kong Department of Computer Science City University of Hong Kong 1 Probabilistic Continuous Update.
Direct Message Passing for Hybrid Bayesian Networks Wei Sun, PhD Assistant Research Professor SFL, C4I Center, SEOR Dept. George Mason University, 2009.
Uncovering Overlap Community Structure in Complex Networks using Particle Competition Fabricio A. Liang
P-Rank: A Comprehensive Structural Similarity Measure over Information Networks CIKM’ 09 November 3 rd, 2009, Hong Kong Peixiang Zhao, Jiawei Han, Yizhou.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Markov Random Fields Probabilistic Models for Images
On Node Classification in Dynamic Content-based Networks.
An Asymptotic Analysis of Generative, Discriminative, and Pseudolikelihood Estimators by Percy Liang and Michael Jordan (ICML 2008 ) Presented by Lihan.
Anomaly Detection in Data Mining. Hybrid Approach between Filtering- and-refinement and DBSCAN Eng. Ştefan-Iulian Handra Prof. Dr. Eng. Horia Cioc ârlie.
Finding Top-k Shortest Path Distance Changes in an Evolutionary Network SSTD th August 2011 Manish Gupta UIUC Charu Aggarwal IBM Jiawei Han UIUC.
Probabilistic Coverage in Wireless Sensor Networks Authors : Nadeem Ahmed, Salil S. Kanhere, Sanjay Jha Presenter : Hyeon, Seung-Il.
A Passive Approach to Sensor Network Localization Rahul Biswas and Sebastian Thrun International Conference on Intelligent Robots and Systems 2004 Presented.
1 Data Mining: Concepts and Techniques (3 rd ed.) — Chapter 12 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign.
1 LinkClus: Efficient Clustering via Heterogeneous Semantic Links Xiaoxin Yin, Jiawei Han Univ. of Illinois at Urbana-Champaign Philip S. Yu IBM T.J. Watson.
Computing & Information Sciences Kansas State University IJCAI HINA 2015: 3 rd Workshop on Heterogeneous Information Network Analysis KSU Laboratory for.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor.
1 Standard error Estimated standard error,s,. 2 Example 1 While measuring the thermal conductivity of Armco iron, using a temperature of 100F and a power.
Chapter 7 Statistical Inference: Estimating a Population Mean.
Consensus Extraction from Heterogeneous Detectors to Improve Performance over Network Traffic Anomaly Detection Jing Gao 1, Wei Fan 2, Deepak Turaga 2,
Towards Social User Profiling: Unified and Discriminative Influence Model for Inferring Home Locations Rui Li, Shengjie Wang, Hongbo Deng, Rui Wang, Kevin.
Towards Total Scene Understanding: Classification, Annotation and Segmentation in an Automatic Framework N 工科所 錢雅馨 2011/01/16 Li-Jia Li, Richard.
Ahmad Salam AlRefai.  Introduction  System Features  General Overview (general process)  Details of each component  Simulation Results  Considerations.
Relation Strength-Aware Clustering of Heterogeneous Information Networks with Incomplete Attributes ∗ Source: VLDB.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Mining Advisor-Advisee Relationships from Research Publication.
Anomaly Detection. Network Intrusion Detection Techniques. Ştefan-Iulian Handra Dept. of Computer Science Polytechnic University of Timișoara June 2010.
10.1 – Estimating with Confidence. Recall: The Law of Large Numbers says the sample mean from a large SRS will be close to the unknown population mean.
Introduction to Micro-economics Technote Creating a distribution; calculating a probability.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Query-based Graph Cuboid Outlier Detection
Paper Presentation Social influence based clustering of heterogeneous information networks Qiwei Bao & Siqi Huang.
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 12 —
Collective Network Linkage across Heterogeneous Social Platforms
Integrating Meta-Path Selection With User-Guided Object Clustering in Heterogeneous Information Networks Yizhou Sun†, Brandon Norick†, Jiawei Han†, Xifeng.
RankClus: Integrating Clustering with Ranking for Heterogeneous Information Network Analysis Yizhou Sun, Jiawei Han, Peixiang Zhao, Zhijun Yin, Hong Cheng,
Community Distribution Outliers in Heterogeneous Information Networks
Estimating Networks With Jumps
Jiawei Han Department of Computer Science
Dynamic Supervised Community-Topic Model
ADVANCED ANOMALY DETECTION IN CANARY TESTING
Modeling Topic Diffusion in Scientific Collaboration Networks
Presentation transcript:

On Community Outliers and their Efficient Detection in Information Networks Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1 1 University of Illinois 2 IBM TJ Watson

2 Information Networks Node represents an entity –Each node has feature values –e.g., users in social networks, webpages on internet Link represents relationship between entities –e.g., two users are linked if they are friends; two webpages are linked through hyper-links Information networks are ubiquitous

3 Outlier (Anomaly, Novelty) Detection Goal –Identify points that deviate significantly from the majority of the data Outliers from A. Banerjee, V.Chandola, V. Kumar, J.Srivastava. Anomaly Detection: A Tutorial. SDM’08

4 Community Outliers Definition –Two information sources: links, node features –There exist communities based on links and node features –Objects that have feature values deviating from those of other members in the same community are defined as community outliers high-income low-income community outlier

5 Contextual (Conditional) Outliers Global vs. Contextual –Global: identify outliers among all the data –Contextual: identify outliers within a subset of data defined by contextual features

6 Examples Contexts –a subset of features, temporal, spatial, or communities in networks (in this paper) temporal contexts spatial contexts

7 Outliers in Information Networks 1)Global outlier: only consider node features 2) Structural outlier: only consider links 3) Local outlier: only consider the feature values of direct neighbors structural outlier local outlier

8 Unified Model is Needed Links and node features –More meaningful to identify communities based on links and node features together Community discovery and outlier detection –Outliers affect the discovery of communities consider links only

9 A Unified Probabilistic Model (1) community label Z outlier node features X link structure W high-income: mean: 116k std: 35k low-income: mean: 20k std: 12k model parameters K: number of communities

10 A Unified Probabilistic Model (2) Probability –Maximize P(X) P(X|Z)P(Z) –P(X|Z) depends on the community label and model parameters eg., salaries in the high or low-income communities follow Gaussian distributions defined by mean and std –P(Z) is higher if neighboring nodes from normal communities share the same community label eg., two linked persons are more likely to be in the same community outliers are isolated—does not depend on the labels of neighbors

11 Modeling Continuous or Text Data Continuous –Gaussian distribution –Model parameters: mean, standard deviation Text –Multinomial distribution –Model parameters: probability of a word appearing in a community

12 Fix, find Z that maximizes P(Z|X) Fix Z, find that maximizes P(X|Z) Community Outlier Detection Algorithm Initialize Z Inference Parameter estimation : model parameters Z: community labels

13 Inference (1) Calculate Z –Model parameters are known –Iteratively update the community labels of nodes –Select the label that maximizes P(Z|X,Z N ) 100k low- income high- income high-income: 110k mean: low-income: 30k high-income? 80% low-income? 10% outlier? 10%

14 Inference (2) Calculate P(Z|X,Z N ) –Consider both the node features and community labels of neighbors if Z indicates a normal community –If the probability of a node belonging to any community is low enough, label it as an outlier 100k low- income high- income high-income: 100k mean: low-income: 30k P(salary=100k|high-income) P(salary=100k|low-income) constant P(high-income|neighbors) P(low-income|neighbors) high-income: low-income: outlier:

15 Parameter Estimation Calculate model parameters –maximum likelihood estimation Continuous –mean: sample mean of the community –standard deviation: square root of the sample variance of the community Text –probability of a word appearing in the community: empirical probability

16 Simulated Experiments Data –Generate continuous data based on Gaussian distributions and generate labels according to the model –r: percentage of outliers, K: number of communities Baseline models –GLODA: global outlier detection (based on node features only) –DNODA: local outlier detection (check the feature values of direct neighbors) –CNA: partition data into communities based on links and then conduct outlier detection in each community

17 Precision

18 Experiments on DBLP Data –DBLP: computer science bibliography –Areas: data mining, artificial intelligence, database, information analysis Case studies –Conferences: Links: percentage of common authors among two conferences Node features: publication titles in the conference –Authors: Links: co-authorship relationship Node features: titles of publications by an author

19 Case Studies on Conferences Community outliers: CVPR CIKM

20 Conclusions Community Outliers –Nodes that have different behaviors compared with the others in the community Community Outlier Detection –A unified probabilistic model –Conduct community discovery and outlier detection simultaneously –Consider both links and node features

21 Thanks! Any questions?