Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1 University of Illinois, IBM TJ Watson Debapriya Basu
2 Determine outliers in information networks Compare various algorithms which does the same
3 Eg Internet, Social Networking Sites Nodes – characterized by feature values Links - representative of relation between nodes
Outliers – anomalies, novelties Different kinds of outliers ◦ Global ◦ Contextual 4
6 Unified model considering both nodes and links Community discovery and outlier detection are related processes
7 Treat each object as a multivariate data point Use K components to describe normal community behavior and one component to denote outliers Induce a hidden variable z i at each object indicating community Treat network information as a graph Model the graph as a Hidden Markov Random Field on z i Find the local minimum of the posterior probability potential energy of the model.
8 community label Z outlier node feature X link structure W high-income: mean: 116k std: 35k low-income: mean: 20k std: 12k model parameters K: number of communitie s
9 SymbolDefinition I = {1,2,3….i,..M}Indices of the objects V = {v1,v2….v m }Set of objects S = {s1,s2,….s m }Given attributes of objects W M*M = {w ij }Adjacency matrix containing the weights of the links Z = {z 1,…..,z m }RVs for hidden labels of objects X = {x 1,…..,x m }RVs for observed data N i (i ∈ I)Neighborhood of object v i 1,….,k,….KIndices of normal communities Θ = {Θ 1, Θ 2,……, Θ k }R.Vs for model parameters
◦ Set of R.Vs X are conditionally independent given their labels P(X=S|Z) = ΠP(x i =s i |z i ) ◦ Kth normal community is characterized by a set of parameters P(x i =s i |z i =k) = P(x i =s i |Θ k ) ◦ Outliers are characterized by uniform distribution ◦ P(x i =s i |z i =0) = ρ0 ◦ Markov random field is defined over hidden variable Z ◦ P(z i |z I-{i} ) = P(z i |z Ni ) ◦ The equivalent Gibbs distribution is P(Z) = exp(-U(Z))*1/H 1 H 1 = normalizing constant, U(Z) = sum of clique potentials. ◦ Goal is to find the configuration of z that maximizes P(X=S|Z)P(Z) for a given Θ 10
11 Continuous Data ◦ Is modeled as Gaussian distribution ◦ Model parameters: mean, standard deviation Text Data ◦ Is modeled as Multinomial distribution ◦ Model parameters: probability of a word appearing in a community
12 Given Θ, find Z that maximizes P(Z|X) Given Z, find Θ that maximizes P(X|Z) Initialize Z INFERENCE PARAMETER ESTIMATION Θ : model parameters Z: community labels
13 Calculate model parameters ◦ maximum likelihood estimation Continuous ◦ mean: sample mean of the community ◦ standard deviation: square root of the sample variance of the community Text ◦ probability of a word appearing in the community: empirical probability
14 Calculate Z i values ◦ Given Model parameters, ◦ Iteratively update the community labels of nodes at each timestep ◦ Select the label that maximizes P(Z|X,Z N ) Calculate P(Z|X,Z N ) values ◦ Both the node features and community labels of neighbors if Z indicates a normal community ◦ If the probability of a node belonging to any community is low enough, label it as an outlier
15 Setting Hyper parameters ◦ a 0 = threshold ◦ Λ = confidence in the network ◦ K = number of communities Initialization ◦ Group outliers in clusters. ◦ It will eventually get corrected.
16 Data Generation ◦ Generate continuous data based on Gaussian distributions and generate labels according to the model ◦ Define r: percentage of outliers, K: number of communities Baseline models ◦ GLODA: global outlier detection (based on node features only) ◦ DNODA: local outlier detection (check the feature values of direct neighbors) ◦ CNA: partition data into communities based on links and then conduct outlier detection in each community
18 Communities ◦ data mining, artificial intelligence, database, information analysis Sub network of Conferences Links: percentage of common authors among two conferences Node features: publication titles in the conference Sub network of Authors Links: co-authorship relationship Node features: titles of publications by an author
19 Community outliers: CVPR CIKM
20 Community Outliers Community Outlier Detection QUESTIONS
21 On Community Outliers and their Efficient Detection in Information Networks – Gao, Liang, Fan, Wang, Sun, Han Outlier detection – Irad Ben-Gal Automated detection of outliers in real-world data – Last, Kandel Outlier Detection for High Dimensional Data – Aggarwal, Yu