Community Detection with Edge Content in Social Media Networks Paper presented by Konstantinos Giannakopoulos.

Slides:



Advertisements
Similar presentations
Bayesian Belief Propagation
Advertisements

Google News Personalization: Scalable Online Collaborative Filtering
Partitional Algorithms to Detect Complex Clusters
Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.
Information retrieval – LSI, pLSI and LDA
Biointelligence Laboratory, Seoul National University
Computer vision: models, learning and inference Chapter 8 Regression.
Dimensionality Reduction PCA -- SVD
Supervised Learning Recap
Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.
Probabilistic Clustering-Projection Model for Discrete Data
Statistical Topic Modeling part 1
10/11/2001Random walks and spectral segmentation1 CSE 291 Fall 2001 Marina Meila and Jianbo Shi: Learning Segmentation by Random Walks/A Random Walks View.
Jun Zhu Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint.
Bayesian Robust Principal Component Analysis Presenter: Raghu Ranganathan ECE / CMR Tennessee Technological University January 21, 2011 Reading Group (Xinghao.
Visual Recognition Tutorial
Generative Topic Models for Community Analysis
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Communities in Heterogeneous Networks Chapter 4 1 Chapter 4, Community Detection and Mining in Social Media. Lei Tang and Huan Liu, Morgan & Claypool,
Prénom Nom Document Analysis: Parameter Estimation for Pattern Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Carnegie Mellon 1 Maximum Likelihood Estimation for Information Thresholding Yi Zhang & Jamie Callan Carnegie Mellon University
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Latent Dirichlet Allocation a generative model for text
Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction
Switch to Top-down Top-down or move-to-nearest Partition documents into ‘k’ clusters Two variants “Hard” (0/1) assignment of documents to clusters “soft”
Object Class Recognition Using Discriminative Local Features Gyuri Dorko and Cordelia Schmid.
Unsupervised Learning
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Visual Recognition Tutorial
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Multiple Object Class Detection with a Generative Model K. Mikolajczyk, B. Leibe and B. Schiele Carolina Galleguillos.
Image Segmentation Rob Atlas Nick Bridle Evan Radkoff.
Introduction to Machine Learning for Information Retrieval Xiaolong Wang.
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
Andreas Papadopoulos - [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International.
A Passive Approach to Sensor Network Localization Rahul Biswas and Sebastian Thrun International Conference on Intelligent Robots and Systems 2004 Presented.
Detecting Communities Via Simultaneous Clustering of Graphs and Folksonomies Akshay Java Anupam Joshi Tim Finin University of Maryland, Baltimore County.
Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW.
Introduction to LDA Jinyang Gao. Outline Bayesian Analysis Dirichlet Distribution Evolution of Topic Model Gibbs Sampling Intuition Analysis of Parameter.
Lecture 2: Statistical learning primer for biologists
Latent Dirichlet Allocation
CS 590 Term Project Epidemic model on Facebook
CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 8. Text Clustering.
Relation Strength-Aware Clustering of Heterogeneous Information Networks with Incomplete Attributes ∗ Source: VLDB.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
 In the previews parts we have seen some kind of segmentation method.  In this lecture we will see graph cut, which is a another segmentation method.
Analysis of Social Media MLD , LTI William Cohen
Community structure in graphs Santo Fortunato. More links “inside” than “outside” Graphs are “sparse” “Communities”
Network Theory: Community Detection Dr. Henry Hexmoor Department of Computer Science Southern Illinois University Carbondale.
Ultra-high dimensional feature selection Yun Li
1 Dongheng Sun 04/26/2011 Learning with Matrix Factorizations By Nathan Srebro.
Spectral Methods for Dimensionality
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Statistical Models for Automatic Speech Recognition
Multimodal Learning with Deep Boltzmann Machines
J. Zhu, A. Ahmed and E.P. Xing Carnegie Mellon University ICML 2009
Probabilistic Models with Latent Variables
Topic models for corpora and for graphs
3.3 Network-Centric Community Detection
Michal Rosen-Zvi University of California, Irvine
Junghoo “John” Cho UCLA
Topic models for corpora and for graphs
Topic Models in Text Processing
Parametric Methods Berlin Chen, 2005 References:
Presentation transcript:

Community Detection with Edge Content in Social Media Networks Paper presented by Konstantinos Giannakopoulos

Outline Definitions – Social Networks & Big Data – Community Detection The framework of Matrix Factorization algorithms. – Steps, Goals, Solution – The PCA approach The EIMF algorithm – Description, Performance Metrics, Evaluation Other Approaches – Algorithms, Models, Metrics

From Social Networks to Big Data Network Social Network BIG DATA

Social Networks Users act (conversations, like, share) Users are connected

Community Detection Links and Content Density of Links Some of the less strongly linked vertices may belong to the same community if they share similar content

General Methodology of MF models Decomposition of a matrix into a product of matrices. M: A matrix representation of the social network. M:[m x n] A:[m x k] B:[k x n] = Product of two low-rank matrices k-dimensional feature vector

What’s next? Two sub-models: – Link matrix factorization F L – Content matrix factorization F C Each matrix contains a k- dimensional feature vector. F = min ||M − P || 2 F + regulation term P: Product of matrices that approximate M Content is incorporated in F C using: – cosine similarity – norm Laplacian matrix. Regulation term – improves robustness – Prevents overfitting.

Goal & Solution G oal: To find an optimum representation of the latent vectors. Optimization problem. Frobenius norm ||.|| 2 F measures the discrepancy between matrices When F L and F C are convex functions, minimization problem is solved using – conjugate gradient or quasi-Newton methods. Then, F L and F C are incorporated into one objective function that is usually a convex function too. Obtain high quality communities. – use of traditional classifiers like k-means, or SVMs

The PCA Algorithm State-of-Art method for this model: – PCA (similar to LSI). Optimization problem: min||M−ZU T || 2 + γ||U|| – Z: [n × l] matrix with each row being the l-dim feature vector of a node, – U: [l × n] matrix, and – ||.|| 2 F : the Fobenius norm. Goal: Approximate M by Z U T, a product of two low- rank matrices, with a regulalization on U.

Edge-Induced Matrix-Factorization (EIMF) The partitioning of the edge set into k communities which are based both on their linkages and content. – Edges : latent vector space based on link structure. – Content is incorporated into edges, so that the latent vectors of the edges with similar content are clustered together. Two Objective Functions – Linkage-based connectivity/density, captured by O l – Content-based similarity among the messages, O C

O l : link structure for any vertex and its incident edges Approximate Link Matrix Γ: [m x n] O l (E)=||E T V−Γ|| 2 F or O l (E)=||E T E∆−Γ|| 2 F E:[k x m] V:[k x n] v1v2v3v4v5v6 e e e e e e1e2e3e4e5 k1 k2 v1v2v3v4v5v6 k1 k2

O c : link incorporating edge content For each edge, the content is associated with it. – Each document is represented w/ a d-dim feature vector. – Cosine Similarity: Similarity measure of two corresponding feature vectors: – Normalized Laplacian matrix: To minimize the content- based objective function. O c (E) = min E tr(E T ·L·E) e1e2e3 e4e5 term1 term2 ………. termd C: [d x m]

To Sum Up Two Objective Functions: – Linkage based connectivity/density. link structure for any vertex and its incident edges is: O l (E) = ||E T V−Γ|| 2 F – Content-based similarity among text documents. O c (E) = min E tr(E T ·L·E) Goal – Minimize the objective function O(E ) = O l (E ) + λ · O c (E ) Solution – Convex functions => no local minimum => Gradient – Apply k-means for the detection of final communities

Experiments Characteristics of the Datasets – Enron Dataset #of messages: 200,399 #of users: 158 #of communities: 53 – Flickr Social Network Dataset #of users: #of communities: 15 #of images: 26,920

Performance Metrics Supervised – Precision: The fraction of retrieved docs that are relevant. eg. high precision: Every result on first page to be relevant. – Recall: The fraction of relevant docs that are retrieved. eg. Retrieve all the relevant results. – Pairwise F-measure: A higher value suggests that the clustering is of good quality.

Performance Metrics Average Cluster Purity (ACP) – The average percentage of the dominant community in the different clusters.

Evaluation Four sets of experiments with other algorithms – Link only Newman LDA-Link – Content LDA-Word NCUT-Content – Link + node content LDA-Link-Word NCUT-Link-Content – Link + edge content EIMF-Lap EIMF-LP Tuning the balancing parameter λ

Strong/Weak points Strong Points – Incorporation of content messages to link connectivity. – Detection of overlapping communities. Weak Points – Tested mainly on datasets (directed communication) and on dataset with tags. Not on a social network (broadcast communication). – Experiments do not see it as a unified model.

More Link-Based Algorithms Modularity Measures the strength of division of a network into modules. High modularity => dense inner connections & sparse outer connections.

k1 = k2 = 1, k3 = k4 = k5 = 2, M = 2|E| = 8, Pij = (ki kj) / (M)

Even More Link-based Algorithms Betweenness Measures a node’s centrality in a network. It is the number of shortest paths from all vertices to all others that pass through that node. Normalized Cut (Spectral Clustering) Using the eigenvalues of the similarity matrix of the data points to perform dimensionality reduction before clustering in fewer dimensions. It partitions points in two sets based on the eigenvector corresponding to the second-smallest eigenvalue of the normalized Laplacian matrix of S where D is the diagonal matrix Clique-based PHITS

Other Algorithms Node-Content – PLSA Probabilistic model. Data is observations that arise from a generative probabilistic process that includes hidden variables. Posterior inference to infer the hidden structure. – LDA Each content is a mixture of various topics. – SVM (on content and/or on links-content) A vector-based method that finds a decision boundary between two classes. Combined Link Structure and Node-Content Analysis – NCUT-Link-Content – LDA-Link-Content

Other Community Detection Models Discriminative – Given a value vector c that the model aims to predict, and a vector x that contains the values of the input features, the goal is to find the conditional distribution p(c | x). – p(c | x) is described by a parametric model. – Usage of Maximum Likelihood Estimation technique for finding optimal values of the model parameters. – State-of-art approach: PLSI

Generative – Given some hidden parameters, it randomly generates data. The goal is to find the joint probability distribution p(x, c). – the conditional probability p(c|x) can be estimated through the joint distribution p(x, c). e.g. P (c, u, z, ω) = P(ω|u) P(u|z) P(z|c) P(c) – State-of-art approach: LDA

Bayesian Models 1.Estimate prior distributions for model parameters. (e.g. Dirichlet distribution with Gamma function, Beta distribution). 2.Estimate the Joint probability of the complete data. 3.A Bayesian inference framework is used to maximize the posterior probability. The problem is intractable, thus optimization is necessary. 4.Apply Gibbs sampling approach for parameter estimation, to compute the conditional probability. 1.Compute statistics with initial assignments. 2.For each iteration and for each node: a.Estimate the objective function. b.Sample the community assignment for node i according to the above distribution. c.Update the statistics.

Additional Evaluation Metrics Normalized Mutual Information NMI – The average percentage of the dominant community in different cl usters. Modularity NCUT

Additional Evaluation Metrics Perplexity – A metric for evaluating language models (topic models). – A higher value of perplexity implies a lesser model likelihood and hence lesser generative power of the model.

Comparative Analysis m Three models – MF, Discriminative (D), Generative (G) Parameter Estimation – Objective Function min (MF) Frobenius norm, Cosine similarity, Laplacian norm, quasi-Newton. – EM & MLE (D) – Gibbs Sampling (Entropy-based, Blocked) (G) Metrics – PWF, ACP (MF) – NMI, PWF, Modu (D) – NMI, Modu, Perplexity, Runnning Time, #of iterations (G)