Analysis of Massive Data Sets Prof. dr. sc. Siniša Srbljić Doc. dr. sc. Dejan Škvorc Doc. dr. sc. Ante Đerek Faculty of Electrical Engineering and Computing.

Slides:



Advertisements
Similar presentations
Clustering II.
Advertisements

Community Detection and Graph-based Clustering
Social network partition Presenter: Xiaofei Cao Partick Berg.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Hierarchical Clustering, DBSCAN The EM Algorithm
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.
22C:19 Discrete Math Graphs Fall 2014 Sukumar Ghosh.
Community Detection with Edge Content in Social Media Networks Paper presented by Konstantinos Giannakopoulos.
CSE 5243 (AU 14) Graph Basics and a Gentle Introduction to PageRank 1.
Analysis and Modeling of Social Networks Foudalis Ilias.
Graph Partitioning Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial 3.0Attribution-NonCommercial.
Crew Scheduling Housos Efthymios, Professor Computer Systems Laboratory (CSL) Electrical & Computer Engineering University of Patras.
Data Mining Techniques: Clustering
V4 Matrix algorithms and graph partitioning
Lectures on Network Flows
Author: Jie chen and Yousef Saad IEEE transactions of knowledge and data engineering.
Lecture 6 Image Segmentation
Clustering II.
Communities in Heterogeneous Networks Chapter 4 1 Chapter 4, Community Detection and Mining in Social Media. Lei Tang and Huan Liu, Morgan & Claypool,
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Region Segmentation. Find sets of pixels, such that All pixels in region i satisfy some constraint of similarity.
Fast algorithm for detecting community structure in networks.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.
Chapter 9 Graph algorithms Lec 21 Dec 1, Sample Graph Problems Path problems. Connectedness problems. Spanning tree problems.
Computer Science 1 Web as a graph Anna Karpovsky.
The Shortest Path Problem
Clustering Unsupervised learning Generating “classes”
Models of Influence in Online Social Networks
Graph partition in PCB and VLSI physical synthesis Lin Zhong ELEC424, Fall 2010.
CHAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling
Graph Algorithms Ch. 5 Lin and Dyer. Graphs Are everywhere Manifest in the flow of s Connections on social network Bus or flight routes Social graphs:
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Computer Science 112 Fundamentals of Programming II Introduction to Graphs.
Principles of Social Network Analysis. Definition of Social Networks “A social network is a set of actors that may have relationships with one another”
Automated Social Hierarchy Detection through Network Analysis (SNAKDD07) Ryan Rowe, Germ´an Creamer, Shlomo Hershkop, Salvatore J Stolfo 1 Advisor:
Mining Social Network Graphs Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata November 13, 17, 2014.
Vladyslav Kolbasin Stable Clustering. Clustering data Clustering is part of exploratory process Standard definition:  Clustering - grouping a set of.
Algorithm Course Dr. Aref Rashad February Algorithms Course..... Dr. Aref Rashad Part: 5 Graph Algorithms.
Special Topics in Educational Data Mining HUDK5199 Spring 2013 March 25, 2012.
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
Chapter 3. Community Detection and Evaluation May 2013 Youn-Hee Han
Detecting Communities Via Simultaneous Clustering of Graphs and Folksonomies Akshay Java Anupam Joshi Tim Finin University of Maryland, Baltimore County.
Most of contents are provided by the website Graph Essentials TJTSD66: Advanced Topics in Social Media.
Concept Switching Azadeh Shakery. Concept Switching: Problem Definition C1C2Ck …
Network Community Behavior to Infer Human Activities.
Community Discovery in Social Network Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology.
CS 590 Term Project Epidemic model on Facebook
1 EE5900 Advanced Embedded System For Smart Infrastructure Static Scheduling.
Community structure in graphs Santo Fortunato. More links “inside” than “outside” Graphs are “sparse” “Communities”
Prof. Swarat Chaudhuri COMP 382: Reasoning about Algorithms Fall 2015.
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
Importance Measures on Nodes Lecture 2 Srinivasan Parthasarathy 1.
Analysis of Massive Data Sets Prof. dr. sc. Siniša Srbljić Doc. dr. sc. Dejan Škvorc Doc. dr. sc. Ante Đerek Faculty of Electrical Engineering and Computing.
Social Networks Strong and Weak Ties
Analysis of Massive Data Sets Prof. dr. sc. Siniša Srbljić Doc. dr. sc. Dejan Škvorc Doc. dr. sc. Ante Đerek Faculty of Electrical Engineering and Computing.
Alan Mislove Bimal Viswanath Krishna P. Gummadi Peter Druschel.
GUILLOU Frederic. Outline Introduction Motivations The basic recommendation system First phase : semantic similarities Second phase : communities Application.
Graph clustering to detect network modules
Graphs and Social Networks
Approximating the MST Weight in Sublinear Time
Lectures on Network Flows
Community detection in graphs
Network Science: A Short Introduction i3 Workshop
Why Social Graphs Are Different Communities Finding Triangles
EE5900 Advanced Embedded System For Smart Infrastructure
Affiliation Network Models of Clusters in Networks
Algorithm Course Dr. Aref Rashad
Hierarchical Clustering
Clustering.
Analysis of Large Graphs: Overlapping Communities
Presentation transcript:

Analysis of Massive Data Sets Prof. dr. sc. Siniša Srbljić Doc. dr. sc. Dejan Škvorc Doc. dr. sc. Ante Đerek Faculty of Electrical Engineering and Computing Consumer Computing Laboratory

Copyright © 2016 CCL: Analysis of massive data sets2 of 44Copyright © 2016 CCL: Analysis of massive data sets Community Detection in Social Network Graphs Goran Delač, PhD

Copyright © 2016 CCL: Analysis of massive data sets3 of 44 Outline □Social Networks o Social graph □Detecting Communities o Traditional approaches □Affiliation Graph Model o Detecting communities using AGM □BigCLAM Approach

Copyright © 2016 CCL: Analysis of massive data sets4 of 44 Social Networks □Modern social interactions

Copyright © 2016 CCL: Analysis of massive data sets5 of 44 Social Networks □Modern social interactions □Massive communities (Q4 2015) o Google+ 2,200,000,000 (profiles, low activity) o Facebook 1,591,000,000 o Instagram 400,000,000 o Twitter 320,000,000

Copyright © 2016 CCL: Analysis of massive data sets6 of 44 Social Networks □Vast amounts of data – immense opportunities o Trend analysis, information cascades o Sentiment analysis o Social search o Recommendations o…o… o Detecting communities

Copyright © 2016 CCL: Analysis of massive data sets7 of 44 Social Networks □What are social networks? 1.Collection of entities (usually people, but not necessary) 2.At least one relationship exists between entities (friendship, follower, …)  Unidirectional/Bidirectional  Binary/Weighted

Copyright © 2016 CCL: Analysis of massive data sets8 of 44 Social Networks □What are social networks? 3.Assumption of nonrandomness (locality)  If entity A is related to both B and C, there is higher then average probability that B and C are related Facebook social graph visualization

Copyright © 2016 CCL: Analysis of massive data sets9 of 44 Social Networks □Social Network Representation o Social graph  Entities are nodes  Connections are edges o Connections can have a degree  Labeled edges o Connections can have a direction  Directed (G+, Twitter)  Undirected (Facebook)

Copyright © 2016 CCL: Analysis of massive data sets10 of 44 Social Networks □Social Network Representation o Social graph  Entities are nodes  Connections are edges o Entities can have different types  e.g. users, pages, tags users can be related if they tag the same pages  k-partite social graph

Copyright © 2016 CCL: Analysis of massive data sets11 of 44 Social Networks □Relationships other than “friendship” o Telephone networks o networks o Collaboration networks o Information nets, infrastructure nets, … They all exhibit locality of “friendship” …

Copyright © 2016 CCL: Analysis of massive data sets12 of 44 Social Networks □Goal o Extract knowledge of social communities Partition the social graph

Copyright © 2016 CCL: Analysis of massive data sets13 of 44 Detecting Communities □Traditional methods o Minimal cut  One of the oldest methods  Minimize number of edges between communities  Problem: Will find communities even if they do not manifest o Hierarchical clustering  Apply similarity measure on the adjacency matrix Cosine similarity, Jaccard similarity, Hamming distance  Single-linkage clustering All node pairs in different communities have similarity below a certain threshold 1 i j

Copyright © 2016 CCL: Analysis of massive data sets14 of 44 Detecting Communities □Traditional methods o Girvan-Newman Algorithm  Identifies edges that lie between the communities and removes them  Measure: betweenness  Betweenness – number of times a node acts as a connection along the shortest path between other nodes in the graph  Edge betweenness – number of times an edge is within a shortest path between any two nodes in a graph

Copyright © 2016 CCL: Analysis of massive data sets15 of 44 Detecting Communities □Traditional methods o Girvan-Newman Algorithm 1.Calculate edge betweenness 2.Remove edge with highest betweenness 3.Recalculate betweenness 4.Repeat steps 1-3 until graph brakes down into communities  Effective but computationally heavy: O(m 2 n), m - edges, n - vertices

Copyright © 2016 CCL: Analysis of massive data sets16 of 44 Detecting Communities You High school friends Friends and family abroad College friends Football friends Communities can overlap!

Copyright © 2016 CCL: Analysis of massive data sets17 of 44 Detecting Communities You Goal

Copyright © 2016 CCL: Analysis of massive data sets18 of 44 □Problem o Communities can overlap! □Solution o Clique detection methods  All nodes within in a clique are directly connected (dense graph)  Clique percolation method (CPM) o Affiliation graph model (AGM)  Yang, Leskovec 2012 [3]3 Detecting Communities

Copyright © 2016 CCL: Analysis of massive data sets19 of 44 Affiliation Graph Model □Approach: Utilize generative models 1.Using a model generate the social network 2.Given a social network, derive the most appropriate model that describes it Generative model

Copyright © 2016 CCL: Analysis of massive data sets20 of 44 Affiliation Graph Model □Goal: Derive a model that generates social networks o The model has a set of parameters that need to be estimated  In doing so we in effect detect communities o Given network nodes (entities), how do communities (defined by AGM) generate the edges of the network? Generative model

Copyright © 2016 CCL: Analysis of massive data sets21 of 44 Affiliation Graph Model □Model o Two types of nodes  Social network entities (nodes)  Communities o Edges  Community membership AB Communities Nodes Memberships ?

Copyright © 2016 CCL: Analysis of massive data sets22 of 44 Affiliation Graph Model □Model o Probability that a node links to other nodes in community C pcpc o AGM(N, C, M, {p c }) AB Communities Nodes Memberships ? pApA pBpB

Copyright © 2016 CCL: Analysis of massive data sets23 of 44 Affiliation Graph Model □Generative process o Each pair of nodes in community C is connected with probability p c □Total probability o Nodes u and v are connected

Copyright © 2016 CCL: Analysis of massive data sets24 of 44 Affiliation Graph Model □Generative process o Go through all node pairs (u, v) and generate a connection (edge) with probability:

Copyright © 2016 CCL: Analysis of massive data sets25 of 44 Affiliation Graph Model □Advantage of AGM o Flexible community representation  Non-overlapping  Overlapping  Nested

Copyright © 2016 CCL: Analysis of massive data sets26 of 44 Detecting Communities using AGM □Finding the model = finding the communities AGM Input: social networkOutput: AGM Community memberships

Copyright © 2016 CCL: Analysis of massive data sets27 of 44 Detecting Communities using AGM □Find AGM parameters o N → social net (entities) → directly from graph o C → number of comminutes → estimate form graph [3]3 o p c → prob. that two nodes in community are connected o M → community memberships

Copyright © 2016 CCL: Analysis of massive data sets28 of 44 Detecting Communities using AGM □Solution: Maximum Likelihood Estimation o Given a social graph G o Given a model f(param) o We want to estimate P f (G | param)  Conditional probability  The probability that AGM generated G given parameters param □Find the most likely model that generated graph G

Copyright © 2016 CCL: Analysis of massive data sets29 of 44 Detecting Communities using AGM □MLE: Example o Suppose we have a sequence of events (e.g. coin toss, rainy days etc.) o X = [0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0] o Model f(Y) – return 1 with probability Y o What is P f (X | Y) ? o Assuming the events are independent  P f (X | Y) = P f (0 | Y) * P f (1 | Y) * … * P f (0 | Y) = Y 5 (1-Y) 7  P f (X | Y = 5 / 12) =  X was most probably generated if Y = 5 / 12

Copyright © 2016 CCL: Analysis of massive data sets30 of 44 Detecting Communities using AGM

Copyright © 2016 CCL: Analysis of massive data sets31 of 44 Detecting Communities using AGM □MLE for AGM o Goal: find parameters param = (N, C, M, {p c }) such that: o Problem  Finding param is equal to finding bipartite affiliation network  Too hard for large data sets!

Copyright © 2016 CCL: Analysis of massive data sets32 of 44 BigCLAM Approach □Solution: Relax the model □BigCLAM (Yang, Leskovec 2013) [2]2 o Cluster Affiliation Model for Big Networks o Idea: avoid discrete memberships  Introduce strengths to memberships  Strengths are non negative values  If strength is 0, the entity is not a member of the given community  If strength is high, the entity is a very active community member o Implications  Moving to continuous domain enables usage of very effective approaches, like gradient descent

Copyright © 2016 CCL: Analysis of massive data sets33 of 44 BigCLAM Approach □Membership strengths o F uA > 0: membership strength o Probability that nodes u and v are connected in A: A B

Copyright © 2016 CCL: Analysis of massive data sets34 of 44 BigCLAM Approach □Membership strength matrix F Communities Nodes F u A v FuFu

Copyright © 2016 CCL: Analysis of massive data sets35 of 44 BigCLAM Approach □Probability that at least one common node connects nodes u and v

Copyright © 2016 CCL: Analysis of massive data sets36 of 44 BigCLAM Approach □Goal: Find such F so that:

Copyright © 2016 CCL: Analysis of massive data sets37 of 44 BigCLAM Approach □Modification: log likelihood o Why?  Sums instead of products  Errors are less pronounced when summing small numbers

Copyright © 2016 CCL: Analysis of massive data sets38 of 44 BigCLAM Approach □Goal o Find F that maximizes:

Copyright © 2016 CCL: Analysis of massive data sets39 of 44 BigCLAM Approach □Gradient descent 1.Compute a gradient for a single row 2.Update row – move in the direction of gradient 3.Repeat for all rows until F stops changing

Copyright © 2016 CCL: Analysis of massive data sets40 of 44 BigCLAM Approach □Gradient descent

Copyright © 2016 CCL: Analysis of massive data sets41 of 44 BigCLAM Approach □Gradient descent Update row

Copyright © 2016 CCL: Analysis of massive data sets42 of 44 BigCLAM Approach However! Node degree is much smaller than the total number of nodes in the network! Compute once at the beginning of a pass

Copyright © 2016 CCL: Analysis of massive data sets43 of 44 BigCLAM Approach [2] Yang, Leskovec ~ 5 min for 300k nodes ~ 1 day for network with 100M edges

Copyright © 2016 CCL: Analysis of massive data sets44 of 44 Literature 1.J. Leskovec, A. Rajaraman, and J. D. Ullman, "Mining of Massive Datasets", 2014, Chapter 6 : "Mining Social-Network Graphs" (link)link 2.J. Yang, J. Leskovec: "Overlapping community detection at scale: a nonnegative matrix factorization approach ", WSDM '13 Proceedings of the sixth ACM international conference on Web search and data mining, Rome, Italy, February 2013, pp. 587 – 596. (link)link 3.J. Yang, J. Leskovec: "Community-Affiliation Graph Model for Overlapping Network Community Detection", ICDM '12 Proceedings of the 2012 IEEE 12th International Conference on Data Mining, Brussels, Belgium, December 2012, pp. 1170– (link)link