Finding community structure in very large networks

Slides:



Advertisements
Similar presentations
Lecture 15. Graph Algorithms
Advertisements

Algorithms (and Datastructures) Lecture 3 MAS 714 part 2 Hartmut Klauck.
Fast algorithm for detecting community structure in networks M. E. J. Newman Department of Physics and Center for the Study of Complex Systems, University.
Social network partition Presenter: Xiaofei Cao Partick Berg.
Clustering.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Review Binary Search Trees Operations on Binary Search Tree
Heaps1 Part-D2 Heaps Heaps2 Recall Priority Queue ADT (§ 7.1.3) A priority queue stores a collection of entries Each entry is a pair (key, value)
Modularity and community structure in networks
Graph Partitioning Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial 3.0Attribution-NonCommercial.
© The McGraw-Hill Companies, Inc., Chapter 2 The Complexity of Algorithms and the Lower Bounds of Problems.
 Graph Graph  Types of Graphs Types of Graphs  Data Structures to Store Graphs Data Structures to Store Graphs  Graph Definitions Graph Definitions.
V4 Matrix algorithms and graph partitioning
Structural Inference of Hierarchies in Networks BY Yu Shuzhi 27, Mar 2014.
C++ Programming: Program Design Including Data Structures, Third Edition Chapter 21: Graphs.
2 -1 Chapter 2 The Complexity of Algorithms and the Lower Bounds of Problems.
Fast algorithm for detecting community structure in networks.
Modularity in Biological networks.  Hypothesis: Biological function are carried by discrete functional modules.  Hartwell, L.-H., Hopfield, J. J., Leibler,
Chapter 9 Graph algorithms Lec 21 Dec 1, Sample Graph Problems Path problems. Connectedness problems. Spanning tree problems.
A scalable multilevel algorithm for community structure detection
CS 206 Introduction to Computer Science II 11 / 05 / 2008 Instructor: Michael Eckmann.
CPSC 411, Fall 2008: Set 4 1 CPSC 411 Design and Analysis of Algorithms Set 4: Greedy Algorithms Prof. Jennifer Welch Fall 2008.
1 Shortest Path Calculations in Graphs Prof. S. M. Lee Department of Computer Science.
Hierarchical clustering & Graph theory
Chapter Tow Search Trees BY HUSSEIN SALIM QASIM WESAM HRBI FADHEEL CS 6310 ADVANCE DATA STRUCTURE AND ALGORITHM DR. ELISE DE DONCKER 1.
Chapter 9 – Graphs A graph G=(V,E) – vertices and edges
Computer Science 112 Fundamentals of Programming II Introduction to Graphs.
1 By: MOSES CHARIKAR, CHANDRA CHEKURI, TOMAS FEDER, AND RAJEEV MOTWANI Presented By: Sarah Hegab.
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
Clustering Spatial Data Using Random Walk David Harel and Yehuda Koren KDD 2001.
A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.
Nyhoff, ADTs, Data Structures and Problem Solving with C++, Second Edition, © 2005 Pearson Education, Inc. All rights reserved Graphs.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
ANALYSIS AND IMPLEMENTATION OF GRAPH COLORING ALGORITHMS FOR REGISTER ALLOCATION By, Sumeeth K. C Vasanth K.
Spectral Analysis based on the Adjacency Matrix of Network Data Leting Wu Fall 2009.
CS 361 – Chapters 8-9 Sorting algorithms –Selection, insertion, bubble, “swap” –Merge, quick, stooge –Counting, bucket, radix How to select the n-th largest/smallest.
Union-find Algorithm Presented by Michael Cassarino.
Chapter 2: Basic Data Structures. Spring 2003CS 3152 Basic Data Structures Stacks Queues Vectors, Linked Lists Trees (Including Balanced Trees) Priority.
Review 1 Queue Operations on Queues A Dequeue Operation An Enqueue Operation Array Implementation Link list Implementation Examples.
Graphs A graphs is an abstract representation of a set of objects, called vertices or nodes, where some pairs of the objects are connected by links, called.
University at BuffaloThe State University of New York Detecting Community Structure in Networks.
Community Discovery in Social Network Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology.
Data Structures & Algorithms Graphs Richard Newman based on book by R. Sedgewick and slides by S. Sahni.
Heaps © 2010 Goodrich, Tamassia. Heaps2 Priority Queue ADT  A priority queue (PQ) stores a collection of entries  Typically, an entry is a.
Community structure in graphs Santo Fortunato. More links “inside” than “outside” Graphs are “sparse” “Communities”
Network Theory: Community Detection Dr. Henry Hexmoor Department of Computer Science Southern Illinois University Carbondale.
BINARY TREES Objectives Define trees as data structures Define the terms associated with trees Discuss tree traversal algorithms Discuss a binary.
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
Graphs. Graph Definitions A graph G is denoted by G = (V, E) where  V is the set of vertices or nodes of the graph  E is the set of edges or arcs connecting.
James Hipp Senior, Clemson University.  Graph Representation G = (V, E) V = Set of Vertices E = Set of Edges  Adjacency Matrix  No Self-Inclusion (i.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Clustering [Idea only, Chapter 10.1, 10.2, 10.4].
Lecture 20. Graphs and network models 1. Recap Binary search tree is a special binary tree which is designed to make the search of elements or keys in.
GUILLOU Frederic. Outline Introduction Motivations The basic recommendation system First phase : semantic similarities Second phase : communities Application.
Graph clustering to detect network modules
Multiway Search Trees Data may not fit into main memory
The minimum cost flow problem
Greedy Algorithm for Community Detection
Community detection in graphs
Michael L. Nelson CS 495/595 Old Dominion University
Lectures on Graph Algorithms: searching, testing and sorting
Consensus Partition Liang Zheng 5.21.
Quick-Sort 2/25/2019 2:22 AM Quick-Sort     2
Quick-Sort 4/8/ :20 AM Quick-Sort     2 9  9
Quick-Sort 4/25/2019 8:10 AM Quick-Sort     2
Important Problem Types and Fundamental Data Structures
OPIM 915 Fall 2010 Data Structures 23-38,
Hierarchical Clustering
Clustering.
Presentation transcript:

Finding community structure in very large networks By Aaron Clauset M. E. J. Newman and Cristopher Moore

Talk outline Introduction and reminder The algorithm Example: Amazon.com Summary

Girvan & Newman: betweenness clustering Edge Betweeness: The number of shortest paths between vertex pairs that goes along an edge divisive Algorithm compute all pairs of shortest paths For each edge compute the number of such paths it belongs to Remove the max-weight edge Repeat to 1 until no edges are left divisive – מפלג

Girvan & Newman: disadvantages Betweenness needs to be recalculated at each step Removal of an edge can impact the betweenness of another edge Very expensive: all pairs shortest path – O(n3) O(m2n) Does not scale to more than a few hundred nodes O( mn) for calculating edge betweeness. m iterations.

Dendrogram (hierarchical tree) A dendrogram (hierarchical tree) illustrates the output of hierarchical clustering algorithms Leaves represent graph nodes, top represents original graph As we move down the tree, larger communities are partitioned to smaller ones 1 2 3 4 5 6 7 8 9

Quality functions Hierarchical clustering algorithms create numerous partitions In general, we do not know how many communities we should seek. How may we know that our clustering is “good” We need a quality function

The modularity quality function Modularity Q  designed to measure the strength of division of a network into clusters/communities It measures when the division is a good one, in the sense that there are many edges within communities and only a few between them If a high value of Q represents a good community division, why not simply optimize Q over all possible divisions to find the best one?

Is there a community structure in a very large networks Is there a community structure in a very large networks? How can we find it?

Newman Fast Algorithm(2003) A naive implementation runs in time O((m+ n)n), or O(n^2) on a sparse graph. Greedy optimization of modularity: Starting with each vertex is the sole member of one of n communities, we repeatedly join communities together in pairs, choosing at each step the join that results in the greatest increase (or smallest decrease) in Q. (agglomerative aglrotihm) The progress of the algorithm can be represented as a “dendrogram,” a tree that shows the order of the joins Thank you shlomi for introducing it to us!

The algorithm Introduction and reminder The algorithm Example: Amazon.com Summary

The algorithm(2004) Here we propose a new algorithm that performs the same greedy optimization by: using more sophisticated data structures for reducing time complexity and memory it runs far more quickly, in time O(md log n) where d is the depth of the “dendrogram” describing the network’s community structure.

Definitions Avw- adjacency matrix vertex v belongs to community 4 4 1 לעשות הנפשה 1 2 1 2 3 3

Definitions (cont.) δ-function is 1 if Ci = Cj and 0 otherwise degree of a vertex v is defined to be the number of edges incident upon it:

Definitions (cont.) the fraction of edges that join vertices in community i to vertices in community j the fraction of ends of edges that are attached to vertices in community i

The modularity calculation Degrees of nodes-pair Modularity value ……. Rendom graph Probability of an edge if only a function of node-degrees # edges In-same-cluster indicator variable Graph adjacency matrix

The modularity calculation(cont.) = Q

Purpose of the algorithm The operation of the algorithm involves finding the changes in Q that would result from the join of each pair of communities, choosing the largest of them, and performing the corresponding join

Updates on the previous algorithm 1 The operation is done explicitly on the entire matrix, but if the adjacency matrix is sparse? the operation can be carried out more efficiently using data structures for sparse matrices.

Data structures Data structures: 1. A sparse matrix containing Qij for each pair i, j of communities with at least one edge between them. We store each row of the matrix both as a balanced binary tree and as a max-heap. ai= כמה קשתות נוגעות איכשהו בקהילה הi כולל הקשתות הפנימיות. (so that the largest element can be found in constant time). MAX HEAP (so that elements can be found or inserted in O(log n) time) עץ בינארי

Data structures (cont.) 2. A max-heap H containing the largest element of each row of the matrix Qij along with the labelsi, j of the corresponding pair of communities. 3. An ordinary vector array with elements ai. 4. Q=the maximal modularity Row j Row k k n m 99 21 5 k,m j,n k,i

Initialization for each i: we start off with each vertex being the sole member of a community of one, in which case eij = 1/2m if i and j are connected and zero otherwise, and ai = ki/2m.

The algorithm Calculate the initial values of ∆Qij and ai according to initialization and populate the max heap with the largest element of each row of the matrix ∆Q. Select the largest ∆Qij from H, join the corresponding communities, update the matrix ∆Q, the heap H and ai (as described later) and increment Q by ∆Qij. Repeat step 2 until only one community remains.

update the matrix ∆Q If we join communities i and j, labeling the combined community j, say, we need only update the jth row and column, and remove the ith row and column altogether. If community k is connected to both i and j, then If k is connected to i but not to j, then If k is connected to j but not to i, then

Reminder of how modularity can help us visualize large networks

Reminder-run time Insertion in balanced binary tree - O(log n) Updating the max-heap for the kth row by inserting, raising, or lowering ∆Qkj takes O(log|k|) ≤ O(log n) time Operation Binary[2] find-max Θ(1) delete-max Θ(log n) insert merge Θ(n) https://www.youtube.com/watch?v=vDHFF4wjWYU

Run time |i|= degree of i, the number of neighboring communities Join i and j = O((|i| + |j|) log n) (10 a) -insert every |i| into the jth row costs :log |j| (10 b +10 c)- insert every |i|+|j| : log (|i|+|j|) kth row – update single element : log (|k|) maximal O(log n) there are at most |i| + |j| values of k for which we have to do this Total: O((|i| + |j|) log n) עבור העץ הבינארי וגם הmax heap

Run time (cont.) the total running time is at most O(log n) times the sum over all nodes of the dendrogram of the degrees of the corresponding communities. worst-case: the degree of a community is the sum of the degrees of all the vertices in the original network comprising it. In that case, each vertex of the original network contributes its degree to all of the communities it is a part of, along the path in the dendrogram from it to the root

Run time (cont.) O(md log n) If the dendrogram has depth d, there are at most d nodes in this path, and since the total degree of all the vertices is 2m, we have a running time of O(md log n) as stated. O(md log n)

Practical situations It is usually unnecessary to maintain the separate max-heaps for each row their maintenance takes a moderate amount of effort and this effort is wasted if the largest element in a row does not change when two rows are joined if the largest element of the kth row was ∆Qki or ∆Qkj and is now reduced by Eq. (10b) or (10c), we simply scan the kth row to find the new largest element. the average-case running time is often better than that of the more sophisticated algorithm.

Example: Amazon.com Introduction and reminder The algorithm Summary

The connections- Amazon The network we study consists of items listed on the Amazon web site. the network has 409 687 items and 2 464 630 edges. Items can be books,music, video games etc. Edge from A to B iff B was frequently purchased by buyers of A

Bridge – an edge , that when removed, splits off a community. Bridges can act as bottlenecks for information flow

Looking at the largest communities in the network, we find that they tend to consist of items (books, music) in similar genres or on similar topics

Power low partitioned at the point of maximum modularity, the distribution of community sizes s appears to have a power-law form התפלגות מצטברת של הגדלים של קהילות, כאשר הרשת מחולקת למחיצות במודולריות המרבית שנמצאה על ידי האלגוריתם

Summary Introduction and reminder The algorithm Example: Amazon.com

Summary Run time O(md log n) n- vertices m- edges d- depth of the dendrogram Balanced dendrogram- d ∼ log n and Sparse network- m ∼ n Run time O(n log2 n). The algorithm should allow researchers to analyze even larger networks with millions of vertices and tens of millions of edges using current computing resources

Improvments Unfortunately, the algorithm does not scale well and its use is practically limited to networks whose sizes are up to 500,000 nodes. We show that this inefficiency is caused from merging communities in unbalanced manner and that a simple heuristics that attempts to merge community structures in a balanced manner can dramatically improve community structure analysis. היוריסטיקה (Heuristic) היא כלל חשיבה פשוט, מעין כלל אצבע המבוסס על הגיון פשוט או אינטואיציה, המציע דרך קלה ומהירה לקבלת החלטות, ללא התעמקות ובמחיר דיוק נמוך. http://dl.acm.org/citation.cfm?id=1242805

Improvments (cont.) The proposed techniques are tested using data sets obtained from existing social networking service that hosts 5.5 million users. We have tested two variations of the heuristics. The fastest method processes a SNS friendship network with 1 million users in 5 minutes (70 times faster than our algorithm) Another friendship network with 4 million users in 35 minutes.

Credits

The End