1 Modularity and Community Structure in Networks* Final project *Based on a paper by M.E.J Newman in PNAS 2006.

Slides:



Advertisements
Similar presentations
Algorithm Analysis Input size Time I1 T1 I2 T2 …
Advertisements

05/11/2005 Carnegie Mellon School of Computer Science Aladdin Lamps 05 Combinatorial and algebraic tools for multigrid Yiannis Koutis Computer Science.
Midwestern State University Department of Computer Science Dr. Ranette Halverson CMPS 2433 – CHAPTER 4 GRAPHS 1.
Modularity and community structure in networks
Discussion #33 Adjacency Matrices. Topics Adjacency matrix for a directed graph Reachability Algorithmic Complexity and Correctness –Big Oh –Proofs of.
Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.
1 Modularity and Community Structure in Networks* Final project *Based on a paper by M.E.J Newman in PNAS 2006.
Introduction to Bioinformatics
10/11/2001Random walks and spectral segmentation1 CSE 291 Fall 2001 Marina Meila and Jianbo Shi: Learning Segmentation by Random Walks/A Random Walks View.
Chapter 9 Graph algorithms. Sample Graph Problems Path problems. Connectedness problems. Spanning tree problems.
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Segmentation Divide the image into segments. Each segment:
Fast algorithm for detecting community structure in networks.
Modularity in Biological networks.  Hypothesis: Biological function are carried by discrete functional modules.  Hartwell, L.-H., Hopfield, J. J., Leibler,
MAE 552 – Heuristic Optimization Lecture 26 April 1, 2002 Topic:Branch and Bound.
A scalable multilevel algorithm for community structure detection
Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.
Multiple Sequence alignment Chitta Baral Arizona State University.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
Tirgul 13. Unweighted Graphs Wishful Thinking – you decide to go to work on your sun-tan in ‘ Hatzuk ’ beach in Tel-Aviv. Therefore, you take your swimming.
Testing an individual module
1 Modularity and Community Structure in Networks* Final project *Based on a paper by M.E.J Newman in PNAS 2006.
1 Modularity and Community Structure in Networks* Final project *Based on a paper by M.E.J Newman in PNAS 2006.
1 Modularity and Community Structure in Networks* Final project *Based on a paper by M.E.J Newman in PNAS 2006.
Dynamic Programming Introduction to Algorithms Dynamic Programming CSE 680 Prof. Roger Crawfis.
Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.
Design and Analysis of Algorithms
Image Segmentation Image segmentation is the operation of partitioning an image into a collection of connected sets of pixels. 1. into regions, which usually.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.
Gene expression & Clustering (Chapter 10)
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Piyush Kumar (Lecture 2: PageRank) Welcome to COT5405.
Computer Science 112 Fundamentals of Programming II Introduction to Graphs.
Finding dense components in weighted graphs Paul Horn
Greedy Approximation Algorithms for finding Dense Components in a Graph Paper by Moses Charikar Presentation by Paul Horn.
A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.
Targil 6 Notes This week: –Linear time Sort – continue: Radix Sort Some Cormen Questions –Sparse Matrix representation & usage. Bucket sort Counting sort.
Computing Eigen Information for Small Matrices The eigen equation can be rearranged as follows: Ax = x  Ax = I n x  Ax - I n x = 0  (A - I n )x = 0.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.
Understanding Network Concepts in Modules Dong J, Horvath S (2007) BMC Systems Biology 2007, 1:24.
Learning Spectral Clustering, With Application to Speech Separation F. R. Bach and M. I. Jordan, JMLR 2006.
Algorithms 2005 Ramesh Hariharan. Algebraic Methods.
Direct Methods for Sparse Linear Systems Lecture 4 Alessandra Nardi Thanks to Prof. Jacob White, Suvranu De, Deepak Ramaswamy, Michal Rewienski, and Karen.
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
CS 484. Iterative Methods n Gaussian elimination is considered to be a direct method to solve a system. n An indirect method produces a sequence of values.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University.
GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function Sara Mostafavi, Debajyoti Ray, David Warde-Farley,
Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)
Introduction to design and analysis algorithm
Basic Graph Algorithms Programming Puzzles and Competitions CIS 4900 / 5920 Spring 2009.
 In the previews parts we have seen some kind of segmentation method.  In this lecture we will see graph cut, which is a another segmentation method.
Chapter 13 Backtracking Introduction The 3-coloring problem
Finding community structure in very large networks
Community structure in graphs Santo Fortunato. More links “inside” than “outside” Graphs are “sparse” “Communities”
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
Presented by Alon Levin
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
1 Introduction to design and analysis algorithm. 2.
Graph clustering to detect network modules
Numerical Algorithms Chapter 11.
June 2017 High Density Clusters.
Section 7.12: Similarity By: Ralucca Gera, NPS.
Finding modules on graphs
Clustering.
Spectral methods for Global Network Alignment
3.3 Network-Centric Community Detection
Maths for Signals and Systems Linear Algebra in Engineering Lectures 13 – 14, Tuesday 8th November 2016 DR TANIA STATHAKI READER (ASSOCIATE PROFFESOR)
Clustering.
Presentation transcript:

1 Modularity and Community Structure in Networks* Final project *Based on a paper by M.E.J Newman in PNAS 2006

2 Introduction

3 Networks A network: presented by a graph G(V,E): V = nodes, E = edges (link node pairs) Examples of real-life networks: –social networks (V = people) –World Wide Web (V= webpages) –protein-protein interaction networks (V = proteins)

4 Protein-protein Interaction Networks Nodes – proteins (6K), edges – interactions (15K). Reflect the cell’s machinery and signaling pathways.

5 Communities (clusters) in a network A community (cluster) is a densely connected group of vertices, with only sparser connections to other groups.

6 Searching for communities in a network There are numerous algorithms with different "target-functions": –"Homogenity" - dense connectivity clusters –"Separation"- graph partitioning, min-cut approach Clustering is important for Understanding the structure of the network –Provides an overview of the network

7 Distilling Modules from Networks Motivation: identifying protein complexes responsible for certain functions in the cell

8 Newman's network division algorithm

9 Important features of Newman's clustering algorithm The number and size of the clusters are determined by the algorithm Attempts to find a division that maximizes a modularity score Q –heuristic algorithm Notifies when the network is non-modular

10 Modularity of a division (Q) Q = #(edges within groups) - E(#(edges within groups in a RANDOM graph with same node degrees)) Trivial division: all vertices in one group ==> Q(trivial division) = 0 Edges within groups k i = degree of node i M =  k i = 2|E| Aij = 1 if (i,j)  E, 0 otherwise Eij = expected number of edges between i and j in a random graph with same node degrees. Lemma: Eij  k i *k j / M Q =  (Aij - ki*kj/M | i,j in the same group)

11 Algorithm 1: Division into two groups (1) Suppose we have n vertices {1,...,n} s - {  1} vector of size n. Represent a 2-division: –si == sj iff i and j are in the same group –½ (si*sj+1) = 1 if si==sj, 0 otherwise ==> Q =  (Aij - ki*kj/M | i,j in the same group)

12 Algorithm 1: Division into two groups (2) Since where B = the modularity matrix - symmetric - row sum = 0 0 is an eigvenvalue of B

13 Modularity matrix: example

14 Algorithm 1: Division into two groups (3) Which vector s maximizes Q? –clearly s ~ u1 maximizes Q, but u1 may not be {  1} vector –Greedy heuristic: choose s ~ u1: si= +1 if ui>0, si=-1 otherwise B's eigen values B's corresponding eigen vectors B is symmetric  B is diagonalizable (real eigenvalues) n=||s|| 2 =  a i 2 Bu i =  i u i

15

16 Example: a 2-division of a social network A network showing relationships between people in a karate club which eventually split into 2. The division algorithm predicts exactly the two groups after the split known group leader known group leaders Color matches the entries of the eigen vector u1: light = positive entry (si=1) dark: negative (si=-1)

17 Dividing into more than 2 (1) How to compute into more than 2? Idea: apply the algorithm recursively on every group. Splitting a group ==>update Q {i,j} pairs that needs to be updated in Q Bij0|1 =1 iff i and j are in the same group, 0 otherwise

18 Dividing into more than 2 (2) g - a group of n g vertices s - a {  1} vector of size n g Compute  Q for a 2-division of g New: elements of g are split into two subgroups (corresponding to s) Old: all the elements of g are within one group (g) Bij0|1

19 Dividing into more than 2 (3) where B[g] = the submatrix of B defined by g f i (g) = sum of ith row B[g] f i ({1,...,n}) = 0 generalized modularity matrix

20 Generalized modularity matrix: example g = {1, 4, 5} (1 is the minimal index) What is [{1...5}]?

21 A "generalized" 2-division algorithm (divides a group in a network)

22

23 Further techniques for modularity maximization (Combined with Neman's "generalized' 2-division algorithm)

24 A heuristic for 2-division 1.{g1, g2} - an initial 2-division of g 2.While there is an unmoved node: 1.Let v be an unmoved node, whose moving between g1 and g2 maximizes  Q 2.Move v between g1 and g2 3.From the n g 2-divisions generated in the previous step - let {g1, g2} be the one with maximum  Q 4.If  Q>0 ==> go to 1 The last iteration produces a 2-division which equals the initial 2-division

25 Choosing j' with maximum  Q 2.While there is an unmoved node: 1. Let v be an unmoved node, whose moving between g1 and g2 maximizes  Q 2. Move v between g1 and g2 Computing  Q for each node moving j' and storing its  Q

26 Algorithm 4 -cont. 3. From the n g 2-divisions generated in the previous step - let {g1, g2} be the one with maximum  Q 4. If  Q>0 ==> go to 1

27 Finding the leading eigen-pair The power method

28 The Power Method (1) A - a diagonalizable matrix Let ( 1,V 1 ),..., ( n,V n ) be n eigenpairs of A where | 1 | > | 2 |  | 3 | ...  | n | The power method finds the dominant eigenpair of A, i.e. (V 1, 1 ) (Note that 1 is not necessarily the leading eigenvalue) X 0 = any vector.  X 0 = c 1 V c n V n, where c i = X 0  V i

29 The Power Method (2) X 1 =AX 0 = A (c 1 V c n V n ) = c 1 AV c n AV n = c 1 1 V c n n V n X 2 =A 2 X 0 = AX 1 = A (c 1 1 V c n n V n ) = c V c n n 2 V n... X m =A m X 0 = AX m-1 = A (c 1 1 m-1 V c n n m-1 V n ) = c 1 1 m V c n n m V n ~ c1 1 m V 1 If m is large enough 

30 Power Method (3) Suppose V 1  Y  0. For m large enough: X m = AX m-1 = A m X 0 For simplicity, Y=X m

31 Power method - Example Example:  We perform only matrix-vector multiplications! Convergence usually occurs within O(n) iterations

32 Power method – convergence condition To avoid numerical problems due to large numbers – normalize X i before computing X i+1 = A X i X 0 = X / ||X|| X 1 = AX 0 / ||AX 0 || X 2 = AX 1 / || AX 1 ||.... The desired precision

33 Finding the leading eigenpair using matrix shifting Let be the eigenvalues of A, and U 1,...,U n their corresponding eigenvectors Let ||A|| 1 =  max | i | (exercise) Q: What is the dominant eigenpair of A+||A|| 1 I? A: ( 1+ ||A|| 1, U1)

34 Implementation Robustness and Efficiency

35 Checking "positiveness" #define IS_POSITIVE(X) ((X) > ) Instead "x>0" ==> use IS_POSITIVE(X)

36 Efficient multiplications in the (extended) modularity matrix: O(n) instead O(n 2 ) multiplication in a sparse matrix inner product  f (g) i x i ("matrix shifting") "matrix shifting"

37 sparse_matrix_arr typedef struct{ int n; /* matrix size */ elem* values; /* the non zero elements ordered by rows*/ int* colind; /* column indices */ int* rowptr; /* pointers to where rows begin in the values array. */ } sparse_matrix_arr;

38 Fast score computations Computing  Q for each node ==>O(n 2 ) Computing  Q for each node in O(n) before moving 1st node Updating the score AFTER a move of a node k (s is already updated) Algorithm 4

39 Project specifications

40 programs 1.sparse_mlpl < matrix_vec.in 2.modularity_mat 3.spectral_div 4.improve_div 5.cluster for the power method computing a 2-division The complete clustering algorithm (including the improvement)

41 Implementation process Read and understand the document Design ALL programs: –Data structures –Functions used by more than one program Check your code –"Toy" examples on website - easy to debug –Your own created LARGE examples Run your code on yeast/fly networks

42 Analyzing clusters in yeast and fly protein-protein interaction networks Input: true PPI network + 2 random networks Task 1: infer the true network Solution: the true network is more modular Task 2: compute associated functions (using cytoscape + BiNGO) Saccharomyces cerevisiae drosophila melanogaster

43 Cytoscape, BiNGO (version 2.5.1) –A framework for analyzing networks –Provides visualization of networks and clusters –Finding functions associated with gene cluster –Runs from cytoscape –Version 2.3 is not suitable for our project!!! (due to a bug) ==> use version 2.4 (when available) or version 2.0 (available under ~ozery/public/cytoscape- v2.5.1/plugins/BiNGO.jar).

44 BiNGO output (GO = Gene Ontology)

45 Visualization with cytoscape

46 How is the project checked? Most checks (points): "BLACK BOX" –The common checks in "real world" –Running with fixed input files, comparing to fixed output files –Score = #(successful checks) / #(total checks) "WHITE BOX" checks: code review (10 points maximum) –code simplicity / efficiency

47 A simple data structure for maintaining a division Complexity: –Finding all the elements of a group: O(n) –Splitting a group into 2: O(n) typedef struct Division_{ int n; int* group-ids; int numGroups; double Q; } Division; #nodes in the network for each node - its group id (initially 0 - all nodes within on group)

48 Maintaining the generalized modularity matrix Should we maintain the modularity matrix? –No: 1) we do not use it explicitly 2) it is a dense matrix - consumes a large memory space –Yes: 1) Despite its large size - can be kept in memory 2) Can simplify code (e.g. deriving B[g] from B, computing the L1-norm) 3) Can be used in validating the correctness of optimized multiplications (debug mode only!)

49 Suggestion for modules Sparse matrices: - Data structure: sparse_matrix_lst -Reading a sparse matrix ( file / stdin) -Multiplication in a vector -Computing A[g] -Methods hiding the inner structure (allows a simple replacement of sparse_matrix_lst with another data structure for holding sparse matrices) Division Group The spectral algorithm: -2-division -full-division The improvement algorithm The generalized modularity matrix: - Data structure: A[g], k[g], M, f[g], L1-norm -Multiplication in a vector -Computing Q -printing the modularity matrix

50 Good luck! (and have fun...)