Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computational Molecular Biology

Similar presentations


Presentation on theme: "Computational Molecular Biology"— Presentation transcript:

1 Computational Molecular Biology
Community Structures

2 What is Community Structure
Definition: A community is a group of nodes in which: There are more edges (interactions) between nodes within the group than to nodes outside of it My T. Thai

3 Why Community Structure (CS)?
Many systems can be expressed by a network, in which nodes represent the objects and edges represent the relations between them: Social networks: collaboration, online social networks Technological networks: IP address networks, WWW, software dependency Biological networks: protein interaction networks, metabolic networks, gene regulatory networks My T. Thai

4 Why CS? Yeast Protein interaction networks My T. Thai

5 Why CS? IP address network My T. Thai

6 Why Community Structure?
Nodes in a community have some common properties Communities represent some properties of a networks Examples: In social networks, represent social groupings based on interest or background In citation networks, represent related papers on one topic In metabolic networks, represent cycles and other functional groupings My T. Thai

7 How to detect a community?
My T. Thai

8 Early Work Using hierarchical clustering Overview of this method:
For each pair (u,v), calculate weight wuv which represents how closely connected u and v are Initialize G = (V, emptyset) At each iteration, add an edge with the strongest weight My T. Thai

9 Early Work My T. Thai

10 Early Work How to define the weight wuv
Many different methods have been proposed: Number disjoint paths between u and v Number of possible paths between u and v Disadvantages: Tendency to separate the boundary vertices from the communities (to which they should belong) My T. Thai

11 An Overview of Recent Work
Disjoint CS Overlapping CS Centralized Approach Define the quantity of modularity and use the greedy algorithms, IP, SDP Spectral clustering Random Walk, Clique Percolation Localized Approach My T. Thai

12 Edge Betweeness Focus on the edges which are least central, i.e.,, the edges which are most “between” communities Instead of adding edge to G = (V, emptyset), progressively removing edges from an original graph G = (V,E) My T. Thai

13 Edge Betweeness Definition:
For each edge (u,v), the edge betweeness of (u,v) is defined as the number of shortest paths between any pair of nodes in a network that run through (u,v) betweeness(u,v) = | { Pxy | x, y in V, Pxy is a shortest path between x and y, and (u,v) in Pxy}| My T. Thai

14 Why Edge Betweeness My T. Thai

15 Algorithm Initialize G = (V,E) representing a network
while E is not empty Calculate the betweeness of all edges in G Remove the edge e with the highest betweeness, G = (V, E – e) Indeed, we just need to recalculate the betweeness of all edges affected by the removal My T. Thai

16 Time Complexity Let |V| = n and |E| = m
Calculate the betweeness of all edges: O(mn) Since we need to recalculate each time we remove an edge: O(m2n) My T. Thai

17 An Example My T. Thai

18 Disadvantages/Improvements
Can we improve the time complexity? The communities are in the hierarchical form, can we find the disjoint communities? My T. Thai

19 Define the quantity (measurement) of modularity Q and find an approximation algorithm to maximize Q
My T. Thai

20 How to define Q Let A be an adjacency matrix of the network
Fraction of edges that fall within communities where My T. Thai

21 How to define Q What is the problem? If we try to maximize the above equation, then we may put all nodes into one single community How to fix it? My T. Thai

22 How to define Q? Let kv be the degree of node v
The term kv kw /2m represents the probability of an edge existing between vertices v and w if connections are made random but respecting vertex degrees My T. Thai

23 Greedy Algorithm Initially, we have n communities (each node is a community) At each step, join two communities whose the hierarchical tree has the largest increase in Q Stop when we left with a single community (run n -1 steps) My T. Thai

24 Disadvantages It is still a hierarchical approach
Cannot escape a suboptimal maximum How to avoid the suboptimal maximum? My T. Thai

25 Local Communities My T. Thai

26 Overview Find communities based on local information (not information of entire network) Two ways: Detect the communities Define local modularity, then greedily optimize this function My T. Thai

27 What We Have Learnt Can we use this Q?
How can we “twist” it to make it work? My T. Thai

28 Consider this Figure Suppose we have perfect knowledge of some subgraphs C Then we should know some neighbors on C, lie in U Visit some neighbors of nodes in U may extend the knowledge of C Now, can we re-use Q (defined before) on C? My T. Thai

29 Some Definitions Again, define an adjacent matrix A (wrt C) as follows: Consider this quantity: where (# of edges in the partial adj matrix) My T. Thai

30 Relationship between B, C, U
Consider nodes in B If C is a sharp community, then Nodes in B have more connections to nodes in C Nodes in B have less (a few) connections to nodes in U My T. Thai

31 Definitions Define a boundary-adjacent matrix B as follows:
Define local community R: where δ(i,j) = 1 iff vi in B and vj in C or vice versa. Otherwise, δ(i,j) = 0. T: #edges with one ore more endpoints in B I: #edges in B such that none of their endpoints in U My T. Thai

32 Properties of R 0 < R < 1
Directly proportional to the sharpness of the boundary given by B When R is undefined? My T. Thai

33 A Greedy Algorithm My T. Thai

34 Overview of Second Method
Start at a vertex, check the degree of each vertex with respect to each one-hop neighbors, two-hop neighbors, …, l-hops neighbors Why? If the community is highly connected, the l-hops neighbors tend to revisit the nodes At the boundary, the number of newly added edges decreases My T. Thai

35 Some Definitions kie(j): Emerging degree of a vertex i which is l-hops away vertex j is defined as the number of edges (u,i) where u is not within l-hops away from j Kjl: Total emerging degree of all nodes that exactly l-hops away from j where Sjl is the set of all vertices exactly l-hops away from j Initially, Kj0= degree of node j = kj My T. Thai

36 Some Definitions The change in total emerging degree My T. Thai

37 Algorithm Randomly choose a starting vertex j
Initially, l = 0, add j to C (C is a community), and K0j = kj l = l++; add all l-hop neighbors of j to C Compute ΔKjl. If ΔKjl < α, then return C. Otherwise, repeat step 3 My T. Thai

38 Any Problem? How to define α? Do we need to define α? If not, what should we change? What if the starting vertex is the “bridge one”? What can we do? My T. Thai

39 Impact of α α = 0, never stop until explore the entire connected subgraph α is large, stop sooner (l is small), resulting in many small communities α is too large, return n singleton communities (α > kmax where kmax is the largest degree) My T. Thai

40 A Small Example Actual CS of the Karate Club Obtained by the Alg
My T. Thai

41 Dynamic Communities My T. Thai

42 Dynamic Networks Event decomposition A dynamic network
A collection of network snapshots at many time points. Changes are frequently introduced Insertions / Removals of nodes Insertions / Removals of edges t = 0 t = 1 t = 2 t = 3 Event decomposition Insertions / Removals of nodes = {Insert/Remove a node} Insertions / Removals of edges = {Insert/Remove an edge}

43 Recall… Modularity function However, max Q is NP-hard

44 Adaptive Solutions An adaptive method: A basic community structure
To maximize the gained modularity with low computational complexity Locally compute a new structure based on local information after each change of the network A basic community structure Only for the first snapshot Adaptively update the network communities based on this basic structure Method: Blondel et al (2008)

45 QCA: An Adaptive Method
Input network Blondel’s method : Network changes Need to handle Node insertion Edge insertion Node removal Edge removal Basic communities Updated communities

46 Membership determination
A node actively determines its Membership u FinS(u) S FoutC(u) C

47 Introducing a new node C1 C2 C3 Possibilities No new edges
New edges linking with one community New edges linking multiple communities u C2 C1 C3

48 Handling node insertion
u Join u to the community C with the highest FoutC(u) C1 C2 FoutC1(u) FoutC2(u) C3 FoutC3(u)

49 Introducing a new edge Possibilities
A new edge is inside a single community A new edge is joining two communities u v u v

50 Handling edge insertion
Keep the current community structure intact a b Find qu,C,D and qv,D,C Join a to C or D according to qu,C,D and qv,C,D If a (or b) changes its membership Check all a’s neighbors a b

51 Time complexity Inserting a new node Inserting a new edge
Visit all neighbors of u at most once O(du) Inserting a new edge Computing qu,C(u),C(v) and in constant time O(1)

52 Removing an edge Resulting community is either: Remains unchanged
Breaks up into smaller communities If it contains substructures that are less attractive to the others u v

53 Handling edge removal Strategy Find maximal ‘quasi’-cliques
Let the other singletons determine their best communities

54 Removing a node All edges connected to u will be removed
Resulting community either Remains unchanged Breaks up into smaller spices and merged to others

55 Handling node removal Strategy 3-Clique percolation
Let the left over nodes determine their best communities

56 Experimental Results Test our algorithms on real-world data traces
Enron , ArXiv citation and Facebook networks In comparison with the Blondel’s method at each snapshot Metrics Modularity Number of communities Normalized Mutual Information (NMI) Running time

57 ArXiv e-print Citation network
Modularity # Communities NMI Running Time

58 Facebook Modularity # Communities NMI Running Time


Download ppt "Computational Molecular Biology"

Similar presentations


Ads by Google