Presentation is loading. Please wait.

Presentation is loading. Please wait.

Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research.

Similar presentations


Presentation on theme: "Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research."— Presentation transcript:

1 Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research

2  Big data  Study emerging behaviors  How are small networks different from large 2

3  Communities (groups, clusters, modules):  Sets of nodes with lots of connections inside and few to outside (the rest of the network) 3 Communities, clusters, groups, modules

4  Nodes represent proteins  Edges represent interactions/associations  Proteins with same function interact more  Can use network to discover functional groups 4 Yeast transcriptional regulatory modules [Bar-Joseph et al., 2003]

5  Clusters correspond to social communities, organizational units (e.g., departments) 5 Zachary’s Karate club network During the study the club split into 2 The split corresponds to min-cut ( ● vs. ■ )

6 6 [Adamic-Glance 2005] Democrat vs. Republican blogs

7 7 Citations Collaborations [Newman 2003]

8  Nested communities: modular structure of networks is hierarchically organized 8 CS Math DramaMusic Science Arts University

9  Recursive hierarchical network 9 (a) N=5, E=8 (b) N=25, E=56 (c) N=125, E=344

10  Intuition: Find nodes that can be easily separated from the rest of the network  Various objective functions  Min-cut  Normalized-cut  Centrality, Modularity  Various algorithms  Spectral clustering (random walks)  Girvan-Newman (centrality)  Metis (contraction based) 10 Girvan-Newman: 1) Betweenness centrality: number of shortest paths passing through an edge. 2) Remove edges by decreasing centrality

11 11

12 Statistical properties of community structure  Instead of searching for communities we measure well how expressed are communities Questions  What is the community structure of real world networks?  How to measure and quantify this?  What does this tell us about network structure?  What is a good model (intuition)?  What are consequences for clustering/partitioning algorithms? 12

13  How community like is a set of nodes?  Need a natural intuitive measure  Conductance (normalized cut) Φ(S) = # edges cut / # edges inside  Small Φ(S) corresponds to more community-like sets of nodes S S’ 13

14 Score: Φ(S) = # edges cut / # edges inside What is “best” community of 5 nodes? 14

15 Score: Φ(S) = # edges cut / # edges inside Bad community Φ=5/6 = 0.83 What is “best” community of 5 nodes? 15

16 Score: Φ(S) = # edges cut / # edges inside Better community Φ=5/7 = 0.7 Bad community Φ=2/5 = 0.4 What is “best” community of 5 nodes? 16

17 Score: Φ(S) = # edges cut / # edges inside Better community Φ=5/7 = 0.7 Bad community Φ=2/5 = 0.4 Best community Φ=2/8 = 0.25 What is “best” community of 5 nodes? 17

18  We define: Network community profile (NCP) plot Plot the score of best community of size k  Search over all subsets of size k and find best: Φ(k=5) = 0.25  NCP plot is intractable to compute 18

19  We define: Network community profile (NCP) plot Plot the score of best community of size k 19 Community size, log k log Φ(k) k=5, Φ(k)=0.25 k=7, Φ(k)=0.18

20 20 Community size, log k Community score, log Φ(k)

21  Local spectral clustering algorithm  Pick a seed node  Slowly diffuse mass around it (via PageRank like random walk)  Find the bottleneck  Repeat many times  Many seed nodes for very local walks  Less seed nodes for more global (longer) walks 21

22 22

23  Dolphin social network  Two communities of dolphins NCP plot Network 23

24  Zachary’s university karate club social network  During the study club split into 2  The split (squares vs. circles) corresponds to cut B NCP plotNetwork 24

25  Collaborations between scientists in Networks NCP plotNetwork 25

26 26 NCP plot Network

27 27 NCP plot Network

28  Manifold learning dataset (Hands) 28 NCP plot Network

29  Eastern US power grid: 29

30 30 NCP plot Network – Small social networks – Geometric and – Hierarchical network have downward NCP plot What about large networks?

31 31

32  Previously researchers examined community structure of small networks (~100 nodes)  We examined more than 70 different large networks Large real-world networks look very different! 32

33  Typical example: General relativity collaboration network (4,158 nodes, 13,422 edges) 33

34 Community score Community size Better and better communities Best communities get worse and worse Best community has 100 nodes 34

35  Whiskers are responsible for downward slope of NCP plot Whisker is a set of nodes connected to the network by a single edge NCP plot Largest whisker 35

36  Each new edge inside the community costs more NCP plot Φ=2/4 = 0.5 Φ=8/6 = 1.3 Φ=64/14 = 4.5 Each node has twice as many children Φ=1/3 = 0.33 36

37  Take a real network G  Rewire edges for a long time  We obtain a random graph with same degree distribution as the real network G 37

38 38 Rewired network: random network with same degree distribution

39 39 Whiskers in real networks are larger than expected

40 40 Whiskers in real networks are non-trivial (richer than trees) Edge to cut

41 What if we allow cuts that give disconnected communities? Cut all whiskers Compose communities out of whiskers How good “communities” do we get? 41

42 Community score Community size We get better community scores when composing disconnected sets of whiskers Connected communities Bag of whiskers 42

43 43 Nothing happens! Now we have 2-edge connected whiskers to deal with.

44 44 Connected communities Bag of whiskers Rewired network

45 Network structure: Core-periphery (jellyfish, octopus) Whiskers are responsible for good communities Denser and denser core of the network Core contains 60% node and 80% edges 45

46 46

47  (Sparse) Random graph:  Start with N nodes  Pick pairs of nodes uniformly at random and connect 47 Flat (long random connections) Theorem (works for any degree distribution) Sparsity does not explain our observation

48 48  Preferential attachment [Price 1965, Albert & Barabasi 1999]:  Add a new node, create m out-links  Probability of linking a node k i is proportional to its degree  Based on Herbert Simon’s result  Power-laws arise from “Rich get richer” (cumulative advantage) Flat (connections to hubs – no locality)

49  Let’s exploit local connections 49 Down (locally network looks like a mesh) and Flat (at large scale network looks random)

50  Geometric preferential attachment:  Place nodes at random in 2D  Pick a node  Pick nodes in a radius  Connect preferentially 50 Flat (locally network is random) and Down (globally network is a mesh – union of local expanders)

51  Forest Fire: connections spread like a fire  New node joins the network  Selects a seed node  Connects to some of its neighbors  Continue recursively As community grows it blends into the core of the network 51

52 rewired network Bag of whiskers 52

53  Whiskers:  Largest whisker has ~100 nodes  Independent of network size  Dunbar number: a person can maintain social relationship to at most 150 people  Core:  Core has little structure (hard to cut)  Still more structure than the random network 53

54  Other researchers examined small networks so they did not hit the Dunbar’s limit  Small evidence:  400k nodes Amazon co-purchasing network [Clauset et al. 2004] ▪ Largest community has 50% of all nodes ▪ It was labeled “Miscelaneous”  Karate club has no significant community structure [Newman et al. 2007] 54

55  Bond vs. identity communities  Multiple hierarchies that blur the community boundaries 55

56  Ground truth  Yes, use attributes, better link semantics 56

57  NCP plot is a way to analyze network community structure  Our results agree with previous work on small networks (that are commonly used for testing community finding algorithms)  But large networks are different  Large networks  Whiskers + Core structure  Small well isolated communities blend into the core of the networks as they grow 57


Download ppt "Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research."

Similar presentations


Ads by Google