Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 4. Modules/communities in networks What is a module? Nodes in a given module (or community group or a functional unit) tend to connect with other.

Similar presentations


Presentation on theme: "Lecture 4. Modules/communities in networks What is a module? Nodes in a given module (or community group or a functional unit) tend to connect with other."— Presentation transcript:

1 Lecture 4

2 Modules/communities in networks

3 What is a module? Nodes in a given module (or community group or a functional unit) tend to connect with other nodes in the same module Biology: proteins of the same function (e.g. DNA repair) or sub-cellular localization (e.g. nucleus) WWW – websites on a common topic (e.g. physics) or organization (e.g. EPFL) Internet – Autonomous systems/routers by geography (e.g. Switzerland) or domain (e.g. educational or military)

4 Sometimes easy to discover

5 Sometimes hard

6 Hierarchical clustering calculating the “similarity weight” W ij for all pairs of vertices (e.g. # of independent paths i  j) start with all n vertices disconnected add edges between pairs one by one in order of decreasing weight result: nested components, where one can take a ‘slice’ at any level of the tree

7 Girvan & Newman (2002): betweenness clustering Betweenness of and edge i -- j is the # of shortest paths going through this edge Algorithm compute the betweenness of all edges remove edge with the lowest betweenness recalculate betweenness Caveats: Betweenness needs to be recalculated at each step very expensive: all pairs shortest path – O(N 3 ) may need to repeat up to N times does not scale to more than a few hundred nodes, even with the fastest algorithms

8 Using random walks/diffusion to discover modules in networks K. Eriksen, I. Simonsen, S. Maslov, K. Sneppen, PRL 90, 148701(2003)

9 Why diffusion? Any dynamical process would equilibrate faster on modules and slower between modules Thus its slow modes reveal modules Diffusion is the simplest dynamical process (people also use others like Ising/Potts model, etc.)

10 Random walkers on a network Study the behavior of many VIRTUAL random walkers on a network At each time step each random walker steps on a randomly selected neighbor They equilibrate to a steady state n i ~ k i (solid state physics: n i = const) Slow modes of equilibration to the steady state allow to detect modules in a network

11 Matrix formalism

12 Eigenvectors of the transfer matrix T ij

13 Similarity transformation Matrix T ij is asymmetric  Could in principle result to complex eigenvalues/eigenvectors Luckily, S ij =1/(  K i  K j ) has the same eigenvalues and eigenvectors v i /  K i Known as similarity transformation

14 Density of states  ( ) filled circles –real AS-network empty squares – degree-preserving randomized version

15 Participation ratio: PR (  ) =  i 1/(v (  ) i ) 4 -0.500.51 0 50 100 150 200 250 Participation Ratio

16 US Military Russia

17 2 0.9626 RU RU RU RU CA RU RU ?? ?? US US US US ?? (US Department of Defence) 3 0.9561 ?? FR FR FR ?? FR ?? RU RU RU ?? ?? RU ?? 4 0.9523 US ?? US ?? ?? ?? ?? (US Navy) NZ NZ NZ NZ NZ NZ NZ 5. 0.9474 KR KR KR KR KR ?? KR UA UA UA UA UA UA UA

18 Hacked Ford AS

19

20 Using random walks/diffusion to rank information networks e.g. Google’s PageRank made it 160 billion $

21 Information networks 3x10 5 Phys Rev articles connected by 3x10 6 citation links 10 10 webpages in the world To find relevant information one needs to efficiently search and rank!!

22 Ranking webpages Assign an “importance factor” G i to every webpage Given a keyword (say “jaguar”) find all the pages that have it in their text and display them in the order of descending G i. One solution still used in scientific publishing is G i =K in (i) (the number of incoming links), but: Too democratic: It doesn’t take into account the importance of nodes sending links Easy to trick and artificially boost the ranking (for the WWW)

23 How Google works Google’s recipe (circa 1998) is to simulate the behavior of many virtual “random surfers” PageRank: G i ~ the number of virtual hits the page gets. It is also ~ the steady state number of random surfers at a given page Popular pages send more surfers your way  PageRank ~ K in is weighted by the popularity of a webpage sending each hyperlink Surfers get bored following links  with probability  =0.15 a surfer jumps to a randomly selected page (not following any hyperlinks)

24 How communities in the WWW influence Google ranking H. Xie, K.-K. Yan, SM, cond-mat/0409087 physics/0510107 Physica A 373 (2007) 831–836

25 How do WWW communities influence their average G i ? Pages in a web-community preferentially link to each other. Examples: Pages from the same organization (e.g. EPFL) Pages devoted to a common topic (e.g. Physics) Pages in the same geographical location (e.g Switzerland) Naïve argument: communities “trap” random surfers to spend more time inside  they should increase the average Google ranking of the community

26 log 10 ( c ) # of intra-community links Test of a naïve argument Naïve argument is wrong: it could go either way Community #1 Community #2

27 E cc E ww

28 G c – average Google rank of pages in the community; G w  1 – in the outside world E cw G c / c – current from C to W It must be equal to: E wc G w / w – current from W to C Thus G c depends on the ratio between E cw and E wc – the number of edges (hyperlinks) between the community and the world

29 Balancing currents for nonzero  J cw =(1-  ) E cw G c / c +  G c N c – current from C to W It must be equal to: J cw =(1-  ) E wc G w / +  G w N w (N c /N w ) – current from W to C

30 For very isolated communities (E cw /E (r) cw <  and E wc /E (r) wc <  ) one has G c =1. Their Google rank is decoupled from the outside world! Overall range:  <G c <1/  What are the consequences?

31 WWW - the empirical data We have data for ~10 US universities (+ all UK and Australian Universities) Looked closely at UCLA and Long Island University (LIU) UCLA has different departments LIU has 4 campuses

32  =0.15

33  =0.001 Abnormally high PageRank

34 Top PageRank LIU websites for  =0.001 don’t make sense #1 www.cwpost.liu.edu/cwis/cwp/edu/edleader/higher_ed/ hear.html' #5 …/higher_ed/ index.html #9 …/higher_ed/courses.html World Strongly connected component

35

36

37 What about citation networks? Better use  =0.5 instead of  =0.15: people don’t click through papers as easily as through webpages Time arrow: papers only cite older papers: Small values of  give older papers unfair advantage New algorithm CiteRank (as in PageRank). Random walkers start from recent papers ~exp(-t/  d )

38

39 Summary Diffusion and modules (communities) in a network affect each other In the “hardware” part of the Internet (Autonomous systems or routers ) diffusion allows one to detect modules In the “software” part Diffusion-like process is used for ranking (Google’s PageRank) WWW communities affect this ranking in a non- trivial way

40 THE END

41

42 Part 2: Opinion networks "Extracting Hidden Information from Knowledge Networks", S. Maslov, and Y-C. Zhang, Phys. Rev. Lett. (2001). "Exploring an opinion network for taste prediction: an empirical study", M. Blattner, Y.-C. Zhang, and S. Maslov, in preparation.

43 Predicting customers’ tastes from their opinions on products Each of us has personal tastes Information about them is contained in our opinions on products Matchmaking: opinions of customers with tastes similar to mine could be used to forecast my opinions on untested products Internet allows to do it on large scale (see amazon.com and many others)

44 Opinion networks Webapges Other webpages 2 1 3 4 1 2 3 WWW Opinions of movie-goers on movies Customers Movies opinion 2 1 3 4 1 2 3

45 Storing opinions XXX29?? XXX?8?8 XXX??1? 2??XXXX 98?XXXX ??1XXXX ?8?XXXX Matrix of opinions  IJ Network of opinions 9 8 8 1 2 Customers 2 1 3 4 1 2 3 Movies

46 Using correlations to reconstruct customer’s tastes Similar opinions  similar tastes Simplest model: Movie-goers  M-dimensional vector of tastes T I Movies  M-dimensional vector of features F J Opinions  scalar product:  IJ = T I  F J 9 8 8 1 2 Customers 2 1 3 4 1 2 3 Movies

47 Loop correlation Predictive power 1/M (L-1)/2 One needs many loops to best reconstruct unknown opinions L=5 known opinions: Predictive power of an unknown opinion is 1/M 2 An unknown opinion

48 Main parameter: density of edges The larger is the density of edges p the easier is the prediction At p 1  1/N (N=N costomers +N movies ) macroscopic prediction becomes possible. Nodes are connected but vectors T I and F J are not fixed: ordinary percolation threshold At p 2  2M/N > p 1 all tastes and features ( T I and F J ) can be uniquely reconstructed: rigidity percolation threshold

49 Real empirical data (EachMovie dataset) on opinions of customers on movies: 5-star ratings of 1600 movies by 73000 users 1.6 million opinions!

50 Spectral properties of  For M<N the matrix  IJ has N-M zero eigenvalues and M positive ones:  = R  R +. Using SVD one can “diagonalize” R = U  D  V + such that matrices V and U are orthogonal V +  V = 1, U  U + = 1, and D is diagonal. Then  = U  D 2  U + The amount of information contained in  : NM-M(M-1)/2 << N(N-1)/2 - the # of off- diagonal elements

51 Recursive algorithm for the prediction of unknown opinions 1. Start with  0 where all unknown elements are filled with (zero in our case) 2. Diagonalize and keep only M largest eigenvalues and eigenvectors 3. In the resulting truncated matrix  ’ 0 replace all known elements with their exact values and go to step 1

52 Convergence of the algorithm Above p 2 the algorithm exponentially converges to the exact values of unknown elements The rate of convergence scales as (p-p 2 ) 2

53 Reality check: sources of errors Customers are not rational!  IJ = r I  b J +  IJ (idiosyncrasy) Opinions are delivered to the matchmaker through a narrow channel: Binary channel S IJ = sign(  IJ ) : 1 or 0 (liked or not) Experience rated on a scale 1 to 5 or 1 to 10 at best If number of edges K, and size N are large, while M is small these errors could be reduced

54 How to determine M? In real systems M is not fixed: there are always finer and finer details of tastes Given the number of known opinions K one should choose M eff  K/(N readers +N books ) so that systems are below the second transition p 2  tastes should be determined hierarchically

55 Avoid overfitting Divide known votes into training and test sets Select M eff so that to avoid overfitting !!! Reasonable fit Overfit

56 Knowledge networks in biology Interacting biomolecules: key and lock principle Matrix of interactions (binding energies)  IJ = k I  l J + l I  k J Matchmaker (bioinformatics researcher) tries to guess yet unknown interactions based on the pattern of known ones Many experiments measure S IJ =  (  IJ -  th ) k (1) k (2) l (2) l (1)

57 THE END


Download ppt "Lecture 4. Modules/communities in networks What is a module? Nodes in a given module (or community group or a functional unit) tend to connect with other."

Similar presentations


Ads by Google