CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian
This lecture Probabilistic generative models for social networks (in particular web graph)
Why look for generative models? Designing and testing algorithms for the web E.g.: Compressing the web graph Designing crawling strategies Search algorithms on P2P networks … Explaining why web has certain properties For example, the central limit theorem tells us why we often see the Gaussian distribution in practice. Is there a similar explanation for the power law distribution? Predicting what “might” happen in the future E.g.: An AIDS epidemic? An Internet black out? A residential segregation?
Characteristics of a good model Simple Plausible Exihibits the observed properties Power law Small world Locally dense, globally sparse
Power law distribution From last lecture: power laws everywhere! Income distribution (Pareto 1896) Word frequencies (Estoup 1916, Zipf 1932) City population (Auerbach 1913, Zipf 1949) Scientific productivity (Lotka 1926) Internet graph degree dist (FFF 1999) Web graph degree dist (BKMRRSTW 2000) Dist. of file sizes … Why?
Models and explanations for power law Optimization (“power law is the best design”) Mandelbrot 1953: Zipf’s law is the most efficient design. Carlson & Doyle 1999, Fabrikant et al (HOT) Monkeys typing randomly Miller 1957: even a monkey typing randomly can generate a power law. Multiplicative processes & Log-normal dist. Gibrat 1930, Champernowne 1955, Gabaix 1999 Preferential growth (“the rich get richer”) Simon 1955, Yule 1925
Log-normal distribution Central limit Thm: Product of many indep. distributions is approximately log-normal.
Multiplicative process and power law Multiplicative processes can sometimes generate power law instead of log-normal: Multiplicative process with a minimum Chambernowne 1953, Gabaix 1999 Random stopping time Montroll and Schlesinger 1982,1983
Preferential growth The system “grows”. The probability of a new member joining a group is proportional to its current size. Simon 1955, Yule 1925 (for biological systems) Barabasi and Albert 1999: preferential attachment for web graph
Random graph models Erdos-Renyi random graphs G(n,p) n vertices, there is an edge between each pair independently with probability p. G(n,p) at a glance: Average degree np. Binomial degree dist. p < 1/n: union of small simple connected comp. p > 1/n: a “giant” complex component emerges (still many small connected components) p > ln(n)/n: connected.
The ACL model Proposed by Aiello, Chung, and Lu, Fix a degree sequence d (e.g., power law). Put d i copies of the i’th vertex. Pick a random matching. Contract the d i copies of the i’th vertex Essentially a variant of G(n,p), with the degree distribution explicitly enforced.
Preferential attachment Start with a graph with one node. Vertices arrive one by one. When a vertex arrives, it connects itself to one (m, in general) of the previous vertices, with probability proportional to their degrees.
Preferential attachment Heuristic analysis (Barabasi-Albert): degree distribution follows a power law with exponent -3. Theorem (Bollobas, Riordan, Spencer, Tusnady). For d < n 1/16, the fraction of vertices that have degree d is almost surely around