Jure Leskovec Machine Learning Department Carnegie Mellon University
The emergence of ‘cyberspace’ and the World Wide Web is like the discovery of a new continent. Jim Gray, 1998 Turing Award address 2
Large on-line systems have detailed records of human activity On-line communities: ▪ Facebook (64 million users, billion dollar business) ▪ MySpace (300 million users) Communication: ▪ Instant Messenger (~1 billion users) News and Social media: ▪ Blogging (250 million blogs world-wide, presidential candidates run blogs) On-line worlds: ▪ World of Warcraft (internal economy 1 billion USD) ▪ Second Life (GDP of 700 million USD in ‘07) Opportunities for impact in science and industry 3
b) Internet (AS) c) Social networks a) World wide web d) Communicatione) Citationsf) Protein interactions 4
We know lots about the network structure: Properties: Scale free [Barabasi ’99], 6-degrees of separation [Milgram ’67], Navigation [Adamic-Adar ’03, LibenNowell ’05], Bipartite cores [Kumar et al. ’99], Network motifs [Milo et al. ‘02], Communities [Newman ‘99], Conductance [Mihail-Papadimitriou-Saberi ‘06], Hubs and authorities [Page et al. ’98, Kleinberg ‘99] Models: Preferential attachment [Barabasi ’99], Small- world [Watts-Strogatz ‘98], Copying model [Kleinberg el al. ’99], Heuristically optimized tradeoffs [Fabrikant et al. ‘02], Congestion [Mihail et al. ‘03], Searchability [Kleinberg ‘00], Bowtie [Broder et al. ‘00], Transit-stub [Zegura ‘97], Jellyfish [Tauro et al. ‘01] We know much less about processes and dynamics of networks 5
Network Dynamics: Network evolution ▪ How network structure changes as the network grows and evolves? Diffusion and cascading behavior ▪ How do rumors and diseases spread over networks? 6
We need massive network data for the patterns to emerge: MSN Messenger network [WWW ’08, Nature ‘08] (the largest social network ever analyzed) ▪ 240M people, 255B messages, 4.5 TB data Product recommendations [EC ‘06] ▪ 4M people, 16M recommendations Blogosphere [work in progress] ▪ 60M posts, 120M links 7
Diffusion & Cascades Network Evolution Patterns & Observations Q1: What do cascades look like? Q2: How likely are people to get influenced? Q5: How does network structure change as the network grows/evolves? Models & Algorithms Q3: How do we find influential nodes? Q4: How do we quickly detect epidemics? Q6:How do we generate realistic looking and evolving networks? 8
Diffusion & Cascades Network Evolution Patterns & Observations Q1: What do cascades look like? Q2: How likely are people to get influenced? Q5: How does network structure change as the network grows/evolves? Models & Algorithms Q3: How do we find influential nodes? Q4: How do we quickly detect epidemics? Q6:How do we generate realistic looking and evolving networks? 9
Behavior that cascades from node to node like an epidemic News, opinions, rumors Word-of-mouth in marketing Infectious diseases As activations spread through the network they leave a trace – a cascade Cascade (propagation graph) Network 10
People send and receive product recommendations, purchase products Data: Large online retailer: 4 million people, 16 million recommendations, 500k products 10% credit10% off 11 [w/ Adamic-Huberman, EC ’06]
Bloggers write posts and refer (link) to other posts and the information propagates Data: 10.5 million posts, 16 million links 12 [w/ Glance-Hurst et al., SDM ’07]
Viral marketing cascades are more social: Collisions (no summarizers) Richer non-tree structures Are they stars? Chains? Trees? Information cascades (blogosphere): Viral marketing (DVD recommendations): (ordered by frequency) 13 propagation [w/ Kleinberg-Singh, PAKDD ’06]
Prob. of adoption depends on the number of friends who have adopted [Bass ‘69, Granovetter ’78] What is the shape? Distinction has consequences for models and algorithms k = number of friends adopting Prob. of adoption k = number of friends adopting Prob. of adoption Diminishing returns? Critical mass? To find the answer we need lots of data 14
Later similar findings were made for group membership [Backstrom-Huttenlocher- Kleinberg ‘06], and probability of communication [Kossinets-Watts ’06] Probability of purchasing DVD recommendations (8.2 million observations) # recommendations received Adoption curve follows the diminishing returns. Can we exploit this? 15 [w/ Adamic-Huberman, EC ’06]
Diffusion & Cascades Network Evolution Patterns & Observations A1: Cascade shapes A2: Human adoption follows diminishing returns Q5: How does network structure change as the network grows/evolves? Models & Algorithms Q3: How do we find influential nodes? Q4: How do we quickly detect epidemics? Q6:How do we generate realistic looking and evolving networks? 16
Blogs – information epidemics Which are the influential/infectious blogs? Viral marketing [Domingos-Richardson] Who are the trendsetters? Influential people? Disease spreading Where to place monitoring stations to detect epidemics? 17
How to quickly detect epidemics as they spread? 18 c1c1 c2c2 c3c3 [w/ Krause-Guestrin et al., KDD ’07] (best student paper)
Cost: Cost of monitoring is node dependent Reward: Minimize the number of affected nodes: ▪ If A are the monitored nodes, let R(A) denote the number of nodes we save We also consider other rewards: ▪ Minimize time to detection ▪ Maximize number of detected outbreaks R(A) A 19 ( ) [w/ Krause-Guestrin et al., KDD ’07] (best student paper)
Reward for detecting cascade i Given: Graph G(V,E), budget M Data on how cascades C 1, …, C i, …,C K spread over time Select a set of nodes A maximizing the reward subject to cost(A) ≤ M Solving the problem exactly is NP-hard Max-cover [Khuller et al. ’99] 20
Submodularity of the reward functions CELF: algorithm with approximate guarantee Lazy evaluation to speed-up CELF We extend the result of [Kempe-Kleinberg-Tardos ’03] 21 [w/ Krause-Guestrin et al., KDD ’07] (best student paper)
If objective function exhibits diminishing returns: Submodularity (think of it as “convexity”) Theorem [Nemhauser et al. ‘78] If R is a function that is monotone and submodular, then k-step greedy produces set A for which R(A) is within (1-1/e) ~63% of optimal. a b c a b c d d Marginal reward e Greedy Add sensor with highest marginal gain Slow! O(kN) 22
R(A) is monotone: monitoring more nodes never hurts We must show R is submodular: A B Natural example: Sets A 1, A 2,…, A n R(A) = size of union of A i (size of covered area) If R 1,…,R K are submodular, then ∑p i R i is submodular A B u Gain of adding a node to a small setGain of adding a node to a large set R(A {u}) – R(A) ≥ R(B {u}) – R(B) 23
Theorem: Reward function is submodular Consider cascade i: R i (u k ) = set of nodes saved from u k R i (A) = size of union R i (u k ), u k A R i is submodular Global optimization: R(A) = Prob(i) R i (A) R is submodular u1u1 R i (u 1 ) Cascade i u2u2 R i (u 2 ) 24 (best student paper) [w/ Krause-Guestrin et al., KDD ’07]
We extend the setting to variable node cost case We develop CELF (cost-effective lazy forward- selection) algorithm: Two independent runs of a modified greedy ▪ Solution set A’: ignore cost, greedily optimize reward ▪ Solution set A’’: greedily optimize reward/cost ratio Pick best of the two: arg max(R(A’), R(A’’)) Theorem: CELF is near optimal CELF achieves ½(1-1/e) factor approximation For the size of our problems naïve CELF is too slow 25 (best student paper) [w/ Krause-Guestrin et al., KDD ’07]
Observation: Submodularity guarantees that marginal rewards decrease with the solution size Idea: Use marginal reward from previous step as a upper bound on current marginal reward d Marginal reward a b c d e 26
CELF algorithm: Keep an ordered list of marginal rewards r i from previous step Re-evaluate r i only for the top node a b c a b c d d e e Marginal reward 27
CELF algorithm: Keep an ordered list of marginal rewards r i from previous step Re-evaluate r i only for the top node a a b c d dbce e Marginal reward 28
CELF algorithm: Keep an ordered list of marginal rewards r i from previous step Re-evaluate r i only for the top node a c a b c d d b Marginal reward e e 29
For more info see our website: Which blogs should one read to catch big stories? CELF In-links Random # posts Out-links Number of selected blogs (sensors) Reward (higher is better) 30 (used by Technorati)
CELF Greedy Exhaustive search CELF runs 700x faster than simple greedy algorithm Number of selected blogs (sensors) Run time (seconds) (lower is better) 31
Given: a real city water distribution network data on how contaminants spread over time Place sensors (to save lives) Problem posed by the US Environmental Protection Agency SS 32 c1c1 c2c2 [w/ Krause et al., J. of Water Resource Planning]
Our approach performed best at the Battle of Water Sensor Networks competition CELF Population Random Flow Degree AuthorScore CMU (CELF) 26 Sandia21 U Exter20 Bentley systems19 Technion (1)14 Bordeaux12 U Cyprus11 U Guelph7 U Michigan4 Michigan Tech U3 Malcolm2 Proteo2 Technion (2)1 Number of placed sensors Population saved (higher is better) 33 [w/ Ostfeld et al., J. of Water Resource Planning]
Diffusion & Cascades Network Evolution Patterns & Observations A1: Cascade shapes A2: Human adoption follows diminishing returns Q5: How does network structure change as the network grows/evolves? Models & Algorithms A3, A4: CELF algorithm for detecting cascades and outbreaks Q6: How do we generate realistic looking and evolving networks? 34
Empirical findings on real graphs led to new network models Such models make assumptions/predictions about other network properties What about network evolution? 35 log degree log prob. Model Power-law degree distributionPreferential attachment Explains
Networks are denser over time Densification Power Law: a … densification exponent (1 ≤ a ≤ 2) What is the relation between the number of nodes and the edges over time? Prior work assumes: constant average degree over time Internet Citations a=1.2 a=1.6 N(t) E(t) N(t) E(t) 36 [w/ Kleinberg-Faloutsos, KDD ’05] (best research paper)
Prior models and intuition say that the network diameter slowly grows (like log N, log log N) time diameter size of the graph Internet Citations Diameter shrinks over time as the network grows the distances between the nodes slowly decrease 37 [w/ Kleinberg-Faloutsos, KDD ’05] (best research paper)
None of the existing models generates graphs with these properties We need a new model Idea: Generate graphs recursively 38 Recursion (fractals) Skewed distributions (power-laws)
We prove Kronecker graphs mimic real graphs: Power-law degree distribution, Densification, Shrinking/stabilizing diameter, Spectral properties Initiator (9x9) (3x3) (27x27) Kronecker product of graph adjacency matrices (actually, there is also a nice social interpretation of the model) 39 [w/ Chakrabarti-Kleinberg-Faloutsos, PKDD ’05] p ij Edge probability
Want to generate realistic networks: Why synthetic graphs? Anomaly detection, Simulations, Predictions, Null-model, Sharing privacy sensitive graphs, … Q: Which network properties do we care about? A: Don’t commit, let’s match adjacency matrices 40 Compare graphs properties, e.g., degree distribution Given a real network Generate a synthetic network
Maximum likelihood estimation Naïve estimation takes O(N!N 2 ): N! for different node labelings: ▪ Our solution: Metropolis sampling: N! (big) const N 2 for traversing graph adjacency matrix ▪ Our solution: Kronecker product (E << N 2 ): N 2 E Do stochastic gradient descent ab cd P(|) Kronecker arg max We estimate the model in O(E) 41 [w/ Faloutsos, ICML ’07]
We search the space of ~10 1,000,000 permutations Fitting takes 2 hours Real and Kronecker are very close 42 Degree distribution node degree probability Path lengths number of hops # reachable pairs “Network” values rank network value [w/ Faloutsos, ICML ’07]
Diffusion & Cascades Network Evolution Patterns & Observations A1: Cascade shapes A2: Human adoption follows diminishing returns A5: Densification, Shrinking diameters Models & Algorithms algorithm for detecting cascades and outbreaks A6: Kronecker graphs, Estimating Kronecker initiator A3, A4: CELF 43 Two quick applications Conclusion and Future work
P1) Web Projections: Predicting search result quality without page content P2) MSN Instant Messenger: Social network of the whole world 44
User types in a query to a search engine Search engine returns results: Is this a good set of search results? Result returned by the search engine Hyperlinks Non-search results connecting the graph 45 [w/ Dumais-Horvitz, WWW ’07]
We can predict search result quality with 80% accuracy just from the connection patterns between the results 46 Predict “Good” Predict “Poor” Good? Poor? [w/ Dumais-Horvitz, WWW ’07]
Milgram’s small world experiment (i.e., hops + 1) 47 Lots of engineering effort and good hardware: 4 dual-core Opteron server 48GB memory 6.8TB of fast SCSI disks How to learn short paths? Small-world experiment [Milgram ‘67] People send letters from Nebraska to Boston How many steps does it take? Messenger social network – largest network analyzed 240M people, 30B conversations, 4.5TB data [w/ Horvitz, WWW ’08, Nature ‘08] MSN Messenger network Number of steps between pairs of people
Diffusion & Cascades Network Evolution Patterns & Observations A1: Cascade shapes A2: Human adoption follows diminishing returns A5: Densification, Shrinking diameters Models & Algorithms A3, A4: CELF algorithm for detecting cascades and outbreaks A6: Kronecker graphs, Estimating Kronecker initiator Big data matters 48
How do news and information spread New ranking and influence measures for blogs Sentiment analysis from cascade structure Crawling 2 million blogs/day (with Spinn3r) Obscure technology story Small tech blog Wired Slashdot New Scientist New York Times CNN BBC 49
Models of information diffusion When, where and what post will create a cascade? Where should one tap the network to get the effect they want? Social Media Marketing How to handle richer classes of graphs Interaction of network structure and node/edge attributes 50
Why are networks the way they are? Continue work on fundamental properties of networks Build models and understanding Network community structure Health of a social network How to better design and steer networks Adoption of social networking services (with Facebook) Predictive modeling of large communities Online multi-player games are closed worlds with detailed traces of human activity Predict success of guilds, battles, elections (with EVE online) 51
Map-reduce for massive graphs Algorithms and techniques for massive graphs With Yahoo! Research: 4000 node Hadoop cluster ▪ world’s 50 th fastest supercomputer ▪ 3 TB memory ▪ 1.5 PB storage We already have new ways to peek into the structure of massive networks 52
Computer systems Theory and algorithms Machine learning / Data mining 53 (complex) networks
Computer systems Theory and algorithms Machine learning / Data mining 54 Social Sciences Biology Physics Applications & Industry Statistics (complex) networks
Christos Faloutsos Jon Kleinberg Andreas Krause Carlos Guestrin Eric Horvitz Susan Dumais Andrew Tomkins Ravi Kumar Lada Adamic Bernardo Huberman Kevin Lang Michael Manohey Zoubin Gharamani Anirban Dasgupta Tom Mitchell John Lafferty Larry Wasserman Avrim Blum Samuel Madden Michalis Faloutsos Ajit Singh John Shawe-Taylor Lars Backstrom Marko Grobelnik Natasa Milic-Frayling Dunja Mladenic Jeff Hammerbacher Matt Hurst Deepay Chakrabarti Natalie Glance Jeanne VanBriesen Microsoft Research Yahoo Research Facebook Eve online Spinn3r LinkedIn Nielsen BuzzMetrics Microsoft Live Labs Google 55
Diffusion & Cascades Network Evolution Patterns & Observations A1: Cascade shapes A2: Human adoption follows diminishing returns A5: Densification, Shrinking diameters Models & Algorithms A3, A4: CELF algorithm for detecting cascades and outbreaks A6: Kronecker graphs, Estimating Kronecker initiator Big data matters 56