Jure Leskovec Computer Science Department Cornell University / Stanford University.

Jure Leskovec Computer Science Department Cornell University / Stanford University

 Large on-line systems have detailed records of human activity  On-line communities: ▪ Facebook (64 million users, billion dollar business) ▪ MySpace (300 million users)  Communication: ▪ Instant Messenger (~1 billion users)  News and Social media: ▪ Blogging (250 million blogs world-wide, presidential candidates run blogs)  On-line worlds: ▪ World of Warcraft (internal economy 1 billion USD) ▪ Second Life (GDP of 700 million USD in ‘07) 2

b) Internet (AS) c) Social networks a) World wide web d) Communicatione) Citationsf) Protein interactions 3

 We know lots about the network structure:  Properties: Scale free [Barabasi-Albert ’99, Faloutsos 3 ‘99], 6-degrees of separation [Milgram ’67], Small- diameter [Albert-Jeong-Barabasi ‘99], Bipartite cores [Kumar et al. ’99], Network motifs [Milo et al. ‘02], Communities [Newman ‘99], Hubs and authorities [Page et al. ’98, Kleinberg ‘99]  Models: Preferential attachment [Barabasi ’99], Small- world [Watts-Strogatz ‘98], Copying model [Kleinberg el al. ’99], Heuristically optimized tradeoffs [Fabrikant et al. ‘02], Latent Space Models [Raftery et al. ‘02], Searchability [Kleinberg ‘00], Bowtie [Broder et al. ‘00], Exponential Random Graphs [Frank-Strauss ‘86], Transit- stub [Zegura ‘97], Jellyfish [Tauro et al. ‘01] We know much less about processes and dynamics of networks 4

 Network Dynamics:  Network evolution ▪ How network structure changes as the network grows and evolves?  Diffusion and cascading behavior ▪ How do rumors and diseases spread over networks? 5

 We need massive network data for the patterns to emerge:  MSN Messenger network [WWW ’08] (the largest social network ever analyzed) ▪ 240M people, 255B messages, 4.5 TB data  Product recommendations [EC ‘06] ▪ 4M people, 16M recommendations  Blogosphere [work in progress] ▪ 60M posts, 120M links 6

Network Evolution Diffusion & Cascades Patterns & Observations Q1: How does the network structure change as the network evolves/grows? Q2: What is edge attachment at the microscopic level? Q4: What do cascades look like? Q5: How likely are people to get influenced? Models & Algorithms Q3: How do we generate realistically looking and evolving networks? Q6: How do we find influential nodes and quickly detect epidemics? 7

Network Evolution Diffusion & Cascades Patterns & Observations Q1: How does the network structure change as the network evolves/grows? Q2: What is edge attachment at the microscopic level? Q4: What do cascades look like? Q5: How likely are people to get influenced? Models & Algorithms Q3: How do we generate realistically looking and evolving networks? Q6: How do we find influential nodes and quickly detect epidemics? 8

 Empirical findings on real graphs led to new network models  Such models make assumptions/predictions about other network properties  What about network evolution? 9 log degree log prob. Model Power-law degree distributionPreferential attachment Explains

 Prior models and intuition say that the network diameter slowly grows (like log N, log log N) time diameter size of the graph Internet Citations  Diameter shrinks over time  as the network grows the distances between the nodes slowly decrease 10 [w/ Kleinberg-Faloutsos, KDD ’05]

 Networks are denser over time  Densification Power Law: a … densification exponent (1 ≤ a ≤ 2)  What is the relation between the number of nodes and the edges over time?  Prior work assumes: constant average degree over time Internet Citations a=1.2 a=1.6 N(t) E(t) N(t) E(t) 11 [w/ Kleinberg-Faloutsos, KDD ’05] Is shrinking diameter just a consequence of densification?

12 diameter size of the graph Erdos-Renyi random graph Densification exponent a =1.3  There is more to shrinking diameter than just densification Densifying random graph has increasing diameter

 Compare diameter of a:  True network (red)  Random network with the same degree distribution (blue) 13 diameter size of the graph diameter Citations Densification + scale free structure give shrinking diameter

 We directly observe atomic events of network evolution (and not only network snapshots) 14 We observe evolution at finest scale  Test individual edge attachment  Directly observe events leading to network properties  Compare network models by likelihood (and not by just summary network statistics) and so on for millions… [w/ Backstrom-Kumar-Tomkins, KDD ’08]

 Network datasets  Full temporal information from the first edge onwards  LinkedIn (N=7m, E=30m), Flickr (N=600k, E=3m), Delicious (N=200k, E=430k), Answers (N=600k, E=2m)  How are edge destinations selected? Where do new edges attach?  Are edges more likely to attach to high degree nodes?  Are edges more likely to attach to nodes that are close? [w/ Backstrom-Kumar-Tomkins, KDD ’08] 15

 Are edges more likely to connect to higher degree nodes? G np PA Flickr Networkτ G np 0 PA1 Flickr1 Delicious1 Answers0.9 LinkedIn0.6 [w/ Backstrom-Kumar-Tomkins, KDD ’08] 16

u u w w v v  Just before the edge (u,w) is placed how many hops is between u and w? Network% Δ% Δ Flickr66%66% Delicious28% Answers23% LinkedIn50% Fraction of triad closing edges Real edges are local. Most of them close triangles! [w/ Backstrom-Kumar-Tomkins, KDD ’08] G np PA Flickr 17

 Want to generate realistic networks:  Why synthetic graphs?  Anomaly detection, Simulations, Predictions, Null-model, Sharing privacy sensitive graphs, …  Q: Which network properties do we care about?  Q: What is a good model and how do we fit it? 18 Compare graphs properties, e.g., degree distribution Given a real network Generate a synthetic network

 We prove Kronecker graphs mimic real graphs:  Power-law degree distribution, Densification, Shrinking/stabilizing diameter, Spectral properties Initiator (9x9) (3x3) (27x27) 19 [w/ Chakrabarti-Kleinberg-Faloutsos, PKDD ’05] p ij Edge probability Kronecker product of graph adjacency matrices (actually, there is also a nice social interpretation of the model) Given a real graph. How to estimate the initiator G 1 ?

 Initiator matrix is a similarity matrix  Node u is described with k binary attributes u 1, u 2,…, u k  Probability of link between nodes u,v: P(u,v) = ∏ G 1 [u i, v i ] 20 ab cd ab cd ab cd This is exactly Kronecker product v u P(u,v) = b∙d∙c∙b

 Maximum likelihood estimation  Naïve estimation takes O(N!N 2 ):  N! for different node labelings: ▪ Our solution: Metropolis sampling: N!  (big) const  N 2 for traversing graph adjacency matrix ▪ Our solution: Kronecker product (E << N 2 ): N 2  E  Do stochastic gradient descent ab cd P(|) Kronecker arg max We estimate the model in O(E) 21 [w/ Faloutsos, ICML ’07]

 We search the space of ~10 1,000,000 permutations  Fitting takes 2 hours  Real and Kronecker are very close 22 Degree distribution node degree probability 0.990.54 0.490.13 Path lengths number of hops # reachable pairs “Network” values rank network value [w/ Faloutsos, ICML ’07]

 Fitting Epinions we obtained G 1 =  What does this tell about the network structure? 23 0.910.54 0.490.13 Core 0.9 edges Periphery 0.1 edges 0.5 edges Core-periphery (jellyfish, octopus) [w/ Dasgupta-Lang-Mahoney, WWW ’08]

 Small and large networks are fundamentally different 24 Collaboration network (N=4,158, E=13,422) Scientific collaborations (N=397, E=914) 0.910.54 0.490.13 0.90.1 0.9 G 1 =

Network Evolution Diffusion & Cascades Patterns & Observations A1: Densification, shrinking diameters A2: Local edge attachment Q4: What do cascades look like? Q5: How likely are people to get influenced? Models & Algorithms A3: Kronecker graphs Q6: How do we find influential nodes and quickly detect epidemics? 25

 Behavior that cascades from node to node like an epidemic  News, opinions, rumors  Word-of-mouth in marketing  Infectious diseases  As activations spread through the network they leave a trace – a cascade Cascade (propagation graph) Network 26

 People send and receive product recommendations, purchase products  Data: Large online retailer: 4 million people, 16 million recommendations, 500k products 10% credit10% off 27 [w/ Adamic-Huberman, EC ’06]

 Bloggers write posts and refer (link) to other posts and the information propagates  Data: 10.5 million posts, 16 million links 28 [w/ Glance-Hurst et al., SDM ’07]

 Viral marketing cascades are more social:  Collisions (no summarizers)  Richer non-tree structures  Are they stars? Chains? Trees?  Information cascades (blogosphere):  Viral marketing (DVD recommendations): (ordered by frequency) 29 propagation [w/ Kleinberg-Singh, PAKDD ’06]

 Prob. of adoption depends on the number of friends who have adopted [Bass ‘69, Shelling ’78]  What is the shape?  Distinction has consequences for models and algorithms k = number of friends adopting Prob. of adoption k = number of friends adopting Prob. of adoption Diminishing returns? Critical mass? To find the answer we need lots of data 30

Later similar findings were made for group membership [Backstrom-Huttenlocher- Kleinberg ‘06], and probability of communication [Kossinets-Watts ’06] Probability of purchasing DVD recommendations (8.2 million observations) # recommendations received Adoption curve follows the diminishing returns. Can we exploit this? 31 [w/ Adamic-Huberman, EC ’06]

 Blogs – information epidemics  Which are the influential/infectious blogs?  Viral marketing  Who are the trendsetters?  Influential people?  Disease spreading  Where to place monitoring stations to detect epidemics? 32

How to quickly detect epidemics as they spread? 33 c1c1 c2c2 c3c3 [w/ Krause-Guestrin et al., KDD ’07]

 Cost:  Cost of monitoring is node dependent  Reward:  Minimize the number of affected nodes: ▪ If A are the monitored nodes, let R(A) denote the number of nodes we save We also consider other rewards: ▪ Minimize time to detection ▪ Maximize number of detected outbreaks R(A) A 34 ( ) [w/ Krause-Guestrin et al., KDD ’07]

Reward for detecting cascade i  Given:  Graph G(V,E), budget M  Data on how cascades C 1, …, C i, …,C K spread over time  Select a set of nodes A maximizing the reward subject to cost(A) ≤ M  Solving the problem exactly is NP-hard  Max-cover [Khuller et al. ’99] 35

 Theorem: Reward function R is submodular (diminishing returns, think of it as “concavity”) 36 [w/ Krause-Guestrin et al., KDD ’07] S1S1 S2S2 Placement A={S 1, S 2 } S’ New monitored node: Adding S’ helps a lot S2S2 S4S4 S1S1 S3S3 Placement B={S 1, S 2, S 3, S 4 } S’ Adding S’ helps very little Gain of adding a node to a small setGain of adding a node to a large set R(A  {u}) – R(A) ≥ R(B  {u}) – R(B) A  B

 We develop CELF algorithm:  Two independent runs of a modified greedy ▪ Solution set A’: ignore cost, greedily optimize reward ▪ Solution set A’’: greedily optimize reward/cost ratio  Pick best of the two: arg max(R(A’), R(A’’)) 37 [w/ Krause-Guestrin et al., KDD ’07] a b c a b c d d Marginal reward e Current solution: {a}Current solution: {}Current solution: {a, c}  Theorem: If R is submodular then CELF is near optimal: CELF achieves ½(1-1/e) factor approximation

38  Question: Which blogs should one read to be most up to date?  Idea: Select blogs to cover the blogosphere. Each dot is a blog Proximity is based on the number of common cascades

For more info see our website: www.blogcascade.org  Which blogs should one read to catch big stories? CELF In-links Random # posts Out-links Number of selected blogs (sensors) Reward (higher is better) 39 (used by Technorati) [w/ Krause-Guestrin et al., KDD ’07]

 Given:  a real city water distribution network  data on how contaminants spread over time  Place sensors (to save lives)  Problem posed by the US Environmental Protection Agency SS 40 c1c1 c2c2 [w/ Krause et al., J. of Water Resource Planning]

 Our approach performed best at the Battle of Water Sensor Networks competition CELF Population Random Flow Degree AuthorScore CMU (CELF) 26 Sandia21 U Exter20 Bentley systems19 Technion (1)14 Bordeaux12 U Cyprus11 U Guelph7 U Michigan4 Michigan Tech U3 Malcolm2 Proteo2 Technion (2)1 Number of placed sensors Population saved (higher is better) 41 [w/ Ostfeld et al., J. of Water Resource Planning]

 Why are networks the way they are?  Only recently have basic properties been observed on a large scale  Confirms social science intuitions; calls others into question.  Benefits of working with large data  What patterns do we observe in massive networks?  What microscopic mechanisms cause them? Social network of the whole world? 42

Milgram’s small world experiment (i.e., hops + 1) 43  Small-world experiment [Milgram ‘67]  People send letters from Nebraska to Boston  How many steps does it take?  Messenger social network – largest network analyzed  240M people, 30B conversations, 4.5TB data [w/ Horvitz, WWW ’08, Nature ‘08] MSN Messenger network Number of steps between pairs of people

 Predictive modeling of information diffusion  When, where and what information will create a cascade?  Where should one tap the network to get the effect they want? Social Media Marketing  How do news and information spread  New ranking and influence measures  Sentiment analysis from cascade structure  How to introduce incentives? 44

Observations: Data analysis Models: Predictions Algorithms: Applications Actively influencing the network 45

Jure Leskovec Computer Science Department Cornell University / Stanford University.

Similar presentations

Presentation on theme: "Jure Leskovec Computer Science Department Cornell University / Stanford University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Jure Leskovec Computer Science Department Cornell University / Stanford University.

Similar presentations

Presentation on theme: "Jure Leskovec Computer Science Department Cornell University / Stanford University."— Presentation transcript:

Similar presentations

About project

Feedback