Jure Leskovec Machine Learning Department Carnegie Mellon University.

Slides:



Advertisements
Similar presentations
Cost-effective Outbreak Detection in Networks Jure Leskovec, Andreas Krause, Carlos Guestrin, Christos Faloutsos, Jeanne VanBriesen, Natalie Glance.
Advertisements

1 Dynamics of Real-world Networks Jure Leskovec Machine Learning Department Carnegie Mellon University
Cost-effective Outbreak Detection in Networks Jure Leskovec, Andreas Krause, Carlos Guestrin, Christos Faloutsos, Jeanne VanBriesen, Natalie Glance.
Cost-effective Outbreak Detection in Networks Jure Leskovec Machine Learning Department, CMU Joint work with Andreas Krause, Carlos Guestrin, Christos.
Analysis and Modeling of Social Networks Foudalis Ilias.
Online Distributed Sensor Selection Daniel Golovin, Matthew Faulkner, Andreas Krause theory and practice collide 1.
1 Connectivity Structure of Bipartite Graphs via the KNC-Plot Erik Vee joint work with Ravi Kumar, Andrew Tomkins.
Lecture 21 Network evolution Slides are modified from Jurij Leskovec, Jon Kleinberg and Christos Faloutsos.
Kronecker Graphs: An Approach to Modeling Networks Jure Leskovec, Deepayan Chakrabarti, Jon Kleinberg, Christos Faloutsos, Zoubin Ghahramani Presented.
Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research.
Advanced Topics in Data Mining Special focus: Social Networks.
CS 599: Social Media Analysis University of Southern California1 The Basics of Network Analysis Kristina Lerman University of Southern California.
CS728 Lecture 5 Generative Graph Models and the Web.
Modeling Real Graphs using Kronecker Multiplication
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
1 Complex systems Made of many non-identical elements connected by diverse interactions. NETWORK New York Times Slides: thanks to A-L Barabasi.
CS 345A Data Mining Lecture 1 Introduction to Web Mining.
Web as Graph – Empirical Studies The Structure and Dynamics of Networks.
INFERRING NETWORKS OF DIFFUSION AND INFLUENCE Presented by Alicia Frame Paper by Manuel Gomez-Rodriguez, Jure Leskovec, and Andreas Kraus.
1 Efficient planning of informative paths for multiple robots Amarjeet Singh *, Andreas Krause +, Carlos Guestrin +, William J. Kaiser *, Maxim Batalin.
Web Projections Learning from Contextual Subgraphs of the Web Jure Leskovec, CMU Susan Dumais, MSR Eric Horvitz, MSR.
Simpath: An Efficient Algorithm for Influence Maximization under Linear Threshold Model Amit Goyal Wei Lu Laks V. S. Lakshmanan University of British Columbia.
Leveraging Big Data: Lecture 11 Instructors: Edith Cohen Amos Fiat Haim Kaplan Tova Milo.
Peer-to-Peer and Social Networks Random Graphs. Random graphs E RDÖS -R ENYI MODEL One of several models … Presents a theory of how social webs are formed.
Topic 13 Network Models Credits: C. Faloutsos and J. Leskovec Tutorial
Log Dimension Hypothesis1 The Logarithmic Dimension Hypothesis Anthony Bonato Ryerson University MITACS International Problem Solving Workshop July 2012.
Jure Leskovec Computer Science Department Cornell University / Stanford University Joint work with: Eric Horvitz, Michael Mahoney,
Dynamics of networks Jure Leskovec Computer Science Department
Jure Leskovec PhD: Machine Learning Department, CMU Now: Computer Science Department, Stanford University.
1 1 Stanford University 2 MPI for Biological Cybernetics 3 California Institute of Technology Inferring Networks of Diffusion and Influence Manuel Gomez.
1 1 Stanford University 2 MPI for Biological Cybernetics 3 California Institute of Technology Inferring Networks of Diffusion and Influence Manuel Gomez.
Influence Maximization in Dynamic Social Networks Honglei Zhuang, Yihan Sun, Jie Tang, Jialin Zhang, Xiaoming Sun.
COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002.
Jure Leskovec Computer Science Department Cornell University / Stanford University Joint work with: Jon Kleinberg (Cornell), Christos.
December 7-10, 2013, Dallas, Texas
Maximizing the Spread of Influence through a Social Network David Kempe, Jon Kleinberg, Eva Tardos Cornell University KDD 2003.
Maximizing the Spread of Influence through a Social Network Authors: David Kempe, Jon Kleinberg, É va Tardos KDD 2003.
Jure Leskovec Computer Science Department Cornell University / Stanford University.
Jure Leskovec Machine Learning Department Carnegie Mellon University.
On-line Social Networks - Anthony Bonato 1 Dynamic Models of On-Line Social Networks Anthony Bonato Ryerson University WAW’2009 February 13, 2009 nt.
Most of contents are provided by the website Network Models TJTSD66: Advanced Topics in Social Media (Social.
RTM: Laws and a Recursive Generator for Weighted Time-Evolving Graphs Leman Akoglu, Mary McGlohon, Christos Faloutsos Carnegie Mellon University School.
Cost-effective Outbreak Detection in Networks Presented by Amlan Pradhan, Yining Zhou, Yingfei Xiang, Abhinav Rungta -Group 1.
1 1 Stanford University 2 MPI for Biological Cybernetics 3 California Institute of Technology Inferring Networks of Diffusion and Influence Manuel Gomez.
Netlogo demo. Complexity and Networks Melanie Mitchell Portland State University and Santa Fe Institute.
Topics In Social Computing (67810) Module 1 Introduction & The Structure of Social Networks.
GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.
Dynamics of Real-world Networks
Inferring Networks of Diffusion and Influence
Modeling networks using Kronecker multiplication
Who are the most influential bloggers?
Introduction to Web Mining
How Do “Real” Networks Look?
Social and Information Network Analysis: Review of Key Concepts
Generative Model To Construct Blog and Post Networks In Blogosphere
How Do “Real” Networks Look?
How Do “Real” Networks Look?
Lecture 13 Network evolution
The likelihood of linking to a popular website is higher
Dynamics of Real-world Networks
Graph and Tensor Mining for fun and profit
Cost-effective Outbreak Detection in Networks
Lecture 21 Network evolution
CS 345A Data Mining Lecture 1
Modelling and Searching Networks Lecture 2 – Complex Networks
CS 345A Data Mining Lecture 1
Introduction to Web Mining
Navigation and Propagation in Networks
CS 345A Data Mining Lecture 1
Advanced Topics in Data Mining Special focus: Social Networks
Presentation transcript:

Jure Leskovec Machine Learning Department Carnegie Mellon University

 Today: Large on-line systems have detailed records of human activity  On-line communities: ▪ Facebook (64 million users, billion dollar business) ▪ MySpace (300 million users)  Communication: ▪ Instant Messenger (~1 billion users)  News and Social media: ▪ Blogging (250 million blogs world-wide, presidential candidates run blogs)  On-line worlds: ▪ World of Warcraft (internal economy 1 billion USD) ▪ Second Life (GDP of 700 million USD in ‘07) Opportunities for impact in science and industry 2

b) Internet (AS) c) Social networks a) World wide web d) Communicatione) Citationsf) Protein interactions 3

 We know lots about the network structure:  Properties: Scale free [Barabasi ’99], 6-degrees of separation [Milgram ’67], Navigation [Adamic-Adar ’03, LibenNowell ’05], Bipartite cores [Kumar et al. ’99], Network motifs [Milo et al. ‘02], Communities [Newman ‘99], Conductance [Mihail-Papadimitriou-Saberi ‘06], Hubs and authorities [Page et al. ’98, Kleinberg ‘99]  Models: Preferential attachment [Barabasi ’99], Small- world [Watts-Strogatz ‘98], Copying model [Kleinberg el al. ’99], Heuristically optimized tradeoffs [Fabrikant et al. ‘02], Congestion [Mihail et al. ‘03], Searchability [Kleinberg ‘00], Bowtie [Broder et al. ‘00], Transit-stub [Zegura ‘97], Jellyfish [Tauro et al. ‘01] We know much less about processes and dynamics of networks 4

 Network Dynamics:  Network evolution ▪ How network structure changes as the network grows and evolves?  Diffusion and cascading behavior ▪ How do rumors and diseases spread over networks? 5

 We need massive network data for the patterns to emerge:  MSN Messenger network [WWW ’08, Nature ‘08] (the largest social network ever analyzed) ▪ 240M people, 255B messages, 4.5 TB data  Product recommendations [EC ‘06] ▪ 4M people, 16M recommendations  Blogosphere [work in progress] ▪ 60M posts, 120M links 6

Diffusion & Cascades Network Evolution Patterns & Observations Q1: What do cascades look like? Trees? Stars? Chains? Q4: How does network structure change as the network grows/evolves? Models & Algorithms Q2: How do we find influential nodes? Q3: How do we quickly detect epidemics? Q5: How do we generate realistic looking and evolving networks? 7

Diffusion & Cascades Network Evolution Patterns & Observations Q1: What do cascades look like? Trees? Stars? Chains? Q4: How does network structure change as the network grows/evolves? Models & Algorithms Q2: How do we find influential nodes? Q3: How do we quickly detect epidemics? Q5: How do we generate realistic looking and evolving networks? 8

 Behavior that cascades from node to node like an epidemic  News, opinions, rumors  Word-of-mouth in marketing  Infectious diseases  As activations spread through the network they leave a trace – a cascade Cascade (propagation graph) Network 9

 People send and receive product recommendations, purchase products  Data: Large online retailer: 4 million people, 16 million recommendations, 500k products 10% credit10% off 10 [w/ Adamic-Huberman, EC ’06]

 Bloggers write posts and refer (link) to other posts and the information propagates  Data: 10.5 million posts, 16 million links 11 [w/ Glance-Hurst et al., SDM ’07]

 Viral marketing cascades are more social:  Collisions (no summarizers)  Richer non-tree structures  Are they stars? Chains? Trees?  Information cascades (blogosphere):  Viral marketing (DVD recommendations): (ordered by frequency) 12 propagation [w/ Kleinberg-Singh, PAKDD ’06]

Diffusion & Cascades Network Evolution Patterns & Observations A1: Cascade shapes. Viral marketing cascades more social. Q4: How does network structure change as the network grows/evolves? Models & Algorithms Q2: How do we find influential nodes? Q3: How do we quickly detect epidemics? Q5:How do we generate realistic looking and evolving networks? 13

 Blogs – information epidemics  Which are the influential/infectious blogs?  Viral marketing  Who are the trendsetters?  Influential people?  Disease spreading  Where to place monitoring stations to detect epidemics? 14

How to quickly detect epidemics as they spread? 15 c1c1 c2c2 c3c3 [w/ Krause-Guestrin et al., KDD ’07] (best student paper)

 Cost:  Cost of monitoring is node dependent  Reward:  Minimize the number of affected nodes: ▪ If A are the monitored nodes, let R(A) denote the number of nodes we save We also consider other rewards: ▪ Minimize time to detection ▪ Maximize number of detected outbreaks R(A) A 16 ( ) [w/ Krause-Guestrin et al., KDD ’07] (best student paper)

Reward for detecting cascade i  Given:  Graph G(V,E), budget M  Data on how cascades C 1, …, C i, …,C K spread over time  Select a set of nodes A maximizing the reward subject to cost(A) ≤ M  Solving the problem exactly is NP-hard  Max-cover [Khuller et al. ’99] 17

 Problem structure: Submodularity of the reward functions  CELF: algorithm with approximate guarantee  Speed up: Lazy evaluation to speed-up CELF We extend the result of [Kempe-Kleinberg-Tardos ’03] 18 [w/ Krause-Guestrin et al., KDD ’07] (best student paper)

 Theorem: Reward function R is submodular (diminishing returns, think of it as “convexity”) 19 (best student paper) [w/ Krause-Guestrin et al., KDD ’07] Gain of adding a node to a small setGain of adding a node to a large set R(A  {u}) – R(A) ≥ R(B  {u}) – R(B) A  B S1S1 S2S2 Placement A={S 1, S 2 } S’ New monitored node: Adding S’ helps a lot S2S2 S4S4 S1S1 S3S3 Placement B={S 1, S 2, S 3, S 4 } S’ Adding S’ helps very little

 We develop CELF (cost-effective lazy forward-selection) algorithm:  Two independent runs of a modified greedy ▪ Solution set A’: ignore cost, greedily optimize reward ▪ Solution set A’’: greedily optimize reward/cost ratio  Pick best of the two: arg max(R(A’), R(A’’))  Theorem: If R is submodular then CELF is near optimal  CELF achieves ½(1-1/e) factor approximation For the size of our problems naïve CELF is too slow 20 (best student paper) [w/ Krause-Guestrin et al., KDD ’07]

 Observation: Submodularity guarantees that marginal rewards decrease with the solution size  Idea:  Use marginal reward from previous step as a upper bound on current marginal reward d Marginal reward a b c d e 21

 CELF algorithm:  Keep an ordered list of marginal rewards r i from previous step  Re-evaluate r i only for the top node a b c a b c d d e e Marginal reward 22

 CELF algorithm:  Keep an ordered list of marginal rewards r i from previous step  Re-evaluate r i only for the top node a a b c d dbce e Marginal reward 23

 CELF algorithm:  Keep an ordered list of marginal rewards r i from previous step  Re-evaluate r i only for the top node a c a b c d d b Marginal reward e e 24

For more info see our website:  Which blogs should one read to catch big stories? CELF In-links Random # posts Out-links Number of selected blogs (sensors) Reward (higher is better) 25 (used by Technorati)

CELF Greedy Exhaustive search CELF runs 700x faster than simple greedy algorithm Number of selected blogs (sensors) Run time (seconds) (lower is better) 26

 Given:  a real city water distribution network  data on how contaminants spread over time  Place sensors (to save lives)  Problem posed by the US Environmental Protection Agency SS 27 c1c1 c2c2 [w/ Krause et al., J. of Water Resource Planning]

 Our approach performed best at the Battle of Water Sensor Networks competition CELF Population Random Flow Degree AuthorScore CMU (CELF) 26 Sandia21 U Exter20 Bentley systems19 Technion (1)14 Bordeaux12 U Cyprus11 U Guelph7 U Michigan4 Michigan Tech U3 Malcolm2 Proteo2 Technion (2)1 Number of placed sensors Population saved (higher is better) 28 [w/ Ostfeld et al., J. of Water Resource Planning]

Diffusion & Cascades Network Evolution Patterns & Observations A1: Cascade shapes. Viral marketing cascades more social. Q4: How does network structure change as the network grows/evolves? Models & Algorithms A2, A3: CELF algorithm for detecting cascades and outbreaks Q5: How do we generate realistic looking and evolving networks? 29

 Empirical findings on real graphs led to new network models  Such models make assumptions/predictions about other network properties  What about network evolution? 30 log degree log prob. Model Power-law degree distributionPreferential attachment Explains

 Networks are denser over time  Densification Power Law: a … densification exponent (1 ≤ a ≤ 2)  What is the relation between the number of nodes and the edges over time?  Prior work assumes: constant average degree over time Internet Citations a=1.2 a=1.6 N(t) E(t) N(t) E(t) 31 [w/ Kleinberg-Faloutsos, KDD ’05] (best research paper)

 Prior models and intuition say that the network diameter slowly grows (like log N, log log N) time diameter size of the graph Internet Citations  Diameter shrinks over time  as the network grows the distances between the nodes slowly decrease 32 [w/ Kleinberg-Faloutsos, KDD ’05] (best research paper)

 Want to generate realistic networks:  Why synthetic graphs?  Anomaly detection, Simulations, Predictions, Null-model, Sharing privacy sensitive graphs, …  Q: What is a good model that we can fit to the data?  A: Next slide.  Q: Which network properties do we care about?  A: Don’t commit, let’s match adjacency matrices 33 Compare graphs properties, e.g., degree distribution Given a real network Generate a synthetic network

 We prove Kronecker graphs mimic real graphs:  Power-law degree distribution, Densification, Shrinking/stabilizing diameter, Spectral properties Initiator (9x9) (3x3) (27x27) Kronecker product of graph adjacency matrices 34 [w/ Chakrabarti-Kleinberg-Faloutsos, PKDD ’05] p ij Edge probability

 Maximum likelihood estimation  Naïve estimation takes O(N!N 2 ):  N! for different node labelings: ▪ Our solution: Metropolis sampling: N!  (big) const  N 2 for traversing graph adjacency matrix ▪ Our solution: Kronecker product (E << N 2 ): N 2  E  Do stochastic gradient descent ab cd P(|) Kronecker arg max We estimate the model in O(E) 35 [w/ Faloutsos, ICML ’07]

 We search the space of ~10 1,000,000 permutations  Fitting takes 2 hours  Real and Kronecker are very close 36 Degree distribution node degree probability Path lengths number of hops # reachable pairs “Network” values rank network value [w/ Faloutsos, ICML ’07]

Diffusion & Cascades Network Evolution Patterns & Observations A1: Cascade shapes. Viral marketing cascades more social. A4: Densification, Shrinking diameters Models & Algorithms algorithm for detecting cascades and outbreaks A5: Kronecker graphs, Estimating Kronecker initiator Other topics Conclusion and Future work A2, A3: CELF 37

 P1) Query Projections:  Predicting search result quality without page content  P2) MSN Instant Messenger:  Social network of the whole world 38

 User types in a query to a search engine  Search engine returns results: Is this a good set of search results? Result returned by the search engine Hyperlinks Non-search results connecting the graph 39 [w/ Dumais-Horvitz, WWW ’07]

Q QueryResults Projection on the web graph Query projection graph Generate graphical features Query connection graph Construct case library Predictions

 The idea  Find features that describe the structure of the graph  Then use the features for machine learning  Want features that describe  Connectivity of the graph  Centrality of projection and connector nodes  Clustering and density of the core of the graph vs.

 Dataset:  Graph: 40 million nodes, 720 million edges  30,000 queries with top 20 results for each ▪ Human assigned relevance. 6-point scale: Perfect to Bad  Task:  Predict the highest rating in the set of results ▪ 2-class problem: “Good” (top 3 ratings) vs. “Poor” (bottom 3)  Result:  Predict search result quality with 80% accuracy just from the connection patterns between the results

43 Predict “Good” Predict “Poor” Good? Poor? [w/ Dumais-Horvitz, WWW ’07]

Good result sets have:  Search result nodes are hub nodes in the graph  Small connector node degrees  Big connected component  Few isolated nodes in projection graph  Few connector nodes AttributesAccuracy Marginals0.55 RankNet0.63 Query projection0.82

Milgram’s small world experiment (i.e., hops + 1) 45 Lots of engineering effort and good hardware:  4 dual-core Opteron server  48GB memory  6.8TB of fast SCSI disks How to learn short paths?  Small-world experiment [Milgram ‘67]  People send letters from Nebraska to Boston  How many steps does it take?  Messenger social network – largest network analyzed  240M people, 30B conversations, 4.5TB data [w/ Horvitz, WWW ’08, Nature ‘08] MSN Messenger network Number of steps between pairs of people

Diffusion & Cascades Network Evolution Patterns & Observations A1: Cascade shapes. Viral marketing cascades more social. A4: Densification, Shrinking diameters Models & Algorithms A2, A3: CELF algorithm for detecting cascades and outbreaks A5: Kronecker graphs, Estimating Kronecker initiator Big data matters 46

 How do news and information spread  New ranking and influence measures for blogs  Sentiment analysis from cascade structure  Crawling 2 million blogs/day (with Spinn3r) Obscure technology story Small tech blog Wired Slashdot New Scientist New York Times CNN BBC 47

 Models of information diffusion  When, where and what post will create a cascade?  Where should one tap the network to get the effect they want?  Social Media Marketing  How to handle richer classes of graphs  Interaction of network structure and node/edge attributes 48

 Why are networks the way they are?  Continue work on fundamental properties of networks  Build models and understanding  Network community structure  Health of a social network  Steer the network evolution  Adoption of social networking services (with Facebook)  Predictive modeling of large communities  Online multi-player games are closed worlds with detailed traces of human activity  Predict success of guilds, battles, elections (with EVE online) 49

 Map-reduce for massive graphs  Algorithms and techniques for massive graphs  With Yahoo! Research: 4000 node Hadoop cluster ▪ world’s 50 th fastest supercomputer ▪ 3 TB memory ▪ 1.5 PB storage  We already have new ways to peek into the structure of massive networks 50

Computer systems Theory and algorithms Machine learning / Data mining 51 (complex) networks

Computer systems Theory and algorithms Machine learning / Data mining 52 Social Sciences Biology Physics Applications & Industry Statistics (complex) networks

Christos Faloutsos Jon Kleinberg Andreas Krause Carlos Guestrin Eric Horvitz Susan Dumais Andrew Tomkins Ravi Kumar Lada Adamic Bernardo Huberman Kevin Lang Michael Manohey Zoubin Gharamani Anirban Dasgupta Tom Mitchell John Lafferty Larry Wasserman Avrim Blum Samuel Madden Michalis Faloutsos Ajit Singh John Shawe-Taylor Lars Backstrom Marko Grobelnik Natasa Milic-Frayling Dunja Mladenic Jeff Hammerbacher Matt Hurst Deepay Chakrabarti Natalie Glance Jeanne VanBriesen Microsoft Research Yahoo Research Facebook Eve online Spinn3r LinkedIn Nielsen BuzzMetrics Microsoft Live Labs Google 53

Diffusion & Cascades Network Evolution Patterns & Observations A1: Cascade shapes. Viral marketing cascades more social. A4: Densification, Shrinking diameters Models & Algorithms A2, A3: CELF algorithm for detecting cascades and outbreaks A5: Kronecker graphs, Estimating Kronecker initiator Big data matters 54