Graphs (Part I) Shannon Quinn (with thanks to William Cohen of CMU and Jure Leskovec, Anand Rajaraman, and Jeff Ullman of Stanford University)

1 Graphs (Part I) Shannon Quinn (with thanks to William Cohen of CMU and Jure Leskovec, Anand Rajaraman, and Jeff Ullman of Stanford University)

2 Why I’m talking about graphs Lots of large data is graphs – Facebook, Twitter, citation data, and other social networks – The web, the blogosphere, the semantic web, Freebase, Wikipedia, Twitter, and other information networks – Text corpora (like RCV1), large datasets with discrete feature values, and other bipartite networks nodes = documents or words links connect document  word or word  document – Computer networks, biological networks (proteins, ecosystems, brains, …), … – Heterogeneous networks with multiple types of nodes people, groups, documents

3 Properties of Graphs Descriptive Statistics Number of connected components Diameter Degree distribution … Models of Growth/Formation Erdos-Renyi Preferential attachment Stochastic block models …. Let’s look at some examples of graphs … but first, why are these statistics important?

4 An important question How do you explore a dataset? – compute statistics (e.g., feature histograms, conditional feature histograms, correlation coefficients, …) – sample and inspect run a bunch of small-scale experiments How do you explore a graph? – compute statistics (degree distribution, …) – sample and inspect how do you sample?

5 Protein-Protein Interactions 5 Can we identify functional modules? Nodes: Proteins Edges: Physical interactions J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

6 Protein-Protein Interactions 6 Functional modules Nodes: Proteins Edges: Physical interactions J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

7 Facebook Network 7 Can we identify social communities? Nodes: Facebook Users Edges: Friendships J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

8 Facebook Network 8 High school Summer internship Stanford (Squash) Stanford (Basketball) Social communities Nodes: Facebook Users Edges: Friendships J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

13 Graphs Set V of vertices/nodes v1,…, set E of edges (u,v),…. – Edges might be directed and/or weighted and/or labeled Degree of v is #edges touching v – Indegree and Outdegree for directed graphs Path is sequence of edges (v0,v1),(v1,v2),…. Geodesic path between u and v is shortest path connecting them – Diameter is max u,v in V {length of geodesic between u,v} – Effective diameter is 90 th percentile – Mean diameter is over connected pairs (Connected) component is subset of nodes that are all pairwise connected via paths Clique is subset of nodes that are all pairwise connected via edges Triangle is a clique of size three

14 Graphs Some common properties of graphs: – Distribution of node degrees – Distribution of cliques (e.g., triangles) – Distribution of paths Diameter (max shortest-path) Effective diameter (90 th percentile) Connected components – … Some types of graphs to consider: – Real graphs (social & otherwise) – Generated graphs: Erdos-Renyi “Bernoulli” or “Poisson” Watts-Strogatz “small world” graphs Barbosi-Albert “preferential attachment” …

15 Erdos-Renyi graphs Take n nodes, and connect each pair with probability p – Mean degree is z=p(n-1)

16 Erdos-Renyi graphs Take n nodes, and connect each pair with probability p – Mean degree is z=p(n-1) – Mean number of neighbors distance d from v is z d – How large does d need to be so that z d >=n ? If z>1, d = log(n)/log(z) If z<1, you can’t do it – So: There tend to be either many small components (z 1) giant connected component) – Another intuition: If there are a two large connected components, then with high probability a few random edges will link them up.

17 Erdos-Renyi graphs Take n nodes, and connect each pair with probability p – Mean degree is z=p(n-1) – Mean number of neighbors distance d from v is z d – How large does d need to be so that z d >=n ? If z>1, d = log(n)/log(z) If z<1, you can’t do it – So: If z>1, diameters tend to be small (relative to n)

18 Sociometry, Vol. 32, No. 4. (Dec., 1969), pp. 425-443. 64 of 296 chains succeed, avg chain length is 6.2

21 Erdos-Renyi graphs Take n nodes, and connect each pair with probability p – Mean degree is z=p(n-1) This is usually not a good model of degree distribution in natural networks

22 Degree distribution Plot cumulative degree – X axis is degree – Y axis is #nodes that have degree at least k Typically use a log-log scale – Straight lines are a power law; normal curve dives to zero at some point – Left: trust network in epinions web site from Richardson & Domingos

Degree distribution Plot cumulative degree – X axis is degree – Y axis is #nodes that have degree at least k Typically use a log-log scale – Straight lines are a power law; normal curve dives to zero at some point This defines a "scale" for the network

24 Graphs Some common properties of graphs: – Distribution of node degrees – Distribution of cliques (e.g., triangles) – Distribution of paths Diameter (max shortest-path) Effective diameter (90 th percentile) Connected components – … Some types of graphs to consider: – Real graphs (social & otherwise) – Generated graphs: Erdos-Renyi “Bernoulli” or “Poisson” Watts-Strogatz “small world” graphs Barbosi-Albert “preferential attachment” …

25 Graphs Some common properties of graphs: – Distribution of node degrees: often scale-free – Distribution of cliques (e.g., triangles) – Distribution of paths Diameter (max shortest- path) Effective diameter (90 th percentile) often small Connected components usually one giant CC – … Some types of graphs to consider: – Real graphs (social & otherwise) – Generated graphs: Erdos-Renyi “Bernoulli” or “Poisson” Watts-Strogatz “small world” graphs Barbosi-Albert “preferential attachment” generates scale-free graphs …

26 Barabasi-Albert Networks Science 286 (1999) Start from a small number of node, add a new node with m links Preferential Attachment Probability of these links to connect to existing nodes is proportional to the node’s degree ‘Rich gets richer’ This creates ‘hubs’: few nodes with very large degrees

27 Preferential attachment (Barabasi-Albert) Random graph (Erdos Renyi)

28 Graphs Some common properties of graphs: – Distribution of node degrees: often scale-free – Distribution of cliques (e.g., triangles) – Distribution of paths Diameter (max shortest- path) Effective diameter (90 th percentile) often small Connected components usually one giant CC – … Some types of graphs to consider: – Real graphs (social & otherwise) – Generated graphs: Erdos-Renyi “Bernoulli” or “Poisson” Watts-Strogatz “small world” graphs Barbosi-Albert “preferential attachment” generates scale-free graphs …

29 Homophily One definition: excess edges between similar nodes – E.g., assume nodes are male and female and Pr(male)=p, Pr(female)=q. – Is Pr(gender(u)≠ gender(v) | edge (u,v)) >= 2pq? Another def’n: excess edges between common neighbors of v

30 Homophily Another def’n: excess edges between common neighbors of v

31 Homophily In a random Erdos-Renyi graph: In natural graphs two of your mutual friends might well be friends: Like you they are both in the same class (club, field of CS, …) You introduced them

32 Watts-Strogatz model Start with a ring Connect each node to k nearest neighbors  homophily Add some random shortcuts from one point to another  small diameter Degree distribution not scale-free Generalizes to d dimensions

33 An important question How do you explore a dataset? – compute statistics (e.g., feature histograms, conditional feature histograms, correlation coefficients, …) – sample and inspect run a bunch of small-scale experiments How do you explore a graph? – compute statistics (degree distribution, …) – sample and inspect how do you sample?

34 KDD 2006

35 Brief summary Define goals of sampling: – “scale-down” – find G’<G with similar statistics – “back in time”: for a growing G, find G’<G that is similar (statistically) to an earlier version of G Experiment on real graphs with plausible sampling methods, such as – RN – random nodes, sampled uniformly – … See how well they perform

36 Brief summary Experiment on real graphs with plausible sampling methods, such as – RN – random nodes, sampled uniformly RPN – random nodes, sampled by PageRank RDP – random nodes sampled by in-degree – RE – random edges – RJ – run PageRank’s “random surfer” for n steps – RW – run RWR’s “random surfer” for n steps – FF – repeatedly pick r(i) neighbors of i to “burn”, and then recursively sample from them

37 10% sample – pooled on five datasets

38 d-statistic measures agreement between distributions D=max{|F(x)-F’(x)|} where F, F’ are cdf’s max over nine different statistics

39 Parallel Graph Computation Distributed computation and/or multicore parallelism – Sometimes confusing. We will talk mostly about distributed computation. Are classic graph algorithms parallelizable? What about distributed? – Depth-first search? – Breadth-first search? – Priority-queue based traversals (Djikstra’s, Prim’s algorithms)

40 MapReduce for Graphs Graph computation almost always iterative MapReduce ends up shipping the whole graph on each iteration over the network (map- >reduce->map->reduce->...) – Mappers and reducers are stateless

41 Iterative Computation is Difficult System is not optimized for iteration: Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Iterations Disk Penalty Startup Penalty

42 MapReduce and Partitioning Map-Reduce splits the keys randomly between mappers/reducers But on natural graphs, high-degree vertices (keys) may have million-times more edges than the average  Extremely uneven distribution  Time of iteration = time of slowest job.

43 Curse of the Slow Job Data CPU 1 CPU 2 CPU 3 CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Iterations Barrier Data Barrier

44 Map-Reduce is Bulk-Synchronous Parallel Bulk-Synchronous Parallel = BSP (Valiant, 80s) – Each iteration sees only the values of previous iteration. – In linear systems literature: Jacobi iterations Pros: – Simple to program – Maximum parallelism – Simple fault-tolerance Cons: – Slower convergence – Iteration time = time taken by the slowest node

45 Triangle Counting in Twitter Graph 40M Users 1.2B Edges Total: 34.8 Billion Triangles Hadoop results from [Suri & Vassilvitskii '11] 1536 Machines 423 Minutes 64 Machines, 1024 Cores 1.5 Minutes

46 PageRank 40M Webpages, 1.4 Billion Links (100 iterations) 5.5 hrs 1 hr 8 min Hadoop results from [Kang et al. '11] Twister (in-memory MapReduce) [Ekanayake et al. ‘10]

47 Graph algorithms PageRank implementations – in memory – streaming, node list in memory – streaming, no memory – map-reduce A little like Naïve Bayes variants – data in memory – word counts in memory – stream-and-sort – map-reduce

48 Google ’ s PageRank web site xxx web site yyyy web site a b c d e f g web site pdq pdq.. web site yyyy web site a b c d e f g web site xxx Inlinks are “ good ” (recommendations) Inlinks from a “ good ” site are better than inlinks from a “ bad ” site but inlinks from sites with many outlinks are not as “ good ”... “ Good ” and “ bad ” are relative. web site xxx

49 Google ’ s PageRank web site xxx web site yyyy web site a b c d e f g web site pdq pdq.. web site yyyy web site a b c d e f g web site xxx Imagine a “ pagehopper ” that always either follows a random link, or jumps to random page

50 Google ’ s PageRank (Brin & Page, web site xxx web site yyyy web site a b c d e f g web site pdq pdq.. web site yyyy web site a b c d e f g web site xxx Imagine a “ pagehopper ” that always either follows a random link, or jumps to random page PageRank ranks pages by the amount of time the pagehopper spends on a page: or, if there were many pagehoppers, PageRank is the expected “ crowd size ”

51 Random Walks avoids messy “dead ends”….

54 Graph = Matrix Vector = Node  Weight H ABCDEFGHIJ A_111 B1_1 C11_ D_11 E1_1 F111_ G_11 H_11 I11_1 J111_ A B C F D E G I J A A3 B2 C3 D E F G H I J M M v

55 PageRank in Memory Let u = (1/N, …, 1/N) – dimension = #nodes N Let A = adjacency matrix: [a ij =1  i links to j] Let W = [w ij = a ij /outdegree(i)] – w ij is probability of jump from i to j Let v 0 = (1,1,….,1) – or anything else you want Repeat until converged: – Let v t+1 = cu + (1-c)Wv t c is probability of jumping “anywhere randomly”

56 Streaming PageRank Assume we can store v but not W in memory Repeat until converged: – Let v t+1 = cu + (1-c)Wv t Store A as a row matrix: each line is – i j i,1,…,j i,d [the neighbors of i] Store v’ and v in memory: v’ starts out as cu For each line “i j i,1,…,j i,d “ – For each j in j i,1,…,j i,d v’[j] += (1-c)v[i]/d Everything needed for update is right there in row….

57 Streaming PageRank: with some long rows Repeat until converged: – Let v t+1 = cu + (1-c)Wv t Store A as a list of edges: each line is: “i d(i) j” Store v’ and v in memory: v’ starts out as cu For each line “i d j“ v’[j] += (1-c)v[i]/d We need to get the degree of i and store it locally

58 Streaming PageRank: preprocessing Original encoding is edges (i,j) Mapper replaces i,j with i,1 Reducer is a SumReducer Result is pairs (i,d(i)) Then: join this back with edges (i,j) For each i,j pair: – send j as a message to node i in the degree table messages always sorted after non-messages – the reducer for the degree table sees i,d(i) first then j1, j2, …. can output the key,value pairs with key=i, value=d(i), j

59 PageRank in MapReduce

60 More on graph algorithms PageRank is a one simple example of a graph algorithm – but an important one – personalized PageRank (aka “random walk with restart”) is an important operation in machine learning/data analysis settings PageRank is typical in some ways – Trivial when graph fits in memory – Easy when node weights fit in memory – More complex to do with constant memory – A major expense is scanning through the graph many times … same as with SGD/Logistic regression disk-based streaming is much more expensive than memory-based approaches Locality of access is very important! gains if you can pre-cluster the graph even approximately avoid sending messages across the network – keep them local

61 Machine Learning in Graphs - 2010

62 Some ideas Combiners are helpful – Store outgoing incrementVBy messages and aggregate them – This is great for high indegree pages Hadoop’s combiners are suboptimal – Messages get emitted before being combined – Hadoop makes weak guarantees about combiner usage

66 Some ideas Most hyperlinks are within a domain – If we keep domains on the same machine this will mean more messages are local – To do this, build a custom partitioner that knows about the domain of each nodeId and keeps nodes on the same domain together – Assign node id’s so that nodes in the same domain are together – partition node ids by range – Change Hadoop’s Partitioner for this

67 Some ideas Repeatedly shuffling the graph is expensive – We should separate the messages about the graph structure (fixed over time) from messages about pageRank weights (variable) – compute and distribute the edges once – read them in incrementally in the reducer not easy to do in Hadoop! – call this the “Schimmy” pattern

68 Schimmy Relies on fact that keys are sorted, and sorts the graph input the same way…..

69 Schimmy

70 Results

71 More details at… Overlapping Community Detection at Scale: A Nonnegative Matrix Factorization Approach by J. Yang, J. Leskovec. ACM International Conference on Web Search and Data Mining (WSDM), 2013. Overlapping Community Detection at Scale: A Nonnegative Matrix Factorization Approach Detecting Cohesive and 2-mode Communities in Directed and Undirected Networks by J. Yang, J. McAuley, J. Leskovec. ACM International Conference on Web Search and Data Mining (WSDM), 2014. Detecting Cohesive and 2-mode Communities in Directed and Undirected Networks Community Detection in Networks with Node Attributes by J. Yang, J. McAuley, J. Leskovec. IEEE International Conference On Data Mining (ICDM), 2013. Community Detection in Networks with Node Attributes J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 71

72 Semi-Supervised Learning With Graphs Shannon Quinn (with thanks to William Cohen at CMU)

73 Semi-supervised learning A pool of labeled examples L A (usually larger) pool of unlabeled examples U Can you improve accuracy somehow using U?

74 Semi-Supervised Bootstrapped Learning/Self-training Paris Pittsburgh Seattle Cupertino mayor of arg1 live in arg1 San Francisco Austin denial arg1 is home of traits such as arg1 anxiety selfishness Berlin Extract cities:

75 Semi-Supervised Bootstrapped Learning via Label Propagation Paris live in arg1 San Francisco Austin traits such as arg1 anxiety mayor of arg1 Pittsburgh Seattle denial arg1 is home of selfishness

76 Semi-Supervised Bootstrapped Learning via Label Propagation Paris live in arg1 San Francisco Austin traits such as arg1 anxiety mayor of arg1 Pittsburgh Seattle denial arg1 is home of selfishness Nodes “near” seedsNodes “far from” seeds Information from other categories tells you “how far” (when to stop propagating) arrogance traits such as arg1 denial selfishness

77 Semi-Supervised Learning as Label Propagation on a (Bipartite) Graph Paris live in arg1 San Francisco Austin traits such as arg1 anxiety mayor of arg1 Pittsburgh Seattle denial arg1 is home of selfishness Propagate labels to nearby nodes X is “near” Y if there is a high probability of reaching X from Y with a random walk where each step is either (a) move to a random neighbor or (b) jump back to start node Y, if you’re at an NP node rewards multiple paths penalizes long paths penalizes high-fanout paths I like arg1 beer Propagation methods: “personalized PageRank” (aka damped PageRank, random-walk- with-reset)

78 ASONAM-2010 (Advances in Social Networks Analysis and Mining)

79 Network Datasets with Known Classes UBMCBlog AGBlog MSPBlog Cora Citeseer

80 RWR - fixpoint of: Seed selection 1.order by PageRank, degree, or randomly 2.go down list until you have at least k examples/class

81 CoEM/HF/wvRN One definition [MacSkassy & Provost, JMLR 2007]:…

82 CoEM/HF/wvRN Another definition in [X. Zhu, Z. Ghahramani, and J. Lafferty, ICML 2003] – A harmonic field – the score of each node in the graph is the harmonic, or linearly weighted, average of its neighbors’ scores (harmonic field, HF)

