CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 1 Graph Analytics Workshop: Tools Christos Faloutsos CMU
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 2 Welcome ! TueWeThu 9:00-10:30ToolsLaplaciansParallelism 11:00-12:30NELLRich graphsCommunities 1:30-3:00ExercisesPanelScalability 3:30-5:00Graph. modelsPostersGraph ‘Laws’ Reception
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 3 Roadmap Introduction – Motivation Task 1: Node importance Task 2: Community detection Task 3: Mining graphs over time – Tensors Task 4: Theory – intro to Laplacians Conclusions
CMU SCS C. Faloutsos (CMU) 4 Graphs - why should we care? Internet Map [lumeta.com] Food Web [Martinez ’91] >$10B revenue >0.5B users Graph Analytics wkshp
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 5 Graphs - why should we care? IR: bi-partite graphs (doc-terms) ‘NELL’: ‘ merkel ’ ‘ chancellor ’ ‘ germany ’ - facts -> tensors web: hyper-text graph... and more: D1D1 DNDN T1T1 TMTM...
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 6 Graphs - why should we care? ‘viral’ marketing web-log (‘blog’) news propagation computer network security: /IP traffic and anomaly detection.... Any M:N relationship -> Graph Any subject-verb-object construct: -> Graph/Tensor
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 7 Graphs and matrices Closely related Powerful tools from matrix algebra, for graph mining
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 8 Examples of Matrices: Graph - social network John PeterMaryNick... John Peter Mary Nick
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 9 Examples of Matrices: Market basket market basket as in Association Rules milkbreadchoc.wine... John Peter Mary Nick...
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 10 Examples of Matrices: Documents and terms Paper#1 Paper#2 Paper#3 Paper#4 dataminingclassif.tree...
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 11 Examples of Matrices: Authors and terms dataminingclassif.tree... John Peter Mary Nick...
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 12 Roadmap Introduction – Motivation Task 1: Node importance Task 2: Community detection Task 3: Mining graphs over time – Tensors Task 4: Theory – intro to Laplacians Conclusions
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 13 Node importance - Motivation: Given a graph (eg., web pages containing the desirable query word) Q: Which node is the most important?
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 14 Node importance - Motivation: Given a graph (eg., web pages containing the desirable query word) Q: Which node is the most important? A1: HITS (SVD = Singular Value Decomposition) A2: eigenvector (PageRank) ‘I am important, if my friends are important’ -> Fixed point / eigenvector
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 15 Node importance - motivation SVD and eigenvector analysis: very closely related
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 16 Roadmap Introduction – Motivation Task 1: Node importance Task 2: Community detection Task 3: Mining graphs over time – Tensors Task 4: Theory – intro to Laplacians Conclusions
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 17 Task 1 - SVD - Detailed outline Motivation Definition - properties Interpretation Complexity Case Studies –HITS –PageRank
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 18 SVD - Motivation problem #1: text - LSI: find ‘concepts’ problem #2: compression / dim. reduction
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 19 SVD - Motivation problem #1: text - LSI: find ‘concepts’
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 20 SVD - Motivation Customer-product, for recommendation system: bread lettuce beef vegetarians meat eaters tomatos chicken
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 21 SVD - Motivation problem #2: compress / reduce dimensionality
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 22 Problem - specs Visualize customers
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 23 SVD - Motivation
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 24 SVD - Motivation
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 25 Task 1 - SVD - Detailed outline Motivation Definition - properties Interpretation Complexity Case Studies –HITS –PageRank
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 26 SVD - Definition A = U V T - example:
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 27 SVD - Definition A [n x m] = U [n x r] r x r] (V [m x r] ) T A: n x m matrix (eg., n documents, m terms) U: n x r matrix (n documents, r concepts) : r x r diagonal matrix (strength of each ‘concept’) (r : rank of the matrix) V: m x r matrix (m terms, r concepts)
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 28 SVD - Properties THEOREM [Press+92]: always possible to decompose matrix A into A = U V T, where U, V: unique (*) U, V: column orthonormal (ie., columns are unit vectors, orthogonal to each other) –U T U = I; V T V = I (I: identity matrix) : singular are positive, and sorted in decreasing order
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 29 SVD - Example A = U V T - example: data inf. retrieval brain lung = CS MD xx
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 30 SVD - Example A = U V T - example: data inf. retrieval brain lung = CS MD xx CS-concept MD-concept
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 31 SVD - Example A = U V T - example: data inf. retrieval brain lung = CS MD xx CS-concept MD-concept doc-to-concept similarity matrix
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 32 SVD - Example A = U V T - example: data inf. retrieval brain lung = CS MD xx ‘strength’ of CS-concept
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 33 SVD - Example A = U V T - example: data inf. retrieval brain lung = CS MD xx term-to-concept similarity matrix CS-concept
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 34 SVD - Example A = U V T - example: data inf. retrieval brain lung = CS MD xx term-to-concept similarity matrix CS-concept
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 35 Task 1 - SVD - Detailed outline Motivation Definition - properties Interpretation Complexity Case studies Additional properties
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 36 SVD - Interpretation #1 ‘documents’, ‘terms’ and ‘concepts’: U: document-to-concept similarity matrix V: term-to-concept sim. matrix : its diagonal elements: ‘strength’ of each concept
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 37 SVD – Interpretation #1 ‘documents’, ‘terms’ and ‘concepts’: Q: if A is the document-to-term matrix, what is A T A? A: Q: A A T ? A:
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 38 Copyright: Faloutsos, Tong (2009) 2-38 SVD – Interpretation #1 ‘documents’, ‘terms’ and ‘concepts’: Q: if A is the document-to-term matrix, what is A T A? A: term-to-term ([m x m]) similarity matrix Q: A A T ? A: document-to-document ([n x n]) similarity matrix ICDE’09
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 39 Copyright: Faloutsos, Tong (2009) 2-39 SVD properties V are the eigenvectors of the covariance matrix A T A U are the eigenvectors of the Gram (inner- product) matrix AA T Further reading: 1. Ian T. Jolliffe, Principal Component Analysis (2 nd ed), Springer, Gilbert Strang, Linear Algebra and Its Applications (4 th ed), Brooks Cole, 2005.
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 40 SVD - Interpretation #2 best axis to project on: (‘best’ = min sum of squares of projection errors)
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 41 SVD - Motivation
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 42 SVD - interpretation #2 minimum RMS error SVD: gives best axis to project v1 first singular vector
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 43 SVD - Interpretation #2
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 44 SVD - Interpretation #2 A = U V T - example: = xx v1
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 45 SVD - Interpretation #2 A = U V T - example: = xx variance (‘spread’) on the v1 axis
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 46 SVD - Interpretation #2 A = U V T - example: –U gives the coordinates of the points in the projection axis = xx
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 47 SVD - Interpretation #2 More details Q: how exactly is dim. reduction done? = xx
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 48 SVD - Interpretation #2 More details Q: how exactly is dim. reduction done? A: set the smallest singular values to zero: = xx
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 49 SVD - Interpretation #2 ~ xx
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 50 SVD - Interpretation #2 ~ xx
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 51 SVD - Interpretation #2 ~ xx
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 52 SVD - Interpretation #2 ~
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 53 SVD - Interpretation #2 Exactly equivalent: ‘spectral decomposition’ of the matrix: = xx
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 54 SVD - Interpretation #2 Exactly equivalent: ‘spectral decomposition’ of the matrix: = xx u1u1 u2u2 1 2 v1v1 v2v2
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 55 SVD - Interpretation #2 Exactly equivalent: ‘spectral decomposition’ of the matrix: =u1u1 1 vT1vT1 u2u2 2 vT2vT n m
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 56 SVD - Interpretation #2 Exactly equivalent: ‘spectral decomposition’ of the matrix: =u1u1 1 vT1vT1 u2u2 2 vT2vT n m n x 1 1 x m r terms
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 57 SVD - Interpretation #2 approximation / dim. reduction: by keeping the first few terms (Q: how many?) =u1u1 1 vT1vT1 u2u2 2 vT2vT n m assume: 1 >= 2 >=...
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 58 SVD - Interpretation #2 A (heuristic - [Fukunaga]): keep 80-90% of ‘energy’ (= sum of squares of i ’s) =u1u1 1 vT1vT1 u2u2 2 vT2vT n m assume: 1 >= 2 >=...
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 59 Pictorially: matrix form of SVD –Best rank-k approximation in L2 A m n m n U VTVT
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 60 Pictorially: Spectral form of SVD –Best rank-k approximation in L2 A m n + 1u1v11u1v1 2u2v22u2v2
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 61 Task 1 - SVD - Detailed outline Motivation Definition - properties Interpretation –#1: documents/terms/concepts –#2: dim. reduction –#3: picking non-zero, rectangular ‘blobs’ Complexity Case studies Additional properties
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 62 SVD - Interpretation #3 finds non-zero ‘blobs’ in a data matrix = xx
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 63 SVD - Interpretation #3 finds non-zero ‘blobs’ in a data matrix = xx
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 64 SVD - Interpretation #3 finds non-zero ‘blobs’ in a data matrix = ‘communities’ (bi-partite cores, here) Row 1 Row 4 Col 1 Col 3 Col 4 Row 5 Row 7
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 65 Task 1 - SVD - Detailed outline Motivation Definition - properties Interpretation Complexity Case Studies –HITS –PageRank
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 66 SVD - Complexity O( n * m * m) or O( n * n * m) (whichever is less) less work, if we just want singular values or if we want first k singular vectors or if the matrix is sparse [Berry] Implemented: in any linear algebra package (LINPACK, matlab, Splus, mathematica...)
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 67 SVD - conclusions so far SVD: A= U V T : unique (*) U: document-to-concept similarities V: term-to-concept similarities : strength of each concept dim. reduction: keep the first few strongest singular values (80-90% of ‘energy’) –SVD: picks up linear correlations SVD: picks up non-zero ‘blobs’
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 68 Task 1 - SVD - Detailed outline Motivation Definition - properties Interpretation Complexity Case Studies –HITS –PageRank
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 69 Kleinberg’s algo (HITS) Kleinberg, Jon (1998). Authoritative sources in a hyperlinked environment. Proc. 9th ACM-SIAM Symposium on Discrete Algorithms.
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 70 Recall: problem dfn Given a graph (eg., web pages containing the desirable query word) Q: Which node is the most important?
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 71 Kleinberg’s algorithm Problem dfn: given the web and a query find the most ‘authoritative’ web pages for this query Step 0: find all pages containing the query terms Step 1: expand by one move forward and backward
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 72 Kleinberg’s algorithm Step 1: expand by one move forward and backward
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 73 Kleinberg’s algorithm on the resulting graph, give high score (= ‘authorities’) to nodes that many important nodes point to give high importance score (‘hubs’) to nodes that point to good ‘authorities’) hubsauthorities
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 74 Kleinberg’s algorithm observations recursive definition! each node (say, ‘i’-th node) has both an authoritativeness score a i and a hubness score h i
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 75 Kleinberg’s algorithm Let E be the set of edges and A be the adjacency matrix: the (i,j) is 1 if the edge from i to j exists Let h and a be [n x 1] vectors with the ‘hubness’ and ‘authoritativiness’ scores. Then:
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 76 Kleinberg’s algorithm Then: a i = h k + h l + h m that is a i = Sum (h j ) over all j that (j,i) edge exists or a = A T h k l m i
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 77 Kleinberg’s algorithm symmetrically, for the ‘hubness’: h i = a n + a p + a q that is h i = Sum (q j ) over all j that (i,j) edge exists or h = A a p n q i
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 78 Kleinberg’s algorithm In conclusion, we want vectors h and a such that: h = A a a = A T h SVD properties: A [n x m] v 1 [m x 1] = 1 u 1 [n x 1] u 1 T A = 1 v 1 T =
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 79 Kleinberg’s algorithm In short, the solutions to h = A a a = A T h are the left- and right- singular-vectors of the adjacency matrix A. Starting from random a’ and iterating, we’ll eventually converge (Q: to which of all the singular-vectors? why?)
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 80 Kleinberg’s algorithm (Q: to which of all the singular-vectors? why?) A: to the ones of the strongest singular-value: (A T A ) k v’ ~ (constant) v 1
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 81 Kleinberg’s algorithm - results Eg., for the query ‘java’: java.sun.com (“the java developer”)
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 82 Kleinberg’s algorithm - discussion ‘authority’ score can be used to find ‘similar pages’ (how?)
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 83 Task 1 - SVD - Detailed outline Motivation Definition - properties Interpretation Complexity Case Studies –HITS –PageRank
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 84 PageRank (google) Brin, Sergey and Lawrence Page (1998). Anatomy of a Large-Scale Hypertextual Web Search Engine. 7th Intl World Wide Web Conf. Larry Page Sergey Brin
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 85 Problem: PageRank Given a directed graph, find its most interesting/central node A node is important, if it is connected with important nodes (recursive, but OK!)
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 86 Problem: PageRank - solution Given a directed graph, find its most interesting/central node Proposed solution: Random walk; spot most ‘popular’ node (-> steady state prob. (ssp)) A node has high ssp, if it is connected with high ssp nodes (recursive, but OK!)
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 87 (Simplified) PageRank algorithm Let A be the adjacency matrix; let B be the transition matrix: transpose, column-normalized - then = To From B
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 88 (Simplified) PageRank algorithm B p = p =
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 89 Definitions AAdjacency matrix (from-to) DDegree matrix = (diag ( d1, d2, …, dn) ) BTransition matrix: to-from, column normalized B = A T D -1
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 90 (Simplified) PageRank algorithm B p = 1 * p thus, p is the eigenvector that corresponds to the highest eigenvalue (=1, since the matrix is column-normalized ) Why does such a p exist? –p exists if B is nxn, nonnegative, irreducible [Perron–Frobenius theorem]
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 91 (Simplified) PageRank algorithm In short: imagine a particle randomly moving along the edges compute its steady-state probabilities (ssp) Full version of algo: with occasional random jumps Why? To make the matrix irreducible
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 92 Full Algorithm With probability 1-c, fly-out to a random node Then, we have p = c B p + (1-c)/n 1 => p = (1-c)/n [I - c B] -1 1
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 93 Alternative notation MModified transition matrix M = c B + (1-c)/n 1 1 T Then p = M p That is: the steady state probabilities = PageRank scores form the first eigenvector of the ‘modified transition matrix’
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 94 Parenthesis: intuition behind eigenvectors
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 95 Formal definition If A is a (n x n) square matrix , x) is an eigenvalue/eigenvector pair of A if A x = x CLOSELY related to singular values:
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 96 Property #1: Eigen- vs singular-values if B [n x m] = U [n x r] r x r] (V [m x r] ) T then A = ( B T B ) is symmetric and C(4): B T B v i = i 2 v i ie, v 1, v 2,...: eigenvectors of A = (B T B)
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 97 Property #2 If A [nxn] is a real, symmetric matrix Then it has n real eigenvalues (if A is not symmetric, some eigenvalues may be complex)
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 98 Property #3 If A [nxn] is a real, symmetric matrix Then it has n real eigenvalues And they agree with its n singular values, except possibly for the sign
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 99 Intuition A as vector transformation Axx’ = x
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 100 Intuition By defn., eigenvectors remain parallel to themselves (‘fixed points’) Av1v1 v1v1 = 3.62 * 1
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 101 Convergence Usually, fast:
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 102 Convergence Usually, fast:
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 103 Convergence Usually, fast: depends on ratio 1 : 2 1 2
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 104 Kleinberg/google - conclusions SVD helps in graph analysis: hub/authority scores: strongest left- and right- singular-vectors of the adjacency matrix random walk on a graph: steady state probabilities are given by the strongest eigenvector of the (modified) transition matrix
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 105 Conclusions SVD: a valuable tool given a document-term matrix, it finds ‘concepts’ (LSI)... and can find fixed-points or steady-state probabilities (google/ Kleinberg/ Markov Chains)
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 106 Conclusions cont’d (We didn’t discuss/elaborate, but, SVD... can reduce dimensionality (KL)... and can find rules (PCA; RatioRules)... and can solve optimally over- and under- constraint linear systems (least squares / query feedbacks)
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 107 References Berry, Michael: Brin, S. and L. Page (1998). Anatomy of a Large-Scale Hypertextual Web Search Engine. 7th Intl World Wide Web Conf.
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 108 References Christos Faloutsos, Searching Multimedia Databases by Content, Springer, (App. D)Searching Multimedia Databases by Content Fukunaga, K. (1990). Introduction to Statistical Pattern Recognition, Academic Press. I.T. Jolliffe Principal Component Analysis Springer, 2002 (2 nd ed.)
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 109 References cont’d Kleinberg, J. (1998). Authoritative sources in a hyperlinked environment. Proc. 9th ACM-SIAM Symposium on Discrete Algorithms. Press, W. H., S. A. Teukolsky, et al. (1992). Numerical Recipes in C, Cambridge University Press.
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 110 PART 2: Communities
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 111 Roadmap Introduction – Motivation Task 1: Node importance Task 2: Community detection Task 3: Mining graphs over time – Tensors Task 4: Theory – intro to Laplacians Conclusions
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 112 Task 2 – Communities - Detailed outline Motivation Hard clustering – k pieces Hard clustering – optimal # pieces Observations
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 113 Problem Given a graph, and k Break it into k (disjoint) communities
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 114 Problem Given a graph, and k Break it into k (disjoint) communities k = 2
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 115 Solution #1: METIS Arguably, the best algorithm Open source, at – and *many* related papers, at same url Main idea: –coarsen the graph; –partition; –un-coarsen
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 116 Solution #1: METIS G. Karypis and V. Kumar. METIS 4.0: Unstructured graph partitioning and sparse matrix ordering system. TR, Dept. of CS, Univ. of Minnesota, 1998.
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 117 Solution #2 (problem: hard clustering, k pieces) Spectral partitioning: Consider the 2 nd smallest eigenvector of the (normalized) Laplacian See details in ‘Task 7’, later
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 118 Solutions #3, … Many more ideas: Clustering on the A 2 (square of adjacency matrix) [Zhou, Woodruff, PODS’04] Minimum cut / maximum flow [Flake+, KDD’00] …
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 119 Task 2 – Communities - Detailed outline Motivation Hard clustering – k pieces Hard clustering – optimal # pieces Observations
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 120 Cross-association Desiderata: Simultaneously discover row and column groups Fully Automatic: No “magic numbers” Scalable to large matrices Reference: 1.Chakrabarti et al. Fully Automatic Cross-Associations, KDD’04
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 121 What makes a cross-association “good”? versus Column groups Row groups Why is this better?
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 122 What makes a cross-association “good”? versus Column groups Row groups Why is this better? simpler; easier to describe easier to compress!
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 123 What makes a cross-association “good”? Problem definition: given an encoding scheme decide on the # of col. and row groups k and l and reorder rows and columns, to achieve best compression
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 124 Main Idea size i * H(x i ) + Cost of describing cross-associations Code Cost Description Cost ΣiΣi Total Encoding Cost = Good Compression Better Clustering Minimize the total cost (# bits) for lossless compression
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 125 Algorithm k = 5 row groups k=1, l=2 k=2, l=2 k=2, l=3 k=3, l=3 k=3, l=4 k=4, l=4 k=4, l=5 l = 5 col groups
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 126 Experiments “CLASSIC” 3,893 documents 4,303 words 176,347 “dots” Combination of 3 sources: MEDLINE (medical) CISI (info. retrieval) CRANFIELD (aerodynamics) Documents Words
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 127 Experiments “CLASSIC” graph of documents & words: k=15, l=19 Documents Words
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 128 Experiments “CLASSIC” graph of documents & words: k=15, l=19 MEDLINE (medical) insipidus, alveolar, aortic, death, prognosis, intravenous blood, disease, clinical, cell, tissue, patient
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 129 Experiments “CLASSIC” graph of documents & words: k=15, l=19 CISI (Information Retrieval) providing, studying, records, development, students, rules abstract, notation, works, construct, bibliographies MEDLINE (medical)
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 130 Experiments “CLASSIC” graph of documents & words: k=15, l=19 CRANFIELD (aerodynamics) shape, nasa, leading, assumed, thin CISI (Information Retrieval) MEDLINE (medical)
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 131 Experiments “CLASSIC” graph of documents & words: k=15, l=19 paint, examination, fall, raise, leave, based CRANFIELD (aerodynamics) CISI (Information Retrieval) MEDLINE (medical)
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 132 Algorithm Code for cross-associations (matlab): tgz Variations and extensions: ‘Autopart’ [Chakrabarti, PKDD’04]
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 133 Algorithm Hadoop implementation [ICDM’08] Spiros Papadimitriou, Jimeng Sun: DisCo: Distributed Co-clustering with Map- Reduce: A Case Study towards Petabyte-Scale End-to-End Mining. ICDM 2008:
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 134 Task 2 – Communities - Detailed outline Motivation Hard clustering – k pieces Hard clustering – optimal # pieces Observations
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 135 Observation #1 Skewed degree distributions – there are nodes with huge degree (>O(10^4), in facebook/linkedIn popularity contests!)
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 136 Observation #2 Maybe there are no good cuts: ``jellyfish’’ shape [Tauro+’01], [Siganos+,’06], strange behavior of cuts [Chakrabarti+’04], [Leskovec+,’08]
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 137 Observation #2 Maybe there are no good cuts: ``jellyfish’’ shape [Tauro+’01], [Siganos+,’06], strange behavior of cuts [Chakrabarti+,’04], [Leskovec+,’08] ? ?
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 138 Jellyfish model [Tauro+] … A Simple Conceptual Model for the Internet Topology, L. Tauro, C. Palmer, G. Siganos, M. Faloutsos, Global Internet, November 25-29, 2001 Jellyfish: A Conceptual Model for the AS Internet Topology G. Siganos, Sudhir L Tauro, M. Faloutsos, J. of Communications and Networks, Vol. 8, No. 3, pp , Sept
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 139 Strange behavior of min cuts ‘negative dimensionality’ (!) NetMine: New Mining Tools for Large Graphs, by D. Chakrabarti, Y. Zhan, D. Blandford, C. Faloutsos and G. Blelloch, in the SDM 2004 Workshop on Link Analysis, Counter-terrorism and Privacy Statistical Properties of Community Structure in Large Social and Information Networks, J. Leskovec, K. Lang, A. Dasgupta, M. Mahoney. WWW 2008.
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 140 “Min-cut” plot Do min-cuts recursively. log (# edges) log (mincut-size / #edges) N nodes Mincut size = sqrt(N)
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 141 “Min-cut” plot Do min-cuts recursively. log (# edges) log (mincut-size / #edges) N nodes New min-cut
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 142 “Min-cut” plot Do min-cuts recursively. log (# edges) log (mincut-size / #edges) N nodes New min-cut Slope = -0.5 For a d-dimensional grid, the slope is -1/d
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 143 “Min-cut” plot log (# edges) log (mincut-size / #edges) Slope = -1/d For a d-dimensional grid, the slope is -1/d log (# edges) log (mincut-size / #edges) For a random graph, the slope is 0
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 144 “Min-cut” plot What does it look like for a real-world graph? log (# edges) log (mincut-size / #edges) ?
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 145 Experiments Datasets: –Google Web Graph: 916,428 nodes and 5,105,039 edges –Lucent Router Graph: Undirected graph of network routers from 112,969 nodes and 181,639 edges –User Website Clickstream Graph: 222,704 nodes and 952,580 edges NetMine: New Mining Tools for Large Graphs, by D. Chakrabarti, Y. Zhan, D. Blandford, C. Faloutsos and G. Blelloch, in the SDM 2004 Workshop on Link Analysis, Counter-terrorism and Privacy
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 146 Experiments Used the METIS algorithm [ Karypis, Kumar, 1995] log (# edges) log (mincut-size / #edges) Google Web graph Values along the y- axis are averaged “lip” for large edges Slope of -0.4, corresponds to a 2.5- dimensional grid! Slope~ -0.4
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 147 Experiments Same results for other graphs too… log (# edges) log (mincut-size / #edges) Lucent Router graphClickstream graph Slope~ Slope~ -0.45
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 148 Task 2 – Communities Conclusions – Practitioner’s guide Hard clustering – k pieces Hard clustering – optimal # pieces Observations METIS Cross-associations ‘jellyfish’: no good cuts
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 149 PART 3: Tensors
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 150 Roadmap Introduction – Motivation Task 1: Node importance Task 2: Community detection Task 3: Mining graphs over time – Tensors Task 4: Theory – intro to Laplacians Conclusions
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 151 Task 3 – Tensors - Detailed roadmap Motivation Definitions: PARAFAC Case study: web mining
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 152 Examples of Matrices: Authors and terms dataminingclassif.tree... John Peter Mary Nick...
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 153 But: if it changes over time?? A: treat it as ‘tensor’ dataminingclassif.tree... John Peter Mary Nick... KDD’08 KDD’07 KDD’09
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 154 Motivation: Why tensors? Q: what is a tensor?
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 155 Motivation: Why tensors? A: N-D generalization of matrix: dataminingclassif.tree... John Peter Mary Nick... KDD’09
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 156 Motivation: Why tensors? A: N-D generalization of matrix: dataminingclassif.tree... John Peter Mary Nick... KDD’08 KDD’07 KDD’09
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 157 Tensors are useful for 3 or more modes Terminology: ‘mode’ (or ‘aspect’): dataminingclassif.tree... Mode (== aspect) #1 Mode#2 Mode#3
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 158 Notice 3 rd mode does not need to be time we can have more than 3 modes... IP destination Dest. port IP source
CMU SCS Background: Tensors Tensors (=multi-dimensional arrays) are everywhere –Sensor stream (time, location, type) –Predicates (subject, verb, object) in knowledge base “Barrack Obama is the president of U.S.” “Eric Clapton plays guitar” (26M) (48M) NELL (Never Ending Language Learner) data Nonzeros =144M Graph Analytics wkshp 159 C. Faloutsos (CMU)
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 160 Task 3 – Tensors - Detailed roadmap Motivation Definitions: PARAFAC Case study: web mining
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 161 Tensor basics Multi-mode extensions of SVD – recall that:
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 162 Reminder: SVD –Best rank-k approximation in L2 A m n m n U VTVT
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 163 Reminder: SVD –Best rank-k approximation in L2 A m n + 1u1v11u1v1 2u2v22u2v2
CMU SCS Extension to (>=)3 modes Graph Analytics wkshp 164 C. Faloutsos (CMU)
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 165 Main points: 2 major types of tensor decompositions: PARAFAC and Tucker (not examined here) both can be solved with ``alternating least squares’’ (ALS)
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 166 Task 3 – Tensors - Detailed outline Motivation Definitions: PARAFAC Case study: web mining
CMU SCS Discoveries: Problem Definition Most important concepts and synonyms? (26M) (48M) NELL (Never Ending Language Learner) data Nonzeros =144M Graph Analytics wkshp 167 C. Faloutsos (CMU)
CMU SCS A1: Concept Discovery Concept Discovery in Knowledge Base Graph Analytics wkshp 168 C. Faloutsos (CMU)
CMU SCS A2.1: Concept Discovery Graph Analytics wkshp 169 C. Faloutsos (CMU)
CMU SCS A2: Synonym Discovery Synonym Discovery in Knowledge Base a1a1 a2a2 aRaR … (Given) noun phrase (Discovered) synonym 1 (Discovered) synonym 2 Graph Analytics wkshp 170 C. Faloutsos (CMU)
CMU SCS 171 C. Faloutsos (CMU) A2: Synonym Discovery Graph Analytics wkshp
CMU SCS GigaTensor: Scaling Tensor Analysis Up By 100 Times – Algorithms and Discoveries U Kang Christos Faloutsos KDD 2012 Evangelos Papalexakis Abhay Harpale Graph Analytics wkshp 172 C. Faloutsos (CMU)
CMU SCS Experiments GigaTensor solves 100x larger problem Number of nonzero = I / 50 (J) (I) (K) GigaTensor Tensor Toolbox Out of Memory 100x Graph Analytics wkshp 173 C. Faloutsos (CMU)
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 174 Conclusions Real data may have multiple aspects (modes) Tensors provide elegant theory and algorithms –PARAFAC (and Tucker): discover groups GigaTensor: scales up (hadoop/PEGASUS) –
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 175 References T. G. Kolda, B. W. Bader and J. P. Kenny. Higher-Order Web Link Analysis Using Multilinear Algebra. In: ICDM 2005, Pages , November Jimeng Sun, Spiros Papadimitriou, Philip Yu. Window-based Tensor Analysis on High- dimensional and Multi-aspect Streams, Proc. of the Int. Conf. on Data Mining (ICDM), Hong Kong, China, Dec 2006
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 176 Resources See tutorial on tensors, KDD’07 (w/ Tamara Kolda and Jimeng Sun):
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 177 Tensor tools - resources Toolbox: from Tamara Kolda: csmr.ca.sandia.gov/~tgkolda/TensorToolbox Copyright: Faloutsos, Tong (2009) ICDE’09 T. G. Kolda and B. W. Bader. Tensor Decompositions and Applications. SIAM Review, Volume 51, Number 3, September 2009 csmr.ca.sandia.gov/~tgkolda/pubs/bibtgkfiles/TensorReview-preprint.pdf T. Kolda and J. Sun: Scalable Tensor Decomposition for Multi-Aspect Data Mining (ICDM 2008)
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 178 PART 4: Theory
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 179 Roadmap Introduction – Motivation Task 1: Node importance Task 2: Community detection Task 3: Mining graphs over time – Tensors Task 4: Theory – intro to Laplacians Conclusions
CMU SCS Task 4 – Theory - Detailed roadmap Adjacency matrix Laplacian –Connected Components –Intuition: 2 nd smallest eigenvalue -> ‘good cut’ 180 Graph Analytics wkshp 180 C. Faloutsos (CMU)
CMU SCS Adjacency matrix Graph Analytics wkshpC. Faloutsos (CMU) 181 A=
CMU SCS Adjacency matrix Graph Analytics wkshpC. Faloutsos (CMU) 182 A= step-away paths
CMU SCS Adjacency matrix Graph Analytics wkshpC. Faloutsos (CMU) Obvious extensions, for directed and/or weighted cases
CMU SCS Task 4 – Theory - Detailed roadmap Adjacency matrix Laplacian –Connected Components –Intuition: 2 nd smallest eigenvalue -> ‘good cut’ 184 Graph Analytics wkshp 184 C. Faloutsos (CMU)
CMU SCS Main upcoming result the second smallest eigenvector of the Laplacian (u 2 ) gives a good cut: Nodes with positive scores should go to one group And the rest to the other Graph Analytics wkshp 185 C. Faloutsos (CMU)
CMU SCS Laplacian Graph Analytics wkshpC. Faloutsos (CMU) 186 L= D-A= Diagonal matrix, d ii =d i
CMU SCS Task 4 – Theory - Detailed roadmap Adjacency matrix Laplacian –Connected Components –Intuition: 2 nd smallest eigenvalue -> ‘good cut’ 187 Graph Analytics wkshp 187 C. Faloutsos (CMU)
CMU SCS Connected Components Lemma: Let G be a graph with n vertices and c connected components. If L is the Laplacian of G, then rank(L)= n-c. Proof: see p.279, Godsil-Royle Graph Analytics wkshpC. Faloutsos (CMU) 188
CMU SCS Connected Components Graph Analytics wkshpC. Faloutsos (CMU) 189 G(V,E) L= eig(L)=
CMU SCS Connected Components Graph Analytics wkshpC. Faloutsos (CMU) 190 G(V,E) L= eig(L)= #zeros = #components
CMU SCS Connected Components Graph Analytics wkshpC. Faloutsos (CMU) 191 G(V,E) L= eig(L)=
CMU SCS Connected Components Graph Analytics wkshpC. Faloutsos (CMU) 192 G(V,E) L= eig(L)= #zeros = #components
CMU SCS Connected Components Graph Analytics wkshpC. Faloutsos (CMU) 193 G(V,E) L= eig(L)= Indicates a “good cut”
CMU SCS Task 4 – Theory - Detailed roadmap Reminders Adjacency matrix Laplacian –Connected Components –Intuition: 2 nd smallest eigenvalue -> ‘good cut’ 194 Graph Analytics wkshp 194 C. Faloutsos (CMU)
CMU SCS Example: Spectral Partitioning Graph Analytics wkshpC. Faloutsos (CMU) 195 K 500 dumbbell graph ?Montagues Capulets Romeo Juliet
CMU SCS Example: Spectral Partitioning This is how adjacency matrix of B looks Graph Analytics wkshpC. Faloutsos (CMU) 196 spy(B)
CMU SCS Example: Spectral Partitioning 2 nd eigenvector u 2 of B: B u 2 = u 2 Graph Analytics wkshpC. Faloutsos (CMU) 197 L = diag(sum(B))-B; [u v] = eigs(L,2,'SM'); plot(u(:,1),’x’) Not so much information yet… Node-id ‘i’ u 2,i score
CMU SCS Example: Spectral Partitioning 2 nd eigenvector after sorting on x 2,i score Graph Analytics wkshpC. Faloutsos (CMU) 198 [ign ind] = sort(u(:,1)); plot(u(ind),'x') x 2,i score Node-id ‘i’
CMU SCS Example: Spectral Partitioning 2 nd eigenvector after sorting on x 2,i score Graph Analytics wkshpC. Faloutsos (CMU) 199 [ign ind] = sort(u(:,1)); plot(u(ind),'x') But now we see the two communities! x 2,i score Node-id ‘i’
CMU SCS Example: Spectral Partitioning This is how adjacency matrix of B looks now Graph Analytics wkshpC. Faloutsos (CMU) 200 spy(B(ind,ind))
CMU SCS Why λ 2 ? 201 Each ball 1 unit of mass x1x1 xnxn OSCILLATE Dfn of eigenvector Matrix viewpoint: Graph Analytics wkshp 201 C. Faloutsos (CMU)
CMU SCS Why λ 2 ? 202 Each ball 1 unit of mass x1x1 xnxn OSCILLATE Force due to neighbors displacement Hooke’s constant Physics viewpoint: Graph Analytics wkshp 202 C. Faloutsos (CMU)
CMU SCS Why λ 2 ? Graph Analytics wkshpC. Faloutsos (CMU) 203 Each ball 1 unit of mass Eigenvector value Node id x1x1 xnxn OSCILLATE For the first eigenvector: All nodes: same displacement (= value)
CMU SCS Why λ 2 ? 204 Each ball 1 unit of mass Eigenvector value Node id x1x1 xnxn OSCILLATE Graph Analytics wkshp 204 C. Faloutsos (CMU)
CMU SCS Conclusions Spectrum tells us a lot about the graph: Adjacency: #Paths Laplacian: Sparse Cut Graph Analytics wkshpC. Faloutsos (CMU) 205
CMU SCS References Fan R. K. Chung: Spectral Graph Theory (AMS) Chris Godsil and Gordon Royle: Algebraic Graph Theory (Springer) Bojan Mohar and Svatopluk Poljak: Eigenvalues in Combinatorial Optimization, IMA Preprint Series #939 Gilbert Strang: Introduction to Applied Mathematics (Wellesley-Cambridge Press) Graph Analytics wkshpC. Faloutsos (CMU) 206
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 207 PART 5: Conclusions
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) P9-208 Summary Task 1: Node importance Task 2: Community detection Task 3: Mining graphs over time Task 4: Spectral graph theory
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) P9-209 Summary Task 1: Node importance Task 2: Community detection Task 3: Mining graphs over time Task 4: Spectral graph theory ->SVD, PageRank, HITS -> METIS; ‘no good cuts’ -> Tensors -> Laplacians
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) P9-210 Acknowledgements Funding: IIS , IIS , DBI , CNS
Graph Analytics wkshpC. Faloutsos (CMU) P9-211 THANK YOU! Christos Faloutsos