SCS CMU Proximity on Large Graphs Speaker: Hanghang Tong Guest Lecture
SCS CMU 2 Graphs are everywhere!
SCS CMU 3 Food-web: example
SCS CMU 4 Graph Mining: the big picture Graph/Global Level Subgraph/ Community Level Node Level We are here!
SCS CMU 5 Proximity on Graph: What? a.k.a Relevance, Closeness, ‘Similarity’…
SCS CMU 6 Proximity is the main tool behind… Link prediction [Liben-Nowell+], [Tong+] Ranking [Haveliwala], [Chakrabarti+] Management [Minkov+] Image caption [Pan+] Neighborhooh Formulation [Sun+] Conn. subgraph [Faloutsos+], [Tong+], [Koren+] Pattern match [Tong+] Collaborative Filtering [Fouss+] Many more… Will return to this later
SCS CMU 7 Roadmap Motivation Part I: Definitions Part II: Fast Solutions Part III: Applications Conclusion Basic: RWR Variants Asymmetry of Prox. Group Prox Prox w/ Attributes Prox w/ Time
SCS CMU 8 Why not shortest path? ‘pizza delivery guy’ problem ‘multi-facet’ relationship Some ``bad’’ proximities
SCS CMU 9 Why not max. netflow? No punishment on long paths Some ``bad’’ proximities
SCS CMU 10 Why not ``effective conductance”? Some ``bad’’ proximities ‘pizza delivery guy’ problem
SCS CMU 11 What is a ``good’’ Proximity? Multiple Connections Quality of connection Direct & In-directed Conns Length, Degree, Weight… …
SCS CMU Random walk with restart
SCS CMU 13 Random walk with restart Node 4 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Node 8 Node 9 Node 10 Node 11 Node Ranking vector More red, more relevant Nearby nodes, higher scores
SCS CMU Why RWR is a good score? 14 all paths from i to j with length 1 all paths from i to j with length 2 all paths from i to j with length 3
SCS CMU 15 Roadmap Motivation Part I: Definitions Part II: Fast Solutions Part III: Applications Conclusion Basic: RWR Variants Asymmetry of Prox. Group Prox Prox w/ Attributes Prox w/ Time
SCS CMU 16 Variant: escape probability Define Random Walk (RW) on the graph Esc_Prob(A B) –Prob (starting at A, reaches B before returning to A) Esc_Prob = Pr (smile before cry) A B the remaining graph
SCS CMU 17 Other Variants Other measure by RWs –Community Time/Hitting Time [Fouss+] –SimRank [Jeh+] Equivalence of Random Walks –Electric Networks: EC [Doyle+]; SAEC[Faloutsos+]; CFEC[Koren+] –String Systems Katz [Katz], [Huang+], [Scholkopf+] Matrix-Forest-based Alg [Chobotarev+]
SCS CMU 18 Other Variants Other measure by RWs –Community Time/Hitting Time [Fouss+] –SimRank [Jeh+] Equivalence of Random Walks –Electric Networks: EC [Doyle+]; SAEC[Faloutsos+]; CFEC[Koren+] –String Systems Katz [Katz], [Huang+], [Scholkopf+] Matrix-Forest-based Alg [Chobotarev+] All are related to, or similar to random walk with restart!
SCS CMU 19 Roadmap Motivation Part I: Definitions Part II: Fast Solutions Part III: Applications Conclusion Basic: RWR Variants Asymmetry of Prox. Group Prox Prox w/ Attributes Prox w/ Time
SCS CMU 20 Asymmetry of Proximity [Tong+ KDD07 a] What is Prox from A to B? What is Prox from B to A? What is Prox between A and B?
SCS CMU 21 Asymmetry also exists in un-directed graphs Hanghang’s most important conf. is KDD The most important author in KDD is... So is love… Hanghang KDD
SCS CMU 22 Roadmap Motivation Part I: Definitions Part II: Fast Solutions Part III: Applications Conclusion Basic: RWR Variants Asymmetry of Prox. Group Prox Prox w/ Attributes Prox w/ Time
SCS CMU 23 Group Proximity [Tong+ 2007] Q: How close are Accountants to SECs? A: Prob (starting at any RED, reaches any GREEN before touching any RED again)
SCS CMU 24 Proximity on Attribute Graphs What is the proximity from node 7 to 10? If we know that…
SCS CMU 25 Sol: Augmented graphs
SCS CMU 26 Attributes on nodes/edges (ER graph) [Chakrabarti+ WWW07] skip WroteSentReceived In-Replied-toCited Works
SCS CMU 27 Proximity w/ Time Sol #1: treat time an categorical attr. [Minkov+] Sol #2: aggregate slice matrices [Tong+] Time Global aggregation Slide window Exponential emphasis
SCS CMU 28 Summary of Part I Goal: Summarize multiple … relationships Solutions –Basic: Random Walk with Restart –Property: Asymmetry –Variants: Esc_Prob and many others. –Generalization: Group Prox.; w/ Attr.; w/ Time
SCS CMU 29 Motivation Part I: Definitions Part II: Fast Solutions Part III: Applications Conclusion Roadmap B_Lin: RWR FastAllDAP: Esc_Prob BB_Lin: Skewed BGs FastUpdate: Time-Evolving
SCS CMU Preliminary: Sherman–Morrison Lemma 30 = If: Then:
SCS CMU SM Lemma: Applications RLS –and almost any algorithm in time series! Leave-one-out cross validation for LS Kalman filtering Incremental matrix decomposition … and all the fast sols we will introduce! 31
SCS CMU 32 Computing RWR n x n n x 1 Ranking vector Starting vector Adjacent matrix 1 Restart p
SCS CMU 33 Beyond RWR P-PageRank [Haveliwala] PageRank [Haveliwala] RWR [Pan, Sun] SM Learning [Zhou, Zhu] RL in CBIR [He] Fast RWR (B_Lin) Finds the Root Solution ! : Maxwell Equation for Web! [Chakrabarti]
SCS CMU 34 RWR is the building block for computing… –Escape Probability (augmented w/sink) [Tong+] –.. Effective Conductanc Resistance Dist. Commute Time –MRF (special structure) [Cohen] Similar Idea of B_Lin to compute other measurements Beyond RWR
SCS CMU 35 Q: Given query i, how to solve it? ? ? Adjacent matrix Starting vector
SCS CMU OntheFly: No pre-computation/ light storage Slow on-line response O(mE)
SCS CMU 37 4 PreCompute [Haveliwala] R:R:
SCS CMU 38 PreCompute: Fast on-line response Heavy pre-computation/storage cost O(n ) 3 2
SCS CMU 39 Q: How to Balance? On-line Off-line
SCS CMU 40 B_Lin: Basic Idea [Tong+] Find Community Fix the remaining Combine
SCS CMU 41 Pre-computational stage Q: A: A few small, instead of ONE BIG, matrices inversions Efficiently compute and store Q
SCS CMU 42 Q: Efficiently recover one column of Q A: A few, instead of MANY, matrix-vector multiplication On-Line Query Stage +
SCS CMU 43 Pre-compute Stage p1: B_Lin Decomposition –P1.1 partition –P1.2 low-rank approximation p2: Q matrices –P2.1 computing (for each partition) –P2.2 computing (for concept space)
SCS CMU 44 P1.1: partition Within-partition linkscross-partition links skip
SCS CMU 45 P1.1: block-diagonal skip
SCS CMU 46 P1.2: LRA for |S| << |W 2 | ~ skip
SCS CMU 47 + = skip
SCS CMU 48 p2.1 Computing c skip
SCS CMU 49 Comparing and Computing Time –100,000 nodes; 100 partitions –Computing 100,00x is Faster! Storage Cost –100x saving! Q 1,1 Q 1,2 Q 1,k = skip
SCS CMU 50 Q: How to fix the green portions? + ~ ~ ~ + ? skip
SCS CMU 51 p2.2 Computing: U V = _ Q 1,1 Q 1,2 Q 1,k skip
SCS CMU 52 SM Lemma says: We have: Communities Bridges skip
SCS CMU 53 On-Line Stage Q + Query Result ? A (SM lemma) Pre-Computation skip
SCS CMU 54 On-Line Query Stage q1: q2: q3: q4: q5: q6: skip
SCS CMU 55 skip
SCS CMU 56 Query Time vs. Pre-Compute Time Log Query Time Log Pre-compute Time Quality: 90%+ On-line: Up to 150x speedup Pre-computation: Two orders saving
SCS CMU 57 Motivation Part I: Definitions Part II: Fast Solutions Part III: Applications Conclusion Roadmap B_Lin: RWR FastAllDAP: Esc_Prob BB_Lin: Skewed BGs FastUpdate: Time-Evolving
SCS CMU 58 FastAllDAP [Tong+] Footnote: augmented w/ universal sink as practical modification A B the remaining graph Q: How to compute –Esc_Prob = Pr (smile before cry)?
SCS CMU 59 Solving DAP (Straight-forward way) One matrix inversion, one proximity! 1 x (n-2) (n-2) x (n-2) 1-c: fly-out probability (to black-hole)
SCS CMU 60 Esc_Prob(1->5) = P= I - + P: Transition matrix (row norm.) 2 c c
SCS CMU 61 Case 1, Medium Size Graph –Matrix inversion is feasible, but… –What if we want many proximities? –Q: How to get all (n ) proximities efficiently? –A: FastAllDAP! Case 2: Large Size Graph –Matrix inversion is infeasible –Q: How to get one proximity efficiently? –A: FastOneDAP! Challenges 2 skip
SCS CMU 62 FastAllDAP Q1: How to efficiently compute all possible proximities on a medium size graph? –a.k.a. how to efficiently solve multiple linear systems simultaneously? Goal: reduce # of matrix inversions!
SCS CMU 63 FastAllDAP: Observation Need two different matrix inversions! P=
SCS CMU 64 FastAllDAP: Rescue Redundancy among different linear systems! P= Overlap between two gray parts! Prox(1 5) Prox(1 6)
SCS CMU 65 FastAllDAP: Theorem Theorem: Proof: by SM Lemma Example:
SCS CMU 66 FastAllDAP: Algorithm Alg. –Compute Q –For i,j =1,…, n, compute Computational Save O(1) instead of O(n )! Example –w/ 1000 nodes, –1m matrix inversion vs. 1 matrix! 2
SCS CMU 67 FastAllDAP Size of Graph Time (sec) Straight-Solver FastAllDAP 1,000x faster!
SCS CMU 68 Motivation Part I: Definitions Part II: Fast Solutions Part III: Applications Conclusion Roadmap B_Lin: RWR FastAllDAP: Esc_Prob BB_Lin: Skewed BGs FastUpdate: Time-Evolving
SCS CMU RWR on Bipartite Graph 69 n m authors Conferences Author-Conf. Matrix Observation: n >> m! Examples: 1. DBLP: 400k aus, 3.5k confs 2. NetFlix: 2.7M usrs, 18k mvs
SCS CMU 70 Q: Given query i, how to solve it? RWR on Skewed bipartite graphs ? ? ….... ….. … n m Ar ….... ….. …... Ac
SCS CMU Step 1: Step 2: Cost: Examples –NetFlix: 1.5hr for pre-computation; –DBLP: 1 few minutes 71 BB_Lin: Pre-Computation [Tong+ 06] M = Ac Ar X 2-step RWR for Conferences All Conf-Conf Prox. Scores
SCS CMU 72 BB_Lin: Pre-Computation [Tong+ 06] Step 1: Step 2: M = Ac Ar X 2-step RWR for Conferences All Conf-Conf Prox. Scores
SCS CMU 73 BB_Lin: Pre-Computation [Tong+ 06] Step 1: Step 2: Cost: Examples –NetFlix: 1.5hr for pre-computation; –DBLP: 1 few minutes M = Ac Ar X 2-step RWR for Conferences All Conf-Conf Prox. Scores Ac/Ar E edges m x m
SCS CMU BB_Lin: On-Line Stage 74 Ac/Ar E edges Case 1: - Conf - Conf authors Conferences Read out !
SCS CMU BB_Lin: On-Line Stage 75 Ac/Ar E edges Case 2: - Au - Conf authors Conferences 1 matrix-vec!
SCS CMU BB_Lin: On-Line Stage 76 Ac/Ar E edges Case 3: - Au - Au authors Conferences 2 m atrix-vec!
SCS CMU BB_Lin: Examples NetFlix dataset (2.7m user x 18k movies) –1.5hr for pre-computation; –<1 sec for on-line DBLP dataset (400k authors x 3.5k confs) –A few minutes for pre-computation –<0.01 sec for on-line 77
SCS CMU 78 Motivation Part I: Definitions Part II: Fast Solutions Part III: Applications Conclusion Roadmap B_Lin: RWR FastAllDAP: Esc_Prob BB_Lin: Skewed BGs FastUpdate: Time-Evolving
SCS CMU 79 Challenges BB_Lin is good for skewed bipartite graphs –for NetFlix (2.7M nodes and 100M edges) –w/ 1.5 hr pre-computation for m x m core matrix –fraction of seconds for on-line query But…what if the graph is evolving over time –New edges/nodes arrive; edge weights increase… –1.5hr itself becomes a part of on-line cost!
SCS CMU 80 t=0 Q: How to update the core matrix? t=1 ~ ~ ?
SCS CMU Update the core matrix Step 1: Step 2: 81 M = Ac Ar X ~ ~ ~ ? M = X + Rank 2 update = + X
SCS CMU Update : General Case [Tong+ 2008] E’ edges changed Involves n’ authors, m’ confs. Observation 82 M = Ac Ar X ~ n authors m Conferences
SCS CMU 83 Observation: –the rank of update is small! Algorithm: –E’ edges changed –Involves n’ authors, m’ confs. –our Alg. –(details in the paper) Update : General Case 83 n authors m Conferences
SCS CMU 84 FastOneUpdate 176x speedup 40x speedup Time (Seconds) Datasets
SCS CMU 85 Fast-Batch-Update Min (n’, m’)E’ Time (Seconds) 15x speed-up on average!
SCS CMU 86 Summary of Part II Goal: Efficiently Solve Linear System(s) Sols. –B_Lin: Approximate one large linear system –FastAllDAP: multiple inner-related linear systems –BB_Lin: the intrinsic complexity is small –FastUpdate: (smooth) dynamic linear system
SCS CMU 87 B_Lin FastAllDAP … BB_Lin … FastUpdate
SCS CMU 88 Motivation Part I: Definitions Part II: Fast Solutions Part III: Applications Conclusion Roadmap Link Prediction NF gCap CePS G-Ray pTrack/cTrack
SCS CMU 89 Link Prediction: existence no link with link density Prox (i j)+Prox (j i) Prox. is effective to distinguish red and blue!
SCS CMU 90 Link Prediction: direction Q: Given the existence of the link, what is the direction of the link? A: Compare prox(i j) and prox(j i) >70% Prox (i j) - Prox (j i) density
SCS CMU 91 Neighborhood Formulation … … … … ConferenceAuthor A: RWR! [Sun ICDM2005] Q: what is most related conference to ICDM
SCS CMU 92 NF: example
SCS CMU 93 gCaP: Automatic Image Caption Q … SeaSunSkyWave {} {} CatForestGrassTiger {?, ?, ?,} ? A: RWR! [Pan KDD2004]
SCS CMU 94 Test Image SeaSunSkyWaveCatForestTigerGrass Image Keyword Region
SCS CMU 95 Test Image SeaSunSkyWaveCatForestTigerGrass Image Keyword Region {Grass, Forest, Cat, Tiger}
SCS CMU 96 Center-Piece Subgraph(CePS) ? Original Graph Black: query nodes CePS Q A: RWR! [Tong KDD 2006] Red: Max (Prox(Red, A) x Prox(Red, B) x Prox(Red, C)) CePS guy
SCS CMU 97 CePS: Example
SCS CMU 98 K_SoftAnd: Relaxation of AND Asking AND query? No Answer! Disconnected Communities Noise
SCS CMU 99 2_SoftAnd And 1_SoftAnd (OR) x 1e-4
SCS CMU 100 CePS: 2 Soft_AND Stat. DB
SCS CMU 101 OutputInput Attributed Data Graph Query Graph Matching Subgraph Graph X-Ray
SCS CMU 102 G-Ray: How to? matching node Goodness = Prox (12, 4) x Prox (4, 12) x Prox (7, 4) x Prox (4, 7) x Prox (11, 7) x Prox (7, 11) x Prox (12, 11) x Prox (11, 12)
SCS CMU 103 Effectiveness: star-query Query Result
SCS CMU 104 Effectiveness: line-query Query Result
SCS CMU 105 Query Result Effectiveness: loop-query
SCS CMU 106 pTrack [Given] –(1) a large, skewed time-evolving bipartite graphs, –(2) the query nodes of interest [Track] –(1) top-k most related nodes for each query node at each time step t; –(2) the proximity score (or rank of proximity) between any two query nodes at each time step t Author A’ Rank in KDD Year
SCS CMU 107 Philip S. Yu’s Top-5 conferences up to each year ICDE ICDCS SIGMETRICS PDIS VLDB CIKM ICDCS ICDE SIGMETRICS ICMCS KDD SIGMOD ICDM CIKM ICDCS ICDM KDD ICDE SDM VLDB Databases Performance Distributed Sys. Databases Data Mining
SCS CMU 108 KDD’s Rank wrt. VLDB over years Rank Year Data Mining and Databases are more and more relavant!
SCS CMU 109 cTrack [Given] –(1) a large, skewed time-evolving graphs, –(2) the query nodes of interest [Track] –(1) top-k most central nodes at each time step t; –(2) the centrality score (or rank of centrality) for each query node at each time step t
SCS CMU 110 Ranking of Centrality up to each year (in NIPS) M. Jordan G.Hinton C. Koch T. Sejnowski Year Rank of Influential-ness
SCS CMU most influential authors up to each year Author-paper bipartite graph from NIPS k papers, 2037 authors, spreading over 13 years T. Sejnowski M. Jordan
SCS CMU 112 RWR Variantsw/ Time w/ Attribute Group Porx. Definitions B_Lin FastAllDAP BB_Lin FastUpdate Computations Link Prediction NF gCap CePS G-Ray pTrack cTrack Applications Proximity On Graphs Weighted Multiple Relationship Efficiently Solve Linear System(s) Use Proximity as Building block
SCS CMU Take-home Messages Proximity Definitions –RWR –and a lot of variants Computations –SM Lemma 113
SCS CMU References L. Page, S. Brin, R. Motwani, & T. Winograd. (1998), The PageRank Citation Ranking: Bringing Order to the Web, Technical report, Stanford Library. T.H. Haveliwala. (2002) Topic-Sensitive PageRank. In WWW, , 2002 J.Y. Pan, H.J. Yang, C. Faloutsos & P. Duygulu. (2004) Automatic multimedia cross-modal correlation discovery. In KDD, , C. Faloutsos, K. S. McCurley & A. Tomkins. (2002) Fast discovery of connection subgraphs. In KDD, , J. Sun, H. Qu, D. Chakrabarti & C. Faloutsos. (2005) Neighborhood Formation and Anomaly Detection in Bipartite Graphs. In ICDM, , W. Cohen. (2007) Graph Walks and Graphical Models. Draft. 114
SCS CMU References P. Doyle & J. Snell. (1984) Random walks and electric networks, volume 22. Mathematical Association America, New York. Y. Koren, S. C. North, and C. Volinsky. (2006) Measuring and extracting proximity in networks. In KDD, 245–255, A. Agarwal, S. Chakrabarti & S. Aggarwal. (2006) Learning to rank networked entities. In KDD, 14-23, S. Chakrabarti. (2007) Dynamic personalized pagerank in entity-relation graphs. In WWW, , F. Fouss, A. Pirotte, J.-M. Renders, & M. Saerens. (2007) Random-Walk Computation of Similarities between Nodes of a Graph with Application to Collaborative Recommendation. IEEE Trans. Knowl. Data Eng. 19(3),
SCS CMU References H. Tong & C. Faloutsos. (2006) Center-piece subgraphs: problem definition and fast solutions. In KDD, , H. Tong, C. Faloutsos, & J.Y. Pan. (2006) Fast Random Walk with Restart and Its Applications. In ICDM, , H. Tong, Y. Koren, & C. Faloutsos. (2007) Fast direction- aware proximity for graph mining. In KDD, , H. Tong, B. Gallagher, C. Faloutsos, & T. Eliassi-Rad. (2007) Fast best-effort pattern matching in large attributed graphs. In KDD, , H. Tong, S. Papadimitriou, P.S. Yu & C. Faloutsos. (2008) Proximity Tracking on Time-Evolving Bipartite Graphs. to appear in SDM
SCS CMU 117 Thank you!