Speaker: Hanghang Tong Carnegie Mellon University

Slides:



Advertisements
Similar presentations
BiG-Align: Fast Bipartite Graph Alignment
Advertisements

CMU SCS I2.2 Large Scale Information Network Processing INARC 1 Overview Goal: scalable algorithms to find patterns and anomalies on graphs 1. Mining Large.
Fast Algorithms for Querying and Mining Large Graphs Hanghang Tong Machine Learning Department Carnegie Mellon University
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Multi-label Relational Neighbor Classification using Social Context Features Xi Wang and Gita Sukthankar Department of EECS University of Central Florida.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
1 Social Influence Analysis in Large-scale Networks Jie Tang 1, Jimeng Sun 2, Chi Wang 1, and Zi Yang 1 1 Dept. of Computer Science and Technology Tsinghua.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Fast Query Execution for Retrieval Models based on Path Constrained Random Walks Ni Lao, William W. Cohen Carnegie Mellon University
N EIGHBORHOOD F ORMATION AND A NOMALY D ETECTION IN B IPARTITE G RAPHS Jimeng Sun, Huiming Qu, Deepayan Chakrabarti & Christos Faloutsos Jimeng Sun, Huiming.
Fast Direction-Aware Proximity for Graph Mining KDD 2007, San Jose Hanghang Tong, Yehuda Koren, Christos Faloutsos.
SCS CMU Joint Work by Hanghang Tong, Spiros Papadimitriou, Jimeng Sun, Philip S. Yu, Christos Faloutsos Speaker: Hanghang Tong Aug , 2008, Las Vegas.
Neighborhood Formation and Anomaly Detection in Bipartite Graphs Jimeng Sun Huiming Qu Deepayan Chakrabarti Christos Faloutsos Speaker: Jimeng Sun.
Multimedia Databases SVD II. Optimality of SVD Def: The Frobenius norm of a n x m matrix M is (reminder) The rank of a matrix M is the number of independent.
© 2011 IBM Corporation IBM Research SIAM-DM 2011, Mesa AZ, USA, Non-Negative Residual Matrix Factorization w/ Application to Graph Anomaly Detection Hanghang.
Multimedia Databases SVD II. SVD - Detailed outline Motivation Definition - properties Interpretation Complexity Case studies SVD properties More case.
SCS CMU Proximity Tracking on Time- Evolving Bipartite Graphs Speaker: Hanghang Tong Joint Work with Spiros Papadimitriou, Philip S. Yu, Christos Faloutsos.
1 Fast Dynamic Reranking in Large Graphs Purnamrita Sarkar Andrew Moore.
Measure Proximity on Graphs with Side Information Joint Work by Hanghang Tong, Huiming Qu, Hani Jamjoom Speaker: Mary McGlohon 1 ICDM 2008, Pisa, Italy15-19.
Fast Random Walk with Restart and Its Applications
SCS CMU Joint Work by Hanghang Tong, Yasushi Sakurai, Tina Eliassi-Rad, Christos Faloutsos Speaker: Hanghang Tong Oct , 2008, Napa, CA CIKM 2008.
CMU SCS KDD'09Faloutsos, Miller, Tsourakakis P3-1 Large Graph Mining: Power Tools and a Practitioner’s guide Task 3: Recommendations & proximity Faloutsos,
School of Electronics Engineering and Computer Science Peking University Beijing, P.R. China Ziqi Wang, Yuwei Tan, Ming Zhang.
Piyush Kumar (Lecture 2: PageRank) Welcome to COT5405.
Random Walks and Semi-Supervised Learning Longin Jan Latecki Based on : Xiaojin Zhu. Semi-Supervised Learning with Graphs. PhD thesis. CMU-LTI ,
Random Walk with Restart (RWR) for Image Segmentation
DATA MINING LECTURE 13 Absorbing Random walks Coverage.
DATA MINING LECTURE 13 Pagerank, Absorbing Random Walks Coverage Problems.
KDD 2007, San Jose Fast Direction-Aware Proximity for Graph Mining Speaker: Hanghang Tong Joint work w/ Yehuda Koren, Christos Faloutsos.
SCS CMU Proximity on Large Graphs Speaker: Hanghang Tong Guest Lecture.
Fast Random Walk with Restart and Its Applications Hanghang Tong, Christos Faloutsos and Jia-Yu (Tim) Pan ICDM 2006 Dec , HongKong.
Andreas Papadopoulos - [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International.
CMU SCS KDD '09Faloutsos, Miller, Tsourakakis P5-1 Large Graph Mining: Power Tools and a Practitioner’s guide Task 5: Graphs over time & tensors Faloutsos,
1 LinkClus: Efficient Clustering via Heterogeneous Semantic Links Xiaoxin Yin, Jiawei Han Univ. of Illinois at Urbana-Champaign Philip S. Yu IBM T.J. Watson.
Tools and Algorithms for Querying and Mining Large Graphs Hanghang Tong Machine Learning Department Carnegie Mellon University
1 Panther: Fast Top-K Similarity Search on Large Networks Jing Zhang 1, Jie Tang 1, Cong Ma 1, Hanghang Tong 2, Yu Jing 1, and Juanzi Li 1 1 Department.
Guided Learning for Role Discovery (GLRD) Presented by Rui Liu Gilpin, Sean, Tina Eliassi-Rad, and Ian Davidson. "Guided learning for role discovery (glrd):
Hanghang Tong, Brian Gallagher, Christos Faloutsos, Tina Eliassi-Rad
KDD 2007, San Jose Fast Direction-Aware Proximity for Graph Mining Speaker: Hanghang Tong Joint work w/ Yehuda Koren, Christos Faloutsos.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Talk 2: Graph Mining Tools - SVD, ranking, proximity Christos Faloutsos CMU.
Page 1 PathSim: Meta Path-Based Top-K Similarity Search in Heterogeneous Information Networks Yizhou Sun, Jiawei Han, Xifeng Yan, Philip S. Yu, Tianyi.
Panther: Fast Top-k Similarity Search in Large Networks JING ZHANG, JIE TANG, CONG MA, HANGHANG TONG, YU JING, AND JUANZI LI Presented by Moumita Chanda.
Kijung Shin Jinhong Jung Lee Sael U Kang
Center-Piece Subgraphs: Problem definition and Fast Solutions Hanghang Tong Christos Faloutsos Carnegie Mellon University.
Facets: Fast Comprehensive Mining of Coevolving High-order Time Series Hanghang TongPing JiYongjie CaiWei FanQing He Joint Work by Presenter:Wei Fan.
SCS CMU Speaker Hanghang Tong Colibri: Fast Mining of Large Static and Dynamic Graphs Speaking Skill Requirement.
CS 440 Database Management Systems Web Data Management 1.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
Cohesive Subgraph Computation over Large Graphs
Finding Dense and Connected Subgraphs in Dual Networks
Large Graph Mining: Power Tools and a Practitioner’s guide
Hanghang Tong, Brian Gallagher, Christos Faloutsos, Tina Eliassi-Rad
Predicting Long-Term Impact of CQA Posts: A Comprehensive Viewpoint
Text & Web Mining 9/22/2018.
Query-Friendly Compression of Graph Streams
Hanghang Tong, Brian Gallagher, Christos Faloutsos, Tina Eliassi-Rad
Distributed Representations of Subgraphs
Large Graph Mining: Power Tools and a Practitioner’s guide
An Efficient method to recommend research papers and highly influential authors. VIRAJITHA KARNATAPU.
COS 518: Advanced Computer Systems Lecture 12 Mike Freedman
Apache Spark & Complex Network
Author: Kazunari Sugiyama, etc. (WWW2004)
Jinhong Jung, Woojung Jin, Lee Sael, U Kang, ICDM ‘16
Asymmetric Transitivity Preserving Graph Embedding
Alan Kuhnle*, Victoria G. Crawford, and My T. Thai
Learning to Rank Typed Graph Walks: Local and Global Approaches
Proximity in Graphs by Using Random Walks
Analysis of Large Graphs: Overlapping Communities
Presentation transcript:

Speaker: Hanghang Tong Carnegie Mellon University Proximity on Large Graphs: definitions, fast solutions and applications Speaker: Hanghang Tong Carnegie Mellon University 2008-7-31 IBM T.J. Watson

Joint work with IBM Spiros Papadimitriou Philip S. Yu Huiming Qu Christos Faloutsos (CMU) Jia-Yu Pan (Google) Yehuda Koren (AT&T Labs) IBM Spiros Papadimitriou Philip S. Yu Huiming Qu Hani Jamjoom Tina Eliassi-Rad (LLNL) Brian Gallagher (LLNL) Kensuke Oonuma (Sony Corp.) Yasushi Sakurai (NTT Labs)

Graphs are everywhere!

Graph Mining: Big Picture + Graph Level Patterns Laws Generators Smith Alan Adam Adam John Jones Tom Peter + Subgraph Level - Community Beck + Node Level Association Correlation Causality Proximity Jack Amy Not covered parts: generator, sampling, optimal graphs…; anomaly could cross all 3 levels Dan Anna Anna Tom Alice Cell Phone SameTime Lotus Mail We are here!

Proximity on Graph: What? a.k.a Relevance, Closeness, ‘Similarity’…

Proximity on Graphs: Why? Link prediction [Liben-Nowell+], [Tong+] Ranking [Haveliwala], [Chakrabarti+] Email Management [Minkov+] Image caption [Pan+] Neighborhooh Formulation [Sun+] Conn. subgraph [Faloutsos+], [Tong+], [Koren+] Pattern match [Tong+] Collaborative Filtering [Fouss+] Many more… Will return to this later

Prox. is effective to ‘deleted’ and absent edges! Link Prediction density Prox. Hist. for a set of deleted links Prox (ij)+Prox (ji) Prox. is effective to ‘deleted’ and absent edges! density Prox. Hist. for a set of absent links Prox (ij)+Prox (ji) Q: How to predict the existence of the link? A: Proximity! [Liben-Nowell + 2003]

Neighborhood Search on graphs … … … … Conference Author Q: what is most related conference to ICDM? A: Proximity! [Sun+ ICDM2005]

Automatic Image Caption Region Image Test Image Keyword Sea Sun Sky Wave Cat Forest Tiger Grass Q: How to assign keywords to the test image? A: Proximity! [Pan+ 2004]

Center-Piece Subgraph(CePS) Input Output CePS guy CePS Original Graph Q: How to find hub for the black nodes? A: Proximity! [Tong+ KDD 2006]

Q: How to find matching subgraph? A: Proximity![Tong+ KDD 2007 b] Input Query Graph Output Best-Effort Pattern Match Data Graph Matching Subgraph Q: How to find matching subgraph? A: Proximity![Tong+ KDD 2007 b]

Roadmap Basic: RWR Variants Motivation Properties Part I: Definitions Generalizations Motivation Part I: Definitions Part II: Fast Solutions Part III: Applications Conclusion

Why not shortest path? ‘pizza delivery guy’ problem Some ``bad’’ proximities Why not shortest path? ‘pizza delivery guy’ problem ‘multi-facet’ relationship

Why not max. netflow? No punishment on long paths Some ``bad’’ proximities Why not max. netflow? No punishment on long paths

What is a ``good’’ Proximity? … Multiple Connections Quality of connection Direct & In-directed Conns Length, Degree, Weight…

Random walk with restart 1 4 3 2 5 6 7 9 10 8 11 12

Random walk with restart 1 4 3 2 5 6 7 9 10 8 11 12 0.13 0.10 0.05 0.08 0.04 0.02 0.03 Node 4 Node 1 Node 2 Node 3 Node 5 Node 6 Node 7 Node 8 Node 9 Node 10 Node 11 Node 12 0.13 0.10 0.22 0.05 0.08 0.04 0.03 0.02 Nearby nodes, higher scores Ranking vector More red, more relevant

Why RWR is a good score? j i : adjacency matrix. c: damping factor all paths from i to j with length 1 to j with length 2 to j with length 3 Weighted sum

Roadmap Basic: RWR Variants Motivation Properties Part I: Definitions Generalizations Motivation Part I: Definitions Part II: Fast Solutions Part III: Applications Conclusion

Variant: escape probability Define Random Walk (RW) on the graph Esc_Prob(CMUIBM) Prob (starting at CMU, reaches IBM before returning to CMU) the remaining graph CMU IBM This measurement, with some small modifications, also meets our requirements. Esc_Prob = Pr (smile before cry)

All are “related to” or “similar to” random walk with restart! Other Variants Other measure by RWs Community Time/Hitting Time [Fouss+] SimRank [Jeh+] Equivalence of Random Walks Electric Networks: EC [Doyle+]; SAEC[Faloutsos+]; CFEC[Koren+] String Systems Katz [Katz], [Huang+], [Scholkopf+] Matrix-Forest-based Alg [Chobotarev+] All are “related to” or “similar to” random walk with restart! All these measurement are very similar or closed related to rwr. By related, I mean that we can actually compute them based on rwr. By similar, I mean we can use similar idea to do the fast computation!

Chaptering different measurements Regularized Un-constrained Quad Opt. RWR Norma lize Katz 4 ssp decides 1 esc_prob Esc_Prob + Sink Hitting Time/ Commute Time relax X out-degree Harmonic Func. Constrained Quad Opt. Effective Conductance “voltage = position” String System Physical Models Mathematic Tools

Roadmap Basic: RWR Variants Motivation Properties Part I: Definitions Generalizations Motivation Part I: Definitions Part II: Fast Solutions Part III: Applications Conclusion

Property: Monotonicity We want: A: degree preserving! [Koren+ KDD06][Tong+ KDD07a][Tong+ SDM08]

Property: Asymmetry [Tong+ KDD07 a] What is Prox from A to B? What is Prox from B to A? What is Prox between A and B?

Asymmetry in un-directed graphs Hanghang’s # 1 employer is IBM The #1 employee of IBM is ... Hanghang IBM So is love…

Roadmap Basic: RWR Variants Motivation Properties Part I: Definitions Generalizations Motivation Part I: Definitions Part II: Fast Solutions Part III: Applications Conclusion

Group Proximity [Tong+ KDD07 a] Q: How close are Accountants to SECs? A: Prob (starting at any RED, reaches any GREEN before touching any RED again)

Proximity on Attributed Graphs [Tong+ KDD07 b] What is the proximity from node 7 to 10? If we know that…

A: Augmented graphs [Tong+ KDD07 b]

More on Generalizations Attributed on edges [Chakrabarti+ KDD 06] Proximity w/ Time [Minkov+], [Tong+ SDM 2008], [Tong+ CIKM 2008] Proximity w/ Side Information [Tong+ 2008] …

Summary of Part I Goal: Summarize multiple … relationship Solutions Basic: Random Walk with Restart [Pan+ 2004][Sun+ 2006][Tong+ 2006] Properties: Asymmetry, monotonicity [Koren+ 2006][Tong+ 2007] [Tong+ 2008] Variants: Esc_Prob and many others. [Faloutsos+ 2004] [Koren+ 2006][Tong+ 2007] Generalizations: Group Prox, w/ Attr., w/ Time, w/ Side Information [Charkrabarti+ 2006][Tong+ 2007] [Tong+ 2008]

Roadmap Motivation Part I: Definitions Part II: Fast Solutions Part III: Applications Conclusion

Roadmap Motivation Part I: Definitions Part II: Fast Solutions Part III: Applications Conclusion B_Lin: RWR BB_Lin: Skewed BGs FastUpdate: Time-Evolving

Preliminary: Sherman–Morrison Lemma = If: Then:

SM: The block-form Or… A B C D And many other variants… Also known as Woodburg Identity Or…

SM Lemma: Applications RLS (Recursive least square) and almost any algorithm in time series! Leave-one-out cross validation for LSR Kalman filtering Incremental matrix decomposition … … and all the fast sol.s we will introduce!

Computing RWR n x 1 n x n n x 1 1 Adjacency matrix Restart p Starting vector Ranking vector 1 4 3 2 5 6 7 9 10 8 11 12 1 n x 1 n x n n x 1

Q: Given query i, how to solve it? Starting vector Ranking vector Adjacency matrix Ranking vector

OntheFly: Slow on-line response O(mE) 1 4 3 2 5 6 7 9 10 8 11 12 0.13 0.10 0.05 0.08 0.04 0.02 0.03 1 4 3 2 5 6 7 9 10 8 11 12 No pre-computation/ light storage Slow on-line response O(mE)

PreCompute R: [Haveliwala+ 2002] c x Q Q 1 4 3 2 5 6 7 9 10 8 11 10 9 12 0.13 0.10 0.05 0.08 0.04 0.02 0.03 10 9 12 2 1 8 R: 3 11 4 6 5 7 c x Q [Haveliwala+ 2002] Q

PreCompute: Fast on-line response Heavy pre-computation/storage cost 1 4 3 2 5 6 7 9 10 8 11 12 0.13 0.10 0.05 0.08 0.04 0.02 0.03 1 4 3 2 5 6 7 9 10 8 11 12 Fast on-line response Heavy pre-computation/storage cost O(n ) 3 O(n ) 2

Q: How to Balance? On-line Off-line

B_Lin: Basic Idea [Tong+ ICDM 2006] 1 4 3 2 5 6 7 9 10 8 11 12 1 4 3 2 5 6 7 9 10 8 11 12 Find Community 5 6 7 9 10 8 11 12 5 6 7 9 10 8 11 12 1 4 3 2 5 6 7 9 10 8 11 12 0.13 0.10 0.05 0.08 0.04 0.02 0.03 1 4 3 2 1 4 3 2 1 4 3 2 5 6 7 9 10 8 11 12 1 4 3 2 5 6 7 9 10 8 11 12 Combine Fix the remaining

B_Lin: details W ~ ~ + ~ W 1: within community ~ Cross-community

W ~ I – c I – c – U S V W1 B_Lin: details SM Lemma! -1 Easy to be inverted LRA difference SM Lemma!

B_Lin: summary Pre-Computational Stage On-Line Stage Q: A: A few small, instead of ONE BIG, matrices inversions On-Line Stage Q: Efficiently recover one column of Q A: A few, instead of MANY, matrix-vector multiplication Efficiently compute and store Q

Query Time vs. Pre-Compute Time Log Query Time Quality: 90%+ On-line: Up to 150x speedup Pre-computation: Two orders saving Log Pre-compute Time

Roadmap Motivation Part I: Definitions Part II: Fast Solutions Part III: Applications Conclusion B_Lin: RWR BB_Lin: Skewed BGs FastUpdate: Time-Evolving

RWR on Bipartite Graph Observation: n >> m! Examples: authors Author-Conf. Matrix Observation: n >> m! Examples: 1. DBLP: 400k aus, 3.5k confs 2. NetFlix: 2.7M usrs,18k mvs n Conferences m

RWR on Skewed bipartite graphs Q: Given query i, how to solve it? m confs … . . . … . … .. Ar ? ? . … … . .. … . . Ac n aus n m

BB_Lin: Pre-Computation [Tong+ ICDM 06] 2-step RWR for Conferences M = Ac Ar X Step 1: Step 2: Cost: Examples NetFlix: 1.5hr for pre-computation; DBLP: 1 few minutes m conferences All Conf-Conf Prox. Scores n authors

BB_Lin: Pre-Computation [Tong+ ICDM 06] 2-step RWR for Conferences M = Ac Ar X Step 1: Step 2: m conferences All Conf-Conf Prox. Scores n authors

BB_Lin: Pre-Computation [Tong+ ICDM 06] 2-step RWR for Conferences M = Ac Ar X Step 1: Step 2: Cost: Examples NetFlix: 1.5hr for pre-computation; DBLP: 1 few minutes All Conf-Conf Prox. Scores m x m Ac/Ar E edges

BB_Lin: On-Line Stage Read out ! (Base) Case 1: - Conf - Conf authors Ac/Ar E edges Conferences (Base) Case 1: - Conf - Conf Read out !

BB_Lin: On-Line Stage 1 matrix-vec! Case 2: - Au - Conf authors Ac/Ar E edges Conferences Case 2: - Au - Conf 1 matrix-vec!

BB_Lin: On-Line Stage 2 matrix-vec! Case 3: - Au - Au authors Ac/Ar E edges Conferences Case 3: - Au - Au 2 matrix-vec!

BB_Lin: Examples Dataset Off-Line Cost On-Line Cost DBLP a few minutes frac. of sec. NetFlix 1.5 hours <0.01 sec. 400k authors x 3.5k conf.s 2.7m user x 18k movies

Roadmap Motivation Part I: Definitions Part II: Fast Solutions Part III: Applications Conclusion B_Lin: RWR BB_Lin: Skewed BGs FastUpdate: Time-Evolving

Challenges BB_Lin is good for skewed bipartite graphs for NetFlix (2.7M nodes and 100M edges) On-line cost for query: fraction of seconds w/ 1.5 hr pre-computation for m x m core matrix But…what if the graph is evolving over time New edges/nodes arrive; edge weights increase… On-line cost: 1.5hr itself becomes a part of this!

Q: How to update the core matrix? ~ ?

~ ~ ~ Update the core matrix = = = Step 1: Step 2: X X + X + Ar M M Ac Rank 2 update ~ ~ Model – descriptive for static. Generative, why? = + X

~ Update : General Case = Observation E’ edges changed n authors E’ edges changed Involves n’ authors, m’ confs. Observation ~ M = Ac Ar X m Conferences

Update : General Case Observation: Our Algorithm 64 n authors Observation: the rank of update is small! Real Example (DBLP Post) 1258 time steps E’ up to ~20,000! min(n’,m’) <=132 Our Algorithm m Conferences

Fast-Single-Update log(Time) (Seconds) 176x speedup 40x speedup Our method Our method Datasets

Fast-Batch-Update 15x speed-up on average! Time (Seconds) Our method Our method DBLP post time: 1,258 time steps, 1~18,000 edges changed at each time step 1~132 for rank of update! E’ Min (n’, m’) 15x speed-up on average!

More on “Fast Solutions” FastAllDAP Simultaneously solve multiple linear systems [Tong+ KDD 2007 a] MT3 Multiple-Resolution Analysis on Time [Tong+ CIKM 2008] Fast-ProSIN On-Line response for users’ feedback [Tong+ 2008]

Summary of Part II Goal: Efficiently Solve Linear System(s) Sols. B_Lin: one large linear system [Tong+ ICDM06] BB_Lin: the intrinsic complexity is small [Tong+ ICDM06] FastUpdate: dynamic linear system [Tong+ SDM08] FastAllDAP: multiple linear systems [Tong+ KDD07 a] MT3: [Tong+ CIKM 2008] Fast-ProSIN: [Tong+ 2008]

Roadmap Motivation Part I: Definitions Part II: Fast Solutions Part III: Applications Conclusion

Roadmap Motivation Part I: Definitions Part II: Fast Solutions Part III: Applications Conclusion Link Prediction & + Ranking Related Tasks User Specific Patterns Time Related Tasks

Prox. is effective to ‘deleted’ and absent edges! density Link Prediction: existence Prox. Hist. for a set of deleted links Prox (ij)+Prox (ji) Prox. is effective to ‘deleted’ and absent edges! density Prox. Hist. for a set of absent links Prox (ij)+Prox (ji) Q: How to predict the existence of the link? A: Proximity! [Liben-Nowell + 2003]

Link Prediction: direction [Tong+ KDD 07 a] Q: Given the existence of the link, what is the direction of the link? A: Compare prox(ij) and prox(ji) >70% density Prox (ij) - Prox (ji)

Beyond Link Prediction Collaborative Filtering [Fouss+] Name Disambiguation [Minkov+ SIGIR 06] Anomaly Nodes/Edges ‘a’ is abnormal if the neighborhood of ‘a’ is so different [Sun+ ICDM 2005] Here (link prediction, disambiguation, anomaly detection), we want to user proximity to directly quantify/study the relationship between nodes on graphs

Roadmap Motivation Part I: Definitions Part II: Fast Solutions Part III: Applications Conclusion Link Prediction & + Ranking Related Tasks User Specific Patterns Time Related Tasks

Neighborhood Search on graphs … … … … Conference Author Q: what is most related conference to ICDM? A: Proximity! [Sun+ ICDM2005]

NF: example

gCaP: Automatic Image Caption Q { } Cat Forest Grass Tiger … { Sea Sun Sky Wave } ? {?, ?, ?,} A: Proximity! [Pan+ KDD2004]

Region Image Keyword Test Image Sea Sun Sky Wave Cat Forest Tiger Grass Test Image Keyword

Region Image Keyword Test Image {Grass, Forest, Cat, Tiger} Sea Sun Sky Wave Cat Forest Tiger Grass Keyword

C-DEM: Multi-Modal Query System for Drosophila Embryo Databases [Fan+ VLDB 2008] C-DEM is Isomophic to gCap.

Roadmap Motivation Part I: Definitions Part II: Fast Solutions Part III: Applications Conclusion Link Prediction & + Ranking Related Tasks User Specific Patterns Time Related Tasks

Center-Piece Subgraph(CePS) Input Output CePS guy CePS Original Graph Q: How to find hub for the black nodes? A: Proximity! [Tong+ KDD 2006] Red: Max (Prox(A, Red) x Prox(B, Red) x Prox(C, Red))

CePS: Example

K_SoftAnd: Relaxation of AND Disconnected Communities Noise Another reason to motivate the k_softand query is that, in some situations, asking an and_query might be too restrictive, especially when the # of the queries is large, or there exist noise in the query set, like in the left figure; or the queries belongs to different remote/dis-connected communities, like in the right figure In both examples, the algorithm will output no answer for AND query Asking AND query?  No Answer!

CePS: 2 SoftAND DB Stat.

Q: How to find matching subgraph? A: Proximity![Tong+ KDD 2007 b] Input Query Graph Output Best-Effort Pattern Match Data Graph Matching Subgraph Q: How to find matching subgraph? A: Proximity![Tong+ KDD 2007 b]

G-Ray: How to? Goodness = Prox (12, 4) x Prox (4, 12) x matching node matching node matching node matching node Goodness = Prox (12, 4) x Prox (4, 12) x Prox (7, 4) x Prox (4, 7) x Prox (11, 7) x Prox (7, 11) x Prox (12, 11) x Prox (11, 12)

Effectiveness: star-query Here is a star-query, we want to a star-shape group of co-authors, with one author coming from each of PODS, IAT and ISBMS. We see Dr. Phillips Yu is in the center and the rest matching authors being well known domain experts in each conf. Query Result

Effectiveness: line-query And here is a line query, we want to find authors from 4 different conferences who cooperate in a line fashion. Result

Effectiveness: loop-query And this is a loop query. Result

Roadmap Motivation Part I: Definitions Part II: Fast Solutions Part III: Applications Conclusion Link Prediction & + Ranking Related Tasks User Specific Patterns Time Related Tasks

Challenge Graphs are evolving over time! New nodes/edges show up; Existing nodes/edges die out; Edge weights change… Q: How to Generalize everything? A: Track Proximity! [Tong+ SDM 2008] However, many real graphs, How can we apply trend analysis tools, e.g. google trend to the graph level?

pTrack/cTrack: Trend analysis on graph level T. Sejnowski Rank of Influential-ness G.Hinton C. Koch Let me give you an example Arrow direction M. Jordan Year

pTrack: Philip S. Yu’s Top-5 conferences up to each year ICDE ICDCS SIGMETRICS PDIS VLDB CIKM ICMCS KDD SIGMOD ICDM SDM 1992 1997 2002 2007 DBLP: (Au. x Conf.) - 400k aus, - 3.5k confs - 20 yrs Databases Performance Distributed Sys. Databases Data Mining

KDD’s Rank wrt. VLDB over years Prox. Rank Data Mining and Databases are more and more relavant! Year

cTrack:10 most influential authors in NIPS community up to each year T. Sejnowski M. Jordan Author-paper bipartite graph from NIPS 1987-1999. 3k. 1740 papers, 2037 authors, spreading over 13 years

T3: Understand Time in Complex Context [Tong+ CIKM 2008] Event Entity t1 e1 b1, b2 e2 b2, b3 t2 e3 e4 b3, b4 t3 e5 b4, b5 t4 e6 b5, b6 e7 b6, b7 t5 e8 t6 e9 b7, b8 Time Cluster, rep. entities: b7,b6, b8 Abnormal Time rep. entities: b5,b4 Time Cluster rep. entities: b3, b2, b1 Output Input

T3: Time-to-Time Proximity Matrix Event Entity t1 e1 b1, b2 e2 b2, b3 t2 e3 e4 b3, b4 t3 e5 b4, b5 t4 e6 b5, b6 e7 b6, b7 t5 e8 t6 e9 b7, b8

More Applications … Clustering Email management [Minkov+ CEAS 06]. Proximity as input [Ding+ KDD 2007] Email management [Minkov+ CEAS 06]. Business Process Management [Qu+ 2008] ProSIN Listen to clients’ comments [Tong+ 2008] TANGENT Broaden Users’ Horizon [Oonuma & Tong + 2008] Ghost Edge Within Network Classification [Gallagher & Tong+ KDD08 b] …

LP: [Liben-Nowell+][Tong+ 2007] Applications Computations Use Proximity as Building block gCap: pTrack/cTrack: LP: [Liben-Nowell+][Tong+ 2007] NF: CePS: G-Ray: T3: GhostEdge: [Gallagher & Tong+ 2008] [Tong+ 2007] [Tong+ 2006] [Pan+ 2004] [Pan+ 2005] [Tong+ 2008] Efficiently Solve Linear System(s) MT3: [Tong+ 2008] Fast-ProSIN: [Tong+ 2008] FastUpdate: [Tong+ 2008] FastAllDAP: [Tong+ 2007] BB_Lin: [Tong+ 2006] B_Lin: [Tong+ 2006] Weighted Multiple Relationship Proximity On Graphs Definitions RWR: [Pan+ 2004][Sun+ 2006][Tong+ 2006] Properties.: [Koren+ 2006]]Tong+ 2007, 2008] Variants: [Faloutsos+ 2004] [Koren+ 2006][Tong+ 2007] Generalizations: [Charkrabarti+ 2006][Tong+ 2007, 2008]

Take-home messages Proximity Definitions Computations Applications RWR and a lot of variants Computations Find out “smoothness” SM Lemma Applications Proximity as a building block

References L. Page, S. Brin, R. Motwani, & T. Winograd. (1998), The PageRank Citation Ranking: Bringing Order to the Web, Technical report, Stanford Library. T.H. Haveliwala. (2002) Topic-Sensitive PageRank. In WWW, 517-526, 2002 J.Y. Pan, H.J. Yang, C. Faloutsos & P. Duygulu. (2004) Automatic multimedia cross-modal correlation discovery. In KDD, 653-658, 2004. C. Faloutsos, K. S. McCurley & A. Tomkins. (2002) Fast discovery of connection subgraphs. In KDD, 118-127, 2004. J. Sun, H. Qu, D. Chakrabarti & C. Faloutsos. (2005) Neighborhood Formation and Anomaly Detection in Bipartite Graphs. In ICDM, 418-425, 2005. W. Cohen. (2007) Graph Walks and Graphical Models. Draft.

References P. Doyle & J. Snell. (1984) Random walks and electric networks, volume 22. Mathematical Association America, New York. Y. Koren, S. C. North, and C. Volinsky. (2006) Measuring and extracting proximity in networks. In KDD, 245–255, 2006. A. Agarwal, S. Chakrabarti & S. Aggarwal. (2006) Learning to rank networked entities. In KDD, 14-23, 2006. S. Chakrabarti. (2007) Dynamic personalized pagerank in entity-relation graphs. In WWW, 571-580, 2007. F. Fouss, A. Pirotte, J.-M. Renders, & M. Saerens. (2007) Random-Walk Computation of Similarities between Nodes of a Graph with Application to Collaborative Recommendation. IEEE Trans. Knowl. Data Eng. 19(3), 355-369 2007.

References H. Tong & C. Faloutsos. (2006) Center-piece subgraphs: problem definition and fast solutions. In KDD, 404-413, 2006. H. Tong, C. Faloutsos, & J.Y. Pan. (2006) Fast Random Walk with Restart and Its Applications. In ICDM, 613-622, 2006. H. Tong, Y. Koren, & C. Faloutsos. (2007) Fast direction-aware proximity for graph mining. In KDD, 747-756, 2007. H. Tong, B. Gallagher, C. Faloutsos, & T. Eliassi-Rad. (2007) Fast best-effort pattern matching in large attributed graphs. In KDD, 737-746, 2007. H. Tong, S. Papadimitriou, P.S. Yu & C. Faloutsos. (2008) Proximity Tracking on Time-Evolving Bipartite Graphs. to appear in SDM 2008.

References B. Gallagher, H. Tong, T. Eliassi-Rad, C. Faloutsos. Using Ghost Edges for Classification in Sparsely Labeled Networks. KDD 2008 H. Tong, Y. Sakurai, T. Eliassi-Rad, and C. Faloutsos. Fast Mining of Complex Time-Stamped Events CIKM 08 H. Tong, H. Qu, and H. Jamjoom. Measuring Proximity on Graphs with Side Information. Submitted. K. Oonuma, H. Tong, and C. Faloutsos. TANGENT: A Novel, “Surprise-me”, Recommendation Algorithm. Submitted.

Thank you! htong@cs.cmu.edu www.cs.cmu.edu/~htong