Presentation is loading. Please wait.

Presentation is loading. Please wait.

Fast Algorithms for Querying and Mining Large Graphs Hanghang Tong Machine Learning Department Carnegie Mellon University

Similar presentations


Presentation on theme: "Fast Algorithms for Querying and Mining Large Graphs Hanghang Tong Machine Learning Department Carnegie Mellon University"— Presentation transcript:

1 Fast Algorithms for Querying and Mining Large Graphs Hanghang Tong Machine Learning Department Carnegie Mellon University htong@cs.cmu.edu http://www.cs.cmu.edu/~htong 1

2 ---------- Graphs are everywhere! 2 Internet Map [Koren 2009]Food Web [2007] Protein Network [Salthe 2004] Social Network [Newman 2005] Web Graph Terrorist Network [Krebs 2002] Why Do We Care?

3 Research Theme Help users to understand and utilize large graph-related data? 3

4 A1: Social Networks Facebook (300m users, $10bn value, $500mn revenue) MSN (240m users, 4.5pb); Myspace (110m users) LinkedIn (50m users, $1bn value); Twitter (18m users) How to help users explore such networks? (e.g., find strange persons, communities, locate common friends, etc) 4

5 A2: Network Forensics [Sun+ 2007] How to detect abnormal traffic? 5 Port scanningDDoS Normal Traffic Footnote: Rows are IP sources; Columns are IP destinations. Adj. Matrix ibm.com cmu.edu Graph

6 6 2005 NY Time Forbes ReutersHardware Service IBM 2006 NY Time Forbes ReutersHardware Service IBM 2007 NY Time Forbes ReutersHardware Service IBM A3: Business Intelligence …. Year Rank of IBM in Global Service (higher is better) What is IBM’s rank in global service business over years? Footnote: nodes are business reviews and keywords; edges means ‘reporting’

7 A4: Financial Fraud Detection [Tong+ 2007] 7  7.5% of U.S. adults lost money for financial fraud  50%+ US corporations lost >= $500,000 [Albrecht+ 2001]  e.g., Enron ($70bn)  Total cost of financial fraud: $1trillion [Ansari 2006] How to detect abnormal transaction patterns? (e.g., money-laundry ring) : Anonymous accounts : Anonymous banks Legends:

8 A5: Immunization How to select k `best’ nodes for immunization? 8 34 33 25 26 27 28 29 30 31 32 22 21 20 19 18 17 23 24 12 13 14 15 16 1 9 10 11 3 4 5 6 7 8 2 Footnote: SARS costs 700+ lives; $40+ Bn

9 This Talk Q uerying [Goal: query complex relationship] – Q.1. Find complex user-specific patterns; – Q.2. Proximity tracking; – Q.3. Answer all the above questions quickly. M ining [Goal: find interesting patterns] – M.1. Immunization; – M.2. Spot anomalies. 9

10 Tasks vs. Applications App.s Tasks A1A2A3A4A5 Q1 Q2 Q3 M1 M2 10 A1: Social Networks A2: Network Forensics A3: Business Intelligence A4: Financial Fraud A5: Immunization Q1: Complex User-Specific Patterns Q2: Proximity Tracking Q3: Fast Proximity Computing M1: Immunization M2: Anomaly Detection

11 Overview Q1 Q3 Q2 Q3 M1 M2

12 Overview CePS, iPoG (KDD06, ICDM08, CIKM09) Q1 FastProx (ICDM06, KAIS07, KDD07 b, ICDM08) Q3 pTrack/cTrack (SDM08, SAM08) Q2 FastProx (SDM08, SAM08) Q3 NetShield M1 Colibri-S (KDD08) M2 Colibri-D (KDD08) M2

13 Proximity Measurement 13 Q: How close is A to B? a.k.a Relevance, Closeness, ‘Similarity’… Background

14 Random Walk with Restart [Tong+ ICDM 2006] Node 4 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Node 8 Node 9 Node 10 Node 11 Node 12 0.13 0.10 0.13 0.22 0.13 0.05 0.08 0.04 0.03 0.04 0.02 1 4 3 2 5 6 7 9 10 8 1 1212 0.13 0.10 0.13 0.05 0.08 0.04 0.02 0.04 0.03 Ranking vector More red, more relevant Nearby nodes, higher scores Background

15 Intuitions: Why RWR is Good Score? 15 54 2 3 13 1412 10116789 1 20 Target Source Score (Red Path) = (1-c) c 6 x W(1,3) x W(3,4) x …. x W(14,20) Penalty of length of pathProb of traversing the path Footnote: (1-c) is restart probability in RWR; W is normalized adjacency matrix of the graph. Background

16 Prox (1, 20) = Score (Red Path) + Score (Green Path) + Score (Yellow Path) + Score (Purple Path) + … A high proximity means many short/high weighted paths 54 2 3 1513 1412 10116789 1 20 Target Source Intuitions: Why RWR is Good Score? Background

17 Overview CePS, iPoG (KDD06, ICDM08, CIKM09) Q1 FastProx (ICDM06, KAIS07, KDD07 b, ICDM08) Q3 pTrack/cTrack (SDM08, SAM08) Q2 FastProx (SDM08, SAM08) Q3 NetShield M1 Colibri-S (KDD08) M2 Colibri-D (KDD08) M2

18 Q1: Find Complex User-Specific Patterns Q1.1. Center-Piece Subgraph Discovery, – e.g., master-mind criminal given some suspects X, Y and Z? Q1.2 Interactive Querying (e.g. Negation) – e.g., find most similar conferences wrt KDD, but not like ICML? 18 Footnote: Our algorithms for both Q1.1 and Q1.2 are to be deployed in a real system (Cyano) in IBM

19 Overview CePS, iPoG (KDD06, ICDM08, CIKM09) Q1 FastProx (ICDM06, KAIS07, KDD07 b, ICDM08) Q3 pTrack/cTrack (SDM08, SAM08) Q2 FastProx (SDM08, SAM08) Q3 NetShield M1 Colibri-S (KDD08) M2 Colibri-D (KDD08) M2

20 Q1.1 Center-Piece Subgraph Discovery [Tong+ KDD 06] Original Graph CePS Q: How to find hub for the black nodes? CePS Node Input Output Red: Max (Prox(A, Red) x Prox(B, Red) x Prox(C, Red))

21 CePS: Example (AND Query) 21 DBLP co-authorship network: - 400,000 authors, 2,000,000 edges ?

22 CePS: Example (AND Query) 22 DBLP co-authorship network: - 400,000 authors, 2,000,000 edges

23 K_SoftAND: Relaxation of AND Asking AND query?  No Answer! Disconnected Communities Noise 23 details

24 CePS: 2 SoftAND Stat. DB 24 details

25 Overview CePS, iPoG (KDD06, ICDM08, CIKM09) Q1 FastProx (ICDM06, KAIS07, KDD07 b, ICDM08) Q3 pTrack/cTrack (SDM08, SAM08) Q2 FastProx (SDM08, SAM08) Q3 NetShield M1 Colibri-S (KDD08) M2 Colibri-D (KDD08) M2

26 Q1.2: Interactive Querying 26 User Feedback

27 Initial ResultsNo to `ICML’Yes to `SIGIR’ 'ICDM' 'ICML' 'SDM' 'VLDB' 'ICDE' 'SIGMOD' 'NIPS' 'PKDD' 'IJCAI' 'PAKDD' 'ICDM' 'SDM' 'PKDD' 'ICDE' 'VLDB' 'SIGMOD' 'PAKDD' 'CIKM' 'SIGIR' 'WWW' 'SIGIR' 'TREC' 'CIKM' 'ECIR' 'CLEF' 'ICDM' 'JCDL' 'VLDB' 'ACL' 'ICDE' two main sub-communities in KDD: DBs (green) vs. Stat (Red) Negative feedback on ICML will exclude other stats confs (NIPS, IJCAI) Positive feedback on SIGIR will bring more IR (brown) conferences. what are most related conferences wrt KDD? (DBLP author-conference bipartite graph) 27 Q1.2 iPoG for Interactive Querying [Tong+ ICDM 08, CIKM 09]

28 Initial ResultsNo to `ICML’Yes to `SIGIR’ 'ICDM' 'ICML' 'SDM' 'VLDB' 'ICDE' 'SIGMOD' 'NIPS' 'PKDD' 'IJCAI' 'PAKDD' 'ICDM' 'SDM' 'PKDD' 'ICDE' 'VLDB' 'SIGMOD' 'PAKDD' 'CIKM' 'SIGIR' 'WWW' 'SIGIR' 'TREC' 'CIKM' 'ECIR' 'CLEF' 'ICDM' 'JCDL' 'VLDB' 'ACL' 'ICDE' two main sub-communities in KDD: DBs (green) vs. ML/AI (Red) Negative feedback on ICML will exclude other ML/AI conf.s (NIPS, IJCAI) Positive feedback on SIGIR will bring more IR (brown) conferences. what are most related conferences wrt KDD? (DBLP author-conference bipartite graph) 28

29 Initial ResultsNo to `ICML’Yes to `SIGIR’ 'ICDM' 'ICML' 'SDM' 'VLDB' 'ICDE' 'SIGMOD' 'NIPS' 'PKDD' 'IJCAI' 'PAKDD' 'ICDM' 'SDM' 'PKDD' 'ICDE' 'VLDB' 'SIGMOD' 'PAKDD' 'CIKM' 'SIGIR' 'WWW' 'SIGIR' 'TREC' 'CIKM' 'ECIR' 'CLEF' 'ICDM' 'JCDL' 'VLDB' 'ACL' 'ICDE' two main sub-communities in KDD: DBs (green) vs. ML/AI (Red) Negative feedback on ICML will exclude other ML/AI conf.s (NIPS, IJCAI) Positive feedback on SIGIR will bring more IR (brown) conferences. what are most related conferences wrt KDD? (DBLP author-conference bipartite graph) 29 Q1.2 iPoG for Interactive Querying [Tong+ ICDM 08, CIKM 09]

30 Overview CePS, iPoG (KDD06, ICDM08, CIKM09) Q1 FastProx (ICDM06, KAIS07, KDD07 b, ICDM08) Q3 pTrack/cTrack (SDM08, SAM08) Q2 FastProx (SDM08, SAM08) Q3 NetShield M1 Colibri-S (KDD08) M2 Colibri-D (KDD08) M2

31 Q2.2 pTrack: Challenge [Tong+ SDM 08] Observations (CePS, iPoG…) – All for static graphs – Proximity: main tool Graphs are evolving over time! – New nodes/edges show up; – Existing nodes/edges die out; – Edge weights change… Q: How close is Philip Yu to DBs over years? A: Track proximity, incrementally! 31

32 32 Author-Keyword Bipartite Graphs (NIPS) …. NIPS 1995 Sejnowski Jordan Neural Network ICA Bayes NIPS 1994 Sejnowski Jordan Neural Network ICA Bayes NIPS 1993 Sejnowski Jordan Neural Network ICA Bayes

33 pTrack: Trend analysis on graph level M. Jordan G.Hinton C. Koch T. Sejnowski Year Rank of Influence 33

34 pTrack: Problem Definitions [Given] – a large, skewed time-evolving bipartite graphs, – the query nodes of interest [Track] – (1) top-k most related nodes for each query node at each time step t; – (2) the proximity score (or rank of proximity) between any two query nodes at each time step t. 34

35 pTrack: Philip S. Yu’s Top-5 conferences up to each year ICDE ICDCS SIGMETRICS PDIS VLDB CIKM ICDCS ICDE SIGMETRICS ICMCS KDD SIGMOD ICDM CIKM ICDCS ICDM KDD ICDE SDM VLDB 1992199720022007 Databases Performance Distributed Sys. Databases Data Mining DBLP: (Au. x Conf.) - 400k authors, - 3.5k conferences - 20 years 35

36 Prox. Rank Year Data Mining and Databases are getting closer & closer 36 (Closer) John KDD Tom Bob Carl Van Roy RECOMB ICML VLDB KDD’s Rank wrt. VLDB over years …….

37 Q2: pTrack on Bipartite Graphs Computational Challenges (assuming ) – Iterative method O(m) – Straight-forward update Example – NetFlix (2.6m users x 18k movies, 100m ratings) – Both need >1hr 37

38 Q2: pTrack on Bipartite Graphs Observation #1 – n 1 authors; n 2 conferences; – n 1 >> n 2 e.g., 400k authors, 3.5k conf.s in DBLP Observation #2 – m edges changed, (n 1 authors, n 2 conf.s) – rank of update = = update Proposed algorithm: Fast-Update 38 Theorem: (Tong+ 2008) (1) Fast-Update has no quality loss (2) Fast-Update is ~~~ KDD

39 39 176x speedup 40x speedup log(Time) (Seconds) Data Sets Our method Q2: Speed Comparison

40 Overview CePS, iPoG (KDD06, ICDM08, CIKM09) Q1 FastProx (ICDM06, KAIS07, KDD07 b, ICDM08) Q3 pTrack/cTrack (SDM08, SAM08) Q2 FastProx (SDM08, SAM08) Q3 NetShield M1 Colibri-S (KDD08) M2 Colibri-D (KDD08) M2

41 41 RWR: Think of it as Wine Spill 1.Spill a drop of wine on cloth 2.Spread/diffuse to the neighborhood Background

42 42 1 4 3 2 5 6 7 9 10 8 1 1212 RWR: Wine Spill on a Graph wine spill on clothRWR on a graph Query Background

43 1 4 3 2 5 6 7 9 10 8 1 1212 Random Walk with Restart 43 Background

44 44 Computing RWR 1 4 3 2 5 6 7 9 10 8 1 1212 n x n n x 1 Ranking vector Starting vector(Normalized) Adjacency matrix 1 Restart p Footnote: Maxwell Equation for Web [Chakrabarti]

45 Computing RWR 45 Footnote: 1-c restart prob; W normalized adjacency matrix Q How to get (elements) of Q? = - - c x WIQ 1 4 3 2 5 6 7 9 10 8 1 1212

46 Computing RWR OntheFly – No Pre-Computation; – Light Storage Cost (W) – Slow On-Line Response: O(m x Iter) Pre-Compute – Fast On-Line Response – Prohibitive Pre-Compute Cost: O(n 3 ) – Prohibitive Storage Cost: O(n 2 ) 46

47 Q: How to Balance? On-line Off-line 47 Goal: Efficiently get (elements) of

48 B_Lin: Basic Idea [Tong+ ICDM 2006] 1 4 3 2 5 6 7 9 10 8 1 1212 0.13 0.10 0.13 0.05 0.08 0.04 0.02 0.04 0.03 1 4 3 2 5 6 7 9 10 8 1 1212 Find Community Fix the remaining Combine 1 4 3 2 5 6 7 9 10 8 1 1212 5 6 7 9 8 1 1212 1 4 3 2 1 4 3 2 5 6 7 9 8 1 1212 1 4 3 2 48 1 4 3 2

49 B_Lin: Basic Idea [Tong+ ICDM 2006] Pre-Compute Stage – Find Communities – Pre-compute within-community scores On-Line Stage – Fix the influence of the bridges (cross-community links) 49

50 + ~ ~ B_Lin: details W 1 : within community Cross community details 50 + W =

51 B_Lin: details W I – c ~ ~ I – c – cUSV W1W1 Easy to be invertedLRA difference Sherman–Morrison Lemma! details 51 If Then

52 B_Lin: Pre-Compute Stage Q: Efficiently compute and store Q A: A few small, instead of ONE BIG, matrices inversions 52 Footnote: Q 1 =(I-cW 1 ) -1

53 B_Lin: On-Line Stage Q: Efficiently recover one column of Q A: A few, instead of MANY, matrix-vector multiplications 53

54 Query Time vs. Pre-Compute Time Log Query Time Log Pre-compute Time Quality: 90%+ On-line: Up to 150x speedup Pre-computation: Two orders saving 54 Our Results

55 More on Scalability Issues for Querying (the spectrum of ``FastProx’’) B_Lin: one large linear system – [Tong+ ICDM06, KAIS08] BB_Lin: the intrinsic complexity is small – [Tong+ KAIS08] FastUpdate: time-evolving linear system – [Tong+ SDM08, SAM08] FastAllDAP: multiple linear systems – [Tong+ KDD07 a] Fast-iPoG: dealing w/ on-line feedback – [Tong+ ICDM 2008, Tong+ CIKM09] 55

56 Overview CePS, iPoG (KDD06, ICDM08, CIKM09) Q1 FastProx (ICDM06, KAIS07, KDD07 b, ICDM08) Q3 pTrack/cTrack (SDM08, SAM08) Q2 FastProx (SDM08, SAM08) Q3 NetShield M1 Colibri-S (KDD08) M2 Colibri-D (KDD08) M2

57 A5: Immunization How to select k `best’ nodes for immunization? 57 34 33 25 26 27 28 29 30 31 32 22 21 20 19 18 17 23 24 12 13 14 15 16 1 9 10 11 3 4 5 6 7 8 2

58 M1: SIS Virus Model [Chakrabarti+ 2008] ‘Flu’ like: Susceptible-Infectious- Susceptible If virus ‘strength’ s < 1/ λ 1,A, an epidemic can not happen 58 Footnote: Think of s as # of sneeze before heal. Background

59 M1: Optimal Method Select k nodes, whose absence creates the largest drop in λ 1,A 59 1 9 10 3 4 5 7 8 6 2 9 1 11 10 3 4 5 6 7 8 2 9 Original Graph: λ 1,A Without {2, 6}: λ 1,A ~

60 M1: Optimal Method Select k nodes, whose absence creates the largest drop in λ 1,A But, we need in time – Example: 1,000 nodes, with 10,000 edges It takes 0.01 seconds to compute λ It takes 2,615 years to find best-5 nodes ! 60 Leading eigenvalue w/o subset of nodes S

61 M1: Netshield to Rescue 61 Theorem: (Tong+ 2009) (1) Au = λ 1,A X u(i ): eigen-score 1 2 34 9 10 1112 5 6 78 13 14 1516 10 1 1 1 1 1 1 1 1 1 1 1 1 u Think of u(i) as PageRank or in-degree

62 M1: Netshield to Rescue (Intuition) find a set of nodes S, which – (1) each has high eigen-scores – (2) diverse among themselves 1 2 34 9 10 1112 5 6 78 13 14 1516 10 1 1 1 1 1 1 1 1 1 1 1 1 1 2 34 9 1112 5 6 78 13 14 1516 10 1 1 1 1 1 1 1 1 1 1 1 1 1 2 34 9 1112 5 6 78 13 14 1516 10 1 1 1 1 1 1 1 1 1 1 1 1 Theorem: (Tong+ 2009) (1)

63 M1: Netshield to Rescue Example: – 1,000 nodes, with 10,000 edges – Netshield takes < 0.1 seconds to find best-5 nodes ! – … as opposed to 2,615 years 63 Theorem: (Tong+ 2009) (1) (2) Br(S) is sub-modular (3) Netshield is near-optimal (wrt max Br(S)) (4) Netshield is O(nk 2 +m) Footnote: near-optimal means Br(S Netshield ) >= (1-1/e) Br(S Opt )

64 Why Netshield is Near-Optimal? 64 1 3 10 8 7 4 6 5 2 9 B B 1 3 8 7 4 6 5 2 9 Blue Bar: Marginal benefit of deleting blues nodes Green Bar: Benefit of deleting green nodes A A A Sub-Modular (i.e., Diminishing Returns)>= B Theorem: k-step greedy alg. to maximize a sub-modular function guarantees (1-1/e) optimal [Nemhauster+ 78]

65 M1: Why Br(S) is sub-modular? 65 1 3 10 8 7 4 6 5 2 9 Marginal Benefit Pure from Blue Interaction between Blue and Green Only purple term depends on {1, 2}! Footnote: greens {1, 2} are nodes already deleted; blue {5,6} nodes are the nodes to be deleted = - details

66 66 1 3 10 8 7 4 6 5 2 9 Marginal Benefit = Blue –Purple More Green Footnote: greens are nodes already deleted; blue {5,6} nodes are the nodes to be deleted 1 3 10 8 7 4 6 5 2 9 More PurpleLess Red Marginal Benefit of Left >= Marginal Benefit of Right M1: Why Br(S) is sub-modular? details

67 M2: Quality of Netshield 67 Eig-Drop k Netshield Optimal (1-1/e) x Optimal (better)

68 M1: Speed of Netshield 68 Time k > 10 days 0.1 seconds Netshield NIPS co-authorship Network (better)

69 Scalability of Netshield Time # of edges (better) X 10 8

70 Overview CePS, iPoG (KDD06, ICDM08, CIKM09) Q1 FastProx (ICDM06, KAIS07, KDD07 b, ICDM08) Q3 pTrack/cTrack (SDM08, SAM08) Q2 FastProx (SDM08, SAM08) Q3 NetShield M1 Colibri-S (KDD08) M2 Colibri-D (KDD08) M2

71 Motivation [Tong+ KDD 08 b] Q: How to find patterns from a large graph? – e.g., communities, anomalies, etc. 71 AuthorConference John KDD Tom Bob Carl Van Roy RECOMB ISMB ICDM

72 Motivation [Tong+ KDD 08 b] Q: How to find patterns from a large graph? – e.g., communities, anomalies, etc. A: Low-Rank Approximation (LRA) for adjacency matrix of the graph. 72 A L MR XX ~ ~

73 1100 1100 1100 0111 0011 0011 LRA for Graph Mining John KDD Tom Bob Carl Van Roy RECOMB ISMB ICDM AuthorConferenceAdjacency matrix: A 73 Conference Author

74 LRA for Graph Mining: Communities John KDD Tom Bob Carl Van Roy RECOMB ISMB ICDM AuthorConf. ~ ~ XX Adj. matrix: A R: Conf. Group M: Group-Group Interaction L: author group 74

75 LRA for Graph Mining: Anomalies John KDD Tom Bob Carl Van Roy RECOMB ISMB ICDM AuthorConf. Adj. matrix: A Recon. error is high  ‘Carl’ is abnormal 75 Reconstructed A ~

76 Challenges: How to Get (L, M, R)? Efficiently both time and space Intuitively easy for interpretation Dynamically track patterns over time 76 None of existing methods fully meets our wish list!

77 Why Not SVD and CUR/CMD? SVD (Optimal in L 2 and L F ) – Efficiency Time: Space: (L, R) are dense – Interpretation Linear Combination of many columns – Dynamic: Not Easy 77 CUR/CMD (Example-based) – Efficiency Better than SVD Redundancy in L – Interpretation Actual Columns from A xxxx – Dynamic: Not Easy

78 Solutions: Colibri [Tong+ KDD 08 b] Colibri-S: for static graphs – Basic idea: remove linear redundancy Colibri-D: for dynamic graphs – Basic idea: leverage smoothness over time 78 Theorem: (Tong+ 2008) (1) Colibri = CUR/CMD in accuracy (2) Colibri <= CUR/CMD in time (3) Colibri <= CUR/CMD in space

79 Comparison SVD, CUR vs. Colibri s Wish List SVD [Golub+ 1989] CUR [Drineas+ 2005] Colibri [Tong+ 2008] Efficiency Interpretation Dynamics 79 details

80 Performance of Colibri-S TimeSpace Ours CUR CMD 80 SVD Accuracy Same 91%+ Time 12x of CMD 28x of CUR Space ~1/3 of CMD ~10% of CUR Ours

81 Performance of Colibri-D Time # of changed cols CMD Colibri-S Colibri-D achieves up to 112x speedup Colibri-D 81 Network traffic - 21,837 nodes - 1,220 hours - 22,800 edge/hr (Prior Best Method) Accuracy - Same 93%+

82 Overview CePS, iPoG (KDD06, ICDM08, CIKM09) Q1 FastProx (ICDM06, KAIS07, KDD07 b, ICDM08) Q3 pTrack/cTrack (SDM08, SAM08) Q2 FastProx (SDM08, SAM08) Q3 NetShield M1 Colibri-S (KDD08) M2 Colibri-D (KDD08) M2

83 Some of my other work #1: FastDAP (in KDD07 a) – Predict Link Direction #2: Graph X-Ray (in KDD 07 b) – Best Effort Pattern Match in Attributed Graphs. #3: GhostEdge (in KDD 08 a) – Classification in Sparsely Labeled Network #4: TANGENT (in KDD09) – ``surprise-me’’ recommendation #5: GMine (in VLDB 06) – Interactive Graph Visualization and Mining #6: Graphite (in ICDM 08) – Visual Query System for Attributed Graphs # 7: T3/MT3: (in CIKM 08) – Mine Complex Time-stamped Events #8: BlurDetect (in ICME 04) – Determine whether or not, and how, an image is blurred #9: MRBIR (in MM 04, TIP06) – Manifold-Ranking based Image Retrieval #10: GBMML (in CVPR05, ACM/Multimedia 05) – Graph-based Multiple Modality Learning 83

84 TasksStatic GraphsDynamic GraphsImages 84 Overview (this talk + others) Querying Mining CePS, iPoG, Basset, DAP, G-Ray, Grahite, TANGENT, FastRWR (KDD06, CDM06, KDD07a, KDD07b, IICDM08, KAIS08, CIKM09, KDD09) pTrack, cTrack, Fast-Update (SDM08, SAM08) Netshield, Colibri-S, GhostEdge, Gmine, Pack, Shiftr (VLDB06, KDD08a, KDD08b, SDM-LinkAnalysis 09, ) T3/MT3, Colibri-D (KDD08a, CIKM08) MRBIR, UOLIR (MM04, CVPR05) BlurDetect, GBMML, iQuality, iExpertise (ICDE04, ICIP04, MMM05, PCM05, MM05)

85 Plans Goals Step 1 (this talk) Step 2 (medium term) Step 3 (long term) G1 Querying CePS, iPoG, pTrack Recommendation Interpretable Q Querying rich data G2 Mining Netshield, Colibri Immunization Interpretable M Mining rich data G3 Scalability All above O(m) or better (single machine) Scalable by parallel Scalable on rich data What is Next? 85 Research Theme: Help users to understand and utilize large graph-related data

86 Current Recommendation (Focus on Relevance) 86 1001 1 Sci. fiction comedy horror Footnote: Nodes are movies; Edge is similarity between movies adventure Red nodes: by (most of) existing algorithms

87 ``Broad Spectrum Recommendation’’ (focus on completeness = relevance + diversity + novelty) 87 1001 1 adventure Sci. fiction comedy horror Footnote: Nodes are movies; Edge = similarity between movies

88 Research Theme: Help users to understand and utilize large graph-related data Plans Goals Step 1 (this talk) Step 2 (medium term) Step 3 (long term) G1 Querying CePS, iPoG, pTrack Recommendation Interpretable Q Querying rich data G2 Mining Netshield, Colibri Immunization Interpretable M Mining rich data G3 Scalability All above O(m) or better (single machine) Scalable by parallel Scalable on rich data What is Next? 88

89 Interpretable Recommendation Amazon.com recommends (based on items you purchased or told us your own) Current Recommendation 89

90 Interpretable Recommendation Amazon.com recommends (based on items you purchased or told us your own) Amazing.com recommends Because it has the topics You are interested Graph mining Linear algebra You might be interested Hadoop Submodularity Current Recommendation Interpretable Recommendation

91 Plans Goals Step 1 (this talk) Step 2 (medium term) Step 3 (long term) G1 Querying CePS, iPoG, pTrack Recommendation Interpretable Q Querying rich data G2 Mining Netshield, Colibri Immunization Interpretable M Mining rich data G3 Scalability All above O(m) or better (single machine) Scalable by parallel Scalable on rich data What is Next? 91 Research Theme: Help users to understand and utilize large graph-related data

92 Immunization This Talk: SIS (e.g., flu) In the Future – Immunize for SIR (e.g., chicken pox) – Immunize in Dynamic Settings  Dynamics of Graphs,  e.g., edges/nodes are changing  Dynamics of Virus,  e.g., the infection/healing rates are changing 92 Footnote: SIR stands for susceptible-infectious-recovered.

93 Plans Goals Step 1 (this talk) Step 2 (medium term) Step 3 (long term) G1 Querying CePS, iPoG, pTrack Recommendation Interpretable Q Querying rich data G2 Mining Netshield, Colibri Immunization Interpretable M Mining rich data G3 Scalability All above O(m) or better (single machine) Scalable by parallel Scalable on rich data What is Next? 93 Research Theme: Help users to understand and utilize large graph-related data

94 Interpretable Mining 94  Find Communities  Find a few nodes/edges to describe  each community  relationship between 2 communities Footnote: Nodes are actors; edges indicate co-play in a movie.

95 Plans Goals Step 1 (this talk) Step 2 (medium term) Step 3 (long term) G1 Querying CePS, iPoG, pTrack Recommendation Interpretable Q Querying rich data G2 Mining Netshield, Colibri Immunization Interpretable M Mining rich data G3 Scalability All above O(m) or better (single machine) Scalable by parallel Scalable on rich data What is Next? 95 Research Theme: Help users to understand and utilize large graph-related data

96 Querying Rich Graphs (e.g., geo-coded, attributed) 96 What is difference between North America and Asia? Teenage Adult Phone MSN

97 Mining Rich Graphs (e.g., geo-coded, attributed) 97 Teenager Adult Phone MSN How to find patterns? (e.g., communities, anomalies) telemarketer

98 Plans Goals Step 1 (this talk) Step 2 (medium term) Step 3 (long term) G1 Querying CePS, iPoG, pTrack Recommendation Interpretable Q Querying rich data G2 Mining Netshield, Colibri Immunization Interpretable M Mining rich data G3 Scalability All above O(m) or better (single machine) Scalable by parallel Scalable on rich data What is Next? 98 Research Theme: Help users to understand and utilize large graph-related data

99 Scalability Two orthogonal efforts – E1: O(m) or better on a single machine – E2: Parallelism (e.g., hadoop) (implementation, decouple, analysis) 99

100 Research Theme: Help users to understand and utilize large graph-related data 100 Real Data User Scalability

101 CePS iPoG Basset pTrack BLin BBLin FastUpdate Fast-iPoG Colibri GhostEdge Graphite Pack TANGENT GMine T3 Mining Q1 Q2 Q3 M3 M2 M1 My Collaboration Graph (During Ph.D Study) Legends: Green: Querying Yellow: Mining Purple: Others G-Ray DAP NBLin cTrack Basset MT3 NetShield

102 Q & A Thank you! 102


Download ppt "Fast Algorithms for Querying and Mining Large Graphs Hanghang Tong Machine Learning Department Carnegie Mellon University"

Similar presentations


Ads by Google