Download presentation
Presentation is loading. Please wait.
Published byGerald Green Modified over 9 years ago
1
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos1 Mining static and time-evolving graphs Christos Faloutsos Carnegie Mellon University
2
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos2 Overview Mining Static graphs –CenterPiece Subgraphs (CePS) –Fast RWR computation –‘best-effort’ subgraph matching (in progress) Mining time-evolving graphs –Tensors + intrusion detection –Sparse graphs Other topics –Graph sampling –Graph generators
3
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos3 CePS w/ Hanghang Tong, KDD 2006 htong@cs.cmu.edu
4
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos4 Center-Piece Subgraph(Ceps) Given Q query nodes Find Center-piece ( ) Input of Ceps –Q Query nodes –Budget b –K softand coefficient App. –Social Networks –Law Inforcement, …
5
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos5 Challenges in Ceps Q1: How to measure the importance? Q2: How to extract connection subgraph? Q3: How to do it efficiently?
6
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos6 An Illustrating Example 1 2 3 4 5 6 7 89 11 10 13 12 Starting from 1 Randomly to neighbor Some p to return to 1 Prob (RW will finally stay at j)
7
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos7 Individual Score Calculation Q1 Q2Q3 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Node 8 Node 9 Node 10 Node 11 Node 12 Node 13 0.5767 0.0088 0.0088 0.1235 0.0076 0.0076 0.0283 0.0283 0.0283 0.0076 0.1235 0.0076 0.0088 0.5767 0.0088 0.0076 0.0076 0.1235 0.0088 0.0088 0.5767 0.0333 0.0024 0.1260 0.1260 0.0024 0.0333 0.1260 0.0333 0.0024 0.0333 0.1260 0.0024 0.0024 0.1260 0.0333 0.0024 0.0333 0.1260
8
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos8 Individual Score Calculation Q1 Q2Q3 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Node 8 Node 9 Node 10 Node 11 Node 12 Node 13 0.5767 0.0088 0.0088 0.1235 0.0076 0.0076 0.0283 0.0283 0.0283 0.0076 0.1235 0.0076 0.0088 0.5767 0.0088 0.0076 0.0076 0.1235 0.0088 0.0088 0.5767 0.0333 0.0024 0.1260 0.1260 0.0024 0.0333 0.1260 0.0333 0.0024 0.0333 0.1260 0.0024 0.0024 0.1260 0.0333 0.0024 0.0333 0.1260
9
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos9 AND: Combining Scores Q: How to combine scores? A: Multiply …= prob. 3 random particles coincide on node j
10
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos10 K_SoftAnd: Combining Scores Generalization – SoftAND: We want nodes close to k of Q (k<Q) query nodes. Q: How to do that?
11
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos11 K_SoftAnd: Combine Scores Generalization – softAND: We want nodes close to k of Q (k<Q) query nodes. Q: How to do that? A: Prob(at least k-out- of-Q will meet each other at j)
12
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos12 AND query vs. K_SoftAnd query And Query 2_SoftAnd Query x 1e-4
13
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos13 1_SoftAnd query = OR query
14
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos14 Challenges in Ceps Q1: How to measure the importance? Q2: How to extract connection subgraph? Q3: How to do it efficiently?
15
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos15 Goal –Maximize total scores and –‘Appropriate’ Connections How to…”Extract” Alg. –Dynamic Programming –Greedy Alg. Pickup promising node Find ‘best’ path “Extract” Alg. 1 2 3 5 4 6 7 8 9 10 11 12 13 141516 1 2 3 5 4 6 7 8 9 10 11 12 13
16
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos16 Challenges in Ceps Q1: How to measure the importance? Q2: How to extract connection subgraph? Q3: How to do it efficiently?
17
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos17 Graph Partition: Efficiency Issue Straightforward way –Q linear system: –linear to # of edges Observation –Skewed dist. How to… –Graph partition
18
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos18 Even better: We can correct for the deleted edges (Tong+, ICDM’06, best paper award) But let’s omit the details
19
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos19 Experimental Setup Dataset –DBLP/authorship –Author-Paper –315k nodes –1,800k edges
20
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos20 Experimental Setup We want to check –Does the goodness criteria make sense? –Does “ extract ” alg. capture most of important nodes/edge? –Efficiency
21
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos21 Case Study: AND query
22
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos22 2_SoftAnd query Statistic database
23
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos23 Evaluation of “Extract” Alg. 20 nodes 90%+ preserved Budget (b) Node Ratio 2 query nodes 3 query nodes
24
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos24 Running Time vs. Quality for Fast Ceps Running Time Quality ~90% quality 6:1 speedup
25
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos25 Conclusion Q1:How to measure the importance? A1: RWR+K_SoftAnd Q2: How to find connection subgraph? A2:”Extract” Alg. Q3:How to do it efficiently? A3:Graph Partition (Fast Ceps) –~90% quality –6:1 speedup
26
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos26 Overview Mining Static graphs –CenterPiece Subgraphs (CePS) –Fast RWR computation –‘best-effort’ subgraph matching (in progress) Mining time-evolving graphs Other topics
27
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos27 Random walk with restart Node 4 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Node 8 Node 9 Node 10 Node 11 Node 12 0.13 0.10 0.13 0.22 0.13 0.05 0.08 0.04 0.03 0.04 0.02 1 4 3 2 5 6 7 9 10 8 1 1212 0.13 0.10 0.13 0.05 0.08 0.04 0.02 0.04 0.03 Ranking vector More red, more relevant Nearby nodes, higher scores
28
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos28 Computing RWR 1 4 3 2 5 6 7 9 10 8 1 1212 n x n n x 1 Ranking vectorStarting vectorAdjacency matrix @(t+1) @t
29
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos29 Alternatives On-the-fly: precompute nothing -> slow Precompute everything -> O(N*N) space
30
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos30 Alternatives On-the-fly: precompute nothing -> slow Precompute a little, and adjust on-the-fly Precompute everything -> O(N*N) space
31
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos31 Computing RWR 1 4 3 2 5 6 7 9 10 8 1 1212
32
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos32 Computing RWR 1 4 3 2 5 6 7 9 10 8 1 1212 Break into ‘communities’
33
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos33 FastRWR Instead of ONE BIG (and dense) inverted matrix
34
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos34 FastRWR Instead of ONE BIG (and dense) inverted matrix Several, smaller matrices, plus info about the ‘bridges’
35
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos35 FastRWR Instead of ONE BIG (and dense) inverted matrix Several, smaller matrices, plus info about the ‘bridges’
36
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos36 FastRWR Instead of ONE BIG (and dense) inverted matrix Several, smaller matrices, plus info about the ‘bridges’
37
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos37 Query Time vs. Pre-Compute Time Log Query Time Log Pre-compute Time Quality: 90%+ On-line: Up to 150x speedup Pre-computation: Two orders saving
38
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos38 Query Time vs. Pre-Storage Log Query Time Log Storage Quality: 90%+ On-line: Up to 150x speedup Pre-storage: Three orders saving
39
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos39 Conclusion FastRWR –Good accuracy (90%+) –150x speed-up: query time –Orders of magnitude saving: pre-compute & storage
40
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos40 Overview Mining Static graphs –CenterPiece Subgraphs (CePS) –Fast RWR computation –‘best-effort’ subgraph matching (in progress) Mining time-evolving graphs Other topics
41
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos41 Best-effort Sub-Graph Matching, on Attributed Graphs
42
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos42 Nodes have one (categorical) attribute query: Eg., loop -> ‘money laundering’ Synthetic data ‘Best-effort’: problem dfn.
43
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos43 ‘Best-effort’: problem dfn. Loop-QueryResults
44
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos44 Star-QueryResults
45
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos45 DBLP dataset Authorship Graph –Nodes: authors –Edges: # of co-authored paper –Attributes: Conference and Year ~300k nodes, ~1m edges
46
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos46 Line Query: Results Footnote for results -Red nodes: qualifying nodes -white nodes: immediate nodes.
47
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos47 Star-query Results Footnote for results -Red nodes: qualifying nodes -white nodes: immediate nodes.
48
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos48 Loop-Query: Results
49
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos49 P.I.T. Terrorist Relations Nodes: Terrorist Relationship –Attributes: Family Contact Colleague Congregate Edges: Two Relationship shares a common person ~1k nodes and ~8k edges
50
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos50 Star-Query Results
51
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos51 Overview Mining Static graphs –CenterPiece Subgraphs (CePS) –Fast RWR computation –‘best-effort’ subgraph matching (in progress) Mining time-evolving graphs –Tensors + intrusion detection –Other tools (MDL) Other topics
52
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos52 Tensors for time evolving graphs [Jimeng Sun+ KDD’06] [ “, SMD’07] [ CF, Kolda, Sun, SDM’07 tutorial]
53
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos53 Social network analysis Static: find community structures Dynamic: monitor community structure evolution; spot abnormal individuals; abnormal time-stamps
54
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos54 Network Forensics Directional network flows A large ISP with 100 POPs, each POP 10Gbps link capacity [Hotnets2004] Task: Identify abnormal traffic pattern and find out the cause normal traffic abnormal traffic destination source destination source
55
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos55 Tensors - outline Motivation Main ideas Experiments
56
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos56 Static case For a timestamp, data can be modeled using a tensor (matrix == 2-mode tensor) Location Type Time = 0 temperature light
57
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos57 Dynamic case: Tensor streams Location Type Time = 0 temperature light
58
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos58 Dynamic Data model: Tensor streams time (Jimeng’s Desk, light) Location Type
59
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos59 Dynamic Data model: Tensor streams Streams come with structure –(time, location, sensor-modality) –(time, host-id, measurement-type) time (Jimeng’s Desk, light) Location Type
60
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos60 What is the factor? Factor is a set of 1D summaries 1 st factor Location Type Time
61
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos61 What is the factor? Factor is a set of 1D summaries Multi-linear approximation on all aspects 1 st factor Location Type Time Day Night Close to window Away from window Day Night Close to window Away from window
62
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos62 Tensors - outline Motivation Main ideas Experiments
63
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos63 1 st factor Scaling factor 250 type location time WTA on real sensor data 1 st factor consists of the main trends: –Daily periodicity on time –Uniform on all locations –Temp, Light and Volt are positively correlated while negatively correlated with Humid Location Type Time
64
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos64 WTA on real sensor data (cont.) 2 nd factor captures an atypical trend: –Uniformly across all time –Concentrating on 3 locations –Mainly due to voltage Interpretation: two sensors have low battery, and the other one has high battery. 2 nd factor Scaling factor 154 type location time
65
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos65 DB DM Application 1: Multiway latent semantic indexing (LSI) DB 2004 1990 Michael Stonebreaker Query Pattern U keyword authors keyword U authors Projection matrices specify the clusters Core tensors give cluster activation level Philip Yu
66
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos66 Bibliographic data (DBLP) Papers from VLDB and KDD conferences Construct 2nd order tensors with yearly windows with Each tensor: 4584 3741 11 timestamps (years)
67
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos67 Multiway LSI AuthorsKeywordsYear michael carey, michael stonebreaker, h. jagadish, hector garcia-molina queri,parallel,optimization,concurr, objectorient 1995 surajit chaudhuri,mitch cherniack,michael stonebreaker,ugur etintemel distribut,systems,view,storage,servic,pr ocess,cache 2004 jiawei han,jian pei,philip s. yu, jianyong wang,charu c. aggarwal streams,pattern,support, cluster, index,gener,queri 2004 Two groups are correctly identified: Databases and Data mining People and concepts are drifting over time DM DB
68
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos68 Application 2: Network Anomaly Detection Anomaly detection –Reconstruction error driven –Multiple resolution Data –TCP flows collected at CMU backbone –Raw data 500GB with compression –Construct 3 rd order tensors with hourly windows with –1200 timestamps (hours)
69
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos69 destination source Network anomaly detection Identify when and where anomalies occurred. Prominent difference between normal and abnormal ones is mainly due to unusual scanning activity (confirmed by the campus admin). scanners Time (hour) destination source error AbnormalNormal
70
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos70 Computational cost 3 rd order network tensor 2 nd order DBLP tensor OTA is the offline tensor analysis Performance metric: CPU time (sec) Observations: –DTA and STA are orders of magnitude faster than OTA –The slight upward trend in DBLP is due to the increasing number of papers each year (data become denser over time)
71
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos71 Accuracy comparison Performance metric: the ratio of reconstruction error between DTA/STA and OTA; fixing the error of OTA to 20% Observation: DTA performs very close to OTA in both datasets, STA performs worse in DBLP due to the bigger changes. 3 rd order network tensor2 nd order DBLP tensor
72
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos72 InteMon: intelligent monitoring system on large clusters [VLDB06 demo] [Operating System Review 06]
73
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos73 System Architecture
74
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos74 Case 1: Environmental Monitoring Abnormal dehumidification and reheating cycle is identified Temperature Humidity
75
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos75
76
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos76 Overview Mining Static graphs –CenterPiece Subgraphs (CePS) –Fast RWR computation –‘best-effort’ subgraph matching (in progress) Mining time-evolving graphs –Tensors + intrusion detection –Other tools (MDL) Other topics
77
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos77 Parameter-free mining Using MDL, to –Find ‘natural’ communities –‘natural’ cut-points (under submission)
78
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos78 MDL mining on time-evolving graph (Enron emails)
79
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos79 Overview Mining Static graphs Mining time-evolving graphs Other topics –Graph sampling –Graph generators
80
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos80
81
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos81 Overview Mining Static graphs –CenterPiece Subgraphs (CePS) –Fast RWR computation –‘best-effort’ subgraph matching (in progress) Mining time-evolving graphs –Tensors + intrusion detection –Sparse graphs Other topics –Graph sampling –Graph generators
82
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos82 Realistic graph generation Kronecker graphs [Leskovec+, PKDD’05] [Leskovec+, under review]
83
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos83 Why fitting graph models? Parameters tell us about the structure of a graph Extrapolation: given a graph today, how will it look in a year? Sampling: can I get a smaller graph with similar properties? Anonymization: instead of releasing real graph (e.g., email network), we can release a synthetic version of it
84
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos84 Experiments on real AS graph Degree distributionHop plot Network valueAdjacency matrix eigen values
85
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos85 Intro to Kronecker graphs
86
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos86 Problem Definition Given a growing graph with count of nodes N 1, N 2, … Generate a realistic sequence of graphs that will obey all the patterns Idea: Self-similarity –Leads to power laws –Communities within communities –…
87
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos87 There are many obvious (but wrong) ways –Does not obey Densification Power Law –Has increasing diameter Kronecker Product is exactly what we need Recursive Graph Generation There are many obvious (but wrong) ways Initial graph Recursive expansion
88
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos88 Adjacency matrix Kronecker Product – a Graph Intermediate stage Adjacency matrix
89
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos89 Kronecker Product – a Graph Continuing multypling with G 1 we obtain G 4 and so on … G 4 adjacency matrix
90
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos90 Conclusions Static graphs: Random Walks, ``CePS’’, best-effort sub-graph matching Dynamic graphs: Tensors (intrusion/change detection Graph generation: Kronecker
91
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos91 References Hanghang Tong, Christos Faloutsos, and Jia-Yu Pan Fast Random Walk with Restart and Its Applications ICDM 2006, Hong Kong.Fast Random Walk with Restart and Its Applications Hanghang Tong, Christos Faloutsos Center-Piece Subgraphs: Problem Definition and Fast Solutions, KDD 2006, Philadelphia, PACenter-Piece Subgraphs: Problem Definition and Fast Solutions,
92
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos92 References Jure Leskovec, Jon Kleinberg and Christos Faloutsos Graphs over Time: Densification Laws, Shrinking Diameters and Possible Explanations KDD 2005, Chicago, IL. ("Best Research Paper" award).Graphs over Time: Densification Laws, Shrinking Diameters and Possible Explanations Jure Leskovec, Deepayan Chakrabarti, Jon Kleinberg, Christos Faloutsos Realistic, Mathematically Tractable Graph Generation and Evolution, Using Kronecker Multiplication (ECML/PKDD 2005), Porto, Portugal, 2005. [PDF]Realistic, Mathematically Tractable Graph Generation and Evolution, Using Kronecker MultiplicationECML/PKDD 2005PDF
93
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos93 References Jimeng Sun, Dacheng Tao, Christos Faloutsos Beyond Streams and Graphs: Dynamic Tensor Analysis, KDD 2006, Philadelphia, PA Beyond Streams and Graphs: Dynamic Tensor Analysis, Jimeng Sun, Yinglian Xie, Hui Zhang, Christos Faloutsos. Less is More: Compact Matrix Decomposition for Large Sparse Graphs, SDM, Minneapolis, Minnesota, Apr 2007. [pdf]pdf
94
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos94 Thank you! Contact info: {christos, htong, jimeng, jure} cs.cmu.edu www. cs.cmu.edu /~christos (w/ papers, datasets, code, etc)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.