School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos1 Mining static and time-evolving graphs Christos Faloutsos Carnegie Mellon University
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos2 Overview Mining Static graphs –CenterPiece Subgraphs (CePS) –Fast RWR computation –‘best-effort’ subgraph matching (in progress) Mining time-evolving graphs –Tensors + intrusion detection –Sparse graphs Other topics –Graph sampling –Graph generators
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos3 CePS w/ Hanghang Tong, KDD 2006
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos4 Center-Piece Subgraph(Ceps) Given Q query nodes Find Center-piece ( ) Input of Ceps –Q Query nodes –Budget b –K softand coefficient App. –Social Networks –Law Inforcement, …
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos5 Challenges in Ceps Q1: How to measure the importance? Q2: How to extract connection subgraph? Q3: How to do it efficiently?
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos6 An Illustrating Example Starting from 1 Randomly to neighbor Some p to return to 1 Prob (RW will finally stay at j)
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos7 Individual Score Calculation Q1 Q2Q3 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Node 8 Node 9 Node 10 Node 11 Node 12 Node
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos8 Individual Score Calculation Q1 Q2Q3 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Node 8 Node 9 Node 10 Node 11 Node 12 Node
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos9 AND: Combining Scores Q: How to combine scores? A: Multiply …= prob. 3 random particles coincide on node j
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos10 K_SoftAnd: Combining Scores Generalization – SoftAND: We want nodes close to k of Q (k<Q) query nodes. Q: How to do that?
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos11 K_SoftAnd: Combine Scores Generalization – softAND: We want nodes close to k of Q (k<Q) query nodes. Q: How to do that? A: Prob(at least k-out- of-Q will meet each other at j)
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos12 AND query vs. K_SoftAnd query And Query 2_SoftAnd Query x 1e-4
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos13 1_SoftAnd query = OR query
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos14 Challenges in Ceps Q1: How to measure the importance? Q2: How to extract connection subgraph? Q3: How to do it efficiently?
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos15 Goal –Maximize total scores and –‘Appropriate’ Connections How to…”Extract” Alg. –Dynamic Programming –Greedy Alg. Pickup promising node Find ‘best’ path “Extract” Alg
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos16 Challenges in Ceps Q1: How to measure the importance? Q2: How to extract connection subgraph? Q3: How to do it efficiently?
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos17 Graph Partition: Efficiency Issue Straightforward way –Q linear system: –linear to # of edges Observation –Skewed dist. How to… –Graph partition
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos18 Even better: We can correct for the deleted edges (Tong+, ICDM’06, best paper award) But let’s omit the details
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos19 Experimental Setup Dataset –DBLP/authorship –Author-Paper –315k nodes –1,800k edges
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos20 Experimental Setup We want to check –Does the goodness criteria make sense? –Does “ extract ” alg. capture most of important nodes/edge? –Efficiency
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos21 Case Study: AND query
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos22 2_SoftAnd query Statistic database
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos23 Evaluation of “Extract” Alg. 20 nodes 90%+ preserved Budget (b) Node Ratio 2 query nodes 3 query nodes
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos24 Running Time vs. Quality for Fast Ceps Running Time Quality ~90% quality 6:1 speedup
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos25 Conclusion Q1:How to measure the importance? A1: RWR+K_SoftAnd Q2: How to find connection subgraph? A2:”Extract” Alg. Q3:How to do it efficiently? A3:Graph Partition (Fast Ceps) –~90% quality –6:1 speedup
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos26 Overview Mining Static graphs –CenterPiece Subgraphs (CePS) –Fast RWR computation –‘best-effort’ subgraph matching (in progress) Mining time-evolving graphs Other topics
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos27 Random walk with restart Node 4 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Node 8 Node 9 Node 10 Node 11 Node Ranking vector More red, more relevant Nearby nodes, higher scores
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos28 Computing RWR n x n n x 1 Ranking vectorStarting
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos29 Alternatives On-the-fly: precompute nothing -> slow Precompute everything -> O(N*N) space
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos30 Alternatives On-the-fly: precompute nothing -> slow Precompute a little, and adjust on-the-fly Precompute everything -> O(N*N) space
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos31 Computing RWR
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos32 Computing RWR Break into ‘communities’
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos33 FastRWR Instead of ONE BIG (and dense) inverted matrix
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos34 FastRWR Instead of ONE BIG (and dense) inverted matrix Several, smaller matrices, plus info about the ‘bridges’
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos35 FastRWR Instead of ONE BIG (and dense) inverted matrix Several, smaller matrices, plus info about the ‘bridges’
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos36 FastRWR Instead of ONE BIG (and dense) inverted matrix Several, smaller matrices, plus info about the ‘bridges’
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos37 Query Time vs. Pre-Compute Time Log Query Time Log Pre-compute Time Quality: 90%+ On-line: Up to 150x speedup Pre-computation: Two orders saving
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos38 Query Time vs. Pre-Storage Log Query Time Log Storage Quality: 90%+ On-line: Up to 150x speedup Pre-storage: Three orders saving
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos39 Conclusion FastRWR –Good accuracy (90%+) –150x speed-up: query time –Orders of magnitude saving: pre-compute & storage
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos40 Overview Mining Static graphs –CenterPiece Subgraphs (CePS) –Fast RWR computation –‘best-effort’ subgraph matching (in progress) Mining time-evolving graphs Other topics
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos41 Best-effort Sub-Graph Matching, on Attributed Graphs
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos42 Nodes have one (categorical) attribute query: Eg., loop -> ‘money laundering’ Synthetic data ‘Best-effort’: problem dfn.
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos43 ‘Best-effort’: problem dfn. Loop-QueryResults
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos44 Star-QueryResults
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos45 DBLP dataset Authorship Graph –Nodes: authors –Edges: # of co-authored paper –Attributes: Conference and Year ~300k nodes, ~1m edges
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos46 Line Query: Results Footnote for results -Red nodes: qualifying nodes -white nodes: immediate nodes.
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos47 Star-query Results Footnote for results -Red nodes: qualifying nodes -white nodes: immediate nodes.
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos48 Loop-Query: Results
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos49 P.I.T. Terrorist Relations Nodes: Terrorist Relationship –Attributes: Family Contact Colleague Congregate Edges: Two Relationship shares a common person ~1k nodes and ~8k edges
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos50 Star-Query Results
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos51 Overview Mining Static graphs –CenterPiece Subgraphs (CePS) –Fast RWR computation –‘best-effort’ subgraph matching (in progress) Mining time-evolving graphs –Tensors + intrusion detection –Other tools (MDL) Other topics
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos52 Tensors for time evolving graphs [Jimeng Sun+ KDD’06] [ “, SMD’07] [ CF, Kolda, Sun, SDM’07 tutorial]
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos53 Social network analysis Static: find community structures Dynamic: monitor community structure evolution; spot abnormal individuals; abnormal time-stamps
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos54 Network Forensics Directional network flows A large ISP with 100 POPs, each POP 10Gbps link capacity [Hotnets2004] Task: Identify abnormal traffic pattern and find out the cause normal traffic abnormal traffic destination source destination source
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos55 Tensors - outline Motivation Main ideas Experiments
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos56 Static case For a timestamp, data can be modeled using a tensor (matrix == 2-mode tensor) Location Type Time = 0 temperature light
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos57 Dynamic case: Tensor streams Location Type Time = 0 temperature light
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos58 Dynamic Data model: Tensor streams time (Jimeng’s Desk, light) Location Type
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos59 Dynamic Data model: Tensor streams Streams come with structure –(time, location, sensor-modality) –(time, host-id, measurement-type) time (Jimeng’s Desk, light) Location Type
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos60 What is the factor? Factor is a set of 1D summaries 1 st factor Location Type Time
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos61 What is the factor? Factor is a set of 1D summaries Multi-linear approximation on all aspects 1 st factor Location Type Time Day Night Close to window Away from window Day Night Close to window Away from window
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos62 Tensors - outline Motivation Main ideas Experiments
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos63 1 st factor Scaling factor 250 type location time WTA on real sensor data 1 st factor consists of the main trends: –Daily periodicity on time –Uniform on all locations –Temp, Light and Volt are positively correlated while negatively correlated with Humid Location Type Time
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos64 WTA on real sensor data (cont.) 2 nd factor captures an atypical trend: –Uniformly across all time –Concentrating on 3 locations –Mainly due to voltage Interpretation: two sensors have low battery, and the other one has high battery. 2 nd factor Scaling factor 154 type location time
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos65 DB DM Application 1: Multiway latent semantic indexing (LSI) DB Michael Stonebreaker Query Pattern U keyword authors keyword U authors Projection matrices specify the clusters Core tensors give cluster activation level Philip Yu
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos66 Bibliographic data (DBLP) Papers from VLDB and KDD conferences Construct 2nd order tensors with yearly windows with Each tensor: 4584 timestamps (years)
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos67 Multiway LSI AuthorsKeywordsYear michael carey, michael stonebreaker, h. jagadish, hector garcia-molina queri,parallel,optimization,concurr, objectorient 1995 surajit chaudhuri,mitch cherniack,michael stonebreaker,ugur etintemel distribut,systems,view,storage,servic,pr ocess,cache 2004 jiawei han,jian pei,philip s. yu, jianyong wang,charu c. aggarwal streams,pattern,support, cluster, index,gener,queri 2004 Two groups are correctly identified: Databases and Data mining People and concepts are drifting over time DM DB
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos68 Application 2: Network Anomaly Detection Anomaly detection –Reconstruction error driven –Multiple resolution Data –TCP flows collected at CMU backbone –Raw data 500GB with compression –Construct 3 rd order tensors with hourly windows with –1200 timestamps (hours)
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos69 destination source Network anomaly detection Identify when and where anomalies occurred. Prominent difference between normal and abnormal ones is mainly due to unusual scanning activity (confirmed by the campus admin). scanners Time (hour) destination source error AbnormalNormal
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos70 Computational cost 3 rd order network tensor 2 nd order DBLP tensor OTA is the offline tensor analysis Performance metric: CPU time (sec) Observations: –DTA and STA are orders of magnitude faster than OTA –The slight upward trend in DBLP is due to the increasing number of papers each year (data become denser over time)
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos71 Accuracy comparison Performance metric: the ratio of reconstruction error between DTA/STA and OTA; fixing the error of OTA to 20% Observation: DTA performs very close to OTA in both datasets, STA performs worse in DBLP due to the bigger changes. 3 rd order network tensor2 nd order DBLP tensor
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos72 InteMon: intelligent monitoring system on large clusters [VLDB06 demo] [Operating System Review 06]
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos73 System Architecture
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos74 Case 1: Environmental Monitoring Abnormal dehumidification and reheating cycle is identified Temperature Humidity
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos75
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos76 Overview Mining Static graphs –CenterPiece Subgraphs (CePS) –Fast RWR computation –‘best-effort’ subgraph matching (in progress) Mining time-evolving graphs –Tensors + intrusion detection –Other tools (MDL) Other topics
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos77 Parameter-free mining Using MDL, to –Find ‘natural’ communities –‘natural’ cut-points (under submission)
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos78 MDL mining on time-evolving graph (Enron s)
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos79 Overview Mining Static graphs Mining time-evolving graphs Other topics –Graph sampling –Graph generators
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos80
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos81 Overview Mining Static graphs –CenterPiece Subgraphs (CePS) –Fast RWR computation –‘best-effort’ subgraph matching (in progress) Mining time-evolving graphs –Tensors + intrusion detection –Sparse graphs Other topics –Graph sampling –Graph generators
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos82 Realistic graph generation Kronecker graphs [Leskovec+, PKDD’05] [Leskovec+, under review]
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos83 Why fitting graph models? Parameters tell us about the structure of a graph Extrapolation: given a graph today, how will it look in a year? Sampling: can I get a smaller graph with similar properties? Anonymization: instead of releasing real graph (e.g., network), we can release a synthetic version of it
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos84 Experiments on real AS graph Degree distributionHop plot Network valueAdjacency matrix eigen values
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos85 Intro to Kronecker graphs
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos86 Problem Definition Given a growing graph with count of nodes N 1, N 2, … Generate a realistic sequence of graphs that will obey all the patterns Idea: Self-similarity –Leads to power laws –Communities within communities –…
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos87 There are many obvious (but wrong) ways –Does not obey Densification Power Law –Has increasing diameter Kronecker Product is exactly what we need Recursive Graph Generation There are many obvious (but wrong) ways Initial graph Recursive expansion
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos88 Adjacency matrix Kronecker Product – a Graph Intermediate stage Adjacency matrix
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos89 Kronecker Product – a Graph Continuing multypling with G 1 we obtain G 4 and so on … G 4 adjacency matrix
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos90 Conclusions Static graphs: Random Walks, ``CePS’’, best-effort sub-graph matching Dynamic graphs: Tensors (intrusion/change detection Graph generation: Kronecker
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos91 References Hanghang Tong, Christos Faloutsos, and Jia-Yu Pan Fast Random Walk with Restart and Its Applications ICDM 2006, Hong Kong.Fast Random Walk with Restart and Its Applications Hanghang Tong, Christos Faloutsos Center-Piece Subgraphs: Problem Definition and Fast Solutions, KDD 2006, Philadelphia, PACenter-Piece Subgraphs: Problem Definition and Fast Solutions,
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos92 References Jure Leskovec, Jon Kleinberg and Christos Faloutsos Graphs over Time: Densification Laws, Shrinking Diameters and Possible Explanations KDD 2005, Chicago, IL. ("Best Research Paper" award).Graphs over Time: Densification Laws, Shrinking Diameters and Possible Explanations Jure Leskovec, Deepayan Chakrabarti, Jon Kleinberg, Christos Faloutsos Realistic, Mathematically Tractable Graph Generation and Evolution, Using Kronecker Multiplication (ECML/PKDD 2005), Porto, Portugal, [PDF]Realistic, Mathematically Tractable Graph Generation and Evolution, Using Kronecker MultiplicationECML/PKDD 2005PDF
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos93 References Jimeng Sun, Dacheng Tao, Christos Faloutsos Beyond Streams and Graphs: Dynamic Tensor Analysis, KDD 2006, Philadelphia, PA Beyond Streams and Graphs: Dynamic Tensor Analysis, Jimeng Sun, Yinglian Xie, Hui Zhang, Christos Faloutsos. Less is More: Compact Matrix Decomposition for Large Sparse Graphs, SDM, Minneapolis, Minnesota, Apr [pdf]pdf
School of Computer Science Carnegie Mellon LLNL, Feb. '07C. Faloutsos94 Thank you! Contact info: {christos, htong, jimeng, jure} cs.cmu.edu www. cs.cmu.edu /~christos (w/ papers, datasets, code, etc)