Download presentation
Presentation is loading. Please wait.
1
Tools for Large Graph Mining
- Deepayan Chakrabarti Thesis Committee: Christos Faloutsos Chris Olston Guy Blelloch Jon Kleinberg (Cornell)
2
Introduction ► Graphs are ubiquitous
Protein Interactions [genomebiology.com] Internet Map [lumeta.com] Food Web [Martinez ’91] ► Graphs are ubiquitous And what graphs are we talking about? All graphs in the real world, and there is an incredible variety of such graph datasets once we start looking for them. Friendship Network [Moody ’01]
3
“Needle exchange” networks of drug users [Weeks et al. 2002]
Introduction What can we do with graphs? How quickly will a disease spread on this graph? Can we do anything useful by mining these graphs? Yes, and in many disciplines, not just in computer science. For example, this graph is very important in disease prevention and public policy (though it might not look like much). “Needle exchange” networks of drug users [Weeks et al. 2002]
4
Introduction What can we do with graphs?
“Key” terrorist What can we do with graphs? How quickly will a disease spread on this graph? Who are the “strange bedfellows”? Who are the key people? … Hijacker network [Krebs ‘01] ► Graph analysis can have great impact
5
Graph Mining: Two Paths
Specific applications Node grouping Viral propagation Frequent pattern mining Fast message routing General issues Realistic graph generation Graph patterns and “laws” Graph evolution over time?
6
Specific applications
Our Work Specific applications Node grouping Viral propagation Frequent pattern mining Fast message routing General issues Realistic graph generation Graph patterns and “laws” Graph evolution over time?
7
Specific applications
Our Work Node Grouping Find “natural” partitions and outliers automatically. Viral Propagation Will a virus spread and become an epidemic? Graph Generation How can we mimic a given real-world graph? Specific applications Node grouping Viral propagation Frequent pattern mining Fast message routing General issues Realistic graph generation Graph patterns and “laws” Graph evolution over time?
8
Roadmap Find “natural” partitions and outliers automatically 1 3 2 4
Focus of this talk Specific applications Node grouping Viral propagation General issues Realistic graph generation Graph patterns and “laws” 1 3 2 Find “natural” partitions and outliers automatically 4 Conclusions
9
Node Grouping [KDD 04] Customer Groups Product Groups Products Customers Customers Products Simultaneously group customers and products, or, documents and words, or, users and preferences …
10
Node Grouping [KDD 04] Row and column groups
Both are fine Customer Groups Customer Groups Product Groups Product Groups Row and column groups need not be along a diagonal, and need not be equal in number
11
Motivation Visualization Summarization
Detection of outlier nodes and edges Compression, and others…
12
Node Grouping Desiderata:
Simultaneously discover row and column groups Fully Automatic: No “magic numbers” Scalable to large matrices Online: New data should not require full recomputations
13
Closely Related Work Information Theoretic Co-clustering [Dhillon+/2003] Number of row and column groups must be specified Desiderata: Simultaneously discover row and column groups Fully Automatic: No “magic numbers” Scalable to large graphs Online
14
Other Related Work K-means and variants: [Pelleg+/2000, Hamerly+/2003]
“Frequent itemsets”: [Agrawal+/1994] Information Retrieval: [Deerwester+1990, Hoffman/1999] Graph Partitioning: [Karypis+/1998] Do not cluster rows and cols simultaneously User must specify “support” Choosing the number of “concepts” Number of partitions Measure of imbalance between clusters
15
What makes a cross-association “good”?
versus Column groups Row groups Why is this better? Similar nodes are grouped together As few groups as necessary A few, homogeneous blocks Good Clustering Good Compression implies
16
Cost of describing ni1, ni0 and groups
Main Idea Good Compression Good Clustering implies Binary Matrix density pi1 = % of dots Column groups Row groups + Σi Cost of describing ni1, ni0 and groups Σi size * H(pi1) Description Cost Code Cost
17
Examples Σi One row group, one column group + Σi size * H(pi1)
high low + Σi Cost of describing ni1, ni0 and groups size * H(pi1) Σi Total Encoding Cost = Description Cost Code Cost low high Description cost is the stopping criterion. m row group, n column group
18
What makes a cross-association “good”?
Why is this better? Row groups versus Row groups Column groups Column groups low low Total Encoding Cost = size * H(pi1) Cost of describing ni1, ni0 and groups Code Cost Description Cost Σi + Σi
19
Formal problem statement
Given a binary matrix, Re-organize the rows and columns into groups, and Choose the number of row and column groups, to Minimize the total encoding cost.
20
Formal problem statement
Note: No Parameters Given a binary matrix, Re-organize the rows and columns into groups, and Choose the number of row and column groups, to Minimize the total encoding cost.
21
Algorithms l = 5 col groups k = 5 row groups k=1, l=2 k=2, l=2
22
Algorithms Find good groups for fixed k and l
Start with initial matrix Lower the encoding cost Final cross-association Choose better values for k and l
23
Fixed k and l Find good groups for fixed k and l
Start with initial matrix Lower the encoding cost Final cross-association Choose better values for k and l
24
Fixed k and l Re-assign: for each row x Row re-assigns
re-assign it to the row group which minimizes the code cost Column groups Row groups Column groups Row groups Row re-assigns Column re-assigns and repeat …
25
Choosing k and l Find good groups for fixed k and l
Start with initial matrix Lower the encoding cost Final cross-association Choose better values for k and l
26
Choosing k and l Split: Find the most “inhomogeneous” group.
Column groups Row groups Row groups Column groups Split: Find the most “inhomogeneous” group. Remove the rows/columns which make it inhomogeneous. Create a new group for these rows/columns.
27
Algorithms Re-assigns Splits Find good groups for fixed k and l
Start with initial matrix Lower the encoding cost Final cross-association Choose better values for k and l Splits
28
“Customer-Product” graph with Zipfian sizes, no noise
Experiments l = 5 col groups k = 5 row groups “Customer-Product” graph with Zipfian sizes, no noise
29
“Quasi block-diagonal” graph with Zipfian sizes, noise=10%
Experiments l = 8 col groups k = 6 row groups “Quasi block-diagonal” graph with Zipfian sizes, noise=10%
30
“White Noise” graph: we find the existing spurious patterns
Experiments l = 3 col groups k = 2 row groups “White Noise” graph: we find the existing spurious patterns
31
Experiments “CLASSIC” 3,893 documents 4,303 words 176,347 “dots”
Combination of 3 sources: MEDLINE (medical) CISI (info. retrieval) CRANFIELD (aerodynamics) Documents Words
32
“CLASSIC” graph of documents & words: k=15, l=19
Experiments Documents Words “CLASSIC” graph of documents & words: k=15, l=19
33
Experiments “CLASSIC” graph of documents & words: k=15, l=19
blood, disease, clinical, cell, … insipidus, alveolar, aortic, death, … MEDLINE (medical) Hard to see difference in shading, but… “CLASSIC” graph of documents & words: k=15, l=19
34
Experiments “CLASSIC” graph of documents & words: k=15, l=19
abstract, notation, works, construct, … providing, studying, records, development, … MEDLINE (medical) CISI (Information Retrieval) “CLASSIC” graph of documents & words: k=15, l=19
35
Experiments “CLASSIC” graph of documents & words: k=15, l=19
shape, nasa, leading, assumed, … MEDLINE (medical) CISI (Information Retrieval) CRANFIELD (aerodynamics) “assumed” in aerodynamics… “CLASSIC” graph of documents & words: k=15, l=19
36
Experiments “CLASSIC” graph of documents & words: k=15, l=19
paint, examination, fall, raise, leave, based, … MEDLINE (medical) CISI (Information Retrieval) CRANFIELD (aerodynamics) “CLASSIC” graph of documents & words: k=15, l=19
37
Experiments “GRANTS” 13,297 documents 5,298 words 805,063 “dots”
NSF Grant Proposals Words in abstract
38
“GRANTS” graph of documents & words: k=41, l=28
Experiments NSF Grant Proposals Words in abstract “GRANTS” graph of documents & words: k=41, l=28
39
“GRANTS” graph of documents & words: k=41, l=28
Experiments encoding, characters, bind, nucleus The Cross-Associations refer to topics: Genetics “GRANTS” graph of documents & words: k=41, l=28
40
“GRANTS” graph of documents & words: k=41, l=28
Experiments coupling, deposition, plasma, beam The Cross-Associations refer to topics: Genetics Physics “GRANTS” graph of documents & words: k=41, l=28
41
“GRANTS” graph of documents & words: k=41, l=28
Experiments manifolds, operators, harmonic The Cross-Associations refer to topics: Genetics Physics Mathematics … “GRANTS” graph of documents & words: k=41, l=28
42
Linear on the number of “dots”: Scalable
Experiments Splits Time (secs) Re-assigns Number of “dots” Linear on the number of “dots”: Scalable
43
Summary of Node Grouping
Desiderata: Simultaneously discover row and column groups Fully Automatic: No “magic numbers” Scalable to large matrices Online: New data does not need full recomputation
44
Extensions We can use the same MDL-based framework for other problems:
Self-graphs Detection of outlier edges
45
Extension #1 [PKDD 04] Self-graphs, such as Co-authorship graphs
Social networks The Internet, and the World-wide Web Customers Products Authors Bipartite graph Self-graph
46
Extension #1 [PKDD 04] Self-graphs
Rows and columns represent the same nodes so row re-assigns affect column re-assigns… Customers Products Authors Bipartite graph Self-graph
47
Experiments DBLP dataset 6,090 authors in:
SIGMOD ICDE VLDB PODS ICDT 175,494 co-citation or co-authorship links Authors Authors Here’s a real-world dataset. We chose all authors who have published in the conferences named above, and built their co-citation graph.
48
Stonebraker, DeWitt, Carey
Experiments Authors Author groups Authors Author groups Stonebraker, DeWitt, Carey 8 author groups are found. Some groups are *very* small, and these typically consist of the few people who have published with a LOT of other people. For example, there is one group with Stonebraker, DeWitt and Carey. They have published a lot, with themselves and with other. Similarly, there are several other groups… There are big groups too; these seem to consists mostly of people who have rarely published in these conferences, or have published one or two other papers with other folks in these groups… So they are not well-connected to the small dense “core” groups… k=8 author groups found
49
Extension #2 [PKDD 04] Outlier edges
Which links should not exist? (illegal contact/access?) Which links are missing? (missing data?)
50
Extension #2 [PKDD 04] Outlier edges
Node Groups Nodes Deviations from “normality” Lower quality compression Outliers How do we find outlier edges? Suppose we were given the matrix on the left, and applied our algorithm to find the clusters on the right. Clearly the edges in the top-right block are the outliers. How do we automatically choose them? The idea is that outliers are “deviations from normality”. Now, if everything was normal/homogeneous, then we would get excellent compression. So these deviations must lower the quality of compression. Thus, our algorithm is: Find the edges whose removal maximally reduces the encoding cost. So, these were the edges which were causing the increased encoding cost in the first place, and so are probably outliers. This reduction in cost is a measure of the “outlierness” of the edge. By this metric, we pick up exactly the outlier edges in the top-right block. Find edges whose removal maximally reduces cost
51
Roadmap Will a virus spread and become an epidemic? 1 3 2 4
Specific applications Node grouping Viral propagation General issues Realistic graph generation Graph patterns and “laws” 1 3 2 Will a virus spread and become an epidemic? 4 Conclusions
52
The SIS (or “flu”) model
(Virus) birth rate β : probability than an infected neighbor attacks (Virus) death rate δ : probability that an infected node heals Cured = Susceptible Healthy Prob. δ N2 Prob. β N1 N Infected N3 Undirected network
53
The SIS (or “flu”) model
Competition between virus birth and death Epidemic or extinction? depends on the ratio β/δ but also on the network topology Epidemic or Extinction Example of the effect of network topology
54
Epidemic threshold The epidemic threshold τ is the value such that
If β/δ < τ there is no epidemic where β = birth rate, and δ = death rate
55
Question: What is the epidemic threshold?
Previous models Question: What is the epidemic threshold? Homogeneity assumption: All nodes have the same degree (but most graphs have power laws) Mean-field assumption: All nodes of the same degree are equally affected (but susceptibility should depend on position in network too) Answer #1: 1/<k> [Kephart and White ’91, ’93] Answer #2: <k>/<k2> [Pastor-Satorras and Vespignani ’01] BUT BUT
56
The full solution is intractable!
The full Markov Chain has 2N states intractable so, a simplification is needed. Independence assumption: Probability that two neighbors are infected = Product of individual probabilities of infection This is a point estimate of the full Markov Chain.
57
Our model A non-linear dynamical system (NLDS)
which makes no assumptions about the topology Probability of being infected Adjacency matrix 1-pi,t = [1-pi,t δpi,t-1] . ∏ (1-β.Aji.pj,t-1) N j=1 Healthy at time t Healthy at time t-1 Infected but cured No infection received from another node
58
► λ1,A alone decides viral epidemics!
Epidemic threshold [Theorem 1] We have no epidemic if: (Virus) Birth rate (Virus) Death rate Epidemic threshold largest eigenvalue of adj. matrix A β/δ < τ = 1/ λ1,A ► λ1,A alone decides viral epidemics!
59
Recall the definition of eigenvalues
X = λA X λ1,A = largest eigenvalue ≈ size of the largest “blob”
60
Experiments (100-node Star)
…… …… β/δ > τ (above threshold) β/δ = τ (close to the threshold) β/δ < τ (below threshold)
61
Experiments (Oregon) β/δ > τ (above threshold)
10,900 nodes and 31,180 edges β/δ > τ (above threshold) β/δ = τ (at the threshold) β/δ < τ (below threshold)
62
Extensions This dynamical-systems framework can exploited further
The rate of decay of the infection Information survival thresholds in sensor/P2P networks
63
Extension #1 Below the threshold: How quickly does an infection die out? [Theorem 2] Exponentially quickly
64
Experiment (10K Star Graph)
Linear on log-lin scale exponential decay Number of infected nodes (log-scale) Time-steps (linear-scale) “Score” s = β/δ * λ1,A = “fraction” of threshold
65
Experiment (Oregon Graph)
Linear on log-lin scale exponential decay Number of infected nodes (log-scale) Log (n_inf(t)) = C(initial conditions, eigenspectrum) + t . Log[1-(1-s)\delta] Time-steps (linear-scale) “Score” s = β/δ * λ1,A = “fraction” of threshold
66
Extension #2 Information survival in sensor networks [+ Leskovec, Faloutsos, Guestrin, Madden] Sensors gain new information
67
Extension #2 Information survival in sensor networks [+ Leskovec, Faloutsos, Guestrin, Madden] Sensors gain new information but they may die due to harsh environment or battery failure so they occasionally try to transmit data to nearby sensors and failed sensors are occasionally replaced.
68
Extension #2 Information survival in sensor networks [+ Leskovec, Faloutsos, Guestrin, Madden] Sensors gain new information but they may die due to harsh environment or battery failure so they occasionally try to transmit data to nearby sensors and failed sensors are occasionally replaced. Under what conditions does the information survive? ASSUMING UNCORRELATED FAILURES
69
Extension #2 [Theorem 1] The information dies out exponentially quickly if Retransmission rate Resurrection rate Failure rate of sensors Largest eigenvalue of the “link quality” matrix
70
Specific applications
Roadmap Specific applications Node grouping Viral propagation General issues Realistic graph generation Graph patterns and “laws” 1 3 2 How can we generate a “realistic” graph, that mimics a given real-world? 4 Conclusions Skip
71
Experiments (Clickstream bipartite graph)
Some personal webpage Clickstream R-MAT + x Count Websites Yahoo, Google and others Users R-MAT parameters: a=0.55, b=0.13, c=0.20, d=0.12; n1=15, n2=18 In-degree
72
Experiments (Clickstream bipartite graph)
-checking surfers Clickstream R-MAT + x Count Websites “All-night” surfers Users Out-degree
73
Experiments (Clickstream bipartite graph)
Count vs Out-degree Count vs In-degree Hop-plot Singular value vs Rank Left “Network value” Right “Network value” ►R-MAT can match real-world graphs
74
Specific applications
Roadmap Specific applications Node grouping Viral propagation General issues Realistic graph generation Graph patterns and “laws” 1 3 2 4 Conclusions
75
Conclusions Two paths in graph mining: Specific applications:
Viral Propagation non-linear dynamical system, epidemic depends on largest eigenvalue Node Grouping MDL-based approach for automatic grouping General issues: Graph Patterns Marks of “realism” in a graph Graph Generators R-MAT, a scalable generator matching many of the patterns
76
Software http://www-2.cs.cmu.edu/~deepay/#Sw CrossAssociations NetMine
To find natural node groups. Used by “anonymous” large accounting firm. Used by Intel Research, Cambridge, UK. Used at UC, Riverside (net intrusion detection). Used at the University of Porto, Portugal NetMine To extract graph patterns quickly + build realistic graphs. Used by Northrop Grumman corp. F4 A non-linear time series forecasting package.
77
===CROSS-ASSOCIATIONS===
Why simultaneous grouping? Differences from co-clustering and others? Other parameter-fitting criteria? Cost surface Exact cost function Exact complexity, wall-clock times Soft clustering Different weights for code and description costs? Precision-recall for CLASSIC Inter-group “affinities” Collaborative filtering and recommendation systems? CA versus bipartite cores Extras General comments on CA communities
78
===Viral Propagation===
Comparison with previous methods Accuracy of dynamical system Relationship with full Markov chain Experiments on information survival threshold Comparison with Infinite Particle Systems Intuition behind the largest eigenvalue Correlated failures
79
===R-MAT=== Graph patterns Generator desiderata Description of R-MAT
Experiments on a directed graph R-MAT communities via Cross-Associations? R-MAT versus tree-based generators
80
===Graphs in general===
Relational learning Graph Kernels
81
Simultaneous grouping is useful
Sparse blocks, with little in common between rows Grouping rows first would collapse these two into one! Index
82
Cross-Associations ≠ Co-clustering !
Information-theoretic co-clustering Cross-Associations Lossy Compression. Approximates the original matrix, while trying to minimize KL-divergence. The number of row and column groups must be given by the user. Lossless Compression. Always provides complete information about the matrix, for any number of row and column groups. Chosen automatically using the MDL principle. Index
83
Other parameter-fitting methods
The Gap statistic [Tibshirani+ ’01] Minimize the “gap” of log-likelihood of intra-cluster distances from the expected log-likelihood. But Needs a distance function between graph nodes Needs a “reference” distribution Needs multiple MCMC runs to remove “variance due to sampling” more time. Index
84
Other parameter-fitting methods
Stability-based method [Ben-Hur+ ’02, ‘03] Run clustering multiple times on samples of data, for several values of “k” For low k, clustering is stable; for high k, unstable Choose this transition point. But Needs many runs of the clustering algorithm Arguments possible over definition of transition point Index
85
Precision-Recall for CLASSIC
Index
86
Cost surface (total cost)
Surface plot Contour plot l k l k With increasing k and l: Total cost decays very rapidly initially, but then starts increasing slowly Index
87
Cost surface (code cost only)
Surface plot Contour plot l k l k With increasing k and l: Code cost decays very rapidly Index
88
Encoding Cost Function
Total encoding cost = log*(k) + log*(l) (cluster number) N.log(N) + M.log(M) + (row/col order) Σ log(ai) + Σ log(bj) (cluster sizes) ΣΣ log(aibj+1) (block densities) ΣΣ aibj . H(pi,j) Description cost Code cost Index
89
Complexity of CA O(E. (k2+l2)) ignoring the number of re-assign iterations, which is typically low. Index
90
Complexity of CA Time / Σ(k+l) Number of edges Index
91
Inter-group distances
Grp1 Grp2 Grp3 Node Groups Nodes Nodes Node Groups Two groups are “close” Merging them does not increase cost by much How can we compute the distance between groups? Suppose our clustering algorithm gives us the three groups in the right side matrix. What are the distances between these groups? Our idea is: If two groups are “close”, then merging them should not increase the cost too much. So our measure of the distance between groups I and J is the relative increase in cost on merging them. [ Side note: Dist(I,J) = [cost(merged)-cost(I)-cost(J)] / [cost(I)+cost(J)] The numerator checks the increase in cost; lower the increase, lower the distance. The denominator is there to normalize out the size of the groups… Otherwise large groups (which normally have higher cost) will never be found to be close to any others… We tried other schemes too, but this worked best. ] distance(i,j) = relative increase in cost on merging i and j Index
92
Inter-group distances
Grp1 Grp1 5.5 Grp2 Node Groups Grp2 4.5 5.1 Grp3 Grp3 Node Groups Two groups are “close” Merging them does not increase cost by much In this example, Grp1 and Grp2 have no “bridges”, and are very distinct. Their distance is found to be 5.5 However, for Grp2 and Grp3: they share lots of cross-edges, and so merging them leads to a large block which is not totally inhomogeneous (that is, its not a 50% dots 50% spaces sort of thing). So the distance between them is lower than for Grp1-Grp2. Again, between Grp1 and Grp3, there are some cross-edges, but fewer than Grp2-Grp3. So Grp1-Grp3 distance is intermediate between the Grp1-Grp2 and Grp2-Grp3 distances. distance(i,j) = relative increase in cost on merging i and j Index
93
Experiments Inter-group distances can aid in visualization
Grp1 Grp8 Author groups Author groups Stonebraker, DeWitt, Carey We show the distances here; these were plotted with graphviz, and show only a visualization of the distances (Graphviz can only be provided “hints” about the distances between nodes, and it can, and does, change these distances based on its algorithms). But we see that Grp8 and Grp7 are the “core” of the network, and are close to lots of other groups… They have published (and have cross-edges with) people from lots of other groups. Whereas Grp1 is really far off from everyone else, as is Grp2 (to an extent). Inter-group distances can aid in visualization Index
94
Collaborative filtering and recommendation systems
Q: If someone likes a product X, will (s)he like product Y? A: Check if others who liked X also liked Y. Focus on distances between people, typically cosine similarity and not on clustering Index
95
CA and bipartite cores: related but different
Hubs Authorities A 3x2 bipartite core Kumar et al. [1999] say that bipartite cores correspond to communities. Index
96
CA and bipartite cores: related but different
CA finds two communities there: one for hubs, and one for authorities. We gracefully handle cases where a few links are missing. CA considers connections between all sets of clusters, and not just two sets. Not each node need belong to a non-trivial bipartite core. CA is (informally) a generalization Index
97
Comparison with soft clustering
Soft clustering each node belongs to each cluster with some probability Hard clustering one cluster per node Index
98
Comparison with soft clustering
Far more degrees of freedom Parameter fitting is harder Algorithms can be costlier Hard clustering is better for exploratory data analysis Some real-world problems require hard clustering e.g., fraud detection for accountants Index
99
Weights for code cost vs description cost
Total = 1. (code cost) + 1. (description cost) Physical meaning: Total number of bits Total = α. (code cost) + β. (description cost) Physical meaning: Number of encoding bits under some prior Index
100
Formula for re-assigns
Re-assign: for each row x Column groups Row groups Index
101
Choosing k and l l = 5 k = 5 Split: Find the row group R with the maximum entropy per row Choose the rows in R whose removal reduces the entropy per row in R Send these rows to the new row group, and set k=k+1 Index
102
Experiments Epinions dataset 75,888 users
508,960 “dots”, one “dot” per “trust” relationship k=19 groups found User groups This is the Epinions dataset. Its an online social network from epinions.com, where each user links to other users whose opinions he/she trusts. The interesting thing is that we find a few small groups with ~10 users who form a small dense “core”. They are trusted by a lot of other of people… And these people could be very interesting for viral marketing or focussed marketing techniques. [ Side note: We could not plot all the dots themselves, because there were too many of them, and MATLAB won’t do it. So we’ve used a shading scheme, with darker shades implying blocks having more “dots” (note: not higher *density* of dots, just number of dots). The densest blocks are all at the bottom right corner, where we show the “small dense core” ] Small dense “core” Index
103
Comparison with previous methods
Our threshold subsumes the homogeneous model Proof We are more accurate than the Mean-Field Assumption model. Index
104
Comparison with previous methods
10K Star Graph Index
105
Comparison with previous methods
Oregon Graph Index
106
Accuracy of dynamical system
10K Star Graph Index
107
Accuracy of dynamical system
Oregon Graph Index
108
Accuracy of dynamical system
10K Star Graph Index
109
Accuracy of dynamical system
Oregon Graph Index
110
Relationship with full Markov Chain
The full Markov Chain is of the form: Prob(infection at time t) = Xt-1 + Yt-1 – Zt-1 Independence assumption leads to a point estimate for Zt-1 non-linear dynamical system. Still non-linear, but now tractable Non-linear component Index
111
Experiments: Information survival
INTEL sensor map (54 nodes) MIT sensor map (40 nodes) and others… Index
112
Experiments: Information survival
INTEL sensor map Index
113
Survival threshold on INTEL
Index
114
Survival threshold on INTEL
Index
115
Experiments: Information survival
MIT sensor map Index
116
Survival threshold on MIT
Index
117
Survival threshold on MIT
Index
118
Infinite Particle Systems
“Contact Process” ≈ SIS model Differences: Infinite graphs only the questions asked are different Very specific topologies lattices, trees Exact thresholds have not been found for these; proving existence of thresholds is important Our results match those on the finite line graph [Durrett+ ’88] Index
119
Intuition behind the largest eigenvalue
Approximately size of the largest “blob” Consider the special case of a “caveman” graph Largest eigenvalue = 4 Index
120
Intuition behind the largest eigenvalue
Approximately size of the largest “blob” Largest eigenvalue = 4.016 Index
121
Graph Patterns Power Laws Count vs Outdegree Count vs Indegree
The “epinions” graph with 75,888 nodes and 508,960 edges Count vs Indegree Index
122
Graph Patterns Power Laws Count vs Outdegree Count vs Indegree
The “epinions” graph with 75,888 nodes and 508,960 edges Count vs Indegree Index
123
Graph Patterns Power Laws and deviations (DGX/Lognormals [Bi+ ’01])
Count vs Indegree Count Degree Index
124
Graph Patterns Power Laws and deviations Small-world
“Community” effect … # reachable pairs Effective Diameter hops Index
125
Graph Generator Desiderata
Other desiderata Few parameters Fast parameter-fitting Scalable graph generation Simple extension to undirected, bipartite and weighted graphs Power Laws and deviations Small-world “Community” effect … Most current graph generators fail to match some of these. Index
126
The R-MAT generator [SIAM DM’04] Intuition: The “80-20 law” b (0.1)
From To Subdivide the adjacency matrix and choose one quadrant with probability (a,b,c,d) 2n b (0.1) a (0.5) c (0.15) d (0.25) Index
127
The R-MAT generator [SIAM DM’04] Intuition: The “80-20 law” a b a c d
Subdivide the adjacency matrix and choose one quadrant with probability (a,b,c,d) Recurse till we reach a 1*1 cell where we place an edge and repeat for all edges. 2n a b a c d d c Index
128
The R-MAT generator [SIAM DM’04] Intuition: The “80-20 law”
Only 3 parameters a, b and c (d = 1-a-b-c). We have a fast parameter fitting algorithm. 2n a b a c d d c Index
129
Experiments (Epinions directed graph)
Effective Diameter Count vs Indegree Count vs Outdegree Hop-plot Count vs Stress Eigenvalue vs Rank “Network value” ►R-MAT matches directed graphs Index
130
R-MAT communities and Cross-Associations
R-MAT builds communities in graphs, and Cross-Associations finds them. Relationship? R-MAT builds a hierarchy of communities, while CA finds a flat set of communities Linkage in the sizes of communities found by CA: When the R-MAT parameters are very skewed, the community sizes for CA are skewed and vice versa Index
131
R-MAT and tree-based generators
Recursive splitting in R-MAT ≈ following a tree from root to leaf. Relationship with other tree-based generators [Kleinberg ’01, Watts+ ’02]? The R-MAT tree has edges as leaves, the others have nodes Tree-distance between nodes is used to connect nodes in other generators, but what does tree-distance between edges mean? Index
132
Comparison with relational learning
Relational Learning (typical) Graph Mining (typical) Aims to find small structure/patterns at the local level Labeled nodes and edges Semantics of labels are important Algorithms are typically costlier Emphasis on global aspects of large graphs Unlabeled graphs More focused on topological structure and properties Scalability is more important Index
133
===OTHER WORK=== OTHER WORK
134
Other Work Time Series Prediction [CIKM 2002]
We use the fractal dimension of the data This is related to chaos theory and Lyapunov exponents…
135
Other Work Logistic Parabola Time Series Prediction [CIKM 2002]
136
Other Work Lorenz attractor Time Series Prediction [CIKM 2002]
137
Other Work Laser fluctuations Time Series Prediction [CIKM 2002]
138
Other Work Adaptive histograms with error guarantees [+ Ashraf Aboulnaga, Yufei Tao, Christos Faloutsos] Insertions, deletions Count Count Prob. Maintain count probabilities for buckets to give statistically correct query result-size estimation and query feedback + … Salary
139
Other Work User-personalization
Patent number 6,611,834 (IBM) Relevance feedback in multimedia image search Filed for patent (IBM) Building 3D models using robot camera and rangefinder data [ICML 2001]
140
===EXTRAS===
141
Conclusions Two paths in graph mining: Specific applications:
Viral Propagation Resilience testing, information dissemination, rumor spreading Node Grouping automatically grouping nodes, AND finding the correct number of groups References: Fully automatic Cross-Associations, by Chakrabarti, Papadimitriou, Modha and Faloutsos, in KDD 2004 AutoPart: Parameter-free graph partitioning and Outlier detection, by Chakrabarti, in PKDD 2004 Epidemic spreading in real networks: An eigenvalue viewpoint, by Wang, Chakrabarti, Wang and Faloutsos, in SRDS 2003
142
Conclusions Two paths in graph mining: Specific applications
General issues: Graph Patterns Marks of “realism” in a graph Graph Generators R-MAT, a fast, scalable generator matching many of the patterns References: R-MAT: A recursive model for graph mining, by Chakrabarti, Zhan and Faloutsos in SIAM Data Mining 2004. NetMine: New mining tools for large graphs, by Chakrabarti, Zhan, Blandford, Faloutsos and Blelloch, in the SIAM 2004 Workshop on Link analysis, counter-terrorism and privacy
143
Other References F4: Large Scale Automated Forecasting using Fractals, by D. Chakrabarti and C. Faloutsos, in CIKM 2002. Using EM to Learn 3D Models of Indoor Environments with Mobile Robots, by Y. Liu, R. Emery, D. Chakrabarti, W. Burgard and S. Thrun, in ICML 2001 Graph Mining: Laws, Generators and Algorithms, by D. Chakrabarti and C. Faloutsos, under submission to ACM Computing Surveys
144
References --- graphs R-MAT: A recursive model for graph mining, by D. Chakrabarti, Y. Zhan, C. Faloutsos in SIAM Data Mining 2004. Epidemic spreading in real networks: An eigenvalue viewpoint, by Y. Wang, D. Chakrabarti, C. Wang and C. Faloutsos, in SRDS 2003 Fully automatic Cross-Associations, by D. Chakrabarti, S. Papadimitriou, D. Modha and C. Faloutsos, in KDD 2004 AutoPart: Parameter-free graph partitioning and Outlier detection, by D. Chakrabarti, in PKDD 2004 NetMine: New mining tools for large graphs, by D. Chakrabarti, Y. Zhan, D. Blandford, C. Faloutsos and G. Blelloch, in the SIAM 2004 Workshop on Link analysis, counter-terrorism and privacy
145
Specific applications
Roadmap Specific applications Node grouping Viral propagation General issues Realistic graph generation Graph patterns and “laws” 1 3 2 4 Other Work 5 Conclusions
146
Experiments (Clickstream bipartite graph)
Some personal webpage Clickstream + Count Websites Yahoo, Google and others Users In-degree
147
Experiments (Clickstream bipartite graph)
-checking surfers Clickstream + Count Websites “All-night” surfers Users Out-degree
148
Experiments (Clickstream bipartite graph)
Clickstream R-MAT # Reachable pairs Websites Users Hops
149
Graph Generation Important for:
Simulations of new algorithms Compression using a good graph generation model Insight into the graph formation process Our R-MAT (Recursive MATrix) generator can match many common graph patterns.
150
Recall the definition of eigenvalues
λA = eigenvalue of A λ1,A = largest eigenvalue A X = λA X β/δ < τ = 1/ λ1,A
151
Tools for Large Graph Mining
Deepayan Chakrabarti Carnegie Mellon University I’ll be discussing some of my work done at CMU with my advisor Christos Faloutsos and other researchers. We’ve been developing tools and algorithms to mine large graph datasets.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.