Download presentation
Presentation is loading. Please wait.
1
Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University of Wisconsin at Madison 3 IBM T. J. Watson Research Center 4 University of California at Santa Barbara 1
2
Outline Motivation The efficiency bottleneck encountered in big networks Patterns must be preserved Summarize-Mine Experiments Summary 2
3
3
4
Frequent Subgraph Mining Find all graphs p such that |D p | >= min_sup Get into the topological structures of graph data Useful for many downstream applications query graph graph database 4
5
Challenges Subgraph isomorphism checking is inevitable for any frequent subgraph mining algorithm This will have problems on big networks Suppose there is only one triangle in the network But there are 1,000,000 length-2 paths We must enumerate all these 1,000,000, because any one of them has the potential to grow into a full triangle 5
6
Too Many Embeddings Subgraph isomorphism is NP-hard So, when the problem size increases, … During the checking, large graphs are grown from small subparts For small subparts, there might be too many (overlapped) embeddings in a big network Such embedding enumerations will finally kill us 6
7
Motivating Application System call graphs from security research Model dependencies among system calls Unique subgraph signatures for malicious programs Compare malicious/benign programs These graphs are very big Thousands of nodes on average We tried state-of-art mining technologies, but failed 7
8
Our Approach Subgraph isomorphism checking cannot be done on large networks So we do it on small graphs Summarize-Mine Summarize: Merge nodes by label and collapse corresponding edges Mine: Now, state-of-art algorithms should work 8
9
Mining after Summarization 9
10
Remedy for Pattern Changes Frequent subgraphs are presented on a different abstraction level False negatives & false positives, compared to true patterns mined from the un-summarized database D False negatives (recover) Randomized technique + multiple rounds False positives (delete) Verify against D Substantial work can be transferred to the summaries 10
11
Outline Motivation Summarize-Mine The algorithm flow-chart Recovering false negatives Verifying false positives Experiments Summary 11
12
12
13
False Negatives For a pattern p, if each of its vertices bears a different label, then the embeddings of p must be preserved after summarization Since we are merging groups of vertices by label, the nodes of p should stay in different groups Otherwise, 13
14
Missing Prob. of Embeddings Suppose Assign x j nodes for label l j (j=1,…,L) in the summary S i => x j groups of nodes with label l j in the original graph G i Pattern p has m j nodes with label l j Then 14
15
No “Collision” for Same Labels Consider a specific embedding f: p->G i, f is preserved if vertices in f(p) stay in different groups Randomly assign m j nodes with label l j to x j groups, the probability that they will not “collide” is: Multiply probabilities for independent events 15
16
Example A pattern with 5 labels, each label => 2 vertices m 1 = m 2 = m 3 = m 4 = m 5 = 2 Assign 20 nodes in the summary (i.e., 20 node groups in the original graph) for each label The summary has 100 vertices x 1 = x 2 = x 3 = x 4 = x 5 = 20 The probability that an embedding will persist 16
17
Extend to Multiple Graphs Setting x 1,…,x L to the same values across all G i ’s in the database only depends on m 1,…,m L, i.e., pattern p’s vertex label distribution We denote this probability as q(p) For each of p’s support graphs in D, it has a probability of at least q(p) to continue support p Thus, the overall support can be bounded below by a binomial random variable 17
18
Support Moves Downward 18
19
False Negative Bound 19
20
Example, Cont. As above, q(p)=0.774 min_sup=50 20 min_sup'403938373635 1 round0.59660.46220.33460.22550.14120.0820 2 rounds0.35590.21360.11190.05080.01990.0067 3 rounds0.21230.09880.03740.01150.00280.0006
21
False Positives Much easier to handle Just check against the original database D Discard if this “actual” support is less than min_sup 21
22
The Same Skeleton as gSpan DFS code tree Depth-first search Minimum DFS code? Check support by isomorphism tests Record all one-edge extensions along the way Pass down the projected database and recurse 22
23
Integrate Verification Schemes Top-Down and Bottom-Up Possible factors Amount of false positives Top-down verification can be performed early Top-down preferred by experiments 23 Transaction ID list for p 1 => D p 1 Just search within D p 1 Transaction ID list for p 2 => D p 2 Just search within D-D p 2 ; if frequent, can stop
24
Summary-Guided Verification Substantial verification work can be performed on the summaries, as well 24 Got it!
25
Iterative Summarize-Mine Use a single pattern tree to hold all results spanning across multiple iterations No need to combine pattern sets in a final step Avoid verifying patterns that have already been checked by previous iterations Verified support graphs are accurate, they can help pre- pruning in later iterations Details omitted 25
26
Outline Motivation Summarize-Mine Experiments Summary 26
27
Dataset Real data W32.Stration, a family of mass-mailing worms W32.Virut, W32.Delf, W32.Ldpinch, W32.Poisonivy, etc. Vertex # up to 20,000 and edge # even higher Avg. # of vertices: 1,300 Synthetic data Size, # of distinct node/edge labels, etc. Generator details omitted 27
28
A Sample Malware Signature Mined from W32.Stration A malware reading and leaking certain registry settings related to the network devices 28
29
Comparison with gSpan gSpan is an efficient graph pattern mining algorithm Graphs with different size are randomly drawn Eventually, gSpan cannot work 29
30
The Influence of min_sup' Total vs. False Positives The gap corresponds to true patterns It gradually widens as we decrease min_sup' 30
31
Summarization Ratio 10/1 node(s) before/after summarization => ratio=10 Trading-off min_sup' and t as the inner loop A range of reasonable parameters in the middle 31
32
Scalability On the synthetic data Parameters are tuned as done above 32
33
Outline Motivation Summarize-Mine Experiments Summary 33
34
Summary We solve the frequent subgraph mining problem for graphs with big size We found interesting malware signatures Our algorithm is much more efficient, while the state- of-art mining technologies do not work We show that patterns can be well preserved on higher-level by a good generalization scheme Very useful, given the emerging trend of huge networks The data has to be preprocessed and summarized 34
35
Summary Our method is orthogonal to many previous works on this topic => Combine for further improvement Efficient pattern space traversal Other data space reduction techniques different from our compression within individual transactions Transaction sampling, merging, etc. They perform compression between transactions 35
36
36
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.