Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.

Slides:



Advertisements
Similar presentations
Reachability Querying: An Independent Permutation Labeling Approach (published in VLDB 2014) Presenter: WEI, Hao.
Advertisements

Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.
Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.
gSpan: Graph-based substructure pattern mining
Correlation Search in Graph Databases Yiping Ke James Cheng Wilfred Ng Presented By Phani Yarlagadda.
Introduction to Algorithms Rabie A. Ramadan rabieramadan.org 2 Some of the sides are exported from different sources.
Frequent Closed Pattern Search By Row and Feature Enumeration
Classification Techniques: Decision Tree Learning
Mining Graphs.
Generated Waypoint Efficiency: The efficiency considered here is defined as follows: As can be seen from the graph, for the obstruction radius values (200,
Mining Top-K Large Structural Patterns in a Massive Network Feida Zhu 1, Qiang Qu 2, David Lo 1, Xifeng Yan 3, Jiawei Han 4, and Philip S. Yu 5 1 Singapore.
1 Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan 1, Guoren Wang 1, Haixun Wang 2, Lei Chen 3 1. Northeastern University, China 2. Microsoft.
A Randomized Linear-Time Algorithm to Find Minimum Spanning Trees David R. Karger David R. Karger Philip N. Klein Philip N. Klein Robert E. Tarjan.
The number of edge-disjoint transitive triples in a tournament.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Discussion #36 Spanning Trees
Temporal Indexing MVBT. Temporal Indexing Transaction time databases : update the last version, query all versions Queries: “Find all employees that worked.
1 On Compressing Web Graphs Michael Mitzenmacher, Harvard Micah Adler, Univ. of Massachusetts.
Temporal Indexing MVBT. Temporal Indexing Transaction time databases : update the last version, query all versions Queries: “Find all employees that worked.
Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.
SubSea: An Efficient Heuristic Algorithm for Subgraph Isomorphism Vladimir Lipets Ben-Gurion University of the Negev Joint work with Prof. Ehud Gudes.
Fast Algorithms for Association Rule Mining
Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002.
33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Towards Graph Containment Search and Indexing Chen Chen 1, Xifeng Yan 2, Philip.
Fundamentals of Multimedia Chapter 7 Lossless Compression Algorithms Ze-Nian Li and Mark S. Drew 건국대학교 인터넷미디어공학부 임 창 훈.
Improved results for a memory allocation problem Rob van Stee University of Karlsruhe Germany Leah Epstein University of Haifa Israel WADS 2007 WAOA 2007.
FAST FREQUENT FREE TREE MINING IN GRAPH DATABASES Marko Lazić 3335/2011 Department of Computer Engineering and Computer Science,
Subgraph Containment Search Dayu Yuan The Pennsylvania State University 1© Dayu Yuan9/7/2015.
Graph Indexing: A Frequent Structure­ based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†
Minimal Spanning Trees What is a minimal spanning tree (MST) and how to find one.
Sequential PAttern Mining using A Bitmap Representation
CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket.
AR mining Implementation and comparison of three AR mining algorithms Xuehai Wang, Xiaobo Chen, Shen chen CSCI6405 class project.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )
C++ Programming: Program Design Including Data Structures, Fourth Edition Chapter 19: Searching and Sorting Algorithms.
Diversified Top-k Graph Pattern Matching 1 Yinghui Wu UC Santa Barbara Wenfei Fan University of Edinburgh Southwest Jiaotong University Xin Wang.
Expanders via Random Spanning Trees R 許榮財 R 黃佳婷 R 黃怡嘉.
Computer Science CPSC 322 Lecture 9 (Ch , 3.7.6) Slide 1.
Graph Indexing: A Frequent Structure- based Approach Alicia Cosenza November 26 th, 2007.
University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.
1 Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining -SIGKDD’03 Mohammad El-Hajj, Osmar R. Zaïane.
MINING COLOSSAL FREQUENT PATTERNS BY CORE PATTERN FUSION FEIDA ZHU, XIFENG YAN, JIAWEI HAN, PHILIP S. YU, HONG CHENG ICDE07 Advisor: Koh JiaLing Speaker:
Spectral Sequencing Based on Graph Distance Rong Liu, Hao Zhang, Oliver van Kaick {lrong, haoz, cs.sfu.ca {lrong, haoz, cs.sfu.ca.
Most of contents are provided by the website Graph Essentials TJTSD66: Advanced Topics in Social Media.
1 AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Hong.
Mining Top-K Large Structural Patterns in a Massive Network Feida Zhu 1, Qiang Qu 2, David Lo 1, Xifeng Yan 3, Jiawei Han 4, and Philip S. Yu 5 1 Singapore.
Mining Graph Patterns Efficiently via Randomized Summaries Chen Chen, Cindy X. Lin, Matt Fredrikson, Mihai Christodorescu, Xifeng Yan, Jiawei Han VLDB’09.
1 Efficient Discovery of Frequent Approximate Sequential Patterns Feida Zhu, Xifeng Yan, Jiawei Han, Philip S. Yu ICDM 2007.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley. Ver Chapter 13: Graphs Data Abstraction & Problem Solving with C++
Author: Haoyu Song, Murali Kodialam, Fang Hao and T.V. Lakshman Publisher/Conf. : IEEE International Conference on Network Protocols (ICNP), 2009 Speaker:
© 2006 Pearson Addison-Wesley. All rights reserved 14 A-1 Chapter 14 Graphs.
Discriminative Frequent Pattern Analysis for Effective Classification By Hong Cheng, Xifeng Yan, Jiawei Han, Chih- Wei Hsu Presented by Mary Biddle.
Association Analysis (3)
Discovering functional interaction patterns in Protein-Protein Interactions Networks   Authors: Mehmet E Turnalp Tolga Can Presented By: Sandeep Kumar.
Graph Indexing From managing and mining graph data.
Indexing and Mining Free Trees Yun Chi, Yirong Yang, Richard R. Muntz Department of Computer Science University of California, Los Angeles, CA {
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Approach to Data Mining from Algorithm and Computation Takeaki Uno, ETH Switzerland, NII Japan Hiroki Arimura, Hokkaido University, Japan.
Gspan: Graph-based Substructure Pattern Mining
Theory of Computational Complexity Probability and Computing Chapter Hikaru Inada Iwama and Ito lab M1.
Updating SF-Tree Speaker: Ho Wai Shing.
Mining in Graphs and Complex Structures
Temporal Indexing MVBT.
Temporal Indexing MVBT.
Probabilistic Data Management
Instructor: Shengyu Zhang
Chapter 14 Graphs © 2011 Pearson Addison-Wesley. All rights reserved.
Presentation transcript:

Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University of Wisconsin at Madison 3 IBM T. J. Watson Research Center 4 University of California at Santa Barbara 1

Outline Motivation The efficiency bottleneck encountered in big networks Patterns must be preserved Summarize-Mine Experiments Summary 2

3

Frequent Subgraph Mining Find all graphs p such that |D p | >= min_sup Get into the topological structures of graph data Useful for many downstream applications query graph graph database 4

Challenges Subgraph isomorphism checking is inevitable for any frequent subgraph mining algorithm This will have problems on big networks Suppose there is only one triangle in the network But there are 1,000,000 length-2 paths We must enumerate all these 1,000,000, because any one of them has the potential to grow into a full triangle 5

Too Many Embeddings Subgraph isomorphism is NP-hard So, when the problem size increases, … During the checking, large graphs are grown from small subparts For small subparts, there might be too many (overlapped) embeddings in a big network Such embedding enumerations will finally kill us 6

Motivating Application System call graphs from security research Model dependencies among system calls Unique subgraph signatures for malicious programs Compare malicious/benign programs These graphs are very big Thousands of nodes on average We tried state-of-art mining technologies, but failed 7

Our Approach Subgraph isomorphism checking cannot be done on large networks So we do it on small graphs Summarize-Mine Summarize: Merge nodes by label and collapse corresponding edges Mine: Now, state-of-art algorithms should work 8

Mining after Summarization 9

Remedy for Pattern Changes Frequent subgraphs are presented on a different abstraction level False negatives & false positives, compared to true patterns mined from the un-summarized database D False negatives (recover) Randomized technique + multiple rounds False positives (delete) Verify against D Substantial work can be transferred to the summaries 10

Outline Motivation Summarize-Mine The algorithm flow-chart Recovering false negatives Verifying false positives Experiments Summary 11

12

False Negatives For a pattern p, if each of its vertices bears a different label, then the embeddings of p must be preserved after summarization Since we are merging groups of vertices by label, the nodes of p should stay in different groups Otherwise, 13

Missing Prob. of Embeddings Suppose Assign x j nodes for label l j (j=1,…,L) in the summary S i => x j groups of nodes with label l j in the original graph G i Pattern p has m j nodes with label l j Then 14

No “Collision” for Same Labels Consider a specific embedding f: p->G i, f is preserved if vertices in f(p) stay in different groups Randomly assign m j nodes with label l j to x j groups, the probability that they will not “collide” is: Multiply probabilities for independent events 15

Example A pattern with 5 labels, each label => 2 vertices m 1 = m 2 = m 3 = m 4 = m 5 = 2 Assign 20 nodes in the summary (i.e., 20 node groups in the original graph) for each label The summary has 100 vertices x 1 = x 2 = x 3 = x 4 = x 5 = 20 The probability that an embedding will persist 16

Extend to Multiple Graphs Setting x 1,…,x L to the same values across all G i ’s in the database only depends on m 1,…,m L, i.e., pattern p’s vertex label distribution We denote this probability as q(p) For each of p’s support graphs in D, it has a probability of at least q(p) to continue support p Thus, the overall support can be bounded below by a binomial random variable 17

Support Moves Downward 18

False Negative Bound 19

Example, Cont. As above, q(p)=0.774 min_sup=50 20 min_sup' round rounds rounds

False Positives Much easier to handle Just check against the original database D Discard if this “actual” support is less than min_sup 21

The Same Skeleton as gSpan DFS code tree Depth-first search Minimum DFS code? Check support by isomorphism tests Record all one-edge extensions along the way Pass down the projected database and recurse 22

Integrate Verification Schemes Top-Down and Bottom-Up Possible factors Amount of false positives Top-down verification can be performed early Top-down preferred by experiments 23 Transaction ID list for p 1 => D p 1 Just search within D p 1 Transaction ID list for p 2 => D p 2 Just search within D-D p 2 ; if frequent, can stop

Summary-Guided Verification Substantial verification work can be performed on the summaries, as well 24 Got it!

Iterative Summarize-Mine Use a single pattern tree to hold all results spanning across multiple iterations No need to combine pattern sets in a final step Avoid verifying patterns that have already been checked by previous iterations Verified support graphs are accurate, they can help pre- pruning in later iterations Details omitted 25

Outline Motivation Summarize-Mine Experiments Summary 26

Dataset Real data W32.Stration, a family of mass-mailing worms W32.Virut, W32.Delf, W32.Ldpinch, W32.Poisonivy, etc. Vertex # up to 20,000 and edge # even higher Avg. # of vertices: 1,300 Synthetic data Size, # of distinct node/edge labels, etc. Generator details omitted 27

A Sample Malware Signature Mined from W32.Stration A malware reading and leaking certain registry settings related to the network devices 28

Comparison with gSpan gSpan is an efficient graph pattern mining algorithm Graphs with different size are randomly drawn Eventually, gSpan cannot work 29

The Influence of min_sup' Total vs. False Positives The gap corresponds to true patterns It gradually widens as we decrease min_sup' 30

Summarization Ratio 10/1 node(s) before/after summarization => ratio=10 Trading-off min_sup' and t as the inner loop A range of reasonable parameters in the middle 31

Scalability On the synthetic data Parameters are tuned as done above 32

Outline Motivation Summarize-Mine Experiments Summary 33

Summary We solve the frequent subgraph mining problem for graphs with big size We found interesting malware signatures Our algorithm is much more efficient, while the state- of-art mining technologies do not work We show that patterns can be well preserved on higher-level by a good generalization scheme Very useful, given the emerging trend of huge networks The data has to be preprocessed and summarized 34

Summary Our method is orthogonal to many previous works on this topic => Combine for further improvement Efficient pattern space traversal Other data space reduction techniques different from our compression within individual transactions Transaction sampling, merging, etc. They perform compression between transactions 35

36