Mining Graph Patterns Efficiently via Randomized Summaries Chen Chen, Cindy X. Lin, Matt Fredrikson, Mihai Christodorescu, Xifeng Yan, Jiawei Han VLDB’09.

Slides:

Advertisements

Similar presentations

Online Mining of Frequent Query Trees over XML Data Streams Hua-Fu Li*, Man-Kwan Shan and Suh-Yin Lee Department of Computer Science.

Advertisements

Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.

Graph Mining Laks V.S. Lakshmanan

Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.

gSpan: Graph-based substructure pattern mining

Correlation Search in Graph Databases Yiping Ke James Cheng Wilfred Ng Presented By Phani Yarlagadda.

Mining Multiple-level Association Rules in Large Databases

Frequent Closed Pattern Search By Row and Feature Enumeration

Connected Substructure Similarity Search Haichuan Shang The University of New South Wales & NICTA, Australia Joint Work: Xuemin Lin (The University of.

Mining Top-K Large Structural Patterns in a Massive Network Feida Zhu 1, Qiang Qu 2, David Lo 1, Xifeng Yan 3, Jiawei Han 4, and Philip S. Yu 5 1 Singapore.

1 Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan 1, Guoren Wang 1, Haixun Wang 2, Lei Chen 3 1. Northeastern University, China 2. Microsoft.

Frequent Subgraph Pattern Mining on Uncertain Graph Data

Data Mining Association Analysis: Basic Concepts and Algorithms

1 IncSpan :Incremental Mining of Sequential Patterns in Large Database Hong Cheng, Xifeng Yan, Jiawei Han Proc Int. Conf. on Knowledge Discovery.

IGraph: A Framework for Comparisons of Disk-Based Graph Indexing Techniques Jeffrey Xu Yu et. al. VLDB ‘10 Presented by Tao Yu.

Association Analysis (7) (Mining Graphs)

Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.

COM (Co-Occurrence Miner): Graph Classification Based on Pattern Co-occurrence Ning Jin, Calvin Young, Wei Wang University of North Carolina at Chapel.

Data Mining Association Analysis: Basic Concepts and Algorithms

Maintenance of Discovered Association Rules S.D.LeeDavid W.Cheung Presentation : Pablo Gazmuri.

SubSea: An Efficient Heuristic Algorithm for Subgraph Isomorphism Vladimir Lipets Ben-Gurion University of the Negev Joint work with Prof. Ehud Gudes.

Association Analysis (3). FP-Tree/FP-Growth Algorithm Use a compressed representation of the database using an FP-tree Once an FP-tree has been constructed,

Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002.

33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Towards Graph Containment Search and Indexing Chen Chen 1, Xifeng Yan 2, Philip.

1 Efficiently Mining Frequent Trees in a Forest Mohammed J. Zaki.

FAST FREQUENT FREE TREE MINING IN GRAPH DATABASES Marko Lazić 3335/2011 Department of Computer Engineering and Computer Science,

272: Software Engineering Fall 2012 Instructor: Tevfik Bultan Lecture 17: Code Mining.

Graph Indexing Techniques Seoul National University IDB Lab. Kisung Kim

Graph Indexing: A Frequent Structure based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†

Sequential PAttern Mining using A Bitmap Representation

AR mining Implementation and comparison of three AR mining algorithms Xuehai Wang, Xiaobo Chen, Shen chen CSCI6405 class project.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.

Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )

SPIN: Mining Maximal Frequent Subgraphs from Graph Databases Jun Huan, Wei Wang, Jan Prins, Jiong Yang KDD 2004.

1 Frequent Subgraph Mining Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY June 12, 2010.

Mining Frequent Patterns without Candidate Generation.

Graph Indexing: A Frequent Structure- based Approach Alicia Cosenza November 26 th, 2007.

Expert Systems with Applications 34 (2008) 459–468 Multi-level fuzzy mining with multiple minimum supports Yeong-Chyi Lee, Tzung-Pei Hong, Tien-Chin Wang.

Frequent Subgraph Discovery Michihiro Kuramochi and George Karypis ICDM 2001.

Frequent Item Mining. What is data mining? =Pattern Mining? What patterns? Why are they useful?

Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin.

University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.

MINING COLOSSAL FREQUENT PATTERNS BY CORE PATTERN FUSION FEIDA ZHU, XIFENG YAN, JIAWEI HAN, PHILIP S. YU, HONG CHENG ICDE07 Advisor: Koh JiaLing Speaker:

1 AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Advisor ： Dr. Koh Jia-Ling Speaker ： Tu Yi-Lang Date ： Hong.

Mining Top-K Large Structural Patterns in a Massive Network Feida Zhu 1, Qiang Qu 2, David Lo 1, Xifeng Yan 3, Jiawei Han 4, and Philip S. Yu 5 1 Singapore.

1 Efficient Discovery of Frequent Approximate Sequential Patterns Feida Zhu, Xifeng Yan, Jiawei Han, Philip S. Yu ICDM 2007.

Discovering functional interaction patterns in Protein-Protein Interactions Networks   Authors: Mehmet E Turnalp Tolga Can Presented By: Sandeep Kumar.

APEX: An Adaptive Path Index for XML data Chin-Wan Chung, Jun-Ki Min, Kyuseok Shim SIGMOD 2002 Presentation: M.S.3 HyunSuk Jung Data Warehousing Lab. In.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Graph Indexing From managing and mining graph data.

Indexing and Mining Free Trees Yun Chi, Yirong Yang, Richard R. Muntz Department of Computer Science University of California, Los Angeles, CA {

Ning Jin, Wei Wang ICDE 2011 LTS: Discriminative Subgraph Mining by Learning from Search History.

Xifeng Yan Philip S. Yu Jiawei Han SIGMOD 2005 Substructure Similarity Search in Graph Databases.

Gspan: Graph-based Substructure Pattern Mining

Outline Introduction State-of-the-art solutions

Minimum Spanning Tree 8/7/2018 4:26 AM

Probabilistic Data Management

Advanced Pattern Mining 02

Algorithms and networks

Mining Frequent Subgraphs

On Efficient Graph Substructure Selection

Graph Database Mining and Its Applications

Mining Frequent Subgraphs

Algorithms and networks

Efficient Subgraph Similarity All-Matching

FP-Growth Wenlong Zhang.

Mining Frequent Subgraphs

Approximate Graph Mining with Label Costs

Presentation transcript:

Mining Graph Patterns Efficiently via Randomized Summaries Chen Chen, Cindy X. Lin, Matt Fredrikson, Mihai Christodorescu, Xifeng Yan, Jiawei Han VLDB’09

Outline Motivation Preliminaries SUMMARIZE-MINE FRAMEWORK SUMMARIZE-MINE FRAMEWORK Bounding the False Negative Rate Bounding the False Negative Rate Experiments Experiments Conclusion Conclusion

Motivation Graphs Pattern Mining are heavily needed in many real applications, such as bioinformatics, hyperlinked webs and social network analysis. Unfortunately, due to the fundamental role subgraph isomorphism plays in existing methods, they may all enter into a pitfall when the cost to enumerate a huge set of isomorphic embeddings blows up, especially in large graphs with few identical labels.

Motivation Consider possible ways to reduce the number of embeddings. In particular, since in real applications, many embeddings overlap substantially, we explore the possibility of somehow “merging” these embeddings to significantly reduce the overall cardinality.

Preliminaries

SUMMARIZE-MINE FRAMEWORK Summarization: For raw database with frequency threshold min_sup, we bind vertices with identical labels into a single node and collapse the network correspondingly into a smaller summarized version. This step generalizes our view on the data to a higher level. Mining: Apply any state-of-art frequent subgraph mining algorithm on the summarized database D’ = {S 1, S 2,..., S n } with a slightly lowered support threshold min sup’, which generates the pattern set FP(D’). Verification: Check patterns in FP(D’) against the original database D, remove those p ∈ FP(D’) whose support in D is less than min sup and transform the result collection into R’ Iteration: Repeat steps 1, 2 and 3 for t times, and combine the results from each iteration. Let R’ 1,R’ 2,...,R’ t be the patterns obtained from different iterations, the final result is R’ = R’ 1 ∪ R’ 2 ∪ … ∪ R’ t. This step is to guarantee that the overall probability of missing any frequent pattern is bounded. Deal with false positive and false negative. Raw DB Summarized DB

Summarization: For raw database with frequency threshold min_sup, we bind vertices with identical labels into a single node and collapse the network correspondingly into a smaller summarized version. This step generalizes our view on the data to a higher level. Mining: Apply any state-of-art frequent subgraph mining algorithm on the summarized database D’ = {S 1, S 2,..., S n } with a slightly lowered support threshold min sup’, which generates the pattern set FP(D’). Verification: Check patterns in FP(D’) against the original database D, remove those p ∈ FP(D’) whose support in D is less than min sup and transform the result collection into R’ Iteration: Repeat steps 1, 2 and 3 for t times, and combine the results from each iteration. Let R’ 1,R’ 2,...,R’ t be the patterns obtained from different iterations, the final result is R’ = R’ 1 ∪ R’ 2 ∪ … ∪ R’ t. This step is to guarantee that the overall probability of missing any frequent pattern is bounded. Deal with false positive and false negative. SUMMARIZE-MINE FRAMEWORK Raw DB Summarized DB

Take gSpan as the skeleton of mining algorithm Each labeled graph pattern can be transformed into a sequential representation called DFS code With a defined lexicographical order on the DFS code space, all subgraph patterns can be organized into a tree structure, where  1. patterns with k edges are put on the k th level  2. a preorder traversal of this tree would generate the DFS codes of all possible patterns in the lexicographical order SUMMARIZE-MINE FRAMEWORK

According to DFS lexicographic order, SUMMARIZE-MINE FRAMEWORK

Summarization: For raw database with frequency threshold min_sup, we bind vertices with identical labels into a single node and collapse the network correspondingly into a smaller summarized version. This step generalizes our view on the data to a higher level. Mining: Apply any state-of-art frequent subgraph mining algorithm on the summarized database D’ = {S 1, S 2,..., S n } with a slightly lowered support threshold min sup’, which generates the pattern set FP(D’). Verification: Check patterns in FP(D’) against the original database D, remove those p ∈ FP(D’) whose support in D is less than min sup and transform the result collection into R’ Iteration: Repeat steps 1, 2 and 3 for t times, and combine the results from each iteration. Let R’ 1,R’ 2,...,R’ t be the patterns obtained from different iterations, the final result is R’ = R’ 1 ∪ R’ 2 ∪ … ∪ R’ t. This step is to guarantee that the overall probability of missing any frequent pattern is bounded. Deal with false positive and false negative. Raw DB Summarized DB

SUMMARIZE-MINE FRAMEWORK Reduce false positives  Technique 1: Bottom-up sup(p 1 ) > sup(p 2 ) >min_sup  Technique 2: Top-down min_sup > sup(p 1 ) > sup(p 2 ) It is guaranteed that there is no false positives. False Embeddings  False Positives

SUMMARIZE-MINE FRAMEWORK Summarization: For raw database with frequency threshold min_sup, we bind vertices with identical labels into a single node and collapse the network correspondingly into a smaller summarized version. This step generalizes our view on the data to a higher level. Mining: Apply any state-of-art frequent subgraph mining algorithm on the summarized database D’ = {S 1, S 2,..., S n } with a slightly lowered support threshold min sup’, which generates the pattern set FP(D’). Verification: Check patterns in FP(D’) against the original database D, remove those p ∈ FP(D’) whose support in D is less than min sup and transform the result collection into R’ Iteration: Repeat steps 1, 2 and 3 for t times, and combine the results from each iteration. Let R’ 1,R’ 2,...,R’ t be the patterns obtained from different iterations, the final result is R’ = R’ 1 ∪ R’ 2 ∪ … ∪ R’ t. This step is to guarantee that the overall probability of missing any frequent pattern is bounded. Deal with false positive and false negative. Raw DB Summarized DB

SUMMARIZE-MINE FRAMEWORK

Summarization: For raw database with frequency threshold min_sup, we bind vertices with identical labels into a single node and collapse the network correspondingly into a smaller summarized version. This step generalizes our view on the data to a higher level. Mining: Apply any state-of-art frequent subgraph mining algorithm on the summarized database D’ = {S 1, S 2,..., S n } with a slightly lowered support threshold min sup’, which generates the pattern set FP(D’). Verification: Check patterns in FP(D’) against the original database D, remove those p ∈ FP(D’) whose support in D is less than min sup and transform the result collection into R’ Iteration: Repeat steps 1, 2 and 3 for t times, and combine the results from each iteration. Let R’ 1,R’ 2,...,R’ t be the patterns obtained from different iterations, the final result is R’ = R’ 1 ∪ R’ 2 ∪ … ∪ R’ t. This step is to guarantee that the overall probability of missing any frequent pattern is bounded. Deal with false positive and false negative. Raw DB Summarized DB

Bounding the False Negative Rate Miss Embeddings  False Negatives q(p) The probability that all m j vertices with label l j are assigned to x j different groups (and thus f continues to exist) is Multiplying the probabilities for all L labels

Bounding the False Negative Rate

The false negative rate after t iterations is (1−P) t. To make (1−P) t less than some small  Technique 1: For raw database with frequency threshold min_sup, we adopt a lower frequency threshold min_sup’ for summarized database.  Technique 2: Iterate the mining steps for t times and combine the results generated in each time. It is NOT guaranteed that there is no false negaitives, but the possibility is bounded by

Experiments

Experiments

Conclusion Isomorphism test on small graphs is much more easier. Each graph does iteration t times to reduce the false negative rate, t = ?