Mining Top-K Large Structural Patterns in a Massive Network Feida Zhu 1, Qiang Qu 2, David Lo 1, Xifeng Yan 3, Jiawei Han 4, and Philip S. Yu 5 1 Singapore.

Slides:



Advertisements
Similar presentations
Complex Networks for Representation and Characterization of Images For CS790g Project Bingdong Li 9/23/2009.
Advertisements

Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
gSpan: Graph-based substructure pattern mining
The IEEE International Conference on Big Data 2013 Arash Fard M. Usman Nisar Lakshmish Ramaswamy John A. Miller Matthew Saltz Computer Science Department.
Minimizing Seed Set for Viral Marketing Cheng Long & Raymond Chi-Wing Wong Presented by: Cheng Long 20-August-2011.
5/12/2015PhD seminar CS BGU Counting subgraphs Support measures for graphs Natalia Vanetik.
Mining Frequent Patterns in Data Streams at Multiple Time Granularities CS525 Paper Presentation Presented by: Pei Zhang, Jiahua Liu, Pengfei Geng and.
Connected Substructure Similarity Search Haichuan Shang The University of New South Wales & NICTA, Australia Joint Work: Xuemin Lin (The University of.
Mining Top-K Large Structural Patterns in a Massive Network Feida Zhu 1, Qiang Qu 2, David Lo 1, Xifeng Yan 3, Jiawei Han 4, and Philip S. Yu 5 1 Singapore.
1 Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan 1, Guoren Wang 1, Haixun Wang 2, Lei Chen 3 1. Northeastern University, China 2. Microsoft.
Frequent Subgraph Pattern Mining on Uncertain Graph Data
The influence of search engines on preferential attachment Dan Li CS3150 Spring 2006.
A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.
Association Analysis (7) (Mining Graphs)
Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.
Common Properties of Real Networks. Erdős-Rényi Random Graphs.
Probabilistic Similarity Search for Uncertain Time Series Presented by CAO Chen 21 st Feb, 2011.
The community-search problem and how to plan a successful cocktail party Mauro SozioAris Gionis Max Planck Institute, Germany Yahoo! Research, Barcelona.
Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002.
33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Towards Graph Containment Search and Indexing Chen Chen 1, Xifeng Yan 2, Philip.
Query-Based Outlier Detection in Heterogeneous Information Networks Jonathan Kuck 1, Honglei Zhuang 1, Xifeng Yan 2, Hasan Cam 3, Jiawei Han 1 1 University.
Honglei Zhuang1, Jing Zhang2, George Brova1,
Database k-Nearest Neighbors in Uncertain Graphs Lin Yincheng VLDB10.
Graph Indexing: A Frequent Structure­ based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†
CSV: Visualizing and Mining Cohesive Subgraphs Nan Wang Srinivasan Parthasarathy Kian-Lee Tan Anthony K. H. Tung School of Computing National University.
Efficient Gathering of Correlated Data in Sensor Networks
Mehdi Kargar Aijun An York University, Toronto, Canada Discovering Top-k Teams of Experts with/without a Leader in Social Networks.
VLDB 2012 Mining Frequent Itemsets over Uncertain Databases Yongxin Tong 1, Lei Chen 1, Yurong Cheng 2, Philip S. Yu 3 1 The Hong Kong University of Science.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
Page 1 Ming Ji Department of Computer Science University of Illinois at Urbana-Champaign.
Diversified Top-k Graph Pattern Matching 1 Yinghui Wu UC Santa Barbara Wenfei Fan University of Edinburgh Southwest Jiaotong University Xin Wang.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
1 Frequent Subgraph Mining Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY June 12, 2010.
P-Rank: A Comprehensive Structural Similarity Measure over Information Networks CIKM’ 09 November 3 rd, 2009, Hong Kong Peixiang Zhao, Jiawei Han, Yizhou.
1/52 Overlapping Community Search Graph Data Management Lab, School of Computer Science
On Node Classification in Dynamic Content-based Networks.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Finding Top-k Shortest Path Distance Changes in an Evolutionary Network SSTD th August 2011 Manish Gupta UIUC Charu Aggarwal IBM Jiawei Han UIUC.
University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.
Zhuo Peng, Chaokun Wang, Lu Han, Jingchao Hao and Yiyuan Ba Proceedings of the Third International Conference on Emerging Databases, Incheon, Korea (August.
Marina Drosou, Evaggelia Pitoura Computer Science Department
MINING COLOSSAL FREQUENT PATTERNS BY CORE PATTERN FUSION FEIDA ZHU, XIFENG YAN, JIAWEI HAN, PHILIP S. YU, HONG CHENG ICDE07 Advisor: Koh JiaLing Speaker:
1 LinkClus: Efficient Clustering via Heterogeneous Semantic Links Xiaoxin Yin, Jiawei Han Univ. of Illinois at Urbana-Champaign Philip S. Yu IBM T.J. Watson.
1 AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Hong.
Computer Science and Engineering TreeSpan Efficiently Computing Similarity All-Matching Gaoping Zhu #, Xuemin Lin #, Ke Zhu #, Wenjie Zhang #, Jeffrey.
Page 1 PathSim: Meta Path-Based Top-K Similarity Search in Heterogeneous Information Networks Yizhou Sun, Jiawei Han, Xifeng Yan, Philip S. Yu, Tianyi.
Mining Graph Patterns Efficiently via Randomized Summaries Chen Chen, Cindy X. Lin, Matt Fredrikson, Mihai Christodorescu, Xifeng Yan, Jiawei Han VLDB’09.
1 Efficient Discovery of Frequent Approximate Sequential Patterns Feida Zhu, Xifeng Yan, Jiawei Han, Philip S. Yu ICDM 2007.
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
NOTE: To change the image on this slide, select the picture and delete it. Then click the Pictures icon in the placeholder to insert your own image. Fast.
Indexing and Mining Free Trees Yun Chi, Yirong Yang, Richard R. Muntz Department of Computer Science University of California, Los Angeles, CA {
A Binary Linear Programming Formulation of the Graph Edit Distance Presented by Shihao Ji Duke University Machine Learning Group July 17, 2006 Authors:
Mining Concept-Drifting Data Streams Using Ensemble Classifiers Haixun Wang Wei Fan Philip S. YU Jiawei Han Proc. 9 th ACM SIGKDD Internal Conf. Knowledge.
Mining Social Ties Beyond Homophily Hongwei Liang * Ke Wang * Feida Zhu # * Simon Fraser University, Canada # Singapore Management University, Singapore.
Xifeng Yan Philip S. Yu Jiawei Han SIGMOD 2005 Substructure Similarity Search in Graph Databases.
Gspan: Graph-based Substructure Pattern Mining
Cohesive Subgraph Computation over Large Graphs
Finding Dense and Connected Subgraphs in Dual Networks
Integrating Meta-Path Selection With User-Guided Object Clustering in Heterogeneous Information Networks Yizhou Sun†, Brandon Norick†, Jiawei Han†, Xifeng.
On Efficient Graph Substructure Selection
Effective Social Network Quarantine with Minimal Isolation Costs
Coverage Approximation Algorithms
On the effect of randomness on planted 3-coloring models
Efficient Subgraph Similarity All-Matching
Approximate Graph Mining with Label Costs
Presentation transcript:

Mining Top-K Large Structural Patterns in a Massive Network Feida Zhu 1, Qiang Qu 2, David Lo 1, Xifeng Yan 3, Jiawei Han 4, and Philip S. Yu 5 1 Singapore Management University, 2 Peking University, 3 University of California – Santa Barbara, 4,5 University of Illinois – Urbana-Champaign & Chicago Reported by Luyiqi

Presentation at VLDB 2011 – Seattle, WA  Graph data is getting ever bigger, and so are the patterns.  E.g., social networks like Facebook, Twitter, etc.  Often, large patterns are more informative in characterizing large graph data.  E.g., in DBLP, small patterns are ubiquitous, larger patterns better characterize different research communities.  E.g., in software engineering, large patterns can correspond to software backbones Motivation - Why large graph patterns? 2 Mining Top-K Large Structural Patterns in a Massive Network

Presentation at VLDB 2011 – Seattle, WA  Larger frequent patterns from larger input graphs.  Pattern explosion is notorious in frequent graph mining even for small patterns and data  Frequent pattern mining in single graph setting is tricky!  Support computation and embedding maintenance in single graph setting is tricky.  Most of large graph data are no longer graph transaction database, they are single graphs. Motivation – Why is it challenging? 3 Mining Top-K Large Structural Patterns in a Massive Network

Presentation at VLDB 2011 – Seattle, WA  Motivation  Problem Definition  Our Solution: SpiderMine  Experiments  Conclusion and Future Work Talk Outline 4 Mining Top-K Large Structural Patterns in a Massive Network

Presentation at VLDB 2011 – Seattle, WA Notations  Radius  Diameter  Support 5 Mining Top-K Large Structural Patterns in a Massive Network

Presentation at VLDB 2011 – Seattle, WA  Given a graph, mine the top-K largest patterns.  But, to capture them exactly, no more and no less, we might have to generate all the smaller ones, which we cannot afford.  Let’s find them probabilistically, with user-defined error bound.  Problem definition: “Mine top-K largest frequent patterns whose diameters are bounded by D max with a probability of at least 1-ε“ Problem 6 Mining Top-K Large Structural Patterns in a Massive Network

Presentation at VLDB 2011 – Seattle, WA Solution: SpiderMine 7 Mining Top-K Large Structural Patterns in a Massive Network

Presentation at VLDB 2011 – Seattle, WA  How to capture large graph patterns?  Observation:  Large patterns are composed of a large number of small components, called “spiders”, which will eventually connect together after some rounds of pattern growth. Main Idea 8 Mining Top-K Large Structural Patterns in a Massive Network

Presentation at VLDB 2011 – Seattle, WA  An r-spider is a frequent graph pattern P such that there exists a vertex u of P, and all other vertices of P are within distance r to u.  u is called the head vertex. r-Spider u r 9 Mining Top-K Large Structural Patterns in a Massive Network

Presentation at VLDB 2011 – Seattle, WA 1.Mine the set S of all the r-spiders. 2.Randomly draw M r-spiders from S as the initial set of patterns. 3.Grow these patterns for t iterations. A.Extend pattern boundary with spiders. B.At each iteration, we increase the radius of a pattern by r. C.Merge two patterns whenever possible. 4.Discard unmerged patterns. 5.Continue to grow the remaining ones to maximum size. 6.Return the top-K largest ones in the result.  t = D max /2r SpiderMine Overview 10 Mining Top-K Large Structural Patterns in a Massive Network

Presentation at VLDB 2011 – Seattle, WA  Why can SpiderMine save large patterns and prune small ones with good chance? 1.Small patterns are less likely to be hit in the random draw.  First pruning at the initial random draw 2.Even if a small pattern is hit, it’s even much less likely to be hit multiple times.  Second pruning after t pattern growth iteration 3.The larger the pattern, the greater the chance it is hit and saved. Large patterns vs small patterns 11 Mining Top-K Large Structural Patterns in a Massive Network

Presentation at VLDB 2011 – Seattle, WA How many r-spiders to draw? With user-defined error threshold ε, we solve for M by setting: 12 Mining Top-K Large Structural Patterns in a Massive Network

Presentation at VLDB 2011 – Seattle, WA Proof of Lemma 2 13 Mining Top-K Large Structural Patterns in a Massive Network

Presentation at VLDB 2011 – Seattle, WA 14 Mining Top-K Large Structural Patterns in a Massive Network

Presentation at VLDB 2011 – Seattle, WA How to grow ? 15 Mining Top-K Large Structural Patterns in a Massive Network

Presentation at VLDB 2011 – Seattle, WA  Reduce combinatorial complexity of pattern growth  Observation:  Spiders are shared by many larger patterns.  Once obtained, they can be efficiently assembled to generate large patterns. Why Spiders? 16 Mining Top-K Large Structural Patterns in a Massive Network

Presentation at VLDB 2011 – Seattle, WA  Improve graph isomorphism checking  We propose a novel graph pattern representation  Spider-set representation.  A pattern is represented by the set of its constituent r-spiders.  Two isomorphic patterns must have the same spider-set representation.  Two patterns having the same spider-set representations are highly likely to be isomorphic. Why Spiders? 17 Mining Top-K Large Structural Patterns in a Massive Network

Presentation at VLDB 2011 – Seattle, WA Why Spiders?  Example  The larger the r, the more effective is our spider- based isomorphism detection.  More topological constraints 18 Mining Top-K Large Structural Patterns in a Massive Network.

Presentation at VLDB 2011 – Seattle, WA Experimental Results 19 Mining Top-K Large Structural Patterns in a Massive Network

Presentation at VLDB 2011 – Seattle, WA Synthetic Datasets  Random Network (Erdos-Renyi)  Generate background graph & inject freq. patterns  |V|, f – number of vertices and labels, respectively  d – average degree  m,n – number of small or large patterns injected  |V L |, |V S | (L sup, S sup ) - number of vertices of injected large/small patterns (with their supports)  Scale-Free Network (Barabasi-Albert) 20 Mining Top-K Large Structural Patterns in a Massive Network

Presentation at VLDB 2011 – Seattle, WA Experiments(I) --- Random Network 21 Mining Top-K Large Structural Patterns in a Massive Network

Presentation at VLDB 2011 – Seattle, WA Experiments(I) --- Random Network Runtime comparison with SUBDUE, SEuS, and MoSS 22 Mining Top-K Large Structural Patterns in a Massive Network

Presentation at VLDB 2011 – Seattle, WA Experiments(I) --- Random Network  Further increasing input graph size to Mining Top-K Large Structural Patterns in a Massive Network

Presentation at VLDB 2011 – Seattle, WA  Barabasi-Albert Model  Generate graphs with power law degree distribution Experiments(II) --- Scale-free Network 24 Mining Top-K Large Structural Patterns in a Massive Network

Presentation at VLDB 2011 – Seattle, WA Experiments(IV) --- DBLP data authors in DB/DM Label authors by # of papers Prolific (P): >= 50 papers Senior (S): 20~49 papers Junior (J): 10 ~ 19 papers Beginner(B): 5~9 papers 6508 authors, edges 25 Mining Top-K Large Structural Patterns in a Massive Network

Presentation at VLDB 2011 – Seattle, WA Experiments(IV) --- DBLP data 26 Mining Top-K Large Structural Patterns in a Massive Network

Presentation at VLDB 2011 – Seattle, WA  propose a novel probabilistic algorithm, SpiderMine, for top-K large pattern mining from a single graph with user-defined error bound.  propose a new concept of r-spider, which reduces both the complexity in pattern growth and the cost of graph isomorphism checking.  Extensive experiments on both synthetic and real data demonstrate the effectiveness and efficiency of SpiderMine. Conclusion 27 Mining Top-K Large Structural Patterns in a Massive Network

Presentation at VLDB 2011 – Seattle, WA 28 Thank You Mining Top-K Large Structural Patterns in a Massive Network