Slides are modified from Jiawei Han & Micheline Kamber

Slides:



Advertisements
Similar presentations
 Data mining has emerged as a critical tool for knowledge discovery in large data sets. It has been extensively used to analyze business, financial,
Advertisements

Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.
gSpan: Graph-based substructure pattern mining
Mining Top-K Large Structural Patterns in a Massive Network Feida Zhu 1, Qiang Qu 2, David Lo 1, Xifeng Yan 3, Jiawei Han 4, and Philip S. Yu 5 1 Singapore.
1 Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan 1, Guoren Wang 1, Haixun Wang 2, Lei Chen 3 1. Northeastern University, China 2. Microsoft.
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis (7) (Mining Graphs)
The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.
SCAN: A Structural Clustering Algorithm for Networks
Network Motifs Zach Saul CS 289 Network Motifs: Simple Building Blocks of Complex Networks R. Milo et al.
SubSea: An Efficient Heuristic Algorithm for Subgraph Isomorphism Vladimir Lipets Ben-Gurion University of the Negev Joint work with Prof. Ehud Gudes.
Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002.
33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Towards Graph Containment Search and Indexing Chen Chen 1, Xifeng Yan 2, Philip.
FAST FREQUENT FREE TREE MINING IN GRAPH DATABASES Marko Lazić 3335/2011 Department of Computer Engineering and Computer Science,
Slides are modified from Jiawei Han & Micheline Kamber
Graph Indexing Techniques Seoul National University IDB Lab. Kisung Kim
Subgraph Containment Search Dayu Yuan The Pennsylvania State University 1© Dayu Yuan9/7/2015.
Graph Indexing: A Frequent Structure­ based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†
Clustering of protein networks: Graph theory and terminology Scale-free architecture Modularity Robustness Reading: Barabasi and Oltvai 2004, Milo et al.
SCAN: A Structural Clustering Algorithm for Networks
Xiangnan Kong,Philip S. Yu Department of Computer Science University of Illinois at Chicago KDD 2010.
1 Frequent Subgraph Mining Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY June 12, 2010.
Graph Indexing: A Frequent Structure- based Approach Alicia Cosenza November 26 th, 2007.
Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.
Frequent Subgraph Discovery Michihiro Kuramochi and George Karypis ICDM 2001.
Mohammad Hasan, Mohammed Zaki RPI, Troy, NY. Consider the following problem from Medical Informatics Healthy Diseased Damaged Tissue Images Cell Graphs.
University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.
Mining Graph Patterns Efficiently via Randomized Summaries Chen Chen, Cindy X. Lin, Matt Fredrikson, Mihai Christodorescu, Xifeng Yan, Jiawei Han VLDB’09.
Graph Indexing: A Frequent Structure-­based Approach 指導老師:曾新穆 教授 組員:李彥寬、洪世敏、丁鏘巽、 黃冠霖、詹博丞 日期: 2013/11/ /11/141.
Graph Indexing From managing and mining graph data.
1 Survey of Biodata Analysis from a Data Mining Perspective Peter Bajcsy Jiawei Han Lei Liu Jiong Yang.
Approach to Data Mining from Algorithm and Computation Takeaki Uno, ETH Switzerland, NII Japan Hiroki Arimura, Hokkaido University, Japan.
Data Mining: Principles and Algorithms Graph Pattern Mining Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign
1 Substructure Similarity Search in Graph Databases R 陳芃安.
Ning Jin, Wei Wang ICDE 2011 LTS: Discriminative Subgraph Mining by Learning from Search History.
Gspan: Graph-based Substructure Pattern Mining
Mining Coherent Dense Subgraphs across Multiple Biological Networks Vahid Mirjalili CSE 891.
Auburn University
Cohesive Subgraph Computation over Large Graphs
Outline Introduction State-of-the-art solutions
CSCI2950-C Lecture 12 Networks
CSPs: Search and Arc Consistency Computer Science cpsc322, Lecture 12
Mining in Graphs and Complex Structures
Network Motif Discovery using Subgraph Enumeration and Symmetry-Breaking by Grochow and Kellis Wooyoung Kim 4/3/2009 CSc 8910 Analysis of Biological Network,
CSPs: Search and Arc Consistency Computer Science cpsc322, Lecture 12
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.
Frequent Pattern Mining
Probabilistic Data Management
Mining Frequent Subgraphs
September 19, 2018.
Jiawei Han Department of Computer Science
Graph Search with Indexing
Data Mining: Concepts and Techniques — Chapter 9 — 9.1. Graph mining
On Efficient Graph Substructure Selection
Mining, Indexing and Searching Graphs in Biological Databases
Association Rule Mining
Graph Database Mining and Its Applications
Mining and Searching Graphs in Biological Databases
COMP5331 FP-Tree Prepared by Raymond Wong Presented by Raymond Wong
Discovering Larger Network Motifs
Efficient Subgraph Similarity All-Matching
CSE572, CBS572: Data Mining by H. Liu
FP-Growth Wenlong Zhang.
SEG5010 Presentation Zhou Lanjun.
Asymmetric Transitivity Preserving Graph Embedding
Jongik Kim1, Dong-Hoon Choi2, and Chen Li3
CPT-S 415 Big Data Yinghui Wu EME B45.
CSE572: Data Mining by H. Liu
Approximate Graph Mining with Label Costs
Presentation transcript:

Slides are modified from Jiawei Han & Micheline Kamber Lecture 30: Graph Data Mining Slides are modified from Jiawei Han & Micheline Kamber

Graph Data Mining DNA sequence RNA

Graph Data Mining Compounds Texts

Graph Pattern Mining Graph Classification Graph Clustering Outline Mining Frequent Subgraph Patterns Graph Indexing Graph Similarity Search Graph Classification Graph pattern-based approach Machine Learning approaches Motifs Graph Clustering Link-density-based approach

Applications of graph pattern mining Frequent subgraphs A (sub)graph is frequent if its support (occurrence frequency) in a given dataset is no less than a minimum support threshold Support of a graph g is defined as the percentage of graphs in G which have g as subgraph Applications of graph pattern mining Mining biochemical structures Program control flow analysis Mining XML structures or Web communities Building blocks for graph classification, clustering, compression, comparison, and correlation analysis

Example: Frequent Subgraphs GRAPH DATASET (A) (B) (C) FREQUENT PATTERNS (MIN SUPPORT IS 2) (1) (2)

Example GRAPH DATASET FREQUENT PATTERNS (MIN SUPPORT IS 2)

Graph Mining Algorithms Incomplete beam search – Greedy (Subdue) Inductive logic programming (WARMR) Graph theory-based approaches Apriori-based approach Pattern-growth approach

Properties of Graph Mining Algorithms Search order breadth vs. depth Generation of candidate subgraphs apriori vs. pattern growth Elimination of duplicate subgraphs passive vs. active Support calculation embedding store or not Discover order of patterns path  tree  graph

Apriori-Based Approach (k+1)-edge k-edge G1 G1 G G2 G’ … Subgraph isomorphism test NP-complete Gn G’’ Gn Join Prune check the frequency of each candidate

Apriori-Based, Breadth-First Search Methodology: breadth-search, joining two graphs AGM (Inokuchi, et al.) generates new graphs with one more node FSG (Kuramochi and Karypis) generates new graphs with one more edge

Pattern Growth Method (k+2)-edge (k+1)-edge k-edge … G1 duplicate graph G2 G Apriori: Step1: Join two k-1 edge graphs (these two graphs share a same k-2 edge subgraph) to generate a k-edge graph Step2: Join the tid-list of these two k-1 edge graphs, then see whether its count is larger than the minimum support Step3: Check all k-1 subgraph of this k-edge graph to see whether all of them are frequent Step4: After G successfully pass Step1-3, do support computation of G in the graph dataset, See whether it is really frequent. gSpan: Step1: Right-most extend a k-1 edge graph to several k edge graphs. Step2: Enumerate the occurrence of this k-1 edge graph in the graph dataset, meanwhile, counting these k edge graphs. Step3: Output those k edge graphs whose support is larger than the minimum support. Pros: 1: gSpan avoid the costly candidate generation and testing some infrequent subgraphs. 2: No complicated graph operations, like joining two graphs and calculating its k-1 subgraphs. 3. gSpan is very simple The key is how to do right most extension efficiently in graph. We invented DFS code for graph. … … Gn

Graph Pattern Explosion Problem If a graph is frequent, all of its subgraphs are frequent the Apriori property An n-edge frequent graph may have 2n subgraphs Among 422 chemical compounds which are confirmed to be active in an AIDS antiviral screen dataset, there are 1,000,000 frequent graph patterns if the minimum support is 5%

Closed Frequent Graphs A frequent graph G is closed if there exists no supergraph of G that carries the same support as G If some of G’s subgraphs have the same support it is unnecessary to output these subgraphs nonclosed graphs Lossless compression Still ensures that the mining result is complete

Querying graph databases: Graph Search Querying graph databases: Given a graph database and a query graph, find all the graphs containing this query graph query graph graph database

An indexing mechanism is needed Scalability Issue Naïve solution Sequential scan (Disk I/O) Subgraph isomorphism test (NP-complete) Problem: Scalability is a big issue An indexing mechanism is needed

Indexing Strategy Query graph (Q) Graph (G) If graph G contains query graph Q, G should contain any substructure of Q Substructure Our work, also with all the previous work follows this indexing strategy. Remarks Index substructures of a query graph to prune graphs that do not contain these substructures

Indexing Framework Two steps in processing graph queries Step 1. Index Construction Enumerate structures in the graph database, build an inverted index between structures and graphs Step 2. Query Processing Enumerate structures in the query graph Calculate the candidate graphs containing these structures Prune the false positive answers by performing subgraph isomorphism test

Why Frequent Structures? We cannot index (or even search) all of substructures Large structures will likely be indexed well by their substructures Size-increasing support threshold support minimum support threshold size

Structural Graph Indexing index pre-defined structures that are commonly observed

Structure Similarity Search CHEMICAL COMPOUNDS (a) caffeine (b) diurobromine (c) sildenafil QUERY GRAPH

Substructure Similarity Measure Feature-based similarity measure Each graph is represented as a feature vector X = {x1, x2, …, xn} Similarity is defined by the distance of their corresponding vectors Advantages Easy to index Fast Rough measure

Some “Straightforward” Methods Method1: Directly compute the similarity between the graphs in the DB and the query graph Sequential scan Subgraph similarity computation Method 2: Form a set of subgraph queries from the original query graph and use the exact subgraph search Costly: If we allow 3 edges to be missed in a 20-edge query graph, it may generate 1,140 subgraphs

Index: Precise vs. Approximate Search Precise Search Use frequent patterns as indexing features Select features in the database space based on their selectivity Build the index Approximate Search Hard to build indices covering similar subgraphs explosive number of subgraphs in databases Idea: (1) keep the index structure (2) select features in the query space

Resolving local structure: network motifs motif to be found graph motif matches in the target graph http://mavisto.ipk-gatersleben.de/frequency_concepts.html

Examples of network motifs (3 nodes) Feed forward loop Found in neural networks Seems to be used to neutralize “biological noise” Single-Input Module e.g. gene control networks X Y Z X Y Z

Examples of network motifs (4 nodes) Parallel paths Found in neural networks Food webs W X Y Z

All 3 node motifs

4 node subgraphs (computational expense increases with the size of the graph!)

Network motif detection Some motifs will occur more often in real world networks than random networks Technique: construct many random graphs with the same number of nodes and edges (same node degree distribution?) count the number of motifs in those graphs calculate the Z score: the probability that the given number of motifs in the real world network could have occurred by chance Software available: http://www.weizmann.ac.il/mcb/UriAlon/ (the original) http://theinf1.informatik.uni-jena.de/~wernicke/motifs/index.html (faster and more user friendly)

x - mx zx sx What the Z score means = m = mean number of times the motif appeared in the random graph s standard deviation the probability observing a Z score of 2 is 0.02275 In the context of motifs: Z > 0, motif occurs more often than for random graphs Z < 0, motif occurs less often than in random graphs |Z| > 1.65, only a 5% chance of random occurence # of times motif appeared in random graph x - mx zx = sx

software: FANMOD (also igraph) http://theinf1.informatik.uni-jena.de/~wernicke/motifs/index.html

Superfamilies of networks source: Milo et al., Superfamilies of Evolved and Designed Networks, Science 303:1538-1542, 2004

Quiz Q: Which of the following triads is underrepresented in social networks?

Superfamilies of networks source: Milo et al., Superfamilies of Evolved and Designed Networks, Science 303:1538-1542, 2004

Motifs: recap Given a particular structure, search for it in the network, e.g. complete triads advantage: motifs can correspond to particular functions, e.g. in biological networks disadvantage: don’t know if motif is part of a larger cohesive community

Graph Pattern Mining Graph Classification Graph Clustering Outline Mining Frequent Subgraph Patterns Graph Indexing Graph Similarity Search Graph Classification Graph pattern-based approach Machine Learning approaches Motifs Graph Clustering Link-density-based approach

Substructure-Based Graph Classification Basic idea Extract graph substructures Represent a graph with a feature vector , where is the frequency of in that graph Build a classification model Different features and representative work Fingerprint Maccs keys Tree and cyclic patterns [Horvath et al.] Minimal contrast subgraph [Ting and Bailey] Frequent subgraphs [Deshpande et al.; Liu et al.] Graph fragments [Wale and Karypis] 41

Direct Mining of Discriminative Patterns Avoid mining the whole set of patterns Harmony [Wang and Karypis] DDPMine [Cheng et al.] LEAP [Yan et al.] MbT [Fan et al.] Find the most discriminative pattern A search problem? An optimization problem? Extensions Mining top-k discriminative patterns Mining approximate/weighted discriminative patterns 42

Graph Kernels Motivation: Basic idea: Kernel based learning methods doesn’t need to access data points They rely on the kernel function between the data points Can be applied to any complex structure provided you can define a kernel function on them Basic idea: Map each graph to some significant set of patterns Define a kernel on the corresponding sets of patterns

Kernel-based Classification Random walk Basic Idea: count the matching random walks between the two graphs Marginalized Kernels Gärtner ’02, Kashima et al. ’02, Mahé et al.’04 and are paths in graphs and and are probability distributions on paths is a kernel between paths, e.g., 44

Graph Pattern Mining Graph Classification Graph Clustering Outline Mining Frequent Subgraph Patterns Graph Indexing Graph Similarity Search Graph Classification Graph pattern-based approach Machine Learning approaches Motifs Graph Clustering Link-density-based approach

Graph Compression Extract common subgraphs and simplify graphs by condensing these subgraphs into nodes

Graph/Network Clustering Problem Networks made up of the mutual relationships of data elements usually have an underlying structure Because relationships are complex, it is difficult to discover these structures. How can the structure be made clear? Given simple information of who associates with whom, could one identify clusters of individuals with common interests or special relationships? E.g., families, cliques, terrorist cells… 48

What size should they be? What is the best partitioning? An Example of Networks How many clusters? What size should they be? What is the best partitioning? Should some points be segregated? 49

Individuals who are outliers reside at the margins of society A Social Network Model Individuals in a tight social group, or clique, know many of the same people regardless of the size of the group Individuals who are hubs know many people in different groups but belong to no single group E.g., politicians bridge multiple groups Individuals who are outliers reside at the margins of society E.g., Hermits know few people and belong to no group

The Neighborhood of a Vertex Define () as the immediate neighborhood of a vertex i.e. the set of people that an individual knows

Structure Similarity The desired features tend to be captured by a measure called Structural Similarity Structural similarity is large for members of a clique and small for hubs and outliers.

Graph Mining Applications of Frequent Subgraph Mining Mining (FSM) Variant Subgraph Pattern Mining Pattern Growth based Indexing and Search Clustering Approximate methods Coherent Subgraph mining Apriori based Classification Closed Subgraph mining Dense Subgraph Mining GraphGrep Daylight gIndex (Є Grafil) gSpan MoFa GASTON FFSM SPIN CSA CLAN AGM FSG PATH SUBDUE GBI Kernel Methods (Graph Kernels) CloseGraph CloseCut Splat CODENSE