Download presentation
Presentation is loading. Please wait.
Published byCathleen Terry Modified over 8 years ago
1
1 Data Mining: Principles and Algorithms Mining Homogeneous Networks Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign www.cs.uiuc.edu/~hanj ©2013 Jiawei Han. All rights reserved. 1
2
June 6, 2016 Mining and Searching Graphs in Graph Databases 2
3
3 Mining Homogeneous Networks A Taxonomy of Link Mining Tasks Graph Pattern Mining Graph Classification Graph Clustering Summary
4
Graph, Graph, Everywhere Aspirin Yeast protein interaction network from H. Jeong et al Nature 411, 41 (2001) Internet Co-author network 4
5
5 A Taxonomy of Link Mining Tasks Object-Related Tasks Link-based object ranking Link-based object classification Object clustering (group detection) Object identification (entity resolution) Link-Related Tasks Link prediction Graph-Related Tasks Subgraph discovery Graph classification Generative model for graphs 5
6
6 Link-Based Object Ranking (LBR) Exploit the link structure of a graph to order or prioritize the set of objects within the graph Focused on graphs with single object type and single link type A primary focus of link analysis community Web information analysis PageRank and Hits are typical LBR approaches In social network analysis (SNA), LBR is a core analysis task Objective: rank individuals in terms of “centrality” Degree centrality vs. eigen vector/power centrality Rank objects relative to one or more relevant objects in the graph vs. ranks object over time in dynamic graphs 6
7
7 Block-level Link Analysis (Cai et al. 04) Most of the existing link analysis algorithms, e.g. PageRank and HITS, treat a web page as a single node in the web graph However, in most cases, a web page contains multiple semantics and hence it might not be considered as an atomic and homogeneous node Web page is partitioned into blocks using the vision- based page segmentation algorithm extract page-to-block, block-to-page relationships Block-level PageRank and Block-level HITS 7
8
8 Link-Based Object Classification (LBC) Predicting the category of an object based on its attributes, its links and the attributes of linked objects Web: Predict the category of a web page, based on words that occur on the page, links between pages, anchor text, html tags, etc. Citation: Predict the topic of a paper, based on word occurrence, citations, co-citations Epidemics: Predict disease type based on characteristics of the patients infected by the disease Communication: Predict whether a communication contact is by email, phone call or mail 8
9
9 Challenges in Link-Based Classification Labels of related objects tend to be correlated Collective classification: Explore such correlations and jointly infer the categorical values associated with the objects in the graph Ex: Classify related news items in Reuter data sets (Chak’98) Simply incorp. words from neighboring documents: not helpful Multi-relational classification is another solution for link-based classification 9
10
10 Group Detection Cluster the nodes in the graph into groups that share common characteristics Web: identifying communities Citation: identifying research communities Methods Hierarchical clustering Blockmodeling of SNA Spectral graph partitioning Stochastic blockmodeling Multi-relational clustering 10
11
11 Entity Resolution Predicting when two objects are the same, based on their attributes and their links Also known as: deduplication, reference reconciliation, co- reference resolution, object consolidation Applications Web: predict when two sites are mirrors of each other Citation: predicting when two citations are referring to the same paper Epidemics: predicting when two disease strains are the same Biology: learning when two names refer to the same protein 11
12
12 Entity Resolution Methods Earlier viewed as pair-wise resolution problem: resolved based on the similarity of their attributes Importance at considering links Coauthor links in bib data, hierarchical links between spatial references, co-occurrence links between name references in documents Use of links in resolution Collective entity resolution: one resolution decision affects another if they are linked Propagating evidence over links in a depen. graph Probabilistic models interact with different entity recognition decisions 12
13
13 Link Prediction Predict whether a link exists between two entities, based on attributes and other observed links Applications Web: predict if there will be a link between two pages Citation: predicting if a paper will cite another paper Epidemics: predicting who a patient’s contacts are Methods Often viewed as a binary classification problem Local conditional probability model, based on structural and attribute features Difficulty: sparseness of existing links Collective prediction, e.g., Markov random field model 13
14
14 Link Cardinality Estimation Predicting the number of links to an object Web: predict the authority of a page based on the number of in-links; identifying hubs based on the number of out- links Citation: predicting the impact of a paper based on the number of citations Epidemics: predicting the number of people that will be infected based on the infectiousness of a disease Predicting the number of objects reached along a path from an object Web: predicting number of pages retrieved by crawling a site Citation: predicting the number of citations of a particular author in a specific journal 14
15
15 Subgraph Discovery Find characteristic subgraphs Focus of graph-based data mining Applications Biology: protein structure discovery Communications: legitimate vs. illegitimate groups Chemistry: chemical substructure discovery Methods Subgraph pattern mining Graph classification Classification based on subgraph pattern analysis 15
16
Information Diffusion, Evolution and Contagion Network structure and information diffusion How information propagates across network Contagion, opinion formation Opinion formation based on network structures Evolution of network structures Community formation, merge, split, fade, and disappearance 16
17
17 Metadata Mining Schema mapping, schema discovery, schema reformulation cite – matching between two bibliographic sources web - discovering schema from unstructured or semi- structured data bio – mapping between two medical ontologies 17
18
18 Link Mining Challenges Logical vs. statistical dependencies Feature construction: Aggregation vs. selection Instances vs. classes Collective classification and collective consolidation Effective use of labeled & unlabeled data Link prediction Closed vs. open world Challenges common to any link-based statistical model (Bayesian Logic Programs, Conditional Random Fields, Probabilistic Relational Models, Relational Markov Networks, Relational Probability Trees, Stochastic Logic Programming to name a few) 18
19
June 6, 2016 Mining and Searching Graphs in Graph Databases 19
20
20 Mining Homogeneous Networks A Taxonomy of Link Mining Tasks Graph Pattern Mining Graph Classification Graph Clustering Summary 20
21
Why Graph Pattern Mining? Graphs are ubiquitous Chemical compounds (Cheminformatics) Protein structures, biological pathways/networks (Bioinformactics) Program control flow, traffic flow, and workflow analysis XML databases, Web, and social network analysis Graph is a general model Trees, lattices, sequences, and items are degenerated graphs Diversity of graphs Directed vs. undirected, labeled vs. unlabeled (edges & vertices), weighted, with angles & geometry (topological vs. 2-D/3-D) Complexity of algorithms: many problems are of high complexity 21
22
Graph Pattern Mining Frequent subgraphs A (sub)graph is frequent if its support (occurrence frequency) in a given dataset is no less than a minimum support threshold Applications of graph pattern mining Mining biochemical structures Program control flow analysis Mining XML structures or Web communities Building blocks for graph classification, clustering, compression, comparison, and correlation analysis 22
23
23 Example: Frequent Subgraphs GRAPH DATASET FREQUENT PATTERNS (MIN SUPPORT IS 2) (A)(B)(C) (1)(2) 23
24
24 EXAMPLE (II) GRAPH DATASET FREQUENT PATTERNS (MIN SUPPORT IS 2) 24
25
Graph Mining Algorithms Incomplete beam search – Greedy (Subdue) Inductive logic programming (WARMR) Graph theory-based approaches Apriori-based approach Pattern-growth approach 25
26
SUBDUE (Holder et al. KDD’94) Start with single vertices Expand best substructures with a new edge Limit the number of best substructures Substructures are evaluated based on their ability to compress input graphs Using minimum description length (DL) Best substructure S in graph G minimizes: DL(S) + DL(G\S) Terminate until no new substructure is discovered 26
27
WARMR (Dehaspe et al. KDD’98) Graphs are represented by Datalog facts atomel(C, A1, c), bond (C, A1, A2, BT), atomel(C, A2, c) : a carbon atom bound to a carbon atom with bond type BT WARMR: the first general purpose ILP system Level-wise search Simulate Apriori for frequent pattern discovery 27
28
Frequent Subgraph Mining Approaches Apriori-based approach AGM/AcGM: Inokuchi, et al. (PKDD’00) FSG: Kuramochi and Karypis (ICDM’01) PATH # : Vanetik and Gudes (ICDM’02, ICDM’04) FFSM: Huan, et al. (ICDM’03) Pattern growth approach MoFa, Borgelt and Berthold (ICDM’02) gSpan: Yan and Han (ICDM’02) Gaston: Nijssen and Kok (KDD’04) 28
29
Properties of Graph Mining Algorithms Search order breadth vs. depth Generation of candidate subgraphs apriori vs. pattern growth Elimination of duplicate subgraphs passive vs. active Support calculation embedding store or not Discover order of patterns path tree graph 29
30
Apriori-Based Approach … G G1G1 G2G2 GnGn k-edge (k+1)-edge G’ G’’ JOIN 30
31
31 Apriori-Based, Breadth-First Search AGM (Inokuchi, et al. PKDD’00) generates new graphs with one more node Methodology: breadth-search, joining two graphs FSG (Kuramochi and Karypis ICDM’01) generates new graphs with one more edge 31
32
PATH (Vanetik and Gudes ICDM’02, ’04) Apriori-based approach Building blocks: edge-disjoint path A graph with 3 edge-disjoint paths construct frequent paths construct frequent graphs with 2 edge-disjoint paths construct graphs with k+1 edge-disjoint paths from graphs with k edge-disjoint paths repeat 32
33
FFSM (Huan, et al. ICDM’03) Represent graphs using canonical adjacency matrix (CAM) Join two CAMs or extend a CAM to generate a new graph Store the embeddings of CAMs All of the embeddings of a pattern in the database Can derive the embeddings of newly generated CAMs 33
34
Pattern Growth Method … G G1G1 G2G2 GnGn k-edge (k+1)-edge … (k+2)-edge … duplicate graph 34
35
MoFa (Borgelt and Berthold ICDM’02) Extend graphs by adding a new edge Store embeddings of discovered frequent graphs Fast support calculation Also used in other later developed algorithms such as FFSM and GASTON Expensive Memory usage Local structural pruning 35
36
GSPAN (Yan and Han ICDM’02) Right-Most Extension Theorem: Completeness The Enumeration of Graphs using Right-most Extension is COMPLETE 36
37
DFS Code Flatten a graph into a sequence using depth first search 0 1 2 3 4 e0: (0,1) e1: (1,2) e2: (2,0) e3: (2,3) e4: (3,1) e5: (2,4) 37
38
38 DFS Lexicographic Order Let Z be the set of DFS codes of all graphs. Two DFS codes a and b have the relation a<=b (DFS Lexicographic Order in Z) if and only if one of the following conditions is true. Let a = (x 0, x 1, …, x n ) and b = (y 0, y 1, …, y n ), (i)if there exists t, 0<= t <= min(m,n), x k =y k for all k, s.t. k<t, and x t < y t (ii)x k =y k for all k, s.t. 0<= k<= m and m <= n. 38
39
39 DFS Code Extension Let a be the minimum DFS code of a graph G and b be a non-minimum DFS code of G. For any DFS code d generated from b by one right-most extension, (i) d is not a minimum DFS code, (ii) min_dfs(d) cannot be extended from b, and (iii) min_dfs(d) is either less than a or can be extended from a. THEOREM [ RIGHT-EXTENSION ] The DFS code of a graph extended from a Non-minimum DFS code is NOT MINIMUM 39
40
GASTON (Nijssen and Kok KDD’04) Extend graphs directly Store embeddings Separate the discovery of different types of graphs path tree graph Simple structures are easier to mine and duplication detection is much simpler 40
41
Graph Pattern Explosion Problem If a graph is frequent, all of its subgraphs are frequent ─ the Apriori property An n-edge frequent graph may have 2 n subgraphs Among 422 chemical compounds which are confirmed to be active in an AIDS antiviral screen dataset, there are 1,000,000 frequent graph patterns if the minimum support is 5% 41
42
Closed Frequent Graphs Motivation: Handling graph pattern explosion problem Closed frequent graph A frequent graph G is closed if there exists no supergraph of G that carries the same support as G If some of G’s subgraphs have the same support, it is unnecessary to output these subgraphs (nonclosed graphs) Lossless compression: still ensures that the mining result is complete 42
43
CLOSEGRAPH (Yan & Han, KDD’03) … A Pattern-Growth Approach G G1G1 G2G2 GnGn k-edge (k+1)-edge At what condition, can we stop searching their children i.e., early termination? If G and G’ are frequent, G is a subgraph of G’. If in any part of the graph in the dataset where G occurs, G’ also occurs, then we need not grow G, since none of G’s children will be closed except those of G’. 43
44
Handling Tricky Exception Cases (graph 1) a c b d (pattern 2) (pattern 1) (graph 2) a c b d ab a c d 44
45
Experimental Result The AIDS antiviral screen compound dataset from NCI/NIH The dataset contains 43,905 chemical compounds Among these 43,905 compounds, 423 of them belongs to CA, 1081 are of CM, and the remaining are in class CI 45
46
46 Discovered Patterns 20% 10% 5% 46
47
Performance (1): Run Time Minimum support (in %) Run time per pattern (msec) 47
48
Performance (2): Memory Usage Minimum support (in %) Memory usage (GB) 48
49
49 Number of Patterns: Frequent vs. Closed CA Minimum support Number of patterns 49
50
50 Runtime: Frequent vs. Closed CA Minimum support Run time (sec) 50
51
Do the Odds Beat the Curse of Complexity? Potentially exponential number of frequent patterns The worst case complexty vs. the expected probability Ex.: Suppose Walmart has 10 4 kinds of products The chance to pick up one product 10 -4 The chance to pick up a particular set of 10 products: 10 -40 What is the chance this particular set of 10 products to be frequent 10 3 times in 10 9 transactions? Have we solved the NP-hard problem of subgraph isomorphism testing? No. But the real graphs in bio/chemistry is not so bad A carbon has only 4 bounds and most proteins in a network have distinct labels 51
52
Graph Search Querying graph databases: Given a graph database and a query graph, find all the graphs containing this query graph query graph graph database 52
53
53 Scalability Issue Sequential scan Disk I/Os Subgraph isomorphism testing An indexing mechanism is needed DayLight: Daylight.com (commercial) GraphGrep: Dennis Shasha, et al. PODS'02 Grace: Srinath Srinivasa, et al. ICDE'03 53
54
Indexing Strategy Graph (G) Substructure Query graph (Q) If graph G contains query graph Q, G should contain any substructure of Q Remarks Index substructures of a query graph to prune graphs that do not contain these substructures 54
55
Indexing Framework Two steps in processing graph queries Step 1. Index Construction Enumerate structures in the graph database, build an inverted index between structures and graphs Step 2. Query Processing Enumerate structures in the query graph Calculate the candidate graphs containing these structures Prune the false positive answers by performing subgraph isomorphism test 55
56
Cost Analysis QUERY RESPONSE TIME REMARK: make |C q | as small as possible fetch indexnumber of candidates 56
57
Path-based Approach GRAPH DATABASE PATHS 0-length: C, O, N, S 1-length: C-C, C-O, C-N, C-S, N-N, S-O 2-length: C-C-C, C-O-C, C-N-C,... 3-length:... (a) (b) (c) Built an inverted index between paths and graphs 57
58
Path-based Approach (cont.) QUERY GRAPH 0-edge: S C ={a, b, c}, S N ={a, b, c} 1-edge: S C-C ={a, b, c}, S C-N ={a, b, c} 2-edge: S C-N-C = {a, b}, … … Intersect these sets, we obtain the candidate answers - graph (a) and graph (b) - which may contain this query graph. 58
59
59 Problems: Path-based Approach GRAPH DATABASE (a) (b) (c) QUERY GRAPH Only graph (c) contains this query graph. However, if we only index paths: C, C-C, C-C-C, C-C-C-C, we cannot prune graph (a) and (b). 59
60
gIndex: Indexing Graphs by Data Mining Our methodology on graph index: Identify frequent structures in the database, the frequent structures are subgraphs that appear quite often in the graph database Prune redundant frequent structures to maintain a small set of discriminative structures Create an inverted index between discriminative frequent structures and graphs in the database 60
61
IDEAS: Indexing with Two Constraints structure (>10 6 ) frequent (~10 5 ) discriminative (~10 3 ) 61
62
Why Discriminative Subgraphs? All graphs contain structures: C, C-C, C-C-C Why bother indexing these redundant frequent structures? Only index structures that provide more information than existing structures Sample database (a) (b) (c) 62
63
63 Discriminative Structures Pinpoint the most useful frequent structures Given a set of structures and a new structure, we measure the extra indexing power provided by, When is small enough, is a discriminative structure and should be included in the index Index discriminative frequent structures only Reduce the index size by an order of magnitude 63
64
Why Frequent Structures? We cannot index (or even search) all of substructures Large structures will likely be indexed well by their substructures Size-increasing support threshold size support minimum support threshold 64
65
Experimental Setting The AIDS antiviral screen compound dataset from NCI/NIH, containing 43,905 chemical compounds Query graphs are randomly extracted from the dataset GraphGrep: maximum length (edges) of paths is set at 10 gIndex: maximum size (edges) of structures is set at 10 65
66
66 Experiments: Index Size DATABASE SIZE # OF FEATURES 66
67
67 Experiments: Answer Set Size QUERY SIZE # OF CANDIDATES 67
68
Experiments: Incremental Maintenance Frequent structures are stable to database updating Index can be built based on a small portion of a graph database, but be used for the whole database 68
70
70 Mining Homogeneous Networks A Taxonomy of Link Mining Tasks Graph Pattern Mining Graph Classification Graph Clustering Summary 70
71
Substructure-Based Graph Classification Data: graph data with labels, e.g., chemical compounds, software behavior graphs, social networks Basic idea Extract graph substructures Represent a graph with a feature vector, where is the frequency of in that graph Build a classification model Different features and representative work Fingerprint Maccs keys Tree and cyclic patterns [Horvath et al., KDD’04] Minimal contrast subgraph [Ting and Bailey, SDM’06] Frequent subgraphs [Deshpande et al., TKDE’05; Liu et al., SDM’05] Graph fragments [Wale and Karypis, ICDM’06] 71
72
Fingerprints (fp-n) Enumerate all paths up to length l and certain cycles 1 2 ■ ■ ■ ■ ■ n...... Hash features to position(s) in a fixed length bit-vector 1 2 ■ ■ ■ ■ ■ n O N O O Chemical Compounds...... N O O N O O N N O Courtesy of Nikil Wale 72
73
Maccs Keys (MK) Domain Expert Each Fragment forms a fixed dimension in the descriptor-space Identify “Important” Fragments for bioactivity HO O NH 2 NH 2 O OH O NH NH 2 O Courtesy of Nikil Wale 73
74
Cycles and Trees (CT) [Horvath et al., KDD’04] Identify Bi-connected components Delete Bi-connected Components from the compound O NH 2 O O Left-over Trees Fixed number of cycles Bounded Cyclicity Using Bi-connected components Chemical Compound O O O NH 2 O Courtesy of Nikil Wale 74
75
Frequent Subgraphs (FS) [Deshpande et al., TKDE’05 ] Discovering Features OOO Sup:+ve:30% -ve:5% F O Sup:+ve:40%-ve:0% Sup:+ve:1% -ve:30% H H O N O H H H H H HH H HH H H H Chemical Compounds Discovered Subgraphs Frequent Subgraph Discovery Min. Support. Topological features – captured by graph representation F Courtesy of Nikil Wale 75
76
Graph Fragments (GF) [Wale & Karypis, ICDM’06] Tree Fragments (TF): At least one node of the tree fragment has a degree greater than 2 (no cycles). Path Fragments (PF): All nodes have degree less than or equal to 2 but does not include cycles. Acyclic Fragments (AF): TF U PF –Acyclic fragments are also termed as free trees. NH NH 2 O OH O Courtesy of Nikil Wale 76
77
Comparison of Different Features [Wale & Karypis, ICDM’06] 77
78
Minimal Contrast Subgraphs [Ting and Bailey, SDM’06] A contrast graph is a subgraph appearing in one class of graphs and never in another class of graphs Minimal if none of its subgraphs are contrasts May be disconnected Allows succinct description of differences But requires larger search space Mining Contrast Subgraphs Find the maximal common edge sets These may be disconnected Apply a minimal hypergraph transversal operation to derive the minimal contrast edge sets from the maximal common edge sets Must compute minimal contrast vertex sets separately and then minimal union with the minimal contrast edge sets Courtesy of Bailey and Dong 78
79
Frequent Subgraph-Based Classification [Deshpande et al., TKDE’05] Frequent subgraphs A graph is frequent if its support (occurrence freq.) in a given dataset is no less than a min. support threshold Feature generation Frequent topological subgraphs by FSG Frequent geometric subgraphs with 3D shape information Feature selection Sequential covering paradigm Classification Use SVM to learn a classifier based on feature vectors Assign different misclassification costs for different classes to address skewed class distribution 79
80
Varying Minimum Support 80
81
Varying Misclassification Cost 81
82
All graph substructures up to a given length (size or # of bonds) Determined dynamically → Dataset dependent descriptor space Complete coverage → Descriptors for every compound Precise representation → One to one mapping Complex fragments → Arbitrary topology Recurrence relation to generate graph fragments of length l Graph Fragment [Wale and Karypis, ICDM’06] Courtesy of Nikil Wale 82
83
2016-6-6ICDM 08 Tutorial83 Performance Comparison 83
84
Re-examination of Pattern-Based Classification Model Learning Positive Negative Training Instances Test Instances Prediction Model Pattern-Based Feature Construction Computationally Expensive! Feature Space Transformation 84
85
Computational Bottleneck Data Frequent Patterns 10 4 ~10 6 Discriminative Patterns Two steps, expensive Mining Filtering Data Discriminative Patterns Direct mining, efficient Direct MiningTransform FP-tree 85
86
Challenge: Non Anti-Monotonic Anti-Monotonic Non Monotonic Non-Monotonic: Enumerate all subgraphs then check their score? Enumerate subgraphs : small-size to large-size 86
87
Direct Mining of Discriminative Patterns Avoid mining the whole set of patterns Harmony [Wang and Karypis, SDM’05] DDPMine [Cheng et al., ICDE’08] LEAP [Yan et al., SIGMOD’08] MbT [Fan et al., KDD’08] Find the most discriminative pattern A search problem? An optimization problem? Extensions Mining top-k discriminative patterns Mining approximate/weighted discriminative patterns 87
88
Mining Most Significant Graph with Leap Search [Yan et al., SIGMOD’08] Objective functions 88
89
Upper-Bound 89
90
Upper-Bound: Anti-Monotonic Rule of Thumb : If the frequency difference of a graph pattern in the positive dataset and the negative dataset increases, the pattern becomes more interesting We can recycle the existing graph mining algorithms to accommodate non-monotonic functions. 90
91
Structural Similarity Sibling Structural similarity Significance similarity Size-4 graph Size-5 graph Size-6 graph 91
92
Leap on g’ subtree if : leap length, tolerance of structure/frequency dissimilarity Structural Leap Search Mining PartLeap Part g : a discovered graph g’: a sibling of g 92
93
Frequency Association Association between pattern’s frequency and objective scores Start with a high frequency threshold, gradually decrease it 93
94
LEAP Algorithm 1. Structural Leap Search with Frequency Threshold 3. Branch-and-Bound Search with F(g*) 2. Support Descending Mining F(g*) converges 94
95
Branch-and-Bound vs. LEAP Branch-and-BoundLEAP Pruning base Parent-child bound (“vertical”) strict pruning Sibling similarity (“horizontal”) approximate pruning Feature Optimality GuaranteedNear optimal EfficiencyGoodBetter 95
96
NCI Anti-Cancer Screen Datasets NameAssay IDSizeTumor Description MCF-78327,770Breast MOLT-412339,765Leukemia NCI-H23140,353Non-Small Cell Lung OVCAR-810940,516Ovarian P38833041,472Leukemia PC-34127,509Prostate SF-2954740,271Central Nerve System SN12C14540,004Renal SW-6208140,532Colon UACC2573339,988Melanoma YEAST16779,601Yeast anti-cancer Data Description 96
97
Efficiency Tests Search EfficiencySearch Quality: G-test 97
98
OA Kernel scalability problem! Mining Quality: Graph Classification NameOA Kernel* LEAPOA Kernel (6x) LEAP (6x) MCF-70.680.67 0.750.76 MOLT-40.650.660.690.72 NCI-H230.790.760.770.79 OVCAR- 8 0.670.720.790.78 P3880.790.820.81 PC-30.660.690.790.76 Average0.700.720.750.77 AUC Runtime * OA Kernel: Optimal Assignment Kernel LEAP: LEAP search [Frohlich et al., ICML’05] 98
100
100 Mining Homogeneous Networks A Taxonomy of Link Mining Tasks Graph Pattern Mining Graph Classification Graph Clustering Summary 100
101
Graph Compression Extract common subgraphs and simplify graphs by condensing these subgraphs into nodes 101
102
Graph/Network Clustering Problem X. Xu, N. Yuruk, Z. Feng, and T. A. J. Schweiger, “SCAN: A Structural Clustering Algorithm for Networks”, Proc. 2007 ACM SIGKDD Int. Conf. Knowledge Discovery in Databases (KDD'07), San Jose, CA, Aug. 2007 Networks made up of the mutual relationships of data elements usually have an underlying structure Because relationships are complex, it is difficult to discover these structures. How can the structure be made clear? Given simply information of who associates with whom, could one identify clusters of individuals with common interests or special relationships (families, cliques, terrorist cells)? 102
103
Graph Clustering: Sparsest Cut G = (V,E). The cut set of a cut is the set of edges {(u, v) ∈ E | u ∈ S, v ∈ T } and S and T are in two partitions Size of the cut: # of edges in the cut set Min-cut (e.g., C 1 ) is not a good partition A better measure: Sparsity: A cut is sparsest if its sparsity is not greater than that of any other cut Ex. Cut C2 = ({a, b, c, d, e, f, l}, {g, h, i, j, k}) is the sparsest cut For k clusters, the modularity of a clustering assesses the quality of the clustering: The modularity of a clustering of a graph is the difference between the fraction of all edges that fall into individual clusters and the fraction that would do so if the graph vertices were randomly connected The optimal clustering of graphs maximizes the modularity l i : # edges between vertices in the i-th cluster d i : the sum of the degrees of the vertices in the i-th cluster 103
104
Graph Clustering: Challenges of Finding Good Cuts High computational cost Many graph cut problems are computationally expensive The sparsest cut problem is NP-hard Need to tradeoff between efficiency/scalability and quality Sophisticated graphs May involve weights and/or cycles. High dimensionality A graph can have many vertices. In a similarity matrix, a vertex is represented as a vector (a row in the matrix) whose dimensionality is the number of vertices in the graph Sparsity A large graph is often sparse, meaning each vertex on average connects to only a small number of other vertices A similarity matrix from a large sparse graph can also be sparse 104
105
Two Approaches for Graph Clustering Two approaches for clustering graph data Use generic clustering methods for high-dimensional data Designed specifically for clustering graphs Using clustering methods for high-dimensional data Extract a similarity matrix from a graph using a similarity measure A generic clustering method can then be applied on the similarity matrix to discover clusters Ex. Spectral clustering: approximate optimal graph cut solutions Methods specific to graphs Search the graph to find well-connected components as clusters Ex. SCAN (Structural Clustering Algorithm for Networks) X. Xu, N. Yuruk, Z. Feng, and T. A. J. Schweiger, “SCAN: A Structural Clustering Algorithm for Networks”, KDD'07 105
106
An Example of Networks How many clusters? What size should they be? What is the best partitioning? Should some points be segregated? 106
107
A Social Network Model Individuals in a tight social group, or clique, know many of the same people, regardless of the size of the group. Individuals who are hubs know many people in different groups but belong to no single group. Politicians, for example bridge multiple groups. Individuals who are outliers reside at the margins of society. Hermits, for example, know few people and belong to no group. 107
108
The Neighborhood of a Vertex Define ( ) as the immediate neighborhood of a vertex (i.e. the set of people that an individual knows ). 108
109
Structure Similarity The desired features tend to be captured by a measure we call Structural Similarity Structural similarity is large for members of a clique and small for hubs and outliers. 109
110
Structural Connectivity [1] -Neighborhood: Core: Direct structure reachable: Structure reachable: transitive closure of direct structure reachability Structure connected: [1] M. Ester, H. P. Kriegel, J. Sander, & X. Xu (KDD'96) “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases 110
111
Structure-Connected Clusters Structure-connected cluster C Connectivity: Maximality: Hubs: Not belong to any cluster Bridge to many clusters Outliers: Not belong to any cluster Connect to less clusters hub outlier 111
112
13 9 10 11 7 8 12 6 4 0 1 5 2 3 Algorithm = 2 = 0.7 112
113
13 9 10 11 7 8 12 6 4 0 1 5 2 3 Algorithm = 2 = 0.7 0.63 113
114
13 9 10 11 7 8 12 6 4 0 1 5 2 3 Algorithm = 2 = 0.7 0.75 0.67 0.82 114
115
13 9 10 11 7 8 12 6 4 0 1 5 2 3 Algorithm = 2 = 0.7 115
116
13 9 10 11 7 8 12 6 4 0 1 5 2 3 Algorithm = 2 = 0.7 0.67 116
117
13 9 10 11 7 8 12 6 4 0 1 5 2 3 Algorithm = 2 = 0.7 0.73 117
118
13 9 10 11 7 8 12 6 4 0 1 5 2 3 Algorithm = 2 = 0.7 118
119
13 9 10 11 7 8 12 6 4 0 1 5 2 3 Algorithm = 2 = 0.7 0.51 119
120
13 9 10 11 7 8 12 6 4 0 1 5 2 3 Algorithm = 2 = 0.7 0.68 120
121
13 9 10 11 7 8 12 6 4 0 1 5 2 3 Algorithm = 2 = 0.7 0.51 121
122
13 9 10 11 7 8 12 6 4 0 1 5 2 3 Algorithm = 2 = 0.7 122
123
13 9 10 11 7 8 12 6 4 0 1 5 2 3 Algorithm = 2 = 0.7 0.51 0.68 123
124
13 9 10 11 7 8 12 6 4 0 1 5 2 3 Algorithm = 2 = 0.7 124
125
Running Time Running time = O(|E|) For sparse networks = O(|V|) [2] A. Clauset, M. E. J. Newman, & C. Moore, Phys. Rev. E 70, 066111 (2004). 125
126
Summary: Mining Homogeneous Networks A Taxonomy of Link Mining Tasks Graph Pattern Mining Graph Classification Graph Clustering 126
128
References: Graph Pattern Mining (1) T. Asai, et al. “Efficient substructure discovery from large semi-structured data”, SDM'02 C. Borgelt and M. R. Berthold, “Mining molecular fragments: Finding relevant substructures of molecules”, ICDM'02 M. Deshpande, M. Kuramochi, and G. Karypis, “Frequent Sub-structure Based Approaches for Classifying Chemical Compounds”, ICDM 2003 M. Deshpande, M. Kuramochi, and G. Karypis. “Automated approaches for classifying structures”, BIOKDD'02 L. Dehaspe, H. Toivonen, and R. King. “Finding frequent substructures in chemical compounds”, KDD'98 C. Faloutsos, K. McCurley, and A. Tomkins, “Fast Discovery of 'Connection Subgraphs”, KDD'04 L. Holder, D. Cook, and S. Djoko. “Substructure discovery in the subdue system”, KDD'94 J. Huan, W. Wang, D. Bandyopadhyay, J. Snoeyink, J. Prins, and A. Tropsha. “Mining spatial motifs from protein structure graphs”, RECOMB’04 J. Huan, W. Wang, and J. Prins. “Efficient mining of frequent subgraph in the presence of isomorphism”, ICDM'03 H. Hu, X. Yan, Yu, J. Han and X. J. Zhou, “Mining Coherent Dense Subgraphs across Massive Biological Networks for Functional Discovery”, ISMB'05 A. Inokuchi, T. Washio, and H. Motoda. “An apriori-based algorithm for mining frequent substructures from graph data”, PKDD'00 C. James, D. Weininger, and J. Delany. “Daylight Theory Manual Daylight Version 4.82”. Daylight Chemical Information Systems, Inc., 2003. G. Jeh, and J. Widom, “Mining the Space of Graph Properties”, KDD'04 M. Koyuturk, A. Grama, and W. Szpankowski. “An efficient algorithm for detecting frequent subgraphs in biological networks”, Bioinformatics, 20:I200--I207, 2004. 128
129
M. Kuramochi and G. Karypis. “Frequent subgraph discovery”, ICDM'01 M. Kuramochi and G. Karypis, “GREW: A Scalable Frequent Subgraph Discovery Algorithm”, ICDM’04 B. McKay. Practical graph isomorphism. Congressus Numerantium, 30:45--87, 1981. S. Nijssen and J. Kok. A quickstart in frequent structure mining can make a difference. KDD'04 J. Prins, J. Yang, J. Huan, and W. Wang. “Spin: Mining maximal frequent subgraphs from graph databases”. KDD'04 D. Shasha, J. T.-L. Wang, and R. Giugno. “Algorithmics and applications of tree and graph searching”, PODS'02 J. R. Ullmann. “An algorithm for subgraph isomorphism”, J. ACM, 23:31--42, 1976. N. Vanetik, E. Gudes, and S. E. Shimony. “Computing frequent graph patterns from semistructured data”, ICDM'02 C. Wang, W. Wang, J. Pei, Y. Zhu, and B. Shi. “Scalable mining of large disk-base graph databases”, KDD'04 T. Washio and H. Motoda, “State of the art of graph-based data mining”, SIGKDD Explorations, 5:59-68, 2003 X. Yan and J. Han, “gSpan: Graph-Based Substructure Pattern Mining”, ICDM'02 X. Yan and J. Han, “CloseGraph: Mining Closed Frequent Graph Patterns”, KDD'03 X. Yan, P. S. Yu, and J. Han, “Graph Indexing: A Frequent Structure-based Approach”, SIGMOD'04 X. Yan, X. J. Zhou, and J. Han, “Mining Closed Relational Graphs with Connectivity Constraints”, KDD'05 X. Yan, P. S. Yu, and J. Han, “Substructure Similarity Search in Graph Databases”, SIGMOD'05 X. Yan, F. Zhu, J. Han, and P. S. Yu, “Searching Substructures with Superimposed Distance”, ICDE'06 M. J. Zaki. “Efficiently mining frequent trees in a forest”, KDD'02 P. Zhao and J. Han, “On Graph Query Optimization in Large Networks", VLDB'10 References: Graph Pattern Mining (2) 129
130
References: Graph Classification (1) G. Cong, K. Tan, A. Tung, and X. Xu. Mining Top-k Covering Rule Groups for Gene Expression Data, SIGMOD’05. M. Deshpande, M. Kuramochi, N. Wale, and G. Karypis. Frequent Substructure-based Approaches for Classifying Chemical Compounds, TKDE’05. G. Dong and J. Li. Efficient Mining of Emerging Patterns: Discovering Trends and Differences, KDD’99. G. Dong, X. Zhang, L. Wong, and J. Li. CAEP: Classification by Aggregating Emerging Patterns, DS’99 R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification (2nd ed.), John Wiley & Sons, 2001. W. Fan, K. Zhang, H. Cheng, J. Gao, X. Yan, J. Han, P. S. Yu, and O. Verscheure. Direct Mining of Discriminative and Essential Graphical and Itemset Features via Model-based Search Tree, KDD’08. D. Cai, Z. Shao, X. He, X. Yan, and J. Han, “Community Mining from Multi-Relational Networks”, PKDD'05. H. Fröhlich, J. Wegner, F. Sieker, and A. Zell, “Optimal Assignment Kernels For Attributed Molecular Graphs”, ICML’05 T. Gärtner, P. Flach, and S. Wrobel, “On Graph Kernels: Hardness Results and Efficient Alternatives”, COLT/Kernel’03 H. Kashima, K. Tsuda, and A. Inokuchi, “Marginalized Kernels Between Labeled Graphs”, ICML’03 T. Kudo, E. Maeda, and Y. Matsumoto, “An Application of Boosting to Graph Classification”, NIPS’04 C. Liu, X. Yan, H. Yu, J. Han, and P. S. Yu, “Mining Behavior Graphs for ‘Backtrace'' of Noncrashing Bugs’'', SDM'05 P. Mahé, N. Ueda, T. Akutsu, J. Perret, and J. Vert, “Extensions of Marginalized Graph Kernels”, ICML’04 130
131
T. Horvath, T. Gartner, and S. Wrobel. Cyclic Pattern Kernels for Predictive Graph Mining, KDD’04. T. Kudo, E. Maeda, and Y. Matsumoto. An Application of Boosting to Graph Classification, NIPS’04. W. Li, J. Han, and J. Pei. CMAR: Accurate and Efficient Classification based on Multiple Class- association Rules, ICDM’01. B. Liu, W. Hsu, and Y. Ma. Integrating Classification and Association Rule Mining, KDD’98. H. Liu, J. Han, D. Xin, and Z. Shao. Mining Frequent Patterns on Very High Dimensional Data: A Topdown Row Enumeration Approach, SDM’06. S. Nijssen, and J. Kok. A Quickstart in Frequent Structure Mining Can Make a Difference, KDD’04. F. Pan, G. Cong, A. Tung, J. Yang, and M. Zaki. CARPENTER: Finding Closed Patterns in Long Biological Datasets, KDD’03 F. Pan, A. Tung, G. Cong G, and X. Xu. COBBLER: Combining Column, and Row enumeration for Closed Pattern Discovery, SSDBM’04. Y. Sun, Y. Wang, and A. K. C. Wong. Boosting an Associative Classifier, TKDE’06. P-N. Tan, V. Kumar, and J. Srivastava. Selecting the Right Interestingness Measure for Association Patterns, KDD’02. R. Ting and J. Bailey. Mining Minimal Contrast Subgraph Patterns, SDM’06. N. Wale and G. Karypis. Comparison of Descriptor Spaces for Chemical Compound Retrieval and Classification, ICDM’06. H. Wang, W. Wang, J. Yang, and P.S. Yu. Clustering by Pattern Similarity in Large Data Sets, SIGMOD’02. J. Wang and G. Karypis. HARMONY: Efficiently Mining the Best Rules for Classification, SDM’05. X. Xu, N. Yuruk, Z. Feng, and T. A. J. Schweiger, SCAN: A Structural Clustering Algorithm for Networks, KDD'07 X. Yan, H. Cheng, J. Han, and P. S. Yu, Mining Significant Graph Patterns by Scalable Leap Search, SIGMOD’08. References: Graph Classification (2) 131
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.