1 Graph Mining Applications to Machine Learning Problems Max Planck Institute for Biological Cybernetics Koji Tsuda
2 Graphs …
3 DNA Sequence RNA Texts in literature Graph Structures in Biology C C O C C C C H ACGC Amitriptyline inhibitsadenosineuptake H H H H H Compounds CG UUUU UA
4 Substructure Representation 0/1 vector of pattern indicators Huge dimensionality! Need Graph Mining for selecting features Better than paths (Marginalized graph kernels) patterns
5 Overview Quick Review on Graph Mining EM-based Clustering algorithm Mixture model with L1 feature selection Graph Boosting Supervised Regression for QSAR Analysis Linear programming meets graph mining
6 Quick Review of Graph Mining
7 Graph Mining Analysis of Graph Databases Find all patterns satisfying predetermined conditions Frequent Substructure Mining Combinatorial, Exhaustive Recently developed AGM (Inokuchi et al., 2000), gspan (Yan et al., 2002), Gaston (2004)
8 Graph Mining Frequent Substructure Mining Enumerate all patterns occurred in at least m graphs :Indicator of pattern k in graph i Support(k): # of occurrence of pattern k
9 Gspan (Yan and Han, 2002) Efficient Frequent Substructure Mining Method DFS Code Efficient detection of isomorphic patterns Extend Gspan for our works
10 Enumeration on Tree-shaped Search Space Each node has a pattern Generate nodes from the root: Add an edge at each step
11 Tree Pruning Anti-monotonicity: If support(g) < m, stop exploring! Not generated Support(g): # of occurrence of pattern g
12 Discriminative patterns: Weighted Substructure Mining w_i > 0: positive class w_i < 0: negative class Weighted Substructure Mining Patterns with large frequency difference Not Anti-Monotonic: Use a bound
13 Multiclass version Multiple weight vectors (graph belongs to class ) (otherwise) Search patterns overrepresented in a class
14 EM-based clustering of graphs Tsuda, K. and T. Kudo: Clustering Graphs by Weighted Substructure Mining. ICML 2006, , 2006
15 EM-based graph clustering Motivation Learning a mixture model in the feature space of patterns Basis for more complex probabilistic inference L1 regularization & Graph Mining E-step -> Mining -> M-step
16 Probabilistic Model Binomial Mixture Each Component :Mixing weight for cluster :Feature vector of a graph (0 or 1) :Parameter vector for cluster
17 Function to minimize L1-Regularized log likelihood Baseline constant ML parameter estimate using single binomial distribution In solution, most parameters exactly equal to constants
18 E-step Active pattern E-step computed only with active patterns (computable!)
19 M-step Putative cluster assignment by E-step Each parameter is solved separately Use graph mining to find active patterns Then, solve it only for active patterns
20 Solution Occurrence probability in a cluster Overall occurrence probability
21 Important Observation For active pattern k, the occurrence probability in a graph cluster is significantly different from the average
22 Mining for Active Patterns F F is rewritten in the following form Active patterns can be found by graph mining! (multiclass)
23 Experiments: RNA graphs Stem as a node Secondary structure by RNAfold 0/1 Vertex label (self loop or not)
24 Clustering RNA graphs Three Rfam families Intron GP I (Int, 30 graphs) SSU rRNA 5 (SSU, 50 graphs) RNase bact a (RNase, 50 graphs) Three bipartition problems Results evaluated by ROC scores (Area under the ROC curve)
25 Examples of RNA Graphs
26 ROC Scores
27 No of Patterns & Time
28 Found Patterns
29 Summary (EM) Probabilistic clustering based on substructure representation Inference helped by graph mining Many possible extensions Na ï ve Bayes Graph PCA, LFD, CCA Semi-supervised learning Applications in Biology?
30 Graph Boosting Saigo, H., T. Kadowaki and K. Tsuda: A Linear Programming Approach for Molecular QSAR analysis. International Workshop on Mining and Learning with Graphs, 85-96, 2006
31 Graph Regression Problem Known as QSAR problem in chemical informatics Quantitative Structure-Activity Analysis Given a graph, predict a real-value Typically, features (descriptors) are given
32 QSAR with conventional descriptors #atoms#bonds#rings…Activity
33 Motivation of Graph Boosting Descriptors are not always available New features by obtaining informative patterns (i.e., subgraphs) Greedy pattern discovery by Boosting + gSpan Linear Programming (LP) Boosting for reducing the number of graph mining calls Accurate prediction & interpretable results
34 Molecule as a labeled graph C C C C C C O C CC C
35 QSAR with patterns …Activity C C C C C C C C C C C C C C C C O Cl C C C C C C C C C C C C C C C C C O C
36 Sparse regression in a very high dimensional space G: all possible patterns (intractably large) |G|-dimensional feature vector x for a molecule Linear Regression Use L1 regularizer to have sparse α Select a tractable number of patterns
37 Problem formulation We introduce ε-insensitive loss and L1 regularizer m: # of training graphs d = |G| ξ +, ξ - : slack variables ε: parameter
38 Dual LP Primal: Huge number of weight variables Dual: Huge number of constraints LP1-Dual
39 Column Generation Algorithm for LP Boost (Demiriz et al., 2002) Start from the dual with no constraints Add the most violated constraint each time Guaranteed to converge Constraint Matrix Used Part
40 Finding the most violated constraint Constraint for a pattern (shown again) Finding the most violated one Searched by weighted substructure mining
41 Algorithm Overview Iteration Find a new pattern by graph mining with weight u If all constraints are satisfied, break Add a new constraint Update u by LP1-Dual Return Convert dual solution to obtain primal solution α
42 Speed-up by adding multiple patterns (multiple pricing) So far, the most violated pattern is chosen Mining and inclusion of top k patterns at each iteration Reduction of the number of mining calls A Linear Programming Approach for Molecular QSAR Analysis
43 Speed-up by multiple pricing
44 Clearly negative data #atoms#bonds#rings…Activity A Linear Programming Approach for Molecular QSAR Analysis
45 Inclusion of clearly negative data LP2-Primal l: # of clearly negative data z: predetermined upperbound ξ ’ : slack variable
46 Experiments Data from Endocrine Disruptors Knowledge Base 59 compounds labeled by real number and 61 compounds labeled by a large negative number Label (target) is a log translated relative proliferative potency (log(RPP)) normalized between – 1 and 1 Comparison with Marginalized Graph Kernel + ridge regression Marginalized Graph Kernel + kNN regression
47 Results with or without clearly negative data LP2 LP1
48 Extracted patterns Interpretable compared with implicitly expressed features by Marginalized Graph Kernel
49 Summary (Graph Boosting) Graph Boosting simultaneously generate patterns and learn their weights Finite convergence by column generation Potentially interpretable by chemists. Flexible constraints and speed-up by LP.
50 Concluding Remarks Using graph mining as a part of machine learning algorithms Weights are essential Please include weights when you implement your item-set/tree/graph mining algorithms Make it available on the web! Then ML researchers can use it