1 Graph Mining Applications to Machine Learning Problems Max Planck Institute for Biological Cybernetics Koji Tsuda.

1 Graph Mining Applications to Machine Learning Problems Max Planck Institute for Biological Cybernetics Koji Tsuda

2 Graphs …

3 DNA Sequence RNA Texts in literature Graph Structures in Biology C C O C C C C H ACGC Amitriptyline inhibitsadenosineuptake H H H H H Compounds CG UUUU UA

4 Substructure Representation 0/1 vector of pattern indicators Huge dimensionality! Need Graph Mining for selecting features Better than paths (Marginalized graph kernels) patterns

5 Overview Quick Review on Graph Mining EM-based Clustering algorithm Mixture model with L1 feature selection Graph Boosting Supervised Regression for QSAR Analysis Linear programming meets graph mining

6 Quick Review of Graph Mining

7 Graph Mining Analysis of Graph Databases Find all patterns satisfying predetermined conditions Frequent Substructure Mining Combinatorial, Exhaustive Recently developed AGM (Inokuchi et al., 2000), gspan (Yan et al., 2002), Gaston (2004)

8 Graph Mining Frequent Substructure Mining Enumerate all patterns occurred in at least m graphs :Indicator of pattern k in graph i Support(k): # of occurrence of pattern k

9 Gspan (Yan and Han, 2002) Efficient Frequent Substructure Mining Method DFS Code Efficient detection of isomorphic patterns Extend Gspan for our works

10 Enumeration on Tree-shaped Search Space Each node has a pattern Generate nodes from the root: Add an edge at each step

11 Tree Pruning Anti-monotonicity: If support(g) < m, stop exploring! Not generated Support(g): # of occurrence of pattern g

12 Discriminative patterns: Weighted Substructure Mining w_i > 0: positive class w_i < 0: negative class Weighted Substructure Mining Patterns with large frequency difference Not Anti-Monotonic: Use a bound

13 Multiclass version Multiple weight vectors (graph belongs to class ) (otherwise) Search patterns overrepresented in a class

14 EM-based clustering of graphs Tsuda, K. and T. Kudo: Clustering Graphs by Weighted Substructure Mining. ICML 2006, 953-960, 2006

15 EM-based graph clustering Motivation Learning a mixture model in the feature space of patterns Basis for more complex probabilistic inference L1 regularization & Graph Mining E-step -> Mining -> M-step

16 Probabilistic Model Binomial Mixture Each Component :Mixing weight for cluster :Feature vector of a graph (0 or 1) :Parameter vector for cluster

17 Function to minimize L1-Regularized log likelihood Baseline constant ML parameter estimate using single binomial distribution In solution, most parameters exactly equal to constants

18 E-step Active pattern E-step computed only with active patterns (computable!)

19 M-step Putative cluster assignment by E-step Each parameter is solved separately Use graph mining to find active patterns Then, solve it only for active patterns

20 Solution Occurrence probability in a cluster Overall occurrence probability

21 Important Observation For active pattern k, the occurrence probability in a graph cluster is significantly different from the average

22 Mining for Active Patterns F F is rewritten in the following form Active patterns can be found by graph mining! (multiclass)

23 Experiments: RNA graphs Stem as a node Secondary structure by RNAfold 0/1 Vertex label (self loop or not)

24 Clustering RNA graphs Three Rfam families Intron GP I (Int, 30 graphs) SSU rRNA 5 (SSU, 50 graphs) RNase bact a (RNase, 50 graphs) Three bipartition problems Results evaluated by ROC scores (Area under the ROC curve)

25 Examples of RNA Graphs

26 ROC Scores

27 No of Patterns & Time

28 Found Patterns

29 Summary (EM) Probabilistic clustering based on substructure representation Inference helped by graph mining Many possible extensions Na ï ve Bayes Graph PCA, LFD, CCA Semi-supervised learning Applications in Biology?

30 Graph Boosting Saigo, H., T. Kadowaki and K. Tsuda: A Linear Programming Approach for Molecular QSAR analysis. International Workshop on Mining and Learning with Graphs, 85-96, 2006

31 Graph Regression Problem Known as QSAR problem in chemical informatics Quantitative Structure-Activity Analysis Given a graph, predict a real-value Typically, features (descriptors) are given

32 QSAR with conventional descriptors #atoms#bonds#rings…Activity 22253 20211.2 23240.77 11 -3.52 2122-4

33 Motivation of Graph Boosting Descriptors are not always available New features by obtaining informative patterns (i.e., subgraphs) Greedy pattern discovery by Boosting + gSpan Linear Programming (LP) Boosting for reducing the number of graph mining calls Accurate prediction & interpretable results

34 Molecule as a labeled graph C C C C C C O C CC C

35 QSAR with patterns …Activity 1113 1 1.2 1 0.77 1 -3.52 11-4 C C C C C C C C C C C C C C C C O Cl C C C C C C C C C C C C C C C C C O C

36 Sparse regression in a very high dimensional space G: all possible patterns (intractably large) |G|-dimensional feature vector x for a molecule Linear Regression Use L1 regularizer to have sparse α Select a tractable number of patterns

37 Problem formulation We introduce ε-insensitive loss and L1 regularizer m: # of training graphs d = |G| ξ +, ξ - : slack variables ε: parameter

38 Dual LP Primal: Huge number of weight variables Dual: Huge number of constraints LP1-Dual

39 Column Generation Algorithm for LP Boost (Demiriz et al., 2002) Start from the dual with no constraints Add the most violated constraint each time Guaranteed to converge Constraint Matrix Used Part

40 Finding the most violated constraint Constraint for a pattern (shown again) Finding the most violated one Searched by weighted substructure mining

41 Algorithm Overview Iteration Find a new pattern by graph mining with weight u If all constraints are satisfied, break Add a new constraint Update u by LP1-Dual Return Convert dual solution to obtain primal solution α

42 Speed-up by adding multiple patterns (multiple pricing) So far, the most violated pattern is chosen Mining and inclusion of top k patterns at each iteration Reduction of the number of mining calls A Linear Programming Approach for Molecular QSAR Analysis

43 Speed-up by multiple pricing

44 Clearly negative data #atoms#bonds#rings…Activity 22253 20211.2 23240.77 11 -3.52 2122-4 2220-10000 2319-10000 A Linear Programming Approach for Molecular QSAR Analysis

45 Inclusion of clearly negative data LP2-Primal l: # of clearly negative data z: predetermined upperbound ξ ’ : slack variable

46 Experiments Data from Endocrine Disruptors Knowledge Base 59 compounds labeled by real number and 61 compounds labeled by a large negative number Label (target) is a log translated relative proliferative potency (log(RPP)) normalized between – 1 and 1 Comparison with Marginalized Graph Kernel + ridge regression Marginalized Graph Kernel + kNN regression

47 Results with or without clearly negative data LP2 LP1

48 Extracted patterns Interpretable compared with implicitly expressed features by Marginalized Graph Kernel

49 Summary (Graph Boosting) Graph Boosting simultaneously generate patterns and learn their weights Finite convergence by column generation Potentially interpretable by chemists. Flexible constraints and speed-up by LP.

50 Concluding Remarks Using graph mining as a part of machine learning algorithms Weights are essential Please include weights when you implement your item-set/tree/graph mining algorithms Make it available on the web! Then ML researchers can use it

1 Graph Mining Applications to Machine Learning Problems Max Planck Institute for Biological Cybernetics Koji Tsuda.

Similar presentations

Presentation on theme: "1 Graph Mining Applications to Machine Learning Problems Max Planck Institute for Biological Cybernetics Koji Tsuda."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Graph Mining Applications to Machine Learning Problems Max Planck Institute for Biological Cybernetics Koji Tsuda.

Similar presentations

Presentation on theme: "1 Graph Mining Applications to Machine Learning Problems Max Planck Institute for Biological Cybernetics Koji Tsuda."— Presentation transcript:

Similar presentations

About project

Feedback