Hyper-parameter tuning for graph kernels via Multiple Kernel Learning

Hyper-parameter tuning for graph kernels via Multiple Kernel Learning
Carlo M. Massimo, Nicolò Navarin and Alessandro Sperduti Department of Mathematics University of Padova, Italy

Machine learning on graphs
Motivations: Many real-world problems require learning on data can be naturally represented in structured form (XML, parse trees) e.g. predict the mutagenicity of chemical compounds In these domains, kernel methods are an interesting approach because they do not require an explicit transformation in vectorial form Note: We focus on graphs, but the proposed techniques can be applied to virtually every domain.

Kernel methods Learning algorithm: SVM SVR K-means
It access examples only via dot products. Kernel function: Informally, a similarity measure between two examples Symmetric positive semidefinite function → kernel function (Mercer’s theorem) <ɸ(x),ɸ(x’)>: projects the entities in a feature space and computes the dot product between these projections This mapping may be done IMPLICITLY or EXPLICITLY Input space Feature Space ɸ

Graph kernels Haussler’s Convolution Graph kernels count matching subgraphs in two graphs: ɸ Explicit feature space representation A A B B G ... 1 1 1 1 1 1 1 C D D A B C D A B C D The kernel is the dot product between the two feature vectors B D A A A B C D A B C D G’ B 1 ... 4/13

The kernel (Gram) matrix
Given a set of examples S={x1x2...xm} and a kernel function k depending on some parameters θ, the kernel matrix is defined as: kθ(x1,x1) kθ(x1,x2) kθ(x1,xm) kθ(x2,x1) kθ(x2,x2) kθ(x3,x3) kθ(xm,x1) kθ(xm,xm) KθS=

Overview on graph kernels
Different Graph kernels consider different sub-structures (e.g. random walks, shortest paths…) Weisfeiler-Lehman (WL): subtree-patterns, 1 parameter, Shervashidze and Borgwardt 2011 WLC: subtree-patterns with contextual information (i.e. the structure of the graph surrounding the feature), derived from Navarin et al., ICONIP 2015 ODDST: sub-DAGs, 2 parameters, Da San Martino et al., SDM 2012 TCKST: sub-DAGs with contextual information Navarin et al., ICONIP 2015 Problem: Different kernels induce different similarity measures different biases different predictive performances.. There is no way to know a-priori which kernel will perform better

The Parameter selection problem
Given a task, we have to choose among: several (graph) kernels, each one depending on different parameters the learning algorithm parameter(s) (e.g. the C of SVM) How can we select the best kernel/parameters? Grid-search approach: fix a priori a set of values for each parameter select the best performing combination may be costly …….. .

Grid-search approach Error Estimation Procedure (e.g. k-fold CV)
Hyper-parameter Selection procedure (e.g. nested k-fold CV) 1 Gram matrix for each point in the grid For each classifier parameter: Grid_size error estimation procedures For nested k-fold: k*Grid_size trainings

The Parameter selection problem
Given a task, we have to choose among: Several (graph) kernels, each one depending on different parameters The learning algorithm parameter(s) (e.g. the C of SVM) How can we select the best kernel/parameters? Grid-search approach fix a priori a set of values for each parameter select the best performing combination may be costly Possible solution: randomized search in the parameter grid Still depends on the number of considered trials Not optimal Is there a different alternative? Possibly faster and that can lead to better results Our proposal: MKL...

Multiple Kernel Learning Aiolli and Donini, Neurocomputing 2015
Goal: learning a combination of “weak” kernels, that hopefully performs better than the single ones K=∑Ri=1 ηiKi ηi≥0 EasyMKL: computes the kernel combination and solves an SVM- like learning problem at the same cost of a single SVM (linear in the number of kernels) Another perspective: avoiding kernel hyper-parameter optimization combining all the kernels in the grid together Computational benefits: A single learning procedure has to be computed Predictive performance: in principle, information from different kernels can be combined WARNING! We are not selecting but combining As an edge case, maybe that just a single kernel is selected (all the weight on a single kernel) In general, a few important kernels will have the most of the weight

Proposed approach Error Estimation Procedure (e.g. k-fold CV)
Hyper-parameter Selection procedure (e.g. nested k-fold CV) For each classifier parameter: 1 error estimation (MKL) procedure For nested k-fold: k trainings For each classifier parameter: Grid_size error estimation procedures For nested k-fold: k*Grid_size trainings

Experimental setting Every example is a graph
Chemical graphs / Proteins secondary structure Lucido con descrizione dei dataset, statistiche dei dataset. (n di grafi...) anche come motivazione..

Experimental results (1)
Comparable performances! Lucido con descrizione dei dataset, statistiche dei dataset. (n di grafi...) anche come motivazione.. hs: grid search. AUROC with SVM classifier c: kernel combination. AUROC (with EasyMKL ODDST (2 params, ~100 matrices) TCKST (2 params, ~100 matrices) ODDST +TCKST (~200 matrices)

Kernel Orthogonalization
MKL performs better when the information provided by each kernel is orthogonal w.r.t. each other Idea: partitioning the feature space of a given kernel s.t. the resulting subsets have non-overlapping features [Aiolli et al, CIDM 2015]. K(x,y)=K1(x,y)+K2(x,y) The ODDST and WL feature spaces have an inherent hierarchical structure WL orthogonalization is a minor contribution of the paper ɸ1(x) ɸ(x)= ɸ2(x)

For this dataset, the grid was reduced for computational constraints 90.38∓0.10 Improved performances! hs: grid search. AUROC with SVM classifier c: kernel combination. AUROC (with EasyMKL oc: orthogonalized combination ODDST (2 params, ~100 matrices) TCKST (2 params, ~100 matrices) ODDST +TCKST (~200 matrices)

WL: (1 param, 10 matrices) WLC: (1 param, 10 matrices) WL, WLC: (20 matrices) c: slightly improved performances! oc: no significant improvement in this case hs: grid search. AUROC with SVM classifier c: kernel combination. AUROC (with EasyMKL oc: orthogonalized combination

Combining different kernel functions: consistent results the method can be applied also in this case. hs: grid search. AUROC with SVM classifier c: kernel combination. AUROC (with EasyMKL oc: orthogonalized combination

Runtimes The performance of the proposed method depends from:
hs: grid search. AUROC with SVM classifier c: kernel combination. AUROC (with EasyMKL oc: orthogonalized combination The performance of the proposed method depends from: The gram marix size The number of matrices to consider TCK_ST kernel (100 matrices) WLC kernel (10 matrices)

Conclusions We presented an alternative to kernel hyper-parameter selection: that combine kernels depending on their effectiveness faster w.r.t grid search if the number of kernels is large complexity grows linearly in the number of kernel two possible approaches: combine different kernel matrices as-is divide the Reproducing kernel Hilbert Space in a principled way achieves comparable results w.r.t. Grid search Future work: application to other (similar) domains (kernels on graph nodes, commonly used in bioinformatic applications) speedup of MKL computation adopting SVM instead of KOMD as classifier selection of the kernel matrices to use (instead of combination) (may be beneficial according to the weight analysis in the paper)

Weights analysis ODDST on AIDS dataset WLC on AIDS dataset

Hyper-parameter tuning for graph kernels via Multiple Kernel Learning

Similar presentations

Presentation on theme: "Hyper-parameter tuning for graph kernels via Multiple Kernel Learning"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Hyper-parameter tuning for graph kernels via Multiple Kernel Learning

Similar presentations

Presentation on theme: "Hyper-parameter tuning for graph kernels via Multiple Kernel Learning"— Presentation transcript:

Similar presentations

About project

Feedback