Support Feature Machine for DNA microarray data Tomasz Maszczyk and Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń, Poland RSCTC 2010
PlanPlan Main ideaMain idea SFM vs SVMSFM vs SVM Description of our approachDescription of our approach ResultsResults ConclusionsConclusions
Main idea SVM – based on LDA with margin maximization (good generalization, control of complexity).SVM – based on LDA with margin maximization (good generalization, control of complexity). Non-linear decision borders – linearized by projecting into high- dimensional feature space.Non-linear decision borders – linearized by projecting into high- dimensional feature space. Cover theorem (increase P() data separable, flattening decision borders).Cover theorem (increase P() data separable, flattening decision borders). Kernel methods – new features z i (x)=k(x,x i ) constructed around SV x i (vectors close to the decision borders).Kernel methods – new features z i (x)=k(x,x i ) constructed around SV x i (vectors close to the decision borders). Instead original input space x i, SVM works in the space of kernel features z i (x) called "the kernel space".Instead original input space x i, SVM works in the space of kernel features z i (x) called "the kernel space".
Main idea Each SV ?= useful feature, optimal for data with particular distributions, not work on parity or other problems with complex logical structure.Each SV ?= useful feature, optimal for data with particular distributions, not work on parity or other problems with complex logical structure. For some highly-non-separable problems localized linear projections may easily solve the problem. New useful features: random linear projections, principal components derived from data, or projection pursuit algorithms based on Quality of Projected Clusters (QPC).For some highly-non-separable problems localized linear projections may easily solve the problem. New useful features: random linear projections, principal components derived from data, or projection pursuit algorithms based on Quality of Projected Clusters (QPC). Appropriate feature space ?= optimal solutions, learn from other models what interesting features they have discovered: prototypes, linear combinations, or fragments of branches in decision trees.Appropriate feature space ?= optimal solutions, learn from other models what interesting features they have discovered: prototypes, linear combinations, or fragments of branches in decision trees. The final model - linear discrimination, Naive Bayes, nearest neighbor or a decision tree - is secondary, if appropriate space has been set up.The final model - linear discrimination, Naive Bayes, nearest neighbor or a decision tree - is secondary, if appropriate space has been set up.
SFM vs SVM SFM generalize SVM explicitly building enhanced space that includes kernel features z i (x)=k(x,x i ) together with any other features that may provide useful information. This approach has several advantages comparing to standard SVM: With explicit representation of features interpretation of discriminant function is as simple as in any linear discrimination method.With explicit representation of features interpretation of discriminant function is as simple as in any linear discrimination method. Kernel-based SVM is equivalent to linear SVM in the explicitly constructed kernel space, therefore enhancing this space should lead to improvement of results.Kernel-based SVM is equivalent to linear SVM in the explicitly constructed kernel space, therefore enhancing this space should lead to improvement of results.
SFM vs SVM Kernels with various parameters may be used, including various degrees of localization, and the resulting discriminant may select global features, combining them with local features that handle exceptions.Kernels with various parameters may be used, including various degrees of localization, and the resulting discriminant may select global features, combining them with local features that handle exceptions. Complexity of SVM is O(n 2 ) due to the need of generating kernel matrix; SFM may select smaller number of kernel features from those vectors that project on overlapping regions in linear projections.Complexity of SVM is O(n 2 ) due to the need of generating kernel matrix; SFM may select smaller number of kernel features from those vectors that project on overlapping regions in linear projections. Many feature selection methods may be used to estimate usefulness of new features that define support feature space.Many feature selection methods may be used to estimate usefulness of new features that define support feature space. Many algorithms may be used in the support feature space to generate the final solution.Many algorithms may be used in the support feature space to generate the final solution.
SFMSFM SFM algorithm starts from std, followed by FS (Relief – only positive weights). Such reduced, but still high dimensional data, is used to generate two types of new features: Projections on m=N c (N c -1)/2 directions obtained by connecting pairs of centers w ij =c i -c j, where c i is the mean of all vectors that belong to the C i, i=1…N c class. In high dimensional space such features r i (x)=w i ·x help a lot (hist). FDA ?= better directions, more expensive.Projections on m=N c (N c -1)/2 directions obtained by connecting pairs of centers w ij =c i -c j, where c i is the mean of all vectors that belong to the C i, i=1…N c class. In high dimensional space such features r i (x)=w i ·x help a lot (hist). FDA ?= better directions, more expensive. Features based on kernel features. Many types of kernels may be mixed together, including the same types of kernels with different parameters (only Gaussian kernels with fixed dispersion β) t i (x)=exp(-βΣ|x i -x| 2 ). QPC on this feature space, generating additional orthogonal directions that are useful as new features. N Q =5 but CV should works better.Features based on kernel features. Many types of kernels may be mixed together, including the same types of kernels with different parameters (only Gaussian kernels with fixed dispersion β) t i (x)=exp(-βΣ|x i -x| 2 ). QPC on this feature space, generating additional orthogonal directions that are useful as new features. N Q =5 but CV should works better.
AlgorithmAlgorithm Fix the Gaussian dispersion β and the number of QPC features N QFix the Gaussian dispersion β and the number of QPC features N Q Standardize datasetStandardize dataset Normalize the length of each vector to 1Normalize the length of each vector to 1 Perform Relief feature ranking, select only those with positive weights RW i > 0Perform Relief feature ranking, select only those with positive weights RW i > 0 Calculate class centers c i, i=1...N c, create m directions w ij =c i -c j, i>jCalculate class centers c i, i=1...N c, create m directions w ij =c i -c j, i>j Project all vectors on these directions r ij (x) = w ij ·x (features r ij )Project all vectors on these directions r ij (x) = w ij ·x (features r ij ) Create kernel features t i (x)=exp(-βΣ|x i -x| 2 )Create kernel features t i (x)=exp(-βΣ|x i -x| 2 ) Create N Q QPC directions w i in the kernel space, adding QPC features s i (x) = w i ·xCreate N Q QPC directions w i in the kernel space, adding QPC features s i (x) = w i ·x Build linear model on the new feature spaceBuild linear model on the new feature space Classify test data mapped into the new feature spaceClassify test data mapped into the new feature space
SFM - resume In essence SFM requires construction of new features, followed by a simple linear model (LSVM) or any other learning model.In essence SFM requires construction of new features, followed by a simple linear model (LSVM) or any other learning model. More attention to generation of features than to the sophisticated optimization algorithms or new classification methods.More attention to generation of features than to the sophisticated optimization algorithms or new classification methods. Several parameters may be used to control the process of feature creation and selection but here they are fixed or set in an automatic way. Solutions are given in form of linear discriminant function and thus are easy to understand.Several parameters may be used to control the process of feature creation and selection but here they are fixed or set in an automatic way. Solutions are given in form of linear discriminant function and thus are easy to understand. New features created in this way are based on those transformations of inputs that have been found interesting for some task, and thus have meaningful interpretation.New features created in this way are based on those transformations of inputs that have been found interesting for some task, and thus have meaningful interpretation.
DatasetsDatasets
Results (SVM vs SFM in the kernel space only)
Results ( SFM in extended spaces) K=K(X,X i )Z=WXH=[Z 1,Z 2 ]
DatasetsDatasets
ResultsResults
SummarySummary SFM focused on generation of new features, rather than improvement of optimization and classification algorithms. It may be regarded as an example of mixture of experts, where each expert is a simple model based on projection on some specific direction (random, or connecting clusters), localization of projected clusters (QPC), optimized directions (for example by FDA), or kernel methods based on similarity to reference vectors. For some data kernel-based features are most important, for other projections and restricted projections discover more interesting aspects.SFM focused on generation of new features, rather than improvement of optimization and classification algorithms. It may be regarded as an example of mixture of experts, where each expert is a simple model based on projection on some specific direction (random, or connecting clusters), localization of projected clusters (QPC), optimized directions (for example by FDA), or kernel methods based on similarity to reference vectors. For some data kernel-based features are most important, for other projections and restricted projections discover more interesting aspects. Kernel-based SVM is equivalent to the use of kernel features combined with LSVM. Mixing different kernels and different types of features creates much better enhanced features space then a single-kernel solution.Kernel-based SVM is equivalent to the use of kernel features combined with LSVM. Mixing different kernels and different types of features creates much better enhanced features space then a single-kernel solution. Complex data may require decision borders of different complexity, and it is rather straightforward to introduce multiresolution in the presented algorithm, for example using different dispersion β for every t i, while in the standard SVM approach this is difficult to achieve.Complex data may require decision borders of different complexity, and it is rather straightforward to introduce multiresolution in the presented algorithm, for example using different dispersion β for every t i, while in the standard SVM approach this is difficult to achieve.
Thank You!