Support Feature Machine for DNA microarray data Tomasz Maszczyk and Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń, Poland.

Slides:

Advertisements

Similar presentations

ECG Signal processing (2)

Advertisements

Universal Learning Machines (ULM) Włodzisław Duch and Tomasz Maszczyk Department of Informatics, Nicolaus Copernicus University, Toruń, Poland ICONIP 2009,

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

An Introduction of Support Vector Machine

An Introduction of Support Vector Machine

Support Vector Machines

1 Lecture 5 Support Vector Machines Large-margin linear classifier Non-separable case The Kernel trick.

Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.

Machine learning continued Image source:

Computer vision: models, learning and inference Chapter 8 Regression.

Heterogeneous Forests of Decision Trees Krzysztof Grąbczewski & Włodzisław Duch Department of Informatics, Nicholas Copernicus University, Torun, Poland.

Discriminative and generative methods for bags of features

Support Vector Machines and Kernel Methods

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

Heterogeneous adaptive systems Włodzisław Duch & Krzysztof Grąbczewski Department of Informatics, Nicholas Copernicus University, Torun, Poland.

Almost Random Projection Machine with Margin Maximization and Kernel Features Tomasz Maszczyk and Włodzisław Duch Department of Informatics, Nicolaus Copernicus.

An Introduction to Kernel-Based Learning Algorithms K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda and B. Scholkopf Presented by: Joanna Giforos CS8980: Topics.

Support Vector Machines for Visualization and Dimensionality Reduction Tomasz Maszczyk and Włodzisław Duch Department of Informatics, Nicolaus Copernicus.

Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.

Support Vector Machines

Competent Undemocratic Committees Włodzisław Duch, Łukasz Itert and Karol Grudziński Department of Informatics, Nicholas Copernicus University, Torun,

Lecture 10: Support Vector Machines

Support Feature Machines: Support Vectors are not enough Tomasz Maszczyk and Włodzisław Duch Department of Informatics, Nicolaus Copernicus University,

Comparing Kernel-based Learning Methods for Face Recognition Zhiguo Li

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Discriminative and generative methods for bags of features

An Introduction to Support Vector Machines Martin Law.

Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.

Step 3: Classification Learn a decision rule (classifier) assigning bag-of-features representations of images to different classes Decision boundary Zebra.

Outline Separating Hyperplanes – Separable Case

DATA MINING LECTURE 10 Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines.

Machine Learning Seminar: Support Vector Regression Presented by: Heng Ji 10/08/03.

Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.

1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.

Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.

ICML2004, Banff, Alberta, Canada Learning Larger Margin Machine Locally and Globally Kaizhu Huang Haiqin Yang, Irwin King, Michael.

Line detection Assume there is a binary image, we use F(ά,X)=0 as the parametric equation of a curve with a vector of parameters ά=[α 1, …, α m ] and X=[x.

SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.

An Introduction to Support Vector Machines (M. Law)

Using Support Vector Machines to Enhance the Performance of Bayesian Face Recognition IEEE Transaction on Information Forensics and Security Zhifeng Li,

Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:

D. M. J. Tax and R. P. W. Duin. Presented by Mihajlo Grbovic Support Vector Data Description.

Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.

University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.

CS558 Project Local SVM Classification based on triangulation (on the plane) Glenn Fung.

CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel:

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:

Dimensionality reduction

Computational Intelligence: Methods and Applications Lecture 21 Linear discrimination, linear machines Włodzisław Duch Dept. of Informatics, UMK Google:

Computational Intelligence: Methods and Applications Lecture 29 Approximation theory, RBF and SFN networks Włodzisław Duch Dept. of Informatics, UMK Google:

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Computational Intelligence: Methods and Applications Lecture 24 SVM in the non-linear case Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.

Computational Intelligence: Methods and Applications Lecture 22 Linear discrimination - variants Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.

1 Kernel Machines A relatively new learning methodology (1992) derived from statistical learning theory. Became famous when it gave accuracy comparable.

Day 17: Duality and Nonlinear SVM Kristin P. Bennett Mathematical Sciences Department Rensselaer Polytechnic Institute.

Support Feature Machine for DNA microarray data

Table 1. Advantages and Disadvantages of Traditional DM/ML Methods

Support Vector Machines

Support Vector Machines Introduction to Data Mining, 2nd Edition by

Tomasz Maszczyk and Włodzisław Duch Department of Informatics,

Design of Hierarchical Classifiers for Efficient and Accurate Pattern Classification M N S S K Pavan Kumar Advisor : Dr. C. V. Jawahar.

Support Vector Machines Most of the slides were taken from:

Feature space tansformation methods

Generally Discriminant Analysis

Heterogeneous adaptive systems

Support Vector Machines 2

Presentation transcript:

Support Feature Machine for DNA microarray data Tomasz Maszczyk and Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń, Poland RSCTC 2010

PlanPlan Main ideaMain idea SFM vs SVMSFM vs SVM Description of our approachDescription of our approach ResultsResults ConclusionsConclusions

Main idea SVM – based on LDA with margin maximization (good generalization, control of complexity).SVM – based on LDA with margin maximization (good generalization, control of complexity). Non-linear decision borders – linearized by projecting into high- dimensional feature space.Non-linear decision borders – linearized by projecting into high- dimensional feature space. Cover theorem (increase P() data separable, flattening decision borders).Cover theorem (increase P() data separable, flattening decision borders). Kernel methods – new features z i (x)=k(x,x i ) constructed around SV x i (vectors close to the decision borders).Kernel methods – new features z i (x)=k(x,x i ) constructed around SV x i (vectors close to the decision borders). Instead original input space x i, SVM works in the space of kernel features z i (x) called "the kernel space".Instead original input space x i, SVM works in the space of kernel features z i (x) called "the kernel space".

Main idea Each SV ?= useful feature, optimal for data with particular distributions, not work on parity or other problems with complex logical structure.Each SV ?= useful feature, optimal for data with particular distributions, not work on parity or other problems with complex logical structure. For some highly-non-separable problems localized linear projections may easily solve the problem. New useful features: random linear projections, principal components derived from data, or projection pursuit algorithms based on Quality of Projected Clusters (QPC).For some highly-non-separable problems localized linear projections may easily solve the problem. New useful features: random linear projections, principal components derived from data, or projection pursuit algorithms based on Quality of Projected Clusters (QPC). Appropriate feature space ?= optimal solutions, learn from other models what interesting features they have discovered: prototypes, linear combinations, or fragments of branches in decision trees.Appropriate feature space ?= optimal solutions, learn from other models what interesting features they have discovered: prototypes, linear combinations, or fragments of branches in decision trees. The final model - linear discrimination, Naive Bayes, nearest neighbor or a decision tree - is secondary, if appropriate space has been set up.The final model - linear discrimination, Naive Bayes, nearest neighbor or a decision tree - is secondary, if appropriate space has been set up.

SFM vs SVM SFM generalize SVM explicitly building enhanced space that includes kernel features z i (x)=k(x,x i ) together with any other features that may provide useful information. This approach has several advantages comparing to standard SVM: With explicit representation of features interpretation of discriminant function is as simple as in any linear discrimination method.With explicit representation of features interpretation of discriminant function is as simple as in any linear discrimination method. Kernel-based SVM is equivalent to linear SVM in the explicitly constructed kernel space, therefore enhancing this space should lead to improvement of results.Kernel-based SVM is equivalent to linear SVM in the explicitly constructed kernel space, therefore enhancing this space should lead to improvement of results.

SFM vs SVM Kernels with various parameters may be used, including various degrees of localization, and the resulting discriminant may select global features, combining them with local features that handle exceptions.Kernels with various parameters may be used, including various degrees of localization, and the resulting discriminant may select global features, combining them with local features that handle exceptions. Complexity of SVM is O(n 2 ) due to the need of generating kernel matrix; SFM may select smaller number of kernel features from those vectors that project on overlapping regions in linear projections.Complexity of SVM is O(n 2 ) due to the need of generating kernel matrix; SFM may select smaller number of kernel features from those vectors that project on overlapping regions in linear projections. Many feature selection methods may be used to estimate usefulness of new features that define support feature space.Many feature selection methods may be used to estimate usefulness of new features that define support feature space. Many algorithms may be used in the support feature space to generate the final solution.Many algorithms may be used in the support feature space to generate the final solution.

SFMSFM SFM algorithm starts from std, followed by FS (Relief – only positive weights). Such reduced, but still high dimensional data, is used to generate two types of new features: Projections on m=N c (N c -1)/2 directions obtained by connecting pairs of centers w ij =c i -c j, where c i is the mean of all vectors that belong to the C i, i=1…N c class. In high dimensional space such features r i (x)=w i ·x help a lot (hist). FDA ?= better directions, more expensive.Projections on m=N c (N c -1)/2 directions obtained by connecting pairs of centers w ij =c i -c j, where c i is the mean of all vectors that belong to the C i, i=1…N c class. In high dimensional space such features r i (x)=w i ·x help a lot (hist). FDA ?= better directions, more expensive. Features based on kernel features. Many types of kernels may be mixed together, including the same types of kernels with different parameters (only Gaussian kernels with fixed dispersion β) t i (x)=exp(-βΣ|x i -x| 2 ). QPC on this feature space, generating additional orthogonal directions that are useful as new features. N Q =5 but CV should works better.Features based on kernel features. Many types of kernels may be mixed together, including the same types of kernels with different parameters (only Gaussian kernels with fixed dispersion β) t i (x)=exp(-βΣ|x i -x| 2 ). QPC on this feature space, generating additional orthogonal directions that are useful as new features. N Q =5 but CV should works better.

AlgorithmAlgorithm Fix the Gaussian dispersion β and the number of QPC features N QFix the Gaussian dispersion β and the number of QPC features N Q Standardize datasetStandardize dataset Normalize the length of each vector to 1Normalize the length of each vector to 1 Perform Relief feature ranking, select only those with positive weights RW i > 0Perform Relief feature ranking, select only those with positive weights RW i > 0 Calculate class centers c i, i=1...N c, create m directions w ij =c i -c j, i>jCalculate class centers c i, i=1...N c, create m directions w ij =c i -c j, i>j Project all vectors on these directions r ij (x) = w ij ·x (features r ij )Project all vectors on these directions r ij (x) = w ij ·x (features r ij ) Create kernel features t i (x)=exp(-βΣ|x i -x| 2 )Create kernel features t i (x)=exp(-βΣ|x i -x| 2 ) Create N Q QPC directions w i in the kernel space, adding QPC features s i (x) = w i ·xCreate N Q QPC directions w i in the kernel space, adding QPC features s i (x) = w i ·x Build linear model on the new feature spaceBuild linear model on the new feature space Classify test data mapped into the new feature spaceClassify test data mapped into the new feature space

SFM - resume In essence SFM requires construction of new features, followed by a simple linear model (LSVM) or any other learning model.In essence SFM requires construction of new features, followed by a simple linear model (LSVM) or any other learning model. More attention to generation of features than to the sophisticated optimization algorithms or new classification methods.More attention to generation of features than to the sophisticated optimization algorithms or new classification methods. Several parameters may be used to control the process of feature creation and selection but here they are fixed or set in an automatic way. Solutions are given in form of linear discriminant function and thus are easy to understand.Several parameters may be used to control the process of feature creation and selection but here they are fixed or set in an automatic way. Solutions are given in form of linear discriminant function and thus are easy to understand. New features created in this way are based on those transformations of inputs that have been found interesting for some task, and thus have meaningful interpretation.New features created in this way are based on those transformations of inputs that have been found interesting for some task, and thus have meaningful interpretation.

DatasetsDatasets

Results (SVM vs SFM in the kernel space only)

Results ( SFM in extended spaces) K=K(X,X i )Z=WXH=[Z 1,Z 2 ]

DatasetsDatasets

ResultsResults

SummarySummary SFM focused on generation of new features, rather than improvement of optimization and classification algorithms. It may be regarded as an example of mixture of experts, where each expert is a simple model based on projection on some specific direction (random, or connecting clusters), localization of projected clusters (QPC), optimized directions (for example by FDA), or kernel methods based on similarity to reference vectors. For some data kernel-based features are most important, for other projections and restricted projections discover more interesting aspects.SFM focused on generation of new features, rather than improvement of optimization and classification algorithms. It may be regarded as an example of mixture of experts, where each expert is a simple model based on projection on some specific direction (random, or connecting clusters), localization of projected clusters (QPC), optimized directions (for example by FDA), or kernel methods based on similarity to reference vectors. For some data kernel-based features are most important, for other projections and restricted projections discover more interesting aspects. Kernel-based SVM is equivalent to the use of kernel features combined with LSVM. Mixing different kernels and different types of features creates much better enhanced features space then a single-kernel solution.Kernel-based SVM is equivalent to the use of kernel features combined with LSVM. Mixing different kernels and different types of features creates much better enhanced features space then a single-kernel solution. Complex data may require decision borders of different complexity, and it is rather straightforward to introduce multiresolution in the presented algorithm, for example using different dispersion β for every t i, while in the standard SVM approach this is difficult to achieve.Complex data may require decision borders of different complexity, and it is rather straightforward to introduce multiresolution in the presented algorithm, for example using different dispersion β for every t i, while in the standard SVM approach this is difficult to achieve.

Thank You!