Gene selection using Random Voronoi Ensembles Stefano Rovetta Department of Computer and Information Sciences, University of Genoa, Italy Francesco masulli.

Slides:



Advertisements
Similar presentations
Statistical Machine Learning- The Basic Approach and Current Research Challenges Shai Ben-David CS497 February, 2007.
Advertisements

ECG Signal processing (2)
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
Classification / Regression Support Vector Machines
A gene expression analysis system for medical diagnosis D. Maroulis, D. Iakovidis, S. Karkanis, I. Flaounas D. Maroulis, D. Iakovidis, S. Karkanis, I.
Pattern Recognition and Machine Learning
An Introduction of Support Vector Machine
Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.
SVM—Support Vector Machines
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Correlation Aware Feature Selection Annalisa Barla Cesare Furlanello Giuseppe Jurman Stefano Merler Silvano Paoli Berlin – 8/10/2005.
Groundwater 3D Geological Modeling: Solving as Classification Problem with Support Vector Machine A. Smirnoff, E. Boisvert, S. J.Paradis Earth Sciences.
COMPUTER AIDED DIAGNOSIS: FEATURE SELECTION Prof. Yasser Mostafa Kadah –
Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Principal Component Analysis
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
4 th NETTAB Workshop Camerino, 5 th -7 th September 2004 Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini
University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.
Support Vector Classification (Linearly Separable Case, Primal) The hyperplanethat solves the minimization problem: realizes the maximal margin hyperplane.
L16: Micro-array analysis Dimension reduction Unsupervised clustering.
Basis of a Vector Space (11/2/05)
Reduced Support Vector Machine
Classification Problem 2-Category Linearly Separable Case A- A+ Malignant Benign.
Support Vector Machines
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
Data mining and statistical learning - lecture 13 Separating hyperplane.
CIBB-WIRN 2004 Perugia, 14 th -17 th September 2004 Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini Feature.
Task analysis 1 © Copyright De Montfort University 1998 All Rights Reserved Task Analysis Preece et al Chapter 7.
Lecture II-2: Probability Review
An Introduction to Support Vector Machines Martin Law.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Outline Separating Hyperplanes – Separable Case
Chapter 10 Boosting May 6, Outline Adaboost Ensemble point-view of Boosting Boosting Trees Supervised Learning Methods.
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Prediction model building and feature selection with SVM in breast cancer diagnosis Cheng-Lung Huang, Hung-Chang Liao, Mu- Chen Chen Expert Systems with.
Chap 20-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 20 Sampling: Additional Topics in Sampling Statistics for Business.
计算机学院 计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知 计算机学院 Perceptron Revisited: Linear Separators Binary classification.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
An Introduction to Support Vector Machines (M. Law)
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
BioSS reading group Adam Butler, 21 June 2006 Allen & Stott (2003) Estimating signal amplitudes in optimal fingerprinting, part I: theory. Climate dynamics,
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
SemiBoost : Boosting for Semi-supervised Learning Pavan Kumar Mallapragada, Student Member, IEEE, Rong Jin, Member, IEEE, Anil K. Jain, Fellow, IEEE, and.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel:
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.
Text Classification using Support Vector Machine Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Iterative K-Means Algorithm Based on Fisher Discriminant UNIVERSITY OF JOENSUU DEPARTMENT OF COMPUTER SCIENCE JOENSUU, FINLAND Mantao Xu to be presented.
Incremental Reduced Support Vector Machines Yuh-Jye Lee, Hung-Yi Lo and Su-Yun Huang National Taiwan University of Science and Technology and Institute.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
SemiBoost : Boosting for Semi-supervised Learning Pavan Kumar Mallapragada, Student Member, IEEE, Rong Jin, Member, IEEE, Anil K. Jain, Fellow, IEEE, and.
Methods of multivariate analysis Ing. Jozef Palkovič, PhD.
Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)
High resolution product by SVM. L’Aquila experience and prospects for the validation site R. Anniballe DIET- Sapienza University of Rome.
1 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination Machine learning, pattern recognition and statistical data modelling Lecture 10.
Data Mining Practical Machine Learning Tools and Techniques
Presentation transcript:

Gene selection using Random Voronoi Ensembles Stefano Rovetta Department of Computer and Information Sciences, University of Genoa, Italy Francesco masulli Department of Computer Science, University of Pisa, Italy

The input selection problem Hard Given d inputs, there are 2 d possible subsets and no guarantee that larger subset perform better/worse than smaller (a.k.a.: no monotonicity) Classic A lot of references dating back from about mid-seventies Important Curse of dimensionality, Generalization, Cost of measurements, Cost of computation...

A different perspective Although old, the input selection problem is being actively studied now From optimization Classic approach: improve training speed / generalization ability / computational resources requirements......to model analysis Mainstream approach as of today: find the subset of inputs which account the most for the observed phenomenon A tool for scientific inquiry, not for system design

Gene selection Bioinformatics is where input selection is a current (hot) topic DNA microarrays provide bulks of simultaeous data – e.g., gene expression We have to find out which genes are the most relevant to a given pathology (Good candidates to be the true cause) We are interested in a specific approach: assessing the relative importance of each input variable (gene)

Problem statement We address: – Classification problems – with 2 classes only to simplify the analysis (can be extended to multiclass) – seeking a saliency ranking - on a d-dimensional vector space: x   d A single separating function is assumed, denoted by g(x)

Outline of the technique The proposed technique has three components 1 – a local analysis step with a basic classifier 2 – a resampling procedure to iterate step 1 3 – an integration step

Saliency (or importance or sensitivity or...) Many definitions Intuitively: some attribute of an input variable which measures its influence on the solution of a given (classification) problem The derivative of the output w.r.t. each input variable is a natural measure of influence  g(x) = (∂g(x)/∂x 1,..., ∂g(x)/∂x d ) But...

Finite sample effects The rule is learned from a training set: random variability Derivatives and local fluctuations often it is better to study difference ratios ( f(x+Δ) – f(x) ) / Δ rather than derivatives f'(x)

Use of linear separators If the decision function is of the form g(x) = w. x then derivatives w.r.t. inputs are constant and given directly by the coefficient vector w SVMs can provide the optimum linear separators w.r.t. a given generalization bound 2-norm soft margin optimization: bound on generalization error based on (soft) margin such linear separators are robust in terms of sample variations (they depend on support vectors only)

Local analysis The linear separator is applied on a local basis Nonlinear g(x) can be studied by local linearization Voronoi partitioning A Voronoi tessellation is performed on the training set Linear analysis is applied within each Voronoi polyhedron (a localized subset of training samples) We obtain a saliency ranking directly by t = w/max{w i } (signs can be discarded and analyzed separately)

Drawbacks Several: mainly border effects and small sample size within Voronoi polyhedra Solution: resampling The Voronoi tessellation is performed several times Random Voronoi tessellations are used each time

An ensemble method The procedure can be seen as an ensemble of localized linear classifiers The necessary classifier diversity is provided by random Voronoi tessellations What we need next: Integration of local analyses

Integrating by clustering For each Voronoi polyhedron of each resampling step, we obtain a pair of d-dimensional vectors (or a 2d- dimensional combined vector) v i = ( t i, y i ) where: t i the saliency ranking y i the Voronoi centroid (site) To integrate the local analyses we perform a c-means type clustering on vectors v i

Some details on the clustering step - The clustering technique is the Graded Possibilistic c-Means algorithm - The dimensionality problem is easily tackled by working only within the subspace spanned by the training set - Clusters are obtained by merging (averaging) sets of vectors v i which are close either by their y (location) or by their t (saliency pattern) components - The number of clusters is currently to be prespecified (as in standard c-means) It is independent on the number of voronoi sites used

Results “Leukemia” data set by Golub et al.

Discussion and future work The results indicate that some of the genes indicated by the original work by Golub et al. are found to be important also by our approach. Extensive validation (by the help of domain experts or biologists) must be done The direction (sign) of saliency has been found to be always in agreement with statistical correlation as indicated by the original work. Further experiments: a new data set (still unpublished) is currently being investigated An interesting tweak: replacing the general c-means-type clustering with a technique specifically tailored on rank data