4 th NETTAB Workshop Camerino, 5 th -7 th September 2004 Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini

Slides:



Advertisements
Similar presentations
RANDOM PROJECTIONS IN DIMENSIONALITY REDUCTION APPLICATIONS TO IMAGE AND TEXT DATA Ella Bingham and Heikki Mannila Ângelo Cardoso IST/UTL November 2009.
Advertisements

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Predictive Analysis of Gene Expression Data from Human SAGE Libraries Alexessander Alves* Nikolay Zagoruiko + Oleg Okun § Olga Kutnenko + Irina Borisova.
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
CSCE555 Bioinformatics Lecture 15 classification for microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
A gene expression analysis system for medical diagnosis D. Maroulis, D. Iakovidis, S. Karkanis, I. Flaounas D. Maroulis, D. Iakovidis, S. Karkanis, I.
R OBERTO B ATTITI, M AURO B RUNATO The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data.
Yue Han and Lei Yu Binghamton University.
Correlation Aware Feature Selection Annalisa Barla Cesare Furlanello Giuseppe Jurman Stefano Merler Silvano Paoli Berlin – 8/10/2005.
Gene selection using Random Voronoi Ensembles Stefano Rovetta Department of Computer and Information Sciences, University of Genoa, Italy Francesco masulli.
Machine Learning Bioinformatics Data Analysis and Tools
MCS Multiple Classifier Systems, Cagliari 9-11 June Giorgio Valentini Random aggregated and bagged ensembles.
Classification: Support Vector Machine 10/10/07. What hyperplane (line) can separate the two classes of data?
Reduced Support Vector Machine
Proteomic Mass Spectrometry
Margin Based Sample Weighting for Stable Feature Selection Yue Han, Lei Yu State University of New York at Binghamton.
Dimension Reduction and Feature Selection Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
1 Ensembles of Nearest Neighbor Forecasts Dragomir Yankov, Eamonn Keogh Dept. of Computer Science & Eng. University of California Riverside Dennis DeCoste.
CIBB-WIRN 2004 Perugia, 14 th -17 th September 2004 Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini Feature.
PMSB 2006, Tuusula (Finland) A. Bertoni, G.Valentini, DSI - Univ. Milano 1 Alberto Bertoni, Giorgio Valentini
DIMACS Workshop on Machine Learning Techniques in Bioinformatics 1 Cancer Classification with Data-dependent Kernels Anne Ya Zhang (with Xue-wen.
Supervised gene expression data analysis using SVMs and MLPs Giorgio Valentini
Jeff Howbert Introduction to Machine Learning Winter Machine Learning Feature Creation and Selection.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
An Evaluation of Gene Selection Methods for Multi-class Microarray Data Classification by Carlotta Domeniconi and Hong Chai.
Machine Learning CS 165B Spring 2012
Gene expression profiling identifies molecular subtypes of gliomas
A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.
Thien Anh Dinh1, Tomi Silander1, Bolan Su1, Tianxia Gong
Whole Genome Expression Analysis
Evaluation of Supervised Learning Algorithms on Gene Expression Data CSCI 6505 – Machine Learning Adan Cosgaya Winter 2006 Dalhousie University.
Clustering of DNA Microarray Data Michael Slifker CIS 526.
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
Prediction model building and feature selection with SVM in breast cancer diagnosis Cheng-Lung Huang, Hung-Chang Liao, Mu- Chen Chen Expert Systems with.
1 Classifying Lymphoma Dataset Using Multi-class Support Vector Machines INFS-795 Advanced Data Mining Prof. Domeniconi Presented by Hong Chai.
Selection of Patient Samples and Genes for Disease Prognosis Limsoon Wong Institute for Infocomm Research Joint work with Jinyan Li & Huiqing Liu.
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
1 Effective Feature Selection Framework for Cluster Analysis of Microarray Data Gouchol Pok Computer Science Dept. Yanbian University China Keun Ho Ryu.
CLASSIFICATION: Ensemble Methods
+ Get Rich and Cure Cancer with Support Vector Machines (Your Summer Projects)
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
Clustering Features in High-Throughput Proteomic Data Richard Pelikan (or what’s left of him) BIOINF 2054 April
NIPS 2001 Workshop on Feature/Variable Selection Isabelle Guyon BIOwulf Technologies.
Class 23, 2001 CBCl/AI MIT Bioinformatics Applications and Feature Selection for SVMs S. Mukherjee.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
1 CHUKWUEMEKA DURUAMAKU.  Machine learning, a branch of artificial intelligence, concerns the construction and study of systems that can learn from data.
Consensus Group Stable Feature Selection
A comparative study of survival models for breast cancer prognostication based on microarray data: a single gene beat them all? B. Haibe-Kains, C. Desmedt,
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Advanced Gene Selection Algorithms Designed for Microarray Datasets Limitation of current feature selection methods: –Ignores gene/gene interaction: single.
High resolution product by SVM. L’Aquila experience and prospects for the validation site R. Anniballe DIET- Sapienza University of Rome.
Combining Bagging and Random Subspaces to Create Better Ensembles
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Classification with Gene Expression Data
Heping Zhang, Chang-Yung Yu, Burton Singer, Momian Xiong
In Search of the Optimal Set of Indicators when Classifying Histopathological Images Catalin Stoean University of Craiova, Romania
Roberto Battiti, Mauro Brunato
Machine Learning Feature Creation and Selection
A Unifying View on Instance Selection
Boosting For Tumor Classification With Gene Expression Data
Concave Minimization for Support Vector Machine Classifiers
Feature Selection Methods
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

4 th NETTAB Workshop Camerino, 5 th -7 th September 2004 Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini Random subspace ensembles for the bio- molecular diagnosis of tumors NETTAB 2004 Workshop on Models and Metaphors from Biology to Bioinformatics Tools

4 th NETTAB Workshop Camerino, 5 th -7 th September 2004 Outline The problem of the bio-molecular diagnosis of tumors using gene expression data Current approaches to bio-molecular diagnosis The Random Subspace (RS) ensemble method Reasons to apply RS ensembles to the bio- molecular diagnosis of tumors Experimental results Open problems

4 th NETTAB Workshop Camerino, 5 th -7 th September 2004 Bio-molecular diagnosis of malignancies: motivations Traditionl clinical diagnostic approaches may sometimes fail in detecting tumors (Alizadeh et al. 2001) Several results showed that bio-molecular analysis of malignancies may help to better characterize malignancies (e.g. gene expression profiling) Information for supporting both diagnosis and prognosis of malignancies at bio-molecular level may be obtained from high-throughput bio- technologies (e.g. DNA microarray)

4 th NETTAB Workshop Camerino, 5 th -7 th September 2004 Bio-molecular diagnosis of malignancies: current approaches Huge amount of data available from bio- technologies: analysis and extraction of significant biological knowledge is critical Current approaches: statistical methods and machine learning methods (Golub et al., 1999; Furey et al., 2000; Ramaswamy et al., 2001; Khan et al., 2001; Dudoit et al. 2002; Lee & Lee, 2003; Weston et al., 2003).

4 th NETTAB Workshop Camerino, 5 th -7 th September 2004 Main problems with gene expression data for bio-molecular diagnosis Data are usually noisy: High dimensionality Low cardinality Curse of dimensionality Gene expression measurements Labeling errors

4 th NETTAB Workshop Camerino, 5 th -7 th September 2004 Current approaches against the curse of dimensionality Selection of significant subsets of components (genes) e.g.: filter methods, forward selection, backward selection, recursive feature elimination, entropy and mutual information based feature selection methods (see Guyon & Ellisseef, 2003 for a recent review). Extraction of significant subsets of features e.g.: Principal Component Analysis or Indipendent Component Analysis Anyway, both approaches have problems...

4 th NETTAB Workshop Camerino, 5 th -7 th September 2004 An alternative approach based on ensemble methods Random subspace (RS) ensembles: –RS (Ho, 1998) reduce the high dimensionality of the data by randomly selecting subsets of genes. –Aggregation of different base learners trained on different subsets of features may reduce variance and improve diversity  h 1  h m Aggregation h D D1 D1  Dm Dm  Algorithm

4 th NETTAB Workshop Camerino, 5 th -7 th September 2004 The RS algorithm 1.Select a random projection from the d-dimensional input space to a k-dimensional subspace 2.Project the data from the d-dimensional space into the selected k-dimensional subspace 3.Train a classifier on the obtained k-dimensional gene expression data, using a suitable learning algorithm 4.Repeat steps 1-3 m times 5.Aggregate the trained classifiers by majority (or weighted) voting Input: a d-dimensional labelled gene expression data set

4 th NETTAB Workshop Camerino, 5 th -7 th September 2004 Reasons to apply RS ensembles to the bio-molecular diagnosis of tumors Gene expression data are usually very high dimensional, and RS ensembles reduce the dimensionality and are effective with high dimensional data (Skurichina and Duin, 2002) Co-regulated genes show correlated gene expression levels (see e.g. Gasch and Eisen, 2002), and RS ensembles are effective with correlated sets of features (Bingham and Mannila, 2001) Random projections may improve the diversity between base learners Overall accuracy of the ensemble may be enhanced through aggregation techniques (at least w.r.t. the variance component of the error)

4 th NETTAB Workshop Camerino, 5 th -7 th September 2004 Experimental environment We considered 2 bio-medical problems both based on gene- expression profiles of a relatively small group of patients: 1.Colon adenocarcinoma diagnosis (Alon et al., 1999): 62 samples, 40 colon tumors and 22 normal colon samples, 2000 genes. 2.Medulloblastoma clinical outcome prediction (Pomeroy et al., 2002): 60 samples, 39 survivors and 21 treatment failures, 7129 genes. Methods: RS ensembles with linear SVMs as base learners Single linear SVMs Software: C++ NEURObjects library (Valentini, 2002) Hardware: Avogadro cluster of Xeon double processor workstations

4 th NETTAB Workshop Camerino, 5 th -7 th September 2004 Results Colon tumor prediction (5 fold cross validation) Medulloblastoma clinical outcome prediction (5 fold cross validation)

4 th NETTAB Workshop Camerino, 5 th -7 th September 2004 Colon tumor prediction: error as a function of the susbspace dimension Single SVM test error

4 th NETTAB Workshop Camerino, 5 th -7 th September 2004 Colon tumor prediction : sensitivity/specificity as a function of the susbspace dimension

4 th NETTAB Workshop Camerino, 5 th -7 th September 2004 Average base learner error The better accuracy of the RS ensemble does not simply depend on the better accuracy of their component base learners

4 th NETTAB Workshop Camerino, 5 th -7 th September 2004 Results summary Statistical significant difference in favour of RS ensembles vs. single SVMs RS ensembles are better than single SVMs for a large choice of subspace dimensions No learning if a too small subspace dimension is selected (because of the low accuracy of the corresponding base learners) The results cannot be explained only through the accuracy of the base learners What about the reasons of the effectiveness of the random subspace approach?

4 th NETTAB Workshop Camerino, 5 th -7 th September 2004 Effectiveness of the RS method and other open problems 1.Can we explain the effectiveness of RS through the diversity of the base learners ? 2.Can we get a bias-variance interpretation ? 3.Can we get quantitative relationships between dimensionality reduction, redundant features and correlation of gene expression levels? 4.What about the “optimal” subspace dimension? 5.Are feature selection and random subspace ensemble approaches alternative, or it may be useful to combine them?

4 th NETTAB Workshop Camerino, 5 th -7 th September 2004 Combining feature selection and random subspace ensemble methods Random Subspace on Selected Features (RS-SF algorithm) A two-steps algorithm: 1.Select a subset of features (genes) according to a suitable feature selection method 2.Apply the random subspace ensemble method to the subset of selected features

4 th NETTAB Workshop Camerino, 5 th -7 th September 2004 Preliminary results on combining feature selection with random subspace ensembles - 1 TestSt.devTrainSt.devSens.Spec.Prec. RS-SF ensemble RS ensemble Single FS-SVM Single SVM

4 th NETTAB Workshop Camerino, 5 th -7 th September 2004 Preliminary results on combining feature selection with random subspace ensembles - 2

4 th NETTAB Workshop Camerino, 5 th -7 th September 2004 Conclusions RS ensembles can improve the accuracy of bio- molecular diagnosis characterized by very high dimensional data Several problems about the reasons of the effectiveness of the proposed approach remain open A new promising approach consists in combining feature (gene) selection and RS ensembles