Supervised gene expression data analysis using SVMs and MLPs Giorgio Valentini

Slides:



Advertisements
Similar presentations
AIME03, Oct 21, 2003 Classification of Ovarian Tumors Using Bayesian Least Squares Support Vector Machines C. Lu 1, T. Van Gestel 1, J. A. K. Suykens.
Advertisements

ECG Signal processing (2)
Outlines Background & motivation Algorithms overview
CSCE555 Bioinformatics Lecture 15 classification for microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
A gene expression analysis system for medical diagnosis D. Maroulis, D. Iakovidis, S. Karkanis, I. Flaounas D. Maroulis, D. Iakovidis, S. Karkanis, I.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Machine learning continued Image source:
Pattern Recognition and Machine Learning
Microarrays Dr Peter Smooker,
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
4 th NETTAB Workshop Camerino, 5 th -7 th September 2004 Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini
Supervised gene expression data analysis using SVMs and MLPs Giorgio Valentini
Discrimination Methods As Used In Gene Array Analysis.
Alizadeh et. al. (2000) Stephen Ayers 12/2/01. Clustering “Clustering is finding a natural grouping in a set of data, so that samples within a cluster.
Microarrays and Gene Expression Analysis. 2 Gene Expression Data Microarray experiments Applications Data analysis Gene Expression Databases.
CIBB-WIRN 2004 Perugia, 14 th -17 th September 2004 Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini Feature.
Feature Selection and Its Application in Genomic Data Analysis March 9, 2004 Lei Yu Arizona State University.
Statistical Learning: Pattern Classification, Prediction, and Control Peter Bartlett August 2002, UC Berkeley CIS.
Guidelines on Statistical Analysis and Reporting of DNA Microarray Studies of Clinical Outcome Richard Simon, D.Sc. Chief, Biometric Research Branch National.
Analysis of microarray data
An Evaluation of Gene Selection Methods for Multi-class Microarray Data Classification by Carlotta Domeniconi and Hong Chai.
Gene expression profiling identifies molecular subtypes of gliomas
Sp’10Bafna/Ideker Classification (SVMs / Kernel method)
JM - 1 Introduction to Bioinformatics: Lecture VIII Classification and Supervised Learning Jarek Meller Jarek Meller Division.
Whole Genome Expression Analysis
Analysis and Management of Microarray Data Dr G. P. S. Raghava.
Evaluation of Supervised Learning Algorithms on Gene Expression Data CSCI 6505 – Machine Learning Adan Cosgaya Winter 2006 Dalhousie University.
Clustering of DNA Microarray Data Michael Slifker CIS 526.
Molecular Diagnosis Florian Markowetz & Rainer Spang Courses in Practical DNA Microarray Analysis.
1 Classifying Lymphoma Dataset Using Multi-class Support Vector Machines INFS-795 Advanced Data Mining Prof. Domeniconi Presented by Hong Chai.
The Broad Institute of MIT and Harvard Classification / Prediction.
Selection of Patient Samples and Genes for Disease Prognosis Limsoon Wong Institute for Infocomm Research Joint work with Jinyan Li & Huiqing Liu.
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
Microarrays and Gene Expression Analysis. 2 Gene Expression Data Microarray experiments Applications Data analysis Gene Expression Databases.
+ Get Rich and Cure Cancer with Support Vector Machines (Your Summer Projects)
1 FINAL PROJECT- Key dates –last day to decided on a project * 11-10/1- Presenting a proposed project in small groups A very short presentation (Max.
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
Gene Expression Analysis. 2 DNA Microarray First introduced in 1987 A microarray is a tool for analyzing gene expression in genomic scale. The microarray.
Support Vector Machine Data Mining Olvi L. Mangasarian with Glenn M. Fung, Jude W. Shavlik & Collaborators at ExonHit – Paris Data Mining Institute University.
Class 23, 2001 CBCl/AI MIT Bioinformatics Applications and Feature Selection for SVMs S. Mukherjee.
DNA Microarray Data Analysis using Artificial Neural Network Models. by Venkatanand Venkatachalapathy (‘Venkat’) ECE/ CS/ ME 539 Course Project.
Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel:
Prof. Yechiam Yemini (YY) Computer Science Department Columbia University (c)Copyrights; Yechiam Yemini; Lecture 2: Introduction to Paradigms 2.3.
Data Mining and Decision Support
Gene expression. Gene Expression 2 protein RNA DNA.
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Machine Learning and Data Mining: A Math Programming- Based Approach Glenn Fung CS412 April 10, 2003 Madison, Wisconsin.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
High resolution product by SVM. L’Aquila experience and prospects for the validation site R. Anniballe DIET- Sapienza University of Rome.
Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning with Spark MLlib
Mammogram Analysis – Tumor classification
Gene Expression Analysis
Gene expression.
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Boosting For Tumor Classification With Gene Expression Data
Neuro-Computing Lecture 4 Radial Basis Function Network
Class Prediction Based on Gene Expression Data Issues in the Design and Analysis of Microarray Experiments Michael D. Radmacher, Ph.D. Biometric Research.
Introduction to Radial Basis Function Networks
Evaluating Classifiers for Disease Gene Discovery
Support Vector Machines 2
Presentation transcript:

Supervised gene expression data analysis using SVMs and MLPs Giorgio Valentini

Outline A real problem: Lymphoma gene expression data analysis by machine learning methods: Diagnosis of tumors using a supervised approach Discovering groups of genes related to carcinogenic processes Discovering subgroups of diseases using gene expression data.

DNA microarray DNA hybridization microarrays supply information about gene expression through measurements of mRNA levels of large amounts of genes in a cell They offer a snapshot of the overall functional status of a cell: virtually all differences in cell type or state are related with changes in the mRNA levels of many genes. DNA microarrays have been used in mutational analyses, genetic mapping studies, in genome monitoring of gene expression, in pharmacogenomics, in metabolic pathway analysis.

A DNA microarray image (E. coli) Each spot corresponds to the expression level of a particular gene Red spots correspond to over expressed genes Green spots to under expressed genes Yellow spots correspond to intermediate levels of gene expression

Analyzing microarray data by machine learning methods Unsupervised approach No or limited a priori knowledge. Clustering algorithms are used to group together similar expression patterns : grouping sets of genes grouping different cells or different functional status of the cell. Example: hierarchical clustering, fuzzy or possibilistic clustering, self- organizing maps. Supervised approach “A priori” biological and medical knowledge on the problem domain. Learning algorithms with labeled examples are used to associate gene expression data with classes: separating normal form cancerous tissues classifying different classes of cells on functional basis Prediction of the functional class of unknown genes. Example: multi-layer perceptrons, support vector machines, decision trees, ensembles of classifiers. The large amount of gene expression data requires machine learning methods to analyze and extract significant knowledge from DNA microarray data

A real problem: A gene expression analysis of lymphoma 1. Separating cancerous and normal tissues using the overall information available. 2. Two step method: A priori knowledge and unsupervised methods to select “candidate” subgroups SVM or MLP identify the most correlated subgroups 2. Identifying groups of genes specifically related to the expression of two different tumour phenotypes through expression signatures. Biological problems 1. - Support Vector Machines (SVM) : linear, RBF and polynomial kernels - Multi Layer Perceptron (MLP) - Linear Perceptron (LP) Machine learning methods

The data Data of a specialized DNA microarray, named "Lymphochip", developed at the Stanford University School of Medicine: 96 tissue samples from normal and cancerous populations of human lymphocytes 4026 different genes preferentially expressed in lymphoid cells or with known roles in processes important in immunology or cancer High dimensional data Small sample size A challenging machine learning problem

Types of lymphoma Three main classes of lymphoma: Diffuse Large B-Cell Lymphoma (DLBCL), Follicular Lymphoma (FL) Chronic Lymphocytic Leukemia (CLL) Transformed Cell Lines (TCL) and normal lymphoid tissues Type of tissueNumber of samples Normal lymphoid cells24 DLBCL46 FL9 CLL11 TCL6

Visualizing data with Tree View

The first problem: Separating normal from cancerous tissues. Our first task consists in distinguishing cancerous from normal tissues using the overall information available, i.e. all the gene expression data. From a machine learning standpoint it is a dichotomic problem. Data characteristics: Small sample size High dimension Missing values Noise Main applicative goal: Supporting functional- molecular diagnosis of tumors and polygenic diseases

Supervised approaches to molecular classification of diseases Several supervised methods have been applied to the analysis of cDNA microarrays and high density oligonucleotide chips: Decision trees Fisher linear discriminant Multi-Layer Perceptrons Nearest-Neighbours classifiers Proposed by different authors: Golub et al. (1999), Pavlidis et al. (2001), Khan et al. (2001), Furey et al. (2000), Ramaswamy et al. (2001), Yeang et al. (2001), Dudoit et al. (2002). Linear discriminant analysis Parzen windows Support Vector Machines

Why using Support Vector Machines ? “General” motivations SVM are two-class classifiers theoretically founded on Vapnik' s Statistical Learning Theory. They act as linear classifiers in a high dimensional feature space originated by a projection of the original input space. The resulting classifier is in general non linear in the input space. SVM achieves good generalization performances maximizing the margin between the classes. SVM learning algorithm has no local minima “Specific” motivations Kernel are well-suited to working with high dimensional data. Small sample sizes require algorithms with good generalization capabilities. Automatic diagnosis of tumors requires high sensitivity and very effective classifiers. SVM can identify mis-labeled data (i.e. incorrect diagnosis). We could design specific kernel to incorporate “a priori” knowledge about the problem.

SVM to classify cancerous and normal cells We consider 3 standard SVM kernels: Gaussian Polynomial Dot-product Varying: Values of the the kernel parameters The regularization factor C Estimation of the generalization error through: 10-fold cross- validation leave-one-out Comparing them with: MLP LP Varying: Number of hidden units Backpropagation parameters

Results Learning machine modelGen. errorSt. dev.Prec.Sens. SVM-linear SVM-poly SVM-RBF MLP LP fold cross-validation ~ leave-one-out estimation of error SVM-linear achieves the best results. High sensitivity, no matter what type of kernel function is used. Radial basis SVM high misclassification rate and high estimated VC dimension

ROC analysis The ROC curve of the SVM-linear is ideal The polynomial SVM also achieves a reasonably good ROC curve The SVM-RBF show a diagonal ROC curve: the highest sensitivity is achieved only when it completely fails to correctly detect normal cells. The ROC curve of the MLP is also nearly optimal Linear perceptron shows a worse ROC curve, but with reasonable values lying on the highest and leftmost part of the ROC plane.

Summary of the results on the first problem Using hierarchical clustering 14,6% of the examples are misclassified (Alizadeh, 2000), against the 1.04% of the SVM, the 2.08% of the MLP and the 9.38% of the LP. Supervised methods exploit a priori biological knowledge (i.e. labeled data), while clustering methods use only gene expression data to group together different tissues, without any labeled data. Linear SVM achieve the best results, but also MLP and 2 nd degree polynomial show a relatively low generalization error. Linear SVM and MLP can be used to build classifiers with a high- sensitivity and a low rate of false positives. These results must be considered with caution because the size of the available data set is too small to infer general statements about the performances of the proposed learning machines.

The second problem: Identifying DLBCL subgroups It starts from an hypothesis of Alizadeh et al. about the existence of two distinct functional types of lymphoma inside DLBCL. Actually, we consider two problems: 1. Validation of Alizadeh’s hypothesis They identified two subgroups of molecularly distinct DLBCL: germinal centre B-like (GCB-like) and activated B-like cells (AB-like). These two classes correspond to patients with very different prognosis. 2. Finding groups of genes mostly related to this separation Different subsets of genes could be responsible for the distinction of these two DLBCL subgroups: the expression signatures Proliferation, T-cell, Lymphnode and GCB (Lossos,2000).

A feature selection approach based on “a priori” knowledge Finding the most correlated genes involves an exponential combination of genes (2 n -1), where n is usually of the order of thousands. We need greedy algorithms and heuristic methods. Can we exploit “a priori” biological knowledge about the problem ?

An heuristic method (1) A two-stage approach: I. Select groups of coordinately expressed genes. II. Identify among them the ones mostly correlated to the disease. We do not consider single genes. We consider only groups of coordinately expressed genes.

An heuristic method (2) I. Selecting groups of coordinately expressed genes: Use “a priori” biological and medical knowledge about groups of genes with known or suspected roles in carcinogenic processes And/or Use unsupervised methods such as clustering algorithms to identify coordinately expressed sets of genes II. Identify subgroups of genes mostly related to the disease: 1.Train a set of classifiers using only the subgroups of genes selected in the first stage. 2.Evaluate and rank the performance of the trained classifiers. 3.Select the subgroups by which the corresponding classifiers achieve the best ranking.

Applying the heuristic method 1. Selecting “candidate” subgroups of genes: We used biological knowledge and hierarchical clustering algorithms to select four subgroups: Proliferation: sets of genes involved the biological process of proliferation T-cell: genes preferentially expressed in T-cells Lymphnode: Sets of genes normally expressed in lymphnodes GCB: genes that distinguish germinal centre B-cells from other stages in B-cell ontogeny 2. Identify subgroups of genes most related to the the separation GCB-like / AB-like Training of SVM, MLP and LP as classifiers using each subgroup of genes and all the subgroups together (All) Leave-one-out methods used with gaussian, polynomial and linear SVM 10-fold cross-validation with gaussian, polynomial and linear SVM, MLP and LP. 5 classification tasks

GCB signature Learn. machine modelGen. errorSt. dev.Prec.Sens. SVM-linear SVM-poly SVM-RBF MLP LP All signatures Learn. machine modelGen. errorSt. dev.Prec.Sens. SVM-linear SVM-poly SVM-RBF MLP LP

Results

The second problem: summary The results support the hypothesis of Alizadeh about the existence of two distinct subgroups in DLBCL. The heuristic method identifies the GCB signature as a cluster of coordinately expressed genes related to the separation between the GCB-like and AB-like DLBCL subgroups.

Developments I. Methods to discover subclasses of tumors on molecular basis. Integrating “a priori” biological knowledge, supervised machine learning methods and unsupervised clustering methods Stratifying patients into molecularly relevant categories, enhancing the discrimination power and precision of clinical trials New perspectives on the development of new cancer therapeutics based on a molecular understanding of the cancer phenotype. II. Methods to identify small subsets of genes correlated to tumors - Refinements of the proposed heuristic method using clustering algorithms with semi-automatic selection of the number of the significant subgroups of genes. - Greedy algorithms based on mutual information measures. Enhancing biological knowledge about tumoral processes Automatic diagnosis of tumors using DNA microchips Discovery of new subclasses of tumors