Evaluation of Supervised Learning Algorithms on Gene Expression Data CSCI 6505 – Machine Learning Adan Cosgaya Winter 2006 Dalhousie University.

Slides:



Advertisements
Similar presentations
CSCE555 Bioinformatics Lecture 15 classification for microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Advertisements

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
A gene expression analysis system for medical diagnosis D. Maroulis, D. Iakovidis, S. Karkanis, I. Flaounas D. Maroulis, D. Iakovidis, S. Karkanis, I.
Capturing Best Practice for Microarray Gene Expression Data Analysis Gregory Piatetsky-Shapiro Tom Khabaza Sridhar Ramaswamy Presented briefly by Joey.
Instance-based Classification Examine the training samples each time a new query instance is given. The relationship between the new query instance and.
Application of Stacked Generalization to a Protein Localization Prediction Task Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D. Pace University, School.
Minimum Redundancy and Maximum Relevance Feature Selection
Feature/Model Selection by Linear Programming SVM, Combined with State-of-Art Classifiers: What Can We Learn About the Data Erinija Pranckeviciene, Ray.
Gene selection using Random Voronoi Ensembles Stefano Rovetta Department of Computer and Information Sciences, University of Genoa, Italy Francesco masulli.
A Study on Feature Selection for Toxicity Prediction*
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
4 th NETTAB Workshop Camerino, 5 th -7 th September 2004 Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini
Cancer classification using Machine Learning Techniques on Microarray Data Yongjin Park 1 and Ming-Chi Tsai 2 1 Department of Biology, Computational Biology.
Aprendizagem baseada em instâncias (K vizinhos mais próximos)
Feature Selection and Its Application in Genomic Data Analysis March 9, 2004 Lei Yu Arizona State University.
DIMACS Workshop on Machine Learning Techniques in Bioinformatics 1 Cancer Classification with Data-dependent Kernels Anne Ya Zhang (with Xue-wen.
ML ALGORITHMS. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of classifying new examples.
Applications of Data Mining in Microarray Data Analysis Yen-Jen Oyang Dept. of Computer Science and Information Engineering.
Ordinal Decision Trees Qinghua Hu Harbin Institute of Technology
Guidelines on Statistical Analysis and Reporting of DNA Microarray Studies of Clinical Outcome Richard Simon, D.Sc. Chief, Biometric Research Branch National.
JM - 1 Introduction to Bioinformatics: Lecture VIII Classification and Supervised Learning Jarek Meller Jarek Meller Division.
A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.
Classification of multiple cancer types by multicategory support vector machines using gene expression data.
Whole Genome Expression Analysis
Evaluation – next steps
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Exagen Diagnostics, Inc., all rights reserved Biomarker Discovery in Genomic Data with Partial Clinical Annotation Cole Harris, Noushin Ghaffari.
1 Classifying Lymphoma Dataset Using Multi-class Support Vector Machines INFS-795 Advanced Data Mining Prof. Domeniconi Presented by Hong Chai.
GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume.
Basic Data Mining Technique
Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.
Analysing Microarray Data Using Bayesian Network Learning Name: Phirun Son Supervisor: Dr. Lin Liu.
The Broad Institute of MIT and Harvard Classification / Prediction.
Special topics on text mining [ Part I: text classification ] Hugo Jair Escalante, Aurelio Lopez, Manuel Montes and Luis Villaseñor.
Selection of Patient Samples and Genes for Disease Prognosis Limsoon Wong Institute for Infocomm Research Joint work with Jinyan Li & Huiqing Liu.
+ Get Rich and Cure Cancer with Support Vector Machines (Your Summer Projects)
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
A presentation on the topic For CIS 595 Bioinformatics course
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Fuzzy Machine Learning Methods for Biomedical Data Analysis
Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas.
Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.
Class 23, 2001 CBCl/AI MIT Bioinformatics Applications and Feature Selection for SVMs S. Mukherjee.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Consensus Group Stable Feature Selection
Applications of Supervised Learning in Bioinformatics Yen-Jen Oyang Dept. of Computer Science and Information Engineering.
A comparative study of survival models for breast cancer prognostication based on microarray data: a single gene beat them all? B. Haibe-Kains, C. Desmedt,
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.
Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring T.R. Golub et al., Science 286, 531 (1999)
Eigengenes as biological signatures Dr. Habil Zare, PhD PI of Oncinfo Lab Assistant Professor, Department of Computer Science Texas State University 5.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.
Classifiers!!! BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin.
Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.
Classification with Gene Expression Data
Results for all features Results for the reduced set of features
An Artificial Intelligence Approach to Precision Oncology
Classifiers!!! BCH339N Systems Biology / Bioinformatics – Spring 2016
Introduction to translational and clinical bioinformatics Connecting complex molecular information to clinically relevant decisions using molecular.
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Gene Expression Classification
Molecular Classification of Cancer
CSCI N317 Computation for Scientific Applications Unit Weka
Assignment 1: Classification by K Nearest Neighbors (KNN) technique
Using Bayesian Network in the Construction of a Bi-level Multi-classifier. A Case Study Using Intensive Care Unit Patients Data B. Sierra, N. Serrano,
Evaluating Classifiers for Disease Gene Discovery
Presentation transcript:

Evaluation of Supervised Learning Algorithms on Gene Expression Data CSCI 6505 – Machine Learning Adan Cosgaya Winter 2006 Dalhousie University Machine Learning Prediction

2 / 18 Outline Introduction Definition of the Problem Related Work Algorithms Description of the Data Methodology of Experiments Results Relevance of Results Conclusions & Future Work

3 / 18 Introduction ML has gained attention in the biomedical field. Need to turn biomedical data into meaningful information. Microarray technology is used to generate gene expression data. Gene expression data involves a huge number of numeric attributes (gene expression measurements). This kind of data is also characterized by consisting of a small numbers of instances. This work investigates the classification problem on such data.

4 / 18 Definition of the Problem Classifying Gene Expression Data  Number of features (n) is much greater than the number of sample instances (m). (n >> m)  Typical data: n > 5000, and m < 100  High risk of overfitting the data due the abundance of attributes and shortage of available samples.  The datasets produced by Microarray experiments are highly dimensional and often noisy due to the process involved in the experiments.

5 / 18 Related Work Using gene expression data for the task of classification, has recently gained attention in the biomedical community. Golub et al. describe an approach to cancer classification based on gene expression applied to human acute Leukemia (ALL vs AML). A. Rosenwald et al. developed a model predictor of patient survival after chemotherapy (Alive vs Dead). Furey et al. present a method to analyze microarray expression data using SVM. Guyon et al. experiment with reducing the dimensionality of gene expression data.

6 / 18 Algorithms K-Nearest Neighbor (KNN)  It is one of the simplest and widely used algorithms for data classification. Naive Bayes (NB)  It assumes that the effect of a feature value on a given class is independent of the values of other features. Decision Trees (DT)  Internal nodes represent tests on one or more attributes and leaf nodes indicate decision outcomes. Support Vector Machines (SVM)  Works well on high dimensional data

7 / 18 Description of the Data Leukemia dataset.  A collection of 72 expression measurements. The samples are divided into two variants of leukemia: 25 samples of acute myeloid leukemia (AML) and 47 samples acute lymphoblastic leukemia (ALL). Diffuse Large-B-Cell Lymphoma (DLBCL) dataset  Biopsy samples that were examined for gene expression with the use of DNA microarrays. Each sample corresponds to the prediction of survival after chemotherapy for diffuse large-B-cell lymphoma (Alive, Dead).

8 / 18 Methodology of Experiments Feature Selection  Remove irrelevant features (but may have biological meaning).  Use of GainRatio Selecting a Supervised Learning Method  KNN, NB, DT, SVM Testing Methodology  Evaluation over independent test set (train/test split) Ratios: 66/34, 80/20, 90/10  10-fold Cross-Validation  Compare both methods and see if they are in logical agreement Feature Selection (gene subset) Algorithm All features

9 / 18 Methodology of Experiments (cont…) Measuring Performance  Accuracy  Precision (p)  Recall (r)  F-Measure It is hard to compare two classifiers using two measures. F- Measure combines precision and recall into one measure. F-Measure is the harmonic mean of precision, and recall. For F to be large, both p and r must be large.

10 / 18 Results Without Feature Selection  Naive Bayes and SVM perform better  KNN and SVM perform better Cross-validation results are lower; it uses nearly all the data for training and testing, giving a more realistic estimation.

11 / 18 Results (cont…) With Feature Selection  KNN and SVM perform better  NB and SVM perform better There is an increase in the overall accuracy, more notorious in DLBCL

12 / 18 Results (cont…) Summary of classification accuracies with cross-validation F-Measures for both datasets with and without feature selection

13 / 18 Relevance of Results Performance depends on the characteristics of the problem, the quality of the measurements in the data, and the capabilities of the classifier in finding regularities in the data. Feature selection, helps to minimize the use of redundant and/or noisy features. SVM gave the best results, they perform well with high dimensional data, and also benefit from feature selection. Decision Trees had the overall worst performance, however, they still work at a competitive level.

14 / 18 Relevance of Results (cont…) Surprisingly, KNN behaves relatively well despite its simplicity, this characteristic allows it to scale well for large feature spaces. In the case of the Leukemia dataset, very high accuracies were achieved here for all the algorithms. Perfect accuracy was achieved in many cases. The DLBCL dataset shows lower accuracies, although using feature selection improved them. In the overall, the observations of the accuracy results are consistent with those from the F-measure, giving us confidence in the relevance of the results obtained.

15 / 18 Conclusions & Future Work Supervised learning algorithms can be used to the classification of gene expression data from DNA microarrays with high accuracy. SVM by its very own nature, deal well with high dimensional gene expression data. We have verified that there are subsets of features (genes) that are more relevant than others and better separate the classes. The use of one algorithm instead of others should be evaluated on a case by case basis

16 / 18 Conclusions & Future Work (cont…) The use of feature selection proved to be beneficial to improve the overall performance of the algorithms. This idea can be extended to the use of other feature selection methods or data transformation such as PCA. Analysis of the effect of noisy gene expression data on the reliability of the classifier. While the scope of our experimental results is confined to a couple of datasets, the analysis can be used as a baseline for future use of supervised learning algorithms for gene expression data

17 / 18 References T.R. Golub et al. Molecular classification of cancer: class discovery and class prediction by gene-expression monitoring. Science, Vol. 286, 531– 537, A. Rosenwald, G. Wright, W. C. Chan, et al. The use of molecular profiling to predict survival after chemotherapy for diffuse large B-cell lymphoma. New England Journal of Medicine, Vol. 346, 1937–1947, Terrence S. Furey, Nello Cristianini, et al. Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics, Vol. 16, 906–914, I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene selection for cancer classification using support vector machines. BIOWulf Technical Report, Ethem Alpaydin. Introduction to Machine Learning. The MIT Press, Ian H. Witten, Eibe Frank. Data Mining: Practical Machine Learning Tools and Techniques. Second Edition. Morgan Kaufmann Publishers, 2005 Wikipedia: Alvis Brazma, Helen Parkinson, Thomas Schlitt, Mohammadreza Shojatalab. A quick introduction to elements of biology-cells, molecules, genes, functional genomics, microarrays. European Bioinformatics Institute.

18 / 18 Thank You!