A comparison of PLS-based and other dimension reduction methods for tumour classification using microarray data Cameron Hurst Institute of Health and Biomedical.

Slides:



Advertisements
Similar presentations
Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden A Data Pre-processing Method to Increase.
Advertisements

Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Minimum Redundancy and Maximum Relevance Feature Selection
A Comprehensive Study on Third Order Statistical Features for Image Splicing Detection Xudong Zhao, Shilin Wang, Shenghong Li and Jianhua Li Shanghai Jiao.
Multivariate Methods Pattern Recognition and Hypothesis Testing.
Lecture Notes for CMPUT 466/551 Nilanjan Ray
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Multidimensional Analysis If you are comparing more than two conditions (for example 10 types of cancer) or if you are looking at a time series (cell cycle.
1 Cluster Analysis EPP 245 Statistical Analysis of Laboratory Data.
Gene expression profiling identifies molecular subtypes of gliomas
A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.
Large Two-way Arrays Douglas M. Hawkins School of Statistics University of Minnesota
ArrayCluster: an analytic tool for clustering, data visualization and module finder on gene expression profiles 組員:李祥豪 謝紹陽 江建霖.
بسم الله الرحمن الرحیم.. Multivariate Analysis of Variance.
Transfer Learning Task. Problem Identification Dataset : A Year: 2000 Features: 48 Training Model ‘M’ Testing 98.6% Training Model ‘M’ Testing 97% Dataset.
Jeff Howbert Introduction to Machine Learning Winter Regression Linear Regression.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Linear Models for Classification
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:
NTU & MSRA Ming-Feng Tsai
Lecture 15 Wrap up of class. What we intended to do and what we have done: Topics: What is the Biological Problem at hand? Types of data: micro-array,
Introduction to Power and Effect Size  More to life than statistical significance  Reporting effect size  Assessing power.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
High-throughput genomic profiling of tumor-infiltrating leukocytes
CSE 4705 Artificial Intelligence
Principal Component Analysis (PCA)
PREDICT 422: Practical Machine Learning
Queensland University of Technology
PREDICT 422: Practical Machine Learning
Machine Learning – Classification David Fenyő
Chapter 7. Classification and Prediction
DEEP LEARNING BOOK CHAPTER to CHAPTER 6
Object Orie’d Data Analysis, Last Time
Heping Zhang, Chang-Yung Yu, Burton Singer, Momian Xiong
Background on Classification
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
An Artificial Intelligence Approach to Precision Oncology
School of Computer Science & Engineering
Modeling and Simulation (An Introduction)
Intro to Research Methods
کاربرد نگاشت با حفظ تنکی در شناسایی چهره
10701 / Machine Learning.
Multidimensional Scaling and Correspondence Analysis
Population Information Integration, Analysis and Modeling
Machine Learning Feature Creation and Selection
Lecture 23: Feature Selection
Blind Signal Separation using Principal Components Analysis
Educational Research: Correlational Studies
Hierarchical clustering approaches for high-throughput data
REMOTE SENSING Multispectral Image Classification
6-1 Introduction To Empirical Models
What is Regression Analysis?
ECE 471/571 – Review 1.
EPSY 5245 EPSY 5245 Michael C. Rodriguez
Multivariate Statistics
Understanding Multi-Environment Trials
Linear Model Selection and regularization
Welcome to the Kernel-Club
Interpretation of Similar Gene Expression Reordering
Parametric Methods Berlin Chen, 2005 References:
Label propagation algorithm
Systematic review of atopic dermatitis disease definition in studies using routinely-collected health data M.P. Dizon, A.M. Yu, R.K. Singh, J. Wan, M-M.
Meta-analysis, systematic reviews and research syntheses
Qi Li,Qing Wang,Ye Yang and Mingshu Li
Lecture 16. Classification (II): Practical Considerations
Marios Mattheakis and Pavlos Protopapas
8/22/2019 Exercise 1 In the ISwR data set alkfos, do a PCA of the placebo and Tamoxifen groups separately, then together. Plot the first two principal.
Hairong Qi, Gonzalez Family Professor
Presentation transcript:

A comparison of PLS-based and other dimension reduction methods for tumour classification using microarray data Cameron Hurst Institute of Health and Biomedical Innovation, Queensland University of Technology Janet Chaseling - Griffith University Michael Steele - Bond University

Developments in ‘omics Developments in genomics, proteomics and metabolomics has seen the generation of huge amounts of data Of particular interest in this study are large scale microarray studies

Microarray data Microarray datasets involves the expression levels of many genes (circa 1-50K) Gene expression data can be thought of representing how much a gene is turned off/on To date, microarray studies generally involve a small number of patients (<100)

Microarray studies There are a number reasons why microarray studies are conducted I will focus on the area of cancer classification studies However, the techniques I am evaluating readily translate to other types of ‘omic classification studies

Objectives in cancer studies based on microarrays There are three main types of objectives to microarray experiments involved different cancers: Identification of genes that are differentially expressed among cancer classes Class discovery Class prediction (e.g. Tumour classification) It is this last objective that is the focus of my study However, the first objective can also examined in this type of study

Tumour class classification Xn samples x p genes Yn samples x 1 1 ... .. p 2 n Tumour class Class 1 . Class k

Analytical problems in tumour classification The large numbers of genes and small number of subjects in microarray datasets present problems for many traditional statistical methods This ‘small n, large p’ problem has led to two main approaches to the analysis of this data: Machine learning methods Dimension reduction method Which approach used has usually been a matter of the discipline of the analyst (informatics or statistics)

Many methods but limited knowledge A large number of methods have been proposed for tumour classification Many previous studies comparing tumour classification methods have not been systematic about comparing ‘classes’ of methods i.e. previous comparisons have differed in a number of ways, so unclear which properties of the methods represent their relative strengths and weaknesses So there is a lack of knowledge about the reasons for difference in performance

Techniques considered….. Present study involves the comparison of three ‘classes’ of methods for tumour classification. Indirect methods Principal Component Analysis (PCA)  DFA Spectral Map Analysis (SMA)  DFA Canonical (direct) ordination methods Redundancy Analysis (RDA) Canonical Correspondence Analysis (CCoA) Canonical Analysis of Principal Coordinates (CAP)

Techniques considered….. The within-class differences among these techniques is really about the analysis space employed. For example, the main difference between the Canonical ordination methods Redundancy Analysis and Canonical Correspondence Analysis is the former eigen-decomposes a Euclidean dissimilarity matrix and the latter a 2 dissimilarity matrix i.e. Differences in their performance reflect implicit data standardizations

Techniques considered….. 3. PLS-based methods PLS data reduction followed by Linear Discriminant Analysis (PLS) PLS data reduction followed a ridge-penalized logistic regression –ridge PLS (rPLS) PLS data reduction followed iteratively reweighted least squares regression– generalized PLS (gPLS) All of these methods involve a two-step process where PLS components are derived (step 1) which are subsequently used to train some classification rule(step 2) In this respect, these methods only differ in the classification rule used in step 2

STEP 1: PLS dimension reduction STEP 2: Discriminant Analysis Xn samples x p genes Yn samples x (k-1) Tn samples x m components Yn samples x (k-1) 1 ... .. p 2 n 1 .. m 2 n Tumour class Class 1 . Class k Tumour class Class 1 . Class k m << p

Comparison of methods….. All eight methods were run on three benchmark datasets Colon cancer p = 2000 genes; n= 62 [ ncases = 40 + ncontrols =22] Small Round Blue Cells Tumours [SRBCT] p = 2308 genes; n= 83 [distributed among 4 classes] Brain tumours p = 5597 genes; n= 42 [distributed among 5 classes]

Comparison of methods…… Effectiveness of classification rules was gauged using misclassification rates based on Leave-One-Out-Cross-Validation As the number of components retained represents a meta-parameter for both the indirect and PLS-based methods, misclassifications were evaluated using 2, 4 and 6 components. Methods were also run on datasets which had both had or had not been reduced using a priori feature selection (Significance Analysis of Microarrays)

Example: Results for SRBCT dataset Here, PLS methods outperformed all other methods. Indirect methods only performed comparably where either a priori feature selection was employed, or where a larger number of components were retained PLS miclass. < Canonical ordination misclass. < Indirect misclass.

Results… Indirect methods preformed comparably only when non-differentially expressed genes were removed prior to classification rule training Removal of differentially expressed genes made little difference to PLS-based methods

Results…. Canonical methods were highly inconsistent in their performance across datasets, varying from quite effective in the colon and SRBCT datasets to very poor (worse than indirect methods) for the brain tumour dataset.

Results….. PLS-based methods generally performed better than both indirect and canonical ordination methods In most cases, rPLS and gPLS were superior to PLS using linear discriminant functions

Conclusions Indirect methods should be restricted to a class discovery role (e.g. examining gene-gene and gene-patient interactions) Relative performance of techniques tended to align to the ‘class’ of the method (Indirect, Canonical ordination or PLS-based methods). That is, the analysis space of the method (i.e. for indirect methods, or the canonical ordination methods) tended to make little difference to misclassification rates

Further work…. It is not clear which properties of the various datasets leads to inconsistencies in the performance of the classification methods Protocols to systematically evaluate microarray classification methods need further development Most promising avenue likely to involve simulated microarray datasets allowing aspects of the data to be systematically varied so strength/limitations of the methods can be directly linked to data properties.

Further work…. The focus in this study has been solely on dimension reduction methods Any comprehensive comparative study should include promising machine learning methods

Thank you Questions???