Download presentation
Presentation is loading. Please wait.
Published bySara Preston Modified over 6 years ago
1
A comparison of PLS-based and other dimension reduction methods for tumour classification using microarray data Cameron Hurst Institute of Health and Biomedical Innovation, Queensland University of Technology Janet Chaseling - Griffith University Michael Steele - Bond University
2
Developments in ‘omics
Developments in genomics, proteomics and metabolomics has seen the generation of huge amounts of data Of particular interest in this study are large scale microarray studies
3
Microarray data Microarray datasets involves the expression levels of many genes (circa 1-50K) Gene expression data can be thought of representing how much a gene is turned off/on To date, microarray studies generally involve a small number of patients (<100)
4
Microarray studies There are a number reasons why microarray studies are conducted I will focus on the area of cancer classification studies However, the techniques I am evaluating readily translate to other types of ‘omic classification studies
5
Objectives in cancer studies based on microarrays
There are three main types of objectives to microarray experiments involved different cancers: Identification of genes that are differentially expressed among cancer classes Class discovery Class prediction (e.g. Tumour classification) It is this last objective that is the focus of my study However, the first objective can also examined in this type of study
6
Tumour class classification
Xn samples x p genes Yn samples x 1 1 ... .. p 2 n Tumour class Class 1 . Class k
7
Analytical problems in tumour classification
The large numbers of genes and small number of subjects in microarray datasets present problems for many traditional statistical methods This ‘small n, large p’ problem has led to two main approaches to the analysis of this data: Machine learning methods Dimension reduction method Which approach used has usually been a matter of the discipline of the analyst (informatics or statistics)
8
Many methods but limited knowledge
A large number of methods have been proposed for tumour classification Many previous studies comparing tumour classification methods have not been systematic about comparing ‘classes’ of methods i.e. previous comparisons have differed in a number of ways, so unclear which properties of the methods represent their relative strengths and weaknesses So there is a lack of knowledge about the reasons for difference in performance
9
Techniques considered…..
Present study involves the comparison of three ‘classes’ of methods for tumour classification. Indirect methods Principal Component Analysis (PCA) DFA Spectral Map Analysis (SMA) DFA Canonical (direct) ordination methods Redundancy Analysis (RDA) Canonical Correspondence Analysis (CCoA) Canonical Analysis of Principal Coordinates (CAP)
10
Techniques considered…..
The within-class differences among these techniques is really about the analysis space employed. For example, the main difference between the Canonical ordination methods Redundancy Analysis and Canonical Correspondence Analysis is the former eigen-decomposes a Euclidean dissimilarity matrix and the latter a 2 dissimilarity matrix i.e. Differences in their performance reflect implicit data standardizations
11
Techniques considered…..
3. PLS-based methods PLS data reduction followed by Linear Discriminant Analysis (PLS) PLS data reduction followed a ridge-penalized logistic regression –ridge PLS (rPLS) PLS data reduction followed iteratively reweighted least squares regression– generalized PLS (gPLS) All of these methods involve a two-step process where PLS components are derived (step 1) which are subsequently used to train some classification rule(step 2) In this respect, these methods only differ in the classification rule used in step 2
12
STEP 1: PLS dimension reduction STEP 2: Discriminant Analysis
Xn samples x p genes Yn samples x (k-1) Tn samples x m components Yn samples x (k-1) 1 ... .. p 2 n 1 .. m 2 n Tumour class Class 1 . Class k Tumour class Class 1 . Class k m << p
13
Comparison of methods…..
All eight methods were run on three benchmark datasets Colon cancer p = 2000 genes; n= 62 [ ncases = 40 + ncontrols =22] Small Round Blue Cells Tumours [SRBCT] p = 2308 genes; n= 83 [distributed among 4 classes] Brain tumours p = 5597 genes; n= 42 [distributed among 5 classes]
14
Comparison of methods……
Effectiveness of classification rules was gauged using misclassification rates based on Leave-One-Out-Cross-Validation As the number of components retained represents a meta-parameter for both the indirect and PLS-based methods, misclassifications were evaluated using 2, 4 and 6 components. Methods were also run on datasets which had both had or had not been reduced using a priori feature selection (Significance Analysis of Microarrays)
15
Example: Results for SRBCT dataset
Here, PLS methods outperformed all other methods. Indirect methods only performed comparably where either a priori feature selection was employed, or where a larger number of components were retained PLS miclass. < Canonical ordination misclass. < Indirect misclass.
16
Results… Indirect methods preformed comparably only when non-differentially expressed genes were removed prior to classification rule training Removal of differentially expressed genes made little difference to PLS-based methods
17
Results…. Canonical methods were highly inconsistent in their performance across datasets, varying from quite effective in the colon and SRBCT datasets to very poor (worse than indirect methods) for the brain tumour dataset.
18
Results….. PLS-based methods generally performed better than both indirect and canonical ordination methods In most cases, rPLS and gPLS were superior to PLS using linear discriminant functions
19
Conclusions Indirect methods should be restricted to a class discovery role (e.g. examining gene-gene and gene-patient interactions) Relative performance of techniques tended to align to the ‘class’ of the method (Indirect, Canonical ordination or PLS-based methods). That is, the analysis space of the method (i.e. for indirect methods, or the canonical ordination methods) tended to make little difference to misclassification rates
20
Further work…. It is not clear which properties of the various datasets leads to inconsistencies in the performance of the classification methods Protocols to systematically evaluate microarray classification methods need further development Most promising avenue likely to involve simulated microarray datasets allowing aspects of the data to be systematically varied so strength/limitations of the methods can be directly linked to data properties.
21
Further work…. The focus in this study has been solely on dimension reduction methods Any comprehensive comparative study should include promising machine learning methods
22
Thank you Questions???
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.