Download presentation
Presentation is loading. Please wait.
Published byDenis Holmes Modified over 9 years ago
1
SVM-based techniques for biomarker discovery in proteomic pattern data Elena Marchiori Department of Computer Science Vrije Universiteit Amsterdam
2
Overview Variable selection Variable selection SVM-based techniques SVM-based techniques Application to proteomic pattern data Application to proteomic pattern data Results Results Conclusion Conclusion
3
Variable Selection Select a small subset of input variables (for example genes in gene expression data, m/z values in proteomic pattern data) which are used for building classifier Select a small subset of input variables (for example genes in gene expression data, m/z values in proteomic pattern data) which are used for building classifier Advantages: Advantages: it is cheaper to measure less variables the resulting classifier is simpler and potentially faster prediction accuracy may improve by discarding irrelevant variables identifying relevant variables gives more insight into the nature of the corresponding classification problem (biomarker detection)
4
Support Vector Machines Advantages: Advantages: maximize the margin between two classes in the feature space characterized by a kernel function are robust with respect to high input dimension Disadvantages: Disadvantages: difficult to incorporate background knowledge Sensitive to outliers
5
w T x + b = 0 w T x + b < 0 w T x + b > 0 f(x) = sign(w T x + b) Binary classification
6
Linear Separators
7
SVM: separable classes ρ Support vector margin Optimal hyper-plane Support vectors uniquely characterize optimal hyper-plane
8
SVM and outliers outlier
9
SVM-RFE Linear binary classifier decision function Recursive Feature Elimination (SVM-RFE) Recursive Feature Elimination (SVM-RFE) at each iteration: 1) eliminate threshold% of variables with lower score 2) recompute scores of remaining variables SVM-RFE based algorithms: SVM-RFE based algorithms: run SVM-RFE with different thresholds JOIN: select variables occurring more than cutoff times ENSEMBLE: consider majority vote of resulting classifiers
10
SVM-RFE I. Guyon et al., Machine Learning, 46,389-422, 2002
11
SVM-RFE variant Input: Train set, threshold T, number N of variables to be selected Input: Train set, threshold T, number N of variables to be selected Output: subset of variables of size N Output: subset of variables of size N RFE: RFE: Train: Run linear SVM on train set Score: generate a sequence of variables ordered wrt the absolute value of their weight Eliminate: remove T % of variables from ordered sequence Repeat (train, score, eliminate) on train set restricted to remaining variables until only N variables are left
12
JOIN and ENSEMBLE SVM-RFE
13
Case Study: proteomic pattern data Petricoin et al papers Petricoin et al papers Commercial analysis software (Proteome Quest): http://www.correlogic.com/ Data sets available at: http://ncifdaproteomics.com/ppatterns.php http://ncifdaproteomics.com/ppatterns.php
14
Data generation: SELDI-TOF MS Surface-enhanced laser desorption/ionization time-of-flight mass spectrometry Method for profiling a population of proteins in a sample according to the size and net charge of individual proteins. Method for profiling a population of proteins in a sample according to the size and net charge of individual proteins. The readout is a spectrum of peaks. The position of a protein in the spectrum corresponds to its “time of flight” because the small proteins fly faster than the heavy ones. The readout is a spectrum of peaks. The position of a protein in the spectrum corresponds to its “time of flight” because the small proteins fly faster than the heavy ones. 1 Serum on protein binding plate 2 Insert plate in vacuum chamber 3 Irradiate plate with laser 4 This “launches” the proteins / peptides 5 Measure “time of flight” (TOF) of Ions, related to the molecular weight of proteins
15
Example of proteomic pattern profile from one blood sample Time of flight Abundance Heavier peptides move slower -> Time of flight corresponds to weight Weight corresponds to peptides Measurement of relative abundance of detected peptides in serum
16
How to use such data? Diagnostic tool: Diagnostic tool: design a classifier for discriminating healthy from disease samples Biomarkers identification: Biomarkers identification: Variable subset selection (VSS): select a subset of input variables (m/z values) that best discriminate the two classes (potential biomarkers)
17
Commercial Tools Proteome Quest (Correlogic): GA+clustering, no pre-selection (Petricoin et al., The Lancet 2002) Proteome Quest (Correlogic): GA+clustering, no pre-selection (Petricoin et al., The Lancet 2002) Propeak (3Z Informatics): separability analysis + bootstrap Propeak (3Z Informatics): separability analysis + bootstrap Biomarker AMplification Filter BAMF (Eclipse Diagnostics): ? Biomarker AMplification Filter BAMF (Eclipse Diagnostics): ?
18
Non-commercial Techniques Pre-processing + ranking + kNN (Zhu et al., PNAS 2003) Pre-processing + ranking + kNN (Zhu et al., PNAS 2003) Pre-selection + boosted decision trees (Qu et al., Clin. Chem. 2002) Pre-selection + boosted decision trees (Qu et al., Clin. Chem. 2002) Filter FS + classifier (Liu et al., Genome Informatics 2002) Filter FS + classifier (Liu et al., Genome Informatics 2002) GA + SVM, SVM-RFE ensemble (Jong et al., EvoBIO 2004, Jong et al. CIBCB 2004) GA + SVM, SVM-RFE ensemble (Jong et al., EvoBIO 2004, Jong et al. CIBCB 2004) Many others: any ML method for classification/FS (see, e.g., special issue on FS, JMLR 2003) Many others: any ML method for classification/FS (see, e.g., special issue on FS, JMLR 2003)
19
Goal and Methods Goal: analyze performance of SVM-based techniques for classification and variable selection with proteomic pattern data Goal: analyze performance of SVM-based techniques for classification and variable selection with proteomic pattern data SVM SVM SVM-RFE SVM-RFE Ensemble SVM-RFE: Ensemble SVM-RFE: Majority vote of SVM-RFE classifiers obtained from SVM-RFE with different cutoff values Join SVM-RFE: Join SVM-RFE: SVM trained on N variables that have been selected more often by SVM-RFE with different threshold values
20
DataSets Two proteomic pattern datasets from prostate and ovarian cancer from NCI/CCR and FDA/CBER Clinical proteomics Program Databank: Data sets available at: http://ncifdaproteomics.com/ppatterns.php 15154 115 (15 benign) 100 215 Ovarian 4/03/02 1515425369322Prostate M/z values healthycancer tot #
21
Experimental Setup 10 random partitions of dataset:T (50%),H (25%),V (25%) 10 random partitions of dataset:T (50%),H (25%),V (25%) Algorithms: Algorithms: SVM trained on union of T and H SVM-RFE(threshold) with thresholds = 0.2,0.3,0.4,0.5, 0.6,0.7 Choose threshold giving best classifier sensitivity on H JOIN(cutoff, 0.2, 0.3,0.4, 0.5,0.6,0.7) with cutoffs = 1, 2, 3, 4, 5 Choose cutoff giving best classifier sensitivity on H Performance: average (over 10 V's) Performance: average (over 10 V's)
22
Results Prostate Dataset
23
Results Ovarian Dataset
24
Controversy Noise, bias, results reliability and reproducibility in serum proteomics: Noise, bias, results reliability and reproducibility in serum proteomics: Sorace, Zhan, BMC Bioinformatics, 2004, Petricoin, BMC Bioinformatics, 2004, Baggerly, Journal of the National Cancer Institute, vol. 97, No.4, 2005. Liotta, Journal of the National Cancer Institute, vol. 97, No.4, 2005. Ransohoff, Journal of the National Cancer Institute, vol. 97, No.4, 2005.
25
Conclusion Many machine learning techniques can be used for potential biomarker detection with pattern proteomic data. Many machine learning techniques can be used for potential biomarker detection with pattern proteomic data. SVM based techniques are a possible effective choice because of the high input dimension of such data. SVM based techniques are a possible effective choice because of the high input dimension of such data. Computational analysis of pattern proteomic data has to use a correct methodology that considers biases induced by the selection and classification algorithms and by the data splitting. Computational analysis of pattern proteomic data has to use a correct methodology that considers biases induced by the selection and classification algorithms and by the data splitting. Problems related to reliability and reproducibility of data are inherent to the laboratory technology and actually addressed by researchers and practitioners. Problems related to reliability and reproducibility of data are inherent to the laboratory technology and actually addressed by researchers and practitioners.
26
Acknowledgments Connie Jimenez (Biology, VUMC) Aad van der Vaart (Statistics, VUA)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.