Machine Learning techniques for biomarker discovery in proteomic pattern data Elena Marchiori Department of Computer Science Vrije Universiteit Amsterdam.

Slides:



Advertisements
Similar presentations
Alexander Statnikov1, Douglas Hardin1,2, Constantin Aliferis1,3
Advertisements

Relevant characteristics extraction from semantically unstructured data PhD title : Data mining in unstructured data Daniel I. MORARIU, MSc PhD Supervisor:
CSCE555 Bioinformatics Lecture 15 classification for microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Molecular Biomedical Informatics 分子生醫資訊實驗室 Machine Learning and Bioinformatics 機器學習與生物資訊學 Machine Learning & Bioinformatics 1.
A gene expression analysis system for medical diagnosis D. Maroulis, D. Iakovidis, S. Karkanis, I. Flaounas D. Maroulis, D. Iakovidis, S. Karkanis, I.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
SVM—Support Vector Machines
Correlation Aware Feature Selection Annalisa Barla Cesare Furlanello Giuseppe Jurman Stefano Merler Silvano Paoli Berlin – 8/10/2005.
Software Quality Ranking: Bringing Order to Software Modules in Testing Fei Xing Michael R. Lyu Ping Guo.
Fei Xing1, Ping Guo1,2 and Michael R. Lyu2
III 1 Sorin Alexe RUTCOR, Rutgers University, Piscataway, NJ URL: rutcor.rutgers.edu/~salexe Datascope - a new tool.
Algorithms for Smoothing Array CGH data
4 th NETTAB Workshop Camerino, 5 th -7 th September 2004 Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini
Classification: Support Vector Machine 10/10/07. What hyperplane (line) can separate the two classes of data?
Elena Marchiori Department of Computer Science
Feature Selection Bioinformatics Data Analysis and Tools
1 Diagnosing Breast Cancer with Ensemble Strategies for a Medical Diagnostic Decision Support System David West East Carolina University Paul Mangiameli.
Diagnosis of Ovarian Cancer Based on Mass Spectrum of Blood Samples Committee: Eugene Fink Lihua Li Dmitry B. Goldgof Hong Tang.
Guidelines on Statistical Analysis and Reporting of DNA Microarray Studies of Clinical Outcome Richard Simon, D.Sc. Chief, Biometric Research Branch National.
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
An Evaluation of Gene Selection Methods for Multi-class Microarray Data Classification by Carlotta Domeniconi and Hong Chai.
Evaluating Classifiers
A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.
Whole Genome Expression Analysis
Data mining and machine learning A brief introduction.
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
From Genomic Sequence Data to Genotype: A Proposed Machine Learning Approach for Genotyping Hepatitis C Virus Genaro Hernandez Jr CMSC 601 Spring 2011.
Integration II Prediction. Kernel-based data integration SVMs and the kernel “trick” Multiple-kernel learning Applications – Protein function prediction.
GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
The Broad Institute of MIT and Harvard Classification / Prediction.
Machine Learning Using Support Vector Machines (Paper Review) Presented to: Prof. Dr. Mohamed Batouche Prepared By: Asma B. Al-Saleh Amani A. Al-Ajlan.
Wang Y 1,2, Damaraju S 1,3,4, Cass CE 1,3,4, Murray D 3,4, Fallone G 3,4, Parliament M 3,4 and Greiner R 1,2 PolyomX Program 1, Department.
EMBC2001 Using Artificial Neural Networks to Predict Malignancy of Ovarian Tumors C. Lu 1, J. De Brabanter 1, S. Van Huffel 1, I. Vergote 2, D. Timmerman.
Text Classification 2 David Kauchak cs459 Fall 2012 adapted from:
Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.
CISC Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning.
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
Evaluating Results of Learning Blaž Zupan
SVM-based techniques for biomarker discovery in proteomic pattern data Elena Marchiori Department of Computer Science Vrije Universiteit Amsterdam.
Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
Consensus Group Stable Feature Selection
Introduction to Biostatistics and Bioinformatics Experimental Design.
Feature Selection and Weighting using Genetic Algorithm for Off-line Character Recognition Systems Faten Hussein Presented by The University of British.
Validation methods.
Applying Support Vector Machines to Imbalanced Datasets Authors: Rehan Akbani, Stephen Kwek (University of Texas at San Antonio, USA) Nathalie Japkowicz.
Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
Blackbox classifiers for preoperative discrimination between malignant and benign ovarian tumors C. Lu 1, T. Van Gestel 1, J. A. K. Suykens 1, S. Van Huffel.
Ubiquitination Sites Prediction Dah Mee Ko Advisor: Dr.Predrag Radivojac School of Informatics Indiana University May 22, 2009.
Next, this study employed SVM to classify the emotion label for each EEG segment. The basic idea is to project input data onto a higher dimensional feature.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Serum Diagnosis of Chronic Fatigue Syndrome (CFS) Using Array-based Proteomics Pingzhao Hu W Le, S Lim, B Xing, CMT Greenwood and J Beyene Hospital for.
Improvement of SSR Redundancy Identification by Machine Learning Approach Using Dataset from Cotton Marker Database Pengfei Xuan 1,2, Feng Luo 2, Albert.
High resolution product by SVM. L’Aquila experience and prospects for the validation site R. Anniballe DIET- Sapienza University of Rome.
In Search of the Optimal Set of Indicators when Classifying Histopathological Images Catalin Stoean University of Craiova, Romania
Glenn Fung, Murat Dundar, Bharat Rao and Jinbo Bi
Evaluating Results of Learning
Experiments in Machine Learning
iSRD Spam Review Detection with Imbalanced Data Distributions
Concave Minimization for Support Vector Machine Classifiers
Evaluating Classifiers for Disease Gene Discovery
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presenter: Donovan Orn
Presentation transcript:

Machine Learning techniques for biomarker discovery in proteomic pattern data Elena Marchiori Department of Computer Science Vrije Universiteit Amsterdam

Overview Proteomic pattern data How to use the data Approaches Methodology Case study Conclusion

SELDI-TOF MS Surface-enhanced laser desorption/ionization time-of-flight mass spectronomy Method for profiling a population of proteins in a sample according to the size and net charge of individual proteins. The readout is a spectrum of peaks. The position of a protein in the spectrum corresponds to its “time of flight” because the small proteins fly faster than the heavy ones. 1 Serum on protein binding plate 2 Insert plate in vacuum chamber 3 Irradiate plate with laser 4 This “launches” the proteins / peptides 5 Measure “time of flight” (TOF) of Ions, which corresponds to molecular Weights of proteins

Example Time of flight Abundance Heavier peptides move slower -> Time of flight corresponds to weight Weight corresponds to peptides Measuring relative abundance of detected proteins in serum

How to use the data? Diagnostic tool: –design a classifier for discriminating healthy from disease samples Biomarkers identification: –Feature selection (FS): select features (peptides / proteins) that best discriminate the two classes (potential biomarkers)

Classification / FS diagnostic tool => classifier –train a classifier that separates the two classes of diseased and healthy examples biomarkers => feature subset selection –for a given type of classifier (e.g. KNN, SVM) find a small set of features that optimizes the performance of the classifier when restricted to the selected features –for a given clustering algorithm find a small set of features that maximizes the coherence of class labels of examples in the clusters (Petricoin et al, The Lancet 2002)

Approaches: Commercial Proteome Quest (Correlogic): GA+clustering, no pre-selection (Petricoin et al., The Lancet 2002) Propeak (3Z Informatics): separability analysis + bootstrap Biomarker AMplification Filter BAMF (Eclipse Diagnostics): ?

Approaches: Non-commercial Pre-processing + ranking + kNN (Zhu et al., PNAS 2003) Pre-selection + boosted decision trees (Qu et al., Clin. Chem. 2002) Filter FS + classifier (Liu et al., Genome Informatics 2002) GA + SVM (Jong et al., EvoBIO 2004) Many others: any ML method for classification/FS (see, e.g., special issue on FS, JMLR 2003)

SVM-based methods –Linear Support Vector Machine

GA_SVM Training set T= T_1  T_2. A genetic algorithm evolves a number of populations. Each population consists of sets of features of a given size. The fitness of an individual of the population is based on the performance of a SVM. SVM is trained on T_1 using only the features of the individual. The fitness is the SVM error over T_2. At each generation new individuals are created and inserted into the population by selecting fit parents which are mutated and recombined. Individuals may migrate to neighbor populations.

Ensemble SVM-RFE SVM-RFE(a cutoff, a training set T=T_1  T_2) 1.Train a linear soft-SVM(C, class label penalties) on T_1 2.Order features using the weights of the resulting classifier 3.Eliminate features with weight smaller than cutoff 4.Repeat the process with T_1 restricted to the remaining features This algorithm generates a chain of feature sets F_1  F_2  …  F_k SVM-RFE selects from {F_1, …,F_k} the set F* that minimizes the error over T_2 of the classifier restricted to the feature set, plus a term for penalizing large feature sets. We proposed a variant of this FS algorithm that uses ensembles of results of SVM-RFE over different cutoff values.

Methodology Cross Validation –split data randomly in train and test set –apply the classification/FS method to the training set –use the test set only to assess the performance of the method –repeat the process a number of times to analyze bias induced by the data splitting

About Methodology Examples of recent papers that do NOT use a correct methodology: –Qu et al. (Clin. Chem. 2002) : perform feature pre- selection before application of CV –Villanueva et al (Anal. Chem. 2004): use the entire dataset for feature ranking –Petricoin et al (The Lancet 2002): consider one data split into train/test set papers addressing methodology pitfalls: – Simon et al, J Nat. Cancer Inst 2003 –Ambroise and Mc Lachlan, PNAS 2002

Case Study: Data Used in Petricoin et al papers –Commercial analysis software (Proteome Quest): –Data sets: Ovarian data set: –162 Positive (Cancer) 92 Negative (Healthy) –15154 Variables (Peptides / Proteins) Prostate data set: –69 Positive 253 Negative –15154 Variables number variables >> number examples

Preliminary analysis Few visible differences in means between healthy/cancer groups But many very low p-values (in particular ovarian -> easy) Prostate data: Ovarian data: Difference in meansHistogram p-values

The Methods Diagnostic tool: –Support Vector Machine with linear and polynomial kernel Biomarkers Detection and Diagnostic: –Feature subset selection, using Genetic Algorithms and Support Vector Machine

Diagnostics: Results Support Vector Machine (SVM) on all features –Linear and quadratic kernel Evaluation measures: –Error: fp + fn / total –Sensitivity: tp / (tp + fn) –Specificity: tn / (fp + tn) –Positive Predictive Value: tp / (tp + fp) Results seem consistent with preliminary analysis: ovarian easier than prostate

Biomarker Detection: Results Linear SVM, Prostate data set Quadratic SVM, Prostate data set Bigger error than SVM on all features (+/- 0.06)

Results of Experiments Results of experiments with GA-SVM indicate that there is variability both due to the data splitting and the algorithm. Different sets of features are obtained at each run, however there is a group of about 50 features that occur more often over all the runs.

Results of Experiments Ensemble-RFE-SVM achieves perfect classification on ovarian dataset while on the prostate dataset achieves sensitivity 0.97(0.04) and specificity of 0.89(0.06). Ensemble-RFE-SVM outperforms both GA-SVM and the commercial software of Petricoin et al. However, it finds feature sets of larger sizes. Features provided by Petricoin et al URL site yield scarce performance when SVM is used, showing that performance depends on the type of classifier used…

Diagnostic tool Design Effective FS algorithms, like ensemble SVM- RFE, have to be enhanced with a user-friendly interface and visualization features in order to become operative in research laboratories and hospitals. The resulting tools can be used by biologists and pathologists for analyzing their data without need of direct support from CS people.

Conclusion Many machine learning techniques can be used for the analysis of pattern proteomic data. SVM based approaches are effective. Computational analysis of pattern proteomic data has to use a correct methodology that considers biases induced by the selection and classification algorithms and by the data splitting. Collaboration: –Connie Jimenez –Gus Smit –Kees Jong –Aad van der Vaart