Proteomics Informatics – Molecular signatures (Week 11)

Slides:



Advertisements
Similar presentations
Statistical Machine Learning- The Basic Approach and Current Research Challenges Shai Ben-David CS497 February, 2007.
Advertisements

Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.
Predictive Analysis of Gene Expression Data from Human SAGE Libraries Alexessander Alves* Nikolay Zagoruiko + Oleg Okun § Olga Kutnenko + Irina Borisova.
CSCE555 Bioinformatics Lecture 15 classification for microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.

An Introduction of Support Vector Machine
SVM—Support Vector Machines
Machine learning continued Image source:
Clinical Trial Designs for the Evaluation of Prognostic & Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.
Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data.
Minimum Redundancy and Maximum Relevance Feature Selection
Correlation Aware Feature Selection Annalisa Barla Cesare Furlanello Giuseppe Jurman Stefano Merler Silvano Paoli Berlin – 8/10/2005.
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
Model and Variable Selections for Personalized Medicine Lu Tian (Northwestern University) Hajime Uno (Kitasato University) Tianxi Cai, Els Goetghebeur,
Part II: Discriminative Margin Clustering Joint work with: Rob Tibshirani, Dept of Statistics Patrick O. Brown, School of Medicine Stanford University.
Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical.
Reduced Support Vector Machine
Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical.
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
Modeling Gene Interactions in Disease CS 686 Bioinformatics.
What is Learning All about ?  Get knowledge of by study, experience, or being taught  Become aware by information or from observation  Commit to memory.
Statistical Learning: Pattern Classification, Prediction, and Control Peter Bartlett August 2002, UC Berkeley CIS.
Predictive Analysis of Clinical Trials Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer Institute
Re-Examination of the Design of Early Clinical Trials for Molecularly Targeted Drugs Richard Simon, D.Sc. National Cancer Institute linus.nci.nih.gov/brb.
MammaPrint, the story of the 70-gene profile
Expression profiling of peripheral blood cells for early detection of breast cancer Introduction Early detection of breast cancer is a key to successful.
JM - 1 Introduction to Bioinformatics: Lecture VIII Classification and Supervised Learning Jarek Meller Jarek Meller Division.
Tang G et al. Proc SABCS 2010;Abstract S4-9.
This week: overview on pattern recognition (related to machine learning)
CZ5225: Modeling and Simulation in Biology Lecture 6, Microarray Cancer Classification Prof. Chen Yu Zong Tel:
Molecular Diagnosis Florian Markowetz & Rainer Spang Courses in Practical DNA Microarray Analysis.
Integration II Prediction. Kernel-based data integration SVMs and the kernel “trick” Multiple-kernel learning Applications – Protein function prediction.
Sgroi DC et al. Proc SABCS 2012;Abstract S1-9.
Metabolic Syndrome and Recurrence within the 21-Gene Recurrence Score Assay Risk Categories in Lymph Node Negative Breast Cancer Lakhani A et al. Proc.
BIOMARKERS Diagnostics and Prognostics. OMICS Molecular Diagnostics: Promises and Possibilities, p. 12 and 26.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
Dubsky P et al. Proc SABCS 2012;Abstract S4-3.
The Broad Institute of MIT and Harvard Classification / Prediction.
1 Critical Review of Published Microarray Studies for Cancer Outcome and Guidelines on Statistical Analysis and Reporting Authors: A. Dupuy and R.M. Simon.
SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.
Previous Lecture: Bioimage Informatics Agullo-Pascual E, Reid DA, Keegan S, Sidhu M, Fenyö D, Rothenberg E, Delmar M, "Super-resolution fluorescence microscopy.
Use of Candidate Predictive Biomarkers in the Design of Phase III Clinical Trials Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.
The Use of Predictive Biomarkers in Clinical Trial Design Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer Institute
JM - 1 Machine Learning for Studies of Genotype-Phenotype Correlations Jarek Meller Jarek Meller Division of Biomedical Informatics,
Using Predictive Classifiers in the Design of Phase III Clinical Trials Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer Institute.
Support Vector Machine Data Mining Olvi L. Mangasarian with Glenn M. Fung, Jude W. Shavlik & Collaborators at ExonHit – Paris Data Mining Institute University.
SVM-based techniques for biomarker discovery in proteomic pattern data Elena Marchiori Department of Computer Science Vrije Universiteit Amsterdam.
Class 23, 2001 CBCl/AI MIT Bioinformatics Applications and Feature Selection for SVMs S. Mukherjee.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
Supervised Learning. CS583, Bing Liu, UIC 2 An example application An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc)
Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.
Introduction to Biostatistics and Bioinformatics Experimental Design.
A comparative study of survival models for breast cancer prognostication based on microarray data: a single gene beat them all? B. Haibe-Kains, C. Desmedt,
CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel:
Data Mining and Decision Support
Supervised Machine Learning: Classification Techniques Chaleece Sandberg Chris Bradley Kyle Walsh.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Classification with Gene Expression Data
Introduction to translational and clinical bioinformatics Connecting complex molecular information to clinically relevant decisions using molecular.
Glenn Fung, Murat Dundar, Bharat Rao and Jinbo Bi
Estimating Link Signatures with Machine Learning Algorithms
Basic machine learning background with Python scikit-learn
Claudio Lottaz and Rainer Spang
Support Vector Machines Introduction to Data Mining, 2nd Edition by
Single Sample Expression-Anchored Mechanisms Predict Survival in Head and Neck Cancer Yang et al Presented by Yves A. Lussier MD PhD The University.
Claudio Lottaz and Rainer Spang
Proteomics Informatics –
Presentation transcript:

Proteomics Informatics – Molecular signatures (Week 11)

Definition of a molecular signature FDA calls them “in vitro diagnostic multivariate assays” A molecular signature is a computational or mathematical model that links high-dimensional molecular information to phenotype or other response variable of interest.

1.Models of disease phenotype/clinical outcome Diagnosis Prognosis, long-term disease management Personalized treatment (drug selection, titration) 2.Biomarkers for diagnosis, or outcome prediction Make the above tasks resource efficient, and easy to use in clinical practice 3.Discovery of structure & mechanisms (regulatory/interaction networks, pathways, sub- types) Leads for potential new drug candidates Uses of molecular signatures

Example of a molecular signature

Developed by Agendia ( 70-gene signature to stratify women with breast cancer that hasn’t spread into “low risk” and “high risk” for recurrence of the disease Independently validated in >1,000 patients So far performed >10,000 tests Cost of the test is ~$3,000 In February, 2007 the FDA cleared the MammaPrint test for marketing in the U.S. for node negative women under 61 years of age with tumors of less than 5 cm. TIME Magazine’s 2007 “medical invention of the year”. MammaPrint

Oncotype DX Breast Cancer Assay Developed by Genomic Health ( 21-gene signature to predict whether a woman with localized, ER+ breast cancer is at risk of relapse Independently validated in thousands of patients So far performed >100,000 tests Price of the test is $4,175 Not FDA approved but covered by most insurances including Medicare Its sales in 2010 reached $170M and with a compound annual growth rate is projected to hit $300M by 2015.

Improved Survival and Cost Savings In a 2005 economic analysis of recurrence in LN-,ER+ patients receiving tamoxifen, Hornberger et al. performed a cost-utility analysis using a decision analytic model. Using a model, recurrence Score result was predicted on average to increase quality-adjusted survival by 16.3 years and reduce overall costs by $155,128. In a 2 million member plan, approximately 773 women are eligible for the test. If half receive the test, given the high and increasing cost of adjuvant chemotherapy, supportive care and management of adverse events, the use of the Oncotype DX assay is estimated to save approximately $1,930 per woman tested (given an aggregate 34% reduction in chemotherapy use).

Molecular signatures on the market

EF Petricoin III, AM Ardekani, BA Hitt, PJ Levine, VA Fusaro, SM Steinberg, GB Mills, C Simone, DA Fishman, EC Kohn, LA Liotta, "Use of proteomic patterns in serum to identify ovarian cancer", Lancet 359 (2002) 572–77

Example: OvaCheck Developed by Correlogic ( Blood test for the early detection of epithelial ovarian cancer Failed to obtain FDA approval Looks for subtle changes in patterns among the tens of thousands of proteins, protein fragments and metabolites in the blood Signature developed by genetic algorithm Significant artifacts in data collection & analysis questioned validity of the signature: -Results are not reproducible -Data collected differently for different groups of patients a.html

Main ingredients for developing a molecular signature

Base-Line Characteristics DF Ransohoff, "Bias as a threat to the validity of cancer molecular-marker research", Nat Rev Cancer 5 (2005)

How to Address Bias DF Ransohoff, "Bias as a threat to the validity of cancer molecular-marker research", Nat Rev Cancer 5 (2005)

Principles and geometric representation for supervised learning Want to classify objects as boats and houses.

All objects before the coast line are boats and all objects after the coast line are houses. Coast line serves as a decision surface that separates two classes. Principles and geometric representation for supervised learning

Longitude Latitude Boat House Principles and geometric representation for supervised learning

Longitude Latitude Boat House Then the algorithm seeks to find a decision surface that separates classes of objects Principles and geometric representation for supervised learning

Longitude Latitude ??? ? ? ? These objects are classified as boats These objects are classified as houses Unseen (new) objects are classified as “boats” if they fall below the decision surface and as “houses” if the fall above it Principles and geometric representation for supervised learning

These boats will be misclassified as houses This house will be misclassified as boat Principles and geometric representation for supervised learning

10,000-50,000 (gene expression microarrays, aCGH, and early SNP arrays) >500,000 (tiled microarrays, SNP arrays) 10,000-1,000,000 (MS based proteomics) >100,000,000 (next-generation sequencing) This is the ‘curse of dimensionality’ In 2-D this looks simple but what happens in higher dimensional data…

Some methods do not run at all (classical regression) Some methods give bad results (KNN, Decision trees) Very slow analysis Very expensive/cumbersome clinical application Tends to “overfit” High-dimensionality (especially with small samples) causes:

Over-fitting (a model to your data) = building a model that is good in original data but fails to generalize well to new/unseen data. Under-fitting (a model to your data) = building a model that is poor in both original data and new/unseen data. Two problems: Over-fitting & Under-fitting

Over/under-fitting are related to complexity of the decision surface and how well the training data is fit

Predictor X Outcome of Interest Y Training Data Future Data This line is good! This line overfits! Over/under-fitting are related to complexity of the decision surface and how well the training data is fit

Predictor X Outcome of Interest Y Training Data Future Data This line is good! This line underfits! Over/under-fitting are related to complexity of the decision surface and how well the training data is fit

Successful data analysis methods balance training data fit with complexity –Too complex signature (to fit training data well)  overfitting (i.e., signature does not generalize) –Too simplistic signature (to avoid overfitting)  underfitting (will generalize but the fit to both the training and future data will be low and predictive performance small).

Challenges in computational analysis of omics data Relatively easy to develop a predictive model and even easier to believe that a model is when it is not. There are both practical and theoretical problems. Omics data has many special characteristics and it is difficult to analyze. Example: OvaCheck, a blood test for early detection of epithelial ovarian cance, failed FDA approval. - Looks for subtle changes in patterns of proteins levels – Signature developed by genetic algorithm - Data collected differently for the different patient groups

The Support Vector Machine (SVM) approach for building molecular signatures Support vector machines (SVMs) is a binary classification algorithm. SVMs are important because of (a) theoretical reasons: -Robust to very large number of variables and small samples -Can learn both simple and highly complex classification models -Employ sophisticated mathematical principles to avoid overfitting and (b) superior empirical results.

Cancer patientsNormal patients Gene X Gene Y Consider example dataset described by 2 genes, gene X and gene Y Represent patients geometrically (by “vectors”) The Support Vector Machine (SVM) approach for building molecular signatures

Find a linear decision surface (“hyperplane”) that can separate patient classes and has the largest distance (i.e., largest “gap” or “margin”) between border-line patients (i.e., “support vectors”); Gap Cancer patientsNormal patients Gene X Gene Y The Support Vector Machine (SVM) approach for building molecular signatures

If such linear decision surface does not exist, the data is mapped into a much higher dimensional space (“feature space”) where the separating decision surface is found; The feature space is constructed via very clever mathematical projection (“kernel trick”). The Support Vector Machine (SVM) approach for building molecular signatures

Estimation of signature accuracy test train data train test train test train test train test data Large sample case: use hold-out validation Small sample case: use N-fold cross-validation

Challenges in computational analysis of omics data for development of molecular signatures Signature multiplicity (Rashomon effect) Poor experimental design Is there predictive signal? Assay validity/reproducibility Efficiency (Statistical and computational) Causality vs predictivness Methods development (reinventing the wheel) Many variables, few samples, noise, artifacts Editorialization/Over-simplification/Sensationalism

General conclusions 1.Molecular signatures play a crucial role in personalized medicine and translational bioinformatics. 2.Molecular signatures are being used to treat patients today, not in the future. 3.Development of accurate molecular signature should rely on use of supervised methods. 4.In general, there are many challenges for computational analysis of omics data for development of molecular signatures. 5.One of these challenges is molecular signature multiplicity. 6.There exist an algorithm that can extract the set of maximally predictive and non-redundant molecular signatures from high- throughput data.

Proteomics Informatics – Molecular signatures (Week 11)