Super Learning in Prediction HIV Example Mark van der Laan www.bepress.com/ucbbiostat Division of Biostatistics, University of California, Berkeley.

Slides:



Advertisements
Similar presentations
Ensemble Learning – Bagging, Boosting, and Stacking, and other topics
Advertisements

 Bases de données complexes et nouveaux outils prédictifs: - MIMIC-II - Super ICU Learner Algorithm (SICULA) Project PIRRACCHIO R, Petersen M, Carone.
CART: Classification and Regression Trees Chris Franck LISA Short Course March 26, 2013.
Informative Censoring Addressing Bias in Effect Estimates Due to Study Drop-out Mark van der Laan and Maya Petersen Division of Biostatistics, University.
Model Assessment and Selection
Model Assessment, Selection and Averaging
Coefficient Path Algorithms Karl Sjöstrand Informatics and Mathematical Modelling, DTU.
Sparse vs. Ensemble Approaches to Supervised Learning
Linear Regression Models Based on Chapter 3 of Hastie, Tibshirani and Friedman Slides by David Madigan.
Biomarker Discovery Analysis Targeted Maximum Likelihood Cathy Tuglus, UC Berkeley Biostatistics November 7 th -9 th 2007 BASS XIV Workshop with Mark van.
Prediction Methods Mark J. van der Laan Division of Biostatistics U.C. Berkeley
Modeling Gene Interactions in Disease CS 686 Bioinformatics.
General Mining Issues a.j.m.m. (ton) weijters Overfitting Noise and Overfitting Quality of mined models (some figures are based on the ML-introduction.
An extension of the compound covariate prediction under the Cox proportional hazard models Emura, Chen & Chen [ 2012, PLoS ONE 7(10) ] Takeshi Emura (NCU)
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
From Genomic Sequence Data to Genotype: A Proposed Machine Learning Approach for Genotyping Hepatitis C Virus Genaro Hernandez Jr CMSC 601 Spring 2011.
Recursive Partitioning And Its Applications in Genetic Studies Chin-Pei Tsai Assistant Professor Department of Applied Mathematics Providence University.
Prediction of HIV-1 Drug Resistance: Representation of Target Sequence Mutational Patterns via an n-Grams Approach Majid Masso School of Systems Biology,
Estimating fitness landscapes John Pinney
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
The Broad Institute of MIT and Harvard Classification / Prediction.
Targeted MLE for Variable Importance and Causal Effect with Clinical Trial and Observational Data Mark van der Laan works.bepress.com/mark_van_der_laan.
Controlling FDR in Second Stage Analysis Catherine Tuglus Work with Mark van der Laan UC Berkeley Biostatistics.
Increased phenotypic susceptibility (hypersusceptibility, HS) to NNRTIs is observed in ~30% of viral isolates with NRTI- resistance mutations 1 and has.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Empirical Efficiency Maximization: Locally Efficient Covariate Adjustment in Randomized Experiments Daniel B. Rubin Joint work with Mark J. van der Laan.
Recursive Binary Partitioning Old Dogs with New Tricks KDD Conference 2009 David J. Slate and Peter W. Frey.
AdvancedBioinformatics Biostatistics & Medical Informatics 776 Computer Sciences 776 Spring 2002 Mark Craven Dept. of Biostatistics & Medical Informatics.
Classification Derek Hoiem CS 598, Spring 2009 Jan 27, 2009.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Machine Learning 5. Parametric Methods.
… Algo 1 Algo 2 Algo 3 Algo N Meta-Learning Algo.
HIV Mutation Classifier HIV Mutation Classifier Hannah Bier’s Project Proposal.
Data Mining Lectures Lecture 7: Regression Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 7: Regression Algorithms Padhraic Smyth Department of.
Conditional Classification Trees using Instrumental Variables Roberta Siciliano Valerio Aniello Tutore Department of Mathematics and Statistics University.
CPH Dr. Charnigo Chap. 11 Notes Figure 11.2 provides a diagram which shows, at a glance, what a neural network does. Inputs X 1, X 2,.., X P are.
Data Mining: Concepts and Techniques1 Prediction Prediction vs. classification Classification predicts categorical class label Prediction predicts continuous-valued.
CZ5225 Methods in Computational Biology Lecture 6: Drug resistance mutations and model developments CZ5225 Methods in Computational Biology.
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
I. Statistical Methods for Genome-Enabled Prediction of Complex Traits OUTLINE THE CHALLENGES OF PREDICTING COMPLEX TRAITS ORDINARY LEAST SQUARES (OLS)
BNFO 615 Fall 2016 Usman Roshan NJIT. Outline Machine learning for bioinformatics – Basic machine learning algorithms – Applications to bioinformatics.
Usman Roshan Dept. of Computer Science NJIT
Bootstrap and Model Validation
Introduction to Machine Learning
KAIR 2013 Nov 7, 2013 A Data Driven Analytic Strategy for Increasing Yield and Retention at Western Kentucky University Matt Bogard Office of Institutional.
Classification with Gene Expression Data
Misha L. Rajaram and Karin S. Dorman Iowa State University
Introduction to Machine Learning and Tree Based Methods
6. Kernel Regression.
第 3 章 神经网络.
Boosting and Additive Trees
CS548 Fall 2017 Decision Trees / Random Forest Showcase by Yimin Lin, Youqiao Ma, Ran Lin, Shaoju Wu, Bhon Bunnag Showcasing work by Cano,
Probabilistic Models for Linear Regression
Artificial Intelligence Chapter 3 Neural Networks
Using decision trees and their ensembles for analysis of NIR spectroscopic data WSC-11, Saint Petersburg, 2018 In the light of morning session on superresolution.
Multivariable Logistic Regression Split Cohort into Development &
Artificial Intelligence Chapter 3 Neural Networks
Regression Usman Roshan.
Model generalization Brief summary of methods
Artificial Intelligence Chapter 3 Neural Networks
Parametric Methods Berlin Chen, 2005 References:
Artificial Intelligence Chapter 3 Neural Networks
Derek Hoiem CS 598, Spring 2009 Jan 27, 2009
How to Choose the Matching Variables Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts and Economic.
Evaluating Classifiers for Disease Gene Discovery
Artificial Intelligence Chapter 3 Neural Networks
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Chaoran Hu1,4, Xiao Tan2,4, Qing Pan3, Yong Ma4, Jaejoon Song4
Presentation transcript:

Super Learning in Prediction HIV Example Mark van der Laan Division of Biostatistics, University of California, Berkeley

Outline Super Learning in Prediction of HIV Phenotype based on HIV Genotype

Scientific Goal Predict phenotype from genotype of the HIV virus –Phenotype: in vitro drug susceptibility –Genotype: mutations in the protease and reverse transcriptase regions of the viral strand

HIV-1 Data (Rhee et al.) HIV-1 sequences from publicly available isolates in the Stanford HIV Sequence Database (Bob Shafer) Predictor: Genotype –Based on amino acid sequences of protease positions 1-99 –Mutations defined as differences from the subtype B consensus wildtype sequence –We used a subset consisting of 58 treatment-selected mutations (Rhee. et.al.) Outcome: Drug Susceptibility –Standardized log fold change in susceptibility to Nelfinavir (NFV) (n=740 isolates) –Fold change defined as the ratio of IC 50 of an isolate to a standard wildtype control isolate

Possible Prediction Algorithms Rhee et al., for example, applied: 1.Decision Trees 2.Neural Networks 3.Support Vector Regression 4.Main Term Linear Regression 5.Least Angle Regression (LARS) 6.Random Forest We also applied 1.Logic Regression 2.Deletion/Substitution/Addition Regression

Super Learner Selects best learner from a set of candidates –Selection based on cross validation Performs (asymptotically) as well as oracle selector

Super Learner

Super Learning: Minimizing cross-validated risk over all linear combinations of the candidate algorithms

The Super Learner as Linear Combination Cross-Validation risk used to determine appropriate weights for each candidate

Candidate 10-fold Cross Validation Mean CV Risk Main Term Linear Regression LARS Logic Reg CART Random Forest Super Learner0.1505

DSA Estimator Minimum CV Risk Cross-Validated Risk Number of Terms v=10 Main terms only  Number of terms={1,…,50} Best number of terms=40

DSA Estimator Best Model of Sizes 1-20 MutationRankingMutationRanking 90M120I11 30N250L12 54V373S13 46I424I14 84C554S15 84A674S16 88S782F17 54T810F18 84V954M19 82A1088D20

Super Learner Final Estimator= Least Squares Regression with all mutations included as main terms

Closing Remarks Do not know a priori which candidate will work best, but Super Learner is data adaptive Unlke other “meta-learners” in the machine learning literature (that we know of), we use cross-validated risk to estimate the candidate weights. Combining super learning with Targeted MLE (in the estimation of the Q(A,W) function) for better efficiency in the variable importance problem.

References for Section 1 Mark J. van der Laan, Eric C. Polley, and Alan E. Hubbard, "Super Learner" (July 2007). U.C. Berkeley Division of Biostatistics Working Paper Series. Working Paper L. Breiman. Random Forests. Machine Learning, 45:5–32, L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. TheWadsworth Statistics/Probability series. Wadsworth International Group, Hastie, T. J. (1991) Generalized additive models. Chapter 7 of Statistical Models in S eds J. M. Chambers and T. J. Hastie, Wadsworth & Brooks/Cole. Venables, W. N. and Ripley, B. D. (2002) Modern Applied Statistics with S. New York: Springer. S. Dudoit and M. J. van der Laan. Asymptotics of cross-validated risk estimation in estimator selection and performance assessment. Statistical Methodology, 2:131– 154, B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least Angle Regression. Annals of Statistics, 32(2):407–499, J. H. Friedman. Multivariate adaptive regression splines. Annals of Statistics, 19(1):1–141, Discussion by A. R. Barron and X. Xiao. A.E. Hoerl and R.W. Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(3):55–67, S. Rhee, J. Taylor, G. Wadhera, J. Ravela, A. Ben-Hur, D. Brutlag, and R. W. Shafer. Genotypic predictors of human immunodeficiency virus type 1 drug resistance. Proceedings of the National Academy of Sciences USA, 2006.

References for Section 1 (con’t) R. W. Shafer. Genotypic predictors of human immunodeficiency virus type 1 drug resistance. Proceedings of the National Academy of Sciences USA, I. Ruczinski, C. Kooperberg, and M. LeBlanc. Logic Regression. Journal of Computational and Graphical Statistics, 12(3):475–511, S. E. Sinisi and M. J. van der Laan. Deletion/Substitution/Addition algorithm in learning with applications in genomics. Statistical Applications in Genetics and Molecular Biology, 3(1), Article 18. S. E. Sinisi, E. C. Polley, S.Y. Rhee, and M. J. van der Laan. Super learning: An application to the prediction of HIV-1 drug resistance. Statistical Applications in Genetics and Molecular Biology, 6(1), M. J. van der Laan and S. Dudoit. Unified Cross-Validation Methodology for Selection Among Estimators and a General Cross- Validated Adaptive Epsilon-Net Estimator: Finite Sample Oracle Inequalities and Examples. Technical Report 130, Division of Bio-19 Hosted by The Berkeley Electronic Press statistics, University of California, Berkeley, Nov URL M. J. van der Laan and D. Rubin. Targeted maximum likelihood learning. International Journal of Biostatistics, 2(1), M. J. van der Laan, S. Dudoit, and A. W. van der Vaart. The cross-validated adaptive epsilon-net estimator. Statistics and Decisions, 24(3):373–395, A.W. van der Vaart, S. Dudoit, and M.J. van der Laan. Oracle inequalities for mulit- fold cross vaidation. Statistics and Decisions, 24(3), 2006.