Prediction Methods Mark J. van der Laan Division of Biostatistics U.C. Berkeley www.stat.berkeley.edu/~laan.

Slides:



Advertisements
Similar presentations
Random Forest Predrag Radenković 3237/10
Advertisements

Molecular Biomedical Informatics 分子生醫資訊實驗室 Machine Learning and Bioinformatics 機器學習與生物資訊學 Machine Learning & Bioinformatics 1.
A Quick Overview By Munir Winkel. What do you know about: 1) decision trees 2) random forests? How could they be used?
Model Assessment, Selection and Averaging
Model assessment and cross-validation - overview
CMPUT 466/551 Principal Source: CMU
Paper presentation for CSI5388 PENGCHENG XI Mar. 23, 2005
Coefficient Path Algorithms Karl Sjöstrand Informatics and Mathematical Modelling, DTU.
Linear Methods for Regression Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
Sparse vs. Ensemble Approaches to Supervised Learning
Linear Regression Models Based on Chapter 3 of Hastie, Tibshirani and Friedman Slides by David Madigan.
Biomarker Discovery Analysis Targeted Maximum Likelihood Cathy Tuglus, UC Berkeley Biostatistics November 7 th -9 th 2007 BASS XIV Workshop with Mark van.
Ensemble Learning: An Introduction
Multivariate Data Analysis Chapter 4 – Multiple Regression.
Additive Models and Trees
Data mining and statistical learning, lecture 5 Outline  Summary of regressions on correlated inputs  Ridge regression  PCR (principal components regression)
Classification 10/03/07.
Bagging LING 572 Fei Xia 1/24/06. Ensemble methods So far, we have covered several learning methods: FSA, HMM, DT, DL, TBL. Question: how to improve results?
Sparse vs. Ensemble Approaches to Supervised Learning
Comp 540 Chapter 9: Additive Models, Trees, and Related Methods
Prelude of Machine Learning 202 Statistical Data Analysis in the Computer Age (1991) Bradely Efron and Robert Tibshirani.
Classification and Prediction: Regression Analysis
Ensemble Learning (2), Tree and Forest
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
Simple Linear Regression
Chapter 10 Boosting May 6, Outline Adaboost Ensemble point-view of Boosting Boosting Trees Supervised Learning Methods.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Stats for Engineers Lecture 9. Summary From Last Time Confidence Intervals for the mean t-tables Q Student t-distribution.
Model Building III – Remedial Measures KNNL – Chapter 11.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
NFL Play Predictions Will Burton, NCSU Industrial Engineering 2015
Targeted MLE for Variable Importance and Causal Effect with Clinical Trial and Observational Data Mark van der Laan works.bepress.com/mark_van_der_laan.
Model Comparison for Tree Resin Dose Effect On Termites Lianfen Qian Florida Atlantic University Co-author: Soyoung Ryu, University of Washington.
Controlling FDR in Second Stage Analysis Catherine Tuglus Work with Mark van der Laan UC Berkeley Biostatistics.
Data Mining - Volinsky Columbia University 1 Topic 10 - Ensemble Methods.
Chapter 4 Linear Regression 1. Introduction Managerial decisions are often based on the relationship between two or more variables. For example, after.
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
1 Multiple Regression A single numerical response variable, Y. Multiple numerical explanatory variables, X 1, X 2,…, X k.
Ensemble Learning (1) Boosting Adaboost Boosting is an additive model
Multivariate Dyadic Regression Trees for Sparse Learning Problems Xi Chen Machine Learning Department Carnegie Mellon University (joint work with Han Liu)
Single-Factor Studies KNNL – Chapter 16. Single-Factor Models Independent Variable can be qualitative or quantitative If Quantitative, we typically assume.
The Simple Linear Regression Model: Specification and Estimation ECON 4550 Econometrics Memorial University of Newfoundland Adapted from Vera Tabakova’s.
Super Learning in Prediction HIV Example Mark van der Laan Division of Biostatistics, University of California, Berkeley.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Case Selection and Resampling Lucila Ohno-Machado HST951.
Data Analytics CMIS Short Course part II Day 1 Part 3: Ensembles Sam Buttrey December 2015.
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
1 Simple Linear Regression and Correlation Least Squares Method The Model Estimating the Coefficients EXAMPLE 1: USED CAR SALES.
Gaussian Process and Prediction. (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)2 Outline Gaussian Process and Bayesian Regression  Bayesian regression.
Regression Tree Ensembles Sergey Bakin. Problem Formulation §Training data set of N data points (x i,y i ), 1,…,N. §x are predictor variables (P-dimensional.
Classification and Regression Trees
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Supervised learning in high-throughput data  General considerations  Dimension reduction with outcome variables  Classification models.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Statistics 350 Lecture 2. Today Last Day: Section Today: Section 1.6 Homework #1: Chapter 1 Problems (page 33-38): 2, 5, 6, 7, 22, 26, 33, 34,
Mustafa Gokce Baydogan, George Runger and Eugene Tuv INFORMS Annual Meeting 2011, Charlotte A Bag-of-Features Framework for Time Series Classification.
Bivariate Regression. Bivariate Regression analyzes the relationship between two variables. Bivariate Regression analyzes the relationship between two.
JMP Discovery Summit 2016 Janet Alvarado
Multiple Imputation using SOLAS for Missing Data Analysis
Chapter 13 – Ensembles and Uplift
Boosting and Additive Trees (2)
Generalized regression techniques
Ungraded quiz Unit 6.
Using decision trees and their ensembles for analysis of NIR spectroscopic data WSC-11, Saint Petersburg, 2018 In the light of morning session on superresolution.
Lecture 1: Introduction to Machine Learning Methods
Multiple Decision Trees ISQS7342
Ensemble learning Reminder - Bagging of Trees Random Forest
CART on TOC CART for TOC R 2 = 0.83
Classification with CART
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Prediction Methods Mark J. van der Laan Division of Biostatistics U.C. Berkeley

Outline  Overview of Common Approaches to Prediction  Regression  randomForest  DSA  Cross-Validation  Super Learner Method for Prediction  Example Conclusion

If Scientific Goal... Predict phenotype from genotype Predict phenotype from genotype of the HIV virus... Prediction For HIV-positive patient, determine importance of genetic mutations on treatment response If Scientific Goal......Variable Importance!

Common Methods Linear Regression Lasso Regression Least Angle Regression Penalized Regression Ridge Regression: Simple, less greedy Forward Stagewise regression

Common Methods Non-parametric Regression: Polymars: Uses piece-wise linear splines Knots selected using Generalized Cross-Validation Semi-parametric Regression: Finds predictors that are Boolean (logical) combinations of the original (binary) predictors Logic Regression:

Classification and Regression Algorithm Seeks to estimate E[Y|A,W], i.e. the prediction of Y given a set of covariates {A,W} Bootstrap Aggregation of classification trees –Attempt to reduce bias of single tree Cross-Validation to assess misclassification rates –Out-of-bag (oob) error rate Random Forest Permutation to determine variable importance Assumes all trees are independent draws from an identical distribution, minimizing loss function at each node in a given tree – randomly drawing data for each tree and variables for each node W1W1 W2W2 W3W3 sets of covariates, W={ W 1, W 2, W 3,...} Breiman (1996,1999)

The Algorithm –Bootstrap sample of data –Using 2/3 of the sample, fit a tree to its greatest depth determining the split at each node through minimizing the loss function considering a random sample of covariates (size is user specified) –For each tree.. Predict classification of the leftover 1/3 using the tree, and calculate the misclassification rate = out of bag error rate. For each variable in the tree, permute the variables values and compute the out-of-bag error, compare to the original oob error, the increase is a indication of the variable’s importance –Aggregate oob error and importance measures from all trees to determine overall oob error rate and Variable Importance measure. Oob Error Rate: Calculate the overall percentage of misclassification Variable Importance: Average increase in oob error over all trees and assuming a normal distribution of the increase among the trees, determine an associated p-value Resulting predictor set is high-dimensional Random Forest

Deletion/Substitution/Addition Algorithm (DSA)