CRISP: Consensus Regularized Selection based Prediction

Slides:



Advertisements
Similar presentations
Ordinary Least-Squares
Advertisements

Chapter Outline 3.1 Introduction
Linear Regression.
Lectures 17,18 – Boosting and Additive Trees Rice ECE697 Farinaz Koushanfar Fall 2006.
Prediction with Regression
CHAPTER 13 M ODELING C ONSIDERATIONS AND S TATISTICAL I NFORMATION “All models are wrong; some are useful.”  George E. P. Box Organization of chapter.
R OBERTO B ATTITI, M AURO B RUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
Ch11 Curve Fitting Dr. Deshi Ye
Model Assessment, Selection and Averaging
Chapter 2: Lasso for linear models
Software Quality Ranking: Bringing Order to Software Modules in Testing Fei Xing Michael R. Lyu Ping Guo.
Basis Expansion and Regularization Presenter: Hongliang Fei Brian Quanz Brian Quanz Date: July 03, 2008.
Cos 429: Face Detection (Part 2) Viola-Jones and AdaBoost Guest Instructor: Andras Ferencz (Your Regular Instructor: Fei-Fei Li) Thanks to Fei-Fei Li,
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Coefficient Path Algorithms Karl Sjöstrand Informatics and Mathematical Modelling, DTU.
Linear Methods for Regression Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
Sparse vs. Ensemble Approaches to Supervised Learning
Linear Regression Models Based on Chapter 3 of Hastie, Tibshirani and Friedman Slides by David Madigan.
1 Ensembles of Nearest Neighbor Forecasts Dragomir Yankov, Eamonn Keogh Dept. of Computer Science & Eng. University of California Riverside Dennis DeCoste.
Collaborative Filtering Matrix Factorization Approach
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Efficient Model Selection for Support Vector Machines
1 Active learning based survival regression for censored data Bhanukiran Vinzamuri Yan Li Chandan K.
Shrinkage Estimation of Vector Autoregressive Models Pawin Siriprapanukul 11 January 2010.
CJT 765: Structural Equation Modeling Class 7: fitting a model, fit indices, comparingmodels, statistical power.
Jeff Howbert Introduction to Machine Learning Winter Regression Linear Regression.
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
Center for Evolutionary Functional Genomics Large-Scale Sparse Logistic Regression Jieping Ye Arizona State University Joint work with Jun Liu and Jianhui.
The Group Lasso for Logistic Regression Lukas Meier, Sara van de Geer and Peter Bühlmann Presenter: Lu Ren ECE Dept., Duke University Sept. 19, 2008.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Multivariate Dyadic Regression Trees for Sparse Learning Problems Xi Chen Machine Learning Department Carnegie Mellon University (joint work with Han Liu)
Jeff Howbert Introduction to Machine Learning Winter Regression Linear Regression Regression Trees.
R EGRESSION S HRINKAGE AND S ELECTION VIA THE L ASSO Author: Robert Tibshirani Journal of the Royal Statistical Society 1996 Presentation: Tinglin Liu.
Network Lasso: Clustering and Optimization in Large Graphs
Simulation Study for Longitudinal Data with Nonignorable Missing Data Rong Liu, PhD Candidate Dr. Ramakrishnan, Advisor Department of Biostatistics Virginia.
CpSc 881: Machine Learning
Linear Prediction Correlation can be used to make predictions – Values on X can be used to predict values on Y – Stronger relationships between X and Y.
Rehospitalization Analytics: Modeling and Reducing the Risks of Rehospitalization Chandan K. Reddy Department of Computer Science, Wayne State University.
Chapter 2-OPTIMIZATION G.Anuradha. Contents Derivative-based Optimization –Descent Methods –The Method of Steepest Descent –Classical Newton’s Method.
Additive Models , Trees , and Related Models Prof. Liqing Zhang Dept. Computer Science & Engineering, Shanghai Jiaotong University.
Logistic Regression & Elastic Net
Regularized Least-Squares and Convex Optimization.
Bump Hunting The objective PRIM algorithm Beam search References: Feelders, A.J. (2002). Rule induction by bump hunting. In J. Meij (Ed.), Dealing with.
I. Statistical Methods for Genome-Enabled Prediction of Complex Traits OUTLINE THE CHALLENGES OF PREDICTING COMPLEX TRAITS ORDINARY LEAST SQUARES (OLS)
Feature learning for multivariate time series classification Mustafa Gokce Baydogan * George Runger * Eugene Tuv † * Arizona State University † Intel Corporation.
Chapter 7. Classification and Prediction
Boosting and Additive Trees (2)
Glenn Fung, Murat Dundar, Bharat Rao and Jinbo Bi
CSE 4705 Artificial Intelligence
CJT 765: Structural Equation Modeling
Machine learning, pattern recognition and statistical data modelling
Basic machine learning background with Python scikit-learn
Machine Learning Basics
Asymmetric Gradient Boosting with Application to Spam Filtering
Roberto Battiti, Mauro Brunato
USPACOR: Universal Sparsity-Controlling Outlier Rejection
Cos 429: Face Detection (Part 2) Viola-Jones and AdaBoost Guest Instructor: Andras Ferencz (Your Regular Instructor: Fei-Fei Li) Thanks to Fei-Fei.
Collaborative Filtering Matrix Factorization Approach
Ying shen Sse, tongji university Sep. 2016
What is Regression Analysis?
Linear Model Selection and regularization
Multivariable Logistic Regression Split Cohort into Development &
6.5 Taylor Series Linearization
Biointelligence Laboratory, Seoul National University
Pejman Mohammadi, Niko Beerenwinkel, Yaakov Benenson  Cell Systems 
The loss function, the normal equation,
Mathematical Foundations of BME Reza Shadmehr
Recursively Adapted Radial Basis Function Networks and its Relationship to Resource Allocating Networks and Online Kernel Learning Weifeng Liu, Puskal.
Sparse Principal Component Analysis
Presentation transcript:

CRISP: Consensus Regularized Selection based Prediction Ping Wang1, Karthik K. Padthe2, Bhanukiran Vinzamuri2, Chandan K. Reddy1 1Virginia Tech, Arlington, VA 22203 2Wayne State University, Detroit, MI 48202

Outline Motivation Objective CRISP Method Experimental Results Conclusion http://www.aiyellowpage.com/content/what-will-big-data-do-to-our-privacy/

Motivation In many real-world domains, predictions based on the history information are helpful for making plans for the future. Features Variables Linear Regression Prediction ? Dataset Models Results Large Variance Bad Interpretation Poor Generalization Disadvantages of MSE Solutions: Regularization Lasso Ridge Elastic Net MC+ SCAD LSP Convex Non-convex

Objective The main objective of this work is to integrate the consensus regularization problem using non-convex regularizers in the regression framework, and determine the optimal regularizer for the predictive model.

Linear Regression Model Basic linear regression model: Parameter estimation and feature selection are done simultaneously by solving the following minimization problem. 1. aims at estimating the relationship between the features 𝑋=( 𝑥 1 , 𝑥 2 ,…, 𝑥 𝑛 )T and the corresponding response variable Y=( 𝑦 1 , 𝑦 2 ,…, 𝑦 𝑛 )T 2

MC+ Penalty Minimax concave plus (MC+) penalty: A fast, continuous, nearly unbiased and accurate method of penalized variable selection in linear regression. For each , there will be a continuum of penalties and threshold operators when varying from to 1+. We can also vary the value of , which will determine a specific model along with the non-convex parameter . 1. R. Mazumder, J. Friedman, and T. Hastie. SparseNet: Coordinate descent with nonconvex penalties. Journal of the American Statistical Association, 106 (495):1125–1138, 2012. 2. C. H. Zhang. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38(2):894–942, 2010.

Optimization Coordinate descent method is used to minimize a multivariable objective function by solving a series of univariate optimization problems in a loop. It does the linear search along one coordinate direction at the current point and cyclically iterates through the other directions.

How to find the optimal parameter corresponding Consensus Criterion How to find the optimal parameter corresponding to the best solution? Squared Error (se): it measures the square of the errors or the deviations and assesses the quality of an estimator. Generally, for the training sample and a linear fit when the values of and are fixed, the squared error of the predictor will be

CRISP Method The two-stage CRISP algorithm: Generate a committee of regularization models, each model in this committee differs from the others, but the solutions for each one of them can be expressed using a unique generalized thresholding operator. Obtain a consensus matrix using a consensus criterion based on the family of thresholding operators obtained by varying and ; select the optimal parameter among all these candidate regularization models for final prediction.

CRISP Algorithm Input: Predictor matrix (X); response variable (Y); regularization parameter sequence (Λ); non-convexity parameter sequence (Γ), length of Λ (L), length of Γ (K). Output: Optimal model parameters ( 𝜆 ∗ , 𝛾 ∗ ) and regression coefficient vector ( 𝛽 ∗ ). 1 For 𝑙=𝐿,⋯,2, 1 𝑑𝑜 2 Use warm start Lasso solution 𝛽 𝜆 𝑙+1 , 𝛾 𝑘 ; Initialize 𝛽 ← 𝛽 𝜆 𝑙+1 , 𝛾 𝑘 ; For k=𝐾,⋯,2, 1 𝑑𝑜 repeat For i =1, 2, ⋯, m do 𝛽 𝑖 ← 𝑆 𝛾 𝑘 ( 𝛽 , 𝜆 𝑙 ) end until 𝛽 converges to 𝛽 𝜆 𝑙 , 𝛾 𝑘 ; Estimate 𝜂 𝜆 𝑙 , 𝛾 𝑘 for 𝛽 𝜆 𝑙 , 𝛾 𝑘 ; End 13 ( 𝜆 ∗ , 𝛾 ∗ )←𝑎𝑟𝑔 𝑚𝑖𝑛 𝜆 𝑙 , 𝛾 𝑘 ∈ℝ 𝑄( 𝛽 ); 14 Select final model 𝛽 ∗ corresponding to ( 𝜆 ∗ , 𝛾 ∗ ) A grid of increasing Λ={ 𝜆 1 , 𝜆 2 ,…, 𝜆 𝐿 , 𝜆 𝐿+1 } and Γ={ 𝛾 1 , 𝛾 2 ,…, 𝛾 𝑘 } values are used for traversing different combinations of 𝜆 and 𝛾 and generating different candidate models in the ensemble. 𝜆 𝐿+1 values is used for warm startup in the CRISP algorithm by Lasso.

Data Description Electronic Health Records (EHRs): Obtained from Henry Ford Health System in Detroit, Michigan in United States for patients admitted with chronic heart failure (CHF) condition over a period of 10 years. Synthetic datasets. Datasets # Features # Instances HF-cohort 77 8132 EHR-0 73 4416 EHR-1 72 3409 EHR-2 2748 EHR-3 2208 EHR-4 71 1800 Syn-1 1000 500 Syn-2 5000 Syn-3 10000

Evaluation Metrics AUC: is expressed as area under the a receiver operating characteristic (ROC) curve, which is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. MSE: measures how accurately the model predicts the outcome. R2: measures the performance of regression models.

Goodness of Prediction Comparison using AUC ± std for 30-day readmission problems on longitudinal EHRs. Model HF-cohort EHR-0 EHR-1 EHR-2 EHR-3 EHR-4 Logit 0.5700±0.012 0.6060±0.013 0.5270±0.027 0.5490±0.013 0.6000±0.024 0.5960±0.035 GAM 0.6274±0.016 0.5944±0.015 0.5778±0.010 0.5990±0.040 0.6027±0.022 0.5728±0.019 GA2M 0.6192±0.013 0.5719±0.012 0.5546±0.032 0.5743±0.017 0.5894±0.015 0.5514±0.018 hiernet-weak 0.5980±0.011 0.5735±0.022 0.5657±0.010 0.5718±0.013 0.6163±0.038 0.5549±0.021 hiernet-strong 0.5887±0.010 0.5706±0.021 0.5628±0.026 0.5690±0.030 0.6055±0.041 0.5590±0.035 EN-linear 0.6181±0.009 0.6129±0.014 0.6185±0.026 0.6103±0.021 0.6351±0.025 0.6201±0.018 EN-logit 0.6184±0.021 0.6138±0.029 0.6192±0.018 0.6109±0.010 0.6350±0.050 0.6199±0.031 SGL 0.6233±0.010 0.6117±0.028 0.6095±0.016 0.5991±0.030 0.6222±0.050 L1-SVR 0.5171±0.016 0.5157±0.008 0.5070±0.018 0.5189±0.014 0.5919±0.013 0.5822±0.057 L2-SVR 0.6269±0.017 0.6075±0.016 0.5892±0.013 0.6041±0.031 0.6258±0.033 0.5939±0.014 CRISP(se) (p-value) 0.6504±0.008 (0.0013) 0.6224±0.017 (7.85e-08) 0.6194±0.025 (0.0003) 0.6366±0.019 (5.725e-07) 0.6433±0.033 (0.0031) 0.6428±0.043 (0.0012)

Goodness of Prediction Comparison using MSE ± std for 30-day readmission problems on longitudinal EHRs. Model HF-cohort EHR-0 EHR-1 EHR-2 EHR-3 EHR-4 Logit 0.2103±0.008 0.2056±0.004 0.2333±0.006 0.2254±0.010 0.2194±0.008 0.2283±0.011 GAM 0.1811±0.005 0.2122±0.010 0.2197±0.004 0.2246±0.009 0.2308±0.011 0.2488±0.014 GA2M 0.2238±0.010 0.2736±0.023 0.3154±0.048 0.3177±0.028 0.3089±0.038 0.3302±0.019 hiernet-weak 0.1914±0.008 0.2232±0.010 0.2226±0.002 0.2293±0.004 0.2309±0.010 0.2551±0.002 hiernet-strong 0.1933±0.007 0.2256±0.007 0.2297±0.005 0.2335±0.009 0.2250±0.008 0.2559±0.019 EN-linear 0.1832±0.003 0.2059±0.001 0.2075±0.002 0.2152±0.004 0.2145±0.004 0.2263±0.001 EN-logit 0.1833±0.006 0.2061±0.008 0.2077±0.009 0.2153±0.002 0.2146±0.004 0.2265±0.004 SGL 0.1816±0.003 0.2050±0.004 0.2065±0.008 0.2149±0.007 0.2151±0.009 0.2272±0.009 L1-SVR 0.7166±0.022 0.9585±0.037 0.9756±0.027 1.0635±0.041 1.0861±0.053 1.0814±0.107 L2-SVR 0.2402±0.007 0.2985±0.016 0.3104±0.026 0.3333±0.028 0.3441±0.049 0.3980±0.016 CRISP(se) 0.1775±0.002 0.2030±0.003 0.2050±0.003 0.2110±0.003 0.2083±0.004 0.2202±0.003

Goodness of Prediction Comparison using AUC for 365-day readmission problems on EHRs.

Goodness of Prediction Comparison using MSE ±std on synthetic datasets. Comparison using R2 on synthetic datasets. Dataset Syn−1 Syn−2 Syn-3 𝐿 0 0.3677±0.030 0.9598±0.156 1.0391±0.087 𝐿 ∞ 0.2439±0.032 0.8806±0.140 1.0214±0.070 EN-linear 0.1892±0.032 0.7832±0.087 1.0020±0.138 SGL 0.1744±0.030 0.8392±0.097 0.9028±0.059 CRISP(se) 0.0861±0.012 0.7698±0.179 1.0015±0.188 Dataset Syn−1 Syn−2 Syn-3 𝐿 0 0.6269 0.1539 0.1602 𝐿 ∞ 0.6197 0.2510 0.1269 EN-linear 0.8093 0.2064 0.1181 SGL 0.5046 0.1682 0.1038 CRISP(se) 0.9124 0.2215 0.2057

Scalability Experiments Syn2 Syn3 MC+ penalty runs faster compared to other non-convex penalty. CRISP method which uses the MC+ penalty can perform efficiently on high-dimensional datasets.

Conclusion We presented a method named CRISP for solving the consensus regularization problem for regression which has not been addressed in the literature. This method generates a committee of non-convex regularized linear regression models using the minimax concave plus (MC+) penalty, and it applies a consensus criterion to select the best model for prediction. We evaluated this model on both longitudinal EHRs and high-dimensional synthetic datasets using diverse metrics such as AUC, MSE and R2 and assessed the scalability of CRISP. The results indicate that CRISP performs better compared to various other additive, interactions and sparse regression models.

Questions and Comments Chandan K. Reddy: reddy@cs.vt.edu Thank You Questions and Comments ? Chandan K. Reddy: reddy@cs.vt.edu Ping Wang: ping@vt.edu

Notations Name Description n number of instances m number of features X feature vector matrix Y response variable regression coefficient vector scalar regularization parameter scalar non-convexity parameter consensus matrix L Length of regularization sequence K Length of non-convexity sequence 1. In high-dimensional data, m is much greater than n. This motivates the use of a relatively small number of predictors to accurately predict the outcome