Institute of Statistics and Decision Sciences In Defense of a Dissertation Submitted for the Degree of Doctor of Philosophy 26 July 2005 Regression Model Search and Uncertainty with Many Predictors Christopher M. Hans
26 July 2005 Regression Model Search & Uncertainty Overview Regression Model Space Exploration SSS Methodology –Comparisons to MCMC –Analytic Evaluation Sparsity in the Normal Linear Model Several examples from cancer genomics
26 July 2005 Regression Model Search & Uncertainty Variable Selection & Model Uncertainty Regression Modeling –Many possible predictors Model/Variable Selection –Choose one set of “relevant” predictors Model Averaging –Use all models (or at least a set of high probability)
26 July 2005 Regression Model Search & Uncertainty Model Search Strategies Stepwise Methods –Forward/Backward selection –“Leaps and Bounds” [Furnival & Wilson, 1974] MCMC –Gibbs sampling [George & McCulloch, 1993, 1997; Geweke, 1996; Smith and Kohn, 1996, 1997; Brown et al., 1998] –Metropolis-Hastings [Madigan & York, 1995; Raftery et al. 1997; Green, 1995]
26 July 2005 Regression Model Search & Uncertainty Sparsity via p( ) Focus is on “sparse” models Sparsity is encoded in the model via the prior Specification of important
26 July 2005 Regression Model Search & Uncertainty Parameter Space Priors Encompassing Model Induced Priors Closed form calculation of p(y | )
26 July 2005 Regression Model Search & Uncertainty Shotgun Stochastic Search Shoot out many proposals Evaluate them in parallel Sample a new model from the proposals
26 July 2005 Regression Model Search & Uncertainty Criteria for Model Search At each iteration: 1. Move across dimension effectively 2. Allow each variable to be considered 3. Quickly identify “similar” models
26 July 2005 Regression Model Search & Uncertainty Regression Model SSS Defining the Neighborhood: Current model of size k Consider three types of proposals: –neighboring models - of dimension k - 1 –neighboring models ± of dimension k –neighboring models + of dimension k + 1 These are the models “shot out” (in parallel)
26 July 2005 Regression Model Search & Uncertainty Choosing the New Model Would like dimensional balance: Bayes: sample based on relative posterior probabilities Otherwise: BIC, R 2, F statistic, etc.
26 July 2005 Regression Model Search & Uncertainty Regression Model SSS Current Model Parallel Computing Step Three Proposals New Model
26 July 2005 Regression Model Search & Uncertainty SSS Output As the search progresses: –Maintain list * of the best models evaluated –Based on a score function [log p(y | ) + log p( )] –Use these models to summarize the posterior
26 July 2005 Regression Model Search & Uncertainty Posterior Summarization Condition on * –Norming Constant: –Model probabilities: –Dimension Importance: –Variable importance:
26 July 2005 Regression Model Search & Uncertainty Relationship to MCMC Metropolis-Hastings: –Use P(x) restricted to a neighborhood B(x) as the proposal distribution –Acceptance Probability:
26 July 2005 Regression Model Search & Uncertainty Analysis of SSS Performance Fixed dimensional SSS: nbd( ) = ± Expected time to find * Orthogonal design
26 July 2005 Regression Model Search & Uncertainty Simulation Study Real Data: n = 41, p = 8, simulated y from “true” model: | * | = 4 p 2 { 10, 100, 500, 1000, 2500, 5000, 7500, 8408, 10000, 12500, 15000, 17500, 20000, 22500, }
26 July 2005 Regression Model Search & Uncertainty Comparison to MCMC 40,000 SSS iterations SSS: 11 hrs. 53 min. 29,163 Gibbs iterations Gibbs: 75.41% 1,137,195,208 model evaluations 135,252 Gibbs iterations Gibbs: 55 hrs. 13 min. Gibbs: 97.49%
26 July 2005 Regression Model Search & Uncertainty Illustration: Glioblastoma Survival Study Keck Center for Neurooncogenomics at Duke –n = 41 (patients) p = 8,408 (genes) –Expression levels are standardized –Priors: = 1, = 3, = 10/p
26 July 2005 Regression Model Search & Uncertainty Posterior Summarization 1,000,000 models from 40,000 iterations (<12 hours)
26 July 2005 Regression Model Search & Uncertainty Assessing Model Fit Model Averaged Fit –Sample from
26 July 2005 Regression Model Search & Uncertainty Extension to Binary Regression
26 July 2005 Regression Model Search & Uncertainty Extension to Weibull Survival Models
26 July 2005 Regression Model Search & Uncertainty Sparsity in the Normal Linear Model Sparsity via p( ) Sparsity via p(y | ): shrinkage Marginal Likelihood:
26 July 2005 Regression Model Search & Uncertainty Lower Bound on the Marginal Likelihood Theorem: –y and x j are centered, scaled –rank(X) = k for all k < n –For fixed >0, >0 and y –Equality achieved when:
26 July 2005 Regression Model Search & Uncertainty Implications Model “fit” has been removed –Latent penalty for adding an irrelevant predictor = 1:
26 July 2005 Regression Model Search & Uncertainty Inference on Sparsity Estimate average value of p(y | ) for models in
26 July 2005 Regression Model Search & Uncertainty Inference on Sparsity Shift by lower bound Approximate parametrically Estimate missing posterior mass
26 July 2005 Regression Model Search & Uncertainty Stochastic Version of Lower Bound
26 July 2005 Regression Model Search & Uncertainty Stochastic Version of Lower Bound First order Taylor expansion gives
26 July 2005 Regression Model Search & Uncertainty Assessing the Approximation Random Models Keck Models
26 July 2005 Regression Model Search & Uncertainty Marginal Likelihood for
26 July 2005 Regression Model Search & Uncertainty Marginal Posterior p( | y) Assign prior distribution:
26 July 2005 Regression Model Search & Uncertainty Future Considerations Connections to other modeling frameworks –Gaussian graphical models Model space prior distributions –p( | X) –p( | ), X » F p * (y | ) for other priors Extended analysis of SSS, connections to MCMC
26 July 2005 Regression Model Search & Uncertainty Notation Regression Subsets / Models: – is a p £ 1 indicator vector j = 1 if x j is in the model j = 0 if x j is not in the model 0 = (1,0,0,1,…) ! model has x 1 and x 4 “Dimension” or Size of a model
26 July 2005 Regression Model Search & Uncertainty Linear Model Framework For a given model Bayesian framework
26 July 2005 Regression Model Search & Uncertainty Parameter Space Priors Regression Model: Implied Regression:
26 July 2005 Regression Model Search & Uncertainty Parameter Space Priors Derive from an encompassing model –Consistency across regression models Assign a prior
26 July 2005 Regression Model Search & Uncertainty Neighborhood Example deletion set swap set addition set Note – | - | = k if k > 2 ( - = ; if k = 1) – | ± | = k (p – k) – | + | = p – k
26 July 2005 Regression Model Search & Uncertainty Relationship to MCMC Closely related to Metropolis-Hastings –Use P(x) restricted to a neighborhood B(x) as the proposal distribution
26 July 2005 Regression Model Search & Uncertainty Relationship to MCMC Accept move with probability Relating Notation:
26 July 2005 Regression Model Search & Uncertainty Relationship to MCMC Can’t use “two stage” sampling –Say we sample a model 0 from +
26 July 2005 Regression Model Search & Uncertainty Analysis of SSS Performance Define the map Z t 2 {0,…,k} Analyze the induced chain {Z t }
26 July 2005 Regression Model Search & Uncertainty Analysis of SSS Performance Consider the random variable Interest is in Focus on v 0 : Z 0 = 0
26 July 2005 Regression Model Search & Uncertainty Analysis of SSS Performance Need to specify the transition matrix P p,k The vector of expected hitting times is
26 July 2005 Regression Model Search & Uncertainty Analysis of SSS for Orthogonal Designs x i 0 x j = 0 for all i j Two models a, b X a = (X 1 X 2 ) and X b = (X 1 X 3 ) X 1 is a set of k - 1 common variables
26 July 2005 Regression Model Search & Uncertainty Analysis of SSS for Orthogonal Designs Ratio of marginal likelihoods Least squares coefficient
26 July 2005 Regression Model Search & Uncertainty Analysis of SSS for Orthogonal Designs Simplifying assumption:
26 July 2005 Regression Model Search & Uncertainty Simulation Study Time for SSS to find * as p increases Based on brain cancer survival data –n = 41, p = 8,408 –“True” model has four variables –Simulated m = 1,...,50 response vectors, y (m) – (m) i » N(0,0.5), i = 1,…,n
26 July 2005 Regression Model Search & Uncertainty Simulation Study p 2 { 10, 100, 500, 1000, 2500, 5000, 7500, 8408, 10000, 12500, 15000, 17500, 20000, 22500, } Reorder X so that “true” variables are 1,2,3,4 p · 8,408 –Take first p columns of X and randomly permute p > 8,408 –Take first 8,408 columns, add p – 8,408 columns of random noise, and permute
26 July 2005 Regression Model Search & Uncertainty Marginal Posterior for Assign prior distribution When p >> k,
26 July 2005 Regression Model Search & Uncertainty Example: Binary Regression Can extend SSS to binary regression Use Laplace approximation
26 July 2005 Regression Model Search & Uncertainty Predicting Lymph Node Status X: Gene Expression (breast cancer tumors) Y: Lymph Node Positivity Status n = 148 n 0 = 100 low risk (node negative) n 1 = 48 high risk (high node positive) p = 4,512
26 July 2005 Regression Model Search & Uncertainty SSS Results: 100,000 Models
26 July 2005 Regression Model Search & Uncertainty Model Fit and Predictive Accuracy
26 July 2005 Regression Model Search & Uncertainty Example: Survival Regression Weibull survival models Use Laplace approximation
26 July 2005 Regression Model Search & Uncertainty Predicting Survival Time X: Gene Expression (lung cancer tumors) Y: Survival Time n = 91 patients d = 45 observed survival times n-d = 46 censored times p = 2,717
26 July 2005 Regression Model Search & Uncertainty SSS Results: 100,000 models
26 July 2005 Regression Model Search & Uncertainty Survival Predictions
26 July 2005 Regression Model Search & Uncertainty Sensitivity/Specificty