Bayesian Semi-Parametric Multiple Shrinkage

Slides:



Advertisements
Similar presentations
Hypothesis Testing. To define a statistical Test we 1.Choose a statistic (called the test statistic) 2.Divide the range of possible values for the test.
Advertisements

Gibbs Sampling Methods for Stick-Breaking priors Hemant Ishwaran and Lancelot F. James 2001 Presented by Yuting Qi ECE Dept., Duke Univ. 03/03/06.
Bayesian dynamic modeling of latent trait distributions Duke University Machine Learning Group Presented by Kai Ni Jan. 25, 2007 Paper by David B. Dunson,
Hypothesis testing Another judgment method of sampling data.
R OBERTO B ATTITI, M AURO B RUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
CHAPTER 21 Inferential Statistical Analysis. Understanding probability The idea of probability is central to inferential statistics. It means the chance.
Fast Bayesian Matching Pursuit Presenter: Changchun Zhang ECE / CMR Tennessee Technological University November 12, 2010 Reading Group (Authors: Philip.
The Horseshoe Estimator for Sparse Signals Reading Group Presenter: Zhen Hu Cognitive Radio Institute Friday, October 08, 2010 Authors: Carlos M. Carvalho,
Visual Recognition Tutorial
A Bayesian Approach to Joint Feature Selection and Classifier Design Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario.
Chapter Seventeen HYPOTHESIS TESTING
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Topic 2: Statistical Concepts and Market Returns
Visual Recognition Tutorial
Using ranking and DCE data to value health states on the QALY scale using conventional and Bayesian methods Theresa Cain.
MACHINE LEARNING 6. Multivariate Methods 1. Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Motivating Example  Loan.
Chapter 9 Hypothesis Testing.
Lecture II-2: Probability Review
Latent Variable Models Christopher M. Bishop. 1. Density Modeling A standard approach: parametric models  a number of adaptive parameters  Gaussian.
Additional Slides on Bayesian Statistics for STA 101 Prof. Jerry Reiter Fall 2008.
The horseshoe estimator for sparse signals CARLOS M. CARVALHO NICHOLAS G. POLSON JAMES G. SCOTT Biometrika (2010) Presented by Eric Wang 10/14/2010.
The paired sample experiment The paired t test. Frequently one is interested in comparing the effects of two treatments (drugs, etc…) on a response variable.
1 CSI5388: Functional Elements of Statistics for Machine Learning Part I.
Education Research 250:205 Writing Chapter 3. Objectives Subjects Instrumentation Procedures Experimental Design Statistical Analysis  Displaying data.
Maximum Likelihood Estimator of Proportion Let {s 1,s 2,…,s n } be a set of independent outcomes from a Bernoulli experiment with unknown probability.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Brian Macpherson Ph.D, Professor of Statistics, University of Manitoba Tom Bingham Statistician, The Boeing Company.
The Group Lasso for Logistic Regression Lukas Meier, Sara van de Geer and Peter Bühlmann Presenter: Lu Ren ECE Dept., Duke University Sept. 19, 2008.
The Dirichlet Labeling Process for Functional Data Analysis XuanLong Nguyen & Alan E. Gelfand Duke University Machine Learning Group Presented by Lu Ren.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Bayesian Parametric and Semi- Parametric Hierarchical models: An application to Disinfection By-Products and Spontaneous Abortion: Rich MacLehose November.
Bayesian Multivariate Logistic Regression by Sean O’Brien and David Dunson (Biometrics, 2004 ) Presented by Lihan He ECE, Duke University May 16, 2008.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
CpSc 881: Machine Learning
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Bayesian Density Regression Author: David B. Dunson and Natesh Pillai Presenter: Ya Xue April 28, 2006.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University.
- 1 - Preliminaries Multivariate normal model (section 3.6, Gelman) –For a multi-parameter vector y, multivariate normal distribution is where  is covariance.
The Nested Dirichlet Process Duke University Machine Learning Group Presented by Kai Ni Nov. 10, 2006 Paper by Abel Rodriguez, David B. Dunson, and Alan.
Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate its.
Institute of Statistics and Decision Sciences In Defense of a Dissertation Submitted for the Degree of Doctor of Philosophy 26 July 2005 Regression Model.
I. Statistical Methods for Genome-Enabled Prediction of Complex Traits OUTLINE THE CHALLENGES OF PREDICTING COMPLEX TRAITS ORDINARY LEAST SQUARES (OLS)
Generalization Performance of Exchange Monte Carlo Method for Normal Mixture Models Kenji Nagata, Sumio Watanabe Tokyo Institute of Technology.
Markov Chain Monte Carlo in R
Estimation of Gene-Specific Variance
Deep Feedforward Networks
Bayesian Generalized Product Partition Model
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Analyzing Redistribution Matrix with Wavelet
Roberto Battiti, Mauro Brunato
Outline Parameter estimation – continued Non-parametric methods.
Kernel Stick-Breaking Process
Chapter 9 Hypothesis Testing.
PSY 626: Bayesian Statistics for Psychological Science
More about Posterior Distributions
Discrete Event Simulation - 4
Comparing Populations
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Generally Discriminant Analysis
LECTURE 09: BAYESIAN LEARNING
Basis Expansions and Generalized Additive Models (1)
Parametric Methods Berlin Chen, 2005 References:
Learning From Observed Data
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
Applied Statistics and Probability for Engineers
Classical regression review
Presentation transcript:

Bayesian Semi-Parametric Multiple Shrinkage Paper by Richard F. MacLehose, David B. Dunson Duke University Machine Learning Group Presented by Lu Ren

Outline Motivation Model and Lasso Prior Semi-Parametric Multiple Shrinkage Priors Posterior Computation Experiment Results Conclusions

Motivation Non-identified effects are commonplace due to high-dimensional or correlated data, such as gene microarray. 2. Standard techniques use independent normal priors centered at zero, with the degree of shrinkage controlled by the prior variance. 3. Coefficients could be assumed exchangeable within specific groups and be allowed to shrunk toward different means, when sufficient prior knowledge is available. 4. As such prior knowledge is lacking, a Bayesian semiparametric hierarchical model is proposed in this paper, by placing a DP prior on the unknown mean and scale parameters.

Model and Lasso Prior Suppose we collect data , where is a vector of predictors and is a binary outcome. A stand approach is to estimate the coefficients in a regression model: . For large , maximum likelihood estimates will tend to have high variance and may not be unique. However, we could incorporate a penalty by using a lasso prior . The DE denotes a double exponential distribution, equivalent to

Multiple Shrinkage Prior In many situations shrinkage toward non-null values will be beneficial. Instead of inducing shrinkage toward zero, the lasso model is extended by introducing a mixture prior with separate prior location and scale parameters: The data play more of role in their choice of the hyper-parameters while favoring sparsity through a carefully-tailored hyperprior. DP prior is non-parametric and allows clustering of parameters to help reduce dimensionality.

Multiple Shrinkage Prior The proposed prior structure: The amount of shrinkage a coefficient exhibits toward its prior mean is determined by , with larger values resulting in greater shrinkage. Therefore, are specified to make the prior as sparse as possible.

Multiple Shrinkage Prior Assume the coefficient-specific hyperparameter values into clusters, . The number of clusters is controlled by and the coefficients are adaptively shrinked toward non-zero locations. The prior’s equivalent stick breaking form: if if The random variable and .

Multiple Shrinkage Prior Small make the number of clusters increase more slowly than the number of coefficients. Choosing a relatively large can give support to a wide range of possible prior means.

Multiple Shrinkage Prior Treat falling within small range around zero as having no meaningful biologic effect . Default prior specification: For , . Recommend to choose smaller values for and that are large enough to encourage shrinkage but not so large as to overwhelm the data. Specify so the DE prior has prior credible intervals of unit width. 3. Setting and to assign 95% probability to a very wide range of reasonable prior effects.

Multiple Shrinkage Prior Some testing methods Assuming is null and let indicates the predictor to have some effect with probability . From MCMC, we estimate ; Or we can estimate the posterior expected false discovery rate (FDR) for a threshold , Or simply list the predictors ordered by their posterior probabilities of .

Posterior Computation Assume the outcome occurs when a latent variable, . where and 1a. Augment the data with sampled from 1b. Update by sampling from 2. Update the regression coefficients using the current estimates of by sampling from the following

Posterior Computation where and . The matrix is a diagonal matrix with element and is an diagonal matrix with element 3. Update the mixing parameter . 4a. Update the prior location and scale parameters using a modified version of the retrospective stick breaking algorithm. Sample with where

Posterior Computation The conditional distribution is where and 4b. Sample from 4c. Update the vector of coefficient configurations using a Metropolis step.

Posterior Computation where and normalizing constant for is To determine the proposal configuration for the prior, sample . If , let and draw new values of from their prior until . The new proposed configuration is for moving the coefficient to bin . The accepting probability of moving from configuration to is:

Experiment Results Simulation 50 data set: 400 observations and 20 parameters, with 10 of the parameters having true effect of 2 and the remaining 10 having a true effect of 0. 100 observations and 200 parameters, 10 of which have true effect of 2 while the remaining have true effect 0. The results show that the multiple shrinkage prior offers improvement over the standard Bayesian lasso and the reduction in MSE is largely a result of decreased bias. a: the first 10 coefficients (MSE=0.03) compared to the standard lasso (MSE=1.08); the remaining 10 (MSE=0.01) while in standard lasso (MSE=0.04).

Experiment Results b: the MSE of the 10 coefficients with and effect of 2 is much lower in the multiple shrinkage model (1.5 vs 3.2); the remaining 190 coefficients are estimated with slightly higher MSE in the multiple shrinkage prior than the standard lasso (0.08 vs 0.01). 2. Experiments on Diabetes (Pima).

Experiment Results The multiple shrinkage prior offered improvement with a lower misclassification rate than the two standard lasso and SVM. (21%, 22% and 23%) 3. Multiple myeloma Analyze the data from 80 individuals diagnosed with multiple myeloma to determine whether any polymorphisms are related to early age. The predictor’s dimension is 135.

Experiment Results

Conclusions The multiple shrinkage prior provides greater flexibility in both the amount of shrinkage and the value toward which coefficients are shrunk. The new method can greatly decrease MSE ( largely as a result of decreasing bias), which is demonstrated in the experiment results.