The Group Lasso for Logistic Regression Lukas Meier, Sara van de Geer and Peter Bühlmann Presenter: Lu Ren ECE Dept., Duke University Sept. 19, 2008.

Slides:

Advertisements

Similar presentations

Continued Psy 524 Ainsworth

Advertisements

Linear Regression.

Regularization David Kauchak CS 451 – Fall 2013.

R OBERTO B ATTITI, M AURO B RUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.

Fast Bayesian Matching Pursuit Presenter: Changchun Zhang ECE / CMR Tennessee Technological University November 12, 2010 Reading Group (Authors: Philip.

Nonlinear Regression Ecole Nationale Vétérinaire de Toulouse Didier Concordet ECVPT Workshop April 2011 Can be downloaded at

Chapter 2: Lasso for linear models

Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)

Multivariate linear models for regression and classification Outline: 1) multivariate linear regression 2) linear classification (perceptron) 3) logistic.

Visual Recognition Tutorial

Linear Methods for Regression Dept. Computer Science & Engineering, Shanghai Jiao Tong University.

Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Linear Regression Models Based on Chapter 3 of Hastie, Tibshirani and Friedman Slides by David Madigan.

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Ordinary least squares regression (OLS)

Classification and Prediction: Regression Analysis

CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.

Review of Lecture Two Linear Regression Normal Equation

Collaborative Filtering Matrix Factorization Approach

Bump Hunting The objective PRIM algorithm Beam search References: Feelders, A.J. (2002). Rule induction by bump hunting. In J. Meij (Ed.), Dealing with.

Outline Separating Hyperplanes – Separable Case

The horseshoe estimator for sparse signals CARLOS M. CARVALHO NICHOLAS G. POLSON JAMES G. SCOTT Biometrika (2010) Presented by Eric Wang 10/14/2010.

1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.

Shrinkage Estimation of Vector Autoregressive Models Pawin Siriprapanukul 11 January 2010.

Segmental Hidden Markov Models with Random Effects for Waveform Modeling Author: Seyoung Kim & Padhraic Smyth Presentor: Lu Ren.

Using Neural Networks to Predict Claim Duration in the Presence of Right Censoring and Covariates David Speights Senior Research Statistician HNC Insurance.

Generalized Linear Models All the regression models treated so far have common structure. This structure can be split up into two parts: The random part:

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

Danila Filipponi Simonetta Cozzi ISTAT, Italy Outlier Identification Procedures for Contingency Tables in Longitudinal Data Roma,8-11 July 2008.

The Dirichlet Labeling Process for Functional Data Analysis XuanLong Nguyen & Alan E. Gelfand Duke University Machine Learning Group Presented by Lu Ren.

Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.

HMM - Part 2 The EM algorithm Continuous density HMM.

Linear Models for Classification

Multiple Logistic Regression STAT E-150 Statistical Methods.

CpSc 881: Machine Learning

Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

Machine Learning 5. Parametric Methods.

September 28, 2000 Improved Simultaneous Data Reconciliation, Bias Detection and Identification Using Mixed Integer Optimization Methods Presented by:

Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.

MathematicalMarketing Slide 5.1 OLS Chapter 5: Ordinary Least Square Regression We will be discussing  The Linear Regression Model  Estimation of the.

Computacion Inteligente Least-Square Methods for System Identification.

Institute of Statistics and Decision Sciences In Defense of a Dissertation Submitted for the Degree of Doctor of Philosophy 26 July 2005 Regression Model.

Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.

StingyCD: Safely Avoiding Wasteful Updates in Coordinate Descent

Bayesian Semi-Parametric Multiple Shrinkage

Chapter 7. Classification and Prediction

Deep Feedforward Networks

Probability Theory and Parameter Estimation I

A Fast Trust Region Newton Method for Logistic Regression

Boosting and Additive Trees (2)

Classification of unlabeled data:

10701 / Machine Learning.

Probabilistic Models for Linear Regression

Roberto Battiti, Mauro Brunato

Statistical Learning Dong Liu Dept. EEIS, USTC.

Modelling data and curve fitting

Probabilistic Models with Latent Variables

Collaborative Filtering Matrix Factorization Approach

Ying shen Sse, tongji university Sep. 2016

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

What is Regression Analysis?

Lecture 4: Econometric Foundations

Generally Discriminant Analysis

CRISP: Consensus Regularized Selection based Prediction

Parametric Methods Berlin Chen, 2005 References:

Multivariate Methods Berlin Chen, 2005 References:

Stochastic Methods.

Presentation transcript:

The Group Lasso for Logistic Regression Lukas Meier, Sara van de Geer and Peter Bühlmann Presenter: Lu Ren ECE Dept., Duke University Sept. 19, 2008

Outline From lasso to group lasso logistic group lasso Algorithms for the logistic group lasso Logistic group lasso-ridge hybrid Simulation and application to splice site detection Discussion

Lasso A popular model selection and shrinkage estimation method. In a linear regression set-up: :continuous response :design matrix : parameter vector The lasso estimator is then defined as: where, and larger set some exactly to 0.

Group Lasso In some cases not only continuous but also categorical predictors (factors) are present, the lasso solution is not satisfactory with only selecting individual dummy variables but the whole factor. Extended from the lasso penalty, the group lasso estimator is: : the index set belonging to the th group of variables. The penalty does the variable selection at the group level, belonging to the intermediate between and type penalty. It encourages that either or for all

Consider a case: two factors and Observe the contour of the penalty function: -penalty treats the three co-ordinate directions differently: encourage sparsity in individual coefficients while -penalty treats all directions equally and does not encourage sparsity. Connection Ref: Ming Yuan and Yi Lin, Model selection and estimation in regression with grouped variables, J.R. Statist.,2008

Logistic Group Lasso Independent and identically distributed observations : p-dimensional vector of predictors : a binary response variable,: feedom degree The conditional probability with The estimator is given by the minimizer of the convex function:

Logistic Group Lasso controls the amount of penalization rescale the penalty with respect to thedimensionality of

Optimization Algorithms 1.Block co-ordinate descent Cycle through the parameter groups and minimize the object function, keeping all except the current group fixed.

:set to while all other components remain unchanged the parameter vector after block updates, and it can be shown every limit point of the sequence is a minimum point of blockwise minimizations of the active groups must be performed numerically, and sufficiently fast for small group size and dimension. 2. Block co-ordinate gradient descent Combine a quadratic approximation of the log-likelihood with an additional line search: Optimization Algorithms

Armijo rule: an inexact line search, let be the largest value in so that

Optimization Algorithms Minimization with respect to the th parameter group depends on only, here define. A proper choice is where is a lower bound to ensure convergence. To calculate the on a grid of the penalty parameter we can start at We use as a starting value for and proceed iteratively until with equal or close to 0.

Hybrid Methods Logistic group lasso-ridge hybrid The models selected by the group lasso are large compared with the underlying true models; The ordinary lasso can obtain good prediction with smaller models by using lasso with relaxation. Define the index set of predictors selected by the group lasso with, and is the set of possible parameter vectors of the corresponding submodel. The group lasso-ridge hybrid estimator: is a special case called the group lasso-MLE hybrid

Simulation First sample instances of a nine-dim multivariate normal distribution with mean 0 and covariance matrix Each is transformed into a four-valued categorical variable by using the quartiles of the standard normal so that Simulate independent standard normal and Four different cases are studied:

Observations: The group lasso seems to select unnecessarily large models with many noise variables; The group lasso-MLE hybrid is very conservative in selecting terms; The group lasso-ridge hybrid seems to be the best compromise and has the best prediction performance in terms of the log- likelihood score.

Application Experiment Splice sites: the regions between coding (exons) and non- coding (introns) DNA segments. Two training data set: 5610 true and 5610 false donor sites 2805 true and false donor sites Test sets: 4208 true and false donor sites. For a threshold we assign observation to class if And to class otherwise. The Person correlation between true class membership and the predicted class membership.

The corresponding values of on the test set are and, respectively. Whereas the group lasso solution has some active three-way interactions, the group lasso-ridge hybrid and the group lasso- MLE hybrid contain only two-way interations. The three-way interactions of the group lasso solution seem to be very weak. The best model with respect to the log-likelihood score on the validation set is the group lasso estimator.

Conclusions Study the group lasso for logistic regression Present efficient algorithm (automatic and much faster) Propose the group lasso-ridge hybrid method Apply to short DNA motif modelling and splice site detection