High-dimensional Error Analysis of Regularized M-Estimators Ehsan AbbasiChristos ThrampoulidisBabak Hassibi Allerton Conference Wednesday September 30,

Slides:



Advertisements
Similar presentations
Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Advertisements

General Linear Model With correlated error terms  =  2 V ≠  2 I.
CHAPTER 8 More About Estimation. 8.1 Bayesian Estimation In this chapter we introduce the concepts related to estimation and begin this by considering.
Prediction with Regression
1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.
T HE POWER OF C ONVEX R ELAXATION : N EAR - OPTIMAL MATRIX COMPLETION E MMANUEL J. C ANDES AND T ERENCE T AO M ARCH, 2009 Presenter: Shujie Hou February,
Chap 8: Estimation of parameters & Fitting of Probability Distributions Section 6.1: INTRODUCTION Unknown parameter(s) values must be estimated before.
Chapter 2: Lasso for linear models
Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)
Bayesian Robust Principal Component Analysis Presenter: Raghu Ranganathan ECE / CMR Tennessee Technological University January 21, 2011 Reading Group (Xinghao.
Visual Recognition Tutorial
Maximum likelihood (ML) and likelihood ratio (LR) test
Maximum likelihood (ML)
1 L-BFGS and Delayed Dynamical Systems Approach for Unconstrained Optimization Xiaohui XIE Supervisor: Dr. Hon Wah TAM.
AGC DSP AGC DSP Professor A G Constantinides© Estimation Theory We seek to determine from a set of data, a set of parameters such that their values would.
Maximum likelihood (ML) and likelihood ratio (LR) test
Descriptive statistics Experiment  Data  Sample Statistics Experiment  Data  Sample Statistics Sample mean Sample mean Sample variance Sample variance.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Maximum-Likelihood estimation Consider as usual a random sample x = x 1, …, x n from a distribution with p.d.f. f (x;  ) (and c.d.f. F(x;  ) ) The maximum.
Visual Recognition Tutorial
Support Vector Regression (Linear Case:)  Given the training set:  Find a linear function, where is determined by solving a minimization problem that.
Support Vector Regression David R. Musicant and O.L. Mangasarian International Symposium on Mathematical Programming Thursday, August 10, 2000
Visual Recognition Tutorial
July 3, Department of Computer and Information Science (IDA) Linköpings universitet, Sweden Minimal sufficient statistic.
ECE 530 – Analysis Techniques for Large-Scale Electrical Systems
Maximum likelihood (ML)
PATTERN RECOGNITION AND MACHINE LEARNING
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Computational Stochastic Optimization: Bridging communities October 25, 2012 Warren Powell CASTLE Laboratory Princeton University
Machine Learning Seminar: Support Vector Regression Presented by: Heng Ji 10/08/03.
Aug. 27, 2003IFAC-SYSID2003 Functional Analytic Framework for Model Selection Masashi Sugiyama Tokyo Institute of Technology, Tokyo, Japan Fraunhofer FIRST-IDA,
Multiple Regression The Basics. Multiple Regression (MR) Predicting one DV from a set of predictors, the DV should be interval/ratio or at least assumed.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
Geographic Information Science
SUPA Advanced Data Analysis Course, Jan 6th – 7th 2009 Advanced Data Analysis for the Physical Sciences Dr Martin Hendry Dept of Physics and Astronomy.
ELEC 303 – Random Signals Lecture 18 – Classical Statistical Inference, Dr. Farinaz Koushanfar ECE Dept., Rice University Nov 4, 2010.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Principles of Parameter Estimation.
BCS547 Neural Decoding. Population Code Tuning CurvesPattern of activity (r) Direction (deg) Activity
BCS547 Neural Decoding.
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
+ Quadratic Programming and Duality Sivaraman Balakrishnan.
Ridge Regression: Biased Estimation for Nonorthogonal Problems by A.E. Hoerl and R.W. Kennard Regression Shrinkage and Selection via the Lasso by Robert.
SYSTEMS Identification Ali Karimpour Assistant Professor Ferdowsi University of Mashhad.
R. Kass/W03 P416 Lecture 5 l Suppose we are trying to measure the true value of some quantity (x T ). u We make repeated measurements of this quantity.
Learning Theory Reza Shadmehr Distribution of the ML estimates of model parameters Signal dependent noise models.
Computacion Inteligente Least-Square Methods for System Identification.
Recovering structured signals: Precise performance analysis Christos Thrampoulidis Joint ITA Workshop, La Jolla, CA February 3, 2016.
Estimation Econometría. ADE.. Estimation We assume we have a sample of size T of: – The dependent variable (y) – The explanatory variables (x 1,x 2, x.
Presentation : “ Maximum Likelihood Estimation” Presented By : Jesu Kiran Spurgen Date :
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
STA302/1001 week 11 Regression Models - Introduction In regression models, two types of variables that are studied:  A dependent variable, Y, also called.
12. Principles of Parameter Estimation
Probability Theory and Parameter Estimation I
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Empirical risk minimization
Fundamentals of estimation and detection in signals and images
Ch3: Model Building through Regression
CH 5: Multivariate Methods
Lecture 09: Gaussian Processes
Probabilistic Models for Linear Regression
10701 / Machine Learning Today: - Cross validation,
Lecture 4: Econometric Foundations
Lecture 10: Gaussian Processes
Recursively Adapted Radial Basis Function Networks and its Relationship to Resource Allocating Networks and Online Kernel Learning Weifeng Liu, Puskal.
Parametric Methods Berlin Chen, 2005 References:
Empirical risk minimization
12. Principles of Parameter Estimation
Probabilistic Surrogate Models
Presentation transcript:

High-dimensional Error Analysis of Regularized M-Estimators Ehsan AbbasiChristos ThrampoulidisBabak Hassibi Allerton Conference Wednesday September 30,

Linear Regression Model Estimate unknown signal from noisy linear measurements: measurement/design matrix unknown signal noise vector 2

M-estimators For some convex loss function solve: Maximum Likelihood (ML) estimators ? least-squares, least-absolute deviations Huber-loss, etc… Fisher information, consistency, asymptotic normality, Cramer-Rao bound, ML, robust statistics, Huber loss, optimal loss … 3

Why revisit & what changes? Modern: n is increasingly large  machine learning, image processing, sensor/social networks, DNA microarrays,... Structured signals: sparse, low-rank, block-sparse, low-varying … Regularized M-estimators Compressive sensing: Traditional: but the ambient dimension n is fixed Regularizer is structure inducing, convex, typically non-smooth L 1, nuclear, L 1 /L 2 norms, total variation … atomic norms 4

Classical question - Modern regime: New results & phenomena High-dimensional Proportional regime ? Question goes back to 50’s (Huber, Kolmogorov…) Only very recent advances, special instances, strict assumptions No general theory! has entries iid Gaussian Assumption: benchmark in CS/statistics theory universality 5

Contribution at a rate Assume has entries iid Gaussian mild regularity conditions on, p z, f, and p x0 Then, with probability one, where is the unique solution to a system of four nonlinear equations in four unknowns : 6

The Equations Let’s parse them, to get some insight … 7

The Explicit ones and appear in the equations explicitly. 8

The Loss and the Regularizer The loss function and the regularizer appear through their Moureau envelope approximations. In the traditional regime instead of the Moureau envelopes the functions themselves appear 9

The Distributions The convolution of the pdf of the noise with a gaussian is a completely new phenomenon compared to the traditional regime 10

The Expected Moureau Envelope The role of and is summarized in how they affect error performance of the M-estimator (strictly) convex and continuously differentiable even if is non-differentiable! generalizes the “Gaussian width” or “Gaussian distance squared” or “statistical dimension”. same for and 11

Reminder: Moureau Envelopes Moureau-Yoshida envelope of evaluated at with parameter : always underestimates f at x. The smaller the τ the closer to f smooth approximation always continuously differentiable in both x and τ ( even if f is non-differentiable ) jointly convex in x and τ optimal v is unique (proximal operator) everything extends to vector-valued function f 12

Examples 13

Set Indicator Function Gaussian width 14

Summarizing Key Features Squared error of general Regularized M-estimators Minimal and generic regularity assumptions – non-smooth, heavy-tails, non-separable, … Key role of Expected Moureau envelopes – strictly convex and smooth – generalize known geometric summary parameters Observation: fast solution by simple iterative scheme! 15

Simulations Optimal tuning? 16

Non-smooth losses 17

Non-smooth losses Optimal loss? 18

Non-smooth losses Consistent Estimators? 19

Heavy-tailed noise Huber loss function + noise iid Cauchy Robustness? 20

Non-separable loss Square-root LASSO 21

Beyond Gaussian Designs analysis framework directly applies to elliptically distributed For the LASSO we have extended ideas to IRO matrices Universality over iid entries (Empirical observation) modified equations 22

Convex Gaussian Min-max Theorem Apply CGMT to (PO) (AO) Theorem (CGMT) [TAH’15,TOH’15] 23

Proof Diagram M-estimator (PO) Duality (AO) (DO) Deterministic min-max Optimization in 4 variables CGMT The Equations First-order optimality conditions 24

Related Literature [El Karoui 2013,2015] Ridge regularization, smooth loss, no structured x 0 Ellpitical distributions iid entries beyond Gaussian [Donoho, Montanari 2013] No regularizer smooth+strongly convex, bounded noise 25

Conclusions Master Theorem for general M-estimators – Minimal assumptions – 4 nonlinear equations, unique solution, fast iterative solution (why?) – Summary parameters: Expected Moureau envelopes Opportunities, lots to be asked… Optimal loss-function? optimal Regularizer? When can we be consistent? Optimally tuning tuning parameter? LASSO: Linear = Non-linear[TAH’15 NIPS] CGMT framework is powerful non-linear measurements, y=g(Ax 0 ) Beyond squared error analysis… Apply CGMT for different set S… [TAYH’15 ICASSP] 26