Raymond J. Carroll Texas A&M University Postdoctoral Training Program: Non/Semiparametric.

Slides:

Advertisements

Similar presentations

Pattern Recognition and Machine Learning

Advertisements

Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.

Pattern Recognition and Machine Learning: Kernel Methods.

FTP Biostatistics II Model parameter estimations: Confronting models with measurements.

Statistics 100 Lecture Set 7. Chapters 13 and 14 in this lecture set Please read these, you are responsible for all material Will be doing chapters

3.2 OLS Fitted Values and Residuals -after obtaining OLS estimates, we can then obtain fitted or predicted values for y: -given our actual and predicted.

The General Linear Model Or, What the Hell’s Going on During Estimation?

Lecture 6 (chapter 5) Revised on 2/22/2008. Parametric Models for Covariance Structure We consider the General Linear Model for correlated data, but assume.

Instrumental Variables Estimation and Two Stage Least Square

Raymond J. Carroll Texas A&M University Non/Semiparametric Regression and Clustered/Longitudinal Data.

Model assessment and cross-validation - overview

Data mining and statistical learning - lecture 6

Gaussian process emulation of multiple outputs Tony O’Hagan, MUCM, Sheffield.

Raymond J. Carroll Texas A&M University Postdoctoral Training Program: Non/Semiparametric.

Raymond J. Carroll Texas A&M University Postdoctoral Training Program: Non/Semiparametric.

Lecture 19 Continuous Problems: Backus-Gilbert Theory and Radon’s Problem.

Section 4.2 Fitting Curves and Surfaces by Least Squares.

The Simple Linear Regression Model: Specification and Estimation

Kernel methods - overview

Raymond J. Carroll Texas A&M University Nonparametric Regression and Clustered/Longitudinal Data.

Maximum likelihood (ML) and likelihood ratio (LR) test

Raymond J. Carroll Department of Statistics and Nutrition Texas A&M University Non/Semiparametric Regression.

Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.

Curve-Fitting Regression

Linear Regression Models Based on Chapter 3 of Hastie, Tibshirani and Friedman Slides by David Madigan.

Nonparametric Regression and Clustered/Longitudinal Data

Score Tests in Semiparametric Models Raymond J. Carroll Department of Statistics Faculties of Nutrition and Toxicology Texas A&M University

Model Selection in Semiparametrics and Measurement Error Models Raymond J. Carroll Department of Statistics Faculty of Nutrition and Toxicology Texas A&M.

Kernel Methods and SVM’s. Predictive Modeling Goal: learn a mapping: y = f(x;  ) Need: 1. A model structure 2. A score function 3. An optimization strategy.

1 An Introduction to Nonparametric Regression Ning Li March 15 th, 2004 Biostatistics 277.

Semiparametric Methods for Colonic Crypt Signaling Raymond J. Carroll Department of Statistics Faculty of Nutrition and Toxicology Texas A&M University.

Modeling clustered survival data The different approaches.

Maximum likelihood (ML)

Classification and Prediction: Regression Analysis

The Paradigm of Econometrics Based on Greene’s Note 1.

Principles of the Global Positioning System Lecture 11 Prof. Thomas Herring Room A;

Andrew Thomson on Generalised Estimating Equations (and simulation studies)

MULTIPLE TRIANGLE MODELLING ( or MPTF ) APPLICATIONS MULTIPLE LINES OF BUSINESS- DIVERSIFICATION? MULTIPLE SEGMENTS –MEDICAL VERSUS INDEMNITY –SAME LINE,

Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.

Modern Navigation Thomas Herring

Geographic Information Science

SUPA Advanced Data Analysis Course, Jan 6th – 7th 2009 Advanced Data Analysis for the Physical Sciences Dr Martin Hendry Dept of Physics and Astronomy.

Properties of OLS How Reliable is OLS?. Learning Objectives 1.Review of the idea that the OLS estimator is a random variable 2.How do we judge the quality.

ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.

Generalised method of moments approach to testing the CAPM Nimesh Mistry Filipp Levin.

Simulation Study for Longitudinal Data with Nonignorable Missing Data Rong Liu, PhD Candidate Dr. Ramakrishnan, Advisor Department of Biostatistics Virginia.

Regression Analysis1. 2 INTRODUCTION TO EMPIRICAL MODELS LEAST SQUARES ESTIMATION OF THE PARAMETERS PROPERTIES OF THE LEAST SQUARES ESTIMATORS AND ESTIMATION.

Sampling Design and Analysis MTH 494 Lecture-21 Ossam Chohan Assistant Professor CIIT Abbottabad.

Histograms h=0.1 h=0.5 h=3. Theoretically The simplest form of histogram B j = [(j-1),j)h.

Gaussian Process and Prediction. (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)2 Outline Gaussian Process and Bayesian Regression  Bayesian regression.

G Lecture 71 Revisiting Hierarchical Mixed Models A General Version of the Model Variance/Covariances of Two Kinds of Random Effects Parameter Estimation.

- 1 - Preliminaries Multivariate normal model (section 3.6, Gelman) –For a multi-parameter vector y, multivariate normal distribution is where  is covariance.

Exposure Prediction and Measurement Error in Air Pollution and Health Studies Lianne Sheppard Adam A. Szpiro, Sun-Young Kim University of Washington CMAS.

Model Selection and the Bias–Variance Tradeoff All models described have a smoothing or complexity parameter that has to be considered: multiplier of the.

I. Statistical Methods for Genome-Enabled Prediction of Complex Traits OUTLINE THE CHALLENGES OF PREDICTING COMPLEX TRAITS ORDINARY LEAST SQUARES (OLS)

CS Statistical Machine learning Lecture 7 Yuan (Alan) Qi Purdue CS Sept Acknowledgement: Sargur Srihari’s slides.

Probability Theory and Parameter Estimation I

Dept. Computer Science & Engineering, Shanghai Jiao Tong University

10701 / Machine Learning.

How to handle missing data values

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

OVERVIEW OF LINEAR MODELS

Introduction to Predictive Modeling

Lecture 4: Econometric Foundations

OVERVIEW OF LINEAR MODELS

Generally Discriminant Analysis

Principles of the Global Positioning System Lecture 11

Chapter 3 General Linear Model

Presentation transcript:

Raymond J. Carroll Texas A&M University Postdoctoral Training Program: Non/Semiparametric Regression and Clustered/Longitudinal Data

Where am I From? College Station, home of Texas A&M I-35 I-45 Big Bend National Park Wichita Falls, my hometown

Raymond CarrollAlan Welsh Naisyin WangEnno Mammen Xihong Lin Oliver Linton Acknowledgments Series of papers are on my web site Lin, Wang and Welsh: Longitudinal data (Mammen & Linton for pseudo- observation methods) Linton and Mammen: time series data

Outline Longitudinal models: panel data Background: splines = kernels for independent data Correlated data: do splines = kernels? Semiparametric case: partially linear model: does it matter what nonparametric method is used?

Panel Data (for simplicity) i = 1,…,n clusters/individuals j = 1,…,m observations per cluster SubjectWave 1Wave 2…Wave m 1XXX 2XXX …X nXXX

Panel Data (for simplicity) i = 1,…,n clusters/individuals j = 1,…,m observations per cluster Important points: The cluster size m is meant to be fixed This is not a multiple time series problem where the cluster size increases to infinity Some comments on the single time series problem are given near the end of the talk

The Marginal Nonparametric Model Y = Response X = time-varying covariate Question: can we improve efficiency by accounting for correlation?

The Marginal Nonparametric Model Important assumption Covariates at other waves are not conditionally predictive, i.e., they are surrogates This assumption is required for any GLS fit, including parametric GLS

Independent Data Splines (smoothing, P-splines, etc.) with penalty parameter = Ridge regression fit Some bias, smaller variance is over-parameterized least squares is a polynomial regression

Independent Data Kernels (local averages, local linear, etc.), with kernel density function K and bandwidth h As the bandwidth h  0, only observations with X near t get any weight in the fit

Independent Data Major methods Splines Kernels Smoothing parameters required for both Fits: similar in many (most?) datasets Expectation: some combination of bandwidths and kernel functions look like splines 12

Independent Data Splines and kernels are linear in the responses Silverman showed that there is a kernel function and a bandwidth so that the weight functions are asymptotically equivalent In this sense, splines = kernels This talk is about the same result for correlated data

The weight functions G n (t=.25,x) in a specific case for independent data Kernel Smoothing Spline Note the similarity of shape and the locality: only X’s near t=0.25 get any weight

Working Independence Working independence: Ignore all correlations Fix up standard errors at the end Advantage: the assumption is not required Disadvantage: possible severe loss of efficiency if carried too far

Working Independence Working independence: Ignore all correlations Should posit some reasonable marginal variances Weighting important for efficiency Weighted versions: Splines and kernels have obvious analogues Standard method: Zeger & Diggle, Hoover, Rice, Wu & Yang, Lin & Ying, etc.

Working Independence Working independence: Weighted splines and weighted kernels are linear in the responses The Silverman result still holds In this sense, splines = kernels

Accounting for Correlation Splines have an obvious analogue for non- independent data Let be a working covariance matrix Penalized Generalized least squares (GLS) GLS ridge regression Because splines are based on likelihood ideas, they generalize quickly to new problems

Accounting for Correlation Splines have an obvious analogue for non- independent data Kernels are not so obvious Local likelihood kernel ideas are standard in independent data problems Most attempts at kernels for correlated data have tried to use local likelihood kernel methods

Kernels and Correlation Problem: how to define locality for kernels? Goal: estimate the function at t Let be a diagonal matrix of standard kernel weights Standard Kernel method: GLS pretending inverse covariance matrix is The estimate is inherently local

Kernels and Correlation Specific case: m=3, n=35 Exchangeable correlation structure Red:  = 0.0 Green:  = 0.4 Blue:  = 0.8 Note the locality of the kernel method The weight functions G n (t=.25,x) in a specific case 18

Splines and Correlation Specific case: m=3, n=35 Exchangeable correlation structure Red:  = 0.0 Green:  = 0.4 Blue:  = 0.8 Note the lack of locality of the spline method The weight functions G n (t=.25,x) in a specific case

Splines and Correlation Specific case: m=3, n=35 Complex correlation structure Red: Nearly singular Green:  = 0.0 Blue:  = AR(0.8) Note the lack of locality of the spline method The weight functions G n (t=.25,x) in a specific case

Splines and Standard Kernels Accounting for correlation: Standard kernels remain local Splines are not local Numerical results can be confirmed theoretically Don’t we want our nonparametric regression estimates to be local?

Results on Kernels and Correlation GLS with weights Optimal working covariance matrix is working independence! Using the correct covariance matrix Increases variance Increases MSE Splines Kernels (or at least these kernels) 24

Pseudo-Observation Kernel Methods Better kernel methods are possible Pseudo-observation: original method Construction: specific linear transformation of Y Mean =  (X) Covariance = diagonal matrix This adjusts the original responses without affecting the mean

Pseudo-Observation Kernel Methods Construction: specific linear transformation of Y Mean =  (X) Covariance = diagonal Iterative: Efficiency: More efficient than working independence Proof of Principle: kernel methods can be constructed to take advantage of correlation

Efficiency of Splines and Pseudo- Observation Kernels Exchng: Exchangeable with correlation 0.6 AR: autoregressive with correlation 0.6 Near Sing: A nearly singular matrix

Better Kernel Methods: SUR Simulations of the original pseudo-observation method: it is not as efficient as splines Suggests room for a better estimate Naisyin Wang: her talk will describe an even better kernel method Basis: seemingly unrelated regression ideas Generalizable: based on likelihood ideas

SUR Kernel Methods It is well known that the GLS spline has an exact, analytic expression We have shown that the Wang SUR kernel method has an exact, analytic expression Both methods are linear in the responses

SUR Kernel Methods The two methods differ only in one matrix term This turns out to be exactly the same matrix term considered by Silverman in his work Relatively nontrivial calculations show that Silverman’s result still holds Splines = SUR Kernels 29

Nonlocality The lack of locality of GLS splines and SUR kernels is surprising Suppose we want to estimate the function at t If any observation has an X near t, then all observations in the cluster contribute to the fit, not just those with covariates near t Splines, pseudo-kernels and SUR kernels all borrow strength

Nonlocality Wang’s SUR kernels = BLUP-like pseudo kernels with a clever linear transformation. Let SUR kernels are working independence kernels

Locality of Kernels Original pseudo-observation method: pseudo observations uncorrelated SUR kernels: pseudo-observations are correlated SUR kernels are not local SUR kernels are local in (the same!) pseudo- observations

Locality of Splines Splines = SUR kernels (Silverman-type result) GLS spline: Iterative standard independent spline smoothing SUR pseudo-observations at each iteration GLS splines are not local GLS splines are local in (the same!) pseudo- observations

Time Series Problems Time series problems: many of the same issues arise Original pseudo-observation method Two stages Linear transformation Mean  (X) Independent errors Single standard kernel applied Potential for great gains in efficiency (even infinite for AR problems with large correlation)

Time Series: AR(1) Illustration, First Pseudo Observation Method AR(1), correlation  : Regress Y t 0 on X t

Time Series Problems More efficient methods can be constructed Series of regression problems: for all j, Pseudo observations Mean White noise errors Regress for each j: fits are asymptotically independent Then weighted average Time series version of SUR-kernels for longitudinal data?

Time Series: AR(1) Illustration, New Pseudo Observation Method AR(1), correlation  : Regress Y t 0 on X t and Y t 1 on X t-1 Weights: 1 and  2

Time Series Problems AR(1) errors with correlation  Efficiency of original pseudo-observation method to working independence: Efficiency of new (SUR?) pseudo-observation method to original method: 36

The Semiparametric Model Y = Response X,Z = time-varying covariates Question: can we improve efficiency for  by accounting for correlation?

Profile Methods Given , solve for  say Basic idea: Regress Working independence Standard kernels Pseudo –observations kernels SUR kernels

Profile Methods Given , solve for  say Then fit GLS or W.I. to the model with mean Question: does it matter what kernel method is used? Question: How bad is using W.I. everywhere? Question: are there efficient choices?

The Semiparametric Model: Special Case If X does not vary with time, simple semiparametric efficient method available The basic point is that has common mean and covariance matrix If were a polynomial, GLS likelihood methods would be natural

The Semiparametric Model: Special Case Method: Replace polynomial GLS likelihood with GLS local likelihood with weights Then do GLS on the derived variable Semiparametric efficient

Profile Method: General Case Given , solve for  say Then fit GLS or W.I. to the model with mean In this general case, how you estimate  matters Working independence Standard kernel Pseudo-observation kernel SUR kernel

Profile Methods In this general case, how you estimate  matters Working independence Standard kernel Pseudo-observation kernel SUR kernel We have published the asymptotically efficient score, but not how to implement it

Profile Methods Naisyin Wang’s talk will describe These phenomena Search for an efficient estimator Loss of efficiency for using working independence to estimate  Examples where ignoring the correlation can change conclusions

Conclusions (1/3): Nonparametric Regression In nonparametric regression Kernels = splines for working independence (W.I.) Weighting is important for W.I. Working independence is inefficient Standard kernels splines for correlated data

Conclusions (2/3): Nonparametric Regression In nonparametric regression Pseudo-observation methods improve upon working independence SUR kernels = splines for correlated data Splines and SUR kernels are not local Splines and SUR kernels are local in pseudo- observations

Conclusions (3/3): Semiparametric Regression In semiparametric regression Profile methods are a general class Fully efficient parameter estimates are easily constructed if X is not time-varying When X is time-varying, method of estimating affects properties of parameter estimates Ignoring correlations can change conclusions (see N. Wang talk)

Conclusions: Splines versus Kernels One has to be struck by the fact that all the grief in this problem has come from trying to define kernel methods At the end of the day, they are no more efficient than splines, and harder and more subtle to define Showing equivalence as we have done suggests the good properties of splines