Regression, Part B: Going a bit deeper Assaf Oron, May 2008

Regression, Part B: Going a bit deeper Assaf Oron, May 2008
Stat 391 – Lecture 14 Regression, Part B: Going a bit deeper Assaf Oron, May 2008

Overview We introduced simple linear regression, and some responsible-use tips (dot your t’s and cross your i’s, etc.) Today, we go behind the scenes: Regression with binary X, and t-tests The statistical approach to regression Multiple regression Regression hypothesis tests and inference Regression with categorical X, and ANOVA Advanced model selection in regression Advanced regression alternatives

Binary X and t-tests It is convenient to introduce regression using continuous X But it can also be done when X is limited to a finite number of values, or even to non-numerical values We use the exact same formulae and framework When X is binary – that is, divides the data into two groups (e.g., “male” vs. “female”) - the regression is completely equivalent to the two-sample t-test (the version with the equal-variance assumption) The regression assigns x=0 to one group, x=1 to the other, so our “slope” becomes the difference between group means, and our “intercept” is the mean of the x=0 group Let’s see this in action:

Regression: the Statistical Approach
Our treatment of regression thus far has been devoid of any probability assumptions All we saw was least-squares optimization, partition of sums-of-squares, some diagnostics, etc. But regression can be viewed via a probability model: The β’s are seen (in classical statistics) as fixed constant parameters, to be estimated. The x’s are fixed as well. The ε are random, and are different between different y’s They have expectation 0, and under standard regression are assumed i.i.d. normal

Regression: the Statistical Approach (2)
The equation in the previous slide is a simple example for a probabilistic regression model Such models describe observations as a function of fixed explanatory variables (x) – known as covariates – plus random noise The linear-regression formula can also be written as a conditional probability:

Regression: the Statistical Approach (3)
The probability framework allows us to use the tools of hypothesis testing, confidence intervals – and statistical estimation Under the i.i.d-normal-error assumption, the MLE’s for intercept and slope are identical to the least-squares solutions (this is because the log-likelihood is quadratic, so the MLE mechanics are equivalent to least-squares optimization) Hence the “hats” in the formula

Multiple Regression Often, our response y can potentially be explained by more than one covariate For example: earthquake ground movement at a specific location, is affected by both the magnitude and the distance from the epicenter (attenu dataset) It turns out that everything we did for a single x, can be done with p covariates, using analogous formulae Instead of finding the least-squares line in 2D, we find the least-squares hyperplane in p+1 dimensions We have to convert to matrix-vector terminology:

Multiple Regression (2)
Responses: vector of length n Errors: i.i.d. vector of normal r.v.’s; length n Model Matrix: n rows, (p+1) columns Parameter vector quantifying the effects; length p+1 Where has the intercept term gone? It is merged into X the model matrix, as the first column: a column of 1’s (check it out) Each covariate takes up a subsequent column

This was the math; conceptually, what does multiple regression do? It calculates the “pure” effect of each covariate, while neutralizing the effect of the other covariates (We call this “adjusting for the other covariates”) If a covariate is NOT in the model, it is NOT neutralized – so it becomes a potential confounder Let’s see this in action on the attenu data:

…and now for the first time: we actually write out the solutions First for the parameters: And these are the fitted values for y: Note the matrix transpose and inverse operators All this is a function of X, and can be written as a single matrix a.k.a. “the Hat Matrix” (why?)

Regression Inference Why did we bother with these matrix formulae?
To show you that both parameter estimates and fitted values are linear combinations of the original observations Each individual estimate or fitted value can be written as a weighted sum of the y’s, Useful fact: linear combinations of normal r.v.’s are also normal r.v.’s So if our model assumptions hold, the beta-hats and y-hats are all normal (how convenient)

Regression Inference (2)
What if the observation errors are not normal? Well, recall that since a sum is just the mean multiplied by n, its shape also becomes “approximately normal” as n increases, due to the CLT This holds for weighted-sums as well, under fairly general assumptions At bottom line, if the errors are “reasonably well-behaved” and n is large enough, our estimates are still as good as normal

Regression Inference (3)
So individual beta-hat or y-hat can be assumed (approximately) normal, with variance The only missing piece is to estimate σ2, the variance of the observation errors …And σ2 is easily estimated using the residuals But since we estimate variance from the data, all our inference is based on the t-distribution We lose a degree of freedom for each parameter including the intercept, to end up with n-p-1

Back to that Printout again…
Call: lm(formula = y1 ~ x1, data = anscombe) Residuals: Min Q Median Q Max Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) * x ** --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: on 9 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 1 and 9 DF, p-value: The t-statistics and p-values are for tests against a null hypothesis that the true parameter value is zero (each parameter tested separately). If your null is different, you’ll have to do the test on our own Which parameter does this null usually NOT make sense for?

Note: Beware The (Model) Matrix
If X has a column which is a linear function of other column(s), it is said to be a singular matrix It cannot be inverted The software may scream at you Conceptually, you are asking the method to decide between two identical explanations; it cannot do this If X has a column which is “almost” a linear function of other column(s), it is said to suffer from collinearity Your beta-hat S.E.’s will be huge Conceptually, you are asking the method to decide between two nearly-identical explanations; still not a good prospect

Categorical X and ANOVA
We saw that simple regression with binary X is equivalent to a t-test Similarly, we can model a categorical covariate having k>2 categories, within multiple regression For example: ethnic origin vs. life expectancy The covariate will take up k-1 columns in X – meaning there’ll be k-1 parameters to estimate R interprets text covariates as categorical; you can also convert numerical values to categorical using factor Let’s see this in action:

Categorical X and ANOVA (2)
Regression on a categorical variable is equivalent to a technique called ANOVA: analysis of variance ANOVA is used to analyze designed experiments ANOVA’s name is derived from the fact that its hypothesis tests are performed by comparing sums of square deviations (such as those shown last lecture) This is known as the F test, and appears in our standard regression printout ANOVA is considered an older technology, but is still very useful in engineering, agriculture, etc.

Regression Inference and Model Selection
So… we can keep fitting the data better by adding as many covariates as we want? Not quite. If p≥n-1, you can fit the observations perfectly. This is known as a saturated model; pretty useless for drawing conclusions (in statistics jargon, you will have used up all your degrees of freedom) Before reaching n-1, each additional covariate improves the fit. Where to stop? Obviously, there is a tradeoff; we seek the optimum between over-fitting and under-fitting From a conceptual perspective, we usually prefer the simpler models (less covariates) However, given all possible covariates, “how to find the optimal combination?” is an open question

Regression Inference: Nested Models
If two models are nested, we can make a formal hypothesis test between them, called a likelihood-ratio test (LRT) This test checks whether the gain in explained variability is “worth” the price paid in degrees of freedom But when are two models nested? The simplest case: if model B = model A + some added terms, then A is nested in B (sometimes, nesting includes simplification of more complicated multi-level covariates: e.g., region vs. state) In R, the LRT is available via lrtest , in the lmtest package (and also via the anova function)

Open-Ended Model Selection
Where all else fails… use common sense. Your final model is not necessarily the best in terms of “bang for the buck” Your covariate of interest should definitely go in “nuisance covariates” required by the client or the accepted wisdom, should go in as well Causal diagrams are a must for nontrivial problems; covariates with a clear causal connection to the response should go in first (there are also model-selection tools, known as AIC, BIC, cross-validation, BMA, etc.)

Open-Ended Model Selection (2)
Additionally, the goal of the model matters: If for formal inference/policy/scientific conclusions, you should be more conservative (less covariates, less effort to fit the data closely) If for prediction and forecasting under conditions similar to those observed, you can be a bit more aggressive In any case, there is no magic solution Always remember not to put too much faith in the model

More Sophisticated Regressions
The assumptions of linearity-normality-i.i.d. are quite restrictive Some violations can be handled via standard regression: Nonlinearity – transform the variables Unequal variances – weighted least squares For other violations, extensions of ordinary regression have been developed

More Sophisticated Regressions (2)
For some types of non-normality we have generalized linear models (GLM) The GLM solution is also an MLE GLM’s cover a family of distributions that includes the normal, exponential, Gamma, binomial, Poisson The variant with binomial responses is known as Logistic Regression; let’s see it in action If we suffer from outliers or heavy tails, there are many types of robust regression to choose from

More Sophisticated Regressions (3)
If observations are not i.i.d., but are instead divided into groups, we can use hierarchical or “mixed” models (this is very common) Of course, any regression can be done using Bayesian methods These are especially useful for complicated hierarchical models Finally, if y’s dependence upon x is not described well by any single function, there is nonparametric regression (“smoothing”) Some of which we may see next week

Regression, Part B: Going a bit deeper Assaf Oron, May 2008

Similar presentations

Presentation on theme: "Regression, Part B: Going a bit deeper Assaf Oron, May 2008"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Regression, Part B: Going a bit deeper Assaf Oron, May 2008

Similar presentations

Presentation on theme: "Regression, Part B: Going a bit deeper Assaf Oron, May 2008"— Presentation transcript:

Similar presentations

About project

Feedback