Download presentation
Presentation is loading. Please wait.
Published byAubrie Burlingham Modified over 9 years ago
1
Computational Statistics
2
Basic ideas Predict values that are hard to measure irl, by using co-variables (other properties from the same measurement in a sample population) We will often consider two (or more) variables simultaneously. 1) The data (x1, y1),…, (xn, yn) are considered as independent replications of a pair of random variables, (X, Y ). (observational studies) 2) The data are described by a linear regression model (planned experiments) yi = a + b * xi + εi; i = 1,..., n
3
Regression
4
The linear model Multiple regression model Predicts a response variable using a linear function made up of co-variables (predictor variables) Goal: to estimate the unknown parameter (β p ) of each covariable (X p ) (its weight/significance) to estimate the the error variance Ŷ = b1 + b2 * X1 + b3 * X2 +…. + ε ε = systematic erros + random errors
5
The linear model II Quantify uncertainty in empirical data Assign significance to various components (covariables) Find a good compromise between between model size and the ability to describe data (and hence the response)
6
The linear model III Sample size n > number of predictors p p column vectors are linerarly independent errors are random => responses are random as well E(ε) = 0. If ε ≠ 0 => systematic error, if the model is correct
7
Model variations Linear regression through the origin p1 = 1, Ŷ = B1 x X + ε Simple linear regression p1 = 2, Ŷ = B1 + B2 x X + ε Quadratic regression p1 = 3, Ŷ = B1 + B2 x X + B2 x X 2 + ε Regression with transformed predictor variables (example) Ŷ = B1 + B2 x log(X) + B2 x sin(X)+ ε Data needs to be checked for linearity to identify model changes
8
Goals of analysis A good fit with small errors using the method of least squares Good parameter estimates – how much predictor variables explains (contributes to) the response in the chosen model Good prediction of the response as a function of predictor variables Using confidence intervals and statistical tests to help us reach these goals Find the best model in an interactive process. Probably using heuristics
9
Least Squares residual r = Ŷ(beta, covariates) – X (empirical) Best β when r² is minimal for the chosen set of covariates (in the given model) Least squares based on random errors => least squares are random too => different betas for each measured sample => different regression lines (Although, The Central Limit Theory predicts a “true” regression line with “enough” samples)
11
Linear Model Assumptions E(error) = 0 (Linear equation is correct) All xi’s are exact (no systematic error) Error variance is constant (homoscedasticity). Empirical error variance = teoretical variance. Otherwise weighted least squares can be used Uncorrelated errors; Cov(ei, ej) = 0 for all i≠j. Otherwise, generalized least squares can be used Errors are normally distributed => Y normally distributed. Otherwise, robust methods can be used instead of least squares
12
Model cautions covariate problem (time-based) Dangerous to use a fitted model to extrapolate where no predictor variables have been observed Is the average height of the Vikings just a few centimeters?
13
Test and Confidence (any predictor) Test predictor (p) using the null-hypothesis H0,p : βp = 0 againt the alternative Ha,p : βp ≠ 0 Using t-test and P-values to determine the relevance Quantifies the effect of the p’th predictor variable after having subtracted the linear effect of all other predictor variables on Y Problem: All predictors might have significance due to correlation among predictor variables
14
Test and Confidence (global) Using ANOVA (ANalysis Of VAriance) to test the hypothesis (H0) that all βs = 0 (no relevance) versus at least one β≠0 (Ha) F-test to quantify the statistical significance of the predictor variables Describe fitness using sum of squares R 2 = SS(explained) / SS(total) = (Ŷ-E(Y)) 2 / ( Y-E(Y) ) 2
15
Tukey-Anscombe plot (linearity assumption) Using residuals as an approximation of the unobservable error and linearity Plotting residuals against the fitted values (response) Correlation should be zero -> random fluctuation of values around a horisontal through zero line A trend-plot is evidence of a non-linear relation (or systematic error) Possible solution : transform the response variable or perform a weighted regression SD grows : Y -> log(Y) SD grows as square root : Y->SQRT(Y)
16
The Normal/QQ Plot (norm.distr.assumptions) Checking the normal distribution using quantile-quantile plot (qqplot) or normal plot y-axis = Quantiles of the residuales, x-axis = theoretical quantiles of N(0,1) Normal plot gives a straight line intercepting the mean with a slope value = standard deviation
17
Weighted regression ??
18
Model selection We want the model to be as simple as possible What predictors should be included? We want the best/optimal model, not necessarily the true model More predictors -> higher variance Optimize the bias-variance trade-off
19
Searching for the best model Forward selection Start with the smallest model, include the predictor which reduces the residual sum of squares most until a large number of predictors have be selected. Choose the model with the smallest Cp-statistic Backward selection Start with the full model, Exclude the predictor which increases the residual sum of squares the least until all, or most, predictor variables have been deleted. Choose the model with the smallest Cp-statistic The cross-validated R 2 can be used to calculate the best model when multiple models have been identified (using forward or backward selection)
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.