Linear and generalised linear models Purpose of linear models Least-squares solution for linear models Analysis of diagnostics Exponential family and generalised.

Linear and generalised linear models Purpose of linear models Least-squares solution for linear models Analysis of diagnostics Exponential family and generalised linear model

Reason for linear models Purpose of regression is to reveal statistical relations between input and output variables. Statistics cannot reveal functional relationship. It is purpose of other scientific studies. Statistics can help validation various functional relationship (models). Let us assume that we suspect that functional relationship is where  is a vector of unknown parameters, x=(x 1,x 2,,,x p ) a vector of controllable parameters, and y is output,  is an error associated with the experiment. Then we can set for various values of x experiments and get output (or response) for them. If number of experiments is n then we will have n output values. Denote them as a vector y=(y 1,y 2,,,y n ). Purpose of statistics is to evaluate parameter vector using input and output values. If function f is a linear function of the parameters and errors are additive then we are dealing with linear model. For this model we can write Linear model is linearly dependent on parameters but not on input variables. For example is a linear model. But is not.

Assumptions Basic assumptions for analysis of linear model are: 1)the model is linear in parameters 2)the error structure is additive 3)Random errors have 0 mean, equal variances and they are uncorrelated. These assumptions are sufficient to deal with linear models. Uncorrelated with equal variance assumptions (number 3) can be removed. Then the treatments becomes a little bit more complicated. Note that for general solution normality assumption is not used. This assumption is necessary to design test statistics. If this assumption does not work then we can use bootstrap to design test statistic. These assumptions can be written in a vector form: where y, 0, I,  are vectors and X is a matrix. This matrix is called design matrix, input matrix etc. I is nxn identity matrix.

Solution Solution to least-squares with linear model and and given assumptions is: Let us show this. If we use the form of the model and write least squares equation (since we want to find solution with minimum least-squares error): and get the first and solve the equation then we can see that this solution is correct. If we use the formula for the solution and the expression of y then we can write: So solution is unbiased. Variance of estimation is: Here we used the form of the solution and the assumption number 3)

Variance To calculate covariance matrix we need to be able to calculate  2. Since it is the variance of the error term we can find it using the form of the solution. For the estimated error (denoted by r) we can write: If we use: It gives Since the matrix M is idempotent and symmetric, i.e. M 2 =M=M T, we can write: Where n is the number of the observations and p is the number of the fitted parameters. Then for unbiased estimator for the variance of the residual we can write:

Singular case The above given form of the solution is true if matrices X and X T X are non-singular. I.e. the rank of the matrix X is equal to the number of parameters. If it is not true then either singular value decomposition or eignevalue filtering techniques are used. Fortunately most good properties of the linear model remains. Singular value decomposition (SVD): Any nxp matrix can be decomposed in a form: Where U is nxn and V is pxp orthogonal matrices (inverse is equal to transpose). D is nxp diagonal matrix of the singular values. If X is singular then number of non-zero diagonal elements of D is less than p. Then for X T X we can write: D T D is pxp diagonal matrix. If the matrix is non-singular then we can write: Since D T D is a diagonal matrix therefore its inverse is also diagonal matrix. Main trick used in SVD technique for equation solution is that when diagonals are 0 or close to 0 then instead of their inversion zero is used. I.e. pseudo inverse is calculated using:

Analysis of diagnostics Residuals and hat matrix: Residuals are differences between observation and fitted values: H is called a hat matrix. Diagonal terms h i are leverage of the observations. If these values are close to one then that fitted value is determined by this observation. Sometimes h i ’=h i /(1-h i ) is used to enhance high leverages. Q-Q plot can be used to check normality assumption. Q-Q plot is plot of quantiles of two distributions. If assumption on distribution is correct then this plot should be nearly linear. If the distribution is normal then tests designed for normal distributions can be used. Otherwise bootstrap can be used to derive desired distributions.

Analysis of diagnostics: Cont. Other analysis tools include: Where h i is leverage, h i ’ is enhanced leverage, s 2 is unbiased estimator of  2, s i 2 is unbiased estimator of  2 after removal of i-th observation

Bootstrap Simplest application of bootstrap for this problem is as follows: 1)Calculate residuals using 2)Sample with replacement from the residual vector and denote them r random 3)Design new “observations” using 4)Estimate parameters 5)Repeat steps 2 3 and 4 6)Estimate bootstrap estimation, variances, covariance matrix or the distribution Another technique for bootstrapping is: Resample observations and corresponding row of the design matrix simultaneously - (y i,x 1i,x 2i,,,,x pi ),i=1,n. It meant to be less sensitive to misspecified models. Note that for some samples, the matrix may become singular and problem may become ill defined.

Generalised linear models One of the main assumptions for linear model is that errors are additive. I.e. observations are equal to their expectation value plus an error. What happens if this assumption breaks down, e.g. errors are additive for some function of the expected value. In general we can of course use Maximum likelihood (or Bayesian estimation) for these cases. However there are class of problems that are widely being used in such fields as medicine, biosciences. They are especially important when observations are categorical, i.e. they have discrete values. This class of problems are usually dealt with using generalised linear models. Let us consider these problems. First consider generalised exponential family.

Generalised linear model: Exponential family Natural exponential family of distributions has a form: S(  ) is a scale parameter. We can replace A(  ) with  by change of variables. Many distributions including normal, binomial, Poisson, exponential distributions belong to this family. Moment generating function is: Then the first moment (mean value) and the second central moments are:

Generalised linear model If the distribution of observations is one of the distributions from the exponential family and some function of the expected value of the observations is a linear function of the parameters then generalised linear model is used: Function g is called the link function. Here is a list of the popular distribution and corresponding link functions: binomial - logit = ln(p/(1-p)) normal - identity Gamma - inverse Poisson - log All good statistical packages have implementation of several generalised linear models. To fit using generalised linear model, likelihood function is written Most natural way is to use  =X . The optimisation for this kind of functions is done iteratively.

Bootstrap Three techniques for bootstrapping for generalised linear models can be used I)Resampling differences between observations and expected values 1)Calculate differences between observations and expected values 2)Sample from these differences 3)Add them to the observations and make sure that observations have properties they meant to have 4)Estimate parameters 5)repeat steps 2-4 II)Parametric resampling using the form of distribution and estimated parameters 1)Built the distribution using the estimated parameters 2)Resample using these distributions. Note that each observation may have different distribution 3)Estimate parameters 4)Repeat step 2 and 3 and built up bootstrap estimations, distributions III)Resampling observations and corresponding rows of the design matrix simultaneously 1)Resample from vector (y i,x 1i,x 2i,,,x pi ),i=1,n 2)Estimate parameters 3)Repeat steps 1 and 2

R commands R command for general linear model is lm. R command for generalised linear model is glm

Linear and generalised linear models Purpose of linear models Least-squares solution for linear models Analysis of diagnostics Exponential family and generalised.

Similar presentations

Presentation on theme: "Linear and generalised linear models Purpose of linear models Least-squares solution for linear models Analysis of diagnostics Exponential family and generalised."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Linear and generalised linear models Purpose of linear models Least-squares solution for linear models Analysis of diagnostics Exponential family and generalised.

Similar presentations

Presentation on theme: "Linear and generalised linear models Purpose of linear models Least-squares solution for linear models Analysis of diagnostics Exponential family and generalised."— Presentation transcript:

Similar presentations

About project

Feedback