Regression
Types of Linear Regression Model Ordinary Least Square Model (OLS) –Minimize the residuals about the regression linear –Most commonly used method Generalized Linear Model (GLS) –Flexible generalization of ordinary linear regression that allows for response variables that have other than a normal distribution. –Can be used with categorical data
Regression is used to create empirical models Common Regression – Interval/Ratio (Real) Dependent factor Logistic Regression – Nominal/Ordinal (Integer) Dependent factor Basic Assumption: There is a set of X independent factors that controls the magnitude of a dependent factor Y.
Basic Form Y = a + b1 * X1 + b2 * X2 + b3 * X3 ………+ bn * Xn + e The equation assumes that the relationship between Y and Xi is linear and the effects of each variable is additive. Assumptions: Variables are measured without error There is a linear relationship and all variables relevant No autocorrelation of independent variables No Perfect Collinearity (correlation) between variables Errors are normally distributed for each variable The variance of the error term is constant (not heteroscedasticity)
is uniquely determined by its y-intercept b 0 and its steepness or slope b 1. For any given value of X, we can find a corresponding value of Y on the line. Simple Linear Regression
If the relationship between X and Y is linear, the average Y value for any given X will lie right on the regression line. Realistically, in any population, there's bound to be some variation between observations. The variation is due either to the data itself, or some kind of error in the measurement. Even in simple linear regression, often, not all of the data points will fall exactly on a regression line. Therefore, we account for the observations not falling on the regression line as the error term, e The error term e i for an observation, i, is the difference between the observed data point (X i, Y i ) and the theoretical regression line. Simple Linear Regression
So, even though there may be several Y values with the same X value, the relationship can still be considered linear if we assume the average Y value for any given X value is on the regression line. In a regression model we also assume that for any given value of X, the errors are normally distributed with a mean of zero and a constant variance s 2. Negative error values Positive error values The errors essentially cancel themselves out and the mean is 0.
Now that we have our coefficients the next question is whether the data is any good. One of the ways to test this is to determine a coefficient of determination(COD). We will not go into detail about what a COD is, but suffice it to say that it represents how well the formula we created actually fits the data. The correlation coefficient creates a value between 0 and 1. The closer the value is to 1, the better the fit. Linear Regression Example
Least Squares Regression y = β 0 + β 1 x + ε Y = β X + ε
1. Generalized Linear Regression for Binary or Ordinal Data: Y = (1,0) X = Set of independent variables prediction = a0+ b1x1 + b2x2 + b3x Using a GLM 2. Logit Transformation: Probability (0-1) = 1 / (1 + (exp ( - (a0 + alxl + a2x2 + a3x3 +...))) Two Step Process
Christopherson et al.
The fit statistics for Logistic Regression can be misleading. The “best” fitted model may be relatively “poor” in regards to predictive power. The goal is usually to delineate the area with the best change of finding the event. The smaller the area with the most observed events the better. You can always expand the number of events in the “high” probability areas by generalizing the model. You should perform additional tests to assess model performance. Chi-square and K-S can be used. Relative indices. If possible use an independent data set for testing.
Model Strength
Methods - Deforestation Probability Surface Cell by Cell Logistic Regression for Each Analysis Year (1986 to 1999) using 5 % Stratified Random Samples (> 1,100,000 cells): –Dependent Variable: Deforested (1) / Forested (0) –Independent Variables: LN distance to Roads, LN Distance to Settlements, Well (1) / Poorly (0) Draining Soils
Deforestation Probability Surface Observed Deforestation Results 1986