Simple Linear Regression

Simple Linear Regression

Simple Regression Simple regression analysis is a statistical tool That gives us the ability to estimate the mathematical relationship between a dependent variable (usually called y) and an independent variable (usually called x). The dependent variable is the variable for which we want to make a prediction. While various non-linear forms may be used, simple linear regression models are the most common.

Introduction The primary goal of quantitative analysis is to use current information about a phenomenon to predict its future behavior. Current information is usually in the form of a set of data. In a simple case, when the data form a set of pairs of numbers, we may interpret them as representing the observed values of an independent (or predictor ) variable X and a dependent ( or response) variable Y. In the table on the right the response variable Y represents the man-hours of labor for manufacturing a certain product in lots (X) that vary in size as demand fluctuates.

Introduction The goal of the analyst who studies the data is to find a functional relation between the response variable y and the predictor variable x. The data in this example concerns 10 recent production runs of a spare part manufactured by the Westwood company. Notice that for some values of X (X=30 and X=60) there correspond more than one value of Y. This is not an ordinary functional relationship between X and Y, where to each value of X a unique value of Y must correspond. Question: What is a plausible way of thinking about this situation that would still lead to a model y=F(X) in which we would have a unique value y for each value of X?

Regression Function The statement that the relation between X and Y is statistical should be interpreted as providing the following guidelines: 1. Regard Y as a random variable. 2. For each X, take f (x) to be the expected value (i.e., mean value) of y. 3. Given that E (Y) denotes the expected value of Y, call the equation the regression function. The idea behind this is: when we have several values of y observed for one value of X we take the average of these values of y to assign to x. This underlying idea is refined and made sophisticated by what we call the theory of regression. According to this theory, we should 1. Decide what distribution y has (often taken to be normal) 2.decide what form F(x) has ( here we have taken it to be a line. And then we use the data to estimate the parameters of our line in such a way that the expectation of Y-f(x) is minimized.

Construction of Regression Models
Selection of independent variables Since reality must be reduced to manageable proportions whenever we construct models, only a limited number of independent or predictor variables can or should be included in a regression model. Therefore a central problem is that of choosing the most important predictor variables. Functional form of regression relation Sometimes, relevant theory may indicate the appropriate functional form. More frequently, however, the functional form is not known in advance and must be decided once the data have been collected and analyzed. Since reality must be reduced to manageable proportions whenever we construct models, only a limited number of independent or predictor variables can or should be included in a regression model. Therefore a central problem is that of choosing the most important predictor variables. The major considerations are: 1. The extent to which a chosen variable contributes to reducing the remaining variation in Y after allowance is made for the contributions of other variables that have tentatively been included in the model. 2. The importance of the variable as a causal agent in the process under analysis 3. The degree to which the observations on the variable can be obtained more accurately, quickly, or economically. Sometimes, relevant theory may indicate the appropriate functional form. More frequently, however, the functional form is not known in advance and must be decided once the data have been collected and analyzed. In formulating a regression model, we usually need to restrict the coverage of model to some interval or region of values of the independent variables. The scope is determined either by design of the investigation or by the range of data at hand. For example, a company studying the effect of price on sales volume investigated six price levels, ranging from $4.95 to $6.95. Here the scope of the model would be limited to price levels ranging from near $5 to near $7.

Uses of Regression Analysis
Regression analysis serves Three major purposes. Description Control Prediction The several purposes of regression analysis frequently overlap in practice In the study of branch office operating costs, the purpose was administrative control; by developing a usable statistical relation between costs and independent variable s in the system, management was able to set cost standard for each branch office in the company chain. The west wood Company lot size example, the knowledge of the relation between lot size and man-hours in past production runs enables management to predict the man-hours for the next production run of a given lot size, for purposes of cost estimation and production scheduling. After the run is completed, management can compare the actual man-hours against the predicted hours for purposes of administrative control

Linear regression modelModel
General regression model 0, and 1 are parameters Xi is a known constant Deviations i are independent N(o, 2)

Meaning of Regression Coefficients
The values of the regression parameters 0, and 1 are unknown. We estimate them from data. 1 indicates the change in the mean response per unit increase in X.

Estimated Regression Line
If the scatter plot of our sample data suggests a linear relationship between two variables i.e. we can summarize the relationship by drawing a straight line on the plot. Least squares method give us the “best” estimated line for our set of sample data.

We will write an estimated regression line based on sample data as The method of least squares chooses the values for b0, and b1 to minimize the sum of squared errors

Using calculus, we obtain estimating formulas: or

Estimation of Mean Response
Fitted regression line can be used to estimate the mean value of y for a given value of x. Example The weekly advertising expenditure (x) and weekly sales (y) are presented in the following table.

Point Estimation of Mean Response
From previous table we have: The least squares estimates of the regression coefficients are:

The estimated regression function is: This means that if the weekly advertising expenditure is increased by $1 we would expect the weekly sales to increase by $10.8.

Fitted values for the sample data are obtained by substituting the x value into the estimated regression function. For example if the advertising expenditure is $50, then the estimated Sales is: This is called the point estimate (forecast) of the mean response (sales).

Residual The difference between the observed value yi and the corresponding fitted value . Residuals are highly useful for studying whether a given regression model is appropriate for the data at hand.

Estimation of the variance of the error terms, 2
The variance 2 of the error terms i in the regression model needs to be estimated for a variety of purposes. It gives an indication of the variability of the probability distributions of y. It is needed for making inference concerning regression function and the prediction of y.

Regression Standard Error
To estimate  we work with the variance and take the square root to obtain the standard deviation. For simple linear regression the estimate of 2 is the average squared residual. To estimate  , use s estimates the standard deviation  of the error term  in the statistical model for simple linear regression.

Confidence Intervals and Significance Tests
The standard deviations of b0 and b1 are multiples of the model standard deviation .

Confidence Intervals and Significance Tests

Confidence interval for 1
Now that our test showed that there is a linear association between advertising expenditure and weekly sales, the management wishes an estimate of 1 with a 95% confidence coefficient.

Confidence interval for 1
For a 95 percent confidence coefficient, we require t (.025; 8). From table B in appendix III, we find t(.025; 8) = The 95% confidence interval is:

Prediction of a new response ( )
We now consider the prediction of a new observation y corresponding to a given level x of the independent variable. In our advertising expenditure and weekly sales, the management wishes to predict the weekly sales corresponding to the advertising expenditure of x = $50.

Interval Estimation of a new response ( )
The following formula gives us the point estimator (forecast) for y. 1- % prediction interval for a new observation is: Where

Analysis of variance approach to Regression analysis
Analysis of Variance is the term for statistical analyses that break down the variation in data into separate pieces that correspond to different sources of variation. It is based on the partitioning of sums of squares and degrees of freedom associated with the response variable. In the regression setting, the observed variation in the responses (yi) comes from two sources.

Consider the weekly advertising expenditure and the weekly sales example. There is variation in the amount ($) of weekly sales, as in all statistical data. The variation of the yi is conventionally measured in terms of the deviations:

The measure of total variation, denoted by SST, is the sum of the squared deviations: If SST = 0, all observations are the same(No variability). The greater is SST, the greater is the variation among the y values. When we use the regression model, the measure of variation is that of the y observations variability around the fitted line:

The measure of variation in the data around the fitted regression line is the sum of squared deviations (error), denoted SSE: For our Weekly expenditure example SSE = SST = What accounts for the substantial difference between these two sums of squares?

The difference is another sum of squares: SSR stands for regression sum of squares. SSR is the variation among the predicted responses . The predicted responses lie on the least-square line. They show how y moves in response to x. The larger is SSR relative to SST, the greater is the role of regression line in explaining the total variability in y observations.

Formal Development of the Partitioning
We can decompose the total variability in the observations yi as follows: The total deviation can be viewed as the sum of two components: The deviation of the fitted value around the mean The deviation of yi around the fitted regression line.

Formal Development of the Partitioning
Skipping quite a bit of messy algebra, we just state that this analysis of variance equation always holds: Breakdown of degree of freedom:

Mean squares A sum of squares divided by its degrees of freedom is called a mean square (MS) Regression mean square (MSR) Error mean square (MSE) Note: mean squares are not additive.

Analysis of Variance Table
The breakdowns of the total sum of squares and associated degrees of freedom are displayed in a table called analysis of variance table (ANOVA table) Source of Variation SS df MS F-Test Regression SSR 1 MSR =SSR/1 MSR/MSE Error SSE n-2 MSE =SSE/(n-2) Total SST n-1

The Analysis of Variance table reports in a different way quantities such as r2 and s that are needed in regression analysis. It also reports in a different way the test for the overall significance of the regression. If regression on x has no value for predicting y, we expect the slope of the population regression line to be close to 0.

That is the null hypothesis of “no linear relationship” is: We standardize the slope of the least-squares line to get a t statistic.

F-Test for 1= 0 versus 1 0
The analysis of variance approach starts with sums of squares. If regression on x has no value for predicting y, we expect the SSR to be only a small part of the SST, most of which will be made of the SSE. The proper way to standardize this comparison is to use the ratio

Coefficient of Determination
Recall that SST measures the total variations in yi when no account of the independent variable x is taken. SSE measures the variation in the yi when a regression model with the independent variable x is used. A natural measure of the effect of x in reducing the variation in y can be defined as:

R2 is called the coefficient of determination. 0  SSE  SST, it follows that: We may interpret R2 as the proportionate reduction of total variability in y associated with the use of the independent variable x. The larger is R2, the more is the total variation of y reduced by including the variable x in the model.

If all the observations fall on the fitted regression line, SSE = 0 and R2 = 1. If the slope of the fitted regression line b1 = 0 so that , SSE=SST and R2 = 0. The closer R2 is to 1, the greater is said to be the degree of linear association between x and y. The square root of R2 is called the coefficient of correlation.

Correlation Coefficient
Recall that the algebraic expression for the correlation coefficient is.

Simple Linear Regression

Similar presentations

Presentation on theme: "Simple Linear Regression"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Simple Linear Regression

Similar presentations

Presentation on theme: "Simple Linear Regression"— Presentation transcript:

Similar presentations

About project

Feedback