MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.

Slides:



Advertisements
Similar presentations
Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and l Chapter 12 l Multiple Regression: Predicting One Factor from Several Others.
Advertisements

Inference for Regression
Review ? ? ? I am examining differences in the mean between groups
1 SSS II Lecture 1: Correlation and Regression Graduate School 2008/2009 Social Science Statistics II Gwilym Pryce
Regression Analysis Module 3. Regression Regression is the attempt to explain the variation in a dependent variable using the variation in independent.
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
Linear regression models
© 2008 McGraw-Hill Higher Education The Statistical Imagination Chapter 12: Analysis of Variance: Differences among Means of Three or More Groups.
Objectives (BPS chapter 24)
Correlation & Regression Chapter 15. Correlation statistical technique that is used to measure and describe a relationship between two variables (X and.
LINEAR REGRESSION: Evaluating Regression Models Overview Assumptions for Linear Regression Evaluating a Regression Model.
LINEAR REGRESSION: Evaluating Regression Models. Overview Assumptions for Linear Regression Evaluating a Regression Model.
LINEAR REGRESSION: Evaluating Regression Models. Overview Standard Error of the Estimate Goodness of Fit Coefficient of Determination Regression Coefficients.
The Simple Regression Model
SIMPLE LINEAR REGRESSION
Introduction to Probability and Statistics Linear Regression and Correlation.
BCOR 1020 Business Statistics
Correlation and Regression Analysis
Descriptive measures of the strength of a linear association r-squared and the (Pearson) correlation coefficient r.
Relationships Among Variables
Multiple Linear Regression A method for analyzing the effects of several predictor variables concurrently. - Simultaneously - Stepwise Minimizing the squared.
Review Guess the correlation. A.-2.0 B.-0.9 C.-0.1 D.0.1 E.0.9.
Lecture 5 Correlation and Regression
Example of Simple and Multiple Regression
Lecture 16 Correlation and Coefficient of Correlation
Lecture 15 Basics of Regression Analysis
Analysis of Variance. ANOVA Probably the most popular analysis in psychology Why? Ease of implementation Allows for analysis of several groups at once.
Chapter 13: Inference in Regression
BPS - 3rd Ed. Chapter 211 Inference for Regression.
EQT 272 PROBABILITY AND STATISTICS
Ms. Khatijahhusna Abd Rani School of Electrical System Engineering Sem II 2014/2015.
Inferences in Regression and Correlation Analysis Ayona Chatterjee Spring 2008 Math 4803/5803.
Regression Analysis. Scatter plots Regression analysis requires interval and ratio-level data. To see if your data fits the models of regression, it is.
INTRODUCTORY LINEAR REGRESSION SIMPLE LINEAR REGRESSION - Curve fitting - Inferences about estimated parameter - Adequacy of the models - Linear.
Simple Linear Regression One reason for assessing correlation is to identify a variable that could be used to predict another variable If that is your.
Class 4 Simple Linear Regression. Regression Analysis Reality is thought to behave in a manner which may be simulated (predicted) to an acceptable degree.
MGS8020 Analyze.ppt/Apr 2, 2015/Page 1 Georgia State University - Confidential MGS 8020 Business Intelligence Analyze Apr 2, 2015.
Testing Hypotheses about Differences among Several Means.
Chapter 11 Linear Regression Straight Lines, Least-Squares and More Chapter 11A Can you pick out the straight lines and find the least-square?
1 Chapter 12 Simple Linear Regression. 2 Chapter Outline  Simple Linear Regression Model  Least Squares Method  Coefficient of Determination  Model.
Statistical analysis Outline that error bars are a graphical representation of the variability of data. The knowledge that any individual measurement.
CORRELATION: Correlation analysis Correlation analysis is used to measure the strength of association (linear relationship) between two quantitative variables.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
Chapter 13 Multiple Regression
Regression Analysis Relationship with one independent variable.
STA 286 week 131 Inference for the Regression Coefficient Recall, b 0 and b 1 are the estimates of the slope β 1 and intercept β 0 of population regression.
VI. Regression Analysis A. Simple Linear Regression 1. Scatter Plots Regression analysis is best taught via an example. Pencil lead is a ceramic material.
Multiple Regression. Simple Regression in detail Y i = β o + β 1 x i + ε i Where Y => Dependent variable X => Independent variable β o => Model parameter.
Midterm Review Ch 7-8. Requests for Help by Chapter.
Significance Tests for Regression Analysis. A. Testing the Significance of Regression Models The first important significance test is for the regression.
Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.
Regression Analysis: A statistical procedure used to find relations among a set of variables B. Klinkenberg G
REGRESSION AND CORRELATION SIMPLE LINEAR REGRESSION 10.2 SCATTER DIAGRAM 10.3 GRAPHICAL METHOD FOR DETERMINING REGRESSION 10.4 LEAST SQUARE METHOD.
BPS - 5th Ed. Chapter 231 Inference for Regression.
Bivariate Regression. Bivariate Regression analyzes the relationship between two variables. Bivariate Regression analyzes the relationship between two.
MGS4020_Minitab.ppt/Jul 14, 2011/Page 1 Georgia State University - Confidential MGS 4020 Business Intelligence Regression Analysis By Using Minitab Jul.
Stats Methods at IC Lecture 3: Regression.
Multiple Regression.
Chapter 4: Basic Estimation Techniques
Statistical analysis.
Chapter 4 Basic Estimation Techniques
MGS 8020 Business Intelligence Analyze Jul 22, 2017
Statistical analysis.
Relationship with one independent variable
Simple Linear Regression
Correlation and Regression
CHAPTER 29: Multiple Regression*
Multiple Regression.
Relationship with one independent variable
MGS 3100 Business Analysis Regression Feb 18, 2016
Presentation transcript:

MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015

MGS3100_04.ppt/Sep 29, 2015/Page 2 Georgia State University - Confidential Agenda Regression Statistics & ANOVA Statistical Significance Overview of the Regression

MGS3100_04.ppt/Sep 29, 2015/Page 3 Georgia State University - Confidential What is the Regression Analysis? The regression procedure is used when you are interested in describing the linear relationship between the independent variables and a dependent variable. A line in a two dimensional or two-variable space is defined by the equation Y=a+b*X In full text: the Y variable can be expressed in terms of a constant (a) and a slope (b) times the X variable. The constant is also referred to as the intercept, and the slope as the regression coefficient or B coefficient. For example, GPA may best be predicted as 1+.02*IQ. Thus, knowing that a student has an IQ of 130 would lead us to predict that her GPA would be 3.6 (since, 1+.02*130=3.6).

MGS3100_04.ppt/Sep 29, 2015/Page 4 Georgia State University - Confidential What is the Regression Analysis? In the multivariate case, when there is more than one independent variable, the regression line cannot be visualized in the two dimensional space, but can be computed just as easily. For example, if in addition to IQ we had additional predictors of achievement (e.g., Motivation, Self- discipline) we could construct a linear equation containing all those variables. In general then, multiple regression procedures will estimate a linear equation of the form: Y = a + b1*X1 + b2*X bp*Xp

MGS3100_04.ppt/Sep 29, 2015/Page 5 Georgia State University - Confidential Agenda Regression Statistics & ANOVA Statistical Significance Overview of the Regression

MGS3100_04.ppt/Sep 29, 2015/Page 6 Georgia State University - Confidential 1) Predicted and Residual Scores The regression line expresses the best prediction of the dependent variable (Y), given the independent variables (X). However, nature is rarely (if ever) perfectly predictable, and usually there is substantial variation of the observed points around the fitted regression line (as in the scatterplot shown earlier). The deviation of a particular point from the regression line (its predicted value) is called the residual value.

MGS3100_04.ppt/Sep 29, 2015/Page 7 Georgia State University - Confidential 2) Residual Variance and R-square The smaller the variability of the residual values around the regression line relative to the overall variability, the better is our prediction. For example, if there is no relationship between the X and Y variables, then the ratio of the residual variability of the Y variable to the original variance is equal to 1.0. If X and Y are perfectly related then there is no residual variance and the ratio of variance would be 0.0. In most cases, the ratio would fall somewhere between these extremes, that is, between 0.0 and minus this ratio is referred to as R-square or the coefficient of determination. This value is immediately interpretable in the following manner. If we have an R-square of 0.4 then we know that the variability of the Y values around the regression line is times the original variance; in other words we have explained 40% of the original variability, and are left with 60% residual variability. Ideally, we would like to explain most if not all of the original variability. The R-square value is an indicator of how well the model fits the data (e.g., an R-square close to 1.0 indicates that we have accounted for almost all of the variability with the variables specified in the model).

MGS3100_04.ppt/Sep 29, 2015/Page 8 Georgia State University - Confidential 2) R-square A mathematical term describing how much variation is being explained by the X. R-square = SSR / SST SSR – SS (Regression) SST – SS (Total)

MGS3100_04.ppt/Sep 29, 2015/Page 9 Georgia State University - Confidential 3) Adjusted R-square Adjusted R-square is the adjusted value for R-square will be equal or smaller than the regular R-square. The adjusted R-square adjusts for a bias in R-square. R-square tends to over estimate the variance accounted for compared to an estimate that would be obtained from the population. There are two reasons for the overestimate, a large number of predictors and a small sample size. So, with a small sample and with few predictors, adjusted R-square should be very similar the R-square value. Researchers and statisticians differ on whether to use the adjusted R-square. It is probably a good idea to look at it to see how much your R-square might be inflated, especially with a small sample and many predictors. Adjusted R-square = 1 – [MSR / (SST/(n – 1))] MSR – MS (Regression) SST – SS (Total)

MGS3100_04.ppt/Sep 29, 2015/Page 10 Georgia State University - Confidential 4) Coefficient R (Multiple R) Customarily, the degree to which two or more predictors (independent or X variables) are related to the dependent (Y) variable is expressed in the correlation coefficient R, which is the square root of R-square. In multiple regression, R can assume values between 0 and 1. To interpret the direction of the relationship between variables, one looks at the signs (plus or minus) of the regression or B coefficients. If a B coefficient is positive, then the relationship of this variable with the dependent variable is positive (e.g., the greater the IQ the better the grade point average); if the B coefficient is negative then the relationship is negative (e.g., the lower the class size the better the average test scores). Of course, if the B coefficient is equal to 0 then there is no relationship between the variables.

MGS3100_04.ppt/Sep 29, 2015/Page 11 Georgia State University - Confidential 5) ANOVA In general, the purpose of analysis of variance (ANOVA) is to test for significant differences between means. At the heart of ANOVA is the fact that variances can be divided up, that is, partitioned. Remember that the variance is computed as the sum of squared deviations from the overall mean, divided by n-1 (sample size minus one). Thus, given a certain n, the variance is a function of the sums of (deviation) squares, or SS for short. Partitioning of variance works as follows. Consider the following data set:

MGS3100_04.ppt/Sep 29, 2015/Page 12 Georgia State University - Confidential 6) Degree of Freedom (df) Statisticians use the terms "degrees of freedom" to describe the number of values in the final calculation of a statistic that are free to vary. Consider, for example the statistic s-square.

MGS3100_04.ppt/Sep 29, 2015/Page 13 Georgia State University - Confidential 7) S square & Sums of (deviation) squares The statistic s square is a measure on a random sample that is used to estimate the variance of the population from which the sample is drawn. Numerically, it is the sum of the squared deviations around the mean of a random sample divided by the sample size minus one. Regardless of the size of the population, and regardless of the size of the random sample, it can be algebriacally shown that if we repeatedly took random samples of the same size from the same population and calculated the variance estimate on each sample, these values would cluster around the exact value of the population variance. In short, the statistic s squared is an unbiased estimate of the variance of the population from which a sample is drawn.

MGS3100_04.ppt/Sep 29, 2015/Page 14 Georgia State University - Confidential 7) S square & Sums of (deviation) squares When the regression model is used for prediction, the error (the amount of uncertainty that remains) is the variability about the regression line,. This is the Residual Sum of Squares (residual for left over). It is sometimes called the Error Sum of Squares. The Regression Sum of Squares is the difference between the Total Sum of Squares and the Residual Sum of Squares. Since the total sum of squares is the total amount of variability in the response and the residual sum of squares that still cannot be accounted for after the regression model is fitted, the regression sum of squares is the amount of variability in the response that is accounted for by the regression model.

MGS3100_04.ppt/Sep 29, 2015/Page 15 Georgia State University - Confidential 8) Mean Square Error ANOVA is a good example of why many statistical test represent ratios of explained to unexplained variability. It refers to an estimate of the population variance based on the variability among a given set of measures. It is an estimate of the population variance based on the average of all s-square within the several samples.

MGS3100_04.ppt/Sep 29, 2015/Page 16 Georgia State University - Confidential Agenda Statistical Significance Regression Statistics & ANOVA Overview of the Regression

MGS3100_04.ppt/Sep 29, 2015/Page 17 Georgia State University - Confidential 1) What is "statistical significance" (p-value) The statistical significance of a result is the probability that the observed relationship (e.g., between variables) or a difference (e.g., between means) in a sample occurred by pure chance ("luck of the draw"), and that in the population from which the sample was drawn, no such relationship or differences exist. Using less technical terms, one could say that the statistical significance of a result tells us something about the degree to which the result is "true" (in the sense of being "representative of the population"). More technically, the value of the p-value represents a decreasing index of the reliability of a result (see Brownlee, 1960). The higher the p-value, the less we can believe that the observed relation between variables in the sample is a reliable indicator of the relation between the respective variables in the population. Specifically, the p-value represents the probability of error that is involved in accepting our observed result as valid, that is, as "representative of the population."

MGS3100_04.ppt/Sep 29, 2015/Page 18 Georgia State University - Confidential 1) What is "statistical significance" (p-value) For example, a p-value of.05 (i.e.,1/20) indicates that there is a 5% probability that the relation between the variables found in our sample is a "fluke." In other words, assuming that in the population there was no relation between those variables whatsoever, and we were repeating experiments like ours one after another, we could expect that approximately in every 20 replications of the experiment there would be one in which the relation between the variables in question would be equal or stronger than in ours. (Note that this is not the same as saying that, given that there IS a relationship between the variables, we can expect to replicate the results 5% of the time or 95% of the time; when there is a relationship between the variables in the population, the probability of replicating the study and finding that relationship is related to the statistical power of the design.). In many areas of research, the p-value of.05 is customarily treated as a "border-line acceptable" error level. It identifies a significant trend. f

MGS3100_04.ppt/Sep 29, 2015/Page 19 Georgia State University - Confidential 2) What is "statistical significance" (F-test & t-test) F test The F test employs the statistic (F) to test various statistical hypotheses about the mean (or means) of the distributions from which a sample or a set of samples have been drawn. The t test is a special form of the F test. F-value F-value is the ratio of MSR/MSE. This shows the ratio of the average error that is explained by the regression to the average error that is still unexplained. Thus, the higher the F, the better the model, and the more confidence we have that the model that we derived from sample data actually applies to the whole population, and is not just an aberration found in the sample. Significance of F The value was computed by looking at standardized tables that consider the F- value and your sample size to make that determination. If the significance of F is lower than an alpha of 0.05, the overall regression model is significant t-test The t test employs the statistic (t) to test a given statistical hypothesis about the mean of a population (or about the means of two populations).