SIMPLE LINEAR REGRESSION. Last week  Discussed the ideas behind:  Hypothesis testing  Random Sampling Error  Statistical Significance, Alpha, and.

Slides:



Advertisements
Similar presentations
Regression and correlation methods
Advertisements

Copyright © 2009 Pearson Education, Inc. Chapter 29 Multiple Regression.
Copyright © 2010 Pearson Education, Inc. Slide
Linear Regression t-Tests Cardiovascular fitness among skiers.
Inference for Regression
Correlation and Linear Regression
Chapter 15 (Ch. 13 in 2nd Can.) Association Between Variables Measured at the Interval-Ratio Level: Bivariate Correlation and Regression.
LECTURE 3 Introduction to Linear Regression and Correlation Analysis
Fall 2006 – Fundamentals of Business Statistics 1 Chapter 13 Introduction to Linear Regression and Correlation Analysis.
The Basics of Regression continued
Intro to Statistics for the Behavioral Sciences PSYC 1900
Linear Regression and Correlation Analysis
Topics: Regression Simple Linear Regression: one dependent variable and one independent variable Multiple Regression: one dependent variable and two or.
1 Simple Linear Regression Chapter Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.
REGRESSION AND CORRELATION
Introduction to Probability and Statistics Linear Regression and Correlation.
Business Statistics - QBM117 Least squares regression.
Correlation and Regression Analysis
Introduction to Regression Analysis, Chapter 13,
Simple Linear Regression Analysis
Relationships Among Variables
Lecture 5 Correlation and Regression
Chapter 8: Bivariate Regression and Correlation
Example of Simple and Multiple Regression
Lecture 15 Basics of Regression Analysis
The Chi-Square Distribution 1. The student will be able to  Perform a Goodness of Fit hypothesis test  Perform a Test of Independence hypothesis test.
Introduction to Linear Regression and Correlation Analysis
Inference for regression - Simple linear regression
CORRELATION & REGRESSION
Simple Linear Regression. Correlation Correlation (  ) measures the strength of the linear relationship between two sets of data (X,Y). The value for.
Inferences for Regression
GROUP DIFFERENCES: THE SEQUEL. Last time  Last week we introduced a few new concepts and one new statistical test:  Testing for group differences 
Ms. Khatijahhusna Abd Rani School of Electrical System Engineering Sem II 2014/2015.
Lecture 22 Dustin Lueker.  The sample mean of the difference scores is an estimator for the difference between the population means  We can now use.
MULTIPLE REGRESSION Using more than one variable to predict another.
Simple Linear Regression One reason for assessing correlation is to identify a variable that could be used to predict another variable If that is your.
Association between 2 variables
GrowingKnowing.com © Correlation and Regression Correlation shows relationships between variables. This is important. All professionals want to.
Ch4 Describing Relationships Between Variables. Section 4.1: Fitting a Line by Least Squares Often we want to fit a straight line to data. For example.
Examining Relationships in Quantitative Research
Topic 10 - Linear Regression Least squares principle - pages 301 – – 309 Hypothesis tests/confidence intervals/prediction intervals for regression.
Lecture 8 Simple Linear Regression (cont.). Section Objectives: Statistical model for linear regression Data for simple linear regression Estimation.
MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.
Regression Chapter 16. Regression >Builds on Correlation >The difference is a question of prediction versus relation Regression predicts, correlation.
CORRELATION: Correlation analysis Correlation analysis is used to measure the strength of association (linear relationship) between two quantitative variables.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
Chapter 16 Data Analysis: Testing for Associations.
Discussion of time series and panel models
Relationships If we are doing a study which involves more than one variable, how can we tell if there is a relationship between two (or more) of the.
Simple & Multiple Regression 1: Simple Regression - Prediction models 1.
Political Science 30: Political Inquiry. Linear Regression II: Making Sense of Regression Results Interpreting SPSS regression output Coefficients for.
Lecture 10: Correlation and Regression Model.
Multiple Regression. Simple Regression in detail Y i = β o + β 1 x i + ε i Where Y => Dependent variable X => Independent variable β o => Model parameter.
Chapter 14: Inference for Regression. A brief review of chapter 4... (Regression Analysis: Exploring Association BetweenVariables )  Bi-variate data.
ANOVA, Regression and Multiple Regression March
Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Simple Linear Regression Analysis Chapter 13.
Lecture 10 Introduction to Linear Regression and Correlation Analysis.
Regression Analysis: A statistical procedure used to find relations among a set of variables B. Klinkenberg G
Regression Analysis Presentation 13. Regression In Chapter 15, we looked at associations between two categorical variables. We will now focus on relationships.
Chapter 13 Linear Regression and Correlation. Our Objectives  Draw a scatter diagram.  Understand and interpret the terms dependent and independent.
Regression and Correlation
Correlation and Simple Linear Regression
Political Science 30: Political Inquiry
Correlation and Simple Linear Regression
Correlation and Simple Linear Regression
Simple Linear Regression and Correlation
MGS 3100 Business Analysis Regression Feb 18, 2016
Correlation and Simple Linear Regression
Correlation and Simple Linear Regression
Presentation transcript:

SIMPLE LINEAR REGRESSION

Last week  Discussed the ideas behind:  Hypothesis testing  Random Sampling Error  Statistical Significance, Alpha, and p-values  Examined Correlation – specifically Pearson’s r  What it’s used for, when to use it (and not to use it)  Statistical Assumptions  Interpretation of r (direction/magnitude) and p

Tonight  Extend our discussion on correlation – into simple linear regression  Correlation and regression are specifically linked together, conceptually and mathematically Often see correlations paired with regression  Regression is nothing but one step past r You’ve all done it in high school math  First…brief review…

Quick Review/Quiz  A health researcher plans to determine if there is an association between physical activity and body composition.  Specifically, the researcher thinks that people who are more physically active (PA) will have a lower percent body fat (%BF).  Write out a null and alternative hypothesis

PA and %BF HO:HO:  There is no association between PA and %BF HA:HA:  People with ↑ PA will have ↓ %BF  The researcher will use a Pearson correlation to determine this association. He sets alpha ≤  Write out what that means (alpha ≤ 0.05)

Alpha  If the researcher sets alpha ≤ 0.05, this means that he/she will reject the null hypothesis if the p-value of the correlation is equal to or less than  This is the level of confidence/risk the researcher is willing to accept  If the p-value of the test is greater than 0.05, there is a greater than 5% chance that the result could be due to ___________________, rather than a real effect/association

Results  The researcher runs the correlation in SPSS and this is in the output:  n = 100, r = -0.75, p = 0.02  1) What is the direction of the correlation? What does this mean?  2) What is the sample size?  3) Describe the magnitude of the association?  4) Is this result statistically significant?  5) Did he/she fail to reject the null hypothesis OR reject the null hypothesis?

Results defined  There is a negative, moderate-to-strong, relationship between PA and %BF (r = -0.75, p = 0.02).  Those with higher levels of physical activity tended to have lower %BF (or vice versa)  Reject the null hypothesis and accept the alternative  Based on this correlation alone, does PA cause %BF to change? Why or why not?

Error  Assume the association seen here between PA and %BF is REAL (not due to RSE).  What type of error is made if the researcher fails to reject the null hypothesis (and accepts H O ) Says there is no association when there really is Type II Error  Assume the association seen here between PA and %BF is due to RSE (not REAL).  What type of error is made if the researcher rejects the null hypothesis (and rejects H O ) Says there is an association when there really is not Type I Error

Our Decision Reject H O Accept H O What is True HOHO Type I ErrorCorrect HAHA Type II Error  H A : Is an association between PA and %BF  H O : Is not an association between PA and %BF Questions…?

Back to correlations  Recall, correlations provide two critical pieces of information a relationship between two variables:  1) Direction (+ or -)  2) Strength/Magnitude  However, the correlation coefficient (r) can also be used to describe how well a variable can be used for prediction (of the another).  A frequent goal of statistics  For example…

Association vs Prediction  Is undergrad GPA associated with grad school GPA?  Can grad school GPA be predicted by undergrad GPA?  Are skinfolds measurements associated with %BF?  Can %BF be predicted by skinfolds?  Is muscular strength associated with injury risk?  Can muscular strength be predictive of injury risk?  Is event attendance associated with ticket price?  Can event attendance be predicted by ticket price?  (i.e., what ticket price will maximize profits?)

Correlation and Prediction  This idea should seem reasonable.  Look at the three correlations below. In which of the three do you think it would be easiest (most accurate) to predict the y variable from the x variable? A BC

Correlation and Prediction  The stronger the relationship between two variables, the more accurately you can use information from one of those variables to predict the other Which do you think you could predict more accurately? Bench press repetitions from body weight ? Or 40-yard dash from 10-yard dash?

Explained Variance  The stronger the relationship between two variables, the more accurately you can use information from one of those variables to predict the other  This concept is “explained variance” or “variance accounted for”  Variance = the spread of the data around the center  Why the values are different for everyone  Calculated by squaring the correlation coefficient, r 2  Above correlation: r = and r 2 =  aka, Coefficient of Determination  What percentage of the variability in x is explained by y The 10-yard dash explains 39% of the variance in the 40-yard dash If we could explain 100% of the variance – we’d be able to make a perfect prediction

Coefficient of Determination, r 2  What percentage of the variability in y is explained by x The 10-yard dash explains 39% of the variance in the 40-yard dash So – about 61% (100% - 39% = 61%) of the variance remains unexplained (is due to other things) The more variance you can explain the better the predication The less variance that is explained the more error in the prediction Examples, notice how quickly the prediction degrades: r = 1.00; r 2 = 100% r = 0.87; r 2 = 75% r = 0.71; r 2 = 50% r = 0.50; r 2 = 25% r = 0.22; r 2 = 5%  Example with BP…

 Average systolic blood pressure in the United States  Note mean – and variation (variance) in the values Mean = 119 mmHg SD = 20 N = 22,270 Variance: BP Why are these values so spread out?

What things influence blood pressure  Age  Gender  Physical Activity  Diet  Stress Which of these variables do you think is most important? Least important? If we could measure all of these, could we perfectly predict blood pressure? Correlating each variable with BP would allow us to answer these questions using r 2

Beyond r 2  Obviously you want to have an estimate of how well a prediction might work – but it does not tell you how to make that prediction  For that we use some form of regression  Regression is a generic term (like correlation)  There are several different methods to create a prediction equation: Simple Linear Regression Multiple Linear Regression Logistic Regression (pregnancy test) and many more… Example using Height to predict Weight

r = 0.81 Note the correlation coefficient above (r 2 = 0.66) SPSS is going to do all the work. It will use a process called: Least Squares Estimation Let’s start with a scatterplot between the two variables…

r =.81 Least squares estimation: Fancy process where SPSS draws every possible line through the points - until finding the line where the vertical deviations from that line are the smallest The green line indicates a possible line, the blue arrows indicate the deviations – longer arrows = bigger deviations This is a crappy attempt – it will keep trying new lines until it finds the best one

r =.81 Eventually, SPSS will get it right, finding the line that minimizes deviations, known as: Line of Best Fit Least squares estimation: Fancy process where SPSS draws every possible line through the points - until finding the line where the vertical deviations from that line are the smallest

r =.81 The Line of Best fit is the end-product of regression Up so many units In so many others And it will have a value on the y-axis for the zero value of the x-axis -234 SLOPE INTERCEPT This line will have a certain slope…

The intercept can be seen more clearly if we redraw the graph with appropriate axes… -234lbs The intercept will sometimes be a nonsense value – in this case, nobody is 0 inches tall or weighs -234 lbs.

r =.81 From the line (it’s equation), we can predict that an increase in height of 1 inch predicts a rise in weight of 5.4 lbs We can now estimate weight from height. A person that’s 68 inches tall should weight about 135 lbs lbs Slope = 5.4 SPSS will output the equation, among a number of other items if you ask for them

SPSS output: SLOPE INTERCEPT The β -coefficient is the Slope of the line The (Constant) is the Intercept of the line The p-value is still here. In this case, height is a statistically significant predictor of weight (association likely NOT due to RSE)

Depending on your high school math teacher: Y = a + bX SLOPE INTERCEPT Weight = (Height) We can use those two values to write out the equation for our line Y = b + mX or

Model Fit?  Once you create your regression equation, this equation is called the ‘model’  i.e., we just modeled (created a model for) the association between height and weight  How good is the model? How well do the data fit?  Can use r 2 for a general comparison How well one variable can predict the other Lower r 2 means less variance accounted for, more error Our r = 0.81 for height/weight, so r 2 = 0.65  We can also use Standard Error of the Estimate

How good, generally, is the fit?  Standard error of the estimate (SEE)  Imagine we used our prediction equation to predict height for each subject in our dataset (X to predict Y)  Will our equation perfectly estimate each Y from X? Unless r 2 = 1.0, there will be some error between the real Y and the predicted Y  The SEE is the standard deviation of those differences The standard deviation of actual Y’s about predicted Y’s Estimates typical size of the error in predicting Y (sort of)  Critically related to r 2, but SEE is more specific to your equation

r =.81 Let’s go back to our line of best fit (this line represents the predicted value of Y for each X): Notice some real Y’s are closer to the line than others SEE = The standard deviation of actual Y’s about predicted Y’s Large Error Small Error Very Small Error SEE is the standard deviation of all these errors

SEE  Why calculate the ‘standard deviation’ of these errors instead of just calculating the ‘average error’?  By using standard deviation instead of the mean, we can describe what percentage of estimates are within 1 SEE of the line  In other words, if we used this prediction equation, we would expect that 68% fall within 1 SEE 95% fall within 2 SEE 99% fall within 3 SEE  Knowing, “How often is this accurate?” is probably more important than asking, “What’s the average error?”  Of course, how large the SEE is depends on your r 2 and your sample size (larger samples make more accurate predictions)

r =.81 Let’s go back to our line of best fit : In regression, we call these errors/deviations “residuals” Residual Y = Real Y – Predicted Y Notice that some of the residuals are - and some are +, where we over-estimated (-) or under-estimated (+) weight Large Residual Small Residual Very Small Residual SEE is the standard deviation of the residuals

Residuals  The line of best fit is a line where the residuals are minimized (least error)  The residuals will sum to 0  The mean of the residuals will also be 0  The Line of Best Fit is the ‘balance point’ of the scatterplot  The standard deviation of the residuals is the SEE  Recognize this concept/terminology– if there is a residual – that means the effect of other variables is creating error  Confounding variables create residuals QUESTIONS…?

Statistical Assumptions of Simple Linear Regression  See last week’s notes on assumptions of correlation…  Variables are normally distributed  Homoscedasticity of variance  Sample is representative of population  Relationship is linear (remember, y = a + bX)  The variables are ratio/interval (continuous) Can’t use nominal or ordinal variables …at least pretend for now, we’ll break this one next week.

Simple Linear Regression: Example  Let’s start simple, with two variables we know to be very highly correlated  40-yard dash and 20-yard dash  Can we predict 40-yard dash from 20-yard dash?

SLR  Trimmed dataset down to just two variables  Let’s look at a scatterplot first

All my assumptions are good, should be able to produce a decent prediction Next step, correlation

Correlation  Strength? Direction?  Statistically significant correlations will (usually) produce statistically significant predictors  r 2 = ?? 0.66 Now, run the regression in SPSS

SPSS The ‘predictor’ is the independent variable

Model Outputs  Adjusted r 2 = Adjusts the r 2 value based on sample size…small samples tend to overestimate the ability to predict the DV with the IV (our sample is 428, adjusted is similar)

Model Outputs  Notice our SEE of 0.06 seconds.  68% of residuals are within 0.06 seconds of predicted  95% of residuals are within 0.12 seconds of predicted

Model Outputs  The ‘ANOVA’ portion of the output tells you if the entire model is statistically significant. However, since our model just includes one variable (20-yard dash), the p-value here will match the one to follow

Outputs  Y-intercept =  Slope =  20-yard dash is a statistically significant predictor  What is our equation to predict 40-yard dash?

Equation  40yard dash time = 1.245(20yard time) If a player ran the 20-yard dash in 2.5 seconds, what is their estimated 40-yard dash time? 1.245(2.5) = 4.37 seconds If the player actually ran 4.53 seconds, what is the residual? Residual = Real – Predicted 4.53 – 4.37 = 0.16

Significance vs. Importance in Regression  A statistically significant model/variable does NOT mean the equation is good at predicting  The p-value tells you if the independent variable (predictor) can be used as a predictor of the dependent variable  The r 2 tells you how good the independent variable might be as a predictor (variance accounted for)  The SEE tells you how good the predictor (model) is at predicting QUESTIONS…?

Upcoming…  In-class activity…  Homework:  Cronk Section 5.3  Holcomb Exercises 29, 44, 46 and 33  Multiple Linear Regression next week