Lecturer’s desk Physics- atmospheric Sciences (PAS) - Room 201 s c r e e n Row A Row B Row C Row D Row E Row F Row G Row H Row A Row B Row C Row D Row E Row F Row G Row H table Row A Row B Row C Row D Row E Row F Row G Row H Row J Row K Row L Row M Row N Row P Row J Row K Row L Row M Row N Row P Row Q table
MGMT 276: Statistical Inference in Management Fall 2015
Before our next exam (December 3 rd ) OpenStax Chapters 1 – 13 (Chapter 12 is emphasized) Plous Chapter 17: Social Influences Chapter 18: Group Judgments and Decisions Schedule of readings
Logic of hypothesis testing with Correlations Interpreting the Correlations and scatterplots Simple and Multiple Regression Using correlation for predictions r versus r 2 Regression uses the predictor variable (independent) to make predictions about the predicted variable (dependent) Coefficient of correlation is name for “r” Coefficient of determination is name for “r 2 ” (remember it is always positive – no direction info) Standard error of the estimate is our measure of the variability of the dots around the regression line (average deviation of each data point from the regression line – like standard deviation) Coefficient of regression will “b” for each variable (like slope) Over next couple of lectures 11/17/15
Regression - What do we need to define a line Expenses per year Yearly Income Y-intercept = “a” ( also “b 0 ”) Where the line crosses the Y axis Slope = “b” ( also “b 1 ”) How steep the line is If you spend this much If you probably make this much The predicted variable goes on the “Y” axis and is called the dependent variable The predictor variable goes on the “X” axis and is called the independent variable Revisit this slide
Assumptions Underlying Linear Regression These Y values are normally distributed. The means of these normal distributions of Y values all lie on the straight line of regression. For each value of X, there is a group of Y values The standard deviations of these normal distributions are equal. Revisit this slide
Correlation - the prediction line Prediction line makes the relationship easier to see (even if specific observations - dots - are removed) identifies the center of the cluster of (paired) observations identifies the central tendency of the relationship (kind of like a mean) can be used for prediction should be drawn to provide a “best fit” for the data should be drawn to provide maximum predictive power for the data should be drawn to provide minimum predictive error - what is it good for? Revisit this slide
Predicting Restaurant Bill The expected cost for dinner for two couples (4 people) would be $95.06 Cost = Persons If “Persons” = 4, what is the prediction for “Cost”? Cost = Persons Cost = (4) Cost = = Prediction line Y’ = a + b 1 X 1 Y-intercept Slope If “Persons” = 1, what is the prediction for “Cost”? Cost = Persons Cost = (1) Cost = = People Cost If People = 4 Cost will be about Revisit this slide
Predicting Rent The expected cost for rent on an 800 square foot apartment is $990 Rent = SqFt If “SqFt” = 800, what is the prediction for “Rent”? Rent = SqFt Rent = (800) Rent = = 990 Prediction line Y’ = a + b 1 X 1 Y-intercept Slope Square Feet Cost If SqFt = 800 Rent will be about 990 If “SqFt” = 2500, what is the prediction for “Rent”? Rent = SqFt Rent = (2500) Rent = = 2,775 Revisit this slide
Regression Example Rory is an owner of a small software company and employs 10 sales staff. Rory send his staff all over the world consulting, selling and setting up his system. He wants to evaluate his staff in terms of who are the most (and least) productive sales people and also whether more sales calls actually result in more systems being sold. So, he simply measures the number of sales calls made by each sales person and how many systems they successfully sold.
Regression Example Do more sales calls result in more sales made? Dependent Variable Independent Variable Ethan Isabella Ava Emma Emily Jacob Joshua Number of sales calls made Number of systems sold Step 1: Draw scatterplot Step 2: Estimate r
Regression Example Do more sales calls result in more sales made? Step 3: Calculate r Step 4: Is it a significant correlation?
Do more sales calls result in more sales made? Step 4: Is it a significant correlation? n = 10, df = 8 alpha =.05 Observed r is larger than critical r (0.71 > 0.632) therefore we reject the null hypothesis. Yes it is a significant correlation r (8) = 0.71; p < 0.05 Step 3: Calculate r Step 4: Is it a significant correlation?
Regression: Predicting sales Step 1: Draw prediction line What are we predicting? r = 0.71 b = (slope) a = (intercept) Draw a regression line and regression equation
Regression: Predicting sales Step 1: Draw prediction line r = 0.71 b = (slope) a = (intercept) Draw a regression line and regression equation
Regression: Predicting sales Step 1: Draw prediction line r = 0.71 b = (slope) a = (intercept) Draw a regression line and regression equation
Step 2: State the regression equation Y’ = a + bx Y’ = x Step 3: Solve for some value of Y’ Y’ = (1) Y’ = If make one sales call You should sell systems Regression: Predicting sales Step 1: Predict sales for a certain number of sales calls What should you expect from a salesperson who makes 1 calls? Madison Joshua They should sell systems If they sell more over performing If they sell fewer underperforming
Step 2: State the regression equation Y’ = a + bx Y’ = x Step 3: Solve for some value of Y’ Y’ = (2) Y’ = Regression: Predicting sales Step 1: Predict sales for a certain number of sales calls What should you expect from a salesperson who makes 2 calls? If make two sales call You should sell systems Isabella Jacob They should sell systems If they sell more over performing If they sell fewer underperforming
Step 2: State the regression equation Y’ = a + bx Y’ = x Step 3: Solve for some value of Y’ Y’ = (3) Y’ = Regression: Predicting sales Step 1: Predict sales for a certain number of sales calls What should you expect from a salesperson who makes 3 calls? If make three sales call You should sell systems Ava Emma They should sell systems If they sell more over performing If they sell fewer underperforming
Step 2: State the regression equation Y’ = a + bx Y’ = x Regression: Predicting sales Step 1: Predict sales for a certain number of sales calls What should you expect from a salesperson who makes 4 calls? Step 3: Solve for some value of Y’ Y’ = (4) Y’ = If make four sales calls You should sell systems Emily They should sell systems If they sell more over performing If they sell fewer underperforming
Regression: Evaluating Staff Step 1: Compare expected sales levels to actual sales levels What should you expect from each salesperson They should sell x systems depending on sales calls If they sell more over performing If they sell fewer underperforming Madison Isabella Ava Emma Emily Jacob Joshua
Regression: Evaluating Staff Step 1: Compare expected sales levels to actual sales levels How did Ava do? Ava sold 14.7 more than expected taking into account how many sales calls she made over performing Ava 14.7 Difference between expected Y’ and actual Y is called “residual” (it’s a deviation score) =14.7
Regression: Evaluating Staff Step 1: Compare expected sales levels to actual sales levels How did Jacob do? Jacob sold fewer than expected taking into account how many sales calls he made under performing Ava Difference between expected Y’ and actual Y is called “residual” (it’s a deviation score) Jacob =-23.7
Regression: Evaluating Staff Step 1: Compare expected sales levels to actual sales levels What should you expect from each salesperson They should sell x systems depending on sales calls If they sell more over performing If they sell fewer underperforming Madison Isabella Ava Emma Emily Jacob Joshua
Regression: Evaluating Staff Step 1: Compare expected sales levels to actual sales levels Madison Isabella Ava Emma Emily Jacob Joshua 14.7 Difference between expected Y’ and actual Y is called “residual” (it’s a deviation score)
14.7 Difference between expected Y’ and actual Y is called “residual” (it’s a deviation score) Does the prediction line perfectly the predicted variable when using the predictor variable? The green lines show how much “error” there is in our prediction line…how much we are wrong in our predictions How would we find our “average residual”? No, we are wrong sometimes… How can we estimate how much “error” we have? Exactly? -23.7
Difference between expected Y’ and actual Y is called “residual” (it’s a deviation score) How do we find the average amount of error in our prediction The green lines show how much “error” there is in our prediction line…how much we are wrong in our predictions How would we find our “average residual”? Step 1: Find error for each value (just the residuals) Y – Y’ Ava is 14.7 Emily is -6.8 Madison is 7.9 Jacob is Residual scores The average amount by which actual scores deviate on either side of the predicted score N ΣxΣx Big problem Σ (Y – Y’) = 0 2 Square the deviations Step 2: Add up the residuals Σ (Y – Y’) Divide by df 2 n - 2 Σ (Y – Y’) Square root
Difference between expected Y’ and actual Y is called “residual” (it’s a deviation score) How do we find the average amount of error in our prediction The green lines show how much “error” there is in our prediction line…how much we are wrong in our predictions How would we find our “average residual”? Step 1: Find error for each value (just the residuals) Y – Y’ Step 2: Find average ∑(Y – Y’) 2 n - 2 √ Diallo is 0” Mike is -4” Hunter is -2 Preston is 2” Deviation scores N ΣxΣx Sound familiar??
These would be helpful to know by heart – please memorize these formula Standard error of the estimate (line) =
Slope doesn’t give “variability” info Intercept doesn’t give “variability info Correlation “r” does give “variability info How well does the prediction line predict the predicted variable when using the predictor variable? Residuals do give “variability info Standard error of the estimate (line) What if we want to know the “average deviation score”? Finding the standard error of the estimate (line) Standard error of the estimate: a measure of the average amount of predictive error the average amount that Y’ scores differ from Y scores a mean of the lengths of the green lines
Shorter green lines suggest better prediction – smaller error Longer green lines suggest worse prediction – larger error Why are green lines vertical? Remember, we are predicting the variable on the Y axis So, error would be how we are wrong about Y (vertical) How well does the prediction line predict the Ys from the Xs? Residuals A note about curvilinear relationships and patterns of the residuals
When would our predictions be perfect (with no error at all)? Perfect correlation = or One variable perfectly predicts the other No variability in the scatterplot The dots approximate a straight line Any Residuals?
Assumptions Underlying Linear Regression These Y values are normally distributed. The means of these normal distributions of Y values all lie on the straight line of regression. For each value of X, there is a group of Y values The standard deviations of these normal distributions are equal.
14.7 Difference between expected Y’ and actual Y is called “residual” (it’s a deviation score) Does the prediction line perfectly the predicted variable when using the predictor variable? The green lines show how much “error” there is in our prediction line…how much we are wrong in our predictions No, we are wrong sometimes… How can we estimate how much “error” we have? Perfect correlation = or Each variable perfectly predicts the other No variability in the scatterplot The dots approximate a straight line
Regression Analysis – Least Squares Principle When we calculate the regression line we try to: minimize distance between predicted Ys and actual (data) Y points (length of green lines) remember because of the negative and positive values cancelling each other out we have to square those distance (deviations) so we are trying to minimize the “sum of squares of the vertical distances between the actual Y values and the predicted Y values”
Is the regression line better than just guessing the mean of the Y variable? How much does the information about the relationship actually help? Which minimizes error better? How much better does the regression line predict the observed results? r2r2 Wow!
What is r 2 ? r 2 = The proportion of the total variance in one variable that is predictable by its relationship with the other variable If mother’s and daughter’s heights are correlated with an r =.8, then what amount (proportion or percentage) of variance of mother’s height is accounted for by daughter’s height? Examples.64 because (.8) 2 =.64
What is r 2 ? r 2 = The proportion of the total variance in one variable that is predictable for its relationship with the other variable If mother’s and daughter’s heights are correlated with an r =.8, then what proportion of variance of mother’s height is not accounted for by daughter’s height? Examples.36 because ( ) =.36 or 36% because 100% - 64% = 36%
What is r 2 ? r 2 = The proportion of the total variance in one variable that is predictable for its relationship with the other variable If ice cream sales and temperature are correlated with an r =.5, then what amount (proportion or percentage) of variance of ice cream sales is accounted for by temperature? Examples.25 because (.5) 2 =.25
What is r 2 ? r 2 = The proportion of the total variance in one variable that is predictable for its relationship with the other variable If ice cream sales and temperature are correlated with an r =.5, then what amount (proportion or percentage) of variance of ice cream sales is not accounted for by temperature? Examples.75 because ( ) =.75 or 75% because 100% - 25% = 75%
Some useful terms Regression uses the predictor variable (independent) to make predictions about the predicted variable (dependent) Coefficient of correlation is name for “r” Coefficient of determination is name for “r 2 ” (remember it is always positive – no direction info) Standard error of the estimate is our measure of the variability of the dots around the regression line (average deviation of each data point from the regression line – like standard deviation)
Regression: Evaluating Staff Step 1: Compare expected sales levels to actual sales levels Madison Isabella Ava Emma Emily Jacob Joshua 14.7 Difference between expected Y’ and actual Y is called “residual” (it’s a deviation score) -23.7
Summary Interpret r = 0.71 Positive relationship between the number of sales calls and the number of copiers sold. Strong relationship Remember, we have not demonstrated cause and effect here, only that the two variables—sales calls and copiers sold—are related.
Correlation Coefficient – Excel Example Interpret r = 0.71 Does this correlation reach significance? n = 10, df = 8 alpha =.05 Observed r is larger than critical r (0.759 > 0.632) therefore we reject the null hypothesis. r (8) = 0.71; p < 0.05
Coefficient of Determination – Excel Example Interpret r 2 = (.71 2 =.504) we can say that 50.4 percent of the variation in the number of copiers sold is explained, or accounted for, by the variation in the number of sales calls. Remember, we lose the directionality of the relationship with the r 2
Homework Review
the hours worked and weekly pay is a strong positive correlation. This correlation is significant, r(3) = 0.92; p < 0.05 The relationship between positive strong up down y' = x or 84% 84% of the total variance of “weekly pay” is accounted for by “hours worked” For each additional hour worked, weekly pay will increase by $6.09
Number of Operators Wait Time 280
-.73 The relationship between wait time and number of operators working is negative and moderate. This correlation is not significant, r(3) = 0.73; n.s. negative strong number of operators increase, wait time decreases y' = -18.5x seconds 328 seconds or 54% The proportion of total variance of wait time accounted for by number of operators is 54%. For each additional operator added, wait time will decrease by 18.5 seconds Critical r = No we do not reject the null
Median Income Percent of BAs
The relationship between median income and percent of residents with BA degree is strong and positive. This correlation is significant, r(8) = 0.89; p < positive strong median income goes up so does percent of residents who have a BA degree % of residents 35% of residents or 78% The proportion of total variance of % of BAs accounted for by median income is 78%. For each additional $1 in income, percent of BAs increases by.0005 Percent of residents with a BA degree y' = x Critical r = Yes we reject the null
Median Income Crime Rate
The relationship between crime rate and median income is negative and moderate. This correlation is not significant, r(8) = -0.63; p < n.s. [ is not bigger than critical of 0.632]. negative moderate median income goes up, crime rate tends to go down ,417 thefts 1,418.5 thefts.396 or 40% The proportion of total variance of thefts accounted for by median income is 40%. For each additional $1 in income, thefts go down by.0499 Crime Rate y' = x Critical r = No we do not reject the null