Statistics for clinicians Biostatistics course by Kevin E. Kip, Ph.D., FAHA Professor and Executive Director, Research Center University of South Florida, College of Nursing Professor, College of Public Health Department of Epidemiology and Biostatistics Associate Member, Byrd Alzheimer’s Institute Morsani College of Medicine Tampa, FL, USA 1
SECTION 6.1 Correlation versus linear regression
Learning Outcome: Distinguish the relationship between correlation and linear regression
Correlation and Regression are both measures of association Some Terms for “association” variables: Variable 1:“x” variable independent variable predictor variable exposure variable Variable 2:“y” variable dependent variable outcome variable
Correlation Coefficient Computation Form: Pearson correlation (“r”) where x and y are the sample means of X and Y, s x and s y are the sample standard deviations of X and Y. Co-variation
Introduction to Linear Regression Like correlation, the data are pairs of independent (e.g. “X”) and dependent (e.g. “Y” variables {(x i,y i ): i=1,...,n}. However, here we seek to predict values of Y from X. The fitted equation is written: y = b 0 + b 1 x where y is the predicted value of the response (e.g. blood pressure) obtained by using the equation. This equation of the line best represents the association between the independent variable and the dependent variable The residuals are the differences between the observed and the predicted values: {(y i – y i ): i=1,…,n}
Introduction to Linear Regression r = 0.76 Best fitting line Minimize distance between predicted and actual values
Introduction to Linear Regression y = b 0 + b 1 x y = predicted value of response (outcome) variable b 0 = constant: the intercept (the value of y when x = 0). b 1 = constant: coefficient for slope of regression line – the expected change in y for a one-unit change in x Note: unlike the correlation coefficient, b is unbounded. x i = values of independent (predictor) variable for subject i
9 SECTION 6.2 Least squares regression and predicted values
Learning Outcomes: Describe the theoretical basis of least squares regression Calculate and interpret predicted values from a linear regression model
Introduction to Linear Regression y = b 0 + b 1 x In the above equation, the values of the slope (b 1 ) and intercept (b 0 ) represent the line that best predicts Y from X. More precisely, the goal of regression is to minimize the sum of the squares of the vertical distances of the points from the line. i.e. minimize ∑(y – y) 2 This is frequently done by the method of “least squares” regression.
Least squares estimates: s y b 1 =r s x b 0 = Y – b 1 X Example: We wish to estimate total cholesterol level (y) from BMI (x) Assume r xy = 0.78; Y = 205.9s y = 30.8 X = 27.4s x = 3.7 s y 30.8 b 1 =r = 0.78= 6.49 s x 3.7 b 0 = Y – b 1 X = – 6.49(27.4) = The equation of the regression line is: y = (BMI)
Least squares estimates: (Practice) s y b 1 =r s x b 0 = Y – b 1 X Example: We wish to estimate systolic blood pressure (y) from BMI (x) Assume r xy = 0.46; Y = 133.8s y = 18.4 X = 26.6s x = 3.5 s y b 1 =r = s x b 0 = Y – b 1 X = The equation of the regression line is: y =
Least squares estimates: (Practice) s y b 1 =r s x b 0 = Y – b 1 X Example: We wish to estimate systolic blood pressure (y) from BMI (x) Assume r xy = 0.46; Y = 133.8s y = 18.4 X = 26.6s x = 3.5 s y 18.4 b 1 =r = 0.46= 2.42 s x 3.5 b 0 = Y – b 1 X = – 2.42(26.6) = The equation of the regression line is: y = (BMI)
Least squares estimates: (Practice) The equation of the regression line is: y = (BMI) Predict systolic blood pressure for the following 3 individuals: Person 1 has BMI of 26.4 Person 2 has BMI of 28.9 Person 3 has BMI of 34.8 y 1 = y 2 = y 3 =
Least squares estimates: (Practice) The equation of the regression line is: y = (BMI) Predict systolic blood pressure for the following 3 individuals: Person 1 has BMI of 26.4 Person 2 has BMI of 28.9 Person 3 has BMI of 34.8 y 1 = (26.4)=133.3 y 2 = (28.9)=139.4 y 3 = (34.8)=153.6
17 SECTION 6.3 Assumptions and sources of variation in linear regression
18 Learning Outcomes: Describe the assumptions required for valid use of the linear regression model Describe the partitioning of sum of squares in the linear regression model
Introduction to Linear Regression Some assumptions for linear regression: Dependent variable Y has a linear relationship to the independent variable X This includes checking whether the dependent variable is approximately normally distributed. Independence of the errors (no serial correlation)
Y = (age) R = 0.597
IDXY R0.573
IDXY R0.573
IDXY LOG_ Y
IDXY LOG_ Y
Fundamental Equations for Regression Coefficient of determination (r 2 ) Proportion of variation in Y “explained by the regression on X explained variationSSR SSE R 2 = =-----= total variationSST SST
Example: Fundamental Equations for Regression IDYX NMean SD Sum Y X y = b 0 + b 1 x y = (x) r = 0.42
Example: Fundamental Equations for Regression IDYXY(Y i - Y) 2 (T)(T)(R)(R)(E)(E) NMean SD Sum SST=132SSR=23SSE=109 R0.42 R2R y = (x) SST = 132, df T = 11 SSR = 23, df R = 1 SSE = 109, df E = 10 SSR R 2 =-----= 0.18 SST
Practice: Fundamental Equations for Regression IDYXY(Y i - Y) 2 (T)(T)(R)(R)(E)(E) NMean Sum SST=_____SSR=_____SSE=_____ y = (x) SST = _____, df T = ____ SSR = ______, df R = ____ SSE = ______, df E = ____ SSR R 2 =-----= _______ SST Complete the entries in the table below to determine SST, SSR, SSE, and R 2
Practice: Fundamental Equations for Regression IDYXY(Y i - Y) 2 (T)(T)(R)(R)(E)(E) NMean Sum SST=80.5SSR=19.1SSE=61.4 y = (x) SST = 80.5, df T = 9 SSR = 19.1, df R = 1 SSE = 61.4, df E = 8 SSR R 2 =-----= 0.24 SST
30 SECTION 6.4 Multiple linear regression model
31 Learning Outcome: Calculate and interpret predicted values from the multiple regression model
Multiple Linear Regression Extension of simple linear regression to assess the association between 2 or more independent variables and a single continuous dependent variable. The multiple linear regression equation is: Each regression coefficient represents the change in y relative to a one unit change in the respective independent variable holding the remaining independent variables constant. The R 2 from the multiple linear regression model represents percentage of variation in the dependent variable “explained” by the set of predictors.
Multiple Linear Regression Example: Predictors of systolic blood pressure: Independent Variable Regression Coefficient tp-value Intercept BMI (per 1 unit) Age (in years) Male gender Treatment for hypertension y = (BMI) (age) (male) (tx-hypertension)
Practice: Estimate systolic blood pressure for the following persons: Independent Variable Regression Coefficient tp-value Intercept BMI (per 1 unit) Age (in years) Male gender (1=yes) Treatment for hypertension (1=yes) Person 1: BMI=27.9; age=54; female; on treatment for hypertension Person 2: BMI=34.9; age=66; male; on treatment for hypertension Person 3: BMI=24.8; age=47; female; not on treatment for hypertension y 1 = y 2 = y 3 =
Practice: Estimate systolic blood pressure for the following persons: Independent Variable Regression Coefficient tp-value Intercept BMI (per 1 unit) Age (in years) Male gender (1=yes) Treatment for hypertension (1=yes) Person 1: BMI=27.9; age=54; female; on treatment for hypertension Person 2: BMI=34.9; age=66; male; on treatment for hypertension Person 3: BMI=24.8; age=47; female; not on treatment for hypertension y 1 = (27.9) (54) (0) (1) = y 2 = (34.9) (66) (1) (1) = y 3 = (27.9) (54) (0) (0) = 113.1
Framingham Risk Calculation (10-Year Risk): Dependent Variable: 10-year risk of CVD Independent Variables: Age, gender, total cholesterol, HDL cholesterol, smoker, systolic BP On medication for BP
37 SECTION 6.5 SPSS for linear regression analysis
38 Learning Outcome: Analyze and interpret linear regression models using SPSS
SPSS Analyze Regression Linear Dependent Variable Independent Variable(s) Statistics ---Estimates ---Confidence intervals ---Model fit ---Partial correlations ---Descriptives Example: Dependent variable:HDL Cholesterol Independent variable:BMI
y = – 0.442(BMI)
SPSS Analyze Regression Linear Dependent Variable Independent Variable(s) Statistics ---Estimates ---Confidence intervals ---Model fit ---Partial correlations ---Descriptives Example: Dependent variable:HDL Cholesterol Independent variable(s):BMI, gender (1=male, 2=female)
y = – 0.481(BMI) (female)
SPSS Analyze Regression Linear Dependent Variable Independent Variable(s) Statistics ---Estimates ---Confidence intervals ---Model fit ---Partial correlations ---Descriptives Example: Dependent variable:HDL Cholesterol Independent variable(s):BMI, gender, age
y = – 0.464(BMI) (female) (age)
Practice: Estimate HDL cholesterol levels for the following persons: Person 1: BMI=25.7; female; age=60 Person 2: BMI=36.9; male; age=66 Person 3: BMI=31.8; female; age=51 y 1 = y 2 = y 3 =
Practice: Estimate HDL cholesterol levels for the following persons: Person 1: BMI=25.7; female; age=60 Person 2: BMI=36.9; male; age=66 Person 3: BMI=31.8; female; age=51 y 1 = – 0.464(25.7) (1) (60) = 51.8 y 2 = – 0.464(36.9) (0) (66) = 36.9 y 3 = – 0.464(31.8) (1) (51) = 47.5