Statistical Inference

Statistical Inference
Correlation & Simple Linear Regression April 17, 2018

Correlation analysis*
Measuring the degree of association between two continuous variables, x and y We have a linear relationship between x and y if a straight line drawn through the midst of the points provides the most appropriate approximation to the observed relationship We measure how close the observations are to the straight line that best describes their linear relationship by calculating the Pearson product moment correlation coefficient, usually simply called the correlation coefficient *The following slides were adapted from Prof. Trinquart’s R Course (BS730)

Example

Pearson product moment correlation coefficient
𝑟= 𝑥− 𝑥 𝑦− 𝑦 𝑥− 𝑥 𝑦− 𝑦 2 estimate of the population correlation coefficient, ρ

between -1 and +1 indicates the direction and strength of the linear relationship r>0 one variable increases as the other variable increases r<0: one variable decreases as the other increases r = 0 : no linear correlation (although there may be a non- linear relationship) r = +1 or -1: perfect linear correlation with all the points lying on the line the closer r is to the extremes, the greater the degree of linear association

Rule of thumb r≥0.9 very highly correlated variables (r2 ≥81%)
0.7≤r<0.9 highly correlated (49% ≤ r2 < 81%) 0.5≤r<0.7 moderately correlated (25% ≤ r2 < 49%) 0.3≤r< 0.5 low correlation (9% ≤ r2 < 25%) r<0.3 little if any (linear) correlation ( r2 < 9%)

valid only within the range of values of x and y in the sample x and y can be interchanged without affecting the value of r (linear) correlation does not imply a cause/effect relationship r2 proportion of variability in y that can be explained by its linear relationship with x (coefficient of determination)

When not to calculate Pearson’s r
when there is a non-linear relationship (eg U-shaped and J-shaped relationships) when there are outliers when there are subgroups for which the mean of least one of the variables are different

Pearson Correlation Assumptions
Observations are independent The association is linear Variables are approximately normally distributed

Statistical significance ≠ Clinical relevance
The significance of a given correlation coefficient is a function of sample size; i.e., a low correlation can be significant if the sample size is large enough

Hypothesis Testing H0: There is no correlation between the two variables ρ=0 H1: There is a correlation between the two variables ρ≠0

Pearson correlation test statistic
The Student’s t distribution is used where se(r) is defined as Note that the standard error is inversely related to n. A larger sample size corresponds to a smaller se(r). If the population correlation is zero and assuming x and y follow normal distributions, then the test statistic has a Student’s t distribution with n-2 degrees of freedom

Confidence Interval for ρ
Since r is an estimate of a parameter, we can calculate a confidence interval for the population correlation coefficient, ρ Based on z = 0.5 ln[(1+r)/(1-r)] Because the transformation is a non-linear function of r, the confidence interval is not symmetric around r.

In R Begin with a scatter plot!
plot(x, y, xlab="x-label", ylab="y-label") abline(lm(y ~ x)) cor( x, y, method = “pearson") cor.test(x, y, method=“pearson") default method is "pearson" so you may omit the method option

Example

Example: Reporting Results
H0: There is no linear association between age and SBP (ρ=0) H1: There is a linear association between age and SBP (ρ≠0) Pearson’s correlation was used to determine if there was a linear association between age and systolic blood pressure (SBP) There is significant evidence at α=0.05 of a moderate, positive, linear association between age and SBP, r = 0.65 with a 95% C.I. of 0.09 to 0.90, p-value =

Simple Linear Regression

Background Even as more and more sophisticated statistical procedures are developed, linear regression remains a fundamental element of the quantitative health sciences. Concept of regression and correlation invented by Francis Galton (b. 1822; d. 1911) Studied relationship between parental and child height Contemporary paper in European Journal of Human Genetics showed that modern methods improved little on the original predictive model. Source: Predicting Height: The Victorian Approach Beats Modern Genomics

Modern Example from Simply Statistics
“Data supports claim that if Kobe stops ball hogging the Lakers will win more” “Linear regression suggests that an increase of 1% in % of shots taken by Kobe results in a drop of 1.16 points (+/- 0.22) in score differential.” How do you interpret linear regression results? Showing standard errors is good practice. Do you agree with the analysis? Is it complete? Source:

Questions we might ask with linear regression
To use the parents' heights to predict child heights. To try to find a parsimonious, easily described mean relationship between parent and children's heights. To investigate the variation in child heights that appears unrelated to parents' heights (residual variation). To quantify what impact genotype information has beyond parental height in explaining child height. To figure out how/whether and what assumptions are needed to generalize findings beyond the data in question. Why do children of very tall parents tend to be tall, but a little shorter than their parents and why children of very short parents tend to be short, but a little taller than their parents? (This is a famous question called 'Regression to the mean'.) Parsimonious : simplest possible relationship vs. machine learning. Machine learning produces highly accurate models but don’t generate new parsimonious knowledge. Residual variation: unexplained variation. What residual variation is left after using parental height. Inference: How do we take the data, which is just a sample, and figure out what assumptions are needed to extrapolate to a larger population? This is the subject of statistical inference. We will apply the tools of inference, to the subject of regression. Regression to the mean or regression to mediocrity. This was done by Galton. Now let’s look at Galton’s data.

Linear Regression If we believe y is dependent on x, with a change in y being attributed to a change in x, rather than the other way round, we can determine the linear regression line (the regression of y on x) that best describes the straight line relationship between the two variables

Model Find the equation of the line that best fits the data. The generic equation of the line relating y to x is of the form: In the sample, the equation of the line of “best fit” is written as: NOTE: The actual data will NEVER fall exactly on this "best fit" line. That’s why the model contains an error term. The error term contains the influences of other factors not captured by the model.

For a given value of x, 𝑦 is the value of y which lies on the estimated line. It is the value we expect for y (i.e. its average) if we know the value of x, and is called the fitted value of y; 𝛽 0 intercept; it is the value of y when x = 0 𝛽 1 slope; it represents the amount by which y increases on average if we increase x by one unit

Residual Observed Predicted

Method of least squares
The "best fit" line is the line that minimizes the sum of the squared residuals. Want to minimize: For linear regressions with only one independent variable, X, this yields the following equations:

Example: Model for SBP and Age
For the previous example using SBP and age:

Example: best fit and predicted value
The "best fit" line for these data is: Predicted SBP for a person that is 24 years old: There was an observed SBP value of 116 at age=24. The residual at age=24 is The range of age in the given data was Therefore, we should not use the above model to predict the SBP for a 65 yr old person

correlation and coefficient of simple linear regression
In a simple linear regression model, the sample correlation coefficient, r, and the estimated slope are related 𝑟= 𝛽 1 𝑠 𝑥 𝑠 𝑦

In R lm() function The first argument of the lm() function is a formula object, with the outcome specified followed by the ~ operator then the predictors mod1 <- lm(y ~ x, data=ds) summary(mod1) summary.aov(mod1)

Example: sbp and age

Example: ANOVA table Sum of Squares

Example: Sum of Squares
Model SS = SS of the differences between y predicted by the model and the overall average. Error SS = SS of the differences between observed y and y predicted by the model. Total SS = SS of the differences between observed y and the overall average. The better the model fits, the larger the model SS is and the smaller the error SS

F test F value: test statistic for the overall model
In a situation with only a single X variable, the F-test is equivalent to the t test for the null hypothesis that β1=0

p-value of the goodness of the fit
Pr(>F): P-value for the above F test.

R2 R2 proportion of variability in y explained by the independent variable(s) R2 = (Model SS / Total SS) = /( ) = 0.427 R-square takes values between 0 and 1 and measures the goodness of fit of the model. Values close to 1 are indicative of a good fit

R2 When there is only one independent variable, x,
R2 is the proportion of the variability in y that can be explained by x. R2 is equal to the square of the Pearson correlation between x and y

Parameter estimates Parameter Estimates
Estimated coefficients, 𝛽 0 and 𝛽 1 Of particular interest is, 𝛽 1 , the slope of the linear regression Interpretation: On average, every one-unit change in X corresponds to a 𝛽 1 unit change in Y.

Confidence interval model1<-lm(sbp~age, data=ds) confint(model1)

Example: Confidence Intervals for 𝜷 𝟎 and 𝜷 𝟏
In this example there are n-2 = 9 degrees of freedom and for α = 0.05, the critical t is The 95% confidence interval for 𝛽 1 is 0.73 ± (2.262)(0.28) 0.73 ± 0.63 [0.09, 1.37]

Example: Testing of parameters
Test for Prob > |t| are the p-values for the t-tests

Example: testing the parameter
Another way to view the test 𝜷 𝟏 =0 is that it tells us whether the model provides a significant improvement over the information about y provided only by the sample mean. Notice that, in the case of one predictor, F is the square of t

Example: Reporting Results
The null hypothesis is H0: 𝛽 1 =0 , or "the slope of the model is zero" or "there is no linear association between X and Y”. Be sure to interpret the estimated regression coefficient of interest ( 𝛽 1 ).

Example: Reporting Results – Technical Summary
H0: There is no linear association between SBP and age among women ages H1: There is a linear association between SBP and age among women ages OR H0: βage = 0 vs H1: βage ≠ 0

Example: Reporting Results – Technical Summary
Conclusion of hypothesis test: Reject H0. There is significant evidence at α=0.05 of a linear association between SBP and age among women ages (t=2.59, df=9).

Sample Write-up Methods To examine the association between SBP (mmHg), and age (years), we performed a linear regression analysis. We used a 0.05 level of significance Results In the linear regression analysis of the association of SBP and age, we found that there was a significant, positive linear association, with an estimated slope of 0.73 (t=2.59, df=9; p = ). On average, a one year increase in age corresponds to an increase of 0.73 mmHg in SBP. The R2 for this model was 0.43 (age accounts for 43% of the variability of SBP in these data).

Express results for a continuous independent variable
On average, a one-year increase in age corresponds to a 0.73 mmHg increase in SBP. It’s better to report the effect using A clinically meaningful increment Δ (eg 10-year increase in age) if known A increase by one standard deviation SD On average, a Δ-year increase in age corresponds to a Δ 𝛽 1 increase in SBP

Deliverables Due 4/22 Due 4/29 Due 5/8 Complete draft
Final Research Paper Due 5/8 Final presentation

Statistical Inference

Similar presentations

Presentation on theme: "Statistical Inference"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Statistical Inference

Similar presentations

Presentation on theme: "Statistical Inference"— Presentation transcript:

Similar presentations

About project

Feedback