MASH R workshop 4.

MASH R workshop 4

In this session you will learn:
Correlation coefficients Simple linear regression Multiple linear regression Binary Logistic regression Multinomial regression Ordinal regression Poisson Regression/Negative Binomial Regression

Correlation coefficients: Pearson & Spearman’s rho
A correlation coefficient will study how correlated 2 scales/continuous measures are. If both scales are normally distributed, then we will use the Pearson’s correlation coefficient. If one of the 2 scales is skewed, we will use the Spearman’s rho coefficient.

A correlation coefficient is a number between -1 and 1: 1 means perfect positive correlation, i.e. as x increases, y increases. -1 means perfect negative correlation, i.e. as x increases, y decreases (or the contrary). 0 means no correlation at all between x and y!

There are levels of small, medium, strong, very strong positive/negative correlation. We can also say that x has a small, medium, strong, very strong positive/negative effect on y. V. Strong STRONG MEDIUM SMALL SMALL MEDIUM STRONG V. Strong NEGATIVE CORRELATION/EFFECT POSITIVE CORRELATION/EFFECT

Download the Birthweight data set for R (.csv format on the website). File to download And store in the correct Working directory. Open the .csv file:

Remember to attach the data! It is easier to call the variables by their names, e.g. “motherage”, rather than using “dataset$motherage”.

Research Question: Is there a correlation between weight of babies at birth and their mother height? Histograms of Mother height and Birthweight look symmetrical Q-Q plots seem to correspond to a normal distribution in both Mother height and birthweight.

None of these tests reject The null hypothesis of Normally Distributed data (p-value > 0.05). We can use a Pearson Correlation Coefficient. Pearson must be specified here. Medium Positive correlation between Birthweight and Mother height (mheight).

The test associated with Pearson’s correlation coefficient evaluates whether we reject or not the Null Hypothesis:“No Correlation between our 2 variables.” The p-value of the test is less than 0.05, then we can reject the null hypothesis and conclude that there is a significant correlation between Birthweight and mheight.

Scatter Plot between Birthweight and Mother height: The function abline() adds the line of best fit. The so-called line of best fit comes in fact from the equation obtained from a simple linear regression between Birthweight and Mother height.

Research Question: Does the age of the mothers depend on the age of the fathers? Histograms of Father age and Mother age look skewed to the left. Q-Q plots seem not to show normally distributed data.

Father age and Mother age both reject the Null Hypothesis “Normally Distributed”. We can then conclude that they are NOT Normally distributed and use a SPEARMAN’s RHO coefficient. Pearson must be specified here. Very Strong Positive correlation between Mother and Father age.

The test associated with Pearson’s correlation coefficient evaluates whether we reject or not the Null Hypothesis:“No Correlation between our 2 variables.” The p-value of the test is less than 0.001, then we can very strongly reject the null hypothesis and conclude that there is a significant correlation between Mother age and Father age when giving birth to their child.

Simple linear regression
When calculating the correlation coefficients, we drew the line of best fit to our scatterplot. This line of best fit is defined by an equation: Y: Predicted variable–Dependent (motherage) X: Predictor variable – Independent (fage) That equation can be obtained by employing a simple linear regression between Y and X. In a simple linear regression, we estimate the parameters “a” and “b”, called the intercept. We test if the slope “a” is equal to “0”. If we reject the Null Hypothesis, we conclude that there is a statistically significant impact of X on Y. It kind of does the same job as the correlation coefficient but it is useful to know this for all the other regressions. Y = aX+b

In the Birthweight data set, we will do a simple linear regression between Birthweight (Y) and Gestation (X). We use the function “lm”, as Linear Model. In the expression, we always write: Predicted ~ Predictor(s) Y~X

We will first of all store the information of the regression in “simreg”: “simreg” contains the values of the intercept b = and the slope a =

The equation of the line reads like this: Birthweight = * Gestation

Birthweight = * Gestation

By using the function summary(), we can have access to the statistics of the linear regression. This p-value indicates that we strongly reject the Null Hypothesis: “a=0” or “Gestation does not affect Birthweight”. The R-squared the proportion of the variance in birth weight explained by the model. R-squared is between 0 and 1. The closer to 1, the better.

Assumptions: 1 ) The residuals of the regression should be approximately normally distributed: It is not necessary to have normally distributed variables in the regression. 2) There should be homoscedasticity [homoskedasticiti]. 3) There should be no outliers among the residuals.

Assumption 1: Residuals of the regression should be normally distributed .What does it mean? A simple linear regression is a model. Based on a sample of data points, we estimated the slope and the intercept of the equation of the line. We did this because in theory we want to predict future Birthweights based on other Gestation values. In reality, each point (Gestation,Birthweight) is not exactly on the line of best fit. The distance from each point to the line of best fit is called “residual”. If you have 50 points, you will have 50 residuals. By plotting the histogram of those residuals, we are able to see if they look normally distributed.

42 points in total: 42 distances from line: 42 residuals. Each point represents one individual. The blue line here represents one of the residuals.

42 Residuals. Looks normally distributed You can also confirm it by doing a test of normality:

Assumption 2: Homoscedasticity. What does it mean? It means that the residuals should keep the same variance. It is violated if for example, you have a part of the residuals that are very close to 0 and another part of the residuals that are far away from 0. Heteroscedasticity: Opposite of Homoscedasticity.

The points are equally spread around the 0 line. We can consider that We have Homoscedasticity.

No outlier in the residuals. All assumptions are checked!

MULTIPLE Linear regression
Basically, the same as simple linear regression but we have more than one predictor: Y = a1 X1 + a2 X2 + a3 X3 X1, X2 and X3 are 3 predictors of Y. When analysing the predictor X1, it means that we will analyse its effect on Y while controlling for all the other predictors X2 and X3.

Assumptions: 1) Predicted Variable: Scale/Continuous variable. 2) Predictor Variables: Scale or Categorical variables. 3) The residuals should be approximately distributed. 4) The residuals should show Homoscedasticity. 5) No outliers in the residuals.

Research Question: What variables predict can predict Birthweight? We will use as predictors: Gestation, Smoker, Mother age, Father age.

For each predictor, we have one test, that evaluates if they have a significant impact on the predicted variable (Birthweight). We don’t look at the Intercept test usually. R-squared is and quite big (Good model)

Only those 3 show a Significant p-value (<0.05). Gestation, smoker and mheight Have a significant impact on Birthweight.

The residuals of this regression can be taken as normally distributed. Assumption 3 checked!

Residuals kind of show Homoscedasticity. Acceptable.

No Outlier in the residuals. Assumption 5 checked!

Binary/binomial Logistic regression
The Binomial Logistic Regression is a multiple regression, but the outcome variable (the predicted variable) is binary: A binary variable only takes 2 values: - “0” or “1” - “Yes” or “No” - “Male” or “Female” - “Survived” or “Did not survive”

Research Question: We will investigate which variables predict “survival” status in the Titanic disaster. The variables used will be “Gender”, “Class”, “Age”.

Assumptions: The independent variables can be continuous, or categorical (nominal or ordinal).

From the MASH website, download the Titanic data set in .csv format. The variable “pclass” does not read well. We can change the name.

We need to set the factor variables with their real names instead of numbers. We can keep the variable “pclass” as ordinal: Class 1, 2 and 3. Survived, Gender and Residence are nominal variables but not ordinal. We need to add the factor variables to the data frame before running the regression.

We need to use the nominal variables that use words as values and not numbers! Survived.f, Gender.f, Residence.f NOT survived, Gender and Residence!

- We never look at the intercept p-value. Gender is very strongly associated with survival (p-value < 0.001). Residence: if you are British, you are more likely to die (reference group is American. And coefficient of British is negative (-0.52), so you are more likely to die if you are British than American. Residence: if you are from Other, you are more likely to die compared with and American (not British), but this result is not significant (p=0.07>0.05). Class: is strongly significant (p-value<0.001). The coefficient is negative (-1.04). So as the value of the class increases, you are getting closer to survived = 0, so you are more likely to die. Class 3 is more likely to die than Class 2. Here We treated Class as an ordinal variable! Age: is strongly significant (p<0.001) and the coefficient is negative (-0.03), so as you are getting older, your chance of survival is decreasing (going towards value 0)

Odds Ratios: If you increase “Gender.f” by one unit, that is e.g. going from Male (0) to Female (1), Then your chance of survival is times more. If you increase “pclass” by one unit, that is e.g. going from Class 2 to Class 3, Then your chance of survival is 0.35 times more. But since this number is less than 1, you have more chance to die! You have basically 1/0.35 = 2.86 times more chances to die!

95% Confidence Intervals of the Odds ratio for a given predictor: For a given predictor, If the confidence interval of an odds ratio contains 1, then it means that this predictor does not have any significant impact on the outcome (survival). This is equivalent to p-value<0.05. Remember: this predictor (Residence: another country than Britain) was not significant. Here the Confidence Interval contains 1 so there is no significant impact on survival if you come from a different country compared with the Americans (ref Group).

Multinomial Logistic regression
The Multinomial Logistic Regression is a multiple regression, but the outcome variable (the predicted variable) is nominal with more than 3 levels: - “0” or “1” or “2” - “Yes” or “No” or “Do not know” - “Asian” or “European” or “American” or “African” - “Blue” or “Red” or “Yellow” or “Orange”

Research Question: We will investigate which variables predict which favourite ice cream flavour. The variables used will be “Gender” (female), “Video score game” (video), “Puzzle score game”(puzzle).

In the MASH website, download ice_cream.csv file and paste in the right current directory where RStudio is working. Don’t forget to attach the data set!

We need to define the categories in the variable “ice_cream” and “female” so that we are not lost. Otherwise, we only see numbers and we do not know what it really means. 1 means Vanilla, 2 means Chocolate, and 3 means Strawberry. 0 means female and 1 means male.

We will use the variables “ice_cream.f” and “female.f” instead of “ice_cream” and “female” from now on. Better:

You have 2 blocks of coefficients: The first line represents the coefficient of the Binomial regression between Chocolate and Vanilla (the reference). The second row represents the coefficients of the Binomial regression between Strawberry and Vanilla (the reference).

The only thing missing are the p-values! Which tests the significance of each variable for the favourite flavour of ice cream. The p-values are of course derived from the estimates of the coefficients and their standard errors. Write this little code in order to obtain them: “puzzle” is significantly associated with ice_cream favourite flavour. “video” score is not associated with the favourite ice_cream flavour. The Gender is significantly associated with the favourite ice_cream when you are Comparing Chocolate and Vanilla. But not when you compare Strawberry with Vanilla.

For the odds ratio we still have 2 rows corresponding to 2 regressions. You have 2.26 more chance of preferring chocolate over vanilla if you are a woman! But 1/0.96 = 1.04 less chances To prefer Strawberry over Vanilla if you are a Woman. The confidence intervals for those odds ratios Are divided into 2 blocks: 1st block: Chocolate compared with Vanilla. 2nd block: Strawberry compared with Vanilla.

What if you wanted “Chocolate” as your reference value in “ice_cream.f”? The reference is now “Chocolate”.

Ordinal regression The ordinal regression is different from the multinomial regression because the dependent variable will be an ordinal variable. An ordinal variable is categorical, but it follows an order in its categories (e.g. Mark from 1 to 3) In the multinomial regression, the outcome variable is nominal and taking qualitative values (e.g. ethnicity, favourite_colour are nominal variables). A nominal variable is a categorical variable, but there is no order or ranking between its categories. (“Apple” is not better than “Banana”, so favourite_juice is not ordinal). In some cases you can consider an ordinal variable to be a scale/continuous variable (if you have more than 7 ordered categories). Ethnicity is a nominal variable, that can be used in the multinomial logistic regression. But it can’t be used as outcome in the ordinal regression. A Likert-scale (a mark from 1 to 5 from Strongly Disagree to Strongly Agree) is, however, an ordinal variable and can definitely be used in an ordinal regression.

Ordinal regression Assumptions:
Make sure that your dependent variable is ordinal. Your independent variables can be nominal, ordinal or scales. Data: A study looks at factors that influence the decision of whether to apply to graduate school. College juniors are asked if they are unlikely, somewhat likely, or very likely to apply to graduate school. Research Analysis: We want to investigate if the degree of their parents, their gpa score and if they come from a private school influences this application.

Ordinal regression From the MASH website, download the Graduate.csv file. Put Graduate.csv in the current directory where your R Studio works. Read the csv file and attach the data. “apply” is an ordinal variable taking values: 1 (unlikely), 2(somewhat Likely) and 3(very likely). “apply” will be the outcome. Independent variables: -“pared” (parent education) is binary. -“public” (public/private) is binary. -“gpa” is a score

Ordinal regression We need to define “apply” as a categorical variable. We need the function “polr” from the MASS package:

Ordinal regression This little code will allow you to obtain the p-values in order to assess the impact of each variable on the outcome (application decision): Parent education has a significant strong impact on the decision (P-value<0.05). Public/Private institution does not have any effect (P-value = ). GPA score has a significant impact on the decision (P-value = ). There is a strong significant difference between Unlikely and Somewhat Likely. There is an even stronger significant difference between Somewhat likely and Very Likely.

Ordinal regression The odds ratios are the following:
If we increase the parent education variable by 1 (i.e. if the parents are educated Binary variable), then the odds of increasing the decision is multiplied by 2.85. It means that you can go from the category 1 (Unlikely) to category 3 (Very likely) If your parents are educated. If we are in a private school, the odds of applying is actually a little lower (but this Variable does not have a significant impact on the outcome). If we increase the gpa score by 1, then the odds of increasing the decision towards applying is multiplied by 1.85.

Poisson regression Download the .csv file “awards” from the MASH website. In this example, num_awards is the outcome variable and indicates the number of awards earned by students at a high school in a year, math is a continuous predictor variable and represents students’ scores on their math final exam, and prog is a categorical predictor variable with three levels indicating the type of program in which the students were enrolled. It is coded as 1 = “General”, 2 = “Academic” and 3 = “Vocational”.

Poisson regression The Poisson Regression is used to predict a dependent variable that consists of "count data" given one or more independent variables. Assumption: The distribution of counts follows a Poisson distribution.This entails equidispersion : mean(counts)=variance(counts). If the assumption is not met, you can use a negative binomial regression (There is just one parameter to modify).

Poisson regression Assumption to check: If the data is roughly Poisson distributed. You can check it by Checking if the mean is roughly equal to the variance of the data. The data is num_awards. The mean and the variance are not too far from each other. The histogram shows a Poisson Distribution of mean 1. Assumption checked!

Poisson regression There is a significant association between
programme course and number of awards. There is a strong significant association between Math score and number of awards That association is positive because the coefficient here is positive: as the math score increases, the number of awards increases.

Poisson regression In this slide, we build the “null model” (model 0). It is a model where we assume that The number of awards does not depend on any of the independent variables: “prog” and “math”. We then assess if the model 1 is better than the model 0. It is called And Analysis of Deviance. The null hypothesis is: “Model 0 is equivalent to Model 1”. If the Null hypothesis were correct, it would mean that Model 1 is not better than Model 0. The p-value is very small, we highly reject the null hypothesis and conclude that Model 1 is better than model 0.

MASH R workshop 4.

Similar presentations

Presentation on theme: "MASH R workshop 4."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

MASH R workshop 4.

Similar presentations

Presentation on theme: "MASH R workshop 4."— Presentation transcript:

Similar presentations

About project

Feedback