MASH R workshop 2:
In this session you will know: How to check normality in R and determine when to use a parametric or a non-parametric test. How to run the main parametric tests: - T-test, paired or unpaired (independent) - ANOVA, one-way or repeated measures.
WHY Checking normality? If your data is normally distributed approximately, then in your statistical analysis, you will use parametric tests. If your data is not normally distributed, then you are more likely to use non-parametric tests. The non-parametric tests will be covered in the next session (R Workshop Session 3).
Checking “NORMALITY” for your data By “normality”, we mean to check if your column of measurements (or data) is approximately distributed as a “bell shape” - when plotting the histogram. Corresponding Histogram/Data Data approximately normally distributed? Bell Shape so : YES!
Checking “NORMALITY” for your data By “Bell” shape, we also mean a symmetry around the mean. The normal distribution is a symmetric distribution (see R Workshop Session 1). In a symmetric distribution, it is important to note that: You should notice a symmetry between the left hand side and the right hand side of this Mean (or Mode or Median). The axis of symmetry should pass through the Mean. Mean = Mode = Median
Checking “NORMALITY” for your data Don’t forget your data should be approximately normally distributed! Not necess. exactly normally distributed… Axis of Symmetry passes through the mean. (Approximately!) Left Hand side Right Hand side The Mode (68) is approximately equal to the Mean (68.30).
Checking “NORMALITY” for your data Example of Skewed Data: - No Symmetry. - Mode. - Median. - Mean.
Checking “NORMALITY” for your data: Plotting histograms in r. Following what you learned in Session 1, you can plot a histogram of your data by the command “hist()”. From MASH website, download the ‘normR’ data set at https://www.sheffield.ac.uk/mash/statistics/datasets Put the data sets in a folder and set this folder as the working directory.
Checking “NORMALITY” for your data: Plotting histograms in r.
Skewed Data
Symmetrical data
Checking “NORMALITY” for your data: Test for Normality There are tests to assess whether or not your data is normally distributed. In all cases, the Null Hypothesis is: “H0 : Data Normally distributed”. If the p-value is smaller than 0.05, then you reject the null and therefore conclude that the data is not normally distributed.
Checking “NORMALITY” for your data: NORMALITY TESTS. If the sample has less than 50 participants, use the Shapiro-Wilk test. If the sample has more than 50 participants, use the Kolmogorov-Smirnov test.
Checking “NORMALITY” for your data: Test for Normality: Shapiro-Wilk (<50 people). The size of each sample is less than 50, so I can use the Shapiro-Wilk test in both cases. P-Value > 0.05 Therefore: Null Hypothesis not rejected Data Normally distributed P-Value < 0.05 Therefore: Null Hypothesis rejected Data not Normally distributed
Watch out! The syntax differs a little from the Shapiro-Wilk test! Checking “NORMALITY” for your data: Test for Normality: kolmogorov-smirnov (>50 people). If the sample is more than 50 people, the Kolmogorov-Smirnov test is preferred. Let us just assume that our data set has more than 50 people. Watch out! The syntax differs a little from the Shapiro-Wilk test!
Checking “NORMALITY” for your data: TESTS However, be cautious with those tests because the presence of a simple outlier can reject the null hypothesis while the rest of your data is perfectly symmetric. We recommend you to check both graphs and tests. If you are still undecided, don’t hesitate to come to MASH and ask us!
Checking “NORMALITY” for your data
Checking “NORMALITY” for your data You can also show a P-P plot to prove that your data is normally distributed (or not). It is more common however to show the Q-Q plot.
Parametric tests In this Session, we will only study parametric tests, that are, tests to use when your data is normally distributed.
Comparing a measurement between 2 independent groups: INDEPENDENT T-TEST For assumptions of each Test: Go to “LAERD SPSS name_of_the_test” ! An Independent t-test will detect if there is any statistically significant difference in a measurement (score) between 2 groups (Group 1 and Group 2). We have therefore one categorical variable (Group) and one continuous variable (score). You need to check that your measurement is normally distributed in both groups. You also need to check if there are any outliers. If so, it is better to remove them! (We can keep outliers that are not too extreme though) Finally our last assumption is to check whether the variance in each group is roughly the same. We verify this assumption thanks to Levene’s test.
Comparing a measurement between 2 independent groups: INDEPENDENT T-TEST Download the Birthweight data set for R (.csv format on the website). File to download And store in the correct Working directory. Open the .csv file:
Comparing a measurement between 2 independent groups: INDEPENDENT T-TEST Shows the 6 First rows of the data set. Thanks to this, the system will Recognize the variables by directly calling them: “id”, “length”,”Gestation”,etc. Translates the binary code into words: 0 means Non-Smoker and 1 means Smoker. The original data set contains 0 and 1 instead of the words “Non-Smoker” and “Smoker”.
Comparing a measurement between 2 independent groups: INDEPENDENT T-TEST Is my measurement (Birthweight) normally distributed in both groups “Smoker” and “Non-Smokers? I am plotting 2 histograms representing the distribution of Birthweight in both groups. This function allows you to plot 2 graphs in the same plot. Does this look symmetric to you? If you are not sure, make a qqplot Of Birthweight for the 2 groups and See if the points are close to the line.
Comparing a measurement between 2 independent groups: INDEPENDENT T-TEST In order to plot 2 graphs in the same plot. First plot Second plot The 2 qqplots seem to be corresponding to a normal distribution. You do not have any “S” shape around the curve and the points are close to the line.
Comparing a measurement between 2 independent groups: INDEPENDENT T-TEST We can also perform a normality test. The length of each group does not exceed 50 so we can do a Shapiro-Wilk test for Smoker and Non-smoker. Both of these tests have a p-value > 0.05, therefore accepting the null. Reminder: The null is “My data is normally distributed”. Hence we can conlude that the data is normally distributed.
Comparing a measurement between 2 independent groups: INDEPENDENT T-TEST No outliers!
Comparing a measurement between 2 independent groups: INDEPENDENT T-TEST Finally our last assumption is to check whether the variance in each group is roughly the same. We verify this assumption thanks to Levene’s test. In this test, the null hypothesis is “Variance Group 1 = Variance Group 2”. This test is contained in R package “car”. The command for Levene’s test is then: The p-value is more than 0.05 so we can retain the Null hypothesis of equal variances. Last assumption Checked!
Comparing a measurement between 2 independent groups: INDEPENDENT T-TEST We can finally run the independent t-test. The null hypothesis is “The birthweight is the same in both group”. If the Levene’s test fails, You need to put FALSE to the Assumption var.equal. P-value less than 0.05, so we reject the null hypothesis We conclude that there is a significant difference for the birthweight of the babies between the Smokers and the Non-smokers (mothers).
Comparing a measurement between 3 independent groups: one-way anova For assumptions of each Test: Go to “LAERD SPSS name_of_the_test” ! A one-way ANOVA will detect if there is any statistically significant difference in a measurement (score) between 3 or more groups (Group 1, Group 2, Group 3, etc.). We have therefore one categorical variable (Group) and one continuous variable (score). You need to check that your measurement is normally distributed in each group. You also need to check if there are any outliers per group. If so, you will need to remove them! Finally our last assumption is to check whether the variance in each group is roughly the same. We verify this assumption thanks to Levene’s test.
Comparing a measurement between 3 independent groups: one-way anova The one-way ANOVA can be taken as the same as the independent T-test but if you want to compare a measurement between 3 or more independent groups. From the MASH website, download the Diet.csv file in the working directory, i.e. the same directory where the Birthweight.csv file is located.
Comparing a measurement between 3 independent groups: one-way anova The Research question is: “Which of the 3 Diets was the best for losing weight? - There are 3 Diets , hence 3 Groups. The variable Diet is our categorical variable. The weight lost will be our measurement (continuous/scale) to compare the 3 Diets. We therefore need to create another column representing the weight lost, By subtracting “pre-weight” by “weight6weeks”:
Comparing a measurement between 3 independent groups: one-way anova You can create and add the weight lost variable to your DietR data set directly via: Each row does the subtraction : Weight6weeks – pre.weight. E.g. for Row 6: 61.1-64 = 2.9 (weightlost) Don’t forget to attach the file so that the software Recognizes the variables when you call them! This command defines Diet as a categorical variable
Comparing a measurement between 3 independent groups: one-way anova Assumption 1: Your measurement (weightlost) should be approximately normally distributed in each group. If in one group, it is not normally distributed, then it is better to choose the non-parametric alternative (Kruskal-Wallis). The non-parametric tests are seen in the next R session (R Workshop 3).
Comparing a measurement between 3 independent groups: one-way anova
Comparing a measurement between 3 independent groups: one-way anova The 3 Shapiro –Wilk tests show a p-value Higher than 0.05, accepting the null hypothesis “Normally distributed data” Assumption 1 is then checked!
Comparing a measurement between 3 independent groups: one-way anova Assumption 2: No outliers. Your data should have no outliers in the 3 groups. 2 outliers in Diet 1! We can see on the boxplot that the outliers lie above 8. The code below eliminate the participants of Diet 1, who have their “weightlost” more than 8.
Comparing a measurement between 3 independent groups: one-way anova Assumption 3: Homogeneity of variance test. Basically, it means that each variance corresponding to each group is the same. The Levene’s test does it. If the p-value is above 0.05, then you can assume the variances from different groups as approximately equal. The p-value indicated here (0.5377) is more than 0.05, so We accept the null hypothesis that the groups have similar variance.
Comparing a measurement between 3 independent groups: one-way anova Running the Anova: If the Levene’s test fails, you can replace the parameter “var.equal=TRUE” by “var.equal=FALSE”. The p-value is 0.003229, it is smaller than 0.05, so you will reject the null hypothesis. The Null Hypothesis is:”The lost weight is the same in every group”. You conclude that there is a statistically significant difference of lost weight between the 3 diets.
Comparing a measurement between 3 independent groups: one-way anova Now that you know there is a statistically significant difference between the diets, you may want to know which groups differ most. You need to run multiple comparisons tests, often called post-hoc tests, because they are tested after the ANOVA test. 3 Possible comparisons: Group 1 vs. Group 2 Group 3 vs Group 1 Group 3 vs. Group 2 P-value>0.05 P-value<0.05 The p-value is lower than 0.05 when comparing Group 3 with Group 1, and Group 3 with Group 2. The p-value is more than 0.05 when comparing Groups 1 and 2. Therefore we conclude that there exists a statistically significant difference between Group 3 and the other 2 groups. However there is no significant difference between Group 1 and Group 2.
Comparing a measurement twice on the same group: paired t-test For assumptions of each Test: Go to “LAERD SPSS name_of_the_test” ! A Paired t-test will detect if there is any statistically significant difference in a measurement (score) for the same group of participants but at 2 different times or 2 different conditions. We have therefore one categorical variable (Time) and one continuous variable (score). You need to check that difference of measurement between time 1 and time 2 is normally distributed You also need to check if there are any outliers in this difference. If so, you will need to remove them! (No need to check the equality of variances between time 1 and time 2 for this one!)
Comparing a measurement twice on the same group: paired t-test You will need to download the Cholesterol file for R on MASH website and put it in the same directory as the Diet and Birthweight.
Comparing a measurement twice on the same group: paired t-test Research question: Is there a statistically significant difference of Cholesterol level between Before and After 4 weeks? We have Cholesterol level at “Before” and Cholesterol level at “After4weeks”. In order to validate our assumptions, we need to create the difference between Cholesterol level After4weeks and Cholesterol level Before. We will study if there are any outliers and if the distribution of the difference is approximately normal.
Comparing a measurement twice on the same group: paired t-test Assumption 1: Your measurement difference should be approximately normally distributed. If this is not the case then it is better to choose the non-parametric alternative (Wilcoxon). The non-parametric tests are seen in the next R session (R Workshop 3). Difficult to conclude! We might need the normality test.
Comparing a measurement twice on the same group: paired t-test The null hypothesis of the Shapiro-Wilk test is that the difference is Normally distributed. The p-value is more than 0.05, therefore we can Keep the null hypothesis and conclude that the data is normally distributed. By Data, we mean the difference of Cholesterol level between After 4 weeks and Before. Normality distribution checked!
Comparing a measurement twice on the same group: paired t-test Assumption 2: No outliers. The cholesterol difference should have no outliers. One outlier here!
Comparing a measurement twice on the same group: paired t-test The p-value is very small! 0.00000000001958 There is a strong evidence against the null hypothesis (no difference between after and before). Therefore there is a statistically difference between Before and after.
Comparing a measurement +3 times on the same group: repeated measures anova A Repeated Measures ANOVA will detect if there is any statistically significant difference in a measurement (score) for the same group of participants but at 3 or more different times or 3 or more different conditions. We have therefore one categorical variable (Time) and one continuous variable (score). You need to check that your measurement at each time is normally distributed.You also need to check if there are any outliers for each time. If so, you will need to remove them! Assumption of Sphericity: This assumption will be computed by our function. Sphericity means that all possible differences between times have the same variances. We will not enter into too much detail but this assumption needs to be checked: Variance(time2 – time1) = Variance(time3 - time1) = Variance(time3 – time2) Where time3, time2 and time1 are the measurements made at these times.
Comparing a measurement +3 times on the same group: repeated measures anova Assumption 1: Measurement normally distributed at each time.
Comparing a measurement +3 times on the same group: repeated measures anova
Comparing a measurement +3 times on the same group: repeated measures anova Each p-value is more than 0.05 so we can accept the null hypothesis That the cholesterol is normally distributed at each time. Assumption 1 checked!
Comparing a measurement +3 times on the same group: repeated measures anova Assumption 2: There is no outlier in the data at each time. No outlier! Assumption 2 checked!
Comparing a measurement +3 times on the same group: repeated measures anova Unfortunately, the format the data is presented is not ready for analysing a Repeated ANOVA! We need: Format 1: WIDE Data Format 2: LONG Data There are two fundamental verbs of data tidying: gather() takes multiple columns, and gathers them into key-value pairs: it makes “wide” data longer. spread(). takes two columns (key & value) and spreads in to multiple columns, it makes “long” data wider.
Comparing a measurement +3 times on the same group: repeated measures anova Install the library “tidyr”: Use the function “gather” contained in that library: Data frame to modify Name of the measure: Cholesterol level. The variable “Times” will take the following values: “Before”,”After4weeks” and “After8weeks”.
Comparing a measurement +3 times on the same group: repeated measures anova Here we just change replace the name “..ID” by “Subject”, which looks nicer. But you can keep “..ID” if you like. Ready for the Repeated ANOVA! You need to install the library “ez” In order to perform a Repeated Measures ANOVA.
Comparing a measurement +3 times on the same group: repeated measures anova The format is a little complicated: The first parameter is the data set or data frame In which you will find the variables needed for the Repeated ANOVA. The 2nd parameter is the dependent variable (dv), the 3rd parameter is the subject Number and the last parameter is the Within subject (Times). Unfortunately, there is a 3rd Assumption in the Repeated measures ANOVA, which consists in not violating Sphericity! If the test rejects Sphericity, then you will look at the p-value of “Sphericity Corrections”.
Comparing a measurement +3 times on the same group: repeated measures anova Like for the one-way ANOVA, we need to find between which time we detect the most significant difference. These 3 p-values indicate that there is a strong statistically Difference for each pair of comparison.
EXTRA: how to convert from WIDE to long data Format 2: LONG Data Format 1: WIDE Data spread(). takes two columns (key & value) and spreads in to multiple columns, it makes “long” data wider. Will split the file by Times: After4weeks,Before,After8weeks