Statistical Programming Using the R Language

Slides:



Advertisements
Similar presentations
Prepared by Lloyd R. Jaisingh
Advertisements

Lecture (11,12) Parameter Estimation of PDF and Fitting a Distribution Function.
Nonparametric Statistics Timothy C. Bates
Lecture 10 Non Parametric Testing STAT 3120 Statistical Methods I.
© 2003 Pearson Prentice Hall Statistics for Business and Economics Nonparametric Statistics Chapter 14.
Student’s t statistic Use Test for equality of two means
Biostatistics in Research Practice: Non-parametric tests Dr Victoria Allgar.
15-1 Introduction Most of the hypothesis-testing and confidence interval procedures discussed in previous chapters are based on the assumption that.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 14: Non-parametric tests Marshall University Genomics.
Non-parametric statistics
Mann-Whitney and Wilcoxon Tests.
Stats & Excel Crash Course Jim & Sam April 8, 2014.
Non-Parametric Methods Professor of Epidemiology and Biostatistics
Statistical Methods II
AM Recitation 2/10/11.
Hypothesis Testing Charity I. Mulig. Variable A variable is any property or quantity that can take on different values. Variables may take on discrete.
Copyright © 2008 by Pearson Education, Inc. Upper Saddle River, New Jersey All rights reserved. John W. Creswell Educational Research: Planning,
Non-Parametric Methods Professor of Epidemiology and Biostatistics
Non-parametric Tests. With histograms like these, there really isn’t a need to perform the Shapiro-Wilk tests!
TAUCHI – Tampere Unit for Computer-Human Interaction ERIT 2015: Data analysis and interpretation (1 & 2) Hanna Venesvirta Tampere Unit for Computer-Human.
Nonparametric Statistics aka, distribution-free statistics makes no assumption about the underlying distribution, other than that it is continuous the.
© 2000 Prentice-Hall, Inc. Statistics Nonparametric Statistics Chapter 14.
Biostatistics, statistical software VII. Non-parametric tests: Wilcoxon’s signed rank test, Mann-Whitney U-test, Kruskal- Wallis test, Spearman’ rank correlation.
Ordinally Scale Variables
Lesson 15 - R Chapter 15 Review. Objectives Summarize the chapter Define the vocabulary used Complete all objectives Successfully answer any of the review.
STATISTICAL ANALYSIS FOR THE MATHEMATICALLY-CHALLENGED Associate Professor Phua Kai Lit School of Medicine & Health Sciences Monash University (Sunway.
Experimental Research Methods in Language Learning Chapter 10 Inferential Statistics.
Nonparametric Statistics
Statistical Data Analysis 2011/2012 M. de Gunst Lecture 7.
ENGR 610 Applied Statistics Fall Week 7 Marshall University CITE Jack Smith.
Chapter 13 Understanding research results: statistical inference.
Power Point Slides by Ronald J. Shope in collaboration with John W. Creswell Chapter 7 Analyzing and Interpreting Quantitative Data.
Statistical Programming Using the R Language Lecture 3 Hypothesis Testing Darren J. Fitzpatrick, Ph.D April 2016.
Statistical Programming Using the R Language Lecture 4 Experimental Design & ANOVA Darren J. Fitzpatrick, Ph.D April 2016.
Statistical Programming Using the R Language Lecture 5 Introducing Multivariate Data Analysis Darren J. Fitzpatrick, Ph.D April 2016.
Statistical Programming Using the R Language Lecture 2 Basic Concepts II Darren J. Fitzpatrick, Ph.D April 2016.
I. ANOVA revisited & reviewed
32931 Technology Research Methods Autumn 2017 Quantitative Research Component Topic 3: Comparing between groups Lecturer: Mahrita Harahap
Statistical Programming Using the R Language
Comparing Two Means Prof. Andy Field.
Statistical Programming Using the R Language
Sihua Peng, PhD Shanghai Ocean University
Independent t-Test PowerPoint Prepared by Alfred P. Rovai
Two-Sample Hypothesis Testing
Statistics for Managers Using Microsoft Excel 3rd Edition
Lecture Slides Elementary Statistics Twelfth Edition
Research Methodology Lecture No :25 (Hypothesis Testing – Difference in Groups)
Non-Parametric Tests 12/1.
Non-Parametric Tests 12/1.
Non-Parametric Tests 12/6.
Parametric vs Non-Parametric
STAT 312 Chapter 7 - Statistical Intervals Based on a Single Sample
Non-Parametric Tests.
Hypothesis Testing: Hypotheses
Georgi Iskrov, MBA, MPH, PhD Department of Social Medicine
Nonparametric Statistical Methods: Overview and Examples
Chapter 9 Hypothesis Testing.
Nonparametric Statistics Overview
Elementary Statistics
9 Tests of Hypotheses for a Single Sample CHAPTER OUTLINE
Nonparametric Statistical Methods: Overview and Examples
T-test for 2 Independent Sample Means
十二、Nonparametric Methods (Chapter 12)
Nonparametric Statistical Methods: Overview and Examples
Nonparametric Statistical Methods: Overview and Examples
Non-parametric tests, part A:
Non – Parametric Test Dr. Anshul Singh Thapa.
STA 291 Spring 2008 Lecture 18 Dustin Lueker.
What are their purposes? What kinds?
Some statistics questions answered:
Presentation transcript:

Statistical Programming Using the R Language Lecture 3 Hypothesis Testing Darren J. Fitzpatrick, Ph.D June 2017

How to prep data for use in R? I am guessing that most people interact with data using MS Excel or Open Office Calc. R works with text files. These are the kind of files created from applications like NotePad (windows), TextWrangler (MAC) or gedit (Linux) and by convention will have the extension .txt If you have your data in spreadsheet format, it is possible to save it as either a tab-delimited text file or a comma separated text file (csv). R can work with both of these.

How to prep data for use in R? To save data from excel, select: file save as and choose from the menu for .txt options If you save or receive data as .csv, R can read this using the read.csv() function.

How to prep data for use in R? Note, the first row here is the header. df <- read.table(“file.txt”, header=T) df <- read.table(“file.txt”, header=T, row.names=1)

Solutions I 1.1 – 1.3 Extract data for three genes of interest, create two data frames and give new names to the columns. affected_genes <- affected[, c('M97496', 'M77836', 'M10373')] unaffected_genes <- unaffected[, c('M97496', 'M77836', 'M10373')] new_names <- c('guanylin', 'pyrroline_reductase', 'apolipoprotein_A') names(affected_genes) <- new_names names(unaffected_genes) <- new_names

Solutions II 2.1 Find the x-axes limits (example for Guanylin). max_normal <- max(unaffected_genes[,1]) #[1] 2261 min_normal <- min(unaffected_genes[,1]) #[1] 412 max_tumour <- max(affected_genes[,1]) #[1] 175 min_tumour <- min(affected_genes[,1]) #[1] 8 # x limits are c(min_tumour, max_normal)

Solutions II 2.2 par(mfrow=c(1,2)) hist(unaffected_genes[,1], main='Guanylin (Normal)', xlab='Guanylin Expression', breaks=15, col='darkgrey', xlim=c(min_tumour, max_normal), ylim=c(0, 6)) abline(v=mean(unaffected_genes[,1]), col='red') abline(v=median(unaffected_genes[,1]), col='blue') hist(affected_genes[,1], main='Guanylin (Tumour)', breaks=5, col='lightgrey', ylim=c(0,6)) abline(v=mean(affected_genes[,1]), col='red') abline(v=median(affected_genes[,1]), col='blue')

Solutions III 3.2 boxplot(unaffected_genes[,1], affected_genes[,1], col=c('darkgrey','lightgrey',), main='Guanylin Expression', names=c('Normal', 'Tumour'))

Solutions IV 4 plot(unaffected_genes[,3][-9], affected_genes[,3][-9], main='Normal vs Tumour Apolipoprotein A Expression', xlab='Normal', ylab='Tumour', pch=5, cex=0.5)

Hypothesis Testing I A statistical hypothesis is an assumption about a population parameter, e.g. Mean Variance Or an assumption about the "differences" in population parameters for two or more populations. H0 : Null Hypothesis H1: Alternative Hypothesis Hypothesis testing is a formal procedure that allows one to decide whether to accept or reject the Null Hypothesis.

Hypothesis Testing II To conduct a hypothesis test: Visualise the data (e.g. histograms) Define the Null Hypothesis Decide on an appropriate statistical test Analyse sample data Interpret results

Hypothesis Testing III Statistical Hypothesis Tests typically, Compute a test statistic from the data, e.g., the T value Produce a p-value from the observed test statistic. By convention, the null hypothesis is rejected if p < 0.05.

Hypothesis Testing Parametric Non-Parametric Distribution Normal Any Variance Homogeneous Data Continuous Ordinal/Nominal Central Measure Mean Median Tests Correlation Pearson Spearman 2 groups T-Test (unpaired) Mann-Whitney Test > 2 gropus One-way ANOVA Kruskal-Wallis Test 2 groups (paired) T-Test (paired) Wilcoxon Test > 2 groups (paired) (paired) Friedman Test

Test for Normality I One assumption for parametric statistics is that the your data (if continuous) follows a normal distribution. The Shapiro-Wilk Test is one formal test for Normality. H0 : normally distributed H1 : not normally distributed shapiro.test(x)

Test for Normality II Problem: Test whether the guanylin (Normal and Tumour) are normally distributed. H0: Normal; H1: Not Normal shapiro.test(unaffected_genes[,1]) Shapiro-Wilk normality test data: unaffected_genes[, 1] W = 0.94367, p-value = 0.3344 p > 0.05 Cannot Reject Null Hypothesis Normally Distributed shapiro.test(affected_genes[,1]) Shapiro-Wilk normality test data: affected_genes[, 1] W = 0.86345, p-value = 0.0139 p < 0.05 Reject Null Hypothesis Not Normally Distributed

Test for Normality III The Importance of Eyeballing The Shapiro-Wilk test 'thinks' that guanylin expression in Normal cells is normally distributed. QQ-plots are another way to visualise distributions. par(mfrow=c(1,2)) qqnorm(unaffected_gene[,1], main='Guanylin (Normal)') qqline(unaffected_genes[,1], col='red') qqnorm(affected_genes[,1], main='Guanylin (Tumour)') qqline(affected_genes[,1],

Manipulating Function Outputs When you run a hypothesis test function, R returns lots of information. This information can be stored as a variable and used. shapiro.test(unaffected_genes[,1]) Shapiro-Wilk normality test data: unaffected_genes[, 1] W = 0.94367, p-value = 0.3344 stest <- shapiro.test(unaffected_genes[,1]) str(stest) # see what is in stest List of 4 $ statistic: Named num 0.944 ..- attr(*, "names")= chr "W" $ p.value : num 0.334 $ method : chr "Shapiro-Wilk normality test" $ data.name: chr "unaffected_genes[, 1]" - attr(*, "class")= chr "htest" stest$p.value # access elements of stest [1] 0.3344408 Automatically accessing relevant information from a hypothesis test is important when running many hypothesis tests.

Comparing Two Samples Means I Statistically, when comparing two samples, we are comparing the means (or middle) of two samples. Do the samples originate from the same distribution? Typically, this is done with a T-Test but: T-Tests assume normality. Student's T-Test assumes equal variance in both samples, whereas Welch's T-Test does not.

Comparing Two Sample Means II H0: μ1 = μ2 (no difference in means) Possible Alternative Hypotheses H1: μ1 ≠ μ2 (difference in means) H1: μ1 > μ2 (mean 1 greater than mean 2) H1: μ1 < μ2 (mean 1 less than mean 2) t.test(unaffected_genes[,1], affected_genes[,1]) t.test(unaffected_genes[,1], affected_genes[,1], alternative="greater") alternative="less")

Comparing Two Sample Means III t.test(unaffected_genes[,1], affected_genes[,1]) Welch Two Sample t-test data: unaffected_genes[, 1] and affected[, 1] t = 7.1863, df = 30.999, p-value = 4.433e-08 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 784.3935 1406.0509 sample estimates: mean of x mean of y 1082.94444 -12.27778

Comparing Two Sample Means IV By default, R does a Welch's T-Test. (unequal variances) t.test(unaffected_genes[,1], affected_genes[,1], var.equal=T) # Student T-test Our data is paired, i.e., two measures from the same the individual. (Paired T-tests do not rely on the assumption of equal variances) t.test(unaffected_genes[,1], unaffected_genes[,1], paired=T)

Comparing Two Sample Means V But, our data for Guanylin expression certainly doesn't look normal. In this case, we can use the non-parametric Wilcoxon Test. Non-parametric tests typically look at ranks. wilcox.test(unaffected_genes[,1], affected[,1], paired=T) Wilcoxon signed rank test data: unaffected_genes[, 1] and affected[, 1] V = 171, p-value = 7.629e-06 alternative hypothesis: true location shift is not equal to 0 at the ranks of the data.

Permutation Testing I Whether to use parametric or non-parametric statistics is sometimes clear cut. Assumptions met => parametric Assumptions not met => non-parametric Assumptions "almost" met => try both An alternative is permutation testing. I like permutation testing because it: Doesn't depend on assumptions. Uses the data itself to make decisions. Is robust to outliers. But, they can be computationally expensive. It is best illustrated with an example.

Permutation Testing II Generate some random data. Note! set.seed() allows you to set a start point in a random process and thus to regenerate the same random data. 1. set.seed(100)# to regenerate same data v1 <- rnorm(10, 0, 1)# mean=0, sd=1 v2 <- rnorm(10, 0, 1) # mean=0, sd=1 Generate all possible permutations of the data. diff_means <- mean(v1) – mean(v2) # -0.2516486 v1v2 <- c(v1, v2) # combine N <- length(v1v2)# 20 n <- length(v1) # 10 p <- combn(N, n) # all combinations of n chosen from N # 184756 permutations 2.

Permutation Testing III 3. Compute all possible difference in the means from the data. holder <- numeric(ncol(p)) # initialise results holder for (i in 1:ncol(p)) { # Compute difference in mean on permutation; add to holder holder[i] <- mean(v1v2[p[, i]]) – mean(v1v2[-p[, i]]) }

Permutation Testing IV The holder contains differences in the mean for all possible permutations of the data. holder[1:10] [1] -0.25164862 -0.16169897 -0.16042130 -0.22000299 ...... Produce distribution of differences in mean showing the location of the actual difference in mean computed from the original data. -0.2516486 hist(holder) abline(v=diff_means, lty=2) abline(v=-diff_means, lty=2)

Permutation Testing V One-tailed One-tailed (greater) (less) P-Values are computed by summing the difference in the means for: perm <= observed difference (less) perm >= observed difference (greater) perm <= observed difference >= perm (two) Two-Tailed (p = 0.48) (sum(holder <= -abs(diff_means)))/dim(p)[2] (sum(holder >= abs(diff_means)))/dim(p)[2] (sum(holder <= -abs(diff_means)) + sum(holder >= abs(diff_means)))/dim(p)[2]

Permutation Testing VI install.packages('perm') library(perm) Permutation Testing VI R has a package to compute permutation tests called 'perm'. permTS(v1, v2, alternative="two.sided", method="exact.ce", control=permControl(tsmethod="abs")) Exact Permutation Test (complete enumeration) data: v1 and v2 p-value = 0.4786 # Equivalent to our permutation example alternative hypothesis: true mean v1 - mean v2 is 0 sample estimates: mean v1 - mean v2 -0.2516486

Computing Correlations Correlation coefficients are computed using the cor() function. cor(unaffected_genes[,1], affected_genes[,1], method='pearson') cor(unaffected_genes[,1], affected_genes[,1], method='spearman') These functions just return the correlation coefficients as a number. Pearson's: returns r (Pearson's r) Spearman's : returns rho (ρ)

Lecture 3 – problem sheet A problem sheet entitled lecture_3_problems.pdf is located on the course website. Some of the code required for the problem sheet has been covered in this lecture. Consult the help pages if unsure how to use a function. Please attempt the problems for the next 30-45 mins. We will be on hand to help out. Solutions will be posted this afternoon.

Thank You