Statistical Programming Using the R Language Lecture 3 Hypothesis Testing Darren J. Fitzpatrick, Ph.D April 2016.

Slides:



Advertisements
Similar presentations
Lecture (11,12) Parameter Estimation of PDF and Fitting a Distribution Function.
Advertisements

Hypothesis Testing Steps in Hypothesis Testing:
Inferential Statistics
Nonparametric Statistics Timothy C. Bates
Lecture 10 Non Parametric Testing STAT 3120 Statistical Methods I.
© 2003 Pearson Prentice Hall Statistics for Business and Economics Nonparametric Statistics Chapter 14.
Dealing With Statistical Uncertainty Richard Mott Wellcome Trust Centre for Human Genetics.
MARE 250 Dr. Jason Turner Hypothesis Testing II. To ASSUME is to make an… Four assumptions for t-test hypothesis testing:
BCOR 1020 Business Statistics
Final Review Session.
Lecture 9: One Way ANOVA Between Subjects
Two Population Means Hypothesis Testing and Confidence Intervals With Unknown Standard Deviations.
Student’s t statistic Use Test for equality of two means
Educational Research by John W. Creswell. Copyright © 2002 by Pearson Education. All rights reserved. Slide 1 Chapter 8 Analyzing and Interpreting Quantitative.
15-1 Introduction Most of the hypothesis-testing and confidence interval procedures discussed in previous chapters are based on the assumption that.
Hypothesis Testing Using The One-Sample t-Test
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 14: Non-parametric tests Marshall University Genomics.
Non-parametric statistics
Mann-Whitney and Wilcoxon Tests.
Stats & Excel Crash Course Jim & Sam April 8, 2014.
Statistical Methods II
Statistical hypothesis testing – Inferential statistics I.
Practical statistics for Neuroscience miniprojects Steven Kiddle Slides & data :
ECONOMETRICS I CHAPTER 5: TWO-VARIABLE REGRESSION: INTERVAL ESTIMATION AND HYPOTHESIS TESTING Textbook: Damodar N. Gujarati (2004) Basic Econometrics,
Analysis of Variance. ANOVA Probably the most popular analysis in psychology Why? Ease of implementation Allows for analysis of several groups at once.
AM Recitation 2/10/11.
Inferential Statistics: SPSS
Hypothesis Testing Charity I. Mulig. Variable A variable is any property or quantity that can take on different values. Variables may take on discrete.
Copyright © 2008 by Pearson Education, Inc. Upper Saddle River, New Jersey All rights reserved. John W. Creswell Educational Research: Planning,
Copyright © 2010, 2007, 2004 Pearson Education, Inc Lecture Slides Elementary Statistics Eleventh Edition and the Triola Statistics Series by.
14 Elements of Nonparametric Statistics
NONPARAMETRIC STATISTICS
Dependent Samples: Hypothesis Test For Hypothesis tests for dependent samples, we 1.list the pairs of data in 2 columns (or rows), 2.take the difference.
Special Topics 504: Practical Methods in Analyzing Animal Science Experiments The course is: Designed to help familiarize you with the most common methods.
TAUCHI – Tampere Unit for Computer-Human Interaction ERIT 2015: Data analysis and interpretation (1 & 2) Hanna Venesvirta Tampere Unit for Computer-Human.
Chapter 14 Nonparametric Statistics. 2 Introduction: Distribution-Free Tests Distribution-free tests – statistical tests that don’t rely on assumptions.
Previous Lecture: Categorical Data Methods. Nonparametric Methods This Lecture Judy Zhong Ph.D.
Nonparametric Statistics aka, distribution-free statistics makes no assumption about the underlying distribution, other than that it is continuous the.
© 2000 Prentice-Hall, Inc. Statistics Nonparametric Statistics Chapter 14.
Biostatistics, statistical software VII. Non-parametric tests: Wilcoxon’s signed rank test, Mann-Whitney U-test, Kruskal- Wallis test, Spearman’ rank correlation.
Ordinally Scale Variables
Confidence intervals and hypothesis testing Petter Mostad
4 Hypothesis & Testing. CHAPTER OUTLINE 4-1 STATISTICAL INFERENCE 4-2 POINT ESTIMATION 4-3 HYPOTHESIS TESTING Statistical Hypotheses Testing.
Educational Research Chapter 13 Inferential Statistics Gay, Mills, and Airasian 10 th Edition.
Lecture 9 Chap 9-1 Chapter 2b Fundamentals of Hypothesis Testing: One-Sample Tests.
Lesson 15 - R Chapter 15 Review. Objectives Summarize the chapter Define the vocabulary used Complete all objectives Successfully answer any of the review.
ITEC6310 Research Methods in Information Technology Instructor: Prof. Z. Yang Course Website: c6310.htm Office:
STATISTICAL ANALYSIS FOR THE MATHEMATICALLY-CHALLENGED Associate Professor Phua Kai Lit School of Medicine & Health Sciences Monash University (Sunway.
Experimental Research Methods in Language Learning Chapter 10 Inferential Statistics.
Copyright ©2013 Pearson Education, Inc. publishing as Prentice Hall 9-1 σ σ.
© Copyright McGraw-Hill 2004
Statistical Data Analysis 2011/2012 M. de Gunst Lecture 7.
Statistical Inference Statistical inference is concerned with the use of sample data to make inferences about unknown population parameters. For example,
McGraw-Hill/Irwin Business Research Methods, 10eCopyright © 2008 by The McGraw-Hill Companies, Inc. All Rights Reserved. Chapter 17 Hypothesis Testing.
ENGR 610 Applied Statistics Fall Week 7 Marshall University CITE Jack Smith.
Chapter 13 Understanding research results: statistical inference.
Hypothesis Tests u Structure of hypothesis tests 1. choose the appropriate test »based on: data characteristics, study objectives »parametric or nonparametric.
Educational Research Inferential Statistics Chapter th Chapter 12- 8th Gay and Airasian.
Statistical Programming Using the R Language Lecture 4 Experimental Design & ANOVA Darren J. Fitzpatrick, Ph.D April 2016.
Statistical Programming Using the R Language Lecture 5 Introducing Multivariate Data Analysis Darren J. Fitzpatrick, Ph.D April 2016.
Nonparametric Statistics Overview. Objectives Understand Difference between Parametric and Nonparametric Statistical Procedures Nonparametric methods.
Statistical Programming Using the R Language Lecture 2 Basic Concepts II Darren J. Fitzpatrick, Ph.D April 2016.
When the means of two groups are to be compared (where each group consists of subjects that are not related) then the excel two-sample t-test procedure.
I. ANOVA revisited & reviewed
Statistical Programming Using the R Language
Statistical Programming Using the R Language
Nonparametric Statistics Overview
Elementary Statistics
Non-parametric tests, part A:
Non – Parametric Test Dr. Anshul Singh Thapa.
Presentation transcript:

Statistical Programming Using the R Language Lecture 3 Hypothesis Testing Darren J. Fitzpatrick, Ph.D April 2016

How to prep data for use in R? I am guessing that most people interact with data using MS Excel or Open Office Calc. R works with text files. These are the kind of files created from applications like NotePad (windows), TextWrangler (MAC) or gedit (Linux) and by convention will have the extension.txt If you have your data in spreadsheet format, it is possible to save it as either a tab- delimited text file or a comma separated text file (csv). R can work with both of these.

How to prep data for use in R? To save data from excel, select: file save as and choose from the menu for.txt options If you save or receive data as.csv, R can read this using the read.csv() function.

How to prep data for use in R? If the first row of your data is a header and the first column is row names, then, your header should not include a name for row names.

Solutions I 1.1 – 1.3 affected_genes <- affected[, c('M97496', 'M77836', 'M10373')] unaffected_genes <- unaffected[, c('M97496', 'M77836', 'M10373')] new_names <- c('guanylin', 'pyrroline_reductase', 'apolipoprotein_A') names(affected_genes) <- new_names names(unaffected_genes) <- new_names Extract data for three genes of interest, create two data frames and give new names to the columns.

Trinity College Dublin, The University of Dublin Solutions II Find the x-axes limits (example for Guanylin). 2.1 max_normal <- max(unaffected_genes[,1]) #[1] 2261 min_normal <- min(unaffected_genes[,1]) #[1] 412 max_tumour <- max(affected_genes[,1]) #[1] 175 min_tumour <- min(affected_genes[,1]) #[1] 8 # x limits are c(min_tumour, max_normal)

Trinity College Dublin, The University of Dublin Solutions II 2.2 par(mfrow=c(1,2)) hist(unaffected_genes[,1], main='Guanylin (Normal)', xlab='Guanylin Expression', breaks=15, col='darkgrey', xlim=c(min_tumour, max_normal), ylim=c(0, 6)) abline(v=mean(unaffected_genes[,1]), col='red') abline(v=median(unaffected_genes[,1]), col='blue') hist(affected_genes[,1], main='Guanylin (Tumour)', xlab='Guanylin Expression', breaks=5, col='lightgrey', xlim=c(min_tumour, max_normal), ylim=c(0,6)) abline(v=mean(affected_genes[,1]), col='red') abline(v=median(affected_genes[,1]), col='blue')

Trinity College Dublin, The University of Dublin Solutions III 3.2 boxplot(unaffected_genes[,1], affected_genes[,1], col=c('darkgrey','lightgrey',), main='Guanylin Expression', names=c('Normal', 'Tumour'))

Trinity College Dublin, The University of Dublin Solutions IV 4 plot(unaffected_genes[,3][-9], affected_genes[,3][-9], main='Normal vs Tumour Apolipoprotein A Expression', xlab='Normal', ylab='Tumour', pch=5, cex=0.5)

Trinity College Dublin, The University of Dublin Hypothesis Testing I A statistical hypothesis is an assumption about a population parameter, e.g. Mean Variance Or an assumption about the "differences" in population parameters for two or more populations. H 0 : Null Hypothesis H 1 : Alternative Hypothesis Hypothesis testing is a formal procedure that allows one to decide whether to accept or reject the Null Hypothesis.

Trinity College Dublin, The University of Dublin Hypothesis Testing II To conduct a hypothesis test: 1.Visualise the data (e.g. histograms) 2.Define the Null Hypothesis 3.Decide on an appropriate statistical test 4.Analyse sample data 5.Interpret results

Trinity College Dublin, The University of Dublin Hypothesis Testing III Statistical Hypothesis Tests typically, 1.Compute a test statistic from the data, e.g., the T value 2.Produce a p-value from the observed test statistic. 3.By convention, the null hypothesis is rejected if p < 0.05.

Trinity College Dublin, The University of Dublin Hypothesis Testing ParametricNon-Parametric DistributionNormalAny VarianceHomogeneousAny DataContinuousOrdinal/Nominal Central MeasureMeanMedian Tests CorrelationPearsonSpearman 2 groupsT-Test (unpaired)Mann-Whitney Test > 2 gropusOne-way ANOVAKruskal-Wallis Test 2 groups (paired)T-Test (paired)Wilcoxon Test > 2 groups (paired)One-way ANOVA (paired) Friedman Test

Trinity College Dublin, The University of Dublin Test for Normality I One assumption for parametric statistics is that the your data (if continuous) follows a normal distribution. The Shapiro-Wilk Test is one formal test for Normality. H 0 : normally distributed H 1 : not normally distributed shapiro.test(x)

Trinity College Dublin, The University of Dublin Test for Normality II Problem: Test whether the guanylin (Normal and Tumour) are normally distributed. H 0 : Normal; H 1 : Not Normal shapiro.test(unaffected_genes[,1]) Shapiro-Wilk normality test data: unaffected_genes[, 1] W = , p-value = p > 0.05 Cannot Reject Null Hypothesis Normally Distributed shapiro.test(affected_genes[,1]) Shapiro-Wilk normality test data: affected_genes[, 1] W = , p-value = p < 0.05 Reject Null Hypothesis Not Normally Distributed

Trinity College Dublin, The University of Dublin Test for Normality III The Importance of Eyeballing The Shapiro-Wilk test 'thinks' that guanylin expression in Normal cells is normally distributed. QQ-plots are another way to visualise distributions. par(mfrow=c(1,2)) qqnorm(unaffected_gene[,1], main='Guanylin (Normal)') qqline(unaffected_genes[,1], col='red') qqnorm(affected_genes[,1], main='Guanylin (Tumour)') qqline(affected_genes[,1], col='red')

Trinity College Dublin, The University of Dublin Manipulating Function Outputs When you run a hypothesis test function, R returns lots of information. This information can be stored as a variable and used. shapiro.test(unaffected_genes[,1]) Shapiro-Wilk normality test data: unaffected_genes[, 1] W = , p-value = stest <- shapiro.test(unaffected_genes[,1]) str(stest) # see what is in stest List of 4 $ statistic: Named num attr(*, "names")= chr "W" $ p.value : num $ method : chr "Shapiro-Wilk normality test" $ data.name: chr "unaffected_genes[, 1]" - attr(*, "class")= chr "htest" stest$p.value # access elements of stest [1] Automatically accessing relevant information from a hypothesis test is important when running many hypothesis tests.

Trinity College Dublin, The University of Dublin Comparing Two Samples Means I Statistically, when comparing two samples, we are comparing the means (or middle) of two samples. Do the samples originate from the same distribution? Typically, this is done with a T-Test but: T-Tests assume normality. Student's T-Test assumes equal variance in both samples, whereas Welch's T-Test does not.

Trinity College Dublin, The University of Dublin Comparing Two Sample Means II H 0 : μ 1 = μ 2 (no difference in means) Possible Alternative Hypotheses H 1 : μ 1 ≠ μ 2 (difference in means) H 1 : μ 1 > μ 2 (mean 1 greater than mean 2) H 1 : μ 1 < μ 2 (mean 1 less than mean 2) t.test(unaffected_genes[,1], affected_genes[,1]) t.test(unaffected_genes[,1], affected_genes[,1], alternative="greater") t.test(unaffected_genes[,1], affected_genes[,1], alternative="less")

Trinity College Dublin, The University of Dublin Comparing Two Sample Means III t.test(unaffected_genes[,1], affected_genes[,1]) Welch Two Sample t-test data: unaffected_genes[, 1] and affected[, 1] t = , df = , p-value = 4.433e-08 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: sample estimates: mean of x mean of y

Trinity College Dublin, The University of Dublin Comparing Two Sample Means IV t.test(unaffected_genes[,1], affected_genes[,1], var.equal=T) # Student T-test By default, R does a Welch's T-Test. (unequal variances) Our data is paired, i.e., two measures from the same the individual. (Paired T-tests do not rely on the assumption of equal variances) t.test(unaffected_genes[,1], unaffected_genes[,1], paired=T)

Trinity College Dublin, The University of Dublin Comparing Two Sample Means V But, our data for Guanylin expression certainly doesn't look normal. In this case, we can use the non-parametric Wilcoxon Test. Non-parametric tests typically look at ranks. wilcox.test(unaffected_genes[,1], affected[,1], paired=T) Wilcoxon signed rank test data: unaffected_genes[, 1] and affected[, 1] V = 171, p-value = 7.629e-06 alternative hypothesis: true location shift is not equal to 0 at the ranks of the data.

Trinity College Dublin, The University of Dublin Permutation Testing I Whether to use parametric or non-parametric statistics is sometimes clear cut. Assumptions met => parametric Assumptions not met => non-parametric Assumptions "almost" met => try both An alternative is permutation testing. I like permutation testing because it: Doesn't depend on assumptions. Uses the data itself to make decisions. Is robust to outliers. But, they can be computationally expensive. It is best illustrated with an example.

Trinity College Dublin, The University of Dublin Permutation Testing II Generate some random data. set.seed(100)# to regenerate same data v1 <- rnorm(10, 0, 1)# mean=0, sd=1 v2 <- rnorm(10, 0, 1 ) # mean=0, sd=1 1. diff_means <- mean(v1) – mean(v2) # v1v2 <- c(v1, v2) # combine N <- length(v1v2)# 20 n <- length(v1) # 10 p <- combn(N, n) # all combinations of n chosen from N # permutations Generate all possible permutations of the data. 2. Note! set.seed() allows you to set a start point in a random process and thus to regenerate the same random data.

Trinity College Dublin, The University of Dublin Permutation Testing III holder <- numeric(ncol(p)) # initialise results holder for (i in 1:ncol(p)) { # Compute difference in mean on permutation; add to holder holder[i] <- mean(v1v2[p[, i]]) – mean(v1v2[-p[, i]]) } 3.Compute all possible difference in the means from the data.

Trinity College Dublin, The University of Dublin Permutation Testing IV The holder contains differences in the mean for all possible permutations of the data. holder[1:10] [1] Produce distribution of differences in mean showing the location of the actual difference in mean computed from the original data hist(holder) abline(v=diff_means, lty=2) abline(v=-diff_means, lty=2)

Trinity College Dublin, The University of Dublin Permutation Testing V One-tailed (greater) p=0.24 One-tailed (less) p=0.24 Two-Tailed (p = 0.48) (sum(holder <= -abs(diff_means)))/dim(p)[2] (sum(holder >= abs(diff_means)))/dim(p)[2] (sum(holder = abs(diff_means)))/dim(p)[2] P-Values are computed by summing the difference in the means for: perm <= observed difference (less) perm >= observed difference (greater) perm = perm (two)

Trinity College Dublin, The University of Dublin Permutation Testing VI R has a package to compute permutation tests called 'perm'. install.packages('perm') library(perm) permTS(v1, v2, alternative="two.sided", method="exact.ce", control=permControl(tsmethod="abs")) Exact Permutation Test (complete enumeration) data: v1 and v2 p-value = # Equivalent to our permutation example alternative hypothesis: true mean v1 - mean v2 is 0 sample estimates: mean v1 - mean v

Trinity College Dublin, The University of Dublin Computing Correlations Correlation coefficients are computed using the cor() function. cor(unaffected_genes[,1], affected_genes[,1], method='pearson') cor(unaffected_genes[,1], affected_genes[,1], method='spearman') These functions just return the correlation coefficients as a number. Pearson's: returns r (Pearson's r) Spearman's : returns rho (ρ)

Lecture 3 – problem sheet A problem sheet entitled lecture_3_problems.pdf is located on the course website. Some of the code required for the problem sheet has been covered in this lecture. Consult the help pages if unsure how to use a function. Please attempt the problems for the next mins. We will be on hand to help out. Solutions will be posted this afternoon.

Thank You