Presentation is loading. Please wait.

Presentation is loading. Please wait.

Basic statistic inference in R

Similar presentations


Presentation on theme: "Basic statistic inference in R"— Presentation transcript:

1 Basic statistic inference in R
Shai Meiri

2 “We expect to find differences between x and y” is a trivial saying
Everything differs!!! “We expect to find differences between x and y” is a trivial saying The statistician within you asks “Are the differences we found are larger that expected by chance?” The biologist within you asks “Why the differences I found are in the direction and the level they are?”

3 Moments of central tendency
Mean Arithmetic mean: Σxi/n Geometric mean: (x1*x2*…*xn)1/n Harmonic mean:

4 Moments of central tendency in R
1. Arithmetic mean: Σxi/n Example: Use the function “mean” data<-c(2,3,4,5,6,7,8) mean(data) [1] 5 2. Geometric mean: (x1*x2*…*xn)1/n You can also use the .csv file : Example: dat<-read.csv("island_type_final2.csv") Attach(dat) mean(lat) [1] data<-c(2,3,4,5,6,7,8) exp(mean(log(data))) [1]

5 Moments of central tendency
A. mean B. Median C. Mode General example: data<-c(2,3,4,5,6,7,8) median(data) [1] 5 Example from the .csv: median(mass) [1] 0.69

6 Moments of central tendency
Mean Variance = Σ(xi-μ)2 / n Is the mean is a good measurement to what is happening in the population when the variance is low? data<-c(2,3,4,5,6,7,8) var(data) [1] Var(lat) [1] Example:

7 Moments of central tendency
Mean Variance The second moment of central tendency is the measurement of how much the data is scattered around the first moment (mean) An example for the second moment are the variance, the standard variation, standard error, coefficient of variation and the confidence interval of 90%, 95% and 99% from something

8 Moments of central tendency
#for: data<-c(2,3,4,5,6,7,8) Sample size: length(data) Variance: var(data) Standard deviation: sd(data) se<-(sd(data)/length(data)^0.5) se [1] Standard error: CV<-sd(data)/mean(data) CV [1] coefficient of variation:

9 Moments of central tendency
Mean Variance Skew Skewed distribution of frequencies is not symmetric Do you think that the arithmetic mean is a good measurement of central tendency for a skewed frequency distribution What is the mean salary of the student here and of Bill Gates?

10 Moments of central tendency
Skew skew<-function(data){ m3<-sum((data-mean(data))^3)/length(data) s3<-sqrt(var(data))^3 m3/s3} skew(data) The SE of skewness: sdskew<-function(x) sqrt(6/length(x))

11 Moments of central tendency
Mean Variance Skew Kurtosis

12 Moments of central tendency
Kurtosis kurtosis<-function(x){ m4<-sum((x-mean(x))^4)/length(x) s4<-var(x)^2 m4/s4-3 } kurtosis(x) sdkurtosis<-function(x) sqrt(24/length(x)) SE of kurtosis:

13 A normal distribution can get a value of mean and variance but its skewness and the kurtosis should equal to zero Values of skew and kurtosis have their own variance – and zero should be outside of their confidence interval in order for them to be significantly different from zero

14 Residuals When doing statistics we’re creating models of the reality
One of the most simple models is the mean: The mean height of Israeli citizens is 173 cm The mean salary is 9271 ₪ (correct for April 2014) The mean service in IDF is 24 months (I guess) 46,699 ₪ for a month (excluding the bottles) Rab. Dov Lior Served in IDF for 1 month m2.06

15 Residuals When doing statistics we’re creating models of the reality
We can see here that our models: 24 months, 9271 ₪ and 173 cm are not very successful The Residual Is how much a certain value is far from the prediction of the model. Omri Caspi is far away in 32 cm from the model “Israeli = 173” and in 29 cm from the more complicated model: “Israeli man = 177, Israeli women = 168” Residual = ₪ 37428 Residual = -23 month IDF service Residual = 33 cm

16 Residuals When doing statistics we’re creating models of the reality
dat<-read.csv("island_type_final2.csv") model<-lm(mass~iso+area+age+lat, data=dat) out<-model$residuals out write.table(out, file = "residuals.txt",sep="\t",col.names=F,row.names=F) #note that residual values are in the order entered (i.e., not alphabetic, not by residual size – first in, first out) Residual = ₪ 37428 Residual = 33 cm Residual = -23 month service

17 Theoretical statistics and statistical
inference When we have data it is best that we first describe them: plot graphs, calculate the mean and so on In statistical inference we are testing the behavior of our data compared to a certain hypothesis We can present our hypothesis as a statistical model For Example: The distribution of the heights is normal Number of species increases with area Number of species increases with area with a power function of 0.25

18 Frequency distribution*
How many observations are in each bin? dat<-read.csv("island_type_final2.csv") attach(dat) names(dat) Hist(mass) Describes the distribution of all observations *graphic form = “histogram”

19 Frequency distribution
What did we learn? dat<-read.csv("island_type_final2.csv") attach(dat) Hist(mass) There are no mass smaller than one tenth of a gram or larger than 100 kg Lizard with mass between 1 and 10 are very common – larger or smaller lizards are rare The distribution is unimodal and skewed to the right

20 Frequency distribution
Histograms don’t have to be so ugly dat<-read.csv("island_type_final2.csv") attach(dat) hist(mass, col="purple",breaks=25,xlab="log mass (g)",main="masses of island lizards - great data by Maria",cex.axis=1.2,cex.lab=1.5)

21 Presenting a categorical predictor with a continuous response variable
dat<-read.csv("island_type_final2.csv") attach(dat) plot(type,brood) Always prefer boxplot to barplot

22 Presenting a continuous variable against another continuous variable
dat<-read.csv("island_type_final2.csv") attach(dat) plot(mass,clutch) plot(mass,clutch,pch=16, col=“blue”)

23 Which test should we choose?
It changes according to the nature of our response variable (=y variable), and mostly according to the nature of our predictor variables If the response variable is “success or failure” and the null hypothesis is equality of both we’ll use a binomial test If the response variable is counts we’ll usually use chi-square or G In many cases our response variable will be continuous (14 species, 78 individuals, 54 heartbeats per second, 7.3 eggs, 23 degrees)

24 Which test should we choose?
What is your response variable ? Continuous (14 species, 78 individuals, 23 degrees, 7.3 eggs) Counts (frequency: 6 females, 4 males) Success or failure (found the cheese/idiot) Chi-square or G (=log-likelihood) Binomial Soon…

25 Binomial test in R You need to define the number of successes from the whole sample size. For example: 19 out of 34 is not significant 19 out of 20 is significant binom.test(19,34) Exact binomial test data: 19 and 34 number of successes = 19, number of trials = 34 p-value = alternative hypothesis: true probability of success is not equal to percent confidence interval: sample estimates: probability of success binom.test(19,20) Exact binomial test data: 19 and 20 number of successes = 19, number of trials = 20, p-value = 4.005e-05 alternative hypothesis: true probability of success is not equal to percent confidence interval: sample estimates: probability of success 0.95

26 Chi-square test in R chisq.test
Data: lizard insularity & diet: chisq.test habitat diet species# island carnivore 488 herbivore 43 omnivore 177 mainland 1901 101 269 M<-as.table(rbind(c(1901,101,269),c(488,43,177))) chisq.test(M) data: M χ2 = 80.04, df = 2, p-value < 2.2e-16

27 Chi-square test in R chisq.test χ2 = 17.568, df = 4, p-value = 0.0015
Now lets use our dataset: chisq.test dat<-read.csv("island_type_final2.csv") install.packages("reshape") library(reshape) cast(dat, type ~ what, length) type anoles else gecko Continental 7 45 Land_bridge 1 30 14 Oceanic 23 110 44 M<-as.table(rbind(c(7,45,45),c(1,30,14),c(23,110,44))) chisq.test(M) data: M χ2 = , df = 4, p-value =

28 Which test should we choose?
If our response variable is continuous then we’ll choose our test based on the predictor variables If our predictor variable is categorical (Area 1, Area 2, Area 3 or species A, species B, species C) We’ll use ANOVA If our predictor variable is continuous (temperature, body mass, height) We’ll use REGRESSION

29 t-test in R t.test(x,y) dimorphism<-read.csv("ssd.csv",header=T)
Sex size female 79.7 male 85 120 133.0 118 126.0 105.8 112 106 121.0 95 111.0 86 93.0 65 75.0 230 240.0 t.test(x,y) dimorphism<-read.csv("ssd.csv",header=T) attach(dimorphism) names(dimorphism) males<-size[Sex=="male"] females<-size[Sex=="female"] t.test(females,males) Welch Two Sample t-test data: females and males t = , df = , p-value = alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: sample estimates: mean of x mean of y

30 t-test in R (2) lm(x~y) Estimate standard error t p value (Intercept)
Sex size female 79.7 male 85 120 133.0 118 126.0 105.8 112 106 121.0 95 111.0 86 93.0 65 75.0 230 240.0 lm(x~y) dimorphism<-read.csv("ssd.csv",header=T) attach(dimorphism) names(dimorphism) model<-lm(size~Sex,data=dimorphism) summary(model) Estimate standard error t p value (Intercept) 88.17 1.291 68.32 <2e-16 *** Sexmale 3.932 1.825 2.154 0.031 *

31 Paired t-test in R t.test(x,y,paired=TRUE) female male 88.17 92.10
Species size Sex Xenagama_zonura 79.7 female 85 male Xenosaurus_grandis 120 133.0 Xenosaurus_newmanorum 118 126.0 Xenosaurus_penai 105.8 112 Xenosaurus_platyceps 106 121.0 Xenosaurus_rectocollaris 95 111.0 Zonosaurus_anelanelany 86 93.0 Zootoca_vivipara 65 75.0 Zygaspis_nigra 230 240.0 Zygaspis_quadrifrons 195 227.0 t.test(x,y,paired=TRUE) dimorphism<-read.csv("ssd.csv",header=T) attach(dimorphism) names(dimorphism) males<-size[Sex=="male"] females<-size[Sex=="female"] t.test(females,males, paired=TRUE) Paired t-test data: females and males t = , df = 3503, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: sample estimates: mean of the differences female male 88.17 92.10 tapply(size,Sex,mean)

32 ANOVA in R aov model<-aov(x~y) Df Sum sq Mean sq F value Pr(>F)
species type clutch Trachylepis_sechellensis Continental 0.6 Trachylepis_wrightii 0.65 Tropidoscincus_boreus 0.4 Tropidoscincus_variabilis 0.45 Urocotyledon_inexpectata 0.3 Varanus_beccarii 0.58 Algyroides_fitzingeri Land_bridge Anolis_wattsi Archaeolacerta_bedriagae Cnemaspis_affinis Cnemaspis_limi 0.18 Cnemaspis_monachorum Amblyrhynchus_cristatus Oceanic 0.35 Ameiva_erythrocephala Ameiva_fuscata Ameiva_plei 0.41 Anolis_acutus Anolis_aeneus Anolis_agassizi Anolis_bimaculatus Anolis_bonairensis ANOVA in R aov model<-aov(x~y) island<-read.csv("island_type_final2.csv",header=T) names(island) [1] "species" "what" "family" "insular" "Archipelago" "largest_island" [7] "area" "type" "age" "iso" "lat" "mass" [13] "clutch" "brood" "hatchling" "productivity“ model<-aov(clutch~type,data=island) summary(model) Df Sum sq Mean sq F value Pr(>F) type 2 0.466 2.784 0.0635 . Residuals 289 24.184

33 Post-hoc test for ANOVA in R
TukeyHSD(model) Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = clutch ~ type, data = island) $type diff lwr upr p adj Land_bridge-Continental 0.124 0.2505 0.0561 Oceanic-Continental 0.0218 0.1108 0.8318 Oceanic-Land_bridge -0.102 0.0163 0.1066 The difference is not significant. Notice that zero is always in the confidence interval. The difference between Land bridge islands and Continental islands is very close to significance (p = 0.056)

34 correlation in R cor.test(x,y)
mass 5 1.21 0.83 4 1.84 18 1.39 0.42 0.29 20 0.45 1.54 0.36 0.27 0.04 0.01 21 0.95 0.51 22 0.74 0.92 island<-read.csv("island_type_final2.csv",header=T) names(island) [1] "species" "what" "family" "insular" "Archipelago" "largest_island" [7] "area" "type" "age" "iso" "lat" "mass" [13] "clutch" "brood" "hatchling" "productivity“ attach(island) cor.test(mass,lat) Pearson's product-moment correlation data: mass and lat t = , df = 317, p-value = 0.256 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: sample estimates: cor The variable “cor” is the correlation coefficient r

35 Same data as in the previous example
regression in R Same data as in the previous example lm (=“linear model”): lm (y~x) model<-lm(mass~lat,data=island) summary(model) Call: lm(formula = mass ~ lat, data = island) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 9.934 <2e-16 *** lat -1.138 0.256 Residual standard error: on 317 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 1 and 317 DF, p-value: 0.256

36 lm vs. aov We can also use ‘lm’ with data that fits ANOVA
In this case we’ll receive all the data that ‘summary’ gives for ‘lm’ function for regression including parameter estimates, SE, difference between factors and p-values for contrasts between categories of our predictor variable

37 aov vs. lm aov results lm results More later on
We can use ‘lm’ also on data that fits ANOVA In this case we’ll receive all the data that ‘summary’ gives for ‘lm’ function for regression including parameter estimates, SE, difference between factors and p-values for contrasts between category pairs of our predictor variable island<-read.csv("island_type_final2.csv",header=T) model<-aov(clutch~type,data=island) model2<-lm(clutch~type,data=island) summary(model) summary(model2) Df Sum sq Mean sq F value Pr(>F) type 2 0.466 2.784 0.0635 . Residuals 289 24.184 aov results Estimate Std. Error t value Pr(>|t|) (Intercept) 11.11 <2e-16 *** typeLand_bridge 2.309 0.0216 * typeOceanic 0.578 0.5635 lm results Residual standard error: on 289 degrees of freedom (27 observations deleted due to missingness) Multiple R-squared: , Adjusted R-squared: F-statistic: on 2 and 289 DF, p-value: More later on

38 Assumptions of statistical tests (all statistical tests)
A non-random, non-independent sample of Israeli people Random sampling (assumption of all tests not only parametric) Independence (spatial, phylogenetic etc.)

39 Assumptions of parametric test. A. ANOVA
In addition to the assumptions of all tests Homoscedasticity Normal distribution of the residuals "Comments on earlier drafts of this manuscript made it clear that for many readers who analyze data but who are not particularly interested in statistical questions, any discussion of statistical methods becomes uncomfortable when the term ‘‘error variance’’ is introduced.“ Smith, R. J Use and misuse of the reduced major axis for line-fitting. American Journal of Physical Anthropology 140: Richard Smith & 3 friends Reading material: Sokal & Rohlf Biometry. 3rd edition. Pages (especially for normality)

40 Always look at your data
Don’t just rely on the statistics! Anscombe's quartet Summary statistics are the same for all four data sets: n = 11 means of x & y (9, 7.5), standard deviation (4.12) regression & residual SS R2 = (0.816) regression line (y = 3 + 0.5x) Anscombe Graphs in statistical analysis. The American Statistician 27: 17–21.

41 Assumptions of parametric tests. B. Regression
1. Homoscedasticity Smith, R. J Use and misuse of the reduced major axis for line-fitting. American Journal of Physical Anthropology 140:

42 Assumptions of parametric tests. B. Regression
Homoscedasticity The explanatory variable was sampled without error Smith, R. J Use and misuse of the reduced major axis for line-fitting. American Journal of Physical Anthropology 140:

43 Assumptions of parametric tests. B. Regression
Homoscedasticity The explanatory variable was sampled without error Normal distribution of the residuals of each response variable Smith, R. J Use and misuse of the reduced major axis for line-fitting. American Journal of Physical Anthropology 140:

44 Assumptions of parametric tests. B. Regression
Homoscedasticity The explanatory variable was sampled without error Normal distribution of the residuals of each response variable Equality of variance between the values of the explanatory variables Smith, R. J Use and misuse of the reduced major axis for line-fitting. American Journal of Physical Anthropology 140:

45 Assumptions of parametric tests. B. Regression
Homoscedasticity The explanatory variable was sampled without error Normal distribution of the residuals of each response variable Equality of variance between the values of the explanatory variables Linear relationship between the response and the predictor Smith, R. J Use and misuse of the reduced major axis for line-fitting. American Journal of Physical Anthropology 140:

46 How will we test if our model follows the assumptions?
R has a very useful model diagnostic functions which allows us to evaluate in a graphical matter how much our model follows the model assumption (especially in regression) ראו גם:

47 What can we do when our data doesn’t follow the assumptions?
We can ignore it and hope that our test is robust enough to break the assumptions: this is not as unreasonable as it sounds Use non-parametric tests Use generalized linear models (glm); which means: Transformation (in glm it means changing the link functions) Change error distribution in glm) to non-normal distribution) Use non-linear tests Use randomization (more about it in Roi’s lessons)

48 I think it is really wrong to have a presentation without any animal pictures in it
Non-parametric test Non-parametric test do not assume equality of variance or normal distribution. They are based on Ranks Disadvantages: There are no test for models with multiple predictors Many times their statistical power is very low compared to a equivalent parametric test They do not give you parameter estimation (slopes, intercepts)

49 נראה לי ממש לא בסדר שבמצגת שלמה אין לי תמונות של חיות
Non-parametric tests Non-parametric test do not assume equality of variance or normal distribution. They are based on Ranks חסרונות: לא קיימים מבחנים למודלים מרובי predictors לעיתים קרובות ה-statistical power שלהם נמוך משל מבחן פרמטרי מקביל לא מאפשרים הערכת פרמטרים (שיפועים ונקודות חיתוך)

50 The photographed is not related to the lectures
A few useful non-parametric tests Orycteropus afer The photographed is not related to the lectures Chi-square test is a non-parametric test Kolmogorov-Smirnov is a non-parametric test used to compare to frequency distributions (or to compare “our” distribution to a known distribution. For example: a normal distribution Mann-Whitney U = Wilcoxon rank sum Is a non-parametric test equivalent to students t-test Wilcoxon two-sample (=Wilcoxon signed-rank) test replaces paired-t-test Kruskal-Wallis replaces one-way ANOVA Spearman test Kendall’s-tau test replaces correlation tests

51 Non-parametric tests in R
Kolmogorov-Smirnov is a non-parametric test used to compare to frequency distributions (or to compare “our” distribution to a known distribution. For example: a normal distribution Orycteropus afer The photographed is not related to the lectures We need to define in R the grouping variable and the response: lets say we want to compare between the frequency distribution of lizard body mass on oceanic and land bridge islands island<-read.csv("island_type_final2.csv",header=T) attach(island) levels(type) [1] "Continental" "Land_bridge" "Oceanic“ Land_bridge<-mass[type=="Land_bridge"] Oceanic <-mass[type==" Oceanic"] ks.test(Land_bridge, Oceanic) Two-sample Kolmogorov-Smirnov test  data: Land_bridge and Oceanic D = , p-value = alternative hypothesis: two-sided


Download ppt "Basic statistic inference in R"

Similar presentations


Ads by Google