Download presentation
Presentation is loading. Please wait.
Published byJoseph O’Brien’ Modified over 8 years ago
1
Statistical Concepts and Analysis in R Fish 552: Lecture 9
2
Outline Probability Distributions Exploratory Data Analysis Comparison of two samples
3
Random variables For a sample space, S, a random variable is any rule that associates a number with each outcome in S. A random variable X is continuous if its set of possible values is an entire interval of numbers A random variable X is discrete if its set of possible values is a finite set or an infinite sequence Often we describe discrete or continuous observations by a probability model –e.g. Binomial, normal,.....
4
Probability distributions in R R includes a comprehensive set of probability distributions that can be used to simulate and model data If the function for the probability model is named xxx –pxxx: evaluate the cumulative distribution function P(X ≤ x) –dxxx: evaluate the probability mass or density function f(x) –qxxx: evaluate the quantile function (given q, the smallest x such that P(X ≤ x) > q ) –rxxx: generate a random variable from the model xxx
5
Probability distributions in R DistributionR nameAdditional arguments beta shape1, shape2 binomial binomsize, prob Cauchy cauchylocation, scale chi-squared chisqdf exponential exprate F fdf1, df2 gamma shape, scale geometric geomprob hypergeometric hyperm,n, k log-normal lnormmeanlog, sdlog logistic logislocation, scale negative binomial rbinomsize, prob normal normmean,sd Poisson poislambda Student’s t tdf unifrom unifmin, max Weibull weibullshape, scale Wilcoxon wilcoxm,n
6
Standard normal distribution
7
Quantile qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE) –Do these numbers looks familiar ? > quants <- qnorm(c(0.01,0.025,0.05,0.95,0.975,0.99)) > round(quants,2) [1] -2.33 -1.96 -1.64 1.64 1.96 2.33 Probability pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE) > pnorm(quants) [1] 0.010 0.025 0.050 0.950 0.975 0.990
8
Standard normal distribution Density dnorm(x, mean = 0, sd = 1, log = FALSE) > dnorm(quants) [1] 0.02665214 0.05844507 0.10313564 0.10313564 0.05844507 0.02665214 Random normal variable rnorm(n, mean = 0, sd = 1) > rnorm(1) [1] -0.9392975 Did you get the same random number ?
9
Random number generation Random number generators are actually pseudo random number generators, a deterministic sequence of numbers that behave like random numbers. The state of the seed can be viewed with –.Random.seed By default, R will initialize the random sequence based on the start time of the program The user can initialize the random sequence with the set.seed() function –set.seed(seed) –seed can be any integer between −2147483648 through 2147483647
10
Random number generation When simulating data or working with random numbers, ALWAYS use set.seed() and save your script or the number used to generate the random seed. > set.seed(34) > rnorm(1) [1] -0.1388900
11
sample() function sample() can be used to generate random numbers from a discrete distributions. –With and without replacement –Equal and weighted probabilities This is the “work-horse” function of many modern statistical techniques –Bootstrap, MCMC,... ?sample
12
sample() function Roll a dice 10 times > sample(1:6, 10, replace = TRUE) [1] 6 2 2 6 2 5 3 4 2 3 Flip a coin ten times > sample(c("H", "T"), 10, replace = TRUE) [1] "T" "H" "H" "T" "T" "H" "H" "H" "H" "T " Pick 5 cards > cards <- paste(rep(c("A", 2:10, "J", "Q", "K"), 4), c("Heart", "Diamond", "Spade", "Club")) > sample(cards, 5) [1] "A Club" "Q Club" "9 Diamond" "9 Club" "2 Heart" paste() concatenates two vectors after converting to strings
13
replicate() function Useful way to avoid loops (FISH 553) replicate() repeatedly evaluates an expression n number of times We are interested in the statistical properties of the median as an estimator of central tendency for an Exponential distribution with a rate of 1 with small sample sizes. What is the standard deviation of this estimator? –Simulate it! –> medianResults <- replicate(n = 999, expr = median(rexp(n = 10, rate = 1))) –> sd(medianResults) –[1] 0.3012640 Random exponential
14
Hands-on exercise 1 Generate 100 random normal numbers with mean 100 and standard deviation 10. What proportion of random numbers are 2 standard deviations away from the mean? Select 6 numbers from a lottery containing 56 balls. Go to: http://www.walottery.com/sections/WinningNumbers/ Did you win? For a standard normal random variable, find the number z such that P(-z ≤ Z ≤ z) = 0.23 –Use the symmetry of the normal distribution
15
Exploratory data analysis Important starting point in any study, analysis, etc.. Numerical summaries can be used to quantitatively assess characteristics of data –summary() –boxplot() –fivenum() –sd(), range(), etc.. Visualizing characteristics of data is often more informative –There’s a lot of built in plots, find the one you want !
16
Edgar Anderson’s Iris Data > head(iris) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa Species is coded as factor –Important for several plotting routines > is.factor(iris$Species) [1] TRUE
17
pairs() Produces a matrix of scatterplots pairs(iris[,1:4], main = "Edgar Anderson's Iris Data", pch = 21, bg = rep(c("red", "green3", "blue"), table(iris$Species))) Just repeating “red” the number of Setosa observations and “green3” the number of versicolor observations...
19
boxplot() ?boxplot boxplot(Sepal.Length ~ Species, data=iris, col=c("red", "green3", "blue"), main = "Edgar Anderson's Iris Data")
20
Histograms Useful for examining shape, center and spread of data ?hist –Lots of options When values are continuous, measurements must be sub-divided into intervals. In R the user can specify this by a –Vector of breaks points –Number of breaks –Character string specifying an algorithm (default = nclass.Sturges) Use the default to avoid any bias
21
hist(iris$Sepal.Length[iris$Species == "setosa"], col="red", xlab="Setosa Sepal Length", main="Histogram")
22
Adding density curves Kernel density estimation is a non-parametric way of estimating a probability density function –The bandwidth determines how smooth this estimated curve is Smaller bandwidths = more wiggly Bias can be introduced and often it is best to let R choose the optimal bandwidth There are also many kernel density methods and it is best for now to just use the default
23
hist(iris$Sepal.Length[iris$Species == "setosa"], col="red", freq=FALSE, border="gray", xlab="Setosa Sepal Length", main="Histogram") lines(density(iris$Sepal.Length[iris$Species == "setosa"]), lwd=2)
24
Other useful plots PlotR function Barplot barplot() Contour lines of two-dimensional distribution contour() Plot of two variables conditioned on the other coplot() Dotchart dotchart() Pie chart pie() 3-dimensional surface plot persp() Quantile-quantile plot qqplot() Stripchart stripchart()
25
Basic statistical tests in R R has many built in functions to perform classical statistical tests –correlation – cor.test() –Chi squared – chisq.test() –Test of equal proportions - prop.test() Covered in Lecture 10 –ANOVA – aov() –Linear Models – lm()
26
Comparison of two samples The t-test tests whether the two population means (with unknown population variances) are significantly different (or smaller/larger) from each other. H0: μ 1 – μ 2 = 0 H1: μ 1 – μ 2 ≠ 0 Independent vs. paired t-test The two-sample independent t-test assumes –Normality of populations (but not an issue for large samples) –Equal variances (this can be relaxed) –Independent samples
27
t.test() The t.test() function in R can be used to perform many variants of the t-test –?t.test Specify direction, μ, α Two methods were performed to determine the latent heat of the fusion of ice. The investigator wishes to find out how much (if at all) the methods differed. –The data is entered as methodA and methodB
28
Assumptions If these populations were normal, then the points should fall around the line -A has a strong left skew. B has right skew - qqnorm(methodA) ; qqline(methodA)
29
Assumptions Equal variance seems like a reasonable assumption
30
t-test The Welch two sample t-test (default) assumes unequal variances > t.test(methodA, methodB) Welch Two Sample t-test data: methodA and methodB t = 3.274, df = 12.03, p-value = 0.006633 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 0.01405393 0.06992684 sample estimates: mean of x mean of y 80.02062 79.97862
31
Accessing output The results shown in the previous slide can be accessed by assigning a name to the output and accessing individual elements as usual –What data type do you think this ? > resultsAB <- t.test(methodA, methodB) > names(resultsAB) [1] "statistic" "parameter" "p.value" "conf.int" "estimate" "null.value" "alternative" "method" "data.name" > resultsAB$p.value [1] 0.006633411
32
t-test The box-plot suggested that the two variances might be approximately equal > var(methodA) [1] 0.0005654231 > var(methodB) [1] 0.0009679821 More formally we can perform an F-test to test for equality in the variances
33
F-test If we take two samples n 1 and n 2 from normal populations then the ratio of the variances has an F-distribution with n 1 – 1 and n 2 – 1 degrees of freedom. H0: σ 2 1 / σ 2 2 = 1 H1: σ 2 1 / σ 2 2 ≠ 1
34
F-test There is no evidence of a significance difference between the two variances > var.test(methodA, methodB) F test to compare two variances data: methodA and methodB F = 0.5841, num df = 12, denom df = 7, p-value = 0.3943 alternative hypothesis: true ratio of variances is not equal to 1 95 percent confidence interval: 0.1251922 2.1066573 sample estimates: ratio of variances 0.5841255
35
t-test We should now specify equal variances > t.test(methodA, method.B, varequal=TRUE) Two Sample t-test data: methodA and methodB t = 3.4977, df = 19, p-value = 0.002408 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 0.01686368 0.06711709 sample estimates: mean of x mean of y 80.02062 79.97862
36
Non-parametric tests The normality assumption (based on the qq-plots) is probably not true in this case, particular with small sample sizes The two-sample Wilcoxon (Mann-Whitney) is a useful alternative when the assumptions of the t-test are not met –Have you seen this test before?
37
Mann-Whitney test How it works: –Arrange all the observations into a ranked series –Sum the ranks from sample 1 (R1) and sample 2 –Calculate the test-statistic U The distribution under H 0, can be found enumerating all the possible subset of the ranks (assuming each subset is equally likely), and comparing the test statistic to the probability of observing of the rank. –This can be a cumbersome calculation for large sample sizes
38
Mann-Whitney test When there are ties in the data, this method provides an approximate p- value If the sample size is less than 50 and there are no ties in observations, by default R will calculate an exact p-value –When this is not the case, a normal approximation is used ?wilcox.test
39
Mann-Whitney test Once again reach the same conclusion > wilcox.test(methodA, methodB) Wilcoxon rank sum test with continuity correction data: methodA and methodB W = 88.5, p-value = 0.008995 alternative hypothesis: true location shift is not equal to 0 Warning message: In wilcox.test.default(methodA, methodB) : cannot compute exact p-value with ties Ranks are sensitive rounding and R will produce a warning if rounded observations are nearly equal
40
Hands-on exercise 2 Create 10 qqnorm plots (including qqline ) by sampling 30 points from the following distributions: normal, exponential, t, Cauchy Make between 2 and 5 of the plots come from a normal distribution Have your partner guess which plots are actually from a normal distribution
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.