Sihua Peng, PhD Shanghai Ocean University 2018.10 Modern Biostatistics 2. Data sets Sihua Peng, PhD Shanghai Ocean University 2018.10
Contents Introduction to R Data sets Introductory Statistical Principles Sampling and experimental design with R Graphical data presentation Simple hypothesis testing Introduction to Linear models Correlation and simple linear regression Single factor classification (ANOVA) Nested ANOVA Factorial ANOVA Simple Frequency Analysis
R Function Each function performs a specific function, followed by brackets, for example: mean(): average value sum(): Summation plot(): Plotting sort(): Sorting log(); log2; log10(): log10; exp(); sin(); cos();sd()
Data frames: An example
Data frames: An example Firstly, generate the three variables (excluding the site labels as they are not variables) separately: > HABITAT <- factor(c("Mixed", "Gipps.Manna", "Gipps.Manna", "Gipps.Manna", "Mixed", "Mixed", "Mixed", "Mixed")) > GST <- c(3.4, 3.4, 8.4, 3, 5.6, 8.1, 8.3, 4.6) > EYR <- c(0, 9.2, 3.8, 5, 5.6, 4.1, 7.1, 5.3)
Data frames: An example Next, use the names of the vectors as arguments in the data.frame() function to amalgamate the three separate variables into a single data frame (data set) which we will call MACNALLY. > MACNALLY <- data.frame(HABITAT, GST, EYR)
Data frames: An example Notice that each vector (variable) becomes a column in the data frame and that each row represents a single sampling unit. By default, the rows are named using numbers corresponding to the number of rows in the data frame. However, these can be altered to reflect the names of the sampling units by assigning a list of alternative names to the row.names() property of the data frame.
Data frames: An example > row.names(MACNALLY) <- c("Reedy Lake", "Pearcedale", "Warneet", "Cranbourne", "Lysterfield", "Red Hill", "Devilbend", "Olinda")
Access the data in a data frame MACNALLY$HABITAT access the Column 1 MACNALLY$GST access the Column 2 MACNALLY$EYR access the Colum 3 MACNALLY[1,] First row MACNALLY[,3] Third column MACNALLY[3,2] Element of third row and second column i=1:4; MACNALLY[i,] rows from 1 to 4 MACNALLY[,2:3] cloumns from 2 to 3
Importing (reading) data > MACNALLY <- read.table( + 'macnally.csv', header=T, + row.names=1, sep=‘,') > MACNALLY <- read.table( + 'macnally.txt', header=T, + row.names=1, sep='\t')
Reviewing a data frame - fix() A data frame can also be viewed as a simple spreadsheet in a separate window by using the name of the data frame as an argument in the fix() function. The fix() function also enables simple editing of the data frame. >fix(MACNALLY)
Saving and loading of R objects Any object in R (including data frames) can also be saved into a native R workspace image file (*.RData) either individually, or as a collection of objects using the save() function. For example; > save(MACNALLY, file='macnally.RData') The saved object(s) can be loaded during subsequent sessions by providing the name of the saved workspace image file as an argument to the load() function. For example; > load("macnally.RData")
Exporting (writing) data The write.table() function is used to save data frames. > write.table(MACNALLY, "macnally.csv", quote = F, row.names = T, sep = ",")
Dummy data sets - generating random data Normal > # generate 5 random numbers from a normal > # distribution with a mean of 10 and a standard > # deviation of 1 > rnorm(5,mean=10,sd=1) [1] 11.564555 9.732885 8.357070 8.690451 12.272846 Log-Normal > # generate 5 random numbers from a log-normal > # distribution whose logarithm has a mean of 2 and a > # standard deviation of 1 > rlnorm(5,mean=2,sd=1) [1] 8.157636 30.914781 20.175299 5.071559 16.364014
Dummy data sets - generating random data Poisson > # generate 5 random numbers from a Poisson > # distribution with a lambda parameter of 4 > rpois(5,min=1,max=10) [1] 4 4 2 6 1 Binomial > # generate 5 random numbers from a binomial > # distribution based on 10 Bernoulli trials and > # a prob. of 0.5 > rbinom(5,size=10,prob=.5) [1] 4 4 1 4 6
Manipulating data sets Subsets of data frames – data frame indexing > #extract all the bird densities from sites that have GST values greater than 3 > subset(MACNALLY, GST>3)
The %in% matching operator Subset the MACNALLY dataset according to those rows that correspond to HABITAT 'Montane Forest' or 'Foothills Woodland' > MACNALLY[MACNALLY$HABITAT %in% c("Montane Forest", "Foothills Woodland"),]
Sorting datasets > MACNALLY[order(MACNALLY$HABITAT, MACNALLY$GST), ]