Introduction to R Lecture 3: Data Manipulation Andrew Jaffe 9/27/10
Overview Practice Solutions Indexing Data Management Data Summaries
Practice Make a 2 x 2 table of sex and dog > table(dat$sex, dat$dog) no yes F M
Practice Create a 'BMI' variable using height and weight > dat$bmi = dat$weight*703/dat$height^2 > head(dat$bmi) [1]
Practice Create an 'overweight' variable, which gives the value 1 for people with BMI > 30 and 0 otherwise > dat$overweight = ifelse(dat$bmi > 30, 1, 0) > head(dat$overweight) [1]
Practice Add those two variables to the datasets and save it as a text file somewhere write.table(dat, "lec2_practice.txt", quote = F, row.names = F, sep="\t")
Overview Practice Solutions Indexing Data Management Data Summaries
Indexing Vectors: vector[index] takes ‘index’ elements from vector and returns them > x = c(1,3,7,34,435) > x[1] [1] 1 > x[c(1,4)] [1] 1 34 > x[2:4] [1] > 2:4 [1] 2 3 4
Indexing Replace elements in a vector – combining indexing, is.na(), and rep() > x = c(1,3,NA,6,NA,8) > which(is.na(x)) [1] 3 5 > x[is.na(x)] = 0 # or rep(0) > x [1]
Indexing Data.frames/matrices: dat[row,col] Can subset/extract a row: dat[row,] Can subset/extract a column: dat[,col] > x = matrix(c(1,2,3,4,5,6), ncol = 3) > x [,1] [,2] [,3] [1,] [2,] 2 4 6
Indexing > x[1,] [1] > x[,1] [1] 1 2 > x[1,1] [1] 1 > x[1:2,1:2] [,1] [,2] [1,] 1 3 [2,] 2 4 > x [,1] [,2] [,3] [1,] [2,] 2 4 6
Indexing > x[1,] = rep(1) > x [,1] [,2] [,3] [1,] [2,] > x[,1] = rep(2) > x [,1] [,2] [,3] [1,] [2,] > x [,1] [,2] [,3] [1,] [2,] 2 4 6
Overview Practice Solutions Indexing Data Management Data Summaries
Data Management An aside: save() and load() save(obj_1,…,obj_n, file = “filename.rda”) Saves R objects (vectors, matrices, or data.frames) as an.rda file (similar to.dta) load(“filename.rda”) Loads whatever files were saved in the.rda Easier than reading/writing tables
Data Management Your workspace can be saved an.rda file You get asked this every time you close R save.image(“filename.Rdata”) saves all objects in your workspace (what ls() returns) Each folder might have its own.Rdata file Doing this is personal preference - if you have a script and it’s a quick analysis, probably don’t need a saved image
Data Management “lec3_data.rda” can be downloaded from the website Similar method to read in the data: load(“lec3_data.rda”) Put in the same directory as your script Set your working directory Use the full filename
Data Management What are the dimensions of the dataset?
Data Management What are the dimensions of the dataset? > dim(dog_dat) [1] 482 6
Data Management How many dogs are in this dataset? Is this dataset unique?
Data Management How many dogs are in this dataset? Is this dataset unique? > length(unique(dog_dat$dog_id)) [1] 482 > length(dog_dat$dog_id)) [1] 482
Data Management What are the column/variable names?
Data Management What are the column/variable names? > head(dog_dat) dog_id owner_id dog_type dog_wt_mo1 dog_len_mo1 dog_food_mo lab lab poodle lab husky poodle > names(dog_dat) [1] "dog_id" "owner_id" "dog_type” "dog_wt_mo1" [5] "dog_len_mo1" "dog_food_mo1"
Data Management Some explanation of the variables dog_id: id of dog owner_id: id of owner dog_type: type of dog dog_wt_mo1: dog weight at month 1 (baseline) dog_len_mo1: dog length at month 1 dog_food_mo1: baseline dog food consumption
Data Management Subsetting data: separate data into two data.frames based on a variable: > lab = dog_dat[dog_dat$dog_type == "lab",] > head(lab) dog_id owner_id dog_type dog_wt_mo1 dog_len_mo1 dog_food_mo lab lab lab lab lab lab
Data Management > lab = dog_dat[dog_dat$dog_type == "lab",] > head(which(dog_dat$dog_type == "lab")) [1] Taking those specific rows, and all of the columns of the original data
Data Management > lab2 = dog_dat[dog_dat$dog_type == ”lab",1:3] > head(lab2,3) dog_id owner_id dog_type lab lab lab Taking those specific rows, and the first 3 columns of the original data
Data Management Note (stata users…) that we have two data.frames in our workspace! [ls()]
Data Management Remember we used ifelse() for binary conversions? > heavy = ifelse(dog_dat[,4] > mean(dog_dat[,4]), 1, 0) > head(heavy) [1] Note that you can use column indexing instead of $name for data.frames This is just the mean of that column: > mean(dog_dat[,4]) [1]
Data Management The cut() function can split data into more groups – quintiles, tertiles, etc cut(dat, breaks) dat is a vector of numerical or integer values breaks is where to make the cuts
Data Management If ‘breaks’ is one number (n), it splits the data into ‘n’ equal sized groups > x = 1:5 # or seq(1,5) > cut(x, 2) [1] (0.996,3] (0.996,3] (0.996,3] (3,5] (3,5] Levels: (0.996,3] (3,5] > cut(x, 3) [1] (0.996,2.33] (0.996,2.33] (2.33,3.67] (3.67,5] (3.67,5] Levels: (0.996,2.33] (2.33,3.67] (3.67,5] > cut(x,3, labels=F) # returns integers of groups, not factors [1] FACTORS!
Data Management What is a factor? Similar to terms like ‘category’ and ‘enumerated type’ Has ‘levels’ associated with it – could be ordinal if factor(…,ordered = T) Must only have an as.character() method and be sortable to be converted to a factor using factor()
Data Management If ‘breaks’ are more than one number, splits the vector by those numbers > x = 1:10 > cut(x, c(0,3,6,10)) [1] (0,3] (0,3] (0,3] (3,6] (3,6] (3,6] (6,10] (6,10] (6,10] (6,10] Levels: (0,3] (3,6] (6,10] > cut(x, c(0,3,6,10), FALSE) [1]
Data Management Something more applicable for cut: the quantile(x,probs) function - default ‘probs’ is seq(0,1,0.25), ie quintiles seq(start, end, by) – creates a sequence from the starting value, to the ending value by the specified amount seq(0,10) ~ 0:10 # 0, 1, 2, …, 9, 10 seq(0,10,0.5) # 0, 0.5, 1.0, …, 9.5, 10.0
Data Management Now for stuff with our data: > quantile(dog_dat$dog_wt_mo1) 0% 25% 50% 75% 100% > quantile(dog_dat$dog_wt_mo1, seq(0,1,0.5)) 0% 50% 100% > quantile(dog_dat$dog_wt_mo1, 0.6) 60% 51.5 > quantile(dog_dat$dog_wt_mo1, c(0.4,0.6)) 40% 60%
Data Management > sp = quantile(dog_dat$dog_wt_mo1, 0.75) > big = ifelse(dog_dat$dog_wt_mo1 > sp, 1, 0) > head(big) [1] > quant = cut(dog_dat$dog_wt_mo1, quantile(dog_dat$dog_wt_mo1)) > head(quant) [1] (49.2,55.3] (44.6,49.2] (55.3,72.5] (44.6,49.2] (44.6,49.2] (44.6,49.2] Levels: (10.6,44.6] (44.6,49.2] (49.2,55.3] (55.3,72.5]
Overview Practice Solutions Indexing Data Management Data Summaries
This is some of the only “statistics” in the course R functions can perform statistics well, here are some basics for summaries
Data Summaries mean(dat, na.rm = F) median(dat, na.rm=F) > x = c(1,2,4,6,NA) > mean(x) [1] NA > mean(x, na.rm=T) [1] 3.25 > median(x,na.rm=T) [1] 3
Data Summaries > x = c(1,2,4,7,9,11) > mean(x) [1] > median(x) [1] 5.5 > var(x) [1] > sd(x) [1]
Data Summaries Let’s combine some concepts! Take the mean food consumption of all of the labs
Data Summaries First, figure out which entries correspond to dogs that are labs > Index = which(dog_dat$dog_type == "lab") > head(Index) [1]
Data Summaries Then, take the mean of the data you want > mean(dog_dat$dog_food_mo1[Index]) [1] Note that we first created a vector of dog food, then indexed it - there are no commas needed for the indexing (because it’s a vector)
Data Summaries Combined into 1 line/command: > mean(dog_dat$dog_food_mo1[dog_dat$dog_type == "lab"]) [1] > mean(dog_dat[dog_dat$dog_type == "lab",6]) [1] > mean(dog_dat[dog_dat$dog_type == "lab","dog_food_mo1"]) [1] Pick your favorite – they’re all the same! Note that the first option might make the most sense…
Practice Compute the average dog weight, dog length, and dog food consumption for each dog type at baseline Reminder: the dog types are lab, poodle, husky, and retriever