R Programming II EPID 799C Fall 2017
Overview Review of data types and operators Functions (galore) Reading & writing files Exploring with real datasets A little group work today…
Data Structures Review & deeper: 5 basic data structures, 4(+) atomic types
Data structures Homogenous Heterogenous 1D Atomic Vector List* 2D Matrix Data frame ND Array Atoms are: logical, integer, double (aka numeric), and character. Two rare atomic types (complex and raw2). Also other aggregate types. We’ll introduce factors and dates soon. We make these things using functions c(), matrix(), array(), list() and data.frame(). Note: Lists are recursive! 2 Do genetic work? An R package that works with DNA sequences uses raw format (byte code/math) to efficiently store and operate on those ATGCs. It’ll be largely invisible to you, but you can thank bit math for speedy comparison of sequences.
We Try Let’s make some!
You try: data types Create three atomic vectors (length 5) of each of these types: character, integer, logical. Name them whatever you want. Use “:” shorthand to create a vector of numbers, again, of the same length. Create a data.frame using those four atomic vectors, and take a look at it by printing it to console Create a 3x3 matrix of (any) numbers, then of logicals. (hint: rep() function may be useful for logicals) Create a 3x3x3 array (27 elements) of the numbers 1 to 27. May need help on array()… Create a list that includes another list in it.
Answers # Atomic vectors a = c(1, 2,3, 4) b = 1:4 c = c("one", "two", "three", "four") d = c(T, F, T, F) my_df = data.frame(a, b, c, d) #stringsAsFactors = F would keep c from turning into a factor str(my_df) #Matrices and arrays matrix(1:9, nrow = 3) array(1:27, dim = c(3, 3, 3)) my_list = list(1, "a", list("mike", "shoes"))
Notes on Data Structures Everything is a vector in R. Atoms are vectors of length = 1. ^As reviewed Monday: This is crazy important and useful. Operators therefore expect vectors and know how to operate on entire vectors at once. Lists are recursive and heterogenous. Can make up building blocks of more complex objects. Data.frames are really…lists of atomic (homogenous) vectors*. We’ll verify this later. * Fancy note: technically this means you could have a vector of lists (each element is a list) and it’s still a data.frame. Some recent spatial packages take advantage of this (e.g. simple features geometries).
Type Coersion Explicit as.thing(this) will convert this into the other thing. e.g. as.numeric(c(“1”, “2”, “3”)) as.character(1:3) # as.... Other stuff. Or as(thing, “class”) Implicit R tries to help when it can: e.g. sum(T, F, T, F, T) If you’re going to lose information (get new Nas), R will let you know. Generally artithmatic operators coerce to numbers, and logicals to logicals, etc.
Functions data, _str, class, summary, head, tail, setwd, getwd, View, plot, dim, nrow/col, sd, hist, boxplot, table, type_of, sum, read.csv and write.csv…
Our first functions (vocabulary!) str(), type_of(), length(), attributes(), names(), class()
We Try Let’s explore data types with our functions!
Sidenote 1: Factors & Dates! We’ve got the functions to make sense of these
Aggregate Data Types: Factors Now we’re ready: What is a factor? Let’s find out! Create one: roles = factor(c(“student”, “faculty”, “staff”)) ^ NOTE there was a typo during class. I forgot the c! Find out: use str(), class(), levels(), attributes(), as.numeric(), typeof() on roles How are factors different? Why are they here? Also see: ordered() Stuck with factors? Check out the forcats:: package at http://forcats.tidyverse.org/ and this on factors: http://r4ds.had.co.nz/factors.html
Aggregate Data Types: Dates What is a date? Two things really… today_date = as.Date("2017/08/30") typeof(today_date) today_date_lt = as.POSIXlt("2017/08/30") typeof(today_date_lt) (Try our other functions too!) But hint: Futzing with dates can be a hassle. We’ll use the Lubridate:: package to make that easier, later. <2m skim this: http://r4ds.had.co.nz/dates-and- times.html
Sidenote 2: Operators? C’mon, I thought we were doing functions!
Reminder: Operators… are functions! `+`(1, 2) `%in%`(1:4, c(2,3)) …so are assignment, indexing and (technically) function calls themselves. Meaning, hey, you can easily define your own binary operators. More often we’ll define our own functions, but important to know: EVERYTHING is a function / object, and can be passed around. `%add_wrong%` <- function(a, b) {a + b+1} a %add_wrong% b
Sidenote 3: Write your own Often *super* useful
We Try: Best way to learn functions are to write our own my_first_function = function(param, param2=4){ # function body, using the parameters… return(my_return_val) # ^ note: will return last value if left out! } my_first_function(1, 2) #calling it like this, or my_first_function(param2 = 2, param = 1) # this, or my_first_function(1) #this R functions are scoped (e.g. variables created inside don’t exist outside) and pass by reference as default (smart, don’t create new copies of what’s passed inside unless the copy is changed) Let’s write get_older and hello_world
We Try: Best way to learn functions are to write our own write.csv and read.csv What is iris? (see data() ) Save data.frame iris to iris_lower, and change all the variable names to lower case.
You try: Function Vocab Injection Using the iris dataset and Advanced R: Function Vocabulary (http://adv-r.had.co.nz/Vocabulary.html) Try out as many functions as you can in your group! (Feel free to split them up and work in groups) Suggestions: Some of these are actually pretty advanced - consider not diving into EVERY function. Some you might want to skip that are a bit of a rabbit hole… <<- get assign rle
Answers # Atomic vectors a = c(1, 2,3, 4) b = 1:4 c = c("one", "two", "three", "four") d = c(T, F, T, F) my_df = data.frame(a, b, c, d) #stringsAsFactors = F would keep c from turning into a factor str(my_df) #Matrices and arrays matrix(1:9, nrow = 3) array(1:27, dim = c(3, 3, 3)) my_list = list(1, "a", list("mike", "shoes"))
Sidenote 4: classes and functions You’ll never need to know this until you do.
Classes and functions How do fuctions like dim() or plot() know how to handle all these things? Technically, they’re generics, calling (effectively masking) functions things like dim.data.frame or plot.factor that it calls based on the class() of the object. Look up help for plot.factor and dim.data.frame
Super Duper Sub-setting Review from last time: [], [[]], $ and their many flexibilities
Super Duper Sub-setting: Vectors [] is the atomic subset operator (by location) (Given R is “vectorized” – like almost all of our data! - think matrix notation) [[]] (“double brackets”) is the subset into operator (think subset, then look inside that thing). Most commonly used in in a named list, like… a data.frame!
We Try: Super Duper Sub-setting: Vectors [] can subset a vector by: Numeric vectors (negative to drop, repeats, etc.) Logical vectors Character vectors (IF you’ve named those elements) [] can subset a 2+ dimensional object (matrix, array data.frame) in similar ways… …but then accepts a few other higher order versions of the above. Technically, [] used on a vector always returns a vector, right?
Super Duper Subsetting: Lists [] returns smaller lists element from a list. But often we want to look inside that list element (e.g. in data frames). So for lists we use the [[]], e.g. iris[[“Petal Length”]]. But that’s a hassle, so x$y is a convenience wrapper for the same operator (equivalent to x[[“y”, exact=F]]) which we’ll use ALL THE TIME. * Fancy note: see that exact=F? That means you can do some crazy stuff, like iris$Petal.Len . Really. Don’t do this!
Ready for Reality! Introducing the class dataset
Putting it all together We have basic data types to hold our vectorized, atomic data. We have a wealth of functions to operate on them, usually on a whole vector (think “column”) at once. We can write our own if we need to. We have powerful subsetting (see L2 for the full rundown) to select, rewrite, extract and perform other actions on slices of our data. * Fancy note: see that exact=F? That means you can do some crazy stuff, like iris$Petal.Len . Really. Don’t do this!
NC Birth Data The “small” dataset contains (N) columns of (M) rows of data. Check the documentation for what these values really mean. Mdif, visits, wksgets, mrace, cores, bfed. The overall question: does prenatal care reduce preterm birth? * Fancy note: see that exact=F? That means you can do some crazy stuff, like iris$Petal.Len . Really. Don’t do this!
You try: Functions Read in the NC births (small) file, and rename the variables to all lower case. Explore the dataset as a small group using as many relevant functions as you can from the Advanced R package, and report out to the group Try str(), length(), dim(), typeof(), attributes(), Try head(), tail(), subset()
You try: Tour the Dataset Download and unzip the Births Dataset, then use read.csv() to (and maybe setwd() ) to import the small version of the dataset: births2012_small.csv Use these functions to answer the questions below dim() summary() table() hist() plot() Use an expression with assignment to make a working copy of the dataset with a simpler name How many observations, and how many variables are in the (small) births dataset? What is the average maternal age (mage)? How many mothers have the value 99? Make a histogram of gestational age (WKSGEST). What is the minimum and maximum (non-99) gestational age? How many mothers smoked (CIGDUR)? Make a scatterplot of maternal age versus gestational age.
You try You can flexibly program with [] and [[]], but not as flexibly with $, even though almost always we’ll use $. Can you see why? Using the births data…