Presentation is loading. Please wait.

Presentation is loading. Please wait.

R Programming III: Real Things with Real Data!

Similar presentations


Presentation on theme: "R Programming III: Real Things with Real Data!"— Presentation transcript:

1 R Programming III: Real Things with Real Data!
EPID 799C Fall 2017

2 Overview Introduction to the homework project
Practice data exploration Data recoding A little group work today…

3 Homework Project Split into HW 1-5

4 Motivating Question Does early prenatal care (PNC) reduce preterm birth?

5 Prototypical Epi Analysis
Literature review Nope! We’re picking it.  Hypothesis / question generation Prepare, Explore & Recode Data (HW1*) Select / Subset Covariates (HW2) Functional Form & Crude/Basic Models (HW3) Confounding & Effect Measure Modification (HW4) Graphics & Outputs (HW5) …Bonus stuff *We’ll revisit some early stuff in future assignments, since some of the power of the fancier stuff you’d generally apply right away! But this is the gist.

6 Prototypical Epi Analysis
Note a little overlap of HW2. We’ll occasionally learn some useful tools “out of order” – slightly more advanced concepts that you’d often want to pull out right away. But generally we’re working in order.

7 Important Epidemiology Concepts
…that we will be using but abbreviating hard:  Exposure Outcome Risk Risk Difference/Ratio Rate Confounder Mediator Effect Measure Modifier

8 DAGs Directed Acyclic Graphs (DAGs) inform our variable selection and treatment in models (based on their status as mediators, confounders, effect measure modifiers, etc. We will not elaborate in this class! Take the Epi sequence for more DAG from EPID 716 / Christy Avery

9 Important Epidemiology Concepts
…that we will not be using or only minimally use.  Hand calculations Confidence intervals (minimal) Odds ratios DAGs ….And many more! Covered in depth in EPID 715/716 in a SAS base and the core EPID sequence.

10 Motivating Question… …a bit more specifically
Does early prenatal care = PNC during or before 5th month reduce preterm birth = less than 37 weeks …when controlling for obvious confounders Literature note: …uh, only maybe sorta? It’s more complex than we’ll be treating it for this class. But let’s mostly drop that for now! Feel free to explore the literature here. Think: PNC seems to be good. Let’s figure out how good!

11 Relevant Variables* Exposure/Outcome Mdif: Month Prenatal Care Began Wksgest: Calculated Estimate of Gestation Covariates Mage: Maternal age Mrace: Maternal Race Methnic: Hispanic Origin of Mother Cigdur: Cigarrette Smoking During Pregnancy Cores: Residence of Mother -- County Look ahead: actually, we’ll be creating some modified versions of these, but these are our base elements. And a sidenote on case / style….

12 Relevant Variables Selection Criteria Plur: Plurality of birth (twins, triplets, etc.) Wksgest: Calculated Estimate of Gestation DOB: Date of birth of baby Congenital Anomalies: multiple variables with congenital anomaly status Sex: Infant sex Visits: Total Number of Prenatal Care Visits

13 We Try Let’s start the homework together.  Explore & Recoding (PNC)

14 Answers table(births$mdif) births$mdif[births$mdif %in% c(88,99)] = NA table(births$mdif, useNA = "always") births$pnc5 = as.numeric(births$mdif <=5) table(births$pnc5, births$mdif, useNA = "always") births$pnc5_f = factor(births$pnc5, levels=0:1, labels=c("Term", "Preterm")) table(births$pnc5_f, births$mdif, useNA = "always")

15 You try: other variables
At your table, split up and assign the other variables listed in the Relevant Variables slides among you all for exploration Explore the structure of the variable(s) you’ve chosen using functions like table(), hist(), summary(), head(), plot() and others. With table(), the useNA=“always” parameter may be useful. You may or may not benefit from referencing the excel coding manual included in the data package / reference folder. Describe what would have to be done to properly recode the variable to its intended end structure (do missing values need to be recoded? Will it become a categorical factor variable? Etc.)

16 Control: Iterations & Conditionals
If() {}, elseif() {}, else{}, ifelse(), for(){} … and always, vectors

17 Control To do…or not to do…or do repeatedly. R has “traditional” control structures… but also benefits from vectorization. Traditional first:

18 Conditionals Simple: If(Boolean_test){ # do stuff }
Complex: If(Boolean_test){ # do stuff } else if(test2) { } else { }

19 Conditionals Common usage: iterate (coming up) through a vector, assigning a value, using the ifelse() function. new_data = ifelse(boolean_vector, a, b) …returns a if boolean_vector is true, b if it isn’t.

20 Speaking of Iteration…
For (varname in vector-of-values){ #Do stuff } #But don’t forget about vectorization!

21 Iteration: Examples for (i in 1:nrow(births)){ #seq_along is technically safer… cat("Record", i, "is a birth with", births$plur[i], "baby\n") } cols_to_use = c("wksgest", "plur", "kotel") for (this_name in cols_to_use){ #note I’m iterating over character vector… print(summary(births[,this_name])) # ^ iteration (and functions) are by default “quiet.” print/cat sidenote

22 Iteration: Examples #But don’t forget about vectorization! For (i in 1:nrow(births)){ births$onemorebaby[i] = births$plur[i] } #...and don’t forget about the index!

23 We Try For loops if() else{} switch() ifelse()

24 Control: Function…als
Functions that take functions

25 Functionals Take a function as an input, return a vector. “For each of these things, do this, mush it together and give me the results.”

26 lapply() cols_to_use = c("wksgest", "plur", "kotel") lapply(births[,cols_to_use], summary)

27 lapply()

28 sapply() and vapply() Like lapply(), but simplify output to produce atomic vectors. sapply() is short hand, vapply() is more explicit. sapply(iris, is.numeric) vapply(iris, is.numeric, logical(1)) vapply(iris, is.numeric, numeric(1))

29 sapply() and vapply()

30 Map()/mapply/tapply()
Want multiple inputs or to iterate over objects? Map(), etc. Different length objects? tapply() Skim: , Functional Programming / Functionals / Function Operators for more, or the purr:: package. Maybe later: anonymous functions defined inline

31 We Try lapply() motivation: Don’t Repeat Yourself (DRY code) # Generate a sample dataset set.seed(1014) df <- data.frame(replicate(6, sample(c(1:10, -99), 6, rep = TRUE))) names(df) <- letters[1:6] df df$a[df$a == -99] <- NA df$a <- fix_missing(df$a) #etc. fix_missing <- function(x) { x[x == -99] <- NA; x} lapply(df, fix_missing) #cast to df or []

32 You try You can flexibly program with [] and [[]], but not as flexibly with $, even though almost always we’ll use $. Can you see why? Using the births data…


Download ppt "R Programming III: Real Things with Real Data!"

Similar presentations


Ads by Google