(& Generalized Linear Models) Nathaniel MacHardy Fall 2014
Why care about data management? Everything in R is technically data “classic” data: tables Model Results Functions
Cooking with Data Simple recipes we’ll learn today Subset according to some rule Code a variable into simpler categories Merge data on a key value (ID) Recode particular values of a variable
Recipe 1: Subset object.to.subset[subset instructions] By index vector[ c(1,2,3) ] data [ 2, 4 ] data [ 3, ] data [ , 2 ] By name cars[ c(“speed”,”dist”) ] By logical (1:3)[ c(TRUE,FALSE,TRUE) ] By expression cars$dist[ cars$speed== 7]
Operators: Generate TRUE/FALSE Feed these into square brackets! foods %in% c(“pizza”,”pie”) age > 18 eyecolor == “red”
Basic Functions: Combine/Transform c() combine data.frame() bind into data frame rbind() stack data merge() merge data by key t() swap rows/columns
Functions II: Shortcuts # the long way to classify cat <- rep(NA,length(cat)) # empty vector cat[values<30] <- “high” cat[values>=30 & values<70] <- “medium” cat[values>=70] <- “high” table(cat) high low medium 90 90 120
Recipe 2: Classifying Data with Cut cat <- cut(values,c(0,30,70,100)) table(cat) (0,30] (30,70] (70,100] 90 120 90 cat <- cut(values,c(0,30,70,100), labels=c(“low”,”med”,”high”)) # if desired
Recipe 3: Merge Data health <- data.frame( id=c(“a”,”b”,”c”), age=c(33,56,13) ) out <- data.frame( id=c(“b”,”c”,”a”), outcome=c(0,0,1)) merged <- merge(health,out,by=“id”)
Recipe 4: Replacing Data What if you want to recode part of an object? Easy: subset the “destination” of <- x <- c(1,NA,0,NA,3,NA,4,6) # replace NA x[is.na(x)] <- 0 y <- c(1,1,4,10,99,100) # clip range y[y>10] <- 10
Part II: Generalized Linear Models
Why care about Generalized Linear Models (GLMs)? Almost all Epi models are GLMs Odds ratios (logistic) Risk ratios (log-binomial) “Fancy” models are usually just extensions of GLMs GWR (geographically weighted regression) adds a “weights matrix” MCMC (Markov Chain Monte Carlo) solves GLMs with priors
Basic Parts of GLMs Y: Outcome, Dependent Variable X: Predictor(s), Covariates, Exposure(s), Independent Variable(s) Y = mX + b
Specifying a model in R A “formula” with Outcome on left Exposures on right Y ~ X
Example: Using glm() glm(uptake ~ Type) data(CO2); attach(CO2) Call: glm(formula = uptake ~ Type) Coefficients: (Intercept) TypeMississippi 33.54 -12.66 Degrees of Freedom: 83 Total (i.e. Null); 82 Residual Null Deviance: 9707 Residual Deviance: 6341 AIC: 607.6
Example: Using glm() glm(uptake ~ type + conc) data(CO2); attach(CO2) Call: glm(formula = uptake ~ Type + conc) Coefficients: (Intercept) TypeMississippi conc 25.83005 -12.65952 0.01773 Degrees of Freedom: 83 Total (i.e. Null); 81 Residual Null Deviance: 9707 Residual Deviance: 4056 AIC: 572.1
More info: glm() object m1 <- glm(uptake ~ type + conc) summary(m1) Coefficients: Estimate Std. Error t value Pr(>|t|) (intercept) 25.830052 1.579918 16.349 < 2e-16 *** Mississippi -12.659524 1.544261 -8.198 3.06e-12 *** conc 0.017731 0.002625 6.755 2.00e-09 *** Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
More info: glm() object > confint(m1) Waiting for profiling to be done... 2.5 % 97.5 % (Intercept) 22.7334700 28.92663347 TypeMississippi -15.6862201 -9.63282749 conc 0.0125859 0.02287528
Getting Epidemiological Measures Specify the family argument glm(y~x) # default linear, risk difference glm(y~x, family=binomial(“logit”) # odds ratio For odds ratio and risk ratio, make sure to exponentiate the output with exp() first R gives you the beta coefficient
Lab: Practice with GLMs You'll learn more about model-building in courses What goes in the model (confounders) Choosing a “best” model Practice data management Preparing data for analyses Formatting/extracting results