Presentation is loading. Please wait.

Presentation is loading. Please wait.

R Programming I: Basic data types, structures & subsetting

Similar presentations


Presentation on theme: "R Programming I: Basic data types, structures & subsetting"— Presentation transcript:

1 R Programming I: Basic data types, structures & subsetting
EPID 799C Fall 2018

2 Suggestion for Class Arrival
Download lecture Open lecture-specific scratchpad R script to “take notes” Open Homework script / assignment and half-listen for nuggets. 

3 A clear honor code note! Technically, the HW answers (among other things) are online (old site) We’re asking you not to look in advance. That’s it. That’s how the honor code works. 

4 Prototypical Epi Analysis
Speaking of the homework, here’s the gist: Note a little overlap of HW2. We’ll occasionally learn some useful tools “out of order” – slightly more advanced concepts that you’d often want to pull out right away. But generally we’re working in order.

5 Overview Functional Focus: Exploration & Recoding Building toolset:
Rules of R syntax Basic object types & data structures Basic operators Suggest: get two scripts open. Class scratchpad, and HW1. Load births into the scratchpad at the top.

6 3 pi Mydata somevariable
Elements of R Syntax Objects: 3 pi Mydata somevariable Operators: + - * / & | %in% Functions: mean() sd() plot() glm()

7 Rules of R Grammar R evaluates expressions. Expressions are objects linked using operators and functions: Operators link objects side-by-side. weight/height^ data$variable Functions link objects in (optionally) named groups. sum(1,2,3,4) rnorm(n=10, mean=0,sd=1)

8 Everything else is vocabulary!
Recommend: Try using all (/most) of these!

9 Organizing Syntax Elements
1. Objects Classes Data structures 2. Operators Assignment Infix operators 3. Functions (More next class)

10 1A. Classes type_of() or class()
Basic Types Useful Compound Types Logical Numeric Integer Real Complex Character Dates Factors … so many others, we’ll start Wednesday Coersion, e.g. : as.numeric() str(c(“a”, 1)) sum(T, T, F) class(births_sm) typeof() Coersion can be explicit or implicit: CODE No char vector. Let’s make one. No good logical in our small set. Let’s make one: is_preterm. Sum is implicit coercion.

11 1B. Data Structures str()
Homogenous Heterogenous 1D Atomic Vector 2D Matrix ND Array Create with, conveniently: c() list() matrix() data.frame() * array() List Data Frame …plus compound types Most relevant Functions: str() class() typeof() length() nrow() ncol() dim() attributes() attr() names() rownames() colnames() Lists are recursive factors. Used to make most compound types, but can be tough! c(1:10, c(1,2,3)) matrix(1:10, nrow=2) array(1:2, 1:3, 1:4) list(1:2, “label”, fun=“sum”)

12 Sidenote: names() Names will play a big role later, but in short: you can name the “cells” of a vector or list, or the rows or columns of a matrix/data.frame. They get stored in attributes. QE_scores = c(“Student A” = 80, “Student B” = 90, “Student C” = 75) typeof(QE_scores) names(QE_scores) str(QE_scores)

13 Assignment: = or <- To define an object, use <- or =
students <- 20 [no output] students 20 births_top = head(births_sm) births_top An expression without assignment prints the result but does not modify any objects. An expression with assignment defines an object but does not display the result. Just making stuff without saving it isn’t that useful. We can create and save things into the data frame, or free floating helper variables.

14 Atoms: The Basic Building Block
One “unit” of data my_year = 2012 my_study = “Births: PNC & Preterm Birth” my_year my_name Technically, every atom is a vector of length 1 2012 my_year “Births: PNC & PT Birth” my_study value symbol object

15 Arithmetic Operators All of the basic operators (and order of operations) work like you [should] expect with atoms: /3 births_top$wksgest + 1 %% # remainder %/% # divisor You can create your own operators, which we won’t cover much but can be nice. That’s what %over% is.

16 Logical Operators If you ask R to evaluate an equation, inequality, or Boolean expression of atoms, it will return TRUE or FALSE: 1 == 2+3 FALSE 3 < 4 TRUE 12 >= 13-1 TRUE TRUE & FALSE FALSE TRUE | FALSE TRUE (3<4) & !(FALSE) TRUE

17 Aside: Fancier Binary Operators
%in% # try 1 %in% 1:4 # …and 1:4 %in% 1 %>% # pipe (magrittr) %over% # spatial “over” (sp) … and define your own! “Infix” operators (e.g. a FUNCTION b) are really just calling FUNCTION(a, b). More next class on functions See: infix operators

18 Vectors: Atoms in Sequence
Multiple “units” of data locker_combo = c(12,24,7) foods = c(“Pie”, ”Pizza”, ”Tofu”) top_gestation_obs = births_top$wksgest “Pie” “Pizza” “Tofu” locker.combo foods

19 Arithmetic with Vectors
Arithmetic operators can be used on vectors with other vectors or atoms: top_gestation_obs + 1:6 top_gestation_obs + 1 top_gestation_obs + 1:2 #recycling! top_gestation_obs + births_top$weeknum

20 Vectorized Arithmetic
The heights and weights of five patients in a cohort study at baseline were 64, 72, 70, 67, 73 inches and 80, 85, 79, 72, and 90 kilograms. Create a separate height vector and a weight vector containing the data. Convert the height vector to centimeters (1 inch = 2.54 cm). Use vector arithmetic to calculate a patient bmi* vector (bmi = weight[kg]/height[cm]^2) Now do the same thing inside a data.frame * Problematic as BMI is…  On #3 – why do we do that? Free floating vs. row bound.

21 Logic with Vectors Logical operators can also be used on vectors with other vectors or atoms: a = 1:5 a # [5] a>2 # apply to all [5] F F T T T b = c(3,2,1,3,5) b # [5] a==b # element-wise [5] F T F F T a>=b # [5] F T T T T

22 Pause to Reflect We have basic types of data
Numbers, logicals, characters, etc. We’ll see more later, but they’ll follow similar rules We’ve seen basic data structures Most notably for now: vectors and data.frames We’ll see more later (especially lists), but again, similar We’re about to hit the first powerful R concepts: vectorization (operating on a whole vector at once), and vectorized subsetting, including with data.frames

23 Subsetting Indexing vectors, lists, matrices and data.frames

24 Slicing Vectors with Atoms
Slice a vector using the square brackets: [] top_gestation_obs[3] births_top$weeknum[1] 44 births_top$smoker_f[2] Think of this as “indexing”, or “referencing” part of a vector

25 Slicing Vectors with Vectors
Slice a vector using square brackets: [] births_top$weeknum[1:3] births_top$weeknum[c(T,T,T,F,F,F)] births_top$mage[c(T,F,T,T,F,T)] #>20 # Remember, nothing “happens” to our original # vector unless we are using an assignment! Here’s where the power comes in.

26 Subsetting with expressions
Combine a slice with a logical test to query a vector (return all elements that match a condition): # Step by step… births_top$mage>20 births_gte_20 = births_top$mage>20 births_top$mage[births_gte_20] #>20 # But usually just… births_top$mage[births_top$mage>20] #>20 # Or just as valid births_top$mage[births_top$raceeth_f == “Other”]

27 Combining subsetting & queries
We now have some powerful, 1 line tools. Using births_sm (all records) What’s the mean of the weeks gestation for everyone? What’s the mean for Smokers*? Non-smokers? Those missing the smoking variable (hint: use is.na() function) What’s the mean for wksgest for moms with mage* <20? >=20? >= 30? *remember to deal with missing values. Forget how? Try tab autocomplete or ?mean or F1 on mean() to get that syntax. This is powerful, but not powerful enough. Later we’ll have much more efficient ways to do this…

28 Lists: Mixed Vectors A list is a vector that can have multiple modes (flavors). They work like vectors but can also be referenced slightly differently (double brackets: [[ ]]) to return not just the subset of the list, but what’s in that subset. [[]] == $ ! list(thing="A", 1, TRUE) list(thing="A", 1, TRUE)[1] list(thing="A", 1, TRUE)[[1]] list(thing="A", 1, TRUE)$thing List are a useful object for complex operations and objects. Will cover later, but useful glimpse for data.frames... Will spend a dedicated class, if not a week on lists later. But for now, try / watch this.

29 Sidenote: Matrices are organized Vectors
Vectors can be connected into a matrix: rbind() cbind() a = c(1,2,3) rbind(a,b) b = c(4,5,6) cbind(a,b)

30 Slicing Matrices m Like vectors, matrices can be sliced using []. Give slice instructions for both rows and columns (leave one blank to specify “all”), separated by a comma: m = rbind( 4:6, 7:9 ) # stack rows m[1, ] # row 1, all columns m[ ,2] # all rows, column m[1:2,2:3] # row 1 to 2, col 2 to #

31 Slicing with Matrices m
m Matrices (rectangular data) can also sliced by a logical matrix (or by extension, a logical test that returns a logical matrix): m = cbind( c(4,7), c(5,8), c(6,9) ) # same m m[rbind( c(T,F,T),c(F,T,F) ) ] # 4 8 6 m[m%%2==0] # even numbers # 4 8 6 # Note that this approach returns a vector!

32 Double Slicing m Remember, output can always be input - you can also slice the result of slice as an alternative specification: m = rbind( 4:6, 7:9 ) # better v = m[1, ] # v is 4 5 6 v[2] # 5 # one step m[1, ][2] # 5

33 Data Frames: Mixing and Naming
Data frames allow you to mix-and-match different modes (flavors) of vectors into a matrix you can reference by name. This is a data set. The benefit is treating related data together (vs. all free-floating vectors). We’ve been using this since day 1. id = c(“A”,”B”,”C”) bp = c(115, 120, 130) dx = c(0, 0, 1) data.frame(id,bp,dx) What has happened here? We’ve bound together some named vectors!

34 Slicing & Assignment Remember names? In addition to using the matrix methods, you can also make references by name using the [] or $ operators*: names(births_top) births_top["wksgest"] # return df births_top$wksgest births_top[, "wksgest"] births_top[1:3, 1:3] births_top[births_top$mage == 20, "wksgest"] subset(births_top, subset = births_top$mage == 20, select = "wksgest") # Rarely do this births_top[, c("wksgest", "weeknum", "mage")] One of the few good reasons for an alternate data structure we’ll cover later (the tibble)

35 Slicing & Assignment # Power and danger: We’re allowed to # (and often going to) do this! births_top$mage[births_top$mage < 20] = NA Careful! Don’t overwrite your original births_sm data! (meh, if you do, control-shift-F10 and start over).

36 Slicing Data Frames: Advanced Note (for later!)
births_top["wksgest"] births_top[["wksgest"]] # *technically...$ == births_top[, sapply(births_top, is.numeric)] # ^ looking weeks ahead, but can you guess?

37 Functions: Taking Action
Functions enable you to perform tasks. A function takes one or more arguments, separated by commas. We’ve been using them! Parameters can go in order, or directly by name: mean(dat$bp) # one argument table(dat$id, dat$bp) # two arguments rnorm(n=10,mean=1,sd=2) # named arguments …More layers Wednesday More wednesday

38 You’re getting dangerous! 
Data types & data structures to hold them Vectorization & vectorized subsetting for efficicency. Basic operators and functions when you need them. Enough, already, to do a lot of exploration

39 Activity: Tour (full/narrow/recoded) Dataset
Using births_sm from the rdata file, answer the questions below. Also note the “resources” folder! Use these functions (and others) to answer the questions below : dim() summary() table() hist() plot() How many observations are there, and how many variables are in the (small) births dataset? (Hint: see HW1!) What is the average maternal age (mage)? Make a histogram of gestational age (WKSGEST). What is the minimum and maximum (non-99) gestational age? How many mothers smoked (smoker_f)? Make a scatterplot of maternal age versus gestational age.

40 Activity: Tour (full/narrow/unrecoded) Dataset
Now use read.csv() to read births2012_sm.csv. How many observations are there, and how many variables are in the (full/wide) births dataset? How do the types of variables compare to those in births_sm? (Hint: we’ve got some recoding to do!) What is the average maternal age (mage) now? How many mothers have the value 99? How many mothers smoked (smoker_CIGDUR)? Make a histogram of gestational age (WKSGEST). What is the minimum and maximum (non-99) gestational age? Try the same questions with births2012.csv, the full/wide/unrecoded dataset. Note: much bigger!

41 Packages: Ready for next class
Packages are extensions to base R. They contain additional functions, documentation for using them, and sample data. Packages are available from the Comprehensive R Archive Network (CRAN). The “tidyverse” is a set of packages for data manipulation, exploration,and visualization. They share a common design and work in harmony. We’ll be using it extensively. #Install and load the package 'tidyverse’ install.packages('tidyverse') #only need to run once library(tidyverse) #run once per R session to use load it


Download ppt "R Programming I: Basic data types, structures & subsetting"

Similar presentations


Ads by Google