R Programming I: Basic data types, structures & subsetting

Slides:



Advertisements
Similar presentations
Introduction to R Brody Sandel. Topics Approaching your analysis Basic structure of R Basic programming Plotting Spatial data.
Advertisements

R for Macroecology Aarhus University, Spring 2011.
Basic Elements of C++ Chapter 2.
Baburao Kamble (Ph.D) University of Nebraska-Lincoln Data Analysis Using R Week2: Data Structure, Types and Manipulation in R.
REVIEW 2 Exam History of Computers 1. CPU stands for _______________________. a. Counter productive units b. Central processing unit c. Copper.
Introduction to R Lecture 3: Data Manipulation Andrew Jaffe 9/27/10.
Session 3: More features of R and the Central Limit Theorem Class web site: Statistics for Microarray Data Analysis.
Piotr Wolski Introduction to R. Topics What is R? Sample session How to install R? Minimum you have to know to work in R Data objects in R and how to.
Java Programming: From Problem Analysis to Program Design, 4e Chapter 2 Basic Elements of Java.
What does C store? >>A = [1 2 3] >>B = [1 1] >>[C,D]=meshgrid(A,B) c) a) d) b)
Lecture 26: Reusable Methods: Enviable Sloth. Creating Function M-files User defined functions are stored as M- files To use them, they must be in the.
CHAPTER 2 PROBLEM SOLVING USING C++ 1 C++ Programming PEG200/Saidatul Rahah.
1 Project 2: Using Variables and Expressions. 222 Project 2 Overview For this project you will work with three programs Circle Paint Ideal_Weight What.
R objects  All R entities exist as objects  They can all be operated on as data  We will cover:  Vectors  Factors  Lists  Data frames  Tables 
CSC 108H: Introduction to Computer Programming Summer 2011 Marek Janicki.
JavaScript: Conditionals contd.
Programming in R Intro, data and programming structures
EGR 2261 Unit 10 Two-dimensional Arrays
Lesson #6 Modular Programming and Functions.
Practical Office 2007 Chapter 10
Lesson #6 Modular Programming and Functions.
Basic Elements of C++.
Stats Lab #1 TA: Kyle Davis
Naomi Altman Department of Statistics (Based on notes by J. Lee)
CS1371 Introduction to Computing for Engineers
Variables, Expressions, and IO
Other Kinds of Arrays Chapter 11
Other Kinds of Arrays Chapter 11
CMSC201 Computer Science I for Majors Lecture 03 – Operators
Java Programming: From Problem Analysis to Program Design, 4e
Dplyr I EPID 799C Mon Sep
R Programming III: Real Things with Real Data!
ECONOMETRICS ii – spring 2018
Basic Elements of C++ Chapter 2.
Lesson #6 Modular Programming and Functions.
R Programming I EPID 799C Fall 2017.
Numerical Descriptives in R
Lab 2 Data Manipulation and Descriptive Stats in R
Variables In programming, we often need to have places to store data. These receptacles are called variables. They are called that because they can change.
R Programming II EPID 799C Fall 2017.
Variables ICS2O.
Recoding II: Numerical & Graphical Descriptives
Chapter 2: Basic Elements of Java
Matlab tutorial course
PHP.
Fundamentals of Data Structures
L07 Apply and purrr EPID 799C Fall 2018.
MATLAB Programming Indexing Copyright © Software Carpentry 2011
Installing Packages Introduction to R, Part II
Data Types and Data Structures
Building Java Programs
Lab 2 HRP223 – 2010 October 18, 2010 Copyright © Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected.
Building Java Programs
Lesson #6 Modular Programming and Functions.
Vectors and Matrices In MATLAB a vector can be defined as row vector or as a column vector. A vector of length n can be visualized as matrix of size 1xn.
MIS2502: Data Analytics Introduction to R and RStudio
Building Java Programs
Building Java Programs
Loops and Arrays in JavaScript
Building Java Programs
Building Java Programs
Compiler Construction
Building Java Programs
Summary of what we learned yesterday
R Course 1st Lecture.
Data analysis with R and the tidyverse
Building Java Programs
R for Epi Workshop Module 2: Data Manipulation & Summary Statistics
Compiler Construction
Presentation transcript:

R Programming I: Basic data types, structures & subsetting EPID 799C Fall 2018

Suggestion for Class Arrival Download lecture Open lecture-specific scratchpad R script to “take notes” Open Homework script / assignment and half-listen for nuggets. 

A clear honor code note! Technically, the HW answers (among other things) are online (old site) We’re asking you not to look in advance. That’s it. That’s how the honor code works. 

Prototypical Epi Analysis Speaking of the homework, here’s the gist: Note a little overlap of HW2. We’ll occasionally learn some useful tools “out of order” – slightly more advanced concepts that you’d often want to pull out right away. But generally we’re working in order.

Overview Functional Focus: Exploration & Recoding Building toolset: Rules of R syntax Basic object types & data structures Basic operators Suggest: get two scripts open. Class scratchpad, and HW1. Load births into the scratchpad at the top.

3 pi Mydata somevariable Elements of R Syntax Objects: 3 pi Mydata somevariable Operators: + - * / & | %in% Functions: mean() sd() plot() glm()

Rules of R Grammar R evaluates expressions. Expressions are objects linked using operators and functions: Operators link objects side-by-side. 1+2 weight/height^2 data$variable Functions link objects in (optionally) named groups. sum(1,2,3,4) rnorm(n=10, mean=0,sd=1)

Everything else is vocabulary! Recommend: Try using all (/most) of these! http://adv-r.had.co.nz/Vocabulary.html

Organizing Syntax Elements 1. Objects Classes Data structures 2. Operators Assignment Infix operators 3. Functions (More next class)

1A. Classes type_of() or class() Basic Types Useful Compound Types Logical Numeric Integer Real Complex Character Dates Factors … so many others, we’ll start Wednesday Coersion, e.g. : as.numeric() str(c(“a”, 1)) sum(T, T, F) class(births_sm) typeof() Coersion can be explicit or implicit: CODE No char vector. Let’s make one. No good logical in our small set. Let’s make one: is_preterm. Sum is implicit coercion.

1B. Data Structures str() Homogenous Heterogenous 1D Atomic Vector 2D Matrix ND Array Create with, conveniently: c() list() matrix() data.frame() * array() List Data Frame …plus compound types Most relevant Functions: str() class() typeof() length() nrow() ncol() dim() attributes() attr() names() rownames() colnames() Lists are recursive factors. Used to make most compound types, but can be tough! c(1:10, c(1,2,3)) matrix(1:10, nrow=2) array(1:2, 1:3, 1:4) list(1:2, “label”, fun=“sum”)

Sidenote: names() Names will play a big role later, but in short: you can name the “cells” of a vector or list, or the rows or columns of a matrix/data.frame. They get stored in attributes. QE_scores = c(“Student A” = 80, “Student B” = 90, “Student C” = 75) typeof(QE_scores) names(QE_scores) str(QE_scores)

Assignment: = or <- To define an object, use <- or = students <- 20 [no output] students 20 births_top = head(births_sm) births_top An expression without assignment prints the result but does not modify any objects. An expression with assignment defines an object but does not display the result. Just making stuff without saving it isn’t that useful. We can create and save things into the data frame, or free floating helper variables.

Atoms: The Basic Building Block One “unit” of data my_year = 2012 my_study = “Births: PNC & Preterm Birth” my_year my_name Technically, every atom is a vector of length 1 2012 my_year “Births: PNC & PT Birth” my_study value symbol object

Arithmetic Operators All of the basic operators (and order of operations) work like you [should] expect with atoms: 1+1 18-19 100/3 births_top$wksgest + 1 %% # remainder %/% # divisor You can create your own operators, which we won’t cover much but can be nice. That’s what %over% is.

Logical Operators If you ask R to evaluate an equation, inequality, or Boolean expression of atoms, it will return TRUE or FALSE: 1 == 2+3 FALSE 3 < 4 TRUE 12 >= 13-1 TRUE TRUE & FALSE FALSE TRUE | FALSE TRUE (3<4) & !(FALSE) TRUE

Aside: Fancier Binary Operators %in% # try 1 %in% 1:4 # …and 1:4 %in% 1 %>% # pipe (magrittr) %over% # spatial “over” (sp) … and define your own! “Infix” operators (e.g. a FUNCTION b) are really just calling FUNCTION(a, b). More next class on functions See: infix operators

Vectors: Atoms in Sequence Multiple “units” of data locker_combo = c(12,24,7) foods = c(“Pie”, ”Pizza”, ”Tofu”) top_gestation_obs = births_top$wksgest 12 24 7 “Pie” “Pizza” “Tofu” locker.combo foods

Arithmetic with Vectors Arithmetic operators can be used on vectors with other vectors or atoms: top_gestation_obs + 1:6 top_gestation_obs + 1 top_gestation_obs + 1:2 #recycling! top_gestation_obs + births_top$weeknum

Vectorized Arithmetic The heights and weights of five patients in a cohort study at baseline were 64, 72, 70, 67, 73 inches and 80, 85, 79, 72, and 90 kilograms. Create a separate height vector and a weight vector containing the data. Convert the height vector to centimeters (1 inch = 2.54 cm). Use vector arithmetic to calculate a patient bmi* vector (bmi = weight[kg]/height[cm]^2) Now do the same thing inside a data.frame * Problematic as BMI is…  On #3 – why do we do that? Free floating vs. row bound.

Logic with Vectors Logical operators can also be used on vectors with other vectors or atoms: a = 1:5 a # [5] 1 2 3 4 5 a>2 # apply to all [5] F F T T T b = c(3,2,1,3,5) b # [5] 5 4 3 2 1 a==b # element-wise [5] F T F F T a>=b # [5] F T T T T

Pause to Reflect We have basic types of data Numbers, logicals, characters, etc. We’ll see more later, but they’ll follow similar rules We’ve seen basic data structures Most notably for now: vectors and data.frames We’ll see more later (especially lists), but again, similar We’re about to hit the first powerful R concepts: vectorization (operating on a whole vector at once), and vectorized subsetting, including with data.frames

Subsetting Indexing vectors, lists, matrices and data.frames

Slicing Vectors with Atoms Slice a vector using the square brackets: [] top_gestation_obs[3] births_top$weeknum[1] 44 births_top$smoker_f[2] Think of this as “indexing”, or “referencing” part of a vector

Slicing Vectors with Vectors Slice a vector using square brackets: [] births_top$weeknum[1:3] births_top$weeknum[c(T,T,T,F,F,F)] births_top$mage[c(T,F,T,T,F,T)] #>20 # Remember, nothing “happens” to our original # vector unless we are using an assignment! Here’s where the power comes in.

Subsetting with expressions Combine a slice with a logical test to query a vector (return all elements that match a condition): # Step by step… births_top$mage>20 births_gte_20 = births_top$mage>20 births_top$mage[births_gte_20] #>20 # But usually just… births_top$mage[births_top$mage>20] #>20 # Or just as valid births_top$mage[births_top$raceeth_f == “Other”]

Combining subsetting & queries We now have some powerful, 1 line tools. Using births_sm (all records) What’s the mean of the weeks gestation for everyone? What’s the mean for Smokers*? Non-smokers? Those missing the smoking variable (hint: use is.na() function) What’s the mean for wksgest for moms with mage* <20? >=20? >= 30? *remember to deal with missing values. Forget how? Try tab autocomplete or ?mean or F1 on mean() to get that syntax. This is powerful, but not powerful enough. Later we’ll have much more efficient ways to do this…

Lists: Mixed Vectors A list is a vector that can have multiple modes (flavors). They work like vectors but can also be referenced slightly differently (double brackets: [[ ]]) to return not just the subset of the list, but what’s in that subset. [[]] == $ ! list(thing="A", 1, TRUE) list(thing="A", 1, TRUE)[1] list(thing="A", 1, TRUE)[[1]] list(thing="A", 1, TRUE)$thing List are a useful object for complex operations and objects. Will cover later, but useful glimpse for data.frames... Will spend a dedicated class, if not a week on lists later. But for now, try / watch this.

Sidenote: Matrices are organized Vectors Vectors can be connected into a matrix: rbind() cbind() a = c(1,2,3) rbind(a,b) b = c(4,5,6) cbind(a,b)

Slicing Matrices 4 5 6 m 7 8 9 Like vectors, matrices can be sliced using []. Give slice instructions for both rows and columns (leave one blank to specify “all”), separated by a comma: m = rbind( 4:6, 7:9 ) # stack rows m[1, ] # row 1, all columns 4 5 6 m[ ,2] # all rows, column 2 5 8 m[1:2,2:3] # row 1 to 2, col 2 to 3 5 6 # 8 9

Slicing with Matrices m 4 5 6 m 7 8 9 Matrices (rectangular data) can also sliced by a logical matrix (or by extension, a logical test that returns a logical matrix): m = cbind( c(4,7), c(5,8), c(6,9) ) # same m m[rbind( c(T,F,T),c(F,T,F) ) ] # 4 8 6 m[m%%2==0] # even numbers # 4 8 6 # Note that this approach returns a vector!

Double Slicing 4 5 6 m 7 8 9 Remember, output can always be input - you can also slice the result of slice as an alternative specification: m = rbind( 4:6, 7:9 ) # better v = m[1, ] # v is 4 5 6 v[2] # 5 # one step m[1, ][2] # 5

Data Frames: Mixing and Naming Data frames allow you to mix-and-match different modes (flavors) of vectors into a matrix you can reference by name. This is a data set. The benefit is treating related data together (vs. all free-floating vectors). We’ve been using this since day 1. id = c(“A”,”B”,”C”) bp = c(115, 120, 130) dx = c(0, 0, 1) data.frame(id,bp,dx) What has happened here? We’ve bound together some named vectors!

Slicing & Assignment Remember names? In addition to using the matrix methods, you can also make references by name using the [] or $ operators*: names(births_top) births_top["wksgest"] # return df births_top$wksgest births_top[, "wksgest"] births_top[1:3, 1:3] births_top[births_top$mage == 20, "wksgest"] subset(births_top, subset = births_top$mage == 20, select = "wksgest") # Rarely do this births_top[, c("wksgest", "weeknum", "mage")] One of the few good reasons for an alternate data structure we’ll cover later (the tibble)

Slicing & Assignment # Power and danger: We’re allowed to # (and often going to) do this! births_top$mage[births_top$mage < 20] = NA Careful! Don’t overwrite your original births_sm data! (meh, if you do, control-shift-F10 and start over).

Slicing Data Frames: Advanced Note (for later!) births_top["wksgest"] births_top[["wksgest"]] # *technically...$ == births_top[, sapply(births_top, is.numeric)] # ^ looking weeks ahead, but can you guess?

Functions: Taking Action Functions enable you to perform tasks. A function takes one or more arguments, separated by commas. We’ve been using them! Parameters can go in order, or directly by name: mean(dat$bp) # one argument table(dat$id, dat$bp) # two arguments rnorm(n=10,mean=1,sd=2) # named arguments …More layers Wednesday More wednesday

You’re getting dangerous!  Data types & data structures to hold them Vectorization & vectorized subsetting for efficicency. Basic operators and functions when you need them. Enough, already, to do a lot of exploration

Activity: Tour (full/narrow/recoded) Dataset Using births_sm from the rdata file, answer the questions below. Also note the “resources” folder! Use these functions (and others) to answer the questions below : dim() summary() table() hist() plot() How many observations are there, and how many variables are in the (small) births dataset? (Hint: see HW1!) What is the average maternal age (mage)? Make a histogram of gestational age (WKSGEST). What is the minimum and maximum (non-99) gestational age? How many mothers smoked (smoker_f)? Make a scatterplot of maternal age versus gestational age.

Activity: Tour (full/narrow/unrecoded) Dataset Now use read.csv() to read births2012_sm.csv. How many observations are there, and how many variables are in the (full/wide) births dataset? How do the types of variables compare to those in births_sm? (Hint: we’ve got some recoding to do!) What is the average maternal age (mage) now? How many mothers have the value 99? How many mothers smoked (smoker_CIGDUR)? Make a histogram of gestational age (WKSGEST). What is the minimum and maximum (non-99) gestational age? Try the same questions with births2012.csv, the full/wide/unrecoded dataset. Note: much bigger!

Packages: Ready for next class Packages are extensions to base R. They contain additional functions, documentation for using them, and sample data. Packages are available from the Comprehensive R Archive Network (CRAN). https://cran.r-project.org/web/packages/available_packages_by_name.html The “tidyverse” is a set of packages for data manipulation, exploration,and visualization. They share a common design and work in harmony. We’ll be using it extensively. #Install and load the package 'tidyverse’ install.packages('tidyverse') #only need to run once library(tidyverse) #run once per R session to use load it