R Programming II EPID 799C Fall 2017.

Slides:



Advertisements
Similar presentations
Introduction to R Brody Sandel. Topics Approaching your analysis Basic structure of R Basic programming Plotting Spatial data.
Advertisements

R for Macroecology Aarhus University, Spring 2011.
Tirgul 9 Amortized analysis Graph representation.
Introduction to R - Lecture 4: Looping Andrew Jaffe 9/27/2010.
Introduction to to R Emily Kalah Gade University of Washington Credit to Kristin Siebel for development of much of this PowerPoint.
Introduction to R Lecture 3: Data Manipulation Andrew Jaffe 9/27/10.
Collecting Things Together - Lists 1. We’ve seen that Python can store things in memory and retrieve, using names. Sometime we want to store a bunch of.
Perl Tutorial. Why PERL ??? Practical extraction and report language Similar to shell script but lot easier and more powerful Easy availablity All details.
R objects  All R entities exist as objects  They can all be operated on as data  We will cover:  Vectors  Factors  Lists  Data frames  Tables 
Descriptive Statistics using R. Summary Commands An essential starting point with any set of data is to get an overview of what you are dealing with You.
Creating Database Objects
AP CSP: Cleaning Data & Creating Summary Tables
Binary and Logic Computers use electrical signals that are on or off, so they have to see everything as a series of binary numbers. This data is represented.
Expanding and Factoring Algebraic Expressions
Repetition Structures
EGR 2261 Unit 10 Two-dimensional Arrays
Introduction to Python
Floating Point Math & Representation
Array, Strings and Vectors
Introduction to R.
Containers and Lists CIS 40 – Introduction to Programming in Python
As the last CC-list represents Maximum Compatible Classes we conclude:
EGR 2261 Unit 4 Control Structures I: Selection
CMPT 120 Topic: Python strings.
COSC 220 Computer Science II
CS1010 Discussion Group 11 Week 6 – One dimensional arrays.
Other Kinds of Arrays Chapter 11
Other Kinds of Arrays Chapter 11
Lists in Lisp and Scheme
Introduction to R Studio
Lecture 4 D&D Chapter 5 Methods including scope and overloading Date.
CSC 108H: Introduction to Computer Programming
Subnetting IP4 ICND/CCNA Prep.
R Programming III: Real Things with Real Data!
Ggplot2 I EPID 799C Mon Sep
Strings, Line-by-line I/O, Functions, Call-by-Reference, Call-by-Value
R Programming I EPID 799C Fall 2017.
Lesson 2: Building Blocks of Programming
Numerical Descriptives in R
R Programming I: Basic data types, structures & subsetting
Variables In programming, we often need to have places to store data. These receptacles are called variables. They are called that because they can change.
Dynamic Programming.
OOP Paradigms There are four main aspects of Object-Orientated Programming Inheritance Polymorphism Abstraction Encapsulation We’ve seen Encapsulation.
Recoding III: Introducing apply()
Recoding II: Numerical & Graphical Descriptives
Using files Taken from notes by Dr. Neil Moore
CS 240 – Lecture 9 Bit Shift Operations, Assignment Expressions, Modulo Operator, Converting Numeric Types to Strings.
Lecture 18 Arrays and Pointer Arithmetic
Functions Computers take inputs and produce outputs, just like functions in math! Mathematical functions can be expressed in two ways: We can represent.
PHP.
Functions Chapter 9 Copyright © 2008 W. W. Norton & Company.
MATLAB Programming Indexing Copyright © Software Carpentry 2011
Statistics 540 Computing in Statistics
Coding Concepts (Data- Types)
From now on: Combinatorial Circuits:
Building Java Programs
CS-447– Computer Architecture Lecture 20 Cache Memories
SE1H421 Procedural Programming LECTURE 4 Operators & Conditionals (1)
Introduction to Primitives
JavaScript CS 4640 Programming Languages for Web Applications
Suggested self-checks: Section 7.11 #1-11
Introduction to Primitives
Have you signed up (or had) your meeting?
R Course 1st Lecture.
Arrays.
C++ Array 1.
Creating Database Objects
1D Arrays and Lots of Brackets
Standard Normal Table Area Under the Curve
Algorithms Tutorial 27th Sept, 2019.
Presentation transcript:

R Programming II EPID 799C Fall 2017

Overview Review of data types and operators Functions (galore) Reading & writing files Exploring with real datasets A little group work today…

Data Structures Review & deeper: 5 basic data structures, 4(+) atomic types

Data structures Homogenous Heterogenous 1D Atomic Vector List* 2D Matrix Data frame ND Array Atoms are: logical, integer, double (aka numeric), and character. Two rare atomic types (complex and raw2). Also other aggregate types. We’ll introduce factors and dates soon. We make these things using functions c(), matrix(), array(), list() and data.frame(). Note: Lists are recursive! 2 Do genetic work? An R package that works with DNA sequences uses raw format (byte code/math) to efficiently store and operate on those ATGCs. It’ll be largely invisible to you, but you can thank bit math for speedy comparison of sequences.

We Try Let’s make some!

You try: data types Create three atomic vectors (length 5) of each of these types: character, integer, logical. Name them whatever you want. Use “:” shorthand to create a vector of numbers, again, of the same length. Create a data.frame using those four atomic vectors, and take a look at it by printing it to console Create a 3x3 matrix of (any) numbers, then of logicals. (hint: rep() function may be useful for logicals) Create a 3x3x3 array (27 elements) of the numbers 1 to 27. May need help on array()… Create a list that includes another list in it.

Answers # Atomic vectors a = c(1, 2,3, 4) b = 1:4 c = c("one", "two", "three", "four") d = c(T, F, T, F) my_df = data.frame(a, b, c, d) #stringsAsFactors = F would keep c from turning into a factor str(my_df) #Matrices and arrays matrix(1:9, nrow = 3) array(1:27, dim = c(3, 3, 3)) my_list = list(1, "a", list("mike", "shoes"))

Notes on Data Structures Everything is a vector in R. Atoms are vectors of length = 1. ^As reviewed Monday: This is crazy important and useful.  Operators therefore expect vectors and know how to operate on entire vectors at once. Lists are recursive and heterogenous. Can make up building blocks of more complex objects. Data.frames are really…lists of atomic (homogenous) vectors*. We’ll verify this later. * Fancy note: technically this means you could have a vector of lists (each element is a list) and it’s still a data.frame. Some recent spatial packages take advantage of this (e.g. simple features geometries).

Type Coersion Explicit as.thing(this) will convert this into the other thing. e.g. as.numeric(c(“1”, “2”, “3”)) as.character(1:3) # as.... Other stuff. Or as(thing, “class”) Implicit R tries to help when it can: e.g. sum(T, F, T, F, T) If you’re going to lose information (get new Nas), R will let you know. Generally artithmatic operators coerce to numbers, and logicals to logicals, etc.

Functions data, _str, class, summary, head, tail, setwd, getwd, View, plot, dim, nrow/col, sd, hist, boxplot, table, type_of, sum, read.csv and write.csv…

Our first functions (vocabulary!) str(), type_of(), length(), attributes(), names(), class()

We Try Let’s explore data types with our functions!

Sidenote 1: Factors & Dates! We’ve got the functions to make sense of these

Aggregate Data Types: Factors Now we’re ready: What is a factor? Let’s find out! Create one: roles = factor(c(“student”, “faculty”, “staff”)) ^ NOTE there was a typo during class. I forgot the c! Find out: use str(), class(), levels(), attributes(), as.numeric(), typeof() on roles How are factors different? Why are they here? Also see: ordered() Stuck with factors? Check out the forcats:: package at http://forcats.tidyverse.org/ and this on factors: http://r4ds.had.co.nz/factors.html

Aggregate Data Types: Dates What is a date? Two things really… today_date = as.Date("2017/08/30") typeof(today_date) today_date_lt = as.POSIXlt("2017/08/30") typeof(today_date_lt) (Try our other functions too!) But hint: Futzing with dates can be a hassle. We’ll use the Lubridate:: package to make that easier, later. <2m skim this: http://r4ds.had.co.nz/dates-and- times.html

Sidenote 2: Operators? C’mon, I thought we were doing functions!

Reminder: Operators… are functions! `+`(1, 2) `%in%`(1:4, c(2,3)) …so are assignment, indexing and (technically) function calls themselves. Meaning, hey, you can easily define your own binary operators. More often we’ll define our own functions, but important to know: EVERYTHING is a function / object, and can be passed around. `%add_wrong%` <- function(a, b) {a + b+1} a %add_wrong% b

Sidenote 3: Write your own Often *super* useful

We Try: Best way to learn functions are to write our own my_first_function = function(param, param2=4){ # function body, using the parameters… return(my_return_val) # ^ note: will return last value if left out! } my_first_function(1, 2) #calling it like this, or my_first_function(param2 = 2, param = 1) # this, or my_first_function(1) #this R functions are scoped (e.g. variables created inside don’t exist outside) and pass by reference as default (smart, don’t create new copies of what’s passed inside unless the copy is changed) Let’s write get_older and hello_world

We Try: Best way to learn functions are to write our own write.csv and read.csv What is iris? (see data() ) Save data.frame iris to iris_lower, and change all the variable names to lower case.

You try: Function Vocab Injection Using the iris dataset and Advanced R: Function Vocabulary (http://adv-r.had.co.nz/Vocabulary.html) Try out as many functions as you can in your group! (Feel free to split them up and work in groups) Suggestions: Some of these are actually pretty advanced - consider not diving into EVERY function. Some you might want to skip that are a bit of a rabbit hole… <<- get assign rle

Answers # Atomic vectors a = c(1, 2,3, 4) b = 1:4 c = c("one", "two", "three", "four") d = c(T, F, T, F) my_df = data.frame(a, b, c, d) #stringsAsFactors = F would keep c from turning into a factor str(my_df) #Matrices and arrays matrix(1:9, nrow = 3) array(1:27, dim = c(3, 3, 3)) my_list = list(1, "a", list("mike", "shoes"))

Sidenote 4: classes and functions You’ll never need to know this until you do.

Classes and functions How do fuctions like dim() or plot() know how to handle all these things? Technically, they’re generics, calling (effectively masking) functions things like dim.data.frame or plot.factor that it calls based on the class() of the object. Look up help for plot.factor and dim.data.frame

Super Duper Sub-setting Review from last time: [], [[]], $ and their many flexibilities

Super Duper Sub-setting: Vectors [] is the atomic subset operator (by location) (Given R is “vectorized” – like almost all of our data! - think matrix notation) [[]] (“double brackets”) is the subset into operator (think subset, then look inside that thing). Most commonly used in in a named list, like… a data.frame!

We Try: Super Duper Sub-setting: Vectors [] can subset a vector by: Numeric vectors (negative to drop, repeats, etc.) Logical vectors Character vectors (IF you’ve named those elements) [] can subset a 2+ dimensional object (matrix, array data.frame) in similar ways… …but then accepts a few other higher order versions of the above. Technically, [] used on a vector always returns a vector, right?

Super Duper Subsetting: Lists [] returns smaller lists element from a list. But often we want to look inside that list element (e.g. in data frames). So for lists we use the [[]], e.g. iris[[“Petal Length”]]. But that’s a hassle, so x$y is a convenience wrapper for the same operator (equivalent to x[[“y”, exact=F]]) which we’ll use ALL THE TIME. * Fancy note: see that exact=F? That means you can do some crazy stuff, like iris$Petal.Len . Really. Don’t do this! 

Ready for Reality! Introducing the class dataset

Putting it all together We have basic data types to hold our vectorized, atomic data. We have a wealth of functions to operate on them, usually on a whole vector (think “column”) at once. We can write our own if we need to. We have powerful subsetting (see L2 for the full rundown) to select, rewrite, extract and perform other actions on slices of our data. * Fancy note: see that exact=F? That means you can do some crazy stuff, like iris$Petal.Len . Really. Don’t do this! 

NC Birth Data The “small” dataset contains (N) columns of (M) rows of data. Check the documentation for what these values really mean. Mdif, visits, wksgets, mrace, cores, bfed. The overall question: does prenatal care reduce preterm birth? * Fancy note: see that exact=F? That means you can do some crazy stuff, like iris$Petal.Len . Really. Don’t do this! 

You try: Functions Read in the NC births (small) file, and rename the variables to all lower case. Explore the dataset as a small group using as many relevant functions as you can from the Advanced R package, and report out to the group Try str(), length(), dim(), typeof(), attributes(), Try head(), tail(), subset()

You try: Tour the Dataset Download and unzip the Births Dataset, then use read.csv() to (and maybe setwd() ) to import the small version of the dataset: births2012_small.csv Use these functions to answer the questions below dim() summary() table() hist() plot() Use an expression with assignment to make a working copy of the dataset with a simpler name How many observations, and how many variables are in the (small) births dataset? What is the average maternal age (mage)? How many mothers have the value 99? Make a histogram of gestational age (WKSGEST). What is the minimum and maximum (non-99) gestational age? How many mothers smoked (CIGDUR)? Make a scatterplot of maternal age versus gestational age.

You try You can flexibly program with [] and [[]], but not as flexibly with $, even though almost always we’ll use $. Can you see why? Using the births data…