Introduction to R Lecture 3: Data Manipulation Andrew Jaffe 9/27/10.

Slides:



Advertisements
Similar presentations
Intro to HTML Basics HTML = Hypertext Mark-up Language HTML = Hypertext Mark-up Language HTML is a plain-text file that can be created using a text editor.
Advertisements

Introduction to R Brody Sandel. Topics Approaching your analysis Basic structure of R Basic programming Plotting Spatial data.
R for Macroecology Aarhus University, Spring 2011.
Introduction to R - Functions, Packages Andrew Jaffe 10/18/10.
Writing functions in R Some handy advice for creating your own functions.
COMP 116: Introduction to Scientific Programming Lecture 37: Final Review.
 Statistics package  Graphics package  Programming language  Can be used to share/reproduce analyses  Many new packages being created - can be downloaded.
Basics of Using R Xiao He 1. AGENDA 1.What is R? 2.Basic operations 3.Different types of data objects 4.Importing data 5.Basic data manipulation 2.
Intro to R Stephanie Lee Dept of Sociology, CSSCR University of Washington September 2009.
Chapter 8 and 9 Review: Logical Functions and Control Structures Introduction to MATLAB 7 Engineering 161.
Introduction to GTECH 201 Session 13. What is R? Statistics package A GNU project based on the S language Statistical environment Graphics package Programming.
Introduction to MATLAB Northeastern University: College of Computer and Information Science Co-op Preparation University (CPU) 10/22/2003.
Chapter 20 Thinking Big: Functions. Copyright © 2006 Pearson Addison-Wesley. All rights reserved Anatomy of a Function Functions are packages for.
Chapter 7 Data Management. Agenda Database concept Import data Input and edit data Sort data Function Filter data Create range name Calculate subtotal.
Lecture 2 LISAM. Statistical software.. LISAM What is LISAM? Social network for Creating personal pages Creating courses  Storing course materials (lectures,
Extending MATLAB Write your own scripts and/or functions Scripts and functions are plain text files with extension.m (m-files) To execute commands contained.
INTRO TO PROGRAMMING Chapter 2. M-files While commands can be entered directly to the command window, MATLAB also allows you to put commands in text files.
Introduction to R - Lecture 4: Looping Andrew Jaffe 9/27/2010.
SAS Workshop Lecture 1 Lecturer: Annie N. Simpson, MSc.
732A44 Programming in R.  Self-studies of the course book  2 Lectures (1 in the beginning, 1 in the end)  Labs (computer). Compulsory submission of.
Data, graphics, and programming in R 28.1, 30.1, Daily:10:00-12:45 & 13:45-16:30 EXCEPT WED 4 th 9:00-11:45 & 12:45-15:30 Teacher: Anna Kuparinen.
Introduction to SAS BIO 226 – Spring Outline Windows and common rules Getting the data –The PRINT and CONTENT Procedures Manipulating the data.
Introduction to to R Emily Kalah Gade University of Washington Credit to Kristin Siebel for development of much of this PowerPoint.
Introduction to R Part 1. First Note: I am not an expert at R. – I’ve been hiking up the learning curve for about a year. You can learn R. – You will.
Introduction to R Lecture 1: Getting Started Andrew Jaffe 8/30/10.
Data Objects in R Vector1 dimensionAll elements have the same data types Data types: numeric, character logic, factor Matrix2 dimensions Array2 or more.
Math 15 Lecture 9 University of California, Merced Scilab A Short Introduction – No. 3 Today – Quiz #4.
Workbook 4 User & Group Permissions RH030 Linux Computing Essentials.
©2007 Austin Troy Lecture 7: Introduction to GIS 1.Queries and table operations for a single layer in Arc GIS 2.Intro to queries in Access Lecture by Austin.
Lecture 26: Reusable Methods: Enviable Sloth. Creating Function M-files User defined functions are stored as M- files To use them, they must be in the.
Conditional Statements.  Quiz  Hand in your jQuery exercises from last lecture  They don't have to be 100% perfect to get full credit  They do have.
STAT 534: Statistical Computing Hari Narayanan
EGR 115 Introduction to Computing for Engineers MATLAB Basics 3: Array Operations Monday 08 Sept 2014 EGR 115 Introduction to Computing for Engineers.
Digital Image Processing Introduction to MATLAB. Background on MATLAB (Definition) MATLAB is a high-performance language for technical computing. The.
R objects  All R entities exist as objects  They can all be operated on as data  We will cover:  Vectors  Factors  Lists  Data frames  Tables 
Data & Graphing vectors data frames importing data contingency tables barplots 18 September 2014 Sherubtse Training.
Math 252: Math Modeling Eli Goldwyn Introduction to MATLAB.
Basics in R part 2. Variable types in R Common variable types: Numeric - numeric value: 3, 5.9, Logical - logical value: TRUE or FALSE (1 or 0)
Lecture 11 Introduction to R and Accessing USGS Data from Web Services Jeffery S. Horsburgh Hydroinformatics Fall 2013 This work was funded by National.
Descriptive Statistics using R. Summary Commands An essential starting point with any set of data is to get an overview of what you are dealing with You.
Winter 2016CISC101 - Prof. McLeod1 CISC101 Reminders Quiz 3 this week – last section on Friday. Assignment 4 is posted. Data mining: –Designing functions.
Review > x[-c(1,4,6)] > Y[1:3,2:8] > island.data fishData$weight[1] > fishData[fishData$weight < 20 & fishData$condition.
Computational Methods in Astrophysics Dr Rob Thacker (AT319E)
Working with data in R 2 Fish 552: Lecture 3. Recommended Reading An Introduction to R (R Development Core Team) –
Introduction to R and Data Science Tools in the Microsoft Stack Jamey Johnston.
Review > x ?log # Old Faithful geyser > lengths seq(from=1, to=10, by=2) > rep(3,times=10) Assign 7 to the variable x Get.
Introduction to Matlab
EEE 161 Applied Electromagnetics
Programming in R Intro, data and programming structures
Matlab Training Session 4: Control, Flow and Functions
Introduction to MATLAB
Other Kinds of Arrays Chapter 11
Arrays and files BIS1523 – Lecture 15.
Lab Week 3 HW 1 Collect body height and weight from 5 of your friends
Engineering Innovation Center
Recoding III: Introducing apply()
Recoding III: Introducing apply()
CSCI N207 Data Analysis Using Spreadsheet
Web DB Programming: PHP
Stat 251 (2009, Summer) Lab 1 TA: Yu, Chi Wai.
Basics of R, Ch Functions Help Managing your Objects
Matlab tutorial course
Introduction to MATLAB
Introduction to MATLAB
Introduction to R v
R Course 1st Lecture.
Stat 251 (2009, Summer) Lab 2 TA: Yu, Chi Wai.
Data analysis with R and the tidyverse
Matrices are identified by their size.
Presentation transcript:

Introduction to R Lecture 3: Data Manipulation Andrew Jaffe 9/27/10

Overview Practice Solutions Indexing Data Management Data Summaries

Practice Make a 2 x 2 table of sex and dog > table(dat$sex, dat$dog) no yes F M

Practice Create a 'BMI' variable using height and weight > dat$bmi = dat$weight*703/dat$height^2 > head(dat$bmi) [1]

Practice Create an 'overweight' variable, which gives the value 1 for people with BMI > 30 and 0 otherwise > dat$overweight = ifelse(dat$bmi > 30, 1, 0) > head(dat$overweight) [1]

Practice Add those two variables to the datasets and save it as a text file somewhere write.table(dat, "lec2_practice.txt", quote = F, row.names = F, sep="\t")

Overview Practice Solutions Indexing Data Management Data Summaries

Indexing Vectors: vector[index] takes ‘index’ elements from vector and returns them > x = c(1,3,7,34,435) > x[1] [1] 1 > x[c(1,4)] [1] 1 34 > x[2:4] [1] > 2:4 [1] 2 3 4

Indexing Replace elements in a vector – combining indexing, is.na(), and rep() > x = c(1,3,NA,6,NA,8) > which(is.na(x)) [1] 3 5 > x[is.na(x)] = 0 # or rep(0) > x [1]

Indexing Data.frames/matrices: dat[row,col]  Can subset/extract a row: dat[row,]  Can subset/extract a column: dat[,col] > x = matrix(c(1,2,3,4,5,6), ncol = 3) > x [,1] [,2] [,3] [1,] [2,] 2 4 6

Indexing > x[1,] [1] > x[,1] [1] 1 2 > x[1,1] [1] 1 > x[1:2,1:2] [,1] [,2] [1,] 1 3 [2,] 2 4 > x [,1] [,2] [,3] [1,] [2,] 2 4 6

Indexing > x[1,] = rep(1) > x [,1] [,2] [,3] [1,] [2,] > x[,1] = rep(2) > x [,1] [,2] [,3] [1,] [2,] > x [,1] [,2] [,3] [1,] [2,] 2 4 6

Overview Practice Solutions Indexing Data Management Data Summaries

Data Management An aside: save() and load() save(obj_1,…,obj_n, file = “filename.rda”)  Saves R objects (vectors, matrices, or data.frames) as an.rda file (similar to.dta) load(“filename.rda”)  Loads whatever files were saved in the.rda Easier than reading/writing tables

Data Management Your workspace can be saved an.rda file  You get asked this every time you close R  save.image(“filename.Rdata”) saves all objects in your workspace (what ls() returns)  Each folder might have its own.Rdata file Doing this is personal preference - if you have a script and it’s a quick analysis, probably don’t need a saved image

Data Management “lec3_data.rda” can be downloaded from the website Similar method to read in the data: load(“lec3_data.rda”)  Put in the same directory as your script  Set your working directory  Use the full filename

Data Management What are the dimensions of the dataset?

Data Management What are the dimensions of the dataset? > dim(dog_dat) [1] 482 6

Data Management How many dogs are in this dataset? Is this dataset unique?

Data Management How many dogs are in this dataset? Is this dataset unique? > length(unique(dog_dat$dog_id)) [1] 482 > length(dog_dat$dog_id)) [1] 482

Data Management What are the column/variable names?

Data Management What are the column/variable names? > head(dog_dat) dog_id owner_id dog_type dog_wt_mo1 dog_len_mo1 dog_food_mo lab lab poodle lab husky poodle > names(dog_dat) [1] "dog_id" "owner_id" "dog_type” "dog_wt_mo1" [5] "dog_len_mo1" "dog_food_mo1"

Data Management Some explanation of the variables  dog_id: id of dog  owner_id: id of owner  dog_type: type of dog  dog_wt_mo1: dog weight at month 1 (baseline)  dog_len_mo1: dog length at month 1  dog_food_mo1: baseline dog food consumption

Data Management Subsetting data: separate data into two data.frames based on a variable: > lab = dog_dat[dog_dat$dog_type == "lab",] > head(lab) dog_id owner_id dog_type dog_wt_mo1 dog_len_mo1 dog_food_mo lab lab lab lab lab lab

Data Management > lab = dog_dat[dog_dat$dog_type == "lab",] > head(which(dog_dat$dog_type == "lab")) [1] Taking those specific rows, and all of the columns of the original data

Data Management > lab2 = dog_dat[dog_dat$dog_type == ”lab",1:3] > head(lab2,3) dog_id owner_id dog_type lab lab lab Taking those specific rows, and the first 3 columns of the original data

Data Management Note (stata users…) that we have two data.frames in our workspace! [ls()]

Data Management Remember we used ifelse() for binary conversions? > heavy = ifelse(dog_dat[,4] > mean(dog_dat[,4]), 1, 0) > head(heavy) [1] Note that you can use column indexing instead of $name for data.frames This is just the mean of that column: > mean(dog_dat[,4]) [1]

Data Management The cut() function can split data into more groups – quintiles, tertiles, etc cut(dat, breaks)  dat is a vector of numerical or integer values  breaks is where to make the cuts

Data Management If ‘breaks’ is one number (n), it splits the data into ‘n’ equal sized groups > x = 1:5 # or seq(1,5) > cut(x, 2) [1] (0.996,3] (0.996,3] (0.996,3] (3,5] (3,5] Levels: (0.996,3] (3,5] > cut(x, 3) [1] (0.996,2.33] (0.996,2.33] (2.33,3.67] (3.67,5] (3.67,5] Levels: (0.996,2.33] (2.33,3.67] (3.67,5] > cut(x,3, labels=F) # returns integers of groups, not factors [1] FACTORS!

Data Management What is a factor? Similar to terms like ‘category’ and ‘enumerated type’ Has ‘levels’ associated with it – could be ordinal if factor(…,ordered = T) Must only have an as.character() method and be sortable to be converted to a factor using factor()

Data Management If ‘breaks’ are more than one number, splits the vector by those numbers > x = 1:10 > cut(x, c(0,3,6,10)) [1] (0,3] (0,3] (0,3] (3,6] (3,6] (3,6] (6,10] (6,10] (6,10] (6,10] Levels: (0,3] (3,6] (6,10] > cut(x, c(0,3,6,10), FALSE) [1]

Data Management Something more applicable for cut: the quantile(x,probs) function - default ‘probs’ is seq(0,1,0.25), ie quintiles seq(start, end, by) – creates a sequence from the starting value, to the ending value by the specified amount  seq(0,10) ~ 0:10 # 0, 1, 2, …, 9, 10  seq(0,10,0.5) # 0, 0.5, 1.0, …, 9.5, 10.0

Data Management Now for stuff with our data: > quantile(dog_dat$dog_wt_mo1) 0% 25% 50% 75% 100% > quantile(dog_dat$dog_wt_mo1, seq(0,1,0.5)) 0% 50% 100% > quantile(dog_dat$dog_wt_mo1, 0.6) 60% 51.5 > quantile(dog_dat$dog_wt_mo1, c(0.4,0.6)) 40% 60%

Data Management > sp = quantile(dog_dat$dog_wt_mo1, 0.75) > big = ifelse(dog_dat$dog_wt_mo1 > sp, 1, 0) > head(big) [1] > quant = cut(dog_dat$dog_wt_mo1, quantile(dog_dat$dog_wt_mo1)) > head(quant) [1] (49.2,55.3] (44.6,49.2] (55.3,72.5] (44.6,49.2] (44.6,49.2] (44.6,49.2] Levels: (10.6,44.6] (44.6,49.2] (49.2,55.3] (55.3,72.5]

Overview Practice Solutions Indexing Data Management Data Summaries

This is some of the only “statistics” in the course R functions can perform statistics well, here are some basics for summaries

Data Summaries mean(dat, na.rm = F) median(dat, na.rm=F) > x = c(1,2,4,6,NA) > mean(x) [1] NA > mean(x, na.rm=T) [1] 3.25 > median(x,na.rm=T) [1] 3

Data Summaries > x = c(1,2,4,7,9,11) > mean(x) [1] > median(x) [1] 5.5 > var(x) [1] > sd(x) [1]

Data Summaries Let’s combine some concepts! Take the mean food consumption of all of the labs

Data Summaries First, figure out which entries correspond to dogs that are labs > Index = which(dog_dat$dog_type == "lab") > head(Index) [1]

Data Summaries Then, take the mean of the data you want > mean(dog_dat$dog_food_mo1[Index]) [1] Note that we first created a vector of dog food, then indexed it - there are no commas needed for the indexing (because it’s a vector)

Data Summaries Combined into 1 line/command: > mean(dog_dat$dog_food_mo1[dog_dat$dog_type == "lab"]) [1] > mean(dog_dat[dog_dat$dog_type == "lab",6]) [1] > mean(dog_dat[dog_dat$dog_type == "lab","dog_food_mo1"]) [1] Pick your favorite – they’re all the same! Note that the first option might make the most sense…

Practice Compute the average dog weight, dog length, and dog food consumption for each dog type at baseline Reminder: the dog types are lab, poodle, husky, and retriever