Introduction to R Data Structure June 27 2015. Introduction to R: Data structure Karim & Maria June 27 2015.

Slides:

Advertisements

Similar presentations

Introduction to R Brody Sandel. Topics Approaching your analysis Basic structure of R Basic programming Plotting Spatial data.

Advertisements

CS 11 C track: lecture 7 Last week: structs, typedef, linked lists This week: hash tables more on the C preprocessor extern const.

R for Macroecology Aarhus University, Spring 2011.

Lecture 10 F-tests in MLR (continued) Coefficients of Determination BMTRY 701 Biostatistical Methods II.

Structural Equation Modeling

Chapter 7 Introduction to Procedures. So far, all programs written in such way that all subtasks are integrated in one single large program. There is.

Lecture 4 Sept 8 Complete Chapter 3 exercises Chapter 4.

Multiple Regression Models

Concatenation MATLAB lets you construct a new vector by concatenating other vectors: – A = [B C D... X Y Z] where the individual items in the brackets.

MATLAB Cell Arrays Greg Reese, Ph.D Research Computing Support Group Academic Technology Services Miami University.

7/2/ Lecture 51 STATS 330: Lecture 5. 7/2/ Lecture 52 Tutorials  These will cover computing details  Held in basement floor tutorial lab,

Lecture 4 Sept 7 Chapter 4. Chapter 4 – arrays, collections and indexing This chapter discusses the basic calculations involving rectangular collections.

Lecture 2 MATLAB fundamentals Variables, Naming Rules, Arrays (numbers, scalars, vectors, matrices), Arithmetical Operations, Defining and manipulating.

How to Use the R Programming Language for Statistical Analyses Part I: An Introduction to R Jennifer Urbano Blackford, Ph.D. Department of Psychiatry Kennedy.

Baburao Kamble (Ph.D) University of Nebraska-Lincoln Data Analysis Using R Week2: Data Structure, Types and Manipulation in R.

The Game of Algebra or The Other Side of Arithmetic The Game of Algebra or The Other Side of Arithmetic © 2007 Herbert I. Gross by Herbert I. Gross & Richard.

Psy B07 Chapter 1Slide 1 ANALYSIS OF VARIANCE. Psy B07 Chapter 1Slide 2 t-test refresher  In chapter 7 we talked about analyses that could be conducted.

CENG 311 Machine Representation/Numbers

Introduction to R - Lecture 4: Looping Andrew Jaffe 9/27/2010.

Chapter 10 Review: Matrix Algebra

Matlab tutorial course Lesson 2: Arrays and data types

BIOL 582 Lecture Set 19 Matrices, Matrix calculations, Linear models using linear algebra.

A quick introduction to R prog. 淡江統計陳景祥 (Steve Chen)

Lecture 5: SLR Diagnostics (Continued) Correlation Introduction to Multiple Linear Regression BMTRY 701 Biostatistical Methods II.

 Combines linear regression and ANOVA  Can be used to compare g treatments, after controlling for quantitative factor believed to be related to response.

7.1 - Motivation Motivation Correlation / Simple Linear Regression Correlation / Simple Linear Regression Extensions of Simple.

Introduction to Python

Introduction to MATLAB

732A44 Programming in R.  Self-studies of the course book  2 Lectures (1 in the beginning, 1 in the end)  Labs (computer). Compulsory submission of.

Data, graphics, and programming in R 28.1, 30.1, Daily:10:00-12:45 & 13:45-16:30 EXCEPT WED 4 th 9:00-11:45 & 12:45-15:30 Teacher: Anna Kuparinen.

Matlab Chapter 2: Array and Matrix Operations. What is a vector? In Matlab, it is a single row (horizontal) or column (vertical) of numbers or characters.

Piotr Wolski Introduction to R. Topics What is R? Sample session How to install R? Minimum you have to know to work in R Data objects in R and how to.

IT253: Computer Organization Lecture 3: Memory and Bit Operations Tonga Institute of Higher Education.

Use of Weighted Least Squares. In fitting models of the form y i = f(x i ) +  i i = 1………n, least squares is optimal under the condition  1 ……….  n.

Chapter 13 Recursion. Learning Objectives Recursive void Functions – Tracing recursive calls – Infinite recursion, overflows Recursive Functions that.

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4b, February 20, 2015 Lab: regression, kNN and K- means results, interpreting and evaluating models.

Introduction to R and Statistics Thomas INGICCO G. Courbet, Le désespéré (Autoportrait) G. Courbet, The desperate man (Self-portrait)

Introduction to R. Why use R Its FREE!!! And powerful, fairly widely used, lots of online posts about it Uses S -> an object oriented programing language.

Using R for Marketing Research Dan Toomey 2/23/2015

FACTORS AFFECTING HOUSING PRICES IN SYRACUSE Sample collected from Zillow in January, 2015 Urban Policy Class Exercise - Lecy.

Tutorial 4 MBP 1010 Kevin Brown. Correlation Review Pearson’s correlation coefficient – Varies between – 1 (perfect negative linear correlation) and 1.

Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II.

Chapter 22: Building Multiple Regression Models Generalization of univariate linear regression models. One unit of data with a value of dependent variable.

Lecture 6: Multiple Linear Regression Adjusted Variable Plots BMTRY 701 Biostatistical Methods II.

Chapter 13 Repeated-Measures and Two-Factor Analysis of Variance

Lecture 6: Multiple Linear Regression Adjusted Variable Plots BMTRY 701 Biostatistical Methods II.

Introduction to Python Dr. José M. Reyes Álamo. 2 Three Rules of Programming Rule 1: Think before you program Rule 2: A program is a human-readable set.

STAT 534: Statistical Computing Hari Narayanan

Linear Models Alan Lee Sample presentation for STATS 760.

Overview Excel is a spreadsheet, a grid made from columns and rows. It is a software program that can make number manipulation easy and somewhat painless.

INTRODUCTION TO MATLAB DAVID COOPER SUMMER Course Layout SundayMondayTuesdayWednesdayThursdayFridaySaturday 67 Intro 89 Scripts 1011 Work

Introduction to Exploratory Descriptive Data Analysis in S-Plus Jagdish S. Gangolly State University of New York at Albany.

CMPS 1371 Introduction to Computing for Engineers VECTORS.

R objects  All R entities exist as objects  They can all be operated on as data  We will cover:  Vectors  Factors  Lists  Data frames  Tables 

Math 252: Math Modeling Eli Goldwyn Introduction to MATLAB.

Tutorial 5 Thursday February 14 MBP 1010 Kevin Brown.

The Effect of Race on Wage by Region. To what extent were black males paid less than nonblack males in the same region with the same levels of education.

Lecture 11 Introduction to R and Accessing USGS Data from Web Services Jeffery S. Horsburgh Hydroinformatics Fall 2013 This work was funded by National.

Statistical Analysis Data Mining  R is an extremely popular tool for Statistical Analysis and Data Mining. freeopen source  It is free and open source,

1 Analysis of Variance (ANOVA) EPP 245/298 Statistical Analysis of Laboratory Data.

Working with R Marcin Krzystanek PhD student at Cancer Systems Biology group, CBS Technical University of Denmark.

Review > x[-c(1,4,6)] > Y[1:3,2:8] > island.data fishData$weight[1] > fishData[fishData$weight < 20 & fishData$condition.

16BIT IITR Data Collection Module If you have not already done so, download and install R from download.

Working with data in R 2 Fish 552: Lecture 3. Recommended Reading An Introduction to R (R Development Core Team) –

Programming in R Intro, data and programming structures

Other Kinds of Arrays Chapter 11

Console Editeur : myProg.R 1

Use of Mathematics using Technology (Maltlab)

Communication and Coding Theory Lab(CS491)

R Course 1st Lecture.

Presentation transcript:

Introduction to R Data Structure June

Introduction to R: Data structure Karim & Maria June

Objectives Data Structure Understand Vectors, Lists, Matrices and DataFrames Understand R functions for accessing Dataframes Know how to access objects, subset and work with your data How to deal with missing data How to read and write data

Overview A note about R Statistics vs Signal processing: o R is a predominantly statistical software as opposed to a numerical computational software Provides functions for statistical operations and complex analysis o R is an interpreted language: Advantage: less time writing code Disadvantage: computations can be slower – But! For most uses, it can handle large data well and reasonably fast o It is a real programming language: free and flexible As with every programming language there is always more than one way to do things and never a “right way”. R: it tends to load all the data into memory (workarounds with R version 3). Language of Choice for prototyping – can access databases

Data Structures Tables and Numbers: How many kind of Tables? Vectors (tables of dimension 1), Matrices (tables of dimension 2), Arrays (tables of any dimension), Data Frames (tables of dimension 2, in which each column may contain different type of data – eg. with one row per subject and one column per variable)

Vectors: Dimension 1 #create your first Vector (c stands for concatenate) c(1,2,3,4,5) [1] :5 [1] seq(1, 5, by=1) [1] scan() [1]

Vectors: Selecting elements: #create an object vector called “data” containing numbers from -2 to 2 in.1 increments. data <- seq(-2, 2, by=.1) data [1] [16] [31] ls() “data” [ to remove your object use rm(data)] str(data)

Vectors Indexing data <- seq(-2, 2, by=.1) ls();str(data) data [1] [16] [31] #extract element 5 to 10 data[5:10] [1] #extract element 2 and 14 to 19 data[c(2,14:19)] [1] [1] -1.6 #create a new vector without elements 19 through 22 new<-data[-(19:22)] new;length(data);length(new)

Vectors Indexing #Find which elements are positive (larger than 0) data>0 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE [25] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE [37] TRUE TRUE TRUE TRUE TRUE #Return elements that are positive (greater than 0) data[data>0] [1] [20] 2.0 #create a shorter vector from data short<-data[1:10] names(short) NULL #add names to your elements letters names(short)<-letters[1:length(short)] short["e"] e -1.6

Data Structure Factors: A factor is a vector coding for a qualitative variable. For example: colour or gender or where numeric values do not have meaning such as with zip codes. ?factor #create a vector of factors food <- factor( sample(c("pizza", "pasta", "ravioli"), 10, replace=T) ) #take a look and check that it is indeed a factor food class(food) g <-c("pizza", "pasta", "ravioli") x <- factor( sample(food, 5, replace=T), levels=g ) levels(x) table(x) g pizza pasta ravioli 3 1 1

Data Structure Factors: x <- c("A", "B", "C") y <- 1:2 z <- c("male", "female") expand.grid(x,y,z) data<-expand.grid(x,y,z) #This is now a dataframe (more about this later) class(data) names(data) str(data) names(data)<-c(“group”, “order”, “gender”)

Data Structure Dataframes: Broadly it is a list of vectors of the same length. What is special about a dataframe is that they can contain both quantitative(numbers; each column may contain a measurement in a different unit) and qualitative (strings or factors) variables. n <- 10 df <- data.frame( x=rnorm(n), y=sample(c(T,F), n, replace=T) ) df class(df) [1] "data.frame” dim(df) [1] 10 2 #Remember it is always rows first and then columns names(df) [1] "x" "y"

Data Structure Dataframes: summary(df) x y Min. : Mode :logical 1st Qu.: FALSE:7 Median : TRUE :3 Mean : NA's :0 3rd Qu.: Max. : Change name of your columns: names(df) <- c("Z", "Case") names(df) [1] "Z" "Case” na<-letters[1:10] row.names(df)<-na row.names(df) [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j” str(df) 'data.frame':10 obs. of 2 variables: $ Z : num $ Case: logi FALSE TRUE FALSE TRUE TRUE FALSE...

Access Dataframes: Use the $ to access a specific column by variable name: To access column “Z” or “Case” we simply write df$Z # access column Z df$Case # access column Case an alternative way is to use indexing To access column “Z” or “Case” we simply use df[,1] #This reads: access data frame df, the blank before the coma means “all rows” and the “1” means row 1. Remember Rows-Columns df[,2]#Accesses column 2 “Case”

Access Dataframes: To access row 2, column 1 simply: df[2,1] To access the elements in rows 2:5 for both variables in your dataframe: df[2:5,] df[2:5,1:2] #exactly the same as above

Access Dataframes: Attach & Detach Columns in a dataframe can be temporarily turned into actual variables to facilitate access to elements with the attach command (this is borrowed from the namespaces in C++). Access one of R datasets, in this case observations from an old geyser. ?faithful data<-faithful names(data) dim(data) class(data) str(data) attach(data) str(eruptions)

Access Dataframes: Attach & Detach Access the 5 th value in the column eruptions eruptions[5] Test the strength of the association between eruption and waiting time cor.test(data$eruptions,data$waiting) cor.test(eruptions,waiting) Remember to detach detach()

Subsetting data: Use another built in dataset called “Orange” data(Orange) #explore your data class(Orange);dim(Orange);names(Orange);str(Orange) #Use the subset command #Select only Tree 1 dim(Orange) Orange d<-subset(Orange, Tree==1) #”==“stands for “exactly equal” dim(d) d

Subsetting data: subset and select create a new dataframe with only Tree and Age columns for those trees with a circumference less than 150 d1<-subset(Orange, circumference<150, select = c(Tree, age)) dim(d1) str(d1) d1 Do you need the subset command? No, it is just a fancy wrapper. You can do the same with indexing c<-Orange[Orange$circumference<150,1:2] dim(c) c

Missing data: Traditionally R codes missing values as Not Available,"NA”. #Create a Vector with a missing value a<-c(1:5,NA,7:10) a str(a) length(a) #find the mean of a mean(a) [1] NA mean(a, na.rm=T) a1<-mean(a, na.rm=T) a1 You need to tell R to ignore the missing value

Missing data: In fact you you could remove NA’s yourself with the na.omit function b<-na.omit(a) b [1] attr(,"na.action") [1] 6 attr(,"class") [1] "omit" You can use this with dataframes too. #Create a dataframe called d d <- data.frame(a, b=rev(a)) #check d d #omit all rows that contain at least one NA. e<-na.omit(d) #Look at e, you will see all NA’s removed e

Missing data: Important! Do not use “NA” in a boolean test. #Use your vector with missing data “a” a == 8 #Now try with NA a==NA [1] NA NA NA NA NA NA NA NA NA NA To test for missing values use is.na function is.na(a)

Data Structure: Matrices Matrices are 2-dimensional tables. They are different from dataframes in that their elements all have to be of the same type. R handles matrices reasonably well which makes it an ideal open-source platform for the creation of mathematical instruments that rely on matrix algebra. Other programs may be better (eg Fortran) but have their own disadvantages.

Data Structure: Matrices #Create your first matrix m <- matrix( c(1,2,3,4), nrow=2 ) #check if indeed you have created a matrix is.matrix(m) #Create a larger matrix matrix( 1:3, nrow=3, ncol=3 ) #Note! elements of a matrix are represented vertically, you can specify it the other way either by transposing the matrix or telling R in the first place

Data Structure: Matrices matrix( 1:3, nrow=3, ncol=3, byrow=T ) or t(matrix( 1:3, nrow=3, ncol=3 )) You can perform matrix algebra easily and manipulate your matrices as any other object. R has also some inbuilt functions that make working with matrices easier (or add functionality using the “Matrix” package) #Create two matrices and try adding, subtracting, multiplying, transposing c1<-cbind( c(1,2), c(3,4)); c2<-rbind( c(1,3), c(2,4)) You can index elements in a matrix using the same indexing techniques we have look at previously

Data Structure: Lists Lists are useful when we need to store complex data as they can contain anything. Let’s clear our objects first ls() rm(list=ls()) #will remove ALL objects ls() #create a list (will be empty first) mylist <- list() #Let’s add something to our list mylist[["foo"]]<-1 #and let’s add some more mylist [["bar"]] <- c("a", "b", "c") #let’s check what we have so far mylist str(mylist)

Data Structure: Lists List of 2 $ foo: num 1 $ bar: chr [1:3] "a" "b" "c” #Use the "[[" operator to access one elements in the list #Use the "[" operator to access several elements #example: let’s access the element “bar” mylist[["bar"]] [1] "a" "b" "c” OR mylist[[2]] [1] "a" "b" "c” #This is different from access the second element in a vector, indeed you are accessing the second element in a list

Data Structure: Lists Why are list particularly important in R? The results of most predefined statistical functions will output results in a list. It is important to access elements in a list to extract those result we need. Example: Let’s generate some data n<-20 predictor<-rnorm(n) predicted<-1-2*predictor+rnorm(n) results<-lm(predicted~predictor)

Data Structure: Lists results lm(formula = predicted ~ predictor) Coefficients: (Intercept) predictor summary(results) Call: lm(formula = predicted ~ predictor) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e-05 *** predictor e-09 *** --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: on 18 degrees of freedom Multiple R-squared: ,Adjusted R-squared: F-statistic: 131 on 1 and 18 DF, p-value: 1.076e-09

Data Structure: Lists Let us take a closer look at this object called “results” brace str(results) List of 12 $ coefficients : Named num [1:2] attr(*, "names")= chr [1:2] "(Intercept)" "predictor" $ residuals : Named num [1:20] attr(*, "names")= chr [1:20] "1" "2" "3" "4"... $ effects : Named num [1:20] attr(*, "names")= chr [1:20] "(Intercept)" "predictor" "" ""... etc…. The “$”represent accessible items in the list str(summary(results)) reveals another list

Data Structure: Lists savemyresults<-summary(results) #save your summary results as an object ls() #should be one your objects #look at your residuals savemyresults$residuals #OR savemyresults[[3]] #extract your predictor p-value which is what you would get if you simply typed “results” in the first place (the significance of your model). #This is the eighth value so savemyresults$coefficients Estimate Std. Error t value Pr(>|t|) (Intercept) (1) (3) (5) e-05 (7) predictor (2) (4) (6) e-09 (8) savemyresults$coefficients[8] # This will do the exact same thing: savemyresults[[4]][8] [1] e-09

Delete Elements: Lists #to delete an element in a list #go back to your list mylist str(mylist) mylist[["bar"]] <- NULL str(mylist) List of 1 $ foo: num 1

Delete Elements: Lists #to delete an element in a list #go back to your list mylist str(mylist) mylist[["bar"]] <- NULL str(mylist) List of 1 $ foo: num 1

Data Structure: Attributes Will likely be covered in more detail when when writing your functions. Briefly….so you can sleep well tonight… attributes are simply meta-data. In the previous example, names of the individual element of a list are an attribute. Example: mysecondlist <- list(a=100, b=200, c=300) str(mysecond list) List of 3 $ a: num 1 $ b: num 2 $ c: num 3 attributes(mysecondlist) $names [1] "a" "b" "c"

Data Structure: Attributes Attributes are also the names of rows and columns in a data frame. Example: dat<-data.frame(x=5:6, y=7:8) str(dat) 'data.frame':2 obs. of 2 variables: $ x: int 5 6 $ y: int 7 8 attributes(dat) $names [1] "x" "y” $row.names [1] 1 2 $class [1] "data.frame"

Data Structure: Attributes Let’s go ahead and change the row names of our dataframe rownames(dat)<-c("participant1","participant2”); names(dat)<-c(“name1”, “name2”) dat name1 name2 participant1 5 7 participant2 6 8 attributes(dat) $names [1] "name1" "name2" $row.names [1] "participant1" "participant2" $class [1] "data.frame"

Data Structure: Attributes Attributes are also used to store comments for a function (more about this later in the week). Briefly Create a little function that computes the mean which is simply the sum of the value of divided by the number of values: tt<-function(x){sum(x)/length(x)} comment your code (enter one line at the time) tt<-function(x){ #Function to compute the mean sum(x)/length(x) } str(tt)

Data Structure: Recycling Rule In vector arithmetic, R performs element by element operations o For example try x<-c(2,4,6) y<-c(7,9,10) x+y [1] This is straight-forward because the vectors have the same lengths But what happens when vectors are of different length?

Data Structure: Recycling Rule Create two vectors of different length and try it This complicates things because R processes elements in pairs and when one element is exhausted the other still has elements left V1V

Data Structure: Recycling Rule In this case R evokes the “Recycling Rule” When the shorter vector is exhausted, R returns to the beginning “recycling” its elements but continues taking elements from the longer vector until the operation is complete. It will recycle the shorter-vector elements over and over until necessary V1V

Data Structure: Recycling Rule In this case R evokes the “Recycling Rule” When the shorter vector is exahausted, R returns to the beginning “recycling” its elements but continues taking elements from the longer vector until the operation is complete. It will recycle the shorter-vector elements over and over until necessary V1V21:6+1:

Acknowledgements: