R Introduction, Data Structures. An Excellent R Book (among many others) R in Action Data Analysis and Graphics with R Robert I. Kabacoff

Slides:



Advertisements
Similar presentations
Introduction to R Brody Sandel. Topics Approaching your analysis Basic structure of R Basic programming Plotting Spatial data.
Advertisements

R for Macroecology Aarhus University, Spring 2011.
Training on R For 3 rd and 4 th Year Honours Students, Dept. of Statistics, RU Empowered by Higher Education Quality Enhancement Project (HEQEP) Department.
Introduction to MATLAB The language of Technical Computing.
Cross-Tabulation Tables Tables in R and Computing Chi Square.
Basics of Using R Xiao He 1. AGENDA 1.What is R? 2.Basic operations 3.Different types of data objects 4.Importing data 5.Basic data manipulation 2.
Chapter 8 and 9 Review: Logical Functions and Control Structures Introduction to MATLAB 7 Engineering 161.
John Porter Why this presentation? The forms data take for analysis are often different than the forms data take for archival storage Spreadsheets are.
Srinivasulu Rajendran Centre for the Study of Regional Development (CSRD) School of Social Sciences (SSS) Jawaharlal Nehru University (JNU) New Delhi -
Introduction to GTECH 201 Session 13. What is R? Statistics package A GNU project based on the S language Statistical environment Graphics package Programming.
R for Research Data Analysis using R Day1: Basic R Baburao Kamble University of Nebraska-Lincoln.
Lecture 6 Sept 15, 09 Goals: two-dimensional arrays matrix operations circuit analysis using Matlab image processing – simple examples.
Guide To UNIX Using Linux Third Edition
Seven good reasons why everyone should be using R.
How to Use the R Programming Language for Statistical Analyses Part I: An Introduction to R Jennifer Urbano Blackford, Ph.D. Department of Psychiatry Kennedy.
SPSS Statistical Package for the Social Sciences is a statistical analysis and data management software package. SPSS can take data from almost any type.
Introduction to Array The fundamental unit of data in any MATLAB program is the array. 1. An array is a collection of data values organized into rows and.
Basic Concept of Data Coding Codes, Variables, and File Structures.
Baburao Kamble (Ph.D) University of Nebraska-Lincoln Data Analysis Using R Week2: Data Structure, Types and Manipulation in R.
ADVANCED MICROSOFT POWERPOINT Lesson 6 – Creating Tables and Charts
Basic R Programming for Life Science Undergraduate Students Introductory Workshop (Session 1) 1.
Data Analysis Using SPSS
ALEXANDER C. LOPILATO R: Because the names of other stat programs don’t make sense so why should this one?
Introduction to SPSS Edward A. Greenberg, PhD
Handling Lists F. Duveau 16/12/11 Chapter 9.2. Objectives of the session: Tools: Everything will be done with the Python interpreter in the Terminal Learning.
P366: Lecture #1 Use of Excel for analysis Lei Chen, MD Jan 6, 2002.
XP Agenda Video Last Class Excel Tutorial 5: Working with Excel Lists Agenda for Next Class 1 New Perspectives on Microsoft Office Excel 2003 Tutorial.
Chapter 6 SAS ® OLAP Cube Studio. Section 6.1 SAS OLAP Cube Studio Architecture.
Math 15 Lecture 10 University of California, Merced Scilab Programming – No. 1.
Computational Methods of Scientific Programming Lecturers Thomas A Herring, Room A, Chris Hill, Room ,
Piotr Wolski Introduction to R. Topics What is R? Sample session How to install R? Minimum you have to know to work in R Data objects in R and how to.
Object-Oriented Program Development Using Java: A Class-Centered Approach, Enhanced Edition.
Key Applications Module Lesson 21 — Access Essentials
CHAPTER: 12. Array is a collection of variables of the same data type that are referenced by a common name. An Array of 10 Elements of type double.
R-Studio and Revolution Analytics have built additional functionality on top of base R.
R Programming Yang, Yufei. Normal distribution.
McGraw-Hill/Irwin The O’Leary Series © 2002 The McGraw-Hill Companies, Inc. All rights reserved. Microsoft Excel 2002 Lab 6 Creating and Using Lists and.
A Simple Guide to Using SPSS ( Statistical Package for the Social Sciences) for Windows.
XP. Objectives Sort data and filter data Summarize an Excel table Insert subtotals into a range of data Outline buttons to show or hide details Create.
Lecture 26: Reusable Methods: Enviable Sloth. Creating Function M-files User defined functions are stored as M- files To use them, they must be in the.
STAT 534: Statistical Computing Hari Narayanan
Computing with SAS Software A SAS program consists of SAS statements. 1. The DATA step consists of SAS statements that define your data and create a SAS.
Learn R Toolkit D Kelly O'DayExcel & R WorldsMod 2 - Excel & R Worlds: 1 Module 2 Moving Between Excel & R Worlds Do See & HearRead Learning PowerPoint.
FORMAT statements can be used to change the look of your output –if FORMAT is in the DATA step, then the formats are permanent and stored with the dataset.
Introduction to Exploratory Descriptive Data Analysis in S-Plus Jagdish S. Gangolly State University of New York at Albany.
1 An Introduction to R © 2009 Dan Nettleton. 2 Preliminaries Throughout these slides, red text indicates text that is typed at the R prompt or text that.
Extracting Information from an Excel List The purpose of creating a database, or list in Excel, is to be able to manipulate the data elements in ways that.
R objects  All R entities exist as objects  They can all be operated on as data  We will cover:  Vectors  Factors  Lists  Data frames  Tables 
1 PEER Session 02/04/15. 2  Multiple good data management software options exist – quantitative (e.g., SPSS), qualitative (e.g, atlas.ti), mixed (e.g.,
Data & Graphing vectors data frames importing data contingency tables barplots 18 September 2014 Sherubtse Training.
Math 252: Math Modeling Eli Goldwyn Introduction to MATLAB.
Actor Heights 1)Create Vectors of Actor Names, Heights, Date of Birth, Gender 2) Combine the 4 Vectors into a DataFrame.
1 EPIB 698C Lecture 1 Instructor: Raul Cruz-Cano
Chapter 9 Introduction to Arrays Fundamentals of Java.
Lecture 11 Introduction to R and Accessing USGS Data from Web Services Jeffery S. Horsburgh Hydroinformatics Fall 2013 This work was funded by National.
Pinellas County Schools
Review > x[-c(1,4,6)] > Y[1:3,2:8] > island.data fishData$weight[1] > fishData[fishData$weight < 20 & fishData$condition.
16BIT IITR Data Collection Module If you have not already done so, download and install R from download.
Working with data in R 2 Fish 552: Lecture 3. Recommended Reading An Introduction to R (R Development Core Team) –
Introduction to R user-friendly and absolutely free
Programming in R Intro, data and programming structures
Tutorial 5: Working with Excel Tables, PivotTables, and PivotCharts
Introduction to R Carolina Salge March 29, 2017.
Introduction to R Studio
Uploading and handling databases
Chapter 8 Data Structures: Cell Arrays and Structures
Simulation And Modeling
R Course 1st Lecture.
Creating a dataset in R Instructor: Li, Han
Chapter 8 Data Structures: Cell Arrays and Structures, Sorting
Presentation transcript:

R Introduction, Data Structures

An Excellent R Book (among many others) R in Action Data Analysis and Graphics with R Robert I. Kabacoff

Steps in a typical data analysis (Kabacoff, 2011)

R features (Kabacoff, 2011) R is free! (SPSS, SAS, etc. cost thousands or tens of thousands of dollars R is a comprehensive statistical platform, offering all manner of data analytic techniques R has state-of-the-art graphics capabilities R is a powerful platform for interactive data analysis and exploration R can easily import data from a wide variety of sources, including text files, database management systems, statistical packages, and specialized data repositories. It can write data out to these systems as well R provides an unparalleled platform for programming new statistical methods in an easy and straightforward manner. It’s easily extensible and provides a natural language for quickly programming recently published methods R contains advanced statistical routines not yet available in other packages. In fact, new methods become available for download on a weekly basis A variety of graphic user interfaces (GUIs) are available, offering the power of R through menus and dialogs. R runs on a wide array of platforms, including Windows, Unix, and Mac OS X

Data structures in R

Vectors Vectors are one-dimensional arrays that can hold numeric data, character data, or logical data The combine function c() is used to form the vector > x = c(1, 3, 5, 7, 25, -13, 47) > y = c(”unu", ”doi", ”trei”, “opt”) The data in a vector must only be one type (numeric, character, or logical) Elements of a vector can be referred using a numeric vector of positions within brackets: x[c(4, 6)] refers to the 4th and 6th element of vector x. > x = c(1, 3, 5, 7, 25, -13, 47) > c[3] [1] 5 > x [c(1, 2, 4)] [1] > x[2:6] [1] Last statement generates a sequence of numbers; x <- c(2:6) is equivalent to x <- c(2, 3, 4, 5, 6)

Date type Date type handling is more difficult to handle Dates are represented as the number of days since , with negative values for earlier dates. as.Date( ) converts strings to dates > mydates <- as.Date(c (' ', ' ', ' ')) number of days between 10/11/2013 and 3/10/ 2013 > days <- mydates[3] - mydates[2] > days > # notice the way of displaying the result # print today's date > today <- Sys.Date() > format(today, format="%d %B %Y")

Symbols used with format( ) SymbolMeaningExample %dday as a number (0-31)01-31 %aabbreviated weekdayMon %Aunabbreviated weekdayMonday %mmonth (00-12)00-12 %babbreviated monthJan %Bunabbreviated monthJanuary %y2-digit year07 %Y4-digit year2007

Date conversions Character to Date: as.Date(x, "format") > # convert date info in format ’dd/mm/yyyy' > strDates = c("01/10/2013", ”31/10/2013") > dates = as.Date(strDates,"%d/%m/%Y") Date to Character: as.Character( ) > # convert dates to character data > strDates2 = as.character(dates)

Matrices Two-dimensional arrays where each element has the same type (numeric,character, or logical) Created with the matrix function. Format: > Myymatrix <- matrix(vector, nrow=number_of_rows, ncol=number_of_columns, byrow=logical_value, dimnames=list( char_vector_rownames, char_vector_colnames)) – vector contains the elements for the matrix – nrow and ncol specify the row and column dimensions – dimnames contains optional row and column labels stored in character vectors. – byrow indicates whether the matrix should be filled in by row (byrow=TRUE) or by column (byrow=FALSE); the default is by column.

Creating matrices (1) First example (a 5 x 4 matrix) > m1 <- matrix(1:20, nrow=5, ncol=4) > m1 [,1] [,2] [,3] [,4] [1,] [2,] [3,] [4,] [5,] Second example (a 2 x 2 matrix, filled by rows) > cells <- c(1,26,24,68) > rownames <- c("Row1", "Row2") > colnames <- c("Col1", "Col2") > m2 <- matrix(cells, nrow=2, ncol=2, byrow=TRUE, +dimnames=list(rownames, colnames)) > m2 Col1 Col2 Row Row

Creating matrices (2) Third example (a 2 x 2 matrix, filled by columns) > m3 <- matrix(cells, nrow=2, ncol=2, + byrow=FALSE, dimnames=list(rownames, + colnames)) > > m3 Col1 Col2 Row Row

Accesing matrix elements (1) (re) create the matrix > m1 <- matrix(1:20, nrow=5) > m1 [,1] [,2] [,3] [,4] [1,] [2,] [3,] [4,] [5,] display the 3rd row > m1[3,] [1] display the 3rd column > m1[,3] [1]

Accesing matrix elements (2) display the element in 2nd row anf 3rd column > m1 [2,3] [1] 12 display two elements from the same row: m1 [2,3] and m1[2,4] > m1 [2, c(3,4)] [1] display three elements from the same column: m1 [1,2], m1 [2,2] and m1[3,2] > m1 [c(1,2, 3), 2] [1] display a "submatrix", from m1 [2,2] to m2[4.4] > m1 [ c(2,3,4), c(2,3,4)] [,1] [,2] [,3] [1,] [2,] [3,]

Arrays Similar to matrices but can have more than two dimensions Elements must be of the same type Created with array function: > myarray <- array(vector, + dimensions, dimnames) – vector contains the data for the array – dimensions is a numeric vector giving the maximal index for each dimension – dimnames - optional list of dimension labels. Elements in arrays are accessed similar to those in matrices

Create and access arrays (1) > dim1 <- c("A1", "A2") > dim2 <- c("B1", "B2", "B3") > dim3 <- c("C1", "C2", + "C3", "C4") > a1 <- array(1:24, c(2, 3, 4), + dimnames=list(dim1, dim2, + dim3)) > > a1,, C1 B1 B2 B3 A A ,, C2 B1 B2 B3 A A Cont. of previous column,, C3 B1 B2 B3 A A ,, C4 B1 B2 B3 A A display element [2,2,3] > a1 [2,2,3] [1] 16

Create and access arrays (2) display a matrix from elements of A and B for first row/column of C > a1 [,,1] B1 B2 B3 A A display elements of A for the 3rd "row" of B and 2nd row/columns of C > a1 [,3,2] A1 A display a subarray containg all elements from first two rows/columns of A, B and C > a1 [c(1,2),c(1,2),c(1,2)],, C1 B1 B2 A1 1 3 A2 2 4,, C2 B1 B2 A1 7 9 A2 8 10

Data Frames Most important data structure in R (at least for us) A data frame is a structure in R that holds data and is similar to the datasets found in standard statistical packages (for example, SAS, SPSS, and Stata) and databases The columns are variables and the rows are observations Variables can have different types (for example, numeric, character) in the same data frame. Data frames are the main structures we’ll use to store datasets

data.frame function A data frame is created with the data.frame() function : > mydata <- data.frame(col1, col2, col3,…) – col1, col2, col3, … are column vectors of any type (such as character, numeric,or logical). – names for each column can be provided with the names function. > studentID <- c(1, 2, 3, 4, 5) > name <- c("Popescu I. Vasile", "Ianos W. Adriana", + "Kovacz V. Iosef", "Babadag I. Maria", "Pop P. Ion") > age <- c(23, 19, 21, 22, 31) > scholarship <- c("Social","Studiu1","Studiu2","Merit","Studiu1") > lab_assessment <- c("Bine", "Foarte bine", "Excelent", "Bine", "Slab") > final_grade <- c(9, 9.45, 9.75, 7.21, 6) > student_gi <- data.frame(studentID, name, age, scholarship, + lab_assessment, final_grade) > student_gi studentID name age scholarship lab_assessment final_grade 1 1 Popescu I. Vasile 23 Social Bine Ianos W. Adriana 19 Studiu1 Foarte bine Kovacz V. Iosef 21 Studiu2 Excelent Babadag I. Maria 22 Merit Bine Pop P. Ion 31 Studiu1 Slab 6.00

accessing elements of a data frame (1) display first two columns (studentID and name ) > student_gi [1:2] studentID name 1 1 Popescu I. Vasile 2 2 Ianos W. Adriana 3 3 Kovacz V. Iosef 4 4 Babadag I. Maria 5 5 Pop P. Ion the same operation could be done with > student_gi [c("studentID", "name")] display final_grade column as a vector > student_gi$final_grade [1]

accessing elements of a data frame (2) cross tabulate (a sort of pivot table) lab_assessment by final_grade > table (student_gi$lab_assessment, + student_gi$final_grade) Bine Excelent Foarte bine Slab summary statistics of final_grade > summary(student_gi$final_grade) Min. 1st Qu. Median Mean 3rd Qu. Max two plots > plot(student_gi$lab_assessment, student_gi$final_grade) > plot(student_gi$age, student_gi$final_grade)

attach() attach() function adds the data frame to the R search path When a variable name is encountered, data frames in the search path are checked in order to locate the variable. But first we'll delete the vectors which formed the data frame (to avoid confusion) > rm(studentID, name, age, scholarship, lab_assessment, + final_grade) Now we'll launch the previous commands but with attach > attach(student_gi) > final_grade > table (lab_assessment, final_grade) > summary(final_grade) > plot(lab_assessment, final_grade) > plot(age, final_grade) At the end, detach remove the data frame from the R search path > detach(student_gi)

Case Identifiers Can be specified with a rowname option in the data frame function New values for studentID (to avoid confusion with regular row numbers) > studentID <- c(1001, 1002, 1003, 1004, 1005) Vectors name, age, scholarship, lab_assessment and final_grade are the same (Slightly) new version of the data frame > student_gi <- data.frame(studentID, name, age, + scholarship, lab_assessment, final_grade, + row.names = studentID) studentID is the variable to use in labeling cases > student_gi studentID name age scholarship lab_assessment final_grade Popescu I. Vasile 23 Social Bine Ianos W. Adriana 19 Studiu1 Foarte bine Kovacz V. Iosef 21 Studiu2 Excelent Babadag I. Maria 22 Merit Bine Pop P. Ion 31 Studiu1 Slab 6.00

Factors (1) Variables can be described as nominal, ordinal, or continuous Nominal variables are categorical, without an implied order. Examples: MaritalStatus, Sex, Job, MasterProgramme Ordinal variables imply order but not amount. Examples: Status (poor, improved, excellent ), LabAssessment (slab, bine, foarteBine, excelent) Continuous variables can take on any value within some range, and both order and amount are implied. Examples: LitersPer100Km, Height, Weight, FinalGrade (with decimals) Categorical (nominal) and ordered categorical (ordinal) variables are called factors. Factors determine how data will be analyzed and presented visually The function factor() stores the categorical values as a vector of integers in the range [1... k ] (where k is the number of unique values in the nominal variable), and an internal vector of character strings (the original values) mapped to these integers

factor function A nominal variable > scholarship <- c("Social","Studiu1","Studiu2","Merit", + "Studiu1") factor function > scholarship_f <- factor(scholarship) > scholarship_f [1] Social Studiu1 Studiu2 Merit Studiu1 Levels: Merit Social Studiu1 Studiu2 Ordinal variable > lab_assessment <- c("Bine", "Foarte bine", "Excelent", + "Bine", "Slab") > lab_assessment [1] "Bine" "Foarte bine" "Excelent" "Bine" "Slab" > lab_assessment <- factor(lab_assessment, order=TRUE, + levels=c("Slab", "Bine", "Foarte bine", "Excelent")) > lab_assessment [1] Bine Foarte bine Excelent Bine Slab Levels: Slab < Bine < Foarte bine < Excelent

Data Frame with Factors (1) Vectors studentID, name, age, final_grade are identical as previous Scholarship and lab_assessment are factors > scholarship <- c("Social", "Studiu1", "Studiu2", "Merit", "Studiu1") > scholarship <- factor(scholarship) > lab_assessment <- c("Bine", "Foarte bine", "Excelent", "Bine", "Slab") > lab_assessment <- factor(lab_assessment, order=TRUE, levels=c("Slab", + "Bine", "Foarte bine", "Excelent")) Another version of the data frame (column studentID is removed and becomes row identifier) > student_gi <- data.frame(name, age, scholarship, + lab_assessment, final_grade, row.names = studentID) Structure of the data frame > str(student_gi) 'data.frame':5 obs. of 5 variables: $ name : Factor w/ 5 levels "Babadag I. Maria",..: $ age : num $ scholarship : Factor w/ 4 levels "Merit","Social",..: $ lab_assessment: Ord.factor w/ 4 levels "Slab"<"Bine"<..: $ final_grade : num

Data Frame with Factors (2) Basic statistics about variables in data frame > summary(student_gi)

Factors and Value Labels The factor() function can be used to create value labels for categorical variables > patientID <- c(1, 2, 3, 4) > age <- c(25, 34, 28, 52) > diabetes <- c("Type1", "Type2", "Type1", "Type1") > status <- c("Poor", "Improved", "Excellent", "Poor") > diabetes <- factor(diabetes) > status <- factor(status, order=TRUE) > gender <- c(1, 2, 2, 1) > patientdata <- data.frame(patientID, age, diabetes, + status, gender) Variable gender is coded 1 for male and 2 for female. Create value labels: > patientdata$gender <- factor(patientdata$gender, + levels = c(1,2),labels = c("male", "female")) levels indicate the actual values of the variable labels refer to a character vector containing the desired labels.

Lists Lists are the most complex of the R data types A list is an ordered collection of objects (components). A list allows gathering a large variety of (possibly unrelated) objects under one name. A list can contain a combination of vectors, matrices, data frames, and even other list Created using list() function : mylist <- list(object1, object2, …) where the objects are any of the structures seen so far Optionally, the objects in a list can be named: mylist <- list(name1=object1, + name2=object2, …)

Useful functions for Data Objects