Download presentation
Presentation is loading. Please wait.
Published byCordelia Stevenson Modified over 9 years ago
1
Introduction to R Statistics are no substitute for judgment Henry Clay, U.S. congressman and senator
2
R RR is a free software environment for statistical computing and graphics Object-oriented It runs on a wide variety of platforms Highly extensible Command line and GUI Conflict between extensible and GUI
3
R Studio Scripts Datasets Results Files, plots, packages, & help
4
Creating a project Store all R scripts and data in the same folder or directory by creating a project File > New Project…
5
Script A script is a set of R commands A program c is short for combine in c(369.40, …) # CO2 parts per million for 2000-2009 co2 <- c(369.40,371.07,373.17,375.78,377.52,379.76,381.85,383.71,385.57,384.78) year <- (2000:2009) # a range of values # show values co2 year #compute mean and standard deviation mean(co2) sd(co2) plot(year,co2)
6
Exercise Plot kWh per square foot by year for the following University of Georgia data. # Data in R format year <- (2007:2012) sqft <- c(14214216, 14359041, 14752886, 15341886, 15573100, 15740742) kwh <- c(2141705, 2108088, 2150841, 2211414, 2187164, 2057364) Smart editing 1.Copy each column to a word processor 2.Convert table to text 3.Search and replace commas with null 4.Search and replace returns with commas 5.Edit to put R text around numbers
7
Datasets A dataset is a table One row for each observation Columns contain observation values Same as the relational model R supports multiple data structures and multiple data types
8
Data structures Vector A single row table where data are all of the same type Matrix A table where all data are of the same type co2 <- c(369.40,371.07,373.17,375.78,377.52,379.76,381.85,383.71,385.57,384.78) year <- (2000:2009) co2[2] # get the second value m <- matrix(1:12, nrow=4,ncol=3) m[4,3]
9
Exercise Create a matrix with 6 rows and 3 columns containing the numbers 1 through 18
10
Data structures Array Extends a matrix beyond two dimensions Data frame Same as a relational table Columns can have different data types Typically, read a file to create a data frame a <- array(1:24, c(4,3,2)) a[1,1,1] gender <- c("m","f","f") age <- c(5,8,3) df <- data.frame(gender,age) df[1,2] df[1,] df[,2]
11
Data structures List An ordered collection of objects Can store a variety of objects under one name l <- list(co2,m,df) l[[3]] # list 3 l[[1]][2] # second element of list 1
12
Logical operations
13
Objects Anything that can be assigned to a variable Constant Data structure Function Graph …
14
Types of data Classification Nominal Sorting or ranking Ordinal Measurement Interval Ratio
15
Factors Nominal and ordinal data are factors By default, strings are treated as factors Determine how data are analyzed and presented Failure to realize a column contains a factor, can cause confusion Use str() to find out a frame’s data structure
16
Missing values Missing values are indicated by NA (not available) Arithmetic expressions and functions containing missing values generate missing values Use the na.rm=T option to exclude missing values from calculations sum(c(1,NA,2)) sum(c(1,NA,2),na.rm=T)
17
Missing values You remove rows with missing values by using na.omit() gender <- c("m","f","f","f") age <- c(5,8,3,NA) df <- data.frame(gender,age) df2 <- na.omit(df)
18
Packages R’s base set of packages can be extended by installing additional packages Over 4,000 packages Search the R Project site to identify packages and functionsR Project site Install using R studio Packages must be installed prior to use and their use specified in a script library(packagename)
19
Packages # install ONCE on your computer # can also use Rstudio to install install.packages("knitr") # library EVERY TIME before using a package in a session # loads the package to memory library(knitr)
20
Exercise Install the package birk and use one of its functions to do the following conversions: 100ºF to ºC 1oo meters to feet
21
Compile a notebook A notebook is a report of an analysis Interweaves R code and output File > Compile Notebook … Select html, pdf, or Word output Install knitr before use Install suggested packages
22
PDF
23
Reading a file R can read a wide variety of input formats Text Statistical package formats (e.g., SAS) DBMS
24
Reading a text file Delimited text file, such as CSV Creates a data frame Specify as required Presence of header Separator Row names library(readr) # Read local url <- "~/Dropbox/ Documents/Web sites/terry/data/centralparktemps.txt” t <- read_delim(url,delim=',') It will not find this local file on your computer.
25
Reading a text file Read a file using a URL library(readr) # Read a file with a URL url <- 'http://people.terry.uga.edu/rwatson/data/centralparktemps.txt' t <- read_delim(url,delim=',')
26
Learning about an object url <- "http://people.terry.uga.edu/rwatson/data/centralparktemps.txt" t <- read_delim(url, delim=',') head(t) # first few rows tail(t) # last few rows dim(t) # dimension str(t) # structure of a dataset class(t) #type of object Click on the name of the file in the top-right window to see its content Click on the blue icon of the file in the top-right window to see its structure
27
Referencing data datasetName$columName url <- "http://people.terry.uga.edu/rwatson/data/centralparktemps.txt" t <- read_delim(url, delim=',') # qualify with tablename to reference fields mean(t$temperature) max(t$year) range(t$month) Data set Column
28
Creating a new column library(birk) url <- "http://people.terry.uga.edu/rwatson/data/centralparktemps.txt" t <- read_delim(url,delim=',') # compute Celsius t$Ctemp = round(conv_unit(t$temperature,F,C),1)
29
External files & RStudio server Upload a file Download a file More > Export …
30
Reshaping Converting data from one format to another Wide to narrow Melt Cast
31
Reshaping library(reshape) library(readr) url <- 'http://people.terry.uga.edu/rwatson/data/meltExample.csv' # no column names and tab as delimiter s <- read_delim(url,col_names=F,delim='\t') head(s) colnames(s) <- c('year', 1:12) head(s) # melt (normalization) m <- melt(s,id='year') head(m)
32
Writing files library(birk) library(readr) url <- 'http://people.terry.uga.edu/rwatson/data/centralparktemps.txt’ t <- read_delim(url, delim=',') # compute Celsius and round to one decimal place t$Ctemp = round((t$temperature-32)*5/9,1) colnames(t)[3] <- 'Ftemp' # rename third column to indicate Fahrenheit write_csv(t,"centralparktempsCF.txt") The file is stored in the project's folder
33
sqldf A R package for using SQL with data frames Returns a data frame Supports MySQL
34
Subset Selecting rows Selecting columns Selecting rows and columns library(sqldf) options(sqldf.driver = "SQLite") # to avoid a conflict with RMySQL trowSQL <- sqldf("select * from t where year = 1999") tcolSQL <- sqldf("select year, month, Ctemp from t") trowcolSQL 1989 and year < 2000")
35
Sort Sorting on column name sSQL <- sqldf("select * from t order by year desc, month")
36
Recoding Some analyses might be facilitated by the recoding of data Split a continuous measure into two categories t$Category <- 'Other' t$Category[t$Ftemp >= 30] <- 'Hot'
37
Deleting a column t$Category <- NULL
38
Exercise Download the spreadsheet of monthly mean CO2 measurements (PPM) taken at the Mauna Loa Observatory from 1958 onwards http://co2now.org/Current-CO2/CO2-Now/noaa-mauna-loa- co2-data.html http://co2now.org/Current-CO2/CO2-Now/noaa-mauna-loa- co2-data.html Export a CSV file that contains three columns: year, month, and average CO2 Read the file into R Recode missing values (-99.99) to NA Plot year versus CO2
39
Summarizing data library(sqldf) options(sqldf.driver = "SQLite") # to avoid a conflict with RMySQL url <- 'http://people.terry.uga.edu/rwatson/data/centralparktemps.txt' t <- read_delim(url, delim=',') w <- sqldf("select year, avg(temperature) as mean from t group by year")
40
Merging files There must be a common column in both files library(sqldf) options(sqldf.driver = "SQLite") # to avoid a conflict with RMySQL url <- 'http://people.terry.uga.edu/rwatson/data/centralparktemps.txt' t <- read_delim(url, delim=',') # average monthly temp for each year a <- sqldf("select year, avg(temperature) as mean from t group by year") # read yearly carbon data (source: http://co2now.org/Current-CO2/CO2- Now/noaa-mauna-loa-co2-data.html) url <- 'http://people.terry.uga.edu/rwatson/data/carbon1959-2011.txt' carbon <- read_delim(url, delim=',') m <- sqldf("select a.year, CO2, mean from a, carbon where a.year = carbon.year")
41
Correlation coefficient cor.test(m$mean,m$CO2) Pearson's product-moment correlation data: m$mean and m$CO2 t = 3.1173, df = 51, p-value = 0.002997 95 percent confidence interval: 0.1454994 0.6049393 sample estimates: cor 0.4000598 Significant
42
Concatenating files Taking a set of files of with the same structure and creating a single file Same type of data in corresponding columns Files should be in the same directory
43
Concatenating files Local directory # read the file names from a local directory filenames <- list.files("homeC-all/homeC-power", pattern="*.csv", full.names=TRUE) # append the files one after another for (i in 1:length(filenames)) { # Create the concatenated data frame using the first file if (i == 1) { cp <- read_delim(filenames[i], header=F, delim=',') } else { temp <-read_delim(filenames[i], header=F, delim=',') cp <-rbind(cp, temp) #append to existing file rm(temp)# remove the temporary file } colnames(cp) <- c('time','watts')
44
Concatenating files Remote directory with FTP # read the file names from a remote directory (FTP) library(RCurl) url <- "ftp://watson_ftp:bulldawg1989@http://people.terry.uga.edu/rwatso n/data/Mauna%20Loa%20CO2.csvpeople.terry.uga.edu/rwatson/power/" dir <- getURL(url, dirlistonly = T) filenames <- unlist(strsplit(dir,"\n")) # split into filenames # append the files one after another for (i in 1:length(filenames)) { file <- paste(url,filenames[i],delim='') # concatenate for url if (i == 1) { cp <- read_delim(file, header=F, delim=',') } else { temp <-read_delim(file, header=F, delim=',') cp <-rbind(cp, temp) #append to existing file rm(temp)# remove the temporary file } colnames(cp) <- c('time','kwh') Takes a while to run
45
Database access MySQL access library(DBI) conn <- dbConnect(RMySQL::MySQL(), "richardtwatson.com", dbname="Weather", user="db2", password="student") # Query the database and create file t for use with R t <- dbGetQuery(conn,"SELECT timestamp, airTemp from record;") head(t)
46
Exercise Using the Atlanta weather database and the lubridate package Compute the average temperature at 5 pm in August Determine the maximum temperature for each day in August for each year
47
Resources R books Reference card Quick-R
48
Key points R is a platform for a wide variety of data analytics Statistical analysis Data visualization HDFS and MapReduce Text mining Energy Informatics R is a programming language Much to learn
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.