Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to R Statistics are no substitute for judgment Henry Clay, U.S. congressman and senator.

Similar presentations


Presentation on theme: "Introduction to R Statistics are no substitute for judgment Henry Clay, U.S. congressman and senator."— Presentation transcript:

1 Introduction to R Statistics are no substitute for judgment Henry Clay, U.S. congressman and senator

2 R RR is a free software environment for statistical computing and graphics Object-oriented It runs on a wide variety of platforms Highly extensible Command line and GUI Conflict between extensible and GUI

3 R Studio Scripts Datasets Results Files, plots, packages, & help

4 Creating a project Store all R scripts and data in the same folder or directory by creating a project File > New Project…

5 Script A script is a set of R commands A program c is short for combine in c(369.40, …) # CO2 parts per million for 2000-2009 co2 <- c(369.40,371.07,373.17,375.78,377.52,379.76,381.85,383.71,385.57,384.78) year <- (2000:2009) # a range of values # show values co2 year #compute mean and standard deviation mean(co2) sd(co2) plot(year,co2)

6 Exercise Plot kWh per square foot by year for the following University of Georgia data. # Data in R format year <- (2007:2012) sqft <- c(14214216, 14359041, 14752886, 15341886, 15573100, 15740742) kwh <- c(2141705, 2108088, 2150841, 2211414, 2187164, 2057364) Smart editing 1.Copy each column to a word processor 2.Convert table to text 3.Search and replace commas with null 4.Search and replace returns with commas 5.Edit to put R text around numbers

7 Datasets A dataset is a table One row for each observation Columns contain observation values Same as the relational model R supports multiple data structures and multiple data types

8 Data structures Vector A single row table where data are all of the same type Matrix A table where all data are of the same type co2 <- c(369.40,371.07,373.17,375.78,377.52,379.76,381.85,383.71,385.57,384.78) year <- (2000:2009) co2[2] # get the second value m <- matrix(1:12, nrow=4,ncol=3) m[4,3]

9 Exercise Create a matrix with 6 rows and 3 columns containing the numbers 1 through 18

10 Data structures Array Extends a matrix beyond two dimensions Data frame Same as a relational table Columns can have different data types Typically, read a file to create a data frame a <- array(1:24, c(4,3,2)) a[1,1,1] gender <- c("m","f","f") age <- c(5,8,3) df <- data.frame(gender,age) df[1,2] df[1,] df[,2]

11 Data structures List An ordered collection of objects Can store a variety of objects under one name l <- list(co2,m,df) l[[3]] # list 3 l[[1]][2] # second element of list 1

12 Logical operations

13 Objects Anything that can be assigned to a variable Constant Data structure Function Graph …

14 Types of data Classification Nominal Sorting or ranking Ordinal Measurement Interval Ratio

15 Factors Nominal and ordinal data are factors By default, strings are treated as factors Determine how data are analyzed and presented Failure to realize a column contains a factor, can cause confusion Use str() to find out a frame’s data structure

16 Missing values Missing values are indicated by NA (not available) Arithmetic expressions and functions containing missing values generate missing values Use the na.rm=T option to exclude missing values from calculations sum(c(1,NA,2)) sum(c(1,NA,2),na.rm=T)

17 Missing values You remove rows with missing values by using na.omit() gender <- c("m","f","f","f") age <- c(5,8,3,NA) df <- data.frame(gender,age) df2 <- na.omit(df)

18 Packages R’s base set of packages can be extended by installing additional packages Over 4,000 packages Search the R Project site to identify packages and functionsR Project site Install using R studio Packages must be installed prior to use and their use specified in a script library(packagename)

19 Packages # install ONCE on your computer # can also use Rstudio to install install.packages("knitr") # library EVERY TIME before using a package in a session # loads the package to memory library(knitr)

20 Exercise Install the package birk and use one of its functions to do the following conversions: 100ºF to ºC 1oo meters to feet

21 Compile a notebook A notebook is a report of an analysis Interweaves R code and output File > Compile Notebook … Select html, pdf, or Word output Install knitr before use Install suggested packages

22 PDF

23 Reading a file R can read a wide variety of input formats Text Statistical package formats (e.g., SAS) DBMS

24 Reading a text file Delimited text file, such as CSV Creates a data frame Specify as required Presence of header Separator Row names library(readr) # Read local url <- "~/Dropbox/ Documents/Web sites/terry/data/centralparktemps.txt” t <- read_delim(url,delim=',') It will not find this local file on your computer.

25 Reading a text file Read a file using a URL library(readr) # Read a file with a URL url <- 'http://people.terry.uga.edu/rwatson/data/centralparktemps.txt' t <- read_delim(url,delim=',')

26 Learning about an object url <- "http://people.terry.uga.edu/rwatson/data/centralparktemps.txt" t <- read_delim(url, delim=',') head(t) # first few rows tail(t) # last few rows dim(t) # dimension str(t) # structure of a dataset class(t) #type of object Click on the name of the file in the top-right window to see its content Click on the blue icon of the file in the top-right window to see its structure

27 Referencing data datasetName$columName url <- "http://people.terry.uga.edu/rwatson/data/centralparktemps.txt" t <- read_delim(url, delim=',') # qualify with tablename to reference fields mean(t$temperature) max(t$year) range(t$month) Data set Column

28 Creating a new column library(birk) url <- "http://people.terry.uga.edu/rwatson/data/centralparktemps.txt" t <- read_delim(url,delim=',') # compute Celsius t$Ctemp = round(conv_unit(t$temperature,F,C),1)

29 External files & RStudio server Upload a file Download a file More > Export …

30 Reshaping Converting data from one format to another Wide to narrow Melt Cast

31 Reshaping library(reshape) library(readr) url <- 'http://people.terry.uga.edu/rwatson/data/meltExample.csv' # no column names and tab as delimiter s <- read_delim(url,col_names=F,delim='\t') head(s) colnames(s) <- c('year', 1:12) head(s) # melt (normalization) m <- melt(s,id='year') head(m)

32 Writing files library(birk) library(readr) url <- 'http://people.terry.uga.edu/rwatson/data/centralparktemps.txt’ t <- read_delim(url, delim=',') # compute Celsius and round to one decimal place t$Ctemp = round((t$temperature-32)*5/9,1) colnames(t)[3] <- 'Ftemp' # rename third column to indicate Fahrenheit write_csv(t,"centralparktempsCF.txt") The file is stored in the project's folder

33 sqldf A R package for using SQL with data frames Returns a data frame Supports MySQL

34 Subset Selecting rows Selecting columns Selecting rows and columns library(sqldf) options(sqldf.driver = "SQLite") # to avoid a conflict with RMySQL trowSQL <- sqldf("select * from t where year = 1999") tcolSQL <- sqldf("select year, month, Ctemp from t") trowcolSQL 1989 and year < 2000")

35 Sort Sorting on column name sSQL <- sqldf("select * from t order by year desc, month")

36 Recoding Some analyses might be facilitated by the recoding of data Split a continuous measure into two categories t$Category <- 'Other' t$Category[t$Ftemp >= 30] <- 'Hot'

37 Deleting a column t$Category <- NULL

38 Exercise Download the spreadsheet of monthly mean CO2 measurements (PPM) taken at the Mauna Loa Observatory from 1958 onwards http://co2now.org/Current-CO2/CO2-Now/noaa-mauna-loa- co2-data.html http://co2now.org/Current-CO2/CO2-Now/noaa-mauna-loa- co2-data.html Export a CSV file that contains three columns: year, month, and average CO2 Read the file into R Recode missing values (-99.99) to NA Plot year versus CO2

39 Summarizing data library(sqldf) options(sqldf.driver = "SQLite") # to avoid a conflict with RMySQL url <- 'http://people.terry.uga.edu/rwatson/data/centralparktemps.txt' t <- read_delim(url, delim=',') w <- sqldf("select year, avg(temperature) as mean from t group by year")

40 Merging files There must be a common column in both files library(sqldf) options(sqldf.driver = "SQLite") # to avoid a conflict with RMySQL url <- 'http://people.terry.uga.edu/rwatson/data/centralparktemps.txt' t <- read_delim(url, delim=',') # average monthly temp for each year a <- sqldf("select year, avg(temperature) as mean from t group by year") # read yearly carbon data (source: http://co2now.org/Current-CO2/CO2- Now/noaa-mauna-loa-co2-data.html) url <- 'http://people.terry.uga.edu/rwatson/data/carbon1959-2011.txt' carbon <- read_delim(url, delim=',') m <- sqldf("select a.year, CO2, mean from a, carbon where a.year = carbon.year")

41 Correlation coefficient cor.test(m$mean,m$CO2) Pearson's product-moment correlation data: m$mean and m$CO2 t = 3.1173, df = 51, p-value = 0.002997 95 percent confidence interval: 0.1454994 0.6049393 sample estimates: cor 0.4000598 Significant

42 Concatenating files Taking a set of files of with the same structure and creating a single file Same type of data in corresponding columns Files should be in the same directory

43 Concatenating files Local directory # read the file names from a local directory filenames <- list.files("homeC-all/homeC-power", pattern="*.csv", full.names=TRUE) # append the files one after another for (i in 1:length(filenames)) { # Create the concatenated data frame using the first file if (i == 1) { cp <- read_delim(filenames[i], header=F, delim=',') } else { temp <-read_delim(filenames[i], header=F, delim=',') cp <-rbind(cp, temp) #append to existing file rm(temp)# remove the temporary file } colnames(cp) <- c('time','watts')

44 Concatenating files Remote directory with FTP # read the file names from a remote directory (FTP) library(RCurl) url <- "ftp://watson_ftp:bulldawg1989@http://people.terry.uga.edu/rwatso n/data/Mauna%20Loa%20CO2.csvpeople.terry.uga.edu/rwatson/power/" dir <- getURL(url, dirlistonly = T) filenames <- unlist(strsplit(dir,"\n")) # split into filenames # append the files one after another for (i in 1:length(filenames)) { file <- paste(url,filenames[i],delim='') # concatenate for url if (i == 1) { cp <- read_delim(file, header=F, delim=',') } else { temp <-read_delim(file, header=F, delim=',') cp <-rbind(cp, temp) #append to existing file rm(temp)# remove the temporary file } colnames(cp) <- c('time','kwh') Takes a while to run

45 Database access MySQL access library(DBI) conn <- dbConnect(RMySQL::MySQL(), "richardtwatson.com", dbname="Weather", user="db2", password="student") # Query the database and create file t for use with R t <- dbGetQuery(conn,"SELECT timestamp, airTemp from record;") head(t)

46 Exercise Using the Atlanta weather database and the lubridate package Compute the average temperature at 5 pm in August Determine the maximum temperature for each day in August for each year

47 Resources R books Reference card Quick-R

48 Key points R is a platform for a wide variety of data analytics Statistical analysis Data visualization HDFS and MapReduce Text mining Energy Informatics R is a programming language Much to learn


Download ppt "Introduction to R Statistics are no substitute for judgment Henry Clay, U.S. congressman and senator."

Similar presentations


Ads by Google