Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to R Statistics are no substitute for judgment

Similar presentations


Presentation on theme: "Introduction to R Statistics are no substitute for judgment"— Presentation transcript:

1 Introduction to R Statistics are no substitute for judgment
Henry Clay, U.S. congressman and senator

2 R R is a free software environment for statistical computing and graphics Object-oriented It runs on a wide variety of platforms Highly extensible Command line and GUI Conflict between extensible and GUI

3 R

4 Files, plots, packages, & help
RStudio Datasets Scripts Results Files, plots, packages, & help

5 Creating a project Store all R scripts and data in the same folder or directory by creating a project File > New Project…

6 Script A script is a set of R commands A program
c is short for combine in c(369.55, …) # CO2 parts per million for # ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_annmean_mlo.txt co2 <- c(369.55, , , , , , , , , , , , , , , , ) year <- (2000:2016) # a range of values co2 year #compute mean and standard deviation mean(co2) sd(co2) plot(year,co2)

7 Exercise Plot kWh per square foot by year for the following University of Georgia data. year sqfeet kWh 2007 14,214,216 2,141,705 2008 14,359,041 2,108,088 2009 14,752,886 2,150,841 2010 15,341,886 2,211,414 2011 15,573,100 2,187,164 2012 15,740,742 2,057,364 Smart editing Copy each column to a word processor Convert table to text Search and replace commas with null Search and replace returns with commas Edit to put R text around numbers # Data in R format year <- (2007:2012) sqft <- c( , , , , , ) kwh <- c( , , , , , )

8 Packages R’s base set of packages can be extended by installing additional packages Over 6,000 packages Search the R Project site to identify packages and functions Install using RStudio Packages must be installed prior to use and their use specified in a script library(packagename)

9 Packages # install ONCE on your computer
# can also use RStudio to install install.packages("knitr") # library EVERY TIME before using a package in a session # loads the package to memory library(knitr)

10 Datasets A dataset is a table Same as the relational model
One row for each observation Columns contain observation values Same as the relational model R supports multiple data structures and multiple data types

11 Data structures Vector Matrix
A single row table where data are all of the same type Matrix A table where all data are of the same type co2 <- c(369.40,371.07,373.17,375.78,377.52,379.76,381.85,383.71,385.57,384.78) year <- (2000:2009) co2[2] # get the second value m <- matrix(1:12, nrow=4,ncol=3) m[4,3]

12 Exercise Create a matrix with 6 rows and 3 columns containing the numbers 1 through 18

13 Data structures Array Data frame
Extends a matrix beyond two dimensions Data frame Same as a relational table Columns can have different data types Typically, read a file to create a data frame a <- array(1:24, c(4,3,2)) a[1,1,1] gender <- c("m","f","f") age <- c(5,8,3) df <- data.frame(gender,age) df df[1,2] df[1,] df[,2]

14 Data structures Tibble A rethinking of data frames
Tibbles have a nice printing method that show only the first 10 rows and all the columns that fit on the screen When printed, the data type of each column is specified library(tibble) gender <- c("m","f","f") age <- c(5,8,3) dft <- tibble(gender,age) # creates a tibble dft

15 Data structures List An ordered collection of objects
Can store a variety of objects under one name l <- list(co2,m,df) l[[3]] # list 3 l[[1]][2] # second element of list 1

16 Logical operations Logical operator Symbol EQUAL == AND & OR | NOT !

17 Objects Anything that can be assigned to a variable Constant
Data structure Function Graph Time series

18 Types of data Classification Sorting or ranking Measurement Nominal
Ordinal Measurement Interval Ratio The zero point means none of the variable Kelvin vs Celsius for temperature

19 Factors Nominal and ordinal data are factors
Data frames By default, strings are treated as factors Tibble Strings remain strings Determine how data are analyzed and presented Failure to realize a column contains a factor, can cause confusion Convert to a tibble

20 Missing values Missing values are indicated by NA (not available)
Arithmetic expressions and functions containing missing values generate missing values Use the na.rm=T option to exclude missing values from calculations sum(c(1,NA,2)) sum(c(1,NA,2),na.rm=T)

21 Missing values You remove rows with missing values by using na.omit()
gender <- c("m","f","f","f") age <- c(5,8,3,NA) df <- data.frame(gender,age) df2 <- na.omit(df)

22 Exercise Install the package measurements and use one of its functions to do the following conversions: 100ºF to ºC 100 meters to feet

23 Compile a notebook A notebook is a report of an analysis
Interweaves R code and output Install knitr before use Install suggested packages File > Compile Notebook … Select HTML Select icon for ‘Show in new window' Convert web page to PDF for assignment submission There can problems with pdf conversion on Windows.

24 Open in browser and save as pdf
Notebook output

25 PDF output

26 Reading a file R can read a wide variety of input formats Text
Excel spreadsheet Statistical package formats (e.g., SAS) DBMS

27 Reading a local text file
Use with RStudio installed on your computer Delimited text file, such as CSV readr functions create a tibble Specify as required Presence of column names (colnames = TRUE) Delimiter (delim = ',') It will not find this local file on your computer. library(readr) # Read local url <- "~/Dropbox/0Documents/Web sites/terry/data/centralparktemps.txt" t <- read_delim(url,delim=',')

28 Reading a remote text file
Read a file using a URL library(readr) # Read a file with a URL url <- ' t <- read_delim(url,delim=',')

29 Learning about a file Click on the name of the file in the top-right window to see its content url <- " t <- read_delim(url, delim=',') head(t) # first few rows tail(t) # last few rows dim(t) # dimension str(t) # structure of a dataset class(t) #type of object Click on the blue icon of the file in the top-right window to see its structure

30 Referencing data datasetName$columName Column Data set
url <- " t <- read_delim(url, delim=',') # qualify with tablename to reference fields mean(t$temperature) max(t$year) range(t$month) Column Data set

31 Creating a new column library(measurements) library(readr) url <- " t <- read_delim(url,delim=',') # compute Celsius t$Ctemp <- round(conv_unit(t$temperature, 'F', 'C'),1)

32 Recoding Some analyses might be facilitated by the recoding of data
Split a continuous measure into multiple categories t$Category <- case_when( t$Ctemp >= 25 ~ 'Hot', t$Ctemp < 25 & t$Ctemp >= 5 ~ 'Mild', t$Ctemp < 5 ~'Cold')

33 Deleting a column t$Category <- NULL

34 External files & RStudio server
Upload a file Download a file More > Export …

35 tidyr Converting a spreadsheet for use in R

36 Gather & Spread library(readr) library(tidyr)
url <- ' t <- read_csv(url) t colnames(t) <- c('year',1:4) # gather with data in columns 2 through 5 g <- gather(t,'quarter','value',2:5) g$quarter <- as.integer(g$quarter) g # spread s <- spread(g,quarter,value) s colnames(s) <- c('year', 'Q1','Q2','Q3','Q4')

37 Writing files The file is stored in the project's folder
library(measurements) library(readr) url <- ' t <- read_delim(url, delim=',') # compute Celsius and round to one decimal place t$Ctemp = round(conv_unit(t$temperature,'F','C'),1) colnames(t)[3] <- 'Ftemp' # rename third column to indicate Fahrenheit write_csv(t,"centralparktempsCF.txt") The file is stored in the project's folder

38 dplyr A R package of primitives (basic commands) for data frames and tibbles A grammar of data manipulation Integration with R commands Function Purpose filter() Select rows select() Select columns arrange() Sort rows summarize() Compute a single summary statistic group_by() Pair with summarize() to analyze groups within a dataset inner_join() Join two tables mutate() Create a new column distinct() Select distinct rows sample_n() & sample_frac Select a random sample by number or fraction

39 dplyr subset Selecting rows Selectin columns
Selecting rows and columns library(dpylr) library(readr) url <- " t <- read_delim(url,delim=',') trow <- filter(t, year==1999) tcol <- select(t, year) trowcol <- t %>% select(year, month, temperature) %>% filter(year > 1989 & year < 2000)

40 dplyr sort Sorting on column name t <- arrange(t,desc(year),month)

41 dplyr - summarizing data
library(dplyr) library(readr) url <- ' t <- read_delim(url, delim=',') summarize(t,mean(temperature)) w <- t %>% group_by(year) %>% summarize(averageF = mean(temperature)) w

42 dplyr – mutating data Write as a pipe using measurements::conv_unit
library(dplyr) library(readr) Library(measurements) url <- ' t <- read_delim(url, delim=',') # add column t <- mutate(t,CTemp = (temperature-32)*5/9) # summarize summarize(t,mean(CTemp)) Write as a pipe using measurements::conv_unit

43 Exercise View the web page of yearly CO2 emissions (million metric tons) since the beginning of the industrial revolution Create a new text file using R Clean up the file for use with R and save it as CO2.txt Import (Import Dataset) the file into R Plot year versus CO2 emissions

44 Merging files There must be a common column in both files
library(dplyr) url <- ' t <- read_delim(url, delim=',') # average monthly temp for each year a <- t %>% group_by(year) %>% summarize(mean = mean(temperature)) # read yearly carbon data (source: url <- ' carbon <- read_delim(url, delim=',') m <- inner_join(a,carbon) m

45 Correlation coefficient
cor.test(m$mean,m$CO2) Pearson's product-moment correlation data: m$mean and m$CO2 t = , df = 56, p-value = alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: sample estimates: cor Significant

46 Correlation coefficient
The p-value indicates whether a correlation coefficient is significant The likelihood of getting such a value by chance By convention, p < 0.05 indicates significance To interpret the size of the effect Correlation coefficient Effect size Small Moderate > .50 Large

47 Linear relationship The correlation coefficient does not provide information about the nature of the linear relationship mod <- lm(m$mean ~ m$CO2) summary(mod) Mean temperature = *CO2 Call: lm(formula = m$mean ~ m$CO2) Residuals: Min Q Median Q Max Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) < 2e-16 *** m$CO *** --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: on 56 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 1 and 56 DF, p-value: Variance explained Significant

48 sqldf SQL with data frames/tibbles Not completely identical to MySQL
Cannot embed R commands within an SQL statement Advantage of dplyr library(sqldf) library(readr) options(sqldf.driver = "SQLite") # to avoid a conflict with RMySQL url <-  ' t <- read_delim(url, delim=',') trowcol <-  sqldf("select year, month, temperature from t where year > 1989 and year < 2000")

49 Concatenating files Taking a set of files with the same structure and creating a single file Same type of data in corresponding columns Files should be in the same directory

50 Concatenating files Remote directory with FTP Takes a while to run
# read the file names from a remote directory (FTP) library(RCurl) url <- dir <- getURL(url, dirlistonly = T) filenames <- unlist(strsplit(dir,"\n")) # split into filenames # append the files one after another for (i in 1:length(filenames)) { file <- paste(url,filenames[i],delim='') # concatenate for url if (i == 1) { cp <- read_delim(file, header=F, delim=',') } else { temp <-read_delim(file, header=F, delim=',') cp <-rbind(cp, temp) #append to existing file rm(temp)# remove the temporary file colnames(cp) <- c('time','kwh')

51 Read a spreadsheet A multi-step process library(readxl)
url <- " destfile <- "InternetCompanies.xlsx" download.file(url, destfile) InternetCompanies <- read_excel(destfile) InternetCompanies

52 Database access MySQL access library(DBI) library(RMySQL)
conn <- dbConnect(RMySQL::MySQL(), "richardtwatson.com", dbname="Weather", user="student", password="student") # Query the database and create file t for use with R t <- dbGetQuery(conn,"SELECT * from record;") head(t)

53 Security For security reasons, it is not a good idea to put database access details in your R code Hide in a file Create a csv file within your R code folder containing database access parameters

54 Security Text file (weather_richardtwatson.csv) R code
url,dbname,user,password richardtwatson.com,Weather,student,student R code # Database access library(readr) library(DBI) url <- 'dbaccess/weather_richardtwatson.csv' d <- read_csv(url) conn <- dbConnect(RMySQL::MySQL(), d$url, dbname=d$dbname, user=d$user, password=d$password) t <- dbGetQuery(conn,"SELECT timestamp, airTemp from record;") head(t)

55 Timestamps A timestamp reports when an observation was recorded
:00:00 Components of a time stamp can be extracted with lubridate library(lubridate) library(DBI) conn <- dbConnect(RMySQL::MySQL(), "richardtwatson.com", dbname="Weather", user=" student", password="student") # Query the database and create file t for use with R t <- dbGetQuery(conn,"select * from record;") t$year <- year(t$Timestamp) t$month <- month(t$Timestamp) head(t)

56 Exercise Using the Atlanta weather database and the lubridate package
Compute the average temperature at 5 pm in August Determine the maximum temperature for each day in August across all years in the input file

57 Resources R books Reference card Quick-R

58 Key points R is a platform for a wide variety of data analytics
Statistical analysis Data visualization HDFS and Cluster Computing Text mining Energy Informatics R is a programming language Much to learn


Download ppt "Introduction to R Statistics are no substitute for judgment"

Similar presentations


Ads by Google