Introduction to R Statistics are no substitute for judgment Henry Clay, U.S. congressman and senator.

Slides:



Advertisements
Similar presentations
Statistical Methods Lynne Stokes Department of Statistical Science Lecture 7: Introduction to SAS Programming Language.
Advertisements

EndNote. What is EndNote:  EndNote is referencing software that enables you to create a database of references from your readings. Your database of references.
Introduction to PHP MIS 3501, Fall 2014 Jeremy Shafer
Introduction to SPSS Allen Risley Academic Technology Services, CSUSM
Introduction to GTECH 201 Session 13. What is R? Statistics package A GNU project based on the S language Statistical environment Graphics package Programming.
Ann Arbor ASA ‘Up and Running’ Series: SPSS Prepared by volunteers of the Ann Arbor Chapter of the American Statistical Association, in cooperation with.
EGR 106 – Week 2 – Arrays & Scripts Brief review of last week Arrays: – Concept – Construction – Addressing Scripts and the editor Audio arrays Textbook.
Chapter 7 Data Management. Agenda Database concept Import data Input and edit data Sort data Function Filter data Create range name Calculate subtotal.
Introduction to SPSS Descriptive Statistics. Introduction to SPSS Statistics Program for the Social Sciences (SPSS) Commonly used statistical software.
Guide To UNIX Using Linux Third Edition
Chapter 7 Managing Data Sources. ASP.NET 2.0, Third Edition2.
SPSS Statistical Package for the Social Sciences is a statistical analysis and data management software package. SPSS can take data from almost any type.
Version 4 for Windows NEX T. Welcome to SphinxSurvey Version 4,4, the integrated solution for all your survey needs... Question list Questionnaire Design.
Introduction to Array The fundamental unit of data in any MATLAB program is the array. 1. An array is a collection of data values organized into rows and.
Simple Web SQLite Manager/Form/Report
Introduction to SPSS (For SPSS Version 16.0)
Introduction to SQL Structured Query Language Martin Egerhill.
Basic R Programming for Life Science Undergraduate Students Introductory Workshop (Session 1) 1.
Chapter 2: Working with Data in a Project
A Guide to SQL, Eighth Edition Chapter Three Creating Tables.
ASP.NET Programming with C# and SQL Server First Edition
Introduction to MATLAB Session 1 Prepared By: Dina El Kholy Ahmed Dalal Statistics Course – Biomedical Department -year 3.
Copyright © 2003 Pearson Education, Inc. Slide 8-1 The Web Wizard’s Guide to PHP by David Lash.
1Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall. Exploring Microsoft Office Access 2010 by Robert Grauer, Keith Mast, and Mary Anne.
Carolina Environmental Program UNC Chapel Hill The Analysis Engine – A New Tool for Model Evaluation, Sensitivity and Uncertainty Analysis, and more… Alison.
Introduction to SPSS Edward A. Greenberg, PhD
SPSS Presented by Chabalala Chabalala Lebohang Kompi Balone Ndaba.
Miscellaneous Excel Combining Excel and Access. – Importing, exporting and linking Parsing and manipulating data. 1.
StAR web server tutorial for ROC Analysis. ROC Analysis ROC Analysis: This module allows the user to input data for several classifiers to be tested.
1 Working with MS SQL Server Textbook Chapter 14.
© FPT SOFTWARE – TRAINING MATERIAL – Internal use 04e-BM/NS/HDCV/FSOFT v2/3 Working with MSSQL Server Code:G0-C# Version: 1.0 Author: Pham Trung Hai CTD.
SAS Efficiency Techniques and Methods By Kelley Weston Sr. Statistical Programmer Quintiles.
Piotr Wolski Introduction to R. Topics What is R? Sample session How to install R? Minimum you have to know to work in R Data objects in R and how to.
1 Chapter 2: Working with Data in a Project 2.1 Introduction to Tabular Data 2.2 Accessing Local Data 2.3 Accessing Remote Data 2.4 Importing Text Files.
R packages/libraries Data input/output Rachel Carroll Department of Public Health Sciences, MUSC Computing for Research I, Spring 2014.
Open Source Server Side Scripting ECA 236 Open Source Server Side Scripting MySQL – Inserting Data.
Introduction to Enterprise Guide Jennifer Schmidt Rhonda Ellis Cassandra Hall.
A Simple Guide to Using SPSS ( Statistical Package for the Social Sciences) for Windows.
EndNote. What is EndNote? EndNote is referencing software that enables you to create a database of references from your readings.
C OMPUTING E SSENTIALS Timothy J. O’Leary Linda I. O’Leary Presentations by: Fred Bounds.
Introduction to R Introductions What is R? RStudio Layout Summary Statistics Your First R Graph 17 September 2014 Sherubtse Training.
BlackBerry Applications using Microsoft Visual Studio and Database Handling.
MySQL Importing and creating a database. CSV (Comma Separated Values) file CSV = Comma Separated Values – they are simple text files containing data which.
8 Chapter Eight Server-side Scripts. 8 Chapter Objectives Create dynamic Web pages that retrieve and display database data using Active Server Pages Process.
HRP Copyright © Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by copyright law and.
Data & Graphing vectors data frames importing data contingency tables barplots 18 September 2014 Sherubtse Training.
Math 252: Math Modeling Eli Goldwyn Introduction to MATLAB.
1 Working with MS SQL Server Beginning ASP.NET in C# and VB Chapter 12.
Lecture 11 Introduction to R and Accessing USGS Data from Web Services Jeffery S. Horsburgh Hydroinformatics Fall 2013 This work was funded by National.
Before the class starts: 1) login to a computer 2) start RStudio 3) download Intro.R from MyCourses 4) open Intro.R in Rstudio 5) Download “R in Action”
Introduction to R Dr. Satish Nargundkar. What is R? R is a free software environment for statistical computing and graphics. It compiles and runs on a.
Introduction to R user-friendly and absolutely free
EMPA Statistical Analysis
Reading a file R can read a wide variety of input formats Text,
Introduction to R Carolina Salge March 29, 2017.
Introduction to SPSS.
By Dr. Madhukar H. Dalvi Nagindas Khandwala college
Introduction to Web programming
Installing R and R Studio
Working with Data in Windows
R Data Manipulation Bootstrapping
EndNote by: fatimah alotaibi.
Weka Package Weka package is open source data mining software written in Java. Weka can be applied to your dataset from the GUI, the command line or called.
Introduction to R Statistics are no substitute for judgment
Code is on the Website Outline Comparison of Excel and R
CSCI N317 Computation for Scientific Applications Unit R
Tutorial 6 PHP & MySQL Li Xu
Stata Basic Course Lab 2.
Amos Introduction In this tutorial, you will be briefly introduced to the student version of the SEM software known as Amos. You should download the current.
Data analysis with R and the tidyverse
Presentation transcript:

Introduction to R Statistics are no substitute for judgment Henry Clay, U.S. congressman and senator

R RR is a free software environment for statistical computing and graphics Object-oriented It runs on a wide variety of platforms Highly extensible Command line and GUI Conflict between extensible and GUI

R Studio Scripts Datasets Results Files, plots, packages, & help

Creating a project Store all R scripts and data in the same folder or directory by creating a project File > New Project…

Script A script is a set of R commands A program c is short for combine in c(369.40, …) # CO2 parts per million for co2 <- c(369.40,371.07,373.17,375.78,377.52,379.76,381.85,383.71,385.57,384.78) year <- (2000:2009) # a range of values # show values co2 year #compute mean and standard deviation mean(co2) sd(co2) plot(year,co2)

Exercise Plot kWh per square foot by year for the following University of Georgia data. # Data in R format year <- (2007:2012) sqft <- c( , , , , , ) kwh <- c( , , , , , ) Smart editing 1.Copy each column to a word processor 2.Convert table to text 3.Search and replace commas with null 4.Search and replace returns with commas 5.Edit to put R text around numbers

Datasets A dataset is a table One row for each observation Columns contain observation values Same as the relational model R supports multiple data structures and multiple data types

Data structures Vector A single row table where data are all of the same type Matrix A table where all data are of the same type co2 <- c(369.40,371.07,373.17,375.78,377.52,379.76,381.85,383.71,385.57,384.78) year <- (2000:2009) co2[2] # get the second value m <- matrix(1:12, nrow=4,ncol=3) m[4,3]

Exercise Create a matrix with 6 rows and 3 columns containing the numbers 1 through 18

Data structures Array Extends a matrix beyond two dimensions Data frame Same as a relational table Columns can have different data types Typically, read a file to create a data frame a <- array(1:24, c(4,3,2)) a[1,1,1] gender <- c("m","f","f") age <- c(5,8,3) df <- data.frame(gender,age) df[1,2] df[1,] df[,2]

Data structures List An ordered collection of objects Can store a variety of objects under one name l <- list(co2,m,df) l[[3]] # list 3 l[[1]][2] # second element of list 1

Logical operations

Objects Anything that can be assigned to a variable Constant Data structure Function Graph …

Types of data Classification Nominal Sorting or ranking Ordinal Measurement Interval Ratio

Factors Nominal and ordinal data are factors By default, strings are treated as factors Determine how data are analyzed and presented Failure to realize a column contains a factor, can cause confusion Use str() to find out a frame’s data structure

Missing values Missing values are indicated by NA (not available) Arithmetic expressions and functions containing missing values generate missing values Use the na.rm=T option to exclude missing values from calculations sum(c(1,NA,2)) sum(c(1,NA,2),na.rm=T)

Missing values You remove rows with missing values by using na.omit() gender <- c("m","f","f","f") age <- c(5,8,3,NA) df <- data.frame(gender,age) df2 <- na.omit(df)

Packages R’s base set of packages can be extended by installing additional packages Over 4,000 packages Search the R Project site to identify packages and functionsR Project site Install using R studio Packages must be installed prior to use and their use specified in a script library(packagename)

Packages # install ONCE on your computer # can also use Rstudio to install install.packages("knitr") # library EVERY TIME before using a package in a session # loads the package to memory library(knitr)

Exercise Install the package birk and use one of its functions to do the following conversions: 100ºF to ºC 1oo meters to feet

Compile a notebook A notebook is a report of an analysis Interweaves R code and output File > Compile Notebook … Select html, pdf, or Word output Install knitr before use Install suggested packages

PDF

Reading a file R can read a wide variety of input formats Text Statistical package formats (e.g., SAS) DBMS

Reading a text file Delimited text file, such as CSV Creates a data frame Specify as required Presence of header Separator Row names library(readr) # Read local url <- "~/Dropbox/ Documents/Web sites/terry/data/centralparktemps.txt” t <- read_delim(url,delim=',') It will not find this local file on your computer.

Reading a text file Read a file using a URL library(readr) # Read a file with a URL url <- ' t <- read_delim(url,delim=',')

Learning about an object url <- " t <- read_delim(url, delim=',') head(t) # first few rows tail(t) # last few rows dim(t) # dimension str(t) # structure of a dataset class(t) #type of object Click on the name of the file in the top-right window to see its content Click on the blue icon of the file in the top-right window to see its structure

Referencing data datasetName$columName url <- " t <- read_delim(url, delim=',') # qualify with tablename to reference fields mean(t$temperature) max(t$year) range(t$month) Data set Column

Creating a new column library(birk) url <- " t <- read_delim(url,delim=',') # compute Celsius t$Ctemp = round(conv_unit(t$temperature,F,C),1)

External files & RStudio server Upload a file Download a file More > Export …

Reshaping Converting data from one format to another Wide to narrow Melt Cast

Reshaping library(reshape) library(readr) url <- ' # no column names and tab as delimiter s <- read_delim(url,col_names=F,delim='\t') head(s) colnames(s) <- c('year', 1:12) head(s) # melt (normalization) m <- melt(s,id='year') head(m)

Writing files library(birk) library(readr) url <- ' t <- read_delim(url, delim=',') # compute Celsius and round to one decimal place t$Ctemp = round((t$temperature-32)*5/9,1) colnames(t)[3] <- 'Ftemp' # rename third column to indicate Fahrenheit write_csv(t,"centralparktempsCF.txt") The file is stored in the project's folder

sqldf A R package for using SQL with data frames Returns a data frame Supports MySQL

Subset Selecting rows Selecting columns Selecting rows and columns library(sqldf) options(sqldf.driver = "SQLite") # to avoid a conflict with RMySQL trowSQL <- sqldf("select * from t where year = 1999") tcolSQL <- sqldf("select year, month, Ctemp from t") trowcolSQL 1989 and year < 2000")

Sort Sorting on column name sSQL <- sqldf("select * from t order by year desc, month")

Recoding Some analyses might be facilitated by the recoding of data Split a continuous measure into two categories t$Category <- 'Other' t$Category[t$Ftemp >= 30] <- 'Hot'

Deleting a column t$Category <- NULL

Exercise Download the spreadsheet of monthly mean CO2 measurements (PPM) taken at the Mauna Loa Observatory from 1958 onwards co2-data.html co2-data.html Export a CSV file that contains three columns: year, month, and average CO2 Read the file into R Recode missing values (-99.99) to NA Plot year versus CO2

Summarizing data library(sqldf) options(sqldf.driver = "SQLite") # to avoid a conflict with RMySQL url <- ' t <- read_delim(url, delim=',') w <- sqldf("select year, avg(temperature) as mean from t group by year")

Merging files There must be a common column in both files library(sqldf) options(sqldf.driver = "SQLite") # to avoid a conflict with RMySQL url <- ' t <- read_delim(url, delim=',') # average monthly temp for each year a <- sqldf("select year, avg(temperature) as mean from t group by year") # read yearly carbon data (source: Now/noaa-mauna-loa-co2-data.html) url <- ' carbon <- read_delim(url, delim=',') m <- sqldf("select a.year, CO2, mean from a, carbon where a.year = carbon.year")

Correlation coefficient cor.test(m$mean,m$CO2) Pearson's product-moment correlation data: m$mean and m$CO2 t = , df = 51, p-value = percent confidence interval: sample estimates: cor Significant

Concatenating files Taking a set of files of with the same structure and creating a single file Same type of data in corresponding columns Files should be in the same directory

Concatenating files Local directory # read the file names from a local directory filenames <- list.files("homeC-all/homeC-power", pattern="*.csv", full.names=TRUE) # append the files one after another for (i in 1:length(filenames)) { # Create the concatenated data frame using the first file if (i == 1) { cp <- read_delim(filenames[i], header=F, delim=',') } else { temp <-read_delim(filenames[i], header=F, delim=',') cp <-rbind(cp, temp) #append to existing file rm(temp)# remove the temporary file } colnames(cp) <- c('time','watts')

Concatenating files Remote directory with FTP # read the file names from a remote directory (FTP) library(RCurl) url <- n/data/Mauna%20Loa%20CO2.csvpeople.terry.uga.edu/rwatson/power/" dir <- getURL(url, dirlistonly = T) filenames <- unlist(strsplit(dir,"\n")) # split into filenames # append the files one after another for (i in 1:length(filenames)) { file <- paste(url,filenames[i],delim='') # concatenate for url if (i == 1) { cp <- read_delim(file, header=F, delim=',') } else { temp <-read_delim(file, header=F, delim=',') cp <-rbind(cp, temp) #append to existing file rm(temp)# remove the temporary file } colnames(cp) <- c('time','kwh') Takes a while to run

Database access MySQL access library(DBI) conn <- dbConnect(RMySQL::MySQL(), "richardtwatson.com", dbname="Weather", user="db2", password="student") # Query the database and create file t for use with R t <- dbGetQuery(conn,"SELECT timestamp, airTemp from record;") head(t)

Exercise Using the Atlanta weather database and the lubridate package Compute the average temperature at 5 pm in August Determine the maximum temperature for each day in August for each year

Resources R books Reference card Quick-R

Key points R is a platform for a wide variety of data analytics Statistical analysis Data visualization HDFS and MapReduce Text mining Energy Informatics R is a programming language Much to learn