Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to R August 2016.

Similar presentations


Presentation on theme: "Introduction to R August 2016."— Presentation transcript:

1 Introduction to R August 2016

2 What is R? R is: A programming language used for statistical computing and graphics. Open source and freely available under the GNU General Public License. Supported by the R Project for Statistical Computing Download latest version of R via: The use of R can be facilitated through the use of Rstudio. Download Rstudio via:

3 R Packages The capabilities of R are being continually expanded and improved through the creation of “packages” A package may be developed to: Develop capabilities previously unavailable Improve upon existing capabilities Focus upon the needs of a particular user group. Currently R packages Explore the world of R packages: project.org/web/packages/available_packages_by_name.html Examples of R packages: dplyr: manipulates data by taking subsets, summarizing, rearranging and joining data sets. tidyr: reformats layouts of data sets to make them more compatible with R. lubridate: simplifies working with dates and times. survival: used for survival analysis. babynames: US Baby Names WDI: used for downloading World Development Indicators data from the World Bank

4 Some Distinguishing Characteristics
R does not use GUIs; it is entirely command-line based. Coding can be simplified through the use of products such as RStudio. Commands typically consist of a function with associated arguments that define the manner in which the function should be executed. For example, a command to load a comma-delimited data file: Command Arguments read.table("2009education.csv", header=TRUE, sep=",", colClasses = c("character", rep("numeric",3))) Functions are case-sensitive. For example, typing “xhwe” will not execute the XHWE function

5 Installing Packages Most commonly used functions are part of “Base R”.
If you have installed R, you automatically use these functions. But what if… Need to identify the package that contains the “babynames” function In this case, the package is also named “babynames” Install package and add to library.

6 Installing Packages To install package: Add package to library:
Use your function(s):

7 Unsure about syntax? If unsure about the arguments or syntax for a function: Type a question mark (“?”) followed by the function name (with arguments undefined). For example, if uncertain how to use the read.table function cited in the previous slide, ?read.table() This will pull up a reference sheet detailing to proper use of the function.

8 Accessing Data Sets Data sets may be imported from the Internet or from a computer directory. R can import data in a wide variety of formats; some of the more common are: Excel CSV TXT SPSS Access SQL Server Minitab R also includes a set of standard data sets (for you to practice/play with) To list available datasets, type the following command in R or Rstudio: library(help="datasets") A more complete description of many of these may be found on the following site:

9 Basic R Data Modes Numeric Integer Complex Logical Character Factor
While other modes of data exist, they are far less common than those listed above.

10 Common R Objects Name Dimensions Contents Example Vector 1
Series of values Single data mode Matrix 2 Values stored in rows and columns All values of the same data mode List Allows multiple modes Denver TRUE Data Frame Different columns may have different data modes Topeka FALSE In all of the above, integers are treated as numeric values.

11 Examples of How to Specify Elements within Objects
For a one-dimensional object named “x” x[n] indicates the nth element x[m:n] indicates the mth through the nth elements x[c(m, n)] indicates the mth and the nth elements x[“name”] indicates the element named “name” x[x > k] indicates those elements that are greater than k For a two-dimensional object named “x” x[m,n] indicates the element in the mth row and the nth column x[m,] indicates all of the mth row x[, n] indicates all of the nth column x[, j:n] indicates all of the jth through the nth columns x[, c(j, n)] indicates all of the jth and nth columns x[“name”, ] indicates all of the row named “name” X$name indicates all of the column named “name” (for data frames only)

12 Fire it up!!! (in other words, how to start R)
A couple of different options (via the Windows “Start” button): Start “R” directly or Start “Rstudio” provides a more user-friendly “front end” for using R Windows “Start”

13 Explore the Neighborhood
Where am I? getwd() What else is here? list.files() list.dirs.() Get working directory List files List directories

14 Move to a New Neighborhood
Want to change your working directory from /udrive/faculty/rsippel/Jaguars to /udrive/faculty/rsippel/Tigers? setwd("/udrive/faculty/rsippel/Tigers") or setwd(“../Tigers") Function Arguments Set working directory “..” is shorthand for the parent directory to the directory you are now in. Therefore, “../Tigers” means move up one level in the file structure (to the parent directory and then move down to the “Tigers” directory.

15 Let’s get some data! To download data from the Internet:
download.file("en.openei.org/doe- opendata/dataset/3e a146-49b5-978a- e699334d2e1f/resource/3f00482e-8ea0-4b a212b6322e74/download/iouzipcodes2011.csv", "electricity_rates.csv") …and what if the files are in a zip file? unzip("electricity.zip") Function Arguments Indicate URL where data is located. Provide a filename for the downloaded data Function Arguments Provide pathname for the zip file from which the files will be extracted.

16 Importing Data into R Generic read.table function to read data from a table: rates <- read.table("electricity_rates.csv", header = TRUE, sep = ",", colClasses = c("character", "character", "factor", "factor", "factor", "factor", "numeric", "numeric", "numeric"), na.strings = "0") Object being created Function Arguments Data source Data has headers Values separated by commas Specify mode corresponding to each column colClasses argument may also be written Treat “0” values as being not available colClasses = c(rep("character", 2), rep("factor", 4), rep("numeric", 3))

17 Variants of read.table Variants of read.table for handling specific file formats read.csv() For reading CSV files; assumes comma-separated fields. read.csv2() Used for data formatted in countries that use commas as decimal points. Assumes use of semi-colon instead of comma to separate fields. read.delim()/read.delim2() Assumes delimited values; defaults to tab-delimited values All of the above variants assume the presence of headers in the data file being read.

18 Quick Peeks Getting a Feel for your Data
Does your data come with a “code book”? Code books define the contents of a data set, including parameter definitions, units, data precision, locations and dates of collection, etc. How big is my data set? object.size(“electricity_rates.csv”)

19 Quick Peeks Getting a Feel for your Data
Summarize the structure of an R object str(rates) Or perhaps you want only some more specific information? class(rates) names(rates) dim(rates) “dim” = dimensions

20 Quick Peeks Getting a Feel for your Data
Look at the top or bottom lines of your data set head(rates, 10) tail(rates) Lists top lines, optional 2nd argument specifies number of lines Lists bottom lines; if 2nd argument is not defined, defaults to 6 lines

21 Quick Peeks Getting a Feel for your Data
What if you want to identify the highest (or lowest) value records? Start by ordering the contents by the parameter of interest. order_by_hp <- mtcars [order(mtcars$hp, decreasing = TRUE), ] Display the top lines head(order_by_hp) Place sorted contents in a new object Object being sorted Function Arguments Use this field as the basis for sorting Sort in descending order

22 Quick Peeks Getting a Feel for your Data
What if you want to know if values in a data set meet a certain criterion? For example, do any of the residential electricity rates (“res_rate”) in our rates data frame exceed a value of 0.5? any(mtcars$hp>200) …but do all of the residential electricity rates exceed 0.5? all(mtcars$hp>200) Function Arguments …of course, you knew the answer to this from the previous slide. Function Arguments

23 Quick Peeks Getting a Feel for your Data
Perhaps you want to know the unique values in a data field. For example, how many test subjects are in the dataset Theoph (which contains pharmokinetics data for the drug Theophylline)? unique(Theoph$Subject) Function Arguments

24 Uh-Oh That data’s not right!!
What can you do if you discover some data values that you know are incorrect? One option, if you know what the values should be, is to replace the incorrect values with the corrected ones. For example, suppose that, in the Theoph dataset, the test subject whose weight was recorded as 86.4 actually weighs 84.6? Theoph$Wt <- sub(86.4, 84.6, Theoph$Wt) Field to receive results Function Arguments Existing value Place the changed names back in the original data field New value Impacted field

25 Quick Statistics For a quick “grab bag” of popular statistics…
summary(rates) Characters Factors Numbers The statistics generated will depend upon the mode of each field You may also generate a summary for a specific field (e.g. summary(rates$ind_rate)) Function Arguments

26 Quick Statistics How about some quantiles?
Basic quantile quantile(rates$ind_rate, na.rm = TRUE) specify field remove NA values …or, you can specify your break points quantile(rates$ind_rate, probs = seq(0, 1, 0.1), na.rm = TRUE) Generates a sequence from 0 to 1 at intervals of 0.1 Function Arguments

27 Quick Statistics Perhaps some boxplots for your amusement?
boxplot(res_rate~state, data=rates, main="Residential Electricity Rates", xlab="State", ylab="Rate") Function Arguments Plot residential rate as a function of state Data source Main title X-axis label Y-axis label

28 Quick Statistics …and this is what the boxplot looks like.

29 Quick Statistics …or you can have a histogram.
hist(rates$ind_rate[which(rates$ind_rate != "NA")]) Use this data where the data is not equal to NA Function Arguments

30 Quick Statistics Other popular statistics.
Function Description Example mean(x) mean of object(x) sd(x) standard deviation of object(x) median(x) median of object(x) range(x) range of object(x) sum(x) sum of object(x) min(x) minimum max(x) maximum Note: if data set includes NA values, use “na.rm = TRUE” argument to remove NA values from computations.

31 Popular Mathematical Functions
All of the following functions may be applied to either single values or to vectors of values. For example, abs(-7.74) or abs(rates$comm_rate) Function Description Example abs(x) absolute value(s) sqrt(x) square root(s) ceiling(x) rounds to nearest integer of greater value floor(x) rounds to nearest integer of lesser value trunc(x) truncates to integer value round(x, n) rounds to specified (n) number of places beyond the decimal point signif(x, n) rounds to specified (n) number of significant digits

32 Popular Mathematical Functions
Description Example sin(x) sine cos(x) cosine tan(x) tangent log(x) natural logarithm (base = e) log10(x) base-10 logarithm exp(x) e to the x-power Note: the trigonometric functions (sine, cosine, tangent) assume units of radians

33 Aggregation (no, not aggravation)
What if you wanted to calculate mean commercial electricity rates by state? aggregate(rates$comm_rate, by=list(rates$state), FUN=mean, na.rm=TRUE) Function Arguments Aggregate the commercial rate data Group the data by state Calculate mean values Remove NA values when doing computations Can also use other functions: median, sd, sum, min, max, etc.

34 Aggregation You can also group data using multiple fields.
rates_agg <- aggregate(rates$comm_rate, by=list(rates$state, rates$utility_name), FUN=median, na.rm=TRUE) rates_agg[order(rates_agg$Group.1, decreasing = FALSE), ] New Object Function Arguments Group using state and utility name

35 Taking subsets of a data set
Suppose that you are only interested in the electricity rates for Florida. rates_FL <- subset(rates, state == "FL") New Object Command Arguments Data set Criteria

36 Object Conversions Functions are often specific to only one type of object (vector, list, matrix, or data frame) or mode (e.g. numerical or character). However, functions are available to convert (coerce) types. as.data.frame(X, …) as.list(X, …) as.character(X, …) as.numeric(X, …) Object “X” coerced into a Data Frame object Object “X” coerced into a List object Object “X” coerced to Character mode Object “X” coerced to Numeric mode In addition to specifying the object name, arguments may specify row names, column names, whether character vectors should be converted to factors, etc.

37 Exporting R Data Suppose that you had placed the 2014 data from the “babynames” function in a data frame, “names_2014” How can you export that data from R? General purpose function: write.table() write.table(rates_FL, file = "FL_Rates.csv", sep = ",", row.names = FALSE, col.names = TRUE) Command Arguments Object being exported Path name for new file How values will be separated Do not include row names Include column names

38 Variants of write.table
Variants of write.table for handling specific file formats write.csv() For writing to CSV files; assumes comma-separated fields. write.csv2() Used for data formatted in countries that use commas as decimal points. Assumes use of semi-colon instead of comma to separate fields. write.xlsx() For writing to and Excel spreadsheet write.foreign() For writing to SPSS, Stata, SAS

39 Appendix Reference Materials for Commands used in this Workshop

40 Function Package Reference Materials
abs base aggregate stats all any as.character as.data.frame devel/library/base/html/as.data.frame.html as.list as.numeric babynames boxplot graphics c ceiling class cos dim download.file utils devel/library/utils/html/download.file.html exp

41 Function Package Reference Materials
floor base getwd head utils hist graphics install.packages devel/library/utils/html/install.packages.html library list.dirs list.files log log10 max mean median stats min names

42 Function Package Reference Materials
object.size utils order base quantile stats range read.csv read.csv2 read.delim read.delim2 read.table rep round sd setwd signif sin

43 Function Package Reference Materials
sqrt base str utils sub subset sum summary tail tan trunc unique unzip write.csv write.csv2 write.foreign foreign devel/library/foreign/html/write.foreign.html write.table write.xlsx xlsx

44 We would appreciate your feedback on this workshop


Download ppt "Introduction to R August 2016."

Similar presentations


Ads by Google