Presentation is loading. Please wait.

Presentation is loading. Please wait.

Basics of R INSTRUCTOR: AMANDA MCGOUGH TUESDAY, MARCH 29, 2016.

Similar presentations


Presentation on theme: "Basics of R INSTRUCTOR: AMANDA MCGOUGH TUESDAY, MARCH 29, 2016."— Presentation transcript:

1 Basics of R INSTRUCTOR: AMANDA MCGOUGH TUESDAY, MARCH 29, 2016

2 About LISA o LISA is the source for statistical collaboration and consulting services at VT, free of charge to all students and faculty currently o LISA provides three types of services: Collaboration, Walk-in Consulting, and Short Courses o Collaboration: In depth statistical advice from LISA collaborators which includes meetings about your specific research questions. It is best to meet with LISA before collecting your data. Request a meeting here: http://www.lisa.stat.vt.edu/?q=collaborationhttp://www.lisa.stat.vt.edu/?q=collaboration o Walk-in Consulting: Answers to short research questions. Schedule available here: http://www.lisa.stat.vt.edu/?q=walk_in http://www.lisa.stat.vt.edu/?q=walk_in o Short Courses: Workshops on a variety of topics. Schedule available here: http://www.lisa.stat.vt.edu/?q=short_courses http://www.lisa.stat.vt.edu/?q=short_courses

3 Outline o What is R? o Why use R? o Download R and RStudio o Basic Commands in R o Getting Started using R Scripts o Prices Data Set o Variables in R o Exploratory Data Analysis o Plotting your Data o Practice on your own

4 What is R? o R is a well-developed, simple, and effective programming language that is free. Many scientists, statisticians, analysts, students and others use R for statistical analysis and data visualization. o Data analysis is done in R by writing code and using built-in scripts in the R language. The R environment is equipped with common methods and recent cutting-edge techniques.

5 Why use R? o R is FREE and open to use. o R provides a wide variety of statistical and graphical techniques. o R is an easy programming language to learn. It is more than just point and click. o If you have a question about R, GOOGLE it. There are plenty of resources that can found online that can help to answer your question.

6 How to Download R for your computer? o Download R and RStudio o R can run on Unix, Windows, or Mac OS X computing operating systems. o The R software can be downloaded from CRAN here: http://cran.r-project.org/http://cran.r-project.org/ o Once you have clicked on the version for your operating system the download window should appear. o For Windows users, click on Install R for the first time and then download R 3.1.2 for Windows. o For Mac user, click on the package that fits your operating system.

7 How to Download RStudio for your computer? o After installing the R software, download RStudio which provides an easy to use Graphical User Interface (GUI). o Download RStudio here: http://www.rstudio.com/http://www.rstudio.com/ o Download RStudio Desktop. o Install RStudio.

8 What does RStudio look like?

9 Using R as a Calculator o The simplest thing that R can do is calculate basic arithmetic expressions. o In the console, type any arithmetic expressions in and then hit the Enter key. o Try typing in several arithmetic expressions yourself! 3*(2 + 2) 3^2 sqrt(2) o What can go wrong here?

10 R Script Files o For your project, you may want to save the R commands that you are doing for your analysis. o You can get a new script by doing: File -> New File -> R Script o As you are working on a script, you should add comments to them in order to remind yourself what the script is doing. A comment is added in a script by using # sign. Comments are written to the left of the # sign and should show up in green. x <- sqrt(3) # x is the square root of 3 o Save your script file by clicking on the floppy disk or by using: File -> Save

11 Running your Script Once you have added text to your script you can run it in several ways: 1.One line at a time 2.Sections at a time (by selecting what you want) 3.The whole thing You can either click the run button in R, or use the commands below: PC: ctrl+R Mac: command+Enter

12 Creating Variables o You can store values in variables by assigning a value to a name, using either the = or -> operator. > x = 5 > x <- 2 > y = x + 1 > w = abs(-5) o Storing variables is very helpful when you have a lot of code and you are referring back to it throughout your analysis. These stored variables will show up in the environment.

13 Packages in R o Packages are sets of libraries. o library()This command lists all of the libraries installed in your computer. o sessionInfo() This command lists all of the libraries that are in use during your session. o You can also look in the panel on the bottom right and see which libraries are loaded by seeing which squares are checked. If you need to load a different library, click on the square next to its name.

14 Naming your objects o We have named several basic objects earlier. o How should you name your objects? Here are some tips to consider: 1.Only name using upper and lower cases, number, underscores(_), and periods(.). 2.Begin with either upper/lower case letter or dot. 3.R is case sensitive. 4.Do not use on of R’s reserved words. The command help(reserved) will give you a list of these words.

15 Vectors o A vector is an object that holds several data values of the same type, which are arranged in a particular order. To create vectors, we use the c() function. o Suppose we have data on whale beaching per year in Texas starting in 1990: 7412223511129211121113315679 o Create a new object called whales written in vector form: whales <- c(74, 122, 235, 111, 292, 111, 211, 133, 156, 79)

16 Vectors cont’d o Different commands can be used with vectors, such as the following: whales + 1 whales^2 mean(whales) var(whales) exp(whales) length(whales) whales[3]

17 Vectors cont’d o You can also combine data vectors into one vector. Suppose we have: temp1 <- c(3, 3.76, -0.35) temp2 <- c(1, 2.5, -5) temp <- c(temp1, temp2) o One restriction on combining vectors is that they have to be the same type. So far, all of the vectors that we have created are numeric. One example of a character vector is: pets <- c(‘dog’, ‘cat’, ‘parrot’, ‘snake’)

18 Sequences and Repeating Values o The : command creates a sequence that increments/decrements by 1: seq(1:5) seq(5:1) o You can create a sequence of values by specifying the length or by how much in between: seq(0,5,length=15) seq(1,10,by=2) o The rep() command repeats values or sets of values: rep(1,5) rep(c(1,2,3),5)

19 Data Frames o A data frame is used for storing data tables including lists of vectors of equal length which are displayed vertically and arranged side by side. o All of the values in the same column must be of the same type, but each column can hold different types of data. (e.g. pets, temperature, age, gender) o This helps us to store data sets with each column representing a variable and each row representing an observation. o First, we will work with some data sets available in R and later you can use your own dataset!

20 Importing.csv Files o In a CSV file, the data values are arranged with one observation per line. o Data values are separated by commas within each line. o You can import a CSV file using: read.csv(‘folder/filename.csv’) o We will name our data prices, so we have prices <- read.csv('/Users/amandamcgough/Desktop/prices.csv’) o BE AWARE: Make sure that your slashes are all facing the same way as shown in the examples above. When copying the location over, the computer may use backward slashes instead of forward slashes. o Another option to import your data set is under Tools in R.

21 Prices Data Set (prices.csv) o The prices data set is a random sample of records of resales of homes from Feb 15 to Apr 30 in 1993 from the files maintained by the Albuquerque Board of Realtors. This type of data was collected by multiple listing agencies in multiple listing agencies in many cities and is used by realtors as an information base. o Number of cases: 65 o Variable names: - PRICE = Selling price in hundreds of dollars - SQFT = Square feet of living space - AGE = Age of home in years - NE = Located in northeast sector of city (1) or not (0)

22 Investigating the Prices Data Set o Once the data set is stored in our environment, we can quickly view the data by clicking on prices over in the environment. o Each row corresponds to a particular house. o Each column represents a variable that was measured (price, sqft, age, ne) o Using bracket notation [row,column] we can find particular pieces of our data: prices[1,1] prices[10,] prices[,2] o Also, you can use dollar sign notation. This only works for the variable names. prices$SQFT

23 More Investigating o You can use the minus sign to exclude part of the data set: prices[-1] -> This excludes the first column prices[-1,] -> This excludes the first row o You can use sequences and vectors like before in the bracket notation: prices[1:5,] -> This returns the first five rows or observations (houses) prices[c(1,2,4,8),] -> This returns rows 1, 2, 4, and 8

24 Other Commands for Data Sets o The function names(prices) gives the names of the variables inside the data set. o The function head(prices) gives the names of the first six observations of the data set. o The function tail(prices) gives the last six observations of the data set. o The function dim(prices) gives the dimension of the row first (how many total observations) and the dimension of the column second (how many variables)

25 Variable Classifications o All of the variables in a dataset has a class. The class describes the type of data the variable contains. o To determine the class of the variable use: class(dataset$variable) class(prices$SQFT) o To check all of the classes at the same time use: sapply(dataset, class) sapply(prices, class)

26 Types of Variables o There are five different types of variables that can found in a data frame: 1)numeric: contains real numbers; can be positive or negative; with or without decimals; missing values are represented as NA 2)integer: can be positive or negative; NO decimals; if a fractional part is included then an integer variable is automatically converted to a numeric variable 3)factor: used for categorical data; values can either be character strings or numbers (representing categories) 4)date and POSIXIt: contains dates in a special format 5)character: contains character strings; suitable for any data that does not belong in one of the other types of variables above

27 Changing the Type of a Variable o Change to factor: dataset$variable <- as.factor(dataset$variable) prices$NE <- as.factor(prices$NE) o Change to numeric: dataset$variable <- as.numeric(dataset$variable) prices$SQFT <- as.numeric(prices$SQFT) o Change to character: dataset$variable <- as.character(dataset$variable)

28 Exploratory Data Analysis o You can produce a summary for all of the variables in a dataset, or calculate them one at a time. o summary(prices) o mean(prices$SQFT) #Note: Name SQFT <- prices$SQFT o mean(SQFT) o median(SQFT) o sd(SQFT)

29 Plotting your Data o The first thing that you do when conducting a statistical analysis is PLOTTING YOUR DATA. o Plots help you display your data and results that others can understand and allows you to spot features of the data like outliers and shape of the distribution.

30 The Simplest Plot o The most basic plot of a continuous variable against the observations use: plot(variable, type=“p”) plot(SQFT) o There are many different options that you can modify for your plot. To change titles on your plot, axis numeric labels and more then go to Help on the right and type in plot. Also, you can type in ?plot into the console.

31 Other Plots o A histogram helps to show the distribution of continuous variables. The data is divided into equal length intervals and the number of observations is counted that fall into each interval. hist(variable, breaks=10) hist(SQFT, breaks=10) o A scatter plot helps to show the relationship between two continuous variables. The order matters as the first variable gets displayed on the y-axis and the second variable gets displayed on the x-axis. plot(variable1~variable2, dataset) plot(prices$PRICE~prices$SQFT, prices, pch=8) o To connect the dots in a scatter plot, you can use type=‘l’ in the command.

32 Other Plots o A boxplot can be used to display the summary statistics of a dataset. boxplot(variable) boxplot(SQFT) o Remember that you can add a title and axis labels to any plot. There is a variety of different options in order to make your plot look nice for a paper or presentation. o R makes it easy to export plots by clicking on the Export tab above your plot. Here you can save your image and add it to your paper or presentation!

33 Practice – National Longitudinal Mortality Survey 1. Read the data file NLMS.csv into R. It may take a few minutes to load. 2. Delete column 14 of the data set and store this revised data set. Note: This will take a while to load again, so name it before you run it. 3. Determine the class of variables in the data set. 4. Obtain summary statistics for the variable povpct. 5. Create a histogram and boxplot for the variable povpct.

34 Questions o If you have any basic R questions, first try to GOOGLE it. o If you have no luck with that, then shoot me an email at: yamanda1@vt.edu o Make sure you signed in on the sign-in sheet and complete the survey that will be sent to you by email. THANK YOU!!!


Download ppt "Basics of R INSTRUCTOR: AMANDA MCGOUGH TUESDAY, MARCH 29, 2016."

Similar presentations


Ads by Google