Download presentation
1
Technology: Statistical Analysis Software & R
Marketing Analytics Technology: Statistical Analysis Software & R R Basics Stephan Sorger Disclaimer: All images such as logos, photos, etc. used in this presentation are the property of their respective copyright owners and are used here for educational purposes only Some material adapted from: Sorger, “Marketing Analytics: Strategic Models and Metrics” Slide 1 Welcome to a special technology session for Marketing Analytics. This technology discusses statistical analysis software in general, and the statistical analysis software program called R in particular. © Stephan Sorger 2015: Marketing Analytics; R Basics: 1
2
Statistical Analysis Software: Introduction
Topic Definition Definition Software designed for in-depth analysis Unlike MS Excel (general purpose spreadsheet) Origins SAS conceived in 1966 by Anthony J. Barr Placed statistical procedures in formatted file framework Uses Advanced statistical techniques Nonlinear functions; Multiple regression; Conjoint Advantages Powerful; Accurate; Specific tools Disadvantages Command line interface; steep learning curve Very expensive Slide 2 Statistical analysis software is defined as software designed for in-depth data analysis. It differs from Microsoft Excel, which was designed as a general purpose spreadsheet. Excel was never designed as a robust analytics tool. For that, we turn to statistical analysis software. The category started in 1966 with the development of Statistical Analysis System, known as SAS, by Anthony J. Barr. SAS placed statistical procedures in a formatted file framework for improved execution of complex data analysis. Statistical analysis software can be used for a variety of advanced statistical techniques, such as nonlinear functions and others. The software has the advantage that it is powerful and accurate. On the other hand, it can be difficult to learn. Another disadvantage is that it is not always designed to be user-friendly, unlike desktop software and smartphone apps. © Stephan Sorger 2015: Marketing Analytics; R Basics: 2
3
Statistical Analysis Software: Major Suppliers
Criteria SAS SPSS R Market Fortune 500 Universities Universities Focus Power Ease of use Price User Power user Student Price-sensitive Origins Industry Education Open Source Learning Difficult Moderate Moderate Cost $86,600/yr+ $16,000/yr+ Free UI Command Line Point & Click Command Line Database 32,768 var. 1 file at a time Graphics SAS/Graph High quality Different packages Analogy Microsoft Apple Linux UCLA, Statistical Software Packages Comparison, ats.ucla.edu: MineQuest Business Analytics, “Cost of Licensing WPS 3.0 vs. SAS 9.3.” February 2013. Slide 3 We can examine a few of the major suppliers in the category. We look at SAS, SPSS, and R here. Many other programs exist. We start with SAS. In general, large organizations use SAS, because it is the market leader. Its focus is on the power user working in industry that faces demanding data problems. Its complexity means that it can be difficult to learn. Users pay for all that power, and it can be the most expensive of the three. SAS features a huge database capability, with some graphics capabilities. Its market-leading status and focus on business make it somewhat analogous to Microsoft. SPSS, on the other hand, is focused on universities. Typical users are students involved in education. The power is not as great as that of SAS, but the price is not as great as well. To the extent that it is an alternative to SAS, and that it focuses on education, SPSS is a bit analogous to Apple. R is the open source alternative, and is free to use within the parameters of the license. It is designed for programming-savvy analysts who are not afraid of the program’s steep learning curve. To the extent that it is open source and designed for hard-core programmers, the program is analogous to Linux. IBM SPSS Statistics website, “Buy IBM SPSS Statistics Now” © Stephan Sorger 2015: Marketing Analytics; R Basics: 3
4
R: Introduction Topic Description
Description Free statistical computing and graphics software package Widely used among statisticians and data miners Increased popularity in on History Started in 1993 as implementation of S programming language (1976) R developed by Ross Ihaka and Robert Gentleman “R” from Ross & Robert, as well as play on “S” Functions R includes many functions, which can be expanded through packages Data Can handle multiple simultaneous data sets, unlike Excel Data types: scalars, vectors, matrices, data frames, and lists Vectors: numerical, character, logical Commercial Revolution Analytics offers enterprise version ($) References: 1. Venables, W.N., Smith, D.M., “An Introduction to R.” Version May 16, 2013. Slide 4 R is a free, open source statistical computing and graphics software package. It is widely used, and has particularly grown in popularity since 2010. R started in 1993 as an implementation of the S programming language. It includes many functions, which can be expanded through sets of additional functionality known as packages. R can handle multiple simultaneous data sets easily, unlike Microsoft Excel. It can also handle multiple types of data types. From a commercial perspective, a company called Revolution Analytics, acquired by Microsoft in July 2015, offers an enterprise version of R. © Stephan Sorger 2015: Marketing Analytics; R Basics: 4
5
R: Basics Topic Description Commands Based on UNIX; case sensitive
Commands separated by “;” or by newline <CR> Comments #Hashtags to indicate comments Prompt > #system is waiting for you to type something Traditional version not menu-driven, unlike consumer software Arithmetic > 5 + 4 [1] #system returns the sum of 5 + 4, which is 9 Assignment (=) > x <- 3 # assign the number “3” to the object “x”; similar to “=“ sign Help 2 ways to get help; Example: Get help with “read.csv” command ?(read.csv) help(read.csv) Slide 5 The commands for R are based on UNIX, the multitasking, multiuser computer operating system derived from the original AT&T Unix, developed in the 1970s at Bell Labs. Commands are case sensitive. We can denote comments using hashtags. R uses the greater than symbol as its prompt. R also uses a command line interface, which means users type in commands, rather than selecting them from a graphical user interface such as a pull-down menu. R uses the assignment operator created by the less than sign and the dash symbol instead of the traditional equal sign. Users can get help by typing a question mark followed by the help topic in parentheses Or by typing the word help, followed by the help topic in parentheses. © Stephan Sorger 2015: Marketing Analytics; R Basics: 5
6
R: Basics Topic Description
Functions R features a rich set of functions c() : Function c Statistics functions: mean(x); median(x); range(x); etc. Arithmetic functions: 4^2; log (10); sqrt (16) Vector > x <- v(1, 2, 3) # assign v, a vector of numbers to the object x Matrix > y <- matrix(c(1, 2, 3, 4, 5, 6), 2, # create 2 x 3 matrix Print Ask R to print out numbers inside an object, such as a vector by printing it > print (x) # ask R to print out x > x # Or, you can just type the variable and hit return Plot Ask R to plot out lines based on a dataset by plotting the data > plot(data) Small subset R is a large, complex language. We cover only a small % in this class Slide 6 R features a rich set of functions, including many statistics-related functions. Users can define vectors and matrices using the v and matrix commands, respectively. To print out a dataset, type out the word print followed by the filename in parentheses. Or just type the dataset name and enter return. R can also plot out data using the plot command. Many many functions and commands are available in R. We cover only a small percentage of R’s functionality in this class. © Stephan Sorger 2015: Marketing Analytics; R Basics: 6
7
R: Getting Started Topic Description Download R Windows:
Mac: Launch R Double-click to launch Will see prompt in “R Console” > R Console > Slide 7 To get started, we can obtain R by downloading it from the Comprehensive R Archive Network, known as the CRAN. Both Windows and Macintosh versions are available. Upon program launch, R displays the R Console, shown in the image. © Stephan Sorger 2015: Marketing Analytics; R Basics: 7
8
R Editor and R Console: Layout
Topic Description R Console Traditional R interface; Command Line Interface R Editor Attempt at easier to use User Interface New Script To open Editor, click on Select File > New Script Arrange Editor window on left; Console on right Run Line Execute (run) line: Highlight line on R editor; Click on “Run Line” Open Script Save Script Run Line Return focus to Console Print RGui Icons Untitled—R Editor vector<-c(2,4,6,8) R Console > vector<-c(2,4,6,8) Slide 8 The R console on the right represents the traditional R interface. Other interfaces are available, such as R Editor and R Studio. To open the Editor, select File, and then new script. One can arrange the editor and console windows side by side When using the r editor, type the command in the r editor, then execute the Run Line command by selecting Run Line icon shown in the figure. To be clear, the R Editor is optional. Users can run R directly from the R console if they prefer. © Stephan Sorger 2015: Marketing Analytics; R Basics: 8
9
R Editor: Typical Usage
Topic Description Statistics Find statistics mean(vector) <RUN LINE> (mean) var(vector) <RUN LINE> (variance) sd(vector) <RUN LINE> (standard deviation) Run Line Select Run Line icon to move to R Console and execute Untitled—R Editor vector<-c(2,4,6,8) mean(vector) var(vector) sd(vector) R Console > vector<-c(2,4,6,8) > mean(vector) [1] 5 > var(vector) [1] > sd(vector) [1] Slide 9 This slide shows an example of how users can calculate statistics functions. Here, we want to calculate the mean, variance, and standard deviation of vector c. We first define vector c in the first line, and then invoke the commands for the functions. We type the commands in the R Editor, and then select the Run Line icon to execute them in the R console. © Stephan Sorger 2015: Marketing Analytics; R Basics: 9
10
Sample R Session: Regression Analysis
Topic Description 1. Preparation Remove introductory content; First line should be data headers Save Excel file as Comma Separated Values (CSV) 2. Directory Optional: Set up working directory for dataset; allows shorter filepaths Windows: See “Windows Explorer help” for more info Mac: See “Finder help” for more info 3. Filename Need complete filename Example: “C:\My Documents\Folder A\Filename.csv” Alternative 1: Right click to see filename Alternative 2: Find filename in Windows Explorer (Windows); Finder (Mac) Alternative 3: Drag csv file and drop into R Console; Will show filename Slide 10 In the next two slides, we summarize a small sample R session. We then follow up with an example session including actual screenshots. In our example, we wish to execute a regression analysis on some real estate data. First, we prepare the data by removing any introductory content. We save it in CSV format. Second, we can take the optional step of setting up a working directory for the dataset. Third, we need to find the complete filename of our dataset. Several methods exist. For Windows systems, we can right click to see the filename. We can also find the filename in Windows Explorer or in Mac Finder. Alternatively, we can drag and drop the csv file into the R console. The R console will automatically detect the filename and display it. © Stephan Sorger 2015: Marketing Analytics; R Basics: 10
11
Sample R Session: Regression Analysis
Topic Description 4. Read CSV data Datafile <- read.csv(“C:\\My Documents\\Desktop\\Filename.csv”, header=T) 5. Check data Print out dataset to ensure it was loaded correctly print(Datafile): will print out entire datafile; OK for small datasets str(Datafile): Shows structure of Datafile; “data.frame: 4 obs. of 4 variables” summary(Datafile): Shows summary: Min; Max; Mean; Median 6. Run regression lm: Regression analysis in R; stands for Linear Model lm(Dependent~Independent+Independent, Dataset) 7. Interpret Results Compare results obtained with R with those from Microsoft Excel Slide 11 In step 4, we read the csv data using the read.csv command. In our case, our data currently resides in our computer in csv form in a file called filename. We want to read our data into R. We select the name Datafile as the name for the place for R to put our data. The choice of name is arbitrary. We could have selected a different name instead, such as Sam or Javier. We read in the csv data into R, we use the read.csv command. We refer to the structure of computer programming language statements as syntax. In the syntax for the read.csv command, we show the datafile name at the left, followed by the assignment operator, followed by the command read.csv. To the right of the read.csv command, we place the entire filename between parentheses. We also add the term, header = T, which means that we are including the header row. The header row is the row of names representing the contents of each column. We will show a typical header in the next slide. In step 5, we want to ensure the data was loaded correctly. For small datasets, we can simply ask R to print out the entire dataset. For larger datasets, we invoke the str or summary commands to see the structure or summary, respectively, of the data. In step 6, we invoke R’s regression function, which R calls lm. The syntax includes the term lm on the left, and the dependent and independent variables, as well as the name of the dataset included within parentheses on the right. The lm function separates the dependent variable from the independent variables with a tilde. In step 7, we interpret the results and compare them with those obtained by Excel. We now show an actual R session with screenshots. © Stephan Sorger 2015: Marketing Analytics; R Basics: 11
12
Example: Causal Analysis Forecast for Real Estate
Step Description 1. Preparation Remove introductory information; First row = header row “Save As” CSV Slide 12 In step 1, we prepare the dataset. In our case, we want to prepare a dataset that consists of real estate information. The real estate information includes the price, the house size in thousands of square feet, and the lot size in thousands of square feet, for 18 homes sold in the San Francisco bay area town of Hillsborough in 2010. In the dataset, the first row should be the header of the dataset. If our dataset includes introductory information, such as date or a description of the dataset, remove it to make the header the first row. The header is defined as the names, or labels, identifying the data found in the columns. For example, in our case the header consists of three labels: Price, House, and Lot. We then save our dataset using the Save as command, selecting the CSV option. © Stephan Sorger 2015: Marketing Analytics; R Basics: 12
13
Example: Causal Analysis Forecast for Real Estate
Step Description 2. Directory Optional: Can set up working directory” In R, select File Change dir… then select where you want to put R files Slide 13 In the second step, we can set up a working directory. R will then look in that directory when searching for R files. Working directories can be especially helpful when working with Macintosh systems. © Stephan Sorger 2015: Marketing Analytics; R Basics: 13
14
Example: Causal Analysis Forecast for Real Estate
Step Description 3. Filename “C:\\Users\\user\\Desktop\\RealData.csv” Windows: Right-click on file to get file properties; will show full filename under “Location” Mac: Check Finder to find full filename OR: Drag file into R Slide 14 The third step is to find the complete filename. For Windows systems, right click on the file to get the file properties. For Macintosh systems, check the filename in Finder. Alternatively, users can drag and drop the file into the R console. © Stephan Sorger 2015: Marketing Analytics; R Basics: 14
15
Example: Causal Analysis Forecast for Real Estate
Step Description 4. Read Data Datafile <- read.csv(“C:\\Users\\user\\Desktop\\RealData.csv”, header=T) Alternative: Set up working directory Slide 15 In step 4, we read the data into R by invoking the read.csv command. In the syntax, we show the file into which we want to load the data on the left. In our case, we load the data into a datafile we call datafile. Next, type out the assignment operator. R uses a less than symbol and a dash symbol instead of an equal sign for the assignment operator. Next, we type the read.csv function, followed by the complete filename and the phrase header=T between parentheses, just as it is shown in the screenshot. Now, hit enter. To make a shorter filename, we can set up a working directory. © Stephan Sorger 2015: Marketing Analytics; R Basics: 15
16
Example: Causal Analysis Forecast for Real Estate
Step Description 5. Check Data print (Datafile) ; check if dataset looks OK For large datasets, ask R to provide summary data instead of printing out entire dataset Slide 16 In step 5, we check the data by asking R to print it out. We examine the data, and note that it appears to be the same as our original Excel dataset. Because the data looks good, we proceed to the next step. © Stephan Sorger 2015: Marketing Analytics; R Basics: 16
17
Example: Causal Analysis Forecast for Real Estate
Step Description 6. Run Regression lm(Dependent~Independent+Independent, Dataset) Dependent variable: Price; Independent variable: House; Lot Equation: Price = c1 + c2*(House Size) + c3*(Lot Size) RealRegression <- lm(Price ~ House + Lot, Datafile) Find tilde symbol “ ~ “ at upper left of keyboard, to left of number “1” Slide 17 In step 6, we run the regression analysis against the data. Earlier in the course, we discussed regression analysis. We discussed how the goal of regression analysis is to find the relationship between the dependent variable and one or more independent variables. The relationship is defined by an equation, where the dependent variable is equal to a constant plus the coefficient of an independent variable multiplied by the independent variable. We repeat the term coefficient times independent variable for as many independent variables as we have in the data. In our case, price will be equal to a constant, which we call c1, plus a coefficient we call c2 multiplied by the first independent variable, House Size, plus a coefficient we call c3 multiplied by the second independent variable, Lot Size. We use the lm function within R to execute the linear regression model to obtain the values for c1, c2, and c3. We invoke the lm command by typing lm, followed by the variables price, house and lot, as well as the datafile, enclosed by parentheses. When using the lm command, we add a tilde (squiggly line) after the dependent variable. We separate the independent variables with + symbols. In our case, we ask R to place the result in a dataset we call RealRegression. The name RealRegression is arbitrary. We could have called it Earl instead. We find out what is in the RealRegression dataset in the next slide. © Stephan Sorger 2015: Marketing Analytics; R Basics: 17
18
Example: Causal Analysis Forecast for Real Estate
Topic Description 7. Interpret Results Compare results from R with those from Excel Method Coefficient House Size Lot Size Excel R R results agree well with those of Excel Slide 18 In step 7, we interpret the results. The image in the slide shows a screenshot from R. The screenshot shows the formula we used for regression, as well as the 3 results we sought. Based on the data we gave it, R reports to us that the three results are: Intercept equals House coefficient equals and Lot intercept equals We compare those results with the results we obtained using Microsoft Excel, and we can see the results are virtually identical. Welcome to R! © Stephan Sorger 2015: Marketing Analytics; R Basics: 18
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.