MIS2502: Data Analytics Advanced Analytics Using R

Slides:



Advertisements
Similar presentations
Introduction to R Brody Sandel. Topics Approaching your analysis Basic structure of R Basic programming Plotting Spatial data.
Advertisements

CC SQL Utilities.
To facilitate communications To support household activities, for personal business, or for education To serve as a productivity/ business tool To assist.
McGraw-Hill/Irwin ©2008 The McGraw-Hill Companies, All Rights Reserved TECHNOLOGY PLUG-IN T4 PROBLEM SOLVING USING EXCEL Goal Seek, Solver & Pivot Tables.
Tutorial 8: Developing an Excel Application
 Statistics package  Graphics package  Programming language  Can be used to share/reproduce analyses  Many new packages being created - can be downloaded.
Introduction to GTECH 201 Session 13. What is R? Statistics package A GNU project based on the S language Statistical environment Graphics package Programming.
Introduction to Structured Query Language (SQL)
A Guide to SQL, Seventh Edition. Objectives Understand the concepts and terminology associated with relational databases Create and run SQL commands in.
Lecture 2 LISAM. Statistical software.. LISAM What is LISAM? Social network for Creating personal pages Creating courses  Storing course materials (lectures,
How to Use the R Programming Language for Statistical Analyses Part I: An Introduction to R Jennifer Urbano Blackford, Ph.D. Department of Psychiatry Kennedy.
Basic Concept of Data Coding Codes, Variables, and File Structures.
 A data processing system is a combination of machines and people that for a set of inputs produces a defined set of outputs. The inputs and outputs.
Introduction to R Statistical Software Anthony (Tony) R. Olsen USEPA ORD NHEERL Western Ecology Division Corvallis, OR (541)
McGraw-Hill/Irwin The Interactive Computing Series © 2002 The McGraw-Hill Companies, Inc. All rights reserved. Microsoft Excel 2002 Exploring Formulas.
Introduction to MATLAB ENGR 1187 MATLAB 1. Programming In The Real World Programming is a powerful tool for solving problems in every day industry settings.
SharePoint 2010 Business Intelligence Module 6: Analysis Services.
Database Applications – Microsoft Access Lesson 9 Designing Special Queries Updated 4/11.
Database Applications – Microsoft Access Lesson 9 Designing Special Queries.
Data, graphics, and programming in R 28.1, 30.1, Daily:10:00-12:45 & 13:45-16:30 EXCEPT WED 4 th 9:00-11:45 & 12:45-15:30 Teacher: Anna Kuparinen.
INTRODUCTION TO DATA MINING MIS2502 Data Analytics.
PHP meets MySQL.
Piotr Wolski Introduction to R. Topics What is R? Sample session How to install R? Minimum you have to know to work in R Data objects in R and how to.
IB ITGS Case Study. Introduction: Serving thousands of clients, it is method of environment-friendly green ticketing. User friendly system which minimizes.
6 1 Lecture 8: Introduction to Structured Query Language (SQL) J. S. Chou, P.E., Ph.D.
ITGS Databases.
Chapter 3 MATLAB Fundamentals Introduction to MATLAB Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Introduction to R Introductions What is R? RStudio Layout Summary Statistics Your First R Graph 17 September 2014 Sherubtse Training.
Excel 2007 Part (3) Dr. Susan Al Naqshbandi
MIS2502: Data Analytics Advanced Analytics - Introduction.
R objects  All R entities exist as objects  They can all be operated on as data  We will cover:  Vectors  Factors  Lists  Data frames  Tables 
Lecture 11 Introduction to R and Accessing USGS Data from Web Services Jeffery S. Horsburgh Hydroinformatics Fall 2013 This work was funded by National.
Descriptive Statistics using R. Summary Commands An essential starting point with any set of data is to get an overview of what you are dealing with You.
MIS2502: Data Analytics Introduction to Advanced Analytics and R.
Working with data in R 2 Fish 552: Lecture 3. Recommended Reading An Introduction to R (R Development Core Team) –
Introduction to R Dr. Satish Nargundkar. What is R? R is a free software environment for statistical computing and graphics. It compiles and runs on a.
N5 Databases Notes Information Systems Design & Development: Structures and links.
Data Visualization with Tableau
3 A Guide to MySQL.
Introduction to InVEST ArcGIS Tool
Programming in R Intro, data and programming structures
Analyzing Data Module 4.
Introduction The Custom Store Groups folders and functions allows you to create, modify and use store accounts of specific interest to you or your team.
MIS2502: Data Analytics Advanced Analytics - Introduction
SQL and SQL*Plus Interaction
Practical Office 2007 Chapter 10
Introduction to SPSS.
By Dr. Madhukar H. Dalvi Nagindas Khandwala college
TU170 Learning online and computing with confidence
Introduction to R Programming with AzureML
DEPARTMENT OF COMPUTER SCIENCE
Introduction to R Studio
ECONOMETRICS ii – spring 2018
MIS5101: Data Analytics Advanced Analytics - Introduction
Lab 1 Introductions to R Sean Potter.
What is Cookie? Cookie is small information stored in text file on user’s hard drive by web server. This information is later used by web browser to retrieve.
Exam #3 Review Zuyin (Alvin) Zheng.
NextGen Trustee General Ledger Accounting
Code is on the Website Outline Comparison of Excel and R
MIS2502: Data Analytics Introduction to Advanced Analytics and R
Smart Integration Express
Spreadsheets, Modelling & Databases
Stata Basic Course Lab 2.
MIS2502: Data Analytics Introduction to R and RStudio
MIS2502: Data Analytics ICA #7 Introduction to R and RStudio - Recap
MIS2502: Data Analytics Introduction to Advanced Analytics
R Course 1st Lecture.
MIS2502: Data Analytics Introduction to Advanced Analytics and R
A brief introduction to the nutrient tool-kit, getting R Studio to work and checking the data Martyn Kelly
Exploring Microsoft Excel 2003
Presentation transcript:

MIS2502: Data Analytics Advanced Analytics Using R Zhe (Joe) Deng deng@temple.edu http://community.mis.temple.edu/zdeng

The Information Architecture of an Organization Now we’re here… Data entry Transactional Database Data extraction Analytical Data Store Data analysis Stores real-time transactional data Stores historical transactional and summary data

What is Advanced Data Analytics/Mining? The examination of data or content using sophisticated techniques and tools, to discover deeper insights, make predictions, or generate recommendations. Goals: Extraction of implicit, previously unknown, and potentially useful information from data Exploration and analysis of large data sets to discover meaningful patterns Prediction of future events based on historical data

What data analytics/mining is not… How do sales compare in two different stores in the same state? Sales analysis Which product lines are the highest revenue producers this year? Profitability analysis Did salesperson X meet this quarter’s target? Sales force analysis If these aren’t data mining examples, then what are they ?

Advanced data analytics/mining is about… Why do sales differ in two stores in the state? Sales analysis Which product lines will be the highest revenue producers next year? Profitability analysis How much likely would the salesperson X meet next quarter’s target? Sales force analysis

Example: Smarter Customer Retention Consider a marketing manager for a brokerage company Problem: High churn (customers leave) Customers get an average reward of $160 to open an account 40% of customers leave after the 6 month introductory period Giving incentives to everyone who might leave is expensive Getting a customer back after they leave is expensive

Answer: Not all customers have the same value One month before the end of the introductory period, predict which customers will leave Offer those customers something based on their future value Ignore the ones that are not predicted to churn

Three Analytics Tasks We Will Be Doing in this Class Classification (Decision Tree Approach) Clustering Analysis Association Rule Learning

Decision Trees(To Realize Classification) Used to classify data according to a pre-defined outcome Based on characteristics of that data Uses Predict whether a customer should receive a loan Flag a credit card charge as legitimate Determine whether an investment will pay off

Uses Cluster Analysis Used to determine distinct groups of data Based on data across multiple dimensions Uses Customer segmentation Identifying patient care groups Performance of business sectors

Association Rule Learning Find out which events predict the occurrence of other events Often used to see which products are bought together Uses What products are bought together? Amazon’s recommendation engine Telephone calling patterns

Introduction to R and RStudio R has become one of the dominant language for data analysis A large user community Thousands of third-party packages that contribute functionality Install R with R studio both on your computer according to the installation instruction on our website.

http://www.kdnuggets.com/2015/05/poll-r-rapidminer-python-big-data-spark.html

Software development platform and language Open source, free Many, many, many statistical add-on “packages” that perform data analysis Integrated Development Environment(IDE) for R Nicer interface that makes R easier to use Requires R to run After install both, you only need to interact with Rstudio Mostly, you do not need to touch R directly

Environment Panel Script Panel Utility Panel Console Panel

RStudio Interface Script Panel This is where the R code is shown and edited When you open a R code file, its content shows up here Console Panel This is where R code is executed. Results will show up here If there is error with your code, the error message will also show up here Environment Panel This is where the variables and data are displayed It helps to keep track of the variables and data you have Utility Panel This window includes several tabs Files: shows the path to your current file, not often used Plots: if you use R to plot a graph, it will show up here Packages: install/import packages, more on this later Help: manuals and documentations to every R functions, very useful

Creating and opening a .R file The R script is where you keep a record of your work in R/RStudio. To create a .R file Click “File|New File|R Script” in the menu To save the .R file click “File|Save” To open an existing .R file click “File|Open File” to browse for the .R file

The Basics: Calculation Variable & Value Function & Argument(Parameter) Basic Data Types: Numeric, Character, Logical Advanced Data Types: Vector, Frame Packages Loading data to R Working Directory

The Basics: Calculations In its simplest form, R can be used as a calculator: Type commands into the console and it will give you an answer

The Basics: Variable & Value Read from the right to left as “Assign [value] 5 to [variable] x”. IDE first requires OS to allocate a segment of machine memory to store a empty variable template called “x”, then requires OS to allocate another segment of memory to fill a copy of the template with a value, 5.

The Basics: Variable & Value Variables are named containers for data The assignment operator in R is: <- or = Variable names can start with a letter or digits. Just not a number by itself. Examples: result, x1, 2b (not 2) R is case-sensitive (i.e. Result is a different variable than result) <- and = do the same thing x, y, and z are variables that can be manipulated

The Basics: Function & Argument (Parameter) Function: rm(ARGUMENT). rm() here is a build-in function. You can also define your own function. Some function are used to return a value, such as AVG() in SQL. The others are used to complete an operation, such as this. A function can take no argument, a single argument, or multiple arguments.

The Basics: Function & Argument(Parameter) sqrt(), log(), abs(), and exp() are functions. Functions accept parameters (in parentheses) and return a value

Simple statistics with R You can get descriptive statistics from a vector > scores [1] 65 75 80 88 82 99 100 100 50 > length(scores) [1] 9 > min(scores) [1] 50 > max(scores) [1] 100 > mean(scores) [1] 82.11111 > median(scores) [1] 82 > sd(scores) [1] 17.09857 > var(scores) [1] 292.3611 > summary(scores) Min. 1st Qu. Median Mean 3rd Qu. Max. 50.00 75.00 82.00 82.11 99.00 100.00 Again, length(), min(), max(), mean(), median(), sd(), var() and summary() are all functions. These functions accept vectors as parameter.

The Basics: Basic Data Types Range Assign a Value Numeric Numbers X <-1 Y <- -2.5 Character Text strings name<-"Mark" color<-"red" Logical (Boolean) TRUE or FALSE female<-TRUE

The Basics: Advanced Data Types – Vector & Data Frame Vectors Vector: a combination of elements (i.e. numbers, words) of the same basic type, usually created using c(), seq(), or rep() Data frames Data frame: a table consist of one or more vectors

Vector Examples c() and rep() are functions > scores<-c(65,75,80,88,82,99,100,100,50) > scores [1] 65 75 80 88 82 99 100 100 50 > studentnum<-1:9 > studentnum [1] 1 2 3 4 5 6 7 8 9 > ones<-rep(1,4) > ones [1] 1 1 1 1 > names<-c("Nikita","Dexter","Sherlock") > names [1] "Nikita" "Dexter" "Sherlock" c() and rep() are functions

Indexing Vectors We use brackets [ ] to pick specific elements in the vector. In R, the index of the first element is 1 > scores [1] 65 75 80 88 82 99 100 100 50 > scores[1] [1] 65 > scores[2:3] [1] 75 80 > scores[c(1,4)] [1] 65 88

Data Frames A data frame is a type of variable used for storing data tables is a special type of list where every element of the list has same length (i.e. data frame is a “rectangular” list or table) > BMI<-data.frame( + gender = c("Male","Male","Female"), + height = c(152,171.5,165), + weight = c(81,93,78), + Age = c(42,38,26) + ) > BMI gender height weight Age 1 Male 152.0 81 42 2 Male 171.5 93 38 3 Female 165.0 78 26 > nrow(BMI) [1] 3 > ncol(BMI) [1] 4

Identify elements of a data frame To retrieving cell values > BMI[1,3] [1] 81 > BMI[1,] gender height weight Age 1 Male 152 81 42 > BMI[,3] [1] 81 93 78 More ways to retrieve columns as vectors > BMI[[2]] [1] 152.0 171.5 165.0 > BMI$height > BMI[["height“]]

The Basics: Packages or Packages (add-ons) are collections of R functions and code in a well- defined format. To install a package: install.packages("pysch") For every new R session (i.e., every time you re-open Rstudio), you must load the package before it can be used library(psych) or require(psych) Each package only needs to be installed once Must load for every new R session

Packages Downloads and installs the package (once per R installation)

The Basics: Loading Data into R R can handle all kinds of data files. We will mostly deal with csv files Use read.csv() function to import csv data You need to specify path to the data file By default, comma is used as field delimiter First row is used as variable names You can simply do this by Download source file and csv file into the same folder (i.e., C:\RFiles). Set that folder as working directory by assigning source file location as working directory Very Important!!!

Working directory The working directory is where Rstudio will look first for scripts and files Keeping everything in a self contained directory helps organize code and analyses Check you current working directory with getwd()

To change working directory Use the Session | Set Working Directory Menu If you already have an .R file open, you can select “Set Working Directory>To Source File Location”.

Loading data from a file Usually you won’t type in data manually, you’ll get it from a file Example: 2009 Baseball Statistics (http://www2.stetson.edu/~jrasp/data.htm) reads data from a CSV file and creates a data frame called teamData that store the data table. reference the HomeRuns column in the data frame using TeamData$HomeRuns

More On Loading Datasets Suppose you want to load a dataset called “MIS2502”. If the dataset is in an existing R package, load the package and type data(MIS2502) .RData format, type load(MIS2502) .txt or other text formats, type read.table("MIS2502.txt") .csv format, type read.csv("MIS2502.txt") .dta (Stata) format, load the foreign library and type read.dta(“MIS2502.dta") Remember “function & argument” in the first part To save objects into these formats, use the equivalent write.table(), write.csv(), etc. commands.

The Basics: Summary Calculation Variable & Value Function & Argument(Parameter) Basic Data Types: Numeric, Character, Logical Advanced Data Types: Vector, Frame Packages Loading data to R Working Directory

Analysis Examples Student t-Test: Compare means Histogram Plotting data

Analysis Example: [Student] t-Test Compare differences across groups: We want to know if National League (NL) teams scored more runs than American League (AL) Teams And if that difference is statistically significant To do this, we need a package that will do this analysis In this case, it’s the “psych” package Downloads and installs the package (once per R installation)

t-Test: Compare Differences Across Groups describeby(teamData$Runs, teamData$League) Variable of interest (Runs) Broken up by group (League) Results of t-test for differences in Runs by League

Analysis Example: Histogram hist(teamData$BattingAvg, xlab="Batting Average", main="Histogram: Batting Average") hist() first parameter – data values xlab parameter – label for x axis main parameter - sets title for chart

Analysis Example: Plotting data plot(teamData$BattingAvg,teamData$WinningPct, xlab="Batting Average", ylab="Winning Percentage", main="Do Teams With Better Batting Averages Win More?") plot() first parameter – x data values second parameter – y data values xlab parameter – label for x axis ylab parameter – label for y axis main parameter - sets title for chart

Execute the script Use the Code | Run Region | Run All Menu Commands can be entered one at a time, but usually they are all put into a single file that can be saved and run over and over again.

Getting help general help help about function mean() help.start() general help help(mean) help about function mean() ?mean same. Help about function mean() example(mean) show an example of function mean() help.search("regression") get help on a specific topic such as regression.

Online Tutorials If you’d like to know more about R, check these out: Quick-R (http://www.statmethods.net/index.html) R Tutorial (http://www.r-tutor.com/r-introduction) Learn R Programing (https://www.tutorialspoint.com/r/index.htm) Programming with R (https://swcarpentry.github.io/r-novice-inflammation/ There is also an interactive tutorial to learn R basics, highly recommended! (http://tryr.codeschool.com/)

Time for our 9th ICA!