Arko Barman with modification by C.F. Eick COSC 4335 Data Mining Spring 2015.

Slides:



Advertisements
Similar presentations
Summary Statistics/Simple Graphs in SAS/EXCEL/JMP.
Advertisements

R for Macroecology Aarhus University, Spring 2011.
PRE-SCHOOL QUANT WORKSHOP II R THROUGH EXCEL. NEW YORK TIMES INFOGRAPHICS GALARY The Jobless Rate for People Like You Home Prices in Selected Cities For.
Jack Davis Andrew Henrey FROM N00B TO PRO. PURPOSE Create a simulator from scratch that: Generates data from a variety of distributions Makes a response.
 Statistics package  Graphics package  Programming language  Can be used to share/reproduce analyses  Many new packages being created - can be downloaded.
Basics of Using R Xiao He 1. AGENDA 1.What is R? 2.Basic operations 3.Different types of data objects 4.Importing data 5.Basic data manipulation 2.
Lecture 2 LISAM. Statistical software.. LISAM What is LISAM? Social network for Creating personal pages Creating courses  Storing course materials (lectures,
SPSS 1: An Introduction to the Statistical Package SPSS Suzie Cro MRC Clinical Trials Unit.
How to Use the R Programming Language for Statistical Analyses Part I: An Introduction to R Jennifer Urbano Blackford, Ph.D. Department of Psychiatry Kennedy.
SPSS Statistical Package for the Social Sciences is a statistical analysis and data management software package. SPSS can take data from almost any type.
LISA Short Course Series Basics of R Lin Zhang Feb. 16, 2015 LISA: Basics of RFeb. 16, 2015.
Basic R Programming for Life Science Undergraduate Students Introductory Workshop (Session 1) 1.
MATLAB Lecture One Monday 4 July Matlab Melvyn Sim Department of Decision Sciences NUS Business School
LISA Short Course Series R Basics Ana Maria Ortega Villa Fall 2013 LISA: R BasicsFall 2013.
Carolina Environmental Program UNC Chapel Hill The Analysis Engine – A New Tool for Model Evaluation, Sensitivity and Uncertainty Analysis, and more… Alison.
An introduction to R: get familiar with R Guangxu Liu Bio7932.
732A44 Programming in R.  Self-studies of the course book  2 Lectures (1 in the beginning, 1 in the end)  Labs (computer). Compulsory submission of.
Hands-on Introduction to R. Outline R : A powerful Platform for Statistical Analysis Why bother learning R ? Data, data, data, I cannot make bricks without.
Introduction to SAS BIO 226 – Spring Outline Windows and common rules Getting the data –The PRINT and CONTENT Procedures Manipulating the data.
Intro to R R is a free version of S-plus R is a free version of S-plus Can be used interactively but script or syntax files are commonly used to record.
Introduction to R Part 2. Working Directory The working directory is where you are currently saving data in R. What is the current working directory?
Learning the TSP2: a guide for students at the 国際総合学類筑波大学 RUNNING REGRESSIONS FROM A SPREADSHEET FILE If you are using a network browser to view this program,
Data Objects in R Vector1 dimensionAll elements have the same data types Data types: numeric, character logic, factor Matrix2 dimensions Array2 or more.
Piotr Wolski Introduction to R. Topics What is R? Sample session How to install R? Minimum you have to know to work in R Data objects in R and how to.
What is MATLAB? MATLAB is one of a number of commercially available, sophisticated mathematical computation tools. Others include Maple Mathematica MathCad.
Using the R software R is an open source comprehensive statistical package, more and more used around the world. R project web site:
Hands-on Introduction to R. We live in oceans of data. Computers are essential to record and help analyse it. Competent scientists speak C/C++, Java,
Then click the box for Normal probability plot. In the box labeled Standardized Residual Plots, first click the checkbox for Histogram, Multiple Linear.
R Programming Yang, Yufei. Normal distribution.
R packages/libraries Data input/output Rachel Carroll Department of Public Health Sciences, MUSC Computing for Research I, Spring 2014.
Introduction to Programming in R Department of Statistical Sciences and Operations Research Computation Seminar Series Speaker: Edward Boone
1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © Leland Stanford Junior University. All rights reserved. Warning: This.
An Introduction to R Statistical Computing AMS 597 Stony Brook University Spring 2009 By Tianyi Zhang.
Introduction to R Introductions What is R? RStudio Layout Summary Statistics Your First R Graph 17 September 2014 Sherubtse Training.
Lecture 20: Choosing the Right Tool for the Job. What is MATLAB? MATLAB is one of a number of commercially available, sophisticated mathematical computation.
Scientific Computing (w2) An Introduction to Scientific Computing workshop 2.
Scientific Computing (w1) R Computing Workshops An Introduction to Scientific Computing workshop 1.
Files: By the end of this class you should be able to: Prepare for EXAM 1. create an ASCII file describe the nature of an ASCII text Use and describe string.
DATA MINING Pandas. Python Data Analysis Library A library for data analysis of (mostly) tabular data Gives capabilities similar to Excel and SQL but.
1 Lecture 4 Post-Graduate Students Advanced Programming (Introduction to MATLAB) Code: ENG 505 Dr. Basheer M. Nasef Computers & Systems Dept.
Bioinformatics for biologists
1 FREE SAS SOFTWARE. 2 FREE SOFTWARE Free SAS ® software. SAS STUDIO; An interactive, online community. Superior training and documentation. And the analytical.
R objects  All R entities exist as objects  They can all be operated on as data  We will cover:  Vectors  Factors  Lists  Data frames  Tables 
R tutorial Stat 140 Linjuan Qian
Introduction to CADStat. CADStat and R R is a powerful and free statistical package [
Data & Graphing vectors data frames importing data contingency tables barplots 18 September 2014 Sherubtse Training.
1 Data Manipulation (with SQL) HRP223 – 2009 October 12, 2009 Copyright © Leland Stanford Junior University. All rights reserved. Warning: This.
Lecture 11 Introduction to R and Accessing USGS Data from Web Services Jeffery S. Horsburgh Hydroinformatics Fall 2013 This work was funded by National.
Descriptive Statistics using R. Summary Commands An essential starting point with any set of data is to get an overview of what you are dealing with You.
Before the class starts: 1) login to a computer 2) start RStudio 3) download Intro.R from MyCourses 4) open Intro.R in Rstudio 5) Download “R in Action”
Introduction to R and Data Science Tools in the Microsoft Stack Jamey Johnston.
1-2 What is the Matlab environment? How can you create vectors ? What does the colon : operator do? How does the use of the built-in linspace function.
MIS2502: Data Analytics Introduction to Advanced Analytics and R.
Review > x[-c(1,4,6)] > Y[1:3,2:8] > island.data fishData$weight[1] > fishData[fishData$weight < 20 & fishData$condition.
16BIT IITR Data Collection Module If you have not already done so, download and install R from download.
Working with data in R 2 Fish 552: Lecture 3. Recommended Reading An Introduction to R (R Development Core Team) –
Introduction to R and Data Science Tools in the Microsoft Stack Jamey Johnston.
Introduction to R Dr. Satish Nargundkar. What is R? R is a free software environment for statistical computing and graphics. It compiles and runs on a.
Arko Barman COSC 6335 Data Mining Fall  Free, open source statistical analysis software  Competitor to commercial softwares like MATLAB and SAS.
Programming in R Intro, data and programming structures
Introduction to R Samal Dharmarathna.
Arko Barman COSC 6335 Data Mining Fall 2014
Lab 1 Introductions to R Sean Potter.
JavaScript Charting Library
This is where R scripts will load
Amos Introduction In this tutorial, you will be briefly introduced to the student version of the SEM software known as Amos. You should download the current.
This is where R scripts will load
Data analysis with R and the tidyverse
Creating a dataset in R Instructor: Li, Han
Presentation transcript:

Arko Barman with modification by C.F. Eick COSC 4335 Data Mining Spring 2015

 Free, open source statistical analysis software  Competitor to commercial softwares like MATLAB and SAS – none of which are free  Need to install only basic functionalities  Install packages as and when needed – somewhat like Python  Excellent visualization features!

 Link for installation:  Choose a mirror link to download and install  Latest version is R  Choose default options for installation  Don’t worry about customizing startup options!  Just click ‘Okay’, ‘Next’ and ‘Finish’!

Command line

 R is an interpreter  Case sensitive  Comment lines start with #  Variable names can be alphanumeric, can contain ‘.’ and ‘_’  Variable names must start with a letter or ‘.’  Names starting with ‘.’ cannot have a digit as the second character  To use an additional package, use library( ) e.g. library(matrixStats)  Package you are trying to use must be installed before using

 Variable declaration not needed (like MATLAB)  Variables are also called objects  Commands are separated by ‘;’ or newline  If a command is not complete at the end of a newline, prompt shows ‘+’  Use command q() to quit  For help on a command, use help( ) or ?  For examples using a command, use example( )

 setwd( )  source( ) – runs the R script  sink( ) – saves outputs to  sink() – restores output to console  ls() – prints list of objects in workspace  rm(,, … ) – removes objects from workspace

 Writing scripts - use Notepad/Notepad++ and save with extension.r OR use Rcmdr package  Installing a package - Packages>Install package(s)… choose the package you want to install

assign("x", c(10.4, 5.6, 3.1, 6.4, 21.7)) x  c() concatenates vectors/numbers end to end x<- c(10.4, 5.6, 3.1, 6.4, 21.7) x y<-c(x,0,x) y v<-2*x+y+1 v min(v) max(y) range(x) cat(v)

v<-c(1:10) w<-1:10 x<-2*1:10 (Note operator precedence here) y<-seq(-4,7) y<-seq(-4,7,2)

x <- array(1:20, dim=c(4,5)) X i <- array(c(1:3,3:1), dim=c(3,2)) i x[i] <- 0 #(1,3), (2,2), (3,1 are set to 0) x xy <- x %o% y #(outer product of x and y) xy z <- aperm(xy, c(2,1)) (Permutations of an array xy) z

seq1<-seq(1:6) mat1<-matrix(seq1,2) mat1 mat2<-matrix(seq1,2,byrow=T) mat2  General use: matrix(data,nrow,ncol,byrow)  Notice that ncol is redundant and optional! Number of rows

seq1<-seq(1:6) mat1<-matrix(seq1,2) mat1 mat2<-matrix(seq1,2,byrow=T) mat2  General use: matrix(data,nrow,ncol,byrow)  Notice that ncol is redundant and optional! Number of rows

> x*c(3,4,0,1,2) [,1] [,2] [,3] [,4] [,5] [1,] [2,] [3,] [4,] > x [,1] [,2] [,3] [,4] [,5] [1,] [2,] [3,] [4,] > x%*%c(3,4,0,1,2) #Matrix Multiplication [,1] [1,] 70 [2,] 80 [3,] 90 [4,] 100 Number of rows

 Efficient way to store and manipulate tables e.g. v1 = c(2, 3, 5) v2 = c("aa", "bb", "cc") v3 = c(TRUE, FALSE, TRUE) df = data.frame(v1, v2, v3) df

 A vector of categorical data e.g. > data = c(1,2,2,3,1,2,3,3,1,2,3,3,1) > fdata = factor(data) > fdata [1] Levels: 1 2 3

mydata <- read.table("c:/mydata.csv", header=TRUE, sep=",", row.names="id") Path to the file Note the use of / instead of \ Whether or not to treat the first row as header Delimiter (default – White space) Row names (optional) mydata, header = TRUE, row.names = ) Commonly used for CSV files Skip today!

Download the file Notice that the data is in CSV format and there are no headers! mydata, header = FALSE)  You can also read xls files using read.xls which is a part of gdata package! install.packages(pkgs="gdata") library(gdata) data ) data Skip today!

 > mydata<-data.frame(a=1:5,b=2:6,c=c(1,1,1,3,3))  > mydata[c(-1,-2)]  c  1 1  2 1  3 1  4 3  5 3  > mydata  a b c     

 Selecting observations mySubset2<-mydata[1:5,] mySubset2  Selecting observations based on variable values mySubset3 2),] mySubset3  To avoid using mydata$ prefix in columns attach(mydata) mySubset3<-mydata[which(c==1,] mySubset3 detach(mydata) Be careful when using multiple datasets!!! Remember this if you use multiple datasets!!!

mySubset4<-mydata[sample(nrow(mydata), 3),] mySubset4 mydata1<-split(mydata,mydata$class) mydata1 Groups data of different classes together! mySubset5 0,select=c(a,c)) mySubset5 Most general form for manipulating data! Total number of observations Number of samples Good Manual on Data Import: > mydata1<- split(mydata,mydata$c) > mydata a b c > mydata1 $`1` a b c $`3` a b c

 merge() - used for merging two or more data frames into a single frame  cbind() – adds new column(s) to an existing data frame  rbind() – adds new row(s) to an existing data frame

 Different types of plots – helpful for data visualization and statistical analysis  Basic plot Download ‘sampledata2’ from TA webpage>Resources mydata,header=TRUE) attach(mydata) plot(x1,x2)

 abline() – adds one or more straight lines to a plot  lm() – function to fit linear regression model > x1<-c(1:5,1:3) > x2<-c(2,2,2,3,6,7,5,1) > abline(lm(x2~x1)) title('Regression of x2 on X1') > plot(x1,x2) > abline(lm(x2~x1)) > title('Regression of x2 on + x1') > lm(x1~x2) Coefficients: (Intercept) x > lm(x2~x1) Coefficients: (Intercept) x > abline(1,2)

 Boxplot boxplot(x2~x3,data=mydata,main="x2 versus x3", xlab="x3",ylab="x2")

 Histogram z = rnorm(1000, mean=0,sd=1) hist (z) hist (z, breaks = 5) bins = seq (-5, 5, by = 0.5) hist (z, breaks = bins) Generate 1000 samples of a standard normal random variable

 scatterplot() – for 2D scatterplots  scatterplot3d() – for 3D scatterplots  plot3d() – for interactive 3D plots (rgl package)  scatter3d() – another interactive 3d plot (Rcmdr package)  dotchart() – for dot plots  lines() – draws one or more lines  pie() – for drawing pie chart  pie3d() – for drawing 3D exploded pie charts (plotrix package)

Questions?