Introduction to R Basics * Based on R tutorial by Lorenza Bordoli.

Slides:



Advertisements
Similar presentations
Introduction to R Brody Sandel. Topics Approaching your analysis Basic structure of R Basic programming Plotting Spatial data.
Advertisements

R for Macroecology Aarhus University, Spring 2011.
MATLAB – What is it? Computing environment / programming language Tool for manipulating matrices Many applications, you just need to get some numbers in.
R tutorial Lorenza Bordoli R commands are in written in red.
 Statistics package  Graphics package  Programming language  Can be used to share/reproduce analyses  Many new packages being created - can be downloaded.
Introduction to Data Mining with XLMiner
Introduction to GTECH 201 Session 13. What is R? Statistics package A GNU project based on the S language Statistical environment Graphics package Programming.
R for Research Data Analysis using R Day1: Basic R Baburao Kamble University of Nebraska-Lincoln.
Classifier Decision Tree A decision tree classifies data by predicting the label for each record. The first element of the tree is the root node, representing.
An introduction to R Honors 207 Cognitive Science (These Slides were Shamelessly Stolen from Dr. Pablo Gomez, DePaul University)
Alternative text for elementary statistics –Elementary Concepts –Basic Statistics.
R – a brief introduction Johannes Freudenberg Cincinnati Children’s Hospital Medical Center
C++ for Engineers and Scientists Third Edition
Chapter 8 Arrays and Strings
How to Use the R Programming Language for Statistical Analyses Part I: An Introduction to R Jennifer Urbano Blackford, Ph.D. Department of Psychiatry Kennedy.
Introduction to Array The fundamental unit of data in any MATLAB program is the array. 1. An array is a collection of data values organized into rows and.
Introduction to R: The Basics Rosales de Veliz L., David S.L., McElhiney D., Price E., & Brooks G. Contributions from Ragan. M., Terzi. F., & Smith. E.
Statistical Software An introduction to Statistics Using R Instructed by Jinzhu Jia.
Applied Bioinformatics Introduction to Linux and R Bing Zhang Department of Biomedical Informatics Vanderbilt University
Mathcad Variable Names A string of characters (including numbers and some “special” characters (e.g. #, %, _, and a few more) Cannot start with a number.
1 An Introduction – UCF, Methods in Ecology, Fall 2008 An Introduction By Danny K. Hunt & Eric D. Stolen Getting Started with R (with speaker notes)
Introduction to MATLAB Session 1 Prepared By: Dina El Kholy Ahmed Dalal Statistics Course – Biomedical Department -year 3.
Chapter 7: Arrays. In this chapter, you will learn about: One-dimensional arrays Array initialization Declaring and processing two-dimensional arrays.
732A44 Programming in R.  Self-studies of the course book  2 Lectures (1 in the beginning, 1 in the end)  Labs (computer). Compulsory submission of.
Data, graphics, and programming in R 28.1, 30.1, Daily:10:00-12:45 & 13:45-16:30 EXCEPT WED 4 th 9:00-11:45 & 12:45-15:30 Teacher: Anna Kuparinen.
Lists in Python.
Introduction to to R Emily Kalah Gade University of Washington Credit to Kristin Siebel for development of much of this PowerPoint.
Sébastien Lê Agrocampus Rennes A very short introduction to “R” The “Rcmdr” package and its environment.
1 Lab of COMP 406 Teaching Assistant: Pei-Yuan Zhou Contact: Lab 1: 12 Sep., 2014 Introduction of Matlab (I)
Session 3: More features of R and the Central Limit Theorem Class web site: Statistics for Microarray Data Analysis.
Piotr Wolski Introduction to R. Topics What is R? Sample session How to install R? Minimum you have to know to work in R Data objects in R and how to.
Introduction to MATLAB Session 3 Simopekka Vänskä, THL Department of Mathematics and Statistics University of Helsinki 2011.
R Programming Yang, Yufei. Normal distribution.
C++ for Engineers and Scientists Second Edition Chapter 11 Arrays.
R packages/libraries Data input/output Rachel Carroll Department of Public Health Sciences, MUSC Computing for Research I, Spring 2014.
Chapter 1 – Matlab Overview EGR1302. Desktop Command window Current Directory window Command History window Tabs to toggle between Current Directory &
An Introduction to R Statistical Computing AMS 597 Stony Brook University Spring 2009 By Tianyi Zhang.
1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.
STAT 534: Statistical Computing Hari Narayanan
Digital Image Processing Introduction to MATLAB. Background on MATLAB (Definition) MATLAB is a high-performance language for technical computing. The.
Introduction to MATLAB 1.Basic functions 2.Vectors, matrices, and arithmetic 3.Flow Constructs (Loops, If, etc) 4.Create M-files 5.Plotting.
© 2015 by Wade Rogers Introduction to R Cytomics Workshop December, 2015.
R objects  All R entities exist as objects  They can all be operated on as data  We will cover:  Vectors  Factors  Lists  Data frames  Tables 
Data & Graphing vectors data frames importing data contingency tables barplots 18 September 2014 Sherubtse Training.
Math 252: Math Modeling Eli Goldwyn Introduction to MATLAB.
Gist 2.3 John H. Phan MIBLab Summer Workshop June 28th, 2006.
Lecture 11 Introduction to R and Accessing USGS Data from Web Services Jeffery S. Horsburgh Hydroinformatics Fall 2013 This work was funded by National.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
An Introduction to Programming in Matlab Emily Blumenthal
Data Analysis with R. Many data mining methods are also supported in R core package or in R modules –Kmeans clustering: Kmeans() –Decision tree: rpart()
1 Introduction to R A Language and Environment for Statistical Computing, Graphics & Bioinformatics Introduction to R Lecture 3
Review > x[-c(1,4,6)] > Y[1:3,2:8] > island.data fishData$weight[1] > fishData[fishData$weight < 20 & fishData$condition.
16BIT IITR Data Collection Module If you have not already done so, download and install R from download.
Working with data in R 2 Fish 552: Lecture 3. Recommended Reading An Introduction to R (R Development Core Team) –
Introduction to R user-friendly and absolutely free
Sihua Peng, PhD Shanghai Ocean University
EMPA Statistical Analysis
Programming in R Intro, data and programming structures
R programming language
Introduction to R Samal Dharmarathna.
Sihua Peng, PhD Shanghai Ocean University
Introduction to R.
Naomi Altman Department of Statistics (Based on notes by J. Lee)
Lab 1 Introductions to R Sean Potter.
Use of Mathematics using Technology (Maltlab)
Sihua Peng, PhD Shanghai Ocean University
Installing Packages Introduction to R, Part II
CSCI N317 Computation for Scientific Applications Unit R
R Course 1st Lecture.
Data analysis with R and the tidyverse
Presentation transcript:

Introduction to R Basics * Based on R tutorial by Lorenza Bordoli

R-project background Origin and History –initially written by Ross Ihaka and Robert Gentleman at Dep. of Statistics of U of Auckland, New Zealand during 1990s. –International project since 1997 Open source with GPL license –Free to anyone –In actively development –

What R does R is a programming environment for statistical and data analysis computations. Core Package Statistical functions plotting and graphics Data handling and storage predefined data reader textual, regular expressions hashing Data analysis functions Programming support: loops, branching, subroutines Object Oriented More additional developed packages.

Basic Math operations R as a calculator –+, -, /, *, ^, log, exp, …

Variables Numeric Character String Logical

Assigning Values to Variables “<-” or “=“ Assign multiple values –Concatenate, c() –From stdin, scan() –Series : Seq()

NA: Missing Value Variables of each data type (numeric, character, logical) can also take the value NA: not available. NA is not the same as 0 NA is not the same as “” NA is not the same as FALSE Any operations (calculations, comparisons) that involve NA may or may not produce NA:

Basic Data Structure Vector –an ordered collection of data of the same type –a single number is the special case of a vector with 1 element. –Usually accessed by index Matrix –A rectangular table of data of the same type

Basic Data Structure List –an ordered collection of data of arbitrary types. –name-value pair –Accessible by name

Basic Data Structure Hash Table –In R, a hash table is the same as a workspace for variables, which is the same as an environment. –Store Key-value pairs. –Value can be accessed by key

Dataframes R handles data in objects known as dataframes; –rows: data items; –columns: values of the different attributes Values in each column should be from the same type.

Read Dataframes From File read.table() > worms<-read.table(“worms.txt",header=T,row.names=1) –Read tab-delimited file directly. –Variable name in header row cannot have space. To see the content of the dataframes (object) just type is name: > worms the first row contains the variables names the first column contains data label path: in double quotes

Selecting Data from Dataframes Subscripts within square brackets –[, means “all the rows” and –,] means “all the columns” To select the first three column of the dataframe

Selecting Data from Dataframes names() –Get a list of variables attached to the input name attach() –Make the variables accessible by name: > attach(worms)

Selecting Data from Dataframes Using logic expression while selecting:

subset rows by a logical vector subset a column comparison resulting in logical vector subset the selected rows Selecting Data From a Dataframe More examples:

Sorting Data in Data frames order() >worms[order(worms[,1]),1:6] Area Slope Vegetation Soil.pH Damp Worm.density Farm.Wood Scrub 5.1 TRUE 3 Rookery.Slope Grassland 5.0 TRUE 7 Observatory.Ridge Grassland 3.8 FALSE 0 The.Orchard Orchard 5.7 FALSE 9 Ashurst Arable 4.8 FALSE 4 Cheapside Scrub 4.7 TRUE 4 Rush.Meadow Meadow 4.9 TRUE 5 Nursery.Field Grassland 4.3 FALSE 2 (…) State columns to be sorted State the Area for sorting order

Sorting Data in Dataframes sorted in descending order More on sorting selected

Flow Control If … else loops if (logical expression) { statements } else { alternative statements } * else branch is optional for(i in 1:10) { print(i*i) } i=1 while(i<=10) { print(i*i) i=i+sqrt(i) }

Flow Control apply (arr, margin, fct ) –Applies the function fct along some dimensions of the vector/matrix arr, according to margin, and returns a vector or array of the appropriate size.

Flow Control lapply (list, fct) and sapply (list, fct) –To each element of the list li, the function fct is applied. The result is a list whose elements are the individual fct results. –Sapply, converting results into a vector or array of appropriate size

Create Statistical Summary Descriptive summary for numerical variables: –arithmetic mean; –maximum, minimum, median, 25 and 75 percentiles (first and third quartile); Levels of categorical variables are counted

Create Plots plot(…) –Create scatter plot. > plot(Area, Soil.pH) Automatically create a postscript file with default name

Other Common Plots Univariate: –histograms, –density curves, –Boxplots, quantile-quantile plots Bivariate: –scatter plots with trend lines, –side-by-side boxplots Several variables: –scatter plot matrices, lattice –3-dimensional plots, –heatmap

Saving your work history(Inf) –To review the command lines entered during the sessions savehistory(“history.txt”) –Save the history of command lines to a text file loadhistory(“history.txt”) –read it back into R save(list=ls(),file=“all.Rdata”) –The session as a whole can be saved as a binary file. load(“c:\\temp\\ all.Rdata”) –Read back saved sessions.

Importing and exporting data There are many ways to get data into R and out of R. Most programs (e.g. Excel), as well as humans, know how to deal with rectangular tables in the form of tab- delimited text files. > x = read.delim(“filename.txt”) also: read.table, read.csv > write.table(x, file=“x.txt”, sep=“\t”)

Getting help “?” Or “help” Details about a specific command whose name you know (input arguments, options, algorithm, results): e.g. >? t.test or >help(t.test)

Installing R packages CRAN Comprehensive R Archive Network Collection of numerous R packages To Install, use install.packages() Example: install.packages('ggplot2') To load the package, use library() Example: library(‘ggplot2’)

Data Mining with R

Data mining with R Many data mining methods are also supported in R core package or in R modules –Kmeans clustering: Kmeans() –Decision tree: rpart() in rpart library –Nearest Neighbour Knn() in class library –…

Additional Libraries and Packages Libraries –Comes with Package installation (Core or others) –library() shows a list of current installed –library must be loaded before use e.g. library(rpart) Packages –Developed code/libraries outside the core packages –Can be downloaded and installed separately Install.package(“name”) –There are currently 2561 packages at project.org/web/packages/ project.org/web/packages/ E.g. Rweka, interface to Weka.

Common Data Mining Methods Clustering analysis –Grouping data object into different bucket. –Common methods: Distance based clustering, e.g. k-means Density based clustering e.g. DBSCAN Hierarchical clustering e.g. Aggregative hierarchical clustering Classification –Assigning labels to each data object based on training data. –Common methods: Distance based classification: e.g. SVM Statistic based classification: e.g. Naïve Bayesian Rule based classification: e.g. Decision tree classification

Cluster Analysis Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups –Inter-cluster distance: maximized –Intra-cluster distance: minimized

An Example of k-means Clustering K=3 Examples are from Tan, Steinbach, Kumar Introduction to Data Mining

K-means clustering Example login1% more kmeans.R x<-read.csv("../data/cluster.csv",header=F) fit<-kmeans(x, 2) plot(x,pch=19,xlab=expression(x[1]), ylab=expression(x[2])) points(fit$centers,pch=19,col="blue",cex=2) points(x,col=fit$cluster,pch=19)

> fit K-means clustering with 2 clusters of sizes 49, 51 Cluster means: V1 V Clustering vector: [1] [38] [75] Within cluster sum of squares by cluster: [1] Available components: [1] "cluster" "centers" "withinss" "size" >

Classification Tasks

Support Vector Machine Classification A distance based classification method. The core idea is to find the best hyperplane to separate data from two classes. The class of a new object can be determined based on its distance from the hyperplane.

Binary Classification with Linear Separator Red and blue dots are representations of objects from two classes in the training data The line is a linear separator for the two classes The closets objects to the hyperplane is the support vectors. ρ

SVM Classification Example install.packages("e1071") library(e1071) train<- read.csv("sonar_train.csv",header=FALSE) y<-as.factor(train[,61]) x<-train[,1:60] fit<-svm(x,y) 1-sum(y==predict(fit,x))/length(y))

SVM Classification Example test<- read.csv("sonar_test.csv",header=FALSE) y_test<-as.factor(test[,61]) x_test<-test[,1:60] 1- sum(y_test==predict(fit,x_test))/length (y_test)

Reminder Start R sessions –ssh –sbatch job.Rstudio.training get exemplar code cp –R /work/00791/xwj/R-0915 ~/

Further references R –M. Crawley, Statistics An Introduction using R, Wiley –J. Verzani, SimpleR Using R for Introductory Statistics –Programming manual: Using R for data mining –Data Mining with R: Learning with case studies, Luis Togo Contact Info –Weijia Xu

End of Morning Session Get on the Maverick and start R sessions Basics of R –Variable types –Data structure –Flow controls Using R for data mining –Code examples.

Afternoon Agenda 11:30-1:00 Lunch Break –Hands on with R Try with exemplar code, Try your own code/data, 1:00-1:30 Scaling up R computations 1:30- 2:00 A walkthrough with parallel package in R 2:00- 3:00 Hands on Lab session 3:00- 4:00 Understand the performance of R program