Unit- 3 R for Data Analysis
R Language R is a programming language and software environment for Statistical analysis, Graphics representation and Reporting. R was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, and is currently developed by the R Development Core Team.
Windows Installation You can download the Windows installer version of R from R-3.2.2 for Windows (32/64 bit) and save it in a local directory. As it is a Windows installer (.exe) with a name "R-version-win.exe". You can just double click and run the installer accepting the default settings. If your Windows is 32-bit version, it installs the 32-bit version. But if your windows is 64-bit, then it installs both the 32-bit and 64-bit versions. After installation you can locate the icon to run the Program in a directory structure "R\R-3.2.2\bin\i386\Rgui.exe" under the Windows Program Files. Clicking this icon brings up the R-GUI which is the R console to do R Programming.
Linux Installation R is available as a binary for many version of Linux at the location R Binaries. The instruction to install for various flavors of Linux varies. These steps are mentioned under each type of Linux version in the mentioned link. Still you are in hurry, then you can use yum command to install R as follows: $ yum install R Above command will install core functionality of R programming along with standard packages.
Basic Syntax- First Hello World Program > a<-" Hello World" > print(a) [1] " Hello World"
Comments: Single comment is written using # in the beginning of the statement as follows: # This is my First Program R does not support multi-line comments
Data Types In program we use variables.. Variables are used to store various kind of information Variables are nothing but reserved memory locations to store values. This means that when you create a variable you reserve some space in memory. In R the variables are not declared as some data type. The variables are assigned with R-Objects and the data type of the R-object becomes the data type of the variable. There are many types of R-objects. The frequently used ones are Vectors Lists Matrices Arrays Factors Data Frames
Vector Object The simplest of these objects is the vector object and there are six data types of these atomic vectors, also termed as six classes of vectors. The other R-Objects are built upon the atomic vectors. Logical > v<-"TRUE" > class(v) [1] "character" Numeric > v<-77.5 [1] "numeric"
Integer Complex > v<-4L > class(v) [1] "integer"
> v<-charToRaw("Hello") > v [1] 48 65 6c 6c 6f [1] "raw" Character > v<-"yes" > class(v) [1] "character" > v<-'yes' Raw > v<-charToRaw("Hello") > v [1] 48 65 6c 6c 6f [1] "raw"
When you want to create vector with more than one element, you should use c() function which means to combine the elements into a vector. >apple<-c('red','green','yellow') > apple [1] "red" "green" "yellow" > print(apple) > > print(class(apple)) [1] "character"
LIST A list is a R-object which can contain many different types of elements inside it like vectors, functions and even another list inside it. > list1<-list(c(1,2,5),32.3,sin,"shweta") > list1 [[1]] [1] 1 2 5 [[2]] [1] 32.3 [[3]] function (x) .Primitive("sin") [[4]] [1] "shweta"
Questions What are the different types of data types available in R. What is the difference between numeric and integer vector class What is the difference between cat and print What is the significance of List What are the different classes of vectors .
Matrix== matrix(data,nrow,ncol,byrow,dimnames) > M=matrix(c('a','a','a','b','b','b'),nrow=2,ncol=3) > M [,1] [,2] [,3] [1,] "a" "a" "a" [2,] "b" "b" "b" > M=matrix(c('a','a','a','b','b','b'),nrow=3,ncol=2) [,1] [,2] [1,] "a" "b" [2,] "a" "b" [3,] "a" "b"
> M=matrix(c(3:14),nrow=4, byrow=TRUE) > M [,1] [,2] [,3] [1,] 3 4 5 [2,] 6 7 8 [3,] 9 10 11 [4,] 12 13 14 > M=matrix(c(3:14),nrow=4,byrow=FALSE) [1,] 3 7 11 [2,] 4 8 12 [3,] 5 9 13 [4,] 6 10 14 To access the matrix values > M[1,3] [1] 11 > M[,3] [1] 11 12 13 14 > M[3,] [1] 5 9 13 >
> rowname=c("r1","r2","r3","r4") > colname=c("c1","c2","c3") >M=matrix(c(3:14),nrow=4,byrow=TRUE,dimnames=list(rowname,col name)) > M c1 c2 c3 r1 3 4 5 r2 6 7 8 r3 9 10 11 r4 12 13 14
Array- Store data in more then two dimension Array(data,dim) > v1=c(1,2,3) > v2=c(4,5,6,7,8,9) > A=array(c(v1,v2),dim=c(3,3,2)) > A , , 1 [,1] [,2] [,3] [1,] 1 4 7 [2,] 2 5 8 [3,] 3 6 9 , , 2
Question Create a two 2-D array Assign variable1 as 4,5,6,7 by (<- operator) Assign variable2 as “My”,“Name”,“is”,“shweta” by (= operator). Assign variable3 as TRUE,1 by (-> operator) Display the output as Variable1 is ………variable2 is……………& variable3 is………………. Then curser should come to the new line Find out the class of each variable-
Factor Factors are the data objects which are used to categorize the data and store it as levels. They can store both strings and integers. They are useful in the columns which have a limited number of unique values. Like "male, "Female" and True, False etc. They are useful in data analysis for statistical modeling.
Variables Name Variable Name Validity Reason var_name2. valid Has letters, numbers, dot and underscore var_name% Invalid Has the character '%'. Only dot(.) and underscore allowed. 2var_name invalid Starts with a number .var_name , var.name Can start with a dot(.) but the dot(.)should not be followed by a number. .2var_name The starting dot is followed by a number making it invalid _var_name Starts with _ which is not valid
Variable Assignment > var.1<-c(4,5,7,9) > var.2=c("Hello","we","r","learning R") > c(TRUE,1)->var.3 > print(var.1) [1] 4 5 7 9 > cat("var 1 is", var.1, "\n") var 1 is 4 5 7 9 > cat("var 2 is", var.2, "\n") var 2 is Hello we r learning R > cat("var 3 is", var.3, "\n") var 3 is 1 1
> class(var.1) [1] "numeric" > class(var.2) [1] "character" > class(var.3) To know all the variables currently available in the workspace we use the ls()function. print(ls()) [1] "a" "apple" "list1" "v" "var.1" "var.2" "var.3"
Operators: Arithmetic Operators > a<-c(1,2,3) > b<-c(4,5,6) > a+b [1] 5 7 9 > a-b [1] -3 -3 -3 > a*b [1] 4 10 18 > a/b [1] 0.25 0.40 0.50 > a^b [1] 1 32 729 > a%%b [1] 1 2 3 > b%a Error: unexpected input in "b%a" > b%%a [1] 0 1 0
Relational operator > a>b [1] FALSE FALSE FALSE > a<b [1] TRUE TRUE TRUE > a==b > a<=b > a!=b >
Mislleneous Operator( : , %in%) c=5:18 > c [1] 5 6 7 8 9 10 11 12 13 14 15 16 17 18 > v1<-5 > v2<-16 > t=1:10 > print(v1%in%t) [1] TRUE > print(v2%in%t) [1] FALSE
Decision Making(if-else) > x<-40L > if(is.integer(x)){ + print("x is integer")} [1] "x is integer" > x<-c("what","is","R") > if("R"%in%x){ + print("R is in x") + }else{ + print("R is not in x")} [1] "R is in x" >
Functions > z<-3.5-8i > Re(z) [1] 3.5 > Im(z) [1] -8 > Mod(z) [1] 8.732125 > Conj(z) [1] 3.5+8i > is.complex(z) [1] TRUE
> is.numeric(z) [1] FALSE > as.numeric(z) [1] 3.5 Warning message: imaginary parts discarded in coercion > as.complex(z) [1] 3.5-8i > floor(5.98) [1] 5 > ceiling(5.9) [1] 6 > ceiling(5.1) > floor(5.18)
> trunc(5.4) [1] 5 > trunc(-5.4) [1] -5 > signif(12345678,6) [1] 12345700 > signif(12345678,5) [1] 12346000 > log(10) [1] 2.302585 > sin(pi) [1] 1.224606e-16 > pi [1] 3.141593 > sin(pi/2) [1] 1 >
> seq(3,8) [1] 3 4 5 6 7 8 > 3:8 > mean(3:6) [1] 4.5 > sum(4,5) [1] 9 > sum(4:8) [1] 30 > new<-function(a) + {for(i in 1:a){ + b<-i^2 + print(b)}} > new(4) [1] 1 [1] 4 [1] 16
> new<-function(a,b,c){ + result<-(a*b+c) + print(result) + } > new(2,3,2) [1] 8 > new(a=3,b=5,c=2) [1] 17 + print(a) + print(b) + print(c)
FACTOR > data<-c("East","West","East","North","North","East") > print(is.factor(data)) [1] FALSE > Factor_data=factor(data) > Factor_data [1] East West East North North East Levels: East North West > is.factor(Factor_data) [1] TRUE
We can generate factor levels by using the gl() function. gl(n,k,labels) > v<-gl(4,4,labels=c("East","West","North","South")) > v [1] East East East East West West West West North North North North [13] South South South South Levels: East West North South
DataFrame== Data frames are tabular data objects DataFrame== Data frames are tabular data objects. Unlike a matrix in data frame each column can contain different modes of data. > empdata<-data.frame( + emp_id=c(1:5), + emp_name=c("Shweta","Sonal","Shipra","Manisha","Varsha")) > empdata emp_id emp_name 1 1 Shweta 2 2 Sonal 3 3 Shipra 4 4 Manisha 5 5 Varsha
> height=c(132,166,123,145) > weight=c(48,50,64,44) > gender=c("Female","Male","Male","Female") > data<-data.frame(height,weight,gender) > data height weight gender 1 132 48 Female 2 166 50 Male 3 123 64 Male 4 145 44 Female > is.factor(data$gender) [1] TRUE
Question- Data Frame Create a student data with the help of Data Frame data type with the following fields: Student_Rollno Student_Name Student_Gender Student_marksinDS
To check the structure of dataframe Str(emp) To get the summary Summary(emp) > emp<-data.frame(emp_id=c(1:3),emp_name=c("Shweta","Gargi","Sumit"),emp_salary=c(300,200,400)) > emp emp_id emp_name emp_salary 1 1 Shweta 300 2 2 Gargi 200 3 3 Sumit 400 'data.frame': 3 obs. of 3 variables: $ emp_id : int 1 2 3 $ emp_name : Factor w/ 3 levels "Gargi","Shweta",..: 2 1 3 $ emp_salary: num 300 200 400 > emp[1:2,] > emp[2:3,]
> emp[c(2,3),c(2,3)] emp_name emp_salary 2 Gargi 200 3 Sumit 400 > emp$emp_dept<-c("CSE","ECE","Medical") > emp emp_id emp_name emp_salary emp_dept 1 1 Shweta 300 CSE 2 2 Gargi 200 ECE 3 3 Sumit 400 Medical >
> emp.new<- data.frame(emp_id=34,emp_name="dddd",emp_salary=2222,emp_d ept="dd") > rbind(emp,emp.new) emp_id emp_name emp_salary emp_dept 1 1 Shweta 300 CSE 2 2 Gargi 200 ECE 3 3 Sumit 400 Medical 4 34 dddd 2222 dd
function > new<-function(a) + {for(i in 1:a){ + b<-i^2 + print(b)}} > new(4)
Question Create a function to display the table of any number.. -- with argument --without argument
To get the current working directory > print(getwd()) Create a csv file in that particular directory with the data Id, name, salary,dept Fill data with , How to read the data from csv file data<-read.csv("input.csv")
> print(getwd()) [1] "C:/Users/SHWETA MONGIA/Documents" > data<-read.csv("input.csv") Warning message: In read.table(file = file, header = header, sep = sep, quote = quote, : incomplete final line found by readTableHeader on 'input.csv' > data id name salary 1 1 Rick 40000 2 2 Gary 200000 3 3 Ryan 300000
> is.data.frame(data) [1] TRUE > ncol(data) [1] 3 > nrow(data) > sal<-max(data$salary) > sal [1] 300000 >
> subset(data, salary==max(salary)) id name salary 3 3 Ryan 300000 > subset(data, salary>100000 & id>2)