Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Management Module: Subset, Sort, and Format data

Similar presentations


Presentation on theme: "Data Management Module: Subset, Sort, and Format data"— Presentation transcript:

1 Data Management Module: Subset, Sort, and Format data
Programming in R Data Management Module: Subset, Sort, and Format data

2 Data Management Module
Importing and Exporting Imputting data directly into R Creating, Adding and Dropping Variables Assigning objects Subsetting and Formatting Working with SAS Files Using SQL in R

3 Subset, Sort, and Format data
In this session, I will introduce these topics: Subsetting the observations in a data frame. Sorting a data frame Formatting values in a data frame.

4 Data Management: Subset
There are different ways to subset a data frame or select rows based on some criteria. First, check the values of the variable. pricedata$region==3 This line of R code will return TRUE when the value of region is 3.

5 Data Management: Subset
Then select all the rows from the data frame when the criteria is true. pricedata[pricedata$region==3,] This line of R code will return all rows of pricedata when the region is 3. Using ‘<-’ we can assign the results to a new object: Newdata<-pricedata[pricedata$region==3,]

6 Data Management: Subset
We can also use the function subset(data.frame, criteria) subset(pricedata, region==3) This line of R code will do the same thing we previously described. Again, we can assign the results to a new object with newpricedata <- subset(pricedata, region==3)

7 Data Management: Subset
We can also check for multiple criteria using and/or operators. & is “and.” | is “or.” We can select records when region==3 and line==4. This is also known as an “inner join” > subset(pricedata, region == 3 & line==4)

8 Data Management: Subset
We can also select records when region is 3 or line is 4. This criteria should return more records since it does not require the selection criteria to both be true at the same time. This is also known as an “outer join”. subset(pricedata, region == 3 | line==4)

9 Data Management: Sorting
If you are new to R, then the natural function to consider is “sort()”. The best function to use is the function order(). The function is order(variable(s), decreasing=FALSE) To sort the data frame it needs to be placed in the row index data.frame[order(variables),]

10 Data Management: Sorting
If we want to sort the price data by the cost of the devices then I would do pricedata[order(pricedata$cost),] We can also sort by multiple variables pricedata[order(pricedata$region, pricedata$cost),]

11 Data Management: Labels
A label can be used to give categorical variables coded with numbers a meaningful description. For instance, the data may have region coded as 1. The value 1 really means “East” so we want to label the value 1 as East.

12 Data Management: Labels
Labels are part of the attributes. To check the attributes, there is a function called attributes. Use the function str() to determine the variable’s structure and how it is stored.

13 Data Management: Labels
In R, there are two functions to use. If the data is nominal, then use the function factor(variable, levels or values, labels) If the data is ordinal, then use the function ordered(variable, levels or values, labels)


Download ppt "Data Management Module: Subset, Sort, and Format data"

Similar presentations


Ads by Google