Download presentation
Presentation is loading. Please wait.
Published byRóbert Stloukal Modified over 6 years ago
1
CSCI N317 Computation for Scientific Applications Unit 2 - 2 R
Data processing
2
Create Data Use R Commands Good for small amount of data Enter data
About data frame – Note: “.” can be viewed as an underscore in variable or function names.
3
Create Data Edit data Use edit() function, must assign an output to a variable to get hold of the result Use fix() function, will assign the result to the same variable Or use the “Data editor” feature in GUI. Will call the fix() function on the object. No undo, redo or save options.
4
Import Data Import data from external files Delimited text files
5
Import Data Import data from external files Other options csv files
From a url, e.g. Use data retrieval packages, e.g. “quantmod” package for finance data See file dowGetData.R, get.multiple.quotes.R
6
Export Data Export data (usually data frames and matrices) as text files write.table(), write.csv(), write.csv2(), …
7
Combine Data “I’d estimate that 80% of the effort on a typical project is spent on finding, cleaning and preparing data for analysis. Less than 5% of the effort is devoted to analysis. (The rest of the time is spent on writing up what you did.)” - Joseph Adler, “R in a Nutshell” Combining Data Sets Data files are stored at different locations. paste(): concatenate multiple vectors into a single vector
8
Combine Data cbind(): combine objects by adding columns
9
Combine Data rbind(): combine objects by adding rows
10
Combine Data merge(): merge.R
11
Transformation Reassign variables and generate new columns
Note: dow30_2.csv is one of the output files of the “quantmod” example on slide 5, with adjusted file name and column names Create a new field
12
Transformation Use the “transform” function
Specify a data frame and a set of expressions that use variables within the data frame
13
Transformation Applying a Function to Each Element of an Object
When transforming data, one common operation is to apply a function to a set of objects(or each part of a composite object) and return a new set of objects (or a new composite object). The base R library includes a set of different functions for doing this. Applying a function to an array apply() function accepts three argument: X is the array to which a function is applied, MARGIN specifies the dimensions to which you would like to apply a function, FUN specifies the function. You can also define your own function.
14
Transformation Applying a Function to a List or Vector - lapply()
list data type ( Apply to a list and return a list Apply to a vector and returns a vector
15
Subsets Bracket Notation
Use a simple expression describing the set of rows to select from a data frame as an index Subset function as an alternative to bracket notation subset(dataset, rowexpression, columnexpression)
16
Binning Data cut()
17
Sampling Data Combine a set of vectors or data frames
18
Sampling Data Random Sampling
Use the sample() function and specifying values and sample size
19
Summarizing Functions
tapply(X=…, INDEX=…, FUN=, …) Summarizing X, for each subset specified by INDEX, applying function to subset
20
Summarizing Functions
aggregate(x=…, by=…, FUN=, …) Similar to tapply(), but works on data frames rowsum(x, group=…) Similar, but only applying the sum function
21
Counting Values tabulate()
22
Counting Values table() function for categorical values
23
Reshaping Data transpose
24
Reshaping Data
25
Reshaping Data unstack()
Change the format of a data frame from a stacked form to an unstacked form “form” attribute specifies a formula. The right side of ~ represents the vector to be unstacked. The left side of ~ indicates the groups to create
26
Reshaping Data reshape() Specify row IDs and expand values to columns
27
Sorting Sort a single vector Order a data frame
28
Data Cleaning Identifying problems caused by data collection, processing and storage processes and modifying the data so that these problems don’t interfere with analysis, e.g. duplicate patient records, incorrect credit scores(outside of 340 – 840 range), null values Can be achieved through functions or programming methods
29
Data Cleaning Finding and Removing Duplicates
30
Data Cleaning Using programming methods to remove rows that contains in valid or null values E.g. use the NationalSalaries.xlsx write a program to remove rows that has null values and rows that are summarized data e.g. major groups, all occupations.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.