Presentation is loading. Please wait.

Presentation is loading. Please wait.

Organizing Data from Long-to-Wide Format: Issues and Troubleshooting

Similar presentations


Presentation on theme: "Organizing Data from Long-to-Wide Format: Issues and Troubleshooting"— Presentation transcript:

1 Organizing Data from Long-to-Wide Format: Issues and Troubleshooting
EPA R Workshop | September 13th, 2017 Austin Heinrich

2 Overview A common task in data analysis is organizing tables into formats that are in-tune with the analyst’s objectives More often than not, data tables need to be intuitively reformatted In this presentation, we’ll cover… Observations from an analysis where drinking water contaminant occurrence data was provided in long format and needed to be reorganized into wide format Data structure Issues encountered Solutions discovered

3 Long Format Analyte.Name PWSID Laboratory.Assigned.ID Sample.ID
Sample.Collection.Date Detect Value C NY 3 MIXER DBP 10/14/2010 1 1.43 B 7.84 BDCM 3.7 DBCM 8.15 DBAA 1.8 DCAA NA MBAA MCAA 2.9 TCAA

4 Wide Format PWSID Sample.ID Laboratory. Assigned.ID Sample.Collection.
Date C B BDCM DBCM MCAA DCAA TCAA MBAA DBAA NY 3 MIXER DBP 10/14/2010 1.43 7.84 3.7 8.15 2.9 1.8

5 Process Contaminant Occurrence Data Case Study
Import data options(StringAsFactors = FALSE) X <- read.delim() Each of the nine analytes are in their own separate tab delimited text files Data manipulation For instances where record is a non-detect (“detect” field = “0”), “value” field = null During import, R gives this “NA” Nulls in the sample and lab ID fields

6 Process, cont. Contaminant Occurrence Data Case Study
3. Merging of individual text files a. Organize data so there are multiple observations/row (i.e., “wide” format) b. wideformat <- merge(c, b, by = c(PWSID, Sample.ID, Laboratory.Assigned.ID, Sample.Collection.Date))

7 Inside Merge() Joins data frames in “wide” format Key Arguments x, y = data frames or objects to be coerced to one by, by.x, by.y = specifications of the columns used for merging Documentation

8 Possible Issues NAs and duplicate records can lead to errors
When merging datasets, the number of records that share the common keys should never increase as more datasets are merged in For example, if you merge analyte files “c” (434,624 records) and “b” (433,636 records), the most primary-key matches you could have is 433,636 With NAs and duplicates, you could expect this…. Mergedfile <- merge(c, b, by = c("PWSID", "Laboratory.Assigned.ID", "Sample.ID", "Sample.Collection.Date")) Result is 860,192 records!!!

9 Two Options for Finding NAs
Look at the count of NAs in individual or all fields using summary(x) function summary(c$Sample.ID) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's Simple Indexing c[is.na(c$Sample.ID), ] By assigning this to an object, you get a data frame of all the records (45,185) that have “NA” in the sample ID field

10 Options for Treating NAs and Duplicates
Although records may have NAs in one or more fields, that’s not to say that those records should be deleted Valuable information may still remain Substitute NAs with values c$Sample.ID[is.na(c$Sample.ID) <- "999999“ Duplicate records (across all fields) may be reporting issue c <- c[!duplicated(c),]

11 Additional Reformatting Options
dplyr functions inner_join() returns all rows from x where there are matching values in y, and all columns from x and y left_join() returns all rows from x, and all columns from x and y right_join() returns all rows from y, and all columns from x and y semi_join() returns all rows from x where there are matching values in y Reshape() Aggregate() Others?

12 Takeaways Drinking water contaminant occurrence data was successfully reformatted from long-to-wide using merge() Other functions intended to perform the same task exist; Recommendations? Careful attention should be given to data frame components E.g., NAs and duplicates Without accounting for these, a simple conversion may become a headache

13 Thank you Email: Heinrich.Austin@Epa.gov Phone: (202) 564-6723


Download ppt "Organizing Data from Long-to-Wide Format: Issues and Troubleshooting"

Similar presentations


Ads by Google