Organizing Data from Long-to-Wide Format: Issues and Troubleshooting EPA R Workshop | September 13th, 2017 Austin Heinrich
Overview A common task in data analysis is organizing tables into formats that are in-tune with the analyst’s objectives More often than not, data tables need to be intuitively reformatted In this presentation, we’ll cover… Observations from an analysis where drinking water contaminant occurrence data was provided in long format and needed to be reorganized into wide format Data structure Issues encountered Solutions discovered
Long Format Analyte.Name PWSID Laboratory.Assigned.ID Sample.ID Sample.Collection.Date Detect Value C NY0600363 3 MIXER DBP 1100626 10/14/2010 1 1.43 B 7.84 BDCM 3.7 DBCM 8.15 DBAA 1.8 DCAA NA MBAA MCAA 2.9 TCAA
Wide Format PWSID Sample.ID Laboratory. Assigned.ID Sample.Collection. Date C B BDCM DBCM MCAA DCAA TCAA MBAA DBAA NY0600363 1100626 3 MIXER DBP 10/14/2010 1.43 7.84 3.7 8.15 2.9 1.8
Process Contaminant Occurrence Data Case Study Import data options(StringAsFactors = FALSE) X <- read.delim() Each of the nine analytes are in their own separate tab delimited text files Data manipulation For instances where record is a non-detect (“detect” field = “0”), “value” field = null During import, R gives this “NA” Nulls in the sample and lab ID fields
Process, cont. Contaminant Occurrence Data Case Study 3. Merging of individual text files a. Organize data so there are multiple observations/row (i.e., “wide” format) b. wideformat <- merge(c, b, by = c(PWSID, Sample.ID, Laboratory.Assigned.ID, Sample.Collection.Date))
Inside Merge() Joins data frames in “wide” format Key Arguments x, y = data frames or objects to be coerced to one by, by.x, by.y = specifications of the columns used for merging Documentation https://stat.ethz.ch/R-manual/R-devel/library/base/html/merge.html
Possible Issues NAs and duplicate records can lead to errors When merging datasets, the number of records that share the common keys should never increase as more datasets are merged in For example, if you merge analyte files “c” (434,624 records) and “b” (433,636 records), the most primary-key matches you could have is 433,636 With NAs and duplicates, you could expect this…. Mergedfile <- merge(c, b, by = c("PWSID", "Laboratory.Assigned.ID", "Sample.ID", "Sample.Collection.Date")) Result is 860,192 records!!!
Two Options for Finding NAs Look at the count of NAs in individual or all fields using summary(x) function summary(c$Sample.ID) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 352 399400 743800 757700 913800 2843000 45185 Simple Indexing c[is.na(c$Sample.ID), ] By assigning this to an object, you get a data frame of all the records (45,185) that have “NA” in the sample ID field
Options for Treating NAs and Duplicates Although records may have NAs in one or more fields, that’s not to say that those records should be deleted Valuable information may still remain Substitute NAs with values c$Sample.ID[is.na(c$Sample.ID) <- "999999“ Duplicate records (across all fields) may be reporting issue c <- c[!duplicated(c),]
Additional Reformatting Options dplyr functions inner_join() returns all rows from x where there are matching values in y, and all columns from x and y left_join() returns all rows from x, and all columns from x and y right_join() returns all rows from y, and all columns from x and y semi_join() returns all rows from x where there are matching values in y Reshape() Aggregate() Others?
Takeaways Drinking water contaminant occurrence data was successfully reformatted from long-to-wide using merge() Other functions intended to perform the same task exist; Recommendations? Careful attention should be given to data frame components E.g., NAs and duplicates Without accounting for these, a simple conversion may become a headache
Thank you Email: Heinrich.Austin@Epa.gov Phone: (202) 564-6723