Organizing Data from Long-to-Wide Format: Issues and Troubleshooting

Slides:



Advertisements
Similar presentations
Database Design Week 10.
Advertisements

VA. ACL USER’S GROUP Functions Intermediate to Advanced April 21, 2009 Kimberly M. Taylor, CPA, CISA Chesterfield County, VA.
Unit Eight: Joining Tables In this unit… ► Review ► Intro to Joining Tables ► Relational Records  In Joins  In Relates ► Joining Tables.
Designing a Database Unleashing the Power of Relational Database Design.
Introduction to Oracle9i: SQL1 SQL Group Functions.
1 CS 430 / INFO 430 Information Retrieval Lecture 24 Usability 2.
Separating Columns in Excel. An extremely useful function in Excel is the Text to Column feature which can be used for any type of column separation but.
Relational Database Need to Knows. What is a database? Data - is just a pile of numbers or stats. A business "organises" the data to be meaningful and.
Chapter 06: Lecture Notes (CSIT 104) 1 Copyright © 2008 Pearson Prentice Hall. All rights reserved. 1 1 Copyright © 2008 Prentice-Hall. All rights reserved.
FIRST COURSE Access Tutorial 1 Creating a Database.
Introduction –All information systems create, read, update and delete data. This data is stored in files and databases. Files are collections of similar.
Intermediate Microsoft Excel 2010 Date: November 12, 2012 Time: 9:00 AM to 11:00 AM Location: Serra 156B-PC Lab Instructor: Steve Maier.
Exam Review – Queries & MORE! Access SimNet Exam Access Case Exam Final Exam.
Drinking Water Infrastructure Needs Survey and Assessment 2007 Training.
WEEK 11 Database Design. TABLE INSTANCE CHARTS Create Tables.
Drinking Water Infrastructure Needs Survey and Assessment 2007 Website.
 Agenda: 4/24/13 o External Data o Discuss data manipulation tools and functions o Discuss data import and linking in Excel o Sorting Data o Date and.
Copyright © 2008 Pearson Prentice Hall. All rights reserved Chapter 6 Data Tables and Amortization Tables Exploring Microsoft Office Excel 2007.
Intro to SQL Management Studio. Please Be Sure!! Make sure that your access is read only. If it isn’t, you have the potential to change data within your.
Database Terms Hernandez, Chapter 3. Data/Information The values you store in the database are data. Pieces of Data in and of themselves is not particularly.
MySQL Importing and creating a database. CSV (Comma Separated Values) file CSV = Comma Separated Values – they are simple text files containing data which.
Advanced Adhoc Reporting 2010 Visions Conference July 28, 2010.
A table is a set of data elements (values) that is organized using a model of vertical columns (which are identified by their name) and horizontal rows.
R objects  All R entities exist as objects  They can all be operated on as data  We will cover:  Vectors  Factors  Lists  Data frames  Tables 
HRP Copyright © Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by copyright law and.
ADVANCED TOPICS IN EXCEL Evan Volgas, Lead Data Engineer at WhatRunsWhere.
Lesson 4: Querying a Database. 2 Learning Objectives After studying this lesson, you will be able to:  Create, save, and run select queries  Set query.
DAY 4,5,6: EXCEL CHAPTERS 1 & 2 Rohit January 27 th to February 1 st
 The term “spreadsheet” covers a wide variety of elements useful for quantitative analysis of all kinds. Essentially, a spreadsheet is a simple tool.
Aggregator Stage : Definition : Aggregator classifies data rows from a single input link into groups and calculates totals or other aggregate functions.
INFORMATION TECHNOLOGY DATABASE MANAGEMENT. A database is a collection of information organized to provide efficient retrieval. The collected information.
LM 5 Introduction to SQL MISM 4135 Instructor: Dr. Lei Li.
Review: A Computational View Programming language common concepts: 1. sequence of instructions -> order of operations important 2. conditional structures.
Access Test Questions Test Date: 05/05/16. Chapter 1 (Lynda.com) Question 1 An access database uses five main components (database objects). Which is.
IT 5433 LM3 Relational Data Model. Learning Objectives: List the 5 properties of relations List the properties of a candidate key, primary key and foreign.
HCAI Information for ACtion 2010
User Manual for Contact Management Customer Relationship Management (CRM) for Bursa Malaysia 2014 Version 1.0 | 4 September 2014.
Microsoft Office Access 2010 Lab 1
Microsoft Access 2016 Design and Implement Powerful Relational Databases Chapter 6.
Practical Office 2007 Chapter 10
Database Normalization
LIS 384K.11 Database-Management Principles and Applications
Applied CyberInfrastructure Concepts Fall 2017
Design and Implement Powerful Relational Databases Chapter 6
REDCap Data Migration from CSV file
Structured Query Language
Lecture 12: Data Wrangling
SQL – Entire Select.
CSCI 431 Programming Languages Fall 2003
MongoDB Aggregations.
Working with Tables, PivotTables, and PivotCharts
MongoDB Aggregations.
Structured Query Language – The Fundamentals
CSCI N317 Computation for Scientific Applications Unit R
Lab 2 and Merging Data (with SQL)
CS122 Using Relational Databases and SQL
Lab 2 HRP223 – 2010 October 18, 2010 Copyright © Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected.
Contents Preface I Introduction Lesson Objectives I-2
DATA MANIPULATION Wendy Harrison Mari Morgan Dafydd Williams
MongoDB Aggregations.
ICT Database Lesson 2 Designing a Database.
WIDA ACCESS for ELLS SBD Training
Joins and other advanced Queries
Access Test Questions Test Date: 05/05/16.
Shelly Cashman: Microsoft Access 2016
Lesson 13 Working with Tables
New Perspectives on Microsoft
All about Indexes Gail Shaw.
Index Structures Chapter 13 of GUW September 16, 2019
Presentation transcript:

Organizing Data from Long-to-Wide Format: Issues and Troubleshooting EPA R Workshop | September 13th, 2017 Austin Heinrich

Overview A common task in data analysis is organizing tables into formats that are in-tune with the analyst’s objectives More often than not, data tables need to be intuitively reformatted In this presentation, we’ll cover… Observations from an analysis where drinking water contaminant occurrence data was provided in long format and needed to be reorganized into wide format Data structure Issues encountered Solutions discovered

Long Format Analyte.Name PWSID Laboratory.Assigned.ID Sample.ID Sample.Collection.Date Detect Value C NY0600363 3 MIXER DBP 1100626 10/14/2010 1 1.43 B 7.84 BDCM 3.7 DBCM 8.15 DBAA 1.8 DCAA NA MBAA MCAA 2.9 TCAA

Wide Format PWSID Sample.ID Laboratory. Assigned.ID Sample.Collection. Date C B BDCM DBCM MCAA DCAA TCAA MBAA DBAA NY0600363 1100626 3 MIXER DBP 10/14/2010 1.43 7.84 3.7 8.15 2.9 1.8

Process Contaminant Occurrence Data Case Study Import data options(StringAsFactors = FALSE) X <- read.delim() Each of the nine analytes are in their own separate tab delimited text files Data manipulation For instances where record is a non-detect (“detect” field = “0”), “value” field = null During import, R gives this “NA” Nulls in the sample and lab ID fields

Process, cont. Contaminant Occurrence Data Case Study 3. Merging of individual text files a. Organize data so there are multiple observations/row (i.e., “wide” format) b. wideformat <- merge(c, b, by = c(PWSID, Sample.ID, Laboratory.Assigned.ID, Sample.Collection.Date))

Inside Merge() Joins data frames in “wide” format Key Arguments x, y = data frames or objects to be coerced to one by, by.x, by.y = specifications of the columns used for merging Documentation https://stat.ethz.ch/R-manual/R-devel/library/base/html/merge.html

Possible Issues NAs and duplicate records can lead to errors When merging datasets, the number of records that share the common keys should never increase as more datasets are merged in For example, if you merge analyte files “c” (434,624 records) and “b” (433,636 records), the most primary-key matches you could have is 433,636 With NAs and duplicates, you could expect this…. Mergedfile <- merge(c, b, by = c("PWSID", "Laboratory.Assigned.ID", "Sample.ID", "Sample.Collection.Date")) Result is 860,192 records!!!

Two Options for Finding NAs Look at the count of NAs in individual or all fields using summary(x) function summary(c$Sample.ID) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 352 399400 743800 757700 913800 2843000 45185 Simple Indexing c[is.na(c$Sample.ID), ] By assigning this to an object, you get a data frame of all the records (45,185) that have “NA” in the sample ID field

Options for Treating NAs and Duplicates Although records may have NAs in one or more fields, that’s not to say that those records should be deleted Valuable information may still remain Substitute NAs with values c$Sample.ID[is.na(c$Sample.ID) <- "999999“ Duplicate records (across all fields) may be reporting issue c <- c[!duplicated(c),]

Additional Reformatting Options dplyr functions inner_join() returns all rows from x where there are matching values in y, and all columns from x and y left_join() returns all rows from x, and all columns from x and y right_join() returns all rows from y, and all columns from x and y semi_join() returns all rows from x where there are matching values in y Reshape() Aggregate() Others?

Takeaways Drinking water contaminant occurrence data was successfully reformatted from long-to-wide using merge() Other functions intended to perform the same task exist; Recommendations? Careful attention should be given to data frame components E.g., NAs and duplicates Without accounting for these, a simple conversion may become a headache

Thank you Email: Heinrich.Austin@Epa.gov Phone: (202) 564-6723