Tidy Data Global Health 811 April 9th, 2018.

Slides:



Advertisements
Similar presentations
Introduction to ReportSmith and Effective Dated Tables
Advertisements

& : Maryland Weather Service Report, Vol. 2, pp & : US Historical Climatology Network. Baltimore WSO.
12th EMS Annual Meeting & 9th European Conference on Applied Climatology (ECAC), Łódź, September 2012 Rajmund Przybylak, Aleksandra Pospieszyńska.
CLIMAT (CLIMAT TEMP) History: 1935 – IMO (International Meteorological Organization) that mean monthly values of the main climatological elements at certain.
MEDARE Workshop, Tarragona, Spain Rescue and Digitization of Climate Records of Cyprus 28 – 30 November 2007 Stelios Pashiardis Meteorological Service.
1 HDF5 Life cycle of data Boeing September 19, 2006.
5-4-1 Unit 4: Sampling approaches After completing this unit you should be able to: Outline the purpose of sampling Understand key theoretical.
International Workshop on Rescue and Digitization of Climate Records in the Mediterranean Basin Data Rescue Activities at Slovenian Meteorological Office.
5 Copyright © 2005, Oracle. All rights reserved. Managing Database Storage Structures.
Eurostat 1 7a. Practical use case 1: Pesticides Use Project Blanaru Cristina Eurostat Unit B5: “Central data and metadata services” SDMX Basics course,
A table is a set of data elements (values) that is organized using a model of vertical columns (which are identified by their name) and horizontal rows.
Statistics and Data. inference Sample (Data) statistics probability sampling Population Ch 2, Ch 4, Ch 6 Ch 5, Ch 9, Ch 21 Ch 7, Ch 8 Ch 3, Ch 23 Ch
1 M04- Graphical Displays 2  Department of ISM, University of Alabama, 2003 Graphical Displays of Data.
1 CS 430 Database Theory Winter 2005 Lecture 7: Designing a Database Logical Level.
What is Science? Science comes from the Latin word “scire”
CS 257: Database System Principles Variable length data and record BY Govind Kalyankar Class Id: 107.
Database Design Chapters 17 and 18.
ePREM & YVM Data Information Session
Tidy data, wrangling, and pipelines in R
Merging data using Excel & Stata Mark Bruyneel & Matthijs de Zwaan
Understanding Data Storage
September 2016 Michael Osmann Model developer
Event-driven accounting information systems
Tutorial 5: Working with Excel Tables, PivotTables, and PivotCharts
FORECASTING HEATWAVE, DROUGHT, FLOOD and FROST DURATION Bernd Becker
Site Hub Administrator – Defining Attributes
Data Management Agenda
WHAT’S NEXT? Anne Stoner, Katharine Hayhoe, Ranjini Swaminathan.
Data Cleansing with SQL and R Kevin Feasel
CITA 215 Section 3 Data Modeling.
November Miss Perez’s Class Upcoming Events: November
MELODIST – An open-source MEteoroLOgical observation time series DISaggregation Tool Kristian Förster, Florian Hanzer, Benjamin Winter, Thomas Marke,
REDCap Data Migration from CSV file
Agenda: 10/05/2011 and 10/10/2011 Review Access tables, queries, and forms. Review sample forms. Define 5-8 guidelines each about effective form and report.
Observing Climate Variability and Change
Database Processing: David M. Kroenke’s Chapter One: Introduction
ETL – Using R Kiran Math Developer : Flour in Greenville SC
Tidy Data Global Health 811 April 3, 2018.
The first element in a period is always an extremely active solid.
Database Processing: David M. Kroenke’s Chapter One: Introduction
Veronika Halvelandová
What Are Databases? Organized by Dr. Farrokh Alemi PhD
A M P M Name: ________ Voice Log Week of __/__/__ Monday Tuesday
Two-Digit Addition and Subtraction
R Programming For Sql Developers ETL USING R
Please thank our sponsors!
Tidy data, wrangling, and pipelines in R
Database Design Chapters 17 and 18.
Global Health 811 October 30th, 2018
Lesson Culminating Lesson
Introduction to Customizing Reports in SAP
SDMX Information Model: An Introduction
Organizing and Visualizing Data
Properties of the Periodic Table
Relational Database Design
Important Upcoming Dates
Multiplication: Using Arrays.
Directed Numbers Friday, 12 April 2019.
Dplyr Tidyr & R Markdown
Status of Existing Observing Networks
SEPTEMBER ½ Day Unit PLC
Power BI at Enterprise-Scale
PivotCharts in Excel Kevin Estes.
Use of SQL – The Patricia database
9. Practical use case 3: Pesticides Use Project
Metadata and Quality.
Presentation Title Subtitle Goes Here John Doe August 2016.
Excellence in TB Control Award
Multiplication Facts 3 x Table.
Working with Temporal Data
Presentation transcript:

Tidy Data Global Health 811 April 9th, 2018

Happy families are all alike; every unhappy family is unhappy in its own way — Leo Tolstoy

Concept of Tidy Data Data is often messy! We need a precise way to talk about “Tidy” data Goal: Represent one fact in one place If one fact in multiple places, chance to record different values!

Data Semantics The dataset contains 18 values representing three variables and six observations. Information remains the same in the tidy dataset, but values, variables, and observations are more clear.

Common problems with messy data • Column headers are values, not variable names. • Multiple variables are stored in one column. • Variables are stored in both rows and columns. • Multiple types of observational units are stored in the same table. • A single observational unit is stored in multiple tables

Columns are values, not variables Cases in which you may come across data of this nature: • Tabular data designed for presentation • Sometimes used to record regularly spaced observations over time

Example 1: Pew Survey Data What are the variables & observations in this dataset? What would the tidy version look like?

The Tidy Version The first ten rows of the tidied survey dataset on income and religion. This version is tidy because each column represents a variable and each row represents an observation. In this case a demographic unit corresponding to a combination of religion and income

Example 2: Billboard Data

The Tidy Version

Your reward. Thank me later! https://youtu.be/F7lfNXddV6A

“Melting” Data

Multiple variables stored in one column After melting (reshaping) data, the column variable often becomes a combination of multiple underlying variable names.

Example: WHO TB Dataset

After melting, the data still need tidying

Variables stored in both rows & columns The most complicated form of messy data occurs when variables are stored in both rows and columns

Example: Climate Database - Data are drawn from the Global Historical Climatology Network - One weather station (MX17004) in Mexico - Five month period in 2010

Example: Climate Database

Example: Climate Database In the tidy dataset, each row represents the meteorological measurements for a single day. There are two measured variables, minimum and maximum temperate; all other variables are fixed.

For more on tidy data …see the link on the GH 811 site

R Tidyr Interactive Demo - gather() - separate() - spread()

Installing packages Install the whole tidyverse (warning: this takes a while): install.packages(“tidyverse”) OR Just install tidyr: install.packages(“tidyr”)

Upcoming deadlines Sunday, November 4th at 5pm Data dictionary Table shells Methods section Team charter review Tuesday, November 6th at 2pm Journal 2