Best Practices for Preparing Data Sets Non-CO2 Synthesis Workshop Boulder, Colorado 22-23 October 2008 Compiled by: A. Dayalu, Harvard University Adapted.

Slides:



Advertisements
Similar presentations
TABLES and FIGURES BIOL 4001.
Advertisements

WORKSHOPWORKSHOP THESIS & DISSERTATION FORMATTING.
P20 Seminar November 12, Statistical Collaboration Part 1: Working with Statisticians from Start to Finish Part 2: Essentials of Data Management.
System Design System Design - Mr. Ahmad Al-Ghoul System Analysis and Design.
John Porter Why this presentation? The forms data take for analysis are often different than the forms data take for archival storage Spreadsheets are.
Announcements ●Exam II range ; mean 72
Tools for Data Management and Reporting: Levels 2 and 3 June 18, 2008 Forbes Boyle & Michael Lee CVS-EEP Project Manager & Data Entry Tool Author.
Modules, Hierarchy Charts, and Documentation
Introduction to Array The fundamental unit of data in any MATLAB program is the array. 1. An array is a collection of data values organized into rows and.
Database Design IST 7-10 Presented by Miss Egan and Miss Richards.
Report Writing Format.
Managing Your Own Data (…if you have to) Kathryn A. Carson, Sc.M. Senior Research Associate Department of Epidemiology Johns Hopkins Bloomberg School of.
Organizing Your Data for Statistical Analysis in SPSS
ORGANIZING AND STRUCTURING DATA FOR DIGITAL PROJECTS Suzanne Huffman Digital Resources Librarian Simpson Library.
WWLC Standard Operating Procedures Presented by Frank Hall, Laboratory Certification Coordinator.
U.S. Department of the Interior U.S. Geological Survey Best Practices for Preparing Science Data to Share.
Testing. Definition From the dictionary- the means by which the presence, quality, or genuineness of anything is determined; a means of trial. For software.
Chapter 10: Working with Large Data Spreadsheet-Based Decision Support Systems Prof. Name Position (123) University Name.
Data Organization Data Collection and Spreadsheets.
AON Data Questionnaire Results 21 Respondents Last Updated 27 March 2007 First AON PI Meeting Scot Loehrer, Jim Moore.
Chapter 21 Preparing a Research Report Gay, Mills, and Airasian
Microsoft Access Get a green book. Page AC 2 Define Access Define database.
StAR web server tutorial for ROC Analysis. ROC Analysis ROC Analysis: This module allows the user to input data for several classifiers to be tested.
CC&E Best Data Management Practices, April 19, 2015 Please take the Workshop Survey 1.
Fundamental Practices for Preparing Data Sets Bob Cook Environmental Sciences Division Oak Ridge National Laboratory.
Module 6. Data Management Plans  Definitions ◦ Quality assurance ◦ Quality control ◦ Data contamination ◦ Error Types ◦ Error Handling  QA/QC best practices.
Data Management Practices for Early Career Scientists: Closing Robert Cook Environmental Sciences Division Oak Ridge National Laboratory Oak Ridge, TN.
 Agenda: 4/24/13 o External Data o Discuss data manipulation tools and functions o Discuss data import and linking in Excel o Sorting Data o Date and.
(Spring 2015) Instructor: Craig Duckett Lecture 10: Tuesday, May 12, 2015 Mere Mortals Chap. 7 Summary, Team Work Time 1.
Managing the Impacts of Change on Archiving Research Data A Presentation for “International Workshop on Strategies for Preservation of and Open Access.
WHAT IS A DATABASE? A DATABASE IS A COLLECTION OF DATA RELATED TO A PARTICULAR TOPIC OR PURPOSE OR TO PUT IT SIMPLY A GENERAL PURPOSE CONTAINER FOR STORING.
Managing and Curating Data Chapter 8. Introduction Data organization Data management Data curation Raw data is required to repeat a scientific study Any.
A Simple Guide to Using SPSS ( Statistical Package for the Social Sciences) for Windows.
Managing Your Data: Assign Descriptive File Names Robert Cook Oak Ridge National Laboratory Section: Local Data Management Version 1.0 October 2012.
IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF ).
Analyzing Systems Using Data Dictionaries Systems Analysis and Design, 8e Kendall & Kendall 8.
Verification & Validation. Batch processing In a batch processing system, documents such as sales orders are collected into batches of typically 50 documents.
MySQL Importing and creating a database. CSV (Comma Separated Values) file CSV = Comma Separated Values – they are simple text files containing data which.
MIMOS Berhad. All Rights Reserved. Nazarudin Wijee Mohd Sidek Salleh Grid Computing Lab MIMOS Berhad P-GRADE Portal Heuristic Evaluation.
How Not to Lose Track of Your Research Organization and Planning Resources at Brandeis Melanie Radik and Raphael Fennimore Library & Technology Services.
BIBLIOGRAPHIC INPUT PREPARATION USING FIBRE+ EXERCISE.
Special Considerations for Archiving Data from Field Observations A Presentation for “International Workshop on Strategies for Preservation of and Open.
TIMOTHY SERVINSKY PROJECT MANAGER CENTER FOR SURVEY RESEARCH Data Preparation: An Introduction to Getting Data Ready for Analysis.
(Winter 2016) Instructor: Craig Duckett Lecture 13: Thursday, February 18 th Mere Mortals: Chap. 9 Summary, Team Work 1.
Format of Formal Reports
Data Organization Quality Assurance and Transformations.
GET 236: E NTERPRISE DATA ANALYSIS : TOOLS AND TECHNIQUES W EEK 03: I MPORT AND V ALIDATE DATA.
Workshop Overview What is a report? Sections of a report Report-Writing Tips.
MR.CHITHRAVEL.V ASST.PROFESSOR ACN
Microsoft Office 2013 Try It! Chapter 4 Storing Data in Access.
1 CA202 Spreadsheet Application Focusing on Specific Data using Filters Lecture # 5.
1 PEER Session 02/04/15. 2  Multiple good data management software options exist – quantitative (e.g., SPSS), qualitative (e.g, atlas.ti), mixed (e.g.,
HEI/OCAN College Access Program Data Submissions.
Data Management Practices for Early Career Scientists: Closing Robert Cook Environmental Sciences Division Oak Ridge National Laboratory Oak Ridge, TN.
SWRCBSWRCBSWRCBSWRCB AB2886 Implementation San Jose Training San Jose Training July 30, 2001 Marilyn R. Arsenault ArsenaultLegg, Inc.
New & Improved Meteorological Data Archives Kenneth G. Wastrack Jennifer M. Call D. Sherea Burns Tennessee Valley Authority.
Cover Letter/Business Letters. First, Why letters?  Phone calls are deleted. s are deleted.  Letters can be mailed to multiple people, similar.
Understanding the Value and Importance of Proper Data Documentation 5-1 At the conclusion of this module the participant will be able to List the seven.
N5 Databases Notes Information Systems Design & Development: Structures and links.
Effective Water Quality Monitoring
Focal Point Responsibilities
Components of thesis.
Comments on ASFA Input Helen Wibley, FAO 2016 ASFA Advisory Board Meeting – Hanoi, Viet Nam.
Chapter 3 The DATA DIVISION.
Illustration in Scientific Writing
Year 12 into 13 bridging work
Chapter Four Engineering Communication
Chapter Four Engineering Communication
Chapter Four Engineering Communication
Writing Technical Reports
Presentation transcript:

Best Practices for Preparing Data Sets Non-CO2 Synthesis Workshop Boulder, Colorado October 2008 Compiled by: A. Dayalu, Harvard University Adapted from: Best Practices for Preparing Environmental Data Sets to Share and Archive, by L.A. Hook, T.W. Beaty, S. Santhana Vannan, L. Baskaran, and R.B. Cook. June

Seven Best Practices—Summary 1.Assign Descriptive File Names 2.Use Consistent and Stable File Formats 3.Define the Contents of Your Data Files 4.Use Consistent Data Organization 5.Perform Basic Quality Assurance 6.Assign Descriptive Data Set Titles 7.Provide Documentation

I. Assign Descriptive File Names File names should reflect the contents of the file and include enough information to uniquely identify the data file. File names can contain identifiers such as project acronym, study title, location, investigator, year(s), version, and file type/extensions. File names should contain only numbers, letters, dashes, and underscores—no spaces or special characters. When compressing files, acceptable compression formats are *.zip, *.gz, or *.tar Examples: cobra_2003_flasks.csv (From COBRA 2003 aircraft mission flask data)  data 2003.dat

II. Use Consistent and Stable File Formats for Tabular Data In choosing a file format, data collectors should select a consistent format that can be read well into the future and is independent of changes in applications. Use ASCII file formats delimited using commas, tabs, or semicolons (in order of preference). –Use the same format throughout the file –Use a consistent format across all data files for a study –Figures and analyses should be reported in companion documents—don’t place figures/summary stats in the data file. Include header rows at the top of the data file –First row: descriptors linking file to set (file name, data set title, author, today’s date, date of last data modification, companion file names) –Remaining rows: describe content of each column, including one row for parameter names and units –Column headings should contain only numbers, letters, and underscores—no spaces or special characters In the data set documentation, include: –Descriptions of data file names (expand acronyms, site abbreviations, etc) –Expanded parameter descriptions –Missing value codes –Example data file records –Other data file documentation, useful to a secondary user (See Section 7)

III. Define the contents of your data files In order for others to use your data, they must fully understand the contents of the data set, including the parameter names, units of measure, formats, and definitions of coded values. Parameter names should describe the contents; accompanying documentation should completely describe the parameter. Use consistent capitalization and use only letters, numerals, and underscores. Units need to be explicitly stated in the data file and in the documentation. Formats for each parameter should be consistent across data sets, particularly for dates, times, and spatial coordinates. –Dates: YYYYMMDD format –Time: Report in UTC, using 24 hour notation –Spatial Coordinates: Record in decimal degrees to at least 4 significant digits past the decimal point. Be consistent with and document coordinate type, datum, and spheroid. Mixing coordinate systems (e.g., NAD83 and NAD27) will cause errors in subsequent geographical analysis. –Elevation: Provide elevation in meters, with information on the vertical datum used (e.g., NAVD 1988). Coded Fields such as data quality flags or data qualifiers should be consistent across parameters and files. These should be explained in detail in accompanying documentation. Missing values should be specified using an extreme value not likely to ever be confused with a measured value (e.g., or NA). Except in the case of NA, do not use character codes in an otherwise numeric field.

IV. Use Consistent Data Organization Each observation should be placed in a separate line (row). Most often, each row in a file represents a complete record and the columns represent all the parameters that make up the record. This leads to an arrangement that is similar to a spreadsheet or matrix. Keep similar information together. Do not break up your data set into many small files (e.g., by month); instead, make the month a parameter and have all the data in one large file. This minimizes researchers having to process too many files. Size Limitations. Some applications currently have size restrictions. For example, Excel 2003 limits file size to 65,000 rows and 256 columns. Large files may have to be broken down into logical smaller files to accommodate this. Note that Excel 2007 does not appear to have a row or column limit.

V. Perform Basic Quality Assurance (QA) In addition to scientific QA, we suggest that you perform basic data QA on the data files. Check file format. Make sure data are delimited/line up in the proper column. Check file organization and descriptors to ensure there are no missing values for key parameters (e.g., location, time, sample ID). Review documentation to ensure accurate content/parameter descriptions. Check the content of measured or derived values to detect impossible or anomalous values (e.g., negative mixing ratios). Generate basic plots to aid in QA. Perform statistical summaries and review results. Map locations to see if there are coordinate errors. Verify Data transfers from notebooks, instruments, etc. For data transfers done by hand, consider double data entry and compare the two data sets. Where possible, compare summary statistics before and after data transformation.

VI. Assign Descriptive Data Set Titles We recommend that data set titles be as descriptive as possible. When giving titles to your data sets and associated documentation, please be aware that these data sets may be accessed many years in the future by people unaware of the project details. Data set titles should include the following: –Type of data –Date range –Location –Instruments –Parent project Limit title length to 80 characters, including spaces. Names should contain only numbers, letters, dashes, underscores, and spaces. The data set title should be similar to the name(s) of the data file(s). A data set might contain one to thousands of data files. Examples SAFARI 2000 Upper Air Meteorological Profiles, Skukuza, Dry Seasons  The Aerostar 100 Data Set

VII. Provide Data Set Documentation The documentation accompanying your data set should be written for a user 20 years into the future. Therefore, you should consider what that investigator needs to know to use your data. Write the documentation for a user who is unfamiliar with your project, sites, methods, or observations. Documentation can never be too complete! The following information should be considered essential for data documentation: Name of data set, names of files in the set Why and what data were collected Instruments used, including model and serial number Who collected the data, who to contact regarding data. How to cite the data Where and with what spatial resolution data were collected Definitions of any codes used in the documentation Frequency of data collection How each parameter was measured (methods), and units Environmental conditions at time of sampling (temperature, cloud cover…) Data processing/screening methods Standards or calibrations used Details on QA/QC that was applied Known issues that limit the data’s use Software (including version number) used to prepare and read the file Date of last modification Pertinent notes Summary statistics generated from the final file to verify future file transformations and transfers