Fundamental Practices for Preparing Data Sets

Slides:

Advertisements

Similar presentations

Organising and Documenting Data Stuart Macdonald EDINA & Data Library DIY Research Data Management Training Kit for Librarians.

Advertisements

Local Data Management: Building understandable spreadsheets Jeff Arnfield National Climatic Data Center Version 1.0 Review Date.

John Porter Why this presentation? The forms data take for analysis are often different than the forms data take for archival storage Spreadsheets are.

Data - Information - Knowledge

1 ORNL DAAC: Data and Services Robert Cook and Suresh SanthanaVannan Environmental Sciences Division Oak Ridge National Laboratory Oak Ridge, TN Presentation.

Fundamental Practices for Preparing Data Sets Bob Cook Environmental Sciences Division Oak Ridge National Laboratory 5 th NACP Principal Investigator’s.

Elements of a Data Management Plan Alison Boyer Environmental Sciences Division Oak Ridge National Laboratory.

Elements of a Data Management Plan

Best Practices for Preserving Data Bob Cook Environmental Sciences Division Oak Ridge National Laboratory.

ORGANIZING AND STRUCTURING DATA FOR DIGITAL PROJECTS Suzanne Huffman Digital Resources Librarian Simpson Library.

U.S. Department of the Interior U.S. Geological Survey Data Management Training Modules: Best Practices for Preparing Science Data to Share.

SAFARI 2000 Data Activities at the ORNL DAAC Bob Cook, Les Hook, Stan Attenberger, Dick Olson, and Tim Rhyne Oak Ridge National Laboratory.

From Best Practices for Preserving Data by Bob Cook, Environmental Sciences Division Oak Ridge National Laboratory Module 9.

Fundamental Practices for Preparing Data Sets Robert Cook ORNL Distributed Active Archive Center Environmental Sciences Division Oak Ridge National Laboratory.

U.S. Department of the Interior U.S. Geological Survey Best Practices for Preparing Science Data to Share.

Inter-American Workshop on Environmental Data Access Panel discussion on scientific and technical issues Merilyn Gentry, LBA-ECO Data Coordinator NASA.

AON Data Questionnaire Results 21 Respondents Last Updated 27 March 2007 First AON PI Meeting Scot Loehrer, Jim Moore.

Data management in the field Ari Haukijärvi 2nd EHES training seminar.

Best Practices for Preparing Data Sets Non-CO2 Synthesis Workshop Boulder, Colorado October 2008 Compiled by: A. Dayalu, Harvard University Adapted.

Managing Your Data: Backing Up Your Data Robert Cook Oak Ridge National Laboratory Version 1.0 Review Date.

CC&E Best Data Management Practices, April 19, 2015 Please take the Workshop Survey 1.

Enhancing Linkages Between Projects and Datasets: Examples from LBA-ECO for NACP Lisa Wilcox, Amy L. Morrell,

Managing the Impacts of Programmatic Scale and Enhancing Incentives for Data Archiving A Presentation for “International Workshop on Strategies for Preservation.

Fundamental Practices for Preparing Data Sets Bob Cook Environmental Sciences Division Oak Ridge National Laboratory.

Module 6. Data Management Plans  Definitions ◦ Quality assurance ◦ Quality control ◦ Data contamination ◦ Error Types ◦ Error Handling  QA/QC best practices.

Data Management Practices for Early Career Scientists: Closing Robert Cook Environmental Sciences Division Oak Ridge National Laboratory Oak Ridge, TN.

Managing Your Data: Backing Up Your Data Robert Cook Oak Ridge National Laboratory Section: Local Data Management Version 1.0 October 2012.

1 ORNL DAAC WebGIS Demonstration Suresh Santhana Vannan, Robert Cook, Tammy W. Beaty and Yaxing Wei ORNL DAAC Oak Ridge National Laboratory Distributed.

WK 13 - How to Prepare Ecological Data Sets for Effective Analysis and Sharing 2:00 PM-5:00 PM August 1 st, 2010.

Fundamental Practices for Preparing Data Sets Bob Cook Environmental Sciences Division Oak Ridge National Laboratory.

Managing the Impacts of Change on Archiving Research Data A Presentation for “International Workshop on Strategies for Preservation of and Open Access.

Responsible Data Use and Local Data Management Ruth Duerr National Snow and Ice Data Center.

Data Management 101 for Earth Scientists Managing Your Data Robert Cook Environmental Sciences Division Oak Ridge National Laboratory.

Managing Your Data: Assign Descriptive File Names Robert Cook Oak Ridge National Laboratory Section: Local Data Management Version 1.0 October 2012.

DataONE: Preserving Data and Enabling Data-Intensive Biological and Environmental Research Bob Cook Environmental Sciences Division Oak Ridge National.

IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF ).

Cyberinfrastructure to promote Model - Data Integration Robert Cook, Yaxing Wei, and Suresh S. Vannan Oak Ridge National Laboratory Presented at the Model-Data.

The US Long Term Ecological Research (LTER) Network: Site and Network Level Information Management Kristin Vanderbilt Department of Biology University.

Special Considerations for Archiving Data from Field Observations A Presentation for “International Workshop on Strategies for Preservation of and Open.

Data Organization Quality Assurance and Transformations.

Managing Your Data: Assign Descriptive File Names Robert Cook Oak Ridge National Laboratory Version 1.0 Review Date.

Data Management Practices for Early Career Scientists: Closing Robert Cook Environmental Sciences Division Oak Ridge National Laboratory Oak Ridge, TN.

SWRCBSWRCBSWRCBSWRCB AB2886 Implementation San Jose Training San Jose Training July 30, 2001 Marilyn R. Arsenault ArsenaultLegg, Inc.

Hands-On Microsoft Windows Server 2008 Chapter 7 Configuring and Managing Data Storage.

Lesson 9: SOFTWARE ICT Fundamentals 2nd Semester SY

Water quality data - providing information for management

Data quality & VALIDATION

NASA Earth Science Data Stewardship

Lecture 24: Uncertainty and Geovisualization

Graphical Data Engineering

Open Exeter Project Team

Slide Template for Module 2: Types, Formats, and Stages of Data

Miscellaneous Excel Combining Excel and Access.

CHP - 9 File Structures.

Standardization Promotes Biogeochemical Data Management and Use in Multidisciplinary Environmental Research Yaxing Wei, Suresh Vannan, Robert B. Cook,

INTRODUCTION TO GEOGRAPHICAL INFORMATION SYSTEM

Data Sharing We all need data

Required Data Files Review

Active Data Management in Space 20m DG

Retain Data Commensurate with Value

Research Data Management: Store and Analyze

Data Management: Documentation & Metadata

Considerations for the Paperless Laboratory

Storage Basic recommendations:

Alexander Sterin, Dmitri Nikolaev, RIHMI-WDC

Chapter 2: Operating-System Structures

The role of metadata in census data dissemination

Chapter 2: Operating-System Structures

Instructor Materials Chapter 5: Ensuring Integrity

Presentation transcript:

Fundamental Practices for Preparing Data Sets Bob Cook Environmental Sciences Division Oak Ridge National Laboratory

The 20-Year Rule The metadata accompanying a data set should be written for a user 20 years into the future--what does that investigator need to know to use the data? Prepare the data and documentation for a user who is unfamiliar with your project, methods, and observations NRC (1991)

Information Entropy Information Content Time Paper publication Specific details are lost General details are lost Information Content Retirement or career change This graph illustrates the phenomenon of “information entropy”, associated with research. At the time of the research project, a scientists memory is fresh. Details about the development of the dataset are easily recalled, and it is a good time to document information about the process. Over time, memory of the details begins to fade. A variety of circumstances can intervene, and eventually detailed knowledge about the dataset fades. Without a metadata record, this data might be unusable. A dataset it not considered complete without a metadata record to accompany it. Accident or technology change Loss of data developer Time (From Michener et al 1997)

Metadata needed to Understand Data The details of the data …. Parameter name Measurement date Sample ID location An investigator and their team need only a small amount of metadata. They know that the steamwater sample was collected at the weir near the orchard and was analyzed by the ion chromatograph in Room 220. But for others outside the team, more information is needed. Courtesy of Raymond McCord, ORNL

Metadata Needed to Understand Data Units method Parameter def. lab field Method def. method Units def. parameter name Units media date words QA def. QA flag Measurement Record system records generator To share the data more broadly more information is needed: analytical method, uncertainty, and units Geocoordinates, elevation, datum Sample ID location date GIS org.type name custodian address, etc. coord. elev. type depth Sample def. Type, date location generator 5

Fundamental Data Practices Define the contents of your data files Use consistent data organization Use stable file formats Assign descriptive file names Preserve information Perform basic quality assurance Provide documentation Protect your data Main part of my talk will be to describe these 8 fundamental data practices. Basic habits that will help improve the information content of your data and make it easier to share with others

1. Define the contents of your data files Content flows from science plan (hypotheses) and is informed from requirements of final archive. Keep a set of similar measurements together in one file same investigator, methods, time basis, and instrument No hard and fast rules about contents of each file. Researchers who later use your relatively large data file won't have to process many small files individually.

1. Define the Contents of Your Data Files Define the parameters Ehleringer, et al. 2010. LBA-ECO CD-02 Carbon, Nitrogen, Oxygen Stable Isotopes in Organic Material, Brazil. Data set. Available on-line [http://daac.ornl.gov] from Oak Ridge National Laboratory Distributed Active Archive Center, Oak Ridge, Tennessee, U.S.A. doi: 10.3334/ORNLDAAC/983 NACP Data Management Practices, February 3, 2013

1. Define the Contents of Your Data Files Define the parameters (cont) Be consistent Choose a format for each parameter, Explain the format in the metadata, and Use that format throughout the file Use commonly accepted parameter names and units (SI Units) e.g., use yyyymmdd; January 2, 1999 is 19990102 Use 24-hour notation (13:30 hrs instead of 1:30 p.m. and 04:30 instead of 4:30 a.m.) Report in both local time and Coordinated Universal Time (UTC) See Hook et al. (2010) for additional examples of parameter formats Check your discipline and adopt commonly used parameter names and units

1. Define the Contents of Your Data Files Define the parameters (cont) Check your discipline and adopt commonly used parameter names and units Global Change Master Directory Names only Climate and Forecast Standard Names FLUXNET Standards AmeriFlux

1. Define the contents of your data files Site Table …… Ehleringer, et al. 2010. LBA-ECO CD-02 Carbon, Nitrogen, Oxygen Stable Isotopes in Organic Material, Brazil. Data set. Available on-line [http://daac.ornl.gov] from Oak Ridge National Laboratory Distributed Active Archive Center, Oak Ridge, Tennessee, U.S.A. doi: 10.3334/ORNLDAAC/983

2. Use consistent data organization (one good approach) Each row in a file represents a complete record, and the columns represent all the parameters that make up the record. Station Date Temp Precip Units YYYYMMDD C mm HOGI 19961001 12 19961002 14 3 19961003 19 -9999 Choose one way to organize your data and stick to that way throughout the file Make month or site a parameter and have all the data in one large file. There is an upper limit to the size of files, though. Note: -9999 is a missing value code for the data set

2. Use consistent data organization (a 2nd good approach) Parameter name, value, and units are placed in individual rows. This approach is used in relational databases. Station Date Parameter Value Unit HOGI 19961001 Temp 12 C 19961002 14 Precip mm 3 For data that is sparse (a sparsely populated matrix)

2. Use consistent data organization (cont) Be consistent in file organization and formatting don’t change or re-arrange columns Include header rows (first row should contain file name, data set title, author, date, and companion file names) column headings should describe content of each column, including one row for parameter names and one for parameter units

Collaboration and Data Sharing 2. Use consistent data organization (cont) Collaboration and Data Sharing A personal example of bad practice… Courtesy of Stefanie Hampton, NCEAS

3. Use stable file formats Los[e] years of critical knowledge because modern PCs could not always open old file formats. Lesson: Avoid proprietary formats. They may not be readable in the future http://news.bbc.co.uk/2/hi/6265976.stm 16

3. Use stable file formats (cont) Use text-based comma separated values (csv) Aranibar, J. N. and S. A. Macko. 2005. SAFARI 2000 Plant and Soil C and N Isotopes, Southern Africa, 1995-2000. Data set. Available on-line [http://daac.ornl.gov/] from Oak Ridge National Laboratory Distributed Active Archive Center, Oak Ridge, Tennessee, U.S.A. doi:10.3334/ORNLDAAC/783

4. Assign descriptive file names Use descriptive file names Unique Reflect contents ASCII characters only Avoid spaces Bad: Mydata.xls 2001_data.csv best version.txt Better: bigfoot_agro_2000_gpp.tiff Project Name File Format Year Site name What was measured 18

Courtesy of PhD Comics

4. Assign descriptive file names Organize files logically Make sure your file system is logical and efficient Biodiversity Lake Biodiv_H20_heatExp_2005_2008.csv Experiments Biodiv_H20_predatorExp_2001_2003.csv … As a corollary, need to make sure that your well-named files are organized in a logical directory structure Biodiv_H20_planktonCount_start2001_active.csv Field work Biodiv_H20_chla_profiles_2003.csv … Grassland From S. Hampton

5. Preserve information Keep your raw data raw No transformations, interpolations, etc, in raw file Raw Data File Processing Script (R) ### Giles_zoop_temp_regress_4jun08.r ### Load data Giles<-read.csv("Giles_zoopCount_Diel_2001_2003.csv") ### Look at the data Giles plot(COUNT~ TEMPC, data=Giles) ### Log Transform the independent variable (x+1) Giles$Lcount<-log(Giles$COUNT+1) ### Plot the log-transformed y against x plot(Lcount ~ TEMPC, data=Giles) Giles_zoopCount_Diel_2001_2003.csv TAX COUNT TEMPC C 3.97887358 12.3 F 0.97261354 12.7 M 0.53051648 12.1 F 0 11.9 C 10.8823893 12.8 F 43.5295571 13.1 M 21.7647785 14.2 N 61.6668725 12.9 … Do actions such as transformations, sorts and analyses in separate files – e.g., here I’m using R to call on this data set and to make plots of the data and do a log transform – this way the changes won’t be retained in the original, raw data file. You may want to make your raw data file “read only”. From S. Hampton

5. Preserve information (cont) Use a scripted language to process data R Statistical package (free, powerful) SAS MATLAB Processing scripts are records of processing Scripts can be revised, rerun Graphical User Interface-based analyses may seem easy, but don’t leave a record

6. Perform basic quality assurance Assure that data are delimited and line up in proper columns Check that there no missing values (blank cells) for key parameters Scan for impossible and anomalous values Perform and review statistical summaries Map location data (lat/long) and assess errors No better QA than to analyze data

6. Perform basic quality assurance (con’t) Place geographic data on a map to ensure that geographic coordinates are correct. Place geographic data on a map to ensure that geographic coordinates are correct. Example is from OBFS Web site. Illinois site (longitude has the wrong sign, should be negative) Brazil Site (latitude is positive, but should be negative (should be south of the equator).

Model-Observation Intercomparison 6. Perform basic quality assurance (con’t) Plot information to examine outliers NACP Site Synthesis Model-Observation Intercomparison Model X uses UTC time, all others use Eastern Time Data from the North American Carbon Program Site Synthesis (Courtesy of Dan Ricciuto and Yaxing Wei, ORNL)

Model-Observation Intercomparison 6. Perform basic quality assurance (con’t) Plot information to examine outliers NACP Site Synthesis Model-Observation Intercomparison Data from the North American Carbon Program Site Synthesis (Courtesy of Dan Ricciuto and Yaxing Wei, ORNL)

7. Provide Documentation / Metadata What does the data set describe? Why was the data set created? Who produced the data set and Who prepared the metadata? When and how frequently were the data collected? Where were the data collected and with what spatial resolution? (include coordinate reference system) How was each parameter measured? How reliable are the data?; what is the uncertainty, measurement accuracy?; what problems remain in the data set? What assumptions were used to create the data set? What is the use and distribution policy of the data set? How can someone get a copy of the data set? Provide any references to use of data in publication(s)

8. Protect data Ensure that file transfers are done without error Compare checksums before and after transfers Example tools to generate checksums http://www.pc-tools.net/win32/md5sums/ http://corz.org/windows/software/checksum/ Checksums to verify that files received are the same as those sent

8. Protect data (cont) Create back-up copies often Ideally three copies original, one on-site (external), and one off-site Frequency based on need / risk Remember that disaster may strike not just your computer, but also your building or city. Public repositories provide mirroring of your data at multiple, managed sites.

8. Protect data (cont) Use reliable devices for backups Removable storage device Managed network drive Raid, tape system Managed cloud file-server DropBox, Amazon Simple Storage Service (S3), Carbonite 30

8. Protect data (cont) Test your backups Automatically test backup copies Media degrade over time Annually test copies using checksums or file compare Know that you can recover from a data loss Periodically test your ability to restore information (at least once a year) Each year simulate an actual loss, by trying to recover solely from the backed up copies 31

Fundamental Data Practices Define the contents of your data files Use consistent data organization Use stable file formats Assign descriptive file names Preserve information Perform basic quality assurance Provide documentation Protect your data

Proper Curation Enables Data Reuse Time Information Content Planning Collection Assure Documentation Archive Sufficient for Sharing and Reuse Compile information during different parts of the data life cycle. Need to create sufficient information content so that users can find, understand, and use data 20 years into the future. The time axis should be shorter than one year to fully capture the information.

Best Practices: Conclusions Data management is important in today’s science Well organized data: enables researchers to work more efficiently can be shared easily by collaborators can potentially be re-used in ways not imagined when originally collected

Bibliography Cook, Robert B., Richard J. Olson, Paul Kanciruk, and Leslie A. Hook. 2001. Best Practices for Preparing Ecological Data Sets to Share and Archive. Bulletin of the Ecological Society of America, Vol. 82, No. 2, April 2001. Hook, L. A., T. W. Beaty, S. Santhana-Vannan, L. Baskaran, and R. B. Cook. 2010. Best Practices for Preparing Environmental Data Sets to Share and Archive. http://dx.doi.org/10.3334/ORNLDAAC/BestPractices-2010 Michener, W. K., J. W. Brunt, J. Helly, T. B. Kirchner, and S. G. Stafford. 1997. Non- Geospatial Metadata for Ecology. Ecological Applications. 7:330-342.

Questions?

Additional Slides