Best Practices for Preserving Data Bob Cook Environmental Sciences Division Oak Ridge National Laboratory.

Slides:



Advertisements
Similar presentations
Organising and Documenting Data Stuart Macdonald EDINA & Data Library DIY Research Data Management Training Kit for Librarians.
Advertisements

Peter Griffith and Megan McGroddy 4 th NACP All Investigators Meeting February 3, 2013 Expectations and Opportunities for NACP Investigators to Share and.
ORNL DAAC Experience With Digital Object Identifiers (DOIs) Bruce Wilson, ORNL DAAC Manager for NASA Data Center Managers telecon 22 Feb 2010.
John Porter Why this presentation? The forms data take for analysis are often different than the forms data take for archival storage Spreadsheets are.
Announcements ●Exam II range ; mean 72
The Visibility Information Exchange Web System (VIEWS): An Approach to Air Quality Data Management and Presentation In a broader sense, VIEWS facilitates.
1 ORNL DAAC: Data and Services Robert Cook and Suresh SanthanaVannan Environmental Sciences Division Oak Ridge National Laboratory Oak Ridge, TN Presentation.
Fundamental Practices for Preparing Data Sets Bob Cook Environmental Sciences Division Oak Ridge National Laboratory 5 th NACP Principal Investigator’s.
Elements of a Data Management Plan Alison Boyer Environmental Sciences Division Oak Ridge National Laboratory.
Elements of a Data Management Plan
Introduction to Ecoinformatics: Past, Present & Future William Michener LTER Network Office, University of New Mexico January 2007.
ORGANIZING AND STRUCTURING DATA FOR DIGITAL PROJECTS Suzanne Huffman Digital Resources Librarian Simpson Library.
U.S. Department of the Interior U.S. Geological Survey Data Management Training Modules: Best Practices for Preparing Science Data to Share.
SAFARI 2000 Data Activities at the ORNL DAAC Bob Cook, Les Hook, Stan Attenberger, Dick Olson, and Tim Rhyne Oak Ridge National Laboratory.
From Best Practices for Preserving Data by Bob Cook, Environmental Sciences Division Oak Ridge National Laboratory Module 9.
Fundamental Practices for Preparing Data Sets Robert Cook ORNL Distributed Active Archive Center Environmental Sciences Division Oak Ridge National Laboratory.
U.S. Department of the Interior U.S. Geological Survey Best Practices for Preparing Science Data to Share.
Inter-American Workshop on Environmental Data Access Panel discussion on scientific and technical issues Merilyn Gentry, LBA-ECO Data Coordinator NASA.
1 CDIAC Data Support for SPRUCE and NGEE Les A. Hook and Ranjeet Devarakonda Environmental Sciences Division Oak Ridge National Laboratory CDIAC User Working.
Preserving the Scientific Record: Establishing Relationships with Archives Matthew Mayernik National Center for Atmospheric Research Version 1.0 Review.
Data Management Plans Bill Michener University Libraries and Biology Dept. University of New Mexico.
Data Organization Data Collection and Spreadsheets.
AON Data Questionnaire Results 21 Respondents Last Updated 27 March 2007 First AON PI Meeting Scot Loehrer, Jim Moore.
Recordkeeping for Good Governance Toolkit Digital Recordkeeping Guidance Funafuti, Tuvalu – June 2013.
Best Practices for Preparing Data Sets Non-CO2 Synthesis Workshop Boulder, Colorado October 2008 Compiled by: A. Dayalu, Harvard University Adapted.
Why Create a Data Management Plan? Ruth Duerr National Snow and Ice Data Center Version 1.0 Review Date Section: Data Management Plans.
Elements of a Data Management Plan Bill Michener University Libraries University of New Mexico Data Management Practices for.
Introduction to Geospatial Metadata – ISO 191** Metadata National Centers for Environmental Information (NCEI)
Best Practices for Collecting Data Bill Corey Data Consultant University of Virginia Library Andrea Horne Denton Health Sciences Data.
Managing Your Data: Backing Up Your Data Robert Cook Oak Ridge National Laboratory Version 1.0 Review Date.
CC&E Best Data Management Practices, April 19, 2015 Please take the Workshop Survey 1.
Enhancing Linkages Between Projects and Datasets: Examples from LBA-ECO for NACP Lisa Wilcox, Amy L. Morrell,
Data Citation and Data Attribution A View from the Data Center Perspective Bruce E. Wilson Group Lead, Client & Collaboration Technologies Oak Ridge National.
Managing the Impacts of Programmatic Scale and Enhancing Incentives for Data Archiving A Presentation for “International Workshop on Strategies for Preservation.
Fundamental Practices for Preparing Data Sets Bob Cook Environmental Sciences Division Oak Ridge National Laboratory.
Module 6. Data Management Plans  Definitions ◦ Quality assurance ◦ Quality control ◦ Data contamination ◦ Error Types ◦ Error Handling  QA/QC best practices.
Data Management Practices for Early Career Scientists: Closing Robert Cook Environmental Sciences Division Oak Ridge National Laboratory Oak Ridge, TN.
Introduction to Geospatial Metadata – FGDC CSDGM National Coastal Data Development Center A division of the National Oceanographic Data Center Please .
Global map layers Additional global data sets such as Hydrology data (Hydrosheds), new and updated Landcover data (Globcover), demographic data and others.
Managing Your Data: Backing Up Your Data Robert Cook Oak Ridge National Laboratory Section: Local Data Management Version 1.0 October 2012.
DATABASES Southern Region CEO Wednesday 13 th October 2010.
WK 13 - How to Prepare Ecological Data Sets for Effective Analysis and Sharing 2:00 PM-5:00 PM August 1 st, 2010.
Fundamental Practices for Preparing Data Sets Bob Cook Environmental Sciences Division Oak Ridge National Laboratory.
1 NARSTO Quality Systems Science Center Les A. Hook and Sigurd W. Christensen NARSTO QSSC Environmental Sciences Division Oak Ridge National Laboratory.
Managing the Impacts of Change on Archiving Research Data A Presentation for “International Workshop on Strategies for Preservation of and Open Access.
Responsible Data Use and Local Data Management Ruth Duerr National Snow and Ice Data Center.
Data Management 101 for Earth Scientists Managing Your Data Robert Cook Environmental Sciences Division Oak Ridge National Laboratory.
Managing and Curating Data Chapter 8. Introduction Data organization Data management Data curation Raw data is required to repeat a scientific study Any.
Managing Your Data: Assign Descriptive File Names Robert Cook Oak Ridge National Laboratory Section: Local Data Management Version 1.0 October 2012.
Reconstituting the Ocean: a tale from U.S. JGOFS Cyndy Chandler (MCG, WHOI) U.S. JGOFS Data Management Office and Ocean Carbon and Biogeochemistry Coordination.
Biological and Chemical Oceanography Data Management Office slide 1 of 19 CAMEO Data Management Bob Groman Biological and Chemical Oceanography Data Management.
DataONE: Preserving Data and Enabling Data-Intensive Biological and Environmental Research Bob Cook Environmental Sciences Division Oak Ridge National.
IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF ).
Cyberinfrastructure to promote Model - Data Integration Robert Cook, Yaxing Wei, and Suresh S. Vannan Oak Ridge National Laboratory Presented at the Model-Data.
ORNL DAAC: Introduction Bob Cook ORNL DAAC Environmental Sciences Division Oak Ridge National Laboratory.
The US Long Term Ecological Research (LTER) Network: Site and Network Level Information Management Kristin Vanderbilt Department of Biology University.
Metadata Training for Gulf Restoration Partners Module 1 – Introduction to Metadata and Metadata Standards.
Terra MODIS Collection 4 / 4.5 and Aqua MODIS Collection 4; Sinusoidal Projection Data from 2000 to present; 8-day, 16-day, or annual composites Sites.
Special Considerations for Archiving Data from Field Observations A Presentation for “International Workshop on Strategies for Preservation of and Open.
Data Organization Quality Assurance and Transformations.
Creating Documentation and Metadata: Creating a Citation for Your Data Robert Cook Oak Ridge National Laboratory Section: Local Data Management Copyright.
Data Systems Integration Committee of the Earth Science Data System Working Group (ESDSWG) on Data Quality Robert R. Downs 1 Yaxing Wei 2, and David F.
Global Change Master Directory (GCMD) Mission “To assist the scientific community in the discovery of Earth science data, related services, and ancillary.
Managing Your Data: Assign Descriptive File Names Robert Cook Oak Ridge National Laboratory Version 1.0 Review Date.
Data Management Practices for Early Career Scientists: Closing Robert Cook Environmental Sciences Division Oak Ridge National Laboratory Oak Ridge, TN.
SWRCBSWRCBSWRCBSWRCB AB2886 Implementation San Jose Training San Jose Training July 30, 2001 Marilyn R. Arsenault ArsenaultLegg, Inc.
Understanding the Value and Importance of Proper Data Documentation 5-1 At the conclusion of this module the participant will be able to List the seven.
NASA Earth Science Data Stewardship
Standardization Promotes Biogeochemical Data Management and Use in Multidisciplinary Environmental Research Yaxing Wei, Suresh Vannan, Robert B. Cook,
Fundamental Practices for Preparing Data Sets
Presentation transcript:

Best Practices for Preserving Data Bob Cook Environmental Sciences Division Oak Ridge National Laboratory

Environmental Information Management Course, May 25, 2011 Presenter Bob Cook –Biogeochemist –Chief Scientist, NASA’s ORNL Distributed Active Archive Center for Biogeochemical Dynamics –Associate Editor, Biogeochemistry –Oak Ridge National Laboratory, Oak Ridge, TN –Phone: ORNL, Oak Ridge, TN

Environmental Information Management Course, May 25, 2011 Best Practices for Preserving Data Introduction 10 Best Practices ORNL DAAC DataONE (Observation Network For Earth) EIM Lecture Agenda:

Planning Review Problem Definition Research Objectives Analysis and modeling –Cycles of Research –“An Information View” Measurements Publications Original Observations Data Center/ Archive Planning Discover, analyze, and extract information Multi-Domain Observations Courtesy of Raymond McCord, ORNL Cycles of Research – An Information View 4

Environmental Information Management Course, May 25, 2011 Benefits of Good Data Management Practices Short-term Spend less time doing data management and more time doing research Easier to prepare and use data for yourself Collaborators can readily understand and use data files Long-term (data publication) Scientists outside your project can find, understand, and use your data to address broad questions You get credit for archived data products and their use in other papers Sponsors protect their investment 5

Environmental Information Management Course, May 25, 2011 Metadata Information to let you find, understand, and use the data – descriptors –documentation 6

Environmental Information Management Course, May 25, 2011 The 20-Year Rule The metadata accompanying a data set should be written for a user 20 years into the future--what does that investigator need to know to use the data? Prepare the data and documentation for a user who is unfamiliar with your project, methods, and observations NRC (1991) 7

Environmental Information Management Course, May 25, 2011 Proper Curation Enables Data Reuse Time Information Content Planning Collection Assure Documentation Archive Sufficient for Sharing and Reuse

Environmental Information Management Course, May 25, 2011 Proper Curation Enables Data Reuse Time Information Content Planning Collection Assure Documentation Archive Sufficient for Sharing and Reuse

Environmental Information Management Course, May 25, 2011 Metadata needed to Understand Data The details of the data …. Measurement date Sample ID Parameter name location 10

Metadata Needed to Understand Data – Measurement QA flag media generator method date sample ID parameter name location records Units Sample def. type date location generator lab field Method def. units method Parameter def. org.type name custodian address, etc. coord. elev. type depth Record system date words, words. QA def. Units def. GIS 11

Environmental Information Management Course, May 25, 2011 Fundamental Data Practices 1.Define the contents of your data files 2.Use consistent data organization 3.Use stable file formats 4.Assign descriptive file names 5.Preserve information 6.Perform basic quality assurance 7.Assign descriptive data set titles 8.Provide documentation 9.Protect your data 10.Acknowledge contributions 12

Environmental Information Management Course, May 25, Define the contents of your data files Content flows from science plan (hypotheses) and is informed from requirements of final archive Keep a set of similar measurements together in one file (e.g., same investigator, methods, time basis, and instruments) –No hard and fast rules about contents of each files 13

Environmental Information Management Course, May 25, Define the Contents of Your Data Files Define the parameters Use commonly accepted parameter names that describe the contents (e.g., precip for precipitation) Use consistent capitalization (e.g., not temp, Temp, and TEMP in same file) Explicitly state units of reported parameters in the data file and the metadata –SI units are recommended 14

Environmental Information Management Course, May 25, Define the Contents of Your Data Files Define the parameters (cont) Choose a format for each parameter, explain the format in the metadata, and use that format throughout the file –e.g., use yyyymmdd; January 2, 1999 is –Use 24-hour notation (13:30 hrs instead of 1:30 p.m. and 04:30 instead of 4:30 a.m.) –Report in both local time and Coordinated Universal Time (UTC) What’s the difference between UTC and GMT? What is the time zone of Arizona? –See Hook et al. (2010) for additional examples of parameter formats 15

Environmental Information Management Course, May 25, Define the Contents of Your Data Files (cont) 16 Scholes (2005)

Environmental Information Management Course, May 25, Define the contents of your data files Site Table 17 Site NameSite Code Latitude (deg ) Longitude (deg) Elevation (m) Date Kataba (Mongu)k Pandamatengap Skukuza Flux Tower skukuza Scholes, R. J SAFARI 2000 Woody Vegetation Characteristics of Kalahari and Skukuza Sites. Data set. Available on-line [ from Oak Ridge National Laboratory Distributed Active Archive Center, Oak Ridge, Tennessee, U.S.A. doi: /ORNLDAAC/777 ……

Environmental Information Management Course, May 25, Use consistent data organization (one good approach) StationDateTempPrecip Units YYYYMMDDCmm HOGI HOGI HOGI Note: is a missing value code for the data set 18 Each row in a file represents a complete record, and the columns represent all the parameters that make up the record.

Environmental Information Management Course, May 25, Use consistent data organization (a 2 nd good approach) StationDateParameterValueUnit HOGI Temp12C HOGI Temp14C HOGI Precip0mm HOGI Precip3mm 19 Parameter name, value, and units are placed in individual columns. This approach is used in relational databases.

Environmental Information Management Course, May 25, Use consistent data organization (cont) Be consistent in file organization and formatting –don’t change or re-arrange columns –Include header rows (first row should contain file name, data set title, author, date, and companion file names) –column headings should describe content of each column, including one row for parameter names and one for parameter units 20

Environmental Information Management Course, May 25, Collaboration and Data Sharing A personal example of bad practice…

Environmental Information Management Course, May 25, 2011 Allometry Data at ORNL DAAC: Providing data for users –Data are also offered in comma separated variable (csv) text files 22

Environmental Information Management Course, May 25, Use stable file formats Use text (ASCII) file formats for tabular data –(e.g.,.txt or.csv (comma-separated values) –within the ASCII file, delimit fields using commas, pipes (|), tabs, or semicolons (in order of preference) Use GeoTiffs / shapefiles for spatial data Avoid proprietary formats –They may not be readable in the future 23

Environmental Information Management Course, May 25, Use stable file formats (cont) 24 Aranibar, J. N. and S. A. Macko SAFARI 2000 Plant and Soil C and N Isotopes, Southern Africa, Data set. Available on-line [ from Oak Ridge National Laboratory Distributed Active Archive Center, Oak Ridge, Tennessee, U.S.A. doi: /ORNLDAAC/783

Environmental Information Management Course, May 25, Assign descriptive file names File names should be unique and reflect the file contents Bad file names –Mydata –2001_data A better file name –bigfoot_agro_2000_gpp.tif BigFoot is the project name Agro is the field site name 2000 is the calendar year GPP represents Gross Primary Productivity data tif is the file type – GeoTIFF 25

Environmental Information Management Course, May 25, 2011 Dilbert’s file naming convention 26

Environmental Information Management Course, May 25,

Environmental Information Management Course, May 25, 2011 Biodiversity Lake Experiments Field work Grassland Biodiv_H20_heatExp_2005_2008.csv Biodiv_H20_predatorExp_2001_2003.csv Biodiv_H20_planktonCount_start2001_active.csv Biodiv_H20_chla_profiles_2003.csv … … 4. Assign descriptive file names Organize files logically Make sure your file system is logical and efficient 28 From S. Hampton

Environmental Information Management Course, May 25, Preserve information –Keep your raw data raw –No transformations, interpolations, etc, in raw file Giles_zoopCount_Diel_2001_2003.csv TAXCOUNTTEMPC C F M F011.9 C F M N –… –### Giles_zoop_temp_regress_4jun08.r –### Load data –Giles<- read.csv("Giles_zoopCount_Diel_2001_2003.csv") –### Look at the data –Giles –plot(COUNT~ TEMPC, data=Giles) –### Log Transform the independent variable (x+1) –Giles$Lcount<-log(Giles$COUNT+1) –### Plot the log-transformed y against x –plot(Lcount ~ TEMPC, data=Giles) 29 Raw Data File Processing Script (R) From S. Hampton

Environmental Information Management Course, May 25, Preserve information (cont) Use a scripted language to process data –R Statistical package (free, powerful) –SAS –MATLAB Processing scripts are records of processing –Scripts can be revised, rerun Graphical User Interface-based analyses may seem easy, but don’t leave a record 30

Environmental Information Management Course, May 25, Perform basic quality assurance Assure that data are delimited and line up in proper columns Check that there no missing values (blank cells) for key parameters Scan for impossible and anomalous values Perform and review statistical summaries Map location data (lat/long) and assess errors No better QA than to analyze data 31

Environmental Information Management Course, May 25, Perform basic quality assurance (con’t) Place geographic data on a map to ensure that geographic coordinates are correct. 32

Environmental Information Management Course, May 25, Perform basic quality assurance (con’t) Plot information to examine outliers 33 Model X uses UTC time, all others use Eastern Time Data from the North American Carbon Program Interim Synthesis (Courtesy of Yaxing Wei, ORNL) Model-Observation Intercomparison

Environmental Information Management Course, May 25, Perform basic quality assurance (con’t) Plot information to examine outliers 34 Data from the North American Carbon Program Interim Synthesis (Courtesy of Yaxing Wei, ORNL) Model-Observation Intercomparison

Environmental Information Management Course, May 25, Assign descriptive data set titles Data set titles should ideally describe the type of data, time period, location, and instruments used (e.g., Landsat 7). Titles should be concise (< 85 characters) Data set title should be similar to names of data files –Good: SAFARI 2000 Upper Air Meteorological Profiles, Skukuza, Dry Seasons " –Bad: “Productivity Data” 35

Environmental Information Management Course, May 25, Provide Documentation / Metadata What does the data set describe? Why was the data set created? Who produced the data set and Who prepared the metadata? How was each parameter measured? What assumptions were used to create the data set? When and how frequently were the data collected? Where were the data collected and with what spatial resolution? (include coordinate reference system) How reliable are the data?; what is the uncertainty, measurement accuracy?; what problems remain in the data set? What is the use and distribution policy of the data set? How can someone get a copy of the data set? Provide any references to use of data in publication(s) 36

Environmental Information Management Course, May 25, Protect data Create back-up copies often –Ideally three copies original, one on-site (external), and one off-site –Frequency based on need / risk Know that you can recover from a data loss –Periodically test your ability to restore information 37

Environmental Information Management Course, May 25, Protect data (cont) Ensure that file transfers are done without error –Compare checksums before and after transfers Example tools to generate checksums

Environmental Information Management Course, May 25, Acknowledge contributions –Who contributed data to your data set? –How should your data be acknowledged / used by others? All data have been collected and curated by Dr. C. Hippocrepis at All Rotifers University. Data may be freely downloaded for non-commercial uses. Please contact Dr. C. Hippocrepis to let us know if you use these data. We report uses of the public data to our funders, and it is extremely helpful to know if others have been able to use these data in teaching or 39 From S. Hampton

Environmental Information Management Course, May 25, 2011 When your project is over, where should the data be archived? Part of project planning Contact archive / data center early to find out their requirements –What additional data management steps would they like you to do? Suggested data centers / archives: 40 –ORNL DAAC –DataONE (Observation Network for Earth) –National Biological Information Infrastructure –Dryad –Ecological Archives –Knowledge Network for Biocomplexity (KNB)

Environmental Information Management Course, May 25, 2011 Fundamental Data Practices 1.Define the contents of your data files 2.Use consistent data organization 3.Use stable file formats 4.Assign descriptive file names 5.Preserve information 6.Perform basic quality assurance 7.Assign descriptive data set titles 8.Provide documentation 9.Protect your data 10.Acknowledge contributions 41

Environmental Information Management Course, May 25, 2011 Best Practices: Conclusions Data management is important in today’s science Well organized data: –enables researchers to work more efficiently –can be shared easily by collaborators –can potentially be re-used in ways not imagined when originally collected 42

Environmental Information Management Course, May 25, 2011 Bibliography Ball, C. A., G. Sherlock, and A. Brazma Funding high-throughput data sharing. Nature Biotechnology 22: doi: /nbt Borer, ET., EW. Seabloom, M.B. Jones, and M. Schildhauer Some Simple Guidelines for Effective Data Management,Bulletin of the Ecological Society of America. 90(2): Christensen, S. W. and L. A. Hook NARSTO Data and Metadata Reporting Guidance. Provides reference tables of chemical, physical, and metadata variable names for atmospheric measurements. Available on-line at: Cook, Robert B, Richard J. Olson, Paul Kanciruk, and Leslie A. Hook Best Practices for Preparing Ecological Data Sets to Share and Archive. Bulletin of the Ecological Society of America, Vol. 82, No. 2, April Hook, L. A., T. W. Beaty, S. Santhana-Vannan, L. Baskaran, and R. B. Cook. June Best Practices for Preparing Environmental Data Sets to Share and Archive. Kanciruk, P., R.J. Olson, and R.A. McCord Quality Control in Research Databases: The US Environmental Protection Agency National Surface Water Survey Experience. In: W.K. Michener (ed.). Research Data Management in the Ecological Sciences. The Belle W. Baruch Library in Marine Science, No. 16,

Environmental Information Management Course, May 25, 2011 Bibliography Management in the Ecological Sciences. The Belle W. Baruch Library in Marine Science, No. 16, National Research Council (NRC) Solving the Global Change Puzzle: A U.S. Strategy for Managing Data and Information. Report by the Committee on Geophysical Data of the National Research Council Commission on Geosciences, Environment and Resources. National Academy Press, Washington, D.C.Michener, W K Meta-information concepts for ecological data management. Ecological Informatics. 1:3-7. Michener, W.K. and J.W. Brunt (ed.) Ecological Data: Design, Management and Processing, Methods in Ecology, Blackwell Science. 180p. Michener, W. K., J. W. Brunt, J. Helly, T. B. Kirchner, and S. G. Stafford Non- Geospatial Metadata for Ecology. Ecological Applications. 7: U.S. EPA Environmental Protection Agency Substance Registry System (SRS). SRS provides information on substances and organisms and how they are represented in the EPA information systems. Available on-line at: USGS Metadata in plain language. Available on-line at: