Presentation is loading. Please wait.

Presentation is loading. Please wait.

Fundamental Practices for Preparing Data Sets Bob Cook Environmental Sciences Division Oak Ridge National Laboratory.

Similar presentations


Presentation on theme: "Fundamental Practices for Preparing Data Sets Bob Cook Environmental Sciences Division Oak Ridge National Laboratory."— Presentation transcript:

1 Fundamental Practices for Preparing Data Sets Bob Cook Environmental Sciences Division Oak Ridge National Laboratory

2 NASA TE Best Data Management Practices, May 2, 2013 The 20-Year Rule The metadata accompanying a data set should be written for a user 20 years into the future--what does that investigator need to know to use the data? Prepare the data and documentation for a user who is unfamiliar with your project, methods, and observations NRC (1991) 2

3 NASA TE Best Data Management Practices, May 2, 2013 Information Entropy Information Content Paper publication Specific details are lost General details are lost Accident or technology change Retirement or career change Loss of data developer Time (From Michener et al 1997) 3

4 NASA TE Best Data Management Practices, May 2, 2013 Metadata needed to Understand Data The details of the data …. Measurement date Sample ID Parameter name location 4 Courtesy of Raymond McCord, ORNL

5 NASA TE Best Data Management Practices, May 2, 2013 Metadata Needed to Understand Data Measurement QA flag media generator method date Sample ID parameter name location records Units Sample def. Type, date location generator lab field Method def. Units method Parameter def. org.type name custodian address, etc. coord. elev. type depth Record system date words QA def. Units def. GIS 5

6 NASA TE Best Data Management Practices, May 2, 2013 Fundamental Data Practices 1.Define the contents of your data files 2.Use consistent data organization 3.Use stable file formats 4.Assign descriptive file names 5.Preserve information 6.Perform basic quality assurance 7.Provide documentation 8.Protect your data 6

7 NASA TE Best Data Management Practices, May 2, 2013 1. Define the contents of your data files Content flows from science plan (hypotheses) and is informed from requirements of final archive. Keep a set of similar measurements together in one file same investigator, methods, time basis, and instrument –No hard and fast rules about contents of each file. 7

8 1. Define the Contents of Your Data Files Define the parameters 8 NACP Data Management Practices, February 3, 2013 Ehleringer, et al. 2010. LBA-ECO CD-02 Carbon, Nitrogen, Oxygen Stable Isotopes in Organic Material, Brazil. Data set. Available on-line [http://daac.ornl.gov] from Oak Ridge National Laboratory Distributed Active Archive Center, Oak Ridge, Tennessee, U.S.A. doi: 10.3334/ORNLDAAC/983

9 NASA TE Best Data Management Practices, May 2, 2013 1. Define the Contents of Your Data Files Define the parameters (cont) Be consistent Choose a format for each parameter, –Explain the format in the metadata, and –Use that format throughout the file Use commonly accepted parameter names and units (SI Units) –e.g., use yyyymmdd; January 2, 1999 is 19990102 –Use 24-hour notation (13:30 hrs instead of 1:30 p.m. and 04:30 instead of 4:30 a.m.) –Report in both local time and Coordinated Universal Time (UTC) –See Hook et al. (2010) for additional examples of parameter formats Check your discipline and adopt commonly used parameter names and units 9

10 NASA TE Best Data Management Practices, May 2, 2013 1. Define the Contents of Your Data Files Define the parameters (cont) Check your discipline and adopt commonly used parameter names and units –Global Change Master DirectoryGlobal Change Master Directory Names only –Climate and Forecast Standard NamesClimate and Forecast Standard Names –FLUXNET Standards AmeriFlux 10

11 1. Define the contents of your data files Site Table 11 …… Ehleringer, et al. 2010. LBA-ECO CD-02 Carbon, Nitrogen, Oxygen Stable Isotopes in Organic Material, Brazil. Data set. Available on-line [http://daac.ornl.gov] from Oak Ridge National Laboratory Distributed Active Archive Center, Oak Ridge, Tennessee, U.S.A. doi: 10.3334/ORNLDAAC/983

12 2. Use consistent data organization (one good approach) StationDateTempPrecip Units YYYYMMDDCmm HOGI19961001120 HOGI19961002143 HOGI1996100319-9999 Note: -9999 is a missing value code for the data set 12 Each row in a file represents a complete record, and the columns represent all the parameters that make up the record.

13 2. Use consistent data organization (a 2 nd good approach) StationDateParameterValueUnit HOGI19961001Temp12C HOGI19961002Temp14C HOGI19961001Precip0mm HOGI19961002Precip3mm 13 Parameter name, value, and units are placed in individual rows. This approach is used in relational databases.

14 NASA TE Best Data Management Practices, May 2, 2013 2. Use consistent data organization (cont) Be consistent in file organization and formatting –don’t change or re-arrange columns –Include header rows (first row should contain file name, data set title, author, date, and companion file names) –column headings should describe content of each column, including one row for parameter names and one for parameter units 14

15 NASA TE Best Data Management Practices, May 2, 2013 15 Collaboration and Data Sharing A personal example of bad practice… Courtesy of Stefanie Hampton, NCEAS 2. Use consistent data organization (cont)

16 NASA TE Best Data Management Practices, May 2, 2013 3. Use stable file formats Los[e] years of critical knowledge because modern PCs could not always open old file formats. Lesson: Avoid proprietary formats. They may not be readable in the future 16 http://news.bbc.co.uk/2/hi/6265976.stm

17 3. Use stable file formats (cont) 17 Aranibar, J. N. and S. A. Macko. 2005. SAFARI 2000 Plant and Soil C and N Isotopes, Southern Africa, 1995- 2000. Data set. Available on-line [http://daac.ornl.gov/] from Oak Ridge National Laboratory Distributed Active Archive Center, Oak Ridge, Tennessee, U.S.A. doi:10.3334/ORNLDAAC/783 Use text-based comma separated values (csv)

18 NASA TE Best Data Management Practices, May 2, 2013 Use descriptive file names Unique Reflect contents ASCII characters only Avoid spaces Bad: Mydata.xls 2001_data.csv best version.txt Better:bigfoot_agro_2000_gpp.tiff Site name Year What was measured Project Name File Format 4. Assign descriptive file names 18

19 19 Courtesy of PhD Comics

20 NASA TE Best Data Management Practices, May 2, 2013 Biodiversity Lake Experiments Field work Grassland Biodiv_H20_heatExp_2005_2008.csv Biodiv_H20_predatorExp_2001_2003.csv Biodiv_H20_planktonCount_start2001_active.csv Biodiv_H20_chla_profiles_2003.csv … … 4. Assign descriptive file names Organize files logically Make sure your file system is logical and efficient 20 From S. Hampton

21 5. Preserve information Keep your raw data raw No transformations, interpolations, etc, in raw file Giles_zoopCount_Diel_2001_2003.csv TAXCOUNTTEMPC C3.9788735812.3 F0.9726135412.7 M0.5305164812.1 F011.9 C10.882389312.8 F43.529557113.1 M21.764778514.2 N61.666872512.9 … ### Giles_zoop_temp_regress_4jun08.r ### Load data Giles<- read.csv("Giles_zoopCount_Diel_2001_2003.csv") ### Look at the data Giles plot(COUNT~ TEMPC, data=Giles) ### Log Transform the independent variable (x+1) Giles$Lcount<-log(Giles$COUNT+1) ### Plot the log-transformed y against x plot(Lcount ~ TEMPC, data=Giles) 21 Raw Data File Processing Script (R) From S. Hampton

22 NASA TE Best Data Management Practices, May 2, 2013 5. Preserve information (cont) Use a scripted language to process data –R Statistical package (free, powerful) –SAS –MATLAB Processing scripts are records of processing –Scripts can be revised, rerun Graphical User Interface-based analyses may seem easy, but don’t leave a record 22

23 NASA TE Best Data Management Practices, May 2, 2013 6. Perform basic quality assurance Assure that data are delimited and line up in proper columns Check that there no missing values (blank cells) for key parameters Scan for impossible and anomalous values Perform and review statistical summaries Map location data (lat/long) and assess errors No better QA than to analyze data 23

24 6. Perform basic quality assurance (con’t) Place geographic data on a map to ensure that geographic coordinates are correct. 24

25 6. Perform basic quality assurance (con’t) Plot information to examine outliers 25 Model X uses UTC time, all others use Eastern Time Data from the North American Carbon Program Site Synthesis (Courtesy of Dan Ricciuto and Yaxing Wei, ORNL) NACP Site Synthesis Model-Observation Intercomparison

26 6. Perform basic quality assurance (con’t) Plot information to examine outliers 26 NACP Site Synthesis Model-Observation Intercomparison Data from the North American Carbon Program Site Synthesis (Courtesy of Dan Ricciuto and Yaxing Wei, ORNL)

27 NASA TE Best Data Management Practices, May 2, 2013 7. Provide Documentation / Metadata What does the data set describe? Why was the data set created? Who produced the data set and Who prepared the metadata? When and how frequently were the data collected? Where were the data collected and with what spatial resolution? (include coordinate reference system) How was each parameter measured? How reliable are the data?; what is the uncertainty, measurement accuracy?; what problems remain in the data set? What assumptions were used to create the data set? What is the use and distribution policy of the data set? How can someone get a copy of the data set? Provide any references to use of data in publication(s) 27

28 NASA TE Best Data Management Practices, May 2, 2013 8. Protect data Ensure that file transfers are done without error –Compare checksums before and after transfers Example tools to generate checksums http://www.pc-tools.net/win32/md5sums/ http://corz.org/windows/software/checksum/ 28

29 NASA TE Best Data Management Practices, May 2, 2013 8. Protect data (cont) Create back-up copies often –Ideally three copies –original, one on-site (external), and one off-site –Frequency based on need / risk 29

30 NASA TE Best Data Management Practices, May 2, 2013 8. Protect data (cont) Use reliable devices for backups Removable storage device Managed network drive –Raid, tape system Managed cloud file-server –DropBox, Amazon Simple Storage Service (S3), Carbonite 30

31 NASA TE Best Data Management Practices, May 2, 2013 8. Protect data (cont) Test your backups Automatically test backup copies –Media degrade over time –Annually test copies using checksums or file compare Know that you can recover from a data loss –Periodically test your ability to restore information (at least once a year) –Each year simulate an actual loss, by trying to recover solely from the backed up copies 31

32 NASA TE Best Data Management Practices, May 2, 2013 Fundamental Data Practices 1.Define the contents of your data files 2.Use consistent data organization 3.Use stable file formats 4.Assign descriptive file names 5.Preserve information 6.Perform basic quality assurance 7.Provide documentation 8.Protect your data 32

33 Proper Curation Enables Data Reuse Time Information Content Planning Collection Assure Documentation Archive Sufficient for Sharing and Reuse 33

34 NASA TE Best Data Management Practices, May 2, 2013 Best Practices: Conclusions Data management is important in today’s science Well organized data: –enables researchers to work more efficiently –can be shared easily by collaborators –can potentially be re-used in ways not imagined when originally collected 34

35 NASA TE Best Data Management Practices, May 2, 2013 Bibliography Cook, Robert B., Richard J. Olson, Paul Kanciruk, and Leslie A. Hook. 2001. Best Practices for Preparing Ecological Data Sets to Share and Archive. Bulletin of the Ecological Society of America, Vol. 82, No. 2, April 2001. Hook, L. A., T. W. Beaty, S. Santhana-Vannan, L. Baskaran, and R. B. Cook. 2010. Best Practices for Preparing Environmental Data Sets to Share and Archive. http://dx.doi.org/10.3334/ORNLDAAC/BestPractices-2010 http://dx.doi.org/10.3334/ORNLDAAC/BestPractices-2010 Michener, W. K., J. W. Brunt, J. Helly, T. B. Kirchner, and S. G. Stafford. 1997. Non- Geospatial Metadata for Ecology. Ecological Applications. 7:330-342. 35

36 Questions? 36

37 Additional Slides


Download ppt "Fundamental Practices for Preparing Data Sets Bob Cook Environmental Sciences Division Oak Ridge National Laboratory."

Similar presentations


Ads by Google