Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Formats: Choosing and Adopting Community Accepted Standards

Similar presentations


Presentation on theme: "Data Formats: Choosing and Adopting Community Accepted Standards"— Presentation transcript:

1 Data Formats: Choosing and Adopting Community Accepted Standards
Section: Local Data Management Data Formats: Choosing and Adopting Community Accepted Standards Introduction: Slide 1 This is the Federation of Earth Science Information Partners Data Management for Scientists Short Course, Section: Local Data Management – Data Formats; Module: Choosing and Adopting Community Accepted Standards. This training module is part of the Federation of Earth Science Information Partners (or ESIP Federation's) Data Management for Scientists Short Course. The subject of this module is "Choosing and Adopting Community Accepted Standards". The module was authored by Curt Tilmes from the National Aeronautics and Space Administration (NASA). Besides the ESIP Federation, sponsors of this Data Management for Scientists Short Course are the Data Conservancy and the United States National Oceanic and Atmospheric Administration (NOAA). Curt Tilmes NASA Version 1.0 February 2013 Copyright 2013 Curt Tilmes

2 Overview Some guidelines for choosing and adopting community accepted standards Slide 2: Overview In this module, we’re going to talk about formats for your data, and provide some guidelines for choosing and adopting community accepted standards for data formats.

3 Background Most projects (rightly so) focus on the content of their data files, you need to consider the format as well. Since you captured or created the data, and stored them in your own files, you know how the data are organized, how to read them, how to use them, characteristics of the data that could constrain their use. The goal of a good data format is to make it easier for others to read the data too. Many hours have gone into developing standards for formats – try to learn from them. Slide 3: Background We can think of a number of reasons why you would want to adopt community accepted standards rather than develop your own for a new project. With most projects, the investigators of the project are very concerned about the project’s data, and rightly so, for that data is the main focus and the project’s reason for being. Still, it is the technical format in which the data are represented that allows the data content to be conveyed to other people. As either a principal investigator or primary researcher, you’ve captured and created your own data and sorted them into your own piles. As a result, you always know characteristics about your data, how the data are organized, how to read them, and how to use them. Others who’d like to use your content would have an easier time taking and using your data if they are expressed in a standard data format. Many hours have gone into the development of standard formats that make data easier for others to read, use and understand. Using community developed standards allows you to leverage those efforts.

4 Why use community standards?
If you try to develop your data format from scratch, you will forget something. Build on the experience and improvements built into the community standards over years of use. Tools and analysis software natively support reading community standard data. Reduce development effort and support reuse. Positive feedback – they are more likely to be adopted by others. Slide 4: Why use community standards? (1 of 2) If you try to develop your own data format from scratch, you will always forget something important that is probably covered by a standard format. Standard formats are usually developed based on the road blocks that a larger number of people have found in representing their data in various formats. Using the standards allows you to build on that community-based experience and take advantage of the improvements that have put into those standards. Another good reason for a standard format is that tools and third party analysis software often natively understand those formats. If you invent your own formats, there is much greater likelihood that already existing tools and software will not support your new format. Using a community developed standard will allow your data to be natively supported by those third party tools out of the box. In addition, coming up with a good format is a real pain. You can save yourself a lot of time and effort by adopting a standard that already exists.

5 Why use community standards?
Slide 5: Why use community standards? (2 of 2) Of course, you might ask why are there so many standards? While this is a very good question, this little cartoon from XKCD shows that even if you think that you can invent the one best way to do something, you are probably not going to come up with the one way that everyone will agree is so much better than all the others that they will drop the other standards and switch to yours. You are better off taking advantage of one that already exists.

6 A few guidelines Consider your archive: Consider your users:
Do they have any recommendations? Consider your users: Who wants this data? Why do they want it? What do they want to do with it? Will they be using your data in concert with other data? Consider heritage: What worked well for similar data in the past? What could be done better for newly created data? Consider tools: Try to use data formats supported by the software you intend to use it with. Slide 6: A few guidelines We’d like to offer you a few guidelines that should help you when choosing which of the standards to use and how you will use it. If you are planning to transfer your data to a long term archive, definitely check with archive staff for recommendations. They will be very familiar with their users and will know if there are certain data formats that are commonly used. Consider your user base. What type of person is going to get this data? With what other data will they already be familiar? What do they want to do with your data? Are they going to be using your data in concert with other content that already has a data format? If so, it might well benefit your project to represent your data in a similar data format so that it is more easily used with other data. Consider heritage. What has worked well in the past? In some cases, data creation may have been done poorly in the past and you know that you have a better way to do it in your own project. Still, it would behoove you to at least take a look at what has been done previously. Consider tools. Are there specific tools that make visualizations, analyze, or convert data like yours that are compatible with a data format? If so, your users may want to use your data with those tools, so the formats those tools support should be considered.

7 Some examples HDF – Hierarchical Data Format
HDF4 and HDF5 versions are in use today A NASA variant called HDF-EOS is used within the Earth Observing System program. The Aura project developed a common approach across their instruments and released guidelines as a Technical Note. NetCDF – Network Common Data Form Widely used by agencies including NASA and NOAA Climate and forecast (CF) metadata conventions help standardize some things into NetCDF in a common manner. Slide 7: Some examples We’d like to illustrate the points by showing a couple of examples of NASA, NOAA and other Earth science-centric formats that are converging to become the go-to formats for data. They are HDF – the Hierarchical Data Format that has several versions and variants in use today. The Aura project mentioned on this slide illustrates how this format has been used. Another example is the Network Common Data Form (NetCDF) that is widely used by many agencies including NASA and NOAA, and which is further supported by climate and forecast metadata conventions.

8 Adopting standards The standard gives you a starting point, not a complete solution. Communicate early with a broad range of data users: archivists, software engineers, scientists. Consider how you will be writing the data and how you will be reading the data. Get feedback before making final decisions. Start sharing sample data in proposed format to nail down specifics and work out ambiguities. Document your use and application of the standard completely. Slide 8: Adopting standards Once you have considered a standard and chosen to use it, you should know that making this choice is just a starting point for deciding how to represent your data in that format. Even with a community based standard, you’ll find that there are good ways and poor ways of representing certain types of data with certain formats. It’s an excellent idea to communicate as early as possible with a broad range of the potential end users of your data. Think about who is going to use and archive your data? Will they be software engineers or scientists? Figure out how you will be writing the data and how your users will be reading the data so that you can come up with the best way to organize that data and optimize your software’s ability to deal with that format. Communicate with all the parties who have an interest in what the representation of your data will ultimately be rather than invent a format in isolation. We advise that you come up with a proposal for the format that you want to use and circulate it for feedback from your potential users. A plan of action is always better than just saying, “Well I have a bunch of data and I’m going to just start dumping it into this file” since the data dump could end up being your format by default. Chances are that this kind of data representation will not really be optimal for many actions. Once the proposal has been vetted and accepted, it’s a good idea to share sample data in order to make sure that people understand the specific choices that you’ve made, and you can work out any ambiguities. We also recommend that you very carefully document your choices, and why you made them so that someone else can understand not just that you chose a specific data format, but the manner in which you are implementing that format for your specific data and your specific project. The more documentation and user guides that you have, the better the chances are that your users will be using your data in the way that they ought to be used, and that you intend them to be used.

9 Resources HDF: http://www.hdfgroup.org HDF-EOS: http://hdfeos.org
HDF-EOS Aura File Format Guidelines: Aura_File_Format_Guidelines.pdf /auraasabestpracticerev2.pdf NetCDF: CF: Slide 9: Resources On this slide, you will find a linked listing of some additional resources you might find helpful should you need more information about some of the data formats and guidelines for using them. Even if you don’t use these specific formats, it can be useful to review them because you will see some of the rationale for choosing a data format and why certain decisions were made as these formats were constructed.

10 Other Relevant Modules
Local Data Management – Data Formats: Using Self- describing Data Formats Learn more about the advantages of using formats for your data that have important metadata and other information embedded within them Slide 10: Other Relevant Modules The modules of the ESIP Data Management for Scientists Short Course have been designed to complement and supplement each other. In light of this plan, we think you may find the following module relevant to you as you seek to gain a better understanding of data formatting: Local Data Management – Data Formats: Using Self-describing Data Formats.

11 Recommended Citations
Tilmes, C “Local Data Management – Data Formats: Choosing and Adopting Community Accepted Standards.” In Data Management for Scientists Short Course, edited by Ruth Duerr and Nancy J. Hoebelheinrich, Federation of Earth Science Information Partners: ESIP Commons. doi: /P33N21B6 Slide 9: Recommended Citation This module is available under a Creative Commons Attribution 3.0 license that allows you to share and adapt the work as long as you cite the work according to the citation provided. Thank you very much for your interest in the ESIP Federation’s Data Management for Scientists Short Course. Copyright 2013 Curt Tilmes.


Download ppt "Data Formats: Choosing and Adopting Community Accepted Standards"

Similar presentations


Ads by Google