Lecture 4 Data Management & Metadata Steve Burian Hydroinformatics Fall 2014 This work was funded by National Science Foundation Grants EPS 1135482 and EPS 1208732
Objectives Describe the data life cycle and data management Develop data management techniques that improve organization, facilitate analysis, improve reproducibility, and improve capacity for data re-use Identify the types of information included in metadata records for environmental datasets Determine the dimensionality of a dataset, including the scale triplet of support, spacing extent for both space and time
Quiz You have 5 minutes to show us what you know/learned.
The Data Life Cycle Plan Collect Assure Describe Preserve Discover Integrate Analyze
Activity 1 Work in teams of 2 or 3 to share ideas, but you are required to submit your own DMP with Assignment 1. Send me (steve.burian@utah.edu) your draft plan in 20 minutes.
The Data Life Cycle Plan Collect Assure Describe Preserve Discover Integrate Analyze
What is Metadata? Metadata is “Information about Data” WHO created the data? WHAT is the content of the data? WHEN were the data created? WHERE is it geographically? WHY were the data developed? HOW were the data developed? Greek --- with, about, between, or among; typically used as prefix to mean “one level of description higher” Content, quality, condition, and other characteristics
The Purpose of Metadata Support discovery of scientific data Facilitate acquisition, comprehension, and use of data by HUMANS Enable automated discovery, ingestion, processing and analysis by MACHINES There is a saying: “order is for the feeble minded only, while the genius masters the chaos” – not operationally practical…
Data vs. Metadata Data 15.9
Metadata Data Data vs. Metadata 15.9 Little Bear River at Mendon Road Latitude = 43.0000 15.9 Longitude = -111.0000 Water temperature Degrees Celsius 9/30/2011 5:00 PM
Sharing Data Providing data: Receiving data: Why were the data created? What limitations do the data have? What does the data mean? How should the data be cited if it is re-used in a new study? Receiving data: What are the data gaps? What processes were used for creating the data? Are there any fees associated with the data? In what scale were the data created? What do the values in the tables mean? What software do I need in order to read the data? What projection are the data in? Can I give these data to someone else?
Necessary Meta/data Structure The degree of metadata format and structure necessary for different levels of projected secondary data utilization. (adapted from Michener et al., 1997).
Metadata to Support Understanding and Using Data
Metadata for Data Use Research context Hypotheses, site characteristics, experimental design, research methods Status of the dataset (e.g., raw? processed?) Spatial and temporal domain of the dataset Physical structure of the data
Scale Issues in Interpretation of Measurements and Modeling Results The Scale Triplet of Measurements Interpretation for Geospatial Data Spatial extent represented by grid Average over grid cell? Sample value at grid cell center? Grid cell size Adapted from: Blöschl (1996)
Issues in Data Interpretation Spacing too large - aliasing Adapted from: Blöschl (1996)
Issues in Data Interpretation Extent too small - trend Adapted from: Blöschl (1996)
Issues in Data Interpretation Support too large - smoothing Adapted from: Blöschl (1996)
Metadata Format and Standards
What Does Scientific Metadata Look Like?
What is a Metadata Standard? A structure to describe data with: Common terms to allow consistency between records Common definitions for easier interpretation Common language for ease of communication Common structure to quickly locate information Encoding – structured text or Extensible Markup Language (XML) In search and retrieval, standards provide: Documentation structure in a reliable and predictable format for computer interpretation A uniform summary description of the dataset
General Metadata Organization Information for data discovery Title, keywords, spatial and temporal domain, abstract Information for interpretation and appropriate use Research objectives, experimental design, sampling procedures, site selection, variables and units, data processing Information for automated use Structural attributes of the data (schema) and format of the data (syntax)
Examples of Metadata Standards Dublin Core Element Set Emphasis on web resources, publications http://dublincore.org/documents/dces/ Federal Geographic Data Committee (FGDC) Content Standard for Digital Geospatial Metadata (CSDGM) Emphasis on geospatial data Commonly used by federal agencies http://www.fgdc.gov/metadata/geospatial-metadata-standards International Standards Organization (ISO) 19115/19139 Geographic information: Metadata Emphasis on geospatial data and services http://www.fgdc.gov/metadata/geospatial-metadata-standards#fgdcendorsedisostandards
Examples of Metadata Standards Ecological Metadata Language (EML) Focus on ecological data http://knb.ecoinformatics.org/eml_metadata_guide.html Water Markup Language (WaterML) Emphasis on time series of hydrologic observations More of a data encoding language https://portal.opengeospatial.org/files/?artifact_id=48531 There are many standards available to document data. Each has a different focus, yet ask for similar information about the data set.
The Value of Metadata
Data Discovery and Reuse The descriptive content of the metadata file can be used to identify, assess, and access available data resources online access order process contacts ACCESS use constraints access constraints data quality availability/pricing ASSESS keywords geographic location time period attributes IDENTIFY
Data Accountability Metadata allows you to repeat scientific process if: methodologies are defined variables are defined analytical parameters are defined Metadata allows you to defend your scientific process: demonstrate process increasingly GIS/data-savvy public requires metadata for consumer information INPUT RESULTS
Project Coordination Metadata can be a means to improve communications among project participants using common: descriptions & parameters keywords, vocabularies, thesauri contact information attributes distribution information If reviewed regularly by all participants, metadata created early and updated during the project improves opportunity for coordinating: source data analytical methods new information
Value of Metadata to Data Producers Avoid data duplication Share reliable information Publicize efforts – promote the work of a scientist and his/her contributions to a field of study
Value of Metadata to Data Users Search, retrieve, and evaluate data set information from both inside and outside an organization Find data: Determine what data exists for a geographic location and/or topic Determine applicability: Decide if a data set meets a particular need Discover how to acquire the dataset you identified Process and use the dataset
Value of Metadata to Organizations Metadata helps ensure an organization’s investment in data Documentation of data processing steps, quality control, definitions, data uses, and restrictions Ability to use data after initial intended purpose Transcends people and time Offers data permanence Creates institutional memory Advertises an organization’s research Creates possible new partnerships and collaborations through data sharing
Summary (1) Metadata is documentation of data A metadata record captures critical information about the content of a dataset e.g., spatial and temporal support, spacing, extent Metadata allows data to be discovered, accessed, and re-used
Summary (2) Metadata standards provide structure and consistency to data documentation Standards and tools vary Select according to defined criteria such as data type, organizational guidance, and available resources Metadata is of critical importance to data developers, data users, and organizations
References Michener, W.K. (2006). Meta-information concepts for ecological data management, Ecological Informatics, 1(1), 3-7, http://dx.doi.org/10.1016/j.ecoinf.2005.08.004. Michener, W.K., J.W. Brunt, J.J. Helly, T.B. Kirchner, S.G. Stafford (1997). Nongeospatial metadata for the ecological sciences, Ecological Applications, 7(1), 330-342, http://dx.doi.org/10.1890/1051-0761(1997)007[0330:NMFTES]2.0.CO;2 Blöschl, G. (1996). Scale and Scaling in Hydrology, Habilitationsschrift, Weiner Mitteilungen Wasser Abwasser Gewasser, Wien, 346 p. Credits: Many ideas and some slides in this presentation were taken from: Henkel, H., V. Hutchison, S. Strasser, S. Rebich Hespanha, K. Vanderbilt, L. Wayne, (2012). DataONE education modules, DataONE Project, University of New Mexico, Albuquerque, NM, Available at: http://www.dataone.org/education-modules. (last accessed 9-4-2012)
Assignment 1. Metadata and the Data Life Cycle Your employer is developing a hydrologic model for the Little Bear River in Cache Valley and wants to model the impact of changes in land cover on hydrology in this watershed between 2002 and 2012. Your boss has asked you whether s/he can use the United States Geological Survey (USGS) National Land Cover Dataset (available for 1992, 2001, and 2006) in the study.
National Land Cover Dataset GIS gridded data product Nation-wide coverage Data available for 1992, 2001, 2006 Vegetation/land cover types Used for model inputs and parameterization
For your recommendation, consider: What does the data represent? How were the data created, collected, and/or observed? What was the source of the data? What is the format or syntax of the data? What manipulations, transformations, or derivations have been performed to produce the data? What are the spatial and temporal support, spacing, and extent for these datasets? What are appropriate uses for the dataset that you have selected? What are the limitations to the data? Are there differences in the way the data for the different years were produced that make them incompatible?