Download presentation
Presentation is loading. Please wait.
Published byDonna Fitzgerald Modified over 6 years ago
1
Lecture 2 Data Management and the Data Life Cycle
Jeffery S. Horsburgh Hydroinformatics Fall 2012 This work was funded by National Science Foundation Grant EPS
2
Objectives Describe the data life cycle
Provide a more holistic view of data management Develop data management techniques that: Improve data organization Facilitate analysis Improve reproducibility Improve capacity for data re-use
3
The Data Life Cycle
4
The Data Life Cycle Plan Collect Assure Describe Preserve Discover
Integrate Analyze Slide courtesy DataONE.
5
DataONE Data Observation Network for Earth
A distributed Earth Sciences Network Supporting the Full Data Life Cycle Focused on creating cyberinfrastructure for long term preservation of and access to scientific data
6
DataONE Team and Sponsors
Bertram Ludaescher Peter Honeyman Jeff Horsburgh Robert Sandusky Peter Buneman Carole Goble Cliff Duke Donald Hobern Ewa Deelman Amber Budden, Roger Dahl, Rebecca Koskela, Bill Michener, Robert Nahf, Mark Servilla Patricia Cruse, John Kunze Dave Vieglais Paul Allen, Rick Bonney, Steve Kelling Chad Berkley, Stephanie Hampton, Matt Jones Suzie Allard, Carol Tenopir, Maribeth Manoff, Robert Waltz, Bruce Wilson John Cobb, Bob Cook, Giri Palanismy, Line Pouchard Sky Bristol, Mike Frame, Richard Huffine, Viv Hutchison, Jeff Morisette, Jake Weltzin, Lisa Zolly David DeRoure Ryan Scherle, Todd Vision LEON LEVY FOUNDATION Randy Butler
7
The Data Life Cycle Plan Collect Assure Describe Preserve Discover
Integrate Analyze
8
Planning Consider Data Management Before You Collect Data
What kind of data will be collected? Which methods will be used (sensors, samples, etc.)? What data formats/standards are appropriate? How will the data be used? How will you share the data? Will your methods satisfy Funding requirements Policies for access, sharing, reuse Budget – most of the time this is overlooked!!
9
Data Management Plan Formal document
Outlines what you will do with your data during and after your research
10
Why Prepare a DMP? Can be an agreement among collaborators on how data will be managed Can save you time and money in the long run Meet the requirements of data centers and repositories by design instead of afterthought Because you are required to!
11
NSF DMP Requirements From Grant Proposal Guidelines:
Plans for data management and sharing of the products of research. Proposals must include a supplementary document of no more than two pages labeled “Data Management Plan”. This supplement should describe how the proposal will conform to NSF policy on the dissemination and sharing of research results (in AAG), and may include: the types of data, samples, physical collections, software, curriculum materials, and other materials to be produced in the course of the project the standards to be used for data and metadata format and content (where existing standards are absent or deemed inadequate, this should be documented along with any proposed solutions or remedies) policies for access and sharing including provisions for appropriate protection of privacy, confidentiality, security, intellectual property, or other rights or requirements policies and provisions for re-use, re-distribution, and the production of derivatives plans for archiving data, samples, and other research products, and for preservation of access to them Slide courtesy DataONE.
12
NSF DMP Requirements Summarized from NSF’s Award & Administration Guide: 4. Dissemination and Sharing of Research Results Promptly publish with appropriate authorship Share data, samples, physical collections, and supporting materials with others, within a reasonable timeframe Share software and inventions Investigators can keep their legal rights over their intellectual property, but they still have to make their results, data, and collections available to others Policies will be implemented via Proposal review Award negotiations and conditions Support/incentives Slide courtesy DataONE.
13
Some Example NSF Proposal DMPs (see the lecture materials page in Canvas)
NSF EPSCoR CI-Water (funded) CI-WATER, Cyberinfrastructure to Advance High Performance Water Resource Modeling NSF EPSCoR iUTAH (funded) innovative Urban Transitions and Aridregion Hydro-sustainability NSF Water Sustainability and Climate (not-funded) Integrating Social, Hydroclimate and Ecosystem Components of Western Water Systems to Guide Sustainable Responses to Land Use and Climate Changes
14
Tools for Creating Data Management Plans
University of California Curation Center of the California Digital Library dmp.cdlib.org Create ready-to-use data management plans for specific funding agencies dmponline.dcc.ac.uk Build and edit DMPs according to requirements of major UK funders.
15
The Data Life Cycle Plan Collect Assure Describe Preserve Discover
Integrate Analyze
16
Collect What are some ways that we produce data?
Experiments, observations, samples, model outputs, etc. Varying frequency, temporal and spatial coverage
17
Data Collection Includes Data Entry
Recording observations and notes in a field notebook Transcribing field notebooks and sheets into digital forms Automated processing of sensor data streams into a database
18
Strategies for Data Entry
When you create data entry tools Use pre-designed forms or templates Electronic or paper Use lists of valid values rather than free form text entry Use validation checks (e.g., range checks)
19
The Data Life Cycle Plan Collect Assure Describe Preserve Discover
Integrate Analyze
20
Assure Strategies for preventing errors from entering datasets
Standard data entry forms Pre-specification of formats, units, etc. Activities to ensure quality during collection Standard field and laboratory procedures Automated range checks for sensor data Activities to “clean” collected data Common to sensor data streams Dependent upon variable and sensor Graphical and statistical summaries
21
Assure Out of range values Sensor drift Anomalous values
22
The Data Life Cycle Plan Collect Assure Describe Preserve Discover
Integrate Analyze
23
Describe Metadata Documentation and reporting of data
Contextual details What is it critical to know about the data? Description of temporal and spatial details, instruments/sensors, methods, units, files, etc.
24
Metadata Content and Format
What metadata are needed? Details that make data meaningful How will metadata be created Lab notebooks? Automatically generated by a sensor or instrument? What format will be used for the metadata? Standards may be chosen by community or dictated by an agency
25
The Data Life Cycle Plan Collect Assure Describe Preserve Discover
Integrate Analyze
26
Preserve How are you preserving your data now?
Does Your Office Look Like This? What are the potential problems? What are some potential solutions?
27
Preserve What will be preserved? Where will it be preserved? Back ups?
Version control? Policies for access, sharing, and reuse Obligations for sharing Security and access control Sensitive data How long? Intellectual property issues Responsible parties
28
Data Loss Natural disaster Facilities infrastructure failure
Storage failure Server hardware/software failure Application software failure External dependencies Format obsolescence Legal encumbrance Human error Malicious attack by human or automated agents Loss of staffing competencies Loss of institutional commitment Loss of financial stability Changes in user expectations and requirements CC image by Sharyn Morrow on Flickr CC image by momboleum on Flickr Slide courtesy DataONE.
29
New Opportunities for Data Sharing and Preservation
Emerging data archives/repositories Functionality for collaboration and archival/preservation Potential ideas for semester projects!!! CUAHSI HIS Sharing hydrologic data
30
The Data Life Cycle Plan Collect Assure Describe Preserve Discover
Integrate Analyze
31
Discover Most data are not easily discoverable
Encapsulated in databases or files Formats not compatible with web indexing technologies It’s not enough to post a link to a file on a web page
32
Data Search Engines Discovery by: Keywords Location Time Source
33
Conditions for Effective Data Discovery
Highly curated data Well described via structured metadata Standards for data and metadata formats
34
The Data Life Cycle Plan Collect Assure Describe Preserve Discover
Integrate Analyze
35
Integrate Combining data from different sources (e.g., USGS NWIS and EPA STORET) Creating a unifying view of the data Overcoming heterogeneity Syntax – data formats and organization Semantics – vocabularies and meaning
36
Integration Approaches
Data Source A Data warehousing Efficient queries Issues with “data freshness” Data Source B Common Database Schema Extract Transform Load Data Source C
37
Integration Approaches
Data Source A Wrapper Data Source B Wrapper Data Source C Requires a common information model and a common interface Wrapper
38
Analyze Spreadsheets Workflows Statistical computing
Scripting/coding environments Database management systems
39
The Data Life Cycle Summary
Stages thru which well-managed data passes from the inception of a research project to its conclusion The stages do not always follow a continuous circle Plan Collect Assure Describe Preserve Discover Integrate Analyze
40
Data Management 101 Simple guidelines to improve data management
Benefits Improved data organization – facilitates analysis Improved reproducibility Improved capacity for data re-use Borer, E.T., E.W. Seabloom, M.B. Jones, and M. Schildhauer (2009). Some simple guidelines for effective data management, ESA Bulletin, 90(2): ,
41
1. Don’t Mess with the Raw Data
Always store uncorrected data with all of its “bumps and warts” Do not make any corrections to this You could change something that was actually correct You could make mistakes while correcting other mistakes Script QA/QC procedures and write results to a new file/copy of the data
42
An Example
43
An Example Removal of a calibration shift
44
An Example Removal of anomalous, out of range values
45
An Example Removal of “bad data” – sensor malfunction
46
2. Use Descriptive File Names
Use only plain ASCII characters Brief, but descriptive of content Generally – avoid spaces in file names Include a “readme” file when using many files in a directory
47
This might not be the best system…
How could we make this better?
48
3. Use Descriptive Headers in Files and Tables
Standard convention for many software applications Excel understands that the first line in a file is the header line Subsequent lines are interpreted as data Encapsulate data and descriptive metadata together
49
Streamflow Data from USGS
50
4. Do Not Mix Data Types in Table Columns
Numeric, strings, date/time, boolean Most analysis software will not handle mixed data types Different software packages will handle mixed data types inconsistently Can be more difficult to detect errors in the data
51
5. Archive Data in Non-Proprietary Data Formats
Microsoft Excel is widely available and used now, but what about in 10 years? 20 years? Will your data disappear?
52
6. Consider Media CDs? DVDs? External hard drives?
Don’t strand your data!!! 2000 1995 1976 1986 1994 1985
53
To the Cloud! Convenience Accessibility anywhere Cross platform
Enhanced sharing Low cost But… Privacy??????? Delay (slow or non-existent internet) Storage, but not much else File formats and semantics still matter Disclaimer: Use of these logos is not authorized by, sponsored by, or associated with Microsoft, Google, or Dropbox.
54
7. Automate Analyses Code creates reproducible results
Code is a record of the steps involved in processing and analyzing data Code can be shared Code can be re-executed at any time
55
Reproducible Visualization in R
56
8. Maintain Metadata Borer et al.: “Do not underestimate your ability to forget details about a study!” When did the tree that was stuck in my cross section get removed???? You may not analyze your data until years down the road Exact details of methods, names, files, etc. will become fuzzy
57
Summary (1) Considering the whole data life cycle is important in planning for a project or study Data management planning helps satisfy institutional or funder requirements Assuring data quality includes strategies before, during and after data collection
58
Summary (2) Describing data via metadata is important for data discovery, interpretation, integration, and analysis Relatively simple data management practices can improve data organization, facilitate analysis, improve reproducibility, and improve capacity for reuse
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.