Data Management What? Why? How?
What do we mean by … Managing your Research (aka Data) … Ensuring physical integrity of files and helping to preserve them Ensuring safety of content (data protection, ethics, morality, etc.) Describing the data (via metadata) and recording its history (provenance) Providing or enabling appropriate access at the right time, or restricting access, as appropriate Transferring custody at some point, and possibly destroying
What do we mean by data management? Simply put, data management is all of the activities necessary to make research data discoverable, accessible and understandable today, tomorrow, and well into the future. And it is done throughout the lifecycle
Managing Data in the Research Life Cycle Choosing file formats Access control & security File organization & naming conventions Backup & storage File format conversions Version control Sharing and preservation Document all project/file details Simply put, data management is all of the activities necessary to make research data discoverable, accessible and understandable today, tomorrow, and well into the future And it is done throughout the lifecycle Organization: file formats, naming conventions, and version controls. Documentation (metadata): variable names and descriptions, code books to explain classification schemes and codes, algorithms used to transform the data, software (name and version) used to collect, view, or process the data. Storage: active storage-where the data collected is stored while the project is ongoing, who is responsible for managing it, backup schedule and location, data privacy and security concerns. Sharing: which data are you sharing, who is responsible for managing it, rules for access, intellectual property and/or licensing, data privacy and security concerns. Preservation: which data are you keeping, sustainable formats, data/file conversion Archiving: which data are you archiving, location(s), who is responsible for managing it, backups and redundancy, access rules, data privacy and security concerns.
(Good) Data Management… …helps research to be: Replicated and verified Preserved for future use Linked with other research products Shared and reused …helps researchers: Meet funding requirements Increase visibility of research Save time and effort (avoid data loss) Deal with an ever-increasing amount of data So what is Data Management? Research: Enables data preservation -- makes preserving data for the future easier Supports sharing – you can focus on the research and not user requests; increases research impact Researchers: Saves time – simplifies your research and increases your research efficiency Encourages better documentation – lets others understand your data Keeps funders happy – meets requirements But most of it, it allows you to focus your energy on the research, which is what you want to be doing! This is why you manage your data. http://www.healthcare-informatics.com/article/guest-blog-data-management-challenge-unlocking-value-clinical-data-many-times-requires-enter
What is a Data Management Plan? A comprehensive plan of how you will manage your research data throughout the lifecycle of your research project AND Brief description of how you will comply with funder’s data sharing policy Reviewed as part of a grant application A data management plan, or DMP, is a document that helps the researcher to deal with the data generated (or otherwise obtained) in a research project. From the funders viewpoint, a DMP is usually a document, or a section in another document, that is required to be submitted with a grant proposal that describes how you will comply with their data SHARING policy. There are several ways to think about a data management plan: A document that is created to manage the data in you lab or project. This is a ‘living document’ that is designed to evolve over time. It would cover the following topics: Description of research; Data source(s); Data collection, creation and analysis; Data administration; Data sharing; Archiving; Data documentation and metadata; and Budget. A typical data management plan, or handbook, might be as large as 50 pages. It would serve as a resource for the lab members, and could be used for training new members. A document that is created at the start of a research project, which describes the data to be collected, probable sizes and formats, collection and analysis methods and tools, software, instruments, processes, workflows, and storage and sharing options. It is the blueprint of the research project. A document which is required to be submitted to a funder as part of a grant proposal, and which describes specific data management procedures as specified by the funder.
Who’s Requiring Data Management? Require a Data Management Plan (DMP) Require Sharing of Results – per a Data Policy National Science Foundation (NSF) National Institutes of Health (NIH) National Oceanographic and Atmospheric Research (NOAA) Institute of Museum and Library Services (IMLS) National Endowment of Humanities – office of digital humanities (NEH) Andrew W. Mellon Bill & Melinda Gates Foundation NASA NEH – Preservation & Access IES – Institute of Education Sciences Wellcome Trust Why do you need a data management plan? Read solicitations for proposals carefully and ask program director about specific data management requirements. Build time into your proposal development to formulate a data management plan! Private & public – in the US, UK and other countries Other agencies require sharing, but do not explicitly require a DMP as part of a proposal – NASA, NEH access & preservation NEH Sustainability of project deliverables and datasets – long term preservation Dissemination – sharing The DMPTool (http://dmptool.org) is a good place to start to find out if your funder requires a DMP or a sharing plan. This list is not inclusive.
Parts of a (Generic) NSF Data Management Plan Products of the Research: The types of data, samples, physical collections, software, curriculum materials, and other materials to be produced in the course of the project. Data Formats: The standards to be used for data and metadata format and content (where existing standards are absent or deemed inadequate, this should be documented along with any proposed solutions or remedies). Access to Data and Data Sharing Practices and Policies: Policies for access and sharing including provisions for appropriate protection of privacy, confidentiality, security, intellectual property, or other rights or requirements. Policies for Re-Use, Re-Distribution, and Production of Derivatives. Archiving of Data: Plans for archiving data, samples, and other research products, and for preservation of access to them. This is what the DMP requirements are for the NSF (not division or directorate specific). See the DMPTool (http://dmptool.org) for a list of NSF specifications. Things to keep in mind with the NSF: Every Directorate can have additional rules for any proposal. If the solicitation you are looking at doesn’t mention any additional rules for the DMP, it is always a good idea to go to the parent site and see if they list any additional requirements. For this solicitation, the Division is ‘Research on Learning in Formal and Informal Settings (DRL)’, and the ‘Directorate for Education and Human Resources’. Another, easier, method is to go to the ‘Dissemination and Sharing of Research Results’ page at http://www.nsf.gov/bfa/dias/policy/dmp.jsp, scroll down to the appropriate Directorate, Office, Division, Program or other unit, and see if your solicitation is listed. In this case, it is: EHR has a Directorate-wide Guidance document. http://www.nsf.gov/bfa/dias/policy/dmpdocs/ehr.pdf Even if the solicitation doesn’t refer to it, nor the Division, it is a good idea to follow the guidelines in it. Grant Proposal Guide (GPG) Chapter II.C.2.j http://www.nsf.gov/pubs/policydocs/pappguide/nsf13001/gpg_2.jsp#dmp 8
Department Of Energy Data Management Plan Data Types and Sources: A brief, high-level description of the data to be generated or used through the course of the proposed research and which of these are considered digital research data necessary to validate the research findings. Content and Format: A statement of plans for data and metadata content and format including, where applicable, a description of documentation plans, annotation of relevant software, and the rationale for the selection of appropriate standards. Sharing and Preservation: Means for sharing and the rationale for any restrictions and a timeline for sharing and preservation Protection: A statement of plans, where appropriate and necessary, to protect confidentiality, personal privacy, Personally Identifiable Information Rationale: A discussion of the rationale or justification for the proposed data management plan Software: Software and data created by funded research must be released with sufficient descriptions to facilitate the validation of research results. (Optional) The DOE DMP requirements are included in the DMPTool (http://dmptool.org) II. DMPs should reflect relevant standards and community best practices for data and metadata, and make use of community accepted repositories whenever practicable. III. Data sharing means making data available to people other than those who have generated them. Data preservation means providing for the usability of data beyond the lifetime of the research activity that generated them. This is a BIG section on what to include. IV. Protection: DMPs must protect confidentiality, personal privacy, Personally Identifiable Information, and U.S. national, homeland, and economic security; recognize proprietary interests, business confidential information, and intellectual property rights; avoid significant negative impact on innovation, and U.S. competitiveness; and otherwise be consistent with all applicable laws, regulations, V. Rational: the potential impact of the data within the immediate field and in other fields, and any broader societal impact. ------ Suggested Elements for a Data Management Plan http://science.energy.gov/funding-opportunities/digital-data-management/suggested-elements-for-a-dmp/