Download presentation
Presentation is loading. Please wait.
1
Introduction to Research Data Management
Sarah Jones Digital Curation Centre, Glasgow Carpentry workshop - Open Research Data, 5 December 2016, Oslo
2
What will we cover? What is research data management?
Consideration and pointers when: Creating data Managing data Sharing data Useful tools and resources
3
What is RDM? Image CC-BY-SA by Janneke Staaks
4
What is Research Data Management?
“the active management and appraisal of data over the lifecycle of scholarly and scientific interest” Create Document Use Store Share Preserve Data management is part of good research practice
5
What is involved in RDM? Data Management Planning Data creation
Annotating / documenting data Analysis, use, versioning Storage and backup Publishing papers and data Preparing for deposit Archiving and sharing Licensing Citing… Create Document Use Store Share Preserve
6
Why manage and share data?
Direct benefits for you To make your research easier! Stop yourself drowning in irrelevant stuff Make sure you can understand and reuse your data again later Advance your career – data is growing in significance Research integrity To avoid accusations of fraud or bad science Evidence findings and enable validation of research methods Meet codes of practice on research conduct Many research funders worldwide now require Data Management and Sharing Plans Potential to share data So others can reuse and build on your data To gain credit – several studies have shown higher citation rates when data are shared For greater visibility, impact and new research collaborations Promote innovation and allow research in your field to advance faster Learn good data habits now! You’ll need them later.
7
What if this was your laptop?
Why YOU need a Data Management Plan
8
Good data management is about making informed decisions
10
Creating data Image CC-SA-ND by Bill Dickinson
11
Data creation tips Choose appropriate formats
Adopt a file naming convention Create metadata and documentation as you go Ensure consent forms, licences and agreements don’t restrict opportunities to share data
12
Choose appropriate file formats
Different formats are good for different things open, lossless formats are more sustainable e.g. rtf, xml, tif, wav proprietary and/or compressed formats are less preservable but are often in widespread use e.g. doc, jpg, mp3 One format for analysis then convert to a standard format Data centres may suggest preferred formats for deposit BioformatsConverter batch converts a variety of proprietary microscopy image formats to the Open Microscopy Environment format - OME-TIFF
13
How will you name your files?
Keep file and folder names short, but meaningful Agree a method for versioning Include dates in a set format e.g. YYYYMMDD Avoid using non-alphanumeric characters in file names Use hyphens or underscores not spaces e.g. day-sheet, day_sheet Order the elements in the most appropriate way to retrieve the record Example from ARM Climate Research Facility
14
Data about data What is metadata?
Documentation and metadata are essentially descriptive information about the information contained in a dataset. There should be good documentation at the study level, for example a description of the research methodology that created the data – [the best metadata the data can have is the publication it supports] or a data paper. There should also be documentation at the file, item and variable level suitable so that someone reusing the data can understand it – this could be ensuring that excel spreadsheets have sensible row and column descriptions or that a document is included with the dataset which properly explains any abbreviations used.
15
Documentation and metadata
Standardised Structured Machine and human readable Metadata helps to cite & disambiguate data Documentation aids reuse Documentation Metadata is a subset of documentation Documentation is a catch-all term which can include some very high-level, human-readable, loosely structured information about the dataset – for example a description of the laboratory method used to generate a set of results or a sketch book accompanying a work of sculpture. The term metadata has come to define a subset of documentation information that uses standardised terms and is presented in a structured way. This metadata will be in a form enabling it to be read by machines and/or may be presented in human readable form as well. Metadata
16
Metadata standards These can be general – such as Dublin Core
Or discipline specific Data Documentation Initiative (DDI) – social science Ecological Metadata Language (EML) - ecology Flexible Image Transport System (FITS) – astronomy Provided in catalogues to aid discoverability Structured so search engines can uncover it Exposed in machine-readable form e.g. XML
17
Dublin Core metadata example
Creator: Donald Cooper Role=Photographer Subject: Shakespeare, William, , Antony and Cleopatra [LC] Description: Vanessa Redgrave as Cleopatra Date: Type: Image Format: JPEG Identifier:4150 [catalogue no] Source: negative no 235 Relation: Antony and Cleopatra: Thompson/73-8 IsPartOf Coverage: Bankside Globe Role=Spatial Rights: Donald Cooper Typically, not every one of the DC elements is required to document this resource Although not apparent in this example, repetition of the Creator element is commonly found in metadata describing performing arts materials, which very often have multiple creators. The Creator element is further refined within this record with the use of a Qualifier.
18
http://rd-alliance.github.io/ metadata-directory
Use metadata standard Biosharing A portal of data standards, databases, and policies Focused on life, environmental and biomedical sciences Metadata Standards Directory Broad, disciplinary listing of standards and tools. Maintained by RDA group metadata-directory
19
Documentation Can others understand the data?
Think about what is needed in order to find, evaluate, understand, and reuse the data. Have you documented what you did and how? Did you develop code to run analyses? If so, this should be kept and shared too. Is it clear what each bit of your dataset means? Make sure the columns/rows are labelled, variable ranges defined, abbreviations explained in data dictionaries…
20
ReadMe files We recommend that a ReadMe be a plain text file containing the following: for each filename, a short description of what data it includes, optionally describing the relationship to the tables, figures, or sections within the accompanying publication for tabular data: definitions of column headings and row labels; data codes (including missing data); and measurement units any data processing steps, especially if not described in the publication, that may affect interpretation of results a description of what associated datasets are stored elsewhere, if applicable whom to contact with questions These recommendations for developing a readme file were put together by the Dryad team and are published on the repository’s web-site.
21
Managing data Image ‘tools’ CC-BY by zzpza
22
Legal and ethical issues
Be aware of legislation that applies to you: Offentlighetsloven (FoI) EIR (Environmental Information Regulations) Data Protection Health Research Act Understand what this means in terms of how data are stored, transferred and shared Your organisation must know what data it possesses It must know whether exceptions to access may apply It must know if some of the data belongs to others It must know what data once existed, but has now been deleted – and why These are difficult questions for most of us!
23
Use appropriate services
TSD provides a platform for researchers working at UiO and in other public research institutions to collect, store and analyze sensitive research data. TSD complies with the directive of privacy and electronic communication in Norway.
24
Ask for consent for data sharing
If not, data centres won’t be able to accept the data – regardless of any conditions on the original grant.
25
Where will you store the data?
Your own device (laptop, flash drive, server etc.) And if you lose it? Or it breaks? Departmental drives or university servers “Cloud” storage Do they care as much about your data as you do? The decision will be based on how sensitive your data are, how robust you need the storage to be, and who needs access to the data and when
26
One copy = risk of data loss
CC image by momboleum on Flickr CC image by Sharyn Morrow on Flickr
27
Who will do the backup? Use managed services where possible (e.g. University filestores rather than local or external hard drives), so backup is done automatically 3… 2… 1… backup! at least 3 copies of a file on at least 2 different media with at least 1 offsite Ask central IT team for advice
28
Backup and preservation – not the same thing!
Backups Used to take periodic snapshots of data in case the current version is destroyed or lost Backups are copies of files stored for short or near-long-term Often performed on a somewhat frequent schedule Archiving Used to preserve data for historical reference or potentially during disasters Archives are usually the final version, stored for long-term, and generally not copied over Often performed at the end of a project or during major milestones Remember that backup and preservation are not the same thing (though the terms are often used interchangeably). Backups are performed regularly to take periodic snapshots of the data for the short to medium term, whereas archiving is preserving the final version of the data for the long-term. You should make sure your data are backed-up during the active phase of research and that any data needed for the long-term are archived.
29
How to keep you data secure?
Develop a practical solution that fits your circumstances Store your data on managed servers Restrict access Keep anti-virus software up-to-date Encrypt mobile devices carrying sensitive information
30
Data sharing Image CC-BY-NC-ND by talkingplant
31
The data deluge is upon us
Sensor’s ability to produce data outstrips IT’s ability to process it
32
Why not keep it all? Globally, data volumes are doubling every two years John Gantz and David Reinsel 2011 Extracting Value from Chaos
33
Storage mgmt costs rise long-term
Hardware costs decline, but power and staff costs keep rising David Rosenthal blog.dshr.org/2012/05/lets-just-keep-everything-forever-in.html
34
The storage is cheap fallacy
Decreasing hardware costs offset by exponential growth in data volume Backup and mirroring multiplies cost of preserved data Discovery becomes harder as the chaff outweighs the wheat Curation of unused data is a waste of resources
35
Select what to keep (and share…)
What must be kept to manage compliance risk? What data could be re-used? What data has value and should be kept? Given costs what will or won’t be kept? How will it be kept and shared, on what terms?
36
Kevin Ashley, DCC - Warsaw data workshop
Should all data be open? NO Many reasons – most to do with human subjects But data existence should always be open Allows discovery & negotiation on use Avoids pointless replication Medicine does, however, provide some clear reasons why we can’t just stick all research data on the internet for anyone to trawl through. When human subjects are involved there are real concerns about confidentiality. Yet what alltrials.net and other initiatives make clear is that the *existence* of the data should never be hidden. That allows it to be discovered and for negotiations to take place about its use. It avoids costly replication, which can delay scientific discovery and involve human suffering when the replication takes the form of a clinical trial. CODATA Trieste - Kevin Ashley, DCC - CC-BY 2.0
37
How to make data open? Choose your dataset(s) Apply an open license
What can you may open? You may need to revisit this step if you encounter problems later. Apply an open license Determine what IP exists. Apply a suitable licence e.g. CC-BY Make the data available Provide the data in a suitable format. Use repositories. Make it discoverable Post on the web, register in catalogues…
38
License research data openly
This DCC guide outlines the pros and cons of each approach and gives practical advice on how to implement your licence CREATIVE COMMONS LIMITATIONS NC Non-Commercial What counts as commercial? ND No Derivatives Severely restricts use These clauses are not open licenses Horizon 2020 Open Access guidelines point to: or Guidance from the DCC can also help researchers to understand data licensing. This guide outlines the pros and cons of each approach e.g. the limitations of some CC options The OA guidelines under Horizon 2020 point to CC-0 or CC-BY as a straightforward and effective way to make it possible for others to mine, exploit and reproduce the data. See p11 at: Also in existence: Open Data Commons Open Government Licence 38
39
EUDAT licensing tool Answer questions to determine which licence(s) are appropriate to use
40
Deposit in a data repository
The EC guidelines point to Re3data as one of the registries that can be searched to find a home for data /content/re3data-demo 40
41
How to select a repository?
Look for provision from your community, university, publisher, funder etc Check they match your particular data needs: e.g. formats accepted; mixture of Open and Restricted Access. See if they provide guidance on how to cite the deposited data. Do they assign a persistent & globally unique identifier for sustainable citations and to links back to particular researchers and grants? Look for certification as a ‘Trustworthy Digital Repository’ with an explicit ambition to keep the data available in long term.
42
Norwegian repository landscape
43
Zenodo Zenodo is a multi-disciplinary repository that can be used for the long-tail of research data An OpenAIRE-CERN joint effort Multidisciplinary repository accepting Multiple data types Publications Software Assigns a Digital Object Identifier (DOI) Links funding, publications, data & software 43
44
What are persistent identifiers?
They are an alphanumeric code identifying a resource, organisation or individual They must be Unique Persistent Ideally they should be actionable too
45
How do persistent identifiers work
The likely home of an electronic research data resource attached to a persistent identifier is a data repository (exceptions being resources not available over the web which have a PI attached to a metadata record – physical resource, cumbersomely large dataset etc.)
46
Citing research data: why?
47
How to cite data Key citation elements Author Publication date Title
Location (= identifier) Funder (if applicable)
48
Resources Image ‘Energy Resources | Energie Quelle’ CC-BY-NC by K. H. Reichert
49
Managing and sharing data: a best practice guide
50
Guidance and training resources
ESIP Data Management Training clearing house for environmental sciences DataONE best practices DCC resources FOSTER open science portal
51
Acquire research data skills
52
Tools for managing data
managing-active-research-data
53
Also look for national & local support!
support/research/research-data
54
Finally… Well-managed data makes your research easier, now and in future Well-managed data is easier to share, more likely to be re-used Sharing data is good for you It’s good for all of us It isn’t as hard as you think – we’re here to show you how!
55
How do you share data effectively?
Use appropriate repositories, this catalogue is a good place to start Document and describe it enough for others to understand, use and cite guides/cite-datasets Licence it so others can reuse
56
Thanks for listening For DCC resources see: Follow DCC us on and #ukdcc
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.