Download presentation
Presentation is loading. Please wait.
Published byScott Fields Modified over 7 years ago
1
The CODATA Vision on Data Publication and Data Citation Sarah ORCID: Nordic Workshop on Data Citation Policies and Practices: How to Make it Happen? Helsinki, 23 December 2016
2
The Data Deluge
3
Data, Reproducibility and Science
Science should be reproducible – other people doing the same experiments in the same way should get the same results. Observational data is not reproducible (unless you have a time machine) Therefore we need to have access to the data to confirm the science is valid! Poor data analysis generates false facts – and false facts & inaccessible data undermine science & its credibility
4
Journals have always published data…
Suber cells and mimosa leaves. Robert Hooke, Micrographia, 1665 The Scientific Papers of William Parsons, Third Earl of Rosse …but datasets have gotten so big, it’s not useful to publish them in hard copy anymore
5
Hard copy of the Human Genome at the Wellcome Collection
6
A crisis of reproducibility and credibility?
Pre-clinical oncology – 89% not reproducible Why? Misconduct/fraud Invalid reasoning Absent or inadequate data and/or metadata The fundamental challenge is to scientific self-correction. Journals can no longer contain the data, and neither scientists nor journals have taken the obvious step of having data relevant to a publication concurrently available in an electronic database. (example of last year’s Nature paper revealing that only 11% of results in 50 benchmark papers in pre-clinical oncology were replicable. If lack of Oldenburg’s rigour in presenting evidence is widespread, a failure of replicability risks undermines science as a reliable way of acquiring knowledge and can therefore undermines its credibility. The data providing the evidence for a published concept MUST be concurrently published, together with the metadata. To do otherwise is scientific MALPRACTICE
7
It’s not just data! Experimental protocols Workflows Software code
Metadata Things that went wrong! …
8
Incentives for Open Data
Need reward structures and incentives for researchers to encourage them to make their data open Data citation and publication (issues with treating data as a special case of publications…)
9
Citeable does not equal Open!
Just like you can cite a paper that is behind a paywall, you can cite a dataset that isn’t open. Making something citeable means that: You know it exists You know who’s responsible for it You know where to find it You know a little bit about it (title, abstract,…) Even if you can’t download/read the thing yourself. Citation gives benefits that encourage researchers to make their data open
10
Should ALL data be open? Most data produced through publically funded research should be open. But! Confidentiality issues (e.g. named persons’ health records) Conservation issues (e.g. maps of locations of rare animals at risk from poachers) Security issues (e.g. data and methodologies for building biological weapons) There should be a very good reason for publically funded data to not be open.
11
Most people have an idea of what a publication is
12
What is a Dataset? DataCite’s definition ( Business_Models_Principles_v1.0.pdf): Dataset: "Recorded information, regardless of the form or medium on which it may be recorded including writings, films, sound recordings, pictorial reproductions, drawings, designs, or other graphic representations, procedural manuals, forms, diagrams, work flow, charts, equipment descriptions, data files, data processing or computer programs (software), statistical records, and other research data." (from the U.S. National Institutes of Health (NIH) Grants Policy Statement via DataCite's Best Practice Guide for Data Citation). In my opinion a dataset is something that is: The result of a defined process Scientifically meaningful Well-defined (i.e. clear definition of what is in the dataset and what isn’t)
13
Some examples of data (just from the Earth Sciences)
Time series, some still being updated e.g. meteorological measurements Large 4D synthesised datasets, e.g. Climate, Oceanographic, Hydrological and Numerical Weather Prediction model data generated on a supercomputer 2D scans e.g. satellite data, weather radar data 2D snapshots, e.g. cloud camera Traces through a changing medium, e.g. radiosonde launches, aircraft flights, ocean salinity and temperature Datasets consisting of data from multiple instruments as part of the same measurement campaign Physical samples, e.g. fossils
14
The Understandability Challenge: Article
15
The Understandability Challenge: Data
What the data set looks like on disk What the raw data files look like. I could make these files open easily, but no one would have a clue how to use them!
16
Creating a dataset is hard work!
"Piled Higher and Deeper" by Jorge Cham Documenting a dataset so that it is usable and understandable by others is extra work!
17
“I’m all for the free sharing of information, provided it’s them sharing their information with us.” Mustrum Ridcully, D.Thau., D.M., D.S., D.Mn., D.G., D.D., D.C.L., D.M. Phil., D.M.S., D.C.M., D.W., B.El.L, Archancellor, Unseen University, Anhk-Morpork, Discworld - As quoted in “Unseen Academicals”, by Terry Pratchett
18
I wasn’t a co-author. I didn’t get an acknowledgement.
Getting scooped It happened to me! I shared my data with another research group. They published the first results using that data. I wasn’t a co-author. I didn’t get an acknowledgement.
19
Citing Data We already have a working method for linking between publications which is: commonly used understood by the research community used to create metrics to show how much of an impact something has (citation counts) applied to digital objects (digital versions of journal articles) We can extend citation to other things like: data code multimedia And the best bit is, researchers don’t need to learn a new method of linking – they cite like they normally would!
20
Out of Cite, Out of Mind: Report of the CODATA Task Group on Data Citation
The report was published by the CODATA Data Science Journal on 13 September 2013
21
First Principles for Data Citation
1. Status of Data: Data citations should be accorded the same importance in the scholarly record as the citation of other objects. 2. Attribution: A citation to data should facilitate giving scholarly credit and legal attribution to all parties responsible for those data. 3. Persistence: Citations should refer to objects that persist. 4. Access: Citations should facilitate access to data by humans and by machines. 5. Discovery: Citations should support the discovery of data and their documentation.
22
First Principles for Data Citation
6. Provenance: Citations should facilitate the establishment of provenance of data. 7. Granularity: Citations should support the finest-grained description necessary to identify the data. 8. Verifiability: Citations should contain information sufficient to identify the data unambiguously. 9. Metadata Standards: Citations should employ existing metadata standards. 10. Flexibility: Citation methods should be sufficiently flexible to accommodate the variant practices among communities but should not differ so much that they compromise interoperability of data across communities..
23
Principles are supplemented with a glossary, references and examples
The Noble Eight-Fold Path to Citing Data The Joint Declaration of Data Citation Principles Importance Credit and attribution Evidence Unique Identification Access Persistence Specificity and verifiability Interoperability and flexibility Principles are supplemented with a glossary, references and examples
24
Open/Closed/Published/unpublished
We want to encourage researchers to make their data: Open Persistent Quality assured: through scientific peer review or repository-managed processes Unless there’s a very good reason not to! Publishing = making something public after some formal process which adds value for the consumer: e.g. peer review and provides commitment to persistence Openness Quality Persistence CD Webpage OA journal Subs journal Data repository Shared work space
25
How to publish data Stick it up on a webpage somewhere
Issues with stability, persistence, discoverability… Maintenance of the website Put it in the cloud Attach it to a journal paper and store it as supplementary materials Journals not too keen on archiving lots of supplementary data, especially if it’s large volume. Put it in a disciplinary/institutional repository Write a data article about it and publish it in a data journal By David Fletcher
26
Part of the Italsat data archive – on CDs in a shelf in my office
How not to do it! Part of the Italsat data archive – on CDs in a shelf in my office
27
Open is not enough! “When required to make the data available by my program manager, my collaborators, and ultimately by law, I will grudgingly do so by placing the raw data on an FTP site, named with UUIDs like 4e283d36-61c4-11df-9a26-edddf420622d. I will under no circumstances make any attempt to provide analysis source code, documentation for formats, or any metadata with the raw data. When requested (and ONLY when requested), I will provide an Excel spreadsheet linking the names to data sets with published results. This spreadsheet will likely be wrong -- but since no one will be able to analyze the data, that won't matter.” -
28
It’s ok, I’ll just put it out there and if it’s important other people will figure it out
Phaistos Disk, 1700BC These documents have been preserved for thousands of years! But they’ve both been translated many times, with different meanings each time. Data has to be understandable to be useable (this means documentation, supporting information, metadata)
29
Usability, trust, metadata
When you read a journal paper, it’s easy to read and get a quick understanding of the quality of the paper. You don’t want to be downloading many GB of dataset to open it and see if it’s any use to you. Need to use proxies for quality: Do you know the data source/repository? Can you trust it? Is there enough metadata so that you can understand and/or use the data? In the same way that not all journal publishers are created equal, not all data repositories are created equal Example metadata from a published dataset: “rain.csv contains rainfall in mm for each month at Marysville, Victoria from January 1995 to February 2009”
30
How we (NERC) cite data We using digital object identifiers (DOIs) as part of our dataset citation because: They are actionable, interoperable, persistent links for (digital) objects Scientists are already used to citing papers using DOIs (and they trust them) Academic journal publishers are starting to require datasets be cited in a stable way, i.e. using DOIs. We have a good working relationship with the British Library and DataCite
31
What sort of data can we/will we assign a DOI to?
Dataset has to be: Stable (i.e. not going to be modified) Complete (i.e. not going to be updated) Permanent – by assigning a DOI we’re committing to make the dataset available for posterity Good quality – by assigning a DOI we’re giving it our data centre stamp of approval, saying that it’s complete and all the metadata is available When a dataset is cited that means: There will be bitwise fixity With no additions or deletions of files No changes to the directory structure in the dataset “bundle” A DOI should point to a html representation of some record which describes a data object – i.e. a landing page. Upgrades to versions of data formats will result in new editions of datasets.
32
Example published dataset
33
? The traditional online journal model
BADC Data BODC A Journal (Any online journal system) PDF Word processing software with journal template Data Journal (Geoscience Data Journal) html 1) Author prepares the paper using word processing software. 3) Reviewer reviews the PDF file against the journal’s acceptance criteria. 2) Author submits the paper as a PDF/Word file. 1) Author prepares the data paper using word processing software and the dataset using appropriate tools. 2a) Author submits the data paper to the journal. 3) Reviewer reviews the data paper and the dataset it points to against the journals acceptance criteria. The traditional online journal model Overlay journal model for publishing data 2b) Author submits the dataset to a repository. ?
35
Summary and maybe conclusions?
We need to open the products of research to encourage innovation and collaboration to give credit to the people who’ve created them to be transparent and trustworthy Openness does come at a cost! It’s not enough for data to be open it needs to be usable and understandable too Data citation and publication are ways of encouraging researchers to make their data open or at least tell the world that their data exists! We need a culture change – but it’s already happening!
36
Thanks! Any questions? Be careful of your citations!
@sorcha_ni
38
What is a data article? A data article describes a dataset, giving details of its collection, processing, software, file formats, etc., without the requirement of novel analyses or ground breaking conclusions. the when, how and why data was collected and what the data- product is. Many data journals already exist – see a list (in no particular order) at: ataJournalsList
39
Why bother publishing the dataset in a data journal
Why bother publishing the dataset in a data journal? Why not just publish a normal journal paper citing the data? Data Journals: Peer-review the data Publish negative results Make it quicker to publish the data as they don’t require analysis or novelty – the dataset is published “as-is” Provide attribution and credit for the data collectors who might not be involved with the analysis Make it easier to find datasets, understand them and be sure of their quality and provenance.
40
Peer review, data and data journals
Peer-review of a scientific publication is generally only applied to analysis, interpretation and conclusions, and not the underlying data. But if the conclusions are valid, the data must be of good quality. We need quality assurance of the data underlying research publications – either through peer-review or data repository checking. Researchers need credit for creating, managing and opening their data. Data journals provide that credit in an environment where academic status is solely based on publication record.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.