Download presentation
Presentation is loading. Please wait.
Published bySamantha Gibbs Modified over 8 years ago
1
DATA CITATION Laurie Goodman, PhD Editor-in-Chief, GigaScience Laurie@gigasciencejournal.com ORCID ID: 0000-0001-9724-59760000-0001-9724-5976 Twitter: @GigaScience Credit Where Credit is Due
2
Journal and database for large-scale data studies Editor-in-Chief: Laurie Goodman Executive Editor: Scott Edmunds Editor: Nicole Nogoy Assistant Editor: Hans Zauner GigaDB: Chris Hunter, Jesse Xiao GigaGalaxy: Peter Li in conjunction with
3
Introduction to GigaScience Article types include research articles, data notes, technical notes, reviews, commentary, and editorials Have linked database GigaDB that hosts all data types Have journal linked source code-sharing (Github) and computational platforms (Galaxy) etc. Have in-house biocurator and data scientists to aid researchers in sharing information and putting it into appropriate databases
4
GigaDB piggy-backs onto China National Genbank (CNGB) China National Genbank Launched this month 10PB Object Storage (Aliyun “S3”) 1PB high performance storage: Huawei OceanStor disk array 480 core , 2560G memory Internet network bandwidth 250M 、 500M 、 1G (fibre 1) 10G dark fibre direct connect between genebank and BGI headquarters (fibre 2) Internet network bandwidth 100M, 200M, 300M (fibre 3) Aliyun private cloud software platform GigaDB has a dedicated server within CNGB, an offsite server in HK for backup, and is implementing Amazon cloud for storage and data use in 2017.
5
Why Share Data? Reproducibility Transparency Improved data quality More people accessing data speeds scientific discovery People are dying
6
By the end of the day: 334 people will have died of the measles Data from World Health Organization Fact Sheets http://www.who.int/en/
7
Cultural Reason’s Not to Share Fear of journals considering Data Publication as prior publication. Fear of being scooped (Data Parasites*) Lack of career advancement due to no credit for data production- only for analysis/concept papers. *http://www.nejm.org/doi/full/10.1056/NEJMe1516564 Response from Functional Genomics Data Society: http://fged.org/projects/data-sharing-and-research-parasites/
8
A Tale of Two Bacteria 1.On May 2, 2011 German Doctors Reported the first case of an E.coli infection, that was accompanied by hemolytic-uremic syndrome 2.On May 21, 2011 the first death occurred from this bacteria (denoted E.coli O104:H4) 3.On June 3, 2014, BGI completed a draft sequence of E.coli O104:H4 from a sample provided by doctors at the University Medical Centre Hamburg-Eppendorf 4.At this point- the leaders at BGI held a discussion about whether to release the sequence data immediately: what were the potential repercussions of doing so The question arose: If the data were released now- would it affect their ability to publish later?
9
A Tale of Two Bacteria In one world- the researchers — who were concerned about their ability to publish as this is the way to obtain recognition and obtain grants (which are essential for them to work) — waited. The first publication appeared on July 29 th In another world, the researchers — who decided public health was more important than obtaining a publication — released the data immediately. The first publication appeared on July 29 th — but was not from that group who released the data (though information on that data was included).
10
To maximize its utility to the research community and aid those fighting the current epidemic, genomic data is released here into the public domain under a CC0 license. Until the publication of research papers on the assembly and whole-genome analysis of this isolate we would ask you to cite this dataset as: Li, D; Xi, F; Zhao, M; Liang, Y; Chen, W; Cao, S; Xu, R; Wang, G; Wang, J; Zhang, Z; Li, Y; Cui, Y; Chang, C; Cui, C; Luo, Y; Qin, J; Li, S; Li, J; Peng, Y; Pu, F; Sun, Y; Chen,Y; Zong, Y; Ma, X; Yang, X; Cen, Z; Zhao, X; Chen, F; Yin, X; Song,Y ; Rohde, H; Li, Y; Wang, J; Wang, J and the Escherichia coli O104:H4 TY-2482 isolate genome sequencing consortium (2011) Genomic data from Escherichia coli O104:H4 isolate TY-2482. BGI Shenzhen. doi:10.5524/100001 http://dx.doi.org/10.5524/100001 http://dx.doi.org/10.5524/100001 These data were put on an FTP server under a CCO waiver and also given a DOI to make access ‘permanent’ To the extent possible under law, BGI Shenzhen has waived all copyright and related or neighboring rights to Genomic Data from the 2011 E. coli outbreak. This work is published from: China.
11
Whether the concern about the ability to publish if data are released early is real or imagined Researchers act on that concern Note: Harmsen’s group DID share- immediately upon sequencing— the O104:H4 outbreak strain data. The data referred to here was the 2001 strain that was believed to be the strain involved in a 2001 outbreak of similar type. This slide is meant to only highlight that concerns about being scooped drive early sharing decisions. Given that the first paper published did use the early available O104:H4 data, it would be expected that these data, had they been shared, would have been used in that paper as well.
12
By the end of the day: 1,027 people will have died from influenza Data from World Health Organization Fact Sheets http://www.who.int/en/
13
Deconstructing a paper into accessible, useable, trackable, interlinked units Need to provide credit to reward sharing and proper organization of: Narrative Data/Metadata availability/curation Software availability Interoperability Availability of workflows Transparent analyses Data/ MetaData Software Methods Narrative
14
Deconstructing a paper into accessible, useable, trackable, interlinked units Currently publishers provide credit for this: Narrative Data/Metadata availability/curation Software availability Interoperability Availability of workflows Transparent analyses Data/ MetaData Software Methods Narrative
15
Moving Beyond the Narrative
16
How We Envision Research Publication (Communicating Science) Data Sets in GigaDB Analyses in GigaGalaxy Paper in GigaScience Linked to Open-access journal Data Publishing Platform Data Analysis Platform
17
Paper DOI Data set DOI Linking of papers and data by citation of DOIs
18
NO Hosting all data types
19
By the end of the day: 1,718 people will have died of malaria Data from World Health Organization Fact Sheets http://www.who.int/en/
20
Data Publication and Citation Promotes Rapid and Open Sharing It gets to the heart of the cultural reasons not to share
21
What is a Data Publication? 1.Publishing a standard article that describes the data. 2.Making the data itself citable.
22
Make it easy to cite See where it got cited! Describe the data
23
Current list Of Darwin Finch Data Citations on Google Scholar …And more
24
By the end of the day: 4,110 people will have died of complications from diabetes Data from World Health Organization Fact Sheets http://www.who.int/en/
25
Cultural Reasons Not to Share Fear of journals considering data publication as prior publication. Fear of being scooped. (Data Parasites*) Lack of career advancement due to no credit for data production- only papers. *http://www.nejm.org/doi/full/10.1056/NEJMe1516564 Response from Functional Genomics Data Society: http://fged.org/projects/data-sharing-and-research-parasites/
27
Cultural Reasons Not to Share Fear of journals considering data publication as prior publication. Fear of being scooped (Data Parasites*) Lack of career advancement due to no credit for data production- only papers. *http://www.nejm.org/doi/full/10.1056/NEJMe1516564 Response from Functional Genomics Data Society: http://fged.org/projects/data-sharing-and-research-parasites/
28
http://blogs.biomedcentral.com/gigablog/2014/05/14/th e-latest-weapon-in-publishing-data-the-polar-bear/ Direct Data Citation Encourages data release prior to publication of data analysis article THREE YEARS before publication of the analysis article Releasing Data Early with a Citation
29
The polar bear DATA was released –prepublication- in 2011 Data were used and cited in at least 5 studies 1.Hailer, F et al., Nuclear genomic sequences reveal that polar bears are an old and distinct bear lineage. Science. 2012 Apr 20;336(6079):344-7. doi:10.1126/science.1216424. 2.Cahill, JA et al., Genomic evidence for island population conversion resolves conflicting theories of polar bear evolution. PLoS Genet. 2013;9(3):e1003345. doi:10.1371/journal.pgen.1003345. 3.Morgan, CC et al., Heterogeneous models place the root of the placental mammal phylogeny. Mol Biol Evol. 2013 Sep;30(9):2145-56. doi:10.1093/molbev/mst117. 4.Cronin, MA et al., Molecular Phylogeny and SNP Variation of Polar Bears (Ursus maritimus), Brown Bears (U. arctos), and Black Bears (U. americanus) Derived from Genome Sequences. J Hered. 2014; 105(3):312-23. doi:10.1093/jhered/est133. 5.Bidon, T et al., Brown and Polar Bear Y Chromosomes Reveal Extensive Male-Biased Gene Flow within Brother Lineages. Mol Biol Evol. 2014 Apr 4. doi:10.1093/molbev/msu109 http://blogs.biomedcentral.com/gigablog/2014/05/14/the-latest-weapon-in-publishing-data-the-polar-bear/ Analysis Article by data producers was published in 2014 in Cell The Data Publication has since garnered 6 more citations
30
Cell Press Journals had indicated publishing a dataset prior to publication could be considered as prior publication
31
By the end of the day: 7,671 children (under 5) will have died from Malnutrition Data from World Health Organization Fact Sheets http://www.who.int/en/
32
Cultural Reasons Not to Share Fear of journals considering data publication as prior publication. Fear of being scooped (Data Parasites*) Lack of career advancement due to no credit for data production- only papers. *http://www.nejm.org/doi/full/10.1056/NEJMe1516564 Response from Functional Genomics Data Society: http://fged.org/projects/data-sharing-and-research-parasites/
33
Data as a publication can be cited in the references (like a ‘real’ paper) This rewards authors for making data available AND makes it easier to find the data
36
Cited Data is Being Tracked
37
Funding Agencies are paying attention Funding agencies are now including data release information in grants, require data release on publication, and are assessing if researchers are releasing data
38
By the end of the day: 22,466 people will have died from Cancer Data from World Health Organization Fact Sheets http://www.who.int/en/
39
Data Citation Really is a Major Incentive Last year, we released the genome sequences from 3000 Rice strains (13.4 TB of data) These data were also deposited in NIH SRA repository So why did we do it too? 1.It is linked directly to the Data Paper that provides details of data production, quality, and basic analysis 2.Authors were hesitant to release these data (a HUGE community resource) prior to the analysis paper publication (which, for 3000 strains… could possibly take years…). The opportunity to have these data citable (and trackable) encouraged the authors and led to their releasing these data and doing so in collaboration with GigaScience’s Biocurator The 3,000 Rice Genomes Project. (2014) GigaScience 3:7 http://dx.doi.org/10.1186/2047-217X-3-7;http://dx.doi.org/10.1186/2047-217X-3-7 The 3000 Rice Genomes Project (2014) GigaScience Database. http://dx.doi.org/10.5524/200001
40
Cultural Reasons not to publish Data They aren’t ‘real’ papers They only pad a researchers publication list, and do not add to the lexicon of scientific discoveries. Data production is not a scholarly pursuit.
41
Padding a Resume: Publishing data is “Salami Slicing”!! What is Salami Slicing? Publishing research in several different papers that should form a single cohesive paper Why is Salami Slicing considered ‘unethical’? It fragments the scientific literature, wasting researcher’s time as they try to get all the information related to a very specific topic/dataset/method It can give the appearance (given there are multiple publications) that there is large support for a particular hypothesis It pads a researcher’s publication record unfairly
42
Publishing Data is “Salami Slicing”! Baloney 1.Those guidelines were developed prior to the year 2000: More than 15 years ago: at a time when data set sizes and data types collected in the life sciences by a single research group were relatively small and primarily suitable for a single or narrow range of disciplines or hypotheses. Most journals were not online (which allows easier identification and access to closely related articles ) until the late ‘90s. 2.In 2005, COPE* ruled that a paper that had data that had been used and described, at least in part, in a previous publication was not unethical *Council of Publication Ethics. http://www.publicationethics.org/case/salami-publication 3.Data collection can be (should be!!) a scholarly pursuit: Data that is broadly reusable requires care, thought, training, time, and money to be properly collected, curated, stored, and shared.
43
Data Production is not a scholarly pursuit It doesn’t merit a publication Contrary to popular belief… There are very few —if any— ‘push-a-button-and-get-it’ reuseable data resources
44
Your not supposed to just collect samples! *Collect ALL available metadata*
45
By the end of the day: 47,945 people will have died from Cardiovascular Disease Data from World Health Organization Fact Sheets http://www.who.int/en/
46
Thanks to: Scott Edmunds, Executive Editor Nicole Nogoy, Editor Hans Zauner, Assistant Editor Peter Li, Lead Data Manager Chris Hunter, Lead BioCurator Xiao (Jesse) Si Zhe, Database Developer Joseph Hasan, Journal Development Manager editorial@gigasciencejournal.com database@gigasciencejournal.com @GigaScience facebook.com/GigaScience blogs.openaccesscentral.com/blogs/gigablog Contact us: Follow us: www.gigasciencel.com www.gigadb.org
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.