David Shotton Image BioInformatics Research Group Department of Zoology University of Oxford, UK Doing more with less: data sharing.

Slides:



Advertisements
Similar presentations
David Shotton Image BioInformatics Research Group Department of Zoology University of Oxford, UK The Dryad-UK vision © David Shotton,
Advertisements

Creating Institutional Repositories Stephen Pinfield.
Building Repositories of eprints in UK Research Universities Bill Hubbard SHERPA Project Manager University of Nottingham.
CrossRef Linking and Library Users “The vast majority of scholarly journals are now online, and there have been a number of studies of what features scholars.
Effective management Accurate tracking Easier automation.
Ensuring a Journal’s Economic Sustainability, While Increasing Access to Knowledge.
Publish or perish? Linking Scratchpads and the new Biodiversity Data Journal for streamlining publication of botanical data D.N Koureas 1, L. Penev 2 &
PubMed Central ANCHASL Spring Meeting April 1, 2005 Robert James Associate Director of Public Services Duke University.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
NATIONAL LIBRARY OF MEDICINE PubMed Central Brooke Dine National Library of Medicine Medical Library Association Conference May 2004.
NATIONAL LIBRARY OF MEDICINE PubMed Central Brooke Dine National Library of Medicine Medical Library Association Conference May 2005.
Data citation from the perspective of a scholarly publisher Lyubomir Penev TDWG Data Citation Workshop, New Orleans, Oct 2011 ViBRANT.
An Open Access publisher’s perspective on data publishing Matthew Cockerill Managing Director, BioMed Central Dryad-UK meeting HEFCE, London, 28 April.
Figures for ADMIRAL Project grant application These figures are copyright © David Shotton, University of Oxford, They are made available for reuse.
Image BioInformatics Research Group Department of Zoology University of Oxford, UK Semantic Web Applications and Tools for Life.
1 Archiving Workflow between a Local Repository and the National Library Archive Experiences from the DiVA Project Eva Müller, Peter Hansson, Uwe Klosa,
1 Enriching UK PubMed Central SPIDER launch meeting, Wolfson College, Oxford Paul Davey, UK PubMed Central Engagement Manager.
University of Southampton, U.K.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
David Shotton Image BioInformatics Research Group Department of Zoology University of Oxford, UK CITATION NETWORK ANALYSIS © David.
Data Publishing & Management Learning Objectives: 1.Introduce the advantages of publishing your data, the steps involved and how to publish to increase.
EZID (easy-eye-dee) is a service that makes it simple for digital object producers (researchers and others) to obtain and manage long-term identifiers.
NLM-Semantic Medline Data Science Data Publication Commons Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data.
THE DATA CITATION INDEX AN INNOVATIVE SOLUTION TO EASE THE DISCOVERY, USE AND ATTRIBUTION OF RESEARCH DATA MEGAN FORCE 22 FEBRUARY 2014.
Some facets of knowledge management in mathematics Wolfram Sperber (Zentralblatt Math) Patrick Ion (Math Reviews) Facets of Knowledge Organization A tribute.
Moving forward our shared data agenda: a view from the publishing industry ICSTI, March 2012.
Toward a World of World-Historical Data Ruth Mostern University of California, Merced World Historical Dataverse Colloquium University of Pittsburgh, March.
Collaborative Approach to Open Access: Experience from Bioline International Leslie Chan Associate Director Bioline International University of Toronto.
Thomson Scientific October 2006 ISI Web of Knowledge Autumn updates.
Innovation & Supplementary Material Eleonora Presani – Elsevier
Alma Swan Key Perspectives Ltd Truro, UK.  Use of proxy measures of an individual scholar’s merit is as good as it gets  The responsibility for disseminating.
Data enters Scholarly Communication; how publishers can help make things better Integration of Research Data and Publications Project ODE – workpackage.
Topic Rathachai Chawuthai Information Management CSIM / AIT Review Draft/Issued document 0.1.
Open access & visibility Management Digital Preservation ORA: Purposes.
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
Maximizing Library Investments in Digital Collections Through Better Data Gathering and Analysis (MaxData) Carol Tenopir and Donald.
Towards Data Attribution & Citation in the Life Sciences Philip E. Bourne UCSD 8/22/11Data Attribution and Citation.
Philip E. Bourne Professional Development Lecture 7 Understanding and Working the Publishing Process.
Image BioInformatics Research Group Department of Zoology University of Oxford, UK CERIF Data Surgery University of Bath 9 February.
HEFCE/Higher Education Academy/JISC cc-by-sa (uk2.5) Image source – flickr (cc-by) OER and the Open Agenda Malcolm Read, Executive Secretary, JISC.
Electronic labnotes Mari Wigham COMMIT/. Information WUR  Organising, sharing, finding and reusing data  Expertise in: ● Modelling data.
Scratchpads and the new Biodiversity Data Journal Biodiversity Data Publishing made… easier Dimitris Koureas Natural History Museum London.
Data for secondary analysis: the experience of the UK Data Archive Hilary Beedham UK Data Archive.
Publishing & Citing Research Data Arun Prakash. Agenda  Introduction  Why is Data publishing important ?  Ongoing Work  Role of Semantics.
GOOGLE FUSION TABLES: WEB- CENTERED DATA MANAGEMENT AND COLLABORATION HectorGonzalez, et al. Google Inc. Presented by Donald Cha December 2, 2015.
Entering the Data Era; Digital Curation of Data-intensive Science…… and the role Publishers can play The STM view on publishing datasets Bloomsbury Conference.
The Semantic Web. What is the Semantic Web? The Semantic Web is an extension of the current Web in which information is given well-defined meaning, enabling.
Using Open Access Publishing for the Effective Dissemination of African Research PKP PUBLIC KNOWLEDGE PROJECT Ensuring a Journal’s Economic Sustainability,
Working Group 4 Data and metadata lifecycle management  1. Policies and infrastructure for data and metadata changes  2. Supporting file and data formats.
Greater Visibility, Greater Access QSpace QSpace Queen’s University Research & Learning Repository.
Beyond the PDF: New modes of dissemination Experiments from PLOS Theo Bloom, Editorial Director for Biology, PLOS Amsterdam, March 2013.
Ukpmc.ac.uk As a result of the mandates Research in the open How mandates work in practice 29 th May, 2009 Paul Davey, UK PubMed Central Engagement Manager,
Discover ScholarSphere A repository service collaboration between the University Libraries and ITS.
Research Data Management in the Humanities: an Introduction to the Basics Open Exeter Project Team.
Acknowledgments Funding provided by the Jewett Foundation Introduction Data collected in ocean sciences, whether generated from research or operational.
Reference Management Module I: Introduction By Rehema Chande-Mallya(PhD)
© 2004 Reviews.com™ 1 Reviews: A Front End to Literature Bruce Antelman
Why should I put my research on HIRA?
Open Exeter Project Team
Using Open Access to Increase Personal Internet Presence
Open Access and Research Data Management: An Overview for LLOs
Development of the Amphibian Anatomical Ontology
GFBio – Education module
Publishing software and data
Access  Discovery  Compliance  Identification  Preservation
Open Access to your Research Papers and Data
OpenML Workshop Eindhoven TU/e,
Benefits and Problems Facing Them
Research Data Management
Why should I put my research on HIRA?
Presentation transcript:

David Shotton Image BioInformatics Research Group Department of Zoology University of Oxford, UK Doing more with less: data sharing and integration in an age of data glut and economic contraction © David Shotton, 2010 Published under the Creative Commons Attribution-Noncommercial-Share Alike 3.0 Licence Dryad-UK Discussion Meeting HEFCE Offices, Centre Point, London April 2010

We live in an age of bioinformatics data glut... Attwood TK et al. (2009) Calling International Rescue: knowledge lost in literature and data landslide! Biochemical Journal 424:317–333. Cochrane GR, Galperin MY (2010) Nucleic Acids Research 38:D1-D4  There are now over 1200 bioinformatics databases, between which data integration is difficult  Data integration for many researchers amounts to nothing more sophisticated than cutting and pasting into a Word document !!

Research data – Universals and Particulars  Gene sequences and protein structures represent ‘universal truths’  The data need only be discovered once  The data are intinsically simple and form bounded data sets  Data are cheap per bit, and re-acquisition is becoming cheaper  Public databases exist for these data (GenBank, PDB, etc.)  The whole of bioinformatics is build on their free availability  Life science research data can also be ‘particulars’, for example individual assay results, disease reports, observations, electron micrographs, videos  These data are heterogeneous and form unbounded data sets – typical of ‘long tail’ science rather than ‘big science’  Data collection is costly in human resources, and re-acquisition may be impossible, e.g. for observational data  Datasets thus often have a high intrinsic value per bit  The majority of such research datasets are never published, rotting on the abandoned hard drives of departed postdocs  In this open access age, that is little short of scandalous!

The problems of obtaining infectious disease data  Quote from Professor Angela McLean, after taking three months to amass appropriate disease incidence data for her 2007 J. Virology paper on HIV escape mutations: “When I was a graduate student, I spent long hours in libraries copying numbers from dusty journals. “Things have not improved much since ! ”  I have a particular concern about infectious disease data, since I believe that timely availability of reliable data in this domain may have an important impact on global health.

The benefits and risks of published data  The benefits of open data publication:  review and validation by others,  re-use in another contexts, and  integration with other data to create a new greater whole  Governments, funding agencies, publishers and researchers agree that the results of publicly funded research should be made publicly available  The problems of sharing data (From RIN – BL Report, November 2009 Patterns of Information Use and Exchange: Case Studies in the Life Sciences)  Ethical constraints and IPR issues  Concerns about misuse and data ownership  “As researchers, we see data as a critical part of our ‘intellectual capital’, generated by investment of time, effort and skill.”  Lack of personal attribution and credit for data publication  Difficulties in creating appropriate metadata  Appropriate repositories to archive and publish research datasets

Semantic publishing of structured research datasets Semantic publishing is the use of simple Semantic Web technologies:  to enhance the meaning of on-line published research articles  to provide access to the articles’ published data in actionable form  to facilitate the integration of semantically related data so that data, information and knowledge can more easily be found, extracted, combined and reused For research datasets to be maximally useful, they have to be:  saved in machine-processable form, in conformity with appropriate Web standards (e.g. XML, RDF, OWL)  published and made freely accessible on the Web  referenced by globally unique and resolvable identifiers (e.g. DOIs)  accompanied by useful metadata based upon minimal information standards and ontologies, including provenance information

Features of the original PLoS NTD article, relating to data Good  The article contained a rich variety of data types (geospatial, disease incidence, serological assay, and questionnaire) presented in formats amenable to semantic enrichments (maps, bar charts, tablesand graphs) Poor  While figures and table can be downloaded, they can only be so as images !  The numerical data are not directly available in actionable form

Drosophila gene expression data exists in many databases FlyAtlas

Data from four sources combined in an OpenFlyData window Query for schuy over cached RDF data from FlyTED, BDGP, FlyAtlas and FlyBase

In conclusion: data publishing and global warming Waiting for some international committee in Copenhagen to create the perfect solution to the data publication problem is not the way forward Just as we can each act locally to reduce our carbon footprint, so we can each do something personally to increase our data footprint Each of us, whether researcher, publisher or government agency, can take responsibility for the open publication of our own research data The important thing is to make a start !

end

Advantages of repository over supplementary data files Dryad Suppl Searchable: published metadata allows Google search for data files  Confirmable: author can confirm descriptive metadata terms used  Citable: unique identifiers (DOIs) permit citation of data files  ? Increased exposure of source journal articles through data citation  ? Permanent: data files securely archived in perpetuity ?? Linked: datasets linked to article based on them  Metadata will be available as RDF: part of the “web of linked data”  ? Curated: quality verified, stable formats used, content virus-checked  ? Ease of deposit: authors can upload multiple or zipped files  ? Updatable: new versions of data files can be added, with provenance  Embargo: can delay release of data up to one year after publication  Open access: no restrictions for users, no subscription required  ? Scalable: many journals and societies can leverage economies of scale 

Convergence between journals and databases PLoS Comp. Biol (3) e34  In this paper, Philip Bourne, Editor-in Chief of PLoS Computational Biology and Co-Director of the Protein Data Bank, contends that the distinction between an on-line journal and an on-line database is diminishing  He calls for “seamless integration” between papers reporting results and the data used to compute those results

My critique of Philip Bourne’s ideas  We need to maintain a clear distinction between journal publications:  peer reviewed  immutable dated ‘versions of record’ – part of the history of science –  that provide the citable authorities for research datasets  and research databases:  that should present user with access to complete, impartial, up-to-date datasets, both for further exploration and automated data mining  with curators responsible for correction of errors after submission  Thus “seamless integration” is not desirable  Articles are rhetorical  Datasets are analytical  Researchers require the “seams” to be kept clearly visible, so they know which presuppositional spectacles to wear when reading  Nevertheless, both frictionless interoperability and reciprocal citation between papers and datasets are highly desirable