British Library Datasets Programme JISC RSP Winter School February 2011 Max Wilkinson
2 Today’s Talk 1. The British Library 2. Data in scholarly communication 3. The problem with data 4. The Datasets Programme Vision Strategy Activity (DataCite) 5. Other Projects
3 The British Library Exists for everyone who wants to do research – for academic, personal, and commercial purposes. Covers all subject areas – sciences, technology, medicine, arts, humanities, social sciences… Receives a copy of every item published in the UK. Holds over 150 million items, with 3 million items added each year. Used by over 16,000 people each day (on site and online).
The British Library: some facts and figures Helping people advance knowledge to enrich lives GIA Funding 08/09: £94.8m operational, £12m capital Other funding secured 07/08: c.£33m National library of the UK. Serves researchers, business, libraries, education & the general public Collection includes over 2m sound recordings, 5m reports, theses and conference papers, the world’s largest patents collection (c.50m) 3 main sites in London and Yorkshire. Circa 2,000 staff Business and IP Centre: Providing inspiration, and enabling protection of creative capital and business development Generates value to the UK economy each year of 4.4 times public funding Collection fills over 600km of shelving and grows at 11km per year 70 Tb of digital material through voluntary deposit British Library Act 1972 National centre for reference, study, bibliographical and other information services, in relation both to scientific and technological matters, and to the humanities. Science and Innovation Investment Framework , H.M. Treasury (2004) UK research base must have ready and efficient access to information of all kinds – such as experimental data sets, journals, theses, conference proceedings and patents. This is the life blood of research and innovation. The largest document supply service in the world. Secure e-delivery and ‘just in time’ digitisation enables desktop delivery within 2 hours
5 Who do we serve? The Researcher – We provide access to research level materials to all sectors including academia, industry, government, charities and NGOs. Business -The British Library also has a critical role supporting businesses of all sizes, from individual entrepreneurs through to major organisations. The Learner - We have an important role to play in supporting education from primary schools to developing future researchers of any age. The Library Community – We play a key role in supporting the wider UK Library Community and information network. The General Public - The services we offer include exhibitions and events, tours and web services which digitally showcase our collection.
6 Modern science relies on good data
7 Scholarly record Discovery Access Record Permanence Citation Metadata Exposure Trust Fabrics Copyright Scholarly record
8 The Foundation for Research Data is a crucial component of the scholarly record. Re-acquisition may be impossible Datasets are essential to the British Library’s mission to advance the World’s knowledge.
9 Current Situation No effective way to link between datasets and article; No widely used method to identify datasets; No widely used method to cite datasets.
10 As a result… Datasets are: Difficult to discover Difficult to access In danger of being lost
11 Difficult to Discover. Good luck finding the data! “Source: Committee on Climate Change”
12 Data are diverse in the Digital Landscape Seismic measurements taken by a geologist. An audio archive of birdsong created by an ornithologist. Genetic data collected by a medical researcher. A survey of public opinions collected by a sociologist.
13 Re-join the gap… (No) effective way to link between articles and datasets (No) widely used method to identify datasets (No) widely used method to cite datasets Articles Underlying data
14 Datasets – first class citizens? Data is difficult to manage after project funding ceases Informal networks provide the primary means of sharing Only 21% use a national or international facility Datasets are not included in impact analysis Good luck finding it or getting permission to use it (your discipline may vary) Source: UKRDS Study: The Data Imperative. Managing the UK’s research data for future use (Feb 2009)
15 Scholarly record Discovery Access Record Permanence Citation Metadata Exposure Trust Fabrics Copyright Scholarly record
16 Research training based on scholarly communication Discovery Access Record Permanence Citation Metadata Exposure Trust Fabrics Copyright Scholarly record Rarely includes data
17 Scholarly communication requires intellectual exchanges Discovery Access Record Permanence Citation Metadata Exposure Trust Fabrics Copyright Scholarly record No such data fabric
18 Scholarly discourse requires a record and provenance Discovery Access Record Permanence Citation Metadata Exposure Trust Fabrics Copyright Scholarly record Almost non-existent for data
19 The Datasets Programme We envision a future where researchers can: Discover, access, reuse, and reference datasets. Track the impact of the data that they generate and receive appropriate credit. Our approach is to: Provide a focus for the community to establish needs, requirements and agreement. Explore novel technology and creative solutions.
20 Two key concepts INCENTIVE SUSTAINABILITY
21 Projects and activities Follow us on twitter
22 A Key Component for Many Goals ? CiteReuseVerify Track Impact AccessFind Make Visible Persistent Identification
23 Citation using Digital Object Identifiers (DOIs) Dataset G.Yancheva, N. R. Nowaczyk et al (2007) Rock magnetism and X-ray flourescence spectrometry analyses on sediment cores of the Lake Huguang Maar, Southeast China, PANGAEA Article Citation G. Yancheva, N. R. Nowaczyk et al (2007) Influence of the intertropical convergence zone on the East Asian monsoon Nature 445, How to reference Published Article (Abstract or full text) The DOI system offers an easy, internet actionable way to connect the article with the underlying publication But a complete scholarly record would also link to the evidential datasets and their location, e.g. PANGAEA doi: /nature05431
24 doi: /nature05431 leads to a landing page
25 Digital Object Identifiers (DOIs) offer a solution Mostly widely used identifier for scientific articles Researchers, authors, publishers know how to use them Put datasets on the same playing field as articles Connecting an Article with the Underlying Data Dataset Yancheva et al (2007). Analyses on sediment of Lake Maar. PANGAEA. doi: /PANGAEA URIs are commonly used but can decay (e.g. Wren JD: URL decay in MEDLINE- a 4-year follow-up study. Bioinformatics. 2008, Jun 1;24(11):1381-5).
26 doi: /PANGAEA
27 Dataset citation using Digital Object Identifiers (DOIs) Dataset G.Yancheva, N. R. Nowaczyk et al (2007) Rock magnetism and X-ray flourescence spectrometry analyses on sediment cores of the Lake Huguang Maar, Southeast China, PANGAEA doi: /PANGAEA Article G. Yancheva, N. R. Nowaczyk et al (2007) Influence of the intertropical convergence zone on the East Asian monsoon Nature 445, doi: /nature05431 Data Citation Scholarly record is complete
28 Projects – DataCite DataCite is an international consortium which aims to: Establish easier access to scientific research data on the Internet Increase acceptance of research data as legitimate, citable contributions to the scientific record Support data archiving that will permit results to be verified and re-purposed for future study.
29 DataCite Support researchers by enabling them to locate, identify, and cite research datasets with confidence Support data centres by providing persistent identifiers for datasets, workflows and standards for data publication Support publishers by enabling research articles to be linked to the underlying data DataCite : Data Centres :: CrossRef : Publishers
30 Digital Object Identifier (DOI) doi: / PrefixSuffix
31 DOI prefix doi: / PrefixSuffix The British Library provides data centres with a unique prefix for DataCite DOI For example, Archaeology Data Service uses
32 DOI suffix doi: / PrefixSuffix Suffix generated by the data centre Guidelines for DOI syntax are being developed
33 Resolving a DOI doi: / PrefixSuffix Resolving the DOI:
34 DOIs resolve to an open landing page
35 DataCite Service Built a service for data centres to mint DOIs for datasets and store associated metadata ( British Library is trialling the service with several UK data centres, including:
36 Projects and activities
37 For more information on the BL Datasets Programme Max Wilkinson: Programme Manager; Datasets WebSite Follow us on twitter
38 Follow On slides
39 SageCite: Data citation in bioinformatics workflow Sage bionetworks data capture and analysis workflow (Tavenra: MyExperiemnt) Data Citation service integration points and recommendations Benefits analysis SageCite: Integration of data citation services into multi-contributor bio-informatics workflow. Establishing data attribution and credit mechanisms. ► INCENTIVE Sage Bionetworks: Aggregating datasets from contributors to create massive coherent datasets that can be used for systems level analysis of disease
40 Dryad UK: Repository sustainability Expand Publisher base Seamless integration into publisher workflow Sustainability models for datasets supplementary to publication Dryad UK: Define a business case and pilot service integrating DataCite DOIs and dataset archiving into publisher workflows ► SUSTAINABILITY Leveraging the Dryad Consortium, which is addressing the acquisition and storage of long tail supplementary data
41 Discovery Science Technology & Medicine Focussing on discovery services in the library’s integration engine Based on commissioned consultations Data resources Selection guidelines Making available through library search facilities
42 Dataset Discovery Project
43 Access SSCR Focussing on streamlining access to established and high value data collections Resource guides for datasets Streamlining access to established data centres Raising profiles of high impact datasets E.g Olympics and 2011 census Also piloting dataset surfacing through the Libraries search facilities
44 Projects – British Atmospheric Data Centre British Atmospheric Data Centre (BADC): Natural Environment Research Council's designated data centre for the Atmospheric Sciences. Assists researchers to locate, access and interpret atmospheric data and ensures the long-term integrity of this data. A joint project is underway to improve the citability of BADC datasets Publications based on the data will underlie the 2013 International Panel on Climate Change (IPCC) Report.
45 Challenges to Explore Helping people to … Developing and sustaining… Providing a…
46 A combination of eight social and technical factors – ideally there would be: Personal attribution and credit for data publication An established mechanism for citation of datasets A generic minimum metadata standard for datasets A tool to permit the easy creation of well-structured metadata A standard mechanism for packaging data files and their metadata Appropriate repositories to archive and publish research datasets Reciprocal citation links between datasets and research articles Mechanisms for quality control of data publications