Biodiversity Informatics Metadata standards and GUIDS in Biological Collections Anne Fuchs (ANBG/CANBR) and Margaret Cawsey (ANWC) May 2017 National Research Collections Australia
Biological Collections Manage Preserved specimens and/or parts thereof Living organisms (plants, seeds, algae, bacteria) Genetic samples Sounds/images/videos Biological Collections have been managing their collection objects for not as many millennia as libraries, but for a long time. These are examples of the types of ‘collection objects’ in our collections. Some might be dried e.g. plant specimens and animal skins and bones, others preserved in other ways, e.g. in ethanol or frozen, and of course how each is managed physically depends on the preservation method involved. http://www.cpbr.gov.au/cpbr/herbarium/specimen/index.html Biodiversity Informatics in Australian National History Collections; Fuchs & Cawsey
Accessioning in Biological Collections Long history of cataloguing collections, which included registration of unique institution codes CANB : Australian National Herbarium ANIC : Australian National Insect Collection In the Herbarium community – Index Herbariorum “Each institution is assigned a permanent unique identifier in the form of a one to eight letter code, a practice that dates from the founding of IH in 1935.” (1) More recently the Global Registry of Biodiversity Repositories Institutions allocate Accession/Catalogue Numbers internally For delivery to national/international datasets combined as institutionCode:collectionCode:catalogNumber: ANIC:Hymenoptera:31-035454-384 or CANB:ANH:CANB 621770.1 As part of the cataloguing process specimens are allocated a unique accession or catalogue number It was recognised that institutions (or large collections) needed to also be identifiable, so (CLICK) systems where put in place to ‘register’ these like the Index Herbariorum for herbaria and the Global Registry of Biodiversity Repositories (CLICK). In addition, institutions or collections allocate their own catalogue number When these are combined they uniquely identify a specimen. Biodiversity Informatics in Australian National History Collections; Fuchs & Cawsey
Metadata in Biological Collections Extensive metadata is collected and held Who made the collection When the collection was made Where collected locality, co-ordinates Type of material collected (Bird, Egg, Leaves, Fruit etc) Taxon collected Possibly additional data such as habitat Institutions often hold this metadata in digital repositories (CLICK) In the process of collecting specimens additional metadata is collected (CLICK) The trend towards managing metadata in specially designed collection management systems has gained momentum (since 1980’s). (CLICK) As part of the storage of collection items labels are produced from this metadata These collection management hold all of the information which lets us curate and track our specimens inside collections and around the world, assisting in their use and re-use through specimen loans, tissue grants etc. (As an aside, CSIRO NRCA will be migrating from our national collections to an enterprise CMS solution in the near future. Biodiversity Informatics in Australian National History Collections; Fuchs & Cawsey
Data sharing and discoverability Exchange of metadata between institutions for duplicate and loaned specimens Supply of data to national and international aggregators Australian Virtual Herbarium (AVH) Online Collections of Australian Museums (OZCAM) Atlas of Living Australia (ALA) Global Biodiversity Information Facility (GBIF) Therefore, need standards (CLICK) Even prior to the aggregation tools we see today, institutions had established practises for the depositing of material as “backup” in other institutions and loaning of specimens for taxonomic work. (CLICK) More recently this metadata has underpinned the data in national and international aggregators to meet the needs for research, land management, policy, education etc (CLICK) In order to deliver both institution to institution exchange of data and to aggregated systems we need to talk a common language, therefore standards Biodiversity Informatics in Australian National History Collections; Fuchs & Cawsey
Introducing TDWG or the Taxonomic Working Group Data sharing and discoverability: Standards Introducing TDWG or the Taxonomic Working Group “The TDWG community's priority is the development of standards for the exchange of biological/biodiversity data.” Established 1985 The natural history collections community has been working with data standards for a long time – The Biodiversity Information Standards Working Group – still affectionately known as TDWG (Taxonomic Database Working Group) was established 1985 and is If you visit their website the work and standards they address are listed Biodiversity Informatics in Australian National History Collections; Fuchs & Cawsey
Data sharing and discoverability: Standards Darwin Core (DwC) Access to Biological Collection Databases (ABCD) Extensions e.g. Audubon Core (multimedia) Global Genome Biodiversity Network Data Standard Herbarium Interchange Specimen Protocol for Interchange of Data (HISPID) The standards which Australian Institutions work with Darwin Core – TWDG std, is body of standards. It includes a glossary of terms (in other contexts these might be called properties, elements, fields, columns, attributes, or concepts) intended to facilitate the sharing of information about biological diversity by providing reference definitions, examples, and commentaries. ABCD – TDWG, comprehensive and commented schema for biological collection records (ABCD Schema). XML based Aududon Core - set of vocabularies designed to represent metadata for biodiversity multimedia resources and collections GGBN - The GGBN Data Standard is based on ABCDDNA, that has been developed within the. The current GGBN Data Standard is a result of further reviews of ABCDDNA done with the GGBN community. The GGBN Data Standard is intended to be used with ABCD or Darwin Core and is not a stand-alone solution! HISPID, example of a domain specific standard which started specifically for the herbarium community and has evolved through various iterations to the current standard which follows and maps to the international standards. Additional terms are minted for attributes which are not covered, vocabularies are provided where applicable, terms are described in a domain friendly manner. Biodiversity Informatics in Australian National History Collections; Fuchs & Cawsey
FCIG Faunal Collections Informatics Group HISCOM Herbarium Information Systems Committee In Australia, the natural history collections in Australia have peak councils who represent the interests of their members. They are served by informatics committees who provide technical advise and actualise decisions. The Director of the SA Herbarium is the current chair of CHAH, Anne is a member of HISCOM. The director of the ANWC is the current chair of CHAFC and Margaret is a member of the Faunal Collections Informatics Group, or FCIG. Biodiversity Informatics in Australian National History Collections; Fuchs & Cawsey
2003 first iteration of OZCAM 2007 - the ALA Discoverability – a brief history 1999 first iteration of AVH 2001 – GBIF 2003 first iteration of OZCAM 2007 - the ALA Discoverability of taxon occurrence information has been on the agenda for a long time in the biodiversity data space, with the development of data aggregators 1999 saw the first version of the Australian Virtual Herbarium 2001 GBIF was initiated in Europe to globally share taxon occurrence data in 2003 the first iteration of OZCAM made its online debut In 2007 the Atlas of Living Australia was initiated and came online in 2010 – it now powers the engine for OZCAM and the AVH Which brings us to the vexed history of GUIDs in the biodiversity informatics space. The Atlas specified the required data standard as the Darwin Core. Biodiversity Informatics in Australian National History Collections; Fuchs & Cawsey
Guids – a vexed history But ... 2006 – TDWG GUID Task Group The LSID: Life Science Identifier URN technology Each collection can mint its own But ... Don’t resolve Not used ALA mints its own record identifiers As does GBIF so why bother? urn:lsid:ozcam.taxonomy.org.au:ANWC:Birds:B56401 In 2006 the TDWG GUID task group recommended the use of Life Science identifiers to uniquely identify objects in the biodiversity domain (brandish document) – here’s the TDWG – available from github At the time there were relatively simple requirements They had to be persistent and globally unique so Like the IGSN, their technology is that of a URN (Uniform Resource Name) which makes them harder to implement than simple URLs because you need resolution technologies However, once you’ve chosen the name space and the accepted format, each collection can mint it’s own LSIDs for example (CLICK) As you can see here, LSIDs were actually implemented by members of CHAFC – including the Wildlife collection some LSID providers and services exist and the GUID technology was tested However, the tests to date have resulted in the identification of a variety of issues Biodiversity Informatics in Australian National History Collections; Fuchs & Cawsey
14 recommendations Yes, a GUID is a good idea GUID technologies TDWG GUID applicability statement 2010 14 recommendations Yes, a GUID is a good idea GUID technologies *HTTP URI (used as a basis for some of the following options) URN — LSID *Life Science Identifier DOI — Digital Object Identifier PURL — Permanent URL UUID — Universally Unique Identifier Handle System In 2010 the TDWG Globally Unique Identifiers Task group produced a GUID applicability statement (- wave document 2 about – also available from github) from which I’ve derived much of the info in this talk. This document makes 14 recommendations. One of them is that, yes, a GUID is a good idea. Among them is a list of potential technologies, of which the LSID is one, although, because it is URN technology, LSIDs cannot function in a linked data environment without being represented as http URIs for example something like this It should be noted that document (wave it again) presents reservations about ALL of them. http://bioguid.info/urn:lsid:ozcam.taxonomy.org.au:ANWC:Birds:B56401 Biodiversity Informatics in Australian National History Collections; Fuchs & Cawsey
Apply to objects e.g. scientific names datasets collections specimens TDWG GUID applicability statement 2010 Apply to objects e.g. scientific names datasets collections specimens genetic samples images, videos, sound recordings etc. geological samples? the applicability statement is not prescriptive on which objects GUIDs may apply to, but is prescriptive on HOW they should NOT be applied e.g. >1 of the same GUID technology applied to the same sample In the past LSIDs have applied to all of these objects and then some. It is not inconceivable that they might be applied to geological objects, much as is being suggested that IGSNs could apply to biological objects Biodiversity Informatics in Australian National History Collections; Fuchs & Cawsey
Current situation for the natural history collections... GBIF DOIs for downloads dataset persistence? dataset content changes possible? Conclusions: No consensus reached ... It is unlikely that any particular GUID technology will be successfully implemented until TDWG achieves consensus Recently we’ve heard that GBIF has decided to apply DOIs to data downloads. However, we don’t know yet how persistent these datasets will be; the implication is that the datasets may not be kept for more than 12 months, so the DOI’s won’t resolve beyond that time. Also, there’s the implication that the content of the dataset itself might change i.e. if the download is for all records of a particular species at time X and more records for that species arrive in the GBIF database at time X+n, then the DOI will, at time X+n have a different complement of records than it did at time X. This renders arguable the usefulness of the DOI. As a conclusion: There is as yet no consensus within the TDWG community on which GUID technology is acceptable. It is unlikely that any will be successfully implemented until a consensus is reached in the TDWG community ... But we all recognise that whatever technology is adopted, it will have to be compatible with the use of linked data – which means that the URN technology is not likely to be the one which hits the jackpot. Biodiversity Informatics in Australian National History Collections; Fuchs & Cawsey
Future - Linked Data and the Semantic Web? Principally, the Semantic Web is a Web 3.0 web technology - a way of linking data between systems or entities that allows for rich, self-describing interrelations of data available across the globe on the web. (2) W3C Best Practices for Publishing Linked Data (3) Data is explicitly connected to a license URI design (HTTP based, machine readable, unchangeable, opaque) URI’s are persistent Vocabulary Based on existing standards where ever possible Machine accessible (RFD/SPARQL, restfulAPI) Linked data is a tool of the semantic web in which data and its relationships are machine readable thus opening up possibilities for an environment where applications can query that data, draw inferences using vocabularies (https://www.w3.org/standards/semanticweb/data#summary) Biodiversity Informatics in Australian National History Collections; Fuchs & Cawsey
National Species List – Linked data URI for the name Acacia dealbata Link. https://id.biodiversity.org.au/name/apni/61294 Content negotiation resolves via web services to HTLM, JSON, XML or CSV https://biodiversity.org.au/nsl/services/name/apni/61294.xml Used in exports and data delivery as the identifier ICNAFP APNI scientific http://id.biodiversity.org.au/name/apni/61294 Acacia dealbata Link Acacia dealbata Link ….. URI’s also used for Taxon concepts (instances), Publications/References, Authors, Taxonomic classifications, etc SPARQL service Biodiversity Informatics in Australian National History Collections; Fuchs & Cawsey
List of Resources http://rs.tdwg.org/abcd/2.06 Resource Link Global Registry of Biodiversity Repositories http://grbio.org/ Index Herbariorum (1) http://sciweb.nybg.org/science2/IndexHerbariorum.asp Australian Virtual Herbarium http://avh.chah.org.au/ OZCAM http://ozcam.org.au/ Global Biodiversity Information Facility http://www.gbif.org/ Taxonomic Data Working Group http://www.tdwg.org/ Darwin Core http://rs.tdwg.org/dwc/terms/index.htm ABCD http://rs.tdwg.org/abcd/2.06 HISPID https://github.com/hiscom/hispid Audubon Core https://terms.tdwg.org/wiki/Audubon_Core Global Genome Biodiversity Network Data Standard. https://terms.tdwg.org/wiki/GGBN_Data_Standard Atlas of Living Australia http://ala.org.au LSID https://github.com/tdwg/guid-as/tree/master/lsid LSID applicability https://github.com/tdwg/guid-as/tree/master/guid Linked data tools (2) http://www.linkeddatatools.com/semantic-web-basics Data.gov.au statement https://github.com/AGLDWG/TR/blob/master/guidelines/URI-Guidelines-for-publishing-linked-datasets-on-data.gov.au-latest.md W3C Best Practices for Publishing Linked Data (3) https://www.w3.org/TR/ld-bp/ Biodiversity IBest Practices for Publishing Linked Data nformatics in Australian National History Collections; Fuchs & Cawsey
Thank you Presenter details Anne Fuchs (Centre for Australian National Biodiversity Research) Margaret Cawsey (Australian National Wildlife Collection) National Facilities and Collections, National Research Collections Australia