Www.obis.org.au/irmng IRMNG – the Interim Register of Marine and Nonmarine Genera: rationale and current status Talk prepared for GN-CoL names and taxonomy.

Slides:



Advertisements
Similar presentations
SCOPUS Searching for Scientific Articles By Mohamed Atani UNEP.
Advertisements

LIS618 lecture 2 Thomas Krichel Structure Theory: information retrieval performance Practice: more advanced dialog.
CINAHL DATABASE FOR HINARI USERS: nursing and allied health information (Module 7.1)
PubMed/History; Accessing Full-Text Articles (module 4.4)
In the Format section, we have activated the Bibliographic style drop down menu. From this page, you can choose a specific journal or format (e.g. BMC.
Michigan Electronic Grants System Plus
Your dissertation and the Library James Webley 19 February 2013.
Judging Web Validity Can you trust this site? Adapted from Dr. Ramesh Mehay Course Organiser, Bradford.
Migrating Entomologys Collection Management System to EMu Adrian Hine.
An Electronic Flora of South Australia – Current and future Towards a common approach to electronic floras workshop 3-4 December 2007.
Zoology 305 Library Databases/Indexes Lab Goals for session: 1) Meet your librarian Kevin Messner 2) Understand.
Basic Searching Engineering Village. Agenda What is Engineering Village? Setting up a personal account Searching Engineering Village How to.
What is a Flora? Peter Hovenkamp. What is not a Flora? Labwork/ecology paper Species selection on non-taxonomic criteria No identification tool Character.
IPNI & PhytoKeys Integration Nicky Nicolson (RBG Kew)
School Census Summer 2011 Headlines Version Jim Haywood Product Manager for Statutory Returns.
WP3 Biomapping results to date WP3: NRM, CDF, CEFAS, DINARA, WCS Additional input: WP1, AquaMaps workgroup.
Diana Hernandez Integrating the catalogue of Mexican biota: different approaches for different client perspectives.
Compiled by Helene van der Sandt. Is a search engine that searches for scholarly literature Can search across many disciplines Searches for articles,
A common XML query/response model for automated publication- to-registration pipeline Lyubomir Penev, Jordan Biserkov, Teodor Georgiev, Pavel Stoev Pro-iBiosphere.
Advanced Searching Engineering Village.
Scaling up The International Plant Names Index (IPNI) James A. Macklin Harvard University Herbaria Paul J. Morris Harvard University Herbaria & Museum.
Taxonomy MarBEF/IODE training workshop Oostende, March 2007.
THE TAXONOMY MODULE: Functions Eremothecella calamicola Syd. (Arthoniales: Arthoniaceae) Sydow & Sydow, Ann. Mycol. 15: ; Sérusiaux, Syst. Ascomycetum.
Create new database Create staging table Import new taxonomy Index new taxonomy Load new taxonomy to core db New TNRS DB New taxonomic source More taxonomic.
Next Steps in the Catalogue of Life Frank Bisby, Sp2000 and Thomas Orrell, ITIS Catalogue of Life Partnership.
Engineering Village ™ ® Basic Searching On Compendex ®
E-resources for the social sciences A brief overview of general resources for the social sciences: –Bibliographic databases –Resources for news and statistics.
1 Using Scopus for Literature Research. 2 Why Scopus?  A comprehensive abstract and citation database of peer- reviewed literature and quality web sources.
CLASSIFICATION & NOMENCLATURE of VIRUSES A large number of morphologically and physico ‑ chemically distinct types of viruses that infect virtually all.
With Microsoft Access 2010 © 2011 Pearson Education, Inc. Publishing as Prentice Hall1 PowerPoint Presentation to Accompany GO! with Microsoft ® Access.
Streamlining the registration- to-publication pipeline Lyubomir Penev, Teodor Georgiev, Pavel Stoev Sherborn Meeting, NHM London, 28 Oct 2011 ViBRANT.
FADA workshop, 5-7 December 2008 in Bruges (Belgium) World Register of Marine Species and Aphia IT platform Ward Appeltans
Plant names: obstacles and solutions
BIS TDWG Conference 28 October 2013, Florence Documenting data quality in a global network: the challenge for GBIF Éamonn Ó Tuama, Andrea Hahn, Markus.
PESI Pan-European Species-directories Infrastructure European GBIF nodes Meeting — Paris, 4 April 2011 Walter Berendsohn (based on presentation by Yde.
TAXAMATCH, a “fuzzy” matching algorithm for taxon names, and potential applications in taxonomic databases Tony Rees CSIRO Marine and Atmospheric Research,
Use case lessons: Components of the SEEK architecture Robert K. Peet University of North Carolina.
Representing taxonomy MarBEF-IODE workshop Oostende, March 2007.
Taxonomy and taxonomic systems
1 Scopus as a Research Tool March Why Scopus?  A comprehensive abstract and citation database of peer-reviewed literature and quality web sources.
Lecture Four: Steps 3 and 4 INST 250/4.  Does one look for facts, or opinions, or both when conducting a literature search?  What is the difference.
OBIS Portal Architecture Concepts plus potential for utilization as a basis for Regional OBIS Nodes Tony Rees, CSIRO Marine Research, Hobart (and OBIS.
GLOBAL BIODIVERSITY INFORMATION FACILITY Cataloging and using Taxonomic Data The Global Names Architecture David Remsen Senior Programme Officer, ECAT.
The Global Names Architecture: Integration In Action (NOT “Inaction”) 1.Overview of GNA, GNI & GNUB (15 mins) 2.Questions, Elaborations & Clarifications.
Cataloging 12.3 to 14.2 Seminar. Cataloging 2 -New check routines -Cataloging authorizations -Other innovations -Fix and expand routines -Floating keyboard.
A curation interface for reconciliation of species names for India. Thomas Vattakaven and R. Prabhakar, India Biodiversity Portal, Strand Life Sciences,
Experience from Mapping Existing Models to the Transfer Schema Robert Kukla.
A continuously updated All Genera Index: an achievable goal for Biodiversity Informatics? Tony Rees – CSIRO Marine and Atmospheric Research, Australia.
Christina Flann Species 2000 October 2014 Catalogue of Life Indexing The World’s Known Species Connecting the taxonomic community and the names infrastructure.
Systematics: The Science Of Biological Diversity Chapter 12
PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.
MARCIt records for e-journals project to implement MARCIt service McGill University Library Feb
NDD (National Oceans Office Data Directory) development overview as at 1 July 2002 Tony Rees/Miroslaw Ryba CSIRO Marine Research, Hobart.
BiodiversityWorld GRID Workshop NeSC, Edinburgh – 30 June and 1 July 2005 Taxonomic verification: Species 2000 and the Catalogue of Life Frank Bisby.
GOOGLE SCHOLAR Compiled by Helene van der Sandt. WHAT IS GOOGLE SCHOLAR?
Fábio Lang da Silveira – This talk on behalf of OBIS International Committee and OBIS North & South America Nodes USP – Zoology.
Assembling Biological Inventories for Analysis Robert J. Meese, Ph.D. University of California, Davis (530) Presented by Andrea.
NOMENCLATOR by Claude MASSIN PEET Workshop Brussels December 2006.
CAAB and taxon management at CSIRO Marine Research Tony Rees Divisional Data Centre CSIRO Marine Research, Hobart
Extending the biogeographical model Africamuseum 6 (7?) June 2013.
MarLIN: a research data metadatabase for CSIRO Marine Research Tony Rees Divisional Data Centre CSIRO Marine Research, Hobart contact:
CAAB - Codes for Australian Aquatic Biota Tony Rees Divisional Data Centre CSIRO Marine Research, Hobart
GBIF - ECAT  Electronic Catalogue of Names of Known Organisms  Program Officer;  Per de Place Bjørn 
GLOBAL BIODIVERSITY INFORMATION FACILITY David Remsen Senior Programme Officer, ECAT 3 Oct th Nodes Meeting.
African Register of Marine Species AfReMas Leen Vandepitte On behalf of WoRMS data management team.
Quality control of biodiversity data: tools & techniques Leen Vandepitte On behalf of WoRMS, EurOBIS & LifeWatch data management teams.
1 e-Resources on Social Sciences: Scopus. 2 Why Scopus?  A comprehensive abstract and citation database of peer-reviewed literature and quality web sources.
IRMNG – the Interim Register of Marine and Nonmarine Genera: rationale and current status Talk prepared for GN-CoL names and taxonomy.
Big Data Needs Little CRUD:
Database Systems: Design, Implementation, and Management
Presentation transcript:

www.obis.org.au/irmng IRMNG – the Interim Register of Marine and Nonmarine Genera: rationale and current status Talk prepared for GN-CoL names and taxonomy sharing workshop, Hawaii, March 2012 Tony Rees – CSIRO Marine and Atmospheric Research, Australia for: GN-CoL names and taxonomy sharing workshop, Hawaii, March 2012

The Dream… Imagine a system that would… Automatically classify “any” genus & species name to kingdom / phylum / class / order / family (as far down as possible) – “what is this critter” – plus hierarchical relations e.g. parents / children / siblings Return whether a current (valid) or non-current name e.g. synonym Check spelling for correctness, also authority details, plus supply original publication ref. as available Return associated attributes such as extant / fossil status, habitat information, geographic / geologic range, more… Work seamlessly, with a single point of entry, across all groups and geologic epochs including present day Be as up-to-date as possible (latest content), and authoritative (maintained by relevant experts) Tony Rees: IRMNG March 2012

Realising the Dream… For extant taxa: role of Cat. of Life, however ~30% of species still to go; for fossil taxa: PaleoDB (unknown proportion missing, maybe 50%?) In mean time, could make progress by assembling global genera list, and infilling with species names as available IRMNG is an attempt along these lines… a work in progress, with modest resourcing, but available for use now. genera species Tony Rees: IRMNG March 2012

IRMNG data sources Animal genera + auth’s from Nomenclator Zoologicus and elsewhere, tax. placements and synonymies from multiple sources including CoL, individual taxon treatments and printed works Botanical genera and auth’s from Index Nominum Genericorum (ING) supplemented with other sources, tax. placements and synonymies from multiple sources including GRIN (APGIII in the main), Index Fungorum, AlgaeBase, CyanoDB, more Prokaryote genera, auth’s and tax. placements from LSPN (Euzéby list), previous/non-valid names from multiple sources Virus genera and tax. placements from ICTV db (multiple versions – very different through time) Species lists (all groups) from CoL 2006, Aphia/WoRMS 2006, AFD, NZ Organisms Register + more. Hierarchical approach is in contrast to e.g. NameBank, GNI -- Names are not accepted without a parent (even if this is “Animalia (awaiting allocation)” in a few cases) -- Placeholder groups e.g. “Mollusca (awaiting allocation)” are erected at Order and Family level to allow addition of genus names not yet placed to family (for homonymy in particular, also because other details e.g. publication info, extant/fossil status may already be available) Tony Rees: IRMNG March 2012

IRMNG content as at March 2012 (cf. e.g. Cat. of Life): Cat. of Life (2011 version): 8k families 178k genera 2.25m species names (including synonyms) IRMNG: 19k families 454k genera 1.46m species names (including synonyms) Not all IRMNG genera yet linked to relevant families, but ~370k are (remainder linked to higher taxon i.e. phylum, class or order) Extant/fossil, marine/nonmarine flags held for majority of names Nomenclatural status known for most names, tax. status i.e. valid name/synonym for only a subset at this time (varies by group) Authority known for >97% of genera, publication details for “animal” subset (from Nomenclator Zoologicus in the main) Fuzzy matching (TAXAMATCH) deployed over all web-based queries for correction of potential errors in input names to be matched. Tony Rees: IRMNG March 2012

IRMNG in practice – example genus = “Lawsonia” Same name is currently a valid genus in 3 Codes i.e. plants, animals and bacteria (no barriers to this) Homonomy is a big problem – up to 15% of all genus names are homonyms/isonyms either within or across Codes (including some misspellings which collide with other “good” names) (*Isonyms: multiple publication instances of same name as new, based on same type or concept) Many genus names are valid across more than 1 Code (e.g. used in botany and zoology for different taxa), a handful of genus names are concurrently valid across three Codes as per this example: Lawsonia Worst example currently “Wagneria” – 14 instances in IRMNG, 2 valid, the rest are synonyms Cannot disentangle without a master list of genus names Tony Rees: IRMNG March 2012

Required base information is scattered in multiple systems / printed works at this time plant animal bacterium (etc.) Tony Rees: IRMNG March 2012

Required base information is scattered in multiple systems / printed works at this time plant animal bacterium (etc.) Tony Rees: IRMNG March 2012

IRMNG query as at March 2012 IRMNG is a central aggregation point for all such information as readily available from multiple sources, both electronic and print, although the compilation of names / associated nomenclatural info. outstrips the full taxonomic information at this time. Incorporation of “TAXAMATCH” fuzzy matching also permits return of other names differering only slightly from the entered name, in case one of these is in fact the intended target (also permits a degree of data cleaning and reconciliation/dediplication). Tony Rees: IRMNG March 2012

IRMNG query as at March 2012 synonym of (as known) extant, habitat flags The IRMNG web query interface also includes information on extant & habitat flags, synonymy (as held), sources of the data, information about parent and child taxa, and so on. children parents Tony Rees: IRMNG March 2012

Note: IRMNG fields displayed on the web are only a subset of full information held for any name, e.g.: Tony Rees: IRMNG March 2012

IRMNG core fields IRMNG ID, Rank Scientific name (for species: epithet + parent ID) Authority Publication (as “microcitation” – subset with link to refs. module) Source(s) for above Orthography verified against (authoritative source) Parent ID (+ “according to…”) – Linnaean ranks only at this time Nomenclatural status (+ relation with other names as needed) + “according to…” Taxonomic status (same) Nomenclatural Code Taxonomic or nomenclatural remarks Extant/fossil, marine/nonmarine flags + “according to” (could be “as per parent”) Date entered, last modified, deprecated (where required) (under consideration…) Intermediate ranks e.g. subfamily, subgenus, also infraspecies (not currently held) Type genus / species indicator Freshwater / terrestrial flags vs. present “nonmarine” Geo flags (country codes etc.) Palaeo range (periods/epochs) Vernacular names as available Currently IRMNG is structured around Linnaean ranks only i.e. kingdom / phylum / class / order / family / genus / species (no infraspecies are held at this time), may be extended in future. Deprecated records (e.g. duplicates detected during subsequent QA) are left on system with their IRMNG ID intact, in case referred to elsewhere, or require re-activation. Records are flagged as either current (valid) or non-current at the indicated Rank; not yet clear how to handle taxa considered non-current at designated rank, but current at another. Tony Rees: IRMNG March 2012

IRMNG is not just a “passive” aggregator… Editorial / curatorial decisions / actions required to: Correct obvious data errors Assemble “complete” records from multiple sources (where one source data deficient) Normalise authority data (in particular) to a “house style” Digitise or transcribe print material into electronic form where not otherwise available Decide between conflicting content in data sources e.g. for authority orthography/year, taxonomic placement, valid/synonym status and more Cross-link names e.g. synonyms -> current names, basionyms -> replacement names, misspelled names to their correctly spelled counterparts, etc. etc. Reconcile variant higher taxonomies as supplied to a single hierarchy Add nomenclatural or taxonomic remarks as required. Tony Rees: IRMNG March 2012

Relevance to present meeting? Demonstrates utility of a single entry point to a system permitting query on “any name” – i.e., a [comprehensive] Taxonomic Name Resolution Service (TNRS) covering all life Envisage something like OBIS or GBIF, but for taxonomy – the aggregator / central query point is not a content author, but provides integration and value-added services IRMNG – based on static snapshot/s of multiple data sources; cf. a “super catalogue” should be based on live feeds from relevant authoritative sources, continuously updated as available (?+ some static data not available as feeds) Maybe the static data lives outside the “data aggregation/query” point, becomes a separately managed source How does / should GNA facilitate this? Will the need for an IRMNG (or IRMNG equivalent) disappear or grow in the above scenario? (for example could this role be taken by another player or group of players…) Tony Rees: IRMNG March 2012

Thank you! Tony Rees: IRMNG March 2012

(supplementary slides) Tony Rees: IRMNG March 2012

Size of the task: IRMNG 2011 content cf. Cat. of Life 2011 Cat. of Life - 2011 edition % with auth's IRMNG – Oct 2011 - extant + fossil IRMNG – Oct 2011 - fossil only   Kingdoms 8 7 (0) Phyla 111 153 (12) Classes 288 509 (64) Orders 1,233 2,645 (715) Families 8,071 0% 19,639 22.1% (6,542) Subfamilies -  Genera 178,515 452,848 97.1% (90,278) Subgenera  - Species (valid) 1,347,224 ~100% 1,020,519 (16,792) Species (synonyms) 895,441 440,738 (100) - Cat. of Life misses many genus-level synonyms / misspellings recognised elsewhere (including its source DB’s) - Genera not treated as distinct data objects in CoL (unless changed recently) i.e. no authorities, publication info, nomenclatural or taxonomic remarks - Coverage of fossils is considered valuable feature of IRMNG (though no systematic attempt at species ingestion as yet) CoL has 70% of valid extant species names (of est. 1.9m total), thus maybe also 70% of valid extant genera (with subset of genus-level synonyms) IRMNG has further ~180k extant genus names and ~90k fossil names at this time (including syns) – est. ~25k still missing Tony Rees: IRMNG March 2012

Taxonomic names: what the customer is currently offered (+ more…) publication discovery official registers taxon-specific DB’s integrated DB’s “all names” ICTV Viruses DB ITIS NCBI Taxonomy WoRMS etc. CyanoDB New names published (in primary literature) Index Fungorum MycoBank LPSN (Prokaryote names) AlgaeBase Plant GSD’s ChecklistBank GNI GNUB ICBN Decisions Catalogue of Life The Plant List, IPNI, TROPICOS, ING Journal TOC’s, RSS feeds, text mining Botany ZOOBANK? PaleoDB Zoology Animal GSD’s Many single sources of taxon names - often not integrated - newly published names discoverable only with some effort (although “official” registries/lists for prokaryotes, viruses) - considerable latency as names flow from published (at left) to aggregators (at right) Abstracting services Nomenclator Zoologicus Subject bibliographies ION (Index of Organism Names) Zoological Record Reviews, secondary literature ICZN Decisions other compilations e.g. regional lists, Wikispecies, Wikipedia, more… Tony Rees: IRMNG March 2012

Two approaches - GNI and Cat. of Life GNI / NameBank approach: collect as many namestrings as possible, any rank - User needs to explore source/s to determine taxonomic hierarchy and other information (if held) - Or: maybe one day, will be offered in a coherent hierarchy/list (but not any time soon) NameBank / GNI 20m+ names – all ranks, no hierarchy mix of “clean” and “dirty” names many duplicates extant + fossil, most sectors with at least some names Tony Rees: IRMNG March 2012

GNI search result – “Lawsonia” (all ranks returned) (Mar 2012) …candidate genus names highlighted in red (although could be other ranks too) … need access to original taxonomic / nomenclatural resources to sort out / see if anything missed GNI produces a (partial?) list of known orthographies (mix of all ranks) Species and below can generally be eliminated by pattern matching, leaving uninomial names i.e. genera and above (plus authorities), in multiple potential variants Note this suggests that there may be 3 genuinely distinct “Lawsonia” instances known to GNI at this time – although sometimes the situation is more opaque / potentially misleading (similar auth’s but different taxa, different auth’s but the same taxon, or no auth held). Tony Rees: IRMNG March 2012

Two approaches - GNI and Cat. of Life Cat. of Life approach: stitch together authoritative lists for global sectors complete to species - Some sectors (30% of all extant taxa) not yet sourced, may have no lists - Information above species level is sketchy (e.g. no genus, family auth’s or other information) - Fossil taxa are omitted at this time Cat. of Life NameBank / GNI 20m+ names – all ranks, no hierarchy mix of “clean” and “dirty” names many duplicates extant + fossil, most sectors with at least some names <2m names – Linnaean ranks, in hierarchy all “clean”/ vetted names / relationships extant only, sectors either complete or absent Tony Rees: IRMNG March 2012

Cat. of Life search result – “Lawsonia” (Mar 2012) Catalogue of Life largely indexes species and infraspecies, genera are presented last with no authorities (although position in hierarchy can be accessed from this page) Note, only 2 “Lawsonia”s held, [at least] another one more somewhere not known to CoL (either missing, or out of scope i.e. fossil) Tony Rees: IRMNG March 2012