Integrating source modifiers with sequence data through a new GenBank submission module in Symbiota Andrew N. Miller1, Phil Anders1, Neil Cobb2, Ben Brandt2, and Ed Gilbert3 1University of Illinois Urbana-Champaign 2Northern Arizona University 3Arizona State University BCoN Meeting Lawrence, KS 13 February, 2018
Collection Management Systems Arctos Emu FileMaker Pro Microsoft Access Microsoft Excel Paradox Specify Symbiota
What is Symbiota? Specimen search engine Floristic data Species checklists Surveys Identification key Image library Distribution maps, descriptions, taxonomic information Genetic data Data aggregation
37 Million Records, 40 Portals, 13 Thematic Collection Networks
Key Symbiota Websites Homepage: http://symbiota.org/ Code @ GitHub: https://github.com/Symbiota Citable publication: http://bdj.pensoft.net/articles.php?id=1114 Google Group (support): http://symbiota.org/docs/google-group/ Symbiota Working Group: https://www.idigbio.org/wiki/index.php/Symbiota_Working_Group
Source modifiers are seldom populated in GenBank records The Problem Source modifiers are seldom populated in GenBank records specimen voucher country isolation source host collected by collection date identified by latitude longitude altitude
Fungi dataset (1,200,057 records) (fungi[orgn] NOT srcdb refseq[prop] NOT wgs[keyword] NOT tsa[keyword] NOT uncultured[filter]) NOT gbdiv pat[prop]) AND (specimen_voucher[text] OR isolate[text] OR culture_collection[text] OR strain[text]) Source modifiers specimen voucher 82% country 52% isolation source 29% host 29% collected by 0.00008% collection date 15% identified by 0% latitude longitude 0% altitude 0.6%
Arthropod dataset (3,415,661 records) Source modifiers specimen voucher 29% country 64% isolation source 2% host 4% collected by 0.0008% collection date 49% identified by 0% latitude longitude 0.0002% altitude 0.2%
Plant dataset (3,715,413 records) Source modifiers specimen voucher 33% country 24% isolation source 2% host 0.7% collected by 0% collection date 6% identified by 0% latitude longitude 4.6% altitude 0.4%
Vertebrate dataset (6,748,218 records) Source modifiers specimen voucher 41% country 16% isolation source 2% host 3.5% collected by 1.2% collection date 3.5% identified by 0% latitude longitude 0% altitude 0.2%
Pull metadata directly from Collection Management System The Solution Pull metadata directly from Collection Management System and submit to GenBank
Symbiota rRNA Submission Tool User Profile info Specimen metadata Sequence Send to GenBank
Genetic Data
Genetic Data
Genetic Data
PHP / MySQL Open Source Modular Specimen Floristic Identification