Experience from Mapping Existing Models to the Transfer Schema Robert Kukla
Introduction Three test databases: –ITIS (plants part) –Berlin Model (mosses/higher plants) –Taxonomer (fishes) Imported into mySQL Java program to generate XML Three main aspects: –Identifying concepts –Extracting relationships –Concept details No CharacterCircumscription, SpecimenCircumscription No hybrids as implications are not fully understood
ITIS Integrated Taxonomic Information System “authoritative” taxonomic information Continuously evolving: –New records get added –Existing records get updated (!) taxonomic units (97741 plants) concepts Most explored DB
ITIS - Identifying Concepts ITIS’ own concepts (type = revision) –taxonomic unit –usage = “accepted” Synonyms (type = referenced) –usage = “not accepted” –referenced from synonym table Vernaculars (type = vernacular) –from vernacular table
ITIS: Extracting Relationships Concept Circumscription –parent_tsn field Synonymy Relationships –Explicit synonyms –Vernaculars Lineage Relationships –to concept of same name according to different publication
ITIS – concept details Names: –up to 4 epithets (only 3 used) plus 4 category indicators to be interpreted depending on rank –authorTeam from separate table –NameSimple calculated Publications: –Multiple publication per taxon_unit –Not completely atomised - compromise
Berlin Model - Mosses/(German Higher Plants) Database of Taxonomic Concepts –Records will not change –Explicit concept relationships + (name-) synonymy –24368 concepts – concepts
Berlin Model - Identifying Concepts From table pTaxon
Taxonomer Relational data model for managing information relevant to taxonomic research Records get added; not changed “Assertion” – mention of a taxonomic name in the taxonomic literature “Protonym” – taxonomic name in the context of its first publication Relationships between assertions assertions – concepts
Taxonomer - Identifying Concepts Concepts (type=referenced) –from table tbl_Assertions –ReliabilityID >= 4 (4-revision, 5 original/new combination)
Taxonomer – extracting relationships ConceptCircumscription –ParentAssertionID Relationships –Table not populated
Taxonomer – concept details Number of fields in the database suggested a complexity that was not supported by the data (not all fields filled) Atomised name difficult to recreate as only terminal epithet is stored – omitted it Use of cheat fields for NameSimple Large number of AccordingTo (>4000) Publication data transferred 1:1
Technical Aspects Database consistency e.g. –getting all publication records –no relationships to non-existant concepts Charset –assume windows-1252 code page Slow! –indexes essential –fewer queries with big result sets faster Recursive approach is more suitable for wrapper –guarantees small, consistent subset
Mapping software Universal transformation software to convert relational data to XML (XMlizer) –Often GUI based; filling in a skeleton XML file –Relate a single query (table or join) to collection of XML nodes –Map fields from that query to attributes or child elements of the XML node Problems –No mechanism to use multiple sources (queries) for one –No conditional transformation –No splitting of fields –Limited merging of fields Write our own universal mapping software –addresses first 2 problems
Conclusion Conversion of legacy data is possible but –information missing –information will be lost Data in original DB is open to interpretation so expert should be consulted Required computing resources should not be underestimated