Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Integration Issues in Biodiversity Research Jessie Kennedy Shawn Bowers, Matthew Jones, Josh Madin, Robert Peet, Deana Pennington, Mark Schildhauer,

Similar presentations


Presentation on theme: "Data Integration Issues in Biodiversity Research Jessie Kennedy Shawn Bowers, Matthew Jones, Josh Madin, Robert Peet, Deana Pennington, Mark Schildhauer,"— Presentation transcript:

1 Data Integration Issues in Biodiversity Research Jessie Kennedy Shawn Bowers, Matthew Jones, Josh Madin, Robert Peet, Deana Pennington, Mark Schildhauer, Aimee Stewart

2 Visual Tools for Managing Taxonomic Concepts SEEK  Science Environment for Ecological Knowledge  Research and develop information technology to radically improve the type and scale of ecological science that can be addressed

3 Visual Tools for Managing Taxonomic Concepts Biochemistry Climatology Taxonomy Meteorology Nomenclature Paleontology Genomics Proteomics Hydrology Morphology Geology Oceanography Geography Ecology Science and Scientific Data are Complex

4 Visual Tools for Managing Taxonomic Concepts Biochemistry Climatology Taxonomy Meteorology Nomenclature Paleontology Genomics Proteomics Hydrology Morphology Geology Oceanography Ecology Geography Organism Name Taxon concept Gene sequence Pathway Protein Location Temperature Depth

5 Visual Tools for Managing Taxonomic Concepts Individual Scientist Small Scientific Community Large Scientific Community Scientific Laboraotory Scientific Community: complex

6 Visual Tools for Managing Taxonomic Concepts Biochemistry Climatology Taxonomy Meteorology Nomenclature Paleontology Genomics Proteomics Hydrology Morphology Geology Oceanography Ecology Geography Organism Name Taxon concept Gene sequence Pathway Protein Location Temperature Depth Biochemistry Climatology Taxonomy Meteorology Nomenclature Paleontology Genomics Proteomics Hydrology Morphology Geology Oceanography Ecology Geography Organism Name Taxon concept Gene sequence Pathway Protein Location Temperature Depth Biochemistry Climatology Taxonomy Meteorology Nomenclature Paleontology Genomics Proteomics Hydrology Morphology Geology Oceanography Ecology Geography Organism Name Taxon concept Gene sequence Pathway Protein Location Temperature Depth Biochemistry Climatology Taxonomy Meteorology Nomenclature Paleontology Genomics Proteomics Hydrology Morphology Geology Oceanography Ecology Geography Organism Name Taxon concept Gene sequence Pathway Protein Location Temperature Depth

7 Visual Tools for Managing Taxonomic Concepts Science & Scientific Data are Continually Changing  Conclusions become foundations for new hypotheses  New experiments invalidate existing knowledge  Knowledge is open to interpretation  Different opinions  Need to build this into our technological solutions observation experiment hypothesis conclusion

8 Visual Tools for Managing Taxonomic Concepts Exploiting Scientific Data  To support scientists in  Discovery  Access  Sharing  Integration/Linking  Analysis  Scientists can then improve their potential for new scientific discovery

9 Visual Tools for Managing Taxonomic Concepts Data Integration/Linking: approaches  Metadata  to describe the data sets and know how to interpret the data sets  Ontologies  to define the terminology used and know how data might be related and to aid automatic transformation of the data  Standardisation of formats  for exchange of data + to ease integration  LSIDs  to uniquely identify things; know when 2 things are the same  Workflows  to enable specification, refinement and repetition of integration/analysis  Provenance of data  to record where the data has come from and what has happened to it en route.

10 Visual Tools for Managing Taxonomic Concepts Projects in most sciences: ESG

11 Visual Tools for Managing Taxonomic Concepts Ecological Science - Analysis  Ecological niche modeling of species distributions Where do species occur now? Image from http://www.lifemapper.org Where will they occur in the future?

12 Visual Tools for Managing Taxonomic Concepts Ecological Niche Modeling Environmental Characteristics from gridded GIS layers Known Species Locations Temperature layer Many other layers Environmental Change Prediction Future Scenarios Of Environmental Characteristics Invasion Area Prediction Environmental Characteristics Of Different Geographic Area Native Distribution Prediction Environmental Characteristics Of Surrounding Geographic Area Develop Model Multidimensional Ecological Space D 1 = Temperature D2D2 DnDn

13 Visual Tools for Managing Taxonomic Concepts Sources of Scientific Data  Data are massively dispersed  Ecological field stations and research centers (100’s)  Natural history museums and biocollection facilities (100’s)  Agency data collections (10’s to 100’s)  Individual scientists (1000’s)  Data are heterogeneous  Syntax (format)  Schema (model)  Semantics (meaning)

14 Visual Tools for Managing Taxonomic Concepts Challenge: Data Integration

15 Visual Tools for Managing Taxonomic Concepts SEEK Components

16 Visual Tools for Managing Taxonomic Concepts Semantic Annotation – SEEK ontologies  Integration/merge  Concept mapping  Units conversion  Spatial & temporal scaling  Data discovery  Finding relevant data sets  Understanding data set content

17 Visual Tools for Managing Taxonomic Concepts Smart (Data) Integration: Merge  Discover data of interest  … connect to merge actor  … “compute merge”

18 Visual Tools for Managing Taxonomic Concepts Smart Merge …  Semantic type annotations and ontology definitions used to find mappings between sources  Executing the merge actor results in an integrated data product (via “outer union”) a1 a2 a3 a4 a 5 10 b 6 11 a1 a2 a3 a4 a 5 10 b 6 11 a5 a6 a7 a8 0.1 a 0.2 c 0.3 d a5 a6 a7 a8 0.1 a 0.2 c 0.3 d a3a3 a6a6 a1a1 a8a8 a4a4 Merge a1a8 a3a6 a4 Biomass Site a1 a3 a4 a 5.0 10 b 6.0 11 a 0.1 c 0.2 d 0.3 a1 a3 a4 a 5.0 10 b 6.0 11 a 0.1 c 0.2 d 0.3 Merge Result

19 Visual Tools for Managing Taxonomic Concepts Challenges of Taxonomic Data Scientific names change in meaning over time + geographical region  conclusions being drawn from analysis of data integrated on names.

20 Visual Tools for Managing Taxonomic Concepts Flora North America SubAlpine Fir USDA Plants & ITIS Abies lasiocarpa Abies bifolia Abies lasiocarpa var. arizonica var. lasiocarpa What is Abies lasiocarpa?

21 Visual Tools for Managing Taxonomic Concepts Aus L.1758 Aus aus L.1758 Linneaus 1758 Aus aus L.1758 Tucker 1991 Aus L.1758 Aus cea BFry 1989 Aus aus L.1758 Aus L.1758 Aus bea Archer 1965 Aus aus L.1758 Aus L.1758 Aus bea Archer 1965 Aus cea BFry 1989 Fry 1989 Aus L.1758 Xus beus (Archer) Pargiter 2003. Aus ceus BFry 1989 (vi) Xus Pargiter 2003 Pargiter 2003 Aus aus L. 1758 Changes in meaning of names Aus bea and Aus cea noted as invalid names and replaced with Aus beus and Aus ceus. Pyle 1990 5 Revisions of Aus 1 name spelling change Taxonomic history of imaginary genus Aus L. 1758

22 Visual Tools for Managing Taxonomic Concepts Aus L.1758 Aus bea Archer 1965 Aus aus L.1758 Archer 1965 Aus L.1758 Aus aus L.1758 Linneaus 1758 Aus aus L.1758 Aus L.1758 Xus beus (Archer) Pargiter 2003. Aus ceus BFry 1989 (vi) Xus Pargiter 2003 Pargiter 2003 Aus aus L. 1758 Aus bea and Aus cea noted as invalid names and replaced with Aus beus and Aus ceus. Aus aus L.1758 Tucker 1991 Aus L.1758 Aus cea BFry 1989 Aus L.1758 Aus bea Archer 1965 Aus cea BFry 1989 Fry 1989 Changes in meaning of names Pyle 1990 8 Names 2 genus 6 species

23 N4 - Aus beus Archer 1965 N1 - Aus aus L.1758 N1 C1.5 C1.4 C1.3 C1.2 C1.1 C1.1 - Aus aus L.1758 sec. Linneaeus 1758 C1.2 - Aus aus L.1758 sec. Archer 1965 C1.3 - Aus aus L.1758 sec. Fry 1989 C1.4 - Aus aus L.1758 sec. Tucker 1991 C1.5 - Aus aus L.1758 sec. Pargiter 2003 N2 - Aus bea Archer 1965 N5 C5.5 N5 - Aus ceus Fry 1989 C5.5 - Aus ceus Fry 1989 sec. Fry 1989 C6.5 N6 N6 - Xus beus Pargiter 2003 C6.6 - Xus beus Pargiter 2003 sec. Pargiter 2003 N2 C2.3 C2.2 C2.2 - Aus bea Archer 1965 sec. Archer 1965 C2.3 - Aus bea Archer 1965 sec. Fry 1989 N3 N4 C3.4 C3.3 N3 - Aus cea Fry 1989 C3.3 - Aus cea Fry 1989 sec. Fry 1989 C3.4 - Aus cea Fry 1989 sec. Tucker 1991 N0 - Aus L.1758 N0 C0.5 C0.4 C0.3 C0.2 C0.1 C0.1 - Aus L.1758 sec. Linneaeus 1758 C0.2 - Aus L.1758 sec. Archer 1965 C0.3 - Aus L.1758 sec. Fry 1989 C0.4 - Aus L.1758 sec. Tucker 1991 C0.5 - Aus L.1758 sec. Pargiter 2003 C7.5 N7 N7 - Xus Pargiter 2003 C7.6 - Xus Pargiter 2003 sec. Pargiter 2003 8 Names 17 Concepts Each name has many concepts or meanings

24 Visual Tools for Managing Taxonomic Concepts Find data sets containing Aus aus  Many possible interpretations of Aus aus (N1)  Original concept: C1.1  Most recent concept: C1.5  Preferred Authority (e.g. Fry 1989): C1.3  Everything ever named N1: Union(C1.1,C1.2,C1.3,C1.4,C1.5)  Best fit according to some matching algorithm Best(C1.1,C1.2,C1.3,C1.4,C1.5)  New concept containing only those features common to all concepts with the name N1: Intersection(C1.1,C1.2,C1.3,C1.4,C1.5)  Is it appropriate to link or merge data sets returned on the scientific names?  Depends on the user’s purpose  Level of precision required N1 - Aus aus L.1758 N1 C1.5 C1.4 C1.3 C1.2 C1.1

25 Visual Tools for Managing Taxonomic Concepts C1.5C5.5 C0.5 C1.4C3.4 C0.4 C1.1 C0.1 C1.2 C2.2 C0.2 C1.3 C2.3 C3.3 C0.3 C6.5 C7.5 N0 N7 N1 N2 N5 N6 N3 N4 Information from literature on synonymy Taxonomists record which names their concepts are synonymous with and any name changes Parent child relationships in 5 revisions Names for each of the concepts

26 Visual Tools for Managing Taxonomic Concepts Find data sets with Aus aus (N1) C1.5C5.5 C0.5 C1.4C3.4 C0.4 C1.1 C0.1 C1.2 C2.2 C0.2 C1.3 C2.3 C3.3 C0.3 C6.5 C7.5 N0 N7 N1 N2 N5 N6 N3 N4 N1 C1.1 C1.2C1.3 C1.5 C1.4 N1

27 Visual Tools for Managing Taxonomic Concepts Find data sets with Aus aus (N1) C1.5C5.5 C0.5 C1.4C3.4 C0.4 C1.1 C0.1 C1.2 C2.2 C0.2 C1.3 C2.3 C3.3 C0.3 C6.5 C7.5 N0 N7 N1 N2 N5 N6 N3 N4 N1 N2 C1.1 C1.2 C2.2 C1.3 C2.3 C1.5 C1.4 N1

28 Visual Tools for Managing Taxonomic Concepts Find data sets with Aus aus (N1) C1.5C5.5 C0.5 C1.4C3.4 C0.4 C1.1 C0.1 C1.2 C2.2 C0.2 C1.3 C2.3 C3.3 C0.3 C6.5 C7.5 N0 N7 N1 N2 N5 N6 N3 N4 N1 N2 C1.1 C1.2 C2.2 C1.3 C2.3 C1.5 C1.4C3.4C3.3 N1 N2 N3

29 Visual Tools for Managing Taxonomic Concepts Find data sets with Aus aus (N1) C1.5C5.5 C0.5 C1.4C3.4 C0.4 C1.1 C0.1 C1.2 C2.2 C0.2 C1.3 C2.3 C3.3 C0.3 C6.5 C7.5 N0 N7 N1 N2 N5 N6 N3 N4 N1 N2 C1.1 C1.2 C2.2 C1.3 C2.3 C1.5 C1.4C3.4C3.3 C6.5 N6 N3 N4 N1 N2

30 Visual Tools for Managing Taxonomic Concepts Find data sets with Aus aus (N1) C1.5C5.5 C0.5 C1.4C3.4 C0.4 C1.1 C0.1 C1.2 C2.2 C0.2 C1.3 C2.3 C3.3 C0.3 C6.5 C7.5 N0 N7 N1 N2 N5 N6 N3 N4 N1 N2 C1.1 C1.2 C2.2 C1.3 C2.3 C1.5C5.5 C1.4C3.4C3.3 C6.5 N5 N6 N3 N4 N1 N2 N3 Results in everything returned for Aus aus by traversing the synonymy and name links

31 Visual Tools for Managing Taxonomic Concepts C1.5C5.5 C0.5 C1.4C3.4 C0.4 C1.1 C0.1 C1.2 C2.2 C0.2 C1.3 C2.3 C3.3 C0.3 C6.5 C7.5 N1 N5 N6 N2 N3 N4 N0 N7 == Information to improve data sets returned             Minimally what we need are set relationships from concepts in any taxonomy to earlier concepts and name changes related to earlier names We can build systems to return data suit for purpose     

32 Visual Tools for Managing Taxonomic Concepts Real Biological Taxonomies  Larger and change more frequently than the Aus example  German mosses  14 classifications in 73 years  covering 1548 taxa  only 35% thought to be stable concepts  65% of names used in legacy data sets are ambiguous  Taxonomic Revisions of genus Alteromonas 34 years: from 1972 to 2006  At the species level  18 “emendations”  19 species reassigned to 4 genera  3 new combinations  6 synonyms  2 species to subspecies  2 subspecies to species  21 new species

33 Visual Tools for Managing Taxonomic Concepts SEEK Taxon Approach  Use Taxon Concepts for referring to organisms  Aus aus L. 1758 sec. Tucker 1991  Abies lasiocarpa (Hook) Nutt. sec FNA 1997  Taxon Concept/Name Resolution  International data exchange schema  TCS (Taxonomic Concept Schema)  Concept Repository and Resolution web service  Linked to Kepler workflow system  Globally unique identifiers (LSIDs)  Visualization software for comparing Taxonomies and Asserting Concept Relationships

34 Visual Tools for Managing Taxonomic Concepts Taxon Object Server Mammal Species of the World Taxonomic Literature Taxonomic Data Providers TOS SEEK Cache Database to TCS Mapping Tool Concept Extraction Tool TCS Concept Mapper

35 Visual Tools for Managing Taxonomic Concepts Taxonomic Object Service: SEEK Concept Mapper http://seek.nhm.ku.edu/TaxObjServ/services TCS Find All Concepts Get Synonymous Concepts Get Best Concept TOS SEEK Cache LSID Authority Morpho Data Analysis EML Datasets Identify species EML(TCS) Mark up datasets

36 Visual Tools for Managing Taxonomic Concepts Recap…  Re-emphasised the problems with Taxonomic Names  not good identifiers for organisms  problem extends to most areas  characters, countries, habitats, vegetation types, genes…..  Shown that Taxonomic concepts are better for referring to organisms, specimens, observations…  but  Need better systems for resolving taxonomic names/concepts  Which require better information

37 Visual Tools for Managing Taxonomic Concepts Provide better tools for users  To help taxonomists create better quality data  Better access to reference/legacy data  Explore differences/similarities in existing taxonomies  To create relationships between concepts  Improved data can be made available to the general biology community for incorporating into bio-referenced databases.  To help end users understand and use the data  and its limitations  Biologists can use tools to understand the impact of using particular data on their analysis

38 Visual Tools for Managing Taxonomic Concepts Conclusion  Science is complex (and therefore split into specialisms)  Identify the overlaps/linkages in the different domains  Need useful approximations of things to simplify linked domain  Need to understand the approximations or linking points well  Support re-composition, linking or building on the components  Science is inherently changing  Science is full of legacy data  Today’s scientific research is tomorrow’s legacy data  Track the changes in the data  know when components or links have changed  Provide long-term persistent storage  Any published scientific discovery should store the data as evidence  Data needs to be accurately annotated  Sufficient to repeat analyses to test hypotheses

39 Visual Tools for Managing Taxonomic Concepts Acknowledgements  Colleagues on the SEEK project  NSF and EPSRC funding  e-Science Centre funding  Colleagues in TDWG

40 Thank You Questions…


Download ppt "Data Integration Issues in Biodiversity Research Jessie Kennedy Shawn Bowers, Matthew Jones, Josh Madin, Robert Peet, Deana Pennington, Mark Schildhauer,"

Similar presentations


Ads by Google