Habitat-Lite & EnvO Jin Mao Postdoc, School of Information, University of Arizona Nov. 20, 2015
Outline Habitat-LiteEVNOCases
Habitat-Lite The association of organisms to their environments is a key issue in exploring biodiversity patterns(Pafilis et al., 2015). To facilitate the capture of metadata describing the growing number of genomic and metagenomic projects, including information about isolation source and habitat (Field et al., 2008a; Morrison et al., 2006). Motivations
Habitat-Lite Habitat: the place or environment where an organism naturally or normally lives and grows. Sample source (Isolated from): the environmental context in which a sample is collected (Morrison et al., 2006). Definition
Habitat-Lite The literature is scattered and the metadata is difficult to find, even by expert manual extraction. Related fields in databases: sparse, free text. Lacking standardization in vocabulary and definitions Challenge
Habitat-Lite Short-term: high-level habitat descriptions develop a lightweight controlled vocabulary (Habitat-Lite) within the EvnO framework to capture high-level habitat and environmental metadata. Long-term to develop a repeatable process for other types of metadata by identifying key terms based on usage in databases and the open literature. Goal
Habitat-Lite Do a survey for terms used in a number of relevant sources. Selected a set of high-level terms as a strawman for the first iteration of the Habitat-Lite term list. Discuss with annotators at NCBI. Construction Method
Habitat-Lite Construction Method Seed Terms ExperimentsExperiments “bin” existing entries Useable for human and semiautomated annotation “minimal set” of habitat terms that provided good coverage of entries in key resources NCBI Microbial genomes 16S sequences patterns and biases in the complete genome collection
Habitat-Lite Term List
Habitat-Lite Term List
Environment Ontology (ENVO) Biological: data from environmental samples Biomedical: physical environment of organisms Environment-aware analyses Background
Environment Ontology (ENVO) Need for consistent description of the environmental origins of tissue, pathogen, and metagenomics samples Need for the labeling of samples and artifacts in museum collections Needs
Environment Ontology (ENVO) ENVO should be comprised of classes (terms) referring to key environment-types that may be used to facilitate the retrieval and integration of a broad range of biological data. Interoperability with the numerous biological and biomedical ontologies compliant with Open Biomedical and Biological Ontologies (OBO) Foundry principles. A standardized and semantically controlled representation as GO Both for specialists and for non-experts Goals
Environment Ontology (ENVO) 24/envo.owl OBO: OBO-Edit ontology development tool OWL CSV Download
Cases The ability to “bin” data into interesting categories for purposes of comparison To test the coverage, utility, and usability A small experiment was carried out in late 2006 for the Ribosomal Database Project (RDP; Cole et al.,2007). Bin data
Cases Manually classify into habitats the 168,911 rRNA sequences marked as environmental in RDP release 9.44 (November 2006). Splitting host-associated into separate categories for plant and animal (including human) associated. isolation_source the reference titles Not existed Bin data
Cases The biggest category was animal associated, and a large fraction of these were human associated. Bin data
Cases The metadata about habitat or isolation source occurs in many diverse forms, including PDF tables, densely written materials and methods sections, supplementary material, and even in referenced work. Free text metadata already available The “isolation_source” field from GenBank gene records GenBank Case
Cases To identify probable classes based on the presence of specific key words in each entry. Habitat-Lite terms + synonyms for “waste water” the terms used for matching were “waste water,” “waste-water,” “wastewater,” “sewage,” “sewerage,” etc. Specializations For “food,” the terms used for matching included specific kinds of foods, for example, “milk,” “cheese,” “beer,” etc. This pattern-matching approaches GenBank Case
Cases GenBank Case Of the almost 35,000 distinct entries in the isolation_source field, some 22,000 (63%) contained specific words or phrases that could be mapped to the 17 Habitat-Lite categories.
Cases Habitat field plus Isolation field E xact matches for 84% of GOLD Habitat terms with an additional term “aquatic.” The three most frequent terms (“host,” “aquatic,” and “soil”) covered 75% of GOLD habitat data. Six Habitat-Lite terms were not seen at all in this smaller data set (“air,” “freshwater,” “extreme,” “microbial mat,” “fossil,” “terrestrial”). GOLD
Cases GOLD Comparison of automated mapping and expert mapping The need for annotation guidelines, to handle situations where a term might be placed in several categories.
References Hirschman, L., Clark, C., Cohen, K. B., Mardis, S., Luciano, J., Kottmann, R.,... & Field, D. (2008). Habitat-Lite: a GSC case study based on free text terms for environmental metadata. OMICS A Journal of Integrative Biology, 12(2), Buttigieg, P. L., Morrison, N., Smith, B., Mungall, C. J., Lewis, S. E., & ENVO Consortium. (2013). The environment ontology: contextualising biological and biomedical entities. J. Biomedical Semantics, 4, 43. Pafilis, E., Frankild, S. P., Schnetzer, J., Fanini, L., Faulwetter, S., Pavloudi, C.,... & Jensen, L. J. (2015). ENVIRONMENTS and EOL: identification of Environment Ontology terms in text and the annotation of the Encyclopedia of Life. Bioinformatics, 31(11),
Thank you!