Bringing Organism Observations Into Bioinformatics Networks

Bringing Organism Observations Into Bioinformatics Networks
Steve Kelling Cornell Lab of Ornithology As the types of data included in biodiversity clearinghouses expands outside of the traditional realm of natural history collections opportunities and challenges arise. More data provides a greater opportunity for synthetic analysis across broad spatial and temporal landscapes, but since these data are collected in different ways more care is required in how data are repurposed.

Data to Knowledge digital data are not only the output of research, but the foundation for new scientific insights (NSF 2007) Observations are not only the output of research, but the foundation for new scientific insights Much discussion about the need for synthesis of biodiversity data 1. Observations of nature are the foundation of ecological studies 2. To organize these data requires methods for data sharing and interoperability A caveat on data synthesis. Most traditional data synthesis approaches to understand species occurrence involve tens if not hundreds of potentially important predictors with species data gathered either during a specific study, or reduced to a level in which most of the important information that was collected are removed. But organizing observational data from a variety of projects and enable the analysis of primary occurrence data from them has several challenges

Data about the occurrence of an organism
What, Where, When, How and By Whom Primary Biodiversity Data What constitutes an observation of species’ occurrence?

Rhipidura leucophrys Willy Wagtail

Natural History Collections
Broad-scale Surveys Directed Surveys Natural History collections are zoological, botanical, and paleontological specimens in museums, living collections in botanical or zoological gardens, or microbial strain and tissue collections. They are the foundation for taxonomic and historic occurrence of species. While most use of specimen collections has been for taxon-oriented research, they have been used for predictive modeling of species occurrence. Broad-scale surveys generate probabilistic estimates of species occurrence. They do not provide direct evidence, but allow inferences for the causes of species occurrence. Broad-scale surveys gather tens of millions of observations annually and provide the bulk of non-specimen observational data available. Directed surveys used when a priori knowledge of a given system or biological mechanism already exists. The design attempts to control for known sources of variation, while sampling one or a few well defined variables. As such, directed surveys are the form of observational data collection that closest resembles experimental studies.

Organize mountains of observations in standardized structures for access, analysis, and visualization of biodiversity. The mission of TDWG is to develop the structures, standards, and processes to allow the ingest, organization, and access to biodiversity data. These processes are beginning to structure primary occurrence data at large scales.

Not only do we need information on the occurrence of an organism, but we need to better understand how those occurrences were gathered. We need more than just mountains of data The goal of Biodiversity Information Standards organization is to take an expansive Not Parochial view of what is needed for Biodiversity Informatics and act upon it. No more statements of these data are crap Or Not enough data Or whatever

Data Gathering Information
Project Code Sampling Event Identifier Protocol Identifier Data Gathering Information must be included in any biodiversity data management architecture Project Code: Allows linking of species’ occurrence records to a “project” Sampling Event Identifier: Allows single observations to be grouped. The identifier must be unique within each project. A sampling event is typically defined as a series of observations made during a determined amount of time at a given location (i.e., a checklist of birds or other organisms, marine mammals counted along a transect). Protocol Identifier: Allows the identification of the methods used to collect the species’ occurrence data, using domain specific standards.

High level processing workflow for integrative data intensive biodiversity research. Physical events and objects are gathered through sensor, observer, and survey networks. These data are stored in heterogeneous repositories. Informatics processes allow heterogeneous data to be synthesized for processing. Exploratory analyses (analyses useful for generating hypotheses) can drive confirmatory summative analyses. A variety of visualization tools allow these data to be viewed by a broad public.

New exploratory data analysis tools emerging from the fields of machine learning, data mining, and statistics can automatically identify patterns in large and complex biodiversity data sources. For example, bagged decision trees have been used to accurately identify the patterns of winter bird distributions across North America. These techniques share an ability to automatically adapt to patterns in data making them especially well suited for exploratory analysis.

Garbage Out Data In Data In and Garbage Out
Ecoinformatics initiatives must insure that the data that are being organized do not loose much of the information that was gathered. Not only does this information include data on the organism, but also information on how the organism data were collected. Without this information, the contents of the data looses its significance.

Bringing Organism Observations Into Bioinformatics Networks

Similar presentations

Presentation on theme: "Bringing Organism Observations Into Bioinformatics Networks"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Bringing Organism Observations Into Bioinformatics Networks

Similar presentations

Presentation on theme: "Bringing Organism Observations Into Bioinformatics Networks"— Presentation transcript:

Similar presentations

About project

Feedback