Directions in observational data organization: from schemas to ontologies Matthew B. Jones 1 Chad Berkley 1 Shawn Bowers 2 Joshua Madin 3 Mark Schildhauer.

Slides:



Advertisements
Similar presentations
Overview of the Science Environment for Ecological Knowledge (SEEK) Ricardo Scachetti Pereira.
Advertisements

Chapter 10: Designing Databases
Forest Markup / Metadata Language FML
Semantic annotation on the SONet and Semtools projects: Challenges for broad multidisciplinary exchange of observational data Mark Schildhauer, NCEAS/UCSB.
SONet (Scientific Observations Network) and OBOE (Extensible Observation Ontology): Mark Schildhauer, Director of Computing National Center for Ecological.
Jennifer A. Dunne Santa Fe Institute Pacific Ecoinformatics & Computational Ecology Lab Rich William, Neo Martinez, et al. Challenges.
Ontology Notes are from:
Chad Berkley National Center for Ecological Analysis and Synthesis (NCEAS), University of California, Santa Barbara February.
Center for Environmental Studies Arizona State University Digital Research Records at Center for Environmental Studies Peter McCartney.
Leveraging semantic metadata for ecological data discovery and integration for analysis and modeling Matthew B. Jones Mark P. Schildhauer with contributions.
Introduction to Geospatial Metadata – FGDC CSDGM National Coastal Data Development Center A division of the National Oceanographic Data Center Please .
Improving Data Discovery in Metadata Repositories through Semantic Search Chad Berkley 1, Shawn Bowers 2, Matt Jones 1, Mark Schildhauer 1, Josh Madin.
Database Environment 1.  Purpose of three-level database architecture.  Contents of external, conceptual, and internal levels.  Purpose of external/conceptual.
Database System Concepts and Architecture Lecture # 3 22 June 2012 National University of Computer and Emerging Sciences.
Data Integration, Analysis, and Synthesis Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara.
A Proposal for a Distributed Earth Observation Data Network Matthew B Jones UC Santa Barbara National Center for Ecological Analysis and Synthesis (NCEAS)
CST203-2 Database Management Systems Lecture 2. One Tier Architecture Eg: In this scenario, a workgroup database is stored in a shared location on a single.
SONet: Scientific Observations Network Semtools: Semantic Enhancements for Ecological Data Management Mark Schildhauer, Matt Jones, Shawn Bowers, Huiping.
Cyberinfrastructure Overview Core Cyberinfrastructure Team Matthew B. Jones National Center for Ecological Analysis and Synthesis (NCEAS) University of.
Pipelines and Scientific Workflows with Ptolemy II Deana Pennington University of New Mexico LTER Network Office Shawn Bowers UCSD San Diego Supercomputer.
Metadata and Geographical Information Systems Adrian Moss KINDS project, Manchester Metropolitan University, UK
Knb.ecoinformatics.org LTER EML Best Practices Data Discovery in the Biological Sciences 7-9 February 2005 Mark Servilla LTER Network Office University.
Ecological Metadata Language (EML) and Morpho
Science Environment for Ecological Knowledge: EcoGrid Matthew B. Jones National Center for.
Extensible Markup Language (XML) Extensible Markup Language (XML) is a simple, very flexible text format derived from SGML (ISO 8879).ISO 8879 XML is a.
Chapter 1 : Introduction §Purpose of Database Systems §View of Data §Data Models §Data Definition Language §Data Manipulation Language §Transaction Management.
GLOBAL BIODIVERSITY INFORMATION FACILITY Éamonn Ó Tuama Senior Programme Officer, IDA 21 June Metadata publishing with the IPT.
Growing challenges for biodiversity informatics Utility of observational data models Multiple communities within the earth and biological sciences are.
SEEK EcoGrid l Integrate diverse data networks from ecology, biodiversity, and environmental sciences l Metacat, DiGIR, SRB, Xanthoria,... l EML is the.
Data, Metadata, and Ontology in Ecology Matthew B. Jones National Center for Ecological Analysis and Synthesis (NCEAS) University of California Santa Barbara.
Chad Berkley NCEAS National Center for Ecological Analysis and Synthesis (NCEAS), University of California Santa Barbara Long Term Ecological Research.
©Silberschatz, Korth and Sudarshan1.1Database System Concepts Chapter 1: Introduction Purpose of Database Systems View of Data Data Models Data Definition.
1 Chapter 1 Introduction. 2 Introduction n Definition A database management system (DBMS) is a general-purpose software system that facilitates the process.
Subgroup 1 Collect interoperability requirements Define common, unified data model Engage tool & data providers, data consumers Subgroup 2 Identify and.
Grid Technologies Arcot Rajasekar (SEEK) Paul Watson (North East eScience Centre)
Ecoinformatics Workshop Summary SEEK, LTER Network Main Office University of New Mexico Aluquerque, NM.
The SEEK EcoGrid: A Data Grid System for Ecology Arcot Rajasekar Matthew Jones Bertram Ludäscher
Using R in Kepler Dan Higgins – NCEAS Prepared for: Ecoinformatics Training for Ecologists LTER (Albuquerque) January 8-12, 2007
Using Desktop Data in Kepler Dan Higgins – NCEAS Prepared for: Ecoinformatics Training for Ecologists LTER (Albuquerque) January 8-12, 2007
LTER Data Management Margaret O’Brien Santa Barbara Coastal Long Term Ecological Research (LTER) Project Santa Barbara Channel Biodiversity Observation.
Information Management using Ecological Metadata Language Corinna Gries - CAP Margaret O’Brien - SBC.
Laura Russell Programmer VertNet Buenos Aires (Argentina) 28 September 2011 Training course on biodiversity data publishing and.
EScience Workshop on Scientific Workflows Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara.
Scientific Workflow systems: Summary and Opportunities for SEEK and e-Science.
Long Term Ecological Research Network Office Trends Project Spaghetti & Linguine (aka Trends Data Store) Mark Servilla 14 September.
The US Long Term Ecological Research (LTER) Network: Site and Network Level Information Management Kristin Vanderbilt Department of Biology University.
SEEK Science Environment for Ecological Knowledge l EcoGrid l Ecological, biodiversity and environmental data l Computational access l Standardized, open.
GEM METADATA DEVELOPMENT Xiaoping Wang, Macrosearch Allen Macklin, PMEL and Bernard Megrey, AFSC.
Riccardi: DIALOGUE Workshop August 1, 2005 Supported by NSF BDI 1 Representing and Using Phylogenetic Characters in Morphbank Greg Riccardi, David Gaitros,
Metadata ESA Workshop. In this session we will discuss…  Metadata: what are they? and why should they be created?  Metadata standards  Creating metadata.
1 Database Environment. 2 Objectives of Three-Level Architecture u All users should be able to access same data. u A user’s view is immune to changes.
1 Chapter 2 Database Environment Pearson Education © 2009.
Visualization in Kepler Dan Higgins – NCEAS Prepared for: Ecoinformatics Training for Ecologists LTER (Albuquerque) January 8-12, 2007
Collection-Based Persistent Archives Arcot Rajasekar, Richard Marciano, Reagan Moore San Diego Supercomputer Center Presented by: Preetham A Gowda.
Introduction: Databases and Database Systems Lecture # 1 June 19,2012 National University of Computer and Emerging Sciences.
Infrastructure requirements for linked e-science The requirements of the agINFRA VRC for e-infrastructures. Miguel-Angel Sicilia University of Alcalá,
Building an Information Management System for Global Data Sharing: A Strategy for the International Long Term Ecological Research (ILTER) Network Kristin.
EcoGrid in SEEK A Data Grid System for Ecology Bertram Ludaescher University of California, Davis Arcot Rajasekar San Diego Supercomputer Center, University.
Data Grids, Digital Libraries and Persistent Archives: An Integrated Approach to Publishing, Sharing and Archiving Data. Written By: R. Moore, A. Rajasekar,
Introduction to DBMS Purpose of Database Systems View of Data
Databases and DBMSs Todd S. Bacastow January 2005.
Database Management:.
Improving Data Discovery Through Semantic Search
Problem: Ecological data needed to address critical questions are dispersed, heterogeneous, and complex Solution: An internet-based mechanism to discover,
Chapter 2: Database System Concepts and Architecture
Introduction to Database Systems
Data, Databases, and DBMSs
Introduction to DBMS Purpose of Database Systems View of Data
Chapter 2 Database Environment Pearson Education © 2014.
Ecological Informatics: Challenges and Benefits Presentation to ESA Visions Committee March.
Presentation transcript:

Directions in observational data organization: from schemas to ontologies Matthew B. Jones 1 Chad Berkley 1 Shawn Bowers 2 Joshua Madin 3 Mark Schildhauer 1 National Center for Ecological Analysis and Synthesis (NCEAS) University of California, Santa Barbara 1 University of California, Davis 2 MacQuarie University 3

Ecological studies Ecological studies focus on –Distribution and abundance of organisms –Organism interactions –Population and community processes –Ecosystem processes –Mechanistic understanding of ecosystems Diverse data sources, e.g., –Biodiversity monitoring –Experimental manipulations –Environmental monitoring

Synthesis over ecological process Gruner et al –Ecology Letters, (2008) 11: 740–755 Meta-analysis of 191 factorial manipulations of nutrients and herbivores Experimenters manipulated –nutrient addition –herbivore removal Effect on producer biomass

Synthesis over space Costanza et al. Nature 1997

Synthesis over time Jackson et al., Science 2001

How did they do it? As a scientist, could you: –Locate the precise data used? –Locate the analytical processes used? Reconstruct them? Today, only a slim chance... –Why?

Insufficient sharing Researchers don’t publish their data Researchers don’t publish their analytical code In general, we have no way to verify or reproduce the conclusions in papers

Synthesis requires access to global ecological data Single-schema databases do not suffice Loosely-coupled metadata and data collections –No constraints on data schemas Knowledge Network for Biocomplexity (KNB) National Biological Information Infrastructure (NBII) Preserving data for synthesis

PhysicalDataFormat Access and Distribution LogicalDataModel MethodsCoverage: Space, Time, Taxa Identity and Discovery Information 22 independent modules open modular extensible Ecological Metadata Language Grass roots metadata Describe what data you have... rather than prescribe what to produce.

EML: Selected relationships ‘91‘92‘93‘94 ‘96‘97‘98‘99 ‘01‘02‘03‘04 FGDC created ‘06‘07‘08‘09 EML EML EML 1.4.x EML CSDGM 1.0 Michener ’97 paper ESA FLED Report NBIIB DP ISO Dublin Core OBOE XML 1.0 EML EML 2.1.0?

Logical Model: Attribute structure Describes data tables and their attributes a typical data table with 10 attributes –some metadata are likely apparent, other ambiguous –missing value code is present –definitions need to be explicit, as well as data typing YEAR MONTH DATE SITE TRANSECT SECTION SP_CODE SIZE OBS_CODE NOTES ABUR CLIN ABUR OPIC ABUR OPIC ABUR OPIC ABUR OPIC ABUR OPIC ABUR COTT ABUR CLIN ABUR NF AHND NF Species Codes Value bounds Date Format Code definitions

Logical Model: unit Dictionary Consistent assignment of measurement units –Quantitative definitions in terms of SI units –‘unitType’ expresses dimensionality time, length, mass, energy are all ‘unitType’s second, meter, gram, pound, joule are all ‘unit’s Mass kilogram gram UnitTypeUnit x1000

An EML Record at NCEAS

Knowledge Network for Biocomplexity (KNB) PISCO KNB II AND... (26) GCE LTER NCEAS ESA OBFS KNB 1 Building a data preservation network Preserve primary data Rich metadata descriptions Redundant backup via replication Access controlled by contributors

KNB 1 KNB II PISCO AND... (26) GCE LTER NCEAS ESA OBFS Knowledge Network for Biocomplexity (KNB) South African Data Network Mozambique Mapungubwe Marakele KrugerSAEON Grahamsto wn Cape Town San Parks Wilderness Cape Town U Addo Karoo Tsitsikama Phalabora Savannah ClusterMarine Cluster

South African National Parks Metacat

Metacat deployments

International LTER Recommendation for producing EML across all ILTER sites Recommendation for producing continental and regional metadata caches –one or more in each ILTER region –initial nodes may use Metacat

att1 | attr2 | attr3.... |.... | Dynamic Data Retrieval Data Storage Metadata Parser Metadata Parser Data Loader Data Loader DB Results Query SELECT * FROM... CREATE TABLE... Data QueryResults Data Manager Store DataStore Metadata User Client Metadata Catalog

Join Query Client Query Request Results Response

Importance of semantics So far we’ve dealt only with the logical data model –any semantics in EML in natural language The computer doesn’t really understand: –what is being measured –how measurements relate to one another –how semantics map to logical structure Analysis depends on understanding the semantic contextual relationships among data measurements –e.g., density measured within subplot

Semantic annotation Observation Ontology Data set Mapping between data and the ontology via semantic annotation slide from J. Madin Relational data lacks critical semantic information no way for computer to determine that “Ht.” represents a “height” measurement no way for computer to determine if Plot is nested within Site or vice-versa no way for computer to determine if the Temp applies to Site or Plot or Species

Scientific Observations An Observation is the Measurement of the Value of a Characteristic of some Entity in a particular Context

Provide extension points for loading specialized domain ontologies Goal: semantically describe the structure of scientific observation and measurement as found in a data set Observation ontology (OBOE) Entities represent real- world objects or concepts that can be measured. Observations are made about particular entities. Every measurement has a characteristic, which defines the property of the entity being measured. Observations can provide context for other observations. slide from J. Madin

Datasets vs. Observations EML describes “data sets” –collections of related observations with relatively unspecified semantics –mostly natural language descriptions OBOE describes “scientific observations” –semantically-precise descriptions of scientific measurements –allows understanding of relationships among measurements and context of an observation

Model correspondences

TDWG Observations Task Group An Observation is the Measurement of the Value of a Characteristic of some Entity in a particular Context Create: Community-sanctioned, extensible, and unified ontology model for observational data –Compatible with existing standards –Integrate with metadata standards such as EML, CSDGM, etc. –Reduce the “babel” of scientific dialects

Questions?

Acknowledgments This material is based upon work supported by: The National Science Foundation under Grant Numbers , , , , , and Collaborators: NCEAS (UC Santa Barbara), University of New Mexico (Long Term Ecological Research Network Office), San Diego Supercomputer Center, University of Kansas (Center for Biodiversity Research), University of Vermont, University of North Carolina, Napier University, Arizona State University, UC Davis The National Center for Ecological Analysis and Synthesis, a Center funded by NSF (Grant Number ), the University of California, and the UC Santa Barbara campus. The Andrew W. Mellon Foundation. Kepler contributors: SEEK, Ptolemy II, SDM/SciDAC, GEON, RoadNet, EOL, Resurgence