Data, Metadata, and Ontology in Ecology Matthew B. Jones National Center for Ecological Analysis and Synthesis (NCEAS) University of California Santa Barbara.

Slides:



Advertisements
Similar presentations
Overview of the Science Environment for Ecological Knowledge (SEEK) Ricardo Scachetti Pereira.
Advertisements

Chapter 10: Designing Databases
Semantic annotation on the SONet and Semtools projects: Challenges for broad multidisciplinary exchange of observational data Mark Schildhauer, NCEAS/UCSB.
Alexandria Digital Library Project The ADEPT Bucket Framework.
OMG Architecture Ecosystem SIG Federal CIO Council Data Architecture Subcommittee May 2011 Cory Casanave.
SONet (Scientific Observations Network) and OBOE (Extensible Observation Ontology): Mark Schildhauer, Director of Computing National Center for Ecological.
Ontology Notes are from:
Chad Berkley National Center for Ecological Analysis and Synthesis (NCEAS), University of California, Santa Barbara February.
Automated Analysis and Code Generation for Domain-Specific Models George Edwards Center for Systems and Software Engineering University of Southern California.
Workflow Exchange and Archival: The KSW File and the Kepler Object Manager Shawn Bowers (For Chad Berkley & Matt Jones) University of California, Davis.
COMP 6703 eScience Project Semantic Web for Museums Student : Lei Junran Client/Technical Supervisor : Tom Worthington Academic Supervisor : Peter Strazdins.
The RDF meta model: a closer look Basic ideas of the RDF Resource instance descriptions in the RDF format Application-specific RDF schemas Limitations.
Center for Environmental Studies Arizona State University Digital Research Records at Center for Environmental Studies Peter McCartney.
Leveraging semantic metadata for ecological data discovery and integration for analysis and modeling Matthew B. Jones Mark P. Schildhauer with contributions.
Domain-Specific Software Engineering Alex Adamec.
Introduction to Database Systems 1.  Assignments – 3 – 9%  Marked Lab – 5 – 10% + 2% (Bonus)  Marked Quiz – 3 – 6%  Mid term exams – 2 – (30%) 15%
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
Introduction to Geospatial Metadata – FGDC CSDGM National Coastal Data Development Center A division of the National Oceanographic Data Center Please .
Improving Data Discovery in Metadata Repositories through Semantic Search Chad Berkley 1, Shawn Bowers 2, Matt Jones 1, Mark Schildhauer 1, Josh Madin.
Data Integration, Analysis, and Synthesis Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara.
Introduction for BEAM Ecological Niche Modeling Working Meeting Deana Pennington University of New Mexico December 14, 2004.
San Diego Supercomputer CenterUniversity of California, San Diego Preservation Research Roadmap Reagan W. Moore San Diego Supercomputer Center
Publishing and Visualizing Large-Scale Semantically-enabled Earth Science Resources on the Web Benno Lee 1 Sumit Purohit 2
SONet: Scientific Observations Network Semtools: Semantic Enhancements for Ecological Data Management Mark Schildhauer, Matt Jones, Shawn Bowers, Huiping.
Pipelines and Scientific Workflows with Ptolemy II Deana Pennington University of New Mexico LTER Network Office Shawn Bowers UCSD San Diego Supercomputer.
Scalable Metadata Definition Frameworks Raymond Plante NCSA/NVO Toward an International Virtual Observatory How do we encourage a smooth evolution of metadata.
Metadata and Geographical Information Systems Adrian Moss KINDS project, Manchester Metropolitan University, UK
Knb.ecoinformatics.org LTER EML Best Practices Data Discovery in the Biological Sciences 7-9 February 2005 Mark Servilla LTER Network Office University.
Directions in observational data organization: from schemas to ontologies Matthew B. Jones 1 Chad Berkley 1 Shawn Bowers 2 Joshua Madin 3 Mark Schildhauer.
Ecological Metadata Language (EML) and Morpho
Science Environment for Ecological Knowledge: EcoGrid Matthew B. Jones National Center for.
Semantic Mediation in SEEK/Kepler: Exploiting Semantic Annotation for Discovery, Analysis, and Integration of Scientific Data and Workflows Bertram Ludäscher.
Extensible Markup Language (XML) Extensible Markup Language (XML) is a simple, very flexible text format derived from SGML (ISO 8879).ISO 8879 XML is a.
Model-Driven Analysis Frameworks for Embedded Systems George Edwards USC Center for Systems and Software Engineering
Growing challenges for biodiversity informatics Utility of observational data models Multiple communities within the earth and biological sciences are.
SEEK EcoGrid l Integrate diverse data networks from ecology, biodiversity, and environmental sciences l Metacat, DiGIR, SRB, Xanthoria,... l EML is the.
Chad Berkley NCEAS National Center for Ecological Analysis and Synthesis (NCEAS), University of California Santa Barbara Long Term Ecological Research.
Subgroup 1 Collect interoperability requirements Define common, unified data model Engage tool & data providers, data consumers Subgroup 2 Identify and.
Grid Technologies Arcot Rajasekar (SEEK) Paul Watson (North East eScience Centre)
Ecoinformatics Workshop Summary SEEK, LTER Network Main Office University of New Mexico Aluquerque, NM.
The SEEK EcoGrid: A Data Grid System for Ecology Arcot Rajasekar Matthew Jones Bertram Ludäscher
Using R in Kepler Dan Higgins – NCEAS Prepared for: Ecoinformatics Training for Ecologists LTER (Albuquerque) January 8-12, 2007
PREMIS Implementation Fair, San Francisco, CA October 7, Stanford Digital Repository PREMIS & Geospatial Resources Nancy J. Hoebelheinrich Knowledge.
Using Desktop Data in Kepler Dan Higgins – NCEAS Prepared for: Ecoinformatics Training for Ecologists LTER (Albuquerque) January 8-12, 2007
LTER Data Management Margaret O’Brien Santa Barbara Coastal Long Term Ecological Research (LTER) Project Santa Barbara Channel Biodiversity Observation.
Knowledge Representation Breakout KR: to create content (objects, reltnshps) for SMS (logic/inference) that will be useful for enhancing the discovery.
Information Management using Ecological Metadata Language Corinna Gries - CAP Margaret O’Brien - SBC.
The future of the Web: Semantic Web 9/30/2004 Xiangming Mu.
Laura Russell Programmer VertNet Buenos Aires (Argentina) 28 September 2011 Training course on biodiversity data publishing and.
EScience Workshop on Scientific Workflows Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara.
Scientific Workflow systems: Summary and Opportunities for SEEK and e-Science.
The RDF meta model Basic ideas of the RDF Resource instance descriptions in the RDF format Application-specific RDF schemas Limitations of XML compared.
SEEK Science Environment for Ecological Knowledge l EcoGrid l Ecological, biodiversity and environmental data l Computational access l Standardized, open.
Riccardi: DIALOGUE Workshop August 1, 2005 Supported by NSF BDI 1 Representing and Using Phylogenetic Characters in Morphbank Greg Riccardi, David Gaitros,
Metadata ESA Workshop. In this session we will discuss…  Metadata: what are they? and why should they be created?  Metadata standards  Creating metadata.
The Semantic Web. What is the Semantic Web? The Semantic Web is an extension of the current Web in which information is given well-defined meaning, enabling.
Visualization in Kepler Dan Higgins – NCEAS Prepared for: Ecoinformatics Training for Ecologists LTER (Albuquerque) January 8-12, 2007
Collection-Based Persistent Archives Arcot Rajasekar, Richard Marciano, Reagan Moore San Diego Supercomputer Center Presented by: Preetham A Gowda.
Introduction: Databases and Database Systems Lecture # 1 June 19,2012 National University of Computer and Emerging Sciences.
Infrastructure requirements for linked e-science The requirements of the agINFRA VRC for e-infrastructures. Miguel-Angel Sicilia University of Alcalá,
Geospatial metadata Prof. Wenwen Li School of Geographical Sciences and Urban Planning 5644 Coor Hall
EcoGrid in SEEK A Data Grid System for Ecology Bertram Ludaescher University of California, Davis Arcot Rajasekar San Diego Supercomputer Center, University.
Data sharing and exchange: Experiences within the
Improving Data Discovery Through Semantic Search
Model-Driven Analysis Frameworks for Embedded Systems
Data Model.
Session 2: Metadata and Catalogues
A Semantic Type System and Propagation
Automated Analysis and Code Generation for Domain-Specific Models
Ecological Informatics: Challenges and Benefits Presentation to ESA Visions Committee March.
Presentation transcript:

Data, Metadata, and Ontology in Ecology Matthew B. Jones National Center for Ecological Analysis and Synthesis (NCEAS) University of California Santa Barbara and many major collaborators: Mark Schildhauer, Josh Madin, Jing Tao, Chad Berkley, Dan Higgins, Peter McCartney, Chris Jones, Shawn Bowers, Bertram Ludaescher, and others April 24, 2007

Scaling-up Synthesis More than 400 projects at NCEAS –have produced over 1000 publications that synthesize and re-use existing data –massive investment in compiling, integrating, and analyzing data Building custom database for each project is not logistically feasible Instead, need loosely-coupled systems that accommodate heterogeneity

Dilemma: no unified model No single database suffices –Data warehouses use federated schemas any data that does not fit is not captured original data transformed to fit federation –this is a form of data integration for one purpose –Numerous data warehouses exist not extensible for all data VegBank, ClimbDB, GenBank, PDB, etc.

Metadata-based data collections –Loosely-coupled metadata and data collections –No constraints on data schemas –Data discovery based on metadata –Dynamic data loading and query based on metadata descriptions Data Collections

PhysicalDataFormat Access and Distribution LogicalDataModel MethodsCoverage: Space, Time, Taxa Identity and Discovery Information A … modular extensible comprehensive Ecological Metadata Language What is EML?

EML: Selected relationships ‘91‘92‘93‘94 ‘96‘97‘98‘99 ‘01‘02‘03‘04 FGDC created ‘06‘07‘08‘09 EML EML EML 1.4.x EML CSDGM 1.0 Michener ’97 paper ESA FLED Report NBIIB DP ISO Dublin Core OBOE XML 1.0 EML 2.0.1

A simple EML example eml packageId: sbclter system: knb dataset title: Kelp Forest Community Dynamics: Benthic Fish creator individualName contact surName: Reed surName: EvansindividualName

Data Discovery Geographic, Temporal, and Taxonomic coverage

Logical Model: Attribute structure Describes data tables and their variables/attributes a typical data table with 10 attributes –some metadata are likely apparent, other ambiguous –missing value code is present –definitions need to be explicit, as well as data typing YEAR MONTH DATE SITE TRANSECT SECTION SP_CODE SIZE OBS_CODE NOTES ABUR CLIN ABUR OPIC ABUR OPIC ABUR OPIC ABUR OPIC ABUR OPIC ABUR COTT ABUR CLIN ABUR NF AHND NF Species Codes Value bounds Date Format Code definitions

EML Measurement Scale Low Medium High Equidistant on number scale, meaningful ratio Equidistant on number scale OrderedCategoriesCategories Points on calendar timescale Male Female 3 Celsius5 meter 6-Oct-2004 Textual OrdinalNominal Numeric RatioIntervalDatetime Dates

Logical Model: unit Dictionary Consistent assignment of measurement units –Quantitative definitions in terms of SI units –‘unitType’ expresses dimensionality time, length, mass, energy are all ‘unitType’s second, meter, gram, pound, joule are all ‘unit’s Mass kilogram gram UnitTypeUnit x1000

Collating metadata Most scientists know all of this information about their data –EML simply provides a standardized format for recording the information Enables data exchange across organizations and software systems

Knowledge Network for Biocomplexity (KNB) PISCO KNB II AND... (26) GCE LTER NCEAS ESA OBFS KNB 1 Building a community data network Simplified data sharing Immediate change tracking Redundant backup Data maintained by individuals Access controlled by individuals

EML-described data in the KNB Data Packages in the KNB Year Cumulative count

Kepler: dynamic data loading Data source from EcoGrid (metadata-driven ingestion) res <- lm(BARO ~ T_AIR) res plot(T_AIR, BARO) abline(res) R processing script Kepler supports dynamic data loading: Data sources are discovered via metadata queries EML metadata allows arbitrary schemas to be loaded into an embedded database Data queries can be performed before data flows downstream

Importance of semantics So far we’ve dealt only with the logical data model –any semantics in EML in natural language The computer doesn’t really understand: –what is being measured –how measurements relate to one another –how semantics map to logical structure Analysis depends on understanding the semantic contextual relationships among data measurements –e.g., density measured within subplot

Provide extension points for loading specialized domain ontologies Goal: semantically describe the structure of scientific observation and measurement as found in a data set Observation ontology (OBOE) Entities represent real- world objects or concepts that can be measured. Observations are made about particular entities. Every measurement has a characteristic, which defines the property of the entity being measured. Observations can provide context for other observations. slide from J. Madin

Semantic annotation Observation Ontology Data set Mapping between data and the ontology via semantic annotation slide from J. Madin Relational data lacks critical semantic information no way for computer to determine that “Ht.” represents a “height” measurement no way for computer to determine if Plot is nested within Site or vice-versa no way for computer to determine if the Temp applies to Site or Plot or Species

DateSitePlotSpeciesHeight 10/12Hendricks1AHYA /12Hendricks1AHYA /12Hendricks1AHYA 9.7 …………… h DateLocationNameHeightTaxonomicNameLabel Characteristic: AreaTimeSpace Organism Entity: hasContext

TreePlotSpeciesCount A1AHYA3 A2AHYA2 A3AHYA8 ………… OrganismSpaceOrganism LabelAbundanceTaxonomicNameReplicate Entity: Characteristic: Area hasContext A B C

Observation ontology slide from J. Madin Extension points

Observation A high-level assertion that a thing was observed ?

All things (concrete and conceptual) that are observable Entity

An extension point for domain-specific terms Entity extension

Asserts a “containment” relationship between entities Context

Context is transitive Context

Observations are composed of measurements, which refer measurable characteristics to the entity being observed Measurement

Characteristic

Summary EML captures critical metadata OBOE adds critical semantic descriptions Data discovery and integration tools can be built that leverage metadata and ontologies Metadata and ontologies permit: –Loosely-coupled systems –Schema independence in data systems –Semantic data integration –Capturing data that is collected, rather than derived product

Vegetation Schema Questions Vegetation schema –Exchange standard or federation? Can we accommodate all data that is collected in vegetation plots? –or just a transformed subset XML? RDF? OWL? other? Should a vegetation schema link to other evolving community standards? –EML? –OBOE?

Questions?

Knowledge Representation Working Group Mark Schildhauer, Matt Jones (NCEAS) Shawn Bowers, Bertram Ludaescher, Dave Thau (UCD) Deana Pennington (UNM) Serguei Krivov, Ferdinando Villa (UVM) Corinna Gries, Peter McCartney (ASU) Rich Williams (Microsoft) Acknowledgements

Acknowledgments This material is based upon work supported by: The National Science Foundation under Grant Numbers , , , , , and Collaborators: NCEAS (UC Santa Barbara), University of New Mexico (Long Term Ecological Research Network Office), San Diego Supercomputer Center, University of Kansas (Center for Biodiversity Research), University of Vermont, University of North Carolina, Napier University, Arizona State University, UC Davis The National Center for Ecological Analysis and Synthesis, a Center funded by NSF (Grant Number ), the University of California, and the UC Santa Barbara campus. The Andrew W. Mellon Foundation. Kepler contributors: SEEK, Ptolemy II, SDM/SciDAC, GEON, RoadNet, EOL, Resurgence