Leveraging semantic metadata for ecological data discovery and integration for analysis and modeling Matthew B. Jones Mark P. Schildhauer with contributions.

Slides:



Advertisements
Similar presentations
Overview of the Science Environment for Ecological Knowledge (SEEK) Ricardo Scachetti Pereira.
Advertisements

Maines Sustainability Solutions Initiative (SSI) Focuses on research of the coupled dynamics of social- ecological systems (SES) and the translation of.
Facebook for scientists Titus Schleyer et al. 1 of 38 Digital Vita : Leveraging Personal Information Management Practices to Facilitate Research Collaborations.
Presentation at WebEx Meeting June 15,  Context  Challenge  Anticipated Outcomes  Framework  Timeline & Guidance  Comment and Questions.
UCSD SAN DIEGO SUPERCOMPUTER CENTER Ilkay Altintas Scientific Workflow Automation Technologies Provenance Collection Support in the Kepler Scientific Workflow.
SONet (Scientific Observations Network) and OBOE (Extensible Observation Ontology): Mark Schildhauer, Director of Computing National Center for Ecological.
Jennifer A. Dunne Santa Fe Institute Pacific Ecoinformatics & Computational Ecology Lab Rich William, Neo Martinez, et al. Challenges.
Chad Berkley National Center for Ecological Analysis and Synthesis (NCEAS), University of California, Santa Barbara February.
Experiences in Integration of the 'R' System into Kepler Dan Higgins – National Center for Ecological Analysis and Synthesis (NCEAS), UC Santa Barbara.
Automated Analysis and Code Generation for Domain-Specific Models George Edwards Center for Systems and Software Engineering University of Southern California.
Workflow Exchange and Archival: The KSW File and the Kepler Object Manager Shawn Bowers (For Chad Berkley & Matt Jones) University of California, Davis.
Introduction to Kepler Deana Pennington, PhD University of New Mexico LTER Network Office, Sevilleta LTER PI CI-Team: Advancing CI-Based Science through.
GIS Actors in Kepler - Java-based, GDAL-JNI, and C++(Grass) Routines Dan Higgins - UC Santa Barbara (NCEAS) Chad Berkley – UC Santa Barbara (NCEAS) Jianting.
The Kepler Project Overview, Status, and Future Directions Matthew B. Jones on behalf of the Kepler Project team National Center for Ecological Analysis.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
Improving Data Discovery in Metadata Repositories through Semantic Search Chad Berkley 1, Shawn Bowers 2, Matt Jones 1, Mark Schildhauer 1, Josh Madin.
Biology.sdsc.edu CIPRes in Kepler: An integrative workflow package for streamlining phylogenetic data analyses Zhijie Guan 1, Alex Borchers 1, Timothy.
January, 23, 2006 Ilkay Altintas
Data Management Turban, Aronson, and Liang Decision Support Systems and Intelligent Systems, Seventh Edition.
1 Yolanda Gil Information Sciences InstituteJanuary 10, 2010 Requirements for caBIG Infrastructure to Support Semantic Workflows Yolanda.
U.S. Department of the Interior U.S. Geological Survey CDI Data Management Working Group December 12, 2011 Sally Holl, USGS Texas Water Science Center.
SEEK: Enabling Ecology and Biodiversity Science Through Cyberinfrastructure.
Composing Models of Computation in Kepler/Ptolemy II
Invent the Future Dynamic Web Based Methods and Tools for Multi-University I/UCRC Management, Data Integration and Decision Support Janis Terpenny January.
Introduction for BEAM Ecological Niche Modeling Working Meeting Deana Pennington University of New Mexico December 14, 2004.
San Diego Supercomputer CenterUniversity of California, San Diego Preservation Research Roadmap Reagan W. Moore San Diego Supercomputer Center
Matthew B. Jones Jim Regetz National Center for Ecological Analysis and Synthesis (NCEAS) University of California Santa Barbara NCEAS Synthesis Institute.
Publishing and Visualizing Large-Scale Semantically-enabled Earth Science Resources on the Web Benno Lee 1 Sumit Purohit 2
Supporting Large-Scale Science with Workflows Deana Pennington University of New Mexico Long-Term Ecological Research Network Office ITR: Science Environment.
Data R&D Issues for GTL Data and Knowledge Systems San Diego Supercomputer Center University of California, San Diego Bertram Ludäscher
Pipelines and Scientific Workflows with Ptolemy II Deana Pennington University of New Mexico LTER Network Office Shawn Bowers UCSD San Diego Supercomputer.
Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life UC DAVIS Department of Computer Science The Kepler/pPOD Team Shawn.
Directions in observational data organization: from schemas to ontologies Matthew B. Jones 1 Chad Berkley 1 Shawn Bowers 2 Joshua Madin 3 Mark Schildhauer.
Ecological Metadata Language (EML) and Morpho
Science Environment for Ecological Knowledge: EcoGrid Matthew B. Jones National Center for.
IPlant cyberifrastructure to support ecological modeling Presented at the Species Distribution Modeling Group at the American Museum of Natural History.
Semantic Mediation in SEEK/Kepler: Exploiting Semantic Annotation for Discovery, Analysis, and Integration of Scientific Data and Workflows Bertram Ludäscher.
SEEK EcoGrid l Integrate diverse data networks from ecology, biodiversity, and environmental sciences l Metacat, DiGIR, SRB, Xanthoria,... l EML is the.
Data, Metadata, and Ontology in Ecology Matthew B. Jones National Center for Ecological Analysis and Synthesis (NCEAS) University of California Santa Barbara.
Chad Berkley NCEAS National Center for Ecological Analysis and Synthesis (NCEAS), University of California Santa Barbara Long Term Ecological Research.
Research Design for Collaborative Computational Approaches and Scientific Workflows Deana Pennington January 8, 2007.
Grid Technologies Arcot Rajasekar (SEEK) Paul Watson (North East eScience Centre)
Role of Spatial Database in Biodiversity Conservation Planning Sham Davande, GIS Expert Arid Communities Technologies, Bhuj 11 September, 2015.
The SEEK EcoGrid: A Data Grid System for Ecology Arcot Rajasekar Matthew Jones Bertram Ludäscher
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
Using R in Kepler Dan Higgins – NCEAS Prepared for: Ecoinformatics Training for Ecologists LTER (Albuquerque) January 8-12, 2007
Using Desktop Data in Kepler Dan Higgins – NCEAS Prepared for: Ecoinformatics Training for Ecologists LTER (Albuquerque) January 8-12, 2007
ICCS WSES BOF Discussion. Possible Topics Scientific workflows and Grid infrastructure Utilization of computing resources in scientific workflows; Virtual.
Kepler includes contributors from GEON, SEEK, SDM Center and Ptolemy II, supported by NSF ITRs (SEEK), EAR (GEON), DOE DE-FC02-01ER25486.
EScience Workshop on Scientific Workflows Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara.
GEOSCIENCE NEEDS & CHALLENGES Dogan Seber San Diego Supercomputer Center University of California, San Diego, USA.
Scientific Workflow systems: Summary and Opportunities for SEEK and e-Science.
SEEK Science Environment for Ecological Knowledge l EcoGrid l Ecological, biodiversity and environmental data l Computational access l Standardized, open.
Knowledge Modeling and Discovery. About Thetus Thetus develops knowledge modeling and discovery infrastructure software for customers who: Have high-value.
Ecological Niche Modeling Conceptual Workflows Deana Pennington University of New Mexico December 16, 2004.
Matthew B. Jones National Center for Ecological Analysis and Synthesis (NCEAS) University of California Santa Barbara Advancing Software for Ecological.
Satisfying Requirements BPF for DRA shall address: –DAQ Environment (Eclipse RCP): Gumtree ISEE workbench integration; –Design Composing and Configurability,
Example projects using metadata and thesauri: the Biodiversity World Project Richard White Cardiff University, UK
Visualization in Kepler Dan Higgins – NCEAS Prepared for: Ecoinformatics Training for Ecologists LTER (Albuquerque) January 8-12, 2007
Workflow-Driven Science using Kepler Ilkay Altintas, PhD San Diego Supercomputer Center, UCSD words.sdsc.edu.
Staging of the Ecological Niche Modeling Mammal Prototype Project Deana Pennington University of New Mexico December 14, 2004.
EcoGrid in SEEK A Data Grid System for Ecology Bertram Ludaescher University of California, Davis Arcot Rajasekar San Diego Supercomputer Center, University.
Lifemapper 2.0 Using and Creating Geospatial Data and Open Source Tools for the Biological Community Aimee Stewart, CJ Grady, Dave Vieglais, Jim Beach.
Analysis Manager Training Module
Improving Data Discovery Through Semantic Search
Data Warehousing and Data Mining
A Semantic Type System and Propagation
Automated Analysis and Code Generation for Domain-Specific Models
What's New in eCognition 9
Scientific Workflows Lecture 15
Presentation transcript:

Leveraging semantic metadata for ecological data discovery and integration for analysis and modeling Matthew B. Jones Mark P. Schildhauer with contributions from Shawn Bowers, Chad Berkley, Dan Higgins, Rich Williams, Deana Pennington, Jing Tao, and others National Center for Ecological Analysis and Synthesis University of California, Santa Barbara

SWDBAug 29, 2004 August, 2005 Topical Outline Case study: Predicting species distributions under climate change with ecological niche modeling Challenges presented by Ecological Niche Modeling Scientific Workflows and the Kepler system Improving scientific workflows with semantics Future work

SWDBAug 29, 2004 August, 2005 Ecological Niche modeling Correlate current bioclimatic and topography data with observed species distribution to develop prediction algorithm Project predicted distribution of species using the prediction algorithm against 5 IPCC climate change scenarios Can use the Genetic Algorithm for Ruleset Prediction (GARP) and other algorithms oyamel fir prediction (Oberhauser and Peterson 2005)

SWDBAug 29, 2004 August, 2005 Informatics Challenges Data discovery, access, and archiving Data integration Compute cycles Alternative model testing Model complexity

SWDBAug 29, 2004 August, 2005 Data Discovery, Access, and Archiving Niche modeling requires many different data sources –Species observation data –Environmental data Precipitation, land use, LAI, topography, etc. –Climate change data IPCC Climate Change scenarios Currently, these types of data are either completely inaccessible or accessible with only significant manual effort in locating and accessing them from multiple independent providers Need to archive selected model results and data –Typically, these are handled in an ad-hoc basis, typically without documentation or archiving plans For niche modeling, outputs are diverse: –GA rule sets –Predicted distributions under current and altered conditions –Maps of these distributions

SWDBAug 29, 2004 August, 2005 Data Integration To utilize data, need to normalize and integrate to a common frame of reference For niche modeling, that includes finding an optimal extent, resolution, and projection for all data types Currently, custom scripts or applications are used for such transformations –Extremely time consuming

SWDBAug 29, 2004 August, 2005 Compute cycles Ecological modeling problems are typically computation-limited For niche modeling, researchers desire to examine, for example, predicted distribution of all mammals of the Western Hemisphere under current conditions and 5 IPCC climate change scenarios (200 to 500 runs per species x ~2000 mammal species x 3 minutes/run) = 833 to 2083 days

SWDBAug 29, 2004 August, 2005 Testing alternative models Researchers want to ‘tweak’ models –Explore alternative algorithms –Modify parameterization Iterate over many combinations of these Final results and intermediate versions of models need to be saved and versioned For niche modeling, the following algorithms are commonly used by researchers: DA, discriminant analysis BM, Bayesian Model BP, bioclimatic profiles CART, classification and regression trees GAM, generalized additive models GLM, generalized linear models GARP, genetic algorithm for rule-set production MD, mahalanobis distance method NNETW, neural networks SI, spatial interpolation From: Segurado and Araujo An evaluation of methods for modelling species distributions. Journal of Biogeography 31, 1555–1568.

SWDBAug 29, 2004 August, 2005 Model complexity Models and analyses typically consist of 100s of analytical processing steps –Understanding the model becomes very difficult “Spaghetti code” is common –Only experts can modify or review the model –Complexity increases the chance of undetected errors

SWDBAug 29, 2004 August, 2005 Current approaches to ENM Evolving, but typically these are custom simulation models –e.g., GARP Tend towards monolithic applications that handle everything in one place (data ingestion, transformation, model execution, output management, statistical analysis) –These models typically are difficult to extend, modify, or understand and require specialized expertise to use –This is typical of many models in ecology, and is largely due to the difficulty of managing complexity in modern programming languages

SWDBAug 29, 2004 August, 2005 A Source (e.g., data) C Sink (e.g., display) B Alternative approach: scientific workflows What are scientific workflows? –Graphical model of data flow among processing steps –Inputs and Outputs of components are precisely defined –Components are modular and reusable –Flow of data controlled by a separate execution model –Support for hierarchical models A’ Processor (e.g., regression) B EDF

SWDBAug 29, 2004 August, 2005 Kepler Scientific Workflow System Software to design and execute scientific workflows –Variety of analytical components (including spatial data transformations) –Support for R scripts and Matlab scripts –Real-time data access to sensor networks –Cross-project collaboration SEEK, SciDAC, GEON, Ptolemy, RoadNet, EOL, Resurgence EcoGrid access to heterogeneous environmental data –EML Data support Experimental data, survey data, spatial raster and vector data, etc. –DarwinCore Data support Museum collections –GeoSciences Network (GEON) Data Support Demonstration workflows from many domains –Ecology: Ecological Niche Modeling –Genomics: Promoter Identification Workflow –Geology: Geologic Map Information Integration –Oceanography: Real-time Revelle example of data access

SWDBAug 29, 2004 August, 2005 A simple Kepler workflow Data source from EcoGrid (metadata-driven ingestion) res <- lm(BARO ~ T_AIR) res plot(T_AIR, BARO) abline(res) R processing script

SWDBAug 29, 2004 August, 2005 Ecological Niche Model in Kepler

SWDBAug 29, 2004 August, 2005 Have scientific workflows improved ENM? Data discovery, access, and archiving –Direct access to data archives from natural history collections, ecology, and geology –Ability to archive outputs back into data storage systems Data integration: –directly handled by specialized components in the workflow Compute cycles –Current and growing grid computing support in Kepler increases runtime efficiency Alternative model testing –Decouples components and allows simple modifications of the model –Workflows act as a full description of an executed process Can be saved to the same repositories as data, allowing for complete replication of the model results Model complexity: –Visual display that documents and elucidates the model –Hierarchical modeling allows abstraction at higher levels

SWDBAug 29, 2004 August, 2005 Semantics in scientific workflows Components and their ports typically have: –Explicit ‘structural type’ e.g., int, float, string, {double} –Implicit semantic type Not sure whether the stream of values from a port represents ‘rainfall’ values or ‘body size’ values AB int string int rainfall bodysize int

SWDBAug 29, 2004 August, 2005 Ecological ontologies Model of knowledge in a domain like ecology or biodiversity –What was measured (e.g., biomass) –Type of measurement (e.g., Energy) –Context of measurement (e.g., Psychotria limonensis) –How it was measured (e.g., dry weight)

SWDBAug 29, 2004 August, 2005 Knowledge Representation Current SEEK Ontologies –Ecological Concepts, Models, Networks –Measurements –Properties –Statistical Analyses –Time and Space –Taxonomic Identifiers –Units –Symbiosis Recent Developments –Biodiversity (measured traits, computation of traits) –Descriptive Terminology for Plant Communities –Analytical components –Ontology documentation Future Goals –“Fill-in” existing concepts, evolve the ontology framework –More domains …

SWDBAug 29, 2004 August, 2005 Label data with semantic types Label inputs and outputs of analytical components with semantic types Semantic Annotation DataOntologyWorkflow Components

SWDBAug 29, 2004 August, 2005 Annotating a Component

SWDBAug 29, 2004 August, 2005 Semantic workflow validation Check if an existing workflow is semantically valid –All connected ports have compatible semantic types –All ports that are required are connected –Visually indicate status with red links for invalid connections

SWDBAug 29, 2004 August, 2005 Searching with Semantics

SWDBAug 29, 2004 August, 2005 In summary… Typical analytical models are complex and difficult to comprehend and maintain Scientific workflows provide an intuitive way to introduce structure and efficiency to the modeling and analysis process Adding semantic tools to workflow design and execution also increases usability of the workflow tool Kepler is an evolving but effective tool for scientists –

SWDBAug 29, 2004 August, 2005 Current and future work Knowledge Representation –Better match between ontology and scientist’s mental model –Refined ontologies for biodiversity and niche modeling –Refined supporting ontologies (e.g., space & time) Kepler –Semantically-driven data integration –Workflow composition and transformation –Ontology directed workflow design –Final niche modeling workflow completed

SWDBAug 29, 2004 August, 2005 Acknowledgements This material is based upon work supported by: The National Science Foundation under Grant Numbers , , , , , and Collaborators: NCEAS (UC Santa Barbara), University of New Mexico (Long Term Ecological Research Network Office), San Diego Supercomputer Center, University of Kansas (Center for Biodiversity Research), University of Vermont, University of North Carolina, Napier University, Arizona State University, UC Davis The National Center for Ecological Analysis and Synthesis, a Center funded by NSF (Grant Number ), the University of California, and the UC Santa Barbara campus. The Andrew W. Mellon Foundation. Kepler contributors: SEEK, Ptolemy II, SDM/SciDAC, GEON, RoadNet, EOL, Resurgence