Leveraging semantic metadata for ecological data discovery and integration for analysis and modeling Matthew B. Jones Mark P. Schildhauer with contributions from Shawn Bowers, Chad Berkley, Dan Higgins, Rich Williams, Deana Pennington, Jing Tao, and others National Center for Ecological Analysis and Synthesis University of California, Santa Barbara
SWDBAug 29, 2004 August, 2005 Topical Outline Case study: Predicting species distributions under climate change with ecological niche modeling Challenges presented by Ecological Niche Modeling Scientific Workflows and the Kepler system Improving scientific workflows with semantics Future work
SWDBAug 29, 2004 August, 2005 Ecological Niche modeling Correlate current bioclimatic and topography data with observed species distribution to develop prediction algorithm Project predicted distribution of species using the prediction algorithm against 5 IPCC climate change scenarios Can use the Genetic Algorithm for Ruleset Prediction (GARP) and other algorithms oyamel fir prediction (Oberhauser and Peterson 2005)
SWDBAug 29, 2004 August, 2005 Informatics Challenges Data discovery, access, and archiving Data integration Compute cycles Alternative model testing Model complexity
SWDBAug 29, 2004 August, 2005 Data Discovery, Access, and Archiving Niche modeling requires many different data sources –Species observation data –Environmental data Precipitation, land use, LAI, topography, etc. –Climate change data IPCC Climate Change scenarios Currently, these types of data are either completely inaccessible or accessible with only significant manual effort in locating and accessing them from multiple independent providers Need to archive selected model results and data –Typically, these are handled in an ad-hoc basis, typically without documentation or archiving plans For niche modeling, outputs are diverse: –GA rule sets –Predicted distributions under current and altered conditions –Maps of these distributions
SWDBAug 29, 2004 August, 2005 Data Integration To utilize data, need to normalize and integrate to a common frame of reference For niche modeling, that includes finding an optimal extent, resolution, and projection for all data types Currently, custom scripts or applications are used for such transformations –Extremely time consuming
SWDBAug 29, 2004 August, 2005 Compute cycles Ecological modeling problems are typically computation-limited For niche modeling, researchers desire to examine, for example, predicted distribution of all mammals of the Western Hemisphere under current conditions and 5 IPCC climate change scenarios (200 to 500 runs per species x ~2000 mammal species x 3 minutes/run) = 833 to 2083 days
SWDBAug 29, 2004 August, 2005 Testing alternative models Researchers want to ‘tweak’ models –Explore alternative algorithms –Modify parameterization Iterate over many combinations of these Final results and intermediate versions of models need to be saved and versioned For niche modeling, the following algorithms are commonly used by researchers: DA, discriminant analysis BM, Bayesian Model BP, bioclimatic profiles CART, classification and regression trees GAM, generalized additive models GLM, generalized linear models GARP, genetic algorithm for rule-set production MD, mahalanobis distance method NNETW, neural networks SI, spatial interpolation From: Segurado and Araujo An evaluation of methods for modelling species distributions. Journal of Biogeography 31, 1555–1568.
SWDBAug 29, 2004 August, 2005 Model complexity Models and analyses typically consist of 100s of analytical processing steps –Understanding the model becomes very difficult “Spaghetti code” is common –Only experts can modify or review the model –Complexity increases the chance of undetected errors
SWDBAug 29, 2004 August, 2005 Current approaches to ENM Evolving, but typically these are custom simulation models –e.g., GARP Tend towards monolithic applications that handle everything in one place (data ingestion, transformation, model execution, output management, statistical analysis) –These models typically are difficult to extend, modify, or understand and require specialized expertise to use –This is typical of many models in ecology, and is largely due to the difficulty of managing complexity in modern programming languages
SWDBAug 29, 2004 August, 2005 A Source (e.g., data) C Sink (e.g., display) B Alternative approach: scientific workflows What are scientific workflows? –Graphical model of data flow among processing steps –Inputs and Outputs of components are precisely defined –Components are modular and reusable –Flow of data controlled by a separate execution model –Support for hierarchical models A’ Processor (e.g., regression) B EDF
SWDBAug 29, 2004 August, 2005 Kepler Scientific Workflow System Software to design and execute scientific workflows –Variety of analytical components (including spatial data transformations) –Support for R scripts and Matlab scripts –Real-time data access to sensor networks –Cross-project collaboration SEEK, SciDAC, GEON, Ptolemy, RoadNet, EOL, Resurgence EcoGrid access to heterogeneous environmental data –EML Data support Experimental data, survey data, spatial raster and vector data, etc. –DarwinCore Data support Museum collections –GeoSciences Network (GEON) Data Support Demonstration workflows from many domains –Ecology: Ecological Niche Modeling –Genomics: Promoter Identification Workflow –Geology: Geologic Map Information Integration –Oceanography: Real-time Revelle example of data access
SWDBAug 29, 2004 August, 2005 A simple Kepler workflow Data source from EcoGrid (metadata-driven ingestion) res <- lm(BARO ~ T_AIR) res plot(T_AIR, BARO) abline(res) R processing script
SWDBAug 29, 2004 August, 2005 Ecological Niche Model in Kepler
SWDBAug 29, 2004 August, 2005 Have scientific workflows improved ENM? Data discovery, access, and archiving –Direct access to data archives from natural history collections, ecology, and geology –Ability to archive outputs back into data storage systems Data integration: –directly handled by specialized components in the workflow Compute cycles –Current and growing grid computing support in Kepler increases runtime efficiency Alternative model testing –Decouples components and allows simple modifications of the model –Workflows act as a full description of an executed process Can be saved to the same repositories as data, allowing for complete replication of the model results Model complexity: –Visual display that documents and elucidates the model –Hierarchical modeling allows abstraction at higher levels
SWDBAug 29, 2004 August, 2005 Semantics in scientific workflows Components and their ports typically have: –Explicit ‘structural type’ e.g., int, float, string, {double} –Implicit semantic type Not sure whether the stream of values from a port represents ‘rainfall’ values or ‘body size’ values AB int string int rainfall bodysize int
SWDBAug 29, 2004 August, 2005 Ecological ontologies Model of knowledge in a domain like ecology or biodiversity –What was measured (e.g., biomass) –Type of measurement (e.g., Energy) –Context of measurement (e.g., Psychotria limonensis) –How it was measured (e.g., dry weight)
SWDBAug 29, 2004 August, 2005 Knowledge Representation Current SEEK Ontologies –Ecological Concepts, Models, Networks –Measurements –Properties –Statistical Analyses –Time and Space –Taxonomic Identifiers –Units –Symbiosis Recent Developments –Biodiversity (measured traits, computation of traits) –Descriptive Terminology for Plant Communities –Analytical components –Ontology documentation Future Goals –“Fill-in” existing concepts, evolve the ontology framework –More domains …
SWDBAug 29, 2004 August, 2005 Label data with semantic types Label inputs and outputs of analytical components with semantic types Semantic Annotation DataOntologyWorkflow Components
SWDBAug 29, 2004 August, 2005 Annotating a Component
SWDBAug 29, 2004 August, 2005 Semantic workflow validation Check if an existing workflow is semantically valid –All connected ports have compatible semantic types –All ports that are required are connected –Visually indicate status with red links for invalid connections
SWDBAug 29, 2004 August, 2005 Searching with Semantics
SWDBAug 29, 2004 August, 2005 In summary… Typical analytical models are complex and difficult to comprehend and maintain Scientific workflows provide an intuitive way to introduce structure and efficiency to the modeling and analysis process Adding semantic tools to workflow design and execution also increases usability of the workflow tool Kepler is an evolving but effective tool for scientists –
SWDBAug 29, 2004 August, 2005 Current and future work Knowledge Representation –Better match between ontology and scientist’s mental model –Refined ontologies for biodiversity and niche modeling –Refined supporting ontologies (e.g., space & time) Kepler –Semantically-driven data integration –Workflow composition and transformation –Ontology directed workflow design –Final niche modeling workflow completed
SWDBAug 29, 2004 August, 2005 Acknowledgements This material is based upon work supported by: The National Science Foundation under Grant Numbers , , , , , and Collaborators: NCEAS (UC Santa Barbara), University of New Mexico (Long Term Ecological Research Network Office), San Diego Supercomputer Center, University of Kansas (Center for Biodiversity Research), University of Vermont, University of North Carolina, Napier University, Arizona State University, UC Davis The National Center for Ecological Analysis and Synthesis, a Center funded by NSF (Grant Number ), the University of California, and the UC Santa Barbara campus. The Andrew W. Mellon Foundation. Kepler contributors: SEEK, Ptolemy II, SDM/SciDAC, GEON, RoadNet, EOL, Resurgence