Challenges with Maintaining Legacy Software to Achieve Reproducible Computational Analyses: An Example for Hydrologic Modeling Data Processing Pipelines.

Slides:



Advertisements
Similar presentations
Software by NIC, Secretariat, Port Blair Andaman & Nicobar Administration PROMISE ver 1.2 (PROjects Monitoring and Information System for Enterprise) Slide.
Advertisements

Harlan Shannon Meteorologist U.S. Department of Agriculture Office of the Chief Economist World Agricultural Outlook Board Washington D.C., U.S.A. An Overview.
Automatic Data Ramon Lawrence University of Manitoba
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
Software Architecture Group University of Waterloo CANADA Architecture Recovery Of Web Applications.
A Semantic Workflow Mechanism to Realise Experimental Goals and Constraints Edoardo Pignotti, Peter Edwards, Alun Preece, Nick Gotts and Gary Polhill School.
MDC Open Information Model West Virginia University CS486 Presentation Feb 18, 2000 Lijian Liu (OIM:
2012 National BDPA Technology Conference Creating Rich Data Visualizations using the Google API Yolanda M. Davis Senior Software Engineer AdvancED August.
Discussion and conclusion The OGC SOS describes a global standard for storing and recalling sensor data and the associated metadata. The standard covers.
FP OntoGrid: Paving the way for Knowledgeable Grid Services and Systems WP8: Use case 1: Quality Analysis for Satellite Missions.
1 Yolanda Gil Information Sciences InstituteJanuary 10, 2010 Requirements for caBIG Infrastructure to Support Semantic Workflows Yolanda.
The Case for Data Stewardship: Preserving the Scientific Record Matthew Mayernik National Center for Atmospheric Research Version 2.0 [Review Date]
Ag. & Biological Engineering
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Taverna and my Grid Open Workflow for Life Sciences Tom Oinn
Use of Hierarchical Keywords for Easy Data Management on HUBzero HUBbub Conference 2013 September 6 th, 2013 Gaurav Nanda, Jonathan Tan, Peter Auyeung,
Social Innovation Fund Creating an Application in eGrants Technical Assistance Call 1 – 2:00 p.m. Eastern Time on Friday, March 19, ;
NanoHUB.org and HUBzero™ Platform for Reproducible Computational Experiments Michael McLennan Director and Chief Architect, Hub Technology Group and George.
SI2-SSE: Pipeline Framework for Ensemble Runs on the Cloud Beth Plale (PI), Indiana University | Craig Mattocks (Co-PI), University of Miami Figure: Scheduling.
The BioBox Initiative: Bio-ClusterGrid Maddie Wong Technical Marketing Engineer Sun APSTC – Asia Pacific Science & Technology Center.
Understanding hydrologic changes: application of the VIC model Vimal Mishra Assistant Professor Indian Institute of Technology (IIT), Gandhinagar
GAAIN Virtual Appliances: Virtual Machine Technology for Scientific Data Analysis Arihant Patawari USC Stevens Neuroimaging and Informatics Institute July.
Developed at the Broad Institute of MIT and Harvard Reich M, Liefeld T, Gould J, Lerner J, Tamayo P, and Mesirov JP. GenePattern 2.0. Nature Genetics 38.
Esri UC2013. Technical Workshop. Technical Workshop 2013 Esri International User Conference July 8–12, 2013 | San Diego, California Automating Geodatabase.
A Practical Approach to Metadata Management Mark Jessop Prof. Jim Austin University of York.
Scientific Workflow systems: Summary and Opportunities for SEEK and e-Science.
NOVA A Networked Object-Based EnVironment for Analysis “Framework Components for Distributed Computing” Pavel Nevski, Sasha Vanyashin, Torre Wenaus US.
Advanced Higher Computing Science The Project. Introduction Worth 60% of the total marks for the course Must include: An appropriate interface using input.
Management Information System (MIS) MIS is short for management information system or management information services. Management information system,
System Software Laboratory Databases and the Grid by Paul Watson University of Newcastle Grid Computing: Making the Global Infrastructure a Reality June.
Towards development of a Regional Arctic Climate System Model ---
Overview Modern chip designs have multiple IP components with different process, voltage, temperature sensitivities Optimizing mix to different customer.
DaViTPy (Data Visualization Toolkit – Python)
Appendix 2 Automated Tools for Systems Development
Sharing models as social objects through HydroShare
RDA US Science workshop Arlington VA, Aug 2014 Cees de Laat with many slides from Ed Seidel/Rob Pennington.
MIRACLE Cloud-based reproducible data analysis and visualization for outputs of agent-based models Xiongbing Jin, Kirsten Robinson, Allen Lee, Gary Polhill,
Modern Systems Analysis and Design Third Edition
ReproZip: Computational Reproducibility With Ease
Capstone Project, Computer Science Department
Modern Systems Analysis and Design Third Edition
CUAHSI HIS Sharing hydrologic data
Perspectives on the intersection between computer science and psychology Developing reproducible – and reusable – methods through research software engineering.
A Web-enabled Approach for generating data processors
Business System Development
Software Documentation
The Improvement of PaaS Platform ZENG Shu-Qing, Xu Jie-Bin 2010 First International Conference on Networking and Distributed Computing SQUARE.
FORMAL SYSTEM DEVELOPMENT METHODOLOGIES
TT-DEWCE Targeting…… Develop an inter-operable data base for regional extreme weather and climate events; Provide a WMO portal which links to national.
Performance Load Testing Case Study – Agilent Technologies
Federal Land Manager Environmental Database (FED)
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 2 Database System Concepts and Architecture.
Chapter 4 Automated Tools for Systems Development
Modern Systems Analysis and Design Third Edition
Overview: Software and Software Engineering
Module 01 ETICS Overview ETICS Online Tutorials
Modern Systems Analysis and Design Third Edition
Laura Bright David Maier Portland State University
Automated Software Integration
Quoting and Billing: Commercialization of Big Data Analytics
Prepared by Peter Boško, Luxembourg June 2012
Overview of Workflows: Why Use Them?
New Round of Regional data collections Deltares
Automation of Control System Configuration TAC 18
CBMS4303: Management Information System
GEO Knowledge Hub: overview
Contributor Roles, Open Badges
Harrison Howell CSCE 824 Dr. Farkas
Modern Systems Analysis and Design Third Edition
Presentation transcript:

Challenges with Maintaining Legacy Software to Achieve Reproducible Computational Analyses: An Example for Hydrologic Modeling Data Processing Pipelines Bakinam T. Essawy, Jonathan L. Goodall, Tanu Malik, Hao Xu, Michael Conway and Yolanda Gil iEMSs 2016 Conference Toulouse, France July 11, 2016

Introduction Scientific models may have many modules and components built by different individuals. Metadata should capture these contributions and identify exact versions of software used to perform analyses. Scientific models often depend on legacy software. Reproducing model runs across different machines and over time is challenging. As an example, we have automated a data processing pipeline for the Variable Infiltration Capacity (VIC) hydrologic model. The software was built by many researchers over decades of work, but low level metadata is scattered and not formally organized. Moving the VIC data processing pipelines to new machines remains a significant challenge due in large part to the need to install legacy software dependencies.

Research Objectives Using the Variable Infiltration Capacity (VIC) data processing pipeline as a case study, the research objectives are to … Test a software metadata standard for capturing low-level metadata for the VIC data processing pipeline. Develop a methodology for re-execution of the VIC data processing pipeline using Docker containers. Basic idea behind Docker Source: https://www.docker.com/

Methodology The OntoSoft ontology will be used to organize metadata for the VIC data processing pipeline. The GeoDataSpace project will be used to assist in creating a Docker container for the VIC data processing pipeline. The DataNet Federation Consortium (DFC) Discovery Environment (DE) will be used to execute the Docker containers as a Web App. Ontosoft and Geodataspace are funded under the NSF EarthCube program, and the DFC is funded under the NSF DataNet program.

Background: The VIC Model VIC = Variable Infiltration Capacity; A regional-scale land surface hydrology model VIC developed at U Washington and Princeton; applied worldwide VIC requires precipitation, maximum and minimum temperature, wind speed, soil and vegetation data sets as input. A data processing pipeline was generated to gather and prepare the meteorological dataset required as an input to the model Source : Gao et al. (2009)

Background: The VIC pre-processing pipeline Color coding Schema for dataset preparation: Blue and Green: Precipitation (P) Green: Minimum and Maximum temperature (Tmin and Tmax) Purple: Wind Speed (W) Yellow: Combination of P, Tmax, Tmin, and WDatabases Dark Blue: Complete Meteorological Database

Objective 1: Metadata Capture Ontosoft provides an ontology for scientific software Ontosoft provides a Web portal for populating the ontology The OntoSoft ontology is divided into six categories: "Identify", "Understand", "Do Research", "Get Support", and "Update” http://www.ontosoft.org/ http://www.ontosoft.org/portal/ http://ontosoft.org/ontology

OntoSoft Web Portal

Sources for Metadata Metadata for each process in the VIC pre-processing pipeline was captured from the following publically available resources. VIC software publication in Zenodo Source code and prior experience VIC documentation VIC website Metadata was entered into the Ontosoft Web Portal and can be verified and updated, if necessary, by others.

Results: Metadata Analysis Percent complete metadata for each OntoSoft category ID Software OntoSoft Metadata Categories Identify Understand Execute Do Research Get Support Update Req* Opt* Req Opt Temperature (Max./Min) Precipitation 1 preproc_precip 100 75 9 87 50 2 read_prec_dly 27 3 preproc_append 4 append_prec 5 run_append_prec 18 80 6 regrid 25 60 7 mk_monthly 8 get_prism rescale 10 vicinput Soil and Vegetation 11 create_LDAS_soil 12 create_LDAS_veg_param Wind Speed 13 getwind 14 regrid_wind Complete Metrological Database 15 combine_wind *Req is Required; Opt is optional

create_LDAS_veg_param Metadata Analysis completeness of each OntoSoft metadata for each preprocessing step following metadata extraction process 100 80 create_LDAS_veg_param regrid 60 Completeness (%) 40 20 1 2 3 4 5 6 7 8 9 10 11 12 15 14 15 Software #

create_LDAS_veg_param regrid

Metadata Analysis Overall percentage for Sources used to Extract Metadata in Each OntoSoft Category VIC user discussion Source code and prior experience VIC Website VIC software Publication in Zenodo VIC documentation

Source for Extracted Metadata in Each OntoSoft Category Metadata Analysis Source for Extracted Metadata in Each OntoSoft Category

Preliminary Findings It was possible to capture 90% or more of the required Ontosoft metadata for 13 of the 15 software used in the data processing pipeline from available online sources. Four sources supplied a similar amount of the metadata, showing how metadata is generally available but distributed across sources. Some metadata elements are more easily identified than others; some terms we were confident in populating while others may be subject to interpretation.

Objective 2: Execution of legacy code Step 1: Use Geodataspace to create geounit for data processing pipeline Step 2: Use DataNet Federation Consortium for execution of pipeline as Web app

GeoDataSpace

Geodatabase Client Git and DropBox-like Python client for Linux and Mac OS X Annotate provide semantic annotations Package code, data, environment into Docker containers Track tracks provenance of the scientific program

Prototype Implementation A Docker container was created manually for a simplier post-processing pipeline used to visualize VIC model output. The pipeline is described in a prior paper1. The more complex pre-processing pipeline has been Dockerized using Geodatabase, but is still being tested. The Docker container was uploaded to Docker hub DE administrator was contacted to install the Docker container as a Web app. The DE administrator verified the container and then loaded it as an app. It is now available to DE users for data processing. 1 Essawy, B. T., J. L. Goodall, H. Xu,A. Rajasekar, J. D. Myers, T. A. Kugler, M. M. Billah, M. C. Whitton, and R. W. Moore (2016), Server-side workflow execution using data grid technology for reproducible analyses of data-intensive hydrologic systems, Earth and Space Science, 3,163–175, doi :10.1 002/2015EA000139.

Location of the Docker Container on the Docker Hub Discovery Environment Location of the Docker Container on the Docker Hub

Discovery Environment DE Interface

Summary Reproducibility and transparency of modeling will benefit from… Better metadata capture at a low level to identify the various software used to perform an analysis. Ontosoft can be used to organize existing but scattered metadata for modeling software. Docker containers to more easily execute legacy software across machines. For simple pipelines, dockerizing can be done manually. Tools like GeoDataspace can assist in creating containers for more complex pipelines. Tools for executing shared Docker containers The Data Net Federation (DFC) Discovery Environment (DE) provides a means for execution of Docker containers as a Web applications.

THANK YOU Contact info: Jon Goodall Department of Civil and Environmental Engineering University of Virginia goodall@virginia.edu