Presentation is loading. Please wait.

Presentation is loading. Please wait.

Challenges with Maintaining Legacy Software to Achieve Reproducible Computational Analyses: An Example for Hydrologic Modeling Data Processing Pipelines.

Similar presentations


Presentation on theme: "Challenges with Maintaining Legacy Software to Achieve Reproducible Computational Analyses: An Example for Hydrologic Modeling Data Processing Pipelines."— Presentation transcript:

1 Challenges with Maintaining Legacy Software to Achieve Reproducible Computational Analyses: An Example for Hydrologic Modeling Data Processing Pipelines Bakinam T. Essawy, Jonathan L. Goodall, Tanu Malik, Hao Xu, Michael Conway and Yolanda Gil iEMSs 2016 Conference Toulouse, France July 11, 2016

2 Introduction Scientific models may have many modules and components built by different individuals. Metadata should capture these contributions and identify exact versions of software used to perform analyses. Scientific models often depend on legacy software. Reproducing model runs across different machines and over time is challenging. As an example, we have automated a data processing pipeline for the Variable Infiltration Capacity (VIC) hydrologic model. The software was built by many researchers over decades of work, but low level metadata is scattered and not formally organized. Moving the VIC data processing pipelines to new machines remains a significant challenge due in large part to the need to install legacy software dependencies.

3 Research Objectives Using the Variable Infiltration Capacity (VIC) data processing pipeline as a case study, the research objectives are to … Test a software metadata standard for capturing low-level metadata for the VIC data processing pipeline. Develop a methodology for re-execution of the VIC data processing pipeline using Docker containers. Basic idea behind Docker Source:

4 Methodology The OntoSoft ontology will be used to organize metadata for the VIC data processing pipeline. The GeoDataSpace project will be used to assist in creating a Docker container for the VIC data processing pipeline. The DataNet Federation Consortium (DFC) Discovery Environment (DE) will be used to execute the Docker containers as a Web App. Ontosoft and Geodataspace are funded under the NSF EarthCube program, and the DFC is funded under the NSF DataNet program.

5 Background: The VIC Model
VIC = Variable Infiltration Capacity; A regional-scale land surface hydrology model VIC developed at U Washington and Princeton; applied worldwide VIC requires precipitation, maximum and minimum temperature, wind speed, soil and vegetation data sets as input. A data processing pipeline was generated to gather and prepare the meteorological dataset required as an input to the model Source : Gao et al. (2009)

6 Background: The VIC pre-processing pipeline
Color coding Schema for dataset preparation: Blue and Green: Precipitation (P) Green: Minimum and Maximum temperature (Tmin and Tmax) Purple: Wind Speed (W) Yellow: Combination of P, Tmax, Tmin, and WDatabases Dark Blue: Complete Meteorological Database

7 Objective 1: Metadata Capture
Ontosoft provides an ontology for scientific software Ontosoft provides a Web portal for populating the ontology The OntoSoft ontology is divided into six categories: "Identify", "Understand", "Do Research", "Get Support", and "Update”

8 OntoSoft Web Portal

9 Sources for Metadata Metadata for each process in the VIC pre-processing pipeline was captured from the following publically available resources. VIC software publication in Zenodo Source code and prior experience VIC documentation VIC website Metadata was entered into the Ontosoft Web Portal and can be verified and updated, if necessary, by others.

10 Results: Metadata Analysis
Percent complete metadata for each OntoSoft category ID Software OntoSoft Metadata Categories Identify Understand Execute Do Research Get Support Update Req* Opt* Req Opt Temperature (Max./Min) Precipitation 1 preproc_precip 100 75 9 87 50 2 read_prec_dly 27 3 preproc_append 4 append_prec 5 run_append_prec 18 80 6 regrid 25 60 7 mk_monthly 8 get_prism rescale 10 vicinput Soil and Vegetation 11 create_LDAS_soil 12 create_LDAS_veg_param Wind Speed 13 getwind 14 regrid_wind Complete Metrological Database 15 combine_wind *Req is Required; Opt is optional

11 create_LDAS_veg_param
Metadata Analysis completeness of each OntoSoft metadata for each preprocessing step following metadata extraction process 100 80 create_LDAS_veg_param regrid 60 Completeness (%) 40 20 1 2 3 4 5 6 7 8 9 10 11 12 15 14 15 Software #

12 create_LDAS_veg_param
regrid

13 Metadata Analysis Overall percentage for Sources used to Extract Metadata in Each OntoSoft Category VIC user discussion Source code and prior experience VIC Website VIC software Publication in Zenodo VIC documentation

14 Source for Extracted Metadata in Each OntoSoft Category
Metadata Analysis Source for Extracted Metadata in Each OntoSoft Category

15 Preliminary Findings It was possible to capture 90% or more of the required Ontosoft metadata for 13 of the 15 software used in the data processing pipeline from available online sources. Four sources supplied a similar amount of the metadata, showing how metadata is generally available but distributed across sources. Some metadata elements are more easily identified than others; some terms we were confident in populating while others may be subject to interpretation.

16 Objective 2: Execution of legacy code
Step 1: Use Geodataspace to create geounit for data processing pipeline Step 2: Use DataNet Federation Consortium for execution of pipeline as Web app

17 GeoDataSpace

18 Geodatabase Client Git and DropBox-like Python client for Linux and Mac OS X Annotate provide semantic annotations Package code, data, environment into Docker containers Track tracks provenance of the scientific program

19 Prototype Implementation
A Docker container was created manually for a simplier post-processing pipeline used to visualize VIC model output. The pipeline is described in a prior paper1. The more complex pre-processing pipeline has been Dockerized using Geodatabase, but is still being tested. The Docker container was uploaded to Docker hub DE administrator was contacted to install the Docker container as a Web app. The DE administrator verified the container and then loaded it as an app. It is now available to DE users for data processing. 1 Essawy, B. T., J. L. Goodall, H. Xu,A. Rajasekar, J. D. Myers, T. A. Kugler, M. M. Billah, M. C. Whitton, and R. W. Moore (2016), Server-side workflow execution using data grid technology for reproducible analyses of data-intensive hydrologic systems, Earth and Space Science, 3,163–175, doi : /2015EA

20 Location of the Docker Container on the Docker Hub
Discovery Environment Location of the Docker Container on the Docker Hub

21 Discovery Environment
DE Interface

22 Summary Reproducibility and transparency of modeling will benefit from… Better metadata capture at a low level to identify the various software used to perform an analysis. Ontosoft can be used to organize existing but scattered metadata for modeling software. Docker containers to more easily execute legacy software across machines. For simple pipelines, dockerizing can be done manually. Tools like GeoDataspace can assist in creating containers for more complex pipelines. Tools for executing shared Docker containers The Data Net Federation (DFC) Discovery Environment (DE) provides a means for execution of Docker containers as a Web applications.

23 THANK YOU Contact info: Jon Goodall Department of Civil and Environmental Engineering University of Virginia


Download ppt "Challenges with Maintaining Legacy Software to Achieve Reproducible Computational Analyses: An Example for Hydrologic Modeling Data Processing Pipelines."

Similar presentations


Ads by Google