Cecelia DeLuca1, Rocky Dunlap1 Enhancing GIS Capabilities for High Resolution Earth Science Grids (IN24B-05) Benjamin Koziol (ben.koziol@noaa.gov)1, Robert Oehmke1, Peggy Li2, Ryan O’Kuinghttons3, Gerhard Theurich4, Cecelia DeLuca1, Rocky Dunlap1 1NESII/CIRES/NOAA-ESRL, 2NASA-JPL, 3Cherokee Nation Technologies, 4Source Spring, Inc. AGU December 2017 Hi, my name is Ben Koziol, and today I’ll be presenting “Enhancing GIS Capabilities for High Resolution Earth Science Grids”. My colleagues and I work as part of the NOAA Environmental Software Infrastructure & Interoperability (NESII) group at NOAA’s Earth System Research Lab in Boulder, CO. I also want to quickly thank the program committee for putting this session together.
A Presentation in Two Parts Part 1 - National Hydrography Dataset Conservative Regridding Motivation: Develop and profile a high performance regridding solution for complex, irregular meshes and fine resolution rectangular grids over large scales. Data discussion Implementation Performance Part 2 - Chunked Regridding Motivation: Develop software infrastructure for grid manipulations that scale to high spatial resolutions and accuracies while accommodating arbitrary compute environments and grid structures. Next Steps This presentation will be in two parts about separate but interconnected data processing workflows. Both workflows use the same major software components and are focused on large scale regridding exercises that push the spatial processing capabilities of this particular geoscientific software stack. I’ll give an overview of that software stack in the next slide. In Part 1, I’ll describe a conservative regridding experiment using the National Hydrography Dataset. The goal being to develop and profile a high performance regridding solution for complex, irregular meshes and fine resolution rectangular grids over large scales. In Part 2, I’ll walk through our current approach to chunked regridding. The goal here is to develop software infrastructure for grid manipulations that scale to high spatial resolutions and accuracies while accommodating arbitrary compute environments and grid structures.
Software Stack We primarily used two pieces of software developed in the NESII group for these workflows. ESMF or the Earth System Modeling Framework is a high performance spatial interpolation and model coupling platform. ESMF supports a variety of grid and mesh metadata formats and interpolation methods: bilinear, patch, and first-order conservative. ESMF also has a powerful parallel array facility that supports haloing, redistribution, and arbitrary dimensional decompositions. OpenClimateGIS is a pure Python geospatial manipulation software as well as a general parallel Python computational and IO framework. OCGIS provides the spatial subsetting and GIS toolkit used in these workflows. Both packages are fully open source and built on a number of other valuable open source tools for which we and the community are eternally grateful.
Part 1: National Hydrography Dataset Conservative Regridding So, moving on to Part 1, let me discuss the NHD regridding experiment.
Here’s the Catch-ments Conservatively regrid to hydrologic catchment polygons (unstructured) from an arbitrarily gridded field (structured) High performance requirements Reusability Source Grid: CONUS exact test field - 250-meter (~2e-3 degrees), three timesteps Destination Grid: NHDPlus hydrologic catchments for CONUS [1] 7.7 GB of ESRI-based vector data 2,647,454 Mesh Elements 485,638,947 Nodes Certain classes of hydrologic model’s native grids are not logically rectangular or composed of repeated shapes like a triangular mesh. Rather, the modeling grid or mesh is defined by local environmental characteristics like topography which often leads to dynamic mesh element shapes and properties - a “flexible mesh”. Confounding this issue is the fact that most forcing data, like precipitation or incident solar radiation, are provided as logically rectangular grids. The data on these forcing grids, or source grids, must be interpolated or regridded onto the unstructured form of the destination grid. A conservative regridding approach is often preferred for physical quantities like this as it preserves mass and better accounts for the irregular edges of an unstructured grid. For those unfamiliar with the term “conservative regridding”, it is best to think of this as an area-weighted spatial average to first-order. Also, when I say “unstructured” or “flexible mesh”, I am referring to collections of polygons and multi-polygons for this use case. So, what we were trying to do in this exercise, was evaluate ESMF capabilities under these circumstances. Namely we focused on high performance requirements - the goal is to move towards operational capabilities in the long-term so speed is a consideration. The method also had to be reusable as the purpose of infrastructure software is to generalize across a wide variety of use cases and ensure reuse of developed components. Regarding the input datasets, we used a seamless NHDPlus hydrologic catchments dataset stored in an ESRI File Geodatabase and an exact analytic field calculated on a logically rectangular, 250-meter resolution CONUS grid for use in error evaluation. [1] http://www.horizon-systems.com/nhdplus/
Structure of an Element Hole / Interior Dangling Element The hydrologic catchment elements are complex enough that is worth spending a little time on the characteristics of these elements that caused us the most headaches. Many elements have holes, or interiors as they are sometimes called in GIS speak. You can see one there on the upper left side of the slide. These holes represent unique structures inside a catchment, like a lake or large depression, or may represent a data issue. Regardless of their origin, from a machine perspective, these holes need to be handled by the regridder and accounted for in regridding weights. Thankfully, holes may be handled using multi-geometries by “splitting” an element along an interior axis, creating two component geometries without holes. This approach dove-tailed nicely with two other data challenges. First, elements were already composed of multi-geometries, an artifact from their origin as flow directions computed on logically rectangular grids. Notice the dangling element on the right side of the slide there. And second, high node counts, with some elements approaching fifty thousand nodes, required us to apply the splitting methodology generally to mesh elements to reduce maximum node counts. This created new multi-geometries in the process. This additional level of geometry splitting was needed as ESMF interpolation performance degrades as element node count increases. This is a pretty standard scaling characteristics of algorithms operating on mesh elements like these. Holes/Interiors → Split geometry on interior center and use multi-geometry implementation Multi-Geometries (Multiple Geometry Parts Counted as One Unique Feature) → Create weights for geometry parts then normalize for entire geometry High Node Count → Split geometries based on node counts and use multi-geometry regridding
Results and Performance Conservative regridding result with CONUS NHDPlus catchments overlaid on analytical source field. Here are the results with a short performance report. I know you can’t see too much on the graphic there, but you can at least see the continuation of the test field’s spatial pattern. Format conversion using OCGIS took minutes running on a small number of cores. We encoded the geometries in a modified version of ESMF’s unstructured NetCDF format. It’s important to note that we did not need topology for this type of regridding. If this was needed, say we were doing patch or second-order conservative, the conversion would take longer to search for shared coordinates. ESMF Regridding was done on Yellowstone using 512 cores and took about 17 minutes. Weight application took approximately 17 seconds to initialize then hundredths of a second to apply per timestep. We were also pleased with the RMSE result which is in a reasonable range given the source data resolution and irregularity of the mesh elements. OCGIS Format Conversion - 8 cores, minutes ESMF Regridding Yellowstone Supercomputer - 512 cores, ~17 mins for weight generation Weight application - ~17 seconds initialization, ~0.01 seconds SMM Root-Mean-Square Error: 1.710e-3 Normalized Root-Mean-Square Error: 0.2 %
Part 2: Chunked Regridding Shifting gears, kind of, I’ll move on to the discussion of our approach to chunked regridding.
Motivation for Chunked Approach ESMF memory requirement for weight calculation and sparse matrix multiplication increases linearly with factor count As grid resolution and complexity increases, factor counts will increase in- step resulting in potentially outsized memory requirements for regridding operations Provided spatial mapping may be maintained for weight calculation, source and destination grids may be chunked (split) “Maintaining spatial mapping” implies spatial relationships are indistinguishable inside the local chunks from the spatial relationships present in their global parent grids An iterative, offline approach to weight generation and sparse matrix multiplication lifts grid resolution limitations on machine memory at the expense of computational time and ease of use Ultimate goal is to wrap index-based decompositions within a spatial decomposition framework Increasing the number of processors, and hence increasing the available total memory, will not always be a feasible solution. First, let me explain the motivation for developing a chunked approach to regridding. ESMF memory requirements for a regridding operation generally increase linearly with factor count. This includes the actual weight calculation as well as the associated sparse matrix multiplication. The sparse matrix multiplication memory requirements are related to performance optimizations for repeated weight matrix applications. And to be clear, a “factor” is the spatial interaction between a source and destination element or elements manifesting in a single weight or factor value. Keeping all this in mind, we can anticipate that grid resolutions and spatial accuracy will continue to increase and produce outsized memory requirements now and into the future. What’s necessary, is the ability to break a regridding operation into manageable chunks which maintain a spatial mapping indistinguishable locally from its global counterpart. If done appropriately, regridding can virtually scale to any spatial resolutions or data configurations. In essence, the ultimate goal is to create a spatial decomposition framework that may be used in lieu of index-based decompositions. Granted the spatial decomposition will be transformed into an index-based one behind the scenes, but this can remain relatively opaque to a user. It is also worth noting that simply increasing the number of processors, and hence increasing the available total memory, will not always be a feasible solution. One, some jobs may just require too many cores to be tractable because of compute time and expenses or the job really does take a ton of memory. And two, not everyone has easy access to HPC resources, yet these data should still be manageable and accessible to researchers and developers. For example, using this chunked regridding workflow we were able to run a regridding operation estimated to take 4 terabytes of memory on a development laptop with 8 gigabytes.
Graphical Example with Structured This graphic depicts the simplest case of a spatial decomposition for two logically rectangular grids. The destination is broken into pieces, or sliced, using an index-based decomposition - red. The red chunks are then spatially buffered using a distance that ensures the regridding operation will be spatially mapped by a source grid - green - the spatial decomposition. The choice of this spatial buffer depends on the grids’ spatial resolutions and structure along with the regridding method. For example, patch and higher-order conservative regridding methods will require a larger spatial halo or buffer. This approach ensures that all destination cells are globally unique, no duplication of destination elements, and no modification to weight values are required to reconstruct the destination grid in its original un-chunked form. One can see how this is easily generalized to unstructured grids, the main difference being in how the spatial decomposition is implemented. Red = Destination grid slices Green = Example buffered bounding box used to subset source grid
Development Pathway Integrate chunked regridding into the ESMF_RegridWeightGen CLI Add spatial decomposition capability to the core ESMF library and expose in the ESMF Python interface ESMPy “Out-of-core” memory paradigm (lazy evaluation) is a useful analog and there is considerable ongoing work in the Python community on how to address it (biggus, dask, lama/cf-python) ESMF Homepage: https://www.earthsystemcog.org/projects/esmf/ OCGIS Homepage: https://www.earthsystemcog.org/projects/openclimategis/ Chunked Regridding Demo: https://sourceforge.net/p/esmf/external_demos/ci/master/tree/ESMF_FileRegridWFDemo/ Suggestions? Please email esmf_support@list.woc.noaa.gov if you’d like to try these workflows or have any questions What’s next for this work? Well, first, we want to get chunked regridding and potentially some of the data manipulations for the NHD catchment regridding integrated in the ESMF_RegridWeightGen command line interface. Right now, the tools are independent of the command line program. We’d also like to move as much of the “work” done by external GIS libraries in OpenClimateGIS into ESMF giving codes built on ESMF access to spatial subsetting and spatial decompositions. We have been following the “out-of-core” memory paradigm in big data Python libraries with interest. There is much that can be gained from this work when generalizing to multiple dimensions with stricter decomposition requirements. Currently these tools typically decompose data along a single axis which somewhat limits their utility with n-dimensional datasets, but they are improving rapidly. And that is the end of my presentation. Here are the links to the ESMF and OCGIS homepages along with the ESMF support email. Thank you for your time, and please don’t hesitate to contact us with questions or suggestions.
Backup Slide(s) What’s next for this work? Well, first, we want to get chunked regridding and potentially some of the data manipulations for the NHD catchment regridding integrated in the ESMF_RegridWeightGen command line interface. Right now, the tools are independent of the command line program. We’d also like to move as much of the “work” done by external GIS libraries in OpenClimateGIS into ESMF giving codes built on ESMF access to spatial subsetting and spatial decompositions. We have been following the “out-of-core” memory paradigm in big data Python libraries with interest. There is much that can be gained from this work when generalizing to multiple dimensions with stricter decomposition requirements. Currently these tools typically decompose data along a single axis which somewhat limits their utility with n-dimensional datasets, but they are improving rapidly. And that is the end of my presentation. Here are the links to the ESMF and OCGIS homepages along with my email. Thank you for your time, and please don’t hesitate to contact me with questions or suggestions.
Workflow 1 Here is the final workflow we used. The NHD data stored in an ESRI File Geodatabase on the left is converted to ESMF Unstructured Format using OpenClimateGIS. OCGIS converts the coordinate system from the native WGS84-like oblate spheroid datum to a spherical datum first before applying the geometry preparations described in previous slides. The data is then temporarily stored in a UGRID-like ragged array data structure before converting to ESMF Unstructured Format and writing to NetCDF. ESMF is then used to generate the conservative regridding weights and apply them to the source data using sparse matrix multiplication. OCGIS is then used again for error evaluation and conversion back into ESRI formats for visual validation. All this processing and computation is done in parallel using MPI with asynchronous IO.
Workflow 2 Here is the simplified workflow for our chunked regridding approach. The source and destination grids enter on the left. OpenClimateGIS is used to “chunk” the source and destination grids into pieces. OCGIS also appends some metadata to the chunks and creates an additional indexing metadata file to assist with data reconstruction following a regridding operation. Chunks are persisted in individual NetCDF files exchanging memory usage for disk usage. We then recursively call ESMF weight generation on the chunks creating a series of weight files. Following weight generation there are a number of possibilities including sparse matrix multiplication, merging and re-indexing chunked weight files, and inserting weighted chunks into a master destination file. These final steps are done using OCGIS.
How to store? Unstructured data stores use coordinate indexing (indirection) Multi-geometry breaks may be included using a standard flag (may also be used for interiors) Minimize indirection and indexing to accommodate parallelism This approach used the ESMF Unstructured Format as opposed to UGRID. ESMF format uses a vector for coordinate indirection where UGRID uses a rectangular array. Node thresholding, interior splitting, and multi-geometry flagging done in data conversion step node_connectivity = 0 1 2 3 4 2 coordinate_value = 10. 20. 30. 40. 50. node_connectivity = 0 1 2 -8 3 4 2 The data manipulations discussed in the previous slide along with the general ESMF requirement to store data in NetCDF required us to look into novel data conversion and storage approaches. I know discussions of metadata and data formats can be quite dry, but this aspect of the regridding experiment proved in some ways to be the most challenging and still does not have a definitive solution. I’ll provide an overview of our solution and try not to keep yourself awake at night dreaming of a better one. Unstructured data stores use indirection in the form of a coordinate index to automatically encode topology and avoid coordinate duplication. The former being the most important feature of coordinate indexing. We were able to incorporate multi-geometries quite easily by adding a geometry break flag into this coordinate index taking advantage of the fact that the index is either zero or a positive integer. The same approach may be used for holes by adding an additional flag with a different negative value. We preferred the break value approach mostly because it fit nicely into how ESMF and general GIS software processes mesh elements. That is, the elements were still atomic and multi-geometries could be handled easily by creating an outer loop. It is relatively straightforward to normalize weights for the multi-geometry case from weights calculated from component geometries. Furthermore, we did not want to add additional indexing or counts that must be locally re-indexed or adjusted during parallel operations. In the end, we used ESMF Unstructured Format over the Unstructured Grid Format, UGRID, as coordinate indices are stored in a vector in ESMF Unstructured. The high variability in node counts caused UGRID rectangular arrays to be much larger than necessary to store coordinate information. Lastly, all this pre-processing was done in the data conversion step. When the regrid operation was called, it does not concern itself with polygon splitting or node thresholding. Masked, empty space in rectangular array
Subsetting / Spatial Decomposition Challenges Maintain independent spatial mask - is it a masked value because of data issues or spatial position? Coordinate indices must be re-indexed to persist a spatial subset - requires a flurry of communication in parallel Label-based slicing and mask cascades is critical → Once a spatial slice and associated mask is created this must be applied across all subset targets with shared dimensions Delayed loading of “payload” data greatly increases performance and lower memory usage Difficult to calculate memory requirements except in very controlled conditions MPI → More difficult to implement but offers the necessary communication solution for distributed slicing (slicing-in-parallel), re- indexing, and asynchronous IO Before talking about next steps, I want to reflect on the major challenges we encountered while developing the spatial decomposition workflow and the tools necessary to spatially subset structured and unstructured Earth science grids. First, when a mask is related to a spatial overlay, how do we differentiate between a mask related to spatial position and masks in-place prior to a spatial subset? We added a separate variable to house this mask. To our knowledge, current metadata conventions do not account for ancillary masks such as this. Second, coordinate indices are difficult to subset and operate on in parallel due to processor local versus global indexing. Working with these data often require a lot of MPI communication to distribute and reduce appropriately. This is the best reason to avoid additional indexing layers in data formats whenever possible. Third, label-based slicing and mask cascade operations are essential to account for the data varieties encountered in the wild with strange dimension ordering and “novel” metadata structures. Transition from direct indexing to labels makes for more robust data processing but does add performance penalties in tight loops. Fourth, working with coordinates in-memory for subsetting greatly reduces memory overhead when working with big payload data variables. In this way, dimension objects track source indices in addition to describing array shapes. The payload data may be loaded on demand following a subset increasing performance by avoiding unnecessary IO. It is difficult to calculate memory requirements except in very controlled conditions or with a lot of combinatorial profiling. Lastly, when it comes to memory-limited operations, the utility of MPI is somewhat in question. This is mostly due to the implementation overhead for MPI and by virtue of that fact that MPI works best with multi-core environments and is tuned to performance. It would be hard to remove MPI entirely due to the communication toolkit it provides. How to best merge MPI tools with streaming IO and big memory requirements while also meeting performance requirements is something challenging many Earth system softwares.