Modeling & Analyzing Massive Terrain Data Sets Pankaj K. Agarwal Duke University
STREAM Project Scalable Techniques for hi-Resolution Elevation data Analysis & Modeling In collaboration with: Lars Arge, Helena Mitasova Students: Andrew Danner,Thomas Molhave,Ke Yi
Diverse elevation point data: density, distribution, accuracy Photogrammetry 0.76m v. accuracy (5ft contours) 1974 1995 1998 Lidar 0.15m v. accuracy; altitude 700m and 2300m RTK-GPS 0.10m v. accuracy 1999 2001 2004
Constructing Digital Elevation Models Grid, TIN, Contour DEMs
Natural feature extraction Maps of topo parameters: computed simultaneously with interpolation using partial derivatives of the RST function, terrain features: combined thresholds of topo parameters Ma Extracted foredunes 1999-2004 using profile curvature and elevation threshold
Tracking evolving features slip faces: slope > 30deg 1995 new slip face in 1999 2001 dune crests: profile curvature threshold how to automatically track features that change some of their properties over time?
Hydrology Analysis Flow Analysis Watershed hierarchy
Challenge: Massive Data Sets LIDAR NC Coastline: 200 million points – over 7 GB Neuse River basin (NC): 500 million points – over 17 GB Output is also large 10ft grid: 10GB 5ft grid: 40GB Storage is not the problem. Processing is. Need good transition
Approximation Algorithms Exact computation expensive Many practical problems are intractable Multiple & often conflicting optimization criteria Suffices to find a near-optimal solution Tunable Approximation algorithms Tradeoff between accuracy and efficiency User specifies tolerance
I/O-Bottleneck Data resides in secondary memory Disk access is 106 times slower than main memory access Maximize useful data transferred with each access Amortize large access time by transferring large contiguous blocks of data I/O – not CPU – is often the bottleneck for processing massive data sets track magnetic surface read/write arm read/write head
I/O-efficient Algorithms [AV88] Traditional algorithms optimize CPU computation Not sensitive to penalty of disk access I/O model Memory is finite Data is transferred in blocks Complexity measured in disk blocks transferred Disk RAM CPU B M Virtual memory helps sometime, but not always. Depends on replacement policy, working set size. B ~ 2K-8K
Elevation Data to Watershed Hierarchies: A Pipeline Approach
Point Cloud to Grid DEM Given a point cloud sampled from a bivariate function z=F(x,y) (elevation data) Goal: construct a grid DEM for z DEM Applications Hydrology Ecology Viewsheds … Many algorithms only work on grid data
Sub-meter resolution: improved structures representation 2004 lidar-based 30cm resolution DSM 1999 lidar 2m resolution DSM 2004 lidar: 0.5m resolution binning interpolated DSM Binning (within cell average) sufficient for ~1-3m DEMs but interpolation produces improved geometry representation and allows us to create higher resolution DEMs/DSMs
General Approach: Interpolation Kriging, IDW, Splines, … Interpolate O(N3) z=F(x,y) N Points Modern data sets can have millions of points Direct interpolation does not scale
Improving Interpolation with Quad Trees Segment initial point set using a quad tree Each quad tree leaf has a <K points (10-100) Interpolate each segment separately N/K segments O(K3) interpolation per segment O(K2N) total running time Linear in N Boundary smoothness? Use points from nearby segments
Overview of the Algorithm Construct quad-tree on disk using bulk loading [AAPV 01] For each quad-tree node, find points in neighboring segments Arrange points on disk for sequential interpolation Use the algorithm by Mitasova, Mitas, Harmon Can use any other interpolation method
Building an External Quad Tree Build top few levels in memory Buffer points in lower levels Write buffers to disk when full K=2 Block I/O
Building a TIN TIN: Triangulated Irregular Network Constrained Delaunay Triangulation Developed an I/O- efficient algorithm Builds TIN for Neuse River Basin (14GB) in 7.5 hours
Future Work: Hierarchical DEMs Considerable work in graphics and GIS communities on multiresolution hierarchies for terrains Focuses on visualization Most algorithms perform random memory access Out-of-core simplification I/O-efficient algorithms for terrain simplification (edge contraction method) Feature preserving simplification Preserves main river networks Preserves visibility for most of the points
Coping with Noisy Data Nearly all flow models assume [O’Callaghan and Mark 84, de Berg et al. 96] water flows downwards stops at a minimum
Denoising via Flooding [Jenson and Domingue 88, Arge et al. 03] Pour water uniformly on terrain Water level rises until steady-state
River Network After Flooding Sort(N) I/Os
Denoising via Flooding [Jenson and Domingue 88, Arge et al. 03] Pour water uniformly on terrain Water level rises until steady-state Use topological persistence (Edelsbrunner et al.) as importance measure to measure the importance of a minimum Flood low-persistence minima, preserve high-persistence minima
Topological Persistence [Edelsbrunner et al. 02]
The Union-Find Problem A universe of N elements: x1, x2, …, xN Initially N singleton sets: {x1}, {x2 }, …, {xN} Each set has a representative Maintain the partition under Union(xi, xj) : Joins the sets containing xi and xj Find(xi) : Returns the representative of the set containing xi
Formulated as Batched Union-Find On a terrain represented as a TIN When reach A minimum or maximum: do nothing A regular point u: Issue union(u,v) for a lower neighbor v A saddle u: let v and w be nodes from u’s two connected pieces in its lower link Issue: find(v), find(w), union(u,v), union(u,w) Sequence of union/find operations can be computed in advance lower link
Classical Union-Find Algorithm representatives d h i p b j a f l z s r c k e g m n Union(d, h) : Find(n) : h h d f l d f l m n b j a m b j a link-by-rank path compression e g n e g
Complexity O(N α(N)) for a sequence of N union and find operations [Tarjan 75] α(•) : Inverse Ackermann function (very slow!) Optimal in the worst case [Tarjan79, Fredman and Saks 89] Batched (Off-line) version Entire sequence known in advance Can be improved to linear on RAM [Gabow and Tarjan 85] Not possible on a pointer machine [Tarjan79]
Our Results An I/O-efficient algorithm for the batched union-find problem using O(sort(N)) = O(N/B logM/B(N/B)) I/Os Same as sorting optimal in the worst case A simpler algorithm using O(sort(N) log(N/M)) I/Os Implemented Apply to persistence computation, flooding O(sort(N)) time Other applications
I/O-Efficient Batched Union-Find Two-stage algorithm Convert to interval union-find Compute an order on the elements s.t. each union joins two adjacent sets Solve batched interval union-find Each stage formulated as a geometric problem and solved using sort(N) I/Os
Traditional algorithm good as long as data fits in memory Experimental Results Traditional algorithm good as long as data fits in memory
Experimental Results:Neuse River Basin 0.5 billion points (14GB raw data), sampled to obtain smaller data sets On the entire data set: EM takes 5 hours, IM fails
Contour Trees
Previous Results Directly maintain contours O(N log N) time [van Kreveld et al. 97] Needs union-split-find for circular lists Do not extend to higher dimensions Two sweeps by maintaining components, then merge O(N log N) time [Carr et al. 03] Extend to arbitrary dimensions
Join Tree and Split Tree [Carr et al. 03] Qualified nodes 9 9 9 9 8 8 8 8 7 7 7 7 6 6 6 6 5 5 5 5 4 4 4 4 3 3 3 3 2 2 1 1 1 1 Join tree Split tree Join tree Split tree
Final Contour Tree Hard to BATCH! Join tree Split tree Contour tree 9 8 8 8 7 7 7 6 6 6 5 5 5 4 4 4 3 3 3 2 2 2 1 1 1 Join tree Split tree Contour tree
Another Characterization Let w be the highest node that is a descendant of v in join tree and ancestor of u in split tree, (u, w) is a contour tree edge 9 8 7 6 5 4 3 2 1 Join tree Split tree Contour tree u v w Now can BATCH!
Experimental Results On the Neuse River Basin data set
Coping with Noise: Future Work Constructing hydrologically conditioned DEMs Formulate as an optimization problem Using persistent homology Integrating topology and geometry Handling flat areas Scalability remains a big issue Handling anthropogenic features
Back to Pipeline: Flow Modeling
From Elevation to River Networks Where does water go? From higher elevation to lower elevation Single flow directions forms a tree 80 90 100 95 85 110 Sea
Drainage Area How much area is upstream of each node? 1 How much area is upstream of each node? Each node has initial drainage area (1) Drainage area of internal nodes depends on drainage area of children 2 3 3 3 7 17 25 14 11 9 5
Drainage Area example (Panama)
IFSARE and SRTM based stream analysis Stream and watershed boundary extraction for Panama for on-going water quality research to provide more detailed and up-to-date data. SRTM - 7400x3600 DSM at 90m resolution for entire Panama, IFSARE - 10800x11300 DEM at 10m resolution for the Panama canal section Using r.terraflow - officially released in GRASS6.2 in 2006 - streams can be extracted in 3-4 hours, tile-based approach takes days SRTM DEM for entire Panama with main rivers extracted from DEM in 3 hr
Hierarchical Watershed Decomposition
Watershed Hierarchies Decompose a river network into a hierarchy of hydrological units All water in HU flows to a common outlet Hierarchy provides tunable level of detail Method used: Pfafstetter [VV99] Want a solution scalable to large modern hi-res terrains
Pfafstetter 8 9 7 4 5 3 1 2 6 Find main river 17 25 14 11 9 5 Find main river Find four largest tributaries Label basins/interbasins Recurse until single path 8 6 4 2 7 5 3 1 9
Recurse 71 75 73 74 72 7 2 26 25 24 23 21 22 27
Example Watershed Boundaries 1 2 3 4 5 6 7 8 9 91 92 93 94 95 96 97 98 99 991 992 993 994 995 996 997 998 999
A Complete Pipeline
Implementation TPIE: C++ primitives for I/O-efficient algorithms GRASS: Open Source GIS Interpolation: Regularized spline with tension (in GRASS) Data: North Carolina LIDAR Neuse river basin: 400 million points (NC Floodmaps) Outer banks coastal data : 128 million points (NOAA CSC) USGS 30m NED
Grid Construction Results Resolution (ft) 40 20 10 Grid cells x106 221 885 3542 Points x106 205 340 415 Total time 12h32m 14h46m 26h52m Time spent(%) Build quad tree 8.9 7.1 5.7 Find neighbors 31.6 32.4 29.1 Interpolate 58.8 58.5 59.3 Write output 0.7 2 5.9 Neuse Point Thinning Adjusted interpolation parameters
Sample Watershed Results size (MB) 150 713 5819 size (mln cells) 30.8 147 396.5 total time 10m29s 58m10s 187m43s Time spent… % importing data 8 7 16 sorting by flow 15 13 building river list 31 35 29 sorting river list 19 20 computing labels 6 sort by grid order 14 12 exporting data 5 4
Future Work Terrain Modeling Terrain Analysis
Multivalent terrain models multiple surfaces hybrid raster and 3D vector for structured environments (urban) voxel for unstructured environments (forested areas) Multiple surface representation: DSM and DEM
Terrain analysis on DEMs with structures Investiagte multivalent elevation field representations that preserve road and stream connectivity
Dynamic Data Continuous data streams from various sensors Maintain smart summaries of data Communication, energy efficient algorithms for sensor networks Coping with moving/deformable objects Tracking evolving terrain features Kinetic algorithms that update the structure locally
Terrain Analysis Flow modeling Tracking evolving features Erosion analysis Soil moisture modeling (collaboration with ecologists) Tracking evolving features Visibility based navigation Finding paths from which most of the terrain is visible Computing well concealed paths Covering terrains from a few points
Acknowledgments Collaborators Sponsored by Lars Arge, Helena Mitasova Andrew Danner, Thomas Molhave, Ke Yi Sponsored by Army Research Office W911NF-04-1-0278
Relevant Publications Danner, Yi, Molhave, A., Arge, Mitasova, TerraStream: From Elevation Data to Watershed Hierarchies, submitted for publication, 2007. A., Arge, Danner, From point cloud to grid DEM: A scalable approach. In Proc. 12th SSDH, 771–788, 2006. A., Arge, Yi, I/O-efficient construction of constrained Delaunay triangulations. In Proc. ESA, 355–366, 2005. A., Arge, Yi. I/O-efficient batched union-find and its applications to terrain analysis. In Proc. 22nd SoCG, 2006. Arge, Danner, Haverkort, Zeh. I/O-efficient hierarchical watershed decomposition of grid terrain models. In Proc. 12th SSDH, 825–844, 2006. Danner. I/O Efficient Algorithms and Applications in Geographic Information Systems. PhD thesis, Duke University, 2006.
Thanks!
Future Directions – Noise Removal Bridge detection/removal Other flow routing methods Flow routing on flat surfaces Comparing flow networks
Mapping LIDAR point density point density at the beginning of the project current industry standard no. of points/2m grid cell 1996 0.2 1997 0.9 ATM II NASA 1998 0.4 1999 1.4 2001 0.2 NCflood Leica 2003 2.0 EARL NASA 2004 15.0 SHOALS USACE 2005 6.0 SHOALS USACE 1998 2004 pt / 2m grid cell 2001: NC Flood 2004: SHOALS
Building an External Quad Tree levels in one scan I/Os total After first scan of data: Write top of tree to disk Flush all buffers Recurse on buffers What is in memory what is on disk
Coping with Noisy Data Identifying minima likely due to noise Don’t want to remove real features Persistent homology [ELZ 02] Computed in Sort(N) I/Os
Interpolation Scan through list of points with neighbors Interpolate each leaf separately – Use any interpolation method Evaluate at grid cells Write grid cell values as (i,j,z) as they are computed Sort grid cells by raster order Evaluate Sort Interpolate (i, j, z)