Efficient Algorithms for Large-Scale GIS Applications Laura Toma Duke University.

Slides:



Advertisements
Similar presentations
Single Source Shortest Paths
Advertisements

Efficient access to TIN Regular square grid TIN Efficient access to TIN Let q := (x, y) be a point. We want to estimate an elevation at a point q: 1. should.
I/O-Algorithms Lars Arge Fall 2014 September 25, 2014.
Lars Arge 1/43 Big Terrain Data Analysis Algorithms in the Field Workshop SoCG June 19, 2012 Lars Arge.
Lars Arge 1/13 Efficient Handling of Massive (Terrain) Datasets Lars Arge A A R H U S U N I V E R S I T E T Department of Computer Science.
CSE 380 – Computer Game Programming Pathfinding AI
Data Structure and Algorithms (BCS 1223) GRAPH. Introduction of Graph A graph G consists of two things: 1.A set V of elements called nodes(or points or.
Query Processing in Databases Dr. M. Gavrilova.  Introduction  I/O algorithms for large databases  Complex geometric operations in graphical querying.
I/O-Algorithms Lars Arge January 31, Lars Arge I/O-algorithms 2 Random Access Machine Model Standard theoretical model of computation: –Infinite.
I/O-Efficient Batched Union-Find and Its Applications to Terrain Analysis Pankaj K. Agarwal, Lars Arge, Ke Yi Duke University University of Aarhus.
I/O-Algorithms Lars Arge Aarhus University February 27, 2007.
I/O-Algorithms Lars Arge Spring 2009 February 2, 2009.
Shortest Paths Definitions Single Source Algorithms –Bellman Ford –DAG shortest path algorithm –Dijkstra All Pairs Algorithms –Using Single Source Algorithms.
I/O-Algorithms Lars Arge Spring 2009 January 27, 2009.
I/O-Algorithms Lars Arge Spring 2007 January 30, 2007.
I/O-Algorithms Lars Arge Aarhus University February 16, 2006.
I/O-Algorithms Lars Arge Aarhus University February 7, 2005.
I/O-Algorithms Lars Arge University of Aarhus March 1, 2005.
I/O-Algorithms Lars Arge Spring 2009 March 3, 2009.
I/O-Algorithms Lars Arge Aarhus University February 6, 2007.
I/O-Algorithms Lars Arge Spring 2006 February 2, 2006.
Flow Computation on Massive Grid Terrains
Lars Arge 1/14 A A R H U S U N I V E R S I T E T Department of Computer Science Efficient Handling of Massive (Terrain) Datasets Professor Lars Arge University.
I/O-Algorithms Lars Arge Aarhus University February 9, 2006.
Efficient Algorithms for Large-Scale GIS Applications Laura Toma Duke University.
CS 350 Algorithms for GIS. What is GIS? Definitions  A powerful set of tools for collecting, storing, retrieving at will, transforming and displaying.
I/O-Algorithms Lars Arge Aarhus University February 14, 2008.
Massive Data Algorithmics Faglig Dag, January 17, 2008 Gerth Stølting Brodal University of Aarhus Department of Computer Science.
Shortest Paths Definitions Single Source Algorithms
I/O-Algorithms Lars Arge University of Aarhus March 7, 2005.
External Memory Graph Algorithms and Applications to GIS Laura Toma Duke University July
Flow modeling on grid terrains. Why GIS?  How it all started.. Duke Environmental researchers: computing flow accumulation for Appalachian Mountains.
From Elevation Data to Watershed Hierarchies Pankaj K. Agarwal Duke University Supported by ARO W911NF
Flow modeling on grid terrains. DEM Representations TIN Grid Contour lines Sample points.
Greedy Algorithms Like dynamic programming algorithms, greedy algorithms are usually designed to solve optimization problems Unlike dynamic programming.
1 University of Denver Department of Mathematics Department of Computer Science.
External Memory Algorithms Kamesh Munagala. External Memory Model Aggrawal and Vitter, 1988.
Lars Arge 1/12 Lars Arge. 2/12  Pervasive use of computers and sensors  Increased ability to acquire/store/process data → Massive data collected everywhere.
External-Memory MST (Arge, Brodal, Toma). Minimum-Spanning Tree Given a weighted, undirected graph G=(V,E), the minimum-spanning tree (MST) problem is.
I/O-Algorithms Lars Arge Spring 2008 January 31, 2008.
From Topographic Maps to Digital Elevation Models Daniel Sheehan IS&T Academic Computing Anne Graham MIT Libraries.
Review of Graphs A graph is composed of edges E and vertices V that link the nodes together. A graph G is often denoted G=(V,E) where V is the set of vertices.
TerraStream: From Elevation Data to Watershed Hierarchies Thursday, 08 November 2007 Andrew Danner (Swarthmore), T. Moelhave (Aarhus), K. Yi (HKUST), P.
TerraFlow Flow Computation on Massive Grid Terrains Helena Mitasova Dept. of Marine, Earth & Atmospheric Sciences, NCSU, USA
I/O-Algorithms Lars Arge Fall 2014 August 28, 2014.
Heavily based on slides by Lars Arge I/O-Algorithms Thomas Mølhave Spring 2012 February 9, 2012.
Chapter 9 – Graphs A graph G=(V,E) – vertices and edges
I/O-Efficient Graph Algorithms Norbert Zeh Duke University EEF Summer School on Massive Data Sets Århus, Denmark June 26 – July 1, 2002.
Bin Yao Spring 2014 (Slides were made available by Feifei Li) Advanced Topics in Data Management.
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
Flow Simulation in TINs Drainage Queries in TINs: from local to global and back again S. Yu, M. van Kreveld, and J. Snoeylink.
Terracost: Hazel, Toma, Vahrenhold, Wickremesinghe Terracost: A Versatile and Scalable Approach to Computing Least-Cost-Path Surfaces for Massive Grid-Based.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
INTRODUCTION TO GIS  Used to describe computer facilities which are used to handle data referenced to the spatial domain.  Has the ability to inter-
Flow Modeling on Massive Grids Laura Toma, Rajiv Wickremesinghe with Lars Arge, Jeff Chase, Jeff Vitter Pat Halpin, Dean Urban in collaboration with.
External Memory Geometric Data Structures Lars Arge Duke University June 27, 2002 Summer School on Massive Datasets.
Laura TomaSimplified External memory Algorithms for Planar DAGs Simplified External Memory Algorithms for Planar DAGs July 2004 Lars Arge Laura Toma Duke.
Lecture 1: Basic Operators in Large Data CS 6931 Database Seminar.
CSCE 411 Design and Analysis of Algorithms Set 9: More Graph Algorithms Prof. Jennifer Welch Spring 2012 CSCE 411, Spring 2012: Set 9 1.
External Memory Graph Algorithms and Applications to GIS Laura Toma Bowdoin College.
TerraSTREAM: Terrain Processing Pipeline MADALGO – Center for Massive Data Algorithmics, a Center of the Danish National Research Foundation What TerraSTREAM.
INTRODUCTION TO GEOGRAPHICAL INFORMATION SYSTEM
Flow field representations for a grid DEM
Digital Terrain Analysis for Massive Grids
Query Processing in Databases Dr. M. Gavrilova
CS 350 Algorithms for GIS.
Advanced Topics in Data Management
Presentation transcript:

Efficient Algorithms for Large-Scale GIS Applications Laura Toma Duke University

Why GIS?  How it all started.. Duke Environmental researchers: computing flow accumulation for Appalachian Mountains took 14 days (with 512MB memory) –800km x 800km at 100m resolution  ~64 million points  GIS (Geographic Information Systems) System that handles spatial data Visualization, processing, queries, analysis… Rich area of problems for Computer Science Graphics, graph theory, computational geometry, scientific computing…

GIS and the Environment Indispensable tool Monitoring: keep an eye on the state of earth systems using satellites and monitoring stations (water, pollution, ecosystems, urban development,…) Modeling and simulation: predict consequences of human actions and natural processes Analysis and risk assessment: find the problem areas and analyse the possible causes (soil erosion, groundwater pollution,..) Planning and decision support: provide information and tools for better management of resources Lots of rain Dry Precipitation in Tropical South America High nitrogen concentrations Nitrogen in Chesapeake Bay

GIS and the Environment Bald Head Island Renourishment Sediment flow

Computations on Terrains Reality: Elevation of terrain is a continuous function of two variables h(x,y) Estimate, predict, simulate  Flooding, pollution  Erosion, deposition  Vegetation structure  …. GIS: DEM ( Digital Elevation Model ) is a set of sample points and their heights {  x, y, h xy  } Model and compute indices

DEM Representations TIN Grid Contour lines Sample points

Modeling Flow on Terrains  What happens when it rains? Predict areas susceptible to floods. Predict location of streams. Compute watersheds.  Flow is modeled using two basic attributes Flow Direction (FD) The direction water flows at a point Flow Accumulation (FA) Total amount of water that flows through a point (if water is distributed according to the flow directions)

DEM and Flow Accumulation [Panama]

Uses Flow direction and flow accumulation are used for:  Computing other hydrological attributes river network moisture indices watersheds and watershed divides  Analysis and prediction of sediment and pollutant movement in landscapes.  Decision support in land management, flood and pollution prevention and disaster management

Massive Terrain Data  Remote sensing technology Massive amounts of terrain data Higher resolutions (1km, 100m, 30m, 10m, 1m,…)  NASA-SRTM Mission launched in 2001 Acquired data for 80% of earth at 30m resolution 5TB  USGS Most of US at 10m resolution  LIDAR 1m res

Example: LIDAR Terrain Data  Massive (irregular) point sets (1-10m resolution)  Relatively cheap and easy to collect Example: Jockey’s ridge (NC coast)

It’s Growing!  Appalachian Mountains Area if approx. 800 km x 800 km Sampled at: 100m resolution:  64 million points (128MB) 30m resolution:  640 (1.2GB) 10m resolution:  6400 = 6.4 billion (12GB) 1m resolution:  billion (1.2TB)

Computing on Massive Data  GRASS (open source GIS) Killed after running for 17 days on a 6700 x 4300 grid (approx 50 MB dataset)  TARDEM (research, U. Utah) Killed after running for 20 days on a x grid (appox 240 MB dataset) CPU utilization 5%, 3GB swap file  ArcInfo (ESRI, commercial GIS) Can handle the 240MB dataset Doesn’t work for datasets bigger than 2GB

Outline  Introduction  Flow direction and flow accumulation Definitions, assumptions, algorithm outline  Scalability to large terrains I/O-model I/O-efficient flow accumulation TerraFlow  I/O-efficient graph algorithms  Conclusion and future work

Flow Direction (FD) on Grids  Water flows downhill follows the gradient  On grids: Approximated using 3x3 neighborhood SFD (Single-Flow Direction): FD points to the steepest downslope neighbor MFD (Multiple-Flow direction): FD points to all downslope neighbors

Flow accumulation with MFD

Flow accumulation with SFD

Computing FD  Goal: compute FD for every cell in the grid (FD grid)  Algorithm: For each cell compute SFD/MFD by inspecting 8 neighbor cells  Analysis: O(N) time for a grid of N cells  Is this all? NO! flat areas: Plateas and sinks

FD on Flat Areas  …no obvious flow direction  Plateaus Assign flow directions such that each cell flows towards the nearest spill point of the plateau  Sinks Either catch the water inside the sink Or route the water outside the sink using uphill flow directions model steady state of water and remove (fill) sinks by simulating flooding: uniformly pouring water on terrain until steady state is reached Assign uphill flow directions on the original terrain by assigning downhill flow directions on the flooded terrain

Flow Accumulation (FA) on Grids FA models water flow through each cell with “uniform rain” Initially one unit of water in each cell Water distributed from each cell to neighbors pointed to by its FD Flow conservation: If several FD, distribute proportionally to height difference Flow accumulation of cell is total flow through it Goal: compute FA for every cell in the grid (FA grid)

Computing FA  FD graph: node for each cell (directed) edge from cell a to b if FD of a points to b  FD graph must be acyclic ok on slopes, be careful on plateaus  FD graph depends on the FD method used SFD graph: a tree (or a set of trees) MFD graph: a DAG (or a set of DAGs)

Computing FA: Plane Sweeping  Input: flow direction grid FD  Output: flow accumulation grid FA (initialized to 1)  Process cells in topological order. For each cell: Read its flow from FA grid and its direction from FD grid Update flow for downslope neighbors (all neighbors pointed to by cell flow direction)  Correctness One sweep enough  Analysis O(sort) + O(N) time for a grid of N cells  Note: Topological order means decreasing height order (since water flows downhill).

Scalability Problem  We can compute FD and FA using simple O(N)- time algorithms ..but.. for large sets..?? Dataset Size (log)

Scalability Problem: Why?  Most (GIS) programs assume data fits in memory minimize only CPU computation  But.. Massive data does not fit in main memory! OS places data on disk and moves data between memory and disk as needed  Disk systems try to amortize large access time by transferring large contiguous blocks of data  When processing massive data disk I/O is the bottleneck, rather than CPU time!

Disks are Slow “The difference in speed between modern CPU and disk technologies is analogous to the difference in speed in sharpening a pencil using a sharpener on one’s desk or by taking an airplane to the other side of the world and using a sharpener on someone else’s desk.” (D. Comer)

Scalability to Large Data  Example: reading an array from disk Array size N = 10 elements Disk block size = 2 elements Memory size = 4 elements (2 blocks) Algorithm 2: Loads 5 blocksAlgorithm 1: Loads 10 blocks N blocks >> N/B blocks  Block size is large (32KB, 64KB)  N >> N/B N = 256 x 10 6, B = 8000, 1ms disk access time  N I/Os take 256 x 10 3 sec = 4266 min = 71 hr  N/B I/Os take 256/8 sec = 32 sec

I/O-model I/O-operation Read/write one block of data from/to disk I/O-complexity number of I/O-operations (I/Os) performed by the algorithm External memory or I/O-efficient algorithms: Minimize I/O-complexity RAM model CPU-operation CPU-complexity Number of CPU-operations performed by the algorithm Internal memory algorithms: Minimize CPU-complexity

I/O-Model Parameters N = # elements in problem instance B = # elements that fit in disk block M = # elements that fit in main memory Fundamental bounds: Scanning: scan(N) = Sorting: sort(N) = D P M Block I/O In practice block and main memory sizes are big

Sorting  Mergesort illustrates often used features: Main memory sized chunks (for N/M runs) Multi-way merge (repeatedly merge M/B of them)

I/O-Efficient Priority Queue  Insert insert in B0; when B0 full, empty it recursively on Bi  extract_min extract_min from PQ; if PQ empty, fill it from Bi

Computing FA I/O-Analysis Algorithm: O(N) time  Process (sweep) cells in topological order. For each cell: Read flow from FA grid and direction from FD grid Update flow in FA grid for downslope neighbors  Problem: Cells in topological order distributed over the terrain  scattered access to FA grid and FD grid  O(N) blocks

I/O-Efficient Flow Accumulation  Eliminating scattered accesses to FD grid Store FD grid in topological order  Eliminating scattered accesses to FA grid Obs: flow to neighbor cell is only needed at the time when the neighbor is processed: Time when cell is processed Topological rank priority Push flow by inserting flow increment in priority queue with priority equal to neighbor’s time Flow of cell obtained using DeleteMin operations Note: Augment each cell with priorities of 8 neighbors –Obs: Space (~9N) traded for I/O  Use I/O-efficient priority queue [A95,BK97] O(N) operations in I/Os [ATV00]

TerraFlow  TerraFlow is our suite of programs for flow routing and flow accumulation on massive grids [ATV`00,AC&al`02]  Flow routing and flow accumulation modeled as graph problems and solved in optimal I/O bounds  Efficient times faster on very large grids than existing software  Scalable 1 billion elements!! (>2GB data)  Flexible Allows multiple methods flow modeling

Implementation and Platform  C++  Uses TPIE (Transparent Parallel I/O Environment) Library of I/O-efficient modules developed at Duke  Platform TerraFlow, ArcInfo 500MHz Alpha, FreeBSD 4.0, 1GB RAM GRASS/TARDEM 500MHz Intel PIII, FreeBSD/Windows, 1GB RAM

TerraFlow  GRASS cannot handle Hawaii dataset (killed after 17 days)  TARDEM cannot handle Cumberlands dataset (killed after 20 days)  Significant speedup over ArcInfo for large datasets East-Coast TerraFlow: 8.7 Hours ArcInfo: 78 Hours Washington state TerraFlow: 63 Hours ArcInfo: %

I/O-Efficient Graph Algorithms Graph G=(V,E) on disk  Basic graph (searching) problems BFS, DFS, SSSP, topological sorting ..are big open problems in the I/O-model! Standard internal memory algorithms: O(E) I/Os No I/O-efficient algorithms are known for any of these problems on general graphs! Lower bound Ω (sort(V)), best known Ω (V/sqrt(B))  O(sort(E)) algorithms for special classes of graphs Trees, grid graphs, bounded-treewidth graphs, outerplanar graphs, planar graphs Exploit existence of small separators or geometric structure

Dijkstra’s Algorithm in External Memory  Dijkstra’s algorithm Use a priority queue to store Repeat: DeleteMin(u) and for each adjacent edge (u,v) if d(s,v) > d(s,u) + w uv then DecreaseKey(v, d(s,u) + w uv )  Analysis: O(E) I/Os O(V+E/B) I/Os to load adjacent edges for each vertex Use an I/O-efficient priority queue O(sort(E)) I/Os for O(E) Insert/DeleteMin DecreaseKey: O(1) I/Os to read key d(s,v) of v  O(E) + O(sort(E)) I/Os  Improved SSSP algorithm: O(V + E/B log V) I/Os [KS’96]

SSSP on Grid Graphs [ATV’00]  Previous bound: O(N) I/Os  Lemma The portion of δ(s,t) between intersection points with boundaries of any subgrid is the shortest path within the subgrid Grid graph O(N) vertices, O(N) edges

SSSP on Grid Graphs [ATV’00]  Divide grid into subgrids of size M (assume M > B 2 )  Replace each subgrid with complete graph on its boundary vertices Edge weight: shortest path between the two boundary vertices within the subgrid  Reduced graph G R O(N/B) vertices, O(N) edges Idea: Compute shortest paths locally in each subgrid then compute the shortest way to combine them together

SSSP on Grid Graphs [ATV’00] Algorithm 1.Compute SSSP on G R from s to all boundary vertices 2.Find SSSP from s to all interior vertices: for any subgrid σ, for any t in σ δ(s,t) = min v in Bnd( σ) {δ(s,v) + δ σ (v,t)}  Correctness: easy to show using Lemma  Analysis: O(sort(N)) I/Os Dijkstra’s algorithm using I/O efficient priority queue and graph blocking

Results on Planar graphs Planar graph G with N vertices  Separators can be computed in O(sort(N)) I/Os  I/O-efficient reductions [ABT’00, AMTZ’01]  BFS, DFS, SSSP in O(sort(N)) I/Os O(sort(N)) I/Os [AMTZ’01] O(sort(N)) I/Os [ABT’00] DFS BFSSSSP ε-separators

SSSP on Planar Graphs  Similar with grid graphs. Assume M > B 2, bounded degree  Assume graph is separated O(N/B 2 ) subgraphs, O(B 2 ) vertices each, S=O(N/B) separators each subgraph adjacent to O(B) separators

SSSP on Planar Graphs  Reduced graph G R S = O(N/B) vertices O(N/B 2 ) x O(B 2 ) = O(N) edges  Compute SSSP on G R Dijkstra’s algorithm and I/O-efficient priority queue Keep list L S = {d(s,v) | v in S} For each vertex in S read from L S its O(B) adjacent boundary vertices  O(N/B x B) = O(N) I/Os..but O(N/B 2 ) boundary sets Store L S such that vertices in same boundary set are consecutive Each boundary set is accessed once by its O(B) adjacent vertices  O(N/B 2 x B) = O(N/B) I/Os

On I/O-Efficient DFS  General graphs Internal memory algorithm O(V+E) I/Os Improved upper bound: O(V + E/B log V) I/Os [KS’96] Lower bound Ω (sort(V)) DFS is a big open problem!! Note: PRAM DFS is P-complete  DFS on planar graphs O(sort(N) log N) I/Os algorithm [AMTZ’01] O(sort(N)) I/Os using DFS to BFS reduction [AMTZ’01]

A Divide-and-Conquer DFS Algorithm on Planar Graphs  Based on PRAM algorithm [Smith86]  Algorithm Compute a cycle separator C Compute DFS recursively in the connected components G i of G\C Attach the DFS trees of G i onto the cycle  I/O-analysis O(log N) recursive steps O(sort(N)) I/Os per step  O(sort(N) log N) I/Os in total

DFS to BFS Reduction on Planar Graphs Idea: Partition the faces of G into levels around a source face containing s and grow DFS level-by-level  Levels can be obtained from BFS in dual graph  Denote G i = union of the boundaries of faces at level <= i T i = DFS tree of G i H i = G i \ G i-1  Algorithm: Compute a spanning forest of H i and attach it onto T i-1  Structure of levels is simple The bicomps of the H i are the boundary cycles of G i  Glueing onto T i-1 is simple A spanning tree is a DFS tree if and only if it has no cross edges

DFS to BFS Reduction on Planar Graphs Idea: Partition the faces of G into levels around a source face containing s and grow DFS level-by-level

Other Graphs Results  Grid graphs [ATV’00] MST, SSSP in O(sort(N)) I/Os CC in O(scan(N)) I/Os  Planar graphs [ABT’00, AMTZ’01] O(sort(N)) I/Os planar reductions O(sort(N) log N) I/Os DFS  General graphs [ABT’00] MST in O(sort(N) log log N) I/Os  Planar directed graphs [submitted] Topological sorting and ear decomposition in O(sort(N)) I/Os

..In Conclusion I have tried to convince you of a few of things:  Massive data is available and scalable algorithms are necessary in order to process it  I/O-efficient algorithms have applications “outside” computer science and have big potential for (interdisciplinary) collaboration  I/O-efficient algorithms are theory and practice put together

Current and Future Work  TerraFlow Incorporated in GRASS [AMT’02] Current work with U. Muenster [GE] 2 MS students port TerraFlow to VisualC++ under Windows and make it ArcInfo extension Computing complete watershed hierarchy  Some future directions Processing LIDAR data Point to grid conversion, point to TIN conversion, terrain simplification, Delauney triangulation… Flow modeling on TINs Practical algorithm on triangulations Reductions on general graphs, directed graphs, dynamic algorithms

Massive Data Massive datasets are being collected everywhere Storage management software is billion-$ industry (More) Examples:  Phone: AT&T 20TB phone call database, wireless tracking  Consumer: WalMart 70TB database, buying patterns (supermarket checkout)  WEB: Web crawl of 200M pages and 2000M links, Akamai stores 7 billion clicks per day  Geography: NASA satellites generate 1.2TB per day