DGrid: A Library of Large-Scale Distributed Spatial Data Structures Pieter Hooimeijer,
2 Motivation DGrid was designed to: –Support very large sets of dynamic point data (i.e. points that move unpredictably). –Offer flexible trade-offs between the cost of search operations and the cost of updates. –Run on parallel and distributed systems.
3 Spatial Data Structures Definition—Any data structure that holds: - points- rectangles - lines- polygons - curves- etc. They are typically optimized for a particular type of search operation. We’ll focus on point data.
4 Commonly used example: The Quadtree. –Works like a binary search tree. –Each node in the tree has four children: NE, SE, SW, NW. –This is called ‘Recursive Decomposition’
5 The quadtree implementation in DGrid works like this: A (1,3)B (1,2)C (2,0) This is a ‘bottom-up Matrix (MX) Quadtree.’ (section 3.1.2)
6 Let’s do a search on that quadtree: We ruled out the ‘entire’ NE quadrant at the root level of the tree.
7 Trade-offs for bottom-up MX Quadtree, compared to other tree data structures: –The shape of the tree does not depend on the insertion order. –No need to balance the tree (which would be expensive). –Insertion and deletion are cheaper for clustered data.
8 C++ Templates They look like this: template class vector { //... }; In this case, a separate vector class is generated for each item type. // Strong type-checking using templates: MyType * a = someVector.get(5); // instead of: MyType * a = (MyType)someVector.get(5);
9 Turns out, C++ templates are a crude functional programming language. Why ‘crude?’ {- Haskell -} fact 1 = 1 fact n = n * fact (n - 1) // C++ Templates template struct fact { static const int value = N * fact :: value; }; template struct fact { static const int value = 1; }; This is ‘executed’ by the compiler!
10 This is called template metaprogramming. It’s used extensively in DGrid, to make it: –easier to use; –faster; –type safe.
11 Distributed Data DGrid uses Message Passing Interface (MPI) to run on distributed systems. –MPI is a library of basic ‘send’ and ‘receive’ operations. –Each processor gets a unique ID (‘rank’). –Use if-statements to run different code on different processes.
13 DGrid DGrid has these data structures: –Two types of 2D arrays. –A quadtree. –A distributed data structure. –A location class. Allows nesting of these data structures.
14 Let’s see some examples of nested data structures: –A 2D array of quadtrees (implied: the quadtree contains locations). –A quadtree of small 2D arrays. –A 2D array of 2D arrays. Called ‘tiling.’ A lot like a ‘shallow’ quadtree.
15 DGrid uses the Composite Design Pattern: DataStructure location
16 DGrid uses templates instead of a ‘Component’ interface. The result is that the user can do this: using namespace dgrid::tags; typedef dgrid::dgrid<MyItem, partial_grid_tag< quadtree_tag > > bucket; bucket a(0, 0, 639, 639, tiles(64, 64) << tiles(1, 1)); This is the ‘2D array of quadtrees’ example.
17 Important: The definition of the data structure is a type! Consequences: –Can check parameters at compile time. (Must be tiles( ) << tiles ( ) for this example, or it won’t compile.) –Compiler can optimize extensively (it knows which functions are going to call each other). –Can’t define a type at runtime, so ‘composition’ must be known at compile time.
18 Data structure operations: –insert(x, y, item) – add item at (x, y) –delete(x, y, item) – remove item from (x, y) –get(x, y, some_list) – get all items at (x, y) –get_range(x 0, y 0, x 1, y 1 ) – get all items in the range [ (x 0, y 0 ) : (x 1, y 1 ) ] Note: even the location class must support these operations.
19 In a nested data structure, operations are passed on from level to level. Because the types are known at compile time, these calls can be inlined. –Pro: eliminates the overhead of the function call. –Con: code size increases (function body is repeated).
21 Future Work Add more data structures, more search operations. Separate interface further from implementation. (Dynamic) Load Balancing.